Hi Balraj,
Very interesting discoveries!
Regarding the skbuff frag limit, should this be considered as part of the protocol even though it was originally a Linux implementation issue leaking through? Do you know if it has been stable over time? It might be worth asking on xen-devel.
Regarding there not being enough slots: there's already a "wait for a free slot" mechanism so we could add a "wait for n free slots". Do you have parallel threads transmitting at once? We should probably take care that a "wait for n" doesn't end up constantly
getting gazumped by lots of "wait for 1"s
It turned out that both the suspected problems were real problems and the interference betw the two was confusing the debugging. It looks like the max skbuff frags
is 18 (65536/page_size + 2) and indeed if the chain of packet fragmets is longer than 19 (the logic probably allows for one extra) it locks up the ring permanently. The other problem
was that the ring does indeed get depleted down to the point where the available slots are fewer than the number needed for the current chain of frags. Unfortunately in this case the write is still permitted which overwrites/corrupts freely and things immediately
or pretty soon thereafter go kaplooey. To confirm that there is nothing else, I implemented a quick workaround - the chain of frags is never allowed to be longer than 19 and if there aren't enough free slots then the whole chain is dropped. With these two
changes all tests always completed and completed correctly. However, just dropping when not enough slots causes excessive pkt loss so slows things randomly and a lot - it should either block or the write should fail with an ENOBUFS flavoured exception. The
good news though is that it still works and a lot of the other tricky machinery also works correctly.
Balraj