[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: one-byte TCP writes wedging

Hi Dave, Anil,

"Wait for n free slots" is exactly what is needed.  And at this low level, even if there are multiple writers, the writes should all be serialized.  It'll be best if the order of the pkts on the wire is always the same as order of the successful writes from the application.  Also in addition to the blocking write, we should also have a non-blocking write that fails if there aren't enough free slots available.

For the fragments, I think we should have an internal threshold (say 8 fragments) and if the number of fragments in the write is great than the threshold then it triggers a compaction or repacking before sticking it on the ring.  As long as the threshold is higher than most use cases it should have no impact at all.  In any case the repacking work has to be done at some point so it shouldn't affect performance even if it is triggered often.


On Tue, May 28, 2013 at 10:15 AM, Anil Madhavapeddy <anil@xxxxxxxxxx> wrote:
I think it's far safer to serialise a single fragment batch on the ring (in an ordered Lwt_sequence) rather than have fragments interleaved across multiple packet write requests.

Any other path will have packets being transmitted out-of-order, which will severely mess up performance.  This includes the case where there are multiple outstanding write requests with different numbers of fragments -- these should be delivered in the order they are transmitted, and so a large packet could indeed block a series of small ones.  It also makes hardware offload easier if the fragments aren't scattered over the ring.


On 28 May 2013, at 08:46, Dave Scott <Dave.Scott@xxxxxxxxxxxxx> wrote:

Hi Balraj,

Very interesting discoveries!

Regarding the skbuff frag limit, should this be considered as part of the protocol even though it was originally a Linux implementation issue leaking through? Do you know if it has been stable over time? It might be worth asking on xen-devel.

Regarding there not being enough slots: there's already a "wait for a free slot" mechanism so we could add a "wait for n free slots". Do you have parallel threads transmitting at once? We should probably take care that a "wait for n" doesn't end up constantly getting gazumped by lots of "wait for 1"s


Dave Scott

On May 27, 2013, at 11:33 PM, "Balraj Singh" <balraj.singh@xxxxxxxxxxxx> wrote:

It turned out that both the suspected problems were real problems and the interference betw the two was confusing the debugging.  It looks like the max skbuff frags is 18 (65536/page_size + 2) and indeed if the chain of packet fragmets is longer than 19 (the logic probably allows for one extra) it locks up the ring permanently.  The other problem was that the ring does indeed get depleted down to the point where the available slots are fewer than the number needed for the current chain of frags.  Unfortunately in this case the write is still permitted which overwrites/corrupts freely and things immediately or pretty soon thereafter go kaplooey.  To confirm that there is nothing else, I implemented a quick workaround - the chain of frags is never allowed to be longer than 19 and if there aren't enough free slots then the whole chain is dropped.  With these two changes all tests always completed and completed correctly.  However, just dropping when not enough slots causes excessive pkt loss so slows things randomly and a lot - it should either block or the write should fail with an ENOBUFS flavoured exception.  The good news though is that it still works and a lot of the other tricky machinery also works correctly.


On Sun, May 26, 2013 at 10:53 AM, Anil Madhavapeddy <anil@xxxxxxxxxx> wrote:
The long chain of 36-byte frags is consistent with the backend dropping it.  Does it work better if you restrict the total fragment chains size to just 10 or 11?

The first unexplained packet loss is a real alarm bell though.  The entire TCP retransmit code on our stack is just a canary that spots latent bugs elsewhere in the device stack :-)


On 25 May 2013, at 22:25, Balraj Singh <balraj.singh@xxxxxxxxxxxx> wrote:

In the particular test I am using I write 36 bytes of payload and use the Mirage equivalent of TCP_NODELAY.  This works for a bit but then suffers some packet loss (why? TBD) and triggers a rexmit.  The retransmitted packet is 1400+ bytes and is made up of a long chain of 36 byte io_pages.  I thought that it may be that the ring did not have enough slots to take all the chunks of the pkt.  Making the retransmitted pkt be the size of the original write improved it very significantly but it would still fail in the same way, tho less frequently.  I'm working on it - I see available txring slots vary, but I havent yet found a case where the slots are fully depleted or down to fewer than chunks that need to be written.  I'm still narrowing it down.

This test originally was with 1-byte writes, but that seemed to wedge even before the 1st data packet made it to the wire.  This may be because of the limitation Steven mentioned.  I think I'm getting close on the 36 byte write test, once this is figured out I'll try it with 1 byte writes again.


On Sat, May 25, 2013 at 11:11 AM, Anil Madhavapeddy <anil@xxxxxxxxxx> wrote:
Balraj noticed that a stream of 1-byte writes on a TCP connection would cause netfront/netback to wedge.  This is obviously quite unrealistic, but a good regression test.

A quick chat with Steven Smith pointed out that some Linux netbacks had a limit on the number of fragments allowed (based on the skbuff chain limit size).  So you might be ending up with a situation where the backend drops the entire set of fragments, and the frontend is retransmitting all the time.

So if you modify our frontend to limit the fragment size to ~10 or so for any given packet, that might help.  On the other hand, if you're doing writes with a TCP segment size of 1, but still only 3-4 fragments (for the Ethernet/IP/TCP headers), then we have some other bug.  What does the Netif request look like, Balraj?




Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.