[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.
Thursday, May 1, 2014, 7:39:36 PM, you wrote: > Thursday, May 1, 2014, 5:46:01 PM, you wrote: >> On 01/05/14 14:59, Sander Eikelenboom wrote: >>> >>> Thursday, May 1, 2014, 3:37:41 PM, you wrote: >>> >>>> On 30/04/14 23:25, Sander Eikelenboom wrote: >>>>> >>>>> Wednesday, April 30, 2014, 10:53:39 PM, you wrote: >>>>> >>>>>> On 30/04/14 11:45, Sander Eikelenboom wrote: >>>>>>> Hi Zoltan, >>>>>>> >>>>>>> Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", >>>>>>> merged into mainline with merge commit >>>>>>> 4caeccb4de76440e433a15009636e77d003eb3d6, >>>>>>> seem to introduce a subtle bug on network traffic between 2 guests on a >>>>>>> bridge on the same host. >>>>>>> I have one guest running apache as webdav server with SSL and another >>>>>>> guest that is using that is uploading large files to that webdav server. >>>>>>> Small requests (some get's and propfind's) seem to work ok, but when >>>>>>> the bulk uploading begins it fails with: >>>>>>> >>>>>>> Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>>> Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>>> Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>>> Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>>> >>>>>>> So some how large (probably fragmented) packets can get mangled when >>>>>>> from guest to guest on the same host. >>>>>>> I don't see this with clients that upload large files from external >>>>>>> sources. >>>>>>> Probably if SSL wasn't complaining it would probably be unnoticed for >>>>>>> longer and doing some silent corruption. >>>>>>> >>>>>>> I first blamed openssl, since it started around all the latest openssl >>>>>>> mayhem and updates, but it turns out it is all xen-netback related >>>>>>> again. >>>>>>> >>>>>>> Since these commits break bisectabillity: >>>>>>> - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d (note in commit >>>>>>> message && kernel panic) >>>>>>> - 62bad3199a4c20505fc36c169deef20b25e17c5f (kernel panic) >>>>>>> i stopped bisecting at this point. >>>>>>> >>>>>>> The upside is .. it's 100% reproduceable :-) >>>>>> That's good :) Can you take tcpdump captures along the way (sending >>>>>> guest, dom0, receiving guest), and try to work out which packets are >>>>>> different, and where? Although taking captures in Dom0 might change your >>>>>> result, as it triggers the pages to be copied and unmapped before they >>>>>> reach their target. >>>>> >>>>>> Thanks, >>>>>> Zoli >>>>> >>>>> >>>>> Hrrmm that sounds like a lot of data and a lot of work .. >>>> If you could make captures in the sending and receiving guest with >>>> tcpdump (take care of increasing snaplen so the whole packet is there, >>>> and filter to the SSH connection itself), and upload it somewhere for >>>> me, that would be enough for start. I will try to work out where the >>>> corruption happens. >>>> Also, do you have timestamps for the above mentioned log entries? I >>>> guess they appear on the receiving side. >>>> And some info about the componenets on the server, so I can work out >>>> where is that _ssl.c:1415, and which part of the packet it actually >>>> looks for. >>> >>> They appear on the sending side (duplicity) .. the receiving side (apache + >>> mod_dav + ssl | gnu_tls) gives a "Could not get next bucket brigade (URI:" >> I will try to repro this case in house. What versions of these >> components you used? > Both guests are debian wheezy. > The webdav server has: > ii apache2-mpm-event 2.2.22-13+deb7u1 amd64 > Apache HTTP Server - event driven model > ii apache2-utils 2.2.22-13+deb7u1 amd64 > uti > ii apache2.2-bin 2.2.22-13+deb7u1 amd64 > Apa > ii apache2.2-common 2.2.22-13+deb7u1 amd64 > Apa > ii libapache2-mod-gnutls 0.5.10-1.1 amd64 > Apa > ii libssl1.0.0:amd64 1.0.1e-2+deb7u7 amd64 > SSL > ii openssl 1.0.1e-2+deb7u7 amd64 > Sec > The guest with duplicity currently has a duplicity version from unstable > recompiled for wheezy. But i previously also tried a downgrade to the > standard > wheezy version. It uses the webdav backend and a volumesize of 100MB. > Unfortunately it seems duplicity doesn't bail out at first instance, it seems > it > only reports error after the so the full tcpdumps i got are also 100MB each. > Since the error seems to happen when it's going through > "xenvif_handle_frag_list", i have added a bunch of ratelimited printk's. > Will run that for both the cases: > skb->truesize -= skb->data_len; > skb->truesize -= nskb->data_len; > Let's see what that does different and if that gives an insight in what is > going > wrong. The debug patch that i used is attached. Here is some of the output, skb->data_len size sometimes varies slightly as does skb->truesize (is that allowed to become > 64k ?). All of it attached in dmesg.txt [ 647.256275] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 1: skb->data_len: 64890, skb->len: 65018, skb->truesize: 65658, nskb->data_len: 0, nskb->len: 0, nskb->truesize: 768 MAX_SKB_FRAGS:17 offset:128 [ 647.273483] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 2: skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:128 [ 647.291478] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:4224 [ 647.310199] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:8320 [ 647.329874] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:12416 [ 647.350268] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:16512 [ 647.371562] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:20608 [ 647.393707] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:24704 [ 647.416604] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:28800 [ 647.440482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:32896 [ 647.465490] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:36992 [ 647.491216] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:41088 [ 647.517698] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:45184 [ 647.545022] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:49280 [ 647.573482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:53376 [ 647.602458] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:57472 [ 647.632271] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:61568 [ 647.662608] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:65226 [ 647.693632] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 4: i:16, skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976 MAX_SKB_FRAGS:17 offset:65226 <snip> [ 848.058828] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 1: skb->data_len: 63493, skb->len: 63621, skb->truesize: 64261, nskb->data_len: 0, nskb->len: 0, nskb->truesize: 768 MAX_SKB_FRAGS:17 offset:128 [ 848.096299] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 2: skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:128 [ 848.133427] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:4224 [ 848.170519] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:8320 [ 848.207578] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:12416 [ 848.244847] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:16512 [ 848.282044] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:20608 [ 848.319114] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:24704 [ 848.356198] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:28800 [ 848.393369] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:32896 [ 848.430450] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:36992 [ 848.467631] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:41088 [ 848.504711] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:45184 [ 848.541759] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:49280 [ 848.578914] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:53376 [ 848.616120] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:57472 [ 848.653251] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:61568 [ 848.690269] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:65226 [ 848.727394] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 4: i:16, skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373 MAX_SKB_FRAGS:17 offset:65226 >> Zoli >>> >>> >>>>> >>>>> how ever .. could it be just a type and would the following make sense ? >>>>> >>>>> diff --git a/drivers/net/xen-netback/netback.c >>>>> b/drivers/net/xen-netback/netback.c >>>>> index 7666540..abeea10 100644 >>>>> --- a/drivers/net/xen-netback/netback.c >>>>> +++ b/drivers/net/xen-netback/netback.c >>>>> @@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif >>>>> *vif, struct sk_buff *skb) >>>>> >>>>> xenvif_fill_frags(vif, nskb); >>>>> /* Subtract frags size, we will correct it later */ >>>>> - skb->truesize -= skb->data_len; >>>>> + skb->truesize -= nskb->data_len; >>>>> skb->len += nskb->len; >>>>> skb->data_len += nskb->len; >>> >>>> Nope, that's correct there: after that skb->truesize will be the size of >>>> the struct plus the linear buffer itself. The code is just about the >>>> ditch the original fragments plus the skb on the frag_list. When the new >>>> pages are created, it will update it again. >>> >>> Well i just went a head and tried this .. and the uploading does seem to >>> work fine with this change >>> .. (that obviously doesn't say anything about correctness) >>> >>>> Also, this code path runs only if the guest sends more slots we can >>>> handle (so we put the extra one to the frag_list until we can get rid of >>>> it). On Linux it can only happen with 3.2 or older guest kernels, and >>>> only occasionally. As you said, this is 100% reproducible, so I would >>>> doubt the problem is with this part of the code. >>> >>> Well this assumption seems to be incorrect: >>> - both dom0 and guest kernels are 3.15-rc3's. >>> - but we do end up in this code path >>> >>>> Zoli >>> >>> Attachment:
netback-debug6.diff Attachment:
dmesg.txt _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |