[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.



Thursday, May 1, 2014, 7:39:36 PM, you wrote:


> Thursday, May 1, 2014, 5:46:01 PM, you wrote:

>> On 01/05/14 14:59, Sander Eikelenboom wrote:
>>>
>>> Thursday, May 1, 2014, 3:37:41 PM, you wrote:
>>>
>>>> On 30/04/14 23:25, Sander Eikelenboom wrote:
>>>>>
>>>>> Wednesday, April 30, 2014, 10:53:39 PM, you wrote:
>>>>>
>>>>>> On 30/04/14 11:45, Sander Eikelenboom wrote:
>>>>>>> Hi Zoltan,
>>>>>>>
>>>>>>> Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", 
>>>>>>> merged into mainline with merge commit 
>>>>>>> 4caeccb4de76440e433a15009636e77d003eb3d6,
>>>>>>> seem to introduce a subtle bug on network traffic between 2 guests on a 
>>>>>>> bridge on the same host.
>>>>>>> I have one guest running apache as webdav server with SSL and another 
>>>>>>> guest that is using that is uploading large files to that webdav server.
>>>>>>> Small requests (some get's and propfind's) seem to work ok, but when 
>>>>>>> the bulk uploading begins it fails with:
>>>>>>>
>>>>>>> Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
>>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>> Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
>>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>> Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
>>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>> Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
>>>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac
>>>>>>>
>>>>>>> So some how large (probably fragmented) packets can get mangled when 
>>>>>>> from guest to guest on the same host.
>>>>>>> I don't see this with clients that upload large files from external 
>>>>>>> sources.
>>>>>>> Probably if SSL wasn't complaining it would probably be unnoticed for 
>>>>>>> longer and doing some silent corruption.
>>>>>>>
>>>>>>> I first blamed openssl, since it started around all the latest openssl 
>>>>>>> mayhem and updates, but it turns out it is all xen-netback related 
>>>>>>> again.
>>>>>>>
>>>>>>> Since these commits break bisectabillity:
>>>>>>>        - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d  (note in commit 
>>>>>>> message && kernel panic)
>>>>>>>        - 62bad3199a4c20505fc36c169deef20b25e17c5f  (kernel panic)
>>>>>>> i stopped bisecting at this point.
>>>>>>>
>>>>>>> The upside is .. it's 100% reproduceable :-)
>>>>>> That's good :) Can you take tcpdump captures along the way (sending
>>>>>> guest, dom0, receiving guest), and try to work out which packets are
>>>>>> different, and where? Although taking captures in Dom0 might change your
>>>>>> result, as it triggers the pages to be copied and unmapped before they
>>>>>> reach their target.
>>>>>
>>>>>> Thanks,
>>>>>> Zoli
>>>>>
>>>>>
>>>>> Hrrmm that sounds like a lot of data and a lot of work ..
>>>> If you could make captures in the sending and receiving guest with
>>>> tcpdump (take care of increasing snaplen so the whole packet is there,
>>>> and filter to the SSH connection itself), and upload it somewhere for
>>>> me, that would be enough for start. I will try to work out where the
>>>> corruption happens.
>>>> Also, do you have timestamps for the above mentioned log entries? I
>>>> guess they appear on the receiving side.
>>>> And some info about the componenets on the server, so I can work out
>>>> where is that _ssl.c:1415, and which part of the packet it actually
>>>> looks for.
>>>
>>> They appear on the sending side (duplicity) .. the receiving side (apache +
>>> mod_dav + ssl | gnu_tls) gives a "Could not get next bucket brigade (URI:"
>> I will try to repro this case in house. What versions of these 
>> components you used?

> Both guests are debian wheezy.

> The webdav server has:
> ii  apache2-mpm-event                2.2.22-13+deb7u1           amd64        
> Apache HTTP Server - event driven model
> ii  apache2-utils                    2.2.22-13+deb7u1           amd64        
> uti
> ii  apache2.2-bin                    2.2.22-13+deb7u1           amd64        
> Apa
> ii  apache2.2-common                 2.2.22-13+deb7u1           amd64        
> Apa
> ii  libapache2-mod-gnutls            0.5.10-1.1                 amd64        
> Apa

> ii  libssl1.0.0:amd64                1.0.1e-2+deb7u7            amd64        
> SSL
> ii  openssl                          1.0.1e-2+deb7u7            amd64        
> Sec


> The guest with duplicity currently has a duplicity version from unstable 
> recompiled for wheezy. But i previously also tried a downgrade to the 
> standard 
> wheezy version. It uses the webdav backend and a volumesize of 100MB.

> Unfortunately it seems duplicity doesn't bail out at first instance, it seems 
> it 
> only reports error after the so the full tcpdumps i got are also 100MB each.

> Since the error seems to happen when it's going through 
> "xenvif_handle_frag_list", i have added a bunch of ratelimited printk's.

> Will run that for both the cases:
>         skb->truesize -= skb->data_len;
>         skb->truesize -= nskb->data_len;

> Let's see what that does different and if that gives an insight in what is 
> going 
> wrong.

The debug patch that i used is attached.
Here is some of the output, skb->data_len size sometimes varies slightly as 
does skb->truesize (is that allowed to become > 64k ?). 
All of it attached in dmesg.txt

[  647.256275] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 1: 
skb->data_len: 64890, skb->len: 65018, skb->truesize: 65658, nskb->data_len: 0, 
nskb->len: 0, nskb->truesize: 768  MAX_SKB_FRAGS:17 offset:128
[  647.273483] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 2: 
skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 208, 
nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:128
[  647.291478] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:4224
[  647.310199] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:8320
[  647.329874] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:12416
[  647.350268] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:16512
[  647.371562] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:20608
[  647.393707] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:24704
[  647.416604] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:28800
[  647.440482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:32896
[  647.465490] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:36992
[  647.491216] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:41088
[  647.517698] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:45184
[  647.545022] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:49280
[  647.573482] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:53376
[  647.602458] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:57472
[  647.632271] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:61568
[  647.662608] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 
3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 
offset:65226
[  647.693632] vif vif-8-0 vif8.0: ?!?!? xenvif_handle_frag_list 4: i:16, 
skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 
208, nskb->len: 208, nskb->truesize: 976  MAX_SKB_FRAGS:17 offset:65226

<snip>

[  848.058828] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 1: 
skb->data_len: 63493, skb->len: 63621, skb->truesize: 64261, nskb->data_len: 0, 
nskb->len: 0, nskb->truesize: 768  MAX_SKB_FRAGS:17 offset:128 
[  848.096299] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 2: 
skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, nskb->data_len: 
1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:128 
[  848.133427] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:0, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:4224 
[  848.170519] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:1, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:8320 
[  848.207578] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:2, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:12416 
[  848.244847] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:3, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:16512 
[  848.282044] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:4, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:20608 
[  848.319114] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:5, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:24704 
[  848.356198] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:6, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:28800 
[  848.393369] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:7, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:32896 
[  848.430450] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:8, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:36992 
[  848.467631] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:9, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:41088 
[  848.504711] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:10, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:45184 
[  848.541759] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:11, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:49280 
[  848.578914] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:12, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:53376 
[  848.616120] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:13, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:57472 
[  848.653251] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:14, len: 
4096, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:61568 
[  848.690269] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 3: i:15, len: 
3658, skb->data_len: 65098, skb->len: 65226, skb->truesize: 768, 
nskb->data_len: 1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 
offset:65226 
[  848.727394] vif vif-9-0 vif9.0: ?!?!? xenvif_handle_frag_list 4: i:16, 
skb->data_len: 65098, skb->len: 65226, skb->truesize: 66304, nskb->data_len: 
1605, nskb->len: 1605, nskb->truesize: 2373  MAX_SKB_FRAGS:17 offset:65226 


>> Zoli

>>>
>>>
>>>>>
>>>>> how ever .. could it be just a type and would the following make sense ?
>>>>>
>>>>> diff --git a/drivers/net/xen-netback/netback.c 
>>>>> b/drivers/net/xen-netback/netback.c
>>>>> index 7666540..abeea10 100644
>>>>> --- a/drivers/net/xen-netback/netback.c
>>>>> +++ b/drivers/net/xen-netback/netback.c
>>>>> @@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif 
>>>>> *vif, struct sk_buff *skb)
>>>>>
>>>>>           xenvif_fill_frags(vif, nskb);
>>>>>           /* Subtract frags size, we will correct it later */
>>>>> -       skb->truesize -= skb->data_len;
>>>>> +       skb->truesize -= nskb->data_len;
>>>>>           skb->len += nskb->len;
>>>>>           skb->data_len += nskb->len;
>>>
>>>> Nope, that's correct there: after that skb->truesize will be the size of
>>>> the struct plus the linear buffer itself. The code is just about the
>>>> ditch the original fragments plus the skb on the frag_list. When the new
>>>> pages are created, it will update it again.
>>>
>>> Well i just went a head and tried this .. and the uploading does seem to 
>>> work fine with this change
>>> .. (that obviously doesn't say anything about correctness)
>>>
>>>> Also, this code path runs only if the guest sends more slots we can
>>>> handle (so we put the extra one to the frag_list until we can get rid of
>>>> it). On Linux it can only happen with 3.2 or older guest kernels, and
>>>> only occasionally. As you said, this is 100% reproducible, so I would
>>>> doubt the problem is with this part of the code.
>>>
>>> Well this assumption seems to be incorrect:
>>>          - both dom0 and guest kernels are 3.15-rc3's.
>>>          - but we do end up in this code path
>>>
>>>> Zoli
>>>
>>>




Attachment: netback-debug6.diff
Description: Binary data

Attachment: dmesg.txt
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.