[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.
Thursday, May 1, 2014, 5:46:01 PM, you wrote: > On 01/05/14 14:59, Sander Eikelenboom wrote: >> >> Thursday, May 1, 2014, 3:37:41 PM, you wrote: >> >>> On 30/04/14 23:25, Sander Eikelenboom wrote: >>>> >>>> Wednesday, April 30, 2014, 10:53:39 PM, you wrote: >>>> >>>>> On 30/04/14 11:45, Sander Eikelenboom wrote: >>>>>> Hi Zoltan, >>>>>> >>>>>> Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", >>>>>> merged into mainline with merge commit >>>>>> 4caeccb4de76440e433a15009636e77d003eb3d6, >>>>>> seem to introduce a subtle bug on network traffic between 2 guests on a >>>>>> bridge on the same host. >>>>>> I have one guest running apache as webdav server with SSL and another >>>>>> guest that is using that is uploading large files to that webdav server. >>>>>> Small requests (some get's and propfind's) seem to work ok, but when the >>>>>> bulk uploading begins it fails with: >>>>>> >>>>>> Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>> Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>> Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>> Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL >>>>>> routines:SSL3_READ_BYTES:sslv3 alert bad record mac >>>>>> >>>>>> So some how large (probably fragmented) packets can get mangled when >>>>>> from guest to guest on the same host. >>>>>> I don't see this with clients that upload large files from external >>>>>> sources. >>>>>> Probably if SSL wasn't complaining it would probably be unnoticed for >>>>>> longer and doing some silent corruption. >>>>>> >>>>>> I first blamed openssl, since it started around all the latest openssl >>>>>> mayhem and updates, but it turns out it is all xen-netback related again. >>>>>> >>>>>> Since these commits break bisectabillity: >>>>>> - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d (note in commit >>>>>> message && kernel panic) >>>>>> - 62bad3199a4c20505fc36c169deef20b25e17c5f (kernel panic) >>>>>> i stopped bisecting at this point. >>>>>> >>>>>> The upside is .. it's 100% reproduceable :-) >>>>> That's good :) Can you take tcpdump captures along the way (sending >>>>> guest, dom0, receiving guest), and try to work out which packets are >>>>> different, and where? Although taking captures in Dom0 might change your >>>>> result, as it triggers the pages to be copied and unmapped before they >>>>> reach their target. >>>> >>>>> Thanks, >>>>> Zoli >>>> >>>> >>>> Hrrmm that sounds like a lot of data and a lot of work .. >>> If you could make captures in the sending and receiving guest with >>> tcpdump (take care of increasing snaplen so the whole packet is there, >>> and filter to the SSH connection itself), and upload it somewhere for >>> me, that would be enough for start. I will try to work out where the >>> corruption happens. >>> Also, do you have timestamps for the above mentioned log entries? I >>> guess they appear on the receiving side. >>> And some info about the componenets on the server, so I can work out >>> where is that _ssl.c:1415, and which part of the packet it actually >>> looks for. >> >> They appear on the sending side (duplicity) .. the receiving side (apache + >> mod_dav + ssl | gnu_tls) gives a "Could not get next bucket brigade (URI:" > I will try to repro this case in house. What versions of these > components you used? Both guests are debian wheezy. The webdav server has: ii apache2-mpm-event 2.2.22-13+deb7u1 amd64 Apache HTTP Server - event driven model ii apache2-utils 2.2.22-13+deb7u1 amd64 uti ii apache2.2-bin 2.2.22-13+deb7u1 amd64 Apa ii apache2.2-common 2.2.22-13+deb7u1 amd64 Apa ii libapache2-mod-gnutls 0.5.10-1.1 amd64 Apa ii libssl1.0.0:amd64 1.0.1e-2+deb7u7 amd64 SSL ii openssl 1.0.1e-2+deb7u7 amd64 Sec The guest with duplicity currently has a duplicity version from unstable recompiled for wheezy. But i previously also tried a downgrade to the standard wheezy version. It uses the webdav backend and a volumesize of 100MB. Unfortunately it seems duplicity doesn't bail out at first instance, it seems it only reports error after the so the full tcpdumps i got are also 100MB each. Since the error seems to happen when it's going through "xenvif_handle_frag_list", i have added a bunch of ratelimited printk's. Will run that for both the cases: skb->truesize -= skb->data_len; skb->truesize -= nskb->data_len; Let's see what that does different and if that gives an insight in what is going wrong. > Zoli >> >> >>>> >>>> how ever .. could it be just a type and would the following make sense ? >>>> >>>> diff --git a/drivers/net/xen-netback/netback.c >>>> b/drivers/net/xen-netback/netback.c >>>> index 7666540..abeea10 100644 >>>> --- a/drivers/net/xen-netback/netback.c >>>> +++ b/drivers/net/xen-netback/netback.c >>>> @@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif >>>> *vif, struct sk_buff *skb) >>>> >>>> xenvif_fill_frags(vif, nskb); >>>> /* Subtract frags size, we will correct it later */ >>>> - skb->truesize -= skb->data_len; >>>> + skb->truesize -= nskb->data_len; >>>> skb->len += nskb->len; >>>> skb->data_len += nskb->len; >> >>> Nope, that's correct there: after that skb->truesize will be the size of >>> the struct plus the linear buffer itself. The code is just about the >>> ditch the original fragments plus the skb on the frag_list. When the new >>> pages are created, it will update it again. >> >> Well i just went a head and tried this .. and the uploading does seem to >> work fine with this change >> .. (that obviously doesn't say anything about correctness) >> >>> Also, this code path runs only if the guest sends more slots we can >>> handle (so we put the extra one to the frag_list until we can get rid of >>> it). On Linux it can only happen with 3.2 or older guest kernels, and >>> only occasionally. As you said, this is 100% reproducible, so I would >>> doubt the problem is with this part of the code. >> >> Well this assumption seems to be incorrect: >> - both dom0 and guest kernels are 3.15-rc3's. >> - but we do end up in this code path >> >>> Zoli >> >> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |