[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.

To: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>
From: Zoltan Kiss <zoltan.kiss@xxxxxxxxxx>
Date: Thu, 1 May 2014 14:37:41 +0100
Cc: netdev@xxxxxxxxxxxxxxx, xen-devel@xxxxxxxxxxxxx, Ian Campbell <Ian.Campbell@xxxxxxxxxx>, "David S. Miller" <davem@xxxxxxxxxxxxx>
Delivery-date: Thu, 01 May 2014 13:38:16 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 30/04/14 23:25, Sander Eikelenboom wrote:


Wednesday, April 30, 2014, 10:53:39 PM, you wrote:

On 30/04/14 11:45, Sander Eikelenboom wrote:

Hi Zoltan,

Your series "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy", merged 
into mainline with merge commit 4caeccb4de76440e433a15009636e77d003eb3d6,
seem to introduce a subtle bug on network traffic between 2 guests on a bridge 
on the same host.
I have one guest running apache as webdav server with SSL and another guest 
that is using that is uploading large files to that webdav server.
Small requests (some get's and propfind's) seem to work ok, but when the bulk 
uploading begins it fails with:

Attempt 1 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
routines:SSL3_READ_BYTES:sslv3 alert bad record mac
Attempt 2 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
routines:SSL3_READ_BYTES:sslv3 alert bad record mac
Attempt 3 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
routines:SSL3_READ_BYTES:sslv3 alert bad record mac
Attempt 4 failed. SSLError: [Errno 1] _ssl.c:1415: error:140943FC:SSL 
routines:SSL3_READ_BYTES:sslv3 alert bad record mac

So some how large (probably fragmented) packets can get mangled when from guest 
to guest on the same host.
I don't see this with clients that upload large files from external sources.
Probably if SSL wasn't complaining it would probably be unnoticed for longer 
and doing some silent corruption.

I first blamed openssl, since it started around all the latest openssl mayhem 
and updates, but it turns out it is all xen-netback related again.

Since these commits break bisectabillity:
      - 1bb332af4cd889e4b64dacbf4a793ceb3a70445d  (note in commit message && 
kernel panic)
      - 62bad3199a4c20505fc36c169deef20b25e17c5f  (kernel panic)
i stopped bisecting at this point.

The upside is .. it's 100% reproduceable :-)

That's good :) Can you take tcpdump captures along the way (sending
guest, dom0, receiving guest), and try to work out which packets are
different, and where? Although taking captures in Dom0 might change your
result, as it triggers the pages to be copied and unmapped before they
reach their target.

Thanks,
Zoli



Hrrmm that sounds like a lot of data and a lot of work ..

If you could make captures in the sending and receiving guest withtcpdump (take care of increasing snaplen so the whole packet is there,and filter to the SSH connection itself), and upload it somewhere forme, that would be enough for start. I will try to work out where thecorruption happens.Also, do you have timestamps for the above mentioned log entries? Iguess they appear on the receiving side.And some info about the componenets on the server, so I can work outwhere is that _ssl.c:1415, and which part of the packet it actuallylooks for.


how ever .. could it be just a type and would the following make sense ?

diff --git a/drivers/net/xen-netback/netback.c 
b/drivers/net/xen-netback/netback.c
index 7666540..abeea10 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1366,7 +1366,7 @@ static int xenvif_handle_frag_list(struct xenvif *vif, 
struct sk_buff *skb)

         xenvif_fill_frags(vif, nskb);
         /* Subtract frags size, we will correct it later */
-       skb->truesize -= skb->data_len;
+       skb->truesize -= nskb->data_len;
         skb->len += nskb->len;
         skb->data_len += nskb->len;

Nope, that's correct there: after that skb->truesize will be the size ofthe struct plus the linear buffer itself. The code is just about theditch the original fragments plus the skb on the frag_list. When the newpages are created, it will update it again.Also, this code path runs only if the guest sends more slots we canhandle (so we put the extra one to the frag_list until we can get rid ofit). On Linux it can only happen with 3.2 or older guest kernels, andonly occasionally. As you said, this is 100% reproducible, so I woulddoubt the problem is with this part of the code.


Zoli

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.
  - From: Sander Eikelenboom

Prev by Date: [Xen-devel] [PATCH V4 21/24] xl: update domain configuration when running mem-set and mem-max
Next by Date: Re: [Xen-devel] [PATCH] x86/EPT: flush cache when (potentially) limiting cachability
Previous by thread: Re: [Xen-devel] [PATCH] x86/P2M: pass on errors from p2m_set_entry()
Next by thread: Re: [Xen-devel] [3.15-rc3] Bisected: xen-netback mangles packets between two guests on a bridge since merge of "TX grant mapping with SKBTX_DEV_ZEROCOPY instead of copy" series.
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.