[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Tuesday, March 11, 2014, 4:36:16 PM, you wrote:

> On Tue, Mar 11, 2014 at 02:00:41PM +0100, Sander Eikelenboom wrote:
> [...]
>> >> the issue when using 3.13.6 as a base and ..
>> >>   - pull all 3.14 patches from the  
>> >> git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git tree
>> >>   - apply paul's commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c 
>> >> xen-netback: improve guest-receive-side flow control"
>> >>   - applying annie's v2 patch
>> >>   - applying your patch
>> >> as dom0 and using a 3.14-rc5 as domU kernel.
>> >> 
>> >> Unfortunately i'm still getting the Bad grant references ..
>> >> 
>> 
>> > :-( That's bad news.
>> 
>> > I guess you always have the same DomU kernel when testing? That means we
>> > can narrow down the bug to netback only.
>> 
>> Yes my previous tests (from my previous mail):
>> 
>> - First testing a baseline that worked o.k. for several days (3.13.6 for 
>> both dom0 and domU)
>> - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok.
>> - Testing dom0 3.14-rc5 and domU 3.13.6, this failed.
>> - After that took 3.13.6 as base and first applied all the general xen 
>> related patches for the dom0 kernel, that works ok.
>> - After that started to apply the netback changes for 3.14 and that failed 
>> after the commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback: 
>> improve guest-receive-side flow control".
>> 
>> Also seem to indicate just that, although it could also be something in this 
>> netback commit that triggers a latent bug in netfront, can't rule that one 
>> out completly.
>> 
>> But the trigger is in that commit &&
>> annie's and your patch seem to have no effect at all( on this issue) &&
>> later commits in 3.14 do seems to mask it / make it less likely to trigger, 
>> but do not fix it.
>> 

> Unfortunately I've stared at the same piece of code for some time but
> don't have immediate clue. Later commits don't look suspecious either.

> I also looked at netfront code, but there's no slot couting change
> between 3.13 and 3.14.

> Do you have some straight setup instructions so that I can try to
> reproduce.

Ok you asked for it .. so here we go .. ;-) :

- Xen-unstable
- DomU:
     - PV guest
     - Debian Wheezy
     - Kernel: 3.14-rc5 (.config see dom0 .config)
     - Running netserver
- Dom0:
     - Debian Wheezy
     - Kernel: 3.13.6 + additional patches (see attached git log and .config)
     - 2 physical NIC's
     - Autoballooning is prevented by using dom0_mem=1536M,max:1536M for xen 
and mem=1536M for dom0 kernel stanzas in grub
     - vcpu 0 is pinned on pcpu 0 and exclusively for dom0

- Networking:
     - Routed bridge
     - eth0 = internet
     - eth1 = lan 172.16.x.x
     - xen_bridge = bridge for VM's 192.168.1.x
     - iptables NAT and routing  
     - attached dom0 and domU ifconfig output
     - attached ethtool -k output for the bridge, vif and guest eth0

Triggering workload:
     - Well that's were the problem is :-)
     - The Guest has a normal disk and swap (phy/lvm) and
       shared storage with glusterfs (glusterfs server is on dom0)
     - The Guest exposes this storage via webdavs 

     What triggers it is:
     - The Guest runs it's rsync of the shared storage to a remote client on 
the internet
       So this causes traffic from Dom0 to domU (reading storage) ..
       back from domU to dom0 and via iptables NAT on to eth0 (actual rsync)
       or vice versa when syncing the other way around
     - At the same time do a backup from a windows machine from the lan to the 
webdavs server
       So this causes traffic from eth1 to domU (webdav) .. 
       back from domU to dom0 (writing to the shared storage) 

So in essence it is doing quite some netback && netfront stress testing :-)

It seems to only trigger when doing both the rsync and the webdav simultaneous.

I tried my best to emulate any of this with netperf (and multiple instances),
i tried with various (odd) packet sizes and the packet / byte rates transmitted
are higher then with the workload above ... but it doesn't seem to trigger with 
netpref

So i don't think it will be easy to replicate ...

Perhaps running through the available logging again .. and try to answer some 
questions ... this is just with one guest running kernels as before only added 
debugging to netfront and xen (diffs attached):

Mar 12 02:00:44 backup kernel: [  496.840646] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840665] net eth0: cons:1346005 slots:1 
rp:1346013 max:18 err:0 rx->id:212 rx->offset:0 size:4294967295 ref:572 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.840680] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840687] net eth0: cons:1346005 slots:2 
rp:1346013 max:18 err:-22 rx->id:214 rx->offset:0 size:4294967295 ref:657 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.840701] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840712] net eth0: cons:1346005 slots:3 
rp:1346013 max:18 err:-22 rx->id:215 rx->offset:0 size:4294967295 ref:667 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.840733] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840740] net eth0: cons:1346005 slots:4 
rp:1346013 max:18 err:-22 rx->id:216 rx->offset:0 size:4294967295 ref:716 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.840757] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840764] net eth0: cons:1346005 slots:5 
rp:1346013 max:18 err:-22 rx->id:217 rx->offset:0 size:4294967295 ref:755 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.840778] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840784] net eth0: cons:1346005 slots:6 
rp:1346013 max:18 err:-22 rx->id:218 rx->offset:0 size:4294967295 ref:592 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.840801] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.840807] net eth0: cons:1346005 slots:7 
rp:1346013 max:18 err:-22 rx->id:219 rx->offset:0 size:4294967295 ref:633 
pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 
RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 
xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.841043] net eth0: rx->offset: 0, size: 
4294967295

Mar 12 02:00:44 backup kernel: [  496.841052] net eth0: cons:1346025 slots:1 
rp:1346038 max:18 err:0 rx->id:232 rx->offset:0 size:4294967295 
ref:-131941395332491 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 
gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:13 cons_changed:1 
cons_before:1346024 xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.841067] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.841074] net eth0: cons:1346025 slots:2 
rp:1346038 max:18 err:-22 rx->id:234 rx->offset:0 size:4294967295 
ref:-131941395332579 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 
gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 
cons_before:1346024 xennet_get_extras_err:0
Mar 12 02:00:44 backup kernel: [  496.841092] net eth0: rx->offset: 0, size: 
4294967295
Mar 12 02:00:44 backup kernel: [  496.841101] net eth0: cons:1346025 slots:3 
rp:1346038 max:18 err:-22 rx->id:235 rx->offset:0 size:4294967295 
ref:-131941395332408 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 
gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 
cons_before:1346024 xennet_get_extras_err:0


(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146 
source_off:0 source_len:4096 op->source.offset:0 op->len:1168
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 
source_off:0 source_len:4096 op->source.offset:0 op->len:2476
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 
source_off:0 source_len:4096 op->source.offset:0 op->len:1634
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 
source_off:0 source_len:4096 op->source.offset:1634 op->len:1620
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 
gt_version:1 ldom:0 readonly:0 allow_transitive:1
(XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy 
failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609 
source_off:0 source_len:4096 op->source.offset:0 op->len:4096
(XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 
19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1

- Sometimes (but not always) netfront also spits out:
  dev_warn(dev, "Invalid extra type: %d\n", extra->type);
  where the extra type seems a random value (seen 196, 31 ..)
- Sometimes (but not always) netfront also spits out:
  dev_warn(dev, "Need more slots\n");
- Sometimes (but not always) netfront also spits out:
  dev_warn(dev, "Missing extra info\n");


First question that comes to my mind:
- Are the warnings netfront spits out the cause of xen reporting the bad grant 
reference ?
  Or
  Are the Bad grant references Xen is reporting .. causing netfront to spit out 
the warnings ?
- Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in 
xen-netfront.c doing there and changing cons midway ?
  The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that 
introduced the if says:

    One major change from xen.git is that the guest transmit path (i.e. what
    looks like receive to netback) has been significantly reworked to remove
    the dependency on the out of tree PageForeign page flag (a core kernel
    patch which enables a per page destructor callback on the final
    put_page). This page flag was used in order to implement a grant map
    based transmit path (where guest pages are mapped directly into SKB
    frags). Instead this version of netback uses grant copy operations into
    regular memory belonging to the backend domain. Reinstating the grant
    map functionality is something which I would like to revisit in the
    future.

  It *is* using grant copy now .. so should this part have been removed ?
  And
  Could Paul's commit that seems to be the first to trigger these events affect 
this ?

--
Sander



> Wei.

>> > Paul, do you have any idea what might go wrong?
>> 
>> > Wei.
>> 

Attachment: ethtool-domu-eth0.txt
Description: Text document

Attachment: ethtool-vif.txt
Description: Text document

Attachment: ethtool-xenbridge.txt
Description: Text document

Attachment: grant_table_debug.diff
Description: Binary data

Attachment: ifconfig-domu.txt
Description: Text document

Attachment: kernel-dom0-dotconfig.txt
Description: Text document

Attachment: kernel-dom0-git-log1.txt
Description: Text document

Attachment: netfront-debug.diff
Description: Binary data

Attachment: ifconfig-dom0.txt
Description: Text document

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.