[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"
Wednesday, March 12, 2014, 2:42:43 AM, you wrote: > Tuesday, March 11, 2014, 4:36:16 PM, you wrote: >> On Tue, Mar 11, 2014 at 02:00:41PM +0100, Sander Eikelenboom wrote: >> [...] >>> >> the issue when using 3.13.6 as a base and .. >>> >> - pull all 3.14 patches from the >>> >> git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git tree >>> >> - apply paul's commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c >>> >> xen-netback: improve guest-receive-side flow control" >>> >> - applying annie's v2 patch >>> >> - applying your patch >>> >> as dom0 and using a 3.14-rc5 as domU kernel. >>> >> >>> >> Unfortunately i'm still getting the Bad grant references .. >>> >> >>> >>> > :-( That's bad news. >>> >>> > I guess you always have the same DomU kernel when testing? That means we >>> > can narrow down the bug to netback only. >>> >>> Yes my previous tests (from my previous mail): >>> >>> - First testing a baseline that worked o.k. for several days (3.13.6 for >>> both dom0 and domU) >>> - Testing domU 3.14-rc5 and dom0 3.13.6, this worked ok. >>> - Testing dom0 3.14-rc5 and domU 3.13.6, this failed. >>> - After that took 3.13.6 as base and first applied all the general xen >>> related patches for the dom0 kernel, that works ok. >>> - After that started to apply the netback changes for 3.14 and that failed >>> after the commit "ca2f09f2b2c6c25047cfc545d057c4edfcfe561c xen-netback: >>> improve guest-receive-side flow control". >>> >>> Also seem to indicate just that, although it could also be something in >>> this netback commit that triggers a latent bug in netfront, can't rule that >>> one out completly. >>> >>> But the trigger is in that commit && >>> annie's and your patch seem to have no effect at all( on this issue) && >>> later commits in 3.14 do seems to mask it / make it less likely to trigger, >>> but do not fix it. >>> >> Unfortunately I've stared at the same piece of code for some time but >> don't have immediate clue. Later commits don't look suspecious either. >> I also looked at netfront code, but there's no slot couting change >> between 3.13 and 3.14. >> Do you have some straight setup instructions so that I can try to >> reproduce. > Ok you asked for it .. so here we go .. ;-) : > - Xen-unstable > - DomU: > - PV guest > - Debian Wheezy > - Kernel: 3.14-rc5 (.config see dom0 .config) > - Running netserver > - Dom0: > - Debian Wheezy > - Kernel: 3.13.6 + additional patches (see attached git log and .config) > - 2 physical NIC's > - Autoballooning is prevented by using dom0_mem=1536M,max:1536M for xen > and mem=1536M for dom0 kernel stanzas in grub > - vcpu 0 is pinned on pcpu 0 and exclusively for dom0 > - Networking: > - Routed bridge > - eth0 = internet > - eth1 = lan 172.16.x.x > - xen_bridge = bridge for VM's 192.168.1.x > - iptables NAT and routing > - attached dom0 and domU ifconfig output > - attached ethtool -k output for the bridge, vif and guest eth0 > Triggering workload: > - Well that's were the problem is :-) > - The Guest has a normal disk and swap (phy/lvm) and > shared storage with glusterfs (glusterfs server is on dom0) > - The Guest exposes this storage via webdavs > What triggers it is: > - The Guest runs it's rsync of the shared storage to a remote client on > the internet > So this causes traffic from Dom0 to domU (reading storage) .. > back from domU to dom0 and via iptables NAT on to eth0 (actual rsync) > or vice versa when syncing the other way around > - At the same time do a backup from a windows machine from the lan to > the webdavs server > So this causes traffic from eth1 to domU (webdav) .. > back from domU to dom0 (writing to the shared storage) > So in essence it is doing quite some netback && netfront stress testing :-) > It seems to only trigger when doing both the rsync and the webdav > simultaneous. > I tried my best to emulate any of this with netperf (and multiple instances), > i tried with various (odd) packet sizes and the packet / byte rates > transmitted > are higher then with the workload above ... but it doesn't seem to trigger > with netpref > So i don't think it will be easy to replicate ... > Perhaps running through the available logging again .. and try to answer some > questions ... this is just with one guest running kernels as before only > added debugging to netfront and xen (diffs attached): > Mar 12 02:00:44 backup kernel: [ 496.840646] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840665] net eth0: cons:1346005 slots:1 > rp:1346013 max:18 err:0 rx->id:212 rx->offset:0 size:4294967295 ref:572 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.840680] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840687] net eth0: cons:1346005 slots:2 > rp:1346013 max:18 err:-22 rx->id:214 rx->offset:0 size:4294967295 ref:657 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.840701] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840712] net eth0: cons:1346005 slots:3 > rp:1346013 max:18 err:-22 rx->id:215 rx->offset:0 size:4294967295 ref:667 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.840733] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840740] net eth0: cons:1346005 slots:4 > rp:1346013 max:18 err:-22 rx->id:216 rx->offset:0 size:4294967295 ref:716 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.840757] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840764] net eth0: cons:1346005 slots:5 > rp:1346013 max:18 err:-22 rx->id:217 rx->offset:0 size:4294967295 ref:755 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.840778] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840784] net eth0: cons:1346005 slots:6 > rp:1346013 max:18 err:-22 rx->id:218 rx->offset:0 size:4294967295 ref:592 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.840801] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.840807] net eth0: cons:1346005 slots:7 > rp:1346013 max:18 err:-22 rx->id:219 rx->offset:0 size:4294967295 ref:633 > pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 gso_type:0 gso_segs:0 > RING_HAS_UNCONSUMED_RESPONSES:9 cons_changed:1 cons_before:1346004 > xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.841043] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.841052] net eth0: cons:1346025 slots:1 > rp:1346038 max:18 err:0 rx->id:232 rx->offset:0 size:4294967295 > ref:-131941395332491 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 > gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:13 cons_changed:1 > cons_before:1346024 xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.841067] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.841074] net eth0: cons:1346025 slots:2 > rp:1346038 max:18 err:-22 rx->id:234 rx->offset:0 size:4294967295 > ref:-131941395332579 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 > gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 > cons_before:1346024 xennet_get_extras_err:0 > Mar 12 02:00:44 backup kernel: [ 496.841092] net eth0: rx->offset: 0, size: > 4294967295 > Mar 12 02:00:44 backup kernel: [ 496.841101] net eth0: cons:1346025 slots:3 > rp:1346038 max:18 err:-22 rx->id:235 rx->offset:0 size:4294967295 > ref:-131941395332408 pagesize:4096 skb_ipsummed:0 is_gso:0 gso_size:0 > gso_type:0 gso_segs:0 RING_HAS_UNCONSUMED_RESPONSES:29 cons_changed:1 > cons_before:1346024 xennet_get_extras_err:0 > (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference > 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 > (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy > failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146 > source_off:0 source_len:4096 op->source.offset:0 op->len:1168 > (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference > 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 > (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy > failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 > source_off:0 source_len:4096 op->source.offset:0 op->len:2476 > (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference > 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 > (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy > failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 > source_off:0 source_len:4096 op->source.offset:0 op->len:1634 > (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference > 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 > (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy > failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 > source_off:0 source_len:4096 op->source.offset:1634 op->len:1620 > (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference > 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 > (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy > failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609 > source_off:0 source_len:4096 op->source.offset:0 op->len:4096 > (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference > 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 Hrrmm sorry seems i was incomplete in copying and pasting the accompanying xl dmesg part: (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478146 source_off:0 source_len:4096 op->source.offset:0 op->len:1168 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 source_off:0 source_len:4096 op->source.offset:0 op->len:2476 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 source_off:0 source_len:4096 op->source.offset:0 op->len:1634 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497610 source_off:0 source_len:4096 op->source.offset:1634 op->len:1620 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5497609 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489800 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489799 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 194904079 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489798 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 224788480 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489796 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489795 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489794 source_off:0 source_len:4096 op->source.offset:0 op->len:2892 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5478282 source_off:0 source_len:4096 op->source.offset:0 op->len:1634 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325377 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489794 source_off:0 source_len:4096 op->source.offset:1634 op->len:1204 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 4325379 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5489793 source_off:0 source_len:4096 op->source.offset:0 op->len:4096 (XEN) [2014-03-12 01:00:44] grant_table.c:1856:d0v2 Bad grant reference 19791875 gt_version:1 ldom:0 readonly:0 allow_transitive:1 (XEN) [2014-03-12 01:00:44] grant_table.c:2100:d0v2 acquire_grant_for_copy failed .. dest_is_gref rc:-3 source.domid:32752 dest.domid:1 s_frame:5490944 source_off:0 source_len:4096 op->source.offset:0 op->len:3268 > - Sometimes (but not always) netfront also spits out: > dev_warn(dev, "Invalid extra type: %d\n", extra->type); > where the extra type seems a random value (seen 196, 31 ..) > - Sometimes (but not always) netfront also spits out: > dev_warn(dev, "Need more slots\n"); > - Sometimes (but not always) netfront also spits out: > dev_warn(dev, "Missing extra info\n"); > First question that comes to my mind: > - Are the warnings netfront spits out the cause of xen reporting the bad > grant reference ? > Or > Are the Bad grant references Xen is reporting .. causing netfront to spit > out the warnings ? > - Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in > xen-netfront.c doing there and changing cons midway ? > The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that > introduced the if says: > One major change from xen.git is that the guest transmit path (i.e. what > looks like receive to netback) has been significantly reworked to remove > the dependency on the out of tree PageForeign page flag (a core kernel > patch which enables a per page destructor callback on the final > put_page). This page flag was used in order to implement a grant map > based transmit path (where guest pages are mapped directly into SKB > frags). Instead this version of netback uses grant copy operations into > regular memory belonging to the backend domain. Reinstating the grant > map functionality is something which I would like to revisit in the > future. > It *is* using grant copy now .. so should this part have been removed ? > And > Could Paul's commit that seems to be the first to trigger these events > affect this ? > -- > Sander >> Wei. >>> > Paul, do you have any idea what might go wrong? >>> >>> > Wei. >>> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |