Xen project Mailing List

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles

From: Sander Eikelenboom <linux@xxxxxxxxxxxxxx>

Date: Thu, 27 Feb 2014 15:43:51 +0100

Cc: annie li <annie.li@xxxxxxxxxx>, Paul Durrant <Paul.Durrant@xxxxxxxxxx>, Zoltan Kiss <zoltan.kiss@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Thu, 27 Feb 2014 14:44:02 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Thursday, February 27, 2014, 3:18:12 PM, you wrote: > On Wed, Feb 26, 2014 at 04:11:23PM +0100, Sander Eikelenboom wrote: >> >> Wednesday, February 26, 2014, 10:14:42 AM, you wrote: >> >> >> > Friday, February 21, 2014, 7:32:08 AM, you wrote: >> >> >> >> On 2014/2/20 19:18, Sander Eikelenboom wrote: >> >>> Thursday, February 20, 2014, 10:49:58 AM, you wrote: >> >>> >> >>> >> >>>> On 2014/2/19 5:25, Sander Eikelenboom wrote: >> >>>>> Hi All, >> >>>>> >> >>>>> I'm currently having some network troubles with Xen and recent linux >> >>>>> kernels. >> >>>>> >> >>>>> - When running with a 3.14-rc3 kernel in dom0 and a 3.13 kernel in domU >> >>>>> I get what seems to be described in this thread: >> >>>>> http://www.spinics.net/lists/netdev/msg242953.html >> >>>>> >> >>>>> In the guest: >> >>>>> [57539.859584] net eth0: rx->offset: 0, size: 4294967295 >> >>>>> [57539.859599] net eth0: rx->offset: 0, size: 4294967295 >> >>>>> [57539.859605] net eth0: rx->offset: 0, size: 4294967295 >> >>>>> [57539.859610] net eth0: Need more slots >> >>>>> [58157.675939] net eth0: Need more slots >> >>>>> [58725.344712] net eth0: Need more slots >> >>>>> [61815.849180] net eth0: rx->offset: 0, size: 4294967295 >> >>>>> [61815.849205] net eth0: rx->offset: 0, size: 4294967295 >> >>>>> [61815.849216] net eth0: rx->offset: 0, size: 4294967295 >> >>>>> [61815.849225] net eth0: Need more slots >> >>>> This issue is familiar... and I thought it get fixed. >> >>>> From original analysis for similar issue I hit before, the root cause >> >>>> is netback still creates response when the ring is full. I remember >> >>>> larger MTU can trigger this issue before, what is the MTU size? >> >>> In dom0 both for the physical nics and the guest vif's MTU=1500 >> >>> In domU the eth0 also has MTU=1500. >> >>> >> >>> So it's not jumbo frames .. just everywhere the same plain defaults .. >> >>> >> >>> With the patch from Wei that solves the other issue, i'm still seeing >> >>> the Need more slots issue on 3.14-rc3+wei's patch now. >> >>> I have extended the "need more slots warn" to also print the cons, >> >>> slots, max, rx->offset, size, hope that gives some more insight. >> >>> But it indeed is the VM were i had similar issues before, the primary >> >>> thing this VM does is 2 simultaneous rsync's (one push one pull) with >> >>> some gigabytes of data. >> >>> >> >>> This time it was also acompanied by a "grant_table.c:1857:d0 Bad grant >> >>> reference " as seen below, don't know if it's a cause or a effect though. >> >> >> The log "grant_table.c:1857:d0 Bad grant reference " was also seen before. >> >> Probably the response overlaps the request and grantcopy return error >> >> when using wrong grant reference, Netback returns resp->status with >> >> ||XEN_NETIF_RSP_ERROR(-1) which is 4294967295 printed above from frontend. >> >> Would it be possible to print log in xenvif_rx_action of netback to see >> >> whether something wrong with max slots and used slots? >> >> >> Thanks >> >> Annie >> >> > Looking more closely it are perhaps 2 different issues ... the bad grant >> > references do not happen >> > at the same time as the netfront messages in the guest. >> >> > I added some debugpatches to the kernel netback, netfront and xen >> > granttable code (see below) >> > One of the things was to simplify the code for the debug key to print the >> > granttables, the present code >> > takes too long to execute and brings down the box due to stalls and NMI's. >> > So it now only prints >> > the nr of entries per domain. >> >> >> > Issue 1: grant_table.c:1858:d0 Bad grant reference >> >> > After running the box for just one night (with 15 VM's) i get these >> > mentions of "Bad grant reference". >> > The maptrack also seems to increase quite fast and the number of entries >> > seem to have gone up quite fast as well. >> >> > Most domains have just one disk(blkfront/blkback) and one nic, a few have >> > a second disk. >> > The blk drivers use persistent grants so i would assume it would reuse >> > those and not increase it (by much). >> > As far as I can tell netfront has a pool of grant references and it > will BUG_ON() if there's no grefs in the pool when you request one. > Since your DomU didn't crash so I suspect the book-keeping is still > intact. >> > Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 >> > somewhere this night. >> > Domain 7 is the domain that happens to give the netfront messages. >> >> > I also don't get why it is reporting the "Bad grant reference" for domain >> > 0, which seems to have 0 active entries .. >> > Also is this amount of grant entries "normal" ? or could it be a leak >> > somewhere ? >> > I suppose Dom0 expanding its maptrack is normal. I see as well when I > increase the number of domains. But if it keeps increasing while the > number of DomUs stay the same then it is not normal. It keeps increasing (without (re)starting domains) although eventually it looks like it is settling at a round a maptrack size of 31/256 frames. > Presumably you only have netfront and blkfront to use grant table and > your workload as described below invovled both so it would be hard to > tell which one is faulty. > There's no immediate functional changes regarding slot counting in this > dev cycle for network driver. But there's some changes to blkfront/back > which seem interesting (memory related). Hmm all the times i get a "Bad grant reference" are related to that one specific guest. And it's not doing much blkback/front I/O (it's providing webdav and rsync to network based storage (glusterfs)) Added some more printk's: @@ -2072,7 +2076,11 @@ __gnttab_copy( &s_frame, &s_pg, &source_off, &source_len, 1); if ( rc != GNTST_okay ) - goto error_out; + PIN_FAIL(error_out, GNTST_general_error, + "?!?!? src_is_gref: aquire grant for copy failed current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n", + current->domain->domain_id, op->source.domid, op->dest.domid); + + have_s_grant = 1; if ( op->source.offset < source_off || op->len > source_len ) @@ -2096,7 +2104,11 @@ __gnttab_copy( current->domain->domain_id, 0, &d_frame, &d_pg, &dest_off, &dest_len, 1); if ( rc != GNTST_okay ) - goto error_out; + PIN_FAIL(error_out, GNTST_general_error, + "?!?!? dest_is_gref: aquire grant for copy failed current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n", + current->domain->domain_id, op->source.domid, op->dest.domid); + + have_d_grant = 1; this comes out: (XEN) [2014-02-27 02:34:37] grant_table.c:2109:d0 ?!?!? dest_is_gref: aquire grant for copy failed current_dom_id:0 src_dom_id:32752 dest_dom_id:7 > My suggestion is, if you have a working base line, you can try to setup > different frontend / backend combination to help narrow down the > problem. Will see what i can do after the weekend > Wei. <snip> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.