[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles



Thursday, February 27, 2014, 3:18:12 PM, you wrote:

> On Wed, Feb 26, 2014 at 04:11:23PM +0100, Sander Eikelenboom wrote:
>> 
>> Wednesday, February 26, 2014, 10:14:42 AM, you wrote:
>> 
>> 
>> > Friday, February 21, 2014, 7:32:08 AM, you wrote:
>> 
>> 
>> >> On 2014/2/20 19:18, Sander Eikelenboom wrote:
>> >>> Thursday, February 20, 2014, 10:49:58 AM, you wrote:
>> >>>
>> >>>
>> >>>> On 2014/2/19 5:25, Sander Eikelenboom wrote:
>> >>>>> Hi All,
>> >>>>>
>> >>>>> I'm currently having some network troubles with Xen and recent linux 
>> >>>>> kernels.
>> >>>>>
>> >>>>> - When running with a 3.14-rc3 kernel in dom0 and a 3.13 kernel in domU
>> >>>>>     I get what seems to be described in this thread: 
>> >>>>> http://www.spinics.net/lists/netdev/msg242953.html
>> >>>>>
>> >>>>>     In the guest:
>> >>>>>     [57539.859584] net eth0: rx->offset: 0, size: 4294967295
>> >>>>>     [57539.859599] net eth0: rx->offset: 0, size: 4294967295
>> >>>>>     [57539.859605] net eth0: rx->offset: 0, size: 4294967295
>> >>>>>     [57539.859610] net eth0: Need more slots
>> >>>>>     [58157.675939] net eth0: Need more slots
>> >>>>>     [58725.344712] net eth0: Need more slots
>> >>>>>     [61815.849180] net eth0: rx->offset: 0, size: 4294967295
>> >>>>>     [61815.849205] net eth0: rx->offset: 0, size: 4294967295
>> >>>>>     [61815.849216] net eth0: rx->offset: 0, size: 4294967295
>> >>>>>     [61815.849225] net eth0: Need more slots
>> >>>> This issue is familiar... and I thought it get fixed.
>> >>>>   From original analysis for similar issue I hit before, the root cause
>> >>>> is netback still creates response when the ring is full. I remember
>> >>>> larger MTU can trigger this issue before, what is the MTU size?
>> >>> In dom0 both for the physical nics and the guest vif's MTU=1500
>> >>> In domU the eth0 also has MTU=1500.
>> >>>
>> >>> So it's not jumbo frames .. just everywhere the same plain defaults ..
>> >>>
>> >>> With the patch from Wei that solves the other issue, i'm still seeing 
>> >>> the Need more slots issue on 3.14-rc3+wei's patch now.
>> >>> I have extended the "need more slots warn" to also print the cons, 
>> >>> slots, max,  rx->offset, size, hope that gives some more insight.
>> >>> But it indeed is the VM were i had similar issues before, the primary 
>> >>> thing this VM does is 2 simultaneous rsync's (one push one pull) with 
>> >>> some gigabytes of data.
>> >>>
>> >>> This time it was also acompanied by a "grant_table.c:1857:d0 Bad grant 
>> >>> reference " as seen below, don't know if it's a cause or a effect though.
>> 
>> >> The log "grant_table.c:1857:d0 Bad grant reference " was also seen before.
>> >> Probably the response overlaps the request and grantcopy return error 
>> >> when using wrong grant reference, Netback returns resp->status with 
>> >> ||XEN_NETIF_RSP_ERROR(-1) which is 4294967295 printed above from frontend.
>> >> Would it be possible to print log in xenvif_rx_action of netback to see 
>> >> whether something wrong with max slots and used slots?
>> 
>> >> Thanks
>> >> Annie
>> 
>> > Looking more closely it are perhaps 2 different issues ... the bad grant 
>> > references do not happen
>> > at the same time as the netfront messages in the guest.
>> 
>> > I added some debugpatches to the kernel netback, netfront and xen 
>> > granttable code (see below)
>> > One of the things was to simplify the code for the debug key to print the 
>> > granttables, the present code
>> > takes too long to execute and brings down the box due to stalls and NMI's. 
>> > So it now only prints
>> > the nr of entries per domain.
>> 
>> 
>> > Issue 1: grant_table.c:1858:d0 Bad grant reference
>> 
>> > After running the box for just one night (with 15 VM's) i get these 
>> > mentions of "Bad grant reference".
>> > The maptrack also seems to increase quite fast and the number of entries 
>> > seem to have gone up quite fast as well.
>> 
>> > Most domains have just one disk(blkfront/blkback) and one nic, a few have 
>> > a second disk.
>> > The blk drivers use persistent grants so i would assume it would reuse 
>> > those and not increase it (by much).
>> 

> As far as I can tell netfront has a pool of grant references and it
> will BUG_ON() if there's no grefs in the pool when you request one.
> Since your DomU didn't crash so I suspect the book-keeping is still
> intact.

>> > Domain 1 seems to have increased it's nr_grant_entries from 2048 to 3072 
>> > somewhere this night.
>> > Domain 7 is the domain that happens to give the netfront messages.
>> 
>> > I also don't get why it is reporting the "Bad grant reference" for domain 
>> > 0, which seems to have 0 active entries ..
>> > Also is this amount of grant entries "normal" ? or could it be a leak 
>> > somewhere ?
>> 

> I suppose Dom0 expanding its maptrack is normal. I see as well when I
> increase the number of domains. But if it keeps increasing while the
> number of DomUs stay the same then it is not normal.

It keeps increasing (without (re)starting domains) although eventually it looks 
like it is settling at a round a maptrack size of 31/256 frames.


> Presumably you only have netfront and blkfront to use grant table and
> your workload as described below invovled both so it would be hard to
> tell which one is faulty.

> There's no immediate functional changes regarding slot counting in this
> dev cycle for network driver. But there's some changes to blkfront/back
> which seem interesting (memory related).

Hmm all the times i get a "Bad grant reference" are related to that one 
specific guest.
And it's not doing much blkback/front I/O (it's providing webdav and rsync to 
network based storage (glusterfs))

Added some more printk's:

@@ -2072,7 +2076,11 @@ __gnttab_copy(
                                       &s_frame, &s_pg,
                                       &source_off, &source_len, 1);
         if ( rc != GNTST_okay )
-            goto error_out;
+            PIN_FAIL(error_out, GNTST_general_error,
+                     "?!?!? src_is_gref: aquire grant for copy failed 
current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n",
+                     current->domain->domain_id, op->source.domid, 
op->dest.domid);
+
+
         have_s_grant = 1;
         if ( op->source.offset < source_off ||
              op->len > source_len )
@@ -2096,7 +2104,11 @@ __gnttab_copy(
                                       current->domain->domain_id, 0,
                                       &d_frame, &d_pg, &dest_off, &dest_len, 
1);
         if ( rc != GNTST_okay )
-            goto error_out;
+            PIN_FAIL(error_out, GNTST_general_error,
+                     "?!?!? dest_is_gref: aquire grant for copy failed 
current_dom_id:%d src_dom_id:%d dest_dom_id:%d\n",
+                     current->domain->domain_id, op->source.domid, 
op->dest.domid);
+
+
         have_d_grant = 1;


this comes out:

(XEN) [2014-02-27 02:34:37] grant_table.c:2109:d0 ?!?!? dest_is_gref: aquire 
grant for copy failed current_dom_id:0 src_dom_id:32752 dest_dom_id:7


> My suggestion is, if you have a working base line, you can try to setup
> different frontend / backend combination to help narrow down the
> problem.

Will see what i can do after the weekend

> Wei.

<snip>



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.