[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Wednesday, March 12, 2014, 1:04:44 PM, you wrote:

> On Wed, Mar 12, 2014 at 12:45:16PM +0100, Sander Eikelenboom wrote:
> [...]
>> >> 
>> >> - Sometimes (but not always) netfront also spits out:
>> >>   dev_warn(dev, "Invalid extra type: %d\n", extra->type);
>> >>   where the extra type seems a random value (seen 196, 31 ..)
>> >> - Sometimes (but not always) netfront also spits out:
>> >>   dev_warn(dev, "Need more slots\n");
>> >> - Sometimes (but not always) netfront also spits out:
>> >>   dev_warn(dev, "Missing extra info\n");
>> >> 
>> 
>> > That's because garbage is pushed to frontend. It's trying to decode some
>> > random values.
>> 
>> >> 
>> >> First question that comes to my mind:
>> >> - Are the warnings netfront spits out the cause of xen reporting the bad 
>> >> grant reference ?
>> >>   Or
>> >>   Are the Bad grant references Xen is reporting .. causing netfront to 
>> >> spit out the warnings ?
>> 
>> > No. They are not directly connected.
>> 
>> > 0. backend breaks down a skb into slots.
>> > 1. backend gets a request from the rx ring (a slot with gref provided by
>> >    frontend), until all slots in skb are handled.
>> > 2. backend grant-copies data to that slot (with gref).
>> > 3. backend pushes a response to frontend, whose content depends on the
>> >    result of previous step.
>> > 4. frontend handles that response.
>> 
>> > So the grant copy error in hypervisor you're seeing is in 2. The
>> > complaint from frontend is in 4, when backend pushes a response with
>> > error status in it. The bug is most likely in 0, when the calculation
>> > goes wrong - either in the breakdown process, or in the macro that
>> > returns usable slots (this can lead to backend thinks there's enough
>> > slots while there's not).
>> 
>> Ok so perhaps a debug patch which does more sanity checking there ?
>> Because i don't see any errors/warnings from netback in dom0.
>> 

> From the look of the code, 0 looks correct to me. Netback won't complain
> about "bad gref" because it has no idea what a "good gref" looks like.
> Only Xen has the knowledge whether a gref is legit. All netback sees is
> the hypercall fails and it tries to push corresponding response to
> netfront. But you can probably safely assume a gref larger than several
> thousands be bad.

OK it's indeed netback setting the status to -1:

[  976.431585] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:1 flags:3 size:958
[ 1030.855057] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:2 flags:7 size:2974
[ 1030.861759] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:0 nr_meta_slots:1 flags:0 size:1460
[ 1030.868499] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:1 flags:3 size:958
[ 1030.875278] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:1 flags:3 size:2974
[ 1068.199650] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:8 flags:7 size:1634
[ 1068.206479] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:0 nr_meta_slots:7 flags:4 size:2848
[ 1068.213158] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:1 nr_meta_slots:7 flags:4 size:4096
[ 1068.219701] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:2 nr_meta_slots:7 flags:4 size:4096
[ 1068.226194] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:3 nr_meta_slots:7 flags:4 size:4096
[ 1068.232728] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:4 nr_meta_slots:7 flags:4 size:4096
[ 1068.239163] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:5 nr_meta_slots:7 flags:4 size:4096
[ 1068.245397] vif vif-7-0 vif7.0: ?!? xenvif_add_frag_responses status err? 
status:-1 i:6 nr_meta_slots:7 flags:0 size:1168
[ 1068.251457] vif vif-7-0 vif7.0: ?!? xenvif_rx_action status err? status:-1 
meta_slots_used:1 flags:3 size:958

Now to find out why ..

>> >> - Why is that "if (rx->flags & XEN_NETRXF_extra_info) {}" part in 
>> >> xen-netfront.c doing there and changing cons midway ?
>> >>   The commit message from f942dc2552b8bfdee607be867b12a8971bb9cd85 that 
>> >> introduced the if says:
>> >> 
>> >>     One major change from xen.git is that the guest transmit path (i.e. 
>> >> what
>> >>     looks like receive to netback) has been significantly reworked to 
>> >> remove
>> >>     the dependency on the out of tree PageForeign page flag (a core kernel
>> >>     patch which enables a per page destructor callback on the final
>> >>     put_page). This page flag was used in order to implement a grant map
>> >>     based transmit path (where guest pages are mapped directly into SKB
>> >>     frags). Instead this version of netback uses grant copy operations 
>> >> into
>> >>     regular memory belonging to the backend domain. Reinstating the grant
>> >>     map functionality is something which I would like to revisit in the
>> >>     future.
>> >> 
>> >>   It *is* using grant copy now .. so should this part have been removed ?
>> >>   And
>> >>   Could Paul's commit that seems to be the first to trigger these events 
>> >> affect this ?
>> >> 
>> 
>> > This depicts guest *transmit* path, but the issue you're seeing is in
>> > guest *receive* path. So that's totally different thing.
>> 
>> Errr yes well if the (rx->flags & XEN_NETRXF_extra_info) {}) should only be 
>> doing something on the transmit path ..
>> why does it seem to result to true on my issue ?  (see cons_changed=1 in the 
>> debub output.. and you see cons being cons + 1 after that)

> First thing is the response goes wild and probably contains random value
> so it's possible the flag is "set".

> For a legit response, if you see that flag it means there's an "extra
> info" slot in the response. That code snippet is to extract that extra
> info slot (could be several in a row, that's why there's a loop in
> xennet_get_extras). The request / response format is documented in
> include/xen/interface/io/netif.h -- it's only the tx one, but rx one is
> more or less the same.

> Wei.

>> 
>> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.