[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Wednesday, March 19, 2014, 12:35:32 PM, you wrote:

> On Wed, Mar 19, 2014 at 12:11:04AM +0100, Sander Eikelenboom wrote:
> [...]
>> 
>> >>> Before Paul's change, we always reserve very large room for an incoming
>> >>> SKB. After Paul's change, we only reserve just enough room. Probably
>> >>> some extra room prevents this bug being triggered.
>> 
>> >> [  599.970745] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> >> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 
>> >> vif->rx.sring->req_event:504174
>> >> [  599.972487] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! 
>> >>  min_slots_needed:1 vif->rx.sring->req_prod:506387 
>> >> vif->rx.req_cons:506387 vif->rx.sring->req_event:506388
>> >> [  600.044322] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>> >> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> >> req->gref:165543936 req->id:19 vif->rx.sring->req_event:506388
>> >> [  600.081167] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
>> >> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> >> npo->copy_gref:165543936  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 
>> >> bytes:1168 size:1168 i:6 vif->rx.sring->req_event:506388 
>> >> estimated_slots_needed:8
>> >> [  600.118268] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
>> >> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> >> npo->copy_gref:165543936 npo->copy_off:1168  MAX_BUFFER_OFFSET:4096 
>> >> bytes:1168 size:0 i:7 vif->rx.sring->req_event:506388 
>> >> estimated_slots_needed:8
>> >> [  600.155378] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 
>> >> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 
>> >> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 
>> >> req->id:11 estimated_slots_needed:8 i(frag): 0
>> >> [  600.192438] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 
>> >> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 
>> >> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 
>> >> req->id:11 estimated_slots_needed:8
>> >> [  600.229395] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 ..  
>> >> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> >> sco->meta_slots_used:8 max_upped_gso:1 skb_is_gso(skb):1 
>> >> max_slots_needed:8 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:6 
>> >> min_slots_needed:3
>> >> [  600.266518] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 1 ..  
>> >> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388                    
>> >>      max_upped_gso:1 skb_is_gso(skb):0 max_slots_needed:1 j:4 is_gso:0 
>> >> nr_frags:0 firstpart:1 secondpart:0 min_slots_needed:1
>> 
>> >> It seems to estimate 8 slots and need 8 slots ... however .. shouldn't 
>> >> the queue have been stopped before getting here ..
>> 
>> > *hrmm please just ignore* .. got to get some sleep i guess ..
>> 
>> Just went the empirical way .. and unconditionally upped the calculated 
>> "max_slots_needed" by one in "xenvif_rx_action"  just before checking the 
>> "xenvif_rx_ring_slots_available",
>> this has prevented all non-fatal and fatal occurrences of "cons > prod" so 
>> far. I will leave my tests running overnight, see if it survives the 
>> pounding.
>> 
>> >From other net drivers i see different calculations .. seems it is kind of 
>> >voodoo to determine the value .. one of which 
>> >(drivers/net/ethernet/marvell/sky2.c)
>> seems to unconditionally reserve a slot for both GSO and CKSUM. Don't know 
>> if that makes any sense though:
>> 
>> /* This is the worst case number of transmit list elements for a single skb:
>>    VLAN:GSO + CKSUM + Data + skb_frags * DMA */
>> #define MAX_SKB_TX_LE   (2 + 
>> (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1))
>> 

> Xen network "wire" format doesn't have a CKSUM metadata type though.

>> 
>> 
>> >>> Wei.
>> 
>> >>>> 
>> >>>> >> The second time it does get to the code after the RING_GET_REQUEST 
>> >>>> >> in 'get_next_rx_buffer' and that leads to mayhem ...
>> >>>> >> 
>> >>>> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" 
>> >>>> >> case .. unfortunately it now triggers net_ratelimit at the end:
>> >>>> >> 
>> >>>> >> [  402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping 
>> >>>> >> queue !  min_slots_needed:7 vif->rx.sring->req_prod:189228 
>> >>>> >> vif->rx.req_cons:189222
>> >>>> 
>> >>>> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod -
>> >>>> > req_cons < needed, the queue is stopeed.
>> 
>> >> So it seems .. most of the time .. but if i look at the calculation of 
>> >> "min_slots_needed" in this function it seems completely different from 
>> >> the one in
>> >> xenvif_rx_action for max_slots_needed .. though both seem to be used for 
>> >> the same thing .. to calcultate how many slots the brokendown SKB would 
>> >> need to fit in ..
>> >> So i added the calculation method from xenvif_start_xmit to 
>> >> xenvif_rx_action .. in the error case you see min_slots_needed was 3 .. 
>> >> but max_slots_needed and max_slots_used both were 8.
>> 

> Those are different estimations. max_slots_needed estimates the worst
> case while min_slots_needed estimates the best case. When
> min_slots_needed is met, netback puts the SKB into its internal queue.
> xenvif_rx_action then will dequeue that packet, check max_slots_needed,
> if met, break SKB down.

I realised that, though if you don't count the cost of calculation .. the best 
would be
to have min_slots and max_slots the same .. and being correct. But cost of the 
calculation count ..
on the other hand min_slots seems to be far too low in this case. Is the 
nr_frags the best thing to check for ?
Wouldn't something like skbsize / PAGE_SIZE be a better estimate ?


> What I would expect is, even if they yield different values, checking
> the ring availability is enough before breaking SKB down.

> In your above case, max_slots_needed was met and SKB broken down. And as
> you said in your empirical test, bumping max_slots_needed by 1 prevented
> issues, then we might have problem calculating max_slots_needed. If you
> can work out what went wrong that can be very useful.

OK ran the whole night with "max_slots_needed+1" and i saw no triggering of 
both:
    - the non-fatal oddity (only a temporary cons > prod ..)
    - the fatal case (leading to the grantref error)

So it seems we are off by one .. somewhere


At first the non-fatal oddity .. this one i really can't grasp:

[  567.880282] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:562972 vif->rx.req_cons:562969 
vif->rx.sring->req_event:562973
[  569.893801] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:571188 vif->rx.req_cons:571187 
vif->rx.sring->req_event:571189
[  571.892419] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:577141 vif->rx.req_cons:577138 
vif->rx.sring->req_event:577142
[  575.451383] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:6 vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 
vif->rx.sring->req_event:589615, reserved_slots_left:0
[  575.487721] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
npo->meta_prod:6 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590604 
req->gref:569 req->id:11 vif->rx.sring->req_event:589615 reserved_slots_left:-1
[  575.524427] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590660 
vif->rx.sring->req_event:590661
[  576.508063] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:16 vif->rx.sring->req_prod:594343 vif->rx.req_cons:594343 
vif->rx.sring->req_event:591619, reserved_slots_left:0
[  576.544708] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
npo->meta_prod:16 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594344 
req->gref:668 req->id:167 vif->rx.sring->req_event:591619 reserved_slots_left:-1
[  576.581595] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594389 
vif->rx.sring->req_event:594390
[  577.325826] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:2 vif->rx.sring->req_prod:597248 vif->rx.req_cons:597247 
vif->rx.sring->req_event:597249
[  577.576973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:598146 vif->rx.req_cons:598143 
vif->rx.sring->req_event:598147

What i don't get is:
- i do warn just before the RING_GET_REQUEST on vif->rx.sring->req_prod:590603 
vif->rx.req_cons:590603 .. so cons == prod
- i warn because  req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++); is 
going to add one to cons .. leading to a situation that should not exist ..
- however in the warn just after this RING_GET_REQUEST .. prod is somehow 
increased a lot .. probably the "next_rx_buffer" aspect of things .. so cons is 
much smaller than prod again ..
- How everi don't see anything in the RING_GET_REQUEST macro fiddling with 
vif->rx.sring->req_prod ..
- so who has been fiddling with vif->rx.sring->req_prod in the mean time .. my 
very limited c skills fail to see how this works :-)


Then the fatal case ...

[ 1108.566486] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:1 vif->rx.sring->req_prod:1843278 vif->rx.req_cons:1843278 
vif->rx.sring->req_event:1843279
[ 1109.775757] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:1847779 vif->rx.req_cons:1847776 
vif->rx.sring->req_event:1847780
[ 1110.041661] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:1848795 vif->rx.req_cons:1848794 
vif->rx.sring->req_event:1848796
[ 1110.301778] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:4 vif->rx.sring->req_prod:1849755 vif->rx.req_cons:1849754 
vif->rx.sring->req_event:1849756
[ 1111.565412] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 
vif->rx.sring->req_event:1853482, reserved_slots_left:0
[ 1111.565973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
min_slots_needed:1 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 
vif->rx.sring->req_event:1854292
[ 1111.638425] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
req->gref:4325377 req->id:83 vif->rx.sring->req_event:1854292 
reserved_slots_left:-1
[ 1111.675042] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
npo->copy_gref:4325377  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:1168 
size:1168 i:5 vif->rx.sring->req_event:1854292 estimated_slots_needed:7 
reserved_slots_left:-1
[ 1111.711833] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
npo->copy_gref:4325377 npo->copy_off:1168  MAX_BUFFER_OFFSET:4096 bytes:1168 
size:0 i:6 vif->rx.sring->req_event:1854292 gso_gaps:0 estimated_slots_needed:7 
reserved_slots_left:-1
[ 1111.766836] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 after 
npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 
vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 
req->gref:674 req->id:76 estimated_slots_needed:7 i(frag):0 j(data):1 
reserved_slots_left:-1
[ 1111.803702] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 
npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 
vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 
req->gref:674 req->id:76 estimated_slots_needed:7 j(data):1 
reserved_slots_left:-1    used in funcstart: 0 + 1 .. used_dataloop:1 .. 
used_fragloop:6
[ 1111.858602] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 ..  
vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 sco->meta_slots_used:7 
max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:7 j:3 is_gso:1 nr_frags:1 
firstpart:1 secondpart:5 reserved_slots_left:-1

Ok so if i try to pull this apart (from bottom to top):

- it starts out with "xenvif_rx_action" and calculated we would need:
      1 slot for the  DIV_ROUND_UP(offset_in_page(skb->data) + 
skb_headlen(skb), PAGE_SIZE);
      5 slots for 1 frag
      1 slot for GSO
      So 7 slots in total .. 

- it checks  if there are enough slots available if 
(!xenvif_rx_ring_slots_available(vif, max_slots_needed)) and aparently there 
are .. so it continues

- We end up in "xenvif_gop_skb" processing .. and everything seems to go well ..
      - In the start of this function we do one unconditional RING_GET_REQUEST 
.. and potentially one conditional (if gso)
          - Question: in "xenvif_rx_action" we only reserved ONE slot for GSO 
stuff .. so this could cost TWO slots .. is that right ?
          - But in this case we only use one ... see "used in funcstart: 0 + 1"

      - Now we loop 1 time through the  "while (data < skb_tail_pointer(skb))" 
loop ... since j(data)=1
          - and call "xenvif_gop_frag_copy" from inside this loop here .. which 
potentially calls .. get_next_rx_buffer
          - If i understand it correctly the number of slots this loop should 
take at a maximum should be what we calculated before with the
            DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), 
PAGE_SIZE) ? (1 slot in this case)
          - Nothing is logged yet because cons is still smaller than prod at 
this time, but from the aftermath we see this was "used_dataloop:1"
          - So that seems correct .. 

      - Now we are at the loop through "nr_frags"  and loop the 1 frag we have 
.. 
                  - This all goes well until the sixth iteration (i from 0..5) 
.. and call "xenvif_gop_frag_copy" from inside this loop ...
                  - "xenvif_gop_frag_copy" calls "get_next_rx_buffer" .. cons 
becomes larger than prod ..and the "miracle" from the non-fatal case of prod 
suddenly becoming larger does not happen

      - So the calculation of how many slots are required for the frags in 
"xenvif_rx_action" does not match wath the code in "xenvif_gop_frag_copy" seems 
to need,
        I don't see why .. it's rounding up .. should behave correct if the 
left over happens to be 0 size ..



> Wei.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.