[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Friday, March 21, 2014, 5:49:58 PM, you wrote:

> [...]
>> > Those are different estimations. max_slots_needed estimates the worst
>> > case while min_slots_needed estimates the best case. When
>> > min_slots_needed is met, netback puts the SKB into its internal queue.
>> > xenvif_rx_action then will dequeue that packet, check max_slots_needed,
>> > if met, break SKB down.
>> 
>> I realised that, though if you don't count the cost of calculation .. the 
>> best would be
>> to have min_slots and max_slots the same .. and being correct. But cost of 
>> the calculation count ..
>> on the other hand min_slots seems to be far too low in this case. Is the 
>> nr_frags the best thing to check for ?
>> Wouldn't something like skbsize / PAGE_SIZE be a better estimate ?
>> 

> It doesn't matter. It won't cause breakage at that point.

That was more from a performance point of view ... if the estimate can be that 
much off
it would do the more costly check .. and still bail out. So the better the 
first estimation
(at about the same cost of course) the better.


>> 
>> > What I would expect is, even if they yield different values, checking
>> > the ring availability is enough before breaking SKB down.
>> 
>> > In your above case, max_slots_needed was met and SKB broken down. And as
>> > you said in your empirical test, bumping max_slots_needed by 1 prevented
>> > issues, then we might have problem calculating max_slots_needed. If you
>> > can work out what went wrong that can be very useful.
>> 
>> OK ran the whole night with "max_slots_needed+1" and i saw no triggering of 
>> both:
>>     - the non-fatal oddity (only a temporary cons > prod ..)
>>     - the fatal case (leading to the grantref error)
>> 
>> So it seems we are off by one .. somewhere
>> 
>> 
>> At first the non-fatal oddity .. this one i really can't grasp:
>> 
>> [  567.880282] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:562972 vif->rx.req_cons:562969 
>> vif->rx.sring->req_event:562973
>> [  569.893801] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:571188 vif->rx.req_cons:571187 
>> vif->rx.sring->req_event:571189
>> [  571.892419] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:577141 vif->rx.req_cons:577138 
>> vif->rx.sring->req_event:577142
>> [  575.451383] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> npo->meta_prod:6 vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 
>> vif->rx.sring->req_event:589615, reserved_slots_left:0
>> [  575.487721] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>> npo->meta_prod:6 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590604 
>> req->gref:569 req->id:11 vif->rx.sring->req_event:589615 
>> reserved_slots_left:-1
>> [  575.524427] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590660 
>> vif->rx.sring->req_event:590661
>> [  576.508063] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> npo->meta_prod:16 vif->rx.sring->req_prod:594343 vif->rx.req_cons:594343 
>> vif->rx.sring->req_event:591619, reserved_slots_left:0
>> [  576.544708] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>> npo->meta_prod:16 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594344 
>> req->gref:668 req->id:167 vif->rx.sring->req_event:591619 
>> reserved_slots_left:-1
>> [  576.581595] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594389 
>> vif->rx.sring->req_event:594390
>> [  577.325826] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:2 vif->rx.sring->req_prod:597248 vif->rx.req_cons:597247 
>> vif->rx.sring->req_event:597249
>> [  577.576973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:598146 vif->rx.req_cons:598143 
>> vif->rx.sring->req_event:598147
>> 
>> What i don't get is:
>> - i do warn just before the RING_GET_REQUEST on 
>> vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 .. so cons == prod
>> - i warn because  req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++); is 
>> going to add one to cons .. leading to a situation that should not exist ..
>> - however in the warn just after this RING_GET_REQUEST .. prod is somehow 
>> increased a lot .. probably the "next_rx_buffer" aspect of things .. so cons 
>> is much smaller than prod again ..

> Frontend is also accessing the ring. So it's normal that some index
> changed.

Ah ok .. so here the frontend is "saving us" by putting more stuff on the ring 
so we don't run dry ..

>> - How everi don't see anything in the RING_GET_REQUEST macro fiddling with 
>> vif->rx.sring->req_prod ..
>> - so who has been fiddling with vif->rx.sring->req_prod in the mean time .. 
>> my very limited c skills fail to see how this works :-)
>> 
>> 
>> Then the fatal case ...
>> 
>> [ 1108.566486] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:1 vif->rx.sring->req_prod:1843278 vif->rx.req_cons:1843278 
>> vif->rx.sring->req_event:1843279
>> [ 1109.775757] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:1847779 vif->rx.req_cons:1847776 
>> vif->rx.sring->req_event:1847780
>> [ 1110.041661] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:1848795 vif->rx.req_cons:1848794 
>> vif->rx.sring->req_event:1848796
>> [ 1110.301778] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:4 vif->rx.sring->req_prod:1849755 vif->rx.req_cons:1849754 
>> vif->rx.sring->req_event:1849756
>> [ 1111.565412] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 
>> vif->rx.sring->req_event:1853482, reserved_slots_left:0
>> [ 1111.565973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:1 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 
>> vif->rx.sring->req_event:1854292
>> [ 1111.638425] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>> npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
>> req->gref:4325377 req->id:83 vif->rx.sring->req_event:1854292 
>> reserved_slots_left:-1
>> [ 1111.675042] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
>> npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
>> npo->copy_gref:4325377  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:1168 
>> size:1168 i:5 vif->rx.sring->req_event:1854292 estimated_slots_needed:7 
>> reserved_slots_left:-1
>> [ 1111.711833] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
>> npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
>> npo->copy_gref:4325377 npo->copy_off:1168  MAX_BUFFER_OFFSET:4096 bytes:1168 
>> size:0 i:6 vif->rx.sring->req_event:1854292 gso_gaps:0 
>> estimated_slots_needed:7 reserved_slots_left:-1
>> [ 1111.766836] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 after 
>> npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 
>> vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 
>> req->gref:674 req->id:76 estimated_slots_needed:7 i(frag):0 j(data):1 
>> reserved_slots_left:-1
>> [ 1111.803702] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 
>> npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 
>> vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 
>> req->gref:674 req->id:76 estimated_slots_needed:7 j(data):1 
>> reserved_slots_left:-1    used in funcstart: 0 + 1 .. used_dataloop:1 .. 
>> used_fragloop:6
>> [ 1111.858602] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 ..  
>> vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 
>> sco->meta_slots_used:7 max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:7 
>> j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:5 reserved_slots_left:-1
>> 
>> Ok so if i try to pull this apart (from bottom to top):
>> 
>> - it starts out with "xenvif_rx_action" and calculated we would need:
>>       1 slot for the  DIV_ROUND_UP(offset_in_page(skb->data) + 
>> skb_headlen(skb), PAGE_SIZE);
>>       5 slots for 1 frag
>>       1 slot for GSO
>>       So 7 slots in total .. 
>> 
>> - it checks  if there are enough slots available if 
>> (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) and aparently there 
>> are .. so it continues
>> 
>> - We end up in "xenvif_gop_skb" processing .. and everything seems to go 
>> well ..
>>       - In the start of this function we do one unconditional 
>> RING_GET_REQUEST .. and potentially one conditional (if gso)
>>           - Question: in "xenvif_rx_action" we only reserved ONE slot for 
>> GSO stuff .. so this could cost TWO slots .. is that right ?

> The slot reserved for GSO is for meta data, so 1 is enough.

>>           - But in this case we only use one ... see "used in funcstart: 0 + 
>> 1"
>> 
>>       - Now we loop 1 time through the  "while (data < 
>> skb_tail_pointer(skb))" loop ... since j(data)=1
>>           - and call "xenvif_gop_frag_copy" from inside this loop here .. 
>> which potentially calls .. get_next_rx_buffer
>>           - If i understand it correctly the number of slots this loop 
>> should take at a maximum should be what we calculated before with the
>>             DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), 
>> PAGE_SIZE) ? (1 slot in this case)
>>           - Nothing is logged yet because cons is still smaller than prod at 
>> this time, but from the aftermath we see this was "used_dataloop:1"
>>           - So that seems correct .. 
>> 
>>       - Now we are at the loop through "nr_frags"  and loop the 1 frag we 
>> have .. 
>>                   - This all goes well until the sixth iteration (i from 
>> 0..5) .. and call "xenvif_gop_frag_copy" from inside this loop ...

> Isn't there only one frag?

Yes there is only one frag .. but it seems to be much larger than PAGE_SIZE .. 
and xenvif_gop_frag_copy brakes that frag down into smaller bits .. hence the 
calculation in xenvif_rx_action determining the slots needed by doing:

                for (i = 0; i < nr_frags; i++) {
                        unsigned int size;
                        size = skb_frag_size(&skb_shinfo(skb)->frags[i]);
                        max_slots_needed += DIV_ROUND_UP(size, PAGE_SIZE);
                }

But the code in xenvif_gop_frag_copy .. seems to be needing one more slot (from 
the emperical test) .. and calling "get_next_rx_buffer" seems involved in that 
..


>>                   - "xenvif_gop_frag_copy" calls "get_next_rx_buffer" .. 
>> cons becomes larger than prod ..and the "miracle" from the non-fatal case of 
>> prod suddenly becoming larger does not happen
>> 
>>       - So the calculation of how many slots are required for the frags in 
>> "xenvif_rx_action" does not match wath the code in "xenvif_gop_frag_copy" 
>> seems to need,
>>         I don't see why .. it's rounding up .. should behave correct if the 
>> left over happens to be 0 size ..
>> 
>> 
>> 
>> > Wei.
>> 



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.