[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"



Hello Sander,

Tuesday, March 18, 2014, 10:18:59 PM, you wrote:




> Tuesday, March 18, 2014, 9:14:02 PM, you wrote:


>> Tuesday, March 18, 2014, 5:04:12 PM, you wrote:

>>> On Tue, Mar 18, 2014 at 04:21:27PM +0100, Sander Eikelenboom wrote:
>>> [...]
>>>> 
>>>> Added even more warns ...
>>>> 
>>>> [  297.885969] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>>>> min_slots_needed:4 vif->rx.sring->req_prod:21764 vif->rx.req_cons:21762
>>>> [  298.760555] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>>>> min_slots_needed:3 vif->rx.sring->req_prod:22488 vif->rx.req_cons:22486
>>>> 
>>>> [  306.376176] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>>>> npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313
>>>> [  306.376556] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>>>> min_slots_needed:1 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28313
>>>> [  306.391863] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>>>> npo->meta_prod:30 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
>>>> req->gref:4325377 req->id:153
>>>> 
>>>> [  306.407599] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
>>>> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
>>>> npo->copy_gref:4325377  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:640 
>>>> size:640 i:4
>>>> [  306.423913] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
>>>> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28314 
>>>> npo->copy_gref:4325377 npo->copy_off:640  MAX_BUFFER_OFFSET:4096 bytes:640 
>>>> size:0 i:5
>>>> 
>>>> 
>>>> [  306.440941] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 
>>>> npo->meta_prod:31 old_meta_prod:25 vif->rx.sring->req_prod:28313 
>>>> vif->rx.req_cons:28314 gso_type:1 gso_size:1448 nr_frags:1 req->gref:638 
>>>> req->id:147
>>>> [  306.458334] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 before req 
>>>> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
>>>> gso_type:0 gso_size:0 nr_frags:0
>>>> [  306.476097] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 2 after req 
>>>> npo->meta_prod:31 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
>>>> gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 req->id:154
>>>> [  306.494462] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 3 before 
>>>> npo->meta_prod:32 old_meta_prod:31 vif->rx.sring->req_prod:28313 
>>>> vif->rx.req_cons:28315 gso_type:0 gso_size:0 nr_frags:0 req->gref:4325377 
>>>> req->id:154 j:0
>>>> [  306.513424] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here start  
>>>>  npo->meta_prod:32 vif->rx.sring->req_prod:28313 vif->rx.req_cons:28315 
>>>> npo->copy_gref:4325377 npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:0 
>>>> size:66 i:0
>>>> [  311.390883] net_ratelimit: 317 callbacks suppressed
>>>> [  311.400901] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>>>> min_slots_needed:3 vif->rx.sring->req_prod:32386 vif->rx.req_cons:32322
>>>> 
>>>> - So in this case we are in the 3rd iteration of the loop in 
>>>> xenvif_gop_frag_copy ...
>>>> - Xenvif_start_xmit stop the queue since it has detected it needs one more 
>>>> slot which is unavailable at that time.

>>> Yes.

>>>> - The current rx thread however doesn't know and doesn't check  (neither 
>>>> in the loop in xenvif_gop_frag_copy nor in get_next_rx_buffer that the 
>>>> ring if full) .. while prod == cons now .. consumes another one ..

>>> It does check -- but not in xenvif_gop_frag_copy -- see
>>> xenvif_rx_action, which calls xenvif_rx_ring_slots_available before
>>> queueing skb to break down. That is, when you call xenvif_gop_skb there
>>> should be enough room to accommodate that SKB.

>>>> - That ring request leads to the bad grant references reported by the 
>>>> hypervisor
>>>> 
>>>> (XEN) [2014-03-18 15:02:58.928] grant_table.c:1857:d0v2 Bad grant 
>>>> reference 4325377
>>>> 
>>>> So should there be a check added there ... or should the callers  
>>>> "xenvif_gop_frag_copy" and the caller of that one "xenvif_gop_skb" already 
>>>> have anticipated that what the were about
>>>> to do wasn't going to fit anyway ?
>>>> 

>>> No, see above.

>>>> And of course .. how made Paul's change trigger this ?
>>>> 

>>> Before Paul's change, we always reserve very large room for an incoming
>>> SKB. After Paul's change, we only reserve just enough room. Probably
>>> some extra room prevents this bug being triggered.

>> [  599.970745] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req 
>> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 
>> vif->rx.sring->req_event:504174
>> [  599.972487] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue !  
>> min_slots_needed:1 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 
>> vif->rx.sring->req_event:506388
>> [  600.044322] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req 
>> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> req->gref:165543936 req->id:19 vif->rx.sring->req_event:506388
>> [  600.081167] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2  
>> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> npo->copy_gref:165543936  npo->copy_off:0  MAX_BUFFER_OFFSET:4096 bytes:1168 
>> size:1168 i:6 vif->rx.sring->req_event:506388 estimated_slots_needed:8
>> [  600.118268] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end 
>> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> npo->copy_gref:165543936 npo->copy_off:1168  MAX_BUFFER_OFFSET:4096 
>> bytes:1168 size:0 i:7 vif->rx.sring->req_event:506388 
>> estimated_slots_needed:8
>> [  600.155378] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 
>> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 
>> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 
>> req->id:11 estimated_slots_needed:8 i(frag): 0
>> [  600.192438] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 
>> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 
>> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 
>> req->id:11 estimated_slots_needed:8
>> [  600.229395] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 ..  
>> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 
>> sco->meta_slots_used:8 max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:8 
>> j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:6 min_slots_needed:3
>> [  600.266518] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 1 ..  
>> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388                       
>>   max_upped_gso:1 skb_is_gso(skb):0 max_slots_needed:1 j:4 is_gso:0 
>> nr_frags:0 firstpart:1 secondpart:0 min_slots_needed:1

>> It seems to estimate 8 slots and need 8 slots ... however .. shouldn't the 
>> queue have been stopped before getting here ..

> *hrmm please just ignore* .. got to get some sleep i guess ..

Just went the empirical way .. and unconditionally upped the calculated 
"max_slots_needed" by one in "xenvif_rx_action"  just before checking the 
"xenvif_rx_ring_slots_available",
this has prevented all non-fatal and fatal occurrences of "cons > prod" so far. 
I will leave my tests running overnight, see if it survives the pounding.

From other net drivers i see different calculations .. seems it is kind of 
voodoo to determine the value .. one of which 
(drivers/net/ethernet/marvell/sky2.c)
seems to unconditionally reserve a slot for both GSO and CKSUM. Don't know if 
that makes any sense though:

/* This is the worst case number of transmit list elements for a single skb:
   VLAN:GSO + CKSUM + Data + skb_frags * DMA */
#define MAX_SKB_TX_LE   (2 + (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1))



>>> Wei.

>>>> 
>>>> >> The second time it does get to the code after the RING_GET_REQUEST in 
>>>> >> 'get_next_rx_buffer' and that leads to mayhem ...
>>>> >> 
>>>> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" case 
>>>> >> .. unfortunately it now triggers net_ratelimit at the end:
>>>> >> 
>>>> >> [  402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue 
>>>> >> !  min_slots_needed:7 vif->rx.sring->req_prod:189228 
>>>> >> vif->rx.req_cons:189222
>>>> 
>>>> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod -
>>>> > req_cons < needed, the queue is stopeed.

>> So it seems .. most of the time .. but if i look at the calculation of 
>> "min_slots_needed" in this function it seems completely different from the 
>> one in
>> xenvif_rx_action for max_slots_needed .. though both seem to be used for the 
>> same thing .. to calcultate how many slots the brokendown SKB would need to 
>> fit in ..
>> So i added the calculation method from xenvif_start_xmit to xenvif_rx_action 
>> .. in the error case you see min_slots_needed was 3 .. but max_slots_needed 
>> and max_slots_used both were 8.

>> The main difference between these calculation methods seems to be that 
>> min_slots_needed doesn't take the PAGE_SIZE into account to see how many 
>> slots are needed for the frags.

>> So Paul .. why was the "xenvif_count_skb_slots(vif, skb)" function dropped 
>> and replaced by two seperate and different calculations ?

>> --
>> Sander

>>>> 
>>>> > Wei.
>>>> 







-- 
Best regards,
 Sander                            mailto:linux@xxxxxxxxxxxxxx


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.