[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable Linux 3.14-rc3 and 3.13 Network troubles "bisected"
Wednesday, March 19, 2014, 12:35:32 PM, you wrote: > On Wed, Mar 19, 2014 at 12:11:04AM +0100, Sander Eikelenboom wrote: > [...] >> >> >>> Before Paul's change, we always reserve very large room for an incoming >> >>> SKB. After Paul's change, we only reserve just enough room. Probably >> >>> some extra room prevents this bug being triggered. >> >> >> [ 599.970745] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req >> >> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506387 >> >> vif->rx.sring->req_event:504174 >> >> [ 599.972487] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! >> >> min_slots_needed:1 vif->rx.sring->req_prod:506387 >> >> vif->rx.req_cons:506387 vif->rx.sring->req_event:506388 >> >> [ 600.044322] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req >> >> npo->meta_prod:37 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 >> >> req->gref:165543936 req->id:19 vif->rx.sring->req_event:506388 >> >> [ 600.081167] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 >> >> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 >> >> npo->copy_gref:165543936 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 >> >> bytes:1168 size:1168 i:6 vif->rx.sring->req_event:506388 >> >> estimated_slots_needed:8 >> >> [ 600.118268] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end >> >> npo->meta_prod:38 vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 >> >> npo->copy_gref:165543936 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096 >> >> bytes:1168 size:0 i:7 vif->rx.sring->req_event:506388 >> >> estimated_slots_needed:8 >> >> [ 600.155378] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 >> >> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 >> >> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 >> >> req->id:11 estimated_slots_needed:8 i(frag): 0 >> >> [ 600.192438] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 >> >> npo->meta_prod:38 old_meta_prod:30 vif->rx.sring->req_prod:506387 >> >> vif->rx.req_cons:506388 gso_type:1 gso_size:1448 nr_frags:1 req->gref:570 >> >> req->id:11 estimated_slots_needed:8 >> >> [ 600.229395] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 .. >> >> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 >> >> sco->meta_slots_used:8 max_upped_gso:1 skb_is_gso(skb):1 >> >> max_slots_needed:8 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:6 >> >> min_slots_needed:3 >> >> [ 600.266518] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 1 .. >> >> vif->rx.sring->req_prod:506387 vif->rx.req_cons:506388 >> >> max_upped_gso:1 skb_is_gso(skb):0 max_slots_needed:1 j:4 is_gso:0 >> >> nr_frags:0 firstpart:1 secondpart:0 min_slots_needed:1 >> >> >> It seems to estimate 8 slots and need 8 slots ... however .. shouldn't >> >> the queue have been stopped before getting here .. >> >> > *hrmm please just ignore* .. got to get some sleep i guess .. >> >> Just went the empirical way .. and unconditionally upped the calculated >> "max_slots_needed" by one in "xenvif_rx_action" just before checking the >> "xenvif_rx_ring_slots_available", >> this has prevented all non-fatal and fatal occurrences of "cons > prod" so >> far. I will leave my tests running overnight, see if it survives the >> pounding. >> >> >From other net drivers i see different calculations .. seems it is kind of >> >voodoo to determine the value .. one of which >> >(drivers/net/ethernet/marvell/sky2.c) >> seems to unconditionally reserve a slot for both GSO and CKSUM. Don't know >> if that makes any sense though: >> >> /* This is the worst case number of transmit list elements for a single skb: >> VLAN:GSO + CKSUM + Data + skb_frags * DMA */ >> #define MAX_SKB_TX_LE (2 + >> (sizeof(dma_addr_t)/sizeof(u32))*(MAX_SKB_FRAGS+1)) >> > Xen network "wire" format doesn't have a CKSUM metadata type though. >> >> >> >>> Wei. >> >> >>>> >> >>>> >> The second time it does get to the code after the RING_GET_REQUEST >> >>>> >> in 'get_next_rx_buffer' and that leads to mayhem ... >> >>>> >> >> >>>> >> So added a netdev_warn to xenvif_start_xmit for the "stop queue" >> >>>> >> case .. unfortunately it now triggers net_ratelimit at the end: >> >>>> >> >> >>>> >> [ 402.909693] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping >> >>>> >> queue ! min_slots_needed:7 vif->rx.sring->req_prod:189228 >> >>>> >> vif->rx.req_cons:189222 >> >>>> >> >>>> > I think xenvif_rx_ring_slots_available is doing its job. If req_prod - >> >>>> > req_cons < needed, the queue is stopeed. >> >> >> So it seems .. most of the time .. but if i look at the calculation of >> >> "min_slots_needed" in this function it seems completely different from >> >> the one in >> >> xenvif_rx_action for max_slots_needed .. though both seem to be used for >> >> the same thing .. to calcultate how many slots the brokendown SKB would >> >> need to fit in .. >> >> So i added the calculation method from xenvif_start_xmit to >> >> xenvif_rx_action .. in the error case you see min_slots_needed was 3 .. >> >> but max_slots_needed and max_slots_used both were 8. >> > Those are different estimations. max_slots_needed estimates the worst > case while min_slots_needed estimates the best case. When > min_slots_needed is met, netback puts the SKB into its internal queue. > xenvif_rx_action then will dequeue that packet, check max_slots_needed, > if met, break SKB down. I realised that, though if you don't count the cost of calculation .. the best would be to have min_slots and max_slots the same .. and being correct. But cost of the calculation count .. on the other hand min_slots seems to be far too low in this case. Is the nr_frags the best thing to check for ? Wouldn't something like skbsize / PAGE_SIZE be a better estimate ? > What I would expect is, even if they yield different values, checking > the ring availability is enough before breaking SKB down. > In your above case, max_slots_needed was met and SKB broken down. And as > you said in your empirical test, bumping max_slots_needed by 1 prevented > issues, then we might have problem calculating max_slots_needed. If you > can work out what went wrong that can be very useful. OK ran the whole night with "max_slots_needed+1" and i saw no triggering of both: - the non-fatal oddity (only a temporary cons > prod ..) - the fatal case (leading to the grantref error) So it seems we are off by one .. somewhere At first the non-fatal oddity .. this one i really can't grasp: [ 567.880282] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:562972 vif->rx.req_cons:562969 vif->rx.sring->req_event:562973 [ 569.893801] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:571188 vif->rx.req_cons:571187 vif->rx.sring->req_event:571189 [ 571.892419] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:577141 vif->rx.req_cons:577138 vif->rx.sring->req_event:577142 [ 575.451383] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:6 vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 vif->rx.sring->req_event:589615, reserved_slots_left:0 [ 575.487721] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:6 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590604 req->gref:569 req->id:11 vif->rx.sring->req_event:589615 reserved_slots_left:-1 [ 575.524427] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:590660 vif->rx.req_cons:590660 vif->rx.sring->req_event:590661 [ 576.508063] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:16 vif->rx.sring->req_prod:594343 vif->rx.req_cons:594343 vif->rx.sring->req_event:591619, reserved_slots_left:0 [ 576.544708] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:16 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594344 req->gref:668 req->id:167 vif->rx.sring->req_event:591619 reserved_slots_left:-1 [ 576.581595] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:594389 vif->rx.req_cons:594389 vif->rx.sring->req_event:594390 [ 577.325826] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:2 vif->rx.sring->req_prod:597248 vif->rx.req_cons:597247 vif->rx.sring->req_event:597249 [ 577.576973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:598146 vif->rx.req_cons:598143 vif->rx.sring->req_event:598147 What i don't get is: - i do warn just before the RING_GET_REQUEST on vif->rx.sring->req_prod:590603 vif->rx.req_cons:590603 .. so cons == prod - i warn because req = RING_GET_REQUEST(&vif->rx, vif->rx.req_cons++); is going to add one to cons .. leading to a situation that should not exist .. - however in the warn just after this RING_GET_REQUEST .. prod is somehow increased a lot .. probably the "next_rx_buffer" aspect of things .. so cons is much smaller than prod again .. - How everi don't see anything in the RING_GET_REQUEST macro fiddling with vif->rx.sring->req_prod .. - so who has been fiddling with vif->rx.sring->req_prod in the mean time .. my very limited c skills fail to see how this works :-) Then the fatal case ... [ 1108.566486] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:1843278 vif->rx.req_cons:1843278 vif->rx.sring->req_event:1843279 [ 1109.775757] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:1847779 vif->rx.req_cons:1847776 vif->rx.sring->req_event:1847780 [ 1110.041661] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:1848795 vif->rx.req_cons:1848794 vif->rx.sring->req_event:1848796 [ 1110.301778] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:4 vif->rx.sring->req_prod:1849755 vif->rx.req_cons:1849754 vif->rx.sring->req_event:1849756 [ 1111.565412] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer before req npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 vif->rx.sring->req_event:1853482, reserved_slots_left:0 [ 1111.565973] vif vif-7-0 vif7.0: ?!? xenvif_start_xmit stopping queue ! min_slots_needed:1 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854291 vif->rx.sring->req_event:1854292 [ 1111.638425] vif vif-7-0 vif7.0: ?!? get_next_rx_buffer after req npo->meta_prod:35 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 req->gref:4325377 req->id:83 vif->rx.sring->req_event:1854292 reserved_slots_left:-1 [ 1111.675042] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here 2 npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 npo->copy_gref:4325377 npo->copy_off:0 MAX_BUFFER_OFFSET:4096 bytes:1168 size:1168 i:5 vif->rx.sring->req_event:1854292 estimated_slots_needed:7 reserved_slots_left:-1 [ 1111.711833] vif vif-7-0 vif7.0: ?!? xenvif_gop_frag_copy Me here end npo->meta_prod:36 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 npo->copy_gref:4325377 npo->copy_off:1168 MAX_BUFFER_OFFSET:4096 bytes:1168 size:0 i:6 vif->rx.sring->req_event:1854292 gso_gaps:0 estimated_slots_needed:7 reserved_slots_left:-1 [ 1111.766836] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 4 after npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 req->gref:674 req->id:76 estimated_slots_needed:7 i(frag):0 j(data):1 reserved_slots_left:-1 [ 1111.803702] vif vif-7-0 vif7.0: ?!? xenvif_gop_skb Me here 5 npo->meta_prod:36 old_meta_prod:29 vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 meta->gso_type:1 meta->gso_size:1448 nr_frags:1 req->gref:674 req->id:76 estimated_slots_needed:7 j(data):1 reserved_slots_left:-1 used in funcstart: 0 + 1 .. used_dataloop:1 .. used_fragloop:6 [ 1111.858602] vif vif-7-0 vif7.0: ?!? xenvif_rx_action me here 2 .. vif->rx.sring->req_prod:1854291 vif->rx.req_cons:1854292 sco->meta_slots_used:7 max_upped_gso:1 skb_is_gso(skb):1 max_slots_needed:7 j:3 is_gso:1 nr_frags:1 firstpart:1 secondpart:5 reserved_slots_left:-1 Ok so if i try to pull this apart (from bottom to top): - it starts out with "xenvif_rx_action" and calculated we would need: 1 slot for the DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE); 5 slots for 1 frag 1 slot for GSO So 7 slots in total .. - it checks if there are enough slots available if (!xenvif_rx_ring_slots_available(vif, max_slots_needed)) and aparently there are .. so it continues - We end up in "xenvif_gop_skb" processing .. and everything seems to go well .. - In the start of this function we do one unconditional RING_GET_REQUEST .. and potentially one conditional (if gso) - Question: in "xenvif_rx_action" we only reserved ONE slot for GSO stuff .. so this could cost TWO slots .. is that right ? - But in this case we only use one ... see "used in funcstart: 0 + 1" - Now we loop 1 time through the "while (data < skb_tail_pointer(skb))" loop ... since j(data)=1 - and call "xenvif_gop_frag_copy" from inside this loop here .. which potentially calls .. get_next_rx_buffer - If i understand it correctly the number of slots this loop should take at a maximum should be what we calculated before with the DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE) ? (1 slot in this case) - Nothing is logged yet because cons is still smaller than prod at this time, but from the aftermath we see this was "used_dataloop:1" - So that seems correct .. - Now we are at the loop through "nr_frags" and loop the 1 frag we have .. - This all goes well until the sixth iteration (i from 0..5) .. and call "xenvif_gop_frag_copy" from inside this loop ... - "xenvif_gop_frag_copy" calls "get_next_rx_buffer" .. cons becomes larger than prod ..and the "miracle" from the non-fatal case of prod suddenly becoming larger does not happen - So the calculation of how many slots are required for the frags in "xenvif_rx_action" does not match wath the code in "xenvif_gop_frag_copy" seems to need, I don't see why .. it's rounding up .. should behave correct if the left over happens to be 0 size .. > Wei. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |