[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH 6/6] Under conditions of high load, it was possible for xenvif to queue up very large numbers of packets before pushing them all upstream in one go. This was partially ameliorated by use of the NDIS_RECEIVE_FLAGS_RESOURCES flag.


  • To: "paul@xxxxxxx" <paul@xxxxxxx>, "win-pv-devel@xxxxxxxxxxxxxxxxxxxx" <win-pv-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Martin Harvey <martin.harvey@xxxxxxxxxx>
  • Date: Fri, 23 Jul 2021 16:59:30 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Y7Rarc8wFBYc1NYByPPXZWPnTjyGAmPfFzTyiJTqdO0=; b=bff3IBksrpS15qO9mo42YQJkloE3hhmtZv0DzS1aEbbJxPV2DOFT2t+oJq9SkOenCPQZt241+YQPlbDhFepsNhdYblaWfrfvZ82nnOEXYEjE7bsLYRMYMtaOhJsUfIPv4Pkzs/jaup4PuGnSGGjgHsvA3UGRG6KFVO8Wi6bhjadEteWRLHbgJiAbiBn4ymxzSi41TeymS67QSaeW1uZGFmA45pGy4ve0CXFw7/2ijBuIlKOHqEYUTZYog5RNE82TmRH5RbJyTQ3mpAVgl0i2STMYyMyLjTjUntIliAOv/+XUVnu66HRseAJ/J3yTa68ADD76U+jvDN4MT71lFz+qIw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bp3MM1OfCY2LYAXp4ag9FlUJTaeAEgt/E7mN0M9WFd0Bvx2ewx3arHZPMesJNPYhBdtK75bwFZ7U8Edl9OEr1CH6uaMCxs10Y5xxb64wgMnrprSGbItlG7aFg30nU6/hezrrTXh0f9oXBfyN8JRLNGoOTX/MMYup3wCqfgrHxboPpunZWw+1rHFYe7TBfnbnl83b/Xws1KyMTPbbYCI/vknmRIDFFsE54muigWKnHd8r0YvOqch+qgy+9cFJAOc2psNst9LSKVkmSosxoojE5Ii2kbwRT0Ze0bVUg5PvJ0ktNaErE5HTeBKFxn9NqYp/m063uQhJwGds0fLvHl87dA==
  • Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Delivery-date: Fri, 23 Jul 2021 16:59:43 +0000
  • Ironport-hdrordr: A9a23:qr0VOaOLs5UcGMBcT+X155DYdb4zR+YMi2TDiHoddfUFSKalfp 6V98jzjSWE9gr5K0tQ6OxoX5PwNE80lKQFk7X5WI3PYOCIghrTEGgP1/qE/9SkIVyAygc/79 YVT0EdMqyMMbESt6+Tj2fIcKdDsbz3lJxEnd2/854ud3AXV0gJ1XYPNu/xKDwOeOAyP+tDKH Pq3Ls+m9PPQwVwUi2OPAhyYwGPnayKqLvWJTo9QzI34giHij2lrJb8Dhijxx8bFwhCxL8zmF K13DAQss+Y0s2T+1v57Sv+/p5WkNzuxp9oH8qXkPUYLT3ql0KBeJlhc6fqhkFtnMifrHIR1P XcqRYpOMp+r1nLeHuunBfr0w78lB4z9n7Zz0OCi3eLm72+eNsDMbsYuWtlSGqc16NghqA47E tz5RPai3ODN2KGoMz/j+K4Gy2C2HDE+kbK2tRj/0C3arFuIIO5m7ZvsX+9IK1wVR4SoLpXY9 WGM/usr8q+UWnqIUwx7VMfgeBFYBwIb127qw45y4+o7wQ=
  • Ironport-sdr: cwJ61Y7cyFW3QCwWdUW6e2nFYeqq9qLRcUBiHY8/smsNkBlHyFDjcaOHW/GXqbUlSoxDdYF9eD EggyvzZnJRn20DTpxYZtAumtLhfGmjVAtNzLiKD/jI8BwjukN4T7fDGfuVceu4An+Ol5oP4HpV hra6zJitgEVVMdjGHuFPApllPJ/JF3virJoNnWqFXaokD6HEs6X258th5VOrzm5S3RpVyvKr4p DK9dGGfSsuLsGJgmCiLuVcWxrLmadEwKsH+kog9X9LEc528LcVOf/l/lc4zusJ/mE9en2S5mh3 Eb68AOkNEAYfhjo2QnapzGP2
  • List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>
  • Thread-index: AQHXfWtk+tASUJTSVEWzWgTW/lHzzqtO86eAgAF5OeCAAECkgIAAFxbw
  • Thread-topic: [PATCH 6/6] Under conditions of high load, it was possible for xenvif to queue up very large numbers of packets before pushing them all upstream in one go. This was partially ameliorated by use of the NDIS_RECEIVE_FLAGS_RESOURCES flag.


-----Original Message-----
From: Paul Durrant <xadimgnik@xxxxxxxxx> 
Sent: 23 July 2021 16:06
To: Martin Harvey <martin.harvey@xxxxxxxxxx>; win-pv-devel@xxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH 6/6] Under conditions of high load, it was possible for 
xenvif to queue up very large numbers of packets before pushing them all 
upstream in one go. This was partially ameliorated by use of the 
NDIS_RECEIVE_FLAGS_RESOURCES flag.


> I guess the issue is I don't see why it is a problem for xenvif to accumulate 
> a large quantity of data before pushing. As for flow control 
signals, yes, all we essentially have is the discrepancy between buffers pushed 
up to XENNET and those returned. This should be capped at 2048 because XENNET 
asserts 'low resources' at that point.

I agree with you, it isn't necessarily a problem for xenvif to accumulate a 
large quantity of data at that level in the stack. Given that we probably hold 
less metadata than higher levels/layers in the stack, holding packets there is 
probably better than holding them at many other places in the stack.

Yes, I agree that all we have is the discrepancy between buffers pushed up to 
xennet, and those returned. Above the xennet "low resources" limit, yes we 
include the NDIS_RESOURCES flag when indicating NBL's, at which point data is 
copied higher up the stack and the calls return synchronously.

Herein lies the problem: That NDIS_RESOURCES flag is actually making things 
worse, not better: In order to return synchronously, there is some allocation 
and copying going on higher up.

- That's a good idea if the low ndis resources are a local shortage of net 
buffer lists / net buffers / such like, which need to be conserved, which may 
have been the case back in the days of NDIS0.3
- However, it's a very bad idea if the shortage is some global shortage of Pool 
/ Mm structures / something else, and the bug we see is when higher level 
copying up the stack, with a very large batch of packets being indicated in one 
go, some of the code there assumes that alloc will succeed, and whole thing 
blows up.

- The potential fix you're thinking of of (I assume) is to keep track of the # 
of NBL's indicated up, and not indicate up any more when that reaches a fixed 
size. The idea being, I guess, that we don't use the NDIS_RESOURCES flag, but 
instead tell XenVif not to indicate any more up the stack

I also don't want to do that fix, because it doesn't scale well, and it's not 
solving the problem we have. I'm also concerned that when 10G becomes 40G, 
becomes 100G. Let me explain:

The issue is not the total number of packets indicated up, at one time (which 
might vary quite widely), but the fact that we're pushing so many up in one DPC 
that some Mm resources (pool etc etc) which can be grown at passive, can't be 
grown at DPC (fast enough) to satisfy secondary copying requirements.

So what I've submitted, is really an attempt to ***rate limit per DPC**, which 
seems to be more scalable than a conventional backpressure algorithm. That is, 
in effect what it's doing anyway. That way, we don't need to make any 
assumptions about overall buffering available, we only make assumptions about 
how many packets we can push up the stack every time each CPU gets through its 
DPC queue.

> >   The way it's currently implemented is:
> > 
> > - From Ring->PacketComplete, we allow RX_BUFFERING_MAX to be pushed upsteam 
> > per DPC.
> 
> Why do we want to limit here? Is the concern that we end up pushing so much 
> we tip XENNET into the low resource condition? If so then that's exactly 
> where a return value from VifReceiverQueuePacket() could help. 
XENNET essentially says 'stop giving me packets, I'm waiting for buffers back 
from the stack'.

Well, OK, but the XenNet numbers are also invented by us and fixed, not 
determined at runtime, and not being passed down from hgiehr levels in the 
stack, so it's just a case of where we invent the arbitrary numbers. If we're 
going to invent some limit, I would prefer it to be a rate per DPC limit, than 
a fixed number of buffers outstanding. XenVif is right place to change DPC 
scheduling behaviour.

Also, for XenNet to say yes/no based on how many packets per batch (which I 
guess is not wrong), it's having to work out where the batches begin and end 
based on the "last" flag passed up from XenVif, which is OK, but feels more 
kludgy than a limit on a while loop.

> - If that fails to empty Ring->PacketComplete, we immediately schedule 
> another DPC, this allowing other DPC's to run first, which is assumed to give 
> higher layers a chance to digest the packets pushed up, and/or free up 
> resources when those packets have been completed.

>Why is it assumed higher layers process in that fashion; they may well process 
>in context of NdisMIndicateReceiveNetBufferLists().

Because the NDIS documentation says, or at least implies, that without the 
NDIS_RESOURCES flag, packets are likely to be handled async, and with that 
flag, they're likely to be handled synchronously.

> Now, of course, the reason this works at all is because we have two DPC's: 
> PollDPC and QueueDPC. The draining from the packet queue to packet complete, 
> and indicate up to NDIS is in the  QueueDpc, and the filling of the packet 
> queue is in the PollDPC. It is the different rates of these two DPC's that 
> indicates whether the PacketQueue will fill or empty.

> The motivation for the two was so that we could use a threaded DPC for the 
> 'heavier' work but still clear the ring quickly so TBH you could more easily 
> apply back-pressure on the ring by dropping to a single threaded DPC to do 
> both jobs.

Well here's the thing about buffering. Higher level protos like TCP have only 
the bandwidth-delay product to work out the buffering, and the loss rate to 
determine congestion or resource starvation. So of course, the newbie approach 
is to buffer at all costs, but that tends to make cache behaviour worse, and 
slow everything down, when the right thing to do is to overflow fixed size 
buffers to give the right signals to higher level protos that they might want 
to back off.

So I'm not overly concerned about not clearing the ring: I would prefer to drop 
the odd packet than to frantically clear the ring and buffer so much that cache 
behaviour gets worse and it slows things down (which then increases b/w delay 
product, which then leads to more buffering...)

> > 
> > Let me know if you're happy with this approach.
> > 
>
> Not really. How about the single DPC idea?

Humm. So in a single DPC, we'd go from top to bottom:

- Indicate as much out of packet complete as we could to NDIS (but we're 
inventing a fixed limit here)
- Move as much from packetqueue to packetcomplete as we could (inventing 
another fixed limit)
- Move from ring to packetqueue (same limit as packetcomplete above)
- And if ring overflows, tough cookies.

I just think that I'd far prefer to rate limit per DPC unless/until NDIS 
provided some useful mechanism for telling us how to size rings / buffers / 
amount of stuff you can reasonably have outstanding up the stack (most of which 
gets processed at dispatch anyway).

The above notwithstanding. We have 6 months of testing with this approach, and 
it seems to work quite nicely.
I can do some degree of rework, but this then invalidates all our performance 
results, and I'll have to resubmit you a rather different patch in 3-6 months 
time.

MH.


 


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.