Xen project Mailing List

RE: [PATCH 6/6] Under conditions of high load, it was possible for xenvif to queue up very large numbers of packets before pushing them all upstream in one go. This was partially ameliorated by use of the NDIS_RECEIVE_FLAGS_RESOURCES flag.

To: "paul@xxxxxxx" <paul@xxxxxxx>, "win-pv-devel@xxxxxxxxxxxxxxxxxxxx" <win-pv-devel@xxxxxxxxxxxxxxxxxxxx>

From: Martin Harvey <martin.harvey@xxxxxxxxxx>

Date: Fri, 23 Jul 2021 16:59:30 +0000

Accept-language: en-US

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Y7Rarc8wFBYc1NYByPPXZWPnTjyGAmPfFzTyiJTqdO0=; b=bff3IBksrpS15qO9mo42YQJkloE3hhmtZv0DzS1aEbbJxPV2DOFT2t+oJq9SkOenCPQZt241+YQPlbDhFepsNhdYblaWfrfvZ82nnOEXYEjE7bsLYRMYMtaOhJsUfIPv4Pkzs/jaup4PuGnSGGjgHsvA3UGRG6KFVO8Wi6bhjadEteWRLHbgJiAbiBn4ymxzSi41TeymS67QSaeW1uZGFmA45pGy4ve0CXFw7/2ijBuIlKOHqEYUTZYog5RNE82TmRH5RbJyTQ3mpAVgl0i2STMYyMyLjTjUntIliAOv/+XUVnu66HRseAJ/J3yTa68ADD76U+jvDN4MT71lFz+qIw==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=bp3MM1OfCY2LYAXp4ag9FlUJTaeAEgt/E7mN0M9WFd0Bvx2ewx3arHZPMesJNPYhBdtK75bwFZ7U8Edl9OEr1CH6uaMCxs10Y5xxb64wgMnrprSGbItlG7aFg30nU6/hezrrTXh0f9oXBfyN8JRLNGoOTX/MMYup3wCqfgrHxboPpunZWw+1rHFYe7TBfnbnl83b/Xws1KyMTPbbYCI/vknmRIDFFsE54muigWKnHd8r0YvOqch+qgy+9cFJAOc2psNst9LSKVkmSosxoojE5Ii2kbwRT0Ze0bVUg5PvJ0ktNaErE5HTeBKFxn9NqYp/m063uQhJwGds0fLvHl87dA==

Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com

Delivery-date: Fri, 23 Jul 2021 16:59:43 +0000

Ironport-hdrordr: A9a23:qr0VOaOLs5UcGMBcT+X155DYdb4zR+YMi2TDiHoddfUFSKalfp 6V98jzjSWE9gr5K0tQ6OxoX5PwNE80lKQFk7X5WI3PYOCIghrTEGgP1/qE/9SkIVyAygc/79 YVT0EdMqyMMbESt6+Tj2fIcKdDsbz3lJxEnd2/854ud3AXV0gJ1XYPNu/xKDwOeOAyP+tDKH Pq3Ls+m9PPQwVwUi2OPAhyYwGPnayKqLvWJTo9QzI34giHij2lrJb8Dhijxx8bFwhCxL8zmF K13DAQss+Y0s2T+1v57Sv+/p5WkNzuxp9oH8qXkPUYLT3ql0KBeJlhc6fqhkFtnMifrHIR1P XcqRYpOMp+r1nLeHuunBfr0w78lB4z9n7Zz0OCi3eLm72+eNsDMbsYuWtlSGqc16NghqA47E tz5RPai3ODN2KGoMz/j+K4Gy2C2HDE+kbK2tRj/0C3arFuIIO5m7ZvsX+9IK1wVR4SoLpXY9 WGM/usr8q+UWnqIUwx7VMfgeBFYBwIb127qw45y4+o7wQ=

Ironport-sdr: cwJ61Y7cyFW3QCwWdUW6e2nFYeqq9qLRcUBiHY8/smsNkBlHyFDjcaOHW/GXqbUlSoxDdYF9eD EggyvzZnJRn20DTpxYZtAumtLhfGmjVAtNzLiKD/jI8BwjukN4T7fDGfuVceu4An+Ol5oP4HpV hra6zJitgEVVMdjGHuFPApllPJ/JF3virJoNnWqFXaokD6HEs6X258th5VOrzm5S3RpVyvKr4p DK9dGGfSsuLsGJgmCiLuVcWxrLmadEwKsH+kog9X9LEc528LcVOf/l/lc4zusJ/mE9en2S5mh3 Eb68AOkNEAYfhjo2QnapzGP2

List-id: Developer list for the Windows PV Drivers subproject <win-pv-devel.lists.xenproject.org>

Thread-index: AQHXfWtk+tASUJTSVEWzWgTW/lHzzqtO86eAgAF5OeCAAECkgIAAFxbw

Thread-topic: [PATCH 6/6] Under conditions of high load, it was possible for xenvif to queue up very large numbers of packets before pushing them all upstream in one go. This was partially ameliorated by use of the NDIS_RECEIVE_FLAGS_RESOURCES flag.

-----Original Message----- From: Paul Durrant <xadimgnik@xxxxxxxxx> Sent: 23 July 2021 16:06 To: Martin Harvey <martin.harvey@xxxxxxxxxx>; win-pv-devel@xxxxxxxxxxxxxxxxxxxx Subject: Re: [PATCH 6/6] Under conditions of high load, it was possible for xenvif to queue up very large numbers of packets before pushing them all upstream in one go. This was partially ameliorated by use of the NDIS_RECEIVE_FLAGS_RESOURCES flag. > I guess the issue is I don't see why it is a problem for xenvif to accumulate > a large quantity of data before pushing. As for flow control signals, yes, all we essentially have is the discrepancy between buffers pushed up to XENNET and those returned. This should be capped at 2048 because XENNET asserts 'low resources' at that point. I agree with you, it isn't necessarily a problem for xenvif to accumulate a large quantity of data at that level in the stack. Given that we probably hold less metadata than higher levels/layers in the stack, holding packets there is probably better than holding them at many other places in the stack. Yes, I agree that all we have is the discrepancy between buffers pushed up to xennet, and those returned. Above the xennet "low resources" limit, yes we include the NDIS_RESOURCES flag when indicating NBL's, at which point data is copied higher up the stack and the calls return synchronously. Herein lies the problem: That NDIS_RESOURCES flag is actually making things worse, not better: In order to return synchronously, there is some allocation and copying going on higher up. - That's a good idea if the low ndis resources are a local shortage of net buffer lists / net buffers / such like, which need to be conserved, which may have been the case back in the days of NDIS0.3 - However, it's a very bad idea if the shortage is some global shortage of Pool / Mm structures / something else, and the bug we see is when higher level copying up the stack, with a very large batch of packets being indicated in one go, some of the code there assumes that alloc will succeed, and whole thing blows up. - The potential fix you're thinking of of (I assume) is to keep track of the # of NBL's indicated up, and not indicate up any more when that reaches a fixed size. The idea being, I guess, that we don't use the NDIS_RESOURCES flag, but instead tell XenVif not to indicate any more up the stack I also don't want to do that fix, because it doesn't scale well, and it's not solving the problem we have. I'm also concerned that when 10G becomes 40G, becomes 100G. Let me explain: The issue is not the total number of packets indicated up, at one time (which might vary quite widely), but the fact that we're pushing so many up in one DPC that some Mm resources (pool etc etc) which can be grown at passive, can't be grown at DPC (fast enough) to satisfy secondary copying requirements. So what I've submitted, is really an attempt to ***rate limit per DPC**, which seems to be more scalable than a conventional backpressure algorithm. That is, in effect what it's doing anyway. That way, we don't need to make any assumptions about overall buffering available, we only make assumptions about how many packets we can push up the stack every time each CPU gets through its DPC queue. > > The way it's currently implemented is: > > > > - From Ring->PacketComplete, we allow RX_BUFFERING_MAX to be pushed upsteam > > per DPC. > > Why do we want to limit here? Is the concern that we end up pushing so much > we tip XENNET into the low resource condition? If so then that's exactly > where a return value from VifReceiverQueuePacket() could help. XENNET essentially says 'stop giving me packets, I'm waiting for buffers back from the stack'. Well, OK, but the XenNet numbers are also invented by us and fixed, not determined at runtime, and not being passed down from hgiehr levels in the stack, so it's just a case of where we invent the arbitrary numbers. If we're going to invent some limit, I would prefer it to be a rate per DPC limit, than a fixed number of buffers outstanding. XenVif is right place to change DPC scheduling behaviour. Also, for XenNet to say yes/no based on how many packets per batch (which I guess is not wrong), it's having to work out where the batches begin and end based on the "last" flag passed up from XenVif, which is OK, but feels more kludgy than a limit on a while loop. > - If that fails to empty Ring->PacketComplete, we immediately schedule > another DPC, this allowing other DPC's to run first, which is assumed to give > higher layers a chance to digest the packets pushed up, and/or free up > resources when those packets have been completed. >Why is it assumed higher layers process in that fashion; they may well process >in context of NdisMIndicateReceiveNetBufferLists(). Because the NDIS documentation says, or at least implies, that without the NDIS_RESOURCES flag, packets are likely to be handled async, and with that flag, they're likely to be handled synchronously. > Now, of course, the reason this works at all is because we have two DPC's: > PollDPC and QueueDPC. The draining from the packet queue to packet complete, > and indicate up to NDIS is in the QueueDpc, and the filling of the packet > queue is in the PollDPC. It is the different rates of these two DPC's that > indicates whether the PacketQueue will fill or empty. > The motivation for the two was so that we could use a threaded DPC for the > 'heavier' work but still clear the ring quickly so TBH you could more easily > apply back-pressure on the ring by dropping to a single threaded DPC to do > both jobs. Well here's the thing about buffering. Higher level protos like TCP have only the bandwidth-delay product to work out the buffering, and the loss rate to determine congestion or resource starvation. So of course, the newbie approach is to buffer at all costs, but that tends to make cache behaviour worse, and slow everything down, when the right thing to do is to overflow fixed size buffers to give the right signals to higher level protos that they might want to back off. So I'm not overly concerned about not clearing the ring: I would prefer to drop the odd packet than to frantically clear the ring and buffer so much that cache behaviour gets worse and it slows things down (which then increases b/w delay product, which then leads to more buffering...) > > > > Let me know if you're happy with this approach. > > > > Not really. How about the single DPC idea? Humm. So in a single DPC, we'd go from top to bottom: - Indicate as much out of packet complete as we could to NDIS (but we're inventing a fixed limit here) - Move as much from packetqueue to packetcomplete as we could (inventing another fixed limit) - Move from ring to packetqueue (same limit as packetcomplete above) - And if ring overflows, tough cookies. I just think that I'd far prefer to rate limit per DPC unless/until NDIS provided some useful mechanism for telling us how to size rings / buffers / amount of stuff you can reasonably have outstanding up the stack (most of which gets processed at dispatch anyway). The above notwithstanding. We have 6 months of testing with this approach, and it seems to work quite nicely. I can do some degree of rework, but this then invalidates all our performance results, and I'll have to resubmit you a rather different patch in 3-6 months time. MH.

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.