[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool


  • To: Petr Tesařík <petr@xxxxxxxxxxx>
  • From: "Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx>
  • Date: Tue, 11 Jul 2023 15:54:27 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=microsoft.com; dmarc=pass action=none header.from=microsoft.com; dkim=pass header.d=microsoft.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=znrz4HIQxt2oszuYbbT9CbSrKWMeRoRJhvOteUxVFFE=; b=m+3C93Ka45bbclviXEZFu1UilTrGkDDcQ0Y8gQ9H07pp8lMQt/98WWRDkQ34qMOZk11dVTRk8a4Debiy0qRtY4LZWqF1SSPFVtrSiGt5Gou1iQq5IMWQKP7xN02pijWh0aVyK7BfMl87/jOSfguYibzAIkZeO4DA+OEU8d3LjNs6vfwTzVMauViPb8W7UcQCaB++rTbNJk/MgCHJeBebwV3IhMjoicQg21L1NXj5WCDG1PDseg+wiAR5PP2zHwyP8dAF3Mw3O8yVrGAHNTC9UzFM9VxsBFzg3/xaTzMSddIACWkelq9aP1XP+o46NVXUzGvTaidNL/OQg3Xu87nF+g==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=gDY7dstcW/pUMeSilzAqSGlJiur2gzOd3bNysGulqTP40jc+RngStsoFPuzJA7TFrCosrv8Ien1aIqTt4rRgHV7gMsBOEeddBorBXwG0Ynd3Ht7rF8QKs6iqhi/gGAjV/tOOHv4n/NGKNkI62R2jjM2GziqZDus3FaOWSvTOQxm9Va163xZuP3S0bC39qhHf9O0JgEhVXVGrsNrEIVjwPf+hSaC/Q5CM+vECzeRyN0OLQyqOuX/Tc3FwSb+MqIkGpAxprBgVqplmZBxUJpdLLG3034oz7MiIyBbL6ilc/7FbhrlaJFR2ank4vO4tTHZpTztaiqy1Fuhs0SFwShmirQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=microsoft.com;
  • Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>, Petr Tesarik <petrtesarik@xxxxxxxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Thomas Bogendoerfer <tsbogend@xxxxxxxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>, Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>, "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" <x86@xxxxxxxxxx>, "H. Peter Anvin" <hpa@xxxxxxxxx>, "Rafael J. Wysocki" <rafael@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>, Christoph Hellwig <hch@xxxxxx>, Marek Szyprowski <m.szyprowski@xxxxxxxxxxx>, Robin Murphy <robin.murphy@xxxxxxx>, Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx>, Hans de Goede <hdegoede@xxxxxxxxxx>, Jason Gunthorpe <jgg@xxxxxxxx>, Kees Cook <keescook@xxxxxxxxxxxx>, Saravana Kannan <saravanak@xxxxxxxxxx>, "moderated list:XEN HYPERVISOR ARM" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "moderated list:ARM PORT" <linux-arm-kernel@xxxxxxxxxxxxxxxxxxx>, open list <linux-kernel@xxxxxxxxxxxxxxx>, "open list:MIPS" <linux-mips@xxxxxxxxxxxxxxx>, "open list:XEN SWIOTLB SUBSYSTEM" <iommu@xxxxxxxxxxxxxxx>, Roberto Sassu <roberto.sassu@xxxxxxxxxxxxxxx>, Kefeng Wang <wangkefeng.wang@xxxxxxxxxx>
  • Delivery-date: Tue, 11 Jul 2023 15:54:53 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Msip_labels: MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_ActionId=4fe17a3a-44f9-4ee2-847b-6be77321446b;MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_ContentBits=0;MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Enabled=true;MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Method=Standard;MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_Name=Internal;MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SetDate=2023-07-11T15:45:20Z;MSIP_Label_f42aa342-8706-4288-bd11-ebb85995028c_SiteId=72f988bf-86f1-41af-91ab-2d7cd011db47;
  • Thread-index: AQHZqN4Es6xs27E7q0W1vSuhepPC1q+p0aJAgAKeXgCAAGWX4IABQ5kAgAAO3oCAAduRkIACzpkAgAH5cQA=
  • Thread-topic: [PATCH v3 4/7] swiotlb: if swiotlb is full, fall back to a transient memory pool

From: Petr Tesařík <petr@xxxxxxxxxxx> Sent: Monday, July 10, 2023 2:36 AM
> 
> On Sat, 8 Jul 2023 15:18:32 +0000
> "Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> wrote:
> 
> > From: Petr Tesařík <petr@xxxxxxxxxxx> Sent: Friday, July 7, 2023 3:22 AM
> > >
> > > On Fri, 7 Jul 2023 10:29:00 +0100
> > > Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > > On Thu, Jul 06, 2023 at 02:22:50PM +0000, Michael Kelley (LINUX) wrote:
> > > > > From: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Sent: Thursday, 
> > > > > July 6,
> > > 2023 1:07 AM
> > > > > >
> > > > > > On Thu, Jul 06, 2023 at 03:50:55AM +0000, Michael Kelley (LINUX) 
> > > > > > wrote:
> > > > > > > From: Petr Tesarik <petrtesarik@xxxxxxxxxxxxxxx> Sent: Tuesday, 
> > > > > > > June 27, 2023
> > > > > > 2:54 AM
> > > > > > > >
> > > > > > > > Try to allocate a transient memory pool if no suitable slots 
> > > > > > > > can be found,
> > > > > > > > except when allocating from a restricted pool. The transient 
> > > > > > > > pool is just
> > > > > > > > enough big for this one bounce buffer. It is inserted into a 
> > > > > > > > per-device
> > > > > > > > list of transient memory pools, and it is freed again when the 
> > > > > > > > bounce
> > > > > > > > buffer is unmapped.
> > > > > > > >
> > > > > > > > Transient memory pools are kept in an RCU list. A memory 
> > > > > > > > barrier is
> > > > > > > > required after adding a new entry, because any address within a 
> > > > > > > > transient
> > > > > > > > buffer must be immediately recognized as belonging to the 
> > > > > > > > SWIOTLB, even if
> > > > > > > > it is passed to another CPU.
> > > > > > > >
> > > > > > > > Deletion does not require any synchronization beyond RCU 
> > > > > > > > ordering
> > > > > > > > guarantees. After a buffer is unmapped, its physical addresses 
> > > > > > > > may no
> > > > > > > > longer be passed to the DMA API, so the memory range of the 
> > > > > > > > corresponding
> > > > > > > > stale entry in the RCU list never matches. If the memory range 
> > > > > > > > gets
> > > > > > > > allocated again, then it happens only after a RCU quiescent 
> > > > > > > > state.
> > > > > > > >
> > > > > > > > Since bounce buffers can now be allocated from different pools, 
> > > > > > > > add a
> > > > > > > > parameter to swiotlb_alloc_pool() to let the caller know which 
> > > > > > > > memory pool
> > > > > > > > is used. Add swiotlb_find_pool() to find the memory pool 
> > > > > > > > corresponding to
> > > > > > > > an address. This function is now also used by 
> > > > > > > > is_swiotlb_buffer(), because
> > > > > > > > a simple boundary check is no longer sufficient.
> > > > > > > >
> > > > > > > > The logic in swiotlb_alloc_tlb() is taken from 
> > > > > > > > __dma_direct_alloc_pages(),
> > > > > > > > simplified and enhanced to use coherent memory pools if needed.
> > > > > > > >
> > > > > > > > Note that this is not the most efficient way to provide a 
> > > > > > > > bounce buffer,
> > > > > > > > but when a DMA buffer can't be mapped, something may (and will) 
> > > > > > > > actually
> > > > > > > > break. At that point it is better to make an allocation, even 
> > > > > > > > if it may be
> > > > > > > > an expensive operation.
> > > > > > >
> > > > > > > I continue to think about swiotlb memory management from the 
> > > > > > > standpoint
> > > > > > > of CoCo VMs that may be quite large with high network and storage 
> > > > > > > loads.
> > > > > > > These VMs are often running mission-critical workloads that can't 
> > > > > > > tolerate
> > > > > > > a bounce buffer allocation failure.  To prevent such failures, 
> > > > > > > the swiotlb
> > > > > > > memory size must be overly large, which wastes memory.
> > > > > >
> > > > > > If "mission critical workloads" are in a vm that allowes overcommit 
> > > > > > and
> > > > > > no control over other vms in that same system, then you have worse
> > > > > > problems, sorry.
> > > > > >
> > > > > > Just don't do that.
> > > > > >
> > > > >
> > > > > No, the cases I'm concerned about don't involve memory overcommit.
> > > > >
> > > > > CoCo VMs must use swiotlb bounce buffers to do DMA I/O.  Current 
> > > > > swiotlb
> > > > > code in the Linux guest allocates a configurable, but fixed, amount 
> > > > > of guest
> > > > > memory at boot time for this purpose.  But it's hard to know how much
> > > > > swiotlb bounce buffer memory will be needed to handle peak I/O loads.
> > > > > This patch set does dynamic allocation of swiotlb bounce buffer 
> > > > > memory,
> > > > > which can help avoid needing to configure an overly large fixed size 
> > > > > at boot.
> > > >
> > > > But, as you point out, memory allocation can fail at runtime, so how can
> > > > you "guarantee" that this will work properly anymore if you are going to
> > > > make it dynamic?
> > >
> > > In general, there is no guarantee, of course, because bounce buffers
> > > may be requested from interrupt context. I believe Michael is looking
> > > for the SWIOTLB_MAY_SLEEP flag that was introduced in my v2 series, so
> > > new pools can be allocated with GFP_KERNEL instead of GFP_NOWAIT if
> > > possible, and then there is no need to dip into the coherent pool.
> > >
> > > Well, I have deliberately removed all complexities from my v3 series,
> > > but I have more WIP local topic branches in my local repo:
> > >
> > > - allow blocking allocations if possible
> > > - allocate a new pool before existing pools are full
> > > - free unused memory pools
> > >
> > > I can make a bigger series, or I can send another series as RFC if this
> > > is desired. ATM I don't feel confident enough that my v3 series will be
> > > accepted without major changes, so I haven't invested time into
> > > finalizing the other topic branches.
> > >
> > > @Michael: If you know that my plan is to introduce blocking allocations
> > > with a follow-up patch series, is the present approach acceptable?
> > >
> >
> > Yes, I think the present approach is acceptable as a first step.  But
> > let me elaborate a bit on my thinking.
> >
> > I was originally wondering if it is possible for swiotlb_map() to detect
> > whether it is called from a context that allows sleeping, without the use
> > of SWIOTLB_MAY_SLEEP.   This would get the benefits without having to
> > explicitly update drivers to add the flag.  But maybe that's too risky.
> 
> This is a recurring topic and it has been discussed several times in
> the mailing lists. If you ask me, the best answer is this one by Andrew
> Morton, albeit a bit dated:
> 
> https://lore.kernel.org/lkml/20080320201723.b87b3732.akpm@xxxxxxxxxxxxxxxxxxxx/

Thanks.  That's useful context.

> 
> > For
> > the CoCo VM scenario that I'm most interested in, being a VM implicitly
> > reduces the set of drivers that are being used, and so it's not that hard
> > to add the flag in the key drivers that generate most of the bounce
> > buffer traffic.
> 
> Yes, that's my thinking as well.
> 
> > Then I was thinking about a slightly different usage for the flag than what
> > you implemented in v2 of the series.   In the case where swiotlb_map()
> > can't allocate slots because of the swiotlb pool being full (or mostly 
> > full),
> > kick the background thread (if it is not already awake) to allocate a
> > dynamic pool and grow the total size of the swiotlb.  Then if
> > SWIOTLB_MAY_SLEEP is *not* set, allocate a transient pool just as you
> > have implemented in this v3 of the series.  But if SWIOTLB_MAY_SLEEP
> > *is* set, swiotlb_map() should sleep until the background thread has
> > completed the memory allocation and grown the size of the swiotlb.
> > After the sleep, retry the slot allocation.  Maybe what I'm describing
> > is what you mean by "allow blocking allocations".  :-)
> 
> Not really, but I like the idea. After all, the only reason to have
> transient pools is when something is needed immediately while the
> background allocation is running.

You can also take the thinking one step further:  For bounce buffer
requests that allow blocking, you could decide not to grow the pool
above a specified maximum.  If the max has been reached and space
is not available, sleep until space is released by some other in-progress
request.  This could be a valid way to handle peak demand while
capping the memory allocated to the bounce buffer pool.  There
would be a latency hit because of the waiting, but that could
be a valid tradeoff for rare peaks.  Of course, for requests that can't
block, you'd still need to allocate a transient pool.

Michael

> 
> > This approach effectively throttles incoming swiotlb requests when space
> > is exhausted, and gives the dynamic sizing mechanism a chance to catch
> > up in an efficient fashion.  Limiting transient pools to requests that can't
> > sleep will reduce the likelihood of exhausting the coherent memory
> > pools.  And as you mentioned above, kicking the background thread at the
> > 90% full mark (or some such heuristic) also helps the dynamic sizing
> > mechanism keep up with demand.
> 
> FWIW I did some testing, and my systems were not able to survive a
> sudden I/O peak without transient pools, no matter how low I set the
> threshold for kicking a background. OTOH I always tested with the
> smallest possible SWIOTLB (256 KiB * rounded up number of CPUs, e.g. 16
> MiB on my VM with 48 CPUs). Other sizes may lead to different results.
> 
> As a matter of fact, the size of the initial SWIOTLB memory pool and the
> size(s) of additional pool(s) sound like interesting tunable parameters
> that I haven't explored in depth yet.
> 
> Petr T



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.