[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 0/9] Allow dynamic allocation of software IO TLB bounce buffers



On Tue, Aug 01, 2023 at 08:23:55AM +0200, Petr Tesarik wrote:
> From: Petr Tesarik <petr.tesarik.ext@xxxxxxxxxx>
> 
> Motivation
> ==========
> 
> The software IO TLB was designed with these assumptions:
> 
> 1) It would not be used much. Small systems (little RAM) don't need it, and
>    big systems (lots of RAM) would have modern DMA controllers and an IOMMU
>    chip to handle legacy devices.
> 2) A small fixed memory area (64 MiB by default) is sufficient to
>    handle the few cases which require a bounce buffer.
> 3) 64 MiB is little enough that it has no impact on the rest of the
>    system.
> 4) Bounce buffers require large contiguous chunks of low memory. Such
>    memory is precious and can be allocated only early at boot.
> 
> It turns out they are not always true:
> 
> 1) Embedded systems may have more than 4GiB RAM but no IOMMU and legacy
>    32-bit peripheral busses and/or DMA controllers.
> 2) CoCo VMs use bounce buffers for all I/O but may need substantially more
>    than 64 MiB.
> 3) Embedded developers put as many features as possible into the available
>    memory. A few dozen "missing" megabytes may limit what features can be
>    implemented.
> 4) If CMA is available, it can allocate large continuous chunks even after
>    the system has run for some time.
> 
> Goals
> =====
> 
> The goal of this work is to start with a small software IO TLB at boot and
> expand it later when/if needed.
> 
> Design
> ======
> 
> This version of the patch series retains the current slot allocation
> algorithm with multiple areas to reduce lock contention, but additional
> slots can be added when necessary.
> 
> These alternatives have been considered:
> 
> - Allocate and free buffers as needed using direct DMA API. This works
>   quite well, except in CoCo VMs where each allocation/free requires
>   decrypting/encrypting memory, which is a very expensive operation.
> 
> - Allocate a very large software IO TLB at boot, but allow to migrate pages
>   to/from it (like CMA does). For systems with CMA, this would mean two big
>   allocations at boot. Finding the balance between CMA, SWIOTLB and rest of
>   available RAM can be challenging. More importantly, there is no clear
>   benefit compared to allocating SWIOTLB memory pools from the CMA.
> 
> Implementation Constraints
> ==========================
> 
> These constraints have been taken into account:
> 
> 1) Minimize impact on devices which do not benefit from the change.
> 2) Minimize the number of memory decryption/encryption operations.
> 3) Avoid contention on a lock or atomic variable to preserve parallel
>    scalability.
> 
> Additionally, the software IO TLB code is also used to implement restricted
> DMA pools. These pools are restricted to a pre-defined physical memory
> region and must not use any other memory. In other words, dynamic
> allocation of memory pools must be disabled for restricted DMA pools.
> 
> Data Structures
> ===============
> 
> The existing struct io_tlb_mem is the central type for a SWIOTLB allocator,
> but it now contains multiple memory pools::
> 
>   io_tlb_mem
>   +---------+   io_tlb_pool
>   | SWIOTLB |         +-------+   +-------+   +-------+
>   |allocator|-->|default|-->|dynamic|-->|dynamic|-->...
>   |               |   |memory |   |memory |   |memory |
>   +---------+         | pool  |   | pool  |   | pool  |
>               +-------+   +-------+   +-------+
> 
> The allocator structure contains global state (such as flags and counters)
> and structures needed to schedule new allocations. Each memory pool
> contains the actual buffer slots and metadata. The first memory pool in the
> list is the default memory pool allocated statically at early boot.
> 
> New memory pools are allocated from a kernel worker thread. That's because
> bounce buffers are allocated when mapping a DMA buffer, which may happen in
> interrupt context where large atomic allocations would probably fail.
> Allocation from process context is much more likely to succeed, especially
> if it can use CMA.
> 
> Nonetheless, the onset of a load spike may fill up the SWIOTLB before the
> worker has a chance to run. In that case, try to allocate a small transient
> memory pool to accommodate the request. If memory is encrypted and the
> device cannot do DMA to encrypted memory, this buffer is allocated from the
> coherent atomic DMA memory pool. Reducing the size of SWIOTLB may therefore
> require increasing the size of the coherent pool with the "coherent_pool"
> command-line parameter.
> 
> Performance
> ===========
> 
> All testing compared a vanilla v6.4-rc6 kernel with a fully patched
> kernel. The kernel was booted with "swiotlb=force" to allow stress-testing
> the software IO TLB on a high-performance device that would otherwise not
> need it. CONFIG_DEBUG_FS was set to 'y' to match the configuration of
> popular distribution kernels; it is understood that parallel workloads
> suffer from contention on the recently added debugfs atomic counters.
> 
> These benchmarks were run:
> 
> - small: single-threaded I/O of 4 KiB blocks,
> - big: single-threaded I/O of 64 KiB blocks,
> - 4way: 4-way parallel I/O of 4 KiB blocks.
> 
> In all tested cases, the default 64 MiB SWIOTLB would be sufficient (but
> wasteful). The "default" pair of columns shows performance impact when
> booted with 64 MiB SWIOTLB (i.e. current state). The "growing" pair of
> columns shows the impact when booted with a 1 MiB initial SWIOTLB, which
> grew to 5 MiB at run time. The "var" column in the tables below is the
> coefficient of variance over 5 runs of the test, the "diff" column is the
> difference in read-write I/O bandwidth (MiB/s). The very first column is
> the coefficient of variance in the results of the base unpatched kernel.
> 
> First, on an x86 VM against a QEMU virtio SATA driver backed by a RAM-based
> block device on the host:
> 
>       base       default         growing
>       var     var     diff    var     diff
> small 1.96%   0.47%   -1.5%   0.52%   -2.2%
> big   2.03%   1.35%   +0.9%   2.22%   +2.9%
> 4way  0.80%   0.45%   -0.7%   1.22%   <0.1%
> 
> Second, on a Raspberry Pi4 with 8G RAM and a class 10 A1 microSD card:
> 
>       base       default         growing
>       var     var     diff    var     diff
> small 1.09%   1.69%   +0.5%   2.14%   -0.2%
> big   0.03%   0.28%   -0.5%   0.03%   -0.1%
> 4way  5.15%   2.39%   +0.2%   0.66%   <0.1%
> 
> Third, on a CoCo VM. This was a bigger system, so I also added a 24-thread
> parallel I/O test:
> 
>       base       default         growing
>       var     var     diff    var     diff
> small 2.41%   6.02%   +1.1%   10.33%  +6.7%
> big   9.20%   2.81%   -0.6%   16.84%  -0.2%
> 4way  0.86%   2.66%   -0.1%    2.22%  -4.9%
> 24way 3.19%   6.19%   +4.4%    4.08%  -5.9%
> 
> Note the increased variance of the CoCo VM, although the host was not
> otherwise loaded. These are caused by the first run, which includes the
> overhead of allocating additional bounce buffers and sharing them with the
> hypervisor. The system was not rebooted between successive runs.
> 
> Parallel tests suffer from a reduced number of areas in the dynamically
> allocated memory pools. This can be improved by allocating a larger pool
> from CMA (not implemented in this series yet).
> 
> I have no good explanation for the increase in performance of the
> 24-thread I/O test with the default (non-growing) memory pool. Although the
> difference is within variance, it seems to be real. The average bandwidth
> is consistently above that of the unpatched kernel.
> 
> To sum it up:
> 
> - All workloads benefit from reduced memory footprint.
> - No performance regressions have been observed with the default size of
>   the software IO TLB.
> - Most workloads retain their former performance even if the software IO
>   TLB grows at run time.
> 

For the driver-core touched portions:

Reviewed-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.