[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC XEN PATCH v3 1/5] docs/designs: Add a design document for PV-IOMMU



Disclaimer: I haven't looked at the code yet.

On Thu Jul 11, 2024 at 3:04 PM BST, Teddy Astie wrote:
> Some operating systems want to use IOMMU to implement various features (e.g
> VFIO) or DMA protection.
> This patch introduce a proposal for IOMMU paravirtualization for Dom0.
>
> Signed-off-by Teddy Astie <teddy.astie@xxxxxxxxxx>
> ---
>  docs/designs/pv-iommu.md | 105 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
>  create mode 100644 docs/designs/pv-iommu.md
>
> diff --git a/docs/designs/pv-iommu.md b/docs/designs/pv-iommu.md
> new file mode 100644
> index 0000000000..c01062a3ad
> --- /dev/null
> +++ b/docs/designs/pv-iommu.md
> @@ -0,0 +1,105 @@
> +# IOMMU paravirtualization for Dom0
> +
> +Status: Experimental
> +
> +# Background
> +
> +By default, Xen only uses the IOMMU for itself, either to make device adress
> +space coherent with guest adress space (x86 HVM/PVH) or to prevent devices
> +from doing DMA outside it's expected memory regions including the hypervisor
> +(x86 PV).

"By default...": Do you mean "currently"?

> +
> +A limitation is that guests (especially privildged ones) may want to use
> +IOMMU hardware in order to implement features such as DMA protection and
> +VFIO [1] as IOMMU functionality is not available outside of the hypervisor
> +currently.

s/privildged/privileged/

> +
> +[1] VFIO - "Virtual Function I/O" - 
> https://www.kernel.org/doc/html/latest/driver-api/vfio.html
> +
> +# Design
> +
> +The operating system may want to have access to various IOMMU features such 
> as
> +context management and DMA remapping. We can create a new hypercall that 
> allows
> +the guest to have access to a new paravirtualized IOMMU interface.
> +
> +This feature is only meant to be available for the Dom0, as DomU have some
> +emulated devices that can't be managed on Xen side and are not hardware, we
> +can't rely on the hardware IOMMU to enforce DMA remapping.

Is that the reason though? While it's true we can't mix emulated and real
devices under the same emulated PCI bus covered by an IOMMU, nothing prevents us
from stating "the IOMMU(s) configured via PV-IOMMU cover from busN to busM".

AFAIK, that already happens on systems with several IOMMUs, where they might
affect partially disjoint devices. But I admit I'm no expert on this.

I can definitely see a lot of interesting use cases for a PV-IOMMU interface
exposed to domUs (it'd be a subset of that of dom0, obviously); that'd
allow them to use the IOMMU without resorting to 2-stage translation, which has
terrible IOTLB miss costs.

> +
> +This interface is exposed under the `iommu_op` hypercall.
> +
> +In addition, Xen domains are modified in order to allow existence of several
> +IOMMU context including a default one that implement default behavior (e.g
> +hardware assisted paging) and can't be modified by guest. DomU cannot have
> +contexts, and therefore act as if they only have the default domain.
> +
> +Each IOMMU context within a Xen domain is identified using a domain-specific
> +context number that is used in the Xen IOMMU subsystem and the hypercall
> +interface.
> +
> +The number of IOMMU context a domain can use is predetermined at domain 
> creation
> +and is configurable through `dom0-iommu=nb-ctx=N` xen cmdline.

nit: I think it's more typical within Xen to see "nr" rather than "nb"

> +
> +# IOMMU operations
> +
> +## Alloc context
> +
> +Create a new IOMMU context for the guest and return the context number to the
> +guest.
> +Fail if the IOMMU context limit of the guest is reached.

or -ENOMEM, I guess.

I'm guessing from this dom0 takes care of the contexts for guests? Or are these
contexts for use within dom0 exclusively?

> +
> +A flag can be specified to create a identity mapping.
> +
> +## Free context
> +
> +Destroy a IOMMU context created previously.
> +It is not possible to free the default context.
> +
> +Reattach context devices to default context if specified by the guest.
> +
> +Fail if there is a device in the context and reattach-to-default flag is not
> +specified.
> +
> +## Reattach device
> +
> +Reattach a device to another IOMMU context (including the default one).
> +The target IOMMU context number must be valid and the context allocated.
> +
> +The guest needs to specify a PCI SBDF of a device he has access to.
> +
> +## Map/unmap page
> +
> +Map/unmap a page on a context.
> +The guest needs to specify a gfn and target dfn to map.

And an "order", I hope; to enable superpages and hugepages without having to
find out after the fact that the mappings are in fact mergeable and the leaf PTs
can go away.

> +
> +Refuse to create the mapping if one already exist for the same dfn.
> +
> +## Lookup page
> +
> +Get the gfn mapped by a specific dfn.
> +
> +# Implementation considerations
> +
> +## Hypercall batching
> +
> +In order to prevent unneeded hypercalls and IOMMU flushing, it is advisable 
> to
> +be able to batch some critical IOMMU operations (e.g map/unmap multiple 
> pages).

See above for an additional way of reducing the load.

> +
> +## Hardware without IOMMU support
> +
> +Operating system needs to be aware on PV-IOMMU capability, and whether it is
> +able to make contexts. However, some operating system may critically fail in
> +case they are able to make a new IOMMU context. Which is supposed to happen
> +if no IOMMU hardware is available.
> +
> +The hypercall interface needs a interface to advertise the ability to create
> +and manage IOMMU contexts including the amount of context the guest is able
> +to use. Using these informations, the Dom0 may decide whether to use or not
> +the PV-IOMMU interface.

We could just return -ENOTSUPP when there's no IOMMU, then encapsulate a random
lookup with pv_iommu_is_present() and return true or false depending on rc.

> +
> +## Page pool for contexts
> +
> +In order to prevent unexpected starving on the hypervisor memory with a
> +buggy Dom0. We can preallocate the pages the contexts will use and make
> +map/unmap use these pages instead of allocating them dynamically.
> +

That seems dangerous should we need to shatter a superpage asynchronously (i.e:
due to HW misbehaving and requiring it) and have no more pages in the pool.

Cheers,
Alejandro



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.