[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen virtual IOMMU high level design doc V3



> From: Lan, Tianyu
> Sent: Thursday, November 17, 2016 11:37 PM
> 
> Change since V2:
>       1) Update motivation for Xen vIOMMU - 288 vcpus support part
>       Add descriptor about plan of increasing vcpu from 128 to 255 and
> dependency between X2APIC and interrupt remapping.
>       2) Update 3.1 New vIOMMU hypercall interface
>       Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU
> consideration consideration and drain in-fly DMA subcommand
>       3) Update 3.5 implementation consideration
>       We found it's still safe to enable interrupt remapping function before
> adding l2 translation(DMA translation) to increase vcpu number >255.
>       4) Update 3.2 l2 translation - virtual device part
>       Add proposal to deal with race between in-fly DMA and invalidation
> operation in hypervisor.
>       5) Update 4.4 Report vIOMMU to hvmloader
>       Add option of building ACPI DMAR table in the toolstack for discussion.
> 
> Change since V1:
>       1) Update motivation for Xen vIOMMU - 288 vcpus support part
>       2) Change definition of struct xen_sysctl_viommu_op
>       3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
>       4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on
> the emulated I440 chipset.
>       5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> =====================================================
> ==========================
> 1. Motivation of vIOMMU
>       1.1 Enable more than 255 vcpus
>       1.2 Support VFIO-based user space driver
>       1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>       2.1 l2 translation overview

L2/L1 might be more readable than l2/l1. :-)

>       2.2 Interrupt remapping overview

to be complete, need an overview of l1 translation here

> 3. Xen hypervisor
>       3.1 New vIOMMU hypercall interface
>       3.2 l2 translation
>       3.3 Interrupt remapping
>       3.4 l1 translation
>       3.5 Implementation consideration
> 4. Qemu
>       4.1 Qemu vIOMMU framework
>       4.2 Dummy xen-vIOMMU driver
>       4.3 Q35 vs. i440x
>       4.4 Report vIOMMU to hvmloader
> 
> 
> Glossary:
> =====================================================
> ===========================
> l1 translation - first-level translation to remap a virtual address to
> intermediate (guest) physical address. (GVA->GPA)
> l2 translation - second-level translations to remap a intermediate
> physical address to machine (host) physical address. (GPA->HPA)

If a glossary section required, please make it complex (interrupt remapping, 
DMAR, etc.)

Also please stick to what spec says. I don't think 'intermediate' physical
address is a widely-used term, and GVA->GPA/GPA->HPA are only partial
usages of those structures. You may make them an example, but be
careful with the definition.

> 
> 1 Motivation for Xen vIOMMU
> =====================================================
> ===========================
> 1.1 Enable more than 255 vcpu support

vcpu->vcpus

> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement. Pin each vcpu to separate pcpus.
> 
> Now HVM guest can support 128 vcpus at most. We can increase vcpu number
> from 128 to 255 via changing some limitations and extending vcpu related
> data structure. This also needs to change the rule of allocating vcpu's
> APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to
> change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID
> improvement work will cover this to improve guest's cpu topology. We
> will base on this to increase vcpu number from 128 to 255.
> 
> To support >255 vcpus, X2APIC mode in guest is necessary because legacy
> APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255
> vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires
> interrupt mapping function of vIOMMU.
> 
> The reason for this is that there is no modification to existing PCI MSI
> and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send
> interrupt message containing 8-bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary
> to enable >255 cpus with x2apic mode.
> 
> Both Linux and Windows requires interrupt remapping when cpu number is >255.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on

GIOVA->GPA to be consistent

> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 

You may give more background of how VFIO manages user space driver
to make whole picture clearer, like what you did for >255 vcpus support.

> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.

At least make it clear that SVM is to share virtual address space between
CPU and device side, so CPU virtual address can be programmed as 
DMA destination to device. The format of L1 structure is compatible to
CPU page table.

> 
> 
> 
> 2. Xen vIOMMU Architecture
> =====================================================
> ===========================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
>       1) Avoid round trips between Qemu and Xen hypervisor
>       2) Ease of integration with the rest of the hypervisor
>       3) HVMlite/PVH doesn't use Qemu

3) Maximum code reuse for HVMlite/PVH which doesn't use Qemu at all

> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destroy vIOMMU in hypervisor and deal with virtual PCI device's l2
> translation.
> 
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from

what's "IO page table"? L2 translation?

> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows l2 translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this

from vIOAPIC and vMSI to vLAPIC

> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                                  +-------------------+
>                                  |   PCI Device      |
>                                  +-------------------+
> 
> 

You introduced SVM usage in earlier motivation, but no information in this
design section. Please make it clear whether it's a mistake. If it's the
intention not covering SVM virtualization in this design, please articulate
in the start of the design

> 
> 
> 3 Xen hypervisor
> =====================================================
> =====================
> 
> 3.1 New hypercall XEN_dmop_viommu_op
> Create a new dmop(device model operation hyercall) for vIOMMU since it
> will be called by Qemu during runtime. This hypercall also should
> support PV IOMMU which is still under RFC review. Here only covers
> NON-PV part.

Not sure dmop is a good terminology. Is it possible to build it on top of
the category used for HVMlite (suppose it needs some device model
related hypercalls from toolstack)?

> 
> 1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.
> 
> struct xen_dmop_viommu_op {
>       u32 cmd;
>       u32 domid;
>       u32 viommu_id;
>       union {
>               struct {
>                       u32 capabilities;

OUT parameter?

>               } query_capabilities;
>               struct {
>                       /* IN parameters. */
>                       u32 capabilities;
>                       u64 base_address;
>                       struct {
>                               u32 size;
>                               XEN_GUEST_HANDLE_64(uint32) dev_list;
>                       } dev_scope;
>                       /* Out parameters. */
>                       u32 viommu_id;

duplicated with earlier viommu_id?

>               } create_iommu;
>               struct {
>                       /* IN parameters. */
>                       u32 vsbdf;
>                               u64 iova;
>                       /* Out parameters. */
>                               u64 translated_addr;
>                               u64 addr_mask; /* Translation page size */
>                               u32 permission;
>               } l2_translation;
>       }
> };
> 
> 
> Definition of VIOMMU access permission:

VIOMMU 'memory' access permission?

> #define VIOMMU_NONE   0
> #define       VIOMMU_RO       1
> #define       VIOMMU_WO       2
> #define       VIOMMU_RW       3
> 
> 
> Definition of VIOMMU subops:
> #define XEN_DMOP_viommu_query_capability              0
> #define XEN_DMOP_viommu_create                                1
> #define XEN_DMOP_viommu_destroy                               2
> #define XEN_DMOP_viommu_dma_translation_for_vpdev     3

what's vpdev? virtual device in Qemu?

> #define XEN_DMOP_viommu_dma_drain_completed           4
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_l1_translation          (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_l2_translation          (1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping     (1 << 2)
> 
> 
> 2) Design for subops
> - XEN_DMOP_viommu_query_capability
>       Get vIOMMU capabilities(l1/l2 translation and interrupt
> remapping).
> 
> - XEN_DMOP_viommu_create
>       Create vIOMMU in Xen hypervisor with dom_id, capabilities, reg
> base address and device scope. If size of device list is 0, all PCI
> devices are under the vIOMMU excepts PCI devices assigned to other
> VIOMMU. hypervisor returns vIOMMU id.

Is it clearer to follow VT-d spec by using a INCLUDE_ALL flag
for this purpose?

> 
> - XEN_DMOP_viommu_destroy
>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.

don't you required a viommu_id?

> 
> - XEN_DMOP_viommu_dma_translation_for_vpdev
>               Translate IOVA to GPA for specified virtual PCI device with dom
> id, PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> - XEN_DMOP_viommu_dma_drain_completed
>       Notify hypervisor that dummy vIOMMU has drained in-fly DMA after
> invalidation operation and vIOMMU can mark invalidation completed in
> invalidation register.
> 
> 
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> When guest triggers a invalidation operation, there maybe in-fly DMA
> request for virtual device has been translated by vIOMMU and return back
> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
> sure in-fly DMA operation is completed.

Please be clear that above is required only when read/write draining is
implied. Not all invalidations require it.

> 
> When IOMMU driver invalidates IOTLB, it also will wait until the
> invalidation completion. We may use this to drain in-fly DMA operation
> for virtual device.

Host or guest IOMMU driver? We may use 'what' to drain?

> 
> Guest triggers invalidation operation and trip into vIOMMU in

trip->trap

> hypervisor to flush cache data. After this, it should go to Qemu to
> drain in-fly DMA translation.

After what? Qemu drain should happen as part of trap-emulation of
guest invalidation operation. Also who is 'it'?

> 
> To do that, dummy vIOMMU in Qemu registers the same MMIO region as
> vIOMMU's and emulation part of invalidation operation in Xen hypervisor

emulation part -> emulation handler

> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is

after flush 'physical' cache?

> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
> thread to drain in-fly DMA and return emulation done.

suppose guest vcpu is blocked in this phase, right?

> 
> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
> notifies hypervisor drain operation completed via hypercall, vIOMMU
> clears IVT bit and guest finish invalidation operation.

So the basic idea is to have vIOMMU implement invalidation requests
as unhandled emulation, when memory draining is specified, which
results in a io request sent to Qemu to enable specific draining for
virtual devices. Then why do you need a separate thread and use
a hypercall to notification hypervisor? Once Qemu xen-viommu
wrapper completes emulation of invalidation requests, the standard
io completion flow will resume back to Xen hypervisor to unblock the
vcpu...

> 
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> l2 IO page table pointer field in the context entry, it provides IO page
> table for IOVA->GPA. vIOMMU needs to shadow l2 IO page table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> Page-table pointer in context entry of physical IOMMU. IOMMU driver
> invalidates associated page after changing l2 IO page table when cache
> mode bit is set in capability register. We can use this to shadow IO
> page table.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support vIOMMU l2
> translation, IOMMU driver needs to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> by default and switch to shadow IO page table(IOVA->HPA) when vIOMMU l2
> translation function is enabled. These change will not affect current
> P2M logic.
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table.
> 
> 
> 3.4 l1 translation
> To enable l1 translation in guest
> 1) Xen IOMMU driver enables nested translation mode
> 2) Shadow guest l1 translation root table(PASID table pointer) to
> pIOMMU's context entry.
> 
> When pIOMMU nested translation is enabled, any address generated by l1
> translation is used as the input address for nesting with l2
> translation. That means pIOMMU will translate GPA->HPA during l1
> translation in guest and so pIOMMU needs to enable both l1 and l2
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device. The guest's l1 translation root table can be directly written
> into pIOMMU context entry.
> 
> All are handled in hypervisor and no interactions with Qemu are required.

Looks here you do cover SVM virtualization... then please include
a design in earlier section

> 
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. When Linux Intel IOMMU driver enables l2
> translation, panic if fail to enable.
> 
> There is a kernel parameter "intel_iommu=ON" and Kconfig option
> CONFIG_INTEL_IOMMU_DEFAULT_ON which control l2 translation function.
> When they aren't set, l2 translation function will not be enabled by
> IOMMU driver even if some vIOMMU registers show l2 translation function
> available. In the meantime, irq remapping function still can work to
> support >255 vcpus.
> 
> Checked distribution RHEL, SLES, Oracle and ubuntu don't set the kernel
> parameter or select the Kconfig option. So it's still safe to emulate
> interrupt remapping fist with some capability bits(e,g SAGAW of
> Capability Register) of l2 translation for >255 vcpus support without l2
> translation emulation.
> 
> Showing l2 capability bits is to make sure IOMMU driver parses ACPI DMAR
> tables successfully because IOMMU driver access these bits during
> reading ACPI tables. Otherwise, IOMMU instance will freed if fail.

You said no capability bit for L2 translation. But here you say
"showing l2 capability bits"...

> 
> If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
> will panic guest because it can't enable DMA remapping function via gcmd
> register and "Translation Enable Status" bit in gsts register is never
> set by vIOMMU. This shows actual vIOMMU status that there is no l2
> translation support and warn user should not enable l2 translation.

The rationale of section 3.5 is confusing. Do you mean sth. like below?

- We can first do IRQ remapping, because DMA remapping (l1/l2) and 
IRQ remapping can be enabled separately according to VT-d spec. Enabling 
of DMA remapping will be first emulated as a failure, which may lead
to guest kernel panic if intel_iommu is turned on in the guest. But it's
not a big problem because major distributions have DMA remapping
disabled by default while IRQ remapping is enabled.

- For DMA remapping, likely you'll enable L2 translation first (there is
no capability bit) with L1 translation disabled (there is a SVM capability 
bit). 

If yes, maybe we can break this design into 3 parts too, so both
design review and implementation side can move forward step by
step?

> 
> 
> 
> 4 Qemu
> =====================================================
> =========================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for l2 translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with l2 translation when
> DMA operations of virtual PCI devices happen.
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's l2 translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs I440x
> VT-D is introduced with Q35 chipset. Previous concern was that VTD
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first. After experiments, Linux/Windows guest can
> boot up on the emulated I440x chipset with VTD and VTD driver enables
> interrupt remapping function. So we can skip Q35 support to implement
> vIOMMU directly.
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. There is two ways to pass DMAR table
> to hvmloader. So hvmloder needs to know whether vIOMMU is enabled or not
> and its capability to prepare ACPI DMAR table for Guest OS.
> 
> 1) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> 
> 2) Build ACPI DMAR table in toolstack
> Now tool stack can boot ACPI DMAR table according VM configure and pass
> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
> region is managed by Qemu and it's need to be populated into DMAR
> table. We may hardcore an address in both Qemu and toolstack and use the
> same address to create vIOMMU and build DMAR table.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.