[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote: > Hi All: > The following is our Xen vIOMMU high level design for detail > discussion. Please have a look. Very appreciate for your comments. > This design doesn't cover changes when root port is moved to hypervisor. > We may design it later. Hi, I have a few questions. If I understand correctly, you'll be emulating an Intel IOMMU in Xen. So guests will essentially create intel iommu style page-tables. If we were to use this on Xen/ARM, we would likely be modelling an ARM SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used. Do I understand this correctly? Has a platform agnostic PV-IOMMU been considered to support 2-stage translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap performance too much? Best regards, Edgar > > > Content: > =============================================================================== > 1. Motivation of vIOMMU > 1.1 Enable more than 255 vcpus > 1.2 Support VFIO-based user space driver > 1.3 Support guest Shared Virtual Memory (SVM) > 2. Xen vIOMMU Architecture > 2.1 2th level translation overview > 2.2 Interrupt remapping overview > 3. Xen hypervisor > 3.1 New vIOMMU hypercall interface > 3.2 2nd level translation > 3.3 Interrupt remapping > 3.4 1st level translation > 3.5 Implementation consideration > 4. Qemu > 4.1 Qemu vIOMMU framework > 4.2 Dummy xen-vIOMMU driver > 4.3 Q35 vs. i440x > 4.4 Report vIOMMU to hvmloader > > > 1 Motivation for Xen vIOMMU > =============================================================================== > 1.1 Enable more than 255 vcpu support > HPC virtualization requires more than 255 vcpus support in a single VM > to meet parallel computing requirement. More than 255 vcpus support > requires interrupt remapping capability present on vIOMMU to deliver > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 > vcpus if interrupt remapping is absent. > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > It relies on the 2nd level translation capability (IOVA->GPA) on > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of > vIOMMU to isolate DMA requests initiated by user space driver. > > > 1.3 Support guest SVM (Shared Virtual Memory) > It relies on the 1st level translation table capability (GVA->GPA) on > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > is the main usage today (to support OpenCL 2.0 SVM feature). In the > future SVM might be used by other I/O devices too. > > 2. Xen vIOMMU Architecture > ================================================================================ > > * vIOMMU will be inside Xen hypervisor for following factors > 1) Avoid round trips between Qemu and Xen hypervisor > 2) Ease of integration with the rest of the hypervisor > 3) HVMlite/PVH doesn't use Qemu > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th > level translation. > > 2.1 2th level translation overview > For Virtual PCI device, dummy xen-vIOMMU does translation in the > Qemu via new hypercall. > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > The following diagram shows 2th level translation architecture. > +---------------------------------------------------------+ > |Qemu +----------------+ | > | | Virtual | | > | | PCI device | | > | | | | > | +----------------+ | > | |DMA | > | V | > | +--------------------+ Request +----------------+ | > | | +<-----------+ | | > | | Dummy xen vIOMMU | Target GPA | Memory region | | > | | +----------->+ | | > | +---------+----------+ +-------+--------+ | > | | | | > | |Hypercall | | > +--------------------------------------------+------------+ > |Hypervisor | | | > | | | | > | v | | > | +------+------+ | | > | | vIOMMU | | | > | +------+------+ | | > | | | | > | v | | > | +------+------+ | | > | | IOMMU driver| | | > | +------+------+ | | > | | | | > +--------------------------------------------+------------+ > |HW v V | > | +------+------+ +-------------+ | > | | IOMMU +---------------->+ Memory | | > | +------+------+ +-------------+ | > | ^ | > | | | > | +------+------+ | > | | PCI Device | | > | +-------------+ | > +---------------------------------------------------------+ > > 2.2 Interrupt remapping overview. > Interrupts from virtual devices and physical devices will be delivered > to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this > procedure. > > +---------------------------------------------------+ > |Qemu |VM | > | | +----------------+ | > | | | Device driver | | > | | +--------+-------+ | > | | ^ | > | +----------------+ | +--------+-------+ | > | | Virtual device | | | IRQ subsystem | | > | +-------+--------+ | +--------+-------+ | > | | | ^ | > | | | | | > +---------------------------+-----------------------+ > |hyperviosr | | VIRQ | > | | +---------+--------+ | > | | | vLAPIC | | > | | +---------+--------+ | > | | ^ | > | | | | > | | +---------+--------+ | > | | | vIOMMU | | > | | +---------+--------+ | > | | ^ | > | | | | > | | +---------+--------+ | > | | | vIOAPIC/vMSI | | > | | +----+----+--------+ | > | | ^ ^ | > | +-----------------+ | | > | | | > +---------------------------------------------------+ > HW |IRQ > +-------------------+ > | PCI Device | > +-------------------+ > > > > > > 3 Xen hypervisor > ========================================================================== > > 3.1 New hypercall XEN_SYSCTL_viommu_op > 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter. > > struct xen_sysctl_viommu_op { > u32 cmd; > u32 domid; > union { > struct { > u32 capabilities; > } query_capabilities; > struct { > u32 capabilities; > u64 base_address; > } create_iommu; > struct { > u8 bus; > u8 devfn; > u64 iova; > u64 translated_addr; > u64 addr_mask; /* Translation page size */ > IOMMUAccessFlags permisson; > } 2th_level_translation; > }; > > typedef enum { > IOMMU_NONE = 0, > IOMMU_RO = 1, > IOMMU_WO = 2, > IOMMU_RW = 3, > } IOMMUAccessFlags; > > > Definition of VIOMMU subops: > #define XEN_SYSCTL_viommu_query_capability 0 > #define XEN_SYSCTL_viommu_create 1 > #define XEN_SYSCTL_viommu_destroy 2 > #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3 > > Definition of VIOMMU capabilities > #define XEN_VIOMMU_CAPABILITY_1nd_level_translation (1 << 0) > #define XEN_VIOMMU_CAPABILITY_2nd_level_translation (1 << 1) > #define XEN_VIOMMU_CAPABILITY_interrupt_remapping (1 << 2) > > > 2) Design for subops > - XEN_SYSCTL_viommu_query_capability > Get vIOMMU capabilities(1st/2th level translation and interrupt > remapping). > > - XEN_SYSCTL_viommu_create > Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg > base address. > > - XEN_SYSCTL_viommu_destroy > Destory vIOMMU in Xen hypervisor with dom_id as parameters. > > - XEN_SYSCTL_viommu_dma_translation_for_vpdev > Translate IOVA to GPA for specified virtual PCI device with dom id, > PCI device's bdf and IOVA and xen hypervisor returns translated GPA, > address mask and access permission. > > > 3.2 2nd level translation > 1) For virtual PCI device > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new > hypercall when DMA operation happens. > > 2) For physical PCI device > DMA operations go though physical IOMMU directly and IO page table for > IOVA->HPA should be loaded into physical IOMMU. When guest updates > Second-level Page-table pointer field, it provides IO page table for > IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate > GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level > Page-table pointer to context entry of physical IOMMU. > > Now all PCI devices in same hvm domain share one IO page table > (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level > translation of vIOMMU, IOMMU driver need to support multiple address > spaces per device entry. Using existing IO page table(GPA->HPA) > defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level > translation function is enabled. These change will not affect current > P2M logic. > > 3.3 Interrupt remapping > Interrupts from virtual devices and physical devices will be delivered > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic > according interrupt remapping table. The following diagram shows the logic. > > > 3.4 1st level translation > When nested translation is enabled, any address generated by first-level > translation is used as the input address for nesting with second-level > translation. Physical IOMMU needs to enable both 1st level and 2nd level > translation in nested translation mode(GVA->GPA->HPA) for passthrough > device. > > VT-d context entry points to guest 1st level translation table which > will be nest-translated by 2nd level translation table and so it > can be directly linked to context entry of physical IOMMU. > > To enable 1st level translation in VM > 1) Xen IOMMU driver enables nested translation mode > 2) Update GPA root of guest 1st level translation table to context entry > of physical IOMMU. > > All handles are in hypervisor and no interaction with Qemu. > > > 3.5 Implementation consideration > Linux Intel IOMMU driver will fail to be loaded without 2th level > translation support even if interrupt remapping and 1th level > translation are available. This means it's needed to enable 2th level > translation first before other functions. > > > 4 Qemu > ============================================================================== > 4.1 Qemu vIOMMU framework > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and > AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen. > Especially for 2th level translation of virtual PCI device because > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU > framework provides callback to deal with 2th level translation when > DMA operations of virtual PCI devices happen. > > > 4.2 Dummy xen-vIOMMU driver > 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and > Share Virtual Memory) via hypercall. > > 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register > address and desired capability as parameters. Destroy vIOMMU when VM is > closed. > > 3) Virtual PCI device's 2th level translation > Qemu already provides DMA translation hook. It's called when DMA > translation of virtual PCI device happens. The dummy xen-vIOMMU passes > device bdf and IOVA into Xen hypervisor via new iommu hypercall and > return back translated GPA. > > > 4.3 Q35 vs i440x > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU > driver has assumption that VTD only exists on Q35 and newer chipset and > we have to enable Q35 first. > > Consulted with Linux/Windows IOMMU driver experts and get that these > drivers doesn't have such assumption. So we may skip Q35 implementation > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support > with virtual PCI device's DMA translation and interrupt remapping. We > are using KVM to do experiment of adding vIOMMU on the I440x and test > Linux/Windows guest. Will report back when have some results. > > > 4.4 Report vIOMMU to hvmloader > Hvmloader is in charge of building ACPI tables for Guest OS and OS > probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether > vIOMMU is enabled or not and its capability to prepare ACPI DMAR table > for Guest OS. > > There are three ways to do that. > 1) Extend struct hvm_info_table and add variables in the struct > hvm_info_table to pass vIOMMU information to hvmloader. But this > requires to add new xc interface to use struct hvm_info_table in the Qemu. > > 2) Pass vIOMMU information to hvmloader via Xenstore > > 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore. > This solution is already present in the vNVDIMM design(4.3.1 > Building Guest ACPI Tables > http://www.gossamer-threads.com/lists/xen/devel/439766). > > The third option seems more clear and hvmloader doesn't need to deal > with vIOMMU stuffs and just pass through DMAR table to Guest OS. All > vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver. > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > https://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |