[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Thu, Sep 15, 2016 at 10:22:36PM +0800, Lan, Tianyu wrote: > Hi Andrew: > Sorry to bother you. To make sure we are on the right direction, it's > better to get feedback from you before we go further step. Could you > have a look? Thanks. > > On 8/17/2016 8:05 PM, Lan, Tianyu wrote: > > Hi All: > > The following is our Xen vIOMMU high level design for detail > > discussion. Please have a look. Very appreciate for your comments. > > This design doesn't cover changes when root port is moved to hypervisor. > > We may design it later. > > > > > > Content: > > =============================================================================== > > > > 1. Motivation of vIOMMU > > 1.1 Enable more than 255 vcpus > > 1.2 Support VFIO-based user space driver > > 1.3 Support guest Shared Virtual Memory (SVM) > > 2. Xen vIOMMU Architecture > > 2.1 2th level translation overview > > 2.2 Interrupt remapping overview > > 3. Xen hypervisor > > 3.1 New vIOMMU hypercall interface > > 3.2 2nd level translation > > 3.3 Interrupt remapping > > 3.4 1st level translation > > 3.5 Implementation consideration > > 4. Qemu > > 4.1 Qemu vIOMMU framework > > 4.2 Dummy xen-vIOMMU driver > > 4.3 Q35 vs. i440x > > 4.4 Report vIOMMU to hvmloader > > > > > > 1 Motivation for Xen vIOMMU > > =============================================================================== > > > > 1.1 Enable more than 255 vcpu support > > HPC virtualization requires more than 255 vcpus support in a single VM > > to meet parallel computing requirement. More than 255 vcpus support > > requires interrupt remapping capability present on vIOMMU to deliver > > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255 > > vcpus if interrupt remapping is absent. > > > > > > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest > > It relies on the 2nd level translation capability (IOVA->GPA) on > > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of > > vIOMMU to isolate DMA requests initiated by user space driver. > > > > > > 1.3 Support guest SVM (Shared Virtual Memory) > > It relies on the 1st level translation table capability (GVA->GPA) on > > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation > > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough > > is the main usage today (to support OpenCL 2.0 SVM feature). In the > > future SVM might be used by other I/O devices too. > > > > 2. Xen vIOMMU Architecture > > ================================================================================ > > > > > > * vIOMMU will be inside Xen hypervisor for following factors > > 1) Avoid round trips between Qemu and Xen hypervisor > > 2) Ease of integration with the rest of the hypervisor > > 3) HVMlite/PVH doesn't use Qemu > > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create > > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th > > level translation. > > > > 2.1 2th level translation overview > > For Virtual PCI device, dummy xen-vIOMMU does translation in the > > Qemu via new hypercall. > > > > For physical PCI device, vIOMMU in hypervisor shadows IO page table from > > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU. > > > > The following diagram shows 2th level translation architecture. > > +---------------------------------------------------------+ > > |Qemu +----------------+ | > > | | Virtual | | > > | | PCI device | | > > | | | | > > | +----------------+ | > > | |DMA | > > | V | > > | +--------------------+ Request +----------------+ | > > | | +<-----------+ | | > > | | Dummy xen vIOMMU | Target GPA | Memory region | | > > | | +----------->+ | | > > | +---------+----------+ +-------+--------+ | > > | | | | > > | |Hypercall | | > > +--------------------------------------------+------------+ > > |Hypervisor | | | > > | | | | > > | v | | > > | +------+------+ | | > > | | vIOMMU | | | > > | +------+------+ | | > > | | | | > > | v | | > > | +------+------+ | | > > | | IOMMU driver| | | > > | +------+------+ | | > > | | | | > > +--------------------------------------------+------------+ > > |HW v V | > > | +------+------+ +-------------+ | > > | | IOMMU +---------------->+ Memory | | > > | +------+------+ +-------------+ | > > | ^ | > > | | | > > | +------+------+ | > > | | PCI Device | | > > | +-------------+ | > > +---------------------------------------------------------+ > > > > 2.2 Interrupt remapping overview. > > Interrupts from virtual devices and physical devices will be delivered > > to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this > > procedure. > > > > +---------------------------------------------------+ > > |Qemu |VM | > > | | +----------------+ | > > | | | Device driver | | > > | | +--------+-------+ | > > | | ^ | > > | +----------------+ | +--------+-------+ | > > | | Virtual device | | | IRQ subsystem | | > > | +-------+--------+ | +--------+-------+ | > > | | | ^ | > > | | | | | > > +---------------------------+-----------------------+ > > |hyperviosr | | VIRQ | > > | | +---------+--------+ | > > | | | vLAPIC | | > > | | +---------+--------+ | > > | | ^ | > > | | | | > > | | +---------+--------+ | > > | | | vIOMMU | | > > | | +---------+--------+ | > > | | ^ | > > | | | | > > | | +---------+--------+ | > > | | | vIOAPIC/vMSI | | > > | | +----+----+--------+ | > > | | ^ ^ | > > | +-----------------+ | | > > | | | > > +---------------------------------------------------+ > > HW |IRQ > > +-------------------+ > > | PCI Device | > > +-------------------+ > > > > > > > > > > > > 3 Xen hypervisor > > ========================================================================== > > > > 3.1 New hypercall XEN_SYSCTL_viommu_op > > 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter. > > > > struct xen_sysctl_viommu_op { > > u32 cmd; > > u32 domid; > > union { > > struct { > > u32 capabilities; > > } query_capabilities; > > struct { > > u32 capabilities; > > u64 base_address; > > } create_iommu; > > struct { > > u8 bus; > > u8 devfn; > > u64 iova; > > u64 translated_addr; > > u64 addr_mask; /* Translation page size */ > > IOMMUAccessFlags permisson; > > } 2th_level_translation; > > }; > > > > typedef enum { > > IOMMU_NONE = 0, > > IOMMU_RO = 1, > > IOMMU_WO = 2, > > IOMMU_RW = 3, > > } IOMMUAccessFlags; > > > > > > Definition of VIOMMU subops: > > #define XEN_SYSCTL_viommu_query_capability 0 > > #define XEN_SYSCTL_viommu_create 1 > > #define XEN_SYSCTL_viommu_destroy 2 > > #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3 > > > > Definition of VIOMMU capabilities > > #define XEN_VIOMMU_CAPABILITY_1nd_level_translation (1 << 0) > > #define XEN_VIOMMU_CAPABILITY_2nd_level_translation (1 << 1) > > #define XEN_VIOMMU_CAPABILITY_interrupt_remapping (1 << 2) > > > > > > 2) Design for subops > > - XEN_SYSCTL_viommu_query_capability > > Get vIOMMU capabilities(1st/2th level translation and interrupt > > remapping). > > > > - XEN_SYSCTL_viommu_create > > Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg > > base address. > > > > - XEN_SYSCTL_viommu_destroy > > Destory vIOMMU in Xen hypervisor with dom_id as parameters. > > > > - XEN_SYSCTL_viommu_dma_translation_for_vpdev > > Translate IOVA to GPA for specified virtual PCI device with dom id, > > PCI device's bdf and IOVA and xen hypervisor returns translated GPA, > > address mask and access permission. > > > > > > 3.2 2nd level translation > > 1) For virtual PCI device > > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new > > hypercall when DMA operation happens. > > > > 2) For physical PCI device > > DMA operations go though physical IOMMU directly and IO page table for > > IOVA->HPA should be loaded into physical IOMMU. When guest updates > > Second-level Page-table pointer field, it provides IO page table for > > IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate > > GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level > > Page-table pointer to context entry of physical IOMMU. > > > > Now all PCI devices in same hvm domain share one IO page table > > (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level > > translation of vIOMMU, IOMMU driver need to support multiple address > > spaces per device entry. Using existing IO page table(GPA->HPA) > > defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level > > translation function is enabled. These change will not affect current > > P2M logic. > > > > 3.3 Interrupt remapping > > Interrupts from virtual devices and physical devices will be delivered > > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping > > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic > > according interrupt remapping table. The following diagram shows the logic. > > Uh? Missing diagram? > > > > 3.4 1st level translation > > When nested translation is enabled, any address generated by first-level > > translation is used as the input address for nesting with second-level > > translation. Physical IOMMU needs to enable both 1st level and 2nd level > > translation in nested translation mode(GVA->GPA->HPA) for passthrough > > device. > > > > VT-d context entry points to guest 1st level translation table which > > will be nest-translated by 2nd level translation table and so it > > can be directly linked to context entry of physical IOMMU. > > > > To enable 1st level translation in VM > > 1) Xen IOMMU driver enables nested translation mode > > 2) Update GPA root of guest 1st level translation table to context entry > > of physical IOMMU. > > > > All handles are in hypervisor and no interaction with Qemu. > > > > > > 3.5 Implementation consideration > > Linux Intel IOMMU driver will fail to be loaded without 2th level > > translation support even if interrupt remapping and 1th level > > translation are available. This means it's needed to enable 2th level > > translation first before other functions. > > > > > > 4 Qemu > > ============================================================================== > > > > 4.1 Qemu vIOMMU framework > > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and > > AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy > > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen. > > Especially for 2th level translation of virtual PCI device because > > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU > > framework provides callback to deal with 2th level translation when > > DMA operations of virtual PCI devices happen. > > > > > > 4.2 Dummy xen-vIOMMU driver > > 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and > > Share Virtual Memory) via hypercall. > > > > 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register > > address and desired capability as parameters. Destroy vIOMMU when VM is > > closed. > > > > 3) Virtual PCI device's 2th level translation > > Qemu already provides DMA translation hook. It's called when DMA > > translation of virtual PCI device happens. The dummy xen-vIOMMU passes > > device bdf and IOVA into Xen hypervisor via new iommu hypercall and > > return back translated GPA. > > > > > > 4.3 Q35 vs i440x > > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU s/since/with/ > > driver has assumption that VTD only exists on Q35 and newer chipset and > > we have to enable Q35 first. > > > > Consulted with Linux/Windows IOMMU driver experts and get that these > > drivers doesn't have such assumption. So we may skip Q35 implementation > > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support > > with virtual PCI device's DMA translation and interrupt remapping. We > > are using KVM to do experiment of adding vIOMMU on the I440x and test > > Linux/Windows guest. Will report back when have some results. Any results? > > > > > > 4.4 Report vIOMMU to hvmloader > > Hvmloader is in charge of building ACPI tables for Guest OS and OS > > probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether > > vIOMMU is enabled or not and its capability to prepare ACPI DMAR table > > for Guest OS. > > > > There are three ways to do that. > > 1) Extend struct hvm_info_table and add variables in the struct > > hvm_info_table to pass vIOMMU information to hvmloader. But this > > requires to add new xc interface to use struct hvm_info_table in the Qemu. > > > > 2) Pass vIOMMU information to hvmloader via Xenstore > > > > 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore. > > This solution is already present in the vNVDIMM design(4.3.1 > > Building Guest ACPI Tables > > http://www.gossamer-threads.com/lists/xen/devel/439766). > > > > The third option seems more clear and hvmloader doesn't need to deal > > with vIOMMU stuffs and just pass through DMAR table to Guest OS. All > > vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver. /me nods. That does seem the best option. > > > > > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > https://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |