|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen virtual IOMMU high level design doc
On Thu, Sep 15, 2016 at 10:22:36PM +0800, Lan, Tianyu wrote:
> Hi Andrew:
> Sorry to bother you. To make sure we are on the right direction, it's
> better to get feedback from you before we go further step. Could you
> have a look? Thanks.
>
> On 8/17/2016 8:05 PM, Lan, Tianyu wrote:
> > Hi All:
> > The following is our Xen vIOMMU high level design for detail
> > discussion. Please have a look. Very appreciate for your comments.
> > This design doesn't cover changes when root port is moved to hypervisor.
> > We may design it later.
> >
> >
> > Content:
> > ===============================================================================
> >
> > 1. Motivation of vIOMMU
> > 1.1 Enable more than 255 vcpus
> > 1.2 Support VFIO-based user space driver
> > 1.3 Support guest Shared Virtual Memory (SVM)
> > 2. Xen vIOMMU Architecture
> > 2.1 2th level translation overview
> > 2.2 Interrupt remapping overview
> > 3. Xen hypervisor
> > 3.1 New vIOMMU hypercall interface
> > 3.2 2nd level translation
> > 3.3 Interrupt remapping
> > 3.4 1st level translation
> > 3.5 Implementation consideration
> > 4. Qemu
> > 4.1 Qemu vIOMMU framework
> > 4.2 Dummy xen-vIOMMU driver
> > 4.3 Q35 vs. i440x
> > 4.4 Report vIOMMU to hvmloader
> >
> >
> > 1 Motivation for Xen vIOMMU
> > ===============================================================================
> >
> > 1.1 Enable more than 255 vcpu support
> > HPC virtualization requires more than 255 vcpus support in a single VM
> > to meet parallel computing requirement. More than 255 vcpus support
> > requires interrupt remapping capability present on vIOMMU to deliver
> > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> > vcpus if interrupt remapping is absent.
> >
> >
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the 2nd level translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> >
> >
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the 1st level translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> >
> > 2. Xen vIOMMU Architecture
> > ================================================================================
> >
> >
> > * vIOMMU will be inside Xen hypervisor for following factors
> > 1) Avoid round trips between Qemu and Xen hypervisor
> > 2) Ease of integration with the rest of the hypervisor
> > 3) HVMlite/PVH doesn't use Qemu
> > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> > level translation.
> >
> > 2.1 2th level translation overview
> > For Virtual PCI device, dummy xen-vIOMMU does translation in the
> > Qemu via new hypercall.
> >
> > For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> >
> > The following diagram shows 2th level translation architecture.
> > +---------------------------------------------------------+
> > |Qemu +----------------+ |
> > | | Virtual | |
> > | | PCI device | |
> > | | | |
> > | +----------------+ |
> > | |DMA |
> > | V |
> > | +--------------------+ Request +----------------+ |
> > | | +<-----------+ | |
> > | | Dummy xen vIOMMU | Target GPA | Memory region | |
> > | | +----------->+ | |
> > | +---------+----------+ +-------+--------+ |
> > | | | |
> > | |Hypercall | |
> > +--------------------------------------------+------------+
> > |Hypervisor | | |
> > | | | |
> > | v | |
> > | +------+------+ | |
> > | | vIOMMU | | |
> > | +------+------+ | |
> > | | | |
> > | v | |
> > | +------+------+ | |
> > | | IOMMU driver| | |
> > | +------+------+ | |
> > | | | |
> > +--------------------------------------------+------------+
> > |HW v V |
> > | +------+------+ +-------------+ |
> > | | IOMMU +---------------->+ Memory | |
> > | +------+------+ +-------------+ |
> > | ^ |
> > | | |
> > | +------+------+ |
> > | | PCI Device | |
> > | +-------------+ |
> > +---------------------------------------------------------+
> >
> > 2.2 Interrupt remapping overview.
> > Interrupts from virtual devices and physical devices will be delivered
> > to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> > procedure.
> >
> > +---------------------------------------------------+
> > |Qemu |VM |
> > | | +----------------+ |
> > | | | Device driver | |
> > | | +--------+-------+ |
> > | | ^ |
> > | +----------------+ | +--------+-------+ |
> > | | Virtual device | | | IRQ subsystem | |
> > | +-------+--------+ | +--------+-------+ |
> > | | | ^ |
> > | | | | |
> > +---------------------------+-----------------------+
> > |hyperviosr | | VIRQ |
> > | | +---------+--------+ |
> > | | | vLAPIC | |
> > | | +---------+--------+ |
> > | | ^ |
> > | | | |
> > | | +---------+--------+ |
> > | | | vIOMMU | |
> > | | +---------+--------+ |
> > | | ^ |
> > | | | |
> > | | +---------+--------+ |
> > | | | vIOAPIC/vMSI | |
> > | | +----+----+--------+ |
> > | | ^ ^ |
> > | +-----------------+ | |
> > | | |
> > +---------------------------------------------------+
> > HW |IRQ
> > +-------------------+
> > | PCI Device |
> > +-------------------+
> >
> >
> >
> >
> >
> > 3 Xen hypervisor
> > ==========================================================================
> >
> > 3.1 New hypercall XEN_SYSCTL_viommu_op
> > 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> >
> > struct xen_sysctl_viommu_op {
> > u32 cmd;
> > u32 domid;
> > union {
> > struct {
> > u32 capabilities;
> > } query_capabilities;
> > struct {
> > u32 capabilities;
> > u64 base_address;
> > } create_iommu;
> > struct {
> > u8 bus;
> > u8 devfn;
> > u64 iova;
> > u64 translated_addr;
> > u64 addr_mask; /* Translation page size */
> > IOMMUAccessFlags permisson;
> > } 2th_level_translation;
> > };
> >
> > typedef enum {
> > IOMMU_NONE = 0,
> > IOMMU_RO = 1,
> > IOMMU_WO = 2,
> > IOMMU_RW = 3,
> > } IOMMUAccessFlags;
> >
> >
> > Definition of VIOMMU subops:
> > #define XEN_SYSCTL_viommu_query_capability 0
> > #define XEN_SYSCTL_viommu_create 1
> > #define XEN_SYSCTL_viommu_destroy 2
> > #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3
> >
> > Definition of VIOMMU capabilities
> > #define XEN_VIOMMU_CAPABILITY_1nd_level_translation (1 << 0)
> > #define XEN_VIOMMU_CAPABILITY_2nd_level_translation (1 << 1)
> > #define XEN_VIOMMU_CAPABILITY_interrupt_remapping (1 << 2)
> >
> >
> > 2) Design for subops
> > - XEN_SYSCTL_viommu_query_capability
> > Get vIOMMU capabilities(1st/2th level translation and interrupt
> > remapping).
> >
> > - XEN_SYSCTL_viommu_create
> > Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> > base address.
> >
> > - XEN_SYSCTL_viommu_destroy
> > Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> >
> > - XEN_SYSCTL_viommu_dma_translation_for_vpdev
> > Translate IOVA to GPA for specified virtual PCI device with dom id,
> > PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> > address mask and access permission.
> >
> >
> > 3.2 2nd level translation
> > 1) For virtual PCI device
> > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> > hypercall when DMA operation happens.
> >
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > Second-level Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> > Page-table pointer to context entry of physical IOMMU.
> >
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> > translation function is enabled. These change will not affect current
> > P2M logic.
> >
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table. The following diagram shows the logic.
> >
Uh? Missing diagram?
> >
> > 3.4 1st level translation
> > When nested translation is enabled, any address generated by first-level
> > translation is used as the input address for nesting with second-level
> > translation. Physical IOMMU needs to enable both 1st level and 2nd level
> > translation in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> >
> > VT-d context entry points to guest 1st level translation table which
> > will be nest-translated by 2nd level translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> >
> > To enable 1st level translation in VM
> > 1) Xen IOMMU driver enables nested translation mode
> > 2) Update GPA root of guest 1st level translation table to context entry
> > of physical IOMMU.
> >
> > All handles are in hypervisor and no interaction with Qemu.
> >
> >
> > 3.5 Implementation consideration
> > Linux Intel IOMMU driver will fail to be loaded without 2th level
> > translation support even if interrupt remapping and 1th level
> > translation are available. This means it's needed to enable 2th level
> > translation first before other functions.
> >
> >
> > 4 Qemu
> > ==============================================================================
> >
> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for 2th level translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with 2th level translation when
> > DMA operations of virtual PCI devices happen.
> >
> >
> > 4.2 Dummy xen-vIOMMU driver
> > 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> > Share Virtual Memory) via hypercall.
> >
> > 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> > address and desired capability as parameters. Destroy vIOMMU when VM is
> > closed.
> >
> > 3) Virtual PCI device's 2th level translation
> > Qemu already provides DMA translation hook. It's called when DMA
> > translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> > device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> > return back translated GPA.
> >
> >
> > 4.3 Q35 vs i440x
> > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
s/since/with/
> > driver has assumption that VTD only exists on Q35 and newer chipset and
> > we have to enable Q35 first.
> >
> > Consulted with Linux/Windows IOMMU driver experts and get that these
> > drivers doesn't have such assumption. So we may skip Q35 implementation
> > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> > with virtual PCI device's DMA translation and interrupt remapping. We
> > are using KVM to do experiment of adding vIOMMU on the I440x and test
> > Linux/Windows guest. Will report back when have some results.
Any results?
> >
> >
> > 4.4 Report vIOMMU to hvmloader
> > Hvmloader is in charge of building ACPI tables for Guest OS and OS
> > probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> > vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> > for Guest OS.
> >
> > There are three ways to do that.
> > 1) Extend struct hvm_info_table and add variables in the struct
> > hvm_info_table to pass vIOMMU information to hvmloader. But this
> > requires to add new xc interface to use struct hvm_info_table in the Qemu.
> >
> > 2) Pass vIOMMU information to hvmloader via Xenstore
> >
> > 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> > This solution is already present in the vNVDIMM design(4.3.1
> > Building Guest ACPI Tables
> > http://www.gossamer-threads.com/lists/xen/devel/439766).
> >
> > The third option seems more clear and hvmloader doesn't need to deal
> > with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> > vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
/me nods. That does seem the best option.
> >
> >
> >
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |