[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Discussion about virtual iommu support for Xen guest



> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx]
> Sent: Friday, June 03, 2016 2:59 AM
> 
> On 02/06/16 16:03, Lan, Tianyu wrote:
> > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> >>> On 26/05/16 09:29, Lan Tianyu wrote:
> >>>
> >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> >>> much as HVM.  This alone negates qemu as a viable option.
> >>>
> >>> From a design point of view, having Xen needing to delegate to qemu to
> >>> inject an interrupt into a guest seems backwards.
> >>>
> >>
> >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> >> iommu in the Xen, right?
> >>
> >>>
> >>> A whole lot of this would be easier to reason about if/when we get a
> >>> basic root port implementation in Xen, which is necessary for HVMLite,
> >>> and which will make the interaction with qemu rather more clean.  It is
> >>> probably worth coordinating work in this area.
> >>
> >> The virtual iommu also should be under basic root port in Xen, right?
> >>
> >>>
> >>> As for the individual issue of 288vcpu support, there are already
> >>> issues
> >>> with 64vcpu guests at the moment. While it is certainly fine to remove
> >>> the hard limit at 255 vcpus, there is a lot of other work required to
> >>> even get 128vcpu guests stable.
> >>
> >>
> >> Could you give some points to these issues? We are enabling more vcpus
> >> support and it can boot up 255 vcpus without IR support basically. It's
> >> very helpful to learn about known issues.
> >>
> >> We will also add more tests for 128 vcpus into our regular test to find
> >> related bugs. Increasing max vcpu to 255 should be a good start.
> >
> > Hi Andrew:
> > Could you give more inputs about issues with 64 vcpus and what needs to
> > be done to make 128vcpu guest stable? We hope to do somethings to
> > improve them.
> >
> > What's progress of PCI host bridge in Xen? From your opinion, we should
> > do that first, right? Thanks.
> 
> Very sorry for the delay.
> 
> There are multiple interacting issues here.  On the one side, it would
> be useful if we could have a central point of coordination on
> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> would you mind organising that?
> 
> For the qemu/xen interaction, the current state is woeful and a tangled
> mess.  I wish to ensure that we don't make any development decisions
> which makes the situation worse.
> 
> In your case, the two motivations are quite different I would recommend
> dealing with them independently.
> 
> IIRC, the issue with more than 255 cpus and interrupt remapping is that
> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
> can't be programmed to generate x2apic interrupts?  In principle, if you
> don't have an IOAPIC, are there any other issues to be considered?  What
> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
> deliver xapic interrupts?

The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
deliver interrupts to all cpus in the system if #cpu > 255.

> 
> On the other side of things, what is IGD passthrough going to look like
> in Skylake?  Is there any device-model interaction required (i.e. the
> opregion), or will it work as a completely standalone device?  What are
> your plans with the interaction of virtual graphics and shared virtual
> memory?
> 

The plan is to use a so-called universal pass-through driver in the guest
which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)

----
Here is a brief of potential usages relying on vIOMMU:

a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. 
It requires interrupt remapping capability present on vIOMMU;

b) support guest SVM (Shared Virtual Memory), which relies on the
1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
needs to enable both 1st level and 2nd level translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too;

c) support VFIO-based user space driver (e.g. DPDK) in the guest,
which relies on the 2nd level translation capability (IOVA->GPA) on 
vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);

----
And below is my thought viability of implementing vIOMMU in Qemu:

a) enable >255 vcpus:

        o Enable Q35 in Qemu-Xen;
        o Add interrupt remapping in Qemu vIOMMU;
        o Virtual interrupt injection in hypervisor needs to know virtual
interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
which requires new hypervisor interfaces as Andrew pointed out:
                * either for hypervisor to query IR from Qemu which is not
good;
                * or for Qemu to register IR info to hypervisor which means
partial IR knowledge implemented in hypervisor (then why not putting
whole IR emulation in Xen?)

b) support SVM

        o Enable Q35 in Qemu-Xen;
        o Add 1st level translation capability in Qemu vIOMMU;
        o VT-d context entry points to guest 1st level translation table
which is nest-translated by 2nd level translation table so vIOMMU
structure can be directly linked. It means:
                * Xen IOMMU driver enables nested mode;
                * Introduce a new hypercall so Qemu vIOMMU can register
GPA root of guest 1st level translation table which is then written
to context entry in pIOMMU;

c) support VFIO-based user space driver

        o Enable Q35 in Qemu-Xen;
        o Leverage existing 2nd level translation implementation in Qemu 
vIOMMU;
        o Change Xen IOMMU to support (IOVA->HPA) translation which
means decouple current logic from P2M layer (only for GPA->HPA);
        o As a means of shadowing approach, Xen IOMMU driver needs to
know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
mapping in case of any one is changed. So new interface is required
for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
which may need to be further cached. 

----

After writing down above detail, looks it's clear that putting vIOMMU
in Qemu is not a clean design for a) and c). For b) the hypervisor
change is not that hacky, but for it alone seems not strong to pursue
Qemu path. Seems we may have to go with hypervisor based 
approach...

Anyway stop here. With above background let's see whether others
may have a better thought how to accelerate TTM of those usages
in Xen. Xen once is a leading hypervisor for many new features, but
recently it is not sustaining. If above usages can be enabled decoupled
from HVMlite/virtual_root_port effort, then we can have staged plan
to move faster (first for HVM, later for HVMLite). :-)

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.