[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Discussion about virtual iommu support for Xen guest



> From: Stefano Stabellini
> Sent: Saturday, June 04, 2016 1:15 AM
> 
> On Fri, 3 Jun 2016, Andrew Cooper wrote:
> > On 03/06/16 12:17, Tian, Kevin wrote:
> > >> Very sorry for the delay.
> > >>
> > >> There are multiple interacting issues here.  On the one side, it would
> > >> be useful if we could have a central point of coordination on
> > >> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> > >> would you mind organising that?
> > >>
> > >> For the qemu/xen interaction, the current state is woeful and a tangled
> > >> mess.  I wish to ensure that we don't make any development decisions
> > >> which makes the situation worse.
> > >>
> > >> In your case, the two motivations are quite different I would recommend
> > >> dealing with them independently.
> > >>
> > >> IIRC, the issue with more than 255 cpus and interrupt remapping is that
> > >> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
> > >> can't be programmed to generate x2apic interrupts?  In principle, if you
> > >> don't have an IOAPIC, are there any other issues to be considered?  What
> > >> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
> > >> deliver xapic interrupts?
> > > The key is the APIC ID. There is no modification to existing PCI MSI and
> > > IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> > > interrupt message containing 8bit APIC ID, which cannot address >255
> > > cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> > > enable >255 cpus with x2apic mode.
> >
> > Thanks for clarifying.
> >
> > >
> > > If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
> > > deliver interrupts to all cpus in the system if #cpu > 255.
> >
> > Ok.  So not ideal (and we certainly want to address it), but this isn't
> > a complete show stopper for a guest.
> >
> > >> On the other side of things, what is IGD passthrough going to look like
> > >> in Skylake?  Is there any device-model interaction required (i.e. the
> > >> opregion), or will it work as a completely standalone device?  What are
> > >> your plans with the interaction of virtual graphics and shared virtual
> > >> memory?
> > >>
> > > The plan is to use a so-called universal pass-through driver in the guest
> > > which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)
> >
> > This is fantastic news.
> >
> > >
> > > ----
> > > Here is a brief of potential usages relying on vIOMMU:
> > >
> > > a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread.
> > > It requires interrupt remapping capability present on vIOMMU;
> > >
> > > b) support guest SVM (Shared Virtual Memory), which relies on the
> > > 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
> > > needs to enable both 1st level and 2nd level translation in nested
> > > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
> > > the main usage today (to support OpenCL 2.0 SVM feature). In the
> > > future SVM might be used by other I/O devices too;
> > >
> > > c) support VFIO-based user space driver (e.g. DPDK) in the guest,
> > > which relies on the 2nd level translation capability (IOVA->GPA) on
> > > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > > vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);
> >
> > All of these look like interesting things to do.  I know there is a lot
> > of interest for b).
> >
> > As a quick aside, does Xen currently boot on a Phi?  Last time I looked
> > at the Phi manual, I would expect Xen to crash on boot because of MCXSR
> > differences from more-common x86 hardware.

Tianyu can correct me for the detail info. Xen can boot on Xeon Phi. However
we need a hacky patch in guest Linux kernel to disable dependency check
around interrupt remapping. Otherwise guest kernel boot will fail.

Now we're suffering from some performance issue. When the analysis is
ongoing, could you elaborate the limitation you see with 64vcpu guest? It
would be helpful whether we are hunting the same problem or not...

> >
> > >
> > > ----
> > > And below is my thought viability of implementing vIOMMU in Qemu:
> > >
> > > a) enable >255 vcpus:
> > >
> > >   o Enable Q35 in Qemu-Xen;
> > >   o Add interrupt remapping in Qemu vIOMMU;
> > >   o Virtual interrupt injection in hypervisor needs to know virtual
> > > interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
> > > which requires new hypervisor interfaces as Andrew pointed out:
> > >           * either for hypervisor to query IR from Qemu which is not
> > > good;
> > >           * or for Qemu to register IR info to hypervisor which means
> > > partial IR knowledge implemented in hypervisor (then why not putting
> > > whole IR emulation in Xen?)
> > >
> > > b) support SVM
> > >
> > >   o Enable Q35 in Qemu-Xen;
> > >   o Add 1st level translation capability in Qemu vIOMMU;
> > >   o VT-d context entry points to guest 1st level translation table
> > > which is nest-translated by 2nd level translation table so vIOMMU
> > > structure can be directly linked. It means:
> > >           * Xen IOMMU driver enables nested mode;
> > >           * Introduce a new hypercall so Qemu vIOMMU can register
> > > GPA root of guest 1st level translation table which is then written
> > > to context entry in pIOMMU;
> > >
> > > c) support VFIO-based user space driver
> > >
> > >   o Enable Q35 in Qemu-Xen;
> > >   o Leverage existing 2nd level translation implementation in Qemu
> > > vIOMMU;
> > >   o Change Xen IOMMU to support (IOVA->HPA) translation which
> > > means decouple current logic from P2M layer (only for GPA->HPA);
> > >   o As a means of shadowing approach, Xen IOMMU driver needs to
> > > know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
> > > mapping in case of any one is changed. So new interface is required
> > > for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
> > > which may need to be further cached.
> > >
> > > ----
> > >
> > > After writing down above detail, looks it's clear that putting vIOMMU
> > > in Qemu is not a clean design for a) and c). For b) the hypervisor
> > > change is not that hacky, but for it alone seems not strong to pursue
> > > Qemu path. Seems we may have to go with hypervisor based
> > > approach...
> > >
> > > Anyway stop here. With above background let's see whether others
> > > may have a better thought how to accelerate TTM of those usages
> > > in Xen. Xen once is a leading hypervisor for many new features, but
> > > recently it is not sustaining. If above usages can be enabled decoupled
> > > from HVMlite/virtual_root_port effort, then we can have staged plan
> > > to move faster (first for HVM, later for HVMLite). :-)
> >
> > I dislike that we are in this situation, but I glad to see that I am not
> > the only one who thinks that the current situation is unsustainable.
> >
> > The problem is things were hacked up in the past to assume qemu could
> > deal with everything like this.  Later, performance sucked sufficiently
> > that bit of qemu were moved back up into the hypervisor, which is why
> > the vIOAPIC is currently located there.  The result is a complete
> > tangled ratsnest.
> >
> >
> > Xen has 3 common uses for qemu, which are:
> > 1) Emulation of legacy devices
> > 2) PCI Passthrough
> > 3) PV backends

4) Mediated passthrough as for XenGT

> >
> > 3 isn't really relevant here.  For 1, we are basically just using Qemu
> > to provide an LPC implementation (with some populated slots for
> > disk/network devices).
> >
> > I think it would be far cleaner to re-engineer the current Xen/qemu
> > interaction to more closely resemble real hardware, including
> > considering having multiple vIOAPICs/vIOMMUs/etc when architecturally
> > appropriate.  I expect that it would be a far cleaner interface to use
> > and extend.  I also realise that this isn't a simple task I am
> > suggesting, but I don't see any other viable way out.

Could you give some example why current Xen/qemu interface is not
good while moving root port into Xen makes it far cleaner? 

> >
> > Other issues in the mix is support for multiple device emulators, in
> > which case Xen is already performing first-level redirection of MMIO
> > requests.
> >
> > For HVMLite, there is specifically no qemu, and we need something which
> > can function when we want PCI Passthrough to work.  I am quite confident
> > that the correct solution here is to have a basic host bridge/root port
> > implementation in Xen (as we already have 80% of this already), at which
> > point we don't need any qemu interaction for PCI Passthough at all, even
> > for HVM guests.
> >
> > >From this perspective, it would make sense to have emulators map IOVAs,
> > not GPAs.  We already have mapcache_invalidate infrastructure to flush
> > mappings as they are changed by the guest.
> >
> >
> > For the HVMLite side of things, my key concern is not to try and do any
> > development which we realistically expect to have to undo/change.  As
> > you said yourself, we are struggling to sustain, and really aren't
> > helping ourselves by doing lots of work, and subsequently redoing it
> > when it doesn't work; PVH is the most obvious recent example here.
> >
> > If others agree, I think that it is well worth making some concrete
> > plans for improvements in this area for Xen 4.8.  I think the only
> > viable way forward is to try and get out of the current hole we are in.
> >
> > Thoughts?  (especially Stefano/Anthony)
> 
> Going back to the beginning of the discussion, whether we should enable
> Q35 in QEMU or not is a distraction: of course we should enable it, but
> even with Q35 in QEMU, it might not be a good idea to place the vIOMMU
> emulation there.
> 
> I agree with Andrew that the current model is flawed: the boundary
> between Xen and QEMU emulation is not clear enough. In addition using
> QEMU on Xen introduces latency and security issues (the work to run QEMU
> as non-root and using unprivileged interfaces is not complete yet).
> 
> I think of QEMU as a provider of complex, high level emulators, such as
> the e1000, Cirrus VGA, SCSI controllers, etc., which don't necessarily
> need to be fast.

Earlier you said Qemu imposes security issues. Here you said Qemu can 
still provide complex emulators. Does it mean that security issue in Qemu
simply comes from the part which should be moved into Xen? Any
elaboration here?

> 
> For core x86 components, such as the vIOMMU, for performance and ease of
> integration with the rest of the hypervisor, it seems to me that Xen
> would is the right place to implement them. As a comparison, I would
> certainly argue in favor of implementing vSMMU in the hypervisor on ARM.
> 

After some internal discussion with Tianyu/Eddie, I realized my earlier
description is incomplete which takes only passthrough device into
consideration (as you saw it's mainly around interaction between vIOMMU
and pIOMMU). However from guest p.o.v, all the devices should be covered
by vIOMMU to match today's physical platform, including:

1) DMA-capable virtual device in Qemu, in Dom0 user space
2) PV devices, in Dom0 kernel space
3) Passthrough devices, in Xen hypervisor

A natural implementation is to have vIOMMU together with where the
DMA is emulated, which ends up to a possible way with multiple vIOMMUs
in multiple layers:

1) vIOMMU in Dom0 user
2) vIOMMU in Dom0 kernel
3) vIOMMU in Xen hypervisor

Of course we may come up an option to still keep all vIOMMUs in Xen
hypervisor, which however means every vDMA operations in Qemu or
BE driver need to issue Xen hypercall to get vIOMMU's approval. I haven't
thought thoroughly how big/complex this issue is, but it does be a
limitation from a quick thought.

So, likely we'll have to consider presence of multiple vIOMMUs, each in 
different layers, regardless of root-complex in Qemu or Xen. There
needs to be some interface abstractions to allow vIOMMU/root-complex
communicating with each other. Well, not an easy task...

In the meantime, we are confirming internally whether as an intermediate
step we can compose a virtual platform with only partial devices covered
by vIOMMU while other devices are not. There is no physical platform 
like it, and also it may break guest OS's assumption. But it could make
a staging plan viable if such configuration is possible.

> 
> However the issue is the PCI root-complex, which today is in QEMU. I
> don't think it is a particularly bad fit there, although I can also see
> the benefit of moving it to the hypervisor. It is relevant here if it
> causes problems to implementing vIOMMU in Xen.
> 
> From a software engineering perspective, it would be nice to keep the
> two projects (implementing vIOMMU and moving the PCI root complex to
> Xen) separate, especially given that the PCI root complex one is without
> an owner and a timeline. I don't think it is fair to ask Tianyu or Kevin
> to move the PCI root complex from QEMU to Xen in order to enable vIOMMU
> on Xen systems.
> 
> If vIOMMU in Xen and root complex in QEMU cannot be made to work
> together, then we are at an impasse. I cannot see any good way forward
> unless somebody volunteers to start working on the PCI root complex
> project soon to provide Kevin and Tianyu with a branch to based their
> work upon.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.