Xen project Mailing List

> From: Andrew Cooper [mailto:andrew.cooper3@xxxxxxxxxx] > Sent: Friday, June 03, 2016 2:59 AM > > On 02/06/16 16:03, Lan, Tianyu wrote: > > On 5/27/2016 4:19 PM, Lan Tianyu wrote: > >> On 2016年05月26日 19:35, Andrew Cooper wrote: > >>> On 26/05/16 09:29, Lan Tianyu wrote: > >>> > >>> To be viable going forwards, any solution must work with PVH/HVMLite as > >>> much as HVM. This alone negates qemu as a viable option. > >>> > >>> From a design point of view, having Xen needing to delegate to qemu to > >>> inject an interrupt into a guest seems backwards. > >>> > >> > >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and > >> the qemu virtual iommu can't work for it. We have to rewrite virtual > >> iommu in the Xen, right? > >> > >>> > >>> A whole lot of this would be easier to reason about if/when we get a > >>> basic root port implementation in Xen, which is necessary for HVMLite, > >>> and which will make the interaction with qemu rather more clean. It is > >>> probably worth coordinating work in this area. > >> > >> The virtual iommu also should be under basic root port in Xen, right? > >> > >>> > >>> As for the individual issue of 288vcpu support, there are already > >>> issues > >>> with 64vcpu guests at the moment. While it is certainly fine to remove > >>> the hard limit at 255 vcpus, there is a lot of other work required to > >>> even get 128vcpu guests stable. > >> > >> > >> Could you give some points to these issues? We are enabling more vcpus > >> support and it can boot up 255 vcpus without IR support basically. It's > >> very helpful to learn about known issues. > >> > >> We will also add more tests for 128 vcpus into our regular test to find > >> related bugs. Increasing max vcpu to 255 should be a good start. > > > > Hi Andrew: > > Could you give more inputs about issues with 64 vcpus and what needs to > > be done to make 128vcpu guest stable? We hope to do somethings to > > improve them. > > > > What's progress of PCI host bridge in Xen? From your opinion, we should > > do that first, right? Thanks. > > Very sorry for the delay. > > There are multiple interacting issues here. On the one side, it would > be useful if we could have a central point of coordination on > PVH/HVMLite work. Roger - as the person who last did HVMLite work, > would you mind organising that? > > For the qemu/xen interaction, the current state is woeful and a tangled > mess. I wish to ensure that we don't make any development decisions > which makes the situation worse. > > In your case, the two motivations are quite different I would recommend > dealing with them independently. > > IIRC, the issue with more than 255 cpus and interrupt remapping is that > you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs > can't be programmed to generate x2apic interrupts? In principle, if you > don't have an IOAPIC, are there any other issues to be considered? What > happens if you configure the LAPICs in x2apic mode, but have the IOAPIC > deliver xapic interrupts? The key is the APIC ID. There is no modification to existing PCI MSI and IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send interrupt message containing 8bit APIC ID, which cannot address >255 cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to enable >255 cpus with x2apic mode. If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot deliver interrupts to all cpus in the system if #cpu > 255. > > On the other side of things, what is IGD passthrough going to look like > in Skylake? Is there any device-model interaction required (i.e. the > opregion), or will it work as a completely standalone device? What are > your plans with the interaction of virtual graphics and shared virtual > memory? > The plan is to use a so-called universal pass-through driver in the guest which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.) ---- Here is a brief of potential usages relying on vIOMMU: a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. It requires interrupt remapping capability present on vIOMMU; b) support guest SVM (Shared Virtual Memory), which relies on the 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is the main usage today (to support OpenCL 2.0 SVM feature). In the future SVM might be used by other I/O devices too; c) support VFIO-based user space driver (e.g. DPDK) in the guest, which relies on the 2nd level translation capability (IOVA->GPA) on vIOMMU. pIOMMU 2nd level becomes a shadowing structure of vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA); ---- And below is my thought viability of implementing vIOMMU in Qemu: a) enable >255 vcpus: o Enable Q35 in Qemu-Xen; o Add interrupt remapping in Qemu vIOMMU; o Virtual interrupt injection in hypervisor needs to know virtual interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI, which requires new hypervisor interfaces as Andrew pointed out: * either for hypervisor to query IR from Qemu which is not good; * or for Qemu to register IR info to hypervisor which means partial IR knowledge implemented in hypervisor (then why not putting whole IR emulation in Xen?) b) support SVM o Enable Q35 in Qemu-Xen; o Add 1st level translation capability in Qemu vIOMMU; o VT-d context entry points to guest 1st level translation table which is nest-translated by 2nd level translation table so vIOMMU structure can be directly linked. It means: * Xen IOMMU driver enables nested mode; * Introduce a new hypercall so Qemu vIOMMU can register GPA root of guest 1st level translation table which is then written to context entry in pIOMMU; c) support VFIO-based user space driver o Enable Q35 in Qemu-Xen; o Leverage existing 2nd level translation implementation in Qemu vIOMMU; o Change Xen IOMMU to support (IOVA->HPA) translation which means decouple current logic from P2M layer (only for GPA->HPA); o As a means of shadowing approach, Xen IOMMU driver needs to know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA) mapping in case of any one is changed. So new interface is required for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor which may need to be further cached. ---- After writing down above detail, looks it's clear that putting vIOMMU in Qemu is not a clean design for a) and c). For b) the hypervisor change is not that hacky, but for it alone seems not strong to pursue Qemu path. Seems we may have to go with hypervisor based approach... Anyway stop here. With above background let's see whether others may have a better thought how to accelerate TTM of those usages in Xen. Xen once is a leading hypervisor for many new features, but recently it is not sustaining. If above usages can be enabled decoupled from HVMlite/virtual_root_port effort, then we can have staged plan to move faster (first for HVM, later for HVMLite). :-) Thanks Kevin _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.

Re: [Xen-devel] Discussion about virtual iommu support for Xen guest