[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen virtual IOMMU high level design doc V2

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, "yang.zhang.wz@xxxxxxxxx" <yang.zhang.wz@xxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
From: "Lan, Tianyu" <tianyu.lan@xxxxxxxxx>
Date: Sat, 22 Oct 2016 15:32:50 +0800
Cc: "anthony.perard@xxxxxxxxxx" <anthony.perard@xxxxxxxxxx>, xuquan8@xxxxxxxxxx, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, "ian.jackson@xxxxxxxxxxxxx" <ian.jackson@xxxxxxxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>
Delivery-date: Sat, 22 Oct 2016 07:33:21 +0000
List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 10/21/2016 4:36 AM, Andrew Cooper wrote:

255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.


This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.


The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.


After spending a long time reading up on this, my first observation is
that it is very difficult to find consistent information concerning the
expected content of MSI address/data fields for x86 hardware.  Having
said that, this has been very educational.

It is now clear that any MSI message can either specify an 8 bit APIC ID
directly, or request for the message to be remapped.  Apologies for my
earlier confusion.


Never minder, I will describe this more detail in the following version.




3 Xen hypervisor
==========================================================================


3.1 New hypercall XEN_SYSCTL_viommu_op
This hypercall should also support pv IOMMU which is still under RFC
review. Here only covers non-pv part.

1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
parameter.


Why did you choose sysctl?  As these are per-domain, domctl would be a
more logical choice.  However, neither of these should be usable by
Qemu, and we are trying to split out "normal qemu operations" into dmops
which can be safely deprivileged.


Do you know what's the status of dmop now? I just found some discussions
about design in the maillist. We may use domctl first and move to dmop
when it's ready?


I believe Paul is looking into respin the series early in the 4.9 dev
cycle.  I expect it won't take long until they are submitted.


Ok. I got it. Thanks for information.



Definition of VIOMMU subops:
#define XEN_SYSCTL_viommu_query_capability        0
#define XEN_SYSCTL_viommu_create            1
#define XEN_SYSCTL_viommu_destroy            2
#define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3

Definition of VIOMMU capabilities
#define XEN_VIOMMU_CAPABILITY_l1_translation    (1 << 0)
#define XEN_VIOMMU_CAPABILITY_l2_translation    (1 << 1)
#define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)


How are vIOMMUs going to be modelled to guests?  On real hardware, they
all seem to end associated with a PCI device of some sort, even if it is
just the LPC bridge.



This design just considers one vIOMMU has all PCI device under its
specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
vIOMMU.


Even if the first implementation only supports a single vIOMMU, please
design the interface to cope with multiple.  It will save someone having
to go and break the API/ABI in the future when support for multiple
vIOMMUs is needed.


OK. I got.


How do we deal with multiple vIOMMUs in a single guest?


For multi-vIOMMU, we need to add new field in the struct iommu_op to
designate device scope of vIOMMUs if they are under same PCI
segment. This also needs to change DMAR table.



2) Design for subops
- XEN_SYSCTL_viommu_query_capability
       Get vIOMMU capabilities(l1/l2 translation and interrupt
remapping).

- XEN_SYSCTL_viommu_create
      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
base address.

- XEN_SYSCTL_viommu_destroy
      Destory vIOMMU in Xen hypervisor with dom_id as parameters.

- XEN_SYSCTL_viommu_dma_translation_for_vpdev
      Translate IOVA to GPA for specified virtual PCI device with
dom id,
PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
address mask and access permission.


3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer to context entry of physical IOMMU.


How are you proposing to do this shadowing?  Do we need to trap and
emulate all writes to the vIOMMU pagetables, or is there a better way to
know when the mappings need invalidating?


No, we don't need to trap all write to IO page table.
From VTD spec 6.1, "Reporting the Caching Mode as Set for the
virtual hardware requires the guest software to explicitly issue
invalidation operations on the virtual hardware for any/all updates to
the guest remapping structures.The virtualizing software may trap these
guest invalidation operations to keep the shadow translation structures
consistent to guest translation structure modifications, without
resorting to other less efficient techniques."
So any updates of IO page table will follow invalidation operation and
we use them to do shadowing.


Ok.  That is helpful.

So, the guest makes some updates, and requests an invalidation.  This
traps into Xen, and we presumably re-shadow all state from fresh?

we may expose PSi(Page Selective Invalidation) capability and guest willjust invalidate associated page entry rather than all.

 We can then presumably send a synchronous invalidation request to Qemu at
this point?

In our design, Qemu will not cache IOTLB and so no invalidation requestis sent to Qemu. But I am still considering how to deal with in-fly DMAwhen there is invalidation request just like you mentioned.


How long is this likely to take?  Reshadowing all DMA and Interrupt
remapping tables sounds very expensive.


Now all PCI devices in same hvm domain share one IO page table
(GPA->HPA) in physical IOMMU driver of Xen. To support l2
translation of vIOMMU, IOMMU driver need to support multiple address
spaces per device entry. Using existing IO page table(GPA->HPA)
defaultly and switch to shadow IO page table(IOVA->HPA) when l2
translation function is enabled. These change will not affect current
P2M logic.

3.3 Interrupt remapping
Interrupts from virtual devices and physical devices will be delivered
to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
according interrupt remapping table.


3.4 l1 translation
When nested translation is enabled, any address generated by l1
translation is used as the input address for nesting with l2
translation. Physical IOMMU needs to enable both l1 and l2 translation
in nested translation mode(GVA->GPA->HPA) for passthrough
device.


All these l1 and l2 translations are getting confusing.  Could we
perhaps call them guest translation and host translation, or is that
likely to cause other problems?


Definitions of l1 and l2 translation from VTD spec.
first-level translation to remap a virtual address to intermediate
(guest) physical address.
second-level translations to remap a intermediate physical address to
machine (host) physical address.
guest and host translation maybe not suitable for them?


True, but what is also confusing is that what was previously the only
level of translation is now l2.

So long as it is clearly stated somewhere in the code and/or feature doc
which address spaces each level translates between (i.e. l1 translates
linear addresses into gfns, and l2 translates gfns into mfns), it will
probably be ok.

Sure, we should rename current level translation to l2 translation whenintroduce l1 translation and aligns with VTD spec.


Can I recommend that you make use of the TYPE_SAFE() infrastructure to
make concrete, disperate types for any new translation functions, to
make it harder to accidentally get wrong.


Ok. I got it.


VT-d context entry points to guest l1 translation table which
will be nest-translated by l2 translation table and so it
can be directly linked to context entry of physical IOMMU.

To enable l1 translation in VM
1) Xen IOMMU driver enables nested translation mode
2) Update GPA root of guest l1 translation table to context entry
of physical IOMMU.

All handles are in hypervisor and no interaction with Qemu.


3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. Linux Intel IOMMU driver thinks l2
translation is always available when VTD exits and fail to be loaded
without l2 translation support even if interrupt remapping and l1
translation are available. So it needs to enable l2 translation first
before other functions.


What then is the purpose of the nested translation support bit in the
extended capability register?


It's to translate output GPA from first level translation(IOVA->GPA)
to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."


I didn't phrase my question very well.  I understand what the nested
translation bit means, but I don't understand why we have a problem
signalling the presence or lack of nested translations to the guest.

In other words, why can't we hide l2 translation from the guest by
simply clearing the nested translation capability?


You mean to tell no support of l2 translation via nest translation bit?
But the nested translation is a different function with l2 translation
even from guest view and nested translation only works requests with
PASID (l1 translation).

Linux intel iommu driver enables l2 translation unconditionally and freeiommu instance when failed to enable l2 translation.


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

Follow-Ups:
- Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
  - From: Jan Beulich

References:
- [Xen-devel] Xen virtual IOMMU high level design doc V2
  - From: Lan Tianyu
- Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
  - From: Andrew Cooper
- Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
  - From: Lan Tianyu
- Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
  - From: Andrew Cooper

Prev by Date: [Xen-devel] [linux-3.18 test] 101584: regressions - FAIL
Next by Date: [Xen-devel] [linux-3.10 test] 101594: regressions - FAIL
Previous by thread: Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
Next by thread: Re: [Xen-devel] Xen virtual IOMMU high level design doc V2
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.