|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC] ARM PCI Passthrough design document
On Fri, 26 May 2017, Julien Grall wrote:
> Hi all,
>
> The document below is an RFC version of a design proposal for PCI
> Passthrough in Xen on ARM. It aims to describe from an high level perspective
> the interaction with the different subsystems and how guest will be able
> to discover and access PCI.
>
> Currently on ARM, Xen does not have any knowledge about PCI devices. This
> means that IOMMU and interrupt controller (such as ITS) requiring specific
> configuration will not work with PCI even with DOM0.
>
> The PCI Passthrough work could be divided in 2 phases:
> * Phase 1: Register all PCI devices in Xen => will allow
> to use ITS and SMMU with PCI in Xen
> * Phase 2: Assign devices to guests
>
> This document aims to describe the 2 phases, but for now only phase
> 1 is fully described.
>
>
> I think I was able to gather all of the feedbacks and come up with a solution
> that will satisfy all the parties. The design document has changed quite a lot
> compare to the early draft sent few months ago. The major changes are:
> * Provide more details how PCI works on ARM and the interactions with
> MSI controller and IOMMU
> * Provide details on the existing host bridge implementations
> * Give more explanation and justifications on the approach chosen
> * Describing the hypercalls used and how they should be called
>
> Feedbacks are welcomed.
>
> Cheers,
Hi Julien,
I think this document is a very good first step in the right direction
and I fully agree with the approaches taken here.
A noticed a couple of grammar errors that I pointed out below.
> --------------------------------------------------------------------------------
>
> % PCI pass-through support on ARM
> % Julien Grall <julien.grall@xxxxxxxxxx>
> % Draft B
>
> # Preface
>
> This document aims to describe the components required to enable the PCI
> pass-through on ARM.
>
> This is an early draft and some questions are still unanswered. When this is
> the case, the text will contain XXX.
>
> # Introduction
>
> PCI pass-through allows the guest to receive full control of physical PCI
> devices. This means the guest will have full and direct access to the PCI
> device.
>
> ARM is supporting a kind of guest that exploits as much as possible
> virtualization support in hardware. The guest will rely on PV driver only
> for IO (e.g block, network) and interrupts will come through the virtualized
> interrupt controller, therefore there are no big changes required within the
> kernel.
>
> As a consequence, it would be possible to replace PV drivers by assigning real
> devices to the guest for I/O access. Xen on ARM would therefore be able to
> run unmodified operating system.
>
> To achieve this goal, it looks more sensible to go towards emulating the
> host bridge (there will be more details later). A guest would be able to take
> advantage of the firmware tables, obviating the need for a specific driver
> for Xen.
>
> Thus, in this document we follow the emulated host bridge approach.
>
> # PCI terminologies
>
> Each PCI device under a host bridge is uniquely identified by its Requester ID
> (AKA RID). A Requester ID is a triplet of Bus number, Device number, and
> Function.
>
> When the platform has multiple host bridges, the software can add a fourth
> number called Segment (sometimes called Domain) to differentiate host bridges.
> A PCI device will then uniquely by segment:bus:device:function (AKA SBDF).
>
> So given a specific SBDF, it would be possible to find the host bridge and the
> RID associated to a PCI device. The pair (host bridge, RID) will often be used
> to find the relevant information for configuring the different subsystems (e.g
> IOMMU, MSI controller). For convenience, the rest of the document will use
> SBDF to refer to the pair (host bridge, RID).
>
> # PCI host bridge
>
> PCI host bridge enables data transfer between a host processor and PCI bus
> based devices. The bridge is used to access the configuration space of each
> PCI devices and, on some platform may also act as an MSI controller.
>
> ## Initialization of the PCI host bridge
>
> Whilst it would be expected that the bootloader takes care of initializing
> the PCI host bridge, on some platforms it is done in the Operating System.
>
> This may include enabling/configuring the clocks that could be shared among
> multiple devices.
>
> ## Accessing PCI configuration space
>
> Accessing the PCI configuration space can be divided in 2 category:
> * Indirect access, where the configuration spaces are multiplexed. An
> example would be legacy method on x86 (e.g 0xcf8 and 0xcfc). On ARM a
> similar method is used by PCIe RCar root complex (see [12]).
> * ECAM access, each configuration space will have its own address space.
>
> Whilst ECAM is a standard, some PCI host bridges will require specific
> fiddling
> when access the registers (see thunder-ecam [13]).
>
> In most of the cases, accessing all the PCI configuration spaces under a
> given PCI host will be done the same way (i.e either indirect access or ECAM
> access). However, there are a few cases, dependent on the PCI devices
> accessed,
> which will use different methods (see thunder-pem [14]).
>
> ## Generic host bridge
>
> For the purpose of this document, the term "generic host bridge" will be used
> to describe any host bridge ECAM-compliant and the initialization, if
> required,
> will be already done by the firmware/bootloader.
>
> # Interaction of the PCI subsystem with other subsystems
>
> In order to have a PCI device fully working, Xen will need to configure
> other subsystems such as the IOMMU and the Interrupt Controller.
>
> The interaction expected between the PCI subsystem and the other subsystems
> is:
> * Add a device
> * Remove a device
> * Assign a device to a guest
> * Deassign a device from a guest
>
> XXX: Detail the interaction when assigning/deassigning device
>
> In the following subsections, the interactions will be briefly described from
> a
> higher level perspective. However, implementation details such as callback,
> structure, etc... are beyond the scope of this document.
>
> ## IOMMU
>
> The IOMMU will be used to isolate the PCI device when accessing the memory
> (e.g
> DMA and MSI Doorbells). Often the IOMMU will be configured using a MasterID
> (aka StreamID for ARM SMMU) that can be deduced from the SBDF with the help
> of the firmware tables (see below).
>
> Whilst in theory, all the memory transactions issued by a PCI device should
> go through the IOMMU, on certain platforms some of the memory transaction may
> not reach the IOMMU because they are interpreted by the host bridge. For
> instance, this could happen if the MSI doorbell is built into the PCI host
> bridge or for P2P traffic. See [6] for more details.
>
> XXX: I think this could be solved by using direct mapping (e.g GFN == MFN),
> this would mean the guest memory layout would be similar to the host one when
> PCI devices will be pass-throughed => Detail it.
>
> ## Interrupt controller
>
> PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On
> ARM,
> legacy interrupts will be mapped to SPIs. MSI and MSI-X will write their
> payload in a doorbell belonging to a MSI controller.
>
> ### Existing MSI controllers
>
> In this section some of the existing controllers and their interaction with
> the devices will be briefly described. More details can be found in the
> respective specifications of each MSI controller.
>
> MSIs can be distinguished by some combination of
> * the Doorbell
> It is the MMIO address written to. Devices may be configured by
> software to write to arbitrary doorbells which they can address.
> An MSI controller may feature a number of doorbells.
> * the Payload
> Devices may be configured to write an arbitrary payload chosen by
> software. MSI controllers may have restrictions on permitted payload.
> Xen will have to sanitize the payload unless it is known to be always
> safe.
> * Sideband information accompanying the write
> Typically this is neither configurable nor probeable, and depends on
> the path taken through the memory system (i.e it is a property of the
> combination of MSI controller and device rather than a property of
> either in isolation).
>
> ### GICv3/GICv4 ITS
>
> The Interrupt Translation Service (ITS) is a MSI controller designed by ARM
> and integrated in the GICv3/GICv4 interrupt controller. For the specification
> see [GICV3]. Each MSI/MSI-X will be mapped to a new type of interrupt called
> LPI. This interrupt will be configured by the software using a pair (DeviceID,
> EventID).
>
> A platform may have multiple ITS block (e.g one per NUMA node), each of them
> belong to an ITS group.
>
> The DeviceID is a unique identifier with an ITS group for each MSI-capable
> device that can be deduced from the RID with the help of the firmware tables
> (see below).
>
> The EventID is a unique identifier to distinguish different event sending
> by a device.
>
> The MSI payload will only contain the EventID as the DeviceID will be added
> afterwards by the hardware in a way that will prevent any tampering.
>
> The [SBSA] appendix I describes the set of rules for the integration of the
^ redundant I
> ITS that any compliant platform should follow. Some of the rules will explain
> the security implication of a misbehaving devices. It ensures that a guest
> will never be able to trigger an MSI on behalf of another guest.
>
> XXX: The security implication is described in the [SBSA] but I haven't found
> any similar working in the GICv3 specification. It is unclear to me if
> non-SBSA compliant platform (e.g embedded) will follow those rules.
>
> ### GICv2m
>
> The GICv2m is an extension of the GICv2 to convert MSI/MSI-X writes to unique
> interrupts. The specification can be found in the [SBSA] appendix E.
>
> Depending on the platform, the GICv2m will provide one or multiple instance
> of register frames. Each frame is composed of a doorbell and associated to
> a set of SPIs that can be discovered by reading the register MSI_TYPER.
>
> On an MSI write, the payload will contain the SPI ID to generate. Note that
> on some platform the MSI payload may contain an offset form the base SPI
> rather than the SPI itself.
>
> The frame will only generate SPI if the written value corresponds to an SPI
> allocated to the frame. Each VM should have exclusity to the frame to ensure
^ exclusive access ?
> isolation and prevent a guest OS to trigger an MSI on-behalf of another guest
> OS.
>
> XXX: Linux seems to consider GICv2m as unsafe by default. From my
> understanding,
> it is still unclear how we should proceed on Xen, as GICv2m should be safe
> as long as the frame is only accessed by one guest.
It seems to me that you are right
> ### Other MSI controllers
>
> Servers compliant with SBSA level 1 and higher will have to use either ITS
> or GICv2m. However, it is by no means the only MSI controllers available.
> The hardware vendor may decide to use their custom MSI controller which can be
> integrated in the PCI host bridge.
>
> Whether it will be possible to write securely an MSI will depend on the
> MSI controller implementations.
>
> XXX: I am happy to give a brief explanation on more MSI controller (such
> as Xilinx and Renesas) if people think it is necessary.
>
> This design document does not pertain to a specific MSI controller and will
> try
> to be as agnostic is possible. When possible, it will give insight how to
> integrate the MSI controller.
>
> # Information available in the firmware tables
>
> ## ACPI
>
> ### Host bridges
>
> The static table MCFG (see 4.2 in [1]) will describe the host bridges
> available
> at boot and supporting ECAM. Unfortunately, there are platforms out there
> (see [2]) that re-use MCFG to describe host bridge that are not fully ECAM
> compatible.
>
> This means that Xen needs to account for possible quirks in the host bridge.
> The Linux community are working on a patch series for this, see [2] and [3],
> where quirks will be detected with:
> * OEM ID
> * OEM Table ID
> * OEM Revision
> * PCI Segment
> * PCI bus number range (wildcard allowed)
>
> Based on what Linux is currently doing, there are two kind of quirks:
> * Accesses to the configuration space of certain sizes are not allowed
> * A specific driver is necessary for driving the host bridge
>
> The former is straightforward to solve but the latter will require more
> thought.
> Instantiation of a specific driver for the host controller can be easily done
> if Xen has the information to detect it. However, those drivers may require
> resources described in ASL (see [4] for instance).
>
> The number of platforms requiring specific PCI host bridge driver is currently
> limited. Whilst it is not possible to predict the future, it will be expected
> upcoming platform to have fully ECAM compliant PCI host bridges. Therefore,
> given Xen does not have any ASL parser, the approach suggested is to hardcode
> the missing values. This could be revisit in the future if necessary.
>
> ### Finding information to configure IOMMU and MSI controller
>
> The static table [IORT] will provide information that will help to deduce
> data (such as MasterID and DeviceID) to configure both the IOMMU and the MSI
> controller from a given SBDF.
>
> ## Finding which NUMA node a PCI device belongs to
>
> On NUMA system, the NUMA node associated to a PCI device can be found using
> the _PXM method of the host bridge (?).
>
> XXX: I am not entirely sure where the _PXM will be (i.e host bridge vs PCI
> device).
>
> ## Device Tree
>
> ### Host bridges
>
> Each Device Tree node associated to a host bridge will have at least the
> following properties (see bindings in [8]):
> - device_type: will always be "pci".
> - compatible: a string indicating which driver to instanciate
>
> The node may also contain optional properties such as:
> - linux,pci-domain: assign a fix segment number
> - bus-range: indicate the range of bus numbers supported
>
> When the property linux,pci-domain is not present, the operating system would
> have to allocate the segment number for each host bridges.
>
> ### Finding information to configure IOMMU and MSI controller
>
> ### Configuring the IOMMU
>
> The Device Treee provides a generic IOMMU bindings (see [10]) which uses the
> properties "iommu-map" and "iommu-map-mask" to described the relationship
> between RID and a MasterID.
>
> These properties will be present in the host bridge Device Tree node. From a
> given SBDF, it will be possible to find the corresponding MasterID.
>
> Note that the ARM SMMU also have a legacy binding (see [9]), but it does not
> have a way to describe the relationship between RID and StreamID. Instead it
> assumed that StreamID == RID. This binding has now been deprecated in favor
> of the generic IOMMU binding.
>
> ### Configuring the MSI controller
>
> The relationship between the RID and data required to configure the MSI
> controller (such as DeviceID) can be found using the property "msi-map"
> (see [11]).
>
> This property will be present in the host bridge Device Tree node. From a
> given SBDF, it will be possible to find the corresponding MasterID.
>
> ## Finding which NUMA node a PCI device belongs to
>
> On NUMA system, the NUMA node associated to a PCI device can be found using
> the property "numa-node-id" (see [15]) presents in the host bridge Device Tree
> node.
>
> # Discovering PCI devices
>
> Whilst PCI devices are currently available in the hardware domain, the
> hypervisor does not have any knowledge of them. The first step of supporting
> PCI pass-through is to make Xen aware of the PCI devices.
>
> Xen will require access to the PCI configuration space to retrieve information
> for the PCI devices or access it on behalf of the guest via the emulated
> host bridge.
>
> This means that Xen should be in charge of controlling the host bridge.
> However,
> for some host controller, this may be difficult to implement in Xen because of
> depencencies on other components (e.g clocks, see more details in "PCI host
> bridge" section).
>
> For this reason, the approach chosen in this document is to let the hardware
> domain to discover the host bridges, scan the PCI devices and then report
> everything to Xen. This does not rule out the possibility of doing everything
> without the help of the hardware domain in the future.
>
> ## Who is in charge of the host bridge?
>
> There are numerous implementation of host bridges which exist on ARM. A part
> of
> them requires a specific driver as they cannot be driven by a generic host
> bridge
> driver. Porting those drivers may be complex due to dependencies on other
> components.
>
> This would be seen as signal to leave the host bridge drivers in the hardware
> domain. Because Xen would need to access the configuration space, all the
> access
> would have to be forwarded to hardware domain which in turn will access the
> hardware.
>
> In this design document, we are considering that the host bridge driver can
> be ported in Xen. In the case it is not possible, a interface to forward
> configuration space access would need to be defined. The interface details
> is out of scope.
>
> ## Discovering and registering host bridge
>
> The approach taken in the document will require communication between Xen and
> the hardware domain. In this case, they would need to agree on the segment
> number associated to an host bridge. However, this number is not available in
> the Device Tree case.
>
> The hardware domain will register new host bridges using the existing
> hypercall
> PHYSDEV_mmcfg_reserved:
>
> #define XEN_PCI_MMCFG_RESERVED 1
>
> struct physdev_pci_mmcfg_reserved {
> /* IN */
> uint64_t address;
> uint16_t segment;
> /* Range of bus supported by the host bridge */
> uint8_t start_bus;
> uint8_t end_bus;
>
> uint32_t flags;
> }
>
> Some of the host bridges may not have a separate configuration address space
> region described in the firmware tables. To simplify the registration, the
> field 'address' should contains the base address of one of the region
> described in the firmware tables.
> * For ACPI, it would be the base address specified in the MCFG or in the
> _CBA method.
> * For Device Tree, this would be any base address of region
> specified in the "reg" property.
>
> The field 'flags' is expected to have XEN_PCI_MMCFG_RESERVED set.
>
> It is expected that this hypercall is called before any PCI devices is
> registered to Xen.
>
> When the hardware domain is in charge of the host bridge, this hypercall will
> be used to tell Xen the existence of an host bridge in order to find the
> associated information for configuring the MSI controller and the IOMMU.
>
> ## Discovering and registering PCI devices
>
> The hardware domain will scan the host bridge to find the list of PCI devices
> available and then report it to Xen using the existing hypercall
> PHYSDEV_pci_device_add:
>
> #define XEN_PCI_DEV_EXTFN 0x1
> #define XEN_PCI_DEV_VIRTFN 0x2
> #define XEN_PCI_DEV_PXM 0x3
>
> struct physdev_pci_device_add {
> /* IN */
> uint16_t seg;
> uint8_t bus;
> uint8_t devfn;
> uint32_t flags;
> struct {
> uint8_t bus;
> uint8_t devfn;
> } physfn;
> /*
> * Optional parameters array.
> * First element ([0]) is PXM domain associated with the device (if
> * XEN_PCI_DEV_PXM is set)
> */
> uint32_t optarr[0];
> }
>
> When XEN_PCI_DEV_PXM is set in the field 'flag', optarr[0] will contain the
> NUMA node ID associated with the device:
> * For ACPI, it would be the value returned by the method _PXM
> * For Device Tree, this would the value found in the property
> "numa-node-id".
> For more details see the section "Finding which NUMA node a PCI device belongs
> to" in "ACPI" and "Device Tree".
>
> XXX: I still don't fully understand how XEN_PCI_DEV_EXTFN and
> XEN_PCI_DEV_VIRTFN
> wil work. AFAICT, the former is used with the bus support ARI and the only
> usage
> is in the x86 IOMMU code. For the latter, this is related to IOV but I am not
> sure what devfn and physfn.devfn will correspond too.
>
> Note that x86 currently provides two more hypercalls (PHYSDEVOP_manage_pci_add
> and PHYSDEVOP_manage_pci_add_ext) to register PCI devices. However they are
> subset of the hypercall PHYSDEVOP_pci_device_add. Therefore, it is suggested
> to leave them unimplemented on ARM.
>
> ## Removing PCI devices
>
> The hardware domain will be in charge Xen a device has been removed using
> the existing hypercall PHYSDEV_pci_device_remove:
>
> struct physdev_pci_device {
> /* IN */
> uint16_t seg;
> uint8_t bus;
> uint8_t devfn;
> }
>
> Note that x86 currently provide one more hypercall
> (PHYSDEVOP_manage_pci_remove)
> to remove PCI devices. However it does not allow to pass a segment number.
> Therefore it is suggested to leave unimplemented on ARM.
>
> # Glossary
>
> ECAM: Enhanced Configuration Mechanism
> SBDF: Segment Bus Device Function. The segment is a software concept.
> MSI: Message Signaled Interrupt
> MSI doorbell: MMIO address written to by a device to generate an MSI
> SPI: Shared Peripheral Interrupt
> LPI: Locality-specific Peripheral Interrupt
> ITS: Interrupt Translation Service
>
> # Specifications
> [SBSA] ARM-DEN-0029 v3.0
> [GICV3] IHI0069C
> [IORT] DEN0049B
>
> # Bibliography
>
> [1] PCI firmware specification, rev 3.2
> [2] https://www.spinics.net/lists/linux-pci/msg56715.html
> [3] https://www.spinics.net/lists/linux-pci/msg56723.html
> [4] https://www.spinics.net/lists/linux-pci/msg56728.html
> [6] https://www.spinics.net/lists/kvm/msg140116.html
> [7] http://www.firmware.org/1275/bindings/pci/pci2_1.pdf
> [8] Documents/devicetree/bindings/pci
> [9] Documents/devicetree/bindings/iommu/arm,smmu.txt
> [10] Document/devicetree/bindings/pci/pci-iommu.txt
> [11] Documents/devicetree/bindings/pci/pci-msi.txt
> [12] drivers/pci/host/pcie-rcar.c
> [13] drivers/pci/host/pci-thunder-ecam.c
> [14] drivers/pci/host/pci-thunder-pem.c
> [15] Documents/devicetree/bindings/numa.txt
>
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |