[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC] ARM PCI Passthrough design document



On Fri, 26 May 2017, Julien Grall wrote:
> Hi all,
> 
> The document below is an RFC version of a design proposal for PCI
> Passthrough in Xen on ARM. It aims to describe from an high level perspective
> the interaction with the different subsystems and how guest will be able
> to discover and access PCI.
> 
> Currently on ARM, Xen does not have any knowledge about PCI devices. This
> means that IOMMU and interrupt controller (such as ITS) requiring specific
> configuration will not work with PCI even with DOM0.
> 
> The PCI Passthrough work could be divided in 2 phases:
>         * Phase 1: Register all PCI devices in Xen => will allow
>                    to use ITS and SMMU with PCI in Xen
>         * Phase 2: Assign devices to guests
> 
> This document aims to describe the 2 phases, but for now only phase
> 1 is fully described.
> 
> 
> I think I was able to gather all of the feedbacks and come up with a solution
> that will satisfy all the parties. The design document has changed quite a lot
> compare to the early draft sent few months ago. The major changes are:
>       * Provide more details how PCI works on ARM and the interactions with
>       MSI controller and IOMMU
>       * Provide details on the existing host bridge implementations
>       * Give more explanation and justifications on the approach chosen 
>       * Describing the hypercalls used and how they should be called
> 
> Feedbacks are welcomed.
> 
> Cheers,

Hi Julien,

I think this document is a very good first step in the right direction
and I fully agree with the approaches taken here.

A noticed a couple of grammar errors that I pointed out below.


> --------------------------------------------------------------------------------
> 
> % PCI pass-through support on ARM
> % Julien Grall <julien.grall@xxxxxxxxxx>
> % Draft B
> 
> # Preface
> 
> This document aims to describe the components required to enable the PCI
> pass-through on ARM.
> 
> This is an early draft and some questions are still unanswered. When this is
> the case, the text will contain XXX.
> 
> # Introduction
> 
> PCI pass-through allows the guest to receive full control of physical PCI
> devices. This means the guest will have full and direct access to the PCI
> device.
> 
> ARM is supporting a kind of guest that exploits as much as possible
> virtualization support in hardware. The guest will rely on PV driver only
> for IO (e.g block, network) and interrupts will come through the virtualized
> interrupt controller, therefore there are no big changes required within the
> kernel.
> 
> As a consequence, it would be possible to replace PV drivers by assigning real
> devices to the guest for I/O access. Xen on ARM would therefore be able to
> run unmodified operating system.
> 
> To achieve this goal, it looks more sensible to go towards emulating the
> host bridge (there will be more details later). A guest would be able to take
> advantage of the firmware tables, obviating the need for a specific driver
> for Xen.
> 
> Thus, in this document we follow the emulated host bridge approach.
> 
> # PCI terminologies
> 
> Each PCI device under a host bridge is uniquely identified by its Requester ID
> (AKA RID). A Requester ID is a triplet of Bus number, Device number, and
> Function.
> 
> When the platform has multiple host bridges, the software can add a fourth
> number called Segment (sometimes called Domain) to differentiate host bridges.
> A PCI device will then uniquely by segment:bus:device:function (AKA SBDF).
> 
> So given a specific SBDF, it would be possible to find the host bridge and the
> RID associated to a PCI device. The pair (host bridge, RID) will often be used
> to find the relevant information for configuring the different subsystems (e.g
> IOMMU, MSI controller). For convenience, the rest of the document will use
> SBDF to refer to the pair (host bridge, RID).
> 
> # PCI host bridge
> 
> PCI host bridge enables data transfer between a host processor and PCI bus
> based devices. The bridge is used to access the configuration space of each
> PCI devices and, on some platform may also act as an MSI controller.
> 
> ## Initialization of the PCI host bridge
> 
> Whilst it would be expected that the bootloader takes care of initializing
> the PCI host bridge, on some platforms it is done in the Operating System.
> 
> This may include enabling/configuring the clocks that could be shared among
> multiple devices.
> 
> ## Accessing PCI configuration space
> 
> Accessing the PCI configuration space can be divided in 2 category:
>     * Indirect access, where the configuration spaces are multiplexed. An
>     example would be legacy method on x86 (e.g 0xcf8 and 0xcfc). On ARM a
>     similar method is used by PCIe RCar root complex (see [12]).
>     * ECAM access, each configuration space will have its own address space.
> 
> Whilst ECAM is a standard, some PCI host bridges will require specific 
> fiddling
> when access the registers (see thunder-ecam [13]).
> 
> In most of the cases, accessing all the PCI configuration spaces under a
> given PCI host will be done the same way (i.e either indirect access or ECAM
> access). However, there are a few cases, dependent on the PCI devices 
> accessed,
> which will use different methods (see thunder-pem [14]).
> 
> ## Generic host bridge
> 
> For the purpose of this document, the term "generic host bridge" will be used
> to describe any host bridge ECAM-compliant and the initialization, if 
> required,
> will be already done by the firmware/bootloader.
> 
> # Interaction of the PCI subsystem with other subsystems
> 
> In order to have a PCI device fully working, Xen will need to configure
> other subsystems such as the IOMMU and the Interrupt Controller.
> 
> The interaction expected between the PCI subsystem and the other subsystems 
> is:
>     * Add a device
>     * Remove a device
>     * Assign a device to a guest
>     * Deassign a device from a guest
> 
> XXX: Detail the interaction when assigning/deassigning device
> 
> In the following subsections, the interactions will be briefly described from 
> a
> higher level perspective. However, implementation details such as callback,
> structure, etc... are beyond the scope of this document.
> 
> ## IOMMU
> 
> The IOMMU will be used to isolate the PCI device when accessing the memory 
> (e.g
> DMA and MSI Doorbells). Often the IOMMU will be configured using a MasterID
> (aka StreamID for ARM SMMU)  that can be deduced from the SBDF with the help
> of the firmware tables (see below).
> 
> Whilst in theory, all the memory transactions issued by a PCI device should
> go through the IOMMU, on certain platforms some of the memory transaction may
> not reach the IOMMU because they are interpreted by the host bridge. For
> instance, this could happen if the MSI doorbell is built into the PCI host
> bridge or for P2P traffic. See [6] for more details.
> 
> XXX: I think this could be solved by using direct mapping (e.g GFN == MFN),
> this would mean the guest memory layout would be similar to the host one when
> PCI devices will be pass-throughed => Detail it.
> 
> ## Interrupt controller
> 
> PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On 
> ARM,
> legacy interrupts will be mapped to SPIs. MSI and MSI-X will write their
> payload in a doorbell belonging to a MSI controller.
> 
> ### Existing MSI controllers
> 
> In this section some of the existing controllers and their interaction with
> the devices will be briefly described. More details can be found in the
> respective specifications of each MSI controller.
> 
> MSIs can be distinguished by some combination of
>     * the Doorbell
>         It is the MMIO address written to. Devices may be configured by
>         software to write to arbitrary doorbells which they can address.
>         An MSI controller may feature a number of doorbells.
>     * the Payload
>         Devices may be configured to write an arbitrary payload chosen by
>         software. MSI controllers may have restrictions on permitted payload.
>         Xen will have to sanitize the payload unless it is known to be always
>         safe.
>     * Sideband information accompanying the write
>         Typically this is neither configurable nor probeable, and depends on
>         the path taken through the memory system (i.e it is a property of the
>         combination of MSI controller and device rather than a property of
>         either in isolation).
> 
> ### GICv3/GICv4 ITS
> 
> The Interrupt Translation Service (ITS) is a MSI controller designed by ARM
> and integrated in the GICv3/GICv4 interrupt controller. For the specification
> see [GICV3]. Each MSI/MSI-X will be mapped to a new type of interrupt called
> LPI. This interrupt will be configured by the software using a pair (DeviceID,
> EventID).
> 
> A platform may have multiple ITS block (e.g one per NUMA node), each of them
> belong to an ITS group.
> 
> The DeviceID is a unique identifier with an ITS group for each MSI-capable
> device that can be deduced from the RID with the help of the firmware tables
> (see below).
> 
> The EventID is a unique identifier to distinguish different event sending
> by a device.
> 
> The MSI payload will only contain the EventID as the DeviceID will be added
> afterwards by the hardware in a way that will prevent any tampering.
> 
> The [SBSA] appendix I describes the set of rules for the integration of the
                      ^ redundant I


> ITS that any compliant platform should follow. Some of the rules will explain
> the security implication of a misbehaving devices. It ensures that a guest
> will never be able to trigger an MSI on behalf of another guest.
> 
> XXX: The security implication is described in the [SBSA] but I haven't found
> any similar working in the GICv3 specification. It is unclear to me if
> non-SBSA compliant platform (e.g embedded) will follow those rules.
> 
> ### GICv2m
> 
> The GICv2m is an extension of the GICv2 to convert MSI/MSI-X writes to unique
> interrupts. The specification can be found in the [SBSA] appendix E.
> 
> Depending on the platform, the GICv2m will provide one or multiple instance
> of register frames. Each frame is composed of a doorbell and associated to
> a set of SPIs that can be discovered by reading the register MSI_TYPER.
> 
> On an MSI write, the payload will contain the SPI ID to generate. Note that
> on some platform the MSI payload may contain an offset form the base SPI
> rather than the SPI itself.
> 
> The frame will only generate SPI if the written value corresponds to an SPI
> allocated to the frame. Each VM should have exclusity to the frame to ensure
                                               ^ exclusive access ?


> isolation and prevent a guest OS to trigger an MSI on-behalf of another guest
> OS.
> 
> XXX: Linux seems to consider GICv2m as unsafe by default. From my 
> understanding,
> it is still unclear how we should proceed on Xen, as GICv2m should be safe
> as long as the frame is only accessed by one guest.

It seems to me that you are right


> ### Other MSI controllers
> 
> Servers compliant with SBSA level 1 and higher will have to use either ITS
> or GICv2m. However, it is by no means the only MSI controllers available.
> The hardware vendor may decide to use their custom MSI controller which can be
> integrated in the PCI host bridge.
> 
> Whether it will be possible to write securely an MSI will depend on the
> MSI controller implementations.
> 
> XXX: I am happy to give a brief explanation on more MSI controller (such
> as Xilinx and Renesas) if people think it is necessary.
> 
> This design document does not pertain to a specific MSI controller and will 
> try
> to be as agnostic is possible. When possible, it will give insight how to
> integrate the MSI controller.
> 
> # Information available in the firmware tables
> 
> ## ACPI
> 
> ### Host bridges
> 
> The static table MCFG (see 4.2 in [1]) will describe the host bridges 
> available
> at boot and supporting ECAM. Unfortunately, there are platforms out there
> (see [2]) that re-use MCFG to describe host bridge that are not fully ECAM
> compatible.
> 
> This means that Xen needs to account for possible quirks in the host bridge.
> The Linux community are working on a patch series for this, see [2] and [3],
> where quirks will be detected with:
>     * OEM ID
>     * OEM Table ID
>     * OEM Revision
>     * PCI Segment
>     * PCI bus number range (wildcard allowed)
> 
> Based on what Linux is currently doing, there are two kind of quirks:
>     * Accesses to the configuration space of certain sizes are not allowed
>     * A specific driver is necessary for driving the host bridge
> 
> The former is straightforward to solve but the latter will require more 
> thought.
> Instantiation of a specific driver for the host controller can be easily done
> if Xen has the information to detect it. However, those drivers may require
> resources described in ASL (see [4] for instance).
> 
> The number of platforms requiring specific PCI host bridge driver is currently
> limited. Whilst it is not possible to predict the future, it will be expected
> upcoming platform to have fully ECAM compliant PCI host bridges. Therefore,
> given Xen does not have any ASL parser, the approach suggested is to hardcode
> the missing values. This could be revisit in the future if necessary.
> 
> ### Finding information to configure IOMMU and MSI controller
> 
> The static table [IORT] will provide information that will help to deduce
> data (such as MasterID and DeviceID) to configure both the IOMMU and the MSI
> controller from a given SBDF.
> 
> ## Finding which NUMA node a PCI device belongs to
> 
> On NUMA system, the NUMA node associated to a PCI device can be found using
> the _PXM method of the host bridge (?).
> 
> XXX: I am not entirely sure where the _PXM will be (i.e host bridge vs PCI
> device).
> 
> ## Device Tree
> 
> ### Host bridges
> 
> Each Device Tree node associated to a host bridge will have at least the
> following properties (see bindings in [8]):
>     - device_type: will always be "pci".
>     - compatible: a string indicating which driver to instanciate
> 
> The node may also contain optional properties such as:
>     - linux,pci-domain: assign a fix segment number
>     - bus-range: indicate the range of bus numbers supported
> 
> When the property linux,pci-domain is not present, the operating system would
> have to allocate the segment number for each host bridges.
> 
> ### Finding information to configure IOMMU and MSI controller
> 
> ### Configuring the IOMMU
> 
> The Device Treee provides a generic IOMMU bindings (see [10]) which uses the
> properties "iommu-map" and "iommu-map-mask" to described the relationship
> between RID and a MasterID.
> 
> These properties will be present in the host bridge Device Tree node. From a
> given SBDF, it will be possible to find the corresponding MasterID.
> 
> Note that the ARM SMMU also have a legacy binding (see [9]), but it does not
> have a way to describe the relationship between RID and StreamID. Instead it
> assumed that StreamID == RID. This binding has now been deprecated in favor
> of the generic IOMMU binding.
> 
> ### Configuring the MSI controller
> 
> The relationship between the RID and data required to configure the MSI
> controller (such as DeviceID) can be found using the property "msi-map"
> (see [11]).
> 
> This property will be present in the host bridge Device Tree node. From a
> given SBDF, it will be possible to find the corresponding MasterID.
> 
> ## Finding which NUMA node a PCI device belongs to
> 
> On NUMA system, the NUMA node associated to a PCI device can be found using
> the property "numa-node-id" (see [15]) presents in the host bridge Device Tree
> node.
> 
> # Discovering PCI devices
> 
> Whilst PCI devices are currently available in the hardware domain, the
> hypervisor does not have any knowledge of them. The first step of supporting
> PCI pass-through is to make Xen aware of the PCI devices.
> 
> Xen will require access to the PCI configuration space to retrieve information
> for the PCI devices or access it on behalf of the guest via the emulated
> host bridge.
> 
> This means that Xen should be in charge of controlling the host bridge. 
> However,
> for some host controller, this may be difficult to implement in Xen because of
> depencencies on other components (e.g clocks, see more details in "PCI host
> bridge" section).
> 
> For this reason, the approach chosen in this document is to let the hardware
> domain to discover the host bridges, scan the PCI devices and then report
> everything to Xen. This does not rule out the possibility of doing everything
> without the help of the hardware domain in the future.
> 
> ## Who is in charge of the host bridge?
> 
> There are numerous implementation of host bridges which exist on ARM. A part 
> of
> them requires a specific driver as they cannot be driven by a generic host 
> bridge
> driver. Porting those drivers may be complex due to dependencies on other
> components.
> 
> This would be seen as signal to leave the host bridge drivers in the hardware
> domain. Because Xen would need to access the configuration space, all the 
> access
> would have to be forwarded to hardware domain which in turn will access the
> hardware.
> 
> In this design document, we are considering that the host bridge driver can
> be ported in Xen. In the case it is not possible, a interface to forward
> configuration space access would need to be defined. The interface details
> is out of scope.
> 
> ## Discovering and registering host bridge
> 
> The approach taken in the document will require communication between Xen and
> the hardware domain. In this case, they would need to agree on the segment
> number associated to an host bridge. However, this number is not available in
> the Device Tree case.
> 
> The hardware domain will register new host bridges using the existing 
> hypercall
> PHYSDEV_mmcfg_reserved:
> 
> #define XEN_PCI_MMCFG_RESERVED 1
> 
> struct physdev_pci_mmcfg_reserved {
>     /* IN */
>     uint64_t    address;
>     uint16_t    segment;
>     /* Range of bus supported by the host bridge */
>     uint8_t     start_bus;
>     uint8_t     end_bus;
> 
>     uint32_t    flags;
> }
> 
> Some of the host bridges may not have a separate configuration address space
> region described in the firmware tables. To simplify the registration, the
> field 'address' should contains the base address of one of the region
> described in the firmware tables.
>     * For ACPI, it would be the base address specified in the MCFG or in the
>     _CBA method.
>     * For Device Tree, this would be any base address of region
>     specified in the "reg" property.
> 
> The field 'flags' is expected to have XEN_PCI_MMCFG_RESERVED set.
> 
> It is expected that this hypercall is called before any PCI devices is
> registered to Xen.
> 
> When the hardware domain is in charge of the host bridge, this hypercall will
> be used to tell Xen the existence of an host bridge in order to find the
> associated information for configuring the MSI controller and the IOMMU.
> 
> ## Discovering and registering PCI devices
> 
> The hardware domain will scan the host bridge to find the list of PCI devices
> available and then report it to Xen using the existing hypercall
> PHYSDEV_pci_device_add:
> 
> #define XEN_PCI_DEV_EXTFN   0x1
> #define XEN_PCI_DEV_VIRTFN  0x2
> #define XEN_PCI_DEV_PXM     0x3
> 
> struct physdev_pci_device_add {
>     /* IN */
>     uint16_t    seg;
>     uint8_t     bus;
>     uint8_t     devfn;
>     uint32_t    flags;
>     struct {
>         uint8_t bus;
>         uint8_t devfn;
>     } physfn;
>     /*
>      * Optional parameters array.
>      * First element ([0]) is PXM domain associated with the device (if
>      * XEN_PCI_DEV_PXM is set)
>      */
>     uint32_t optarr[0];
> }
> 
> When XEN_PCI_DEV_PXM is set in the field 'flag', optarr[0] will contain the
> NUMA node ID associated with the device:
>     * For ACPI, it would be the value returned by the method _PXM
>     * For Device Tree, this would the value found in the property 
> "numa-node-id".
> For more details see the section "Finding which NUMA node a PCI device belongs
> to" in "ACPI" and "Device Tree".
> 
> XXX: I still don't fully understand how XEN_PCI_DEV_EXTFN and 
> XEN_PCI_DEV_VIRTFN
> wil work. AFAICT, the former is used with the bus support ARI and the only 
> usage
> is in the x86 IOMMU code. For the latter, this is related to IOV but I am not
> sure what devfn and physfn.devfn will correspond too.
> 
> Note that x86 currently provides two more hypercalls (PHYSDEVOP_manage_pci_add
> and PHYSDEVOP_manage_pci_add_ext) to register PCI devices. However they are
> subset of the hypercall PHYSDEVOP_pci_device_add. Therefore, it is suggested
> to leave them unimplemented on ARM.
> 
> ## Removing PCI devices
> 
> The hardware domain will be in charge Xen a device has been removed using
> the existing hypercall PHYSDEV_pci_device_remove:
> 
> struct physdev_pci_device {
>     /* IN */
>     uint16_t    seg;
>     uint8_t     bus;
>     uint8_t     devfn;
> }
> 
> Note that x86 currently provide one more hypercall 
> (PHYSDEVOP_manage_pci_remove)
> to remove PCI devices. However it does not allow to pass a segment number.
> Therefore it is suggested to leave unimplemented on ARM.
> 
> # Glossary
> 
> ECAM: Enhanced Configuration Mechanism
> SBDF: Segment Bus Device Function. The segment is a software concept.
> MSI: Message Signaled Interrupt
> MSI doorbell: MMIO address written to by a device to generate an MSI
> SPI: Shared Peripheral Interrupt
> LPI: Locality-specific Peripheral Interrupt
> ITS: Interrupt Translation Service
> 
> # Specifications
> [SBSA]  ARM-DEN-0029 v3.0
> [GICV3] IHI0069C
> [IORT]  DEN0049B
> 
> # Bibliography
> 
> [1] PCI firmware specification, rev 3.2
> [2] https://www.spinics.net/lists/linux-pci/msg56715.html
> [3] https://www.spinics.net/lists/linux-pci/msg56723.html
> [4] https://www.spinics.net/lists/linux-pci/msg56728.html
> [6] https://www.spinics.net/lists/kvm/msg140116.html
> [7] http://www.firmware.org/1275/bindings/pci/pci2_1.pdf
> [8] Documents/devicetree/bindings/pci
> [9] Documents/devicetree/bindings/iommu/arm,smmu.txt
> [10] Document/devicetree/bindings/pci/pci-iommu.txt
> [11] Documents/devicetree/bindings/pci/pci-msi.txt
> [12] drivers/pci/host/pcie-rcar.c
> [13] drivers/pci/host/pci-thunder-ecam.c
> [14] drivers/pci/host/pci-thunder-pem.c
> [15] Documents/devicetree/bindings/numa.txt
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.