Xen project Mailing List

Re: [Xen-devel] [RFC] ARM PCI Passthrough design document

On Fri, 26 May 2017, Julien Grall wrote: > Hi all, > > The document below is an RFC version of a design proposal for PCI > Passthrough in Xen on ARM. It aims to describe from an high level perspective > the interaction with the different subsystems and how guest will be able > to discover and access PCI. > > Currently on ARM, Xen does not have any knowledge about PCI devices. This > means that IOMMU and interrupt controller (such as ITS) requiring specific > configuration will not work with PCI even with DOM0. > > The PCI Passthrough work could be divided in 2 phases: > * Phase 1: Register all PCI devices in Xen => will allow > to use ITS and SMMU with PCI in Xen > * Phase 2: Assign devices to guests > > This document aims to describe the 2 phases, but for now only phase > 1 is fully described. > > > I think I was able to gather all of the feedbacks and come up with a solution > that will satisfy all the parties. The design document has changed quite a lot > compare to the early draft sent few months ago. The major changes are: > * Provide more details how PCI works on ARM and the interactions with > MSI controller and IOMMU > * Provide details on the existing host bridge implementations > * Give more explanation and justifications on the approach chosen > * Describing the hypercalls used and how they should be called > > Feedbacks are welcomed. > > Cheers, Hi Julien, I think this document is a very good first step in the right direction and I fully agree with the approaches taken here. A noticed a couple of grammar errors that I pointed out below. > -------------------------------------------------------------------------------- > > % PCI pass-through support on ARM > % Julien Grall <julien.grall@xxxxxxxxxx> > % Draft B > > # Preface > > This document aims to describe the components required to enable the PCI > pass-through on ARM. > > This is an early draft and some questions are still unanswered. When this is > the case, the text will contain XXX. > > # Introduction > > PCI pass-through allows the guest to receive full control of physical PCI > devices. This means the guest will have full and direct access to the PCI > device. > > ARM is supporting a kind of guest that exploits as much as possible > virtualization support in hardware. The guest will rely on PV driver only > for IO (e.g block, network) and interrupts will come through the virtualized > interrupt controller, therefore there are no big changes required within the > kernel. > > As a consequence, it would be possible to replace PV drivers by assigning real > devices to the guest for I/O access. Xen on ARM would therefore be able to > run unmodified operating system. > > To achieve this goal, it looks more sensible to go towards emulating the > host bridge (there will be more details later). A guest would be able to take > advantage of the firmware tables, obviating the need for a specific driver > for Xen. > > Thus, in this document we follow the emulated host bridge approach. > > # PCI terminologies > > Each PCI device under a host bridge is uniquely identified by its Requester ID > (AKA RID). A Requester ID is a triplet of Bus number, Device number, and > Function. > > When the platform has multiple host bridges, the software can add a fourth > number called Segment (sometimes called Domain) to differentiate host bridges. > A PCI device will then uniquely by segment:bus:device:function (AKA SBDF). > > So given a specific SBDF, it would be possible to find the host bridge and the > RID associated to a PCI device. The pair (host bridge, RID) will often be used > to find the relevant information for configuring the different subsystems (e.g > IOMMU, MSI controller). For convenience, the rest of the document will use > SBDF to refer to the pair (host bridge, RID). > > # PCI host bridge > > PCI host bridge enables data transfer between a host processor and PCI bus > based devices. The bridge is used to access the configuration space of each > PCI devices and, on some platform may also act as an MSI controller. > > ## Initialization of the PCI host bridge > > Whilst it would be expected that the bootloader takes care of initializing > the PCI host bridge, on some platforms it is done in the Operating System. > > This may include enabling/configuring the clocks that could be shared among > multiple devices. > > ## Accessing PCI configuration space > > Accessing the PCI configuration space can be divided in 2 category: > * Indirect access, where the configuration spaces are multiplexed. An > example would be legacy method on x86 (e.g 0xcf8 and 0xcfc). On ARM a > similar method is used by PCIe RCar root complex (see [12]). > * ECAM access, each configuration space will have its own address space. > > Whilst ECAM is a standard, some PCI host bridges will require specific > fiddling > when access the registers (see thunder-ecam [13]). > > In most of the cases, accessing all the PCI configuration spaces under a > given PCI host will be done the same way (i.e either indirect access or ECAM > access). However, there are a few cases, dependent on the PCI devices > accessed, > which will use different methods (see thunder-pem [14]). > > ## Generic host bridge > > For the purpose of this document, the term "generic host bridge" will be used > to describe any host bridge ECAM-compliant and the initialization, if > required, > will be already done by the firmware/bootloader. > > # Interaction of the PCI subsystem with other subsystems > > In order to have a PCI device fully working, Xen will need to configure > other subsystems such as the IOMMU and the Interrupt Controller. > > The interaction expected between the PCI subsystem and the other subsystems > is: > * Add a device > * Remove a device > * Assign a device to a guest > * Deassign a device from a guest > > XXX: Detail the interaction when assigning/deassigning device > > In the following subsections, the interactions will be briefly described from > a > higher level perspective. However, implementation details such as callback, > structure, etc... are beyond the scope of this document. > > ## IOMMU > > The IOMMU will be used to isolate the PCI device when accessing the memory > (e.g > DMA and MSI Doorbells). Often the IOMMU will be configured using a MasterID > (aka StreamID for ARM SMMU) that can be deduced from the SBDF with the help > of the firmware tables (see below). > > Whilst in theory, all the memory transactions issued by a PCI device should > go through the IOMMU, on certain platforms some of the memory transaction may > not reach the IOMMU because they are interpreted by the host bridge. For > instance, this could happen if the MSI doorbell is built into the PCI host > bridge or for P2P traffic. See [6] for more details. > > XXX: I think this could be solved by using direct mapping (e.g GFN == MFN), > this would mean the guest memory layout would be similar to the host one when > PCI devices will be pass-throughed => Detail it. > > ## Interrupt controller > > PCI supports three kind of interrupts: legacy interrupt, MSI and MSI-X. On > ARM, > legacy interrupts will be mapped to SPIs. MSI and MSI-X will write their > payload in a doorbell belonging to a MSI controller. > > ### Existing MSI controllers > > In this section some of the existing controllers and their interaction with > the devices will be briefly described. More details can be found in the > respective specifications of each MSI controller. > > MSIs can be distinguished by some combination of > * the Doorbell > It is the MMIO address written to. Devices may be configured by > software to write to arbitrary doorbells which they can address. > An MSI controller may feature a number of doorbells. > * the Payload > Devices may be configured to write an arbitrary payload chosen by > software. MSI controllers may have restrictions on permitted payload. > Xen will have to sanitize the payload unless it is known to be always > safe. > * Sideband information accompanying the write > Typically this is neither configurable nor probeable, and depends on > the path taken through the memory system (i.e it is a property of the > combination of MSI controller and device rather than a property of > either in isolation). > > ### GICv3/GICv4 ITS > > The Interrupt Translation Service (ITS) is a MSI controller designed by ARM > and integrated in the GICv3/GICv4 interrupt controller. For the specification > see [GICV3]. Each MSI/MSI-X will be mapped to a new type of interrupt called > LPI. This interrupt will be configured by the software using a pair (DeviceID, > EventID). > > A platform may have multiple ITS block (e.g one per NUMA node), each of them > belong to an ITS group. > > The DeviceID is a unique identifier with an ITS group for each MSI-capable > device that can be deduced from the RID with the help of the firmware tables > (see below). > > The EventID is a unique identifier to distinguish different event sending > by a device. > > The MSI payload will only contain the EventID as the DeviceID will be added > afterwards by the hardware in a way that will prevent any tampering. > > The [SBSA] appendix I describes the set of rules for the integration of the ^ redundant I > ITS that any compliant platform should follow. Some of the rules will explain > the security implication of a misbehaving devices. It ensures that a guest > will never be able to trigger an MSI on behalf of another guest. > > XXX: The security implication is described in the [SBSA] but I haven't found > any similar working in the GICv3 specification. It is unclear to me if > non-SBSA compliant platform (e.g embedded) will follow those rules. > > ### GICv2m > > The GICv2m is an extension of the GICv2 to convert MSI/MSI-X writes to unique > interrupts. The specification can be found in the [SBSA] appendix E. > > Depending on the platform, the GICv2m will provide one or multiple instance > of register frames. Each frame is composed of a doorbell and associated to > a set of SPIs that can be discovered by reading the register MSI_TYPER. > > On an MSI write, the payload will contain the SPI ID to generate. Note that > on some platform the MSI payload may contain an offset form the base SPI > rather than the SPI itself. > > The frame will only generate SPI if the written value corresponds to an SPI > allocated to the frame. Each VM should have exclusity to the frame to ensure ^ exclusive access ? > isolation and prevent a guest OS to trigger an MSI on-behalf of another guest > OS. > > XXX: Linux seems to consider GICv2m as unsafe by default. From my > understanding, > it is still unclear how we should proceed on Xen, as GICv2m should be safe > as long as the frame is only accessed by one guest. It seems to me that you are right > ### Other MSI controllers > > Servers compliant with SBSA level 1 and higher will have to use either ITS > or GICv2m. However, it is by no means the only MSI controllers available. > The hardware vendor may decide to use their custom MSI controller which can be > integrated in the PCI host bridge. > > Whether it will be possible to write securely an MSI will depend on the > MSI controller implementations. > > XXX: I am happy to give a brief explanation on more MSI controller (such > as Xilinx and Renesas) if people think it is necessary. > > This design document does not pertain to a specific MSI controller and will > try > to be as agnostic is possible. When possible, it will give insight how to > integrate the MSI controller. > > # Information available in the firmware tables > > ## ACPI > > ### Host bridges > > The static table MCFG (see 4.2 in [1]) will describe the host bridges > available > at boot and supporting ECAM. Unfortunately, there are platforms out there > (see [2]) that re-use MCFG to describe host bridge that are not fully ECAM > compatible. > > This means that Xen needs to account for possible quirks in the host bridge. > The Linux community are working on a patch series for this, see [2] and [3], > where quirks will be detected with: > * OEM ID > * OEM Table ID > * OEM Revision > * PCI Segment > * PCI bus number range (wildcard allowed) > > Based on what Linux is currently doing, there are two kind of quirks: > * Accesses to the configuration space of certain sizes are not allowed > * A specific driver is necessary for driving the host bridge > > The former is straightforward to solve but the latter will require more > thought. > Instantiation of a specific driver for the host controller can be easily done > if Xen has the information to detect it. However, those drivers may require > resources described in ASL (see [4] for instance). > > The number of platforms requiring specific PCI host bridge driver is currently > limited. Whilst it is not possible to predict the future, it will be expected > upcoming platform to have fully ECAM compliant PCI host bridges. Therefore, > given Xen does not have any ASL parser, the approach suggested is to hardcode > the missing values. This could be revisit in the future if necessary. > > ### Finding information to configure IOMMU and MSI controller > > The static table [IORT] will provide information that will help to deduce > data (such as MasterID and DeviceID) to configure both the IOMMU and the MSI > controller from a given SBDF. > > ## Finding which NUMA node a PCI device belongs to > > On NUMA system, the NUMA node associated to a PCI device can be found using > the _PXM method of the host bridge (?). > > XXX: I am not entirely sure where the _PXM will be (i.e host bridge vs PCI > device). > > ## Device Tree > > ### Host bridges > > Each Device Tree node associated to a host bridge will have at least the > following properties (see bindings in [8]): > - device_type: will always be "pci". > - compatible: a string indicating which driver to instanciate > > The node may also contain optional properties such as: > - linux,pci-domain: assign a fix segment number > - bus-range: indicate the range of bus numbers supported > > When the property linux,pci-domain is not present, the operating system would > have to allocate the segment number for each host bridges. > > ### Finding information to configure IOMMU and MSI controller > > ### Configuring the IOMMU > > The Device Treee provides a generic IOMMU bindings (see [10]) which uses the > properties "iommu-map" and "iommu-map-mask" to described the relationship > between RID and a MasterID. > > These properties will be present in the host bridge Device Tree node. From a > given SBDF, it will be possible to find the corresponding MasterID. > > Note that the ARM SMMU also have a legacy binding (see [9]), but it does not > have a way to describe the relationship between RID and StreamID. Instead it > assumed that StreamID == RID. This binding has now been deprecated in favor > of the generic IOMMU binding. > > ### Configuring the MSI controller > > The relationship between the RID and data required to configure the MSI > controller (such as DeviceID) can be found using the property "msi-map" > (see [11]). > > This property will be present in the host bridge Device Tree node. From a > given SBDF, it will be possible to find the corresponding MasterID. > > ## Finding which NUMA node a PCI device belongs to > > On NUMA system, the NUMA node associated to a PCI device can be found using > the property "numa-node-id" (see [15]) presents in the host bridge Device Tree > node. > > # Discovering PCI devices > > Whilst PCI devices are currently available in the hardware domain, the > hypervisor does not have any knowledge of them. The first step of supporting > PCI pass-through is to make Xen aware of the PCI devices. > > Xen will require access to the PCI configuration space to retrieve information > for the PCI devices or access it on behalf of the guest via the emulated > host bridge. > > This means that Xen should be in charge of controlling the host bridge. > However, > for some host controller, this may be difficult to implement in Xen because of > depencencies on other components (e.g clocks, see more details in "PCI host > bridge" section). > > For this reason, the approach chosen in this document is to let the hardware > domain to discover the host bridges, scan the PCI devices and then report > everything to Xen. This does not rule out the possibility of doing everything > without the help of the hardware domain in the future. > > ## Who is in charge of the host bridge? > > There are numerous implementation of host bridges which exist on ARM. A part > of > them requires a specific driver as they cannot be driven by a generic host > bridge > driver. Porting those drivers may be complex due to dependencies on other > components. > > This would be seen as signal to leave the host bridge drivers in the hardware > domain. Because Xen would need to access the configuration space, all the > access > would have to be forwarded to hardware domain which in turn will access the > hardware. > > In this design document, we are considering that the host bridge driver can > be ported in Xen. In the case it is not possible, a interface to forward > configuration space access would need to be defined. The interface details > is out of scope. > > ## Discovering and registering host bridge > > The approach taken in the document will require communication between Xen and > the hardware domain. In this case, they would need to agree on the segment > number associated to an host bridge. However, this number is not available in > the Device Tree case. > > The hardware domain will register new host bridges using the existing > hypercall > PHYSDEV_mmcfg_reserved: > > #define XEN_PCI_MMCFG_RESERVED 1 > > struct physdev_pci_mmcfg_reserved { > /* IN */ > uint64_t address; > uint16_t segment; > /* Range of bus supported by the host bridge */ > uint8_t start_bus; > uint8_t end_bus; > > uint32_t flags; > } > > Some of the host bridges may not have a separate configuration address space > region described in the firmware tables. To simplify the registration, the > field 'address' should contains the base address of one of the region > described in the firmware tables. > * For ACPI, it would be the base address specified in the MCFG or in the > _CBA method. > * For Device Tree, this would be any base address of region > specified in the "reg" property. > > The field 'flags' is expected to have XEN_PCI_MMCFG_RESERVED set. > > It is expected that this hypercall is called before any PCI devices is > registered to Xen. > > When the hardware domain is in charge of the host bridge, this hypercall will > be used to tell Xen the existence of an host bridge in order to find the > associated information for configuring the MSI controller and the IOMMU. > > ## Discovering and registering PCI devices > > The hardware domain will scan the host bridge to find the list of PCI devices > available and then report it to Xen using the existing hypercall > PHYSDEV_pci_device_add: > > #define XEN_PCI_DEV_EXTFN 0x1 > #define XEN_PCI_DEV_VIRTFN 0x2 > #define XEN_PCI_DEV_PXM 0x3 > > struct physdev_pci_device_add { > /* IN */ > uint16_t seg; > uint8_t bus; > uint8_t devfn; > uint32_t flags; > struct { > uint8_t bus; > uint8_t devfn; > } physfn; > /* > * Optional parameters array. > * First element ([0]) is PXM domain associated with the device (if > * XEN_PCI_DEV_PXM is set) > */ > uint32_t optarr[0]; > } > > When XEN_PCI_DEV_PXM is set in the field 'flag', optarr[0] will contain the > NUMA node ID associated with the device: > * For ACPI, it would be the value returned by the method _PXM > * For Device Tree, this would the value found in the property > "numa-node-id". > For more details see the section "Finding which NUMA node a PCI device belongs > to" in "ACPI" and "Device Tree". > > XXX: I still don't fully understand how XEN_PCI_DEV_EXTFN and > XEN_PCI_DEV_VIRTFN > wil work. AFAICT, the former is used with the bus support ARI and the only > usage > is in the x86 IOMMU code. For the latter, this is related to IOV but I am not > sure what devfn and physfn.devfn will correspond too. > > Note that x86 currently provides two more hypercalls (PHYSDEVOP_manage_pci_add > and PHYSDEVOP_manage_pci_add_ext) to register PCI devices. However they are > subset of the hypercall PHYSDEVOP_pci_device_add. Therefore, it is suggested > to leave them unimplemented on ARM. > > ## Removing PCI devices > > The hardware domain will be in charge Xen a device has been removed using > the existing hypercall PHYSDEV_pci_device_remove: > > struct physdev_pci_device { > /* IN */ > uint16_t seg; > uint8_t bus; > uint8_t devfn; > } > > Note that x86 currently provide one more hypercall > (PHYSDEVOP_manage_pci_remove) > to remove PCI devices. However it does not allow to pass a segment number. > Therefore it is suggested to leave unimplemented on ARM. > > # Glossary > > ECAM: Enhanced Configuration Mechanism > SBDF: Segment Bus Device Function. The segment is a software concept. > MSI: Message Signaled Interrupt > MSI doorbell: MMIO address written to by a device to generate an MSI > SPI: Shared Peripheral Interrupt > LPI: Locality-specific Peripheral Interrupt > ITS: Interrupt Translation Service > > # Specifications > [SBSA] ARM-DEN-0029 v3.0 > [GICV3] IHI0069C > [IORT] DEN0049B > > # Bibliography > > [1] PCI firmware specification, rev 3.2 > [2] https://www.spinics.net/lists/linux-pci/msg56715.html > [3] https://www.spinics.net/lists/linux-pci/msg56723.html > [4] https://www.spinics.net/lists/linux-pci/msg56728.html > [6] https://www.spinics.net/lists/kvm/msg140116.html > [7] http://www.firmware.org/1275/bindings/pci/pci2_1.pdf > [8] Documents/devicetree/bindings/pci > [9] Documents/devicetree/bindings/iommu/arm,smmu.txt > [10] Document/devicetree/bindings/pci/pci-iommu.txt > [11] Documents/devicetree/bindings/pci/pci-msi.txt > [12] drivers/pci/host/pcie-rcar.c > [13] drivers/pci/host/pci-thunder-ecam.c > [14] drivers/pci/host/pci-thunder-pem.c > [15] Documents/devicetree/bindings/numa.txt > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.