Xen project Mailing List

Re: [Xen-devel] [DRAFT RFC] PVHv2 interaction with physical devices

To: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

From: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Date: Thu, 10 Nov 2016 11:39:08 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kelly <Kelly.Zytaruk@xxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, Paul Durrant <paul.durrant@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

Delivery-date: Thu, 10 Nov 2016 10:41:57 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote: > On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote: > > Hello, > > > > I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with > > physical devices, and what needs to be done inside of Xen in order to > > achieve it. Current draft is RFC because I'm quite sure I'm missing bits > > that should be written down here. So far I've tried to describe what my > > previous series attempted to do by adding a bunch of IO and memory space > > handlers. > > > > Please note that this document only applies to PVHv2 Dom0, it is not > > applicable to untrusted domains that will need more handlers in order to > > secure Xen and other domains running on the same system. The idea is that > > this can be expanded to untrusted domains also in the long term, thus > > having > > a single set of IO and memory handlers for passed-through devices. > > > > Roger. > > > > ---8<--- > > > > This document describes how a PVHv2 Dom0 is supposed to interact with > > physical > > devices. > > > > Architecture > > ============ > > > > Purpose > > ------- > > > > Previous Dom0 implementations have always used PIRQs (physical interrupts > > routed over event channels) in order to receive events from physical > > devices. > > This prevents Dom0 form taking advantage of new hardware virtualization > > features, like posted interrupts or hardware virtualized local APIC. Also > > the > > current device memory management in the PVH Dom0 implementation is lacking, > > and might not support devices that have memory regions past the 4GB > > boundary. > > memory regions meaning BAR regions? Yes. > > > > The new PVH implementation (PVHv2) should overcome the interrupt > > limitations by > > providing the same interface that's used on bare metal (a local and IO > > APICs) > > thus allowing the usage of advanced hardware assisted virtualization > > techniques. This also aligns with the trend on the hardware industry to > > move part of the emulation into the silicon itself. > > What if the hardware PVH2 runs on does not have vAPIC? The emulated local APIC provided by Xen will be used. > > > > In order to improve the mapping of device memory areas, Xen will have to > > know of those devices in advance (before Dom0 tries to interact with them) > > so that the memory BARs will be properly mapped into Dom0 memory map. > > Oh, that is going to be a problem with SR-IOV. Those are created _after_ > dom0 has booted. In fact they are done by the drivers themselves. > > See xen_add_device in drivers/xen/pci.c how this is handled. Is the process of creating those VF something standart? (In the sense that it can be detected by Xen, and proper mappings stablished) > > > > The following document describes the proposed interface and implementation > > of all the logic needed in order to achieve the functionality described > > above. > > > > MMIO areas > > ========== > > > > Overview > > -------- > > > > On x86 systems certain regions of memory might be used in order to manage > > physical devices on the system. Access to this areas is critical for a > > PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 > > implementation > > (PVHv1) that was setup with identity mappings of all the holes and reserved > > regions found in the memory map, this new implementation intents to map only > > what's actually needed by the Dom0. > > And why was the previous approach not working? Previous PVHv1 implementation would only identity map holes and reserved areas in the guest memory map, or up to the 4GB boundary if the guest memory map is smaller than 4GB. If a device has a BAR past the 4GB boundary for example, it would not be identity mapped in the p2m. > > > > Low 1MB > > ------- > > > > When booted with a legacy BIOS, the low 1MB contains firmware related data > > that should be identity mapped to the Dom0. This include the EBDA, video > > memory and possibly ROMs. All non RAM regions below 1MB will be identity > > mapped to the Dom0 so that it can access this data freely. > > > > ACPI regions > > ------------ > > > > ACPI regions will be identity mapped to the Dom0, this implies regions with > > type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect > > memory maps, the top-level tables discovered by Xen (as listed in the > > {X/R}SDT) that are not on RAM regions will be mapped to Dom0. > > > > PCI memory BARs > > --------------- > > > > PCI devices discovered by Xen will have it's BARs scanned in order to detect > > memory BARs, and those will be identity mapped to Dom0. Since BARs can be > > freely moved by the Dom0 OS by writing to the appropriate PCI config space > > register, Xen must trap those accesses and unmap the previous region and > > map the new one as set by Dom0. > > You can make that simpler - we have hypercalls to "notify" in Linux > when a device is changing. Those can provide that information as well. > (This is what PV dom0 does). > > Also you are missing one important part - the MMCFG. That is required > for Xen to be able to poke at the PCI configuration spaces (above the 256). > And you can only get the MMCFG if the ACPI DSDT has been parsed. Hm, I guess I'm missing something, but at least on my hardware Xen seems to be able to parse the MCFG ACPI table before Dom0 does anything with the DSDT: (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f (XEN) PCI: MCFG area at f8000000 reserved in E820 (XEN) PCI: Using MCFG for segment 0000 bus 00-3f > So if you do the PCI bus scanning _before_ booting PVH dom0, you may > need to update your view of PCI devices after the MMCFG locations > have been provided to you. I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have to see hardware where this is actually needed. Also, AFAICT, FreeBSD at least is only able to detect MMCFG regions present in the MCFG ACPI table: http://fxr.watson.org/fxr/source/dev/acpica/acpi.c?im=excerp#L1861 > > > > Limitations > > ----------- > > > > - Xen needs to be aware of any PCI device before Dom0 tries to interact > > with > > it, so that the MMIO regions are properly mapped. > > > > Interrupt management > > ==================== > > > > Overview > > -------- > > > > On x86 systems there are tree different mechanisms that can be used in order > > to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might > > support different methods, but those are never active at the same time. > > > > Legacy PCI interrupts > > --------------------- > > > > The only way to deliver legacy PCI interrupts to PVHv2 guests is using the > > IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI > > _PIC method must be set to APIC mode by the Dom0 OS. > > > > Xen will always provide a single IO APIC, that will match the number of > > possible GSIs of the underlying hardware. This is possible because ACPI > > uses a system cookie in order to name interrupts, so the IO APIC device ID > > or pin number is not used in _PTR methods. > > So the MADT that is presented to dom0 will be mangled? That is > where the IOAPIC information along with the number of GSIs is presented. Yes, the MADT presented to Dom0 is created by Xen, this is already part of my series, see patch: https://lists.xenproject.org/archives/html/xen-devel/2016-10/msg02017.html The IO APIC information is presented in the MADT IO APIC entries, while the total number of GSIs is calculated by the Dom0 by poking at how many pins each IO APIC has (this information is not directly fetched from ACPI). > > > > XXX: is it possible to have more than 256 GSIs? > > Yeah. If you have enough of the IOAPICs you can have more than 256. But > I don't think any OS has taken that into account as the GSI value are > always uint8_t. Right, so AFAICT providing a single IO APIC with enough pins should be fine. > > > > The binding between the underlying physical interrupt and the emulated > > interrupt is performed when unmasking an IO APIC PIN, so writes to the > > IOREDTBL registers that unset the mask bit will trigger this binding > > and enable the interrupt. > > > > MSI Interrupts > > -------------- > > > > MSI interrupts are setup using the PCI config space, either the IO ports > > or the memory mapped configuration area. This means that both spaces should > > be trapped by Xen, in order to detect accesses to these registers and > > properly emulate them. > > > > Since the offset of the MSI registers is not fixed, Xen has to query the > > PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI, > > and then setup the correct traps, which also vary depending on the > > capabilities of the device. The following list contains the set of MSI > > registers that Xen will trap, please take into account that some devices > > might only implement a subset of those registers, so not all traps will > > be used: > > > > - Message control register (offset 2): Xen traps accesses to this register, > > and stores the data written to it into an internal structure. When the OS > > sets the MSI enable bit (offset 0) Xen will setup the configured MSI > > interrupts and route them to the guest. > > > > - Message address register (offset 4): writes and reads to this register > > are > > trapped by Xen, and the value is stored into an internal structure. This > > is > > later used when MSI are enabled in order to configure the vectors > > injected > > to the guest. Writes to this register with MSI already enabled will cause > > a reconfiguration of the binding of interrupts to the guest. > > > > - Message data register (offset 8 or 12 if message address is 64bits): > > writes > > and reads to this register are trapped by Xen, and the value is stored > > into > > an internal structure. This is used when MSI are enabled in order to > > configure the vector where the guests expects to receive those > > interrupts. > > Writes to this register with MSI already enabled will cause a > > reconfiguration of the binding of interrupts to the guest. > > > > - Mask and pending bits: reads or writes to those registers are not trapped > > by Xen. > > > > MSI-X Interrupts > > ---------------- > > > > MSI-X in contrast with MSI has part of the configuration registers in the > > PCI configuration space, while others reside inside of the memory BARs of > > the > > device. So in this case Xen needs to setup traps for both the PCI > > configuration space and two different memory regions. Xen has to query the > > position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a > > handler in order to trap accesses to the different registers. Xen also has > > to figure out the position of the MSI-X table and PBA, using the table BIR > > and table offset, and the PBA BIR and PBA offset. Once those are known a > > handler should also be setup in order to trap accesses to those memory > > regions. > > > > This is the list of MSI-X registers that are used in order to manage MSI-X > > in the PCI configuration space: > > > > - Message control: Xen should trap accesses to this register in order to > > detect changes to the MSI-X enable field (bit 15). Changes to this bit > > will trigger the setup of the MSI-X table entries configured. Writes > > to the function mask bit will be passed-through to the underlying > > register. > > > > - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers > > are not trapped by Xen. > > > > The following registers reside in memory, and are pointed out by the Table > > and > > PBA fields found in the PCI configuration space: > > > > - Message address and data: writes and reads to those registers are trapped > > by Xen, and the value is stored into an internal structure. This is later > > used by Xen in order to configure the interrupt injected to the guest. > > Writes to those registers with MSI-X already enabled will not cause a > > reconfiguration of the interrupt. > > > > - Vector control: writes and reads are trapped, clearing the mask bit (bit > > 0) > > will cause Xen to setup the configured interrupt if MSI-X is globally > > enabled in the message control field. > > > > - Pending bits array: writes and reads to this register are not trapped by > > Xen. > > > > Limitations > > ----------- > > > > - Due to the fact that Dom0 is not able to parse dynamic ACPI tables, > > some UART devices might only function in polling mode, because Xen > > will be unable to properly configure the interrupt pins without Dom0 > > collaboration, and the UART in use by Xen should be explicitly > > blacklisted > > from Dom0 access. > > By blacklisting the IO ports too? Well, I was planning to somehow use the STAO ACPI table, but I'm not really sure how Xen can blacklist a device without parsing the DSDT: https://lists.xen.org/archives/html/xen-devel/2016-08/pdfYfOWKJ83jH.pdf Since this table is under Xen's control, we could always make changes to it in order to suit our needs, although I'm not really sure how a device can be blacklisted without knowing it's ACPI namespace path, and I don't know how to get that without parsing the DSDT. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.