Xen project Mailing List

Re: [Xen-devel] [DRAFT RFC] PVHv2 interaction with physical devices

To: Roger Pau Monné <roger.pau@xxxxxxxxxx>

From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>

Date: Thu, 10 Nov 2016 08:53:05 -0500

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kelly <Kelly.Zytaruk@xxxxxxx>, Julien Grall <julien.grall@xxxxxxx>, Paul Durrant <paul.durrant@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>

Delivery-date: Thu, 10 Nov 2016 13:53:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Thu, Nov 10, 2016 at 11:39:08AM +0100, Roger Pau Monné wrote: > On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote: > > On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote: > > > Hello, > > > > > > I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with > > > physical devices, and what needs to be done inside of Xen in order to > > > achieve it. Current draft is RFC because I'm quite sure I'm missing bits > > > that should be written down here. So far I've tried to describe what my > > > previous series attempted to do by adding a bunch of IO and memory space > > > handlers. > > > > > > Please note that this document only applies to PVHv2 Dom0, it is not > > > applicable to untrusted domains that will need more handlers in order to > > > secure Xen and other domains running on the same system. The idea is that > > > this can be expanded to untrusted domains also in the long term, thus > > > having > > > a single set of IO and memory handlers for passed-through devices. > > > > > > Roger. > > > > > > ---8<--- > > > > > > This document describes how a PVHv2 Dom0 is supposed to interact with > > > physical > > > devices. > > > > > > Architecture > > > ============ > > > > > > Purpose > > > ------- > > > > > > Previous Dom0 implementations have always used PIRQs (physical interrupts > > > routed over event channels) in order to receive events from physical > > > devices. > > > This prevents Dom0 form taking advantage of new hardware virtualization > > > features, like posted interrupts or hardware virtualized local APIC. Also > > > the > > > current device memory management in the PVH Dom0 implementation is > > > lacking, > > > and might not support devices that have memory regions past the 4GB > > > boundary. > > > > memory regions meaning BAR regions? > > Yes. > > > > > > > The new PVH implementation (PVHv2) should overcome the interrupt > > > limitations by > > > providing the same interface that's used on bare metal (a local and IO > > > APICs) > > > thus allowing the usage of advanced hardware assisted virtualization > > > techniques. This also aligns with the trend on the hardware industry to > > > move part of the emulation into the silicon itself. > > > > What if the hardware PVH2 runs on does not have vAPIC? > > The emulated local APIC provided by Xen will be used. > > > > > > > In order to improve the mapping of device memory areas, Xen will have to > > > know of those devices in advance (before Dom0 tries to interact with them) > > > so that the memory BARs will be properly mapped into Dom0 memory map. > > > > Oh, that is going to be a problem with SR-IOV. Those are created _after_ > > dom0 has booted. In fact they are done by the drivers themselves. > > > > See xen_add_device in drivers/xen/pci.c how this is handled. > > Is the process of creating those VF something standart? (In the sense that > it can be detected by Xen, and proper mappings stablished) Yes and no. You can read from the PCI configuration that the device (Physical function) has SR-IOV. But that information may be in the extended configuration registers so you need MCFG. Anyhow the only thing the PF will tell you is the BAR regions they will occupy (since they are behind the bridge) but not the BDFs: Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) IOVCap: Migration-, Interrupt Message Number: 000 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+ IOVSta: Migration- Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function Dependency Link: 00 VF offset: 128, stride: 2, Device ID: 10ca Supported Page Size: 00000553, System Page Size: 00000001 Region 0: Memory at 00000000fbda0000 (64-bit, non-prefetchable) Region 3: Memory at 00000000fbd80000 (64-bit, non-prefetchable) VF Migration: offset: 00000000, BIR: 0 Kernel driver in use: igb And if I enable SR-IOV on the PF I get: 0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) 0a:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) 0a:10.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) 0a:10.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) 0a:11.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) 0a:11.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) -bash-4.1# lspci -s 0a:10.0 -v 0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) Subsystem: Super Micro Computer Inc Device 10c9 Flags: bus master, fast devsel, latency 0 [virtual] Memory at fbda0000 (64-bit, non-prefetchable) [size=16K] [virtual] Memory at fbd80000 (64-bit, non-prefetchable) [size=16K] Capabilities: [70] MSI-X: Enable+ Count=3 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Kernel driver in use: igbvf -bash-4.1# lspci -s 0a:11.4 -v 0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01) Subsystem: Super Micro Computer Inc Device 10c9 Flags: bus master, fast devsel, latency 0 [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K] [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K] Capabilities: [70] MSI-X: Enable+ Count=3 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Kernel driver in use: igbvf > > > > > > > The following document describes the proposed interface and implementation > > > of all the logic needed in order to achieve the functionality described > > > above. > > > > > > MMIO areas > > > ========== > > > > > > Overview > > > -------- > > > > > > On x86 systems certain regions of memory might be used in order to manage > > > physical devices on the system. Access to this areas is critical for a > > > PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 > > > implementation > > > (PVHv1) that was setup with identity mappings of all the holes and > > > reserved > > > regions found in the memory map, this new implementation intents to map > > > only > > > what's actually needed by the Dom0. > > > > And why was the previous approach not working? > > Previous PVHv1 implementation would only identity map holes and reserved > areas in the guest memory map, or up to the 4GB boundary if the guest memory > map is smaller than 4GB. If a device has a BAR past the 4GB boundary for > example, it would not be identity mapped in the p2m. > > > > > > > Low 1MB > > > ------- > > > > > > When booted with a legacy BIOS, the low 1MB contains firmware related data > > > that should be identity mapped to the Dom0. This include the EBDA, video > > > memory and possibly ROMs. All non RAM regions below 1MB will be identity > > > mapped to the Dom0 so that it can access this data freely. > > > > > > ACPI regions > > > ------------ > > > > > > ACPI regions will be identity mapped to the Dom0, this implies regions > > > with > > > type 3 and 4 in the e820 memory map. Also, since some BIOS report > > > incorrect > > > memory maps, the top-level tables discovered by Xen (as listed in the > > > {X/R}SDT) that are not on RAM regions will be mapped to Dom0. > > > > > > PCI memory BARs > > > --------------- > > > > > > PCI devices discovered by Xen will have it's BARs scanned in order to > > > detect > > > memory BARs, and those will be identity mapped to Dom0. Since BARs can be > > > freely moved by the Dom0 OS by writing to the appropriate PCI config space > > > register, Xen must trap those accesses and unmap the previous region and > > > map the new one as set by Dom0. > > > > You can make that simpler - we have hypercalls to "notify" in Linux > > when a device is changing. Those can provide that information as well. > > (This is what PV dom0 does). > > > > Also you are missing one important part - the MMCFG. That is required > > for Xen to be able to poke at the PCI configuration spaces (above the 256). > > And you can only get the MMCFG if the ACPI DSDT has been parsed. > > Hm, I guess I'm missing something, but at least on my hardware Xen seems to > be able to parse the MCFG ACPI table before Dom0 does anything with the > DSDT: > > (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f > (XEN) PCI: MCFG area at f8000000 reserved in E820 > (XEN) PCI: Using MCFG for segment 0000 bus 00-3f > > > So if you do the PCI bus scanning _before_ booting PVH dom0, you may > > need to update your view of PCI devices after the MMCFG locations > > have been provided to you. > > I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have > to see hardware where this is actually needed. Also, AFAICT, FreeBSD at > least is only able to detect MMCFG regions present in the MCFG ACPI table: There is some hardware out there (I think I saw this with an IBM HS-20, but I can't recall the details). The specification says that the MCFG _may_ be defined in the MADT, but is not guaranteed. Which means that it can bubble via the ACPI DSDT code. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.