[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT RFC] PVHv2 interaction with physical devices



On Thu, Nov 10, 2016 at 11:39:08AM +0100, Roger Pau Monné wrote:
> On Wed, Nov 09, 2016 at 01:45:17PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Nov 09, 2016 at 04:59:12PM +0100, Roger Pau Monné wrote:
> > > Hello,
> > > 
> > > I'm attaching a draft of how a PVHv2 Dom0 is supposed to interact with 
> > > physical devices, and what needs to be done inside of Xen in order to 
> > > achieve it. Current draft is RFC because I'm quite sure I'm missing bits 
> > > that should be written down here. So far I've tried to describe what my 
> > > previous series attempted to do by adding a bunch of IO and memory space 
> > > handlers.
> > > 
> > > Please note that this document only applies to PVHv2 Dom0, it is not 
> > > applicable to untrusted domains that will need more handlers in order to 
> > > secure Xen and other domains running on the same system. The idea is that 
> > > this can be expanded to untrusted domains also in the long term, thus 
> > > having 
> > > a single set of IO and memory handlers for passed-through devices.
> > > 
> > > Roger.
> > > 
> > > ---8<---
> > > 
> > > This document describes how a PVHv2 Dom0 is supposed to interact with 
> > > physical
> > > devices.
> > > 
> > > Architecture
> > > ============
> > > 
> > > Purpose
> > > -------
> > > 
> > > Previous Dom0 implementations have always used PIRQs (physical interrupts
> > > routed over event channels) in order to receive events from physical 
> > > devices.
> > > This prevents Dom0 form taking advantage of new hardware virtualization
> > > features, like posted interrupts or hardware virtualized local APIC. Also 
> > > the
> > > current device memory management in the PVH Dom0 implementation is 
> > > lacking,
> > > and might not support devices that have memory regions past the 4GB 
> > > boundary.
> > 
> > memory regions meaning BAR regions?
> 
> Yes.
>  
> > > 
> > > The new PVH implementation (PVHv2) should overcome the interrupt 
> > > limitations by
> > > providing the same interface that's used on bare metal (a local and IO 
> > > APICs)
> > > thus allowing the usage of advanced hardware assisted virtualization
> > > techniques. This also aligns with the trend on the hardware industry to
> > > move part of the emulation into the silicon itself.
> > 
> > What if the hardware PVH2 runs on does not have vAPIC?
> 
> The emulated local APIC provided by Xen will be used.
> 
> > > 
> > > In order to improve the mapping of device memory areas, Xen will have to
> > > know of those devices in advance (before Dom0 tries to interact with them)
> > > so that the memory BARs will be properly mapped into Dom0 memory map.
> > 
> > Oh, that is going to be a problem with SR-IOV. Those are created _after_
> > dom0 has booted. In fact they are done by the drivers themselves.
> > 
> > See xen_add_device in drivers/xen/pci.c how this is handled.
> 
> Is the process of creating those VF something standart? (In the sense that 
> it can be detected by Xen, and proper mappings stablished)

Yes and no.

You can read from the PCI configuration that the device (Physical
function) has SR-IOV. But that information may be in the extended
configuration registers so you need MCFG. Anyhow the only thing the PF
will tell you is the BAR regions they will occupy (since they
are behind the bridge) but not the BDFs:

        Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
                IOVCap: Migration-, Interrupt Message Number: 000
                IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy+
                IOVSta: Migration-
                Initial VFs: 8, Total VFs: 8, Number of VFs: 0, Function 
Dependency Link: 00
                VF offset: 128, stride: 2, Device ID: 10ca
                Supported Page Size: 00000553, System Page Size: 00000001
                Region 0: Memory at 00000000fbda0000 (64-bit, non-prefetchable)
                Region 3: Memory at 00000000fbd80000 (64-bit, non-prefetchable)
                VF Migration: offset: 00000000, BIR: 0
        Kernel driver in use: igb

And if I enable SR-IOV on the PF I get:

0a:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection 
(rev 01)
0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:10.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:10.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:10.6 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:11.0 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:11.2 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)
0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function (rev 01)

-bash-4.1# lspci -s 0a:10.0 -v
0a:10.0 Ethernet controller: Intel Corporation 82576 Virtual Function
(rev 01)
        Subsystem: Super Micro Computer Inc Device 10c9
        Flags: bus master, fast devsel, latency 0
        [virtual] Memory at fbda0000 (64-bit, non-prefetchable) [size=16K]
        [virtual] Memory at fbd80000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: igbvf

-bash-4.1# lspci -s 0a:11.4 -v
0a:11.4 Ethernet controller: Intel Corporation 82576 Virtual Function
(rev 01)
        Subsystem: Super Micro Computer Inc Device 10c9
        Flags: bus master, fast devsel, latency 0
        [virtual] Memory at fbdb8000 (64-bit, non-prefetchable) [size=16K]
        [virtual] Memory at fbd98000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
        Capabilities: [a0] Express Endpoint, MSI 00
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: igbvf


> 
> > > 
> > > The following document describes the proposed interface and implementation
> > > of all the logic needed in order to achieve the functionality described 
> > > above.
> > > 
> > > MMIO areas
> > > ==========
> > > 
> > > Overview
> > > --------
> > > 
> > > On x86 systems certain regions of memory might be used in order to manage
> > > physical devices on the system. Access to this areas is critical for a
> > > PVH Dom0 in order to operate properly. Unlike previous PVH Dom0 
> > > implementation
> > > (PVHv1) that was setup with identity mappings of all the holes and 
> > > reserved
> > > regions found in the memory map, this new implementation intents to map 
> > > only
> > > what's actually needed by the Dom0.
> > 
> > And why was the previous approach not working?
> 
> Previous PVHv1 implementation would only identity map holes and reserved 
> areas in the guest memory map, or up to the 4GB boundary if the guest memory 
> map is smaller than 4GB. If a device has a BAR past the 4GB boundary for 
> example, it would not be identity mapped in the p2m. 
> 
> > > 
> > > Low 1MB
> > > -------
> > > 
> > > When booted with a legacy BIOS, the low 1MB contains firmware related data
> > > that should be identity mapped to the Dom0. This include the EBDA, video
> > > memory and possibly ROMs. All non RAM regions below 1MB will be identity
> > > mapped to the Dom0 so that it can access this data freely.
> > > 
> > > ACPI regions
> > > ------------
> > > 
> > > ACPI regions will be identity mapped to the Dom0, this implies regions 
> > > with
> > > type 3 and 4 in the e820 memory map. Also, since some BIOS report 
> > > incorrect
> > > memory maps, the top-level tables discovered by Xen (as listed in the
> > > {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
> > > 
> > > PCI memory BARs
> > > ---------------
> > > 
> > > PCI devices discovered by Xen will have it's BARs scanned in order to 
> > > detect
> > > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> > > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> > > register, Xen must trap those accesses and unmap the previous region and
> > > map the new one as set by Dom0.
> > 
> > You can make that simpler - we have hypercalls to "notify" in Linux
> > when a device is changing. Those can provide that information as well.
> > (This is what PV dom0 does).
> > 
> > Also you are missing one important part - the MMCFG. That is required
> > for Xen to be able to poke at the PCI configuration spaces (above the 256).
> > And you can only get the MMCFG if the ACPI DSDT has been parsed.
> 
> Hm, I guess I'm missing something, but at least on my hardware Xen seems to 
> be able to parse the MCFG ACPI table before Dom0 does anything with the 
> DSDT:
> 
> (XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
> (XEN) PCI: MCFG area at f8000000 reserved in E820
> (XEN) PCI: Using MCFG for segment 0000 bus 00-3f
> 
> > So if you do the PCI bus scanning _before_ booting PVH dom0, you may
> > need to update your view of PCI devices after the MMCFG locations
> > have been provided to you.
> 
> I'm not opposed to keep the PHYSDEVOP_pci_mmcfg_reserved, but I still have 
> to see hardware where this is actually needed. Also, AFAICT, FreeBSD at 
> least is only able to detect MMCFG regions present in the MCFG ACPI table:

There is some hardware out there (I think I saw this with an IBM HS-20,
but I can't recall the details). The specification says that the MCFG
_may_ be defined in the MADT, but is not guaranteed. Which means that it
can bubble via the ACPI DSDT code.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.