[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT RFC] PVHv2 interaction with physical devices



On Wed, Nov 09, 2016 at 06:51:49PM +0000, Andrew Cooper wrote:
> On 09/11/16 15:59, Roger Pau Monné wrote:
> > Low 1MB
> > -------
> >
> > When booted with a legacy BIOS, the low 1MB contains firmware related data
> > that should be identity mapped to the Dom0. This include the EBDA, video
> > memory and possibly ROMs. All non RAM regions below 1MB will be identity
> > mapped to the Dom0 so that it can access this data freely.
> 
> Are you proposing a unilateral identity map of the first 1MB, or just
> the interesting regions?

The current approach identity maps the first 1MB except for RAM regions, 
that are populated in the p2m, and the data in the original pages is copied 
over. This is done because the AP boot trampoline is placed in the RAM 
regions below 1MB, and the emulator is not able to execute code from pages 
marked as p2m_mmio_direct.
 
> One thing to remember is the iBVT, for iscsi boot, which lives in
> regular RAM and needs searching for.

And I guess this is not static data that just needs to be read by the OS? 
Then I will have to look into fixing the emulator to deal with 
p2m_mmio_direct regions.

> >
> > ACPI regions
> > ------------
> >
> > ACPI regions will be identity mapped to the Dom0, this implies regions with
> > type 3 and 4 in the e820 memory map. Also, since some BIOS report incorrect
> > memory maps, the top-level tables discovered by Xen (as listed in the
> > {X/R}SDT) that are not on RAM regions will be mapped to Dom0.
> >
> > PCI memory BARs
> > ---------------
> >
> > PCI devices discovered by Xen will have it's BARs scanned in order to detect
> > memory BARs, and those will be identity mapped to Dom0. Since BARs can be
> > freely moved by the Dom0 OS by writing to the appropriate PCI config space
> > register, Xen must trap those accesses and unmap the previous region and
> > map the new one as set by Dom0.
> >
> > Limitations
> > -----------
> >
> >  - Xen needs to be aware of any PCI device before Dom0 tries to interact 
> > with
> >    it, so that the MMIO regions are properly mapped.
> >
> > Interrupt management
> > ====================
> >
> > Overview
> > --------
> >
> > On x86 systems there are tree different mechanisms that can be used in order
> > to deliver interrupts: IO APIC, MSI and MSI-X. Note that each device might
> > support different methods, but those are never active at the same time.
> >
> > Legacy PCI interrupts
> > ---------------------
> >
> > The only way to deliver legacy PCI interrupts to PVHv2 guests is using the
> > IO APIC, PVHv2 domains don't have an emulated PIC. As a consequence the ACPI
> > _PIC method must be set to APIC mode by the Dom0 OS.
> >
> > Xen will always provide a single IO APIC, that will match the number of
> > possible GSIs of the underlying hardware. This is possible because ACPI
> > uses a system cookie in order to name interrupts, so the IO APIC device ID
> > or pin number is not used in _PTR methods.
> >
> > XXX: is it possible to have more than 256 GSIs?
> 
> Yes.  There is no restriction on the number of IO-APIC in a system, and
> no restriction on the number of PCI bridges these IO-APICs serve.
> 
> However, I would suggest it would be better to offer one a 1-to-1 view
> of system IO-APICs to vIO-APICs in PVHv2 dom0, or the pin mappings are
> going to get confused when reading the ACPI tables.

Hm, I've been searching for this, but it seems to me that ACPI tables will 
always use GSIs in APIC mode in order to describe interrupts, so it doesn't 
seem to matter whether those GSIs are scattered across multiple IO APICs or 
just a single one.

> >
> > The binding between the underlying physical interrupt and the emulated
> > interrupt is performed when unmasking an IO APIC PIN, so writes to the
> > IOREDTBL registers that unset the mask bit will trigger this binding
> > and enable the interrupt.
> >
> > MSI Interrupts
> > --------------
> >
> > MSI interrupts are setup using the PCI config space, either the IO ports
> > or the memory mapped configuration area. This means that both spaces should
> > be trapped by Xen, in order to detect accesses to these registers and
> > properly emulate them.
> 
> cfc/cf8 need trapping unconditionally, and the MMCFG region can only be
> intercepted in units of 4k.  As a result, Xen will unconditionally see
> all config accesses anyway.

Yes, that's right (however it might decide to just pass-through some of 
them).

> >
> > Since the offset of the MSI registers is not fixed, Xen has to query the
> > PCI configuration space in order to find the offset of the PCI_CAP_ID_MSI,
> > and then setup the correct traps, which also vary depending on the
> > capabilities of the device.
> 
> Although only once at start-of-day.  The layout of capabilities in
> config space for a particular device is static.

Yes, the MSI capabilities offset is fetched at the start-if-day and then 
stored.

> >  The following list contains the set of MSI
> > registers that Xen will trap, please take into account that some devices
> > might only implement a subset of those registers, so not all traps will
> > be used:
> >
> >  - Message control register (offset 2): Xen traps accesses to this register,
> >    and stores the data written to it into an internal structure. When the OS
> >    sets the MSI enable bit (offset 0) Xen will setup the configured MSI
> >    interrupts and route them to the guest.
> >
> >  - Message address register (offset 4): writes and reads to this register 
> > are
> >    trapped by Xen, and the value is stored into an internal structure. This 
> > is
> >    later used when MSI are enabled in order to configure the vectors 
> > injected
> >    to the guest. Writes to this register with MSI already enabled will cause
> >    a reconfiguration of the binding of interrupts to the guest.
> >
> >  - Message data register (offset 8 or 12 if message address is 64bits): 
> > writes
> >    and reads to this register are trapped by Xen, and the value is stored 
> > into
> >    an internal structure. This is used when MSI are enabled in order to
> >    configure the vector where the guests expects to receive those 
> > interrupts.
> >    Writes to this register with MSI already enabled will cause a
> >    reconfiguration of the binding of interrupts to the guest.
> >
> >  - Mask and pending bits: reads or writes to those registers are not trapped
> >    by Xen.
> 
> These must be trapped.  In all cases, Xen must maintain the guests idea
> of whether something is masked, and Xen's own idea.  This is necessary
> for interrupt migration.

Oh, so mask bits must be trapped and the interrupt masked using the Xen 
interrupt API then, noted.

> Having said that, the entire interrupt remapping subsystem in Xen is in
> dire need of an overhaul.  It is terminally dumb and inefficient.  With
> interrupt remapping enabled, Xen should never need to touch interrupt
> sources for non-guest actions.
> 
> >
> > MSI-X Interrupts
> > ----------------
> >
> > MSI-X in contrast with MSI has part of the configuration registers in the
> > PCI configuration space, while others reside inside of the memory BARs of 
> > the
> > device. So in this case Xen needs to setup traps for both the PCI
> > configuration space and two different memory regions. Xen has to query the
> > position of the MSI-X capability using the PCI_CAP_ID_MSIX, and setup a
> > handler in order to trap accesses to the different registers. Xen also has
> > to figure out the position of the MSI-X table and PBA, using the table BIR
> > and table offset, and the PBA BIR and PBA offset. Once those are known a
> > handler should also be setup in order to trap accesses to those memory 
> > regions.
> >
> > This is the list of MSI-X registers that are used in order to manage MSI-X
> > in the PCI configuration space:
> >
> >  - Message control: Xen should trap accesses to this register in order to
> >    detect changes to the MSI-X enable field (bit 15). Changes to this bit
> >    will trigger the setup of the MSI-X table entries configured. Writes
> >    to the function mask bit will be passed-through to the underlying
> >    register.
> >
> >  - Table offset, table BIR, PBA offset, PBA BIR: accesses to those registers
> >    are not trapped by Xen.
> 
> These will be trapped, but are read-only so Xen needn't do anything
> exciting as part of emulation.

Right, those are read-only.

> >
> > The following registers reside in memory, and are pointed out by the Table 
> > and
> > PBA fields found in the PCI configuration space:
> >
> >  - Message address and data: writes and reads to those registers are trapped
> >    by Xen, and the value is stored into an internal structure. This is later
> >    used by Xen in order to configure the interrupt injected to the guest.
> >    Writes to those registers with MSI-X already enabled will not cause a
> >    reconfiguration of the interrupt.
> >
> >  - Vector control: writes and reads are trapped, clearing the mask bit (bit 
> > 0)
> >    will cause Xen to setup the configured interrupt if MSI-X is globally
> >    enabled in the message control field.
> >
> >  - Pending bits array: writes and reads to this register are not trapped by
> >    Xen.
> >
> > Limitations
> > -----------
> >
> >  - Due to the fact that Dom0 is not able to parse dynamic ACPI tables,
> >    some UART devices might only function in polling mode, because Xen
> >    will be unable to properly configure the interrupt pins without Dom0
> >    collaboration, and the UART in use by Xen should be explicitly 
> > blacklisted
> >    from Dom0 access.
> 
> This reminds me that we need to include some HPET quirks in Xen as well.
> 
> There is an entire range of Nehalem era machines where Linux finds an
> HPET in the IOH via quirks alone, and not via the ACPI tables, and
> nothing in Xen currently knows to disallow this access.

Hm, if it's using quirks it's going to be hard to prevent this. At worse 
Linux is going to discover that the HPET is non-functional at least I 
assume?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.