[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [RFC PATCH 07/12] hvmloader: allocate MMCONFIG area in the MMIO hole + minor code refactoring



On Mon, 26 Mar 2018 10:24:38 +0100
Roger Pau Monné <roger.pau@xxxxxxxxxx> wrote:

>On Sat, Mar 24, 2018 at 08:32:44AM +1000, Alexey G wrote:
[...]
>> In fact, the emulated chipset (NB+SB combo without supplemental
>> devices) itself is a small part of required emulation. It's
>> relatively easy to provide own analogs of for eg. 'mch' and
>> 'ICH9-LPC' QEMU PCIDevice's, the problem is to glue all remaining
>> parts together.
>> 
>> I assume the final goal in this case is to have only a set of
>> necessary QEMU PCIDevice's for which we will be providing I/O, MMIO
>> and PCI conf trapping facilities. Only devices such as rtl8139,
>> ich9-ahci and few others.
>> 
>> Basically, this means a new, chipset-less QEMU machine type.
>> Well, in theory it is possible with a bit of effort I think. The main
>> question is where will be the NB/SB/PCIbus emulating part reside in
>> this case.  
>
>Mostly inside of Xen. Of course the IDE/SATA/USB/Ethernet... part of
>the southbrigde will be emulated by a device model (ie: QEMU).
>
>As you mention above, I also took a look and it seems like the amount
>of registers that we should emulate for Q35 DRAM controller (D0:F0) is
>fairly minimal based on current QEMU implementation. We could even
>possibly get away by just emulating PCIEXBAR.

MCH emulation alone might be not an option. Besides, some
southbridge-specific features like emulating ACPI PM facilities for
domain power management (basically, anything at PMBASE) will be
preferable to implement on Xen side, especially considering the fact
that ACPI tables are already provided by Xen's libacpi/hvmloader, not
the device model.
I think the feature may require to cover at least the NB+SB
combination, at least Q35 MCH + ICH9 for start, ideally 82441FX+PIIX4
as well. Also, Xen should control emulated/PT PCI device placement.

Before going this way, it would be good to measure all risks.
Looks like there are two main directions currently:

I. (conservative) Let the main device model (QEMU) to inform Xen about
the current chipset-specific MMCONFIG location, to allow Xen to know
that some MMIO accesses to this area must be forwarded to other ioreq
servers (device emulators) in a form of PCI config read/write ioreqs,
if BDF corresponding to a MMCONFIG offset will point to the PCI device
owned by a device emulator.
In case of device emulators the conversion of MMIO accesses to PCI
config ones is a mandatory step, while the owner of the MMCONFIG MMIO
range may receive MMIO accesses in a native form without conversion
(a strongly preferable option for QEMU).

This approach assumes introducing of the new dmop/hypercall (something
like XEN_DMOP_mmcfg_location_change) to pass to Xen basic MMCONFIG
information -- address, enabled/disabled status (or simply address=0
instead) and size of the MMCONFIG area, eg. as a number of buses.
This information is enough to select a proper ioreq server in Xen and
allow multiple device emulators to function properly.
For future compatibility we can also provide the segment and
start/end bus range as arguments.

What this approach will require:
--------------------------------

- new notification-style dmop/hypercall to tell Xen about the current
  emulated MMCONFIG location

- trivial changes in QEMU to use this dmop in Q35 PCIEXBAR handling code

- relatively simple Xen changes in ioreq.c to use the provided range
  for ioreq server selection. Also, to provide MMIO -> PCI config ioreq
  translation for supplemental ioreq servers which don't know anything
  about the emulated system

Risks:
------

Risk to break anything is minimal in this case.

If QEMU will not provide this information (eg. due to an outdated
version installed), only basic PCI config space accesses via CF8/CFC
will be forwarded to a distinct ioreq server. This means the extended
PCI config space accesses won't be forwarded to specific device
emulators. Other than these device emulators, anything else will
continue to work properly in this case. No differences will be for
guest OSes without PCIe ECAM support in either case.

In general, no breakthrough improvements, no negative side-effects.
Just PCIe ECAM working as expected and compatibility with multiple
ioreq servers is retained.


II. (a new feature) Move chipset emulation to Xen directly.

In this case no separate notification necessary as Xen will be
emulating the chosen chipset itself. MMCONFIG location will be known
from own PCIEXBAR emulation.

QEMU will be used only to emulate a minimal set of unrelated devices
(eg. storage/network/vga). Less dependency on QEMU overall.

More freedom to implement some specific features in the future like
smram support for EFI firmware needs. Chipset remapping (aka reclaim)
functionality for memory relocation may be implemented under complete
Xen control, avoiding usage of unsafe add_to_physmap hypercalls.

In future this will allow to move passthrough-supporting code from QEMU
(hw/xen/xen-pt*.c) to Xen, merging it with Roger's vpci series.
This will improve eg. the PT + stubdomain situation a lot -- PCI config
space accesses for PT devices will be handled in a uniform way without
Dom0 interaction.
This particular feature can be implemented for the previous approach as
well, still it is easier to do when Xen controls the emulated machine

In general, this is a good long-term direction.

What this approach will require:
--------------------------------

- Changes in QEMU code to support a new chipset-less machine(s). In
  theory might be possible to implement on top of the "null" machine
  concept

- Major changes in Xen code to implement the actual chipset emulation
  there

- Changes on the toolstack side as the emulated machine will be
  selected and used differently

- Moving passthrough support from QEMU to Xen will likely require to
  re-divide areas of responsibility for PCI device passthrough between
  xen-pciback and the hypervisor. It might be more convenient to
  perform some tasks of xen-pciback in Xen directly

- strong dependency between Xen/libxl/QEMU/etc versions -- any outdated
  component will be a major problem. Can be resolved by providing some
  compatibility code

- longer implementation time

Risks:
------

- A major architecture change with possible issues encountered during
  the implementation

- Moving the emulation of the machine to Xen creates a non-zero risk of
  introducing a security issue while extending the emulation support
  further. As all emulation will take place on a most trusted level, any
  exploitable bug in the chipset emulation code may compromise the
  whole system

- there is a risk to encounter some dependency on missing chipset
  devices in QEMU. Some of QEMU devices (which depend on QEMU chipset
  devices/properties) might not work without extra patches. In theory
  this may be addressed by leaving the dummy MCH/LPC/pci-host devices
  in place while not forwarding any IO/MMIO/PCI conf accesses to them
  (using simply as compat placeholders)

- risk of incompatibility with future QEMU versions

In both cases, for security concerns PCIEXBAR and other MCH registers
can be made write-once (RO on all further accesses, similar to a
TXT-locked system).

[...]
>> Regarding control of the guest memory map in the toolstack only...
>> The problem is, only firmware can see a final memory map at the
>> moment. And only the device model knows about invisible "service"
>> ranges for emulated devices, like the LFB content (aka "VRAM") when
>> it is not mapped to a guest.
>> 
>> In order to calculate the final memory/MMIO hole split, we need to
>> know:
>> 
>> 1) all PCI devices on a PCI bus. At the moment Xen contributes only
>> devices like PT to the final PCI bus (via QMP device_add). Others are
>> QEMU ones. Even Xen platform PCI device relies on QEMU emulation.
>> Non-QEMU device emulators are another source of virtual PCI devices I
>> guess.
>> 
>> 2) all chipset-specific emulated MMIO ranges. MMCONFIG is one of them
>> and largest (up to 256Mb for a segment). There are few other smaller
>> ranges, eg. Root Complex registers. All this ranges depend on the
>> emulated chipset.
>> 
>> 3) all reserved memory ranges (this one what toolstack already knows)
>> 
>> 4) all "service" guest memory ranges like backing storage for VRAM in
>> QEMU. Emulated Option ROMs should belong here too, but IIRC xen-hvm.c
>> either intentionally or by mistate handles them as emulated ranges
>> currently.
>> 
>> If we miss any of these (like what are the chipset-specific ranges
>> and their size alignment requirements) -- we're in trouble. But, if
>> we know *all* of these, we can pre-calculate the MMIO hole size.
>> Although this is a bit fragile to do from the toolstack because both
>> sizing algo in the toolstack and MMIO BAR allocation code in the
>> firmware (hvmloader) must have their algorithms synchronized,
>> because it is possible to sruff BARs to MMIO hole in different ways,
>> especially when PCI-PCI bridges will appear on the scene. Both need
>> to do it in a consistent way (resulting in similar set of gaps
>> between allocated BARs), otherwise expected MMIO hole sizes won't
>> match, which means we may need to relocate MMIO BARs to the high
>> MMIO hole and this in turn may lead to those overlaps with QEMU
>> memories.  
>
>I agree that the current memory layout management (or the lack of it)
>is concerning. Although related, I think this should be tackled as a
>different issue from the chipset one IMHO.
>
>Since you already posted the Q35 series I would attempt to get that
>done first before jumping into the memory layout one.

It is somewhat related to the chipset because memory/MMIO layout
inconsistency can be solved more, well, naturally on Q35.

Basically, we have a non-standard MMIO hole layout where the
start of the high MMIO hole do not match the top of addressable RAM
(due to invisible ranges of the device model).

Q35 initially have facilities to allow firmware to modify (via
emulation) or discover such MMIO hole setup which can be used for safe
MMIO BAR allocation to avoid overlaps with QEMU-owned invisible ranges.

It doesn't really matter which registers to pick for this task, but for
Q35 this approach is at least consistent with what a real system does
(PV/PVH people will find this peculiarity pointless I suppose :) ).

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.