[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] HVMlite ABI specification DRAFT C + implementation outline


I've Cced a bunch of people who have expressed interest in the HVMlite 
design/implementation, both from a Xen or OS point of view. If you 
would like to be removed, please say so and I will remove you in 
further iterations. The same applies if you want to be added to the Cc.

This is an initial draft on the HVMlite design and implementation. I've 
mixed certain aspects of the design with the implementation, because I 
think we are quite tied by the implementation possibilities in certain 
aspects, so not speaking about it would make the document incomplete. I 
might be wrong on that, so feel free to comment otherwise if you would 
prefer a different approach.

The document is still not complete. I'm of course not as knowledgeable 
as some people on the Cc, so please correct me if you think there are 
mistakes or simply impossible goals.

I think I've managed to integrate all the comments from DRAFT B. I still
haven't done a s/HVMlite/PVH/, but I plan to do so once the document is
finished and ready to go inside of the Xen tree.


Xen HVMlite ABI

Boot ABI

Since the Xen entry point into the kernel can be different from the
native entry point, a `ELFNOTE` is used in order to tell the domain
builder how to load and jump into the kernel entry point:

    ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)

The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
kernel supports the boot ABI described in this document.

The domain builder shall load the kernel into the guest memory space and
jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
following machine state:

 * `ebx`: contains the physical memory address where the loader has placed
   the boot start info structure.

 * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.

 * `cr4`: all bits are cleared.

 * `cs`: must be a 32-bit read/execute code segment with a base of â0â
   and a limit of â0xFFFFFFFFâ. The selector value is unspecified.

 * `ds`, `es`: must be a 32-bit read/write data segment with a base of
   â0â and a limit of â0xFFFFFFFFâ. The selector values are all unspecified.

 * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of '0x67'.

 * `eflags`: all modifiable bits are clear.

All other processor registers are unspecified. The OS is in charge of setting
up it's own stack, GDT and IDT.

The layout of the boot start info data is the following (pointed to by %ebx):

NOTE: nothing will be loaded at physical address 0, so a 0 value in any of the
address fields should be treated as not present.

 0 +----------------+
   | magic          | Contains the magic value 0x336ec578
   |                | ("xEn3" with the 0x80 bit of the "E" set).
 4 +----------------+
   | version        | Version of this structure. Current version is 0.
   |                | New versions are guaranteed to be backwards-compatible.
 8 +----------------+
   | flags          | SIF_xxx flags.
12 +----------------+
   | cmdline_paddr  | Physical address of the command line,
   |                | a zero-terminated ASCII string.
16 +----------------+
   | nr_modules     | Number of modules passed to the kernel.
20 +----------------+
   | modlist_paddr  | Physical address of an array of modules
   |                | (layout of the structure below).
24 +----------------+
   | rsdp_paddr     | Physical address of the RSDP ACPI data structure.
28 +----------------+

The layout of each entry in the module structure is the following:

 0 +----------------+
   | paddr          | Physical address of the module.
 8 +----------------+
   | size           | Size of the module in bytes.
16 +----------------+
   | cmdline_paddr  | Physical address of the command line,
   |                | a zero-terminated ASCII string.
24 +----------------+
   | reserved       |
32 +----------------+

Note that the address and size of the modules is a 64bit unsigned integer.
However Xen will always try to place all modules below the 4GiB boundary.

Other relevant information needed in order to boot a guest kernel
(console page address, xenstore event channel...) can be obtained
using HVMPARAMS, just like it's done on HVM guests.

The setup of the hypercall page is also performed in the same way
as HVM guests, using the hypervisor cpuid leaves and msr ranges.

Hardware description

Hardware description can come from two different sources, just like on (PV)HVM

Description of PV devices will always come from xenbus, and in fact
xenbus is the only hardware description that is guaranteed to always be
provided to HVMlite guests.

Description of physical hardware devices will always come from ACPI, in the
absence of any physical hardware device ACPI tables may not be provided. The
presence of ACPI tables can be detected by finding the RSDP, just like on
bare metal.

Non-PV devices exposed to the guest

The initial idea was to simply don't provide any emulated devices to a HVMlite
guest as the default option. We have however identified certain situations
where emulated devices could be interesting, both from a performance and
ease of implementation point of view. The following list tries to encompass
the different identified scenarios:

 * 1. HVMlite with no emulated devices at all
   This is the current implementation inside of Xen, everything is disabled
   by default and the guest has access to the PV devices only. This is of
   course the most secure design because it has the smallest surface of attack.

 * 2. HVMlite with a local APIC
   This is the default mode unless specified otherwise. It adds an emulated
   local APIC in order to ease the implementation, since the local APIC is
   mostly considered part of the CPU package this days. A minimal set of ACPI
   tables (RSDT, FADT and MADT) are mandatory in order to perform hardware

 * 2. HVMlite with (or capable to) PCI-passthrough
   The current model of PCI-passthrought in PV guests is complex and requires
   heavy modifications to the guest OS. Going forward we would like to remove
   this limitation, by providing an interface that's the same as found on bare
   metal. In order to do this, at least an emulated local APIC should be
   provided to guests, together with the access to a PCI-Root complex.
   As said in the 'Hardware description' section above, this will also require
   ACPI. So this proposed scenario will require the following elements that are
   not present in the minimal (or default) HVMlite implementation: ACPI, local
   APIC, IO APIC (optional) and PCI-Root complex.

 * 3. HVMlite hardware domain
   The aim is that a HVMlite hardware domain is going to work exactly like a
   HVMlite domain with passed-through devices. This means that the domain will
   need access to the same set of emulated devices, and that some ACPI tables
   must be fixed in order to reflect the reality of the container the hardware
   domain is running on. The ACPI section contains more detailed information
   about which/how these tables are going to be fixed.

   Note that in this scenario the hardware domain will always have a local APIC
   and possibly an IO APIC (provided that the physical host also has one), and
   that the usage of many PHYSDEV operations and PIRQ event channels is going
   to be removed in favour of the bare metal mechanisms.

An administrator should be able to change the default setting using the
following tunables that are part of the xl toolstack:

 * lapic: default to true. Indicates whether a local APIC is provided.
 * ioapic: default to false. Indicates whether an IO APIC is provided
   (requires lapic and acpi set to true).
 * acpi: default to true. Indicates whether ACPI tables are provided.

It is important to notice that HVMlite guests are *never* going to have
access to the following devices: 8259 PIC or 8254 PIT. There's no way to
signal the absence of these devices using ACPI, so it must be assumed that
they are never present, and are never going to be, since it's considered
legacy hardware.


ACPI tables will be provided to the hardware domain or to unprivileged
domains. In the case of unprivileged guests ACPI tables are going to be
created by the toolstack and will only contain the set of devices available
to the guest, which will at least be the following: local APIC and
optionally an IO APIC and passed-through device(s). In order to provide this
information from ACPI the following tables are needed as a minimum: RSDT,
FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
the MADT table is not going to be provided to the guest OS.

The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be used
to signal guests that there's no RTC device (the Xen PV wall clock should be
used instead). It is likely that this flag is not going to be set for the
hardware domain, since it should have access to the RTC present in the host
(if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
same boot_flags FADT field for DomUs in order to signal that there's no VGA
adapter present.

In the case of the hardware domain, Xen has traditionally passed-through the
native ACPI tables to the guest. This is something that of course we still
want to do, but in the case of HVMlite Xen will have to make sure that
the data passed in the ACPI tables to the hardware domain contain the accurate
hardware description. This means that at least certain tables will have to
be modified/mangled before being presented to the guest:

 * MADT: the number of local APIC entries need to be fixed to match the number
         of vCPUs available to the guest. The address of the IO APIC(s) also
         need to be fixed in order to match the emulated ones that we are going
         to provide.

 * DSDT: certain devices reported in the DSDT may not be available to the guest,
         but since the DSDT is a run-time generated table we cannot fix it. In
         order to cope with this, a STAO table will be provided that should
         be able to signal which devices are not available to the hardware
         domain. This is in line with the Xen/ACPI implementation for ARM.

 * MPST, PMTT, SBTT, SRAT and SLIT: won't be initially presented to the guest,
   until we get our act together on the vNUMA stuff.

NB: there are corner cases that I'm not sure how to solve properly. Currently
the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm aware
of the following:

 * 1. Reporting CPU PM info back to Xen: this comes from the DSDT/SSDT table,
   and since this table is only available to the hardware domain it has to
   report the PM info back to Xen so that Xen can perform proper PM.
 * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
   mixed with native ACPICA code in most OSes. This is awkward and requires
   the usage of hooks into ACPICA which we have not yet managed to upstream.

AP startup

AP startup is performed using hypercalls. The following VCPU operations
are used in order to bring up secondary vCPUs:

 * VCPUOP_initialise is used to set the initial state of the vCPU. The
   argument passed to the hypercall must be of the type vcpu_hvm_context.
   See public/hvm/hvm_vcpu.h for the layout of the structure. Note that
   this hypercall allows starting the vCPU in several modes (16/32/64bits),
   regardless of the mode the BSP is currently running on.

 * VCPUOP_up is used to launch the vCPU once the initial state has been
   set using VCPUOP_initialise.

 * VCPUOP_down is used to bring down a vCPU.

 * VCPUOP_is_up is used to scan the number of available vCPUs.

Additionally, if a local APIC is available CPU bringup can also be performed
using the hardware native AP startup sequence (IPIs). In this case the
hypercall interface will still be provided, as a faster and more convenient
way of starting APs.

MMIO mapping

For DomUs without any device passed-through no direct MMIO mappings will be
present in the physical memory map presented to the guest. For DomUs with
devices passed-though the toolstack will create direct MMIO mappings as
part of the domain build process, and thus no action will be required
from the DomU.

For the hardware domain initial direct MMIO mappings will be set for the
following regions where applicable:

NOTE: ranges are defined using memory addresses, not pages.

 * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
   memory map at the same position. Non-legacy boots are unlikely to have
   the low 1MiB mapped 1:1, since there's nothing relevant there.

 * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
   1:1 to the guest physical memory map. There are going to be exceptions if
   Xen has to modify the tables before presenting them to the guest.

 * Any RMRR regions found by Xen will also be mapped 1:1 into the hardware
   domain physical memory map. Such mappings will be removed if the device is
   passed-through to another guest.

 * PCI Express MMCFG: no MMCFG areas will be mapped by default into the guest
   memory map. In order to have these areas mapped the hardware domain
   must use the PHYSDEVOP_pci_mmcfg_reserved hypercall. On successful return
   from this hypercall the requested MMCFG areas will be mapped 1:1 into the
   guest memory space.

 * PCI BARs: no BARs will be mapped by default to the hardware domain. In
   order to have BARs mapped the hardware domain must issue the
   PHYSDEVOP_pci_device_add hypercall against each PCI device it founds. On
   return from this hypercall it is guaranteed to have the BARs reported
   in the PCI configuration space mapped 1:1 into the guest physical memory
   map. Further changes to the position of the BARs will be intercepted by
   Xen and the remapping is going to be transparent from a guest point of
   view (ie: there will be no need to re-issue the hypercall again).

Xen HVMlite implementation plan

This is of course not part of the ABI, but I guess it makes sense to add it
here in order to be able to more easily split the tasks required in order to
make the proposed implementation above a reality. I've tried to split
the tasks into smaller sub-tasks when possible.


 1. Initial HVMlite implementation based on a HVM guest: no emulated devices
    will be provided, interface exactly the same as a PVH guest except for the
    boot ABI.

 2. Provide ACPI tables to HVMlite guests: the initial set of provided tables
    will be: RSDT, FADT, MADT (iff local APIC is enabled).

 3. Enable the local APIC by default for HVMlite guests.

 4. Provide options to xl/libxl in order to allow admins to select the
    presence of a local APIC and IO APIC to HVMlite guests.

 5. Implement an emulated PCI Root Complex inside of Xen.

 6. Provide a DSDT table to HVMlite guests in order to signal the presence
    of PCI-passthrough devices.

IMHO, we should focus on (2) and (3) at the moment, and (4) is quite trivial
once those two are in place. (5) and (6) should be implemented once HVMlite
hardware domains are functional.

When implementing (2) it would be good to place the ACPI related code in a
place that's accessible from libxl, hvmloader and Xen itself, in order
to reduce code duplication. hvmloader already has most if not all the required
code in order to build the tables that are needed for HVMlite DomU.


 1. Add a new Dom0 builder specific for HVM-like domains. PV domains have
    different requirements and sharing the same Dom0 domain builder only makes
    the code for both cases much harder to read and disentangle.

 2. Implement the code required in order to mangle/modify the ACPI tables
    provided to Dom0, so that it matches the reality of the container provided
    to Dom0.

 3. Allow HVM Dom0 to use PHYSDEVOP_pci_mmcfg_reserved and
    PHYSDEVOP_pci_device_add and make sure these hypercalls add the proper
    MMIO mappings.

 4. Do the necessary wiring so that interrupts from physical devices are
    received by Dom0 using the emulated interrupt controllers (local and IO

This plan is not as detailed as the DomU one, since the Dom0 work is not as
advanced as the DomU work, and is also tied to the DomU implementation.

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.