[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline

>>> On 08.02.16 at 20:03, <roger.pau@xxxxxxxxxx> wrote:
> Boot ABI
> --------
> Since the Xen entry point into the kernel can be different from the
> native entry point, a `ELFNOTE` is used in order to tell the domain
> builder how to load and jump into the kernel entry point:
>     ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY,          .long,  xen_start32)
> The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the
> kernel supports the boot ABI described in this document.
> The domain builder shall load the kernel into the guest memory space and
> jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the
> following machine state:
>  * `ebx`: contains the physical memory address where the loader has placed
>    the boot start info structure.
>  * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared.
>  * `cr4`: all bits are cleared.
>  * `cs`: must be a 32-bit read/execute code segment with a base of â0â
>    and a limit of â0xFFFFFFFFâ. The selector value is unspecified.
>  * `ds`, `es`: must be a 32-bit read/write data segment with a base of
>    â0â and a limit of â0xFFFFFFFFâ. The selector values are all unspecified.
>  * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of 
> '0x67'.
>  * `eflags`: all user settable bits are clear.

The word "user" here can be mistaken. Perhaps better "all modifiable

> All other processor registers and flag bits are unspecified. The OS is in
> charge of setting up it's own stack, GDT and IDT.

The "flag bits" part should now probably be dropped?

> The format of the boot start info structure is the following (pointed to
> be %ebx):

"... by %ebx"

> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of 
> the address fields should be treated as not present.
>  0 +----------------+
>    | magic          | Contains the magic value 0x336ec578
>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>  4 +----------------+
>    | flags          | SIF_xxx flags.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | nr_modules     | Number of modules passed to the kernel.
> 16 +----------------+
>    | modlist_paddr  | Physical address of an array of modules
>    |                | (layout of the structure below).
> 20 +----------------+

There having been talk about extending the structure, I think we
need some indicator that the consumer can use to know which
fields are present. I.e. either a version field, another flags one,
or a size one.

> The layout of each entry in the module structure is the following:
>  0 +----------------+
>    | paddr          | Physical address of the module.
>  4 +----------------+
>    | size           | Size of the module in bytes.
>  8 +----------------+
>    | cmdline_paddr  | Physical address of the command line,
>    |                | a zero-terminated ASCII string.
> 12 +----------------+
>    | reserved       |
> 16 +----------------+

I've been thinking about this on draft A already: Do we really want
to paint ourselves into the corner of not supporting >4Gb modules,
by limiting their addresses and sizes to 32 bits?

> Hardware description
> --------------------
> Hardware description can come from two different sources, just like on 
> guests.
> Description of PV devices will always come from xenbus, and in fact
> xenbus is the only hardware description that is guaranteed to always be
> provided to HVMlite guests.
> Description of physical hardware devices will always come from ACPI, in the
> absence of any physical hardware device no ACPI tables will be provided.

This seems too strict: How about "in the absence of any physical
hardware device ACPI tables may not be provided"?

> Non-PV devices exposed to the guest
> -----------------------------------
> The initial idea was to simply don't provide any emulated devices to a 
> HVMlite
> guest as the default option. We have however identified certain situations
> where emulated devices could be interesting, both from a performance and
> ease of implementation point of view. The following list tries to encompass
> the different identified scenarios:
>  * 1. HVMlite with no emulated devices at all
>    ------------------------------------------
>    This is the current implementation inside of Xen, everything is disabled
>    by default and the guest has access to the PV devices only. This is of
>    course the most secure design because it has the smaller surface of attack.


>  * 2. HVMlite with (or capable to) PCI-passthrough
>    -----------------------------------------------
>    The current model of PCI-passthrought in PV guests is complex and requires
>    heavy modifications to the guest OS. Going forward we would like to remove
>    this limitation, by providing an interface that's the same as found on bare
>    metal. In order to do this, at least an emulated local APIC should be
>    provided to guests, together with the access to a PCI-Root complex.
>    As said in the 'Hardware description' section above, this will also require
>    ACPI. So this proposed scenario will require the following elements that 
> are
>    not present in the minimal (or default) HVMlite implementation: ACPI, local
>    APIC, IO APIC (optional) and PCI-Root complex.

Are you reasonably convinced that the absence of an IO-APIC
won't, with LAPICs present, cause more confusion than aid to the
OSes wanting to adopt PVHv2?

>  * 3. HVMlite hardware domain
>    --------------------------
>    The aim is that a HVMlite hardware domain is going to work exactly like a
>    HVMlite domain with passed-through devices. This means that the domain will
>    need access to the same set of emulated devices, and that some ACPI tables
>    must be fixed in order to reflect the reality of the container the hardware
>    domain is running on. The ACPI section contains more detailed information
>    about which/how these tables are going to be fixed.
>    Note that in this scenario the hardware domain will *always* have a local
>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>    channels is going to be removed in favour of the bare metal mechanisms.

Do you really mean "*always*"? What about a system without IO-APIC?
Would you mean to emulate one there for no reason?

Also I think you should say "the usage of many PHYSDEV operations",
because - as we've already pointed out - some are unavoidable.

> ----
> ACPI tables will be provided to the hardware domain or to unprivileged
> domains. In the case of unprivileged guests ACPI tables are going to be
> created by the toolstack and will only contain the set of devices available
> to the guest, which will at least be the following: local APIC and
> optionally an IO APIC and passed-through device(s). In order to provide this
> information from ACPI the following tables are needed as a minimum: RSDT,
> FADT, MADT and DSDT. If an administrator decides to not provide a local APIC,
> the MADT table is not going to be provided to the guest OS.
> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be 
> used
> to signal guests that there's no RTC device (the Xen PV wall clock should be
> used instead). It is likely that this flag is not going to be set for the
> hardware domain, since it should have access to the RTC present in the host
> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
> same boot_flags FADT field for DomUs in order to signal that there's no VGA
> adapter present.
> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
> There's no intention to enable these devices, so it is expected that the
> hardware-reduced FADT flag is always going to be set.

We'll need to be absolutely certain that use of this flag doesn't carry
any further implications.

> In the case of the hardware domain, Xen has traditionally passed-through the
> native ACPI tables to the guest. This is something that of course we still
> want to do, but in the case of HVMlite Xen will have to make sure that
> the data passed in the ACPI tables to the hardware domain contain the 
> accurate
> hardware description. This means that at least certain tables will have to
> be modified/mangled before being presented to the guest:
>  * MADT: the number of local APIC entries need to be fixed to match the number
>          of vCPUs available to the guest. The address of the IO APIC(s) also
>          need to be fixed in order to match the emulated ones that we are 
> going
>          to provide.
>  * DSDT: certain devices reported in the DSDT may not be available to the 
> guest,
>          but since the DSDT is a run-time generated table we cannot fix it. In
>          order to cope with this, a STAO table will be provided that should
>          be able to signal which devices are not available to the hardware
>          domain. This is in line with the Xen/ACPI implementation for ARM.

Will STAO be sufficient for everything that may need customization?
I'm particularly worried about processor related methods in DSDT or
SSDT, which - if we're really meaning to do as you say - would need
to be limited (or extended) to the number of vCPU-s Dom0 gets.
What's even less clear to me is how you mean to deal with P-, C-,
and (once supported) T-state management for CPUs which don't
have a vCPU equivalent in Dom0.

> NB: there are corner cases that I'm not sure how to solve properly. Currently
> the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm 
> aware
> of the following:
>  * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and
>    since this table is only available to the hardware domain it has to report
>    the PM info back to Xen so that Xen can perform proper PM.
>  * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is
>    mixed with native ACPICA code in most OSes. This is awkward and requires
>    the usage of hooks into ACPICA which we have not yet managed to upstream.

Iirc shutdown doesn't require any custom patches anymore in Linux.

>  * 3. Reporting the PCI devices it finds to the hypervisor: this is not very
>    intrusive in general, so I'm not that pushed to remove it. It's generally
>    easy in any OS to add some kind of hook that's executed every time a PCI
>    device is discovered.
>  * 4. Report PCI memory-mapped configuration areas to Xen: my opinion 
> regarding
>    this one is the same as (3), it's not really intrusive so I'm not very
>    pushed to remove it.

As said in another reply - for both of these, we just can't remove the
reporting to Xen.

> MMIO mapping
> ------------
> For DomUs without any device passed-through no direct MMIO mappings will be
> present in the physical memory map presented to the guest. For DomUs with
> devices passed-though the toolstack will create direct MMIO mappings as
> part of the domain build process, and thus no action will be required
> from the DomU.
> For the hardware domain initial direct MMIO mappings will be set for the
> following regions:
> NOTE: ranges are defined using memory addresses, not pages.
>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>    memory map at the same position.
>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>    guest physical memory.

When have you last seen a machine with a hole right below the
16Mb boundary?

>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>    1:1 to the guest physical memory map. There are going to be exceptions if
>    Xen has to modify the tables before presenting them to the guest.
>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>    time they will also be made available to the guest at the same position
>    in it's physical memory map. It is possible that Xen will trap accesses to
>    those regions, but a guest should be able to use the native configuration
>    mechanism in order to interact with this configuration space. If the
>    hardware domain reports the presence of any of those regions using the
>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>    them.

s/all guest/allow Dom0/ in this last sentence?

>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>    the PCI devices without hardware domain interaction. In order to have
>    the BARs of PCI devices properly mapped the hardware domain needs to
>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting
>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. This
>    procedure will be transparent from guest point of view, and upon returning
>    from the hypercall mappings must be already established.

I'm not sure this can work, as it imposes restrictions on the ordering
of operations internal of the Dom0 OS: Successfully having probed
for a PCI device (and hence reporting its presence to Xen) doesn't
imply its BARs have already got set up. Together with the possibility
of the OS re-assigning BARs I think we will actually need another
hypercall, or the same device-add hypercall may need to be issued
more than once per device (i.e. also every time any BAR assignment
got changed).


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.