Xen project Mailing List

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline

>>> On 08.02.16 at 20:03, <roger.pau@xxxxxxxxxx> wrote: > Boot ABI > -------- > > Since the Xen entry point into the kernel can be different from the > native entry point, a `ELFNOTE` is used in order to tell the domain > builder how to load and jump into the kernel entry point: > > ELFNOTE(Xen, XEN_ELFNOTE_PHYS32_ENTRY, .long, xen_start32) > > The presence of the `XEN_ELFNOTE_PHYS32_ENTRY` note indicates that the > kernel supports the boot ABI described in this document. > > The domain builder shall load the kernel into the guest memory space and > jump into the entry point defined at `XEN_ELFNOTE_PHYS32_ENTRY` with the > following machine state: > > * `ebx`: contains the physical memory address where the loader has placed > the boot start info structure. > > * `cr0`: bit 0 (PE) must be set. All the other writeable bits are cleared. > > * `cr4`: all bits are cleared. > > * `cs`: must be a 32-bit read/execute code segment with a base of â0â > and a limit of â0xFFFFFFFFâ. The selector value is unspecified. > > * `ds`, `es`: must be a 32-bit read/write data segment with a base of > â0â and a limit of â0xFFFFFFFFâ. The selector values are all unspecified. > > * `tr`: must be a 32-bit TSS (active) with a base of '0' and a limit of > '0x67'. > > * `eflags`: all user settable bits are clear. The word "user" here can be mistaken. Perhaps better "all modifiable bits"? > All other processor registers and flag bits are unspecified. The OS is in > charge of setting up it's own stack, GDT and IDT. The "flag bits" part should now probably be dropped? > The format of the boot start info structure is the following (pointed to > be %ebx): "... by %ebx" > NOTE: nothing will be loaded at physical address 0, so a 0 value in any of > the address fields should be treated as not present. > > 0 +----------------+ > | magic | Contains the magic value 0x336ec578 > | | ("xEn3" with the 0x80 bit of the "E" set). > 4 +----------------+ > | flags | SIF_xxx flags. > 8 +----------------+ > | cmdline_paddr | Physical address of the command line, > | | a zero-terminated ASCII string. > 12 +----------------+ > | nr_modules | Number of modules passed to the kernel. > 16 +----------------+ > | modlist_paddr | Physical address of an array of modules > | | (layout of the structure below). > 20 +----------------+ There having been talk about extending the structure, I think we need some indicator that the consumer can use to know which fields are present. I.e. either a version field, another flags one, or a size one. > The layout of each entry in the module structure is the following: > > 0 +----------------+ > | paddr | Physical address of the module. > 4 +----------------+ > | size | Size of the module in bytes. > 8 +----------------+ > | cmdline_paddr | Physical address of the command line, > | | a zero-terminated ASCII string. > 12 +----------------+ > | reserved | > 16 +----------------+ I've been thinking about this on draft A already: Do we really want to paint ourselves into the corner of not supporting >4Gb modules, by limiting their addresses and sizes to 32 bits? > Hardware description > -------------------- > > Hardware description can come from two different sources, just like on > (PV)HVM > guests. > > Description of PV devices will always come from xenbus, and in fact > xenbus is the only hardware description that is guaranteed to always be > provided to HVMlite guests. > > Description of physical hardware devices will always come from ACPI, in the > absence of any physical hardware device no ACPI tables will be provided. This seems too strict: How about "in the absence of any physical hardware device ACPI tables may not be provided"? > Non-PV devices exposed to the guest > ----------------------------------- > > The initial idea was to simply don't provide any emulated devices to a > HVMlite > guest as the default option. We have however identified certain situations > where emulated devices could be interesting, both from a performance and > ease of implementation point of view. The following list tries to encompass > the different identified scenarios: > > * 1. HVMlite with no emulated devices at all > ------------------------------------------ > This is the current implementation inside of Xen, everything is disabled > by default and the guest has access to the PV devices only. This is of > course the most secure design because it has the smaller surface of attack. smallest? > * 2. HVMlite with (or capable to) PCI-passthrough > ----------------------------------------------- > The current model of PCI-passthrought in PV guests is complex and requires > heavy modifications to the guest OS. Going forward we would like to remove > this limitation, by providing an interface that's the same as found on bare > metal. In order to do this, at least an emulated local APIC should be > provided to guests, together with the access to a PCI-Root complex. > As said in the 'Hardware description' section above, this will also require > ACPI. So this proposed scenario will require the following elements that > are > not present in the minimal (or default) HVMlite implementation: ACPI, local > APIC, IO APIC (optional) and PCI-Root complex. Are you reasonably convinced that the absence of an IO-APIC won't, with LAPICs present, cause more confusion than aid to the OSes wanting to adopt PVHv2? > * 3. HVMlite hardware domain > -------------------------- > The aim is that a HVMlite hardware domain is going to work exactly like a > HVMlite domain with passed-through devices. This means that the domain will > need access to the same set of emulated devices, and that some ACPI tables > must be fixed in order to reflect the reality of the container the hardware > domain is running on. The ACPI section contains more detailed information > about which/how these tables are going to be fixed. > > Note that in this scenario the hardware domain will *always* have a local > APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event > channels is going to be removed in favour of the bare metal mechanisms. Do you really mean "*always*"? What about a system without IO-APIC? Would you mean to emulate one there for no reason? Also I think you should say "the usage of many PHYSDEV operations", because - as we've already pointed out - some are unavoidable. > ACPI > ---- > > ACPI tables will be provided to the hardware domain or to unprivileged > domains. In the case of unprivileged guests ACPI tables are going to be > created by the toolstack and will only contain the set of devices available > to the guest, which will at least be the following: local APIC and > optionally an IO APIC and passed-through device(s). In order to provide this > information from ACPI the following tables are needed as a minimum: RSDT, > FADT, MADT and DSDT. If an administrator decides to not provide a local APIC, > the MADT table is not going to be provided to the guest OS. > > The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be > used > to signal guests that there's no RTC device (the Xen PV wall clock should be > used instead). It is likely that this flag is not going to be set for the > hardware domain, since it should have access to the RTC present in the host > (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the > same boot_flags FADT field for DomUs in order to signal that there's no VGA > adapter present. > > Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field > in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT. > There's no intention to enable these devices, so it is expected that the > hardware-reduced FADT flag is always going to be set. We'll need to be absolutely certain that use of this flag doesn't carry any further implications. > In the case of the hardware domain, Xen has traditionally passed-through the > native ACPI tables to the guest. This is something that of course we still > want to do, but in the case of HVMlite Xen will have to make sure that > the data passed in the ACPI tables to the hardware domain contain the > accurate > hardware description. This means that at least certain tables will have to > be modified/mangled before being presented to the guest: > > * MADT: the number of local APIC entries need to be fixed to match the number > of vCPUs available to the guest. The address of the IO APIC(s) also > need to be fixed in order to match the emulated ones that we are > going > to provide. > > * DSDT: certain devices reported in the DSDT may not be available to the > guest, > but since the DSDT is a run-time generated table we cannot fix it. In > order to cope with this, a STAO table will be provided that should > be able to signal which devices are not available to the hardware > domain. This is in line with the Xen/ACPI implementation for ARM. Will STAO be sufficient for everything that may need customization? I'm particularly worried about processor related methods in DSDT or SSDT, which - if we're really meaning to do as you say - would need to be limited (or extended) to the number of vCPU-s Dom0 gets. What's even less clear to me is how you mean to deal with P-, C-, and (once supported) T-state management for CPUs which don't have a vCPU equivalent in Dom0. > NB: there are corner cases that I'm not sure how to solve properly. Currently > the hardware domain has some 'hacks' regarding ACPI and Xen. At least I'm > aware > of the following: > > * 1. Reporting CPU PM info back to Xen: this comes from the DSDT table, and > since this table is only available to the hardware domain it has to report > the PM info back to Xen so that Xen can perform proper PM. > * 2. Doing proper shutdown (S5) requires the usage of a hypercall, which is > mixed with native ACPICA code in most OSes. This is awkward and requires > the usage of hooks into ACPICA which we have not yet managed to upstream. Iirc shutdown doesn't require any custom patches anymore in Linux. > * 3. Reporting the PCI devices it finds to the hypervisor: this is not very > intrusive in general, so I'm not that pushed to remove it. It's generally > easy in any OS to add some kind of hook that's executed every time a PCI > device is discovered. > * 4. Report PCI memory-mapped configuration areas to Xen: my opinion > regarding > this one is the same as (3), it's not really intrusive so I'm not very > pushed to remove it. As said in another reply - for both of these, we just can't remove the reporting to Xen. > MMIO mapping > ------------ > > For DomUs without any device passed-through no direct MMIO mappings will be > present in the physical memory map presented to the guest. For DomUs with > devices passed-though the toolstack will create direct MMIO mappings as > part of the domain build process, and thus no action will be required > from the DomU. > > For the hardware domain initial direct MMIO mappings will be set for the > following regions: > > NOTE: ranges are defined using memory addresses, not pages. > > * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest > memory map at the same position. > > * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the > guest physical memory. When have you last seen a machine with a hole right below the 16Mb boundary? > * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped > 1:1 to the guest physical memory map. There are going to be exceptions if > Xen has to modify the tables before presenting them to the guest. > > * PCI Express MMCFG: if Xen is able to identify any of these regions at boot > time they will also be made available to the guest at the same position > in it's physical memory map. It is possible that Xen will trap accesses to > those regions, but a guest should be able to use the native configuration > mechanism in order to interact with this configuration space. If the > hardware domain reports the presence of any of those regions using the > PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to > them. s/all guest/allow Dom0/ in this last sentence? > * PCI BARs: it's not possible for Xen to know the position of the BARs of > the PCI devices without hardware domain interaction. In order to have > the BARs of PCI devices properly mapped the hardware domain needs to > call the PHYSDEVOP_pci_device_add hypercall, that will take care of setting > up the BARs in the guest physical memory map using 1:1 MMIO mappings. This > procedure will be transparent from guest point of view, and upon returning > from the hypercall mappings must be already established. I'm not sure this can work, as it imposes restrictions on the ordering of operations internal of the Dom0 OS: Successfully having probed for a PCI device (and hence reporting its presence to Xen) doesn't imply its BARs have already got set up. Together with the possibility of the OS re-assigning BARs I think we will actually need another hypercall, or the same device-add hypercall may need to be issued more than once per device (i.e. also every time any BAR assignment got changed). Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.