Xen project Mailing List

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline

>>> On 10.02.16 at 13:01, <roger.pau@xxxxxxxxxx> wrote: > El 9/2/16 a les 14:24, Jan Beulich ha escrit: >>>>> On 08.02.16 at 20:03, <roger.pau@xxxxxxxxxx> wrote: >>> The layout of each entry in the module structure is the following: >>> >>> 0 +----------------+ >>> | paddr | Physical address of the module. >>> 4 +----------------+ >>> | size | Size of the module in bytes. >>> 8 +----------------+ >>> | cmdline_paddr | Physical address of the command line, >>> | | a zero-terminated ASCII string. >>> 12 +----------------+ >>> | reserved | >>> 16 +----------------+ >> >> I've been thinking about this on draft A already: Do we really want >> to paint ourselves into the corner of not supporting >4Gb modules, >> by limiting their addresses and sizes to 32 bits? > > Hm, that's an itchy question. TBH I doubt we are going to see modules >>4GB ATM, but maybe in the future this no longer holds. > > I wouldn't mind making all the fields in the module structure 64bits, > but I think we should then spell out that Xen will always try to place > the modules below the 4GiB boundary when possible. Sounds reasonable. >>> * 2. HVMlite with (or capable to) PCI-passthrough >>> ----------------------------------------------- >>> The current model of PCI-passthrought in PV guests is complex and >>> requires >>> heavy modifications to the guest OS. Going forward we would like to >>> remove >>> this limitation, by providing an interface that's the same as found on >>> bare >>> metal. In order to do this, at least an emulated local APIC should be >>> provided to guests, together with the access to a PCI-Root complex. >>> As said in the 'Hardware description' section above, this will also >>> require >>> ACPI. So this proposed scenario will require the following elements that >>> are >>> not present in the minimal (or default) HVMlite implementation: ACPI, >>> local >>> APIC, IO APIC (optional) and PCI-Root complex. >> >> Are you reasonably convinced that the absence of an IO-APIC >> won't, with LAPICs present, cause more confusion than aid to the >> OSes wanting to adopt PVHv2? > > As long as the data provided in the MADT represent the container > provided I think we should be fine. In the case of no IO APICs no > entries of type 1 (IO APIC) will be provided in the MADT. I understand that, but are certain OSes are prepared for that? >>> * 3. HVMlite hardware domain >>> -------------------------- >>> The aim is that a HVMlite hardware domain is going to work exactly like a >>> HVMlite domain with passed-through devices. This means that the domain >>> will >>> need access to the same set of emulated devices, and that some ACPI >>> tables >>> must be fixed in order to reflect the reality of the container the >>> hardware >>> domain is running on. The ACPI section contains more detailed information >>> about which/how these tables are going to be fixed. >>> >>> Note that in this scenario the hardware domain will *always* have a local >>> APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event >>> channels is going to be removed in favour of the bare metal mechanisms. >> >> Do you really mean "*always*"? What about a system without IO-APIC? >> Would you mean to emulate one there for no reason? > > Oh, a real system without an IO APIC. No, then we wouldn't provide one > to the hardware domain, since it makes no sense. I.e. the above should say "... will always local APICs and IO-APICs mirroring the physical machine's, ..." or something equivalent. >>> ACPI >>> ---- >>> >>> ACPI tables will be provided to the hardware domain or to unprivileged >>> domains. In the case of unprivileged guests ACPI tables are going to be >>> created by the toolstack and will only contain the set of devices available >>> to the guest, which will at least be the following: local APIC and >>> optionally an IO APIC and passed-through device(s). In order to provide this >>> information from ACPI the following tables are needed as a minimum: RSDT, >>> FADT, MADT and DSDT. If an administrator decides to not provide a local >>> APIC, >>> the MADT table is not going to be provided to the guest OS. >>> >>> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be >>> used >>> to signal guests that there's no RTC device (the Xen PV wall clock should be >>> used instead). It is likely that this flag is not going to be set for the >>> hardware domain, since it should have access to the RTC present in the host >>> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the >>> same boot_flags FADT field for DomUs in order to signal that there's no VGA >>> adapter present. >>> >>> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field >>> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT. >>> There's no intention to enable these devices, so it is expected that the >>> hardware-reduced FADT flag is always going to be set. >> >> We'll need to be absolutely certain that use of this flag doesn't carry >> any further implications. > > No, after taking a closer look at the ACPI spec I don't think we can use > this flag. It has some connotations that wouldn't be true, for example: > > - UEFI must be used for boot. > - Sleep state entering is different. Using SLEEP_CONTROL_REG and > SLEEP_STATUS_REG instead of SLP_TYP, SLP_EN and WAK_STS. This of course > is not something that we can decide for Dom0. > > And there are more implications which I think would not hold in our case. > > So are we just going to say that HVMlite systems will never have a i8259 > PIC or i8254 PIT? Because I don't see a proper way to report this using > standard ACPI fields. I think so, yes. >>> MMIO mapping >>> ------------ >>> >>> For DomUs without any device passed-through no direct MMIO mappings will be >>> present in the physical memory map presented to the guest. For DomUs with >>> devices passed-though the toolstack will create direct MMIO mappings as >>> part of the domain build process, and thus no action will be required >>> from the DomU. >>> >>> For the hardware domain initial direct MMIO mappings will be set for the >>> following regions: >>> >>> NOTE: ranges are defined using memory addresses, not pages. >>> >>> * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest >>> memory map at the same position. >>> >>> * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the >>> guest physical memory. >> >> When have you last seen a machine with a hole right below the >> 16Mb boundary? > > Right, I will remove this. Even my old Nehalems (which is the first > architecture with IOMMU from Intel IIRC) don't have them. > > Should I also mention RMRR? > > * Any RMRR regions reported will also be mapped 1:1 to Dom0. That's a good ideas, yes. But please make explicit that such mappings will go away together with the removal of devices (for pass-through purposes) from Dom0. >>> * PCI BARs: it's not possible for Xen to know the position of the BARs of >>> the PCI devices without hardware domain interaction. In order to have >>> the BARs of PCI devices properly mapped the hardware domain needs to >>> call the PHYSDEVOP_pci_device_add hypercall, that will take care of >>> setting >>> up the BARs in the guest physical memory map using 1:1 MMIO mappings. >>> This >>> procedure will be transparent from guest point of view, and upon >>> returning >>> from the hypercall mappings must be already established. >> >> I'm not sure this can work, as it imposes restrictions on the ordering >> of operations internal of the Dom0 OS: Successfully having probed >> for a PCI device (and hence reporting its presence to Xen) doesn't >> imply its BARs have already got set up. Together with the possibility >> of the OS re-assigning BARs I think we will actually need another >> hypercall, or the same device-add hypercall may need to be issued >> more than once per device (i.e. also every time any BAR assignment >> got changed). > > We already trap accesses to 0xcf8/0xcfc, can't we detect BAR > reassignments and then act accordingly and change the MMIO mapping? > > I was thinking that we could do the initial map at the current position > when issuing the hypercall, and then detect further changes and perform > remapping if needed, but maybe I'm missing something again that makes > this approach not feasible. I think that's certainly possible, but will require quite a bit of care when implementing. (In fact this way I think we could then also observe bus renumbering, without requiring Dom0 to remove and then re-add all affected devices. Konrad - what do you think?) Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.