[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline



>>> On 10.02.16 at 13:01, <roger.pau@xxxxxxxxxx> wrote:
> El 9/2/16 a les 14:24, Jan Beulich ha escrit:
>>>>> On 08.02.16 at 20:03, <roger.pau@xxxxxxxxxx> wrote:
>>> The layout of each entry in the module structure is the following:
>>>
>>>  0 +----------------+
>>>    | paddr          | Physical address of the module.
>>>  4 +----------------+
>>>    | size           | Size of the module in bytes.
>>>  8 +----------------+
>>>    | cmdline_paddr  | Physical address of the command line,
>>>    |                | a zero-terminated ASCII string.
>>> 12 +----------------+
>>>    | reserved       |
>>> 16 +----------------+
>> 
>> I've been thinking about this on draft A already: Do we really want
>> to paint ourselves into the corner of not supporting >4Gb modules,
>> by limiting their addresses and sizes to 32 bits?
> 
> Hm, that's an itchy question. TBH I doubt we are going to see modules
>>4GB ATM, but maybe in the future this no longer holds.
> 
> I wouldn't mind making all the fields in the module structure 64bits,
> but I think we should then spell out that Xen will always try to place
> the modules below the 4GiB boundary when possible.

Sounds reasonable.

>>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>>    -----------------------------------------------
>>>    The current model of PCI-passthrought in PV guests is complex and 
>>> requires
>>>    heavy modifications to the guest OS. Going forward we would like to 
>>> remove
>>>    this limitation, by providing an interface that's the same as found on 
>>> bare
>>>    metal. In order to do this, at least an emulated local APIC should be
>>>    provided to guests, together with the access to a PCI-Root complex.
>>>    As said in the 'Hardware description' section above, this will also 
>>> require
>>>    ACPI. So this proposed scenario will require the following elements that 
>>> are
>>>    not present in the minimal (or default) HVMlite implementation: ACPI, 
>>> local
>>>    APIC, IO APIC (optional) and PCI-Root complex.
>> 
>> Are you reasonably convinced that the absence of an IO-APIC
>> won't, with LAPICs present, cause more confusion than aid to the
>> OSes wanting to adopt PVHv2?
> 
> As long as the data provided in the MADT represent the container
> provided I think we should be fine. In the case of no IO APICs no
> entries of type 1 (IO APIC) will be provided in the MADT.

I understand that, but are certain OSes are prepared for that?

>>>  * 3. HVMlite hardware domain
>>>    --------------------------
>>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>>    HVMlite domain with passed-through devices. This means that the domain 
>>> will
>>>    need access to the same set of emulated devices, and that some ACPI 
>>> tables
>>>    must be fixed in order to reflect the reality of the container the 
>>> hardware
>>>    domain is running on. The ACPI section contains more detailed information
>>>    about which/how these tables are going to be fixed.
>>>
>>>    Note that in this scenario the hardware domain will *always* have a local
>>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>>    channels is going to be removed in favour of the bare metal mechanisms.
>> 
>> Do you really mean "*always*"? What about a system without IO-APIC?
>> Would you mean to emulate one there for no reason?
> 
> Oh, a real system without an IO APIC. No, then we wouldn't provide one
> to the hardware domain, since it makes no sense.

I.e. the above should say "... will always local APICs and IO-APICs
mirroring the physical machine's, ..." or something equivalent.

>>> ACPI
>>> ----
>>>
>>> ACPI tables will be provided to the hardware domain or to unprivileged
>>> domains. In the case of unprivileged guests ACPI tables are going to be
>>> created by the toolstack and will only contain the set of devices available
>>> to the guest, which will at least be the following: local APIC and
>>> optionally an IO APIC and passed-through device(s). In order to provide this
>>> information from ACPI the following tables are needed as a minimum: RSDT,
>>> FADT, MADT and DSDT. If an administrator decides to not provide a local 
>>> APIC,
>>> the MADT table is not going to be provided to the guest OS.
>>>
>>> The ACPI_FADT_NO_CMOS_RTC flag in the FADT boot_flags field is going to be 
>>> used
>>> to signal guests that there's no RTC device (the Xen PV wall clock should be
>>> used instead). It is likely that this flag is not going to be set for the
>>> hardware domain, since it should have access to the RTC present in the host
>>> (if there's one). The ACPI_FADT_NO_VGA is also very likely to be set in the
>>> same boot_flags FADT field for DomUs in order to signal that there's no VGA
>>> adapter present.
>>>
>>> Finally the ACPI_FADT_HW_REDUCED is going to be set in the FADT flags field
>>> in order to signal that there are no legacy devices: i8259 PIC or i8254 PIT.
>>> There's no intention to enable these devices, so it is expected that the
>>> hardware-reduced FADT flag is always going to be set.
>> 
>> We'll need to be absolutely certain that use of this flag doesn't carry
>> any further implications.
> 
> No, after taking a closer look at the ACPI spec I don't think we can use
> this flag. It has some connotations that wouldn't be true, for example:
> 
>  - UEFI must be used for boot.
>  - Sleep state entering is different. Using SLEEP_CONTROL_REG and
> SLEEP_STATUS_REG instead of SLP_TYP, SLP_EN and WAK_STS. This of course
> is not something that we can decide for Dom0.
> 
> And there are more implications which I think would not hold in our case.
> 
> So are we just going to say that HVMlite systems will never have a i8259
> PIC or i8254 PIT? Because I don't see a proper way to report this using
> standard ACPI fields.

I think so, yes.

>>> MMIO mapping
>>> ------------
>>>
>>> For DomUs without any device passed-through no direct MMIO mappings will be
>>> present in the physical memory map presented to the guest. For DomUs with
>>> devices passed-though the toolstack will create direct MMIO mappings as
>>> part of the domain build process, and thus no action will be required
>>> from the DomU.
>>>
>>> For the hardware domain initial direct MMIO mappings will be set for the
>>> following regions:
>>>
>>> NOTE: ranges are defined using memory addresses, not pages.
>>>
>>>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>>>    memory map at the same position.
>>>
>>>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>>>    guest physical memory.
>> 
>> When have you last seen a machine with a hole right below the
>> 16Mb boundary?
> 
> Right, I will remove this. Even my old Nehalems (which is the first
> architecture with IOMMU from Intel IIRC) don't have them.
> 
> Should I also mention RMRR?
> 
>   * Any RMRR regions reported will also be mapped 1:1 to Dom0.

That's a good ideas, yes. But please make explicit that such
mappings will go away together with the removal of devices (for
pass-through purposes) from Dom0.

>>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>>    the PCI devices without hardware domain interaction. In order to have
>>>    the BARs of PCI devices properly mapped the hardware domain needs to
>>>    call the PHYSDEVOP_pci_device_add hypercall, that will take care of 
>>> setting
>>>    up the BARs in the guest physical memory map using 1:1 MMIO mappings. 
>>> This
>>>    procedure will be transparent from guest point of view, and upon 
>>> returning
>>>    from the hypercall mappings must be already established.
>> 
>> I'm not sure this can work, as it imposes restrictions on the ordering
>> of operations internal of the Dom0 OS: Successfully having probed
>> for a PCI device (and hence reporting its presence to Xen) doesn't
>> imply its BARs have already got set up. Together with the possibility
>> of the OS re-assigning BARs I think we will actually need another
>> hypercall, or the same device-add hypercall may need to be issued
>> more than once per device (i.e. also every time any BAR assignment
>> got changed).
> 
> We already trap accesses to 0xcf8/0xcfc, can't we detect BAR
> reassignments and then act accordingly and change the MMIO mapping?
> 
> I was thinking that we could do the initial map at the current position
> when issuing the hypercall, and then detect further changes and perform
> remapping if needed, but maybe I'm missing something again that makes
> this approach not feasible.

I think that's certainly possible, but will require quite a bit of care
when implementing. (In fact this way I think we could then also
observe bus renumbering, without requiring Dom0 to remove and
then re-add all affected devices. Konrad - what do you think?)

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.