[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] HVMlite ABI specification DRAFT B + implementation outline

El 9/2/16 a les 11:56, Andrew Cooper ha escrit:
> On 08/02/16 19:03, Roger Pau Monnà wrote:
>> The format of the boot start info structure is the following (pointed to
>> be %ebx):
>> NOTE: nothing will be loaded at physical address 0, so a 0 value in any of 
>> the
>> address fields should be treated as not present.
>>  0 +----------------+
>>    | magic          | Contains the magic value 0x336ec578
>>    |                | ("xEn3" with the 0x80 bit of the "E" set).
>>  4 +----------------+
>>    | flags          | SIF_xxx flags.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | nr_modules     | Number of modules passed to the kernel.
>> 16 +----------------+
>>    | modlist_paddr  | Physical address of an array of modules
>>    |                | (layout of the structure below).
>> 20 +----------------+
>> The layout of each entry in the module structure is the following:
>>  0 +----------------+
>>    | paddr          | Physical address of the module.
>>  4 +----------------+
>>    | size           | Size of the module in bytes.
>>  8 +----------------+
>>    | cmdline_paddr  | Physical address of the command line,
>>    |                | a zero-terminated ASCII string.
>> 12 +----------------+
>>    | reserved       |
>> 16 +----------------+
>> Other relevant information needed in order to boot a guest kernel
>> (console page address, xenstore event channel...) can be obtained
>> using HVMPARAMS, just like it's done on HVM guests.
>> The setup of the hypercall page is also performed in the same way
>> as HVM guests, using the hypervisor cpuid leaves and msr ranges.
>> Hardware description
>> --------------------
>> Hardware description can come from two different sources, just like on 
>> (PV)HVM
>> guests.
>> Description of PV devices will always come from xenbus, and in fact
>> xenbus is the only hardware description that is guaranteed to always be
>> provided to HVMlite guests.
>> Description of physical hardware devices will always come from ACPI, in the
>> absence of any physical hardware device no ACPI tables will be provided. The
>> presence of ACPI tables can be detected by finding the RSDP, just like on
>> bare metal.
> As we are extending the base structure, why not have an RSDP paddr in it
> as well?  This avoids the need to scan RAM, and also serves as an
> indication of "No ACPI".

Right, this seems fine to me. I can send a patch later to expand the
structure unless anyone else complains.

>> Non-PV devices exposed to the guest
>> -----------------------------------
>> The initial idea was to simply don't provide any emulated devices to a 
>> HVMlite
>> guest as the default option. We have however identified certain situations
>> where emulated devices could be interesting, both from a performance and
>> ease of implementation point of view. The following list tries to encompass
>> the different identified scenarios:
>>  * 1. HVMlite with no emulated devices at all
>>    ------------------------------------------
>>    This is the current implementation inside of Xen, everything is disabled
>>    by default and the guest has access to the PV devices only. This is of
>>    course the most secure design because it has the smaller surface of 
>> attack.
>>  * 2. HVMlite with (or capable to) PCI-passthrough
>>    -----------------------------------------------
>>    The current model of PCI-passthrought in PV guests is complex and requires
>>    heavy modifications to the guest OS. Going forward we would like to remove
>>    this limitation, by providing an interface that's the same as found on 
>> bare
>>    metal. In order to do this, at least an emulated local APIC should be
>>    provided to guests, together with the access to a PCI-Root complex.
>>    As said in the 'Hardware description' section above, this will also 
>> require
>>    ACPI. So this proposed scenario will require the following elements that 
>> are
>>    not present in the minimal (or default) HVMlite implementation: ACPI, 
>> local
>>    APIC, IO APIC (optional) and PCI-Root complex.
>>  * 3. HVMlite hardware domain
>>    --------------------------
>>    The aim is that a HVMlite hardware domain is going to work exactly like a
>>    HVMlite domain with passed-through devices. This means that the domain 
>> will
>>    need access to the same set of emulated devices, and that some ACPI tables
>>    must be fixed in order to reflect the reality of the container the 
>> hardware
>>    domain is running on. The ACPI section contains more detailed information
>>    about which/how these tables are going to be fixed.
>>    Note that in this scenario the hardware domain will *always* have a local
>>    APIC and IO APIC, and that the usage of PHYSDEV operations and PIRQ event
>>    channels is going to be removed in favour of the bare metal mechanisms.
>> The default model for HVMlite guests is going to be to provide a local APIC
>> together with a minimal set of ACPI tables that accurately match the reality 
>> of
>> the container is guest is running on.
> This statement is contrary to option 1 above, which states that all
> emulation is disabled.
> FWIW, I think there needs to be a 4th option, inbetween current 1 and 2,
> which is HVMLite + LAPIC.  This is then the default HVMLite ABI, and is
> not passthrough-capable.

Right, I think this makes sense because (2) is not exactly the same as
it requires the presence of a PCI root complex.

>>  An administrator should be able to change
>> the default setting using the following tunables that are part of the xl
>> toolstack:
>>  * lapic: default to true. Indicates whether a local APIC is provided.
>>  * ioapic: default to false. Indicates whether an IO APIC is provided
>>    (requires lapic set to true).
>>  * acpi: default to true. Indicates whether ACPI tables are provided.
>> <snip>
>> MMIO mapping
>> ------------
>> For DomUs without any device passed-through no direct MMIO mappings will be
>> present in the physical memory map presented to the guest. For DomUs with
>> devices passed-though the toolstack will create direct MMIO mappings as
>> part of the domain build process, and thus no action will be required
>> from the DomU.
>> For the hardware domain initial direct MMIO mappings will be set for the
>> following regions:
>> NOTE: ranges are defined using memory addresses, not pages.
> I would preface this with "where applicable".  Non-legacy boots are
> unlikely to have anything interesting in the first 1MB.

Yes, I've only taken legacy (BIOS) boot into account here. I'm not
familiar with UEFI, so I'm not really sure how different it is, or which
memory regions should be mapped into the guest physmap in that case. I
should have made this explicit by adding a title, like:

Legacy BIOS boot

>>  * [0x0, 0xFFFFF]: the low 1MiB will be mapped into the physical guest
>>    memory map at the same position.
>>  * [0xF00000, 0xFFFFFF]: the ISA memory hole will be mapped 1:1 into the
>>    guest physical memory.
>>  * ACPI memory areas: regions with type E820_ACPI or E820_NVS will be mapped
>>    1:1 to the guest physical memory map. There are going to be exceptions if
>>    Xen has to modify the tables before presenting them to the guest.
>>  * PCI Express MMCFG: if Xen is able to identify any of these regions at boot
>>    time they will also be made available to the guest at the same position
>>    in it's physical memory map. It is possible that Xen will trap accesses to
>>    those regions, but a guest should be able to use the native configuration
>>    mechanism in order to interact with this configuration space. If the
>>    hardware domain reports the presence of any of those regions using the
>>    PHYSDEVOP_pci_mmcfg_reserved hypercall Xen will also all guest access to
>>    them.
>>  * PCI BARs: it's not possible for Xen to know the position of the BARs of
>>    the PCI devices without hardware domain interaction.
> Xen requires no dom0 interaction to find all information like this for
> devices in segment 0 (i.e. all current hardware).  Segments other than 0
> may have their MMCONF regions expressed in AML only.

Thanks for the comments, please bear with me. I think we are mixing two
things here, one is the MMCFG areas, and the other one are the BARs of
each PCI device.

AFAIK MMCFG areas are described in the 'MCFG' ACPI table, which is
static and Xen should be able to parse on it's own. Then I'm not sure
why PHYSDEVOP_pci_mmcfg_reserved is needed at all.

Then for BARs you need to know the specific PCI devices, which are
enumerated in the DSDT or similar ACPI tables, which are not static, and
thus cannot be parsed by Xen. We could do a brute force scan of the
whole PCI bus using the config registers, but that seems hacky. And as
Boris said we need to keep the usage of PHYSDEVOP_pci_device_add in
order to notify Xen of the PXM information.

If we indeed have all the information about the BARs (position and size)
we could pre-map them 1:1 before creating the hardware domain, and thus
no modifications will be needed to the PHYSDEVOP_pci_device_add hypercall.


Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.