[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v2] docs: add PVH specification



Introduce a document that describes the interfaces used on PVH. This
document has been designed from a guest OS point of view (i.e.: what a guest
needs to do in order to support PVH).

Signed-off-by: Roger Pau Monnà <roger.pau@xxxxxxxxxx>
Cc: Jan Beulich <JBeulich@xxxxxxxx>
Cc: Mukesh Rathor <mukesh.rathor@xxxxxxxxxx>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx>
Cc: David Vrabel <david.vrabel@xxxxxxxxxx>
---
The document is still far from complete IMHO, but it might be best to just
commit what we currently have rather than wait for a full document.

I will try to fill the gaps as I go implementing new features on FreeBSD.
---
 docs/misc/pvh.markdown | 369 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 369 insertions(+)
 create mode 100644 docs/misc/pvh.markdown

diff --git a/docs/misc/pvh.markdown b/docs/misc/pvh.markdown
new file mode 100644
index 0000000..0f075c4
--- /dev/null
+++ b/docs/misc/pvh.markdown
@@ -0,0 +1,369 @@
+# PVH Specification #
+
+## Rationale ##
+
+PVH is a new kind of guest that has been introduced on Xen 4.4 as a DomU, and
+on Xen 4.5 as a Dom0. The aim of PVH is to make use of the hardware
+virtualization extensions present in modern x86 CPUs in order to
+improve performance.
+
+PVH is considered a mix between PV and HVM, and can be seen as a PV guest
+that runs inside of an HVM container, or as a PVHVM guest without any emulated
+devices. The design goal of PVH is to provide the best performance possible and
+to reduce the amount of modifications needed for a guest OS to run in this mode
+(compared to pure PV).
+
+This document tries to describe the interfaces used by PVH guests, focusing
+on how an OS should make use of them in order to support PVH.
+
+## Early boot ##
+
+PVH guests use the PV boot mechanism, that means that the kernel is loaded and
+directly launched by Xen (by jumping into the entry point). In order to do this
+Xen ELF Notes need to be added to the guest kernel, so that they contain the
+information needed by Xen. Here is an example of the ELF Notes added to the
+FreeBSD amd64 kernel in order to boot as PVH:
+
+    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_OS,       .asciz, "FreeBSD")
+    ELFNOTE(Xen, XEN_ELFNOTE_GUEST_VERSION,  .asciz, 
__XSTRING(__FreeBSD_version))
+    ELFNOTE(Xen, XEN_ELFNOTE_XEN_VERSION,    .asciz, "xen-3.0")
+    ELFNOTE(Xen, XEN_ELFNOTE_VIRT_BASE,      .quad,  KERNBASE)
+    ELFNOTE(Xen, XEN_ELFNOTE_PADDR_OFFSET,   .quad,  KERNBASE)
+    ELFNOTE(Xen, XEN_ELFNOTE_ENTRY,          .quad,  xen_start)
+    ELFNOTE(Xen, XEN_ELFNOTE_HYPERCALL_PAGE, .quad,  hypercall_page)
+    ELFNOTE(Xen, XEN_ELFNOTE_HV_START_LOW,   .quad,  HYPERVISOR_VIRT_START)
+    ELFNOTE(Xen, XEN_ELFNOTE_FEATURES,       .asciz, 
"writable_descriptor_tables|auto_translated_physmap|supervisor_mode_kernel|hvm_callback_vector")
+    ELFNOTE(Xen, XEN_ELFNOTE_PAE_MODE,       .asciz, "yes")
+    ELFNOTE(Xen, XEN_ELFNOTE_L1_MFN_VALID,   .long,  PG_V, PG_V)
+    ELFNOTE(Xen, XEN_ELFNOTE_LOADER,         .asciz, "generic")
+    ELFNOTE(Xen, XEN_ELFNOTE_SUSPEND_CANCEL, .long,  0)
+    ELFNOTE(Xen, XEN_ELFNOTE_BSD_SYMTAB,     .asciz, "yes")
+
+On the linux side, the above can be found in `arch/x86/xen/xen-head.S`.
+
+It is important to highlight the following notes:
+
+  * `XEN_ELFNOTE_ENTRY`: contains the virtual memory address of the kernel 
entry
+    point.
+  * `XEN_ELFNOTE_HYPERCALL_PAGE`: contains the virtual memory address of the
+    hypercal page inside of the guest kernel (this memory region will be filled
+    by Xen prior to booting).
+  * `XEN_ELFNOTE_FEATURES`: contains the list of features supported by the 
kernel.
+    In the example above the kernel is only able to boot as a PVH guest, but
+    those options can be mixed with the ones used by pure PV guests in order to
+    have a kernel that supports both PV and PVH (like Linux). The list of
+    options available can be found in the `features.h` public header.
+
+Xen will jump into the kernel entry point defined in `XEN_ELFNOTE_ENTRY` with
+paging enabled (either long mode or protected mode with paging turned on
+depending on the kernel bitness) and some basic page tables setup. An important
+distinction for a 64bit PVH is that it is launched at privilege level 0 as
+opposed to a 64bit PV guest which is launched at privilege level 3.
+
+Also, the `rsi` (`esi` on 32bits) register is going to contain the virtual
+memory address were Xen has placed the `start_info` structure. The `rsp` (`esp`
+on 32bits) will point to the top of an initial single page stack, that can be
+used by the guest kernel. The `start_info` structure contains all the info the
+guest needs in order to initialize. More information about the contents can be
+found on the `xen.h` public header.
+
+### Initial amd64 control registers values ###
+
+Initial values for the control registers are set up by Xen before booting the
+guest kernel. The guest kernel can expect to find the following features
+enabled by Xen.
+
+`CR0` has the following bits set by Xen:
+
+  * PE (bit 0): protected mode enable.
+  * ET (bit 4): 387 or newer processor.
+  * PG (bit 31): paging enabled.
+
+`CR4` has the following bits set by Xen:
+
+  * PAE (bit 5): PAE enabled.
+
+And finally in `EFER` the following features are enabled:
+
+  * LME (bit 8): Long mode enable.
+  * LMA (bit 10): Long mode active.
+
+At least the following flags in `EFER` are guaranteed to be disabled:
+
+  * SCE (bit 0): System call extensions disabled.
+  * NXE (bit 11): No-Execute disabled.
+
+There's no guarantee about the state of the other bits in the `EFER` register.
+
+All the segments selectors are set with a flat base at zero.
+
+The `cs` segment selector attributes are set to 0x0a09b, which describes an
+executable and readable code segment only accessible by the most privileged
+level. The segment is also set as a 64-bit code segment (`L` flag set) with a
+default operation size of 16bits (`D` flag unset).
+
+The remaining segment selectors (`ds`, `ss`, `es`, `fs` and `gs`) are all set
+to the same values. The attributes are set to 0x0c093, which implies a read and
+write data segment only accessible by the most privileged level. It is 
important
+to notice that for the `ss` selector the stack is set to use a 32bit pointer
+(`B` flag set).
+
+The `FS.base` and `GS.base` MSRs are zeroed out.
+
+The `IDT` and `GDT` are also zeroed, so the guest must be specially careful to
+not trigger a fault until after they have been properly set. The way of setting
+the IDT and the GDT is using the native instructions as would be done on bare
+metal.
+
+The `RFLAGS` register is guaranteed to be clear when jumping into the kernel
+entry point.
+
+## Memory ##
+
+Since PVH guests rely on virtualization extensions provided by the CPU, they
+have access to a hardware virtualized MMU, which means page-table related
+operations should use the same instructions used on native.
+
+There are however some differences with native. The usage of native MTRR
+operations is forbidden, and `XENPF_*_memtype` hypercalls should be used
+instead. This can be avoided by simply not using MTRR and setting all the
+memory attributes using PAT, which doesn't require the usage of any hypercalls.
+
+Since PVH doesn't use a BIOS in order to boot, the physical memory map has
+to be retrieved using the `XENMEM_memory_map` hypercall, which will return
+an e820 map. This memory map might contain holes that describe MMIO regions,
+that will be already setup by Xen.
+
+*TODO*: we need to figure out what to do with MMIO regions, right now Xen
+sets all the holes in the native e820 to MMIO regions for Dom0 up to 4GB. We
+need to decide what to do with MMIO regions above 4GB on Dom0, and what to do
+for PVH DomUs with pci-passthrough.
+
+In the case of a guest started with memory != maxmem, the e820 memory map
+returned by Xen will contain the memory up to maxmem. The guest has to be very
+careful to only use the lower memory pages up to the value contained in
+`start_info->nr_pages` because any memory page above that value will not be
+populated.
+
+## Physical devices ##
+
+When running as Dom0 the guest OS has the ability to interact with the physical
+devices present in the system. A note should be made that PVH guests require
+a working IOMMU in order to interact with physical devices.
+
+The first step in order to manipulate the devices is to make Xen aware of
+them. Due to the fact that all the hardware description on x86 comes from
+ACPI, Dom0 is responsible of parsing the ACPI tables and notify Xen about the
+devices it finds. This is done with the `PHYSDEVOP_pci_device_add` hypercall.
+
+*TODO*: explain the way to register the different kinds of PCI devices, like
+devices with virtual functions.
+
+## Interrupts ##
+
+All interrupts on PVH guests are routed over event channels, see
+[Event Channel Internals][event_channels] for more detailed information about
+event channels. In order to inject interrupts into the guest an IDT vector is
+used. This is the same mechanism used on PVHVM guests, and allows having
+per-cpu interrupts that can be used to deliver timers or IPIs.
+
+In order to register the callback IDT vector the `HVMOP_set_param` hypercall
+is used with the following values:
+
+    domid = DOMID_SELF
+    index = HVM_PARAM_CALLBACK_IRQ
+    value = (0x2 << 56) | vector_value
+
+In order to know which event channel has fired, we need to look into the
+information provided in the `shared_info` structure. The `evtchn_pending`
+array is used as a bitmap in order to find out which event channel has
+fired. Event channels can also be masked by setting it's port value in the
+`shared_info->evtchn_mask` bitmap.
+
+### Interrupts from physical devices ###
+
+When running as Dom0 (or when using pci-passthrough) interrupts from physical
+devices are routed over event channels. There are 3 different kind of
+physical interrupts that can be routed over event channels by Xen: IO APIC,
+MSI and MSI-X interrupts.
+
+Since physical interrupts usually need EOI (End Of Interrupt), Xen allows the
+registration of a memory region that will contain whether a physical interrupt
+needs EOI from the guest or not. This is done with the
+`PHYSDEVOP_pirq_eoi_gmfn_v2` hypercall that takes a parameter containing the
+physical address of the memory page that will act as a bitmap. Then in order to
+find out if an IRQ needs EOI or not, the OS can perform a simple bit test on 
the
+memory page using the PIRQ value.
+
+### IO APIC interrupt routing ###
+
+IO APIC interrupts can be routed over event channels using `PHYSDEVOP`
+hypercalls. First the IRQ is registered using the `PHYSDEVOP_map_pirq`
+hypercall, as an example IRQ#9 is used here:
+
+    domid = DOMID_SELF
+    type = MAP_PIRQ_TYPE_GSI
+    index = 9
+    pirq = 9
+
+The IRQ#9 is now registered as PIRQ#9. The triggering and polarity can also
+be configured using the `PHYSDEVOP_setup_gsi` hypercall:
+
+    gsi = 9 # This is the IRQ value.
+    triggering = 0
+    polarity = 0
+
+In this example the IRQ would be configured to use edge triggering and high
+polarity.
+
+Finally the PIRQ can be bound to an event channel using the
+`EVTCHNOP_bind_pirq`, that will return the event channel port the PIRQ has been
+assigned. After this the event channel will be ready for delivery.
+
+*NOTE*: when running as Dom0, the guest has to parse the interrupt overrides
+found on the ACPI tables and notify Xen about them.
+
+### MSI ###
+
+In order to configure MSI interrupts for a device, Xen must be made aware of
+it's presence first by using the `PHYSDEVOP_pci_device_add` as described above.
+Then the `PHYSDEVOP_map_pirq` hypercall is used:
+
+    domid = DOMID_SELF
+    type = MAP_PIRQ_TYPE_MSI_SEG or MAP_PIRQ_TYPE_MULTI_MSI
+    index = -1
+    pirq = -1
+    bus = pci_device_bus
+    devfn = pci_device_function
+    entry_nr = number of MSI interrupts
+
+The type has to be set to `MAP_PIRQ_TYPE_MSI_SEG` if only one MSI interrupt
+source is being configured. On devices that support MSI interrupt groups
+`MAP_PIRQ_TYPE_MULTI_MSI` can be used to configure them by also placing the
+number of MSI interrupts in the `entry_nr` field.
+
+The values in the `bus` and `devfn` field should be the same as the ones used
+when registering the device with `PHYSDEVOP_pci_device_add`.
+
+### MSI-X ###
+
+*TODO*: how to register/use them.
+
+## Event timers and timecounters ##
+
+Since some hardware is not available on PVH (like the local APIC), Xen provides
+the OS with suitable replacements in order to get the same functionality. One
+of them is the timer interface. Using a set of hypercalls, a guest OS can set
+event timers that will deliver and event channel interrupt to the guest.
+
+In order to use the timer provided by Xen the guest OS first needs to register
+a VIRQ event channel to be used by the timer to deliver the interrupts. The
+event channel is registered using the `EVTCHNOP_bind_virq` hypercall, that
+only takes two parameters:
+
+    virq = VIRQ_TIMER
+    vcpu = vcpu_id
+
+The port that's going to be used by Xen in order to deliver the interrupt is
+returned in the `port` field. Once the interrupt is set, the timer can be
+programmed using the `VCPUOP_set_singleshot_timer` hypercall.
+
+    flags = VCPU_SSHOTTMR_future
+    timeout_abs_ns = absolute value when the timer should fire
+
+It is important to notice that the `VCPUOP_set_singleshot_timer` hypercall must
+be executed from the same vCPU where the timer should fire, or else Xen will
+refuse to set it. This is a single-shot timer, so it must be set by the OS
+every time it fires if a periodic timer is desired.
+
+Xen also shares a memory region with the guest OS that contains time related
+values that are updated periodically. This values can be used to implement a
+timecounter or to obtain the current time. This information is placed inside of
+`shared_info->vcpu_info[vcpu_id].time`. The uptime (time since the guest has
+been launched) can be calculated using the following expression and the values
+stored in the `vcpu_time_info` struct:
+
+    system_time + ((((tsc - tsc_timestamp) << tsc_shift) * tsc_to_system_mul) 
>> 32)
+
+The timeout that is passed to `VCPUOP_set_singleshot_timer` has to be
+calculated using the above value, plus the timeout the system wants to set.
+
+If the OS also wants to obtain the current wallclock time, the value calculated
+above has to be added to the values found in `shared_info->wc_sec` and
+`shared_info->wc_nsec`.
+
+## SMP discover and bring up ##
+
+The process of bringing up secondary CPUs is obviously different from native,
+since PVH doesn't have a local APIC. The first thing to do is to figure out
+how many vCPUs the guest has. This is done using the `VCPUOP_is_up` hypercall,
+using for example this simple loop:
+
+    for (i = 0; i < MAXCPU; i++) {
+        ret = HYPERVISOR_vcpu_op(VCPUOP_is_up, i, NULL);
+        if (ret >= 0)
+            /* vCPU#i is present */
+    }
+
+Note than when running as Dom0, the ACPI tables might report a different number
+of available CPUs. This is because the value on the ACPI tables is the
+number of physical CPUs the host has, and it might bear no resemblance with the
+number of vCPUs Dom0 actually has so it should be ignored.
+
+In order to bring up the secondary vCPUs they must be configured first. This is
+achieved using the `VCPUOP_initialise` hypercall. A valid context has to be
+passed to the vCPU in order to boot. The relevant fields for PVH guests are
+the following:
+
+  * `flags`: contains `VGCF_*` flags (see `arch-x86/xen.h` public header).
+  * `user_regs`: struct that contains the register values that will be set on
+    the vCPU before booting. All GPRs are available to be set, however, the
+    most relevant ones are `rip` and `rsp` in order to set the start address
+    and the stack. Please note, all selectors must be null.
+  * `ctrlreg[3]`: contains the address of the page tables that will be used by
+    the vCPU. Other control registers should be set to zero, or else the
+    hypercall will fail with -EINVAL.
+
+After the vCPU is initialized with the proper values, it can be started by
+using the `VCPUOP_up` hypercall. The values of the other control registers of
+the vCPU will be the same as the ones described in the `control registers`
+section.
+
+Examples about how to bring up secondary CPUs can be found on the FreeBSD
+code base in `sys/x86/xen/pv.c` and on Linux `arch/x86/xen/smp.c`.
+
+## Control operations (reboot/shutdown) ##
+
+Reboot and shutdown operations on PVH guests are performed using hypercalls.
+In order to issue a reboot, a guest must use the `SHUTDOWN_reboot` hypercall.
+In order to perform a power off from a guest DomU, the `SHUTDOWN_poweroff`
+hypercall should be used.
+
+The way to perform a full system power off from Dom0 is different than what's
+done in a DomU guest. In order to perform a power off from Dom0 the native
+ACPI path should be followed, but the guest should not write the `SLP_EN`
+bit to the Pm1Control register. Instead the `XENPF_enter_acpi_sleep` hypercall
+should be used, filling the following data in the `xen_platform_op` struct:
+
+    cmd = XENPF_enter_acpi_sleep
+    interface_version = XENPF_INTERFACE_VERSION
+    u.enter_acpi_sleep.pm1a_cnt_val = Pm1aControlValue
+    u.enter_acpi_sleep.pm1b_cnt_val = Pm1bControlValue
+
+This will allow Xen to do it's clean up and to power off the system. If the
+host is using hardware reduced ACPI, the following field should also be set:
+
+    u.enter_acpi_sleep.flags = XENPF_ACPI_SLEEP_EXTENDED (0x1)
+
+## CPUID ##
+
+*TDOD*: describe which cpuid flags a guest should ignore and also which flags
+describe features can be used. It would also be good to describe the set of
+cpuid flags that will always be present when running as PVH.
+
+## Final notes ##
+
+All the other hardware functionality not described in this document should be
+assumed to be performed in the same way as native.
+
+[event_channels]: http://wiki.xen.org/wiki/Event_Channel_Internals
-- 
1.8.5.2 (Apple Git-48)


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.