Xen project Mailing List

Index: linux-2.6.16-rc5/Documentation/vmi_spec.txt =================================================================== --- linux-2.6.16-rc5.orig/Documentation/vmi_spec.txt 2006-03-09 23:33:29.000000000 -0800 +++ linux-2.6.16-rc5/Documentation/vmi_spec.txt 2006-03-10 12:55:29.000000000 -0800 @@ -0,0 +1,2197 @@ + + Paravirtualization API Version 2.0 + + Zachary Amsden, Daniel Arai, Daniel Hecht, Pratap Subrahmanyam + Copyright (C) 2005, 2006, VMware, Inc. + All rights reserved + +Revision history: + 1.0: Initial version + 1.1: arai 2005-11-15 + Added SMP-related sections: AP startup and Local APIC support + 1.2: dhecht 2006-02-23 + Added Time Interface section and Time related VMI calls + +Contents + +1) Motivations +2) Overview + Initialization + Privilege model + Memory management + Segmentation + Interrupt and I/O subsystem + IDT management + Transparent Paravirtualization + 3rd Party Extensions + AP Startup + State Synchronization in SMP systems + Local APIC Support + Time Interface +3) Architectural Differences from Native Hardware +4) ROM Implementation + Detection + Data layout + Call convention + PCI implementation + +Appendix A - VMI ROM low level ABI +Appendix B - VMI C prototypes +Appendix C - Sensitive x86 instructions + + +1) Motivations + + There are several high level goals which must be balanced in designing + an API for paravirtualization. The most general concerns are: + + Portability - it should be easy to port a guest OS to use the API + High performance - the API must not obstruct a high performance + hypervisor implementation + Maintainability - it should be easy to maintain and upgrade the guest + OS + Extensibility - it should be possible for future expansion of the + API + + Portability. + + The general approach to paravirtualization rather than full + virtualization is to modify the guest operating system. This means + there is implicitly some code cost to port a guest OS to run in a + paravirtual environment. The closer the API resembles a native + platform which the OS supports, the lower the cost of porting. + Rather than provide an alternative, high level interface for this + API, the approach is to provide a low level interface which + encapsulates the sensitive and performance critical parts of the + system. Thus, we have direct parallels to most privileged + instructions, and the process of converting a guest OS to use these + instructions is in many cases a simple replacement of one function + for another. Although this is sufficient for CPU virtualization, + performance concerns have forced us to add additional calls for + memory management, and notifications about updates to certain CPU + data structures. Support for this in the Linux operating system has + proved to be very minimal in cost because of the already somewhat + portable and modular design of the memory management layer. + + High Performance. + + Providing a low level API that closely resembles hardware does not + provide any support for compound operations; indeed, typical + compound operations on hardware can be updating of many page table + entries, flushing system TLBs, or providing floating point safety. + Since these operations may require several privileged or sensitive + operations, it becomes important to defer some of these operations + until explicit flushes are issued, or to provide higher level + operations around some of these functions. In order to keep with + the goal of portability, this has been done only when deemed + necessary for performance reasons, and we have tried to package + these compound operations into methods that are typically used in + guest operating systems. In the future, we envision that additional + higher level abstractions will be added as an adjunct to the + low-level API. These higher level abstractions will target large + bulk operations such as creation, and destruction of address spaces, + context switches, thread creation and control. + + Maintainability. + + In the course of development with a virtualized environment, it is + not uncommon for support of new features or higher performance to + require radical changes to the operation of the system. If these + changes are visible to the guest OS in a paravirtualized system, + this will require updates to the guest kernel, which presents a + maintenance problem. In the Linux world, the rapid pace of + development on the kernel means new kernel versions are produced + every few months. This rapid pace is not always appropriate for end + users, so it is not uncommon to have dozens of different versions of + the Linux kernel in use that must be actively supported. To keep + this many versions in sync with potentially radical changes in the + paravirtualized system is not a scalable solution. To reduce the + maintenance burden as much as possible, while still allowing the + implementation to accommodate changes, the design provides a stable + ABI with semantic invariants. The underlying implementation of the + ABI and details of what data or how it communicates with the + hypervisor are not visible to the guest OS. As a result, in most + cases, the guest OS need not even be recompiled to work with a newer + hypervisor. This allows performance optimizations, bug fixes, + debugging, or statistical instrumentation to be added to the API + implementation without any impact on the guest kernel. This is + achieved by publishing a block of code from the hypervisor in the + form of a ROM. The guest OS makes calls into this ROM to perform + privileged or sensitive actions in the system. + + Extensibility. + + In order to provide a vehicle for new features, new device support, + and general evolution, the API uses feature compartmentalization + with controlled versioning. The API is split into sections, with + each section having independent versions. Each section has a top + level version which is incremented for each major revision, with a + minor version indicating incremental level. Version compatibility + is based on matching the major version field, and changes of the + major version are assumed to break compatibility. This allows + accurate matching of compatibility. In the event of incompatible + API changes, multiple APIs may be advertised by the hypervisor if it + wishes to support older versions of guest kernels. This provides + the most general forward / backward compatibility possible. + Currently, the API has a core section for CPU / MMU virtualization + support, with additional sections provided for each supported device + class. + +2) Overview + + Initialization. + + Initialization is done with a bootstrap loader that creates + the "start of day" state. This is a known state, running 32-bit + protected mode code with paging enabled. The guest has all the + standard structures in memory that are provided by a native ROM + boot environment, including a memory map and ACPI tables. For + the native hardware, this bootstrap loader can be run before + the kernel code proper, and this environment can be created + readily from within the hypervisor for the virtual case. At + some point, the bootstrap loader or the kernel itself invokes + the initialization call to enter paravirtualized mode. + + Privilege Model. + + The guest kernel must be modified to run at a dynamic privilege + level, since if entry to paravirtual mode is successful, the kernel + is no longer allowed to run at the highest hardware privilege level. + On the IA-32 architecture, this means the kernel will be running at + CPL 1-2, and with the hypervisor running at CPL0, and user code at + CPL3. The IOPL will be lowered as well to avoid giving the guest + direct access to hardware ports and control of the interrupt flag. + + This change causes certain IA-32 instructions to become "sensitive", + so additional support for clearing and setting the hardware + interrupt flag are present. Since the switch into paravirtual mode + may happen dynamically, the guest OS must not rely on testing for a + specific privilege level by checking the RPL field of segment + selectors, but should check for privileged execution by performing + an (RPL != 3 && !EFLAGS_VM) comparison. This means the DPL of kernel + ring descriptors in the GDT or LDT may be raised to match the CPL of + the kernel. This change is visible by inspecting the segments + registers while running in privileged code, and by using the LAR + instruction. + + The system also cannot be allowed to write directly to the hardware + GDT, LDT, IDT, or TSS, so these data structures are maintained by the + hypervisor, and may be shadowed or guest visible structures. These + structures are required to be page aligned to support non-shadowed + operation. + + Currently, the system only provides for two guest security domains, + kernel (which runs at the equivalent of virtual CPL-0), and user + (which runs at the equivalent of virtual CPL-3, with no hardware + access). Typically, this is not a problem, but if a guest OS relies + on using multiple hardware rings for privilege isolation, this + interface would need to be expanded to support that. + + Memory Management. + + Since a virtual machine typically does not have access to all the + physical memory on the machine, there is a need to redefine the + physical address space layout for the virtual machine. The + spectrum of possibilities ranges from presenting the guest with + a view of a physically contiguous memory of a boot-time determined + size, exactly what the guest would see when running on hardware, to + the opposite, which presents the guest with the actual machine pages + which the hypervisor has allocated for it. Using this approach + requires the guest to obtain information about the pages it has + from the hypervisor; this can be done by using the memory map which + would normally be passed to the guest by the BIOS. + + The interface is designed to support either mode of operation. + This allows the implementation to use either direct page tables + or shadow page tables, or some combination of both. All writes to + page table entries are done through calls to the hypervisor + interface layer. The guest notifies the hypervisor about page + tables updates, flushes, and invalidations through API calls. + + The guest OS is also responsible for notifying the hypervisor about + which pages in its physical memory are going to be used to hold page + tables or page directories. Both PAE and non-PAE paging modes are + supported. When the guest is finished using pages as page tables, it + should release them promptly to allow the hypervisor to free the + page table shadows. Using a page as both a page table and a page + directory for linear page table access is possible, but currently + not supported by our implementation. + + The hypervisor lives concurrently in the same address space as the + guest operating system. Although this is not strictly necessary on + IA-32 hardware, performance would be severely degraded if that were + not the case. The hypervisor must therefore reserve some portion of + linear address space for its own use. The implementation currently + reserves the top 64 megabytes of linear space for the hypervisor. + This requires the guest to relocate any data in high linear space + down by 64 megabytes. For non-paging mode guests, this means the + high 64 megabytes of physical memory should be reserved. Because + page tables are not sensitive to CPL, only to user/supervisor level, + the hypervisor must combine segment protection to ensure that the + guest can not access this 64 megabyte region. + + An experimental patch is available to enable boot-time sizing of + the hypervisor hole. + + Segmentation. + + The IA-32 architecture provides segmented virtual memory, which can + be used as another form of privilege separation. Each segment + contains a base, limit, and properties. The base is added to the + virtual address to form a linear address. The limit determines the + length of linear space which is addressable through the segment. + The properties determine read/write, code and data size of the + region, as well as the direction in which segments grow. Segments + are loaded from descriptors in one of two system tables, the GDT or + the LDT, and the values loaded are cached until the next load of the + segment. This property, known as segment caching, allows the + machine to be put into a non-reversible state by writing over the + descriptor table entry from which a segment was loaded. There is no + efficient way to extract the base field of the segment after it is + loaded, as it is hidden by the processor. In a hypervisor + environment, the guest OS can be interrupted at any point in time by + interrupts and NMIs which must be serviced by the hypervisor. The + hypervisor must be able to recreate the original guest state when it + is done servicing the external event. + + To avoid creating non-reversible segments, the hypervisor will + forcibly reload any live segment registers that are updated by + writes to the descriptor tables. *N.B - in the event that a segment + is put into an invalid or not present state by an update to the + descriptor table, the segment register must be forced to NULL so + that reloading it will not cause a general protection fault (#GP) + when restoring the guest state. This may require the guest to save + the segment register value before issuing a hypervisor API call + which will update the descriptor table.* + + Because the hypervisor must protect its own memory space from + privileged code running in the guest at CPL1-2, descriptors may not + provide access to the 64 megabyte region of high linear space. To + achieve this, the hypervisor will truncate descriptors in the + descriptor tables. This means that attempts by the guest to access + through negative offsets to the segment base will fault, so this is + highly discouraged (some TLS implementations on Linux do this). + In addition, this causes the truncated length of the segment to + become visible to the guest through the LSL instruction. + + Interrupt and I/O Subsystem. + + For security reasons, the guest operating system is not given + control over the hardware interrupt flag. We provide a virtual + interrupt flag that is under guest control. The virtual operating + system always runs with hardware interrupts enabled, but hardware + interrupts are transparent to the guest. The API provides calls for + all instructions which modify the interrupt flag. + + The paravirtualization environment provides a legacy programmable + interrupt controller (PIC) to the virtual machine. Future releases + will provide a virtual interrupt controller (VIC) that provides + more advanced features. + + In addition to a virtual interrupt flag, there is also a virtual + IOPL field which the guest can use to enable access to port I/O + from userspace for privileged applications. + + Generic PCI based device probing is available to detect virtual + devices. The use of PCI is pragmatic, since it allows a vendor + ID, class ID, and device ID to identify the appropriate driver + for each virtual device. + + IDT Management. + + The paravirtual operating environment provides the traditional x86 + interrupt descriptor table for handling external interrupts, + software interrupts, and exceptions. The interrupt descriptor table + provides the destination code selector and EIP for interruptions. + The current task state structure (TSS) provides the new stack + address to use for interruptions that result in a privilege level + change. The guest OS is responsible for notifying the hypervisor + when it updates the stack address in the TSS. + + Two types of indirect control flow are of critical importance to the + performance of an operating system. These are system calls and page + faults. The guest is also responsible for calling out to the + hypervisor when it updates gates in the IDT. Making IDT and TSS + updates known to the hypervisor in this fashion allows efficient + delivery through these performance critical gates. + + Transparent Paravirtualization. + + The guest operating system may provide an alternative implementation + of the VMI option rom compiled in. This implementation should + provide implementations of the VMI calls that are suitable for + running on native x86 hardware. This code may be used by the guest + operating system while it is being loaded, and may also be used if + the operating system is loaded on hardware that does not support + paravirtualization. + + When the guest detects that the VMI option rom is available, it + replaces the compiled-in version of the rom with the rom provided by + the platform. This can be accomplished by copying the rom contents, + or by remapping the virtual address containing the compiled-in rom + to point to the platform's ROM. When booting on a platform that + does not provide a VMI rom, the operating system can continue to use + the compiled-in version to run in a non-paravirtualized fashion. + + 3rd Party Extensions. + + If desired, it should be possible for 3rd party virtual machine + monitors to implement a paravirtualization environment that can run + guests written to this specification. + + The general mechanism for providing customized features and + capabilities is to provide notification of these feature through + the CPUID call, and allowing configuration of CPU features + through RDMSR / WRMSR instructions. This allows a hypervisor vendor + ID to be published, and the kernel may enable or disable specific + features based on this id. This has the advantage of following + closely the boot time logic of many operating systems that enables + certain performance enhancements or bugfixes based on processor + revision, using exactly the same mechanism. + + An exact formal specification of the new CPUID functions and which + functions are vendor specific is still needed. + + AP Startup. + + Application Processor startup in paravirtual SMP systems works a bit + differently than in a traditional x86 system. + + APs will launch directly in paravirtual mode with initial state + provided by the BSP. Rather than the traditional init/startup + IPI sequence, the BSP must issue the init IPI, a set application + processor state hypercall, followed by the startup IPI. + + The initial state contains the AP's control registers, general + purpose registers and segment registers, as well as the IDTR, + GDTR, LDTR and EFER. Any processor state not included in the initial + AP state (including x87 FPRs, SSE register states, and MSRs other than + EFER), are left in the poweron state. + + The BSP must construct the initial GDT used by each AP. The segment + register hidden state will be loaded from the GDT specified in the + initial AP state. The IDT and (if used) LDT may either be constructed by + the BSP or by the AP. + + Similarly, the initial page tables used by each AP must also be + constructed by the BSP. + + If an AP's initial state is invalid, or no initial state is provided + before a start IPI is received by that AP, then the AP will fail to start. + It is therefore advisable to have a timeout for waiting for AP's to start, + as is recommended for traditional x86 systems. + + See VMI_SetInitialAPState in Appendix A for a description of the + VMI_SetInitialAPState hypercall and the associated APState data structure. + + State Synchronization In SMP Systems. + + Some in-memory data structures that may require no special synchronization + on a traditional x86 systems need special handling when run on a + hypervisor. Two of particular note are the descriptor tables and page + tables. + + Each processor in an SMP system should have its own GDT and LDT. Changes + to each processor's descriptor tables must be made on that processor + via the appropriate VMI calls. There is no VMI interface for updating + another CPU's descriptor tables (aside from VMI_SetInitialAPState), + and the result of memory writes to other processors' descriptor tables + are undefined. + + Page tables have slightly different semantics than in a traditional x86 + system. As in traditional x86 systems, page table writes may not be + respected by the current CPU until a TLB flush or invlpg is issued. + In a paravirtual system, the hypervisor implementation is free to + provide either shared or private caches of the guest's page tables. + Page table updates must therefore be propagated to the other CPUs + before they are guaranteed to be noticed. + + In particular, when doing TLB shootdown, the initiating processor + must ensure that all deferred page table updates are flushed to the + hypervisor, to ensure that the receiving processor has the most up-to-date + mapping when it performs its invlpg. + + Local APIC Support. + + A traditional x86 local APIC is provided by the hypervisor. The local + APIC is enabled and its address is set via the IA32_APIC_BASE MSR, as + usual. APIC registers may be read and written via ordinary memory + operations. + + For performance reasons, higher performance APIC read and write interfaces + are provided. If possible, these interfaces should be used to access + the local APIC. + + The IO-APIC is not included in this spec, as it is typically not + performance critical, and used mainly for initial wiring of IRQ pins. + Currently, we implement a fully functional IO-APIC with all the + capabilities of real hardware. This may seem like an unnecessary burden, + but if the goal is transparent paravirtualization, the kernel must + provide fallback support for an IO-APIC anyway. In addition, the + hypervisor must support an IO-APIC for SMP non-paravirtualized guests. + The net result is less code on both sides, and an already well defined + interface between the two. This avoids the complexity burden of having + to support two different interfaces to achieve the same task. + + One shortcut we have found most helpful is to simply disable NMI delivery + to the paravirtualized kernel. There is no reason NMIs can't be + supported, but typical uses for them are not as productive in a + virtualized environment. Watchdog NMIs are of limited use if the OS is + already correct and running on stable hardware; profiling NMIs are + similarly of less use, since this task is accomplished with more accuracy + in the VMM itself; and NMIs for machine check errors should be handled + outside of the VM. The addition of NMI support does create additional + complexity for the trap handling code in the VM, and although the task is + surmountable, the value proposition is debatable. Here, again, feedback + is desired. + + Time Interface. + + In a virtualized environment, virtual machines (VM) will time share + the system with each other and with other processes running on the + host system. Therefore, a VM's virtual CPUs (VCPUs) will be + executing on the host's physical CPUs (PCPUs) for only some portion + of time. This section of the VMI exposes a paravirtual view of + time to the guest operating systems so that they may operate more + effectively in a virtual environment. The interface also provides + a way for the VCPUs to set alarms in this paravirtual view of time. + + Time Domains: + + a) Wallclock Time: + + Wallclock time exposed to the VM through this interface indicates + the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO + 8601 date format). If the host's wallclock time changes (say, when + an error in the host's clock is corrected), so does the wallclock + time as viewed through this interface. + + b) Real Time: + + Another view of time accessible through this interface is real + time. Real time always progresses except for when the VM is + stopped or suspended. Real time is presented to the guest as a + counter which increments at a constant rate defined (and presented) + by the hypervisor. All the VCPUs of a VM share the same real time + counter. + + The unit of the counter is called "cycles". The unit and initial + value (corresponding to the time the VM enters para-virtual mode) + are chosen by the hypervisor so that the real time counter will not + rollover in any practical length of time. It is expected that the + frequency (cycles per second) is chosen such that this clock + provides a "high-resolution" view of time. The unit can only + change when the VM (re)enters paravirtual mode. + + c) Stolen time and Available time: + + A VCPU is always in one of three states: running, halted, or ready. + The VCPU is in the 'running' state if it is executing. When the + VCPU executes the HLT interface, the VCPU enters the 'halted' state + and remains halted until there is some work pending for the VCPU + (e.g. an alarm expires, host I/O completes on behalf of virtual + I/O). At this point, the VCPU enters the 'ready' state (waiting + for the hypervisor to reschedule it). Finally, at any time when + the VCPU is not in the 'running' state nor the 'halted' state, it + is in the 'ready' state. + + For example, consider the following sequence of events, with times + given in real time: + + (Example 1) + + At 0 ms, VCPU executing guest code. + At 1 ms, VCPU requests virtual I/O. + At 2 ms, Host performs I/O for virtual I/0. + At 3 ms, VCPU executes VMI_Halt. + At 4 ms, Host completes I/O for virtual I/O request. + At 5 ms, VCPU begins executing guest code, vectoring to the interrupt + handler for the device initiating the virtual I/O. + At 6 ms, VCPU preempted by hypervisor. + At 9 ms, VCPU begins executing guest code. + + From 0 ms to 3 ms, VCPU is in the 'running' state. At 3 ms, VCPU + enters the 'halted' state and remains in this state until the 4 ms + mark. From 4 ms to 5 ms, the VCPU is in the 'ready' state. At 5 + ms, the VCPU re-enters the 'running' state until it is preempted by + the hypervisor at the 6 ms mark. From 6 ms to 9 ms, VCPU is again + in the 'ready' state, and finally 'running' again after 9 ms. + + Stolen time is defined per VCPU to progress at the rate of real + time when the VCPU is in the 'ready' state, and does not progress + otherwise. Available time is defined per VCPU to progress at the + rate of real time when the VCPU is in the 'running' and 'halted' + states, and does not progress when the VCPU is in the 'ready' + state. + + So, for the above example, the following table indicates these time + values for the VCPU at each ms boundary: + + Real time Stolen time Available time + 0 0 0 + 1 0 1 + 2 0 2 + 3 0 3 + 4 0 4 + 5 1 4 + 6 1 5 + 7 2 5 + 8 3 5 + 9 4 5 + 10 4 6 + + Notice that at any point: + real_time == stolen_time + available_time + + Stolen time and available time are also presented as counters in + "cycles" units. The initial value of the stolen time counter is 0. + This implies the initial value of the available time counter is the + same as the real time counter. + + Alarms: + + Alarms can be set (armed) against the real time counter or the + available time counter. Alarms can be programmed to expire once + (one-shot) or on a regular period (periodic). They are armed by + indicating an absolute counter value expiry, and in the case of a + periodic alarm, a non-zero relative period counter value. [TBD: + The method of wiring the alarms to an interrupt vector is dependent + upon the virtual interrupt controller portion of the interface. + Currently, the alarms may be wired as if they are attached to IRQ0 + or the vector in the local APIC LVTT. This way, the alarms can be + used as drop in replacements for the PIT or local APIC timer.] + + Alarms are per-vcpu mechanisms. An alarm set by vcpu0 will fire + only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1. + If an alarm is set relative to available time, its expiry is a + value relative to the available time counter of the vcpu that set + it. + + The interface includes a method to cancel (disarm) an alarm. On + each vcpu, one alarm can be set against each of the two counters + (real time and available time). A vcpu in the 'halted' state + becomes 'ready' when any of its alarm's counters reaches the + expiry. + + An alarm "fires" by signaling the virtual interrupt controller. An + alarm will fire as soon as possible after the counter value is + greater than or equal to the alarm's current expiry. However, an + alarm can fire only when its vcpu is in the 'running' state. + + If the alarm is periodic, a sequence of expiry values, + + E(i) = e0 + p * i , i = 0, 1, 2, 3, ... + + where 'e0' is the expiry specified when setting the alarm and 'p' + is the period of the alarm, is used to arm the alarm. Initially, + E(0) is used as the expiry. When the alarm fires, the next expiry + value in the sequence that is greater than the current value of the + counter is used as the alarm's new expiry. + + One-shot alarms have only one expiry. When a one-shot alarm fires, + it is automatically disarmed. + + Suppose an alarm is set relative to real time with expiry at the 3 + ms mark and a period of 2 ms. It will expire on these real time + marks: 3, 5, 7, 9. Note that even if the alarm does not fire + during the 5 ms to 7 ms interval, the alarm can fire at most once + during the 7 ms to 9 ms interval (unless, of course, it is + reprogrammed). + + If an alarm is set relative to available time with expiry at the 1 + ms mark (in available time) and with a period of 2 ms, then it will + expire on these available time marks: 1, 3, 5. In the scenario + described in example 1, those available time values correspond to + these values in real time: 1, 3, 6. + +3) Architectural Differences from Native Hardware. + + For the sake of performance, some requirements are imposed on kernel + fault handlers which are not present on real hardware. Most modern + operating systems should have no trouble meeting these requirements. + Failure to meet these requirements may prevent the kernel from + working properly. + + 1) The hardware flags on entry to a fault handler may not match + the EFLAGS image on the fault handler stack. The stack image + is correct, and will have the correct state of the interrupt + and arithmetic flags. + + 2) The stack used for kernel traps must be flat - that is, zero base, + segment limit determined by the hypervisor. + + 3) On entry to any fault handler, the stack must have sufficient space + to hold 32 bytes of data, or the guest may be terminated. + + 4) When calling VMI functions, the kernel must be running on a + flat 32-bit stack and code segment. + + 5) Most VMI functions require flat data and extra segment (DS and ES) + segments as well; notable exceptions are IRET and SYSEXIT. + XXXPara - may need to add STI and CLI to this list. + + 6) Interrupts must always be enabled when running code in userspace. + + 7) IOPL semantics for userspace are changed; although userspace may be + granted port access, it can not affect the interrupt flag. + + 8) The EIPs at which faults may occur in VMI calls may not match the + original native instruction EIP; this is a bug in the system + today, as many guests do rely on lazy fault handling. + + 9) On entry to V8086 mode, MSR_SYSENTER_CS is cleared to zero. + + 10) Todo - we would like to support these features, but they are not + fully tested and / or implemented: + + Userspace 16-bit stack support + Proper handling of faulting IRETs + +4) ROM Implementation + + Modularization + + Originally, we envisioned modularizing the ROM API into several + subsections, but the close coupling between the initial layers + and the requirement to support native PCI bus devices has made + ROM components for network or block devices unnecessary to this + point in time. + + VMI - the virtual machine interface. This is the core CPU, I/O + and MMU virtualization layer. I/O is currently limited + to port access to emulated devices. + + Detection + + The presence of hypervisor ROMs can be recognized by scanning the + upper region of the first megabyte of physical memory. Multiple + ROMs may be provided to support older API versions for legacy guest + OS support. ROM detection is done in the traditional manner, by + scanning the memory region from C8000h - DFFFFh in 2 kilobyte + increments. The romSignature bytes must be '0x55, 0xAA', and the + checksum of the region indicated by the romLength field must be zero. + The checksum is a simple 8-bit addition of all bytes in the ROM region. + + Data layout + + typedef struct HyperRomHeader { + uint16_t romSignature; + int8_t romLength; + unsigned char romEntry[4]; + uint8_t romPad0; + uint32_t hyperSignature; + uint8_t APIVersionMinor; + uint8_t APIVersionMajor; + uint8_t reserved0; + uint8_t reserved1; + uint32_t reserved2; + uint32_t reserved3; + uint16_t pciHeaderOffset; + uint16_t pnpHeaderOffset; + uint32_t romPad3; + char reserved[32]; + char elfHeader[64]; + } HyperRomHeader; + + The first set of fields is defined by the BIOS: + + romSignature - fixed 0xAA55, BIOS ROM signature + romLength - the length of the ROM, in 512 byte chunks. + Determines the area to be checksummed. + romEntry - 16-bit initialization code stub used by BIOS. + romPad0 - reserved + + The next set of fields is defined by this API: + + hyperSignature - a 4 byte signature providing recognition of the + device class represented by this ROM. Each + device class defines its own unique signature. + APIVersionMinor - the revision level of this device class' API. + This indicates incremental changes to the API. + APIVersionMajor - the major version. Used to indicates large + revisions or additions to the API which break + compatibility with the previous version. + reserved0,1,2,3 - for future expansion + + The next set of fields is defined by the PCI / PnP BIOS spec: + + pciHeaderOffset - relative offset to the PCI device header from + the start of this ROM. + pnpHeaderOffset - relative offset to the PnP boot header from the + start of this ROM. + romPad3 - reserved by PCI spec. + + Finally, there is space for future header fields, and an area + reserved for an ELF header to point to symbol information. + +Appendix A - VMI ROM Low Level ABI + + OS writers intending to port their OS to the paravirtualizable x86 + processor being modeled by this hypervisor need to access the + hypervisor through the VMI layer. It is possible although it is + currently unimplemented to add or replace the functionality of + individual hypervisor calls by providing your own ROM images. This is + intended to allow third party customizations. + + VMI compatible ROMs user the signature "cVmi" in the hyperSignature + field of the ROM header. + + Many of these calls are compatible with the SVR4 C call ABI, using up + to three register arguments. Some calls are not, due to restrictions + of the native instruction set. Calls which diverge from this ABI are + noted. In GNU terms, this means most of the calls are compatible with + regparm(3) argument passing. + + Most of these calls behave as standard C functions, and as such, may + clobber registers EAX, EDX, ECX, flags. Memory clobbers are noted + explicitly, since many of them may be inlined without a memory clobber. + + Most of these calls require well defined segment conventions - that is, + flat full size 32-bit segments for all the general segments, CS, SS, DS, + ES. Exceptions in some cases are noted. + + The net result of these choices is that most of the calls are very + easy to make from C-code, and calls that are likely to be required in + low level trap handling code are easy to call from assembler. Most + of these calls are also very easily implemented by the hypervisor + vendor in C code, and only the performance critical calls from + assembler paths require custom assembly implementations. + + CORE INTERFACE CALLS + + This set of calls provides the base functionality to establish running + the kernel in VMI mode. + + The interface will be expanded to include feature negotiation, more + explicit control over call bundling and flushing, and hypervisor + notifications to allow inline code patching. + + VMI_Init + + VMICALL void VMI_Init(void); + + Initializes the hypervisor environment. Returns zero on success, + or -1 if the hypervisor could not be initialized. Note that this + is a recoverable error if the guest provides the requisite native + code to support transparent paravirtualization. + + Inputs: None + Outputs: EAX = result + Clobbers: Standard + Segments: Standard + + + PROCESSOR STATE CALLS + + This set of calls controls the online status of the processor. It + include interrupt control, reboot, halt, and shutdown functionality. + Future expansions may include deep sleep and hotplug CPU capabilities. + + VMI_DisableInterrupts + + VMICALL void VMI_DisableInterrupts(void); + + Disable maskable interrupts on the processor. + + Inputs: None + Outputs: None + Clobbers: Flags only + Segments: As this is both performance critical and likely to + be called from low level interrupt code, this call does not + require flat DS/ES segments, but uses the stack segment for + data access. Therefore only CS/SS must be well defined. + + VMI_EnableInterrupts + + VMICALL void VMI_EnableInterrupts(void); + + Enable maskable interrupts on the processor. Note that the + current implementation always will deliver any pending interrupts + on a call which enables interrupts, for compatibility with kernel + code which expects this behavior. Whether this should be required + is open for debate. + + Inputs: None + Outputs: None + Clobbers: Flags only + Segments: CS/SS only + + VMI_GetInterruptMask + + VMICALL VMI_UINT VMI_GetInterruptMask(void); + + Returns the current interrupt state mask of the processor. The + mask is defined to be 0x200 (matching processor flag IF) to indicate + interrupts are enabled. + + Inputs: None + Outputs: EAX = mask + Clobbers: Flags only + Segments: CS/SS only + + VMI_SetInterruptMask + + VMICALL void VMI_SetInterruptMask(VMI_UINT mask); + + Set the current interrupt state mask of the processor. Also + delivers any pending interrupts if the mask is set to allow + them. + + Inputs: EAX = mask + Outputs: None + Clobbers: Flags only + Segments: CS/SS only + + VMI_DeliverInterrupts (For future debate) + + Enable and deliver any pending interrupts. This would remove + the implicit delivery semantic from the SetInterruptMask and + EnableInterrupts calls. + + VMI_Pause + + VMICALL void VMI_Pause(void); + + Pause the processor temporarily, to allow a hypertwin or remote + CPU to continue operation without lock or cache contention. + + Inputs: None + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_Halt + + VMICALL void VMI_Halt(void); + + Put the processor into interruptible halt mode. This is defined + to be a non-running mode where maskable interrupts are enabled, + not a deep low power sleep mode. + + Inputs: None + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_Shutdown + + VMICALL void VMI_Shutdown(void); + + Put the processor into non-interruptible halt mode. This is defined + to be a non-running mode where maskable interrupts are disabled, + indicates a power-off event for this CPU. + + Inputs: None + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_Reboot: + + VMICALL void VMI_Reboot(VMI_INT how); + + Reboot the virtual machine, using a hard or soft reboot. A soft + reboot corresponds to the effects of an INIT IPI, and preserves + some APIC and CR state. A hard reboot corresponds to a hardware + reset. + + Inputs: EAX = reboot mode + #define VMI_REBOOT_SOFT 0x0 + #define VMI_REBOOT_HARD 0x1 + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_SetInitialAPState: + + void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID); + + Sets the initial state of the application processor with local APIC ID + "apicID" to the state in apState. apState must be the page-aligned + linear address of the APState structure describing the initial state of + the specified application processor. + + Control register CR0 must have both PE and PG set; the result of + either of these bits being cleared is undefined. It is recommended + that for best performance, all processors in the system have the same + setting of the CR4 PAE bit. LME and LMA in EFER are both currently + unsupported. The result of setting either of these bits is undefined. + + Inputs: EAX = pointer to APState structure for new co-processor + EDX = APIC ID of processor to initialize + Outputs: None + Clobbers: Standard + Segments: Standard + + + DESCRIPTOR RELATED CALLS + + VMI_SetGDT + + VMICALL void VMI_SetGDT(VMI_DTR *gdtr); + + Load the global descriptor table limit and base registers. In + addition to the straightforward load of the hardware registers, this + has the additional side effect of reloading all segment registers in a + virtual machine. The reason is that otherwise, the hidden part of + segment registers (the base field) may be put into a non-reversible + state. Non-reversible segments are problematic because they can not be + reloaded - any subsequent loads of the segment will load the new + descriptor state. In general, is not possible to resume direct + execution of the virtual machine if certain segments become + non-reversible. + + A load of the GDTR may cause the guest visible memory image of the GDT + to be changed. This allows the hypervisor to share the GDT pages with + the guest, but also continue to maintain appropriate protections on the + GDT page by transparently adjusting the DPL and RPL of descriptors in + the GDT. + + Inputs: EAX = pointer to descriptor limit / base + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_SetIDT + + VMICALL void VMI_SetIDT(VMI_DTR *idtr); + + Load the interrupt descriptor table limit and base registers. The IDT + format is defined to be the same as native hardware. + + A load of the IDTR may cause the guest visible memory image of the IDT + to be changed. This allows the hypervisor to rewrite the IDT pages in + a format more suitable to the hypervisor, which may include adjusting + the DPL and RPL of descriptors in the guest IDT. + + Inputs: EAX = pointer to descriptor limit / base + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_SetLDT + + VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel); + + Load the local descriptor table. This has the additional side effect + of of reloading all segment registers. See VMI_SetGDT for an + explanation of why this is required. A load of the LDT may cause the + guest visible memory image of the LDT to be changed, just as GDT and + IDT loads. + + Inputs: EAX = GDT selector of LDT descriptor + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_SetTR + + VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel); + + Load the task register. Functionally equivalent to the LTR + instruction. + + Inputs: EAX = GDT selector of TR descriptor + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_GetGDT + + VMICALL void VMI_GetGDT(VMI_DTR *gdtr); + + Copy the GDT limit and base fields into the provided pointer. This is + equivalent to the SGDT instruction, which is non-virtualizable. + + Inputs: EAX = pointer to descriptor limit / base + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_GetIDT + + VMICALL void VMI_GetIDT(VMI_DTR *idtr); + + Copy the IDT limit and base fields into the provided pointer. This is + equivalent to the SIDT instruction, which is non-virtualizable. + + Inputs: EAX = pointer to descriptor limit / base + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_GetLDT + + VMICALL VMI_SELECTOR VMI_GetLDT(void); + + Load the task register. Functionally equivalent to the SLDT + instruction, which is non-virtualizable. + + Inputs: None + Outputs: EAX = selector of LDT descriptor + Clobbers: Standard, Memory + Segments: Standard + + VMI_GetTR + + VMICALL VMI_SELECTOR VMI_GetTR(void); + + Load the task register. Functionally equivalent to the STR + instruction, which is non-virtualizable. + + Inputs: None + Outputs: EAX = selector of TR descriptor + Clobbers: Standard, Memory + Segments: Standard + + VMI_WriteGDTEntry + + VMICALL void VMI_WriteGDTEntry(void *gdt, VMI_UINT entry, + VMI_UINT32 descLo, + VMI_UINT32 descHi); + + Write a descriptor to a GDT entry. Note that writes to the GDT itself + may be disallowed by the hypervisor, in which case this call must be + converted into a hypercall. In addition, since the descriptor may need + to be modified to change limits and / or permissions, the guest kernel + should not assume the update will be binary identical to the passed + input. + + Inputs: EAX = pointer to GDT base + EDX = GDT entry number + ECX = descriptor low word + ST(1) = descriptor high word + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_WriteLDTEntry + + VMICALL void VMI_WriteLDTEntry(void *gdt, VMI_UINT entry, + VMI_UINT32 descLo, + VMI_UINT32 descHi); + + Write a descriptor to a LDT entry. Note that writes to the LDT itself + may be disallowed by the hypervisor, in which case this call must be + converted into a hypercall. In addition, since the descriptor may need + to be modified to change limits and / or permissions, the guest kernel + should not assume the update will be binary identical to the passed + input. + + Inputs: EAX = pointer to LDT base + EDX = LDT entry number + ECX = descriptor low word + ST(1) = descriptor high word + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_WriteIDTEntry + + VMICALL void VMI_WriteIDTEntry(void *gdt, VMI_UINT entry, + VMI_UINT32 descLo, + VMI_UINT32 descHi); + + Write a descriptor to a IDT entry. Since the descriptor may need to be + modified to change limits and / or permissions, the guest kernel should + not assume the update will be binary identical to the passed input. + + Inputs: EAX = pointer to IDT base + EDX = IDT entry number + ECX = descriptor low word + ST(1) = descriptor high word + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + + CPU CONTROL CALLS + + These calls encapsulate the set of privileged instructions used to + manipulate the CPU control state. These instructions are all properly + virtualizable using trap and emulate, but for performance reasons, a + direct call may be more efficient. With hardware virtualization + capabilities, many of these calls can be left as IDENT translations, that + is, inline implementations of the native instructions, which are not + rewritten by the hypervisor. Some of these calls are performance critical + during context switch paths, and some are not, but they are all included + for completeness, with the exceptions of the obsoleted LMSW and SMSW + instructions. + + VMI_WRMSR + + VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg); + + Write to a model specific register. This functions identically to the + hardware WRMSR instruction. Note that a hypervisor may not implement + the full set of MSRs supported by native hardware, since many of them + are not useful in the context of a virtual machine. + + Inputs: ECX = model specific register index + EAX = low word of register + EDX = high word of register + Outputs: None + Clobbers: Standard, Memory + Segments: Standard + + VMI_RDMSR + + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg); + + Read from a model specific register. This functions identically to the + hardware RDMSR instruction. Note that a hypervisor may not implement + the full set of MSRs supported by native hardware, since many of them + are not useful in the context of a virtual machine. + + Inputs: ECX = machine specific register index + Outputs: EAX = low word of register + EDX = high word of register + Clobbers: Standard + Segments: Standard + + VMI_SetCR0 + + VMICALL void VMI_SetCR0(VMI_UINT val); + + Write to control register zero. This can cause TLB flush and FPU + handling side effects. The set of features available to the kernel + depend on the completeness of the hypervisor. An explicit list of + supported functionality or required settings may need to be negotiated + by the hypervisor and kernel during bootstrapping. This is likely to + be implementation or vendor specific, and the precise restrictions are + not yet worked out. Our implementation in general supports turning on + additional functionality - enabling protected mode, paging, page write + protections; however, once those features have been enabled, they may + not be disabled on the virtual hardware. + + Inputs: EAX = input to control register + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_SetCR2 + + VMICALL void VMI_SetCR2(VMI_UINT val); + + Write to control register two. This has no side effects other than + updating the CR2 register value. + + Inputs: EAX = input to control register + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_SetCR3 + + VMICALL void VMI_SetCR3(VMI_UINT val); + + Write to control register three. This causes a TLB flush on the local + processor. In addition, this update may be queued as part of a lazy + call invocation, which allows multiple hypercalls to be issued during + the context switch path. The queuing convention is to be negotiated + with the hypervisor during bootstrapping, but the interfaces for this + negotiation are currently vendor specific. + + Inputs: EAX = input to control register + Outputs: None + Clobbers: Standard + Segments: Standard + Queue Class: MMU + + VMI_SetCR4 + + VMICALL void VMI_SetCR3(VMI_UINT val); + + Write to control register four. This can cause TLB flush and many + other CPU side effects. The set of features available to the kernel + depend on the completeness of the hypervisor. An explicit list of + supported functionality or required settings may need to be negotiated + by the hypervisor and kernel during bootstrapping. This is likely to + be implementation or vendor specific, and the precise restrictions are + not yet worked out. Our implementation in general supports turning on + additional MMU functionality - enabling global pages, large pages, PAE + mode, and other features - however, once those features have been + enabled, they may not be disabled on the virtual hardware. The + remaining CPU control bits of CR4 remain active and behave identically + to real hardware. + + Inputs: EAX = input to control register + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_GetCR0 + VMI_GetCR2 + VMI_GetCR3 + VMI_GetCR4 + + VMICALL VMI_UINT32 VMI_GetCR0(void); + VMICALL VMI_UINT32 VMI_GetCR2(void); + VMICALL VMI_UINT32 VMI_GetCR3(void); + VMICALL VMI_UINT32 VMI_GetCR4(void); + + Read the value of a control register into EAX. The register contents + are identical to the native hardware control registers; CR0 contains + the control bits and task switched flag, CR2 contains the last page + fault address, CR3 contains the page directory base pointer, and CR4 + contains various feature control bits. + + Inputs: None + Outputs: EAX = value of control register + Clobbers: Standard + Segments: Standard + + VMI_CLTS + + VMICALL void VMI_CLTS(void); + + Used to clear the task switched (TS) flag in control register zero. A + replacement for the CLTS instruction. + + Inputs: None + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_SetDR + + VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val); + + Set the debug register to the given value. If a hypervisor + implementation supports debug registers, this functions equivalently to + native hardware move to DR instructions. + + Inputs: EAX = debug register number + EDX = debug register value + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_GetDR + + VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num); + + Read a debug register. If debug registers are not supported, the + implementation is free to return zero values. + + Inputs: EAX = debug register number + Outputs: EAX = debug register value + Clobbers: Standard + Segments: Standard + + + PROCESSOR INFORMATION CALLS + + These calls provide access to processor identification, performance and + cycle data, which may be inaccurate due to the nature of running on + virtual hardware. This information may be visible in a non-virtualizable + way to applications running outside of the kernel. As such, both RDTSC + and RDPMC should be disabled by kernels or hypervisors where information + leakage is a concern, and the accuracy of data retrieved by these functions + is up to the individual hypervisor vendor. + + VMI_CPUID + + /* Not expressible as a C function */ + + The CPUID instruction provides processor feature identification in a + vendor specific manner. The instruction itself is non-virtualizable + without hardware support, requiring a hypervisor assisted CPUID call + that emulates the effect of the native instruction, while masking any + unsupported CPU feature bits. + + Inputs: EAX = CPUID number + ECX = sub-level query (nonstandard) + Outputs: EAX = CPUID dword 0 + EBX = CPUID dword 1 + ECX = CPUID dword 2 + EDX = CPUID dword 3 + Clobbers: Flags only + Segments: Standard + + VMI_RDTSC + + VMICALL VMI_UINT64 VMI_RDTSC(void); + + The RDTSC instruction provides a cycles counter which may be made + visible to userspace. For better or worse, many applications have made + use of this feature to implement userspace timers, database indices, or + for micro-benchmarking of performance. This instruction is extremely + problematic for virtualization, because even though it is selectively + virtualizable using trap and emulate, it is much more expensive to + virtualize it in this fashion. On the other hand, if this instruction + is allowed to execute without trapping, the cycle counter provided + could be wrong in any number of circumstances due to hardware drift, + migration, suspend/resume, CPU hotplug, and other unforeseen + consequences of running inside of a virtual machine. There is no + standard specification for how this instruction operates when issued + from userspace programs, but the VMI call here provides a proper + interface for the kernel to read this cycle counter. + + Inputs: None + Outputs: EAX = low word of TSC cycle counter + EDX = high word of TSC cycle counter + Clobbers: Standard + Segments: Standard + + VMI_RDPMC + + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter); + + Similar to RDTSC, this call provides the functionality of reading + processor performance counters. It also is selectively visible to + userspace, and maintaining accurate data for the performance counters + is an extremely difficult task due to the side effects introduced by + the hypervisor. + + Inputs: ECX = performance counter index + Outputs: EAX = low word of counter + EDX = high word of counter + Clobbers: Standard + Segments: Standard + + + STACK / PRIVILEGE TRANSITION CALLS + + This set of calls encapsulates mechanisms required to transfer between + higher privileged kernel tasks and userspace. The stack switching and + return mechanisms are also used to return from interrupt handlers into + the kernel, which may involve atomic interrupt state and stack + transitions. + + VMI_UpdateKernelStack + + VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0); + + Inform the hypervisor that a new kernel stack pointer has been loaded + in the TSS structure. This new kernel stack pointer will be used for + entry into the kernel on interrupts from userspace. + + Inputs: EAX = pointer to TSS structure + EDX = new kernel stack top + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_IRET + + /* No C prototype provided */ + + Perform a near equivalent of the IRET instruction, which atomically + switches off the current stack and restore the interrupt mask. This + may return to userspace or back to the kernel from an interrupt or + exception handler. The VMI_IRET call does not restore IOPL from the + stack image, as the native hardware equivalent would. Instead, IOPL + must be explicitly restored using a VMI_SetIOPL call. The VMI_IRET + call does, however, restore the state of the EFLAGS_VM bit from the + stack image in the event that the hypervisor and kernel both support + V8086 execution mode. If the hypervisor does not support V8086 mode, + this can be silently ignored, generating an error that the guest must + deal with. Note this call is made using a CALL instruction, just as + all other VMI calls, so the EIP of the call site is available to the + VMI layer. This allows faults during the sequence to be properly + passed back to the guest kernel with the correct EIP. + + Note that returning to userspace with interrupts disabled is an invalid + operation in a paravirtualized kernel, and the results of an attempt to + do so are undefined. + + Also note that when issuing the VMI_IRET call, the userspace data + segments may have already been restored, so only the stack and code + segments can be assumed valid. + + There is currently no support for IRET calls from a 16-bit stack + segment, which poses a problem for supporting certain userspace + applications which make use of high bits of ESP on a 16-bit stack. How + to best resolve this is an open question. One possibility is to + introduce a new VMI call which can operate on 16-bit segments, since it + is desirable to make the common case here as fast as possible. + + Inputs: ST(0) = New EIP + ST(1) = New CS + ST(2) = New Flags (including interrupt mask) + ST(3) = New ESP (for userspace returns) + ST(4) = New SS (for userspace returns) + ST(5) = New ES (for v8086 returns) + ST(6) = New DS (for v8086 returns) + ST(7) = New FS (for v8086 returns) + ST(8) = New GS (for v8086 returns) + Outputs: None (does not return) + Clobbers: None (does not return) + Segments: CS / SS only + + VMI_SYSEXIT + + /* No C prototype provided */ + + For hypervisors and processors which support SYSENTER / SYSEXIT, the + VMI_SYSEXIT call is provided as a binary equivalent to the native + SYSENTER instruction. Since interrupts must always be enabled in + userspace, the VMI version of this function always combines atomically + enabling interrupts with the return to userspace. + + Inputs: EDX = New EIP + ECX = New ESP + Outputs: None (does not return) + Clobbers: None (does not return) + Segments: CS / SS only + + + I/O CALLS + + This set of calls incorporates I/O related calls - PIO, setting I/O + privilege level, and forcing memory writeback for device coherency. + + VMI_INB + VMI_INW + VMI_INL + + VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port); + VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port); + VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port); + + Input a byte, word, or doubleword from an I/O port. These + instructions have binary equivalent semantics to native instructions. + + Inputs: EDX = port number + EDX, rather than EAX is used, because the native + encoding of the instruction may use this register + implicitly. + Outputs: EAX = port value + Clobbers: Memory only + Segments: Standard + + VMI_OUTB + VMI_OUTW + VMI_OUTL + + VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port); + VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port); + VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port); + + Output a byte, word, or doubleword to an I/O port. These + instructions have binary equivalent semantics to native instructions. + + Inputs: EAX = port value + EDX = port number + Outputs: None + Clobbers: None + Segments: Standard + + VMI_INSB + VMI_INSW + VMI_INSL + + /* Not expressible as C functions */ + + Input a string of bytes, words, or doublewords from an I/O port. These + instructions have binary equivalent semantics to native instructions. + They do not follow a C calling convention, and clobber only the same + registers as native instructions. + + Inputs: EDI = destination address + EDX = port number + ECX = count + Outputs: None + Clobbers: ESI, ECX, Memory + Segments: Standard + + VMI_OUTSB + VMI_OUTSW + VMI_OUTSL + + /* Not expressible as C functions */ + + Output a string of bytes, words, or doublewords to an I/O port. These + instructions have binary equivalent semantics to native instructions. + They do not follow a C calling convention, and clobber only the same + registers as native instructions. + + Inputs: ESI = source address + EDX = port number + ECX = count + Outputs: None + Clobbers: ESI, ECX + Segments: Standard + + VMI_IODelay + + VMICALL void VMI_IODelay(void); + + Delay the processor by time required to access a bus register. This is + easily implemented on native hardware by an access to a bus scratch + register, but is typically not useful in a virtual machine. It is + paravirtualized to remove the overhead implied by executing the native + delay. + + Inputs: None + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_SetIOPLMask + + VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask); + + Set the IOPL mask of the processor to allow userspace to access I/O + ports. Note the mask is pre-shifted, so an IOPL of 3 would be + expressed as (3 << 12). If the guest chooses to use IOPL to allow + CPL-3 access to I/O ports, it must explicitly set and restore IOPL + using these calls; attempting to set the IOPL flags with popf or iret + may produce no result. + + Inputs: EAX = Mask + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_WBINVD + + VMICALL void VMI_WBINVD(void); + + Write back and invalidate the data cache. This is used to synchronize + I/O memory. + + Inputs: None + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_INVD + + This instruction is deprecated. It is invalid to execute in a virtual + machine. It is documented here only because it is still declared in + the interface, and dropping it required a version change. + + + APIC CALLS + + APIC virtualization is currently quite simple. These calls support the + functionality of the hardware APIC in a form that allows for more + efficient implementation in a hypervisor, by avoiding trapping access to + APIC memory. The calls are kept simple to make the implementation + compatible with native hardware. The APIC must be mapped at a page + boundary in the processor virtual address space. + + VMI_APICWrite + + VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value); + + Write to a local APIC register. Side effects are the same as native + hardware APICs. + + Inputs: EAX = APIC register address + EDX = value to write + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_APICRead + + VMICALL VMI_UINT32 VMI_APICRead(void *reg); + + Read from a local APIC register. Side effects are the same as native + hardware APICs. + + Inputs: EAX = APIC register address + Outputs: EAX = APIC register value + Clobbers: Standard + Segments: Standard + + + TIMER CALLS + + The VMI interfaces define a highly accurate and efficient timer interface + that is available when running inside of a hypervisor. This is an + optional but highly recommended feature which avoids many of the problems + presented by classical timer virtualization. It provides notions of + stolen time, counters, and wall clock time which allows the VM to + get the most accurate information in a way which is free of races and + legacy hardware dependence. + + VMI_GetWallclockTime + + VMI_NANOSECS VMICALL VMI_GetWallclockTime(void); + + VMI_GetWallclockTime returns the current wallclock time as the number + of nanoseconds since the epoch. Nanosecond resolution along with the + 64-bit unsigned type provide over 580 years from epoch until rollover. + The wallclock time is relative to the host's wallclock time. + + Inputs: None + Outputs: EAX = low word, wallclock time in nanoseconds + EDX = high word, wallclock time in nanoseconds + Clobbers: Standard + Segments: Standard + + VMI_WallclockUpdated + + VMI_BOOL VMICALL VMI_WallclockUpdated(void); + + VMI_WallclockUpdated returns TRUE if the wallclock time has changed + relative to the real cycle counter since the previous time that + VMI_WallclockUpdated was polled. For example, while a VM is suspended, + the real cycle counter will halt, but wallclock time will continue to + advance. Upon resuming the VM, the first call to VMI_WallclockUpdated + will return TRUE. + + Inputs: None + Outputs: EAX = 0 for FALSE, 1 for TRUE + Clobbers: Standard + Segments: Standard + + VMI_GetCycleFrequency + + VMICALL VMI_CYCLES VMI_GetCycleFrequency(void); + + VMI_GetCycleFrequency returns the number of cycles in one second. This + value can be used by the guest to convert between cycles and other time + units. + + Inputs: None + Outputs: EAX = low word, cycle frequency + EDX = high word, cycle frequency + Clobbers: Standard + Segments: Standard + + VMI_GetCycleCounter + + VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter); + + VMI_GetCycleCounter returns the current value, in cycles units, of the + counter corresponding to 'whichCounter' if it is one of + VMI_CYCLES_REAL, VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN. + VMI_GetCycleCounter returns 0 for any other value of 'whichCounter'. + + Inputs: EAX = counter index, one of + #define VMI_CYCLES_REAL 0 + #define VMI_CYCLES_AVAILABLE 1 + #define VMI_CYCLES_STOLEN 2 + Outputs: EAX = low word, cycle counter + EDX = high word, cycle counter + Clobbers: Standard + Segments: Standard + + VMI_SetAlarm + + VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry, + VMI_CYCLES period); + + VMI_SetAlarm is used to arm the vcpu's alarms. The 'flags' parameter + is used to specify which counter's alarm is being set (VMI_CYCLES_REAL + or VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu + (VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode + (VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC). If the alarm is set + against the VMI_ALARM_STOLEN counter or an undefined counter number, + the call is a nop. The 'expiry' parameter indicates the expiry of the + alarm, and for periodic alarms, the 'period' parameter indicates the + period of the alarm. If the value of 'period' is zero, the alarm is + armed as a one-shot alarm regardless of the mode specified by 'flags'. + Finally, a call to VMI_SetAlarm for an alarm that is already armed is + equivalent to first calling VMI_CancelAlarm and then calling + VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not + accessible. + + /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */ + + Inputs: EAX = flags value, cycle counter number or'ed with + #define VMI_ALARM_WIRED_IRQ0 0x00000000 + #define VMI_ALARM_WIRED_LVTT 0x00010000 + #define VMI_ALARM_IS_ONESHOT 0x00000000 + #define VMI_ALARM_IS_PERIODIC 0x00000100 + EDX = low word, alarm expiry + ECX = high word, alarm expiry + ST(0) = low word, alarm expiry + ST(1) = high word, alarm expiry + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_CancelAlarm + + VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags); + + VMI_CancelAlarm is used to disarm an alarm. The 'flags' parameter + indicates which alarm to cancel (VMI_CYCLES_REAL or + VMI_CYCLES_AVAILABLE). The return value indicates whether or not the + cancel succeeded. A return value of FALSE indicates that the alarm was + already disarmed either because a) the alarm was never set or b) it was + a one-shot alarm and has already fired (though perhaps not yet + delivered to the guest). TRUE indicates that the alarm was armed and + either a) the alarm was one-shot and has not yet fired (and will no + longer fire until it is rearmed) or b) the alarm was periodic. + + Inputs: EAX = cycle counter number + Outputs: EAX = 0 for FALSE, 1 for TRUE + Clobbers: Standard + Segments: Standard + + + MMU CALLS + + The MMU plays a large role in paravirtualization due to the large + performance opportunities realized by gaining insight into the guest + machine's use of page tables. These calls are designed to accommodate the + existing MMU functionality in the guest OS while providing the hypervisor + with hints that can be used to optimize performance to a large degree. + + VMI_SetLinearMapping + + VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va, + VMI_UINT32 pages, VMI_UINT32 ppn); + + /* The number of VMI address translation slot */ + #define VMI_LINEAR_MAP_SLOTS 4 + + Register a virtual to physical translation of virtual address range to + physical pages. This may be used to register single pages or to + register large ranges. There is an upper limit on the number of active + mappings, which should be sufficient to allow the hypervisor and VMI + layer to perform page translation without requiring dynamic storage. + Translations are only required to be registered for addresses used to + access page table entries through the VMI page table access functions. + The guest is free to use the provided linear map slots in a manner that + it finds most convenient. Kernels which linearly map a large chunk of + physical memory and use page tables in this linear region will only + need to register one such region after initialization of the VMI. + Hypervisors which do not require linear to physical conversion hints + are free to leave these calls as NOPs, which is the default when + inlined into the native kernel. + + Inputs: EAX = linear map slot + EDX = virtual address start of mapping + ECX = number of pages in mapping + ST(0) = physical frame number to which pages are mapped + Outputs: None + Clobbers: Standard + Segments: Standard + + VMI_FlushTLB + + VMICALL void VMI_FlushTLB(int how); + + Flush all non-global mappings in the TLB, optionally flushing global + mappings as well. The VMI_FLUSH_TLB flag should always be specified, + optionally or'ed with the VMI_FLUSH_GLOBAL flag. + + Inputs: EAX = flush type + #define VMI_FLUSH_TLB 0x01 + #define VMI_FLUSH_GLOBAL 0x02 + Outputs: None + Clobbers: Standard, memory (implied) + Segments: Standard + + VMI_InvalPage + + VMICALL void VMI_InvalPage(VMI_UINT32 va); + + Invalidate the TLB mapping for a single page or large page at the + given virtual address. + + Inputs: EAX = virtual address + Outputs: None + Clobbers: Standard, memory (implied) + Segments: Standard + + The remaining documentation here needs updating when the PTE accessors are + simplified. + + 70) VMI_SetPte + + void VMI_SetPte(VMI_PTE pte, VMI_PTE *ptep); + + Assigns a new value to a page table / directory entry. It is a + requirement that ptep points to a page that has already been + registered with the hypervisor as a page of the appropriate type + using the VMI_RegisterPageUsage function. + + 71) VMI_SwapPte + + VMI_PTE VMI_SwapPte(VMI_PTE pte, VMI_PTE *ptep); + + Write 'pte' into the page table entry pointed by 'ptep', and returns + the old value in 'ptep'. This function acts atomically on the PTE + to provide up to date A/D bit information in the returned value. + + 72) VMI_TestAndSetPteBit + + VMI_BOOL VMI_TestAndSetPteBit(VMI_INT bit, VMI_PTE *ptep); + + Atomically set a bit in a page table entry. Returns zero if the bit + was not set, and non-zero if the bit was set. + + 73) VMI_TestAndClearPteBit + + VMI_BOOL VMI_TestAndSetClearBit(VMI_INT bit, VMI_PTE *ptep); + + Atomically clear a bit in a page table entry. Returns zero if the bit + was not set, and non-zero if the bit was set. + + 74) VMI_SetPteLong + 75) VMI_SwapPteLong + 76) VMI_TestAndSetPteBitLong + 77) VMI_TestAndClearPteBitLong + + void VMI_SetPteLong(VMI_PAE_PTE pte, VMI_PAE_PTE *ptep); + VMI_PAE_PTE VMI_SwapPteLong(VMI_UINT64 pte, VMI_PAE_PTE *ptep); + VMI_BOOL VMI_TestAndSetPteBitLong(VMI_INT bit, VMI_PAE_PTE *ptep); + VMI_BOOL VMI_TestAndSetClearBitLong(VMI_INT bit, VMI_PAE_PTE *ptep); + + These functions act identically to the 32-bit PTE update functions, + but provide support for PAE mode. The calls are guaranteed to never + create a temporarily invalid but present page mapping that could be + accidentally prefetched by another processor, and all returned bits + are guaranteed to be atomically up to date. + + One special exception is the VMI_SwapPteLong function only provides + synchronization against A/D bits from other processors, not against + other invocations of VMI_SwapPteLong. + + 78) VMI_ClonePageTable + VMI_ClonePageDirectory + + #define VMI_MKCLONE(start, count) (((start) << 16) | (count)) + + void VMI_ClonePageTable(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN, + VMI_UINT32 flags); + void VMI_ClonePageDirectory(VMI_UINT32 dstPPN, VMI_UINT32 srcPPN, + VMI_UINT32 flags); + + These functions tell the hypervisor to allocate a page shadow + at the PT or PD level using a shadow template. Because of the + availability of bits in the flags, these calls may be merged + together as well as flag the PAE-ness of the shadows. + + 80) VMI_RegisterPageUsage + 81) VMI_ReleasePage + + #define VMI_PAGE_PT 0x01 + #define VMI_PAGE_PD 0x02 + #define VMI_PAGE_PDP 0x04 + #define VMI_PAGE_PML4 0x08 + #define VMI_PAGE_GDT 0x10 + #define VMI_PAGE_LDT 0x20 + #define VMI_PAGE_IDT 0x40 + #define VMI_PAGE_TSS 0x80 + + void VMI_RegisterPageUsage(VMI_UINT32 ppn, int flags); + void VMI_ReleasePage(VMI_UINT32 ppn, int flags); + + These are used to register a page with the hypervisor as being of a + particular type, for instance, VMI_PAGE_PT says it is a page table + page. + + 85) VMI_SetDeferredMode + + void VMI_SetDeferredMode(VMI_UINT32 deferBits); + + Set the lazy state update mode to the specified set of bits. This + allows the processor, hypervisor, or VMI layer to lazily update + certain CPU and MMU state. When setting this to a more permissive + setting, no flush is implied, but when clearing bits in the current + defer mask, all pending state will be flushed. + + The 'deferBits' is a mask specifying how to flush. + + #define VMI_DEFER_NONE 0x00 + + Disallow all asynchronous state updates. This is the default + state. + + #define VMI_DEFER_MMU 0x01 + + Flush all pending page table updates. Note that page faults, + invalidations and TLB flushes will implicitly flush all pending + updates. + + #define VMI_DEFER_CPU 0x02 + + Allow CPU state updates to control registers to be deferred, with + the exception of updates that change FPU state. This is useful + for combining a reload of the page table base in CR3 with other + updates, such as the current kernel stack. + + #define VMI_DEFER_DT 0x04 + + Allow descriptor table updates to be delayed. This allows the + VMI_UpdateGDT / IDT / LDT calls to be asynchronously queued. + + 86) VMI_FlushDeferredCalls + + void VMI_FlushDeferredCalls(void); + + Flush all asynchronous state updates which may be queued as + a result of setting deferred update mode. + + +Appendix B - VMI C prototypes + + Most of the VMI calls are properly callable C functions. Note that for the + absolute best performance, assembly calls are preferable in some cases, as + they do not imply all of the side effects of a C function call, such as + register clobber and memory access. Nevertheless, these wrappers serve as + a useful interface definition for higher level languages. + + In some cases, a dummy variable is passed as an unused input to force + proper alignment of the remaining register values. + + The call convention for these is defined to be standard GCC convention with + register passing. The regparm call interface is documented at: + + http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html + + Types used by these calls: + + VMI_UINT64 64 bit unsigned integer + VMI_UINT32 32 bit unsigned integer + VMI_UINT16 16 bit unsigned integer + VMI_UINT8 8 bit unsigned integer + VMI_INT 32 bit integer + VMI_UINT 32 bit unsigned integer + VMI_DTR 6 byte compressed descriptor table limit/base + VMI_PTE 4 byte page table entry (or page directory) + VMI_LONG_PTE 8 byte page table entry (or PDE or PDPE) + VMI_SELECTOR 16 bit segment selector + VMI_BOOL 32 bit unsigned integer + VMI_CYCLES 64 bit unsigned integer + VMI_NANOSECS 64 bit unsigned integer + + + #ifndef VMI_PROTOTYPES_H + #define VMI_PROTOTYPES_H + + /* Insert local type definitions here */ + typedef struct VMI_DTR { + uint16 limit; + uint32 offset __attribute__ ((packed)); + } VMI_DTR; + + typedef struct APState { + VMI_UINT32 cr0; + VMI_UINT32 cr2; + VMI_UINT32 cr3; + VMI_UINT32 cr4; + + VMI_UINT64 efer; + + VMI_UINT32 eip; + VMI_UINT32 eflags; + VMI_UINT32 eax; + VMI_UINT32 ebx; + VMI_UINT32 ecx; + VMI_UINT32 edx; + VMI_UINT32 esp; + VMI_UINT32 ebp; + VMI_UINT32 esi; + VMI_UINT32 edi; + VMI_UINT16 cs; + VMI_UINT16 ss; + + VMI_UINT16 ds; + VMI_UINT16 es; + VMI_UINT16 fs; + VMI_UINT16 gs; + VMI_UINT16 ldtr; + + VMI_UINT16 gdtrLimit; + VMI_UINT32 gdtrBase; + VMI_UINT32 idtrBase; + VMI_UINT16 idtrLimit; + } APState; + + #define VMICALL __attribute__((regparm(3))) + + /* CORE INTERFACE CALLS */ + VMICALL void VMI_Init(void); + + /* PROCESSOR STATE CALLS */ + VMICALL void VMI_DisableInterrupts(void); + VMICALL void VMI_EnableInterrupts(void); + + VMICALL VMI_UINT VMI_GetInterruptMask(void); + VMICALL void VMI_SetInterruptMask(VMI_UINT mask); + + VMICALL void VMI_Pause(void); + VMICALL void VMI_Halt(void); + VMICALL void VMI_Shutdown(void); + VMICALL void VMI_Reboot(VMI_INT how); + + #define VMI_REBOOT_SOFT 0x0 + #define VMI_REBOOT_HARD 0x1 + + void VMI_SetInitialAPState(APState *apState, VMI_UINT32 apicID); + + /* DESCRIPTOR RELATED CALLS */ + VMICALL void VMI_SetGDT(VMI_DTR *gdtr); + VMICALL void VMI_SetIDT(VMI_DTR *idtr); + VMICALL void VMI_SetLDT(VMI_SELECTOR ldtSel); + VMICALL void VMI_SetTR(VMI_SELECTOR ldtSel); + + VMICALL void VMI_GetGDT(VMI_DTR *gdtr); + VMICALL void VMI_GetIDT(VMI_DTR *idtr); + VMICALL VMI_SELECTOR VMI_GetLDT(void); + VMICALL VMI_SELECTOR VMI_GetTR(void); + + VMICALL void VMI_WriteGDTEntry(void *gdt, + VMI_UINT entry, + VMI_UINT32 descLo, + VMI_UINT32 descHi); + VMICALL void VMI_WriteLDTEntry(void *gdt, + VMI_UINT entry, + VMI_UINT32 descLo, + VMI_UINT32 descHi); + VMICALL void VMI_WriteIDTEntry(void *gdt, + VMI_UINT entry, + VMI_UINT32 descLo, + VMI_UINT32 descHi); + + /* CPU CONTROL CALLS */ + VMICALL void VMI_WRMSR(VMI_UINT64 val, VMI_UINT32 reg); + VMICALL void VMI_WRMSR_SPLIT(VMI_UINT32 valLo, VMI_UINT32 valHi, + VMI_UINT32 reg); + + /* Not truly a proper C function; use dummy to align reg in ECX */ + VMICALL VMI_UINT64 VMI_RDMSR(VMI_UINT64 dummy, VMI_UINT32 reg); + + VMICALL void VMI_SetCR0(VMI_UINT val); + VMICALL void VMI_SetCR2(VMI_UINT val); + VMICALL void VMI_SetCR3(VMI_UINT val); + VMICALL void VMI_SetCR4(VMI_UINT val); + + VMICALL VMI_UINT32 VMI_GetCR0(void); + VMICALL VMI_UINT32 VMI_GetCR2(void); + VMICALL VMI_UINT32 VMI_GetCR3(void); + VMICALL VMI_UINT32 VMI_GetCR4(void); + + VMICALL void VMI_CLTS(void); + + VMICALL void VMI_SetDR(VMI_UINT32 num, VMI_UINT32 val); + VMICALL VMI_UINT32 VMI_GetDR(VMI_UINT32 num); + + /* PROCESSOR INFORMATION CALLS */ + + VMICALL VMI_UINT64 VMI_RDTSC(void); + VMICALL VMI_UINT64 VMI_RDPMC(VMI_UINT64 dummy, VMI_UINT32 counter); + + /* STACK / PRIVILEGE TRANSITION CALLS */ + VMICALL void VMI_UpdateKernelStack(void *tss, VMI_UINT32 esp0); + + /* I/O CALLS */ + /* Native port in EDX - use dummy */ + VMICALL VMI_UINT8 VMI_INB(VMI_UINT dummy, VMI_UINT port); + VMICALL VMI_UINT16 VMI_INW(VMI_UINT dummy, VMI_UINT port); + VMICALL VMI_UINT32 VMI_INL(VMI_UINT dummy, VMI_UINT port); + + VMICALL void VMI_OUTB(VMI_UINT value, VMI_UINT port); + VMICALL void VMI_OUTW(VMI_UINT value, VMI_UINT port); + VMICALL void VMI_OUTL(VMI_UINT value, VMI_UINT port); + + VMICALL void VMI_IODelay(void); + VMICALL void VMI_WBINVD(void); + VMICALL void VMI_SetIOPLMask(VMI_UINT32 mask); + + /* APIC CALLS */ + VMICALL void VMI_APICWrite(void *reg, VMI_UINT32 value); + VMICALL VMI_UINT32 VMI_APICRead(void *reg); + + /* TIMER CALLS */ + VMICALL VMI_NANOSECS VMI_GetWallclockTime(void); + VMICALL VMI_BOOL VMI_WallclockUpdated(void); + + /* Predefined rate of the wallclock. */ + #define VMI_WALLCLOCK_HZ 1000000000 + + VMICALL VMI_CYCLES VMI_GetCycleFrequency(void); + VMICALL VMI_CYCLES VMI_GetCycleCounter(VMI_UINT32 whichCounter); + + /* Defined cycle counters */ + #define VMI_CYCLES_REAL 0 + #define VMI_CYCLES_AVAILABLE 1 + #define VMI_CYCLES_STOLEN 2 + + VMICALL void VMI_SetAlarm(VMI_UINT32 flags, VMI_CYCLES expiry, + VMI_CYCLES period); + VMICALL VMI_BOOL VMI_CancelAlarm(VMI_UINT32 flags); + + /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */ + #define VMI_ALARM_COUNTER_MASK 0x000000ff + + #define VMI_ALARM_WIRED_IRQ0 0x00000000 + #define VMI_ALARM_WIRED_LVTT 0x00010000 + + #define VMI_ALARM_IS_ONESHOT 0x00000000 + #define VMI_ALARM_IS_PERIODIC 0x00000100 + + /* MMU CALLS */ + VMICALL void VMI_SetLinearMapping(int slot, VMI_UINT32 va, + VMI_UINT32 pages, VMI_UINT32 ppn); + + /* The number of VMI address translation slot */ + #define VMI_LINEAR_MAP_SLOTS 4 + + VMICALL void VMI_InvalPage(VMI_UINT32 va); + VMICALL void VMI_FlushTLB(int how); + + /* Flags used by VMI_FlushTLB call */ + #define VMI_FLUSH_TLB 0x01 + #define VMI_FLUSH_GLOBAL 0x02 + + #endif + + +Appendix C - Sensitive x86 instructions in the paravirtual environment + + This is a list of x86 instructions which may operate in a different manner + when run inside of a paravirtual environment. + + ARPL - continues to function as normal, but kernel segment registers + may be different, so parameters to this instruction may need + to be modified. (System) + + IRET - the IRET instruction will be unable to change the IOPL, VM, + VIF, VIP, or IF fields. (System) + + the IRET instruction may #GP if the return CS/SS RPL are + below the CPL, or are not equal. (System) + + LAR - the LAR instruction will reveal changes to the DPL field of + descriptors in the GDT and LDT tables. (System, User) + + LSL - the LSL instruction will reveal changes to the segment limit + of descriptors in the GDT and LDT tables. (System, User) + + LSS - the LSS instruction may #GP if the RPL is not set properly. + (System) + + MOV - the mov %seg, %reg instruction may reveal a different RPL + on the segment register. (System) + + The mov %reg, %ss instruction may #GP if the RPL is not set + to the current CPL. (System) + + POP - the pop %ss instruction may #GP if the RPL is not set to + the appropriate CPL. (System) + + POPF - the POPF instruction will be unable to set the hardware + interrupt flag. (System) + + PUSH - the push %seg instruction may reveal a different RPL on the + segment register. (System) + + PUSHF- the PUSHF instruction will reveal a possible different IOPL, + and the value of the hardware interrupt flag, which is always + set. (System, User) + + SGDT - the SGDT instruction will reveal the location and length of + the GDT shadow instead of the guest GDT. (System, User) + + SIDT - the SIDT instruction will reveal the location and length of + the IDT shadow instead of the guest IDT. (System, User) + + SLDT - the SLDT instruction will reveal the selector used for + the shadow LDT rather than the selector loaded by the guest. + (System, User). + + STR - the STR instruction will reveal the selector used for the + shadow TSS rather than the selector loaded by the guest. + (System, User). _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.

[Xen-devel] [RFC, PATCH 1/24] i386 Vmi documentation