[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-changelog] Split up docs. Signed-off-by: Robb Romans <3r@xxxxxxxxxx>
# HG changeset patch # User kaf24@xxxxxxxxxxxxxxxxxxxx # Node ID 750ad97f37b0a49451c9b887c8ccb9134cc8a1ec # Parent c0796e18b6a45f0352770e700e3f6cae028bd2e3 Split up docs. Signed-off-by: Robb Romans <3r@xxxxxxxxxx> diff -r c0796e18b6a4 -r 750ad97f37b0 docs/Makefile --- a/docs/Makefile Tue Sep 20 09:08:26 2005 +++ b/docs/Makefile Tue Sep 20 09:17:33 2005 @@ -12,7 +12,7 @@ pkgdocdir := /usr/share/doc/xen -DOC_TEX := $(wildcard src/*.tex) +DOC_TEX := src/user.tex src/interface.tex DOC_PS := $(patsubst src/%.tex,ps/%.ps,$(DOC_TEX)) DOC_PDF := $(patsubst src/%.tex,pdf/%.pdf,$(DOC_TEX)) DOC_HTML := $(patsubst src/%.tex,html/%/index.html,$(DOC_TEX)) diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface.tex --- a/docs/src/interface.tex Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface.tex Tue Sep 20 09:17:33 2005 @@ -87,1084 +87,23 @@ mechanism and policy within the system. +%% chapter Virtual Architecture moved to architecture.tex +\include{src/interface/architecture} -\chapter{Virtual Architecture} +%% chapter Memory moved to memory.tex +\include{src/interface/memory} -On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It -has full access to the physical memory available in the system and is -responsible for allocating portions of it to the domains. Guest -operating systems run in and use {\it rings 1}, {\it 2} and {\it 3} as -they see fit. Segmentation is used to prevent the guest OS from -accessing the portion of the address space that is reserved for -Xen. We expect most guest operating systems will use ring 1 for their -own operation and place applications in ring 3. +%% chapter Devices moved to devices.tex +\include{src/interface/devices} -In this chapter we consider the basic virtual architecture provided -by Xen: the basic CPU state, exception and interrupt handling, and -time. Other aspects such as memory and device access are discussed -in later chapters. - -\section{CPU state} - -All privileged state must be handled by Xen. The guest OS has no -direct access to CR3 and is not permitted to update privileged bits in -EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen; -these are analogous to system calls but occur from ring 1 to ring 0. - -A list of all hypercalls is given in Appendix~\ref{a:hypercalls}. - - - -\section{Exceptions} - -A virtual IDT is provided --- a domain can submit a table of trap -handlers to Xen via the {\tt set\_trap\_table()} hypercall. Most trap -handlers are identical to native x86 handlers, although the page-fault -handler is somewhat different. - - -\section{Interrupts and events} - -Interrupts are virtualized by mapping them to \emph{events}, which are -delivered asynchronously to the target domain using a callback -supplied via the {\tt set\_callbacks()} hypercall. A guest OS can map -these events onto its standard interrupt dispatch mechanisms. Xen is -responsible for determining the target domain that will handle each -physical interrupt source. For more details on the binding of event -sources to events, see Chapter~\ref{c:devices}. - - - -\section{Time} - -Guest operating systems need to be aware of the passage of both real -(or wallclock) time and their own `virtual time' (the time for -which they have been executing). Furthermore, Xen has a notion of -time which is used for scheduling. The following notions of -time are provided: - -\begin{description} -\item[Cycle counter time.] - -This provides a fine-grained time reference. The cycle counter time is -used to accurately extrapolate the other time references. On SMP machines -it is currently assumed that the cycle counter time is synchronized between -CPUs. The current x86-based implementation achieves this within inter-CPU -communication latencies. - -\item[System time.] - -This is a 64-bit counter which holds the number of nanoseconds that -have elapsed since system boot. - - -\item[Wall clock time.] - -This is the time of day in a Unix-style {\tt struct timeval} (seconds -and microseconds since 1 January 1970, adjusted by leap seconds). An -NTP client hosted by {\it domain 0} can keep this value accurate. - - -\item[Domain virtual time.] - -This progresses at the same pace as system time, but only while a -domain is executing --- it stops while a domain is de-scheduled. -Therefore the share of the CPU that a domain receives is indicated by -the rate at which its virtual time increases. - -\end{description} - - -Xen exports timestamps for system time and wall-clock time to guest -operating systems through a shared page of memory. Xen also provides -the cycle counter time at the instant the timestamps were calculated, -and the CPU frequency in Hertz. This allows the guest to extrapolate -system and wall-clock times accurately based on the current cycle -counter time. - -Since all time stamps need to be updated and read \emph{atomically} -two version numbers are also stored in the shared info page. The -first is incremented prior to an update, while the second is only -incremented afterwards. Thus a guest can be sure that it read a consistent -state by checking the two version numbers are equal. - -Xen includes a periodic ticker which sends a timer event to the -currently executing domain every 10ms. The Xen scheduler also sends a -timer event whenever a domain is scheduled; this allows the guest OS -to adjust for the time that has passed while it has been inactive. In -addition, Xen allows each domain to request that they receive a timer -event sent at a specified system time by using the {\tt -set\_timer\_op()} hypercall. Guest OSes may use this timer to -implement timeout values when they block. - - - -%% % akw: demoting this to a section -- not sure if there is any point -%% % though, maybe just remove it. - -\section{Xen CPU Scheduling} - -Xen offers a uniform API for CPU schedulers. It is possible to choose -from a number of schedulers at boot and it should be easy to add more. -The BVT, Atropos and Round Robin schedulers are part of the normal -Xen distribution. BVT provides proportional fair shares of the CPU to -the running domains. Atropos can be used to reserve absolute shares -of the CPU for each domain. Round-robin is provided as an example of -Xen's internal scheduler API. - -\paragraph*{Note: SMP host support} -Xen has always supported SMP host systems. Domains are statically assigned to -CPUs, either at creation time or when manually pinning to a particular CPU. -The current schedulers then run locally on each CPU to decide which of the -assigned domains should be run there. The user-level control software -can be used to perform coarse-grain load-balancing between CPUs. - - -%% More information on the characteristics and use of these schedulers is -%% available in {\tt Sched-HOWTO.txt}. - - -\section{Privileged operations} - -Xen exports an extended interface to privileged domains (viz.\ {\it - Domain 0}). This allows such domains to build and boot other domains -on the server, and provides control interfaces for managing -scheduling, memory, networking, and block devices. - - -\chapter{Memory} -\label{c:memory} - -Xen is responsible for managing the allocation of physical memory to -domains, and for ensuring safe use of the paging and segmentation -hardware. - - -\section{Memory Allocation} - - -Xen resides within a small fixed portion of physical memory; it also -reserves the top 64MB of every virtual address space. The remaining -physical memory is available for allocation to domains at a page -granularity. Xen tracks the ownership and use of each page, which -allows it to enforce secure partitioning between domains. - -Each domain has a maximum and current physical memory allocation. -A guest OS may run a `balloon driver' to dynamically adjust its -current memory allocation up to its limit. - - -%% XXX SMH: I use machine and physical in the next section (which -%% is kinda required for consistency with code); wonder if this -%% section should use same terms? -%% -%% Probably. -%% -%% Merging this and below section at some point prob makes sense. - -\section{Pseudo-Physical Memory} - -Since physical memory is allocated and freed on a page granularity, -there is no guarantee that a domain will receive a contiguous stretch -of physical memory. However most operating systems do not have good -support for operating in a fragmented physical address space. To aid -porting such operating systems to run on top of Xen, we make a -distinction between \emph{machine memory} and \emph{pseudo-physical -memory}. - -Put simply, machine memory refers to the entire amount of memory -installed in the machine, including that reserved by Xen, in use by -various domains, or currently unallocated. We consider machine memory -to comprise a set of 4K \emph{machine page frames} numbered -consecutively starting from 0. Machine frame numbers mean the same -within Xen or any domain. - -Pseudo-physical memory, on the other hand, is a per-domain -abstraction. It allows a guest operating system to consider its memory -allocation to consist of a contiguous range of physical page frames -starting at physical frame 0, despite the fact that the underlying -machine page frames may be sparsely allocated and in any order. - -To achieve this, Xen maintains a globally readable {\it -machine-to-physical} table which records the mapping from machine page -frames to pseudo-physical ones. In addition, each domain is supplied -with a {\it physical-to-machine} table which performs the inverse -mapping. Clearly the machine-to-physical table has size proportional -to the amount of RAM installed in the machine, while each -physical-to-machine table has size proportional to the memory -allocation of the given domain. - -Architecture dependent code in guest operating systems can then use -the two tables to provide the abstraction of pseudo-physical -memory. In general, only certain specialized parts of the operating -system (such as page table management) needs to understand the -difference between machine and pseudo-physical addresses. - -\section{Page Table Updates} - -In the default mode of operation, Xen enforces read-only access to -page tables and requires guest operating systems to explicitly request -any modifications. Xen validates all such requests and only applies -updates that it deems safe. This is necessary to prevent domains from -adding arbitrary mappings to their page tables. - -To aid validation, Xen associates a type and reference count with each -memory page. A page has one of the following -mutually-exclusive types at any point in time: page directory ({\sf -PD}), page table ({\sf PT}), local descriptor table ({\sf LDT}), -global descriptor table ({\sf GDT}), or writable ({\sf RW}). Note that -a guest OS may always create readable mappings of its own memory -regardless of its current type. -%%% XXX: possibly explain more about ref count 'lifecyle' here? -This mechanism is used to -maintain the invariants required for safety; for example, a domain -cannot have a writable mapping to any part of a page table as this -would require the page concerned to simultaneously be of types {\sf - PT} and {\sf RW}. - - -%\section{Writable Page Tables} - -Xen also provides an alternative mode of operation in which guests be -have the illusion that their page tables are directly writable. Of -course this is not really the case, since Xen must still validate -modifications to ensure secure partitioning. To this end, Xen traps -any write attempt to a memory page of type {\sf PT} (i.e., that is -currently part of a page table). If such an access occurs, Xen -temporarily allows write access to that page while at the same time -{\em disconnecting} it from the page table that is currently in -use. This allows the guest to safely make updates to the page because -the newly-updated entries cannot be used by the MMU until Xen -revalidates and reconnects the page. -Reconnection occurs automatically in a number of situations: for -example, when the guest modifies a different page-table page, when the -domain is preempted, or whenever the guest uses Xen's explicit -page-table update interfaces. - -Finally, Xen also supports a form of \emph{shadow page tables} in -which the guest OS uses a independent copy of page tables which are -unknown to the hardware (i.e.\ which are never pointed to by {\tt -cr3}). Instead Xen propagates changes made to the guest's tables to the -real ones, and vice versa. This is useful for logging page writes -(e.g.\ for live migration or checkpoint). A full version of the shadow -page tables also allows guest OS porting with less effort. - -\section{Segment Descriptor Tables} - -On boot a guest is supplied with a default GDT, which does not reside -within its own memory allocation. If the guest wishes to use other -than the default `flat' ring-1 and ring-3 segments that this GDT -provides, it must register a custom GDT and/or LDT with Xen, -allocated from its own memory. Note that a number of GDT -entries are reserved by Xen -- any custom GDT must also include -sufficient space for these entries. - -For example, the following hypercall is used to specify a new GDT: - -\begin{quote} -int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em entries}) - -{\em frame\_list}: An array of up to 16 machine page frames within -which the GDT resides. Any frame registered as a GDT frame may only -be mapped read-only within the guest's address space (e.g., no -writable mappings, no use as a page-table page, and so on). - -{\em entries}: The number of descriptor-entry slots in the GDT. Note -that the table must be large enough to contain Xen's reserved entries; -thus we must have `{\em entries $>$ LAST\_RESERVED\_GDT\_ENTRY}\ '. -Note also that, after registering the GDT, slots {\em FIRST\_} through -{\em LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest and -may be overwritten by Xen. -\end{quote} - -The LDT is updated via the generic MMU update mechanism (i.e., via -the {\tt mmu\_update()} hypercall. - -\section{Start of Day} - -The start-of-day environment for guest operating systems is rather -different to that provided by the underlying hardware. In particular, -the processor is already executing in protected mode with paging -enabled. - -{\it Domain 0} is created and booted by Xen itself. For all subsequent -domains, the analogue of the boot-loader is the {\it domain builder}, -user-space software running in {\it domain 0}. The domain builder -is responsible for building the initial page tables for a domain -and loading its kernel image at the appropriate virtual address. - - - -\chapter{Devices} -\label{c:devices} - -Devices such as network and disk are exported to guests using a -split device driver. The device driver domain, which accesses the -physical device directly also runs a {\em backend} driver, serving -requests to that device from guests. Each guest will use a simple -{\em frontend} driver, to access the backend. Communication between these -domains is composed of two parts: First, data is placed onto a shared -memory page between the domains. Second, an event channel between the -two domains is used to pass notification that data is outstanding. -This separation of notification from data transfer allows message -batching, and results in very efficient device access. - -Event channels are used extensively in device virtualization; each -domain has a number of end-points or \emph{ports} each of which -may be bound to one of the following \emph{event sources}: -\begin{itemize} - \item a physical interrupt from a real device, - \item a virtual interrupt (callback) from Xen, or - \item a signal from another domain -\end{itemize} - -Events are lightweight and do not carry much information beyond -the source of the notification. Hence when performing bulk data -transfer, events are typically used as synchronization primitives -over a shared memory transport. Event channels are managed via -the {\tt event\_channel\_op()} hypercall; for more details see -Section~\ref{s:idc}. - -This chapter focuses on some individual device interfaces -available to Xen guests. - -\section{Network I/O} - -Virtual network device services are provided by shared memory -communication with a backend domain. From the point of view of -other domains, the backend may be viewed as a virtual ethernet switch -element with each domain having one or more virtual network interfaces -connected to it. - -\subsection{Backend Packet Handling} - -The backend driver is responsible for a variety of actions relating to -the transmission and reception of packets from the physical device. -With regard to transmission, the backend performs these key actions: - -\begin{itemize} -\item {\bf Validation:} To ensure that domains do not attempt to - generate invalid (e.g. spoofed) traffic, the backend driver may - validate headers ensuring that source MAC and IP addresses match the - interface that they have been sent from. - - Validation functions can be configured using standard firewall rules - ({\small{\tt iptables}} in the case of Linux). - -\item {\bf Scheduling:} Since a number of domains can share a single - physical network interface, the backend must mediate access when - several domains each have packets queued for transmission. This - general scheduling function subsumes basic shaping or rate-limiting - schemes. - -\item {\bf Logging and Accounting:} The backend domain can be - configured with classifier rules that control how packets are - accounted or logged. For example, log messages might be generated - whenever a domain attempts to send a TCP packet containing a SYN. -\end{itemize} - -On receipt of incoming packets, the backend acts as a simple -demultiplexer: Packets are passed to the appropriate virtual -interface after any necessary logging and accounting have been carried -out. - -\subsection{Data Transfer} - -Each virtual interface uses two ``descriptor rings'', one for transmit, -the other for receive. Each descriptor identifies a block of contiguous -physical memory allocated to the domain. - -The transmit ring carries packets to transmit from the guest to the -backend domain. The return path of the transmit ring carries messages -indicating that the contents have been physically transmitted and the -backend no longer requires the associated pages of memory. - -To receive packets, the guest places descriptors of unused pages on -the receive ring. The backend will return received packets by -exchanging these pages in the domain's memory with new pages -containing the received data, and passing back descriptors regarding -the new packets on the ring. This zero-copy approach allows the -backend to maintain a pool of free pages to receive packets into, and -then deliver them to appropriate domains after examining their -headers. - -% -%Real physical addresses are used throughout, with the domain performing -%translation from pseudo-physical addresses if that is necessary. - -If a domain does not keep its receive ring stocked with empty buffers then -packets destined to it may be dropped. This provides some defence against -receive livelock problems because an overload domain will cease to receive -further data. Similarly, on the transmit path, it provides the application -with feedback on the rate at which packets are able to leave the system. - - -Flow control on rings is achieved by including a pair of producer -indexes on the shared ring page. Each side will maintain a private -consumer index indicating the next outstanding message. In this -manner, the domains cooperate to divide the ring into two message -lists, one in each direction. Notification is decoupled from the -immediate placement of new messages on the ring; the event channel -will be used to generate notification when {\em either} a certain -number of outstanding messages are queued, {\em or} a specified number -of nanoseconds have elapsed since the oldest message was placed on the -ring. - -% Not sure if my version is any better -- here is what was here before: -%% Synchronization between the backend domain and the guest is achieved using -%% counters held in shared memory that is accessible to both. Each ring has -%% associated producer and consumer indices indicating the area in the ring -%% that holds descriptors that contain data. After receiving {\it n} packets -%% or {\t nanoseconds} after receiving the first packet, the hypervisor sends -%% an event to the domain. - -\section{Block I/O} - -All guest OS disk access goes through the virtual block device VBD -interface. This interface allows domains access to portions of block -storage devices visible to the the block backend device. The VBD -interface is a split driver, similar to the network interface -described above. A single shared memory ring is used between the -frontend and backend drivers, across which read and write messages are -sent. - -Any block device accessible to the backend domain, including -network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, -can be exported as a VBD. Each VBD is mapped to a device node in the -guest, specified in the guest's startup configuration. - -Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since -similar functionality can be achieved using the more complete LVM -system, which is already in widespread use. - -\subsection{Data Transfer} - -The single ring between the guest and the block backend supports three -messages: - -\begin{description} -\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to this guest - from the backend. The request includes a descriptor of a free page - into which the reply will be written by the backend. - -\item [{\small {\tt READ}}:] Read data from the specified block device. The - front end identifies the device and location to read from and - attaches pages for the data to be copied to (typically via DMA from - the device). The backend acknowledges completed read requests as - they finish. - -\item [{\small {\tt WRITE}}:] Write data to the specified block device. This - functions essentially as {\small {\tt READ}}, except that the data moves to - the device instead of from it. -\end{description} - -% um... some old text -%% In overview, the same style of descriptor-ring that is used for -%% network packets is used here. Each domain has one ring that carries -%% operation requests to the hypervisor and carries the results back -%% again. - -%% Rather than copying data, the backend simply maps the domain's buffers -%% in order to enable direct DMA to them. The act of mapping the buffers -%% also increases the reference counts of the underlying pages, so that -%% the unprivileged domain cannot try to return them to the hypervisor, -%% install them as page tables, or any other unsafe behaviour. -%% %block API here - - -\chapter{Further Information} - - -If you have questions that are not answered by this manual, the -sources of information listed below may be of interest to you. Note -that bug reports, suggestions and contributions related to the -software (or the documentation) should be sent to the Xen developers' -mailing list (address below). - -\section{Other documentation} - -If you are mainly interested in using (rather than developing for) -Xen, the {\em Xen Users' Manual} is distributed in the {\tt docs/} -directory of the Xen source distribution. - -% Various HOWTOs are also available in {\tt docs/HOWTOS}. - -\section{Online references} - -The official Xen web site is found at: -\begin{quote} -{\tt http://www.cl.cam.ac.uk/Research/SRG/netos/xen/} -\end{quote} - -This contains links to the latest versions of all on-line -documentation. - -\section{Mailing lists} - -There are currently four official Xen mailing lists: - -\begin{description} -\item[xen-devel@xxxxxxxxxxxxxxxxxxx] Used for development -discussions and bug reports. Subscribe at: \\ -{\small {\tt http://lists.xensource.com/xen-devel}} -\item[xen-users@xxxxxxxxxxxxxxxxxxx] Used for installation and usage -discussions and requests for help. Subscribe at: \\ -{\small {\tt http://lists.xensource.com/xen-users}} -\item[xen-announce@xxxxxxxxxxxxxxxxxxx] Used for announcements only. -Subscribe at: \\ -{\small {\tt http://lists.xensource.com/xen-announce}} -\item[xen-changelog@xxxxxxxxxxxxxxxxxxx] Changelog feed -from the unstable and 2.0 trees - developer oriented. Subscribe at: \\ -{\small {\tt http://lists.xensource.com/xen-changelog}} -\end{description} - -Of these, xen-devel is the most active. - - +%% chapter Further Information moved to further_info.tex +\include{src/interface/further_info} \appendix -%\newcommand{\hypercall}[1]{\vspace{5mm}{\large\sf #1}} - - - - - -\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}} - - - - - - -\chapter{Xen Hypercalls} -\label{a:hypercalls} - -Hypercalls represent the procedural interface to Xen; this appendix -categorizes and describes the current set of hypercalls. - -\section{Invoking Hypercalls} - -Hypercalls are invoked in a manner analogous to system calls in a -conventional operating system; a software interrupt is issued which -vectors to an entry point within Xen. On x86\_32 machines the -instruction required is {\tt int \$82}; the (real) IDT is setup so -that this may only be issued from within ring 1. The particular -hypercall to be invoked is contained in {\tt EAX} --- a list -mapping these values to symbolic hypercall names can be found -in {\tt xen/include/public/xen.h}. - -On some occasions a set of hypercalls will be required to carry -out a higher-level function; a good example is when a guest -operating wishes to context switch to a new process which -requires updating various privileged CPU state. As an optimization -for these cases, there is a generic mechanism to issue a set of -hypercalls as a batch: - -\begin{quote} -\hypercall{multicall(void *call\_list, int nr\_calls)} - -Execute a series of hypervisor calls; {\tt nr\_calls} is the length of -the array of {\tt multicall\_entry\_t} structures pointed to be {\tt -call\_list}. Each entry contains the hypercall operation code followed -by up to 7 word-sized arguments. -\end{quote} - -Note that multicalls are provided purely as an optimization; there is -no requirement to use them when first porting a guest operating -system. - - -\section{Virtual CPU Setup} - -At start of day, a guest operating system needs to setup the virtual -CPU it is executing on. This includes installing vectors for the -virtual IDT so that the guest OS can handle interrupts, page faults, -etc. However the very first thing a guest OS must setup is a pair -of hypervisor callbacks: these are the entry points which Xen will -use when it wishes to notify the guest OS of an occurrence. - -\begin{quote} -\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long - event\_address, unsigned long failsafe\_selector, unsigned long - failsafe\_address) } - -Register the normal (``event'') and failsafe callbacks for -event processing. In each case the code segment selector and -address within that segment are provided. The selectors must -have RPL 1; in XenLinux we simply use the kernel's CS for both -{\tt event\_selector} and {\tt failsafe\_selector}. - -The value {\tt event\_address} specifies the address of the guest OSes -event handling and dispatch routine; the {\tt failsafe\_address} -specifies a separate entry point which is used only if a fault occurs -when Xen attempts to use the normal callback. -\end{quote} - - -After installing the hypervisor callbacks, the guest OS can -install a `virtual IDT' by using the following hypercall: - -\begin{quote} -\hypercall{set\_trap\_table(trap\_info\_t *table)} - -Install one or more entries into the per-domain -trap handler table (essentially a software version of the IDT). -Each entry in the array pointed to by {\tt table} includes the -exception vector number with the corresponding segment selector -and entry point. Most guest OSes can use the same handlers on -Xen as when running on the real hardware; an exception is the -page fault handler (exception vector 14) where a modified -stack-frame layout is used. - - -\end{quote} - - - -\section{Scheduling and Timer} - -Domains are preemptively scheduled by Xen according to the -parameters installed by domain 0 (see Section~\ref{s:dom0ops}). -In addition, however, a domain may choose to explicitly -control certain behavior with the following hypercall: - -\begin{quote} -\hypercall{sched\_op(unsigned long op)} - -Request scheduling operation from hypervisor. The options are: {\it -yield}, {\it block}, and {\it shutdown}. {\it yield} keeps the -calling domain runnable but may cause a reschedule if other domains -are runnable. {\it block} removes the calling domain from the run -queue and cause is to sleeps until an event is delivered to it. {\it -shutdown} is used to end the domain's execution; the caller can -additionally specify whether the domain should reboot, halt or -suspend. -\end{quote} - -To aid the implementation of a process scheduler within a guest OS, -Xen provides a virtual programmable timer: - -\begin{quote} -\hypercall{set\_timer\_op(uint64\_t timeout)} - -Request a timer event to be sent at the specified system time (time -in nanoseconds since system boot). The hypercall actually passes the -64-bit timeout value as a pair of 32-bit values. - -\end{quote} - -Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op} -allows block-with-timeout semantics. - - -\section{Page Table Management} - -Since guest operating systems have read-only access to their page -tables, Xen must be involved when making any changes. The following -multi-purpose hypercall can be used to modify page-table entries, -update the machine-to-physical mapping table, flush the TLB, install -a new page-table base pointer, and more. - -\begin{quote} -\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} - -Update the page table for the domain; a set of {\tt count} updates are -submitted for processing in a batch, with {\tt success\_count} being -updated to report the number of successful updates. - -Each element of {\tt req[]} contains a pointer (address) and value; -the least significant 2-bits of the pointer are used to distinguish -the type of update requested as follows: -\begin{description} - -\item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or -page table entry to the associated value; Xen will check that the -update is safe, as described in Chapter~\ref{c:memory}. - -\item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the - machine-to-physical table. The calling domain must own the machine - page in question (or be privileged). - -\item[\it MMU\_EXTENDED\_COMMAND:] perform additional MMU operations. -The set of additional MMU operations is considerable, and includes -updating {\tt cr3} (or just re-installing it for a TLB flush), -flushing the cache, installing a new LDT, or pinning \& unpinning -page-table pages (to ensure their reference count doesn't drop to zero -which would require a revalidation of all entries). - -Further extended commands are used to deal with granting and -acquiring page ownership; see Section~\ref{s:idc}. - - -\end{description} - -More details on the precise format of all commands can be -found in {\tt xen/include/public/xen.h}. - - -\end{quote} - -Explicitly updating batches of page table entries is extremely -efficient, but can require a number of alterations to the guest -OS. Using the writable page table mode (Chapter~\ref{c:memory}) is -recommended for new OS ports. - -Regardless of which page table update mode is being used, however, -there are some occasions (notably handling a demand page fault) where -a guest OS will wish to modify exactly one PTE rather than a -batch. This is catered for by the following: - -\begin{quote} -\hypercall{update\_va\_mapping(unsigned long page\_nr, unsigned long -val, \\ unsigned long flags)} - -Update the currently installed PTE for the page {\tt page\_nr} to -{\tt val}. As with {\tt mmu\_update()}, Xen checks the modification -is safe before applying it. The {\tt flags} determine which kind -of TLB flush, if any, should follow the update. - -\end{quote} - -Finally, sufficiently privileged domains may occasionally wish to manipulate -the pages of others: -\begin{quote} - -\hypercall{update\_va\_mapping\_otherdomain(unsigned long page\_nr, -unsigned long val, unsigned long flags, uint16\_t domid)} - -Identical to {\tt update\_va\_mapping()} save that the pages being -mapped must belong to the domain {\tt domid}. - -\end{quote} - -This privileged operation is currently used by backend virtual device -drivers to safely map pages containing I/O data. - - - -\section{Segmentation Support} - -Xen allows guest OSes to install a custom GDT if they require it; -this is context switched transparently whenever a domain is -[de]scheduled. The following hypercall is effectively a -`safe' version of {\tt lgdt}: - -\begin{quote} -\hypercall{set\_gdt(unsigned long *frame\_list, int entries)} - -Install a global descriptor table for a domain; {\tt frame\_list} is -an array of up to 16 machine page frames within which the GDT resides, -with {\tt entries} being the actual number of descriptor-entry -slots. All page frames must be mapped read-only within the guest's -address space, and the table must be large enough to contain Xen's -reserved entries (see {\tt xen/include/public/arch-x86\_32.h}). - -\end{quote} - -Many guest OSes will also wish to install LDTs; this is achieved by -using {\tt mmu\_update()} with an extended command, passing the -linear address of the LDT base along with the number of entries. No -special safety checks are required; Xen needs to perform this task -simply since {\tt lldt} requires CPL 0. - - -Xen also allows guest operating systems to update just an -individual segment descriptor in the GDT or LDT: - -\begin{quote} -\hypercall{update\_descriptor(unsigned long ma, unsigned long word1, -unsigned long word2)} - -Update the GDT/LDT entry at machine address {\tt ma}; the new -8-byte descriptor is stored in {\tt word1} and {\tt word2}. -Xen performs a number of checks to ensure the descriptor is -valid. - -\end{quote} - -Guest OSes can use the above in place of context switching entire -LDTs (or the GDT) when the number of changing descriptors is small. - -\section{Context Switching} - -When a guest OS wishes to context switch between two processes, -it can use the page table and segmentation hypercalls described -above to perform the the bulk of the privileged work. In addition, -however, it will need to invoke Xen to switch the kernel (ring 1) -stack pointer: - -\begin{quote} -\hypercall{stack\_switch(unsigned long ss, unsigned long esp)} - -Request kernel stack switch from hypervisor; {\tt ss} is the new -stack segment, which {\tt esp} is the new stack pointer. - -\end{quote} - -A final useful hypercall for context switching allows ``lazy'' -save and restore of floating point state: - -\begin{quote} -\hypercall{fpu\_taskswitch(void)} - -This call instructs Xen to set the {\tt TS} bit in the {\tt cr0} -control register; this means that the next attempt to use floating -point will cause a trap which the guest OS can trap. Typically it will -then save/restore the FP state, and clear the {\tt TS} bit. -\end{quote} - -This is provided as an optimization only; guest OSes can also choose -to save and restore FP state on all context switches for simplicity. - - -\section{Physical Memory Management} - -As mentioned previously, each domain has a maximum and current -memory allocation. The maximum allocation, set at domain creation -time, cannot be modified. However a domain can choose to reduce -and subsequently grow its current allocation by using the -following call: - -\begin{quote} -\hypercall{dom\_mem\_op(unsigned int op, unsigned long *extent\_list, - unsigned long nr\_extents, unsigned int extent\_order)} - -Increase or decrease current memory allocation (as determined by -the value of {\tt op}). Each invocation provides a list of -extents each of which is $2^s$ pages in size, -where $s$ is the value of {\tt extent\_order}. - -\end{quote} - -In addition to simply reducing or increasing the current memory -allocation via a `balloon driver', this call is also useful for -obtaining contiguous regions of machine memory when required (e.g. -for certain PCI devices, or if using superpages). - - -\section{Inter-Domain Communication} -\label{s:idc} - -Xen provides a simple asynchronous notification mechanism via -\emph{event channels}. Each domain has a set of end-points (or -\emph{ports}) which may be bound to an event source (e.g. a physical -IRQ, a virtual IRQ, or an port in another domain). When a pair of -end-points in two different domains are bound together, then a `send' -operation on one will cause an event to be received by the destination -domain. - -The control and use of event channels involves the following hypercall: - -\begin{quote} -\hypercall{event\_channel\_op(evtchn\_op\_t *op)} - -Inter-domain event-channel management; {\tt op} is a discriminated -union which allows the following 7 operations: - -\begin{description} - -\item[\it alloc\_unbound:] allocate a free (unbound) local - port and prepare for connection from a specified domain. -\item[\it bind\_virq:] bind a local port to a virtual -IRQ; any particular VIRQ can be bound to at most one port per domain. -\item[\it bind\_pirq:] bind a local port to a physical IRQ; -once more, a given pIRQ can be bound to at most one port per -domain. Furthermore the calling domain must be sufficiently -privileged. -\item[\it bind\_interdomain:] construct an interdomain event -channel; in general, the target domain must have previously allocated -an unbound port for this channel, although this can be bypassed by -privileged domains during domain setup. -\item[\it close:] close an interdomain event channel. -\item[\it send:] send an event to the remote end of a -interdomain event channel. -\item[\it status:] determine the current status of a local port. -\end{description} - -For more details see -{\tt xen/include/public/event\_channel.h}. - -\end{quote} - -Event channels are the fundamental communication primitive between -Xen domains and seamlessly support SMP. However they provide little -bandwidth for communication {\sl per se}, and hence are typically -married with a piece of shared memory to produce effective and -high-performance inter-domain communication. - -Safe sharing of memory pages between guest OSes is carried out by -granting access on a per page basis to individual domains. This is -achieved by using the {\tt grant\_table\_op()} hypercall. - -\begin{quote} -\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)} - -Grant or remove access to a particular page to a particular domain. - -\end{quote} - -This is not currently widely in use by guest operating systems, but -we intend to integrate support more fully in the near future. - -\section{PCI Configuration} - -Domains with physical device access (i.e.\ driver domains) receive -limited access to certain PCI devices (bus address space and -interrupts). However many guest operating systems attempt to -determine the PCI configuration by directly access the PCI BIOS, -which cannot be allowed for safety. - -Instead, Xen provides the following hypercall: - -\begin{quote} -\hypercall{physdev\_op(void *physdev\_op)} - -Perform a PCI configuration option; depending on the value -of {\tt physdev\_op} this can be a PCI config read, a PCI config -write, or a small number of other queries. - -\end{quote} - - -For examples of using {\tt physdev\_op()}, see the -Xen-specific PCI code in the linux sparse tree. - -\section{Administrative Operations} -\label{s:dom0ops} - -A large number of control operations are available to a sufficiently -privileged domain (typically domain 0). These allow the creation and -management of new domains, for example. A complete list is given -below: for more details on any or all of these, please see -{\tt xen/include/public/dom0\_ops.h} - - -\begin{quote} -\hypercall{dom0\_op(dom0\_op\_t *op)} - -Administrative domain operations for domain management. The options are: - -\begin{description} -\item [\it DOM0\_CREATEDOMAIN:] create a new domain - -\item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run -queue. - -\item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable - once again. - -\item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated -with a domain - -\item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain - -\item [\it DOM0\_SCHEDCTL:] - -\item [\it DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain - -\item [\it DOM0\_BUILDDOMAIN:] do final guest OS setup for domain - -\item [\it DOM0\_GETDOMAINFO:] get statistics about the domain - -\item [\it DOM0\_GETPAGEFRAMEINFO:] - -\item [\it DOM0\_GETPAGEFRAMEINFO2:] - -\item [\it DOM0\_IOPL:] set I/O privilege level - -\item [\it DOM0\_MSR:] read or write model specific registers - -\item [\it DOM0\_DEBUG:] interactively invoke the debugger - -\item [\it DOM0\_SETTIME:] set system time - -\item [\it DOM0\_READCONSOLE:] read console content from hypervisor buffer ring - -\item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU - -\item [\it DOM0\_GETTBUFS:] get information about the size and location of - the trace buffers (only on trace-buffer enabled builds) - -\item [\it DOM0\_PHYSINFO:] get information about the host machine - -\item [\it DOM0\_PCIDEV\_ACCESS:] modify PCI device access permissions - -\item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler - -\item [\it DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes - -\item [\it DOM0\_SETDOMAININITIALMEM:] set initial memory allocation of a domain - -\item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain - -\item [\it DOM0\_SETDOMAINVMASSIST:] set domain VM assist options -\end{description} -\end{quote} - -Most of the above are best understood by looking at the code -implementing them (in {\tt xen/common/dom0\_ops.c}) and in -the user-space tools that use them (mostly in {\tt tools/libxc}). - -\section{Debugging Hypercalls} - -A few additional hypercalls are mainly useful for debugging: - -\begin{quote} -\hypercall{console\_io(int cmd, int count, char *str)} - -Use Xen to interact with the console; operations are: - -{\it CONSOLEIO\_write}: Output count characters from buffer str. - -{\it CONSOLEIO\_read}: Input at most count characters into buffer str. -\end{quote} - -A pair of hypercalls allows access to the underlying debug registers: -\begin{quote} -\hypercall{set\_debugreg(int reg, unsigned long value)} - -Set debug register {\tt reg} to {\tt value} - -\hypercall{get\_debugreg(int reg)} - -Return the contents of the debug register {\tt reg} -\end{quote} - -And finally: -\begin{quote} -\hypercall{xen\_version(int cmd)} - -Request Xen version number. -\end{quote} - -This is useful to ensure that user-space tools are in sync -with the underlying hypervisor. - -\section{Deprecated Hypercalls} - -Xen is under constant development and refinement; as such there -are plans to improve the way in which various pieces of functionality -are exposed to guest OSes. - -\begin{quote} -\hypercall{vm\_assist(unsigned int cmd, unsigned int type)} - -Toggle various memory management modes (in particular wrritable page -tables and superpage support). - -\end{quote} - -This is likely to be replaced with mode values in the shared -information page since this is more resilient for resumption -after migration or checkpoint. - - - - - - +%% chapter hypercalls moved to hypercalls.tex +\include{src/interface/hypercalls} %% @@ -1173,279 +112,9 @@ %% new scheduler... not clear how many of them there are... %% -\begin{comment} - -\chapter{Scheduling API} - -The scheduling API is used by both the schedulers described above and should -also be used by any new schedulers. It provides a generic interface and also -implements much of the ``boilerplate'' code. - -Schedulers conforming to this API are described by the following -structure: - -\begin{verbatim} -struct scheduler -{ - char *name; /* full name for this scheduler */ - char *opt_name; /* option name for this scheduler */ - unsigned int sched_id; /* ID for this scheduler */ - - int (*init_scheduler) (); - int (*alloc_task) (struct task_struct *); - void (*add_task) (struct task_struct *); - void (*free_task) (struct task_struct *); - void (*rem_task) (struct task_struct *); - void (*wake_up) (struct task_struct *); - void (*do_block) (struct task_struct *); - task_slice_t (*do_schedule) (s_time_t); - int (*control) (struct sched_ctl_cmd *); - int (*adjdom) (struct task_struct *, - struct sched_adjdom_cmd *); - s32 (*reschedule) (struct task_struct *); - void (*dump_settings) (void); - void (*dump_cpu_state) (int); - void (*dump_runq_el) (struct task_struct *); -}; -\end{verbatim} - -The only method that {\em must} be implemented is -{\tt do\_schedule()}. However, if there is not some implementation for the -{\tt wake\_up()} method then waking tasks will not get put on the runqueue! - -The fields of the above structure are described in more detail below. - -\subsubsection{name} - -The name field should point to a descriptive ASCII string. - -\subsubsection{opt\_name} - -This field is the value of the {\tt sched=} boot-time option that will select -this scheduler. - -\subsubsection{sched\_id} - -This is an integer that uniquely identifies this scheduler. There should be a -macro corrsponding to this scheduler ID in {\tt <xen/sched-if.h>}. - -\subsubsection{init\_scheduler} - -\paragraph*{Purpose} - -This is a function for performing any scheduler-specific initialisation. For -instance, it might allocate memory for per-CPU scheduler data and initialise it -appropriately. - -\paragraph*{Call environment} - -This function is called after the initialisation performed by the generic -layer. The function is called exactly once, for the scheduler that has been -selected. - -\paragraph*{Return values} - -This should return negative on failure --- this will cause an -immediate panic and the system will fail to boot. - -\subsubsection{alloc\_task} - -\paragraph*{Purpose} -Called when a {\tt task\_struct} is allocated by the generic scheduler -layer. A particular scheduler implementation may use this method to -allocate per-task data for this task. It may use the {\tt -sched\_priv} pointer in the {\tt task\_struct} to point to this data. - -\paragraph*{Call environment} -The generic layer guarantees that the {\tt sched\_priv} field will -remain intact from the time this method is called until the task is -deallocated (so long as the scheduler implementation does not change -it explicitly!). - -\paragraph*{Return values} -Negative on failure. - -\subsubsection{add\_task} - -\paragraph*{Purpose} - -Called when a task is initially added by the generic layer. - -\paragraph*{Call environment} - -The fields in the {\tt task\_struct} are now filled out and available for use. -Schedulers should implement appropriate initialisation of any per-task private -information in this method. - -\subsubsection{free\_task} - -\paragraph*{Purpose} - -Schedulers should free the space used by any associated private data -structures. - -\paragraph*{Call environment} - -This is called when a {\tt task\_struct} is about to be deallocated. -The generic layer will have done generic task removal operations and -(if implemented) called the scheduler's {\tt rem\_task} method before -this method is called. - -\subsubsection{rem\_task} - -\paragraph*{Purpose} - -This is called when a task is being removed from scheduling (but is -not yet being freed). - -\subsubsection{wake\_up} - -\paragraph*{Purpose} - -Called when a task is woken up, this method should put the task on the runqueue -(or do the scheduler-specific equivalent action). - -\paragraph*{Call environment} - -The task is already set to state RUNNING. - -\subsubsection{do\_block} - -\paragraph*{Purpose} - -This function is called when a task is blocked. This function should -not remove the task from the runqueue. - -\paragraph*{Call environment} - -The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to -TASK\_INTERRUPTIBLE on entry to this method. A call to the {\tt - do\_schedule} method will be made after this method returns, in -order to select the next task to run. - -\subsubsection{do\_schedule} - -This method must be implemented. - -\paragraph*{Purpose} - -The method is called each time a new task must be chosen for scheduling on the -current CPU. The current time as passed as the single argument (the current -task can be found using the {\tt current} macro). - -This method should select the next task to run on this CPU and set it's minimum -time to run as well as returning the data described below. - -This method should also take the appropriate action if the previous -task has blocked, e.g. removing it from the runqueue. - -\paragraph*{Call environment} - -The other fields in the {\tt task\_struct} are updated by the generic layer, -which also performs all Xen-specific tasks and performs the actual task switch -(unless the previous task has been chosen again). - -This method is called with the {\tt schedule\_lock} held for the current CPU -and local interrupts disabled. - -\paragraph*{Return values} - -Must return a {\tt struct task\_slice} describing what task to run and how long -for (at maximum). - -\subsubsection{control} - -\paragraph*{Purpose} - -This method is called for global scheduler control operations. It takes a -pointer to a {\tt struct sched\_ctl\_cmd}, which it should either -source data from or populate with data, depending on the value of the -{\tt direction} field. - -\paragraph*{Call environment} - -The generic layer guarantees that when this method is called, the -caller selected the correct scheduler ID, hence the scheduler's -implementation does not need to sanity-check these parts of the call. - -\paragraph*{Return values} - -This function should return the value to be passed back to user space, hence it -should either be 0 or an appropriate errno value. - -\subsubsection{sched\_adjdom} - -\paragraph*{Purpose} - -This method is called to adjust the scheduling parameters of a particular -domain, or to query their current values. The function should check -the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in -order to determine which of these operations is being performed. - -\paragraph*{Call environment} - -The generic layer guarantees that the caller has specified the correct -control interface version and scheduler ID and that the supplied {\tt -task\_struct} will not be deallocated during the call (hence it is not -necessary to {\tt get\_task\_struct}). - -\paragraph*{Return values} - -This function should return the value to be passed back to user space, hence it -should either be 0 or an appropriate errno value. - -\subsubsection{reschedule} - -\paragraph*{Purpose} - -This method is called to determine if a reschedule is required as a result of a -particular task. - -\paragraph*{Call environment} -The generic layer will cause a reschedule if the current domain is the idle -task or it has exceeded its minimum time slice before a reschedule. The -generic layer guarantees that the task passed is not currently running but is -on the runqueue. - -\paragraph*{Return values} - -Should return a mask of CPUs to cause a reschedule on. - -\subsubsection{dump\_settings} - -\paragraph*{Purpose} - -If implemented, this should dump any private global settings for this -scheduler to the console. - -\paragraph*{Call environment} - -This function is called with interrupts enabled. - -\subsubsection{dump\_cpu\_state} - -\paragraph*{Purpose} - -This method should dump any private settings for the specified CPU. - -\paragraph*{Call environment} - -This function is called with interrupts disabled and the {\tt schedule\_lock} -for the specified CPU held. - -\subsubsection{dump\_runq\_el} - -\paragraph*{Purpose} - -This method should dump any private settings for the specified task. - -\paragraph*{Call environment} - -This function is called with interrupts disabled and the {\tt schedule\_lock} -for the task's CPU held. - -\end{comment} - +%% \include{src/interface/scheduling} +%% scheduling information moved to scheduling.tex +%% still commented out @@ -1457,74 +126,9 @@ %% (and/or kip's stuff?) and write about that instead? %% -\begin{comment} - -\chapter{Debugging} - -Xen provides tools for debugging both Xen and guest OSes. Currently, the -Pervasive Debugger provides a GDB stub, which provides facilities for symbolic -debugging of Xen itself and of OS kernels running on top of Xen. The Trace -Buffer provides a lightweight means to log data about Xen's internal state and -behaviour at runtime, for later analysis. - -\section{Pervasive Debugger} - -Information on using the pervasive debugger is available in pdb.txt. - - -\section{Trace Buffer} - -The trace buffer provides a means to observe Xen's operation from domain 0. -Trace events, inserted at key points in Xen's code, record data that can be -read by the {\tt xentrace} tool. Recording these events has a low overhead -and hence the trace buffer may be useful for debugging timing-sensitive -behaviours. - -\subsection{Internal API} - -To use the trace buffer functionality from within Xen, you must {\tt \#include -<xen/trace.h>}, which contains definitions related to the trace buffer. Trace -events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1, -2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional -(32-bit) data as their arguments. For trace buffer-enabled builds of Xen these -will insert the event ID and data into the trace buffer, along with the current -value of the CPU cycle-counter. For builds without the trace buffer enabled, -the macros expand to no-ops and thus can be left in place without incurring -overheads. - -\subsection{Trace-enabled builds} - -By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG} -is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER}, -either in {\tt <xen/config.h>} or on the gcc command line. - -The size (in pages) of the per-CPU trace buffers can be specified using the -{\tt tbuf\_size=n } boot parameter to Xen. If the size is set to 0, the trace -buffers will be disabled. - -\subsection{Dumping trace data} - -When running a trace buffer build of Xen, trace data are written continuously -into the buffer data areas, with newer data overwriting older data. This data -can be captured using the {\tt xentrace} program in domain 0. - -The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace -buffers into its address space. It then periodically polls all the buffers for -new data, dumping out any new records from each buffer in turn. As a result, -for machines with multiple (logical) CPUs, the trace buffer output will not be -in overall chronological order. - -The output from {\tt xentrace} can be post-processed using {\tt -xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and -{\tt xentrace\_format} (used to pretty-print trace data). For the predefined -trace points, there is an example format file in {\tt tools/xentrace/formats }. - -For more information, see the manual pages for {\tt xentrace}, {\tt -xentrace\_format} and {\tt xentrace\_cpusplit}. - -\end{comment} - - +%% \include{src/interface/debugging} +%% debugging information moved to debugging.tex +%% still commented out \end{document} diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user.tex --- a/docs/src/user.tex Tue Sep 20 09:08:26 2005 +++ b/docs/src/user.tex Tue Sep 20 09:17:33 2005 @@ -59,1803 +59,36 @@ \renewcommand{\floatpagefraction}{.8} \setstretch{1.1} + \part{Introduction and Tutorial} -\chapter{Introduction} - -Xen is a {\em paravirtualising} virtual machine monitor (VMM), or -`hypervisor', for the x86 processor architecture. Xen can securely -execute multiple virtual machines on a single physical system with -close-to-native performance. The virtual machine technology -facilitates enterprise-grade functionality, including: - -\begin{itemize} -\item Virtual machines with performance close to native - hardware. -\item Live migration of running virtual machines between physical hosts. -\item Excellent hardware support (supports most Linux device drivers). -\item Sandboxed, restartable device drivers. -\end{itemize} - -Paravirtualisation permits very high performance virtualisation, -even on architectures like x86 that are traditionally -very hard to virtualise. -The drawback of this approach is that it requires operating systems to -be {\em ported} to run on Xen. Porting an OS to run on Xen is similar -to supporting a new hardware platform, however the process -is simplified because the paravirtual machine architecture is very -similar to the underlying native hardware. Even though operating system -kernels must explicitly support Xen, a key feature is that user space -applications and libraries {\em do not} require modification. - -Xen support is available for increasingly many operating systems: -right now, Linux 2.4, Linux 2.6 and NetBSD are available for Xen 2.0. -A FreeBSD port is undergoing testing and will be incorporated into the -release soon. Other OS ports, including Plan 9, are in progress. We -hope that that arch-xen patches will be incorporated into the -mainstream releases of these operating systems in due course (as has -already happened for NetBSD). - -Possible usage scenarios for Xen include: -\begin{description} -\item [Kernel development.] Test and debug kernel modifications in a - sandboxed virtual machine --- no need for a separate test - machine. -\item [Multiple OS configurations.] Run multiple operating systems - simultaneously, for instance for compatibility or QA purposes. -\item [Server consolidation.] Move multiple servers onto a single - physical host with performance and fault isolation provided at - virtual machine boundaries. -\item [Cluster computing.] Management at VM granularity provides more - flexibility than separately managing each physical host, but - better control and isolation than single-system image solutions, - particularly by using live migration for load balancing. -\item [Hardware support for custom OSes.] Allow development of new OSes - while benefiting from the wide-ranging hardware support of - existing OSes such as Linux. -\end{description} - -\section{Structure of a Xen-Based System} - -A Xen system has multiple layers, the lowest and most privileged of -which is Xen itself. -Xen in turn may host multiple {\em guest} operating systems, each of -which is executed within a secure virtual machine (in Xen terminology, -a {\em domain}). Domains are scheduled by Xen to make effective use of -the available physical CPUs. Each guest OS manages its own -applications, which includes responsibility for scheduling each -application within the time allotted to the VM by Xen. - -The first domain, {\em domain 0}, is created automatically when the -system boots and has special management privileges. Domain 0 builds -other domains and manages their virtual devices. It also performs -administrative tasks such as suspending, resuming and migrating other -virtual machines. - -Within domain 0, a process called \emph{xend} runs to manage the system. -\Xend is responsible for managing virtual machines and providing access -to their consoles. Commands are issued to \xend over an HTTP -interface, either from a command-line tool or from a web browser. - -\section{Hardware Support} - -Xen currently runs only on the x86 architecture, requiring a `P6' or -newer processor (e.g. Pentium Pro, Celeron, Pentium II, Pentium III, -Pentium IV, Xeon, AMD Athlon, AMD Duron). Multiprocessor machines are -supported, and we also have basic support for HyperThreading (SMT), -although this remains a topic for ongoing research. A port -specifically for x86/64 is in progress, although Xen already runs on -such systems in 32-bit legacy mode. In addition a port to the IA64 -architecture is approaching completion. We hope to add other -architectures such as PPC and ARM in due course. - - -Xen can currently use up to 4GB of memory. It is possible for x86 -machines to address up to 64GB of physical memory but there are no -current plans to support these systems: The x86/64 port is the -planned route to supporting larger memory sizes. - -Xen offloads most of the hardware support issues to the guest OS -running in Domain~0. Xen itself contains only the code required to -detect and start secondary processors, set up interrupt routing, and -perform PCI bus enumeration. Device drivers run within a privileged -guest OS rather than within Xen itself. This approach provides -compatibility with the majority of device hardware supported by Linux. -The default XenLinux build contains support for relatively modern -server-class network and disk hardware, but you can add support for -other hardware by configuring your XenLinux kernel in the normal way. - -\section{History} - -Xen was originally developed by the Systems Research Group at the -University of Cambridge Computer Laboratory as part of the XenoServers -project, funded by the UK-EPSRC. -XenoServers aim to provide a `public infrastructure for -global distributed computing', and Xen plays a key part in that, -allowing us to efficiently partition a single machine to enable -multiple independent clients to run their operating systems and -applications in an environment providing protection, resource -isolation and accounting. The project web page contains further -information along with pointers to papers and technical reports: -\path{http://www.cl.cam.ac.uk/xeno} - -Xen has since grown into a fully-fledged project in its own right, -enabling us to investigate interesting research issues regarding the -best techniques for virtualising resources such as the CPU, memory, -disk and network. The project has been bolstered by support from -Intel Research Cambridge, and HP Labs, who are now working closely -with us. - -Xen was first described in a paper presented at SOSP in -2003\footnote{\tt -http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf}, and the first -public release (1.0) was made that October. Since then, Xen has -significantly matured and is now used in production scenarios on -many sites. - -Xen 2.0 features greatly enhanced hardware support, configuration -flexibility, usability and a larger complement of supported operating -systems. This latest release takes Xen a step closer to becoming the -definitive open source solution for virtualisation. - -\chapter{Installation} - -The Xen distribution includes three main components: Xen itself, ports -of Linux 2.4 and 2.6 and NetBSD to run on Xen, and the user-space -tools required to manage a Xen-based system. This chapter describes -how to install the Xen 2.0 distribution from source. Alternatively, -there may be pre-built packages available as part of your operating -system distribution. - -\section{Prerequisites} -\label{sec:prerequisites} - -The following is a full list of prerequisites. Items marked `$\dag$' -are required by the \xend control tools, and hence required if you -want to run more than one virtual machine; items marked `$*$' are only -required if you wish to build from source. -\begin{itemize} -\item A working Linux distribution using the GRUB bootloader and -running on a P6-class (or newer) CPU. -\item [$\dag$] The \path{iproute2} package. -\item [$\dag$] The Linux bridge-utils\footnote{Available from -{\tt http://bridge.sourceforge.net}} (e.g., \path{/sbin/brctl}) -\item [$\dag$] An installation of Twisted v1.3 or -above\footnote{Available from {\tt -http://www.twistedmatrix.com}}. There may be a binary package -available for your distribution; alternatively it can be installed by -running `{\sl make install-twisted}' in the root of the Xen source -tree. -\item [$*$] Build tools (gcc v3.2.x or v3.3.x, binutils, GNU make). -\item [$*$] Development installation of libcurl (e.g., libcurl-devel) -\item [$*$] Development installation of zlib (e.g., zlib-dev). -\item [$*$] Development installation of Python v2.2 or later (e.g., python-dev). -\item [$*$] \LaTeX and transfig are required to build the documentation. -\end{itemize} - -Once you have satisfied the relevant prerequisites, you can -now install either a binary or source distribution of Xen. - -\section{Installing from Binary Tarball} - -Pre-built tarballs are available for download from the Xen -download page -\begin{quote} -{\tt http://xen.sf.net} -\end{quote} - -Once you've downloaded the tarball, simply unpack and install: -\begin{verbatim} -# tar zxvf xen-2.0-install.tgz -# cd xen-2.0-install -# sh ./install.sh -\end{verbatim} - -Once you've installed the binaries you need to configure -your system as described in Section~\ref{s:configure}. - -\section{Installing from Source} - -This section describes how to obtain, build, and install -Xen from source. - -\subsection{Obtaining the Source} - -The Xen source tree is available as either a compressed source tar -ball or as a clone of our master BitKeeper repository. - -\begin{description} -\item[Obtaining the Source Tarball]\mbox{} \\ -Stable versions (and daily snapshots) of the Xen source tree are -available as compressed tarballs from the Xen download page -\begin{quote} -{\tt http://xen.sf.net} -\end{quote} - -\item[Using BitKeeper]\mbox{} \\ -If you wish to install Xen from a clone of our latest BitKeeper -repository then you will need to install the BitKeeper tools. -Download instructions for BitKeeper can be obtained by filling out the -form at: - -\begin{quote} -{\tt http://www.bitmover.com/cgi-bin/download.cgi} -\end{quote} -The public master BK repository for the 2.0 release lives at: -\begin{quote} -{\tt bk://xen.bkbits.net/xen-2.0.bk} -\end{quote} -You can use BitKeeper to -download it and keep it updated with the latest features and fixes. - -Change to the directory in which you want to put the source code, then -run: -\begin{verbatim} -# bk clone bk://xen.bkbits.net/xen-2.0.bk -\end{verbatim} - -Under your current directory, a new directory named \path{xen-2.0.bk} -has been created, which contains all the source code for Xen, the OS -ports, and the control tools. You can update your repository with the -latest changes at any time by running: -\begin{verbatim} -# cd xen-2.0.bk # to change into the local repository -# bk pull # to update the repository -\end{verbatim} -\end{description} - -%\section{The distribution} -% -%The Xen source code repository is structured as follows: -% -%\begin{description} -%\item[\path{tools/}] Xen node controller daemon (Xend), command line tools, -% control libraries -%\item[\path{xen/}] The Xen VMM. -%\item[\path{linux-*-xen-sparse/}] Xen support for Linux. -%\item[\path{linux-*-patches/}] Experimental patches for Linux. -%\item[\path{netbsd-*-xen-sparse/}] Xen support for NetBSD. -%\item[\path{docs/}] Various documentation files for users and developers. -%\item[\path{extras/}] Bonus extras. -%\end{description} - -\subsection{Building from Source} - -The top-level Xen Makefile includes a target `world' that will do the -following: - -\begin{itemize} -\item Build Xen -\item Build the control tools, including \xend -\item Download (if necessary) and unpack the Linux 2.6 source code, - and patch it for use with Xen -\item Build a Linux kernel to use in domain 0 and a smaller - unprivileged kernel, which can optionally be used for - unprivileged virtual machines. -\end{itemize} - - -After the build has completed you should have a top-level -directory called \path{dist/} in which all resulting targets -will be placed; of particular interest are the two kernels -XenLinux kernel images, one with a `-xen0' extension -which contains hardware device drivers and drivers for Xen's virtual -devices, and one with a `-xenU' extension that just contains the -virtual ones. These are found in \path{dist/install/boot/} along -with the image for Xen itself and the configuration files used -during the build. - -The NetBSD port can be built using: -\begin{quote} -\begin{verbatim} -# make netbsd20 -\end{verbatim} -\end{quote} -NetBSD port is built using a snapshot of the netbsd-2-0 cvs branch. -The snapshot is downloaded as part of the build process, if it is not -yet present in the \path{NETBSD\_SRC\_PATH} search path. The build -process also downloads a toolchain which includes all the tools -necessary to build the NetBSD kernel under Linux. - -To customize further the set of kernels built you need to edit -the top-level Makefile. Look for the line: - -\begin{quote} -\begin{verbatim} -KERNELS ?= mk.linux-2.6-xen0 mk.linux-2.6-xenU -\end{verbatim} -\end{quote} - -You can edit this line to include any set of operating system kernels -which have configurations in the top-level \path{buildconfigs/} -directory, for example \path{mk.linux-2.4-xenU} to build a Linux 2.4 -kernel containing only virtual device drivers. - -%% Inspect the Makefile if you want to see what goes on during a build. -%% Building Xen and the tools is straightforward, but XenLinux is more -%% complicated. The makefile needs a `pristine' Linux kernel tree to which -%% it will then add the Xen architecture files. You can tell the -%% makefile the location of the appropriate Linux compressed tar file by -%% setting the LINUX\_SRC environment variable, e.g. \\ -%% \verb!# LINUX_SRC=/tmp/linux-2.6.11.tar.bz2 make world! \\ or by -%% placing the tar file somewhere in the search path of {\tt -%% LINUX\_SRC\_PATH} which defaults to `{\tt .:..}'. If the makefile -%% can't find a suitable kernel tar file it attempts to download it from -%% kernel.org (this won't work if you're behind a firewall). - -%% After untaring the pristine kernel tree, the makefile uses the {\tt -%% mkbuildtree} script to add the Xen patches to the kernel. - - -%% The procedure is similar to build the Linux 2.4 port: \\ -%% \verb!# LINUX_SRC=/path/to/linux2.4/source make linux24! - - -%% \framebox{\parbox{5in}{ -%% {\bf Distro specific:} \\ -%% {\it Gentoo} --- if not using udev (most installations, currently), you'll need -%% to enable devfs and devfs mount at boot time in the xen0 config. -%% }} - -\subsection{Custom XenLinux Builds} - -% If you have an SMP machine you may wish to give the {\tt '-j4'} -% argument to make to get a parallel build. - -If you wish to build a customized XenLinux kernel (e.g. to support -additional devices or enable distribution-required features), you can -use the standard Linux configuration mechanisms, specifying that the -architecture being built for is \path{xen}, e.g: -\begin{quote} -\begin{verbatim} -# cd linux-2.6.11-xen0 -# make ARCH=xen xconfig -# cd .. -# make -\end{verbatim} -\end{quote} - -You can also copy an existing Linux configuration (\path{.config}) -into \path{linux-2.6.11-xen0} and execute: -\begin{quote} -\begin{verbatim} -# make ARCH=xen oldconfig -\end{verbatim} -\end{quote} - -You may be prompted with some Xen-specific options; we -advise accepting the defaults for these options. - -Note that the only difference between the two types of Linux kernel -that are built is the configuration file used for each. The "U" -suffixed (unprivileged) versions don't contain any of the physical -hardware device drivers, leading to a 30\% reduction in size; hence -you may prefer these for your non-privileged domains. The `0' -suffixed privileged versions can be used to boot the system, as well -as in driver domains and unprivileged domains. - - -\subsection{Installing the Binaries} - - -The files produced by the build process are stored under the -\path{dist/install/} directory. To install them in their default -locations, do: -\begin{quote} -\begin{verbatim} -# make install -\end{verbatim} -\end{quote} - - -Alternatively, users with special installation requirements may wish -to install them manually by copying the files to their appropriate -destinations. - -%% Files in \path{install/boot/} include: -%% \begin{itemize} -%% \item \path{install/boot/xen-2.0.gz} Link to the Xen 'kernel' -%% \item \path{install/boot/vmlinuz-2.6-xen0} Link to domain 0 XenLinux kernel -%% \item \path{install/boot/vmlinuz-2.6-xenU} Link to unprivileged XenLinux kernel -%% \end{itemize} - -The \path{dist/install/boot} directory will also contain the config files -used for building the XenLinux kernels, and also versions of Xen and -XenLinux kernels that contain debug symbols (\path{xen-syms-2.0.6} and -\path{vmlinux-syms-2.6.11.11-xen0}) which are essential for interpreting crash -dumps. Retain these files as the developers may wish to see them if -you post on the mailing list. - - - - - -\section{Configuration} -\label{s:configure} -Once you have built and installed the Xen distribution, it is -simple to prepare the machine for booting and running Xen. - -\subsection{GRUB Configuration} - -An entry should be added to \path{grub.conf} (often found under -\path{/boot/} or \path{/boot/grub/}) to allow Xen / XenLinux to boot. -This file is sometimes called \path{menu.lst}, depending on your -distribution. The entry should look something like the following: - -{\small -\begin{verbatim} -title Xen 2.0 / XenLinux 2.6 - kernel /boot/xen-2.0.gz dom0_mem=131072 - module /boot/vmlinuz-2.6-xen0 root=/dev/sda4 ro console=tty0 -\end{verbatim} -} - -The kernel line tells GRUB where to find Xen itself and what boot -parameters should be passed to it (in this case, setting domain 0's -memory allocation in kilobytes and the settings for the serial port). For more -details on the various Xen boot parameters see Section~\ref{s:xboot}. - -The module line of the configuration describes the location of the -XenLinux kernel that Xen should start and the parameters that should -be passed to it (these are standard Linux parameters, identifying the -root device and specifying it be initially mounted read only and -instructing that console output be sent to the screen). Some -distributions such as SuSE do not require the \path{ro} parameter. - -%% \framebox{\parbox{5in}{ -%% {\bf Distro specific:} \\ -%% {\it SuSE} --- Omit the {\tt ro} option from the XenLinux kernel -%% command line, since the partition won't be remounted rw during boot. -%% }} - - -If you want to use an initrd, just add another \path{module} line to -the configuration, as usual: -{\small -\begin{verbatim} - module /boot/my_initrd.gz -\end{verbatim} -} - -As always when installing a new kernel, it is recommended that you do -not delete existing menu options from \path{menu.lst} --- you may want -to boot your old Linux kernel in future, particularly if you -have problems. - - -\subsection{Serial Console (optional)} - -%% kernel /boot/xen-2.0.gz dom0_mem=131072 com1=115200,8n1 -%% module /boot/vmlinuz-2.6-xen0 root=/dev/sda4 ro - - -In order to configure Xen serial console output, it is necessary to add -an boot option to your GRUB config; e.g. replace the above kernel line -with: -\begin{quote} -{\small -\begin{verbatim} - kernel /boot/xen.gz dom0_mem=131072 com1=115200,8n1 -\end{verbatim}} -\end{quote} - -This configures Xen to output on COM1 at 115,200 baud, 8 data bits, -1 stop bit and no parity. Modify these parameters for your set up. - -One can also configure XenLinux to share the serial console; to -achieve this append ``\path{console=ttyS0}'' to your -module line. - - -If you wish to be able to log in over the XenLinux serial console it -is necessary to add a line into \path{/etc/inittab}, just as per -regular Linux. Simply add the line: -\begin{quote} -{\small -{\tt c:2345:respawn:/sbin/mingetty ttyS0} -} -\end{quote} - -and you should be able to log in. Note that to successfully log in -as root over the serial line will require adding \path{ttyS0} to -\path{/etc/securetty} in most modern distributions. - -\subsection{TLS Libraries} - -Users of the XenLinux 2.6 kernel should disable Thread Local Storage -(e.g.\ by doing a \path{mv /lib/tls /lib/tls.disabled}) before -attempting to run with a XenLinux kernel\footnote{If you boot without first -disabling TLS, you will get a warning message during the boot -process. In this case, simply perform the rename after the machine is -up and then run \texttt{/sbin/ldconfig} to make it take effect.}. You can -always reenable it by restoring the directory to its original location -(i.e.\ \path{mv /lib/tls.disabled /lib/tls}). - -The reason for this is that the current TLS implementation uses -segmentation in a way that is not permissible under Xen. If TLS is -not disabled, an emulation mode is used within Xen which reduces -performance substantially. - -We hope that this issue can be resolved by working with Linux -distribution vendors to implement a minor backward-compatible change -to the TLS library. - -\section{Booting Xen} - -It should now be possible to restart the system and use Xen. Reboot -as usual but choose the new Xen option when the Grub screen appears. - -What follows should look much like a conventional Linux boot. The -first portion of the output comes from Xen itself, supplying low level -information about itself and the machine it is running on. The -following portion of the output comes from XenLinux. - -You may see some errors during the XenLinux boot. These are not -necessarily anything to worry about --- they may result from kernel -configuration differences between your XenLinux kernel and the one you -usually use. - -When the boot completes, you should be able to log into your system as -usual. If you are unable to log in to your system running Xen, you -should still be able to reboot with your normal Linux kernel. - - -\chapter{Starting Additional Domains} - -The first step in creating a new domain is to prepare a root -filesystem for it to boot off. Typically, this might be stored in a -normal partition, an LVM or other volume manager partition, a disk -file or on an NFS server. A simple way to do this is simply to boot -from your standard OS install CD and install the distribution into -another partition on your hard drive. - -To start the \xend control daemon, type -\begin{quote} -\verb!# xend start! -\end{quote} -If you -wish the daemon to start automatically, see the instructions in -Section~\ref{s:xend}. Once the daemon is running, you can use the -\path{xm} tool to monitor and maintain the domains running on your -system. This chapter provides only a brief tutorial: we provide full -details of the \path{xm} tool in the next chapter. - -%\section{From the web interface} -% -%Boot the Xen machine and start Xensv (see Chapter~\ref{cha:xensv} for -%more details) using the command: \\ -%\verb_# xensv start_ \\ -%This will also start Xend (see Chapter~\ref{cha:xend} for more information). -% -%The domain management interface will then be available at {\tt -%http://your\_machine:8080/}. This provides a user friendly wizard for -%starting domains and functions for managing running domains. -% -%\section{From the command line} - - -\section{Creating a Domain Configuration File} - -Before you can start an additional domain, you must create a -configuration file. We provide two example files which you -can use as a starting point: -\begin{itemize} - \item \path{/etc/xen/xmexample1} is a simple template configuration file - for describing a single VM. - - \item \path{/etc/xen/xmexample2} file is a template description that - is intended to be reused for multiple virtual machines. Setting - the value of the \path{vmid} variable on the \path{xm} command line - fills in parts of this template. -\end{itemize} - -Copy one of these files and edit it as appropriate. -Typical values you may wish to edit include: - -\begin{quote} -\begin{description} -\item[kernel] Set this to the path of the kernel you compiled for use - with Xen (e.g.\ \path{kernel = '/boot/vmlinuz-2.6-xenU'}) -\item[memory] Set this to the size of the domain's memory in -megabytes (e.g.\ \path{memory = 64}) -\item[disk] Set the first entry in this list to calculate the offset -of the domain's root partition, based on the domain ID. Set the -second to the location of \path{/usr} if you are sharing it between -domains (e.g.\ \path{disk = ['phy:your\_hard\_drive\%d,sda1,w' \% -(base\_partition\_number + vmid), 'phy:your\_usr\_partition,sda6,r' ]} -\item[dhcp] Uncomment the dhcp variable, so that the domain will -receive its IP address from a DHCP server (e.g.\ \path{dhcp='dhcp'}) -\end{description} -\end{quote} - -You may also want to edit the {\bf vif} variable in order to choose -the MAC address of the virtual ethernet interface yourself. For -example: -\begin{quote} -\verb_vif = ['mac=00:06:AA:F6:BB:B3']_ -\end{quote} -If you do not set this variable, \xend will automatically generate a -random MAC address from an unused range. - - -\section{Booting the Domain} - -The \path{xm} tool provides a variety of commands for managing domains. -Use the \path{create} command to start new domains. Assuming you've -created a configuration file \path{myvmconf} based around -\path{/etc/xen/xmexample2}, to start a domain with virtual -machine ID~1 you should type: - -\begin{quote} -\begin{verbatim} -# xm create -c myvmconf vmid=1 -\end{verbatim} -\end{quote} - - -The \path{-c} switch causes \path{xm} to turn into the domain's -console after creation. The \path{vmid=1} sets the \path{vmid} -variable used in the \path{myvmconf} file. - - -You should see the console boot messages from the new domain -appearing in the terminal in which you typed the command, -culminating in a login prompt. - - -\section{Example: ttylinux} - -Ttylinux is a very small Linux distribution, designed to require very -few resources. We will use it as a concrete example of how to start a -Xen domain. Most users will probably want to install a full-featured -distribution once they have mastered the basics\footnote{ttylinux is -maintained by Pascal Schmidt. You can download source packages from -the distribution's home page: {\tt http://www.minimalinux.org/ttylinux/}}. - -\begin{enumerate} -\item Download and extract the ttylinux disk image from the Files -section of the project's SourceForge site (see -\path{http://sf.net/projects/xen/}). -\item Create a configuration file like the following: -\begin{verbatim} -kernel = "/boot/vmlinuz-2.6-xenU" -memory = 64 -name = "ttylinux" -nics = 1 -ip = "1.2.3.4" -disk = ['file:/path/to/ttylinux/rootfs,sda1,w'] -root = "/dev/sda1 ro" -\end{verbatim} -\item Now start the domain and connect to its console: -\begin{verbatim} -xm create configfile -c -\end{verbatim} -\item Login as root, password root. -\end{enumerate} - - -\section{Starting / Stopping Domains Automatically} - -It is possible to have certain domains start automatically at boot -time and to have dom0 wait for all running domains to shutdown before -it shuts down the system. - -To specify a domain is to start at boot-time, place its -configuration file (or a link to it) under \path{/etc/xen/auto/}. - -A Sys-V style init script for RedHat and LSB-compliant systems is -provided and will be automatically copied to \path{/etc/init.d/} -during install. You can then enable it in the appropriate way for -your distribution. - -For instance, on RedHat: - -\begin{quote} -\verb_# chkconfig --add xendomains_ -\end{quote} - -By default, this will start the boot-time domains in runlevels 3, 4 -and 5. - -You can also use the \path{service} command to run this script -manually, e.g: - -\begin{quote} -\verb_# service xendomains start_ - -Starts all the domains with config files under /etc/xen/auto/. -\end{quote} - - -\begin{quote} -\verb_# service xendomains stop_ - -Shuts down ALL running Xen domains. -\end{quote} - -\chapter{Domain Management Tools} - -The previous chapter described a simple example of how to configure -and start a domain. This chapter summarises the tools available to -manage running domains. - -\section{Command-line Management} - -Command line management tasks are also performed using the \path{xm} -tool. For online help for the commands available, type: -\begin{quote} -\verb_# xm help_ -\end{quote} - -You can also type \path{xm help $<$command$>$} for more information -on a given command. - -\subsection{Basic Management Commands} - -The most important \path{xm} commands are: -\begin{quote} -\verb_# xm list_: Lists all domains running.\\ -\verb_# xm consoles_ : Gives information about the domain consoles.\\ -\verb_# xm console_: Opens a console to a domain (e.g.\ - \verb_# xm console myVM_ -\end{quote} - -\subsection{\tt xm list} - -The output of \path{xm list} is in rows of the following format: -\begin{center} -{\tt name domid memory cpu state cputime console} -\end{center} - -\begin{quote} -\begin{description} -\item[name] The descriptive name of the virtual machine. -\item[domid] The number of the domain ID this virtual machine is running in. -\item[memory] Memory size in megabytes. -\item[cpu] The CPU this domain is running on. -\item[state] Domain state consists of 5 fields: - \begin{description} - \item[r] running - \item[b] blocked - \item[p] paused - \item[s] shutdown - \item[c] crashed - \end{description} -\item[cputime] How much CPU time (in seconds) the domain has used so far. -\item[console] TCP port accepting connections to the domain's console. -\end{description} -\end{quote} - -The \path{xm list} command also supports a long output format when the -\path{-l} switch is used. This outputs the fulls details of the -running domains in \xend's SXP configuration format. - -For example, suppose the system is running the ttylinux domain as -described earlier. The list command should produce output somewhat -like the following: -\begin{verbatim} -# xm list -Name Id Mem(MB) CPU State Time(s) Console -Domain-0 0 251 0 r---- 172.2 -ttylinux 5 63 0 -b--- 3.0 9605 -\end{verbatim} - -Here we can see the details for the ttylinux domain, as well as for -domain 0 (which, of course, is always running). Note that the console -port for the ttylinux domain is 9605. This can be connected to by TCP -using a terminal program (e.g. \path{telnet} or, better, -\path{xencons}). The simplest way to connect is to use the \path{xm console} -command, specifying the domain name or ID. To connect to the console -of the ttylinux domain, we could use any of the following: -\begin{verbatim} -# xm console ttylinux -# xm console 5 -# xencons localhost 9605 -\end{verbatim} - -\section{Domain Save and Restore} - -The administrator of a Xen system may suspend a virtual machine's -current state into a disk file in domain 0, allowing it to be resumed -at a later time. - -The ttylinux domain described earlier can be suspended to disk using -the command: -\begin{verbatim} -# xm save ttylinux ttylinux.xen -\end{verbatim} - -This will stop the domain named `ttylinux' and save its current state -into a file called \path{ttylinux.xen}. - -To resume execution of this domain, use the \path{xm restore} command: -\begin{verbatim} -# xm restore ttylinux.xen -\end{verbatim} - -This will restore the state of the domain and restart it. The domain -will carry on as before and the console may be reconnected using the -\path{xm console} command, as above. - -\section{Live Migration} - -Live migration is used to transfer a domain between physical hosts -whilst that domain continues to perform its usual activities --- from -the user's perspective, the migration should be imperceptible. - -To perform a live migration, both hosts must be running Xen / \xend and -the destination host must have sufficient resources (e.g. memory -capacity) to accommodate the domain after the move. Furthermore we -currently require both source and destination machines to be on the -same L2 subnet. - -Currently, there is no support for providing automatic remote access -to filesystems stored on local disk when a domain is migrated. -Administrators should choose an appropriate storage solution -(i.e. SAN, NAS, etc.) to ensure that domain filesystems are also -available on their destination node. GNBD is a good method for -exporting a volume from one machine to another. iSCSI can do a similar -job, but is more complex to set up. - -When a domain migrates, it's MAC and IP address move with it, thus it -is only possible to migrate VMs within the same layer-2 network and IP -subnet. If the destination node is on a different subnet, the -administrator would need to manually configure a suitable etherip or -IP tunnel in the domain 0 of the remote node. - -A domain may be migrated using the \path{xm migrate} command. To -live migrate a domain to another machine, we would use -the command: - -\begin{verbatim} -# xm migrate --live mydomain destination.ournetwork.com -\end{verbatim} - -Without the \path{--live} flag, \xend simply stops the domain and -copies the memory image over to the new node and restarts it. Since -domains can have large allocations this can be quite time consuming, -even on a Gigabit network. With the \path{--live} flag \xend attempts -to keep the domain running while the migration is in progress, -resulting in typical `downtimes' of just 60--300ms. - -For now it will be necessary to reconnect to the domain's console on -the new machine using the \path{xm console} command. If a migrated -domain has any open network connections then they will be preserved, -so SSH connections do not have this limitation. - -\section{Managing Domain Memory} - -XenLinux domains have the ability to relinquish / reclaim machine -memory at the request of the administrator or the user of the domain. - -\subsection{Setting memory footprints from dom0} - -The machine administrator can request that a domain alter its memory -footprint using the \path{xm set-mem} command. For instance, we can -request that our example ttylinux domain reduce its memory footprint -to 32 megabytes. - -\begin{verbatim} -# xm set-mem ttylinux 32 -\end{verbatim} - -We can now see the result of this in the output of \path{xm list}: - -\begin{verbatim} -# xm list -Name Id Mem(MB) CPU State Time(s) Console -Domain-0 0 251 0 r---- 172.2 -ttylinux 5 31 0 -b--- 4.3 9605 -\end{verbatim} - -The domain has responded to the request by returning memory to Xen. We -can restore the domain to its original size using the command line: - -\begin{verbatim} -# xm set-mem ttylinux 64 -\end{verbatim} - -\subsection{Setting memory footprints from within a domain} - -The virtual file \path{/proc/xen/balloon} allows the owner of a -domain to adjust their own memory footprint. Reading the file -(e.g. \path{cat /proc/xen/balloon}) prints out the current -memory footprint of the domain. Writing the file -(e.g. \path{echo new\_target > /proc/xen/balloon}) requests -that the kernel adjust the domain's memory footprint to a new value. - -\subsection{Setting memory limits} - -Xen associates a memory size limit with each domain. By default, this -is the amount of memory the domain is originally started with, -preventing the domain from ever growing beyond this size. To permit a -domain to grow beyond its original allocation or to prevent a domain -you've shrunk from reclaiming the memory it relinquished, use the -\path{xm maxmem} command. - -\chapter{Domain Filesystem Storage} - -It is possible to directly export any Linux block device in dom0 to -another domain, or to export filesystems / devices to virtual machines -using standard network protocols (e.g. NBD, iSCSI, NFS, etc). This -chapter covers some of the possibilities. - - -\section{Exporting Physical Devices as VBDs} -\label{s:exporting-physical-devices-as-vbds} - -One of the simplest configurations is to directly export -individual partitions from domain 0 to other domains. To -achieve this use the \path{phy:} specifier in your domain -configuration file. For example a line like -\begin{quote} -\verb_disk = ['phy:hda3,sda1,w']_ -\end{quote} -specifies that the partition \path{/dev/hda3} in domain 0 -should be exported read-write to the new domain as \path{/dev/sda1}; -one could equally well export it as \path{/dev/hda} or -\path{/dev/sdb5} should one wish. - -In addition to local disks and partitions, it is possible to export -any device that Linux considers to be ``a disk'' in the same manner. -For example, if you have iSCSI disks or GNBD volumes imported into -domain 0 you can export these to other domains using the \path{phy:} -disk syntax. E.g.: -\begin{quote} -\verb_disk = ['phy:vg/lvm1,sda2,w']_ -\end{quote} - - - -\begin{center} -\framebox{\bf Warning: Block device sharing} -\end{center} -\begin{quote} -Block devices should typically only be shared between domains in a -read-only fashion otherwise the Linux kernel's file systems will get -very confused as the file system structure may change underneath them -(having the same ext3 partition mounted rw twice is a sure fire way to -cause irreparable damage)! \Xend will attempt to prevent you from -doing this by checking that the device is not mounted read-write in -domain 0, and hasn't already been exported read-write to another -domain. -If you want read-write sharing, export the directory to other domains -via NFS from domain0 (or use a cluster file system such as GFS or -ocfs2). - -\end{quote} - - -\section{Using File-backed VBDs} - -It is also possible to use a file in Domain 0 as the primary storage -for a virtual machine. As well as being convenient, this also has the -advantage that the virtual block device will be {\em sparse} --- space -will only really be allocated as parts of the file are used. So if a -virtual machine uses only half of its disk space then the file really -takes up half of the size allocated. - -For example, to create a 2GB sparse file-backed virtual block device -(actually only consumes 1KB of disk): -\begin{quote} -\verb_# dd if=/dev/zero of=vm1disk bs=1k seek=2048k count=1_ -\end{quote} - -Make a file system in the disk file: -\begin{quote} -\verb_# mkfs -t ext3 vm1disk_ -\end{quote} - -(when the tool asks for confirmation, answer `y') - -Populate the file system e.g. by copying from the current root: -\begin{quote} -\begin{verbatim} -# mount -o loop vm1disk /mnt -# cp -ax /{root,dev,var,etc,usr,bin,sbin,lib} /mnt -# mkdir /mnt/{proc,sys,home,tmp} -\end{verbatim} -\end{quote} - -Tailor the file system by editing \path{/etc/fstab}, -\path{/etc/hostname}, etc (don't forget to edit the files in the -mounted file system, instead of your domain 0 filesystem, e.g. you -would edit \path{/mnt/etc/fstab} instead of \path{/etc/fstab} ). For -this example put \path{/dev/sda1} to root in fstab. - -Now unmount (this is important!): -\begin{quote} -\verb_# umount /mnt_ -\end{quote} - -In the configuration file set: -\begin{quote} -\verb_disk = ['file:/full/path/to/vm1disk,sda1,w']_ -\end{quote} - -As the virtual machine writes to its `disk', the sparse file will be -filled in and consume more space up to the original 2GB. - -{\bf Note that file-backed VBDs may not be appropriate for backing -I/O-intensive domains.} File-backed VBDs are known to experience -substantial slowdowns under heavy I/O workloads, due to the I/O handling -by the loopback block device used to support file-backed VBDs in dom0. -Better I/O performance can be achieved by using either LVM-backed VBDs -(Section~\ref{s:using-lvm-backed-vbds}) or physical devices as VBDs -(Section~\ref{s:exporting-physical-devices-as-vbds}). - -Linux supports a maximum of eight file-backed VBDs across all domains by -default. This limit can be statically increased by using the {\em -max\_loop} module parameter if CONFIG\_BLK\_DEV\_LOOP is compiled as a -module in the dom0 kernel, or by using the {\em max\_loop=n} boot option -if CONFIG\_BLK\_DEV\_LOOP is compiled directly into the dom0 kernel. - - -\section{Using LVM-backed VBDs} -\label{s:using-lvm-backed-vbds} - -A particularly appealing solution is to use LVM volumes -as backing for domain file-systems since this allows dynamic -growing/shrinking of volumes as well as snapshot and other -features. - -To initialise a partition to support LVM volumes: -\begin{quote} -\begin{verbatim} -# pvcreate /dev/sda10 -\end{verbatim} -\end{quote} - -Create a volume group named `vg' on the physical partition: -\begin{quote} -\begin{verbatim} -# vgcreate vg /dev/sda10 -\end{verbatim} -\end{quote} - -Create a logical volume of size 4GB named `myvmdisk1': -\begin{quote} -\begin{verbatim} -# lvcreate -L4096M -n myvmdisk1 vg -\end{verbatim} -\end{quote} - -You should now see that you have a \path{/dev/vg/myvmdisk1} -Make a filesystem, mount it and populate it, e.g.: -\begin{quote} -\begin{verbatim} -# mkfs -t ext3 /dev/vg/myvmdisk1 -# mount /dev/vg/myvmdisk1 /mnt -# cp -ax / /mnt -# umount /mnt -\end{verbatim} -\end{quote} - -Now configure your VM with the following disk configuration: -\begin{quote} -\begin{verbatim} - disk = [ 'phy:vg/myvmdisk1,sda1,w' ] -\end{verbatim} -\end{quote} - -LVM enables you to grow the size of logical volumes, but you'll need -to resize the corresponding file system to make use of the new -space. Some file systems (e.g. ext3) now support on-line resize. See -the LVM manuals for more details. - -You can also use LVM for creating copy-on-write clones of LVM -volumes (known as writable persistent snapshots in LVM -terminology). This facility is new in Linux 2.6.8, so isn't as -stable as one might hope. In particular, using lots of CoW LVM -disks consumes a lot of dom0 memory, and error conditions such as -running out of disk space are not handled well. Hopefully this -will improve in future. - -To create two copy-on-write clone of the above file system you -would use the following commands: - -\begin{quote} -\begin{verbatim} -# lvcreate -s -L1024M -n myclonedisk1 /dev/vg/myvmdisk1 -# lvcreate -s -L1024M -n myclonedisk2 /dev/vg/myvmdisk1 -\end{verbatim} -\end{quote} - -Each of these can grow to have 1GB of differences from the master -volume. You can grow the amount of space for storing the -differences using the lvextend command, e.g.: -\begin{quote} -\begin{verbatim} -# lvextend +100M /dev/vg/myclonedisk1 -\end{verbatim} -\end{quote} - -Don't let the `differences volume' ever fill up otherwise LVM gets -rather confused. It may be possible to automate the growing -process by using \path{dmsetup wait} to spot the volume getting full -and then issue an \path{lvextend}. - -In principle, it is possible to continue writing to the volume -that has been cloned (the changes will not be visible to the -clones), but we wouldn't recommend this: have the cloned volume -as a `pristine' file system install that isn't mounted directly -by any of the virtual machines. - - -\section{Using NFS Root} - -First, populate a root filesystem in a directory on the server -machine. This can be on a distinct physical machine, or simply -run within a virtual machine on the same node. - -Now configure the NFS server to export this filesystem over the -network by adding a line to \path{/etc/exports}, for instance: - -\begin{quote} -\begin{small} -\begin{verbatim} -/export/vm1root 1.2.3.4/24 (rw,sync,no_root_squash) -\end{verbatim} -\end{small} -\end{quote} - -Finally, configure the domain to use NFS root. In addition to the -normal variables, you should make sure to set the following values in -the domain's configuration file: - -\begin{quote} -\begin{small} -\begin{verbatim} -root = '/dev/nfs' -nfs_server = '2.3.4.5' # substitute IP address of server -nfs_root = '/path/to/root' # path to root FS on the server -\end{verbatim} -\end{small} -\end{quote} - -The domain will need network access at boot time, so either statically -configure an IP address (Using the config variables \path{ip}, -\path{netmask}, \path{gateway}, \path{hostname}) or enable DHCP ( -\path{dhcp='dhcp'}). - -Note that the Linux NFS root implementation is known to have stability -problems under high load (this is not a Xen-specific problem), so this -configuration may not be appropriate for critical servers. + +%% Chapter Introduction moved to introduction.tex +\include{src/user/introduction} + +%% Chapter Installation moved to installation.tex +\include{src/user/installation} + +%% Chapter Starting Additional Domains moved to start_addl_dom.tex +\include{src/user/start_addl_dom} + +%% Chapter Domain Management Tools moved to domain_mgmt.tex +\include{src/user/domain_mgmt} + +%% Chapter Domain Filesystem Storage moved to domain_filesystem.tex +\include{src/user/domain_filesystem} + \part{User Reference Documentation} -\chapter{Control Software} - -The Xen control software includes the \xend node control daemon (which -must be running), the xm command line tools, and the prototype -xensv web interface. - -\section{\Xend (node control daemon)} -\label{s:xend} - -The Xen Daemon (\Xend) performs system management functions related to -virtual machines. It forms a central point of control for a machine -and can be controlled using an HTTP-based protocol. \Xend must be -running in order to start and manage virtual machines. - -\Xend must be run as root because it needs access to privileged system -management functions. A small set of commands may be issued on the -\xend command line: - -\begin{tabular}{ll} -\verb!# xend start! & start \xend, if not already running \\ -\verb!# xend stop! & stop \xend if already running \\ -\verb!# xend restart! & restart \xend if running, otherwise start it \\ -% \verb!# xend trace_start! & start \xend, with very detailed debug logging \\ -\verb!# xend status! & indicates \xend status by its return code -\end{tabular} - -A SysV init script called {\tt xend} is provided to start \xend at boot -time. {\tt make install} installs this script in {\path{/etc/init.d}. -To enable it, you have to make symbolic links in the appropriate -runlevel directories or use the {\tt chkconfig} tool, where available. - -Once \xend is running, more sophisticated administration can be done -using the xm tool (see Section~\ref{s:xm}) and the experimental -Xensv web interface (see Section~\ref{s:xensv}). - -As \xend runs, events will be logged to \path{/var/log/xend.log} and, -if the migration assistant daemon (\path{xfrd}) has been started, -\path{/var/log/xfrd.log}. These may be of use for troubleshooting -problems. - -\section{Xm (command line interface)} -\label{s:xm} - -The xm tool is the primary tool for managing Xen from the console. -The general format of an xm command line is: - -\begin{verbatim} -# xm command [switches] [arguments] [variables] -\end{verbatim} - -The available {\em switches} and {\em arguments} are dependent on the -{\em command} chosen. The {\em variables} may be set using -declarations of the form {\tt variable=value} and command line -declarations override any of the values in the configuration file -being used, including the standard variables described above and any -custom variables (for instance, the \path{xmdefconfig} file uses a -{\tt vmid} variable). - -The available commands are as follows: - -\begin{description} -\item[set-mem] Request a domain to adjust its memory footprint. -\item[create] Create a new domain. -\item[destroy] Kill a domain immediately. -\item[list] List running domains. -\item[shutdown] Ask a domain to shutdown. -\item[dmesg] Fetch the Xen (not Linux!) boot output. -\item[consoles] Lists the available consoles. -\item[console] Connect to the console for a domain. -\item[help] Get help on xm commands. -\item[save] Suspend a domain to disk. -\item[restore] Restore a domain from disk. -\item[pause] Pause a domain's execution. -\item[unpause] Unpause a domain. -\item[pincpu] Pin a domain to a CPU. -\item[bvt] Set BVT scheduler parameters for a domain. -\item[bvt\_ctxallow] Set the BVT context switching allowance for the system. -\item[atropos] Set the atropos parameters for a domain. -\item[rrobin] Set the round robin time slice for the system. -\item[info] Get information about the Xen host. -\item[call] Call a \xend HTTP API function directly. -\end{description} - -For a detailed overview of switches, arguments and variables to each command -try -\begin{quote} -\begin{verbatim} -# xm help command -\end{verbatim} -\end{quote} - -\section{Xensv (web control interface)} -\label{s:xensv} - -Xensv is the experimental web control interface for managing a Xen -machine. It can be used to perform some (but not yet all) of the -management tasks that can be done using the xm tool. - -It can be started using: -\begin{quote} -\verb_# xensv start_ -\end{quote} -and stopped using: -\begin{quote} -\verb_# xensv stop_ -\end{quote} - -By default, Xensv will serve out the web interface on port 8080. This -can be changed by editing -\path{/usr/lib/python2.3/site-packages/xen/sv/params.py}. - -Once Xensv is running, the web interface can be used to create and -manage running domains. - - - - -\chapter{Domain Configuration} -\label{cha:config} - -The following contains the syntax of the domain configuration -files and description of how to further specify networking, -driver domain and general scheduling behaviour. - -\section{Configuration Files} -\label{s:cfiles} - -Xen configuration files contain the following standard variables. -Unless otherwise stated, configuration items should be enclosed in -quotes: see \path{/etc/xen/xmexample1} and \path{/etc/xen/xmexample2} -for concrete examples of the syntax. - -\begin{description} -\item[kernel] Path to the kernel image -\item[ramdisk] Path to a ramdisk image (optional). -% \item[builder] The name of the domain build function (e.g. {\tt'linux'} or {\tt'netbsd'}. -\item[memory] Memory size in megabytes. -\item[cpu] CPU to run this domain on, or {\tt -1} for - auto-allocation. -\item[console] Port to export the domain console on (default 9600 + domain ID). -\item[nics] Number of virtual network interfaces. -\item[vif] List of MAC addresses (random addresses are assigned if not - given) and bridges to use for the domain's network interfaces, e.g. -\begin{verbatim} -vif = [ 'mac=aa:00:00:00:00:11, bridge=xen-br0', - 'bridge=xen-br1' ] -\end{verbatim} - to assign a MAC address and bridge to the first interface and assign - a different bridge to the second interface, leaving \xend to choose - the MAC address. -\item[disk] List of block devices to export to the domain, e.g. \\ - \verb_disk = [ 'phy:hda1,sda1,r' ]_ \\ - exports physical device \path{/dev/hda1} to the domain - as \path{/dev/sda1} with read-only access. Exporting a disk read-write - which is currently mounted is dangerous -- if you are \emph{certain} - you wish to do this, you can specify \path{w!} as the mode. -\item[dhcp] Set to {\tt 'dhcp'} if you want to use DHCP to configure - networking. -\item[netmask] Manually configured IP netmask. -\item[gateway] Manually configured IP gateway. -\item[hostname] Set the hostname for the virtual machine. -\item[root] Specify the root device parameter on the kernel command - line. -\item[nfs\_server] IP address for the NFS server (if any). -\item[nfs\_root] Path of the root filesystem on the NFS server (if any). -\item[extra] Extra string to append to the kernel command line (if - any) -\item[restart] Three possible options: - \begin{description} - \item[always] Always restart the domain, no matter what - its exit code is. - \item[never] Never restart the domain. - \item[onreboot] Restart the domain iff it requests reboot. - \end{description} -\end{description} - -For additional flexibility, it is also possible to include Python -scripting commands in configuration files. An example of this is the -\path{xmexample2} file, which uses Python code to handle the -\path{vmid} variable. - - -%\part{Advanced Topics} - -\section{Network Configuration} - -For many users, the default installation should work `out of the box'. -More complicated network setups, for instance with multiple ethernet -interfaces and/or existing bridging setups will require some -special configuration. - -The purpose of this section is to describe the mechanisms provided by -\xend to allow a flexible configuration for Xen's virtual networking. - -\subsection{Xen virtual network topology} - -Each domain network interface is connected to a virtual network -interface in dom0 by a point to point link (effectively a `virtual -crossover cable'). These devices are named {\tt -vif$<$domid$>$.$<$vifid$>$} (e.g. {\tt vif1.0} for the first interface -in domain 1, {\tt vif3.1} for the second interface in domain 3). - -Traffic on these virtual interfaces is handled in domain 0 using -standard Linux mechanisms for bridging, routing, rate limiting, etc. -Xend calls on two shell scripts to perform initial configuration of -the network and configuration of new virtual interfaces. By default, -these scripts configure a single bridge for all the virtual -interfaces. Arbitrary routing / bridging configurations can be -configured by customising the scripts, as described in the following -section. - -\subsection{Xen networking scripts} - -Xen's virtual networking is configured by two shell scripts (by -default \path{network} and \path{vif-bridge}). These are -called automatically by \xend when certain events occur, with -arguments to the scripts providing further contextual information. -These scripts are found by default in \path{/etc/xen/scripts}. The -names and locations of the scripts can be configured in -\path{/etc/xen/xend-config.sxp}. - -\begin{description} - -\item[network:] This script is called whenever \xend is started or -stopped to respectively initialise or tear down the Xen virtual -network. In the default configuration initialisation creates the -bridge `xen-br0' and moves eth0 onto that bridge, modifying the -routing accordingly. When \xend exits, it deletes the Xen bridge and -removes eth0, restoring the normal IP and routing configuration. - -%% In configurations where the bridge already exists, this script could -%% be replaced with a link to \path{/bin/true} (for instance). - -\item[vif-bridge:] This script is called for every domain virtual -interface and can configure firewalling rules and add the vif -to the appropriate bridge. By default, this adds and removes -VIFs on the default Xen bridge. - -\end{description} - -For more complex network setups (e.g. where routing is required or -integrate with existing bridges) these scripts may be replaced with -customised variants for your site's preferred configuration. - -%% There are two possible types of privileges: IO privileges and -%% administration privileges. - -\section{Driver Domain Configuration} - -I/O privileges can be assigned to allow a domain to directly access -PCI devices itself. This is used to support driver domains. - -Setting backend privileges is currently only supported in SXP format -config files. To allow a domain to function as a backend for others, -somewhere within the {\tt vm} element of its configuration file must -be a {\tt backend} element of the form {\tt (backend ({\em type}))} -where {\tt \em type} may be either {\tt netif} or {\tt blkif}, -according to the type of virtual device this domain will service. -%% After this domain has been built, \xend will connect all new and -%% existing {\em virtual} devices (of the appropriate type) to that -%% backend. - -Note that a block backend cannot currently import virtual block -devices from other domains, and a network backend cannot import -virtual network devices from other domains. Thus (particularly in the -case of block backends, which cannot import a virtual block device as -their root filesystem), you may need to boot a backend domain from a -ramdisk or a network device. - -Access to PCI devices may be configured on a per-device basis. Xen -will assign the minimal set of hardware privileges to a domain that -are required to control its devices. This can be configured in either -format of configuration file: - -\begin{itemize} -\item SXP Format: Include device elements of the form: \\ -\centerline{ {\tt (device (pci (bus {\em x}) (dev {\em y}) (func {\em z})))}} \\ - inside the top-level {\tt vm} element. Each one specifies the address - of a device this domain is allowed to access --- - the numbers {\em x},{\em y} and {\em z} may be in either decimal or - hexadecimal format. -\item Flat Format: Include a list of PCI device addresses of the - format: \\ -\centerline{{\tt pci = ['x,y,z', ...]}} \\ -where each element in the - list is a string specifying the components of the PCI device - address, separated by commas. The components ({\tt \em x}, {\tt \em - y} and {\tt \em z}) of the list may be formatted as either decimal - or hexadecimal. -\end{itemize} - -%% \section{Administration Domains} - -%% Administration privileges allow a domain to use the `dom0 -%% operations' (so called because they are usually available only to -%% domain 0). A privileged domain can build other domains, set scheduling -%% parameters, etc. - -% Support for other administrative domains is not yet available... perhaps -% we should plumb it in some time - - - - - -\section{Scheduler Configuration} -\label{s:sched} - - -Xen offers a boot time choice between multiple schedulers. To select -a scheduler, pass the boot parameter {\em sched=sched\_name} to Xen, -substituting the appropriate scheduler name. Details of the schedulers -and their parameters are included below; future versions of the tools -will provide a higher-level interface to these tools. - -It is expected that system administrators configure their system to -use the scheduler most appropriate to their needs. Currently, the BVT -scheduler is the recommended choice. - -\subsection{Borrowed Virtual Time} - -{\tt sched=bvt} (the default) \\ - -BVT provides proportional fair shares of the CPU time. It has been -observed to penalise domains that block frequently (e.g. I/O intensive -domains), but this can be compensated for by using warping. - -\subsubsection{Global Parameters} - -\begin{description} -\item[ctx\_allow] - the context switch allowance is similar to the `quantum' - in traditional schedulers. It is the minimum time that - a scheduled domain will be allowed to run before being - pre-empted. -\end{description} - -\subsubsection{Per-domain parameters} - -\begin{description} -\item[mcuadv] - the MCU (Minimum Charging Unit) advance determines the - proportional share of the CPU that a domain receives. It - is set inversely proportionally to a domain's sharing weight. -\item[warp] - the amount of `virtual time' the domain is allowed to warp - backwards -\item[warpl] - the warp limit is the maximum time a domain can run warped for -\item[warpu] - the unwarp requirement is the minimum time a domain must - run unwarped for before it can warp again -\end{description} - -\subsection{Atropos} - -{\tt sched=atropos} \\ - -Atropos is a soft real time scheduler. It provides guarantees about -absolute shares of the CPU, with a facility for sharing -slack CPU time on a best-effort basis. It can provide timeliness -guarantees for latency-sensitive domains. - -Every domain has an associated period and slice. The domain should -receive `slice' nanoseconds every `period' nanoseconds. This allows -the administrator to configure both the absolute share of the CPU a -domain receives and the frequency with which it is scheduled. - -%% When -%% domains unblock, their period is reduced to the value of the latency -%% hint (the slice is scaled accordingly so that they still get the same -%% proportion of the CPU). For each subsequent period, the slice and -%% period times are doubled until they reach their original values. - -Note: don't overcommit the CPU when using Atropos (i.e. don't reserve -more CPU than is available --- the utilisation should be kept to -slightly less than 100\% in order to ensure predictable behaviour). - -\subsubsection{Per-domain parameters} - -\begin{description} -\item[period] The regular time interval during which a domain is - guaranteed to receive its allocation of CPU time. -\item[slice] - The length of time per period that a domain is guaranteed to run - for (in the absence of voluntary yielding of the CPU). -\item[latency] - The latency hint is used to control how soon after - waking up a domain it should be scheduled. -\item[xtratime] This is a boolean flag that specifies whether a domain - should be allowed a share of the system slack time. -\end{description} - -\subsection{Round Robin} - -{\tt sched=rrobin} \\ - -The round robin scheduler is included as a simple demonstration of -Xen's internal scheduler API. It is not intended for production use. - -\subsubsection{Global Parameters} - -\begin{description} -\item[rr\_slice] - The maximum time each domain runs before the next - scheduling decision is made. -\end{description} - - - - - - - - - - - - -\chapter{Build, Boot and Debug options} - -This chapter describes the build- and boot-time options -which may be used to tailor your Xen system. - -\section{Xen Build Options} - -Xen provides a number of build-time options which should be -set as environment variables or passed on make's command-line. - -\begin{description} -\item[verbose=y] Enable debugging messages when Xen detects an unexpected condition. -Also enables console output from all domains. -\item[debug=y] -Enable debug assertions. Implies {\bf verbose=y}. -(Primarily useful for tracing bugs in Xen). -\item[debugger=y] -Enable the in-Xen debugger. This can be used to debug -Xen, guest OSes, and applications. -\item[perfc=y] -Enable performance counters for significant events -within Xen. The counts can be reset or displayed -on Xen's console via console control keys. -\item[trace=y] -Enable per-cpu trace buffers which log a range of -events within Xen for collection by control -software. -\end{description} - -\section{Xen Boot Options} -\label{s:xboot} - -These options are used to configure Xen's behaviour at runtime. They -should be appended to Xen's command line, either manually or by -editing \path{grub.conf}. - -\begin{description} -\item [noreboot ] - Don't reboot the machine automatically on errors. This is - useful to catch debug output if you aren't catching console messages - via the serial line. - -\item [nosmp ] - Disable SMP support. - This option is implied by `ignorebiostables'. - -\item [watchdog ] - Enable NMI watchdog which can report certain failures. - -\item [noirqbalance ] - Disable software IRQ balancing and affinity. This can be used on - systems such as Dell 1850/2850 that have workarounds in hardware for - IRQ-routing issues. - -\item [badpage=$<$page number$>$,$<$page number$>$, \ldots ] - Specify a list of pages not to be allocated for use - because they contain bad bytes. For example, if your - memory tester says that byte 0x12345678 is bad, you would - place `badpage=0x12345' on Xen's command line. - -\item [com1=$<$baud$>$,DPS,$<$io\_base$>$,$<$irq$>$ - com2=$<$baud$>$,DPS,$<$io\_base$>$,$<$irq$>$ ] \mbox{}\\ - Xen supports up to two 16550-compatible serial ports. - For example: `com1=9600, 8n1, 0x408, 5' maps COM1 to a - 9600-baud port, 8 data bits, no parity, 1 stop bit, - I/O port base 0x408, IRQ 5. - If some configuration options are standard (e.g., I/O base and IRQ), - then only a prefix of the full configuration string need be - specified. If the baud rate is pre-configured (e.g., by the - bootloader) then you can specify `auto' in place of a numeric baud - rate. - -\item [console=$<$specifier list$>$ ] - Specify the destination for Xen console I/O. - This is a comma-separated list of, for example: -\begin{description} - \item[vga] use VGA console and allow keyboard input - \item[com1] use serial port com1 - \item[com2H] use serial port com2. Transmitted chars will - have the MSB set. Received chars must have - MSB set. - \item[com2L] use serial port com2. Transmitted chars will - have the MSB cleared. Received chars must - have MSB cleared. -\end{description} - The latter two examples allow a single port to be - shared by two subsystems (e.g. console and - debugger). Sharing is controlled by MSB of each - transmitted/received character. - [NB. Default for this option is `com1,vga'] - -\item [sync\_console ] - Force synchronous console output. This is useful if you system fails - unexpectedly before it has sent all available output to the - console. In most cases Xen will automatically enter synchronous mode - when an exceptional event occurs, but this option provides a manual - fallback. - -\item [conswitch=$<$switch-char$><$auto-switch-char$>$ ] - Specify how to switch serial-console input between - Xen and DOM0. The required sequence is CTRL-$<$switch-char$>$ - pressed three times. Specifying the backtick character - disables switching. - The $<$auto-switch-char$>$ specifies whether Xen should - auto-switch input to DOM0 when it boots --- if it is `x' - then auto-switching is disabled. Any other value, or - omitting the character, enables auto-switching. - [NB. default switch-char is `a'] - -\item [nmi=xxx ] - Specify what to do with an NMI parity or I/O error. \\ - `nmi=fatal': Xen prints a diagnostic and then hangs. \\ - `nmi=dom0': Inform DOM0 of the NMI. \\ - `nmi=ignore': Ignore the NMI. - -\item [mem=xxx ] - Set the physical RAM address limit. Any RAM appearing beyond this - physical address in the memory map will be ignored. This parameter - may be specified with a B, K, M or G suffix, representing bytes, - kilobytes, megabytes and gigabytes respectively. The - default unit, if no suffix is specified, is kilobytes. - -\item [dom0\_mem=xxx ] - Set the amount of memory to be allocated to domain0. In Xen 3.x the parameter - may be specified with a B, K, M or G suffix, representing bytes, - kilobytes, megabytes and gigabytes respectively; if no suffix is specified, - the parameter defaults to kilobytes. In previous versions of Xen, suffixes - were not supported and the value is always interpreted as kilobytes. - -\item [tbuf\_size=xxx ] - Set the size of the per-cpu trace buffers, in pages - (default 1). Note that the trace buffers are only - enabled in debug builds. Most users can ignore - this feature completely. - -\item [sched=xxx ] - Select the CPU scheduler Xen should use. The current - possibilities are `bvt' (default), `atropos' and `rrobin'. - For more information see Section~\ref{s:sched}. - -\item [apic\_verbosity=debug,verbose ] - Print more detailed information about local APIC and IOAPIC configuration. - -\item [lapic ] - Force use of local APIC even when left disabled by uniprocessor BIOS. - -\item [nolapic ] - Ignore local APIC in a uniprocessor system, even if enabled by the BIOS. - -\item [apic=bigsmp,default,es7000,summit ] - Specify NUMA platform. This can usually be probed automatically. - -\end{description} - -In addition, the following options may be specified on the Xen command -line. Since domain 0 shares responsibility for booting the platform, -Xen will automatically propagate these options to its command -line. These options are taken from Linux's command-line syntax with -unchanged semantics. - -\begin{description} -\item [acpi=off,force,strict,ht,noirq,\ldots ] - Modify how Xen (and domain 0) parses the BIOS ACPI tables. - -\item [acpi\_skip\_timer\_override ] - Instruct Xen (and domain 0) to ignore timer-interrupt override - instructions specified by the BIOS ACPI tables. - -\item [noapic ] - Instruct Xen (and domain 0) to ignore any IOAPICs that are present in - the system, and instead continue to use the legacy PIC. - -\end{description} - -\section{XenLinux Boot Options} - -In addition to the standard Linux kernel boot options, we support: -\begin{description} -\item[xencons=xxx ] Specify the device node to which the Xen virtual -console driver is attached. The following options are supported: -\begin{center} -\begin{tabular}{l} -`xencons=off': disable virtual console \\ -`xencons=tty': attach console to /dev/tty1 (tty0 at boot-time) \\ -`xencons=ttyS': attach console to /dev/ttyS0 -\end{tabular} -\end{center} -The default is ttyS for dom0 and tty for all other domains. -\end{description} - - - -\section{Debugging} -\label{s:keys} - -Xen has a set of debugging features that can be useful to try and -figure out what's going on. Hit 'h' on the serial line (if you -specified a baud rate on the Xen command line) or ScrollLock-h on the -keyboard to get a list of supported commands. - -If you have a crash you'll likely get a crash dump containing an EIP -(PC) which, along with an \path{objdump -d image}, can be useful in -figuring out what's happened. Debug a Xenlinux image just as you -would any other Linux kernel. - -%% We supply a handy debug terminal program which you can find in -%% \path{/usr/local/src/xen-2.0.bk/tools/misc/miniterm/} -%% This should be built and executed on another machine that is connected -%% via a null modem cable. Documentation is included. -%% Alternatively, if the Xen machine is connected to a serial-port server -%% then we supply a dumb TCP terminal client, {\tt xencons}. - - +%% Chapter Control Software moved to control_software.tex +\include{src/user/control_software} + +%% Chapter Domain Configuration moved to domain_configuration.tex +\include{src/user/domain_configuration} + +%% Chapter Build, Boot and Debug Options moved to build.tex +\include{src/user/build} \chapter{Further Support} @@ -1875,6 +108,7 @@ %Various HOWTOs are available in \path{docs/HOWTOS} but this content is %being integrated into this manual. + \section{Online References} The official Xen web site is found at: @@ -1884,6 +118,7 @@ This contains links to the latest versions of all on-line documentation (including the lateset version of the FAQ). + \section{Mailing Lists} @@ -1905,326 +140,18 @@ \end{description} + \appendix - -\chapter{Installing Xen / XenLinux on Debian} - -The Debian project provides a tool called \path{debootstrap} which -allows a base Debian system to be installed into a filesystem without -requiring the host system to have any Debian-specific software (such -as \path{apt}. - -Here's some info how to install Debian 3.1 (Sarge) for an unprivileged -Xen domain: - -\begin{enumerate} -\item Set up Xen 2.0 and test that it's working, as described earlier in - this manual. - -\item Create disk images for root-fs and swap (alternatively, you - might create dedicated partitions, LVM logical volumes, etc. if - that suits your setup). -\begin{small}\begin{verbatim} -dd if=/dev/zero of=/path/diskimage bs=1024k count=size_in_mbytes -dd if=/dev/zero of=/path/swapimage bs=1024k count=size_in_mbytes -\end{verbatim}\end{small} - If you're going to use this filesystem / disk image only as a - `template' for other vm disk images, something like 300 MB should - be enough.. (of course it depends what kind of packages you are - planning to install to the template) - -\item Create the filesystem and initialise the swap image -\begin{small}\begin{verbatim} -mkfs.ext3 /path/diskimage -mkswap /path/swapimage -\end{verbatim}\end{small} - -\item Mount the disk image for installation -\begin{small}\begin{verbatim} -mount -o loop /path/diskimage /mnt/disk -\end{verbatim}\end{small} - -\item Install \path{debootstrap} - -Make sure you have debootstrap installed on the host. If you are -running Debian sarge (3.1 / testing) or unstable you can install it by -running \path{apt-get install debootstrap}. Otherwise, it can be -downloaded from the Debian project website. - -\item Install Debian base to the disk image: -\begin{small}\begin{verbatim} -debootstrap --arch i386 sarge /mnt/disk \ - http://ftp.<countrycode>.debian.org/debian -\end{verbatim}\end{small} - -You can use any other Debian http/ftp mirror you want. - -\item When debootstrap completes successfully, modify settings: -\begin{small}\begin{verbatim} -chroot /mnt/disk /bin/bash -\end{verbatim}\end{small} - -Edit the following files using vi or nano and make needed changes: -\begin{small}\begin{verbatim} -/etc/hostname -/etc/hosts -/etc/resolv.conf -/etc/network/interfaces -/etc/networks -\end{verbatim}\end{small} - -Set up access to the services, edit: -\begin{small}\begin{verbatim} -/etc/hosts.deny -/etc/hosts.allow -/etc/inetd.conf -\end{verbatim}\end{small} - -Add Debian mirror to: -\begin{small}\begin{verbatim} -/etc/apt/sources.list -\end{verbatim}\end{small} - -Create fstab like this: -\begin{small}\begin{verbatim} -/dev/sda1 / ext3 errors=remount-ro 0 1 -/dev/sda2 none swap sw 0 0 -proc /proc proc defaults 0 0 -\end{verbatim}\end{small} - -Logout - -\item Unmount the disk image -\begin{small}\begin{verbatim} -umount /mnt/disk -\end{verbatim}\end{small} - -\item Create Xen 2.0 configuration file for the new domain. You can - use the example-configurations coming with Xen as a template. - - Make sure you have the following set up: -\begin{small}\begin{verbatim} -disk = [ 'file:/path/diskimage,sda1,w', 'file:/path/swapimage,sda2,w' ] -root = "/dev/sda1 ro" -\end{verbatim}\end{small} - -\item Start the new domain -\begin{small}\begin{verbatim} -xm create -f domain_config_file -\end{verbatim}\end{small} - -Check that the new domain is running: -\begin{small}\begin{verbatim} -xm list -\end{verbatim}\end{small} - -\item Attach to the console of the new domain. - You should see something like this when starting the new domain: - -\begin{small}\begin{verbatim} -Started domain testdomain2, console on port 9626 -\end{verbatim}\end{small} - - There you can see the ID of the console: 26. You can also list - the consoles with \path{xm consoles} (ID is the last two - digits of the port number.) - - Attach to the console: - -\begin{small}\begin{verbatim} -xm console 26 -\end{verbatim}\end{small} - - or by telnetting to the port 9626 of localhost (the xm console - program works better). - -\item Log in and run base-config - - As a default there's no password for the root. - - Check that everything looks OK, and the system started without - errors. Check that the swap is active, and the network settings are - correct. - - Run \path{/usr/sbin/base-config} to set up the Debian settings. - - Set up the password for root using passwd. - -\item Done. You can exit the console by pressing \path{Ctrl + ]} - -\end{enumerate} - -If you need to create new domains, you can just copy the contents of -the `template'-image to the new disk images, either by mounting the -template and the new image, and using \path{cp -a} or \path{tar} or by -simply copying the image file. Once this is done, modify the -image-specific settings (hostname, network settings, etc). - -\chapter{Installing Xen / XenLinux on Redhat or Fedora Core} - -When using Xen / XenLinux on a standard Linux distribution there are -a couple of things to watch out for: - -Note that, because domains>0 don't have any privileged access at all, -certain commands in the default boot sequence will fail e.g. attempts -to update the hwclock, change the console font, update the keytable -map, start apmd (power management), or gpm (mouse cursor). Either -ignore the errors (they should be harmless), or remove them from the -startup scripts. Deleting the following links are a good start: -{\path{S24pcmcia}}, {\path{S09isdn}}, -{\path{S17keytable}}, {\path{S26apmd}}, -{\path{S85gpm}}. - -If you want to use a single root file system that works cleanly for -both domain 0 and unprivileged domains, a useful trick is to use -different 'init' run levels. For example, use -run level 3 for domain 0, and run level 4 for other domains. This -enables different startup scripts to be run in depending on the run -level number passed on the kernel command line. - -If using NFS root files systems mounted either from an -external server or from domain0 there are a couple of other gotchas. -The default {\path{/etc/sysconfig/iptables}} rules block NFS, so part -way through the boot sequence things will suddenly go dead. - -If you're planning on having a separate NFS {\path{/usr}} partition, the -RH9 boot scripts don't make life easy - they attempt to mount NFS file -systems way to late in the boot process. The easiest way I found to do -this was to have a {\path{/linuxrc}} script run ahead of -{\path{/sbin/init}} that mounts {\path{/usr}}: - -\begin{quote} -\begin{small}\begin{verbatim} - #!/bin/bash - /sbin/ipconfig lo 127.0.0.1 - /sbin/portmap - /bin/mount /usr - exec /sbin/init "$@" <>/dev/console 2>&1 -\end{verbatim}\end{small} -\end{quote} - -%$ XXX SMH: font lock fix :-) - -The one slight complication with the above is that -{\path{/sbin/portmap}} is dynamically linked against -{\path{/usr/lib/libwrap.so.0}} Since this is in -{\path{/usr}}, it won't work. This can be solved by copying the -file (and link) below the /usr mount point, and just let the file be -'covered' when the mount happens. - -In some installations, where a shared read-only {\path{/usr}} is -being used, it may be desirable to move other large directories over -into the read-only {\path{/usr}}. For example, you might replace -{\path{/bin}}, {\path{/lib}} and {\path{/sbin}} with -links into {\path{/usr/root/bin}}, {\path{/usr/root/lib}} -and {\path{/usr/root/sbin}} respectively. This creates other -problems for running the {\path{/linuxrc}} script, requiring -bash, portmap, mount, ifconfig, and a handful of other shared -libraries to be copied below the mount point --- a simple -statically-linked C program would solve this problem. - - - - -\chapter{Glossary of Terms} - -\begin{description} -\item[Atropos] One of the CPU schedulers provided by Xen. - Atropos provides domains with absolute shares - of the CPU, with timeliness guarantees and a - mechanism for sharing out `slack time'. - -\item[BVT] The BVT scheduler is used to give proportional - fair shares of the CPU to domains. - -\item[Exokernel] A minimal piece of privileged code, similar to - a {\bf microkernel} but providing a more - `hardware-like' interface to the tasks it - manages. This is similar to a paravirtualising - VMM like {\bf Xen} but was designed as a new - operating system structure, rather than - specifically to run multiple conventional OSs. - -\item[Domain] A domain is the execution context that - contains a running {\bf virtual machine}. - The relationship between virtual machines - and domains on Xen is similar to that between - programs and processes in an operating - system: a virtual machine is a persistent - entity that resides on disk (somewhat like - a program). When it is loaded for execution, - it runs in a domain. Each domain has a - {\bf domain ID}. - -\item[Domain 0] The first domain to be started on a Xen - machine. Domain 0 is responsible for managing - the system. - -\item[Domain ID] A unique identifier for a {\bf domain}, - analogous to a process ID in an operating - system. - -\item[Full virtualisation] An approach to virtualisation which - requires no modifications to the hosted - operating system, providing the illusion of - a complete system of real hardware devices. - -\item[Hypervisor] An alternative term for {\bf VMM}, used - because it means `beyond supervisor', - since it is responsible for managing multiple - `supervisor' kernels. - -\item[Live migration] A technique for moving a running virtual - machine to another physical host, without - stopping it or the services running on it. - -\item[Microkernel] A small base of code running at the highest - hardware privilege level. A microkernel is - responsible for sharing CPU and memory (and - sometimes other devices) between less - privileged tasks running on the system. - This is similar to a VMM, particularly a - {\bf paravirtualising} VMM but typically - addressing a different problem space and - providing different kind of interface. - -\item[NetBSD/Xen] A port of NetBSD to the Xen architecture. - -\item[Paravirtualisation] An approach to virtualisation which requires - modifications to the operating system in - order to run in a virtual machine. Xen - uses paravirtualisation but preserves - binary compatibility for user space - applications. - -\item[Shadow pagetables] A technique for hiding the layout of machine - memory from a virtual machine's operating - system. Used in some {\bf VMMs} to provide - the illusion of contiguous physical memory, - in Xen this is used during - {\bf live migration}. - -\item[Virtual Machine] The environment in which a hosted operating - system runs, providing the abstraction of a - dedicated machine. A virtual machine may - be identical to the underlying hardware (as - in {\bf full virtualisation}, or it may - differ, as in {\bf paravirtualisation}. - -\item[VMM] Virtual Machine Monitor - the software that - allows multiple virtual machines to be - multiplexed on a single physical machine. - -\item[Xen] Xen is a paravirtualising virtual machine - monitor, developed primarily by the - Systems Research Group at the University - of Cambridge Computer Laboratory. - -\item[XenLinux] Official name for the port of the Linux kernel - that runs on Xen. - -\end{description} +%% Chapter Installing Xen / XenLinux on Debian moved to debian.tex +\include{src/user/debian} + +%% Chapter Installing Xen on Red Hat moved to redhat.tex +\include{src/user/redhat} + + +%% Chapter Glossary of Terms moved to glossary.tex +\include{src/user/glossary} \end{document} diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/architecture.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/architecture.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,140 @@ +\chapter{Virtual Architecture} + +On a Xen-based system, the hypervisor itself runs in {\it ring 0}. It +has full access to the physical memory available in the system and is +responsible for allocating portions of it to the domains. Guest +operating systems run in and use {\it rings 1}, {\it 2} and {\it 3} as +they see fit. Segmentation is used to prevent the guest OS from +accessing the portion of the address space that is reserved for Xen. +We expect most guest operating systems will use ring 1 for their own +operation and place applications in ring 3. + +In this chapter we consider the basic virtual architecture provided by +Xen: the basic CPU state, exception and interrupt handling, and time. +Other aspects such as memory and device access are discussed in later +chapters. + + +\section{CPU state} + +All privileged state must be handled by Xen. The guest OS has no +direct access to CR3 and is not permitted to update privileged bits in +EFLAGS. Guest OSes use \emph{hypercalls} to invoke operations in Xen; +these are analogous to system calls but occur from ring 1 to ring 0. + +A list of all hypercalls is given in Appendix~\ref{a:hypercalls}. + + +\section{Exceptions} + +A virtual IDT is provided --- a domain can submit a table of trap +handlers to Xen via the {\tt set\_trap\_table()} hypercall. Most trap +handlers are identical to native x86 handlers, although the page-fault +handler is somewhat different. + + +\section{Interrupts and events} + +Interrupts are virtualized by mapping them to \emph{events}, which are +delivered asynchronously to the target domain using a callback +supplied via the {\tt set\_callbacks()} hypercall. A guest OS can map +these events onto its standard interrupt dispatch mechanisms. Xen is +responsible for determining the target domain that will handle each +physical interrupt source. For more details on the binding of event +sources to events, see Chapter~\ref{c:devices}. + + +\section{Time} + +Guest operating systems need to be aware of the passage of both real +(or wallclock) time and their own `virtual time' (the time for which +they have been executing). Furthermore, Xen has a notion of time which +is used for scheduling. The following notions of time are provided: + +\begin{description} +\item[Cycle counter time.] + + This provides a fine-grained time reference. The cycle counter time + is used to accurately extrapolate the other time references. On SMP + machines it is currently assumed that the cycle counter time is + synchronized between CPUs. The current x86-based implementation + achieves this within inter-CPU communication latencies. + +\item[System time.] + + This is a 64-bit counter which holds the number of nanoseconds that + have elapsed since system boot. + +\item[Wall clock time.] + + This is the time of day in a Unix-style {\tt struct timeval} + (seconds and microseconds since 1 January 1970, adjusted by leap + seconds). An NTP client hosted by {\it domain 0} can keep this + value accurate. + +\item[Domain virtual time.] + + This progresses at the same pace as system time, but only while a + domain is executing --- it stops while a domain is de-scheduled. + Therefore the share of the CPU that a domain receives is indicated + by the rate at which its virtual time increases. + +\end{description} + + +Xen exports timestamps for system time and wall-clock time to guest +operating systems through a shared page of memory. Xen also provides +the cycle counter time at the instant the timestamps were calculated, +and the CPU frequency in Hertz. This allows the guest to extrapolate +system and wall-clock times accurately based on the current cycle +counter time. + +Since all time stamps need to be updated and read \emph{atomically} +two version numbers are also stored in the shared info page. The first +is incremented prior to an update, while the second is only +incremented afterwards. Thus a guest can be sure that it read a +consistent state by checking the two version numbers are equal. + +Xen includes a periodic ticker which sends a timer event to the +currently executing domain every 10ms. The Xen scheduler also sends a +timer event whenever a domain is scheduled; this allows the guest OS +to adjust for the time that has passed while it has been inactive. In +addition, Xen allows each domain to request that they receive a timer +event sent at a specified system time by using the {\tt + set\_timer\_op()} hypercall. Guest OSes may use this timer to +implement timeout values when they block. + + + +%% % akw: demoting this to a section -- not sure if there is any point +%% % though, maybe just remove it. + +\section{Xen CPU Scheduling} + +Xen offers a uniform API for CPU schedulers. It is possible to choose +from a number of schedulers at boot and it should be easy to add more. +The BVT, Atropos and Round Robin schedulers are part of the normal Xen +distribution. BVT provides proportional fair shares of the CPU to the +running domains. Atropos can be used to reserve absolute shares of +the CPU for each domain. Round-robin is provided as an example of +Xen's internal scheduler API. + +\paragraph*{Note: SMP host support} +Xen has always supported SMP host systems. Domains are statically +assigned to CPUs, either at creation time or when manually pinning to +a particular CPU. The current schedulers then run locally on each CPU +to decide which of the assigned domains should be run there. The +user-level control software can be used to perform coarse-grain +load-balancing between CPUs. + + +%% More information on the characteristics and use of these schedulers +%% is available in {\tt Sched-HOWTO.txt}. + + +\section{Privileged operations} + +Xen exports an extended interface to privileged domains (viz.\ {\it + Domain 0}). This allows such domains to build and boot other domains +on the server, and provides control interfaces for managing +scheduling, memory, networking, and block devices. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/debugging.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/debugging.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,62 @@ +\chapter{Debugging} + +Xen provides tools for debugging both Xen and guest OSes. Currently, the +Pervasive Debugger provides a GDB stub, which provides facilities for symbolic +debugging of Xen itself and of OS kernels running on top of Xen. The Trace +Buffer provides a lightweight means to log data about Xen's internal state and +behaviour at runtime, for later analysis. + +\section{Pervasive Debugger} + +Information on using the pervasive debugger is available in pdb.txt. + + +\section{Trace Buffer} + +The trace buffer provides a means to observe Xen's operation from domain 0. +Trace events, inserted at key points in Xen's code, record data that can be +read by the {\tt xentrace} tool. Recording these events has a low overhead +and hence the trace buffer may be useful for debugging timing-sensitive +behaviours. + +\subsection{Internal API} + +To use the trace buffer functionality from within Xen, you must {\tt \#include +<xen/trace.h>}, which contains definitions related to the trace buffer. Trace +events are inserted into the buffer using the {\tt TRACE\_xD} ({\tt x} = 0, 1, +2, 3, 4 or 5) macros. These all take an event number, plus {\tt x} additional +(32-bit) data as their arguments. For trace buffer-enabled builds of Xen these +will insert the event ID and data into the trace buffer, along with the current +value of the CPU cycle-counter. For builds without the trace buffer enabled, +the macros expand to no-ops and thus can be left in place without incurring +overheads. + +\subsection{Trace-enabled builds} + +By default, the trace buffer is enabled only in debug builds (i.e. {\tt NDEBUG} +is not defined). It can be enabled separately by defining {\tt TRACE\_BUFFER}, +either in {\tt <xen/config.h>} or on the gcc command line. + +The size (in pages) of the per-CPU trace buffers can be specified using the +{\tt tbuf\_size=n } boot parameter to Xen. If the size is set to 0, the trace +buffers will be disabled. + +\subsection{Dumping trace data} + +When running a trace buffer build of Xen, trace data are written continuously +into the buffer data areas, with newer data overwriting older data. This data +can be captured using the {\tt xentrace} program in domain 0. + +The {\tt xentrace} tool uses {\tt /dev/mem} in domain 0 to map the trace +buffers into its address space. It then periodically polls all the buffers for +new data, dumping out any new records from each buffer in turn. As a result, +for machines with multiple (logical) CPUs, the trace buffer output will not be +in overall chronological order. + +The output from {\tt xentrace} can be post-processed using {\tt +xentrace\_cpusplit} (used to split trace data out into per-cpu log files) and +{\tt xentrace\_format} (used to pretty-print trace data). For the predefined +trace points, there is an example format file in {\tt tools/xentrace/formats }. + +For more information, see the manual pages for {\tt xentrace}, {\tt +xentrace\_format} and {\tt xentrace\_cpusplit}. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/devices.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/devices.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,178 @@ +\chapter{Devices} +\label{c:devices} + +Devices such as network and disk are exported to guests using a split +device driver. The device driver domain, which accesses the physical +device directly also runs a \emph{backend} driver, serving requests to +that device from guests. Each guest will use a simple \emph{frontend} +driver, to access the backend. Communication between these domains is +composed of two parts: First, data is placed onto a shared memory page +between the domains. Second, an event channel between the two domains +is used to pass notification that data is outstanding. This +separation of notification from data transfer allows message batching, +and results in very efficient device access. + +Event channels are used extensively in device virtualization; each +domain has a number of end-points or \emph{ports} each of which may be +bound to one of the following \emph{event sources}: +\begin{itemize} + \item a physical interrupt from a real device, + \item a virtual interrupt (callback) from Xen, or + \item a signal from another domain +\end{itemize} + +Events are lightweight and do not carry much information beyond the +source of the notification. Hence when performing bulk data transfer, +events are typically used as synchronization primitives over a shared +memory transport. Event channels are managed via the {\tt + event\_channel\_op()} hypercall; for more details see +Section~\ref{s:idc}. + +This chapter focuses on some individual device interfaces available to +Xen guests. + + +\section{Network I/O} + +Virtual network device services are provided by shared memory +communication with a backend domain. From the point of view of other +domains, the backend may be viewed as a virtual ethernet switch +element with each domain having one or more virtual network interfaces +connected to it. + +\subsection{Backend Packet Handling} + +The backend driver is responsible for a variety of actions relating to +the transmission and reception of packets from the physical device. +With regard to transmission, the backend performs these key actions: + +\begin{itemize} +\item {\bf Validation:} To ensure that domains do not attempt to + generate invalid (e.g. spoofed) traffic, the backend driver may + validate headers ensuring that source MAC and IP addresses match the + interface that they have been sent from. + + Validation functions can be configured using standard firewall rules + ({\small{\tt iptables}} in the case of Linux). + +\item {\bf Scheduling:} Since a number of domains can share a single + physical network interface, the backend must mediate access when + several domains each have packets queued for transmission. This + general scheduling function subsumes basic shaping or rate-limiting + schemes. + +\item {\bf Logging and Accounting:} The backend domain can be + configured with classifier rules that control how packets are + accounted or logged. For example, log messages might be generated + whenever a domain attempts to send a TCP packet containing a SYN. +\end{itemize} + +On receipt of incoming packets, the backend acts as a simple +demultiplexer: Packets are passed to the appropriate virtual interface +after any necessary logging and accounting have been carried out. + +\subsection{Data Transfer} + +Each virtual interface uses two ``descriptor rings'', one for +transmit, the other for receive. Each descriptor identifies a block +of contiguous physical memory allocated to the domain. + +The transmit ring carries packets to transmit from the guest to the +backend domain. The return path of the transmit ring carries messages +indicating that the contents have been physically transmitted and the +backend no longer requires the associated pages of memory. + +To receive packets, the guest places descriptors of unused pages on +the receive ring. The backend will return received packets by +exchanging these pages in the domain's memory with new pages +containing the received data, and passing back descriptors regarding +the new packets on the ring. This zero-copy approach allows the +backend to maintain a pool of free pages to receive packets into, and +then deliver them to appropriate domains after examining their +headers. + +% Real physical addresses are used throughout, with the domain +% performing translation from pseudo-physical addresses if that is +% necessary. + +If a domain does not keep its receive ring stocked with empty buffers +then packets destined to it may be dropped. This provides some +defence against receive livelock problems because an overload domain +will cease to receive further data. Similarly, on the transmit path, +it provides the application with feedback on the rate at which packets +are able to leave the system. + +Flow control on rings is achieved by including a pair of producer +indexes on the shared ring page. Each side will maintain a private +consumer index indicating the next outstanding message. In this +manner, the domains cooperate to divide the ring into two message +lists, one in each direction. Notification is decoupled from the +immediate placement of new messages on the ring; the event channel +will be used to generate notification when {\em either} a certain +number of outstanding messages are queued, {\em or} a specified number +of nanoseconds have elapsed since the oldest message was placed on the +ring. + +%% Not sure if my version is any better -- here is what was here +%% before: Synchronization between the backend domain and the guest is +%% achieved using counters held in shared memory that is accessible to +%% both. Each ring has associated producer and consumer indices +%% indicating the area in the ring that holds descriptors that contain +%% data. After receiving {\it n} packets or {\t nanoseconds} after +%% receiving the first packet, the hypervisor sends an event to the +%% domain. + + +\section{Block I/O} + +All guest OS disk access goes through the virtual block device VBD +interface. This interface allows domains access to portions of block +storage devices visible to the the block backend device. The VBD +interface is a split driver, similar to the network interface +described above. A single shared memory ring is used between the +frontend and backend drivers, across which read and write messages are +sent. + +Any block device accessible to the backend domain, including +network-based block (iSCSI, *NBD, etc), loopback and LVM/MD devices, +can be exported as a VBD. Each VBD is mapped to a device node in the +guest, specified in the guest's startup configuration. + +Old (Xen 1.2) virtual disks are not supported under Xen 2.0, since +similar functionality can be achieved using the more complete LVM +system, which is already in widespread use. + +\subsection{Data Transfer} + +The single ring between the guest and the block backend supports three +messages: + +\begin{description} +\item [{\small {\tt PROBE}}:] Return a list of the VBDs available to + this guest from the backend. The request includes a descriptor of a + free page into which the reply will be written by the backend. + +\item [{\small {\tt READ}}:] Read data from the specified block + device. The front end identifies the device and location to read + from and attaches pages for the data to be copied to (typically via + DMA from the device). The backend acknowledges completed read + requests as they finish. + +\item [{\small {\tt WRITE}}:] Write data to the specified block + device. This functions essentially as {\small {\tt READ}}, except + that the data moves to the device instead of from it. +\end{description} + +%% um... some old text: In overview, the same style of descriptor-ring +%% that is used for network packets is used here. Each domain has one +%% ring that carries operation requests to the hypervisor and carries +%% the results back again. + +%% Rather than copying data, the backend simply maps the domain's +%% buffers in order to enable direct DMA to them. The act of mapping +%% the buffers also increases the reference counts of the underlying +%% pages, so that the unprivileged domain cannot try to return them to +%% the hypervisor, install them as page tables, or any other unsafe +%% behaviour. +%% +%% % block API here diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/further_info.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/further_info.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,49 @@ +\chapter{Further Information} + +If you have questions that are not answered by this manual, the +sources of information listed below may be of interest to you. Note +that bug reports, suggestions and contributions related to the +software (or the documentation) should be sent to the Xen developers' +mailing list (address below). + + +\section{Other documentation} + +If you are mainly interested in using (rather than developing for) +Xen, the \emph{Xen Users' Manual} is distributed in the {\tt docs/} +directory of the Xen source distribution. + +% Various HOWTOs are also available in {\tt docs/HOWTOS}. + + +\section{Online references} + +The official Xen web site is found at: +\begin{quote} +{\tt http://www.cl.cam.ac.uk/Research/SRG/netos/xen/} +\end{quote} + +This contains links to the latest versions of all on-line +documentation. + + +\section{Mailing lists} + +There are currently four official Xen mailing lists: + +\begin{description} +\item[xen-devel@xxxxxxxxxxxxxxxxxxx] Used for development + discussions and bug reports. Subscribe at: \\ + {\small {\tt http://lists.xensource.com/xen-devel}} +\item[xen-users@xxxxxxxxxxxxxxxxxxx] Used for installation and usage + discussions and requests for help. Subscribe at: \\ + {\small {\tt http://lists.xensource.com/xen-users}} +\item[xen-announce@xxxxxxxxxxxxxxxxxxx] Used for announcements only. + Subscribe at: \\ + {\small {\tt http://lists.xensource.com/xen-announce}} +\item[xen-changelog@xxxxxxxxxxxxxxxxxxx] Changelog feed + from the unstable and 2.0 trees - developer oriented. Subscribe at: \\ + {\small {\tt http://lists.xensource.com/xen-changelog}} +\end{description} + +Of these, xen-devel is the most active. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/hypercalls.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/hypercalls.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,524 @@ + +\newcommand{\hypercall}[1]{\vspace{2mm}{\sf #1}} + +\chapter{Xen Hypercalls} +\label{a:hypercalls} + +Hypercalls represent the procedural interface to Xen; this appendix +categorizes and describes the current set of hypercalls. + +\section{Invoking Hypercalls} + +Hypercalls are invoked in a manner analogous to system calls in a +conventional operating system; a software interrupt is issued which +vectors to an entry point within Xen. On x86\_32 machines the +instruction required is {\tt int \$82}; the (real) IDT is setup so +that this may only be issued from within ring 1. The particular +hypercall to be invoked is contained in {\tt EAX} --- a list +mapping these values to symbolic hypercall names can be found +in {\tt xen/include/public/xen.h}. + +On some occasions a set of hypercalls will be required to carry +out a higher-level function; a good example is when a guest +operating wishes to context switch to a new process which +requires updating various privileged CPU state. As an optimization +for these cases, there is a generic mechanism to issue a set of +hypercalls as a batch: + +\begin{quote} +\hypercall{multicall(void *call\_list, int nr\_calls)} + +Execute a series of hypervisor calls; {\tt nr\_calls} is the length of +the array of {\tt multicall\_entry\_t} structures pointed to be {\tt +call\_list}. Each entry contains the hypercall operation code followed +by up to 7 word-sized arguments. +\end{quote} + +Note that multicalls are provided purely as an optimization; there is +no requirement to use them when first porting a guest operating +system. + + +\section{Virtual CPU Setup} + +At start of day, a guest operating system needs to setup the virtual +CPU it is executing on. This includes installing vectors for the +virtual IDT so that the guest OS can handle interrupts, page faults, +etc. However the very first thing a guest OS must setup is a pair +of hypervisor callbacks: these are the entry points which Xen will +use when it wishes to notify the guest OS of an occurrence. + +\begin{quote} +\hypercall{set\_callbacks(unsigned long event\_selector, unsigned long + event\_address, unsigned long failsafe\_selector, unsigned long + failsafe\_address) } + +Register the normal (``event'') and failsafe callbacks for +event processing. In each case the code segment selector and +address within that segment are provided. The selectors must +have RPL 1; in XenLinux we simply use the kernel's CS for both +{\tt event\_selector} and {\tt failsafe\_selector}. + +The value {\tt event\_address} specifies the address of the guest OSes +event handling and dispatch routine; the {\tt failsafe\_address} +specifies a separate entry point which is used only if a fault occurs +when Xen attempts to use the normal callback. +\end{quote} + + +After installing the hypervisor callbacks, the guest OS can +install a `virtual IDT' by using the following hypercall: + +\begin{quote} +\hypercall{set\_trap\_table(trap\_info\_t *table)} + +Install one or more entries into the per-domain +trap handler table (essentially a software version of the IDT). +Each entry in the array pointed to by {\tt table} includes the +exception vector number with the corresponding segment selector +and entry point. Most guest OSes can use the same handlers on +Xen as when running on the real hardware; an exception is the +page fault handler (exception vector 14) where a modified +stack-frame layout is used. + + +\end{quote} + + + +\section{Scheduling and Timer} + +Domains are preemptively scheduled by Xen according to the +parameters installed by domain 0 (see Section~\ref{s:dom0ops}). +In addition, however, a domain may choose to explicitly +control certain behavior with the following hypercall: + +\begin{quote} +\hypercall{sched\_op(unsigned long op)} + +Request scheduling operation from hypervisor. The options are: {\it +yield}, {\it block}, and {\it shutdown}. {\it yield} keeps the +calling domain runnable but may cause a reschedule if other domains +are runnable. {\it block} removes the calling domain from the run +queue and cause is to sleeps until an event is delivered to it. {\it +shutdown} is used to end the domain's execution; the caller can +additionally specify whether the domain should reboot, halt or +suspend. +\end{quote} + +To aid the implementation of a process scheduler within a guest OS, +Xen provides a virtual programmable timer: + +\begin{quote} +\hypercall{set\_timer\_op(uint64\_t timeout)} + +Request a timer event to be sent at the specified system time (time +in nanoseconds since system boot). The hypercall actually passes the +64-bit timeout value as a pair of 32-bit values. + +\end{quote} + +Note that calling {\tt set\_timer\_op()} prior to {\tt sched\_op} +allows block-with-timeout semantics. + + +\section{Page Table Management} + +Since guest operating systems have read-only access to their page +tables, Xen must be involved when making any changes. The following +multi-purpose hypercall can be used to modify page-table entries, +update the machine-to-physical mapping table, flush the TLB, install +a new page-table base pointer, and more. + +\begin{quote} +\hypercall{mmu\_update(mmu\_update\_t *req, int count, int *success\_count)} + +Update the page table for the domain; a set of {\tt count} updates are +submitted for processing in a batch, with {\tt success\_count} being +updated to report the number of successful updates. + +Each element of {\tt req[]} contains a pointer (address) and value; +the least significant 2-bits of the pointer are used to distinguish +the type of update requested as follows: +\begin{description} + +\item[\it MMU\_NORMAL\_PT\_UPDATE:] update a page directory entry or +page table entry to the associated value; Xen will check that the +update is safe, as described in Chapter~\ref{c:memory}. + +\item[\it MMU\_MACHPHYS\_UPDATE:] update an entry in the + machine-to-physical table. The calling domain must own the machine + page in question (or be privileged). + +\item[\it MMU\_EXTENDED\_COMMAND:] perform additional MMU operations. +The set of additional MMU operations is considerable, and includes +updating {\tt cr3} (or just re-installing it for a TLB flush), +flushing the cache, installing a new LDT, or pinning \& unpinning +page-table pages (to ensure their reference count doesn't drop to zero +which would require a revalidation of all entries). + +Further extended commands are used to deal with granting and +acquiring page ownership; see Section~\ref{s:idc}. + + +\end{description} + +More details on the precise format of all commands can be +found in {\tt xen/include/public/xen.h}. + + +\end{quote} + +Explicitly updating batches of page table entries is extremely +efficient, but can require a number of alterations to the guest +OS. Using the writable page table mode (Chapter~\ref{c:memory}) is +recommended for new OS ports. + +Regardless of which page table update mode is being used, however, +there are some occasions (notably handling a demand page fault) where +a guest OS will wish to modify exactly one PTE rather than a +batch. This is catered for by the following: + +\begin{quote} +\hypercall{update\_va\_mapping(unsigned long page\_nr, unsigned long +val, \\ unsigned long flags)} + +Update the currently installed PTE for the page {\tt page\_nr} to +{\tt val}. As with {\tt mmu\_update()}, Xen checks the modification +is safe before applying it. The {\tt flags} determine which kind +of TLB flush, if any, should follow the update. + +\end{quote} + +Finally, sufficiently privileged domains may occasionally wish to manipulate +the pages of others: +\begin{quote} + +\hypercall{update\_va\_mapping\_otherdomain(unsigned long page\_nr, +unsigned long val, unsigned long flags, uint16\_t domid)} + +Identical to {\tt update\_va\_mapping()} save that the pages being +mapped must belong to the domain {\tt domid}. + +\end{quote} + +This privileged operation is currently used by backend virtual device +drivers to safely map pages containing I/O data. + + + +\section{Segmentation Support} + +Xen allows guest OSes to install a custom GDT if they require it; +this is context switched transparently whenever a domain is +[de]scheduled. The following hypercall is effectively a +`safe' version of {\tt lgdt}: + +\begin{quote} +\hypercall{set\_gdt(unsigned long *frame\_list, int entries)} + +Install a global descriptor table for a domain; {\tt frame\_list} is +an array of up to 16 machine page frames within which the GDT resides, +with {\tt entries} being the actual number of descriptor-entry +slots. All page frames must be mapped read-only within the guest's +address space, and the table must be large enough to contain Xen's +reserved entries (see {\tt xen/include/public/arch-x86\_32.h}). + +\end{quote} + +Many guest OSes will also wish to install LDTs; this is achieved by +using {\tt mmu\_update()} with an extended command, passing the +linear address of the LDT base along with the number of entries. No +special safety checks are required; Xen needs to perform this task +simply since {\tt lldt} requires CPL 0. + + +Xen also allows guest operating systems to update just an +individual segment descriptor in the GDT or LDT: + +\begin{quote} +\hypercall{update\_descriptor(unsigned long ma, unsigned long word1, +unsigned long word2)} + +Update the GDT/LDT entry at machine address {\tt ma}; the new +8-byte descriptor is stored in {\tt word1} and {\tt word2}. +Xen performs a number of checks to ensure the descriptor is +valid. + +\end{quote} + +Guest OSes can use the above in place of context switching entire +LDTs (or the GDT) when the number of changing descriptors is small. + +\section{Context Switching} + +When a guest OS wishes to context switch between two processes, +it can use the page table and segmentation hypercalls described +above to perform the the bulk of the privileged work. In addition, +however, it will need to invoke Xen to switch the kernel (ring 1) +stack pointer: + +\begin{quote} +\hypercall{stack\_switch(unsigned long ss, unsigned long esp)} + +Request kernel stack switch from hypervisor; {\tt ss} is the new +stack segment, which {\tt esp} is the new stack pointer. + +\end{quote} + +A final useful hypercall for context switching allows ``lazy'' +save and restore of floating point state: + +\begin{quote} +\hypercall{fpu\_taskswitch(void)} + +This call instructs Xen to set the {\tt TS} bit in the {\tt cr0} +control register; this means that the next attempt to use floating +point will cause a trap which the guest OS can trap. Typically it will +then save/restore the FP state, and clear the {\tt TS} bit. +\end{quote} + +This is provided as an optimization only; guest OSes can also choose +to save and restore FP state on all context switches for simplicity. + + +\section{Physical Memory Management} + +As mentioned previously, each domain has a maximum and current +memory allocation. The maximum allocation, set at domain creation +time, cannot be modified. However a domain can choose to reduce +and subsequently grow its current allocation by using the +following call: + +\begin{quote} +\hypercall{dom\_mem\_op(unsigned int op, unsigned long *extent\_list, + unsigned long nr\_extents, unsigned int extent\_order)} + +Increase or decrease current memory allocation (as determined by +the value of {\tt op}). Each invocation provides a list of +extents each of which is $2^s$ pages in size, +where $s$ is the value of {\tt extent\_order}. + +\end{quote} + +In addition to simply reducing or increasing the current memory +allocation via a `balloon driver', this call is also useful for +obtaining contiguous regions of machine memory when required (e.g. +for certain PCI devices, or if using superpages). + + +\section{Inter-Domain Communication} +\label{s:idc} + +Xen provides a simple asynchronous notification mechanism via +\emph{event channels}. Each domain has a set of end-points (or +\emph{ports}) which may be bound to an event source (e.g. a physical +IRQ, a virtual IRQ, or an port in another domain). When a pair of +end-points in two different domains are bound together, then a `send' +operation on one will cause an event to be received by the destination +domain. + +The control and use of event channels involves the following hypercall: + +\begin{quote} +\hypercall{event\_channel\_op(evtchn\_op\_t *op)} + +Inter-domain event-channel management; {\tt op} is a discriminated +union which allows the following 7 operations: + +\begin{description} + +\item[\it alloc\_unbound:] allocate a free (unbound) local + port and prepare for connection from a specified domain. +\item[\it bind\_virq:] bind a local port to a virtual +IRQ; any particular VIRQ can be bound to at most one port per domain. +\item[\it bind\_pirq:] bind a local port to a physical IRQ; +once more, a given pIRQ can be bound to at most one port per +domain. Furthermore the calling domain must be sufficiently +privileged. +\item[\it bind\_interdomain:] construct an interdomain event +channel; in general, the target domain must have previously allocated +an unbound port for this channel, although this can be bypassed by +privileged domains during domain setup. +\item[\it close:] close an interdomain event channel. +\item[\it send:] send an event to the remote end of a +interdomain event channel. +\item[\it status:] determine the current status of a local port. +\end{description} + +For more details see +{\tt xen/include/public/event\_channel.h}. + +\end{quote} + +Event channels are the fundamental communication primitive between +Xen domains and seamlessly support SMP. However they provide little +bandwidth for communication {\sl per se}, and hence are typically +married with a piece of shared memory to produce effective and +high-performance inter-domain communication. + +Safe sharing of memory pages between guest OSes is carried out by +granting access on a per page basis to individual domains. This is +achieved by using the {\tt grant\_table\_op()} hypercall. + +\begin{quote} +\hypercall{grant\_table\_op(unsigned int cmd, void *uop, unsigned int count)} + +Grant or remove access to a particular page to a particular domain. + +\end{quote} + +This is not currently widely in use by guest operating systems, but +we intend to integrate support more fully in the near future. + +\section{PCI Configuration} + +Domains with physical device access (i.e.\ driver domains) receive +limited access to certain PCI devices (bus address space and +interrupts). However many guest operating systems attempt to +determine the PCI configuration by directly access the PCI BIOS, +which cannot be allowed for safety. + +Instead, Xen provides the following hypercall: + +\begin{quote} +\hypercall{physdev\_op(void *physdev\_op)} + +Perform a PCI configuration option; depending on the value +of {\tt physdev\_op} this can be a PCI config read, a PCI config +write, or a small number of other queries. + +\end{quote} + + +For examples of using {\tt physdev\_op()}, see the +Xen-specific PCI code in the linux sparse tree. + +\section{Administrative Operations} +\label{s:dom0ops} + +A large number of control operations are available to a sufficiently +privileged domain (typically domain 0). These allow the creation and +management of new domains, for example. A complete list is given +below: for more details on any or all of these, please see +{\tt xen/include/public/dom0\_ops.h} + + +\begin{quote} +\hypercall{dom0\_op(dom0\_op\_t *op)} + +Administrative domain operations for domain management. The options are: + +\begin{description} +\item [\it DOM0\_CREATEDOMAIN:] create a new domain + +\item [\it DOM0\_PAUSEDOMAIN:] remove a domain from the scheduler run +queue. + +\item [\it DOM0\_UNPAUSEDOMAIN:] mark a paused domain as schedulable + once again. + +\item [\it DOM0\_DESTROYDOMAIN:] deallocate all resources associated +with a domain + +\item [\it DOM0\_GETMEMLIST:] get list of pages used by the domain + +\item [\it DOM0\_SCHEDCTL:] + +\item [\it DOM0\_ADJUSTDOM:] adjust scheduling priorities for domain + +\item [\it DOM0\_BUILDDOMAIN:] do final guest OS setup for domain + +\item [\it DOM0\_GETDOMAINFO:] get statistics about the domain + +\item [\it DOM0\_GETPAGEFRAMEINFO:] + +\item [\it DOM0\_GETPAGEFRAMEINFO2:] + +\item [\it DOM0\_IOPL:] set I/O privilege level + +\item [\it DOM0\_MSR:] read or write model specific registers + +\item [\it DOM0\_DEBUG:] interactively invoke the debugger + +\item [\it DOM0\_SETTIME:] set system time + +\item [\it DOM0\_READCONSOLE:] read console content from hypervisor buffer ring + +\item [\it DOM0\_PINCPUDOMAIN:] pin domain to a particular CPU + +\item [\it DOM0\_GETTBUFS:] get information about the size and location of + the trace buffers (only on trace-buffer enabled builds) + +\item [\it DOM0\_PHYSINFO:] get information about the host machine + +\item [\it DOM0\_PCIDEV\_ACCESS:] modify PCI device access permissions + +\item [\it DOM0\_SCHED\_ID:] get the ID of the current Xen scheduler + +\item [\it DOM0\_SHADOW\_CONTROL:] switch between shadow page-table modes + +\item [\it DOM0\_SETDOMAININITIALMEM:] set initial memory allocation of a domain + +\item [\it DOM0\_SETDOMAINMAXMEM:] set maximum memory allocation of a domain + +\item [\it DOM0\_SETDOMAINVMASSIST:] set domain VM assist options +\end{description} +\end{quote} + +Most of the above are best understood by looking at the code +implementing them (in {\tt xen/common/dom0\_ops.c}) and in +the user-space tools that use them (mostly in {\tt tools/libxc}). + +\section{Debugging Hypercalls} + +A few additional hypercalls are mainly useful for debugging: + +\begin{quote} +\hypercall{console\_io(int cmd, int count, char *str)} + +Use Xen to interact with the console; operations are: + +{\it CONSOLEIO\_write}: Output count characters from buffer str. + +{\it CONSOLEIO\_read}: Input at most count characters into buffer str. +\end{quote} + +A pair of hypercalls allows access to the underlying debug registers: +\begin{quote} +\hypercall{set\_debugreg(int reg, unsigned long value)} + +Set debug register {\tt reg} to {\tt value} + +\hypercall{get\_debugreg(int reg)} + +Return the contents of the debug register {\tt reg} +\end{quote} + +And finally: +\begin{quote} +\hypercall{xen\_version(int cmd)} + +Request Xen version number. +\end{quote} + +This is useful to ensure that user-space tools are in sync +with the underlying hypervisor. + +\section{Deprecated Hypercalls} + +Xen is under constant development and refinement; as such there +are plans to improve the way in which various pieces of functionality +are exposed to guest OSes. + +\begin{quote} +\hypercall{vm\_assist(unsigned int cmd, unsigned int type)} + +Toggle various memory management modes (in particular wrritable page +tables and superpage support). + +\end{quote} + +This is likely to be replaced with mode values in the shared +information page since this is more resilient for resumption +after migration or checkpoint. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/memory.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/memory.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,162 @@ +\chapter{Memory} +\label{c:memory} + +Xen is responsible for managing the allocation of physical memory to +domains, and for ensuring safe use of the paging and segmentation +hardware. + + +\section{Memory Allocation} + +Xen resides within a small fixed portion of physical memory; it also +reserves the top 64MB of every virtual address space. The remaining +physical memory is available for allocation to domains at a page +granularity. Xen tracks the ownership and use of each page, which +allows it to enforce secure partitioning between domains. + +Each domain has a maximum and current physical memory allocation. A +guest OS may run a `balloon driver' to dynamically adjust its current +memory allocation up to its limit. + + +%% XXX SMH: I use machine and physical in the next section (which is +%% kinda required for consistency with code); wonder if this section +%% should use same terms? +%% +%% Probably. +%% +%% Merging this and below section at some point prob makes sense. + +\section{Pseudo-Physical Memory} + +Since physical memory is allocated and freed on a page granularity, +there is no guarantee that a domain will receive a contiguous stretch +of physical memory. However most operating systems do not have good +support for operating in a fragmented physical address space. To aid +porting such operating systems to run on top of Xen, we make a +distinction between \emph{machine memory} and \emph{pseudo-physical + memory}. + +Put simply, machine memory refers to the entire amount of memory +installed in the machine, including that reserved by Xen, in use by +various domains, or currently unallocated. We consider machine memory +to comprise a set of 4K \emph{machine page frames} numbered +consecutively starting from 0. Machine frame numbers mean the same +within Xen or any domain. + +Pseudo-physical memory, on the other hand, is a per-domain +abstraction. It allows a guest operating system to consider its memory +allocation to consist of a contiguous range of physical page frames +starting at physical frame 0, despite the fact that the underlying +machine page frames may be sparsely allocated and in any order. + +To achieve this, Xen maintains a globally readable {\it + machine-to-physical} table which records the mapping from machine +page frames to pseudo-physical ones. In addition, each domain is +supplied with a {\it physical-to-machine} table which performs the +inverse mapping. Clearly the machine-to-physical table has size +proportional to the amount of RAM installed in the machine, while each +physical-to-machine table has size proportional to the memory +allocation of the given domain. + +Architecture dependent code in guest operating systems can then use +the two tables to provide the abstraction of pseudo-physical memory. +In general, only certain specialized parts of the operating system +(such as page table management) needs to understand the difference +between machine and pseudo-physical addresses. + + +\section{Page Table Updates} + +In the default mode of operation, Xen enforces read-only access to +page tables and requires guest operating systems to explicitly request +any modifications. Xen validates all such requests and only applies +updates that it deems safe. This is necessary to prevent domains from +adding arbitrary mappings to their page tables. + +To aid validation, Xen associates a type and reference count with each +memory page. A page has one of the following mutually-exclusive types +at any point in time: page directory ({\sf PD}), page table ({\sf + PT}), local descriptor table ({\sf LDT}), global descriptor table +({\sf GDT}), or writable ({\sf RW}). Note that a guest OS may always +create readable mappings of its own memory regardless of its current +type. + +%%% XXX: possibly explain more about ref count 'lifecyle' here? +This mechanism is used to maintain the invariants required for safety; +for example, a domain cannot have a writable mapping to any part of a +page table as this would require the page concerned to simultaneously +be of types {\sf PT} and {\sf RW}. + + +% \section{Writable Page Tables} + +Xen also provides an alternative mode of operation in which guests be +have the illusion that their page tables are directly writable. Of +course this is not really the case, since Xen must still validate +modifications to ensure secure partitioning. To this end, Xen traps +any write attempt to a memory page of type {\sf PT} (i.e., that is +currently part of a page table). If such an access occurs, Xen +temporarily allows write access to that page while at the same time +\emph{disconnecting} it from the page table that is currently in use. +This allows the guest to safely make updates to the page because the +newly-updated entries cannot be used by the MMU until Xen revalidates +and reconnects the page. Reconnection occurs automatically in a +number of situations: for example, when the guest modifies a different +page-table page, when the domain is preempted, or whenever the guest +uses Xen's explicit page-table update interfaces. + +Finally, Xen also supports a form of \emph{shadow page tables} in +which the guest OS uses a independent copy of page tables which are +unknown to the hardware (i.e.\ which are never pointed to by {\tt + cr3}). Instead Xen propagates changes made to the guest's tables to +the real ones, and vice versa. This is useful for logging page writes +(e.g.\ for live migration or checkpoint). A full version of the shadow +page tables also allows guest OS porting with less effort. + + +\section{Segment Descriptor Tables} + +On boot a guest is supplied with a default GDT, which does not reside +within its own memory allocation. If the guest wishes to use other +than the default `flat' ring-1 and ring-3 segments that this GDT +provides, it must register a custom GDT and/or LDT with Xen, allocated +from its own memory. Note that a number of GDT entries are reserved by +Xen -- any custom GDT must also include sufficient space for these +entries. + +For example, the following hypercall is used to specify a new GDT: + +\begin{quote} + int {\bf set\_gdt}(unsigned long *{\em frame\_list}, int {\em + entries}) + + \emph{frame\_list}: An array of up to 16 machine page frames within + which the GDT resides. Any frame registered as a GDT frame may only + be mapped read-only within the guest's address space (e.g., no + writable mappings, no use as a page-table page, and so on). + + \emph{entries}: The number of descriptor-entry slots in the GDT. + Note that the table must be large enough to contain Xen's reserved + entries; thus we must have `{\em entries $>$ + LAST\_RESERVED\_GDT\_ENTRY}\ '. Note also that, after registering + the GDT, slots \emph{FIRST\_} through + \emph{LAST\_RESERVED\_GDT\_ENTRY} are no longer usable by the guest + and may be overwritten by Xen. +\end{quote} + +The LDT is updated via the generic MMU update mechanism (i.e., via the +{\tt mmu\_update()} hypercall. + +\section{Start of Day} + +The start-of-day environment for guest operating systems is rather +different to that provided by the underlying hardware. In particular, +the processor is already executing in protected mode with paging +enabled. + +{\it Domain 0} is created and booted by Xen itself. For all subsequent +domains, the analogue of the boot-loader is the {\it domain builder}, +user-space software running in {\it domain 0}. The domain builder is +responsible for building the initial page tables for a domain and +loading its kernel image at the appropriate virtual address. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/interface/scheduling.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/interface/scheduling.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,268 @@ +\chapter{Scheduling API} + +The scheduling API is used by both the schedulers described above and should +also be used by any new schedulers. It provides a generic interface and also +implements much of the ``boilerplate'' code. + +Schedulers conforming to this API are described by the following +structure: + +\begin{verbatim} +struct scheduler +{ + char *name; /* full name for this scheduler */ + char *opt_name; /* option name for this scheduler */ + unsigned int sched_id; /* ID for this scheduler */ + + int (*init_scheduler) (); + int (*alloc_task) (struct task_struct *); + void (*add_task) (struct task_struct *); + void (*free_task) (struct task_struct *); + void (*rem_task) (struct task_struct *); + void (*wake_up) (struct task_struct *); + void (*do_block) (struct task_struct *); + task_slice_t (*do_schedule) (s_time_t); + int (*control) (struct sched_ctl_cmd *); + int (*adjdom) (struct task_struct *, + struct sched_adjdom_cmd *); + s32 (*reschedule) (struct task_struct *); + void (*dump_settings) (void); + void (*dump_cpu_state) (int); + void (*dump_runq_el) (struct task_struct *); +}; +\end{verbatim} + +The only method that {\em must} be implemented is +{\tt do\_schedule()}. However, if there is not some implementation for the +{\tt wake\_up()} method then waking tasks will not get put on the runqueue! + +The fields of the above structure are described in more detail below. + +\subsubsection{name} + +The name field should point to a descriptive ASCII string. + +\subsubsection{opt\_name} + +This field is the value of the {\tt sched=} boot-time option that will select +this scheduler. + +\subsubsection{sched\_id} + +This is an integer that uniquely identifies this scheduler. There should be a +macro corrsponding to this scheduler ID in {\tt <xen/sched-if.h>}. + +\subsubsection{init\_scheduler} + +\paragraph*{Purpose} + +This is a function for performing any scheduler-specific initialisation. For +instance, it might allocate memory for per-CPU scheduler data and initialise it +appropriately. + +\paragraph*{Call environment} + +This function is called after the initialisation performed by the generic +layer. The function is called exactly once, for the scheduler that has been +selected. + +\paragraph*{Return values} + +This should return negative on failure --- this will cause an +immediate panic and the system will fail to boot. + +\subsubsection{alloc\_task} + +\paragraph*{Purpose} +Called when a {\tt task\_struct} is allocated by the generic scheduler +layer. A particular scheduler implementation may use this method to +allocate per-task data for this task. It may use the {\tt +sched\_priv} pointer in the {\tt task\_struct} to point to this data. + +\paragraph*{Call environment} +The generic layer guarantees that the {\tt sched\_priv} field will +remain intact from the time this method is called until the task is +deallocated (so long as the scheduler implementation does not change +it explicitly!). + +\paragraph*{Return values} +Negative on failure. + +\subsubsection{add\_task} + +\paragraph*{Purpose} + +Called when a task is initially added by the generic layer. + +\paragraph*{Call environment} + +The fields in the {\tt task\_struct} are now filled out and available for use. +Schedulers should implement appropriate initialisation of any per-task private +information in this method. + +\subsubsection{free\_task} + +\paragraph*{Purpose} + +Schedulers should free the space used by any associated private data +structures. + +\paragraph*{Call environment} + +This is called when a {\tt task\_struct} is about to be deallocated. +The generic layer will have done generic task removal operations and +(if implemented) called the scheduler's {\tt rem\_task} method before +this method is called. + +\subsubsection{rem\_task} + +\paragraph*{Purpose} + +This is called when a task is being removed from scheduling (but is +not yet being freed). + +\subsubsection{wake\_up} + +\paragraph*{Purpose} + +Called when a task is woken up, this method should put the task on the runqueue +(or do the scheduler-specific equivalent action). + +\paragraph*{Call environment} + +The task is already set to state RUNNING. + +\subsubsection{do\_block} + +\paragraph*{Purpose} + +This function is called when a task is blocked. This function should +not remove the task from the runqueue. + +\paragraph*{Call environment} + +The EVENTS\_MASTER\_ENABLE\_BIT is already set and the task state changed to +TASK\_INTERRUPTIBLE on entry to this method. A call to the {\tt + do\_schedule} method will be made after this method returns, in +order to select the next task to run. + +\subsubsection{do\_schedule} + +This method must be implemented. + +\paragraph*{Purpose} + +The method is called each time a new task must be chosen for scheduling on the +current CPU. The current time as passed as the single argument (the current +task can be found using the {\tt current} macro). + +This method should select the next task to run on this CPU and set it's minimum +time to run as well as returning the data described below. + +This method should also take the appropriate action if the previous +task has blocked, e.g. removing it from the runqueue. + +\paragraph*{Call environment} + +The other fields in the {\tt task\_struct} are updated by the generic layer, +which also performs all Xen-specific tasks and performs the actual task switch +(unless the previous task has been chosen again). + +This method is called with the {\tt schedule\_lock} held for the current CPU +and local interrupts disabled. + +\paragraph*{Return values} + +Must return a {\tt struct task\_slice} describing what task to run and how long +for (at maximum). + +\subsubsection{control} + +\paragraph*{Purpose} + +This method is called for global scheduler control operations. It takes a +pointer to a {\tt struct sched\_ctl\_cmd}, which it should either +source data from or populate with data, depending on the value of the +{\tt direction} field. + +\paragraph*{Call environment} + +The generic layer guarantees that when this method is called, the +caller selected the correct scheduler ID, hence the scheduler's +implementation does not need to sanity-check these parts of the call. + +\paragraph*{Return values} + +This function should return the value to be passed back to user space, hence it +should either be 0 or an appropriate errno value. + +\subsubsection{sched\_adjdom} + +\paragraph*{Purpose} + +This method is called to adjust the scheduling parameters of a particular +domain, or to query their current values. The function should check +the {\tt direction} field of the {\tt sched\_adjdom\_cmd} it receives in +order to determine which of these operations is being performed. + +\paragraph*{Call environment} + +The generic layer guarantees that the caller has specified the correct +control interface version and scheduler ID and that the supplied {\tt +task\_struct} will not be deallocated during the call (hence it is not +necessary to {\tt get\_task\_struct}). + +\paragraph*{Return values} + +This function should return the value to be passed back to user space, hence it +should either be 0 or an appropriate errno value. + +\subsubsection{reschedule} + +\paragraph*{Purpose} + +This method is called to determine if a reschedule is required as a result of a +particular task. + +\paragraph*{Call environment} +The generic layer will cause a reschedule if the current domain is the idle +task or it has exceeded its minimum time slice before a reschedule. The +generic layer guarantees that the task passed is not currently running but is +on the runqueue. + +\paragraph*{Return values} + +Should return a mask of CPUs to cause a reschedule on. + +\subsubsection{dump\_settings} + +\paragraph*{Purpose} + +If implemented, this should dump any private global settings for this +scheduler to the console. + +\paragraph*{Call environment} + +This function is called with interrupts enabled. + +\subsubsection{dump\_cpu\_state} + +\paragraph*{Purpose} + +This method should dump any private settings for the specified CPU. + +\paragraph*{Call environment} + +This function is called with interrupts disabled and the {\tt schedule\_lock} +for the specified CPU held. + +\subsubsection{dump\_runq\_el} + +\paragraph*{Purpose} + +This method should dump any private settings for the specified task. + +\paragraph*{Call environment} + +This function is called with interrupts disabled and the {\tt schedule\_lock} +for the task's CPU held. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/build.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/build.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,170 @@ +\chapter{Build, Boot and Debug Options} + +This chapter describes the build- and boot-time options which may be +used to tailor your Xen system. + + +\section{Xen Build Options} + +Xen provides a number of build-time options which should be set as +environment variables or passed on make's command-line. + +\begin{description} +\item[verbose=y] Enable debugging messages when Xen detects an + unexpected condition. Also enables console output from all domains. +\item[debug=y] Enable debug assertions. Implies {\bf verbose=y}. + (Primarily useful for tracing bugs in Xen). +\item[debugger=y] Enable the in-Xen debugger. This can be used to + debug Xen, guest OSes, and applications. +\item[perfc=y] Enable performance counters for significant events + within Xen. The counts can be reset or displayed on Xen's console + via console control keys. +\item[trace=y] Enable per-cpu trace buffers which log a range of + events within Xen for collection by control software. +\end{description} + + +\section{Xen Boot Options} +\label{s:xboot} + +These options are used to configure Xen's behaviour at runtime. They +should be appended to Xen's command line, either manually or by +editing \path{grub.conf}. + +\begin{description} +\item [ noreboot ] Don't reboot the machine automatically on errors. + This is useful to catch debug output if you aren't catching console + messages via the serial line. +\item [ nosmp ] Disable SMP support. This option is implied by + `ignorebiostables'. +\item [ watchdog ] Enable NMI watchdog which can report certain + failures. +\item [ noirqbalance ] Disable software IRQ balancing and affinity. + This can be used on systems such as Dell 1850/2850 that have + workarounds in hardware for IRQ-routing issues. +\item [ badpage=$<$page number$>$,$<$page number$>$, \ldots ] Specify + a list of pages not to be allocated for use because they contain bad + bytes. For example, if your memory tester says that byte 0x12345678 + is bad, you would place `badpage=0x12345' on Xen's command line. +\item [ com1=$<$baud$>$,DPS,$<$io\_base$>$,$<$irq$>$ + com2=$<$baud$>$,DPS,$<$io\_base$>$,$<$irq$>$ ] \mbox{}\\ + Xen supports up to two 16550-compatible serial ports. For example: + `com1=9600, 8n1, 0x408, 5' maps COM1 to a 9600-baud port, 8 data + bits, no parity, 1 stop bit, I/O port base 0x408, IRQ 5. If some + configuration options are standard (e.g., I/O base and IRQ), then + only a prefix of the full configuration string need be specified. If + the baud rate is pre-configured (e.g., by the bootloader) then you + can specify `auto' in place of a numeric baud rate. +\item [ console=$<$specifier list$>$ ] Specify the destination for Xen + console I/O. This is a comma-separated list of, for example: + \begin{description} + \item[ vga ] Use VGA console and allow keyboard input. + \item[ com1 ] Use serial port com1. + \item[ com2H ] Use serial port com2. Transmitted chars will have the + MSB set. Received chars must have MSB set. + \item[ com2L] Use serial port com2. Transmitted chars will have the + MSB cleared. Received chars must have MSB cleared. + \end{description} + The latter two examples allow a single port to be shared by two + subsystems (e.g.\ console and debugger). Sharing is controlled by + MSB of each transmitted/received character. [NB. Default for this + option is `com1,vga'] +\item [ sync\_console ] Force synchronous console output. This is + useful if you system fails unexpectedly before it has sent all + available output to the console. In most cases Xen will + automatically enter synchronous mode when an exceptional event + occurs, but this option provides a manual fallback. +\item [ conswitch=$<$switch-char$><$auto-switch-char$>$ ] Specify how + to switch serial-console input between Xen and DOM0. The required + sequence is CTRL-$<$switch-char$>$ pressed three times. Specifying + the backtick character disables switching. The + $<$auto-switch-char$>$ specifies whether Xen should auto-switch + input to DOM0 when it boots --- if it is `x' then auto-switching is + disabled. Any other value, or omitting the character, enables + auto-switching. [NB. Default switch-char is `a'.] +\item [ nmi=xxx ] + Specify what to do with an NMI parity or I/O error. \\ + `nmi=fatal': Xen prints a diagnostic and then hangs. \\ + `nmi=dom0': Inform DOM0 of the NMI. \\ + `nmi=ignore': Ignore the NMI. +\item [ mem=xxx ] Set the physical RAM address limit. Any RAM + appearing beyond this physical address in the memory map will be + ignored. This parameter may be specified with a B, K, M or G suffix, + representing bytes, kilobytes, megabytes and gigabytes respectively. + The default unit, if no suffix is specified, is kilobytes. +\item [ dom0\_mem=xxx ] Set the amount of memory to be allocated to + domain0. In Xen 3.x the parameter may be specified with a B, K, M or + G suffix, representing bytes, kilobytes, megabytes and gigabytes + respectively; if no suffix is specified, the parameter defaults to + kilobytes. In previous versions of Xen, suffixes were not supported + and the value is always interpreted as kilobytes. +\item [ tbuf\_size=xxx ] Set the size of the per-cpu trace buffers, in + pages (default 1). Note that the trace buffers are only enabled in + debug builds. Most users can ignore this feature completely. +\item [ sched=xxx ] Select the CPU scheduler Xen should use. The + current possibilities are `bvt' (default), `atropos' and `rrobin'. + For more information see Section~\ref{s:sched}. +\item [ apic\_verbosity=debug,verbose ] Print more detailed + information about local APIC and IOAPIC configuration. +\item [ lapic ] Force use of local APIC even when left disabled by + uniprocessor BIOS. +\item [ nolapic ] Ignore local APIC in a uniprocessor system, even if + enabled by the BIOS. +\item [ apic=bigsmp,default,es7000,summit ] Specify NUMA platform. + This can usually be probed automatically. +\end{description} + +In addition, the following options may be specified on the Xen command +line. Since domain 0 shares responsibility for booting the platform, +Xen will automatically propagate these options to its command line. +These options are taken from Linux's command-line syntax with +unchanged semantics. + +\begin{description} +\item [ acpi=off,force,strict,ht,noirq,\ldots ] Modify how Xen (and + domain 0) parses the BIOS ACPI tables. +\item [ acpi\_skip\_timer\_override ] Instruct Xen (and domain~0) to + ignore timer-interrupt override instructions specified by the BIOS + ACPI tables. +\item [ noapic ] Instruct Xen (and domain~0) to ignore any IOAPICs + that are present in the system, and instead continue to use the + legacy PIC. +\end{description} + + +\section{XenLinux Boot Options} + +In addition to the standard Linux kernel boot options, we support: +\begin{description} +\item[ xencons=xxx ] Specify the device node to which the Xen virtual + console driver is attached. The following options are supported: + \begin{center} + \begin{tabular}{l} + `xencons=off': disable virtual console \\ + `xencons=tty': attach console to /dev/tty1 (tty0 at boot-time) \\ + `xencons=ttyS': attach console to /dev/ttyS0 + \end{tabular} +\end{center} +The default is ttyS for dom0 and tty for all other domains. +\end{description} + + +\section{Debugging} +\label{s:keys} + +Xen has a set of debugging features that can be useful to try and +figure out what's going on. Hit `h' on the serial line (if you +specified a baud rate on the Xen command line) or ScrollLock-h on the +keyboard to get a list of supported commands. + +If you have a crash you'll likely get a crash dump containing an EIP +(PC) which, along with an \path{objdump -d image}, can be useful in +figuring out what's happened. Debug a Xenlinux image just as you +would any other Linux kernel. + +%% We supply a handy debug terminal program which you can find in +%% \path{/usr/local/src/xen-2.0.bk/tools/misc/miniterm/} This should +%% be built and executed on another machine that is connected via a +%% null modem cable. Documentation is included. Alternatively, if the +%% Xen machine is connected to a serial-port server then we supply a +%% dumb TCP terminal client, {\tt xencons}. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/control_software.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/control_software.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,115 @@ +\chapter{Control Software} + +The Xen control software includes the \xend\ node control daemon +(which must be running), the xm command line tools, and the prototype +xensv web interface. + +\section{\Xend\ (node control daemon)} +\label{s:xend} + +The Xen Daemon (\Xend) performs system management functions related to +virtual machines. It forms a central point of control for a machine +and can be controlled using an HTTP-based protocol. \Xend\ must be +running in order to start and manage virtual machines. + +\Xend\ must be run as root because it needs access to privileged +system management functions. A small set of commands may be issued on +the \xend\ command line: + +\begin{tabular}{ll} + \verb!# xend start! & start \xend, if not already running \\ + \verb!# xend stop! & stop \xend\ if already running \\ + \verb!# xend restart! & restart \xend\ if running, otherwise start it \\ + % \verb!# xend trace_start! & start \xend, with very detailed debug logging \\ + \verb!# xend status! & indicates \xend\ status by its return code +\end{tabular} + +A SysV init script called {\tt xend} is provided to start \xend\ at +boot time. {\tt make install} installs this script in +\path{/etc/init.d}. To enable it, you have to make symbolic links in +the appropriate runlevel directories or use the {\tt chkconfig} tool, +where available. + +Once \xend\ is running, more sophisticated administration can be done +using the xm tool (see Section~\ref{s:xm}) and the experimental Xensv +web interface (see Section~\ref{s:xensv}). + +As \xend\ runs, events will be logged to \path{/var/log/xend.log} and, +if the migration assistant daemon (\path{xfrd}) has been started, +\path{/var/log/xfrd.log}. These may be of use for troubleshooting +problems. + +\section{Xm (command line interface)} +\label{s:xm} + +The xm tool is the primary tool for managing Xen from the console. +The general format of an xm command line is: + +\begin{verbatim} +# xm command [switches] [arguments] [variables] +\end{verbatim} + +The available \emph{switches} and \emph{arguments} are dependent on +the \emph{command} chosen. The \emph{variables} may be set using +declarations of the form {\tt variable=value} and command line +declarations override any of the values in the configuration file +being used, including the standard variables described above and any +custom variables (for instance, the \path{xmdefconfig} file uses a +{\tt vmid} variable). + +The available commands are as follows: + +\begin{description} +\item[set-mem] Request a domain to adjust its memory footprint. +\item[create] Create a new domain. +\item[destroy] Kill a domain immediately. +\item[list] List running domains. +\item[shutdown] Ask a domain to shutdown. +\item[dmesg] Fetch the Xen (not Linux!) boot output. +\item[consoles] Lists the available consoles. +\item[console] Connect to the console for a domain. +\item[help] Get help on xm commands. +\item[save] Suspend a domain to disk. +\item[restore] Restore a domain from disk. +\item[pause] Pause a domain's execution. +\item[unpause] Un-pause a domain. +\item[pincpu] Pin a domain to a CPU. +\item[bvt] Set BVT scheduler parameters for a domain. +\item[bvt\_ctxallow] Set the BVT context switching allowance for the + system. +\item[atropos] Set the atropos parameters for a domain. +\item[rrobin] Set the round robin time slice for the system. +\item[info] Get information about the Xen host. +\item[call] Call a \xend\ HTTP API function directly. +\end{description} + +For a detailed overview of switches, arguments and variables to each +command try +\begin{quote} +\begin{verbatim} +# xm help command +\end{verbatim} +\end{quote} + +\section{Xensv (web control interface)} +\label{s:xensv} + +Xensv is the experimental web control interface for managing a Xen +machine. It can be used to perform some (but not yet all) of the +management tasks that can be done using the xm tool. + +It can be started using: +\begin{quote} + \verb_# xensv start_ +\end{quote} +and stopped using: +\begin{quote} + \verb_# xensv stop_ +\end{quote} + +By default, Xensv will serve out the web interface on port 8080. This +can be changed by editing +\path{/usr/lib/python2.3/site-packages/xen/sv/params.py}. + +Once Xensv is running, the web interface can be used to create and +manage running domains. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/debian.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/debian.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,154 @@ +\chapter{Installing Xen / XenLinux on Debian} + +The Debian project provides a tool called \path{debootstrap} which +allows a base Debian system to be installed into a filesystem without +requiring the host system to have any Debian-specific software (such +as \path{apt}). + +Here's some info how to install Debian 3.1 (Sarge) for an unprivileged +Xen domain: + +\begin{enumerate} + +\item Set up Xen and test that it's working, as described earlier in + this manual. + +\item Create disk images for rootfs and swap. Alternatively, you might + create dedicated partitions, LVM logical volumes, etc.\ if that + suits your setup. +\begin{verbatim} +dd if=/dev/zero of=/path/diskimage bs=1024k count=size_in_mbytes +dd if=/dev/zero of=/path/swapimage bs=1024k count=size_in_mbytes +\end{verbatim} + + If you're going to use this filesystem / disk image only as a + `template' for other vm disk images, something like 300 MB should be + enough. (of course it depends what kind of packages you are planning + to install to the template) + +\item Create the filesystem and initialise the swap image +\begin{verbatim} +mkfs.ext3 /path/diskimage +mkswap /path/swapimage +\end{verbatim} + +\item Mount the disk image for installation +\begin{verbatim} +mount -o loop /path/diskimage /mnt/disk +\end{verbatim} + +\item Install \path{debootstrap}. Make sure you have debootstrap + installed on the host. If you are running Debian Sarge (3.1 / + testing) or unstable you can install it by running \path{apt-get + install debootstrap}. Otherwise, it can be downloaded from the + Debian project website. + +\item Install Debian base to the disk image: +\begin{verbatim} +debootstrap --arch i386 sarge /mnt/disk \ + http://ftp.<countrycode>.debian.org/debian +\end{verbatim} + + You can use any other Debian http/ftp mirror you want. + +\item When debootstrap completes successfully, modify settings: +\begin{verbatim} +chroot /mnt/disk /bin/bash +\end{verbatim} + +Edit the following files using vi or nano and make needed changes: +\begin{verbatim} +/etc/hostname +/etc/hosts +/etc/resolv.conf +/etc/network/interfaces +/etc/networks +\end{verbatim} + +Set up access to the services, edit: +\begin{verbatim} +/etc/hosts.deny +/etc/hosts.allow +/etc/inetd.conf +\end{verbatim} + +Add Debian mirror to: +\begin{verbatim} +/etc/apt/sources.list +\end{verbatim} + +Create fstab like this: +\begin{verbatim} +/dev/sda1 / ext3 errors=remount-ro 0 1 +/dev/sda2 none swap sw 0 0 +proc /proc proc defaults 0 0 +\end{verbatim} + +Logout + +\item Unmount the disk image +\begin{verbatim} +umount /mnt/disk +\end{verbatim} + +\item Create Xen 2.0 configuration file for the new domain. You can + use the example-configurations coming with Xen as a template. + + Make sure you have the following set up: +\begin{verbatim} +disk = [ 'file:/path/diskimage,sda1,w', 'file:/path/swapimage,sda2,w' ] +root = "/dev/sda1 ro" +\end{verbatim} + +\item Start the new domain +\begin{verbatim} +xm create -f domain_config_file +\end{verbatim} + +Check that the new domain is running: +\begin{verbatim} +xm list +\end{verbatim} + +\item Attach to the console of the new domain. You should see + something like this when starting the new domain: + +\begin{verbatim} +Started domain testdomain2, console on port 9626 +\end{verbatim} + + There you can see the ID of the console: 26. You can also list the + consoles with \path{xm consoles} (ID is the last two digits of the + port number.) + + Attach to the console: + +\begin{verbatim} +xm console 26 +\end{verbatim} + + or by telnetting to the port 9626 of localhost (the xm console + program works better). + +\item Log in and run base-config + + As a default there's no password for the root. + + Check that everything looks OK, and the system started without + errors. Check that the swap is active, and the network settings are + correct. + + Run \path{/usr/sbin/base-config} to set up the Debian settings. + + Set up the password for root using passwd. + +\item Done. You can exit the console by pressing {\path{Ctrl + ]}} + +\end{enumerate} + + +If you need to create new domains, you can just copy the contents of +the `template'-image to the new disk images, either by mounting the +template and the new image, and using \path{cp -a} or \path{tar} or by +simply copying the image file. Once this is done, modify the +image-specific settings (hostname, network settings, etc). diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/domain_configuration.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/domain_configuration.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,281 @@ +\chapter{Domain Configuration} +\label{cha:config} + +The following contains the syntax of the domain configuration files +and description of how to further specify networking, driver domain +and general scheduling behavior. + + +\section{Configuration Files} +\label{s:cfiles} + +Xen configuration files contain the following standard variables. +Unless otherwise stated, configuration items should be enclosed in +quotes: see \path{/etc/xen/xmexample1} and \path{/etc/xen/xmexample2} +for concrete examples of the syntax. + +\begin{description} +\item[kernel] Path to the kernel image. +\item[ramdisk] Path to a ramdisk image (optional). + % \item[builder] The name of the domain build function (e.g. + % {\tt'linux'} or {\tt'netbsd'}. +\item[memory] Memory size in megabytes. +\item[cpu] CPU to run this domain on, or {\tt -1} for auto-allocation. +\item[console] Port to export the domain console on (default 9600 + + domain ID). +\item[nics] Number of virtual network interfaces. +\item[vif] List of MAC addresses (random addresses are assigned if not + given) and bridges to use for the domain's network interfaces, e.g.\ +\begin{verbatim} +vif = [ 'mac=aa:00:00:00:00:11, bridge=xen-br0', + 'bridge=xen-br1' ] +\end{verbatim} + to assign a MAC address and bridge to the first interface and assign + a different bridge to the second interface, leaving \xend\ to choose + the MAC address. +\item[disk] List of block devices to export to the domain, e.g.\ \\ + \verb_disk = [ 'phy:hda1,sda1,r' ]_ \\ + exports physical device \path{/dev/hda1} to the domain as + \path{/dev/sda1} with read-only access. Exporting a disk read-write + which is currently mounted is dangerous -- if you are \emph{certain} + you wish to do this, you can specify \path{w!} as the mode. +\item[dhcp] Set to {\tt `dhcp'} if you want to use DHCP to configure + networking. +\item[netmask] Manually configured IP netmask. +\item[gateway] Manually configured IP gateway. +\item[hostname] Set the hostname for the virtual machine. +\item[root] Specify the root device parameter on the kernel command + line. +\item[nfs\_server] IP address for the NFS server (if any). +\item[nfs\_root] Path of the root filesystem on the NFS server (if + any). +\item[extra] Extra string to append to the kernel command line (if + any) +\item[restart] Three possible options: + \begin{description} + \item[always] Always restart the domain, no matter what its exit + code is. + \item[never] Never restart the domain. + \item[onreboot] Restart the domain iff it requests reboot. + \end{description} +\end{description} + +For additional flexibility, it is also possible to include Python +scripting commands in configuration files. An example of this is the +\path{xmexample2} file, which uses Python code to handle the +\path{vmid} variable. + + +%\part{Advanced Topics} + + +\section{Network Configuration} + +For many users, the default installation should work ``out of the +box''. More complicated network setups, for instance with multiple +Ethernet interfaces and/or existing bridging setups will require some +special configuration. + +The purpose of this section is to describe the mechanisms provided by +\xend\ to allow a flexible configuration for Xen's virtual networking. + +\subsection{Xen virtual network topology} + +Each domain network interface is connected to a virtual network +interface in dom0 by a point to point link (effectively a ``virtual +crossover cable''). These devices are named {\tt + vif$<$domid$>$.$<$vifid$>$} (e.g.\ {\tt vif1.0} for the first +interface in domain~1, {\tt vif3.1} for the second interface in +domain~3). + +Traffic on these virtual interfaces is handled in domain~0 using +standard Linux mechanisms for bridging, routing, rate limiting, etc. +Xend calls on two shell scripts to perform initial configuration of +the network and configuration of new virtual interfaces. By default, +these scripts configure a single bridge for all the virtual +interfaces. Arbitrary routing / bridging configurations can be +configured by customizing the scripts, as described in the following +section. + +\subsection{Xen networking scripts} + +Xen's virtual networking is configured by two shell scripts (by +default \path{network} and \path{vif-bridge}). These are called +automatically by \xend\ when certain events occur, with arguments to +the scripts providing further contextual information. These scripts +are found by default in \path{/etc/xen/scripts}. The names and +locations of the scripts can be configured in +\path{/etc/xen/xend-config.sxp}. + +\begin{description} +\item[network:] This script is called whenever \xend\ is started or + stopped to respectively initialize or tear down the Xen virtual + network. In the default configuration initialization creates the + bridge `xen-br0' and moves eth0 onto that bridge, modifying the + routing accordingly. When \xend\ exits, it deletes the Xen bridge + and removes eth0, restoring the normal IP and routing configuration. + + %% In configurations where the bridge already exists, this script + %% could be replaced with a link to \path{/bin/true} (for instance). + +\item[vif-bridge:] This script is called for every domain virtual + interface and can configure firewalling rules and add the vif to the + appropriate bridge. By default, this adds and removes VIFs on the + default Xen bridge. +\end{description} + +For more complex network setups (e.g.\ where routing is required or +integrate with existing bridges) these scripts may be replaced with +customized variants for your site's preferred configuration. + +%% There are two possible types of privileges: IO privileges and +%% administration privileges. + + +\section{Driver Domain Configuration} + +I/O privileges can be assigned to allow a domain to directly access +PCI devices itself. This is used to support driver domains. + +Setting back-end privileges is currently only supported in SXP format +config files. To allow a domain to function as a back-end for others, +somewhere within the {\tt vm} element of its configuration file must +be a {\tt back-end} element of the form {\tt (back-end ({\em type}))} +where {\tt \em type} may be either {\tt netif} or {\tt blkif}, +according to the type of virtual device this domain will service. +%% After this domain has been built, \xend will connect all new and +%% existing {\em virtual} devices (of the appropriate type) to that +%% back-end. + +Note that a block back-end cannot currently import virtual block +devices from other domains, and a network back-end cannot import +virtual network devices from other domains. Thus (particularly in the +case of block back-ends, which cannot import a virtual block device as +their root filesystem), you may need to boot a back-end domain from a +ramdisk or a network device. + +Access to PCI devices may be configured on a per-device basis. Xen +will assign the minimal set of hardware privileges to a domain that +are required to control its devices. This can be configured in either +format of configuration file: + +\begin{itemize} +\item SXP Format: Include device elements of the form: \\ + \centerline{ {\tt (device (pci (bus {\em x}) (dev {\em y}) (func {\em z})))}} \\ + inside the top-level {\tt vm} element. Each one specifies the + address of a device this domain is allowed to access --- the numbers + \emph{x},\emph{y} and \emph{z} may be in either decimal or + hexadecimal format. +\item Flat Format: Include a list of PCI device addresses of the + format: \\ + \centerline{{\tt pci = ['x,y,z', \ldots]}} \\ + where each element in the list is a string specifying the components + of the PCI device address, separated by commas. The components + ({\tt \em x}, {\tt \em y} and {\tt \em z}) of the list may be + formatted as either decimal or hexadecimal. +\end{itemize} + +%% \section{Administration Domains} + +%% Administration privileges allow a domain to use the `dom0 +%% operations' (so called because they are usually available only to +%% domain 0). A privileged domain can build other domains, set +%% scheduling parameters, etc. + +% Support for other administrative domains is not yet available... +% perhaps we should plumb it in some time + + +\section{Scheduler Configuration} +\label{s:sched} + +Xen offers a boot time choice between multiple schedulers. To select +a scheduler, pass the boot parameter \emph{sched=sched\_name} to Xen, +substituting the appropriate scheduler name. Details of the +schedulers and their parameters are included below; future versions of +the tools will provide a higher-level interface to these tools. + +It is expected that system administrators configure their system to +use the scheduler most appropriate to their needs. Currently, the BVT +scheduler is the recommended choice. + +\subsection{Borrowed Virtual Time} + +{\tt sched=bvt} (the default) \\ + +BVT provides proportional fair shares of the CPU time. It has been +observed to penalize domains that block frequently (e.g.\ I/O +intensive domains), but this can be compensated for by using warping. + +\subsubsection{Global Parameters} + +\begin{description} +\item[ctx\_allow] The context switch allowance is similar to the + ``quantum'' in traditional schedulers. It is the minimum time that + a scheduled domain will be allowed to run before being preempted. +\end{description} + +\subsubsection{Per-domain parameters} + +\begin{description} +\item[mcuadv] The MCU (Minimum Charging Unit) advance determines the + proportional share of the CPU that a domain receives. It is set + inversely proportionally to a domain's sharing weight. +\item[warp] The amount of ``virtual time'' the domain is allowed to + warp backwards. +\item[warpl] The warp limit is the maximum time a domain can run + warped for. +\item[warpu] The unwarp requirement is the minimum time a domain must + run unwarped for before it can warp again. +\end{description} + +\subsection{Atropos} + +{\tt sched=atropos} \\ + +Atropos is a soft real time scheduler. It provides guarantees about +absolute shares of the CPU, with a facility for sharing slack CPU time +on a best-effort basis. It can provide timeliness guarantees for +latency-sensitive domains. + +Every domain has an associated period and slice. The domain should +receive `slice' nanoseconds every `period' nanoseconds. This allows +the administrator to configure both the absolute share of the CPU a +domain receives and the frequency with which it is scheduled. + +%% When domains unblock, their period is reduced to the value of the +%% latency hint (the slice is scaled accordingly so that they still +%% get the same proportion of the CPU). For each subsequent period, +%% the slice and period times are doubled until they reach their +%% original values. + +Note: don't over-commit the CPU when using Atropos (i.e.\ don't reserve +more CPU than is available --- the utilization should be kept to +slightly less than 100\% in order to ensure predictable behavior). + +\subsubsection{Per-domain parameters} + +\begin{description} +\item[period] The regular time interval during which a domain is + guaranteed to receive its allocation of CPU time. +\item[slice] The length of time per period that a domain is guaranteed + to run for (in the absence of voluntary yielding of the CPU). +\item[latency] The latency hint is used to control how soon after + waking up a domain it should be scheduled. +\item[xtratime] This is a boolean flag that specifies whether a domain + should be allowed a share of the system slack time. +\end{description} + +\subsection{Round Robin} + +{\tt sched=rrobin} \\ + +The round robin scheduler is included as a simple demonstration of +Xen's internal scheduler API. It is not intended for production use. + +\subsubsection{Global Parameters} + +\begin{description} +\item[rr\_slice] The maximum time each domain runs before the next + scheduling decision is made. +\end{description} diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/domain_filesystem.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/domain_filesystem.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,243 @@ +\chapter{Domain Filesystem Storage} + +It is possible to directly export any Linux block device in dom0 to +another domain, or to export filesystems / devices to virtual machines +using standard network protocols (e.g.\ NBD, iSCSI, NFS, etc.). This +chapter covers some of the possibilities. + + +\section{Exporting Physical Devices as VBDs} +\label{s:exporting-physical-devices-as-vbds} + +One of the simplest configurations is to directly export individual +partitions from domain~0 to other domains. To achieve this use the +\path{phy:} specifier in your domain configuration file. For example a +line like +\begin{quote} + \verb_disk = ['phy:hda3,sda1,w']_ +\end{quote} +specifies that the partition \path{/dev/hda3} in domain~0 should be +exported read-write to the new domain as \path{/dev/sda1}; one could +equally well export it as \path{/dev/hda} or \path{/dev/sdb5} should +one wish. + +In addition to local disks and partitions, it is possible to export +any device that Linux considers to be ``a disk'' in the same manner. +For example, if you have iSCSI disks or GNBD volumes imported into +domain~0 you can export these to other domains using the \path{phy:} +disk syntax. E.g.: +\begin{quote} + \verb_disk = ['phy:vg/lvm1,sda2,w']_ +\end{quote} + +\begin{center} + \framebox{\bf Warning: Block device sharing} +\end{center} +\begin{quote} + Block devices should typically only be shared between domains in a + read-only fashion otherwise the Linux kernel's file systems will get + very confused as the file system structure may change underneath + them (having the same ext3 partition mounted \path{rw} twice is a + sure fire way to cause irreparable damage)! \Xend\ will attempt to + prevent you from doing this by checking that the device is not + mounted read-write in domain~0, and hasn't already been exported + read-write to another domain. If you want read-write sharing, + export the directory to other domains via NFS from domain~0 (or use + a cluster file system such as GFS or ocfs2). +\end{quote} + + +\section{Using File-backed VBDs} + +It is also possible to use a file in Domain~0 as the primary storage +for a virtual machine. As well as being convenient, this also has the +advantage that the virtual block device will be \emph{sparse} --- +space will only really be allocated as parts of the file are used. So +if a virtual machine uses only half of its disk space then the file +really takes up half of the size allocated. + +For example, to create a 2GB sparse file-backed virtual block device +(actually only consumes 1KB of disk): +\begin{quote} + \verb_# dd if=/dev/zero of=vm1disk bs=1k seek=2048k count=1_ +\end{quote} + +Make a file system in the disk file: +\begin{quote} + \verb_# mkfs -t ext3 vm1disk_ +\end{quote} + +(when the tool asks for confirmation, answer `y') + +Populate the file system e.g.\ by copying from the current root: +\begin{quote} +\begin{verbatim} +# mount -o loop vm1disk /mnt +# cp -ax /{root,dev,var,etc,usr,bin,sbin,lib} /mnt +# mkdir /mnt/{proc,sys,home,tmp} +\end{verbatim} +\end{quote} + +Tailor the file system by editing \path{/etc/fstab}, +\path{/etc/hostname}, etc.\ Don't forget to edit the files in the +mounted file system, instead of your domain~0 filesystem, e.g.\ you +would edit \path{/mnt/etc/fstab} instead of \path{/etc/fstab}. For +this example put \path{/dev/sda1} to root in fstab. + +Now unmount (this is important!): +\begin{quote} + \verb_# umount /mnt_ +\end{quote} + +In the configuration file set: +\begin{quote} + \verb_disk = ['file:/full/path/to/vm1disk,sda1,w']_ +\end{quote} + +As the virtual machine writes to its `disk', the sparse file will be +filled in and consume more space up to the original 2GB. + +{\bf Note that file-backed VBDs may not be appropriate for backing + I/O-intensive domains.} File-backed VBDs are known to experience +substantial slowdowns under heavy I/O workloads, due to the I/O +handling by the loopback block device used to support file-backed VBDs +in dom0. Better I/O performance can be achieved by using either +LVM-backed VBDs (Section~\ref{s:using-lvm-backed-vbds}) or physical +devices as VBDs (Section~\ref{s:exporting-physical-devices-as-vbds}). + +Linux supports a maximum of eight file-backed VBDs across all domains +by default. This limit can be statically increased by using the +\emph{max\_loop} module parameter if CONFIG\_BLK\_DEV\_LOOP is +compiled as a module in the dom0 kernel, or by using the +\emph{max\_loop=n} boot option if CONFIG\_BLK\_DEV\_LOOP is compiled +directly into the dom0 kernel. + + +\section{Using LVM-backed VBDs} +\label{s:using-lvm-backed-vbds} + +A particularly appealing solution is to use LVM volumes as backing for +domain file-systems since this allows dynamic growing/shrinking of +volumes as well as snapshot and other features. + +To initialize a partition to support LVM volumes: +\begin{quote} +\begin{verbatim} +# pvcreate /dev/sda10 +\end{verbatim} +\end{quote} + +Create a volume group named `vg' on the physical partition: +\begin{quote} +\begin{verbatim} +# vgcreate vg /dev/sda10 +\end{verbatim} +\end{quote} + +Create a logical volume of size 4GB named `myvmdisk1': +\begin{quote} +\begin{verbatim} +# lvcreate -L4096M -n myvmdisk1 vg +\end{verbatim} +\end{quote} + +You should now see that you have a \path{/dev/vg/myvmdisk1} Make a +filesystem, mount it and populate it, e.g.: +\begin{quote} +\begin{verbatim} +# mkfs -t ext3 /dev/vg/myvmdisk1 +# mount /dev/vg/myvmdisk1 /mnt +# cp -ax / /mnt +# umount /mnt +\end{verbatim} +\end{quote} + +Now configure your VM with the following disk configuration: +\begin{quote} +\begin{verbatim} + disk = [ 'phy:vg/myvmdisk1,sda1,w' ] +\end{verbatim} +\end{quote} + +LVM enables you to grow the size of logical volumes, but you'll need +to resize the corresponding file system to make use of the new space. +Some file systems (e.g.\ ext3) now support online resize. See the LVM +manuals for more details. + +You can also use LVM for creating copy-on-write (CoW) clones of LVM +volumes (known as writable persistent snapshots in LVM terminology). +This facility is new in Linux 2.6.8, so isn't as stable as one might +hope. In particular, using lots of CoW LVM disks consumes a lot of +dom0 memory, and error conditions such as running out of disk space +are not handled well. Hopefully this will improve in future. + +To create two copy-on-write clone of the above file system you would +use the following commands: + +\begin{quote} +\begin{verbatim} +# lvcreate -s -L1024M -n myclonedisk1 /dev/vg/myvmdisk1 +# lvcreate -s -L1024M -n myclonedisk2 /dev/vg/myvmdisk1 +\end{verbatim} +\end{quote} + +Each of these can grow to have 1GB of differences from the master +volume. You can grow the amount of space for storing the differences +using the lvextend command, e.g.: +\begin{quote} +\begin{verbatim} +# lvextend +100M /dev/vg/myclonedisk1 +\end{verbatim} +\end{quote} + +Don't let the `differences volume' ever fill up otherwise LVM gets +rather confused. It may be possible to automate the growing process by +using \path{dmsetup wait} to spot the volume getting full and then +issue an \path{lvextend}. + +In principle, it is possible to continue writing to the volume that +has been cloned (the changes will not be visible to the clones), but +we wouldn't recommend this: have the cloned volume as a `pristine' +file system install that isn't mounted directly by any of the virtual +machines. + + +\section{Using NFS Root} + +First, populate a root filesystem in a directory on the server +machine. This can be on a distinct physical machine, or simply run +within a virtual machine on the same node. + +Now configure the NFS server to export this filesystem over the +network by adding a line to \path{/etc/exports}, for instance: + +\begin{quote} + \begin{small} +\begin{verbatim} +/export/vm1root 1.2.3.4/24 (rw,sync,no_root_squash) +\end{verbatim} + \end{small} +\end{quote} + +Finally, configure the domain to use NFS root. In addition to the +normal variables, you should make sure to set the following values in +the domain's configuration file: + +\begin{quote} + \begin{small} +\begin{verbatim} +root = '/dev/nfs' +nfs_server = '2.3.4.5' # substitute IP address of server +nfs_root = '/path/to/root' # path to root FS on the server +\end{verbatim} + \end{small} +\end{quote} + +The domain will need network access at boot time, so either statically +configure an IP address using the config variables \path{ip}, +\path{netmask}, \path{gateway}, \path{hostname}; or enable DHCP +(\path{dhcp='dhcp'}). + +Note that the Linux NFS root implementation is known to have stability +problems under high load (this is not a Xen-specific problem), so this +configuration may not be appropriate for critical servers. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/domain_mgmt.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/domain_mgmt.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,203 @@ +\chapter{Domain Management Tools} + +The previous chapter described a simple example of how to configure +and start a domain. This chapter summarises the tools available to +manage running domains. + + +\section{Command-line Management} + +Command line management tasks are also performed using the \path{xm} +tool. For online help for the commands available, type: +\begin{quote} + \verb_# xm help_ +\end{quote} + +You can also type \path{xm help $<$command$>$} for more information on +a given command. + +\subsection{Basic Management Commands} + +The most important \path{xm} commands are: +\begin{quote} + \verb_# xm list_: Lists all domains running.\\ + \verb_# xm consoles_: Gives information about the domain consoles.\\ + \verb_# xm console_: Opens a console to a domain (e.g.\ + \verb_# xm console myVM_) +\end{quote} + +\subsection{\tt xm list} + +The output of \path{xm list} is in rows of the following format: +\begin{center} {\tt name domid memory cpu state cputime console} +\end{center} + +\begin{quote} + \begin{description} + \item[name] The descriptive name of the virtual machine. + \item[domid] The number of the domain ID this virtual machine is + running in. + \item[memory] Memory size in megabytes. + \item[cpu] The CPU this domain is running on. + \item[state] Domain state consists of 5 fields: + \begin{description} + \item[r] running + \item[b] blocked + \item[p] paused + \item[s] shutdown + \item[c] crashed + \end{description} + \item[cputime] How much CPU time (in seconds) the domain has used so + far. + \item[console] TCP port accepting connections to the domain's + console. + \end{description} +\end{quote} + +The \path{xm list} command also supports a long output format when the +\path{-l} switch is used. This outputs the fulls details of the +running domains in \xend's SXP configuration format. + +For example, suppose the system is running the ttylinux domain as +described earlier. The list command should produce output somewhat +like the following: +\begin{verbatim} +# xm list +Name Id Mem(MB) CPU State Time(s) Console +Domain-0 0 251 0 r---- 172.2 +ttylinux 5 63 0 -b--- 3.0 9605 +\end{verbatim} + +Here we can see the details for the ttylinux domain, as well as for +domain~0 (which, of course, is always running). Note that the console +port for the ttylinux domain is 9605. This can be connected to by TCP +using a terminal program (e.g. \path{telnet} or, better, +\path{xencons}). The simplest way to connect is to use the +\path{xm~console} command, specifying the domain name or ID. To +connect to the console of the ttylinux domain, we could use any of the +following: +\begin{verbatim} +# xm console ttylinux +# xm console 5 +# xencons localhost 9605 +\end{verbatim} + +\section{Domain Save and Restore} + +The administrator of a Xen system may suspend a virtual machine's +current state into a disk file in domain~0, allowing it to be resumed +at a later time. + +The ttylinux domain described earlier can be suspended to disk using +the command: +\begin{verbatim} +# xm save ttylinux ttylinux.xen +\end{verbatim} + +This will stop the domain named `ttylinux' and save its current state +into a file called \path{ttylinux.xen}. + +To resume execution of this domain, use the \path{xm restore} command: +\begin{verbatim} +# xm restore ttylinux.xen +\end{verbatim} + +This will restore the state of the domain and restart it. The domain +will carry on as before and the console may be reconnected using the +\path{xm console} command, as above. + +\section{Live Migration} + +Live migration is used to transfer a domain between physical hosts +whilst that domain continues to perform its usual activities --- from +the user's perspective, the migration should be imperceptible. + +To perform a live migration, both hosts must be running Xen / \xend\ +and the destination host must have sufficient resources (e.g.\ memory +capacity) to accommodate the domain after the move. Furthermore we +currently require both source and destination machines to be on the +same L2 subnet. + +Currently, there is no support for providing automatic remote access +to filesystems stored on local disk when a domain is migrated. +Administrators should choose an appropriate storage solution (i.e.\ +SAN, NAS, etc.) to ensure that domain filesystems are also available +on their destination node. GNBD is a good method for exporting a +volume from one machine to another. iSCSI can do a similar job, but is +more complex to set up. + +When a domain migrates, it's MAC and IP address move with it, thus it +is only possible to migrate VMs within the same layer-2 network and IP +subnet. If the destination node is on a different subnet, the +administrator would need to manually configure a suitable etherip or +IP tunnel in the domain~0 of the remote node. + +A domain may be migrated using the \path{xm migrate} command. To live +migrate a domain to another machine, we would use the command: + +\begin{verbatim} +# xm migrate --live mydomain destination.ournetwork.com +\end{verbatim} + +Without the \path{--live} flag, \xend\ simply stops the domain and +copies the memory image over to the new node and restarts it. Since +domains can have large allocations this can be quite time consuming, +even on a Gigabit network. With the \path{--live} flag \xend\ attempts +to keep the domain running while the migration is in progress, +resulting in typical `downtimes' of just 60--300ms. + +For now it will be necessary to reconnect to the domain's console on +the new machine using the \path{xm console} command. If a migrated +domain has any open network connections then they will be preserved, +so SSH connections do not have this limitation. + + +\section{Managing Domain Memory} + +XenLinux domains have the ability to relinquish / reclaim machine +memory at the request of the administrator or the user of the domain. + +\subsection{Setting memory footprints from dom0} + +The machine administrator can request that a domain alter its memory +footprint using the \path{xm set-mem} command. For instance, we can +request that our example ttylinux domain reduce its memory footprint +to 32 megabytes. + +\begin{verbatim} +# xm set-mem ttylinux 32 +\end{verbatim} + +We can now see the result of this in the output of \path{xm list}: + +\begin{verbatim} +# xm list +Name Id Mem(MB) CPU State Time(s) Console +Domain-0 0 251 0 r---- 172.2 +ttylinux 5 31 0 -b--- 4.3 9605 +\end{verbatim} + +The domain has responded to the request by returning memory to Xen. We +can restore the domain to its original size using the command line: + +\begin{verbatim} +# xm set-mem ttylinux 64 +\end{verbatim} + +\subsection{Setting memory footprints from within a domain} + +The virtual file \path{/proc/xen/balloon} allows the owner of a domain +to adjust their own memory footprint. Reading the file (e.g.\ +\path{cat /proc/xen/balloon}) prints out the current memory footprint +of the domain. Writing the file (e.g.\ \path{echo new\_target > + /proc/xen/balloon}) requests that the kernel adjust the domain's +memory footprint to a new value. + +\subsection{Setting memory limits} + +Xen associates a memory size limit with each domain. By default, this +is the amount of memory the domain is originally started with, +preventing the domain from ever growing beyond this size. To permit a +domain to grow beyond its original allocation or to prevent a domain +you've shrunk from reclaiming the memory it relinquished, use the +\path{xm maxmem} command. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/glossary.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/glossary.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,79 @@ +\chapter{Glossary of Terms} + +\begin{description} + +\item[Atropos] One of the CPU schedulers provided by Xen. Atropos + provides domains with absolute shares of the CPU, with timeliness + guarantees and a mechanism for sharing out `slack time'. + +\item[BVT] The BVT scheduler is used to give proportional fair shares + of the CPU to domains. + +\item[Exokernel] A minimal piece of privileged code, similar to a {\bf + microkernel} but providing a more `hardware-like' interface to the + tasks it manages. This is similar to a paravirtualising VMM like + {\bf Xen} but was designed as a new operating system structure, + rather than specifically to run multiple conventional OSs. + +\item[Domain] A domain is the execution context that contains a + running {\bf virtual machine}. The relationship between virtual + machines and domains on Xen is similar to that between programs and + processes in an operating system: a virtual machine is a persistent + entity that resides on disk (somewhat like a program). When it is + loaded for execution, it runs in a domain. Each domain has a {\bf + domain ID}. + +\item[Domain 0] The first domain to be started on a Xen machine. + Domain 0 is responsible for managing the system. + +\item[Domain ID] A unique identifier for a {\bf domain}, analogous to + a process ID in an operating system. + +\item[Full virtualisation] An approach to virtualisation which + requires no modifications to the hosted operating system, providing + the illusion of a complete system of real hardware devices. + +\item[Hypervisor] An alternative term for {\bf VMM}, used because it + means `beyond supervisor', since it is responsible for managing + multiple `supervisor' kernels. + +\item[Live migration] A technique for moving a running virtual machine + to another physical host, without stopping it or the services + running on it. + +\item[Microkernel] A small base of code running at the highest + hardware privilege level. A microkernel is responsible for sharing + CPU and memory (and sometimes other devices) between less privileged + tasks running on the system. This is similar to a VMM, particularly + a {\bf paravirtualising} VMM but typically addressing a different + problem space and providing different kind of interface. + +\item[NetBSD/Xen] A port of NetBSD to the Xen architecture. + +\item[Paravirtualisation] An approach to virtualisation which requires + modifications to the operating system in order to run in a virtual + machine. Xen uses paravirtualisation but preserves binary + compatibility for user space applications. + +\item[Shadow pagetables] A technique for hiding the layout of machine + memory from a virtual machine's operating system. Used in some {\bf + VMMs} to provide the illusion of contiguous physical memory, in + Xen this is used during {\bf live migration}. + +\item[Virtual Machine] The environment in which a hosted operating + system runs, providing the abstraction of a dedicated machine. A + virtual machine may be identical to the underlying hardware (as in + {\bf full virtualisation}, or it may differ, as in {\bf + paravirtualisation}). + +\item[VMM] Virtual Machine Monitor - the software that allows multiple + virtual machines to be multiplexed on a single physical machine. + +\item[Xen] Xen is a paravirtualising virtual machine monitor, + developed primarily by the Systems Research Group at the University + of Cambridge Computer Laboratory. + +\item[XenLinux] Official name for the port of the Linux kernel that + runs on Xen. + +\end{description} diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/installation.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/installation.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,394 @@ +\chapter{Installation} + +The Xen distribution includes three main components: Xen itself, ports +of Linux 2.4 and 2.6 and NetBSD to run on Xen, and the userspace +tools required to manage a Xen-based system. This chapter describes +how to install the Xen~2.0 distribution from source. Alternatively, +there may be pre-built packages available as part of your operating +system distribution. + + +\section{Prerequisites} +\label{sec:prerequisites} + +The following is a full list of prerequisites. Items marked `$\dag$' +are required by the \xend\ control tools, and hence required if you +want to run more than one virtual machine; items marked `$*$' are only +required if you wish to build from source. +\begin{itemize} +\item A working Linux distribution using the GRUB bootloader and + running on a P6-class (or newer) CPU. +\item [$\dag$] The \path{iproute2} package. +\item [$\dag$] The Linux bridge-utils\footnote{Available from {\tt + http://bridge.sourceforge.net}} (e.g., \path{/sbin/brctl}) +\item [$\dag$] An installation of Twisted~v1.3 or + above\footnote{Available from {\tt http://www.twistedmatrix.com}}. + There may be a binary package available for your distribution; + alternatively it can be installed by running `{\sl make + install-twisted}' in the root of the Xen source tree. +\item [$*$] Build tools (gcc v3.2.x or v3.3.x, binutils, GNU make). +\item [$*$] Development installation of libcurl (e.g., libcurl-devel) +\item [$*$] Development installation of zlib (e.g., zlib-dev). +\item [$*$] Development installation of Python v2.2 or later (e.g., + python-dev). +\item [$*$] \LaTeX\ and transfig are required to build the + documentation. +\end{itemize} + +Once you have satisfied the relevant prerequisites, you can now +install either a binary or source distribution of Xen. + + +\section{Installing from Binary Tarball} + +Pre-built tarballs are available for download from the Xen download +page +\begin{quote} {\tt http://xen.sf.net} +\end{quote} + +Once you've downloaded the tarball, simply unpack and install: +\begin{verbatim} +# tar zxvf xen-2.0-install.tgz +# cd xen-2.0-install +# sh ./install.sh +\end{verbatim} + +Once you've installed the binaries you need to configure your system +as described in Section~\ref{s:configure}. + + +\section{Installing from Source} + +This section describes how to obtain, build, and install Xen from +source. + +\subsection{Obtaining the Source} + +The Xen source tree is available as either a compressed source tar +ball or as a clone of our master BitKeeper repository. + +\begin{description} +\item[Obtaining the Source Tarball]\mbox{} \\ + Stable versions (and daily snapshots) of the Xen source tree are + available as compressed tarballs from the Xen download page + \begin{quote} {\tt http://xen.sf.net} + \end{quote} + +\item[Using BitKeeper]\mbox{} \\ + If you wish to install Xen from a clone of our latest BitKeeper + repository then you will need to install the BitKeeper tools. + Download instructions for BitKeeper can be obtained by filling out + the form at: + \begin{quote} {\tt http://www.bitmover.com/cgi-bin/download.cgi} +\end{quote} +The public master BK repository for the 2.0 release lives at: +\begin{quote} {\tt bk://xen.bkbits.net/xen-2.0.bk} +\end{quote} +You can use BitKeeper to download it and keep it updated with the +latest features and fixes. + +Change to the directory in which you want to put the source code, then +run: +\begin{verbatim} +# bk clone bk://xen.bkbits.net/xen-2.0.bk +\end{verbatim} + +Under your current directory, a new directory named \path{xen-2.0.bk} +has been created, which contains all the source code for Xen, the OS +ports, and the control tools. You can update your repository with the +latest changes at any time by running: +\begin{verbatim} +# cd xen-2.0.bk # to change into the local repository +# bk pull # to update the repository +\end{verbatim} +\end{description} + +% \section{The distribution} +% +% The Xen source code repository is structured as follows: +% +% \begin{description} +% \item[\path{tools/}] Xen node controller daemon (Xend), command line +% tools, control libraries +% \item[\path{xen/}] The Xen VMM. +% \item[\path{linux-*-xen-sparse/}] Xen support for Linux. +% \item[\path{linux-*-patches/}] Experimental patches for Linux. +% \item[\path{netbsd-*-xen-sparse/}] Xen support for NetBSD. +% \item[\path{docs/}] Various documentation files for users and +% developers. +% \item[\path{extras/}] Bonus extras. +% \end{description} + +\subsection{Building from Source} + +The top-level Xen Makefile includes a target `world' that will do the +following: + +\begin{itemize} +\item Build Xen. +\item Build the control tools, including \xend. +\item Download (if necessary) and unpack the Linux 2.6 source code, + and patch it for use with Xen. +\item Build a Linux kernel to use in domain 0 and a smaller + unprivileged kernel, which can optionally be used for unprivileged + virtual machines. +\end{itemize} + +After the build has completed you should have a top-level directory +called \path{dist/} in which all resulting targets will be placed; of +particular interest are the two kernels XenLinux kernel images, one +with a `-xen0' extension which contains hardware device drivers and +drivers for Xen's virtual devices, and one with a `-xenU' extension +that just contains the virtual ones. These are found in +\path{dist/install/boot/} along with the image for Xen itself and the +configuration files used during the build. + +The NetBSD port can be built using: +\begin{quote} +\begin{verbatim} +# make netbsd20 +\end{verbatim} +\end{quote} +NetBSD port is built using a snapshot of the netbsd-2-0 cvs branch. +The snapshot is downloaded as part of the build process, if it is not +yet present in the \path{NETBSD\_SRC\_PATH} search path. The build +process also downloads a toolchain which includes all the tools +necessary to build the NetBSD kernel under Linux. + +To customize further the set of kernels built you need to edit the +top-level Makefile. Look for the line: + +\begin{quote} +\begin{verbatim} +KERNELS ?= mk.linux-2.6-xen0 mk.linux-2.6-xenU +\end{verbatim} +\end{quote} + +You can edit this line to include any set of operating system kernels +which have configurations in the top-level \path{buildconfigs/} +directory, for example \path{mk.linux-2.4-xenU} to build a Linux 2.4 +kernel containing only virtual device drivers. + +%% Inspect the Makefile if you want to see what goes on during a +%% build. Building Xen and the tools is straightforward, but XenLinux +%% is more complicated. The makefile needs a `pristine' Linux kernel +%% tree to which it will then add the Xen architecture files. You can +%% tell the makefile the location of the appropriate Linux compressed +%% tar file by +%% setting the LINUX\_SRC environment variable, e.g. \\ +%% \verb!# LINUX_SRC=/tmp/linux-2.6.11.tar.bz2 make world! \\ or by +%% placing the tar file somewhere in the search path of {\tt +%% LINUX\_SRC\_PATH} which defaults to `{\tt .:..}'. If the +%% makefile can't find a suitable kernel tar file it attempts to +%% download it from kernel.org (this won't work if you're behind a +%% firewall). + +%% After untaring the pristine kernel tree, the makefile uses the {\tt +%% mkbuildtree} script to add the Xen patches to the kernel. + + +%% The procedure is similar to build the Linux 2.4 port: \\ +%% \verb!# LINUX_SRC=/path/to/linux2.4/source make linux24! + + +%% \framebox{\parbox{5in}{ +%% {\bf Distro specific:} \\ +%% {\it Gentoo} --- if not using udev (most installations, +%% currently), you'll need to enable devfs and devfs mount at boot +%% time in the xen0 config. }} + +\subsection{Custom XenLinux Builds} + +% If you have an SMP machine you may wish to give the {\tt '-j4'} +% argument to make to get a parallel build. + +If you wish to build a customized XenLinux kernel (e.g. to support +additional devices or enable distribution-required features), you can +use the standard Linux configuration mechanisms, specifying that the +architecture being built for is \path{xen}, e.g: +\begin{quote} +\begin{verbatim} +# cd linux-2.6.11-xen0 +# make ARCH=xen xconfig +# cd .. +# make +\end{verbatim} +\end{quote} + +You can also copy an existing Linux configuration (\path{.config}) +into \path{linux-2.6.11-xen0} and execute: +\begin{quote} +\begin{verbatim} +# make ARCH=xen oldconfig +\end{verbatim} +\end{quote} + +You may be prompted with some Xen-specific options; we advise +accepting the defaults for these options. + +Note that the only difference between the two types of Linux kernel +that are built is the configuration file used for each. The `U' +suffixed (unprivileged) versions don't contain any of the physical +hardware device drivers, leading to a 30\% reduction in size; hence +you may prefer these for your non-privileged domains. The `0' +suffixed privileged versions can be used to boot the system, as well +as in driver domains and unprivileged domains. + +\subsection{Installing the Binaries} + +The files produced by the build process are stored under the +\path{dist/install/} directory. To install them in their default +locations, do: +\begin{quote} +\begin{verbatim} +# make install +\end{verbatim} +\end{quote} + +Alternatively, users with special installation requirements may wish +to install them manually by copying the files to their appropriate +destinations. + +%% Files in \path{install/boot/} include: +%% \begin{itemize} +%% \item \path{install/boot/xen-2.0.gz} Link to the Xen 'kernel' +%% \item \path{install/boot/vmlinuz-2.6-xen0} Link to domain 0 +%% XenLinux kernel +%% \item \path{install/boot/vmlinuz-2.6-xenU} Link to unprivileged +%% XenLinux kernel +%% \end{itemize} + +The \path{dist/install/boot} directory will also contain the config +files used for building the XenLinux kernels, and also versions of Xen +and XenLinux kernels that contain debug symbols (\path{xen-syms-2.0.6} +and \path{vmlinux-syms-2.6.11.11-xen0}) which are essential for +interpreting crash dumps. Retain these files as the developers may +wish to see them if you post on the mailing list. + + +\section{Configuration} +\label{s:configure} + +Once you have built and installed the Xen distribution, it is simple +to prepare the machine for booting and running Xen. + +\subsection{GRUB Configuration} + +An entry should be added to \path{grub.conf} (often found under +\path{/boot/} or \path{/boot/grub/}) to allow Xen / XenLinux to boot. +This file is sometimes called \path{menu.lst}, depending on your +distribution. The entry should look something like the following: + +{\small +\begin{verbatim} +title Xen 2.0 / XenLinux 2.6 + kernel /boot/xen-2.0.gz dom0_mem=131072 + module /boot/vmlinuz-2.6-xen0 root=/dev/sda4 ro console=tty0 +\end{verbatim} +} + +The kernel line tells GRUB where to find Xen itself and what boot +parameters should be passed to it (in this case, setting domain 0's +memory allocation in kilobytes and the settings for the serial port). +For more details on the various Xen boot parameters see +Section~\ref{s:xboot}. + +The module line of the configuration describes the location of the +XenLinux kernel that Xen should start and the parameters that should +be passed to it (these are standard Linux parameters, identifying the +root device and specifying it be initially mounted read only and +instructing that console output be sent to the screen). Some +distributions such as SuSE do not require the \path{ro} parameter. + +%% \framebox{\parbox{5in}{ +%% {\bf Distro specific:} \\ +%% {\it SuSE} --- Omit the {\tt ro} option from the XenLinux +%% kernel command line, since the partition won't be remounted rw +%% during boot. }} + + +If you want to use an initrd, just add another \path{module} line to +the configuration, as usual: + +{\small +\begin{verbatim} + module /boot/my_initrd.gz +\end{verbatim} +} + +As always when installing a new kernel, it is recommended that you do +not delete existing menu options from \path{menu.lst} --- you may want +to boot your old Linux kernel in future, particularly if you have +problems. + +\subsection{Serial Console (optional)} + +%% kernel /boot/xen-2.0.gz dom0_mem=131072 com1=115200,8n1 +%% module /boot/vmlinuz-2.6-xen0 root=/dev/sda4 ro + + +In order to configure Xen serial console output, it is necessary to +add an boot option to your GRUB config; e.g.\ replace the above kernel +line with: +\begin{quote} +{\small +\begin{verbatim} + kernel /boot/xen.gz dom0_mem=131072 com1=115200,8n1 +\end{verbatim}} +\end{quote} + +This configures Xen to output on COM1 at 115,200 baud, 8 data bits, 1 +stop bit and no parity. Modify these parameters for your set up. + +One can also configure XenLinux to share the serial console; to +achieve this append ``\path{console=ttyS0}'' to your module line. + +If you wish to be able to log in over the XenLinux serial console it +is necessary to add a line into \path{/etc/inittab}, just as per +regular Linux. Simply add the line: +\begin{quote} {\small {\tt c:2345:respawn:/sbin/mingetty ttyS0}} +\end{quote} + +and you should be able to log in. Note that to successfully log in as +root over the serial line will require adding \path{ttyS0} to +\path{/etc/securetty} in most modern distributions. + +\subsection{TLS Libraries} + +Users of the XenLinux 2.6 kernel should disable Thread Local Storage +(e.g.\ by doing a \path{mv /lib/tls /lib/tls.disabled}) before +attempting to run with a XenLinux kernel\footnote{If you boot without + first disabling TLS, you will get a warning message during the boot + process. In this case, simply perform the rename after the machine + is up and then run \texttt{/sbin/ldconfig} to make it take effect.}. +You can always reenable it by restoring the directory to its original +location (i.e.\ \path{mv /lib/tls.disabled /lib/tls}). + +The reason for this is that the current TLS implementation uses +segmentation in a way that is not permissible under Xen. If TLS is +not disabled, an emulation mode is used within Xen which reduces +performance substantially. + +We hope that this issue can be resolved by working with Linux +distribution vendors to implement a minor backward-compatible change +to the TLS library. + + +\section{Booting Xen} + +It should now be possible to restart the system and use Xen. Reboot +as usual but choose the new Xen option when the Grub screen appears. + +What follows should look much like a conventional Linux boot. The +first portion of the output comes from Xen itself, supplying low level +information about itself and the machine it is running on. The +following portion of the output comes from XenLinux. + +You may see some errors during the XenLinux boot. These are not +necessarily anything to worry about --- they may result from kernel +configuration differences between your XenLinux kernel and the one you +usually use. + +When the boot completes, you should be able to log into your system as +usual. If you are unable to log in to your system running Xen, you +should still be able to reboot with your normal Linux kernel. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/introduction.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/introduction.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,143 @@ +\chapter{Introduction} + + +Xen is a \emph{paravirtualising} virtual machine monitor (VMM), or +`hypervisor', for the x86 processor architecture. Xen can securely +execute multiple virtual machines on a single physical system with +close-to-native performance. The virtual machine technology +facilitates enterprise-grade functionality, including: + +\begin{itemize} +\item Virtual machines with performance close to native hardware. +\item Live migration of running virtual machines between physical + hosts. +\item Excellent hardware support (supports most Linux device drivers). +\item Sandboxed, re-startable device drivers. +\end{itemize} + +Paravirtualisation permits very high performance virtualisation, even +on architectures like x86 that are traditionally very hard to +virtualise. + +The drawback of this approach is that it requires operating systems to +be \emph{ported} to run on Xen. Porting an OS to run on Xen is +similar to supporting a new hardware platform, however the process is +simplified because the paravirtual machine architecture is very +similar to the underlying native hardware. Even though operating +system kernels must explicitly support Xen, a key feature is that user +space applications and libraries \emph{do not} require modification. + +Xen support is available for increasingly many operating systems: +right now, Linux 2.4, Linux 2.6 and NetBSD are available for Xen 2.0. +A FreeBSD port is undergoing testing and will be incorporated into the +release soon. Other OS ports, including Plan 9, are in progress. We +hope that that arch-xen patches will be incorporated into the +mainstream releases of these operating systems in due course (as has +already happened for NetBSD). + +Possible usage scenarios for Xen include: + +\begin{description} +\item [Kernel development.] Test and debug kernel modifications in a + sandboxed virtual machine --- no need for a separate test machine. +\item [Multiple OS configurations.] Run multiple operating systems + simultaneously, for instance for compatibility or QA purposes. +\item [Server consolidation.] Move multiple servers onto a single + physical host with performance and fault isolation provided at + virtual machine boundaries. +\item [Cluster computing.] Management at VM granularity provides more + flexibility than separately managing each physical host, but better + control and isolation than single-system image solutions, + particularly by using live migration for load balancing. +\item [Hardware support for custom OSes.] Allow development of new + OSes while benefiting from the wide-ranging hardware support of + existing OSes such as Linux. +\end{description} + + +\section{Structure of a Xen-Based System} + +A Xen system has multiple layers, the lowest and most privileged of +which is Xen itself. + +Xen in turn may host multiple \emph{guest} operating systems, each of +which is executed within a secure virtual machine (in Xen terminology, +a \emph{domain}). Domains are scheduled by Xen to make effective use +of the available physical CPUs. Each guest OS manages its own +applications, which includes responsibility for scheduling each +application within the time allotted to the VM by Xen. + +The first domain, \emph{domain 0}, is created automatically when the +system boots and has special management privileges. Domain 0 builds +other domains and manages their virtual devices. It also performs +administrative tasks such as suspending, resuming and migrating other +virtual machines. + +Within domain 0, a process called \emph{xend} runs to manage the +system. \Xend is responsible for managing virtual machines and +providing access to their consoles. Commands are issued to \xend over +an HTTP interface, either from a command-line tool or from a web +browser. + + +\section{Hardware Support} + +Xen currently runs only on the x86 architecture, requiring a `P6' or +newer processor (e.g. Pentium Pro, Celeron, Pentium II, Pentium III, +Pentium IV, Xeon, AMD Athlon, AMD Duron). Multiprocessor machines are +supported, and we also have basic support for HyperThreading (SMT), +although this remains a topic for ongoing research. A port +specifically for x86/64 is in progress, although Xen already runs on +such systems in 32-bit legacy mode. In addition a port to the IA64 +architecture is approaching completion. We hope to add other +architectures such as PPC and ARM in due course. + +Xen can currently use up to 4GB of memory. It is possible for x86 +machines to address up to 64GB of physical memory but there are no +current plans to support these systems: The x86/64 port is the planned +route to supporting larger memory sizes. + +Xen offloads most of the hardware support issues to the guest OS +running in Domain~0. Xen itself contains only the code required to +detect and start secondary processors, set up interrupt routing, and +perform PCI bus enumeration. Device drivers run within a privileged +guest OS rather than within Xen itself. This approach provides +compatibility with the majority of device hardware supported by Linux. +The default XenLinux build contains support for relatively modern +server-class network and disk hardware, but you can add support for +other hardware by configuring your XenLinux kernel in the normal way. + + +\section{History} + +Xen was originally developed by the Systems Research Group at the +University of Cambridge Computer Laboratory as part of the XenoServers +project, funded by the UK-EPSRC. + +XenoServers aim to provide a `public infrastructure for global +distributed computing', and Xen plays a key part in that, allowing us +to efficiently partition a single machine to enable multiple +independent clients to run their operating systems and applications in +an environment providing protection, resource isolation and +accounting. The project web page contains further information along +with pointers to papers and technical reports: +\path{http://www.cl.cam.ac.uk/xeno} + +Xen has since grown into a fully-fledged project in its own right, +enabling us to investigate interesting research issues regarding the +best techniques for virtualising resources such as the CPU, memory, +disk and network. The project has been bolstered by support from +Intel Research Cambridge, and HP Labs, who are now working closely +with us. + +Xen was first described in a paper presented at SOSP in +2003\footnote{\tt + http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf}, and the +first public release (1.0) was made that October. Since then, Xen has +significantly matured and is now used in production scenarios on many +sites. + +Xen 2.0 features greatly enhanced hardware support, configuration +flexibility, usability and a larger complement of supported operating +systems. This latest release takes Xen a step closer to becoming the +definitive open source solution for virtualisation. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/redhat.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/redhat.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,61 @@ +\chapter{Installing Xen / XenLinux on Red~Hat or Fedora Core} + +When using Xen / XenLinux on a standard Linux distribution there are a +couple of things to watch out for: + +Note that, because domains greater than 0 don't have any privileged +access at all, certain commands in the default boot sequence will fail +e.g.\ attempts to update the hwclock, change the console font, update +the keytable map, start apmd (power management), or gpm (mouse +cursor). Either ignore the errors (they should be harmless), or +remove them from the startup scripts. Deleting the following links +are a good start: {\path{S24pcmcia}}, {\path{S09isdn}}, +{\path{S17keytable}}, {\path{S26apmd}}, {\path{S85gpm}}. + +If you want to use a single root file system that works cleanly for +both domain~0 and unprivileged domains, a useful trick is to use +different `init' run levels. For example, use run level 3 for +domain~0, and run level 4 for other domains. This enables different +startup scripts to be run in depending on the run level number passed +on the kernel command line. + +If using NFS root files systems mounted either from an external server +or from domain0 there are a couple of other gotchas. The default +{\path{/etc/sysconfig/iptables}} rules block NFS, so part way through +the boot sequence things will suddenly go dead. + +If you're planning on having a separate NFS {\path{/usr}} partition, +the RH9 boot scripts don't make life easy - they attempt to mount NFS +file systems way to late in the boot process. The easiest way I found +to do this was to have a {\path{/linuxrc}} script run ahead of +{\path{/sbin/init}} that mounts {\path{/usr}}: + +\begin{quote} + \begin{small}\begin{verbatim} + #!/bin/bash + /sbin/ipconfig lo 127.0.0.1 + /sbin/portmap + /bin/mount /usr + exec /sbin/init "$@" <>/dev/console 2>&1 +\end{verbatim}\end{small} +\end{quote} + +%% $ XXX SMH: font lock fix :-) + +The one slight complication with the above is that +{\path{/sbin/portmap}} is dynamically linked against +{\path{/usr/lib/libwrap.so.0}} Since this is in {\path{/usr}}, it +won't work. This can be solved by copying the file (and link) below +the {\path{/usr}} mount point, and just let the file be `covered' when +the mount happens. + +In some installations, where a shared read-only {\path{/usr}} is being +used, it may be desirable to move other large directories over into +the read-only {\path{/usr}}. For example, you might replace +{\path{/bin}}, {\path{/lib}} and {\path{/sbin}} with links into +{\path{/usr/root/bin}}, {\path{/usr/root/lib}} and +{\path{/usr/root/sbin}} respectively. This creates other problems for +running the {\path{/linuxrc}} script, requiring bash, portmap, mount, +ifconfig, and a handful of other shared libraries to be copied below +the mount point --- a simple statically-linked C program would solve +this problem. diff -r c0796e18b6a4 -r 750ad97f37b0 docs/src/user/start_addl_dom.tex --- /dev/null Tue Sep 20 09:08:26 2005 +++ b/docs/src/user/start_addl_dom.tex Tue Sep 20 09:17:33 2005 @@ -0,0 +1,172 @@ +\chapter{Starting Additional Domains} + +The first step in creating a new domain is to prepare a root +filesystem for it to boot from. Typically, this might be stored in a +normal partition, an LVM or other volume manager partition, a disk +file or on an NFS server. A simple way to do this is simply to boot +from your standard OS install CD and install the distribution into +another partition on your hard drive. + +To start the \xend\ control daemon, type +\begin{quote} + \verb!# xend start! +\end{quote} + +If you wish the daemon to start automatically, see the instructions in +Section~\ref{s:xend}. Once the daemon is running, you can use the +\path{xm} tool to monitor and maintain the domains running on your +system. This chapter provides only a brief tutorial. We provide full +details of the \path{xm} tool in the next chapter. + +% \section{From the web interface} +% +% Boot the Xen machine and start Xensv (see Chapter~\ref{cha:xensv} +% for more details) using the command: \\ +% \verb_# xensv start_ \\ +% This will also start Xend (see Chapter~\ref{cha:xend} for more +% information). +% +% The domain management interface will then be available at {\tt +% http://your\_machine:8080/}. This provides a user friendly wizard +% for starting domains and functions for managing running domains. +% +% \section{From the command line} + + +\section{Creating a Domain Configuration File} + +Before you can start an additional domain, you must create a +configuration file. We provide two example files which you can use as +a starting point: +\begin{itemize} +\item \path{/etc/xen/xmexample1} is a simple template configuration + file for describing a single VM. + +\item \path{/etc/xen/xmexample2} file is a template description that + is intended to be reused for multiple virtual machines. Setting the + value of the \path{vmid} variable on the \path{xm} command line + fills in parts of this template. +\end{itemize} + +Copy one of these files and edit it as appropriate. Typical values +you may wish to edit include: + +\begin{quote} +\begin{description} +\item[kernel] Set this to the path of the kernel you compiled for use + with Xen (e.g.\ \path{kernel = `/boot/vmlinuz-2.6-xenU'}) +\item[memory] Set this to the size of the domain's memory in megabytes + (e.g.\ \path{memory = 64}) +\item[disk] Set the first entry in this list to calculate the offset + of the domain's root partition, based on the domain ID. Set the + second to the location of \path{/usr} if you are sharing it between + domains (e.g.\ \path{disk = [`phy:your\_hard\_drive\%d,sda1,w' \% + (base\_partition\_number + vmid), + `phy:your\_usr\_partition,sda6,r' ]} +\item[dhcp] Uncomment the dhcp variable, so that the domain will + receive its IP address from a DHCP server (e.g.\ \path{dhcp=`dhcp'}) +\end{description} +\end{quote} + +You may also want to edit the {\bf vif} variable in order to choose +the MAC address of the virtual ethernet interface yourself. For +example: +\begin{quote} +\verb_vif = [`mac=00:06:AA:F6:BB:B3']_ +\end{quote} +If you do not set this variable, \xend\ will automatically generate a +random MAC address from an unused range. + + +\section{Booting the Domain} + +The \path{xm} tool provides a variety of commands for managing +domains. Use the \path{create} command to start new domains. Assuming +you've created a configuration file \path{myvmconf} based around +\path{/etc/xen/xmexample2}, to start a domain with virtual machine +ID~1 you should type: + +\begin{quote} +\begin{verbatim} +# xm create -c myvmconf vmid=1 +\end{verbatim} +\end{quote} + +The \path{-c} switch causes \path{xm} to turn into the domain's +console after creation. The \path{vmid=1} sets the \path{vmid} +variable used in the \path{myvmconf} file. + +You should see the console boot messages from the new domain appearing +in the terminal in which you typed the command, culminating in a login +prompt. + + +\section{Example: ttylinux} + +Ttylinux is a very small Linux distribution, designed to require very +few resources. We will use it as a concrete example of how to start a +Xen domain. Most users will probably want to install a full-featured +distribution once they have mastered the basics\footnote{ttylinux is + maintained by Pascal Schmidt. You can download source packages from + the distribution's home page: {\tt + http://www.minimalinux.org/ttylinux/}}. + +\begin{enumerate} +\item Download and extract the ttylinux disk image from the Files + section of the project's SourceForge site (see + \path{http://sf.net/projects/xen/}). +\item Create a configuration file like the following: +\begin{verbatim} +kernel = "/boot/vmlinuz-2.6-xenU" +memory = 64 +name = "ttylinux" +nics = 1 +ip = "1.2.3.4" +disk = ['file:/path/to/ttylinux/rootfs,sda1,w'] +root = "/dev/sda1 ro" +\end{verbatim} +\item Now start the domain and connect to its console: +\begin{verbatim} +xm create configfile -c +\end{verbatim} +\item Login as root, password root. +\end{enumerate} + + +\section{Starting / Stopping Domains Automatically} + +It is possible to have certain domains start automatically at boot +time and to have dom0 wait for all running domains to shutdown before +it shuts down the system. + +To specify a domain is to start at boot-time, place its configuration +file (or a link to it) under \path{/etc/xen/auto/}. + +A Sys-V style init script for Red Hat and LSB-compliant systems is +provided and will be automatically copied to \path{/etc/init.d/} +during install. You can then enable it in the appropriate way for +your distribution. + +For instance, on Red Hat: + +\begin{quote} + \verb_# chkconfig --add xendomains_ +\end{quote} + +By default, this will start the boot-time domains in runlevels 3, 4 +and 5. + +You can also use the \path{service} command to run this script +manually, e.g: + +\begin{quote} + \verb_# service xendomains start_ + + Starts all the domains with config files under /etc/xen/auto/. +\end{quote} + +\begin{quote} + \verb_# service xendomains stop_ + + Shuts down ALL running Xen domains. +\end{quote} _______________________________________________ Xen-changelog mailing list Xen-changelog@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-changelog
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |