Xen project Mailing List

[Xen-changelog] [xen staging] docs/markdown: Switch to using pandoc, and fix underscore escaping

Date: Wed, 02 Jan 2019 17:55:14 +0000

Delivery-date: Wed, 02 Jan 2019 17:55:19 +0000

List-id: "Change log for Mercurial $receive only$" <xen-changelog.lists.xenproject.org>

commit d661611d080c833092b8a26a5a43d343e08dd404 Author: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> AuthorDate: Wed Jan 2 10:26:47 2019 +0000 Commit: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> CommitDate: Wed Jan 2 17:50:36 2019 +0000 docs/markdown: Switch to using pandoc, and fix underscore escaping c/s a3a99df44 "docs/cmdline: Rewrite the cpuid_mask_* section" completely forgot about how markdown gets rendered to HTML (as opposed to PDF), because we use different translators depending on the destination format. markdown and pandoc are very similar markup languages, but a couple of details about pandoc cause it to have far more user-friendly inline markup. Switch all markdown documents to be pandoc (so we are using a single translator, and therefore a single flavour of markdown), which fixes the rendered docs on xenbits.xen.org/docs. While changing the format, fix the remainder of the escaped underscores in the same mannor as the previous patch. The two problem cases here are __LINE__ and __FILE__ where the first underscore still needs escaping. In addition, dmop.markdown and dom0less.markdown didn't used to get processed, as only .markdown files in the misc/ directory got considered. dom0less.pandoc gets picked up automatically now, due to being in the features/ directory, but designs/ needs adding to the pandoc directory list for dmop.pandoc to get processed. While edting in appropriate areas, take the opportunity to fix some markup to the surrounding style, and drop trailing whitespace. No change in content - only formatting. This results in the text being easier to read and grep. Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> Acked-by: Ian Jackson <ian.jackson@xxxxxxxxxxxxx> --- docs/Makefile | 2 +- docs/designs/dmop.markdown | 175 --- docs/designs/dmop.pandoc | 175 +++ docs/features/dom0less.markdown | 49 - docs/features/dom0less.pandoc | 49 + docs/misc/9pfs.markdown | 419 ------ docs/misc/9pfs.pandoc | 419 ++++++ docs/misc/coverage.markdown | 124 -- docs/misc/coverage.pandoc | 124 ++ docs/misc/efi.markdown | 118 -- docs/misc/efi.pandoc | 118 ++ docs/misc/hvm-emulated-unplug.markdown | 97 -- docs/misc/hvm-emulated-unplug.pandoc | 97 ++ docs/misc/livepatch.markdown | 1108 ---------------- docs/misc/livepatch.pandoc | 1108 ++++++++++++++++ docs/misc/pv-drivers-lifecycle.markdown | 57 - docs/misc/pv-drivers-lifecycle.pandoc | 57 + docs/misc/pvcalls.markdown | 1092 --------------- docs/misc/pvcalls.pandoc | 1092 +++++++++++++++ docs/misc/pvh.markdown | 112 -- docs/misc/pvh.pandoc | 112 ++ docs/misc/x86-xenpv-bootloader.markdown | 49 - docs/misc/x86-xenpv-bootloader.pandoc | 49 + docs/misc/xen-command-line.markdown | 2199 ------------------------------- docs/misc/xen-command-line.pandoc | 2199 +++++++++++++++++++++++++++++++ docs/misc/xenstore-paths.markdown | 640 --------- docs/misc/xenstore-paths.pandoc | 640 +++++++++ docs/misc/xl-psr.markdown | 254 ---- docs/misc/xl-psr.pandoc | 254 ++++ 29 files changed, 6494 insertions(+), 6494 deletions(-) diff --git a/docs/Makefile b/docs/Makefile index fba6673db6..8f933cf93f 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -17,7 +17,7 @@ MARKDOWNSRC-y := $(sort $(shell find misc -name '*.markdown' -print)) TXTSRC-y := $(sort $(shell find misc -name '*.txt' -print)) -PANDOCSRC-y := $(sort $(shell find process/ features/ misc/ specs/ -name '*.pandoc' -print)) +PANDOCSRC-y := $(sort $(shell find designs/ features/ misc/ process/ specs/ -name '*.pandoc' -print)) # Documentation targets DOC_MAN1 := $(patsubst man/%.pod.1,man1/%.1,$(MAN1SRC-y)) \ diff --git a/docs/designs/dmop.markdown b/docs/designs/dmop.markdown deleted file mode 100644 index 8e9f95af47..0000000000 --- a/docs/designs/dmop.markdown +++ /dev/null @@ -1,175 +0,0 @@ -DMOP -==== - -Introduction ------------- - -The aim of DMOP is to prevent a compromised device model from compromising -domains other than the one it is providing emulation for (which is therefore -likely already compromised). - -The problem occurs when you a device model issues an hypercall that -includes references to user memory other than the operation structure -itself, such as with Track dirty VRAM (as used in VGA emulation). -Is this case, the address of this other user memory needs to be vetted, -to ensure it is not within restricted address ranges, such as kernel -memory. The real problem comes down to how you would vet this address - -the idea place to do this is within the privcmd driver, without privcmd -having to have specific knowledge of the hypercall's semantics. - -The Design ----------- - -The privcmd driver implements a new restriction ioctl, which takes a domid -parameter. After that restriction ioctl is issued, all unaudited operations -on the privcmd driver will cease to function, including regular hypercalls. -DMOP hypercalls will continue to function as they can be audited. - -A DMOP hypercall consists of a domid (which is audited to verify that it -matches any restriction in place) and an array of buffers and lengths, -with the first one containing the specific DMOP parameters. These can -then reference further buffers from within in the array. Since the only -user buffers passed are that found with that array, they can all can be -audited by privcmd. - -The following code illustrates this idea: - -struct xen_dm_op { - uint32_t op; -}; - -struct xen_dm_op_buf { - XEN_GUEST_HANDLE(void) h; - unsigned long size; -}; -typedef struct xen_dm_op_buf xen_dm_op_buf_t; - -enum neg_errnoval -HYPERVISOR_dm_op(domid_t domid, - xen_dm_op_buf_t bufs[], - unsigned int nr_bufs) - -@domid is the domain the hypercall operates on. -@bufs points to an array of buffers where @bufs[0] contains a struct -dm_op, describing the specific device model operation and its parameters. -@bufs[1..] may be referenced in the parameters for the purposes of -passing extra information to or from the domain. -@nr_bufs is the number of buffers in the @bufs array. - -It is forbidden for the above struct (xen_dm_op) to contain any guest -handles. If they are needed, they should instead be in -HYPERVISOR_dm_op->bufs. - -Validation by privcmd driver ----------------------------- - -If the privcmd driver has been restricted to specific domain (using a - new ioctl), when it received an op, it will: - -1. Check hypercall is DMOP. - -2. Check domid == restricted domid. - -3. For each @nr_bufs in @bufs: Check @h and @size give a buffer - wholly in the user space part of the virtual address space. (e.g. - Linux will use access_ok()). - - -Xen Implementation ------------------- - -Since a DMOP buffers need to be copied from or to the guest, functions for -doing this would be written as below. Note that care is taken to prevent -damage from buffer under- or over-run situations. If the DMOP is called -with incorrectly sized buffers, zeros will be read, while extra is ignored. - -static bool copy_buf_from_guest(xen_dm_op_buf_t bufs[], - unsigned int nr_bufs, void *dst, - unsigned int idx, size_t dst_size) -{ - size_t size; - - if ( idx >= nr_bufs ) - return false; - - memset(dst, 0, dst_size); - - size = min_t(size_t, dst_size, bufs[idx].size); - - return !copy_from_guest(dst, bufs[idx].h, size); -} - -static bool copy_buf_to_guest(xen_dm_op_buf_t bufs[], - unsigned int nr_bufs, unsigned int idx, - void *src, size_t src_size) -{ - size_t size; - - if ( idx >= nr_bufs ) - return false; - - size = min_t(size_t, bufs[idx].size, src_size); - - return !copy_to_guest(bufs[idx].h, src, size); -} - -This leaves do_dm_op easy to implement as below: - -static int dm_op(domid_t domid, - unsigned int nr_bufs, - xen_dm_op_buf_t bufs[]) -{ - struct domain *d; - struct xen_dm_op op; - bool const_op = true; - long rc; - - rc = rcu_lock_remote_domain_by_id(domid, &d); - if ( rc ) - return rc; - - if ( !is_hvm_domain(d) ) - goto out; - - rc = xsm_dm_op(XSM_DM_PRIV, d); - if ( rc ) - goto out; - - if ( !copy_buf_from_guest(bufs, nr_bufs, &op, 0, sizeof(op)) ) - { - rc = -EFAULT; - goto out; - } - - switch ( op.op ) - { - default: - rc = -EOPNOTSUPP; - break; - } - - if ( !rc && - !const_op && - !copy_buf_to_guest(bufs, nr_bufs, 0, &op, sizeof(op)) ) - rc = -EFAULT; - - out: - rcu_unlock_domain(d); - - return rc; -} - -long do_dm_op(domid_t domid, - unsigned int nr_bufs, - XEN_GUEST_HANDLE_PARAM(xen_dm_op_buf_t) bufs) -{ - struct xen_dm_op_buf nat[MAX_NR_BUFS]; - - if ( nr_bufs > MAX_NR_BUFS ) - return -EINVAL; - - if ( copy_from_guest_offset(nat, bufs, 0, nr_bufs) ) - return -EFAULT; - - return dm_op(domid, nr_bufs, nat); -} diff --git a/docs/designs/dmop.pandoc b/docs/designs/dmop.pandoc new file mode 100644 index 0000000000..8e9f95af47 --- /dev/null +++ b/docs/designs/dmop.pandoc @@ -0,0 +1,175 @@ +DMOP +==== + +Introduction +------------ + +The aim of DMOP is to prevent a compromised device model from compromising +domains other than the one it is providing emulation for (which is therefore +likely already compromised). + +The problem occurs when you a device model issues an hypercall that +includes references to user memory other than the operation structure +itself, such as with Track dirty VRAM (as used in VGA emulation). +Is this case, the address of this other user memory needs to be vetted, +to ensure it is not within restricted address ranges, such as kernel +memory. The real problem comes down to how you would vet this address - +the idea place to do this is within the privcmd driver, without privcmd +having to have specific knowledge of the hypercall's semantics. + +The Design +---------- + +The privcmd driver implements a new restriction ioctl, which takes a domid +parameter. After that restriction ioctl is issued, all unaudited operations +on the privcmd driver will cease to function, including regular hypercalls. +DMOP hypercalls will continue to function as they can be audited. + +A DMOP hypercall consists of a domid (which is audited to verify that it +matches any restriction in place) and an array of buffers and lengths, +with the first one containing the specific DMOP parameters. These can +then reference further buffers from within in the array. Since the only +user buffers passed are that found with that array, they can all can be +audited by privcmd. + +The following code illustrates this idea: + +struct xen_dm_op { + uint32_t op; +}; + +struct xen_dm_op_buf { + XEN_GUEST_HANDLE(void) h; + unsigned long size; +}; +typedef struct xen_dm_op_buf xen_dm_op_buf_t; + +enum neg_errnoval +HYPERVISOR_dm_op(domid_t domid, + xen_dm_op_buf_t bufs[], + unsigned int nr_bufs) + +@domid is the domain the hypercall operates on. +@bufs points to an array of buffers where @bufs[0] contains a struct +dm_op, describing the specific device model operation and its parameters. +@bufs[1..] may be referenced in the parameters for the purposes of +passing extra information to or from the domain. +@nr_bufs is the number of buffers in the @bufs array. + +It is forbidden for the above struct (xen_dm_op) to contain any guest +handles. If they are needed, they should instead be in +HYPERVISOR_dm_op->bufs. + +Validation by privcmd driver +---------------------------- + +If the privcmd driver has been restricted to specific domain (using a + new ioctl), when it received an op, it will: + +1. Check hypercall is DMOP. + +2. Check domid == restricted domid. + +3. For each @nr_bufs in @bufs: Check @h and @size give a buffer + wholly in the user space part of the virtual address space. (e.g. + Linux will use access_ok()). + + +Xen Implementation +------------------ + +Since a DMOP buffers need to be copied from or to the guest, functions for +doing this would be written as below. Note that care is taken to prevent +damage from buffer under- or over-run situations. If the DMOP is called +with incorrectly sized buffers, zeros will be read, while extra is ignored. + +static bool copy_buf_from_guest(xen_dm_op_buf_t bufs[], + unsigned int nr_bufs, void *dst, + unsigned int idx, size_t dst_size) +{ + size_t size; + + if ( idx >= nr_bufs ) + return false; + + memset(dst, 0, dst_size); + + size = min_t(size_t, dst_size, bufs[idx].size); + + return !copy_from_guest(dst, bufs[idx].h, size); +} + +static bool copy_buf_to_guest(xen_dm_op_buf_t bufs[], + unsigned int nr_bufs, unsigned int idx, + void *src, size_t src_size) +{ + size_t size; + + if ( idx >= nr_bufs ) + return false; + + size = min_t(size_t, bufs[idx].size, src_size); + + return !copy_to_guest(bufs[idx].h, src, size); +} + +This leaves do_dm_op easy to implement as below: + +static int dm_op(domid_t domid, + unsigned int nr_bufs, + xen_dm_op_buf_t bufs[]) +{ + struct domain *d; + struct xen_dm_op op; + bool const_op = true; + long rc; + + rc = rcu_lock_remote_domain_by_id(domid, &d); + if ( rc ) + return rc; + + if ( !is_hvm_domain(d) ) + goto out; + + rc = xsm_dm_op(XSM_DM_PRIV, d); + if ( rc ) + goto out; + + if ( !copy_buf_from_guest(bufs, nr_bufs, &op, 0, sizeof(op)) ) + { + rc = -EFAULT; + goto out; + } + + switch ( op.op ) + { + default: + rc = -EOPNOTSUPP; + break; + } + + if ( !rc && + !const_op && + !copy_buf_to_guest(bufs, nr_bufs, 0, &op, sizeof(op)) ) + rc = -EFAULT; + + out: + rcu_unlock_domain(d); + + return rc; +} + +long do_dm_op(domid_t domid, + unsigned int nr_bufs, + XEN_GUEST_HANDLE_PARAM(xen_dm_op_buf_t) bufs) +{ + struct xen_dm_op_buf nat[MAX_NR_BUFS]; + + if ( nr_bufs > MAX_NR_BUFS ) + return -EINVAL; + + if ( copy_from_guest_offset(nat, bufs, 0, nr_bufs) ) + return -EFAULT; + + return dm_op(domid, nr_bufs, nat); +} diff --git a/docs/features/dom0less.markdown b/docs/features/dom0less.markdown deleted file mode 100644 index 4e342b7957..0000000000 --- a/docs/features/dom0less.markdown +++ /dev/null @@ -1,49 +0,0 @@ -Dom0less -======== - -"Dom0less" is a set of Xen features that enable the deployment of a Xen -system without an control domain (often referred to as "dom0"). Each -feature can be used independently from the others, unless otherwise -stated. - -Booting Multiple Domains from Device Tree ------------------------------------------ - -This feature enables Xen to create a set of DomUs at boot time. -Information about the DomUs to be created by Xen is passed to the -hypervisor via Device Tree. Specifically, the existing Device Tree based -Multiboot specification has been extended to allow for multiple domains -to be passed to Xen. See docs/misc/arm/device-tree/booting.txt for more -information about the Multiboot specification and how to use it. - -Currently, a control domain ("dom0") is still required, but in the -future it will become unnecessary when all domains are created -directly from Xen. Instead of waiting for the control domain to be fully -booted and the Xen tools to become available, domains created by Xen -this way are started right away in parallel. Hence, their boot time is -typically much shorter. - -Domains started by Xen at boot time currently have the following -limitations: - -- They cannot be properly shutdown or rebooted using xl. If one of them - crashes, the whole platform should be rebooted. - -- Some xl operations might not work as expected. xl is meant to be used - with domains that have been created by it. Using xl with domains - started by Xen at boot might not work as expected. - -- The GIC version is the native version. In absence of other - information, the GIC version exposed to the domains started by Xen at - boot is the same as the native GIC version. - -- No PV drivers. There is no support for PV devices at the moment. All - devices need to be statically assigned to guests. - -- Pinning vCPUs of domains started by Xen at boot can be - done from the control domain, using `xl vcpu-pin` as usual. It is not - currently possible to configure vCPU pinning without a control domain. - However, the NULL scheduler can be selected by passing `sched=null` to - the Xen command line. The NULL scheduler automatically assigns and - pins vCPUs to pCPUs, but the vCPU-pCPU assignments cannot be - configured. diff --git a/docs/features/dom0less.pandoc b/docs/features/dom0less.pandoc new file mode 100644 index 0000000000..4e342b7957 --- /dev/null +++ b/docs/features/dom0less.pandoc @@ -0,0 +1,49 @@ +Dom0less +======== + +"Dom0less" is a set of Xen features that enable the deployment of a Xen +system without an control domain (often referred to as "dom0"). Each +feature can be used independently from the others, unless otherwise +stated. + +Booting Multiple Domains from Device Tree +----------------------------------------- + +This feature enables Xen to create a set of DomUs at boot time. +Information about the DomUs to be created by Xen is passed to the +hypervisor via Device Tree. Specifically, the existing Device Tree based +Multiboot specification has been extended to allow for multiple domains +to be passed to Xen. See docs/misc/arm/device-tree/booting.txt for more +information about the Multiboot specification and how to use it. + +Currently, a control domain ("dom0") is still required, but in the +future it will become unnecessary when all domains are created +directly from Xen. Instead of waiting for the control domain to be fully +booted and the Xen tools to become available, domains created by Xen +this way are started right away in parallel. Hence, their boot time is +typically much shorter. + +Domains started by Xen at boot time currently have the following +limitations: + +- They cannot be properly shutdown or rebooted using xl. If one of them + crashes, the whole platform should be rebooted. + +- Some xl operations might not work as expected. xl is meant to be used + with domains that have been created by it. Using xl with domains + started by Xen at boot might not work as expected. + +- The GIC version is the native version. In absence of other + information, the GIC version exposed to the domains started by Xen at + boot is the same as the native GIC version. + +- No PV drivers. There is no support for PV devices at the moment. All + devices need to be statically assigned to guests. + +- Pinning vCPUs of domains started by Xen at boot can be + done from the control domain, using `xl vcpu-pin` as usual. It is not + currently possible to configure vCPU pinning without a control domain. + However, the NULL scheduler can be selected by passing `sched=null` to + the Xen command line. The NULL scheduler automatically assigns and + pins vCPUs to pCPUs, but the vCPU-pCPU assignments cannot be + configured. diff --git a/docs/misc/9pfs.markdown b/docs/misc/9pfs.markdown deleted file mode 100644 index 7f13831e06..0000000000 --- a/docs/misc/9pfs.markdown +++ /dev/null @@ -1,419 +0,0 @@ -# Xen transport for 9pfs version 1 - -## Background - -9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very -simple and describes a series of commands and responses. It is -completely independent from the communication channels, in fact many -clients and servers support multiple channels, usually called -"transports". For example the Linux client supports tcp and unix -sockets, fds, virtio and rdma. - - -### 9pfs protocol - -This document won't cover the full 9pfs specification. Please refer to -this [paper] and this [website] for a detailed description of it. -However it is useful to know that each 9pfs request and response has the -following header: - - struct header { - uint32_t size; - uint8_t id; - uint16_t tag; - } __attribute__((packed)); - - 0 4 5 7 - +---------+--+----+ - | size |id|tag | - +---------+--+----+ - -- *size* -The size of the request or response. - -- *id* -The 9pfs request or response operation. - -- *tag* -Unique id that identifies a specific request/response pair. It is used -to multiplex operations on a single channel. - -It is possible to have multiple requests in-flight at any given time. - - -## Rationale - -This document describes a Xen based transport for 9pfs, in the -traditional PV frontend and backend format. The PV frontend is used by -the client to send commands to the server. The PV backend is used by the -9pfs server to receive commands from clients and send back responses. - -The transport protocol supports multiple rings up to the maximum -supported by the backend. The size of every ring is also configurable -and can span multiple pages, up to the maximum supported by the backend -(although it cannot be more than 2MB). The design is to exploit -parallelism at the vCPU level and support multiple outstanding requests -simultaneously. - -This document does not cover the 9pfs client/server design or -implementation, only the transport for it. - - -## Xenstore - -The frontend and the backend connect via xenstore to exchange -information. The toolstack creates front and back nodes with state -[XenbusStateInitialising]. The protocol node name is **9pfs**. - -Multiple rings are supported for each frontend and backend connection. - -### Backend XenBus Nodes - -Backend specific properties, written by the backend, read by the -frontend: - - versions - Values: <string> - - List of comma separated protocol versions supported by the backend. - For example "1,2,3". Currently the value is just "1", as there is - only one version. N.B.: this is the version of the Xen trasport - protocol, not the version of 9pfs supported by the server. - - max-rings - Values: <uint32_t> - - The maximum supported number of rings per frontend. - - max-ring-page-order - Values: <uint32_t> - - The maximum supported size of a memory allocation in units of - log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It - must be at least 1. - -Backend configuration nodes, written by the toolstack, read by the -backend: - - path - Values: <string> - - Host filesystem path to share. - - tag - Values: <string> - - Alphanumeric tag that identifies the 9pfs share. The client needs - to know the tag to be able to mount it. - - security-model - Values: "none" - - *none*: files are stored using the same credentials as they are - created on the guest (no user ownership squash or remap) - Only "none" is supported in this version of the protocol. - -### Frontend XenBus Nodes - - version - Values: <string> - - Protocol version, chosen among the ones supported by the backend - (see **versions** under [Backend XenBus Nodes]). Currently the - value must be "1". - - num-rings - Values: <uint32_t> - - Number of rings. It needs to be lower or equal to max-rings. - - event-channel-<num> (event-channel-0, event-channel-1, etc) - Values: <uint32_t> - - The identifier of the Xen event channel used to signal activity - in the ring buffer. One for each ring. - - ring-ref<num> (ring-ref0, ring-ref1, etc) - Values: <uint32_t> - - The Xen grant reference granting permission for the backend to - map a page with information to setup a share ring. One for each - ring. - -### State Machine - -Initialization: - - *Front* *Back* - XenbusStateInitialising XenbusStateInitialising - - Query virtual device - Query backend device - properties. identification data. - - Setup OS device instance. - Publish backend features - - Allocate and initialize the and transport parameters - request ring. | - - Publish transport parameters | - that will be in effect during V - this connection. XenbusStateInitWait - | - | - V - XenbusStateInitialised - - - Query frontend transport parameters. - - Connect to the request ring and - event channel. - | - | - V - XenbusStateConnected - - - Query backend device properties. - - Finalize OS virtual device - instance. - | - | - V - XenbusStateConnected - -Once frontend and backend are connected, they have a shared page per -ring, which are used to setup the rings, and an event channel per ring, -which are used to send notifications. - -Shutdown: - - *Front* *Back* - XenbusStateConnected XenbusStateConnected - | - | - V - XenbusStateClosing - - - Unmap grants - - Unbind evtchns - | - | - V - XenbusStateClosing - - - Unbind evtchns - - Free rings - - Free data structures - | - | - V - XenbusStateClosed - - - Free remaining data structures - | - | - V - XenbusStateClosed - - -## Ring Setup - -The shared page has the following layout: - - typedef uint32_t XEN_9PFS_RING_IDX; - - struct xen_9pfs_intf { - XEN_9PFS_RING_IDX in_cons, in_prod; - uint8_t pad[56]; - XEN_9PFS_RING_IDX out_cons, out_prod; - uint8_t pad[56]; - - uint32_t ring_order; - /* this is an array of (1 << ring_order) elements */ - grant_ref_t ref[1]; - }; - - /* not actually C compliant (ring_order changes from ring to ring) */ - struct ring_data { - char in[((1 << ring_order) << PAGE_SHIFT) / 2]; - char out[((1 << ring_order) << PAGE_SHIFT) / 2]; - }; - -- **ring_order** - It represents the order of the data ring. The following list of grant - references is of `(1 << ring_order)` elements. It cannot be greater than - **max-ring-page-order**, as specified by the backend on XenBus. -- **ref[]** - The list of grant references which will contain the actual data. They are - mapped contiguosly in virtual memory. The first half of the pages is the - **in** array, the second half is the **out** array. The array must - have a power of two number of elements. -- **out** is an array used as circular buffer - It contains client requests. The producer is the frontend, the - consumer is the backend. -- **in** is an array used as circular buffer - It contains server responses. The producer is the backend, the - consumer is the frontend. -- **out_cons**, **out_prod** - Consumer and producer indices for client requests. They keep track of - how much data has been written by the frontend to **out** and how much - data has already been consumed by the backend. **out_prod** is - increased by the frontend, after writing data to **out**. **out_cons** - is increased by the backend, after reading data from **out**. -- **in_cons** and **in_prod** - Consumer and producer indices for responses. They keep track of how - much data has already been consumed by the frontend from the **in** - array. **in_prod** is increased by the backend, after writing data to - **in**. **in_cons** is increased by the frontend, after reading data - from **in**. - -The binary layout of `struct xen_9pfs_intf` follows: - - 0 4 8 64 68 72 76 - +---------+---------+-----//-----+---------+---------+---------+ - | in_cons | in_prod | padding |out_cons |out_prod |ring_orde| - +---------+---------+-----//-----+---------+---------+---------+ - - 76 80 84 4092 4096 - +---------+---------+----//---+---------+ - | ref[0] | ref[1] | | ref[N] | - +---------+---------+----//---+---------+ - -**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N -needs to be a power of two, actually max N is 512. As 512 == (1 << 9), -the maximum possible max-ring-page-order value is 9. - -The binary layout of the ring buffers follow: - - 0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT) - +------------//-------------+------------//-------------+ - | in | out | - +------------//-------------+------------//-------------+ - -## Why ring.h is not needed - -Many Xen PV protocols use the macros provided by [ring.h] to manage -their shared ring for communication. This procotol does not, because it -actually comes with two rings: the **in** ring and the **out** ring. -Each of them is mono-directional, and there is no static request size: -the producer writes opaque data to the ring. On the other end, in -[ring.h] they are combined, and the request size is static and -well-known. In this protocol: - - in -> backend to frontend only - out-> frontend to backend only - -In the case of the **in** ring, the frontend is the consumer, and the -backend is the producer. Everything is the same but mirrored for the -**out** ring. - -The producer, the backend in this case, never reads from the **in** -ring. In fact, the producer doesn't need any notifications unless the -ring is full. This version of the protocol doesn't take advantage of it, -leaving room for optimizations. - -On the other end, the consumer always requires notifications, unless it -is already actively reading from the ring. The producer can figure it -out, without any additional fields in the protocol, by comparing the -indexes at the beginning and the end of the function. This is similar to -what [ring.h] does. - -## Ring Usage - -The **in** and **out** arrays are used as circular buffers: - - 0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2 - +-----------------------------------+ - |to consume| free |to consume | - +-----------------------------------+ - ^ ^ - prod cons - - 0 sizeof(array) - +-----------------------------------+ - | free | to consume | free | - +-----------------------------------+ - ^ ^ - cons prod - -The following functions are provided to read and write to an array: - - #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1)) - - static inline void xen_9pfs_read(char *buf, - XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, - uint8_t *h, size_t len) { - if (*masked_cons < *masked_prod) { - memcpy(h, buf + *masked_cons, len); - } else { - if (len > XEN_9PFS_RING_SIZE - *masked_cons) { - memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons); - memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons)); - } else { - memcpy(h, buf + *masked_cons, len); - } - } - *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len); - } - - static inline void xen_9pfs_write(char *buf, - XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, - uint8_t *opaque, size_t len) { - if (*masked_prod < *masked_cons) { - memcpy(buf + *masked_prod, opaque, len); - } else { - if (len > XEN_9PFS_RING_SIZE - *masked_prod) { - memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod); - memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); - } else { - memcpy(buf + *masked_prod, opaque, len); - } - } - *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len); - } - -The producer (the backend for **in**, the frontend for **out**) writes to the -array in the following way: - -- read *cons*, *prod* from shared memory -- general memory barrier -- verify *prod* against local copy (consumer shouldn't change it) -- write to array at position *prod* up to *cons*, wrapping around the circular - buffer when necessary -- write memory barrier -- increase *prod* -- notify the other end via event channel - -The consumer (the backend for **out**, the frontend for **in**) reads from the -array in the following way: - -- read *prod*, *cons* from shared memory -- read memory barrier -- verify *cons* against local copy (producer shouldn't change it) -- read from array at position *cons* up to *prod*, wrapping around the circular - buffer when necessary -- general memory barrier -- increase *cons* -- notify the other end via event channel - -The producer takes care of writing only as many bytes as available in the buffer -up to *cons*. The consumer takes care of reading only as many bytes as available -in the buffer up to *prod*. - - -## Request/Response Workflow - -The client chooses one of the available rings, then it sends a request -to the other end on the *out* array, following the producer workflow -described in [Ring Usage]. - -The server receives the notification and reads the request, following -the consumer workflow described in [Ring Usage]. The server knows how -much to read because it is specified in the *size* field of the 9pfs -header. The server processes the request and sends back a response on -the *in* array of the same ring, following the producer workflow as -usual. Thus, every request/response pair is on one ring. - -The client receives a notification and reads the response from the *in* -array. The client knows how much data to read because it is specified in -the *size* field of the 9pfs header. - - -[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf -[website]: https://github.com/chaos/diod/blob/master/protocol.md -[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html -[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD diff --git a/docs/misc/9pfs.pandoc b/docs/misc/9pfs.pandoc new file mode 100644 index 0000000000..a4dc86f639 --- /dev/null +++ b/docs/misc/9pfs.pandoc @@ -0,0 +1,419 @@ +# Xen transport for 9pfs version 1 + +## Background + +9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very +simple and describes a series of commands and responses. It is +completely independent from the communication channels, in fact many +clients and servers support multiple channels, usually called +"transports". For example the Linux client supports tcp and unix +sockets, fds, virtio and rdma. + + +### 9pfs protocol + +This document won't cover the full 9pfs specification. Please refer to +this [paper] and this [website] for a detailed description of it. +However it is useful to know that each 9pfs request and response has the +following header: + + struct header { + uint32_t size; + uint8_t id; + uint16_t tag; + } __attribute__((packed)); + + 0 4 5 7 + +---------+--+----+ + | size |id|tag | + +---------+--+----+ + +- *size* +The size of the request or response. + +- *id* +The 9pfs request or response operation. + +- *tag* +Unique id that identifies a specific request/response pair. It is used +to multiplex operations on a single channel. + +It is possible to have multiple requests in-flight at any given time. + + +## Rationale + +This document describes a Xen based transport for 9pfs, in the +traditional PV frontend and backend format. The PV frontend is used by +the client to send commands to the server. The PV backend is used by the +9pfs server to receive commands from clients and send back responses. + +The transport protocol supports multiple rings up to the maximum +supported by the backend. The size of every ring is also configurable +and can span multiple pages, up to the maximum supported by the backend +(although it cannot be more than 2MB). The design is to exploit +parallelism at the vCPU level and support multiple outstanding requests +simultaneously. + +This document does not cover the 9pfs client/server design or +implementation, only the transport for it. + + +## Xenstore + +The frontend and the backend connect via xenstore to exchange +information. The toolstack creates front and back nodes with state +[XenbusStateInitialising]. The protocol node name is **9pfs**. + +Multiple rings are supported for each frontend and backend connection. + +### Backend XenBus Nodes + +Backend specific properties, written by the backend, read by the +frontend: + + versions + Values: <string> + + List of comma separated protocol versions supported by the backend. + For example "1,2,3". Currently the value is just "1", as there is + only one version. N.B.: this is the version of the Xen trasport + protocol, not the version of 9pfs supported by the server. + + max-rings + Values: <uint32_t> + + The maximum supported number of rings per frontend. + + max-ring-page-order + Values: <uint32_t> + + The maximum supported size of a memory allocation in units of + log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It + must be at least 1. + +Backend configuration nodes, written by the toolstack, read by the +backend: + + path + Values: <string> + + Host filesystem path to share. + + tag + Values: <string> + + Alphanumeric tag that identifies the 9pfs share. The client needs + to know the tag to be able to mount it. + + security-model + Values: "none" + + *none*: files are stored using the same credentials as they are + created on the guest (no user ownership squash or remap) + Only "none" is supported in this version of the protocol. + +### Frontend XenBus Nodes + + version + Values: <string> + + Protocol version, chosen among the ones supported by the backend + (see **versions** under [Backend XenBus Nodes]). Currently the + value must be "1". + + num-rings + Values: <uint32_t> + + Number of rings. It needs to be lower or equal to max-rings. + + event-channel-<num> (event-channel-0, event-channel-1, etc) + Values: <uint32_t> + + The identifier of the Xen event channel used to signal activity + in the ring buffer. One for each ring. + + ring-ref<num> (ring-ref0, ring-ref1, etc) + Values: <uint32_t> + + The Xen grant reference granting permission for the backend to + map a page with information to setup a share ring. One for each + ring. + +### State Machine + +Initialization: + + *Front* *Back* + XenbusStateInitialising XenbusStateInitialising + - Query virtual device - Query backend device + properties. identification data. + - Setup OS device instance. - Publish backend features + - Allocate and initialize the and transport parameters + request ring. | + - Publish transport parameters | + that will be in effect during V + this connection. XenbusStateInitWait + | + | + V + XenbusStateInitialised + + - Query frontend transport parameters. + - Connect to the request ring and + event channel. + | + | + V + XenbusStateConnected + + - Query backend device properties. + - Finalize OS virtual device + instance. + | + | + V + XenbusStateConnected + +Once frontend and backend are connected, they have a shared page per +ring, which are used to setup the rings, and an event channel per ring, +which are used to send notifications. + +Shutdown: + + *Front* *Back* + XenbusStateConnected XenbusStateConnected + | + | + V + XenbusStateClosing + + - Unmap grants + - Unbind evtchns + | + | + V + XenbusStateClosing + + - Unbind evtchns + - Free rings + - Free data structures + | + | + V + XenbusStateClosed + + - Free remaining data structures + | + | + V + XenbusStateClosed + + +## Ring Setup + +The shared page has the following layout: + + typedef uint32_t XEN_9PFS_RING_IDX; + + struct xen_9pfs_intf { + XEN_9PFS_RING_IDX in_cons, in_prod; + uint8_t pad[56]; + XEN_9PFS_RING_IDX out_cons, out_prod; + uint8_t pad[56]; + + uint32_t ring_order; + /* this is an array of (1 << ring_order) elements */ + grant_ref_t ref[1]; + }; + + /* not actually C compliant (ring_order changes from ring to ring) */ + struct ring_data { + char in[((1 << ring_order) << PAGE_SHIFT) / 2]; + char out[((1 << ring_order) << PAGE_SHIFT) / 2]; + }; + +- **ring_order** + It represents the order of the data ring. The following list of grant + references is of `(1 << ring_order)` elements. It cannot be greater than + **max-ring-page-order**, as specified by the backend on XenBus. +- **ref[]** + The list of grant references which will contain the actual data. They are + mapped contiguosly in virtual memory. The first half of the pages is the + **in** array, the second half is the **out** array. The array must + have a power of two number of elements. +- **out** is an array used as circular buffer + It contains client requests. The producer is the frontend, the + consumer is the backend. +- **in** is an array used as circular buffer + It contains server responses. The producer is the backend, the + consumer is the frontend. +- **out_cons**, **out_prod** + Consumer and producer indices for client requests. They keep track of + how much data has been written by the frontend to **out** and how much + data has already been consumed by the backend. **out_prod** is + increased by the frontend, after writing data to **out**. **out_cons** + is increased by the backend, after reading data from **out**. +- **in_cons** and **in_prod** + Consumer and producer indices for responses. They keep track of how + much data has already been consumed by the frontend from the **in** + array. **in_prod** is increased by the backend, after writing data to + **in**. **in_cons** is increased by the frontend, after reading data + from **in**. + +The binary layout of `struct xen_9pfs_intf` follows: + + 0 4 8 64 68 72 76 + +---------+---------+-----//-----+---------+---------+---------+ + | in_cons | in_prod | padding |out_cons |out_prod |ring_orde| + +---------+---------+-----//-----+---------+---------+---------+ + + 76 80 84 4092 4096 + +---------+---------+----//---+---------+ + | ref[0] | ref[1] | | ref[N] | + +---------+---------+----//---+---------+ + +**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N +needs to be a power of two, actually max N is 512. As 512 == (1 << 9), +the maximum possible max-ring-page-order value is 9. + +The binary layout of the ring buffers follow: + + 0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT) + +------------//-------------+------------//-------------+ + | in | out | + +------------//-------------+------------//-------------+ + +## Why ring.h is not needed + +Many Xen PV protocols use the macros provided by [ring.h] to manage +their shared ring for communication. This procotol does not, because it +actually comes with two rings: the **in** ring and the **out** ring. +Each of them is mono-directional, and there is no static request size: +the producer writes opaque data to the ring. On the other end, in +[ring.h] they are combined, and the request size is static and +well-known. In this protocol: + + in -> backend to frontend only + out-> frontend to backend only + +In the case of the **in** ring, the frontend is the consumer, and the +backend is the producer. Everything is the same but mirrored for the +**out** ring. + +The producer, the backend in this case, never reads from the **in** +ring. In fact, the producer doesn't need any notifications unless the +ring is full. This version of the protocol doesn't take advantage of it, +leaving room for optimizations. + +On the other end, the consumer always requires notifications, unless it +is already actively reading from the ring. The producer can figure it +out, without any additional fields in the protocol, by comparing the +indexes at the beginning and the end of the function. This is similar to +what [ring.h] does. + +## Ring Usage + +The **in** and **out** arrays are used as circular buffers: + + 0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2 + +-----------------------------------+ + |to consume| free |to consume | + +-----------------------------------+ + ^ ^ + prod cons + + 0 sizeof(array) + +-----------------------------------+ + | free | to consume | free | + +-----------------------------------+ + ^ ^ + cons prod + +The following functions are provided to read and write to an array: + + #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1)) + + static inline void xen_9pfs_read(char *buf, + XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, + uint8_t *h, size_t len) { + if (*masked_cons < *masked_prod) { + memcpy(h, buf + *masked_cons, len); + } else { + if (len > XEN_9PFS_RING_SIZE - *masked_cons) { + memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons); + memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons)); + } else { + memcpy(h, buf + *masked_cons, len); + } + } + *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len); + } + + static inline void xen_9pfs_write(char *buf, + XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, + uint8_t *opaque, size_t len) { + if (*masked_prod < *masked_cons) { + memcpy(buf + *masked_prod, opaque, len); + } else { + if (len > XEN_9PFS_RING_SIZE - *masked_prod) { + memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod); + memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); + } else { + memcpy(buf + *masked_prod, opaque, len); + } + } + *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len); + } + +The producer (the backend for **in**, the frontend for **out**) writes to the +array in the following way: + +- read *cons*, *prod* from shared memory +- general memory barrier +- verify *prod* against local copy (consumer shouldn't change it) +- write to array at position *prod* up to *cons*, wrapping around the circular + buffer when necessary +- write memory barrier +- increase *prod* +- notify the other end via event channel + +The consumer (the backend for **out**, the frontend for **in**) reads from the +array in the following way: + +- read *prod*, *cons* from shared memory +- read memory barrier +- verify *cons* against local copy (producer shouldn't change it) +- read from array at position *cons* up to *prod*, wrapping around the circular + buffer when necessary +- general memory barrier +- increase *cons* +- notify the other end via event channel + +The producer takes care of writing only as many bytes as available in the buffer +up to *cons*. The consumer takes care of reading only as many bytes as available +in the buffer up to *prod*. + + +## Request/Response Workflow + +The client chooses one of the available rings, then it sends a request +to the other end on the *out* array, following the producer workflow +described in [Ring Usage]. + +The server receives the notification and reads the request, following +the consumer workflow described in [Ring Usage]. The server knows how +much to read because it is specified in the *size* field of the 9pfs +header. The server processes the request and sends back a response on +the *in* array of the same ring, following the producer workflow as +usual. Thus, every request/response pair is on one ring. + +The client receives a notification and reads the response from the *in* +array. The client knows how much data to read because it is specified in +the *size* field of the 9pfs header. + + +[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf +[website]: https://github.com/chaos/diod/blob/master/protocol.md +[XenbusStateInitialising]: http://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html +[ring.h]: http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD diff --git a/docs/misc/coverage.markdown b/docs/misc/coverage.markdown deleted file mode 100644 index 3554659fe4..0000000000 --- a/docs/misc/coverage.markdown +++ /dev/null @@ -1,124 +0,0 @@ -# Coverage support for Xen - -Coverage support allows you to get coverage information from Xen execution. -You can see how many times a line is executed. - -Some compilers have specific options that enable the collection of this -information. Every basic block in the code will be instrumented by the compiler -to compute these statistics. It should not be used in production as it slows -down your hypervisor. - -# GCOV (GCC coverage) - -## Enable coverage - -Test coverage support can be turned on compiling Xen with the `CONFIG_COVERAGE` -option set to `y`. - -Change your `.config` or run `make -C xen menuconfig`. - -## Extract coverage data - -To extract data you use a simple utility called `xencov`. -It allows you to do 2 operations: - -* `xencov read` extract data -* `xencov reset` reset all coverage counters - -Another utility (`xencov_split`) is used to split extracted data file into -files needed by userspace tools. - -## Split coverage data - -Once you extracted data from Xen, it is time to create files which the coverage -tools can understand. To do it you need to run `xencov_split` utility. - -The utility just takes an input file and splits the blob into gcc .gcda files -in the same directory that you execute the script. As file names are generated -relative to the current directory, it could be a good idea to run the script -from `/` on your build machine. - -Code for splitting the blob is put in another utility for some reason: -* It is simpler to maintain a high level script than a C program; -* You don't need to execute on the Xen host so you just need to copy the file to - your development box (you usually need development files anyway). - -## Possible use - -**This section is just an example on how to use these tools!** - -This example assumes you compiled Xen from `~/xen-unstable` and installed into -the host. **Consider that if you even recompile Xen you are not able to use -blob extracted from xencov!** - -* Ensure the `lcov` package is installed -* From the Xen host machine extract the coverage blob - - cd /root - xencov read coverage.dat - -* Copy the extracted blob to your dev machine - - cd ~ - scp root@myhost:coverage.dat - -* Extract the coverage information - - (cd / && xencov_split ~/coverage.dat) - -* Produce coverage html output - - cd ~/xen-unstable - rm -rf cov.info cov - geninfo -o cov.info xen - mkdir cov - genhtml -o cov cov.info - -* See output in a browser - - firefox cov/index.html - -# LLVM coverage - -## Enable coverage - -Coverage can be enabled using a Kconfig option, from the top-level directory -use the following command to display the Kconfig menu: - - make -C xen menuconfig clang=y - -The code coverage option can be found inside of the "Debugging Options" -section. After enabling it just compile Xen as you would normally do: - - make xen clang=y - -## Extract coverage data - -LLVM coverage can be extracted from the hypervisor using the `xencov` tool. -The following actions are available: - -* `xencov read` extract data -* `xencov reset` reset all coverage counters -* `xencov read-reset` extract data and reset counters at the same time. - -## Possible use - -**This section is just an example on how to use these tools!** - -This example assumes you compiled Xen and copied the xen-syms file from -xen/xen-syms into your current directory. - -* Extract the coverage data from Xen: - - xencov read xen.profraw - -* Convert the data into a profile. Note that you can merge more than one - profraw file into a single profdata file. - - llvm-profdata merge xen.profraw -o xen.profdata - -* Generate a HTML report of the code coverage: - - llvm-cov show -format=html -output-dir=cov/ xen-syms -instr-profile=xen.profdata - -* Open cov/index.html with your browser in order to display the profile. diff --git a/docs/misc/coverage.pandoc b/docs/misc/coverage.pandoc new file mode 100644 index 0000000000..3554659fe4 --- /dev/null +++ b/docs/misc/coverage.pandoc @@ -0,0 +1,124 @@ +# Coverage support for Xen + +Coverage support allows you to get coverage information from Xen execution. +You can see how many times a line is executed. + +Some compilers have specific options that enable the collection of this +information. Every basic block in the code will be instrumented by the compiler +to compute these statistics. It should not be used in production as it slows +down your hypervisor. + +# GCOV (GCC coverage) + +## Enable coverage + +Test coverage support can be turned on compiling Xen with the `CONFIG_COVERAGE` +option set to `y`. + +Change your `.config` or run `make -C xen menuconfig`. + +## Extract coverage data + +To extract data you use a simple utility called `xencov`. +It allows you to do 2 operations: + +* `xencov read` extract data +* `xencov reset` reset all coverage counters + +Another utility (`xencov_split`) is used to split extracted data file into +files needed by userspace tools. + +## Split coverage data + +Once you extracted data from Xen, it is time to create files which the coverage +tools can understand. To do it you need to run `xencov_split` utility. + +The utility just takes an input file and splits the blob into gcc .gcda files +in the same directory that you execute the script. As file names are generated +relative to the current directory, it could be a good idea to run the script +from `/` on your build machine. + +Code for splitting the blob is put in another utility for some reason: +* It is simpler to maintain a high level script than a C program; +* You don't need to execute on the Xen host so you just need to copy the file to + your development box (you usually need development files anyway). + +## Possible use + +**This section is just an example on how to use these tools!** + +This example assumes you compiled Xen from `~/xen-unstable` and installed into +the host. **Consider that if you even recompile Xen you are not able to use +blob extracted from xencov!** + +* Ensure the `lcov` package is installed +* From the Xen host machine extract the coverage blob + + cd /root + xencov read coverage.dat + +* Copy the extracted blob to your dev machine + + cd ~ + scp root@myhost:coverage.dat + +* Extract the coverage information + + (cd / && xencov_split ~/coverage.dat) + +* Produce coverage html output + + cd ~/xen-unstable + rm -rf cov.info cov + geninfo -o cov.info xen + mkdir cov + genhtml -o cov cov.info + +* See output in a browser + + firefox cov/index.html + +# LLVM coverage + +## Enable coverage + +Coverage can be enabled using a Kconfig option, from the top-level directory +use the following command to display the Kconfig menu: + + make -C xen menuconfig clang=y + +The code coverage option can be found inside of the "Debugging Options" +section. After enabling it just compile Xen as you would normally do: + + make xen clang=y + +## Extract coverage data + +LLVM coverage can be extracted from the hypervisor using the `xencov` tool. +The following actions are available: + +* `xencov read` extract data +* `xencov reset` reset all coverage counters +* `xencov read-reset` extract data and reset counters at the same time. + +## Possible use + +**This section is just an example on how to use these tools!** + +This example assumes you compiled Xen and copied the xen-syms file from +xen/xen-syms into your current directory. + +* Extract the coverage data from Xen: + + xencov read xen.profraw + +* Convert the data into a profile. Note that you can merge more than one + profraw file into a single profdata file. + + llvm-profdata merge xen.profraw -o xen.profdata + +* Generate a HTML report of the code coverage: + + llvm-cov show -format=html -output-dir=cov/ xen-syms -instr-profile=xen.profdata + +* Open cov/index.html with your browser in order to display the profile. diff --git a/docs/misc/efi.markdown b/docs/misc/efi.markdown deleted file mode 100644 index 5b54314134..0000000000 --- a/docs/misc/efi.markdown +++ /dev/null @@ -1,118 +0,0 @@ -For x86, building xen.efi requires gcc 4.5.x or above (4.6.x or newer -recommended, as 4.5.x was probably never really tested for this purpose) and -binutils 2.22 or newer. Additionally, the binutils build must be configured to -include support for the x86_64-pep emulation (i.e. -`--enable-targets=x86_64-pep` or an option of equivalent effect should be -passed to the configure script). - -For arm64, the PE/COFF header is open-coded in assembly, so no toolchain -support for PE/COFF is required. Also, the PE/COFF header co-exists with the -normal Image format, so a single binary may be booted as an Image file or as an -EFI application. When booted as an EFI application, Xen requires a -configuration file as described below unless a bootloader, such as GRUB, has -loaded the modules and describes them in the device tree provided to Xen. If a -bootloader provides a device tree containing modules then any configuration -files are ignored, and the bootloader is responsible for populating all -relevant device tree nodes. - -Once built, `make install-xen` will place the resulting binary directly into -the EFI boot partition, provided `EFI_VENDOR` is set in the environment (and -`EFI_MOUNTPOINT` is overridden as needed, should the default of `/boot/efi` not -match your system). The xen.efi binary will also be installed in -`/usr/lib64/efi/`, unless `EFI_DIR` is set in the environment to override this -default. - -The binary itself will require a configuration file (names with the `.efi` -extension of the binary's name replaced by `.cfg`, and - until an existing -file is found - trailing name components dropped at `.`, `-`, and `_` -separators will be tried) to be present in the same directory as the binary. -(To illustrate the name handling, a binary named `xen-4.2-unstable.efi` would -try `xen-4.2-unstable.cfg`, `xen-4.2.cfg`, `xen-4.cfg`, and `xen.cfg` in -order.) One can override this with a command line option (`-cfg=<filename>`). -This configuration file and EFI commandline are only used for booting directly -from EFI firmware, or when using an EFI loader that does not support -the multiboot2 protocol. When booting using GRUB or another multiboot aware -loader the EFI commandline is ignored and all information is passed from -the loader to Xen using the multiboot protocol. - -The configuration file consists of one or more sections headed by a section -name enclosed in square brackets, with individual values specified in each -section. A section named `[global]` is treated specially to allow certain -settings to apply to all other sections (or to provide defaults for certain -settings in case individual sections don't specify them). This file (for now) -needs to be of ASCII type and not e.g. UTF-8 or UTF-16. A typical file would -thus look like this (`#` serving as comment character): - - **************************example begin****************************** - - [global] - default=sle11sp2 - - [sle11sp2] - options=console=vga,com1 com1=57600 loglvl=all noreboot - kernel=vmlinuz-3.0.31-0.4-xen [domain 0 command line options] - ramdisk=initrd-3.0.31-0.4-xen - - **************************example end******************************** - -The individual values used here are: - -###`default=<name>` - -Specifies the section to use for booting, if none was specified on the command -line; only meaningful in the `[global]` section. This isn't required; if -absent, section headers will be ignored and for each value looked for the -first instance within the file will be used. - -###`options=<text>` - -Specifies the options passed to the hypervisor, see [Xen Hypervisor Command -Line Options](xen-command-line.html). - -###`kernel=<filename>[ <options>]` - -Specifies the Dom0 kernel binary and the options to pass to it. - -The options should in general be the same as is used when booting -natively, e.g. including `root=...` etc. - -Check your bootloader (e.g. grub) configuration or `/proc/cmdline` for -the native configuration. - -###`ramdisk=<filename>` - -Specifies a Linux-style initial RAM disk image to load. - -Other values to specify are: - -###`video=gfx-<xres>[x<yres>[x<depth>]]` - -Specifies a video mode to select if available. In case of problems, the -`-basevideo` command line option can be used to skip altering video modes. - -###`xsm=<filename>` - -Specifies an XSM module to load. - -###`ucode=<filename>` - -Specifies a CPU microcode blob to load. (x86 only) - -###`dtb=<filename>` - -Specifies a device tree file to load. The platform firmware may provide a -DTB in an EFI configuration table, so this field is optional in that -case. A dtb specified in the configuration file will override a device tree -provided in the EFI configuration table. (ARM only) - -###`chain=<filename>` - -Specifies an alternate configuration file to use in case the specified section -(and in particular its `kernel=` setting) can't be found in the default (or -specified) configuration file. This is only meaningful in the [global] section -and really not meant to be used together with the `-cfg=` command line option. - -Filenames must be specified relative to the location of the EFI binary. - -Extra options to be passed to Xen can also be specified on the command line, -following a `--` separator option. diff --git a/docs/misc/efi.pandoc b/docs/misc/efi.pandoc new file mode 100644 index 0000000000..23c1a2732d --- /dev/null +++ b/docs/misc/efi.pandoc @@ -0,0 +1,118 @@ +For x86, building xen.efi requires gcc 4.5.x or above (4.6.x or newer +recommended, as 4.5.x was probably never really tested for this purpose) and +binutils 2.22 or newer. Additionally, the binutils build must be configured to +include support for the x86_64-pep emulation (i.e. +`--enable-targets=x86_64-pep` or an option of equivalent effect should be +passed to the configure script). + +For arm64, the PE/COFF header is open-coded in assembly, so no toolchain +support for PE/COFF is required. Also, the PE/COFF header co-exists with the +normal Image format, so a single binary may be booted as an Image file or as an +EFI application. When booted as an EFI application, Xen requires a +configuration file as described below unless a bootloader, such as GRUB, has +loaded the modules and describes them in the device tree provided to Xen. If a +bootloader provides a device tree containing modules then any configuration +files are ignored, and the bootloader is responsible for populating all +relevant device tree nodes. + +Once built, `make install-xen` will place the resulting binary directly into +the EFI boot partition, provided `EFI_VENDOR` is set in the environment (and +`EFI_MOUNTPOINT` is overridden as needed, should the default of `/boot/efi` not +match your system). The xen.efi binary will also be installed in +`/usr/lib64/efi/`, unless `EFI_DIR` is set in the environment to override this +default. + +The binary itself will require a configuration file (names with the `.efi` +extension of the binary's name replaced by `.cfg`, and - until an existing +file is found - trailing name components dropped at `.`, `-`, and `_` +separators will be tried) to be present in the same directory as the binary. +(To illustrate the name handling, a binary named `xen-4.2-unstable.efi` would +try `xen-4.2-unstable.cfg`, `xen-4.2.cfg`, `xen-4.cfg`, and `xen.cfg` in +order.) One can override this with a command line option (`-cfg=<filename>`). +This configuration file and EFI commandline are only used for booting directly +from EFI firmware, or when using an EFI loader that does not support +the multiboot2 protocol. When booting using GRUB or another multiboot aware +loader the EFI commandline is ignored and all information is passed from +the loader to Xen using the multiboot protocol. + +The configuration file consists of one or more sections headed by a section +name enclosed in square brackets, with individual values specified in each +section. A section named `[global]` is treated specially to allow certain +settings to apply to all other sections (or to provide defaults for certain +settings in case individual sections don't specify them). This file (for now) +needs to be of ASCII type and not e.g. UTF-8 or UTF-16. A typical file would +thus look like this (`#` serving as comment character): + + **************************example begin****************************** + + [global] + default=sle11sp2 + + [sle11sp2] + options=console=vga,com1 com1=57600 loglvl=all noreboot + kernel=vmlinuz-3.0.31-0.4-xen [domain 0 command line options] + ramdisk=initrd-3.0.31-0.4-xen + + **************************example end******************************** + +The individual values used here are: + +###`default=<name>` + +Specifies the section to use for booting, if none was specified on the command +line; only meaningful in the `[global]` section. This isn't required; if +absent, section headers will be ignored and for each value looked for the +first instance within the file will be used. + +###`options=<text>` + +Specifies the options passed to the hypervisor, see [Xen Hypervisor Command +Line Options](xen-command-line.html). + +###`kernel=<filename>[ <options>]` + +Specifies the Dom0 kernel binary and the options to pass to it. + +The options should in general be the same as is used when booting +natively, e.g. including `root=...` etc. + +Check your bootloader (e.g. grub) configuration or `/proc/cmdline` for +the native configuration. + +###`ramdisk=<filename>` + +Specifies a Linux-style initial RAM disk image to load. + +Other values to specify are: + +###`video=gfx-<xres>[x<yres>[x<depth>]]` + +Specifies a video mode to select if available. In case of problems, the +`-basevideo` command line option can be used to skip altering video modes. + +###`xsm=<filename>` + +Specifies an XSM module to load. + +###`ucode=<filename>` + +Specifies a CPU microcode blob to load. (x86 only) + +###`dtb=<filename>` + +Specifies a device tree file to load. The platform firmware may provide a +DTB in an EFI configuration table, so this field is optional in that +case. A dtb specified in the configuration file will override a device tree +provided in the EFI configuration table. (ARM only) + +###`chain=<filename>` + +Specifies an alternate configuration file to use in case the specified section +(and in particular its `kernel=` setting) can't be found in the default (or +specified) configuration file. This is only meaningful in the [global] section +and really not meant to be used together with the `-cfg=` command line option. + +Filenames must be specified relative to the location of the EFI binary. + +Extra options to be passed to Xen can also be specified on the command line, +following a `--` separator option. diff --git a/docs/misc/hvm-emulated-unplug.markdown b/docs/misc/hvm-emulated-unplug.markdown deleted file mode 100644 index f6b27ed04f..0000000000 --- a/docs/misc/hvm-emulated-unplug.markdown +++ /dev/null @@ -1,97 +0,0 @@ -#Xen HVM emulated device unplug protocol - -The protocol covers three basic things: - - * Disconnecting emulated devices. - * Getting log messages out of the drivers and into dom0. - * Allowing dom0 to block the loading of specific drivers. This is - intended as a backwards-compatibility thing: if we discover a bug - in some old version of the drivers, then rather than working around - it in Xen, we have the option of just making those drivers fall - back to emulated mode. - -The current protocol works like this (from the point of view of -drivers): - -1. When the drivers first come up, they check whether the unplug logic - is available by reading a two-byte magic number from IO port `0x10`. - These should be `0x49d2`. If the magic number doesn't match, the - drivers don't do anything. - -2. The drivers read a one-byte protocol version from IO port `0x12`. If - this is 0, skip to 6. - -3. The drivers write a two-byte product number to IO port `0x12`. At - the moment, the only drivers using this protocol are our - closed-source ones, which use product number 1. - -4. The drivers write a four-byte build number to IO port `0x10`. - -5. The drivers check the magic number by reading two bytes from `0x10` - again. If it's changed from `0x49d2` to `0xd249`, the drivers are - blacklisted and should not load. - -6. The drivers write a two-byte bitmask of devices to unplug to IO - port `0x10`. The defined bits are: - - * `0` -- All emulated IDE and SCSI disks (not including CD drives). - * `1` -- All emulated NICs. - * `2` -- All IDE disks except for the primary master (not including CD - drives). This is overridden by bit 0. - * `3` -- All emulated NVMe disks. - - The relevant emulated devices then disappear from the relevant - buses. For most guest operating systems, you want to do this - before device enumeration happens. - -Once the drivers have checked the magic number, they can send log -messages to qemu which will be logged to wherever qemu's logs go -(`/var/log/xen/qemu-dm.log` on normal Xen, dom0 syslog on XenServer). -These messages are written to IO port `0x12` a byte at a time, and are -terminated by newlines. There's a fairly aggressive rate limiter on -these messages, so they shouldn't be used for anything even vaguely -high-volume, but they're rather useful for debugging and support. - -It is still permitted for a driver to use this logging feature if it -is blacklisted, but *ONLY* if it has checked the magic number and found -it to be `0x49d2` or `0xd249`. - -This isn't exactly a pretty protocol, but it does solve the problem. - -The blacklist is, from qemu's point of view, handled mostly through -xenstore. A driver version is considered to be blacklisted if -`/mh/driver-blacklist/{product_name}/{build_number}` exists and is -readable, where `{build_number}` is the build number from step 4 as a -decimal number. `{product_name}` is a string corresponding to the -product number in step 3. - -The master registry of product names and numbers is in -xen/include/public/hvm/pvdrivers.h. - -NOTE: The IO ports implementing the unplug protocol are implemented -as part of the Xen Platform PCI Device, so if that device is not -present in the system then this protocol will not work. - - -Unplug protocol for old SUSE PVonHVM - -During xen-3.0.4 timeframe an unofficial unplug protocol was added to -the xen-platform-pci kernel module. The value 0x1 was written to offset -0x4 in the memory region of the Xen Platform PCI Device. This was done -unconditionally. The corresponding code in qemu-xen-traditional did an -unplug of all NIC, IDE and SCSI devices. This was used in all SUSE -releases up to openSUSE 12.3, SLES11SP3. Starting with openSUSE 13.1 and -SLES11SP4/SLE12 the official protocol was used. - -Unplug protocol for old Novell VMDP - -During Xen-3.0 timeframe an unofficial unplug protocol was used in -Novells VMDP. Depending on how VMDP was configured it would control all -devices, or either NIC or storage. To control all devices the value 0x1 -was written to offset 0x4 in the memory region of the Xen Platform PCI -Device. This was supposed to unplug NIC, IDE and SCSI devices. If VMDP -was configured to control just NIC devices it would write the value 0x2 -to offset 0x8. If VMDP was configured to control just storage devices it -would write the value 0x1 to offset 0x8. Starting with VMDP version 1.7 -(released 2011) the official protocol was used. - diff --git a/docs/misc/hvm-emulated-unplug.pandoc b/docs/misc/hvm-emulated-unplug.pandoc new file mode 100644 index 0000000000..f6b27ed04f --- /dev/null +++ b/docs/misc/hvm-emulated-unplug.pandoc @@ -0,0 +1,97 @@ +#Xen HVM emulated device unplug protocol + +The protocol covers three basic things: + + * Disconnecting emulated devices. + * Getting log messages out of the drivers and into dom0. + * Allowing dom0 to block the loading of specific drivers. This is + intended as a backwards-compatibility thing: if we discover a bug + in some old version of the drivers, then rather than working around + it in Xen, we have the option of just making those drivers fall + back to emulated mode. + +The current protocol works like this (from the point of view of +drivers): + +1. When the drivers first come up, they check whether the unplug logic + is available by reading a two-byte magic number from IO port `0x10`. + These should be `0x49d2`. If the magic number doesn't match, the + drivers don't do anything. + +2. The drivers read a one-byte protocol version from IO port `0x12`. If + this is 0, skip to 6. + +3. The drivers write a two-byte product number to IO port `0x12`. At + the moment, the only drivers using this protocol are our + closed-source ones, which use product number 1. + +4. The drivers write a four-byte build number to IO port `0x10`. + +5. The drivers check the magic number by reading two bytes from `0x10` + again. If it's changed from `0x49d2` to `0xd249`, the drivers are + blacklisted and should not load. + +6. The drivers write a two-byte bitmask of devices to unplug to IO + port `0x10`. The defined bits are: + + * `0` -- All emulated IDE and SCSI disks (not including CD drives). + * `1` -- All emulated NICs. + * `2` -- All IDE disks except for the primary master (not including CD + drives). This is overridden by bit 0. + * `3` -- All emulated NVMe disks. + + The relevant emulated devices then disappear from the relevant + buses. For most guest operating systems, you want to do this + before device enumeration happens. + +Once the drivers have checked the magic number, they can send log +messages to qemu which will be logged to wherever qemu's logs go +(`/var/log/xen/qemu-dm.log` on normal Xen, dom0 syslog on XenServer). +These messages are written to IO port `0x12` a byte at a time, and are +terminated by newlines. There's a fairly aggressive rate limiter on +these messages, so they shouldn't be used for anything even vaguely +high-volume, but they're rather useful for debugging and support. + +It is still permitted for a driver to use this logging feature if it +is blacklisted, but *ONLY* if it has checked the magic number and found +it to be `0x49d2` or `0xd249`. + +This isn't exactly a pretty protocol, but it does solve the problem. + +The blacklist is, from qemu's point of view, handled mostly through +xenstore. A driver version is considered to be blacklisted if +`/mh/driver-blacklist/{product_name}/{build_number}` exists and is +readable, where `{build_number}` is the build number from step 4 as a +decimal number. `{product_name}` is a string corresponding to the +product number in step 3. + +The master registry of product names and numbers is in +xen/include/public/hvm/pvdrivers.h. + +NOTE: The IO ports implementing the unplug protocol are implemented +as part of the Xen Platform PCI Device, so if that device is not +present in the system then this protocol will not work. + + +Unplug protocol for old SUSE PVonHVM + +During xen-3.0.4 timeframe an unofficial unplug protocol was added to +the xen-platform-pci kernel module. The value 0x1 was written to offset +0x4 in the memory region of the Xen Platform PCI Device. This was done +unconditionally. The corresponding code in qemu-xen-traditional did an +unplug of all NIC, IDE and SCSI devices. This was used in all SUSE +releases up to openSUSE 12.3, SLES11SP3. Starting with openSUSE 13.1 and +SLES11SP4/SLE12 the official protocol was used. + +Unplug protocol for old Novell VMDP + +During Xen-3.0 timeframe an unofficial unplug protocol was used in +Novells VMDP. Depending on how VMDP was configured it would control all +devices, or either NIC or storage. To control all devices the value 0x1 +was written to offset 0x4 in the memory region of the Xen Platform PCI +Device. This was supposed to unplug NIC, IDE and SCSI devices. If VMDP +was configured to control just NIC devices it would write the value 0x2 +to offset 0x8. If VMDP was configured to control just storage devices it +would write the value 0x1 to offset 0x8. Starting with VMDP version 1.7 +(released 2011) the official protocol was used. + diff --git a/docs/misc/livepatch.markdown b/docs/misc/livepatch.markdown deleted file mode 100644 index 2bdf871578..0000000000 --- a/docs/misc/livepatch.markdown +++ /dev/null @@ -1,1108 +0,0 @@ -# Xen Live Patching Design v1 - -## Rationale - -A mechanism is required to binarily patch the running hypervisor with new -opcodes that have come about due to primarily security updates. - -This document describes the design of the API that would allow us to -upload to the hypervisor binary patches. - -The document is split in four sections: - - * Detailed descriptions of the problem statement. - * Design of the data structures. - * Design of the hypercalls. - * Implementation notes that should be taken into consideration. - - -## Glossary - - * splice - patch in the binary code with new opcodes - * trampoline - a jump to a new instruction. - * payload - telemetries of the old code along with binary blob of the new - function (if needed). - * reloc - telemetries contained in the payload to construct proper trampoline. - -## History - -The document has gone under various reviews and only covers v1 design. - -The end of the document has a section titled `Not Yet Done` which -outlines ideas and design for the future version of this work. - -## Multiple ways to patch - -The mechanism needs to be flexible to patch the hypervisor in multiple ways -and be as simple as possible. The compiled code is contiguous in memory with -no gaps - so we have no luxury of 'moving' existing code and must either -insert a trampoline to the new code to be executed - or only modify in-place -the code if there is sufficient space. The placement of new code has to be done -by hypervisor and the virtual address for the new code is allocated dynamically. - -This implies that the hypervisor must compute the new offsets when splicing -in the new trampoline code. Where the trampoline is added (inside -the function we are patching or just the callers?) is also important. - -To lessen the amount of code in hypervisor, the consumer of the API -is responsible for identifying which mechanism to employ and how many locations -to patch. Combinations of modifying in-place code, adding trampoline, etc -has to be supported. The API should allow read/write any memory within -the hypervisor virtual address space. - -We must also have a mechanism to query what has been applied and a mechanism -to revert it if needed. - -## Workflow - -The expected workflows of higher-level tools that manage multiple patches -on production machines would be: - - * The first obvious task is loading all available / suggested - hotpatches when they are available. - * Whenever new hotpatches are installed, they should be loaded too. - * One wants to query which modules have been loaded at runtime. - * If unloading is deemed safe (see unloading below), one may want to - support a workflow where a specific hotpatch is marked as bad and - unloaded. - -## Patching code - -The first mechanism to patch that comes in mind is in-place replacement. -That is replace the affected code with new code. Unfortunately the x86 -ISA is variable size which places limits on how much space we have available -to replace the instructions. That is not a problem if the change is smaller -than the original opcode and we can fill it with nops. Problems will -appear if the replacement code is longer. - -The second mechanism is by ti replace the call or jump to the -old function with the address of the new function. - -A third mechanism is to add a jump to the new function at the -start of the old function. N.B. The Xen hypervisor implements the third -mechanism. See `Trampoline (e9 opcode)` section for more details. - -### Example of trampoline and in-place splicing - -As example we will assume the hypervisor does not have XSA-132 (see -[domctl/sysctl: don't leak hypervisor stack to toolstacks](http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=4ff3449f0e9d175ceb9551d3f2aecb59273f639d)) -and we would like to binary patch the hypervisor with it. The original code -looks as so: - - 48 89 e0 mov %rsp,%rax - 48 25 00 80 ff ff and $0xffffffffffff8000,%rax - -while the new patched hypervisor would be: - - 48 c7 45 b8 00 00 00 00 movq $0x0,-0x48(%rbp) - 48 c7 45 c0 00 00 00 00 movq $0x0,-0x40(%rbp) - 48 c7 45 c8 00 00 00 00 movq $0x0,-0x38(%rbp) - 48 89 e0 mov %rsp,%rax - 48 25 00 80 ff ff and $0xffffffffffff8000,%rax - -This is inside the arch\_do\_domctl. This new change adds 21 extra -bytes of code which alters all the offsets inside the function. To alter -these offsets and add the extra 21 bytes of code we might not have enough -space in .text to squeeze this in. - -As such we could simplify this problem by only patching the site -which calls arch\_do\_domctl: - - do_domctl: - e8 4b b1 05 00 callq ffff82d08015fbb9 <arch_do_domctl> - -with a new address for where the new `arch_do_domctl` would be (this -area would be allocated dynamically). - -Astute readers will wonder what we need to do if we were to patch `do_domctl` -- which is not called directly by hypervisor but on behalf of the guests via -the `compat_hypercall_table` and `hypercall_table`. Patching the offset in -`hypercall_table` for `do_domctl`: - - ffff82d08024d490: 79 30 - ffff82d08024d492: 10 80 d0 82 ff ff - -with the new address where the new `do_domctl` is possible. The other -place where it is used is in `hvm_hypercall64_table` which would need -to be patched in a similar way. This would require an in-place splicing -of the new virtual address of `arch_do_domctl`. - -In summary this example patched the callee of the affected function by - - * Allocating memory for the new code to live in, - * Changing the virtual address in all the functions which called the old - code (computing the new offset, patching the callq with a new callq). - * Changing the function pointer tables with the new virtual address of - the function (splicing in the new virtual address). Since this table - resides in the .rodata section we would need to temporarily change the - page table permissions during this part. - -However it has drawbacks - the safety checks which have to make sure -the function is not on the stack - must also check every caller. For some -patches this could mean - if there were an sufficient large amount of -callers - that we would never be able to apply the update. - -Having the patching done at predetermined instances where the stacks -are not deep mostly solves this problem. - -### Example of different trampoline patching. - -An alternative mechanism exists where we can insert a trampoline in the -existing function to be patched to jump directly to the new code. This -lessens the locations to be patched to one but it puts pressure on the -CPU branching logic (I-cache, but it is just one unconditional jump). - -For this example we will assume that the hypervisor has not been compiled with -XSA-125 (see -[pre-fill structures for certain HYPERVISOR\_xen\_version sub-ops](http://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=fe2e079f642effb3d24a6e1a7096ef26e691d93e)) -which mem-sets an structure in `xen_version` hypercall. This function is not -called **anywhere** in the hypervisor (it is called by the guest) but -referenced in the `compat_hypercall_table` and `hypercall_table` (and -indirectly called from that). Patching the offset in `hypercall_table` for the -old `do_xen_version`: - - ffff82d08024b270 <hypercall_table>: - ... - ffff82d08024b2f8: 9e 2f 11 80 d0 82 ff ff - -with the new address where the new `do_xen_version` is possible. The other -place where it is used is in `hvm_hypercall64_table` which would need -to be patched in a similar way. This would require an in-place splicing -of the new virtual address of `do_xen_version`. - -An alternative solution would be to patch insert a trampoline in the -old `do_xen_version` function to directly jump to the new `do_xen_version`: - - ffff82d080112f9e do_xen_version: - ffff82d080112f9e: 48 c7 c0 da ff ff ff mov $0xffffffffffffffda,%rax - ffff82d080112fa5: 83 ff 09 cmp $0x9,%edi - ffff82d080112fa8: 0f 87 24 05 00 00 ja ffff82d0801134d2 ; do_xen_version+0x534 - -with: - - ffff82d080112f9e do_xen_version: - ffff82d080112f9e: e9 XX YY ZZ QQ jmpq [new do_xen_version] - -which would lessen the amount of patching to just one location. - -In summary this example patched the affected function to jump to the -new replacement function which required: - - * Allocating memory for the new code to live in, - * Inserting trampoline with new offset in the old function to point to the - new function. - * Optionally we can insert in the old function a trampoline jump to an function - providing an BUG\_ON to catch errant code. - -The disadvantage of this are that the unconditional jump will consume a small -I-cache penalty. However the simplicity of the patching and higher chance -of passing safety checks make this a worthwhile option. - -This patching has a similar drawback as inline patching - the safety -checks have to make sure the function is not on the stack. However -since we are replacing at a higher level (a full function as opposed -to various offsets within functions) the checks are simpler. - -Having the patching done at predetermined instances where the stacks -are not deep mostly solves this problem as well. - -### Security - -With this method we can re-write the hypervisor - and as such we **MUST** be -diligent in only allowing certain guests to perform this operation. - -Furthermore with SecureBoot or tboot, we **MUST** also verify the signature -of the payload to be certain it came from a trusted source and integrity -was intact. - -As such the hypercall **MUST** support an XSM policy to limit what the guest -is allowed to invoke. If the system is booted with signature checking the -signature checking will be enforced. - -## Design of payload format - -The payload **MUST** contain enough data to allow us to apply the update -and also safely reverse it. As such we **MUST** know: - - * The locations in memory to be patched. This can be determined dynamically - via symbols or via virtual addresses. - * The new code that will be patched in. - -This binary format can be constructed using an custom binary format but -there are severe disadvantages of it: - - * The format might need to be changed and we need an mechanism to accommodate - that. - * It has to be platform agnostic. - * Easily constructed using existing tools. - -As such having the payload in an ELF file is the sensible way. We would be -carrying the various sets of structures (and data) in the ELF sections under -different names and with definitions. - -Note that every structure has padding. This is added so that the hypervisor -can re-use those fields as it sees fit. - -Earlier design attempted to ineptly explain the relations of the ELF sections -to each other without using proper ELF mechanism (sh\_info, sh\_link, data -structures using Elf types, etc). This design will explain the structures -and how they are used together and not dig in the ELF format - except mention -that the section names should match the structure names. - -The Xen Live Patch payload is a relocatable ELF binary. A typical binary would have: - - * One or more .text sections. - * Zero or more read-only data sections. - * Zero or more data sections. - * Relocations for each of these sections. - -It may also have some architecture-specific sections. For example: - - * Alternatives instructions. - * Bug frames. - * Exception tables. - * Relocations for each of these sections. - -The Xen Live Patch core code loads the payload as a standard ELF binary, relocates it -and handles the architecture-specifc sections as needed. This process is much -like what the Linux kernel module loader does. - -The payload contains at least three sections: - - * `.livepatch.funcs` - which is an array of livepatch\_func structures. - * `.livepatch.depends` - which is an ELF Note that describes what the payload - depends on. **MUST** have one. - * `.note.gnu.build-id` - the build-id of this payload. **MUST** have one. - -### .livepatch.funcs - -The `.livepatch.funcs` contains an array of livepatch\_func structures -which describe the functions to be patched: - - struct livepatch_func { - const char *name; - void *new_addr; - void *old_addr; - uint32_t new_size; - uint32_t old_size; - uint8_t version; - uint8_t opaque[31]; - }; - -The size of the structure is 64 bytes on 64-bit hypervisors. It will be -52 on 32-bit hypervisors. - - * `name` is the symbol name of the old function. Only used if `old_addr` is - zero, otherwise will be used during dynamic linking (when hypervisor loads - the payload). - * `old_addr` is the address of the function to be patched and is filled in at - payload generation time if hypervisor function address is known. If unknown, - the value *MUST* be zero and the hypervisor will attempt to resolve the - address. - * `new_addr` can either have a non-zero value or be zero. - * If there is a non-zero value, then it is the address of the function that - is replacing the old function and the address is recomputed during - relocation. The value **MUST** be the address of the new function in the - payload file. - * If the value is zero, then we NOPing out at the `old_addr` location - `new_size` bytes. - * `old_size` contains the sizes of the respective `old_addr` function in - bytes. The value of `old_size` **MUST** not be zero. - * `new_size` depends on what `new_addr` contains: - * If `new_addr` contains an non-zero value, then `new_size` has the size of - the new function (which will replace the one at `old_addr`) in bytes. - * If the value of `new_addr` is zero then `new_size` determines how many - instruction bytes to NOP (up to opaque size modulo smallest platform - instruction - 1 byte x86 and 4 bytes on ARM). - * `version` is to be one. - * `opaque` **MUST** be zero. - -The size of the `livepatch_func` array is determined from the ELF section -size. - -When applying the patch the hypervisor iterates over each `livepatch_func` -structure and the core code inserts a trampoline at `old_addr` to `new_addr`. -The `new_addr` is altered when the ELF payload is loaded. - -When reverting a patch, the hypervisor iterates over each `livepatch_func` -and the core code copies the data from the undo buffer (private internal copy) -to `old_addr`. - -It optionally may contain the address of functions to be called right before -being applied and after being reverted: - - * `.livepatch.hooks.load` - an array of function pointers. - * `.livepatch.hooks.unload` - an array of function pointers. - - -### Example of .livepatch.funcs - -A simple example of what a payload file can be: - - /* MUST be in sync with hypervisor. */ - struct livepatch_func { - const char *name; - void *new_addr; - void *old_addr; - uint32_t new_size; - uint32_t old_size; - uint8_t version; - uint8_t pad[31]; - }; - - /* Our replacement function for xen_extra_version. */ - const char *xen_hello_world(void) - { - return "Hello World"; - } - - static unsigned char patch_this_fnc[] = "xen_extra_version"; - - struct livepatch_func livepatch_hello_world = { - .version = LIVEPATCH_PAYLOAD_VERSION, - .name = patch_this_fnc, - .new_addr = xen_hello_world, - .old_addr = (void *)0xffff82d08013963c, /* Extracted from xen-syms. */ - .new_size = 13, /* To be be computed by scripts. */ - .old_size = 13, /* -----------""--------------- */ - } __attribute__((__section__(".livepatch.funcs"))); - -Code must be compiled with `-fPIC`. - -### .livepatch.hooks.load and .livepatch.hooks.unload - -This section contains an array of function pointers to be executed -before payload is being applied (.livepatch.funcs) or after reverting -the payload. This is useful to prepare data structures that need to -be modified patching. - -Each entry in this array is eight bytes. - -The type definition of the function are as follow: - - typedef void (*livepatch_loadcall_t)(void); - typedef void (*livepatch_unloadcall_t)(void); - -### .livepatch.depends and .note.gnu.build-id - -To support dependencies checking and safe loading (to load the -appropiate payload against the right hypervisor) there is a need -to embbed an build-id dependency. - -This is done by the payload containing an section `.livepatch.depends` -which follows the format of an ELF Note. The contents of this -(name, and description) are specific to the linker utilized to -build the hypevisor and payload. - -If GNU linker is used then the name is `GNU` and the description -is a NT\_GNU\_BUILD\_ID type ID. The description can be an SHA1 -checksum, MD5 checksum or any unique value. - -The size of these structures varies with the `--build-id` linker option. - -## Hypercalls - -We will employ the sub operations of the system management hypercall (sysctl). -There are to be four sub-operations: - - * upload the payloads. - * listing of payloads summary uploaded and their state. - * getting an particular payload summary and its state. - * command to apply, delete, or revert the payload. - -Most of the actions are asynchronous therefore the caller is responsible -to verify that it has been applied properly by retrieving the summary of it -and verifying that there are no error codes associated with the payload. - -We **MUST** make some of them asynchronous due to the nature of patching -it requires every physical CPU to be lock-step with each other. -The patching mechanism while an implementation detail, is not an short -operation and as such the design **MUST** assume it will be an long-running -operation. - -The sub-operations will spell out how preemption is to be handled (if at all). - -Furthermore it is possible to have multiple different payloads for the same -function. As such an unique name per payload has to be visible to allow proper manipulation. - -The hypercall is part of the `xen_sysctl`. The top level structure contains -one uint32\_t to determine the sub-operations and one padding field which -*MUST* always be zero. - - struct xen_sysctl_livepatch_op { - uint32_t cmd; /* IN: XEN_SYSCTL_LIVEPATCH_*. */ - uint32_t pad; /* IN: Always zero. */ - union { - ... see below ... - } u; - }; - -while the rest of hypercall specific structures are part of the this structure. - -### Basic type: struct xen\_livepatch\_name - -Most of the hypercalls employ an shared structure called `struct xen_livepatch_name` -which contains: - - * `name` - pointer where the string for the name is located. - * `size` - the size of the string - * `pad` - padding - to be zero. - -The structure is as follow: - - /* - * Uniquely identifies the payload. Should be human readable. - * Includes the NUL terminator - */ - #define XEN_LIVEPATCH_NAME_SIZE 128 - struct xen_livepatch_name { - XEN_GUEST_HANDLE_64(char) name; /* IN, pointer to name. */ - uint16_t size; /* IN, size of name. May be upto - XEN_LIVEPATCH_NAME_SIZE. */ - uint16_t pad[3]; /* IN: MUST be zero. */ - }; - -### XEN\_SYSCTL\_LIVEPATCH\_UPLOAD (0) - -Upload a payload to the hypervisor. The payload is verified -against basic checks and if there are any issues the proper return code -will be returned. The payload is not applied at this time - that is -controlled by *XEN\_SYSCTL\_LIVEPATCH\_ACTION*. - -The caller provides: - - * A `struct xen_livepatch_name` called `name` which has the unique name. - * `size` the size of the ELF payload (in bytes). - * `payload` the virtual address of where the ELF payload is. - -The `name` could be an UUID that stays fixed forever for a given -payload. It can be embedded into the ELF payload at creation time -and extracted by tools. - -The return value is zero if the payload was succesfully uploaded. -Otherwise an -XEN\_EXX return value is provided. Duplicate `name` are not supported. - -The `payload` is the ELF payload as mentioned in the `Payload format` section. - -The structure is as follow: - - struct xen_sysctl_livepatch_upload { - xen_livepatch_name_t name; /* IN, name of the patch. */ - uint64_t size; /* IN, size of the ELF file. */ - XEN_GUEST_HANDLE_64(uint8) payload; /* IN: ELF file. */ - }; - -### XEN\_SYSCTL\_LIVEPATCH\_GET (1) - -Retrieve an status of an specific payload. This caller provides: - - * A `struct xen_livepatch_name` called `name` which has the unique name. - * A `struct xen_livepatch_status` structure. The member values will - be over-written upon completion. - -Upon completion the `struct xen_livepatch_status` is updated. - - * `status` - indicates the current status of the payload: - * *LIVEPATCH\_STATUS\_CHECKED* (1) loaded and the ELF payload safety checks passed. - * *LIVEPATCH\_STATUS\_APPLIED* (2) loaded, checked, and applied. - * No other value is possible. - * `rc` - -XEN\_EXX type errors encountered while performing the last - LIVEPATCH\_ACTION\_\* operation. The normal values can be zero or -XEN\_EAGAIN which - respectively mean: success or operation in progress. Other values - imply an error occurred. If there is an error in `rc`, `status` will **NOT** - have changed. - -The return value of the hypercall is zero on success and -XEN\_EXX on failure. -(Note that the `rc` value can be different from the return value, as in -rc=-XEN\_EAGAIN and return value can be 0). - -For example, supposing there is an payload: - - status: LIVEPATCH_STATUS_CHECKED - rc: 0 - -We apply an action - LIVEPATCH\_ACTION\_REVERT - to revert it (which won't work -as we have not even applied it. Afterwards we will have: - - status: LIVEPATCH_STATUS_CHECKED - rc: -XEN_EINVAL - -It has failed but it remains loaded. - -This operation is synchronous and does not require preemption. - -The structure is as follow: - - struct xen_livepatch_status { - #define LIVEPATCH_STATUS_CHECKED 1 - #define LIVEPATCH_STATUS_APPLIED 2 - uint32_t state; /* OUT: LIVEPATCH_STATE_*. */ - int32_t rc; /* OUT: 0 if no error, otherwise -XEN_EXX. */ - }; - - struct xen_sysctl_livepatch_get { - xen_livepatch_name_t name; /* IN, the name of the payload. */ - xen_livepatch_status_t status; /* IN/OUT: status of the payload. */ - }; - -### XEN\_SYSCTL\_LIVEPATCH\_LIST (2) - -Retrieve an array of abbreviated status and names of payloads that are loaded in the -hypervisor. - -The caller provides: - - * `version`. Version of the payload. Caller should re-use the field provided by - the hypervisor. If the value differs the data is stale. - * `idx` Index iterator. The index into the hypervisor's payload count. It is - recommended that on first invocation zero be used so that `nr` (which the - hypervisor will update with the remaining payload count) be provided. - Also the hypervisor will provide `version` with the most current value. - * `nr` The max number of entries to populate. Can be zero which will result - in the hypercall being a probing one and return the number of payloads - (and update the `version`). - * `pad` - *MUST* be zero. - * `status` Virtual address of where to write `struct xen_livepatch_status` - structures. Caller *MUST* allocate up to `nr` of them. - * `name` - Virtual address of where to write the unique name of the payload. - Caller *MUST* allocate up to `nr` of them. Each *MUST* be of - **XEN\_LIVEPATCH\_NAME\_SIZE** size. Note that **XEN\_LIVEPATCH\_NAME\_SIZE** includes - the NUL terminator. - * `len` - Virtual address of where to write the length of each unique name - of the payload. Caller *MUST* allocate up to `nr` of them. Each *MUST* be - of sizeof(uint32\_t) (4 bytes). - -If the hypercall returns an positive number, it is the number (upto `nr` -provided to the hypercall) of the payloads returned, along with `nr` updated -with the number of remaining payloads, `version` updated (it may be the same -across hypercalls - if it varies the data is stale and further calls could -fail). The `status`, `name`, and `len` are updated at their designed index -value (`idx`) with the returned value of data. - -If the hypercall returns -XEN\_E2BIG the `nr` is too big and should be -lowered. - -If the hypercall returns an zero value there are no more payloads. - -Note that due to the asynchronous nature of hypercalls the control domain might -have added or removed a number of payloads making this information stale. It is -the responsibility of the toolstack to use the `version` field to check -between each invocation. if the version differs it should discard the stale -data and start from scratch. It is OK for the toolstack to use the new -`version` field. - -The `struct xen_livepatch_status` structure contains an status of payload which includes: - - * `status` - indicates the current status of the payload: - * *LIVEPATCH\_STATUS\_CHECKED* (1) loaded and the ELF payload safety checks passed. - * *LIVEPATCH\_STATUS\_APPLIED* (2) loaded, checked, and applied. - * No other value is possible. - * `rc` - -XEN\_EXX type errors encountered while performing the last - LIVEPATCH\_ACTION\_\* operation. The normal values can be zero or -XEN\_EAGAIN which - respectively mean: success or operation in progress. Other values - imply an error occurred. If there is an error in `rc`, `status` will **NOT** - have changed. - -The structure is as follow: - - struct xen_sysctl_livepatch_list { - uint32_t version; /* OUT: Hypervisor stamps value. - If varies between calls, we are - getting stale data. */ - uint32_t idx; /* IN: Index into hypervisor list. */ - uint32_t nr; /* IN: How many status, names, and len - should be filled out. Can be zero to get - amount of payloads and version. - OUT: How many payloads left. */ - uint32_t pad; /* IN: Must be zero. */ - XEN_GUEST_HANDLE_64(xen_livepatch_status_t) status; /* OUT. Must have enough - space allocate for nr of them. */ - XEN_GUEST_HANDLE_64(char) id; /* OUT: Array of names. Each member - MUST XEN_LIVEPATCH_NAME_SIZE in size. - Must have nr of them. */ - XEN_GUEST_HANDLE_64(uint32) len; /* OUT: Array of lengths of name's. - Must have nr of them. */ - }; - -### XEN\_SYSCTL\_LIVEPATCH\_ACTION (3) - -Perform an operation on the payload structure referenced by the `name` field. -The operation request is asynchronous and the status should be retrieved -by using either **XEN\_SYSCTL\_LIVEPATCH\_GET** or **XEN\_SYSCTL\_LIVEPATCH\_LIST** hypercall. - -The caller provides: - - * A `struct xen_livepatch_name` `name` containing the unique name. - * `cmd` The command requested: - * *LIVEPATCH\_ACTION\_UNLOAD* (1) Unload the payload. - Any further hypercalls against the `name` will result in failure unless - **XEN\_SYSCTL\_LIVEPATCH\_UPLOAD** hypercall is perfomed with same `name`. - * *LIVEPATCH\_ACTION\_REVERT* (2) Revert the payload. If the operation takes - more time than the upper bound of time the `rc` in `xen_livepatch_status` - retrieved via **XEN\_SYSCTL\_LIVEPATCH\_GET** will be -XEN\_EBUSY. - * *LIVEPATCH\_ACTION\_APPLY* (3) Apply the payload. If the operation takes - more time than the upper bound of time the `rc` in `xen_livepatch_status` - retrieved via **XEN\_SYSCTL\_LIVEPATCH\_GET** will be -XEN\_EBUSY. - * *LIVEPATCH\_ACTION\_REPLACE* (4) Revert all applied payloads and apply this - payload. If the operation takes more time than the upper bound of time - the `rc` in `xen_livepatch_status` retrieved via **XEN\_SYSCTL\_LIVEPATCH\_GET** - will be -XEN\_EBUSY. - * `time` The upper bound of time (ns) the cmd should take. Zero means to use - the hypervisor default. If within the time the operation does not succeed - the operation would go in error state. - * `pad` - *MUST* be zero. - -The return value will be zero unless the provided fields are incorrect. - -The structure is as follow: - - #define LIVEPATCH_ACTION_UNLOAD 1 - #define LIVEPATCH_ACTION_REVERT 2 - #define LIVEPATCH_ACTION_APPLY 3 - #define LIVEPATCH_ACTION_REPLACE 4 - struct xen_sysctl_livepatch_action { - xen_livepatch_name_t name; /* IN, name of the patch. */ - uint32_t cmd; /* IN: LIVEPATCH_ACTION_* */ - uint32_t time; /* IN: If zero then uses */ - /* hypervisor default. */ - /* Or upper bound of time (ns) */ - /* for operation to take. */ - }; - - -## State diagrams of LIVEPATCH\_ACTION commands. - -There is a strict ordering state of what the commands can be. -The LIVEPATCH\_ACTION prefix has been dropped to easy reading and -does not include the LIVEPATCH\_STATES: - - /->\ - \ / - UNLOAD <--- CHECK ---> REPLACE|APPLY --> REVERT --\ - \ | - \-------------------<-------------/ - -## State transition table of LIVEPATCH\_ACTION commands and LIVEPATCH\_STATUS. - -Note that: - - - The CHECKED state is the starting one achieved with *XEN\_SYSCTL\_LIVEPATCH\_UPLOAD* hypercall. - - The REVERT operation on success will automatically move to the CHECKED state. - - There are two STATES: CHECKED and APPLIED. - - There are four actions (aka commands): APPLY, REPLACE, REVERT, and UNLOAD. - -The state transition table of valid states and action states: - - +---------+---------+--------------------------------+-------+--------+ - | ACTION | Current | Result | Next STATE: | - | ACTION | STATE | |CHECKED|APPLIED | - +---------+----------+-------------------------------+-------+--------+ - | UNLOAD | CHECKED | Unload payload. Always works. | | | - | | | No next states. | | | - +---------+---------+--------------------------------+-------+--------+ - | APPLY | CHECKED | Apply payload (success). | | x | - +---------+---------+--------------------------------+-------+--------+ - | APPLY | CHECKED | Apply payload (error|timeout) | x | | - +---------+---------+--------------------------------+-------+--------+ - | REPLACE | CHECKED | Revert payloads and apply new | | x | - | | | payload with success. | | | - +---------+---------+--------------------------------+-------+--------+ - | REPLACE | CHECKED | Revert payloads and apply new | x | | - | | | payload with error. | | | - +---------+---------+--------------------------------+-------+--------+ - | REVERT | APPLIED | Revert payload (success). | x | | - +---------+---------+--------------------------------+-------+--------+ - | REVERT | APPLIED | Revert payload (error|timeout) | | x | - +---------+---------+--------------------------------+-------+--------+ - -All the other state transitions are invalid. - -## Sequence of events. - -The normal sequence of events is to: - - 1. *XEN\_SYSCTL\_LIVEPATCH\_UPLOAD* to upload the payload. If there are errors *STOP* here. - 2. *XEN\_SYSCTL\_LIVEPATCH\_GET* to check the `->rc`. If *-XEN\_EAGAIN* spin. If zero go to next step. - 3. *XEN\_SYSCTL\_LIVEPATCH\_ACTION* with *LIVEPATCH\_ACTION\_APPLY* to apply the patch. - 4. *XEN\_SYSCTL\_LIVEPATCH\_GET* to check the `->rc`. If in *-XEN\_EAGAIN* spin. If zero exit with success. - - -## Addendum - -Implementation quirks should not be discussed in a design document. - -However these observations can provide aid when developing against this -document. - - -### Alternative assembler - -Alternative assembler is a mechanism to use different instructions depending -on what the CPU supports. This is done by providing multiple streams of code -that can be patched in - or if the CPU does not support it - padded with -`nop` operations. The alternative assembler macros cause the compiler to -expand the code to place a most generic code in place - emit a special -ELF .section header to tag this location. During run-time the hypervisor -can leave the areas alone or patch them with an better suited opcodes. - -Note that patching functions that copy to or from guest memory requires -to support alternative support. For example this can be due to SMAP -(specifically *stac* and *clac* operations) which is enabled on Broadwell -and later architectures. It may be related to other alternative instructions. - -### When to patch - -During the discussion on the design two candidates bubbled where -the call stack for each CPU would be deterministic. This would -minimize the chance of the patch not being applied due to safety -checks failing. Safety checks such as not patching code which -is on the stack - which can lead to corruption. - -#### Rendezvous code instead of stop\_machine for patching - -The hypervisor's time rendezvous code runs synchronously across all CPUs -every second. Using the `stop_machine` to patch can stall the time rendezvous -code and result in NMI. As such having the patching be done at the tail -of rendezvous code should avoid this problem. - -However the entrance point for that code is `do_softirq -> -timer_softirq_action -> time_calibration` which ends up calling -`on_selected_cpus` on remote CPUs. - -The remote CPUs receive CALL\_FUNCTION\_VECTOR IPI and execute the -desired function. - -#### Before entering the guest code. - -Before we call VMXResume we check whether any soft IRQs need to be executed. -This is a good spot because all Xen stacks are effectively empty at -that point. - -To randezvous all the CPUs an barrier with an maximum timeout (which -could be adjusted), combined with forcing all other CPUs through the -hypervisor with IPIs, can be utilized to execute lockstep instructions -on all CPUs. - -The approach is similar in concept to `stop_machine` and the time rendezvous -but is time-bound. However the local CPU stack is much shorter and -a lot more deterministic. - -This is implemented in the Xen hypervisor. - -### Compiling the hypervisor code - -Hotpatch generation often requires support for compiling the target -with `-ffunction-sections` / `-fdata-sections`. Changes would have to -be done to the linker scripts to support this. - -### Generation of Live Patch ELF payloads - -The design of that is not discussed in this design. - -This is implemented in a seperate tool which lives in a seperate -GIT repo. - -Currently it resides at git://xenbits.xen.org/livepatch-build-tools.git - -### Exception tables and symbol tables growth - -We may need support for adapting or augmenting exception tables if -patching such code. Hotpatches may need to bring their own small -exception tables (similar to how Linux modules support this). - -If supporting hotpatches that introduce additional exception-locations -is not important, one could also change the exception table in-place -and reorder it afterwards. - -As found almost every patch (XSA) to a non-trivial function requires -additional entries in the exception table and/or the bug frames. - -This is implemented in the Xen hypervisor. - -### .rodata sections - -The patching might require strings to be updated as well. As such we must be -also able to patch the strings as needed. This sounds simple - but the compiler -has a habit of coalescing strings that are the same - which means if we in-place -alter the strings - other users will be inadvertently affected as well. - -This is also where pointers to functions live - and we may need to patch this -as well. And switch-style jump tables. - -To guard against that we must be prepared to do patching similar to -trampoline patching or in-line depending on the flavour. If we can -do in-line patching we would need to: - - * Alter `.rodata` to be writeable. - * Inline patch. - * Alter `.rodata` to be read-only. - -If are doing trampoline patching we would need to: - - * Allocate a new memory location for the string. - * All locations which use this string will have to be updated to use the - offset to the string. - * Mark the region RO when we are done. - -The trampoline patching is implemented in the Xen hypervisor. - -### .bss and .data sections. - -In place patching writable data is not suitable as it is unclear what should be done -depending on the current state of data. As such it should not be attempted. - -However, functions which are being patched can bring in changes to strings -(.data or .rodata section changes), or even to .bss sections. - -As such the ELF payload can introduce new .rodata, .bss, and .data sections. -Patching in the new function will end up also patching in the new .rodata -section and the new function will reference the new string in the new -.rodata section. - -This is implemented in the Xen hypervisor. - -### Security - -Only the privileged domain should be allowed to do this operation. - -### Live patch interdependencies - -Live patch patches interdependencies are tricky. - _______________________________________________ Xen-changelog mailing list Xen-changelog@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/xen-changelog

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.