[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH] docs/designs: re-work the xenstore migration document...
On 24.04.20 15:37, Paul Durrant wrote: From: Paul Durrant <pdurrant@xxxxxxxxxx> ... to specify a separate migration stream that will also be suitable for live update. The original scope of the document was to support non-cooperative migration of guests [1] but, since then, live update of xenstored has been brought into scope. Thus it makes more sense to define a separate image format for serializing xenstore state that is suitable for both purposes. The document has been limited to specifying a new image format. The mechanism for acquiring the image for live update or migration is not covered as that is more appropriately dealt with by a patch to docs/misc/xenstore.txt. It is also expected that, when the first implementation of live update or migration making use of this specification is committed, that the document is moved from docs/designs into docs/specs. [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md Signed-off-by: Paul Durrant <pdurrant@xxxxxxxxxx> --- Juergen Gross <jgross@xxxxxxxx> Andrew Cooper <andrew.cooper3@xxxxxxxxxx> George Dunlap <george.dunlap@xxxxxxxxxx> Ian Jackson <ian.jackson@xxxxxxxxxxxxx> Jan Beulich <jbeulich@xxxxxxxx> Julien Grall <julien@xxxxxxx> Stefano Stabellini <sstabellini@xxxxxxxxxx> Wei Liu <wl@xxxxxxx> --- docs/designs/xenstore-migration.md | 472 +++++++++++++++++++---------- 1 file changed, 309 insertions(+), 163 deletions(-) diff --git a/docs/designs/xenstore-migration.md b/docs/designs/xenstore-migration.md index 6ab351e8fe..c96bad48eb 100644 --- a/docs/designs/xenstore-migration.md +++ b/docs/designs/xenstore-migration.md @@ -3,254 +3,400 @@ ## BackgroundThe design for *Non-Cooperative Migration of Guests*[1] explains that extra-save records are required in the migrations stream to allow a guest running -PV drivers to be migrated without its co-operation. Moreover the save -records must include details of registered xenstore watches as well as -content; information that cannot currently be recovered from `xenstored`, -and hence some extension to the xenstore protocol[2] will also be required. - -The *libxenlight Domain Image Format* specification[3] already defines a -record type `EMULATOR_XENSTORE_DATA` but this is not suitable for -transferring xenstore data pertaining to the domain directly as it is -specified such that keys are relative to the path -`/local/domain/$dm_domid/device-model/$domid`. Thus it is necessary to -define at least one new save record type. +save records are required in the migrations stream to allow a guest running PV +drivers to be migrated without its co-operation. Moreover the save records must +include details of registered xenstore watches as well ascontent; information +that cannot currently be recovered from `xenstored`, and hence some extension +to the xenstored implementations will also be required. + +As a similar set of data is needed for transferring xenstore data from one +instance to another when live updating xenstored this document proposes an +image format for a 'migration stream' suitable for both purposes.## Proposal -### New Save Record+The image format consists of a _header_ followed by 1 or more _records_. Each +record consists of a type and length field, followed by any data mandated by +the record type. At minimum there will be a single record of type `END` +(defined below).-A new mandatory record type should be defined within the libxenlight Domain-Image Format: +### Header-`0x00000007: DOMAIN_XENSTORE_DATA`+The header identifies the stream as a `xenstore` stream, including the version +of the specification that it complies with.-An arbitrary number of these records may be present in the migration-stream and may appear in any order. The format of each record should be as -follows: +All fields in this header must be in _big-endian_ byte order, regardless of +the setting of the endianness bit.```0 1 2 3 4 5 6 7 octet +-------+-------+-------+-------+-------+-------+-------+-------+ -| type | record specific data | -+-------------------------------+ | -... -+---------------------------------------------------------------+ +| ident | ++-------------------------------+-------------------------------| +| version | flags | ++-------------------------------+-------------------------------+ ```-where type is one of the following values +| Field | Description |+|-----------|---------------------------------------------------| +| `ident` | 0x78656e73746f7265 ('xenstore' in ASCII) | +| | | +| `version` | 0x00000001 (the version of the specification) | +| | | +| `flags` | 0 (LSB): Endianness: 0 = little, 1 = big | +| | | +| | 1-31: Reserved (must be zero) |-| Field | Description |-|--------|--------------------------------------------------| -| `type` | 0x00000000: invalid | -| | 0x00000001: NODE_DATA | -| | 0x00000002: WATCH_DATA | -| | 0x00000003: TRANSACTION_DATA | -| | 0x00000004 - 0xFFFFFFFF: reserved for future use | +### Records+Records immediately follow the header and have the following format: -and data is one of the record data formats described in the following-sections.+```+ 0 1 2 3 4 5 6 7 octet ++-------+-------+-------+-------+-------+-------+-------+-------+ +| type | len | ++-------------------------------+-------------------------------+ +| body +... +| | padding (0 to 7 octets) | ++-------+-------------------------------------------------------+ +``` + +NOTE: padding octets here and in all subsequent format specifications must be + zero, unless stated otherwise. What about: "... are written as zero and should be ignored on read." -NOTE: The record data does not contain an overall length because the-libxenlight record header specifies the length.+| Field | Description |+|--------|------------------------------------------------------| +| `type` | 0x00000000: END | +| | 0x00000001: GLOBAL_DATA | +| | 0x00000002: CONNECTION_DATA | +| | 0x00000003: WATCH_DATA | +| | 0x00000004: TRANSACTION_DATA | +| | 0x00000005: NODE_DATA | +| | 0x00000006 - 0xFFFFFFFF: reserved for future use | +| | | +| `len` | The length (in octets) of `body` | +| | | +| `body` | The type-specific record data |-**NODE_DATA**+The various formats of the type-specific data are described in the following +sections:+\pagebreak -Each NODE_DATA record specifies a single node in xenstore and is formatted-as follows: +### END+The end record marks the end of the image, and is the final record+in the stream.```- 0 1 2 3 octet -+-------+-------+-------+-------+ -| NODE_DATA | -+-------------------------------+ -| path length | -+-------------------------------+ -| path data | -... -| pad (0 to 3 octets) | -+-------------------------------+ -| perm count (N) | -+-------------------------------+ -| perm0 | -+-------------------------------+ -... -+-------------------------------+ -| permN | -+-------------------------------+ -| value length | -+-------------------------------+ -| value data | -... -| pad (0 to 3 octets) | -+-------------------------------+ + 0 1 2 3 4 5 6 7 octet ++-------+-------+-------+-------+-------+-------+-------+-------+ ```-where perm0..N are formatted as follows: +The end record contains no fields; its body length is 0.+ +\pagebreak + +### GLOBAL_DATA + +This record is only relevant for live update. It contains details of global +xenstored state that needs to be restored.```- 0 1 2 3 octet + 0 1 2 3 octet +-------+-------+-------+-------+ -| perm | pad | domid | +| rw-socket-fd | ++-------------------------------+ +| ro-socket-fd | +-------------------------------+ ```-path length and value length are specified in octets (excluding the NUL-terminator of the path). perm should be one of the ASCII values `w`, `r`, -`b` or `n` as described in [2]. All pad values should be 0. -All paths should be absolute (i.e. start with `/`) and as described in -[2]. +| Field | Description | +|----------------|----------------------------------------------| +| `rw-socket-fd` | The file descriptor of the socket accepting | +| | read-write connections | +| | | +| `ro-socket-fd` | The file descriptor of the socket accepting | +| | read-only connections | + +xenstored will resume in the original process context. Hence `rw-socket-fd` and +`ro-socket-fd` simply specify the file descriptors of the sockets.-**WATCH_DATA**+\pagebreak+### CONNECTION_DATA -Each WATCH_DATA record specifies a registered watch and is formatted as-follows: +For live update the image format will contain a `CONNECTION_DATA` record for +each connection to xenstore. For migration it will only contain a record for +the domain being migrated.```- 0 1 2 3 octet -+-------+-------+-------+-------+ -| WATCH_DATA | -+-------------------------------+ -| wpath length | -+-------------------------------+ -| wpath data | -... -| pad (0 to 3 octets) | -+-------------------------------+ + 0 1 2 3 4 5 6 7 octet ++-------+-------+-------+-------+-------+-------+-------+-------+ +| conn-id | pad | ++---------------+-----------------------------------------------+ +| conn-type | conn-spec ... ++-------------------------------+-------------------------------+ I'd rather drop the pad, and replace it by conn-type and a 2-byte flag field (for the flags INTRODUCE, RELEASE, read-only). +| data-len | data +-------------------------------+ -| token length | -+-------------------------------+ -| token data | ... -| pad (0 to 3 octets) | -+-------------------------------+ ```-wpath length and token length are specified in octets (excluding the NUL-terminator). The wpath should be as described for the `WATCH` operation in -[2]. The token is an arbitrary string of octets not containing any NUL -values.+| Field | Description |+|-------------|-------------------------------------------------| +| `conn-id` | A non-zero number used to identify this | +| | connection in subsequent connection-specific | +| | records | +| | | +| `conn-type` | 0x0000: shared ring | +| | 0x0001: socket | +| | | +| `conn-spec` | See below | +| | | +| `data-len` | The length (in octets) of any pending data not | +| | yet written to the connection | +| | | +| `data` | Pending data (may be empty) |-**TRANSACTION_DATA**+The format of `conn-spec` is dependent upon `conn-type`.+\pagebreak -Each TRANSACTION_DATA record specifies an open transaction and is formatted-as follows: +For `shared ring` connections it is as follows:```- 0 1 2 3 octet -+-------+-------+-------+-------+ -| TRANSACTION_DATA | -+-------------------------------+ -| tx_id | -+-------------------------------+ + 0 1 2 3 4 5 6 7 octet + +-------+-------+-------+-------+-------+-------+ + | domid | tdomid | flags | ++---------------+---------------+---------------+---------------+ +| revtchn | levtchn | ++-------------------------------+-------------------------------+ +| mfn | ++---------------------------------------------------------------+ levtchn is not needed IMO. Event channels can be closed and reopened, so levtchn will have a new value in the common case. With my suggestion above regarding flags we would have just 16 bytes now, which can be aligned quite nicely in a sub-structure. ```-where tx_id is the non-zero identifier values of an open transaction.--### Protocol Extension+| Field | Description | +|------------|--------------------------------------------------| +| `domid` | The domain-id that owns the shared page | +| | | +| `tdomid` | The domain-id that `domid` acts on behalf of if | +| | it has been subject to an SET_TARGET | +| | operation [2] or DOMID_INVALID otherwise | +| | | +| `flags` | A bit-wise OR of: | +| | 0x0001: INTRODUCE has been issued | +| | 0x0002: RELEASE has been issued | +| | | +| `revtchn` | The port number of the interdomain channel used | +| | by `domid` to communicate with xenstored | +| | | +| `levtchn` | For a live update this will be the port number | +| | of the interdomain channel used by xenstored | +| | itself otherwise, for migration, it will be -1 | +| | | +| `mfn` | The MFN of the shared page for a live update or | +| | INVALID_MFN otherwise | + +Since the ABI guarantees that entry 1 in `domid`'s grant table will always +contain the GFN of the shared page, so for a live update `mfn` can be used to +give confidence that `domid` has not been re-cycled during the update. + + +For `socket` connections it is as follows:-Before xenstore state is migrated it is necessary to wait for any pending-reads, writes, watch registrations etc. to complete, and also to make sure -that xenstored does not start processing any new requests (so that new -requests remain pending on the shared ring for subsequent processing on the -new host). Hence the following operation is needed:```-QUIESCE <domid>| - -Complete processing of any request issued by the specified domain, and -do not process any further requests from the shared ring. + 0 1 2 3 4 5 6 7 octet + +-------+-------+-------+-------+-------+-------+ + | flags | socket-fd | + +---------------+-------------------------------+ ```-The `WATCH` operation does not allow specification of a `<domid>`; it is-assumed that the watch pertains to the domain that owns the shared ring -over which the operation is passed. Hence, for the tool-stack to be able -to register a watch on behalf of a domain a new operation is needed:-```-ADD_DOMAIN_WATCHES <domid>|<watch>|+ +| Field | Description | +|-------------|-------------------------------------------------| +| `flags` | A bit-wise OR of: | +| | 0001: read-only | +| | | +| `socket-fd` | The file descriptor of the connected socket |-Adds watches on behalf of the specified domain.+This type of connection is only relevant for live update, where the xenstored +resumes in the original process context. Hence `socket-fd` simply specify +the file descriptor of the socket connection.-<watch> is a NUL separated tuple of <path>|<token>. The semantics of this-operation are identical to the domain issuing WATCH <path>|<token>| for -each <watch>. -``` +\pagebreak + +### WATCH_DATA + +The image format will contain a `WATCH_DATA` record for each watch registered +by a connection for which there is `CONNECTION_DATA` record previously present.-The watch information for a domain also needs to be extracted from the-sending xenstored so the following operation is also needed:```-GET_DOMAIN_WATCHES <domid>|<index> <gencnt>|<watch>|* + 0 1 2 3 octet ++-------+-------+-------+-------+ +| conn-id | ++---------------+---------------+ +| wpath-len | token-len | ++---------------+---------------+ +| wpath +... +| token +... +``` +-Gets the list of watches that are currently registered for the domain.+| Field | Description | +|-------------|-------------------------------------------------| +| `conn-id` | The connection that issued the `WATCH` | +| | operation [2] | +| | | +| `wpath-len` | The length (in octets) of `wpath` including the | +| | NUL terminator | +| | | +| `token-len` | The length (in octets) of `token` including the | +| | NUL terminator | +| | | +| `wpath` | The watch path, as specified in the `WATCH` | +| | operation | +| | | +| `token` | The watch identifier token, as specified in the | +| | `WATCH` operation | + +\pagebreak + +### TRANSACTION_DATA + +The image format will contain a `TRANSACTION_DATA` record for each transaction +that is pending on a connection for which there is `CONNECTION_DATA` record +previously present.-<watch> is a NUL separated tuple of <path>|<token>. The sub-list returned-will start at <index> items into the the overall list of watches and may -be truncated (at a <watch> boundary) such that the returned data fits -within XENSTORE_PAYLOAD_MAX.-If <index> is beyond the end of the overall list then the returned sub--list will be empty. If the value of <gencnt> changes then it indicates -that the overall watch list has changed and thus it may be necessary -to re-issue the operation for previous values of <index>. ``` + 0 1 2 3 octet ++-------+-------+-------+-------+ +| conn-id | ++-------------------------------+ +| tx-id | ++-------------------------------+ +``` + + +| Field | Description | +|----------------|----------------------------------------------| +| `conn-id` | The connection that issued the | +| | `TRANSACTION_START` operation [2] | +| | | +| `tx-id` | The transaction id passed back to the domain | +| | by the `TRANSACTION_START` operation | + +\pagebreak-To deal with transactions that were pending when the domain is migrated-it is necessary to start transactions with the same tx_id on behalf of the -domain in the receiving xenstored. +### NODE_DATA-NOTE: For safety each such transaction should result in an `EAGAIN` when-the `TRANSACTION_END` operation is performed, as modifications made under -the tx_id will not be part of the migration stream. +For live update the image format will contain a `NODE_DATA` record for each +node in xenstore. For migration it will only contain a record for the nodes +relating to the domain being migrated. The `NODE_DATA` may be related to +a _committed_ node (globally visible in xenstored) or a _pending_ node (created +or modified by a transaction for which there is also a `TRANSACTION_DATA` +record previously present).-The `TRANSACTION_START` operation does not allow specification of a-`<domid>`; it is assumed that the transaction pertains to the domain that -owns the shared ring over which the operation is passed. Neither does it -allow a `<transid>` to be specified; it is always chosen by xenstored. -Hence, for the tool-stack to be able to open a transaction on behalf of a -domain a new operation is needed:```-START_DOMAIN_TRANSACTION <domid>|<transid>| + 0 1 2 3 octet ++-------+-------+-------+-------+ +| conn-id | ++-------------------------------+ +| tx-id | ++---------------+---------------+ +| access | perm-count | ++---------------+---------------+ +| perm1 | ++-------------------------------+ +... ++-------------------------------+ +| permN | ++---------------+---------------+ +| path-len | value-len | ++---------------+---------------+ I'd rather move path-len and value-len above perm1 in order to have the fixed-length fields in a common structure. +| path +... +| value +... +``` + + +| Field | Description | +|--------------|------------------------------------------------| +| `conn-id` | If this value is non-zero then this record | +| | related to a pending transaction | +| | | +| `tx-id` | This value should be ignored if `conn-id` is | +| | zero. Otherwise it specifies the id of the | +| | pending transaction | +| | | +| `access` | This value should be ignored if this record | +| | does not relate to a pending transaction, | +| | otherwise it specifies the accesses made to | +| | the node and hence is a bitwise OR of: | +| | | +| | 0x0001: read | +| | 0x0002: written | +| | | +| | The value will be zero for a deleted node | +| | | +| `perm-count` | The number (N) of node permission specifiers | +| | (which will be 0 for a node deleted in a | +| | pending transaction) | +| | | +| `perm1..N` | A list of zero or more node permission | +| | specifiers (see below) | +| | | +| `path-len` | The length (in octets) of `path` including the | +| | NUL terminator | +| | | +| `value-len` | The length (in octets) of `value` (which will | +| | be zero for a deleted node) | +| | | +| `path` | The absolute path of the node | +| | | +| `value` | The node value (which may be empty or contain | +| | NUL octets) | + + +A node permission specifier has the following format:-Starts a transaction on behalf of a domain. -The semantics of this are similar to the domain issuing-TRANSACTION_START and receiving the specified <transid> as the response. -The main difference is that the transaction will be immediately marked as -'conflicting' such that when the domain issues TRANSACTION_END T|, it will -result in EAGAIN. +``` + 0 1 2 3 octet ++-------+-------+-------+-------+ +| perm | pad | domid | ++-------+-------+---------------+ ```-It may also be desirable to state in the protocol specification that-the `INTRODUCE` operation should not clear the `<gfn>` specified such that -a `RELEASE` operation followed by an `INTRODUCE` operation form an -idempotent pair. The current implementation of *C xentored* does this -(in the `domain_conn_reset()` function) but this could be dropped as this -behaviour is not currently specified and the page will always be zeroed -for a newly created domain. +| Field | Description | +|---------|-----------------------------------------------------| +| `perm` | One of the ASCII values `w`, `r`, `b` or `n` as | +| | specified for the `SET_PERMS` operation [2] | +| | | +| `domid` | The domain-id to which the permission relates |* * * [1] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/designs/non-cooperative-migration.md+ [2] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/misc/xenstore.txt -[3] See https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=docs/specs/libxl-migration-stream.pandoc Juergen
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |