Xen project Mailing List

Re: [Xen-devel] [DOC v8] PV Calls protocol design

.snip.. > #### Frontend XenBus Nodes > > version > Values: <string> > > Protocol version, chosen among the ones supported by the backend > (see **versions** under [Backend XenBus Nodes]). Currently the > value must be "1". > > port > Values: <uint32_t> > > The identifier of the Xen event channel used to signal activity > in the command ring. > > ring-ref > Values: <uint32_t> > > The Xen grant reference granting permission for the backend to map > the sole page in a single page sized command ring. > > #### Backend XenBus Nodes > > versions > Values: <string> > > List of comma separated protocol versions supported by the backend. > For example "1,2,3". Currently the value is just "1", as there is > only one version. > > max-page-order > Values: <uint32_t> > > The maximum supported size of a memory allocation in units of > log2n(machine pages), e.g. 0 == 1 page, 1 = 2 pages, 2 == 4 pages, > etc. .. for the **data rings** (not to be confused with the command ring). > > function-calls > Values: <uint32_t> > > Value "0" means that no calls are supported. > Value "1" means that socket, connect, release, bind, listen, accept > and poll are supported. > ..snip.. > ### Commands Ring > > The shared ring is used by the frontend to forward POSIX function calls > to the backend. We shall refer to this ring as **commands ring** to > distinguish it from other rings which can be created later in the > lifecycle of the protocol (see [Indexes Page and Data ring]). The grant > reference for shared page for this ring is shared on xenstore (see > [Frontend XenBus Nodes]). The ring format is defined using the familiar > `DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`). Frontend > requests are allocated on the ring using the `RING_GET_REQUEST` macro. > The list of commands below is in calling order. > > The format is defined as follows: > > #define PVCALLS_SOCKET 0 > #define PVCALLS_CONNECT 1 > #define PVCALLS_RELEASE 2 > #define PVCALLS_BIND 3 > #define PVCALLS_LISTEN 4 > #define PVCALLS_ACCEPT 5 > #define PVCALLS_POLL 6 > > struct xen_pvcalls_request { > uint32_t req_id; /* private to guest, echoed in response */ > uint32_t cmd; /* command to execute */ > union { > struct xen_pvcalls_socket { > uint64_t id; > uint32_t domain; > uint32_t type; > uint32_t protocol; > #ifdef CONFIG_X86_32 > uint8_t pad[4]; Could that be shifted to the right? > #endif > } socket; > struct xen_pvcalls_connect { > uint64_t id; > uint8_t addr[28]; > uint32_t len; > uint32_t flags; > grant_ref_t ref; > uint32_t evtchn; > #ifdef CONFIG_X86_32 > uint8_t pad[4]; > #endif > } connect; > struct xen_pvcalls_release { > uint64_t id; > uint8_t reuse; > #ifdef CONFIG_X86_32 > uint8_t pad[7]; Could that be shifted to the right? > #endif > } release; > struct xen_pvcalls_bind { > uint64_t id; > uint8_t addr[28]; > uint32_t len; > } bind; > struct xen_pvcalls_listen { > uint64_t id; > uint32_t backlog; > #ifdef CONFIG_X86_32 > uint8_t pad[4]; Could that be shifted to the right? > #endif > } listen; > struct xen_pvcalls_accept { > uint64_t id; > uint64_t id_new; > grant_ref_t ref; > uint32_t evtchn; > } accept; > struct xen_pvcalls_poll { > uint64_t id; > } poll; > /* dummy member to force sizeof(struct xen_pvcalls_request) to > match across archs */ > struct xen_pvcalls_dummy { > uint8_t dummy[56]; > } dummy; > } u; > }; > > The first two fields are common for every command. Their binary layout > is: > > 0 4 8 > +-------+-------+ > |req_id | cmd | > +-------+-------+ > > - **req_id** is generated by the frontend and is a cookie used to > identify one specific request/response pair of commands. Not to be > confused with any command **id** which are used to identify a socket > across multiple commands, see [Socket]. > - **cmd** is the command requested by the frontend: > > - `PVCALLS_SOCKET`: 0 > - `PVCALLS_CONNECT`: 1 > - `PVCALLS_RELEASE`: 2 > - `PVCALLS_BIND`: 3 > - `PVCALLS_LISTEN`: 4 > - `PVCALLS_ACCEPT`: 5 > - `PVCALLS_POLL`: 6 > > Both fields are echoed back by the backend. See [Socket families and > address format] for the format of the **addr** field of connect and > bind. The maximum size of command specific arguments is 56 bytes. Any > future command that requires more than that will need a bump the > **version** of the protocol. > > Similarly to other Xen ring based protocols, after writing a request to > the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and > issues an event channel notification when a notification is required. > > Backend responses are allocated on the ring using the `RING_GET_RESPONSE` > macro. > The format is the following: > > struct xen_pvcalls_response { > uint32_t req_id; > uint32_t cmd; > int32_t ret; > uint32_t pad; > union { > struct _xen_pvcalls_socket { > uint64_t id; > } socket; > struct _xen_pvcalls_connect { > uint64_t id; > } connect; > struct _xen_pvcalls_release { > uint64_t id; > } release; > struct _xen_pvcalls_bind { > uint64_t id; > } bind; > struct _xen_pvcalls_listen { > uint64_t id; > } listen; > struct _xen_pvcalls_accept { > uint64_t id; > } accept; > struct _xen_pvcalls_poll { > uint64_t id; > } poll; > struct _xen_pvcalls_dummy { > uint8_t dummy[8]; > } dummy; > } u; > }; > > The first four fields are common for every response. Their binary layout > is: > > 0 4 8 12 16 > +-------+-------+-------+-------+ > |req_id | cmd | ret | pad | > +-------+-------+-------+-------+ > > - **req_id**: echoed back from request > - **cmd**: echoed back from request > - **ret**: return value, identifies success (0) or failure (see [Error > numbers] in further sections). If the **cmd** is not supported by the > backend, ret is ENOTSUP. > - **pad**: padding > > After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks > whether > it needs to notify the frontend and does so via event channel. > > A description of each command, their additional request and response > fields follow. > > > #### Socket > > The **socket** operation corresponds to the POSIX [socket][socket] > function. It creates a new socket of the specified family, type and > protocol. **id** is freely chosen by the frontend and references this > specific socket from this point forward. See [Socket families and > address format]. .. to see which ones are supported by different versions of the protocol. > > Request fields: > > - **cmd** value: 0 > - additional fields: > - **id**: generated by the frontend, it identifies the new socket > - **domain**: the communication domain > - **type**: the socket type > - **protocol**: the particular protocol to be used with the socket, usually > 0 > > Request binary layout: > > 8 12 16 20 24 28 > +-------+-------+-------+-------+-------+ > | id |domain | type |protoco| > +-------+-------+-------+-------+-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+--------+ > | id | > +-------+--------+ > > Return value: > > - 0 on success > - See the [POSIX socket function][connect] for error names; see > [Error numbers] in further sections. > > #### Connect > > The **connect** operation corresponds to the POSIX [connect][connect] > function. It connects a previously created socket (identified by **id**) > to the specified address. > > The connect operation creates a new shared ring, which we'll call **data > ring**. The data ring is used to send and receive data from the > socket. The connect operation passes two additional parameters: > **evtchn** and **ref**. **evtchn** is the port number of a new event > channel which will be used for notifications of activity on the data > ring. **ref** is the grant reference of the **indexes page**: a page > which contains shared indexes that point to the write and read locations > in the data ring. The **indexes page** also contains the full array of s/data ring/**data ring**/ > grant references for the data ring. When the frontend issues a > **connect** command, the backend: > > - finds its own internal socket corresponding to **id** > - connects the socket to **addr** > - maps the grant reference **ref**, the indexes page, see struct > pvcalls_data_intf > - maps all the grant references listed in `struct pvcalls_data_intf` and > uses them as shared memory for the data ring s/data ring/**data ring**/ perhaps? > - bind the **evtchn** > - replies to the frontend > > The [Indexes Page and Data ring] format will be described in the > following section. The data ring is unmapped and freed upon issuing a > **release** command on the active socket identified by **id**. A > frontend stage change can also cause data rings to be unmapped. s/stage/state/ > > Request fields: > > - **cmd** value: 0 > - additional fields: > - **id**: identifies the socket > - **addr**: address to connect to, see [Socket families and address format] Hm, so what do we do if we want to support AF_UNIX which has an addr of 108 bytes? > - **len**: address length up to 28 octets. > - **flags**: flags for the connection, reserved for future usage > - **ref**: grant reference of the indexes page > - **evtchn**: port number of the evtchn to signal activity on the data ring > > Request binary layout: > > 8 12 16 20 24 28 32 36 40 44 > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | id | addr | > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | len | flags | ref |evtchn | > +-------+-------+-------+-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX connect function][connect] for error names; see > [Error numbers] in further sections. > > #### Release > > The **release** operation closes an existing active or a passive socket. > > When a release command is issued on a passive socket, the backend > releases it and frees its internal mappings. When a release command is > issued for an active socket, the data ring and indexes page are also > unmapped and freed: > > - frontend sends release command for an active socket > - backend releases the socket > - backend unmaps the data ring > - backend unmaps the indexes page > - backend unbinds the event channel > - backend replies to frontend with an **ret** value > - frontend frees data ring, indexes page and unbinds event channel > > Request fields: > > - **cmd** value: 1 > - additional fields: > - **id**: identifies the socket > - **reuse**: an optimization hint for the backend. The field is > ignored for passive sockets. When set to 1, the frontend lets the > backend know that it will reuse exactly the same set of grant pages > (indexes page and data ring) and event channel when creating one of > the next active sockets. The backend can take advantage of it by > delaying unmapping grants and unbinding the event channel. The > backend is free to ignore the hint. Reused data rings are found by > **ref**, the grant reference of the page containing the indexes. > > Request binary layout: > > 8 12 16 17 > +-------+-------+-----+ > | id |reuse| > +-------+-------+-----+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX shutdown function][shutdown] for error names; see > [Error numbers] in further sections. > > #### Bind > > The **bind** operation corresponds to the POSIX [bind][bind] function. > It assigns the address passed as parameter to a previously created > socket, identified by **id**. **Bind**, **listen** and **accept** are > the three operations required to have fully working passive sockets and > should be issued in that order. > > Request fields: > > - **cmd** value: 2 > - additional fields: > - **id**: identifies the socket > - **addr**: address to connect to, see [Socket families and address > format] > - **len**: address length .. up to 28 octets. > > Request binary layout: > > 8 12 16 20 24 28 32 36 40 44 > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | id | addr | > +-------+-------+-------+-------+-------+-------+-------+-------+-------+ > | len | > +-------+ > > Response additional fields: > > - **id**: echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX bind function][bind] for error names; see > [Error numbers] in further sections. > > ..snip.. > #### Accept > > The **accept** operation extracts the first connection request on the > queue of pending connections for the listening socket identified by > **id** and creates a new connected socket. The id of the new socket is > also chosen by the frontend and passed as an additional field of the > accept request struct (**id_new**). See the [POSIX accept function][accept] > as reference. > > Similarly to the **connect** operation, **accept** creates new [Indexes > Page and Data ring]. The data ring is used to send and receive data from > the socket. The **accept** operation passes two additional parameters: > **evtchn** and **ref**. **evtchn** is the port number of a new event > channel which will be used for notifications of activity on the data s/data/**data/ > ring. **ref** is the grant reference of the **indexes page**: a page s/ring/ring**/ > which contains shared indexes that point to the write and read locations > in the data ring. The **indexes page** also contains the full array of Perhaps highlight data ring here? > grant references for the data ring. > > The backend will reply to the request only when a new connection is > successfully accepted, i.e. the backend does not return EAGAIN or > EWOULDBLOCK. > > Example workflow: > > - frontend issues an **accept** request > - backend waits for a connection to be available on the socket > - a new connection becomes available > - backend accepts the new connection > - backend creates an internal mapping from **id_new** to the new socket > - backend maps the grant reference **ref**, the indexes page, see struct > pvcalls_data_intf > - backend maps all the grant references listed in `struct > pvcalls_data_intf` and uses them as shared memory for the new data > ring **in** and **out** arrays > - backend binds to the **evtchn** > - backend replies to the frontend with a **ret** value > > Request fields: > > - **cmd** value: 4 > - additional fields: > - **id**: id of listening socket > - **id_new**: id of the new socket > - **ref**: grant reference of the indexes page > - **evtchn**: port number of the evtchn to signal activity on the data ring > > Request binary layout: > > 8 12 16 20 24 28 32 > +-------+-------+-------+-------+-------+-------+ > | id | id_new | ref |evtchn | > +-------+-------+-------+-------+-------+-------+ > > Response additional fields: > > - **id**: id of the listening socket, echoed back from request > > Response binary layout: > > 16 20 24 > +-------+-------+ > | id | > +-------+-------+ > > Return value: > > - 0 on success > - See the [POSIX accept function][accept] for error names; see > [Error numbers] in further sections. > > ..snip.. > ### Indexes Page and Data ring > > Data rings are used for sending and receiving data over a connected socket. > They > are created upon a successful **accept** or **connect** command. > The **sendmsg** and **recvmsg** calls are implemented by sending data and > receiving data from a data ring, and updating the corresponding indexes > on the **indexes page**. > > Firstly, the **indexes page** is shared by a **connect** or **accept** > command, see **ref** parameter in their sections. The content of the > **indexes page** is represented by `struct pvcalls_ring_intf`, see > below. The structure contains the list of grant references which > constitute the **in** and **out** buffers of the data ring, see ref[] > below. The backend maps the grant references contiguously. Of the > resulting shared memory, the first half is dedicated to the **in** array > and the second half to the **out** array. They are used as circular > buffers for transferring data, and, together, they are the data ring. > > > +---------------------------+ Indexes page > | Command ring: | +----------------------+ > | @0: xen_pvcalls_connect: | |@0 pvcalls_data_intf: | ^-- The first 64 bytes are reserved for the in_cons, etc. Perhaps just start at @64 (And naturally add that to the 'ref') > | @44: ref +-------------------------------->+@76: ring_order = 1 | > | | |@80: ref[0]+ | > +---------------------------+ |@84: ref[1]+ | > | | | > | | | > +----------------------+ > | > v (data ring) > +-------+-----------+ > | @0->4098: in | > | ref[0] | > |-------------------| > | @4099->8196: out | > | ref[1] | > +-------------------+ > > Thank you! > #### Indexes Page Structure > > typedef uint32_t PVCALLS_RING_IDX; > > struct pvcalls_data_intf { > PVCALLS_RING_IDX in_cons, in_prod; > int32_t in_error; You don't want to perhaps include in_event? > > uint8_t pad[52]; > > PVCALLS_RING_IDX out_cons, out_prod; > int32_t out_error; And out_event as way to do some form of interrupt mitigation (similar to what you had proposed?) > > uint32_t ring_order; > grant_ref_t ref[]; > }; > > /* not actually C compliant (ring_order changes from socket to socket) */ > struct pvcalls_data { > char in[((1<<ring_order)<<PAGE_SHIFT)/2]; > char out[((1<<ring_order)<<PAGE_SHIFT)/2]; > }; > > - **ring_order** > It represents the order of the data ring. The following list of grant > references is of `(1 << ring_order)` elements. It cannot be greater than > **max-page-order**, as specified by the backend on XenBus. It has to > be one at minimum. Oh? Why not zero? (4KB) as the 'max-page-order' has an example of zero order? Perhaps if it MUST be one or more then the 'max-page-order' should say that at least it MUST be one? > - **ref[]** > The list of grant references which will contain the actual data. They are > mapped contiguosly in virtual memory. The first half of the pages is the > **in** array, the second half is the **out** array. The arrays must > have a power of two size. Together, their size is `(1 << ring_order) * > PAGE_SIZE`. > - **in** is an array used as circular buffer > It contains data read from the socket. The producer is the backend, the > consumer is the frontend. > - **out** is an array used as circular buffer > It contains data to be written to the socket. The producer is the frontend, > the consumer is the backend. > - **in_cons** and **in_prod** > Consumer and producer indexes for data read from the socket. They keep track > of how much data has already been consumed by the frontend from the **in** > array. **in_prod** is increased by the backend, after writing data to > **in**. > **in_cons** is increased by the frontend, after reading data from **in**. > -ring-page-order ??? _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.