[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DOC v8] PV Calls protocol design



.snip..
> #### Frontend XenBus Nodes
> 
> version
>      Values:         <string>
> 
>      Protocol version, chosen among the ones supported by the backend
>      (see **versions** under [Backend XenBus Nodes]). Currently the
>      value must be "1".
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the command ring.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized command ring.
> 
> #### Backend XenBus Nodes
> 
> versions
>      Values:         <string>
> 
>      List of comma separated protocol versions supported by the backend.
>      For example "1,2,3". Currently the value is just "1", as there is
>      only one version.
> 
> max-page-order
>      Values:         <uint32_t>
> 
>      The maximum supported size of a memory allocation in units of
>      log2n(machine pages), e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages,
>      etc.

.. for the **data rings** (not to be confused with the command ring).

> 
> function-calls
>      Values:         <uint32_t>
> 
>      Value "0" means that no calls are supported.
>      Value "1" means that socket, connect, release, bind, listen, accept
>      and poll are supported.
> 
..snip..
> ### Commands Ring
> 
> The shared ring is used by the frontend to forward POSIX function calls
> to the backend. We shall refer to this ring as **commands ring** to
> distinguish it from other rings which can be created later in the
> lifecycle of the protocol (see [Indexes Page and Data ring]). The grant
> reference for shared page for this ring is shared on xenstore (see
> [Frontend XenBus Nodes]). The ring format is defined using the familiar
> `DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`).  Frontend
> requests are allocated on the ring using the `RING_GET_REQUEST` macro.
> The list of commands below is in calling order.
> 
> The format is defined as follows:
>     
>     #define PVCALLS_SOCKET         0
>     #define PVCALLS_CONNECT        1
>     #define PVCALLS_RELEASE        2
>     #define PVCALLS_BIND           3
>     #define PVCALLS_LISTEN         4
>     #define PVCALLS_ACCEPT         5
>     #define PVCALLS_POLL           6
> 
>     struct xen_pvcalls_request {
>       uint32_t req_id; /* private to guest, echoed in response */
>       uint32_t cmd;    /* command to execute */
>       union {
>               struct xen_pvcalls_socket {
>                       uint64_t id;
>                       uint32_t domain;
>                       uint32_t type;
>                       uint32_t protocol;
>                 #ifdef CONFIG_X86_32
>                 uint8_t pad[4];

Could that be shifted to the right?
>                 #endif
>               } socket;
>               struct xen_pvcalls_connect {
>                       uint64_t id;
>                       uint8_t addr[28];
>                       uint32_t len;
>                       uint32_t flags;
>                       grant_ref_t ref;
>                       uint32_t evtchn;
>                 #ifdef CONFIG_X86_32
>                 uint8_t pad[4];
>                 #endif
>               } connect;
>               struct xen_pvcalls_release {
>                       uint64_t id;
>                       uint8_t reuse;
>                 #ifdef CONFIG_X86_32
>                 uint8_t pad[7];

Could that be shifted to the right?
>                 #endif
>               } release;
>               struct xen_pvcalls_bind {
>                       uint64_t id;
>                       uint8_t addr[28];
>                       uint32_t len;
>               } bind;
>               struct xen_pvcalls_listen {
>                       uint64_t id;
>                       uint32_t backlog;
>                 #ifdef CONFIG_X86_32
>                 uint8_t pad[4];

Could that be shifted to the right?
>                 #endif
>               } listen;
>               struct xen_pvcalls_accept {
>                       uint64_t id;
>                       uint64_t id_new;
>                       grant_ref_t ref;
>                       uint32_t evtchn;
>               } accept;
>               struct xen_pvcalls_poll {
>                       uint64_t id;
>               } poll;
>               /* dummy member to force sizeof(struct xen_pvcalls_request) to 
> match across archs */
>               struct xen_pvcalls_dummy {
>                       uint8_t dummy[56];
>               } dummy;
>       } u;
>     };
> 
> The first two fields are common for every command. Their binary layout
> is:
> 
>     0       4       8
>     +-------+-------+
>     |req_id |  cmd  |
>     +-------+-------+
> 
> - **req_id** is generated by the frontend and is a cookie used to
>   identify one specific request/response pair of commands. Not to be
>   confused with any command **id** which are used to identify a socket
>   across multiple commands, see [Socket].
> - **cmd** is the command requested by the frontend:
> 
>     - `PVCALLS_SOCKET`:  0
>     - `PVCALLS_CONNECT`: 1
>     - `PVCALLS_RELEASE`: 2
>     - `PVCALLS_BIND`:    3
>     - `PVCALLS_LISTEN`:  4
>     - `PVCALLS_ACCEPT`:  5
>     - `PVCALLS_POLL`:    6
> 
> Both fields are echoed back by the backend. See [Socket families and
> address format] for the format of the **addr** field of connect and
> bind. The maximum size of command specific arguments is 56 bytes. Any
> future command that requires more than that will need a bump the
> **version** of the protocol.
> 
> Similarly to other Xen ring based protocols, after writing a request to
> the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and
> issues an event channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` 
> macro.
> The format is the following:
> 
>     struct xen_pvcalls_response {
>         uint32_t req_id;
>         uint32_t cmd;
>         int32_t ret;
>         uint32_t pad;
>         union {
>               struct _xen_pvcalls_socket {
>                       uint64_t id;
>               } socket;
>               struct _xen_pvcalls_connect {
>                       uint64_t id;
>               } connect;
>               struct _xen_pvcalls_release {
>                       uint64_t id;
>               } release;
>               struct _xen_pvcalls_bind {
>                       uint64_t id;
>               } bind;
>               struct _xen_pvcalls_listen {
>                       uint64_t id;
>               } listen;
>               struct _xen_pvcalls_accept {
>                       uint64_t id;
>               } accept;
>               struct _xen_pvcalls_poll {
>                       uint64_t id;
>               } poll;
>               struct _xen_pvcalls_dummy {
>                       uint8_t dummy[8];
>               } dummy;
>       } u;
>     };
> 
> The first four fields are common for every response. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |req_id |  cmd  |  ret  |  pad  |
>     +-------+-------+-------+-------+
> 
> - **req_id**: echoed back from request
> - **cmd**: echoed back from request
> - **ret**: return value, identifies success (0) or failure (see [Error
>   numbers] in further sections). If the **cmd** is not supported by the
>   backend, ret is ENOTSUP.
> - **pad**: padding
> 
> After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks 
> whether
> it needs to notify the frontend and does so via event channel.
> 
> A description of each command, their additional request and response
> fields follow.
> 
> 
> #### Socket
> 
> The **socket** operation corresponds to the POSIX [socket][socket]
> function. It creates a new socket of the specified family, type and
> protocol. **id** is freely chosen by the frontend and references this
> specific socket from this point forward. See [Socket families and
> address format].

.. to see which ones are supported by different versions of the
protocol.

> 
> Request fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **id**: generated by the frontend, it identifies the new socket
>   - **domain**: the communication domain
>   - **type**: the socket type
>   - **protocol**: the particular protocol to be used with the socket, usually > 0
> 
> Request binary layout:
> 
>     8       12      16      20     24       28
>     +-------+-------+-------+-------+-------+
>     |       id      |domain | type  |protoco|
>     +-------+-------+-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16       20       24
>     +-------+--------+
>     |       id       |
>     +-------+--------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX socket function][connect] for error names; see
>     [Error numbers] in further sections.
> 
> #### Connect
> 
> The **connect** operation corresponds to the POSIX [connect][connect]
> function. It connects a previously created socket (identified by **id**)
> to the specified address.
> 
> The connect operation creates a new shared ring, which we'll call **data
> ring**. The data ring is used to send and receive data from the
> socket. The connect operation passes two additional parameters:
> **evtchn** and **ref**. **evtchn** is the port number of a new event
> channel which will be used for notifications of activity on the data
> ring. **ref** is the grant reference of the **indexes page**: a page
> which contains shared indexes that point to the write and read locations
> in the data ring. The **indexes page** also contains the full array of

s/data ring/**data ring**/ 

> grant references for the data ring. When the frontend issues a
> **connect** command, the backend:
> 
> - finds its own internal socket corresponding to **id**
> - connects the socket to **addr**
> - maps the grant reference **ref**, the indexes page, see struct
>   pvcalls_data_intf
> - maps all the grant references listed in `struct pvcalls_data_intf` and
>   uses them as shared memory for the data ring

s/data ring/**data ring**/ perhaps?

> - bind the **evtchn**
> - replies to the frontend
> 
> The [Indexes Page and Data ring] format will be described in the
> following section. The data ring is unmapped and freed upon issuing a
> **release** command on the active socket identified by **id**. A
> frontend stage change can also cause data rings to be unmapped.

s/stage/state/
> 
> Request fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **id**: identifies the socket
>   - **addr**: address to connect to, see [Socket families and address format]


Hm, so what do we do if we want to support AF_UNIX which has an addr of
108 bytes?

>   - **len**: address length

up to 28 octets.

>   - **flags**: flags for the connection, reserved for future usage
>   - **ref**: grant reference of the indexes page
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Request binary layout:
> 
>     8       12      16      20      24      28      32      36      40      44
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |       id      |                            addr                       |
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     | len   | flags |  ref  |evtchn |
>     +-------+-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX connect function][connect] for error names; see
>     [Error numbers] in further sections.
> 
> #### Release
> 
> The **release** operation closes an existing active or a passive socket.
> 
> When a release command is issued on a passive socket, the backend
> releases it and frees its internal mappings. When a release command is
> issued for an active socket, the data ring and indexes page are also
> unmapped and freed:
> 
> - frontend sends release command for an active socket
> - backend releases the socket
> - backend unmaps the data ring
> - backend unmaps the indexes page
> - backend unbinds the event channel
> - backend replies to frontend with an **ret** value
> - frontend frees data ring, indexes page and unbinds event channel
> 
> Request fields:
> 
> - **cmd** value: 1
> - additional fields:
>   - **id**: identifies the socket
>   - **reuse**: an optimization hint for the backend. The field is
>     ignored for passive sockets. When set to 1, the frontend lets the
>     backend know that it will reuse exactly the same set of grant pages
>     (indexes page and data ring) and event channel when creating one of
>     the next active sockets. The backend can take advantage of it by
>     delaying unmapping grants and unbinding the event channel. The
>     backend is free to ignore the hint. Reused data rings are found by
>     **ref**, the grant reference of the page containing the indexes.
> 
> Request binary layout:
> 
>     8       12      16    17
>     +-------+-------+-----+
>     |       id      |reuse|
>     +-------+-------+-----+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX shutdown function][shutdown] for error names; see
>     [Error numbers] in further sections.
> 
> #### Bind
> 
> The **bind** operation corresponds to the POSIX [bind][bind] function.
> It assigns the address passed as parameter to a previously created
> socket, identified by **id**. **Bind**, **listen** and **accept** are
> the three operations required to have fully working passive sockets and
> should be issued in that order.
> 
> Request fields:
> 
> - **cmd** value: 2
> - additional fields:
>   - **id**: identifies the socket
>   - **addr**: address to connect to, see [Socket families and address
>     format]
>   - **len**: address length

.. up to 28 octets.
> 
> Request binary layout:
> 
>     8       12      16      20      24      28      32      36      40      44
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |       id      |                            addr                       |
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |  len  |
>     +-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX bind function][bind] for error names; see
>     [Error numbers] in further sections.
> 
> 
..snip..
> #### Accept
> 
> The **accept** operation extracts the first connection request on the
> queue of pending connections for the listening socket identified by
> **id** and creates a new connected socket. The id of the new socket is
> also chosen by the frontend and passed as an additional field of the
> accept request struct (**id_new**). See the [POSIX accept function][accept]
> as reference.
> 
> Similarly to the **connect** operation, **accept** creates new [Indexes
> Page and Data ring]. The data ring is used to send and receive data from
> the socket. The **accept** operation passes two additional parameters:
> **evtchn** and **ref**. **evtchn** is the port number of a new event
> channel which will be used for notifications of activity on the data

s/data/**data/
> ring. **ref** is the grant reference of the **indexes page**: a page

s/ring/ring**/

> which contains shared indexes that point to the write and read locations
> in the data ring. The **indexes page** also contains the full array of

Perhaps highlight data ring here?

> grant references for the data ring.
> 
> The backend will reply to the request only when a new connection is
> successfully accepted, i.e. the backend does not return EAGAIN or
> EWOULDBLOCK.
> 
> Example workflow:
> 
> - frontend issues an **accept** request
> - backend waits for a connection to be available on the socket
> - a new connection becomes available
> - backend accepts the new connection
> - backend creates an internal mapping from **id_new** to the new socket
> - backend maps the grant reference **ref**, the indexes page, see struct
>   pvcalls_data_intf
> - backend maps all the grant references listed in `struct
>   pvcalls_data_intf` and uses them as shared memory for the new data
>   ring **in** and **out** arrays
> - backend binds to the **evtchn**
> - backend replies to the frontend with a **ret** value
> 
> Request fields:
> 
> - **cmd** value: 4
> - additional fields:
>   - **id**: id of listening socket
>   - **id_new**: id of the new socket
>   - **ref**: grant reference of the indexes page
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Request binary layout:
> 
>     8       12      16      20      24      28      32
>     +-------+-------+-------+-------+-------+-------+
>     |       id      |    id_new     |  ref  |evtchn |
>     +-------+-------+-------+-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: id of the listening socket, echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX accept function][accept] for error names; see
>     [Error numbers] in further sections.
> 
> 
..snip..
> ### Indexes Page and Data ring
> 
> Data rings are used for sending and receiving data over a connected socket. 
> They
> are created upon a successful **accept** or **connect** command.
> The **sendmsg** and **recvmsg** calls are implemented by sending data and
> receiving data from a data ring, and updating the corresponding indexes
> on the **indexes page**.
> 
> Firstly, the **indexes page** is shared by a **connect** or **accept**
> command, see **ref** parameter in their sections. The content of the
> **indexes page** is represented by `struct pvcalls_ring_intf`, see
> below. The structure contains the list of grant references which
> constitute the **in** and **out** buffers of the data ring, see ref[]
> below. The backend maps the grant references contiguously. Of the
> resulting shared memory, the first half is dedicated to the **in** array
> and the second half to the **out** array. They are used as circular
> buffers for transferring data, and, together, they are the data ring.
> 
> 
>   +---------------------------+                 Indexes page
>   | Command ring:             |                 +----------------------+
>   | @0: xen_pvcalls_connect:  |                 |@0 pvcalls_data_intf: |
      ^-- The first 64 bytes are reserved for the in_cons, etc.
           Perhaps just start at @64 (And naturally add that to the 'ref')

      
>   | @44: ref  +-------------------------------->+@76: ring_order = 1   |
>   |                           |                 |@80: ref[0]+          |
>   +---------------------------+                 |@84: ref[1]+          |
>                                                 |           |          |
>                                                 |           |          |
>                                                 +----------------------+
>                                                             |
>                                                             v (data ring)
>                                                     +-------+-----------+
>                                                     |  @0->4098: in     |
>                                                     |  ref[0]           |
>                                                     |-------------------|
>                                                     |  @4099->8196: out |
>                                                     |  ref[1]           |
>                                                     +-------------------+
>  
> 

Thank you!

> #### Indexes Page Structure
> 
>     typedef uint32_t PVCALLS_RING_IDX;
> 
>     struct pvcalls_data_intf {
>       PVCALLS_RING_IDX in_cons, in_prod;
>       int32_t in_error;

You don't want to perhaps include in_event?
> 
>       uint8_t pad[52];
> 
>       PVCALLS_RING_IDX out_cons, out_prod;
>       int32_t out_error;

And out_event as way to do some form of interrupt mitigation
(similar to what you had proposed?)

> 
>       uint32_t ring_order;
>       grant_ref_t ref[];
>     };
> 
>     /* not actually C compliant (ring_order changes from socket to socket) */
>     struct pvcalls_data {
>         char in[((1<<ring_order)<<PAGE_SHIFT)/2];
>         char out[((1<<ring_order)<<PAGE_SHIFT)/2];
>     };
> 
> - **ring_order**
>   It represents the order of the data ring. The following list of grant
>   references is of `(1 << ring_order)` elements. It cannot be greater than
>   **max-page-order**, as specified by the backend on XenBus. It has to
>   be one at minimum.

Oh? Why not zero? (4KB) as the 'max-page-order' has an example of zero order?
Perhaps if it MUST be one or more then the 'max-page-order' should say
that at least it MUST be one?


> - **ref[]**
>   The list of grant references which will contain the actual data. They are
>   mapped contiguosly in virtual memory. The first half of the pages is the
>   **in** array, the second half is the **out** array. The arrays must
>   have a power of two size. Together, their size is `(1 << ring_order) *
>   PAGE_SIZE`.
> - **in** is an array used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
> - **in_cons** and **in_prod**
>   Consumer and producer indexes for data read from the socket. They keep track
>   of how much data has already been consumed by the frontend from the **in**
>   array. **in_prod** is increased by the backend, after writing data to 
> **in**.
>   **in_cons** is increased by the frontend, after reading data from **in**.
> -ring-page-order

??? 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.