[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT 1] XenSock protocol design document



On 08/07/16 13:23, Stefano Stabellini wrote:
> Hi all,
> 
> as promised, this is the design document for the XenSock protocol I
> mentioned here:
> 
> http://marc.info/?l=xen-devel&m=146520572428581
> 
> It is still in its early days but should give you a good idea of how it
> looks like and how it is supposed to work. Let me know if you find gaps
> in the document and I'll fill them in the next version.
> 
> You can find prototypes of the Linux frontend and backend drivers here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-1
> 
> To use them, make sure to enable CONFIG_XENSOCK in your kernel config
> and add "xensock=1" to the command line of your DomU Linux kernel. You
> also need the toolstack to create the initial xenstore nodes for the
> protocol. To do that, please apply the attached patch to libxl (the
> patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config
> file.
> 
> Feel free to try them out! Please be kind, they are only prototypes with
> a few known issues :-) But they should work well enough to run simple
> tests. If you find something missing, let me know or, even better, write
> a patch!
> 
> I'll follow up with a separate document to cover the design of my
> particular implementation of the protocol.
> 
> Cheers,
> 
> Stefano
> 
> ---
> 
> # XenSocks Protocol v1
> 
> ## Rationale
> 
> XenSocks is a paravirtualized protocol for the POSIX socket API.
> 
> The purpose of XenSocks is to allow the implementation of a specific set
> of POSIX calls to be done in a domain other than your own. It allows
> connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
> implemented in another domain.
> 
> XenSocks provides the following benefits:
> * guest networking works out of the box with VPNs, wireless networks and
>   any other complex configurations on the host
> * guest services listen on ports bound directly to the backend domain IP
>   addresses
> * localhost becomes a secure namespace for intra-VMs communications
> * full visibility of the guest behavior on the backend domain, allowing
>   for inexpensive filtering and manipulation of any guest calls
> * excellent performance
> 
> 
> ## Design
> 
> ### Xenstore
> 
> The frontend and the backend connect to each other exchanging information via
> xenstore. The toolstack creates front and back nodes with state
> XenbusStateInitialising. There can only be one XenSock frontend per domain.
> 
> #### Frontend XenBus Nodes
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the ring buffer.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized ring buffer.
> 
> 
> #### State Machine
> 
>     **Front**                             **Back**
>     XenbusStateInitialising               XenbusStateInitialising
>     - Query virtual device                - Query backend device
>       properties.                           identification data.
>     - Setup OS device instance.                          |
>     - Allocate and initialize the                        |
>       request ring.                                      V
>     - Publish transport parameters                XenbusStateInitWait
>       that will be in effect during
>       this connection.
>                  |
>                  |
>                  V
>        XenbusStateInitialised
> 
>                                           - Query frontend transport 
> parameters.
>                                           - Connect to the request ring and
>                                             event channel.
>                                                          |
>                                                          |
>                                                          V
>                                                  XenbusStateConnected
> 
>      - Query backend device properties.
>      - Finalize OS virtual device
>        instance.
>                  |
>                  |
>                  V
>         XenbusStateConnected
> 
> Once frontend and backend are connected, they have a shared page, which
> will is used to exchange messages over a ring, and an event channel,
> which is used to send notifications.
> 
> 
> ### Commands Ring
> 
> The shared ring is used by the frontend to forward socket API calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which will be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` 
> macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     
>     #define XENSOCK_CONNECT        0
>     #define XENSOCK_RELEASE        3
>     #define XENSOCK_BIND           4
>     #define XENSOCK_LISTEN         5
>     #define XENSOCK_ACCEPT         6
>     #define XENSOCK_POLL           7
>     
>     struct xen_xensock_request {
>         uint32_t id;     /* private to guest, echoed in response */
>         uint32_t cmd;    /* command to execute */
>         uint64_t sockid; /* id of the socket */
>         union {
>             struct xen_xensock_connect {
>                 uint8_t addr[28];
>                 uint32_t len;
>                 uint32_t flags;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } connect;
>             struct xen_xensock_bind {
>                 uint8_t addr[28]; /* ipv6 ready */
>                 uint32_t len;
>             } bind;
>             struct xen_xensock_accept {
>                 uint64_t sockid;
>                 grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                 uint32_t evtchn;
>             } accept;
>         } u;
>     };

Below you write the data ring is flexible and can support different
ring sizes. This is in contradiction to the definition above: as soon
as you modify the ring size you change the interface. You'd have to
modify all guests and the host at the same time.

The flexibility should be kept, so I suggest ring size negotiation via
xenstore: the backend should indicate the maximum supported size and
the frontend should tell which size it is using. In the beginning I'd
see no problem with accepting connection only if both values are
XENSOCK_DATARING_PAGES.

The connect and accept calls should reference only one page (possibly
with an offset into that page) holding the grant_ref_t array of the
needed size.

> 
> The first three fields are common for every command. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |
>     +-------+-------+-------+-------+
> 
> - **id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
>     - `XENSOCK_CONNECT`: 0
>     - `XENSOCK_RELEASE`: 3
>     - `XENSOCK_BIND`:    4
>     - `XENSOCK_LISTEN`:  5
>     - `XENSOCK_ACCEPT`:  6
>     - `XENSOCK_POLL`:    7

Any reason for omitting the values 1 and 2?

> - **sockid** is generated by the frontend and identifies the socket to 
> connect,
>   bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND`
>   commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new
>   socket.
>   
> All three fields are echoed back by the backend.
> 
> As for the other Xen ring based protocols, after writing a request to the 
> ring,
> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` 
> macro.
> The format is the following:
> 
>     struct xen_xensock_response {
>         uint32_t id;
>         uint32_t cmd;
>         uint64_t sockid;
>         int32_t ret;
>     };
>    
>     0       4       8       12      16      20
>     +-------+-------+-------+-------+-------+
>     |  id   |  cmd  |     sockid    |  ret  |
>     +-------+-------+-------+-------+-------+
> 
> - **id**: echoed back from request
> - **cmd**: echoed back from request
> - **sockid**: echoed back from request
> - **ret**: return value, identifies success or failure
> 
> After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks 
> whether
> it needs to notify the frontend and does so via event channel.
> 
> A description of each command, their additional request fields and the
> expected responses follow.
> 
> 
> #### Connect
> 
> The **connect** operation corresponds to the connect system call. It connects 
> a
> socket to the specified address. **sockid** is freely chosen by the frontend 
> and
> references this specific socket from this point forward.
> 
> The connect operation creates a new shared ring, which we'll call **data 
> ring**.
> The new ring is used to send and receive data over the connected socket.
> Information necessary to setup the new ring, such as grant table references 
> and
> event channel ports, are passed from the frontend to the backend as part of
> this request. A **data ring** is unmapped and freed upon issuing a **release**
> command on the active socket identified by **sockid**.
> 
> When the frontend issues a **connect** command, the backend:
> - creates a new socket and connects it to **addr**
> - creates an internal mapping from **sockid** to its own socket
> - maps all the grant references and uses them as shared memory for the new 
> data
>   ring
> - bind the **evtchn**
> - replies to the frontend
> 
> The data ring format will be described in the following section.
> 
> Fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **addr**: address to connect to, in struct sockaddr format

So you expect only Linux guests with the current sockaddr layout?
Please specify the structure in the interface.

>   - **len**: address length
>   - **flags**: flags for the connection, reserved for future usage
>   - **ref**: grant references of the data ring
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[63]|evtchn |  
>         +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the socket system call

Again: don't think only Linux.

> 
> #### Release
> 
> The **release** operation closes an existing active or a passive socket.
> 
> When a release command is issued on a passive socket, the backend releases it
> and frees its internal mappings. When a release command is issued for an 
> active
> socket, the data ring is also unmapped and freed:
> 
> - frontend sends release command for an active socket
> - backend releases the socket
> - backend unmaps the ring
> - backend unbinds the evtchn
> - backend replies to frontend
> - frontend frees ring and unbinds evtchn
> 
> Fields:
> 
> - **cmd** value: 3
> - additional fields: none
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the shutdown system call

Again Linux only.

> 
> #### Bind
> 
> The **bind** operation assigns the address passed as parameter to the socket.
> It corresponds to the bind system call. **sockid** is freely chosen by the
> frontend and references this specific socket from this point forward. 
> **Bind**,
> **listen** and **accept** are the three operations required to have fully
> working passive sockets and should be issued in this order.
> 
> Fields:
> 
> - **cmd** value: 4
> - additional fields:
>   - **addr**: address to bind to, in struct sockaddr format

Dito.

>   - **len**: address length
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |                            addr                       |  len  |
>         +-------+-------+-------+-------+-------+-------+-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the bind system call

Again.

> 
> 
> #### Listen
> 
> The **listen** operation marks the socket as a passive socket. It corresponds 
> to
> the listen system call.
> 
> Fields:
> 
> - **cmd** value: 5
> - additional fields: none
> 
> Return value:
>   - 0 on success
>   - less than 0 on failure, see the error codes of the listen system call

Again.

> 
> 
> #### Accept
> 
> The **accept** operation extracts the first connection request on the queue of
> pending connections for the listening socket identified by **sockid** and
> creates a new connected socket. The **sockid** of the new socket is also 
> chosen
> by the frontend and passed as an additional field of the accept request 
> struct.
> 
> Similarly to the **connect** operation, **accept** creates a new data ring.
> Information necessary to setup the new ring, such as grant table references 
> and
> event channel ports, are passed from the frontend to the backend as part of
> the request.
> 
> The backend will reply to the request only when a new connection is 
> successfully
> accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.
> 
> Example workflow:
> 
> - frontend issues an **accept** request
> - backend waits for a connection to be available on the socket
> - a new connection becomes available
> - backend accepts the new connection
> - backend creates an internal mapping from **sockid** to the new socket
> - backend maps all the grant references and uses them as shared memory for the
>   new data ring
> - backend binds the **evtchn**
> - backend replies to the frontend
> 
> Fields:
> 
> - **cmd** value: 6
> - additional fields:
>   - **sockid**: id of the new socket
>   - **ref**: grant references of the data ring
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Binary layout:
> 
>         16      20      24      28      32      36      40      44     48
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |    sockid     |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] | 
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[6] |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[14]|ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[22]|ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[30]|ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[38]|ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[46]|ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[54]|ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|
>         +-------+-------+-------+-------+-------+-------+-------+-------+
>         |ref[62]|ref[63]|evtchn | 
>         +-------+-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the accept system call

Again.

> 
> 
> #### Poll
> 
> The **poll** operation is only valid for passive sockets. For active sockets,
> the frontend should look at the state of the data ring. When a new connection 
> is
> available in the queue of the passive socket, the backend generates a response
> and notifies the frontend.
> 
> Fields:
> 
> - **cmd** value: 7
> - additional fields: none
> 
> Return value:
> 
>   - 0 on success
>   - less than 0 on failure, see the error codes of the poll system call

Again.

> 
> 
> ### Data ring
> 
> Data rings are used for sending and receiving data over a connected socket. 
> They
> are created upon a successful **accept** or **connect** command. The ring 
> works
> in a similar way to the existing Xen console ring.
> 
> #### Format
> 
>     #define XENSOCK_DATARING_ORDER 6
>     #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
>     #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
>     typedef uint32_t XENSOCK_RING_IDX;
>     
>     struct xensock_ring_intf {
>       char in[XENSOCK_DATARING_SIZE/4];
>       char out[XENSOCK_DATARING_SIZE/2];
>       XENSOCK_RING_IDX in_cons, in_prod;
>       XENSOCK_RING_IDX out_cons, out_prod;
>       int32_t in_error, out_error;
>     };

So you are wasting nearly 64kB of memory?

Wouldn't it make more sense to have 1 page with the admin data (in_*,
out_*) and the appropriate number of pages with the ring buffers? The
admin page could be even the one holding the grant_ref_t array of the
ring buffer pages needed for accept/connect.

> The design is flexible and can support different ring sizes (at compile time).
> The following description is based on order 6 rings, chosen because they 
> provide
> excellent performance.
> 
> - **in** is an array of 65536 bytes, used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array of 131072 bytes, used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
> - **in_cons** and **in_prod**
>   Consumer and producer pointers for data read from the socket. They keep 
> track
>   of how much data has already been consumed by the frontend from the **in**
>   array. **in_prod** is increased by the backend, after writing data to 
> **in**.
>   **in_cons** is increased by the frontend, after reading data from **in**.
> - **out_cons**, **out_prod**
>   Consumer and producer pointers for the data to be written to the socket. 
> They
>   keep track of how much data has been written by the frontend to **out** and
>   how much data has already been consumed by the backend. **out_prod** is
>   increased by the frontend, after writing data to **out**. **out_cons** is
>   increased by the backend, after reading data from **out**.
> - **in_error** and **out_error** They signal errors when reading from the 
> socket
>   (**in_error**) or when writing to the socket (**out_error**). 0 means no
>   errors. When an error occurs, no further reads or writes operations are
>   performed on the socket. In the case of an orderly socket shutdown (i.e. 
> read
>   returns 0) **in_error** is set to -ENOTCONN. **in_error** and **out_error**

Which value? I've found systems with: 57, 76, 107, 134 or 235 (just to
make clear that even an errno name isn't optimal).

>   are never set to -EAGAIN or -EWOULDBLOCK.
> 
> The binary layout follows:
> 
>     0        65536           196608     196612    196616    196620   196624   
>  196628   196632
>     
> +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
>     |    in    |      out       | in_cons | in_prod |out_cons |out_prod 
> |in_error |out_error|
>     
> +----//----+-------//-------+---------+---------+---------+---------+---------+---------+
>     
> 
> #### Workflow
> 
> The **in** and **out** arrays are used as circular buffers:
>     
>     0                               sizeof(array)
>     +-----------------------------------+
>     |to consume|    free    |to consume |
>     +-----------------------------------+
>                ^            ^
>                prod         cons
> 
>     0                               sizeof(array)
>     +-----------------------------------+
>     |  free    | to consume |   free    |
>     +-----------------------------------+
>                ^            ^
>                cons         prod
> 
> The following function is provided to calculate how many bytes are currently
> left unconsumed in an array:
> 
>     #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1))
> 
>     static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod,
>               XENSOCK_RING_IDX cons,
>               XENSOCK_RING_IDX ring_size)
>     {
>       XENSOCK_RING_IDX size;
>     
>       if (prod == cons)
>               return 0;
>     
>       prod = _MASK_XENSOCK_IDX(prod, ring_size);
>       cons = _MASK_XENSOCK_IDX(cons, ring_size);
>     
>       if (prod == cons)
>               return ring_size;
>     
>       if (prod > cons)
>               size = prod - cons;
>       else {
>               size = ring_size - cons;
>               size += prod;
>       }
>       return size;
>     }
> 
> The producer (the backend for **in**, the frontend for **out**) writes to the
> array in the following way:
> 
> - read *cons*, *prod*, *error* from shared memory
> - memory barrier
> - return on *error*
> - write to array at position *prod* up to *cons*, wrapping around the circular
>   buffer when necessary
> - memory barrier
> - increase *prod*
> - notify the other end via evtchn
> 
> The consumer (the backend for **out**, the frontend for **in**) reads from the
> array in the following way:
> 
> - read *prod*, *cons*, *error* from shared memory
> - memory barrier
> - return on *error*
> - read from array at position *cons* up to *prod*, wrapping around the 
> circular
>   buffer when necessary
> - memory barrier
> - increase *cons*
> - notify the other end via evtchn
> 
> The producer takes care of writing only as many bytes as available in the 
> buffer
> up to *cons*. The consumer takes care of reading only as many bytes as 
> available
> in the buffer up to *prod*. *error* is set by the backend when an error occurs
> writing or reading from the socket.
> 


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.