[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [DRAFT 1] XenSock protocol design document
Hi all, as promised, this is the design document for the XenSock protocol I mentioned here: http://marc.info/?l=xen-devel&m=146520572428581 It is still in its early days but should give you a good idea of how it looks like and how it is supposed to work. Let me know if you find gaps in the document and I'll fill them in the next version. You can find prototypes of the Linux frontend and backend drivers here: git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-1 To use them, make sure to enable CONFIG_XENSOCK in your kernel config and add "xensock=1" to the command line of your DomU Linux kernel. You also need the toolstack to create the initial xenstore nodes for the protocol. To do that, please apply the attached patch to libxl (the patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config file. Feel free to try them out! Please be kind, they are only prototypes with a few known issues :-) But they should work well enough to run simple tests. If you find something missing, let me know or, even better, write a patch! I'll follow up with a separate document to cover the design of my particular implementation of the protocol. Cheers, Stefano --- # XenSocks Protocol v1 ## Rationale XenSocks is a paravirtualized protocol for the POSIX socket API. The purpose of XenSocks is to allow the implementation of a specific set of POSIX calls to be done in a domain other than your own. It allows connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be implemented in another domain. XenSocks provides the following benefits: * guest networking works out of the box with VPNs, wireless networks and any other complex configurations on the host * guest services listen on ports bound directly to the backend domain IP addresses * localhost becomes a secure namespace for intra-VMs communications * full visibility of the guest behavior on the backend domain, allowing for inexpensive filtering and manipulation of any guest calls * excellent performance ## Design ### Xenstore The frontend and the backend connect to each other exchanging information via xenstore. The toolstack creates front and back nodes with state XenbusStateInitialising. There can only be one XenSock frontend per domain. #### Frontend XenBus Nodes port Values: <uint32_t> The identifier of the Xen event channel used to signal activity in the ring buffer. ring-ref Values: <uint32_t> The Xen grant reference granting permission for the backend to map the sole page in a single page sized ring buffer. #### State Machine **Front** **Back** XenbusStateInitialising XenbusStateInitialising - Query virtual device - Query backend device properties. identification data. - Setup OS device instance. | - Allocate and initialize the | request ring. V - Publish transport parameters XenbusStateInitWait that will be in effect during this connection. | | V XenbusStateInitialised - Query frontend transport parameters. - Connect to the request ring and event channel. | | V XenbusStateConnected - Query backend device properties. - Finalize OS virtual device instance. | | V XenbusStateConnected Once frontend and backend are connected, they have a shared page, which will is used to exchange messages over a ring, and an event channel, which is used to send notifications. ### Commands Ring The shared ring is used by the frontend to forward socket API calls to the backend. I'll refer to this ring as **commands ring** to distinguish it from other rings which will be created later in the lifecycle of the protocol (data rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` macro (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring using the `RING_GET_REQUEST` macro. The format is defined as follows: #define XENSOCK_DATARING_ORDER 6 #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) #define XENSOCK_CONNECT 0 #define XENSOCK_RELEASE 3 #define XENSOCK_BIND 4 #define XENSOCK_LISTEN 5 #define XENSOCK_ACCEPT 6 #define XENSOCK_POLL 7 struct xen_xensock_request { uint32_t id; /* private to guest, echoed in response */ uint32_t cmd; /* command to execute */ uint64_t sockid; /* id of the socket */ union { struct xen_xensock_connect { uint8_t addr[28]; uint32_t len; uint32_t flags; grant_ref_t ref[XENSOCK_DATARING_PAGES]; uint32_t evtchn; } connect; struct xen_xensock_bind { uint8_t addr[28]; /* ipv6 ready */ uint32_t len; } bind; struct xen_xensock_accept { uint64_t sockid; grant_ref_t ref[XENSOCK_DATARING_PAGES]; uint32_t evtchn; } accept; } u; }; The first three fields are common for every command. Their binary layout is: 0 4 8 12 16 +-------+-------+-------+-------+ | id | cmd | sockid | +-------+-------+-------+-------+ - **id** is generated by the frontend and identifies one specific request - **cmd** is the command requested by the frontend: - `XENSOCK_CONNECT`: 0 - `XENSOCK_RELEASE`: 3 - `XENSOCK_BIND`: 4 - `XENSOCK_LISTEN`: 5 - `XENSOCK_ACCEPT`: 6 - `XENSOCK_POLL`: 7 - **sockid** is generated by the frontend and identifies the socket to connect, bind, etc. A new sockid is required on `XENSOCK_CONNECT` and `XENSOCK_BIND` commands. A new sockid is also required on `XENSOCK_ACCEPT`, for the new socket. All three fields are echoed back by the backend. As for the other Xen ring based protocols, after writing a request to the ring, the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event channel notification when a notification is required. Backend responses are allocated on the ring using the `RING_GET_RESPONSE` macro. The format is the following: struct xen_xensock_response { uint32_t id; uint32_t cmd; uint64_t sockid; int32_t ret; }; 0 4 8 12 16 20 +-------+-------+-------+-------+-------+ | id | cmd | sockid | ret | +-------+-------+-------+-------+-------+ - **id**: echoed back from request - **cmd**: echoed back from request - **sockid**: echoed back from request - **ret**: return value, identifies success or failure After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks whether it needs to notify the frontend and does so via event channel. A description of each command, their additional request fields and the expected responses follow. #### Connect The **connect** operation corresponds to the connect system call. It connects a socket to the specified address. **sockid** is freely chosen by the frontend and references this specific socket from this point forward. The connect operation creates a new shared ring, which we'll call **data ring**. The new ring is used to send and receive data over the connected socket. Information necessary to setup the new ring, such as grant table references and event channel ports, are passed from the frontend to the backend as part of this request. A **data ring** is unmapped and freed upon issuing a **release** command on the active socket identified by **sockid**. When the frontend issues a **connect** command, the backend: - creates a new socket and connects it to **addr** - creates an internal mapping from **sockid** to its own socket - maps all the grant references and uses them as shared memory for the new data ring - bind the **evtchn** - replies to the frontend The data ring format will be described in the following section. Fields: - **cmd** value: 0 - additional fields: - **addr**: address to connect to, in struct sockaddr format - **len**: address length - **flags**: flags for the connection, reserved for future usage - **ref**: grant references of the data ring - **evtchn**: port number of the evtchn to signal activity on the data ring Binary layout: 16 20 24 28 32 36 40 44 48 +-------+-------+-------+-------+-------+-------+-------+-------+ | addr | len | +-------+-------+-------+-------+-------+-------+-------+-------+ | flags |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] |ref[6] | +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]|ref[14]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]|ref[22]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]|ref[30]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]|ref[38]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]|ref[46]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]|ref[54]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]|ref[62]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[63]|evtchn | +-------+-------+ Return value: - 0 on success - less than 0 on failure, see the error codes of the socket system call #### Release The **release** operation closes an existing active or a passive socket. When a release command is issued on a passive socket, the backend releases it and frees its internal mappings. When a release command is issued for an active socket, the data ring is also unmapped and freed: - frontend sends release command for an active socket - backend releases the socket - backend unmaps the ring - backend unbinds the evtchn - backend replies to frontend - frontend frees ring and unbinds evtchn Fields: - **cmd** value: 3 - additional fields: none Return value: - 0 on success - less than 0 on failure, see the error codes of the shutdown system call #### Bind The **bind** operation assigns the address passed as parameter to the socket. It corresponds to the bind system call. **sockid** is freely chosen by the frontend and references this specific socket from this point forward. **Bind**, **listen** and **accept** are the three operations required to have fully working passive sockets and should be issued in this order. Fields: - **cmd** value: 4 - additional fields: - **addr**: address to bind to, in struct sockaddr format - **len**: address length Binary layout: 16 20 24 28 32 36 40 44 48 +-------+-------+-------+-------+-------+-------+-------+-------+ | addr | len | +-------+-------+-------+-------+-------+-------+-------+-------+ Return value: - 0 on success - less than 0 on failure, see the error codes of the bind system call #### Listen The **listen** operation marks the socket as a passive socket. It corresponds to the listen system call. Fields: - **cmd** value: 5 - additional fields: none Return value: - 0 on success - less than 0 on failure, see the error codes of the listen system call #### Accept The **accept** operation extracts the first connection request on the queue of pending connections for the listening socket identified by **sockid** and creates a new connected socket. The **sockid** of the new socket is also chosen by the frontend and passed as an additional field of the accept request struct. Similarly to the **connect** operation, **accept** creates a new data ring. Information necessary to setup the new ring, such as grant table references and event channel ports, are passed from the frontend to the backend as part of the request. The backend will reply to the request only when a new connection is successfully accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK. Example workflow: - frontend issues an **accept** request - backend waits for a connection to be available on the socket - a new connection becomes available - backend accepts the new connection - backend creates an internal mapping from **sockid** to the new socket - backend maps all the grant references and uses them as shared memory for the new data ring - backend binds the **evtchn** - backend replies to the frontend Fields: - **cmd** value: 6 - additional fields: - **sockid**: id of the new socket - **ref**: grant references of the data ring - **evtchn**: port number of the evtchn to signal activity on the data ring Binary layout: 16 20 24 28 32 36 40 44 48 +-------+-------+-------+-------+-------+-------+-------+-------+ | sockid |ref[0] |ref[1] |ref[2] |ref[3] |ref[4] |ref[5] | +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[6] |ref[7] |ref[8] |ref[9] |ref[10]|ref[11]|ref[12]|ref[13]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[14]|ref[15]|ref[16]|ref[17]|ref[18]|ref[19]|ref[20]|ref[21]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[22]|ref[23]|ref[24]|ref[25]|ref[26]|ref[27]|ref[28]|ref[29]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[30]|ref[31]|ref[32]|ref[33]|ref[34]|ref[35]|ref[36]|ref[37]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[38]|ref[39]|ref[40]|ref[41]|ref[42]|ref[43]|ref[44]|ref[45]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[46]|ref[47]|ref[48]|ref[49]|ref[50]|ref[51]|ref[52]|ref[53]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[54]|ref[55]|ref[56]|ref[57]|ref[58]|ref[59]|ref[60]|ref[61]| +-------+-------+-------+-------+-------+-------+-------+-------+ |ref[62]|ref[63]|evtchn | +-------+-------+-------+ Return value: - 0 on success - less than 0 on failure, see the error codes of the accept system call #### Poll The **poll** operation is only valid for passive sockets. For active sockets, the frontend should look at the state of the data ring. When a new connection is available in the queue of the passive socket, the backend generates a response and notifies the frontend. Fields: - **cmd** value: 7 - additional fields: none Return value: - 0 on success - less than 0 on failure, see the error codes of the poll system call ### Data ring Data rings are used for sending and receiving data over a connected socket. They are created upon a successful **accept** or **connect** command. The ring works in a similar way to the existing Xen console ring. #### Format #define XENSOCK_DATARING_ORDER 6 #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) typedef uint32_t XENSOCK_RING_IDX; struct xensock_ring_intf { char in[XENSOCK_DATARING_SIZE/4]; char out[XENSOCK_DATARING_SIZE/2]; XENSOCK_RING_IDX in_cons, in_prod; XENSOCK_RING_IDX out_cons, out_prod; int32_t in_error, out_error; }; The design is flexible and can support different ring sizes (at compile time). The following description is based on order 6 rings, chosen because they provide excellent performance. - **in** is an array of 65536 bytes, used as circular buffer It contains data read from the socket. The producer is the backend, the consumer is the frontend. - **out** is an array of 131072 bytes, used as circular buffer It contains data to be written to the socket. The producer is the frontend, the consumer is the backend. - **in_cons** and **in_prod** Consumer and producer pointers for data read from the socket. They keep track of how much data has already been consumed by the frontend from the **in** array. **in_prod** is increased by the backend, after writing data to **in**. **in_cons** is increased by the frontend, after reading data from **in**. - **out_cons**, **out_prod** Consumer and producer pointers for the data to be written to the socket. They keep track of how much data has been written by the frontend to **out** and how much data has already been consumed by the backend. **out_prod** is increased by the frontend, after writing data to **out**. **out_cons** is increased by the backend, after reading data from **out**. - **in_error** and **out_error** They signal errors when reading from the socket (**in_error**) or when writing to the socket (**out_error**). 0 means no errors. When an error occurs, no further reads or writes operations are performed on the socket. In the case of an orderly socket shutdown (i.e. read returns 0) **in_error** is set to -ENOTCONN. **in_error** and **out_error** are never set to -EAGAIN or -EWOULDBLOCK. The binary layout follows: 0 65536 196608 196612 196616 196620 196624 196628 196632 +----//----+-------//-------+---------+---------+---------+---------+---------+---------+ | in | out | in_cons | in_prod |out_cons |out_prod |in_error |out_error| +----//----+-------//-------+---------+---------+---------+---------+---------+---------+ #### Workflow The **in** and **out** arrays are used as circular buffers: 0 sizeof(array) +-----------------------------------+ |to consume| free |to consume | +-----------------------------------+ ^ ^ prod cons 0 sizeof(array) +-----------------------------------+ | free | to consume | free | +-----------------------------------+ ^ ^ cons prod The following function is provided to calculate how many bytes are currently left unconsumed in an array: #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1)) static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod, XENSOCK_RING_IDX cons, XENSOCK_RING_IDX ring_size) { XENSOCK_RING_IDX size; if (prod == cons) return 0; prod = _MASK_XENSOCK_IDX(prod, ring_size); cons = _MASK_XENSOCK_IDX(cons, ring_size); if (prod == cons) return ring_size; if (prod > cons) size = prod - cons; else { size = ring_size - cons; size += prod; } return size; } The producer (the backend for **in**, the frontend for **out**) writes to the array in the following way: - read *cons*, *prod*, *error* from shared memory - memory barrier - return on *error* - write to array at position *prod* up to *cons*, wrapping around the circular buffer when necessary - memory barrier - increase *prod* - notify the other end via evtchn The consumer (the backend for **out**, the frontend for **in**) reads from the array in the following way: - read *prod*, *cons*, *error* from shared memory - memory barrier - return on *error* - read from array at position *cons* up to *prod*, wrapping around the circular buffer when necessary - memory barrier - increase *cons* - notify the other end via evtchn The producer takes care of writing only as many bytes as available in the buffer up to *cons*. The consumer takes care of reading only as many bytes as available in the buffer up to *prod*. *error* is set by the backend when an error occurs writing or reading from the socket. Attachment:
xensock-libxl _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |