[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT v5] PV Calls protocol design document (former XenSock)



Ping?

On Thu, 4 Aug 2016, Stefano Stabellini wrote:
> Hi all,
> 
> This is the design document of the PV Calls protocol. You can find
> prototypes of the Linux frontend and backend drivers here:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git pvcalls-5
> 
> To use them, make sure to enable CONFIG_PVCALLS in your kernel config
> and add "pvcalls=1" to the command line of your DomU Linux kernel. You
> also need the toolstack to create the initial xenstore nodes for the
> protocol. To do that, please apply the attached patch to libxl (the
> patch is based on Xen 4.7.0-rc3) and add "pvcalls=1" to your DomU config
> file.
> 
> Note that previous versions of the protocols were named XenSock. It has
> been renamed for clarity of scope and to avoid confusion with hv_sock
> and vsock, which are used for inter-VMs communications.
> 
> Cheers,
> 
> Stefano
> 
> Changes in v5:
> - clarify text
> - rename id to req_id
> - rename sockid to id
> - move id to request and response specific fields
> - add version node to xenstore
> 
> Changes in v4:
> - rename xensock to pvcalls
> 
> Changes in v3:
> - add a dummy element to struct xen_xensock_request to make sure the
>   size of the struct is the same on both x86_32 and x86_64
> 
> Changes in v2:
> - add max-dataring-page-order
> - add "Publish backend features and transport parameters" to backend
>   xenbus workflow
> - update new cmd values
> - update xen_xensock_request
> - add backlog parameter to listen and binary layout
> - add description of new data ring format (interface+data)
> - modify connect and accept to reflect new data ring format
> - add link to POSIX docs
> - add error numbers
> - add address format section and relevant numeric definitions
> - add explicit mention of unimplemented commands
> - add protocol node name
> - add xenbus shutdown diagram
> - add socket operation
> 
> ---
> 
> # PV Calls Protocol version 1
> 
> ## Rationale
> 
> PV Calls is a paravirtualized protocol that allows the implementation of
> a set of POSIX functions in a different domain. The PV Calls frontend
> sends POSIX function calls to the backend, which implements them and
> returns a value to the frontend.
> 
> This version of the document covers networking function calls, such as
> connect, accept, bind, release, listen, poll, recvmsg and sendmsg; but
> the protocol is meant to be easily extended to cover different sets of
> calls. Unimplemented commands return ENOTSUPP.
> 
> PV Calls provide the following benefits:
> * full visibility of the guest behavior on the backend domain, allowing
>   for inexpensive filtering and manipulation of any guest calls
> * excellent performance
> 
> Specifically, PV Calls for networking offer these advantages:
> * guest networking works out of the box with VPNs, wireless networks and
>   any other complex configurations on the host
> * guest services listen on ports bound directly to the backend domain IP
>   addresses
> * localhost becomes a secure namespace for inter-VMs communications
> 
> 
> ## Design
> 
> ### Xenstore
> 
> The frontend and the backend connect to each other exchanging information via
> xenstore. The toolstack creates front and back nodes with state
> XenbusStateInitialising. The protocol node name is **pvcalls**. There can only
> be one PV Calls frontend per domain.
> 
> #### Frontend XenBus Nodes
> 
> port
>      Values:         <uint32_t>
> 
>      The identifier of the Xen event channel used to signal activity
>      in the ring buffer.
> 
> ring-ref
>      Values:         <uint32_t>
> 
>      The Xen grant reference granting permission for the backend to map
>      the sole page in a single page sized ring buffer.
> 
> #### Backend XenBus Nodes
> 
> version
>      Values:         <uint32_t>
> 
>      Protocol version supported by the backend.
>     
> max-dataring-page-order
>      Values:         <uint32_t>
> 
>      The maximum supported size of the data ring in units of lb(machine
>      pages). (e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages, etc.).
> 
> #### State Machine
> 
> Initialization:
> 
>     *Front*                               *Back*
>     XenbusStateInitialising               XenbusStateInitialising
>     - Query virtual device                - Query backend device
>       properties.                           identification data.
>     - Setup OS device instance.           - Publish backend features
>     - Allocate and initialize the           and transport parameters
>       request ring.                                      |
>     - Publish transport parameters                       |
>       that will be in effect during                      V
>       this connection.                            XenbusStateInitWait
>                  |
>                  |
>                  V
>        XenbusStateInitialised
> 
>                                           - Query frontend transport 
> parameters.
>                                           - Connect to the request ring and
>                                             event channel.
>                                                          |
>                                                          |
>                                                          V
>                                                  XenbusStateConnected
> 
>      - Query backend device properties.
>      - Finalize OS virtual device
>        instance.
>                  |
>                  |
>                  V
>         XenbusStateConnected
> 
> Once frontend and backend are connected, they have a shared page, which
> will is used to exchange messages over a ring, and an event channel,
> which is used to send notifications.
> 
> Shutdown:
> 
>     *Front*                            *Back*
>     XenbusStateConnected               XenbusStateConnected
>                 |
>                 |
>                 V
>        XenbusStateClosing
> 
>                                        - Unmap grants
>                                        - Unbind evtchns
>                                                  |
>                                                  |
>                                                  V
>                                          XenbusStateClosing
> 
>     - Unbind evtchns
>     - Free rings
>     - Free data structures
>                |
>                |
>                V
>        XenbusStateClosed
> 
>                                        - Free remaining data structures
>                                                  |
>                                                  |
>                                                  V
>                                          XenbusStateClosed
> 
> 
> ### Commands Ring
> 
> The shared ring is used by the frontend to forward POSIX function calls to the
> backend. I'll refer to this ring as **commands ring** to distinguish it from
> other rings which can be created later in the lifecycle of the protocol (data
> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` 
> macro
> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
> using the `RING_GET_REQUEST` macro.
> 
> The format is defined as follows:
>     
>     #define PVCALLS_SOCKET         0
>     #define PVCALLS_CONNECT        1
>     #define PVCALLS_RELEASE        2
>     #define PVCALLS_BIND           3
>     #define PVCALLS_LISTEN         4
>     #define PVCALLS_ACCEPT         5
>     #define PVCALLS_POLL           6
>     
>     struct xen_pvcalls_request {
>       uint32_t req_id; /* private to guest, echoed in response */
>       uint32_t cmd;    /* command to execute */
>       union {
>               struct xen_pvcalls_socket {
>                 uint64_t id;
>                       uint32_t domain;
>                       uint32_t type;
>                       uint32_t protocol;
>               } socket;
>               struct xen_pvcalls_connect {
>                 uint64_t id;
>                       uint8_t addr[28];
>                       uint32_t len;
>                       uint32_t flags;
>                       grant_ref_t ref;
>                       uint32_t evtchn;
>               } connect;
>               struct xen_pvcalls_release {
>                 uint64_t id;
>             } release;
>               struct xen_pvcalls_bind {
>                 uint64_t id;
>                       uint8_t addr[28];
>                       uint32_t len;
>               } bind;
>               struct xen_pvcalls_listen {
>                 uint64_t id;
>                       uint32_t backlog;
>               } listen;
>               struct xen_pvcalls_accept {
>                 uint64_t id;
>                       uint64_t id_new;
>                       grant_ref_t ref;
>                       uint32_t evtchn;
>               } accept;
>               struct xen_pvcalls_poll {
>                 uint64_t id;
>             } poll;
>               /* dummy member to force sizeof(struct xen_pvcalls_request) to 
> match across archs */
>               struct xen_pvcalls_dummy {
>                       uint8_t dummy[56];
>               } dummy;
>       } u;
>     };
> 
> The first two fields are common for every command. Their binary layout
> is:
> 
>     0       4       8
>     +-------+-------+
>     |req_id |  cmd  |
>     +-------+-------+
> 
> - **req_id** is generated by the frontend and identifies one specific request
> - **cmd** is the command requested by the frontend:
> 
>     - `PVCALLS_SOCKET`:  0
>     - `PVCALLS_CONNECT`: 1
>     - `PVCALLS_RELEASE`: 2
>     - `PVCALLS_BIND`:    3
>     - `PVCALLS_LISTEN`:  4
>     - `PVCALLS_ACCEPT`:  5
>     - `PVCALLS_POLL`:    6
> 
> Both fields are echoed back by the backend.
> 
> As for the other Xen ring based protocols, after writing a request to the 
> ring,
> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
> channel notification when a notification is required.
> 
> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` 
> macro.
> The format is the following:
> 
>     struct xen_pvcalls_response {
>         uint32_t req_id;
>         uint32_t cmd;
>         int32_t ret;
>         uint32_t pad;
>         union {
>               struct _xen_pvcalls_socket {
>                 uint64_t id;
>               } socket;
>               struct _xen_pvcalls_connect {
>                 uint64_t id;
>               } connect;
>               struct _xen_pvcalls_release {
>                 uint64_t id;
>               } release;
>               struct _xen_pvcalls_bind {
>                 uint64_t id;
>             } bind;
>               struct _xen_pvcalls_listen {
>                 uint64_t id;
>               } listen;
>               struct _xen_pvcalls_accept {
>                 uint64_t id;
>               } accept;
>               struct _xen_pvcalls_poll {
>                 uint64_t id;
>               } poll;
>               struct _xen_pvcalls_dummy {
>                 uint8_t dummy[8];
>             } dummy;
>       } u;
>     };
>     
> The first four fields are common for every response. Their binary layout
> is:
> 
>     0       4       8       12      16
>     +-------+-------+-------+-------+
>     |req_id |  cmd  |  ret  |  pad  |
>     +-------+-------+-------+-------+
> 
> - **req_id**: echoed back from request
> - **cmd**: echoed back from request
> - **ret**: return value, identifies success (0) or failure (see error numbers
>   below). If the **cmd** is not supported by the backend, ret is ENOTSUPP.
> - **pad**: padding
> 
> After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks 
> whether
> it needs to notify the frontend and does so via event channel.
> 
> A description of each command, their additional request and response
> fields follow.
> 
> 
> #### Socket
> 
> The **socket** operation corresponds to the POSIX [socket][socket] function. 
> It
> creates a new socket of the specified family, type and protocol. **id** is
> freely chosen by the frontend and references this specific socket from this
> point forward. See "Socket families and address format" below.
> 
> Request fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **id**: generated by the frontend, it identifies the new socket
>   - **domain**: the communication domain
>   - **type**: the socket type
>   - **protocol**: the particular protocol to be used with the socket, usually > 0
> 
> Request binary layout:
> 
>     8       12      16      20     24       28
>     +-------+-------+-------+-------+-------+
>     |       id      |domain | type  |protoco|
>     +-------+-------+-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16       20       24
>     +-------+--------+
>     |       id       |
>     +-------+--------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX socket function][connect] for error names; the 
> corresponding
>     error numbers are specified later in this document.
> 
> #### Connect
> 
> The **connect** operation corresponds to the POSIX [connect][connect] 
> function.
> It connects a previously created socket (identified by **id**) to the
> specified address.
> 
> The connect operation creates a new shared ring, which we'll call **data
> ring**. The data ring is used to send and receive data from the socket.
> The connect operation passes two additional parameters which are
> utilized to setup the new ring: **evtchn** and **ref**. **evtchn** is the
> port number of a new event channel which will be used for notifications
> of activity on the data ring. **ref** is the grant reference of a page
> which containes shared pointers to write and read data from the data ring
> and the full array of grant references for the ring buffers. It will be
> described in more detailed later. The data ring is unmapped and freed upon
> issuing a **release** command on the active socket identified by **id**.
> 
> When the frontend issues a **connect** command, the backend:
> - finds its own internal socket corresponding to **id**
> - connects the socket to **addr**
> - maps the grant reference **ref**, the shared page contains the data
>   ring interface (`struct pvcalls_data_intf`)
> - maps all the grant references listed in `struct pvcalls_data_intf` and
>   uses them as shared memory for the ring buffers
> - bind the **evtchn**
> - replies to the frontend
> 
> The data ring format will be described in the following section.
> 
> Request fields:
> 
> - **cmd** value: 0
> - additional fields:
>   - **id**: identifies the socket
>   - **addr**: address to connect to, see the address format section for more
>     information
>   - **len**: address length
>   - **flags**: flags for the connection, reserved for future usage
>   - **ref**: grant reference of the page containing `struct
>     pvcalls_data_intf`
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Request binary layout:
> 
>     8       12      16      20      24      28      32      36      40      44
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |       id      |                            addr                       |
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     | len   | flags |  ref  |evtchn |
>     +-------+-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX connect function][connect] for error names; the 
> corresponding
>     error numbers are specified later in this document.
> 
> #### Release
> 
> The **release** operation closes an existing active or a passive socket.
> 
> When a release command is issued on a passive socket, the backend releases it
> and frees its internal mappings. When a release command is issued for an 
> active
> socket, the data ring is also unmapped and freed:
> 
> - frontend sends release command for an active socket
> - backend releases the socket
> - backend unmaps the data ring buffers
> - backend unmaps the data ring interface
> - backend unbinds the evtchn
> - backend replies to frontend
> - frontend frees ring and unbinds evtchn
> 
> Request fields:
> 
> - **cmd** value: 1
> - additional fields:
>   - **id**: identifies the socket
> 
> Request binary layout:
> 
>     8       12      16
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX shutdown function][shutdown] for error names; the
>     corresponding error numbers are specified later in this document.
> 
> #### Bind
> 
> The **bind** operation corresponds to the POSIX [bind][bind] function. It
> assigns the address passed as parameter to a previously created socket,
> identified by **id**. **Bind**, **listen** and **accept** are the three
> operations required to have fully working passive sockets and should be
> issued in this order.
> 
> Request fields:
> 
> - **cmd** value: 2
> - additional fields:
>   - **id**: identifies the socket
>   - **addr**: address to connect to, see the address format section for more
>     information
>   - **len**: address length
> 
> Request binary layout:
> 
>     8       12      16      20      24      28      32      36      40      44
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |       id      |                            addr                       |
>     +-------+-------+-------+-------+-------+-------+-------+-------+-------+
>     |  len  |
>     +-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX bind function][bind] for error names; the corresponding 
> error
>     numbers are specified later in this document.
> 
> 
> #### Listen
> 
> The **listen** operation marks the socket as a passive socket. It corresponds 
> to
> the [POSIX listen function][listen].
> 
> Reuqest fields:
> 
> - **cmd** value: 3
> - additional fields:
>   - **id**: identifies the socket
>   - **backlog**: the maximum length to which the queue of pending
>     connections may grow
> 
> Request binary layout:
> 
>     8       12      16      20
>     +-------+-------+-------+
>     |       id      |backlog|
>     +-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
>   - 0 on success
>   - See the [POSIX listen function][listen] for error names; the corresponding
>     error numbers are specified later in this document.
> 
> 
> #### Accept
> 
> The **accept** operation extracts the first connection request on the
> queue of pending connections for the listening socket identified by
> **id** and creates a new connected socket. The id of the new socket is
> also chosen by the frontend and passed as an additional field of the
> accept request struct (**id_new**). See the [POSIX accept function][accept]
> as reference.
> 
> Similarly to the **connect** operation, **accept** creates a new data ring.
> Information necessary to setup the new ring, such the grant table reference of
> the page containing the data ring interface (`struct pvcalls_data_intf`) and
> event channel port, are passed from the frontend to the backend as part of the
> request.
> 
> The backend will reply to the request only when a new connection is 
> successfully
> accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.
> 
> Example workflow:
> 
> - frontend issues an **accept** request
> - backend waits for a connection to be available on the socket
> - a new connection becomes available
> - backend accepts the new connection
> - backend creates an internal mapping from **id_new** to the new socket
> - backend maps the grant reference **ref**, the shared page contains the
>   data ring interface (`struct pvcalls_data_intf`)
> - backend maps all the grant references listed in `struct
>   pvcalls_data_intf` and uses them as shared memory for the new data
>   ring
> - backend binds the **evtchn**
> - backend replies to the frontend
> 
> Request fields:
> 
> - **cmd** value: 4
> - additional fields:
>   - **id**: id of listening socket
>   - **id_new**: id of the new socket
>   - **ref**: grant reference of the data ring interface (`struct
>     pvcalls_data_intf`)
>   - **evtchn**: port number of the evtchn to signal activity on the data ring
> 
> Request binary layout:
> 
>     8       12      16      20      24      28      32
>     +-------+-------+-------+-------+-------+-------+
>     |       id      |    id_new     |  ref  |evtchn |
>     +-------+-------+-------+-------+-------+-------+
> 
> Response additional fields:
> 
> - **id**: id of the listening socket, echoed back from request
> 
> Response binary layout:
> 
>     16      20      24
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX accept function][accept] for error names; the corresponding
>     error numbers are specified later in this document.
> 
> 
> #### Poll
> 
> In this version of the protocol, the **poll** operation is only valid
> for passive sockets. For active sockets, the frontend should look at the
> state of the data ring. When a new connection is available in the queue
> of the passive socket, the backend generates a response and notifies the
> frontend.
> 
> Request fields:
> 
> - **cmd** value: 5
> - additional fields:
>   - **id**: identifies the listening socket
> 
> Request binary layout:
> 
>     8       12      16
>     +-------+-------+
>     |       id      |
>     +-------+-------+
> 
> 
> Response additional fields:
> 
> - **id**: echoed back from request
> 
> Response binary layout:
> 
>     16       20       24
>     +--------+--------+
>     |        id       |
>     +--------+--------+
> 
> Return value:
> 
>   - 0 on success
>   - See the [POSIX poll function][poll] for error names; the corresponding 
> error
>     numbers are specified later in this document.
> 
> #### Error numbers
> 
> The numbers corresponding to the error names specified by POSIX are:
> 
>     [EPERM]         -1
>     [ENOENT]        -2
>     [ESRCH]         -3
>     [EINTR]         -4
>     [EIO]           -5
>     [ENXIO]         -6
>     [E2BIG]         -7
>     [ENOEXEC]       -8
>     [EBADF]         -9
>     [ECHILD]        -10
>     [EAGAIN]        -11
>     [EWOULDBLOCK]   -11
>     [ENOMEM]        -12
>     [EACCES]        -13
>     [EFAULT]        -14
>     [EBUSY]         -16
>     [EEXIST]        -17
>     [EXDEV]         -18
>     [ENODEV]        -19
>     [EISDIR]        -21
>     [EINVAL]        -22
>     [ENFILE]        -23
>     [EMFILE]        -24
>     [ENOSPC]        -28
>     [EROFS]         -30
>     [EMLINK]        -31
>     [EDOM]          -33
>     [ERANGE]        -34
>     [EDEADLK]       -35
>     [EDEADLOCK]     -35
>     [ENAMETOOLONG]  -36
>     [ENOLCK]        -37
>     [ENOTEMPTY]     -39
>     [ENOSYS]        -38
>     [ENODATA]       -61
>     [ETIME]         -62
>     [EBADMSG]       -74
>     [EOVERFLOW]     -75
>     [EILSEQ]        -84
>     [ERESTART]      -85
>     [ENOTSOCK]      -88
>     [EOPNOTSUPP]    -95
>     [EAFNOSUPPORT]  -97
>     [EADDRINUSE]    -98
>     [EADDRNOTAVAIL] -99
>     [ENOBUFS]       -105
>     [EISCONN]       -106
>     [ENOTCONN]      -107
>     [ETIMEDOUT]     -110
>     [ENOTSUPP]      -524
> 
> #### Socket families and address format
> 
> The following definitions and explicit sizes, together with POSIX
> [sys/socket.h][address] and [netinet/in.h][in] define socket families and
> address format. Please be aware that only the **domain** `AF_INET`, **type**
> `SOCK_STREAM` and **protocol** `0` are supported by this version of the spec.
> 
>     #define AF_UNSPEC   0
>     #define AF_UNIX     1   /* Unix domain sockets      */
>     #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
>     #define AF_INET     2   /* Internet IP Protocol     */
>     #define AF_INET6    10  /* IP version 6         */
> 
>     #define SOCK_STREAM 1
>     #define SOCK_DGRAM  2
>     #define SOCK_RAW    3
> 
>     /* generic address format */
>     struct sockaddr {
>         uint16_t sa_family_t;
>         char sa_data[26];
>     };
> 
>     struct in_addr {
>         uint32_t s_addr;
>     };
> 
>     /* AF_INET address format */
>     struct sockaddr_in {
>         uint16_t         sa_family_t;
>         uint16_t         sin_port;
>         struct in_addr   sin_addr;
>         char             sin_zero[20];
>     };
> 
> 
> ### Data ring
> 
> Data rings are used for sending and receiving data over a connected socket. 
> They
> are created upon a successful **accept** or **connect** command.
> 
> A data ring is composed of two pieces: the interface and the **in** and 
> **out**
> buffers. The interface, represented by `struct pvcalls_ring_intf` is shared
> first and resides on the page whose grant reference is passed by **accept** 
> and
> **connect** as parameter. `struct pvcalls_ring_intf` contains the list of 
> grant
> references which constitute the **in** and **out** data buffers.
> 
> #### Data ring interface
> 
>     struct pvcalls_data_intf {
>       PVCALLS_RING_IDX in_cons, in_prod;
>       PVCALLS_RING_IDX out_cons, out_prod;
>       int32_t in_error, out_error;
>     
>       uint32_t ring_order;
>       grant_ref_t ref[];
>     };
> 
>     /* not actually C compliant (ring_order changes from socket to socket) */
>     struct pvcalls_data {
>         char in[((1<<ring_order)<<PAGE_SHIFT)/2];
>         char out[((1<<ring_order)<<PAGE_SHIFT)/2];
>     };
> 
> - **ring_order**
>   It represents the order of the data ring. The following list of grant
>   references is of `(1 << ring_order)` elements. It cannot be greater than
>   **max-dataring-page-order**, as specified by the backend on XenBus.
> - **ref[]**
>   The list of grant references which will contain the actual data. They are
>   mapped contiguosly in virtual memory. The first half of the pages is the
>   **in** array, the second half is the **out** array.
> - **in** is an array used as circular buffer
>   It contains data read from the socket. The producer is the backend, the
>   consumer is the frontend.
> - **out** is an array used as circular buffer
>   It contains data to be written to the socket. The producer is the frontend,
>   the consumer is the backend.
> - **in_cons** and **in_prod**
>   Consumer and producer pointers for data read from the socket. They keep 
> track
>   of how much data has already been consumed by the frontend from the **in**
>   array. **in_prod** is increased by the backend, after writing data to 
> **in**.
>   **in_cons** is increased by the frontend, after reading data from **in**.
> - **out_cons**, **out_prod**
>   Consumer and producer pointers for the data to be written to the socket. 
> They
>   keep track of how much data has been written by the frontend to **out** and
>   how much data has already been consumed by the backend. **out_prod** is
>   increased by the frontend, after writing data to **out**. **out_cons** is
>   increased by the backend, after reading data from **out**.
> - **in_error** and **out_error** They signal errors when reading from the 
> socket
>   (**in_error**) or when writing to the socket (**out_error**). 0 means no
>   errors. When an error occurs, no further reads or writes operations are
>   performed on the socket. In the case of an orderly socket shutdown (i.e. 
> read
>   returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error**
>   are never set to EAGAIN or EWOULDBLOCK.
> 
> The binary layout of `struct pvcalls_data_intf` follows:
> 
>     0         4         8         12        16        20        24        28
>     +---------+---------+---------+---------+---------+---------+----------+
>     | in_cons | in_prod |out_cons |out_prod |in_error |out_error|ring_order|
>     +---------+---------+---------+---------+---------+---------+----------+
> 
>     28        32        36        40        4092     4096
>     +---------+---------+---------+----//---+---------+
>     |  ref[0] |  ref[1] |  ref[2] |         |  ref[N] |
>     +---------+---------+---------+----//---+---------+
> 
> The binary layout of the ring buffers follow:
> 
>     0         ((1<<ring_order)<<PAGE_SHIFT)/2       
> ((1<<ring_order)<<PAGE_SHIFT)
>     +------------//-------------+------------//-------------+
>     |            in             |           out             |
>     +------------//-------------+------------//-------------+
> 
> #### Workflow
> 
> The **in** and **out** arrays are used as circular buffers:
>     
>     0                               sizeof(array) == 
> ((1<<ring_order)<<PAGE_SHIFT)/2
>     +-----------------------------------+
>     |to consume|    free    |to consume |
>     +-----------------------------------+
>                ^            ^
>                prod         cons
> 
>     0                               sizeof(array)
>     +-----------------------------------+
>     |  free    | to consume |   free    |
>     +-----------------------------------+
>                ^            ^
>                cons         prod
> 
> The following function is provided to calculate how many bytes are currently
> left unconsumed in an array:
> 
>     #define _MASK_PVCALLS_IDX(idx, ring_size) ((idx) & (ring_size-1))
> 
>     static inline PVCALLS_RING_IDX pvcalls_ring_queued(PVCALLS_RING_IDX prod,
>               PVCALLS_RING_IDX cons,
>               PVCALLS_RING_IDX ring_size)
>     {
>       PVCALLS_RING_IDX size;
>     
>       if (prod == cons)
>               return 0;
>     
>       prod = _MASK_PVCALLS_IDX(prod, ring_size);
>       cons = _MASK_PVCALLS_IDX(cons, ring_size);
>     
>       if (prod == cons)
>               return ring_size;
>     
>       if (prod > cons)
>               size = prod - cons;
>       else {
>               size = ring_size - cons;
>               size += prod;
>       }
>       return size;
>     }
> 
> The producer (the backend for **in**, the frontend for **out**) writes to the
> array in the following way:
> 
> - read *cons*, *prod*, *error* from shared memory
> - memory barrier
> - return on *error*
> - write to array at position *prod* up to *cons*, wrapping around the circular
>   buffer when necessary
> - memory barrier
> - increase *prod*
> - notify the other end via evtchn
> 
> The consumer (the backend for **out**, the frontend for **in**) reads from the
> array in the following way:
> 
> - read *prod*, *cons*, *error* from shared memory
> - memory barrier
> - return on *error*
> - read from array at position *cons* up to *prod*, wrapping around the 
> circular
>   buffer when necessary
> - memory barrier
> - increase *cons*
> - notify the other end via evtchn
> 
> The producer takes care of writing only as many bytes as available in the 
> buffer
> up to *cons*. The consumer takes care of reading only as many bytes as 
> available
> in the buffer up to *prod*. *error* is set by the backend when an error occurs
> writing or reading from the socket.
> 
> 
> [address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html
> [in]: 
> http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
> [socket]: http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html
> [connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
> [shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html
> [bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html
> [listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html
> [accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html
> [poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.