[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [DRAFT v3] XenSock protocol design document



Thursday, July 28, 2016, 8:11:53 PM, you wrote:

> ping

Hi Stefano,

JFYI:
Since this doesn't seem to be checked with the upstream kernel yet,
I don't know if you are aware of the opinions expressed upstream 
about the proposed Hyper-V socket patches:
http://lkml.iu.edu/hypermail/linux/kernel/1607.3/01748.html

(and if that should either influence your design or design process)

--
Sander

> On Wed, 20 Jul 2016, Stefano Stabellini wrote:
>> Hi all,
>> 
>> This is the design document of the XenSock protocol. You can find
>> prototypes of the Linux frontend and backend drivers here:
>> 
>> git://git.kernel.org/pub/scm/linux/kernel/git/sstabellini/xen.git xensock-3
>> 
>> To use them, make sure to enable CONFIG_XENSOCK in your kernel config
>> and add "xensock=1" to the command line of your DomU Linux kernel. You
>> also need the toolstack to create the initial xenstore nodes for the
>> protocol. To do that, please apply the attached patch to libxl (the
>> patch is based on Xen 4.7.0-rc3) and add "xensock=1" to your DomU config
>> file.
>> 
>> Cheers,
>> 
>> Stefano
>> 
>> 
>> Changes in v3:
>> - add a dummy element to struct xen_xensock_request to make sure the
>>   size of the struct is the same on both x86_32 and x86_64
>> 
>> Changes in v2:
>> - add max-dataring-page-order
>> - add "Publish backend features and transport parameters" to backend
>>   xenbus workflow
>> - update new cmd values
>> - update xen_xensock_request
>> - add backlog parameter to listen and binary layout
>> - add description of new data ring format (interface+data)
>> - modify connect and accept to reflect new data ring format
>> - add link to POSIX docs
>> - add error numbers
>> - add address format section and relevant numeric definitions
>> - add explicit mention of unimplemented commands
>> - add protocol node name
>> - add xenbus shutdown diagram
>> - add socket operation
>> 
>> ---
>> 
>> 
>> # XenSocks Protocol v1
>> 
>> ## Rationale
>> 
>> XenSocks is a paravirtualized protocol for the POSIX socket API.
>> 
>> The purpose of XenSocks is to allow the implementation of a specific set
>> of POSIX functions to be done in a domain other than your own. It allows
>> connect, accept, bind, release, listen, poll, recvmsg and sendmsg to be
>> implemented in another domain.
>> 
>> XenSocks provides the following benefits:
>> * guest networking works out of the box with VPNs, wireless networks and
>>   any other complex configurations on the host
>> * guest services listen on ports bound directly to the backend domain IP
>>   addresses
>> * localhost becomes a secure namespace for inter-VMs communications
>> * full visibility of the guest behavior on the backend domain, allowing
>>   for inexpensive filtering and manipulation of any guest calls
>> * excellent performance
>> 
>> 
>> ## Design
>> 
>> ### Xenstore
>> 
>> The frontend and the backend connect to each other exchanging information via
>> xenstore. The toolstack creates front and back nodes with state
>> XenbusStateInitialising. The protocol node name is **xensock**. There can 
>> only
>> be one XenSock frontend per domain.
>> 
>> #### Frontend XenBus Nodes
>> 
>> port
>>      Values:         <uint32_t>
>> 
>>      The identifier of the Xen event channel used to signal activity
>>      in the ring buffer.
>> 
>> ring-ref
>>      Values:         <uint32_t>
>> 
>>      The Xen grant reference granting permission for the backend to map
>>      the sole page in a single page sized ring buffer.
>> 
>> #### Backend XenBus Nodes
>> 
>> max-dataring-page-order
>>     Values:         <uint32_t>
>> 
>>     The maximum supported size of the data ring in units of lb(machine
>>     pages). (e.g. 0 == 1 page,  1 = 2 pages, 2 == 4 pages, etc.).
>> 
>> #### State Machine
>> 
>> Initialization:
>> 
>>     *Front*                               *Back*
>>     XenbusStateInitialising               XenbusStateInitialising
>>     - Query virtual device                - Query backend device
>>       properties.                           identification data.
>>     - Setup OS device instance.           - Publish backend features
>>     - Allocate and initialize the           and transport parameters
>>       request ring.                                      |
>>     - Publish transport parameters                       |
>>       that will be in effect during                      V
>>       this connection.                            XenbusStateInitWait
>>                  |
>>                  |
>>                  V
>>        XenbusStateInitialised
>> 
>>                                           - Query frontend transport 
>> parameters.
>>                                           - Connect to the request ring and
>>                                             event channel.
>>                                                          |
>>                                                          |
>>                                                          V
>>                                                  XenbusStateConnected
>> 
>>      - Query backend device properties.
>>      - Finalize OS virtual device
>>        instance.
>>                  |
>>                  |
>>                  V
>>         XenbusStateConnected
>> 
>> Once frontend and backend are connected, they have a shared page, which
>> will is used to exchange messages over a ring, and an event channel,
>> which is used to send notifications.
>> 
>> Shutdown:
>> 
>>     *Front*                            *Back*
>>     XenbusStateConnected               XenbusStateConnected
>>                 |
>>                 |
>>                 V
>>        XenbusStateClosing
>> 
>>                                        - Unmap grants
>>                                        - Unbind evtchns
>>                                                  |
>>                                                  |
>>                                                  V
>>                                          XenbusStateClosing
>> 
>>     - Unbind evtchns
>>     - Free rings
>>     - Free data structures
>>                |
>>                |
>>                V
>>        XenbusStateClosed
>> 
>>                                        - Free remaining data structures
>>                                                  |
>>                                                  |
>>                                                  V
>>                                          XenbusStateClosed
>> 
>> 
>> ### Commands Ring
>> 
>> The shared ring is used by the frontend to forward socket API calls to the
>> backend. I'll refer to this ring as **commands ring** to distinguish it from
>> other rings which will be created later in the lifecycle of the protocol 
>> (data
>> rings). The ring format is defined using the familiar `DEFINE_RING_TYPES` 
>> macro
>> (`xen/include/public/io/ring.h`). Frontend requests are allocated on the ring
>> using the `RING_GET_REQUEST` macro.
>> 
>> The format is defined as follows:
>>     
>>     #define XENSOCK_SOCKET         0
>>     #define XENSOCK_CONNECT        1
>>     #define XENSOCK_RELEASE        2
>>     #define XENSOCK_BIND           3
>>     #define XENSOCK_LISTEN         4
>>     #define XENSOCK_ACCEPT         5
>>     #define XENSOCK_POLL           6
>>     
>>     struct xen_xensock_request {
>>       uint32_t id; /* private to guest, echoed in response */
>>       uint32_t cmd; /* command to execute */
>>       uint64_t sockid;
>>       union {
>>               struct xen_xensock_socket {
>>                       uint32_t domain;
>>                       uint32_t type;
>>                       uint32_t protocol;
>>               } socket;
>>               struct xen_xensock_connect {
>>                       uint8_t addr[28];
>>                       uint32_t len;
>>                       uint32_t flags;
>>                       grant_ref_t ref;
>>                       uint32_t evtchn;
>>               } connect;
>>               struct xen_xensock_bind {
>>                       uint8_t addr[28];
>>                       uint32_t len;
>>               } bind;
>>               struct xen_xensock_listen {
>>                       uint32_t backlog;
>>               } listen;
>>               struct xen_xensock_accept {
>>                       uint64_t sockid;
>>                       grant_ref_t ref;
>>                       uint32_t evtchn;
>>               } accept;
>>               /* dummy member to force sizeof(struct xen_xensock_request) to 
>> match across archs */
>>               struct xen_xensock_dummy {
>>                       uint8_t dummy[48];
>>               } dummy;
>>       } u;
>>     };
>> 
>> The first three fields are common for every command. Their binary layout
>> is:
>> 
>>     0       4       8       12      16
>>     +-------+-------+-------+-------+
>>     |  id   |  cmd  |     sockid    |
>>     +-------+-------+-------+-------+
>> 
>> - **id** is generated by the frontend and identifies one specific request
>> - **cmd** is the command requested by the frontend:
>>     - `XENSOCK_SOCKET`:  0
>>     - `XENSOCK_CONNECT`: 1
>>     - `XENSOCK_RELEASE`: 2
>>     - `XENSOCK_BIND`:    3
>>     - `XENSOCK_LISTEN`:  4
>>     - `XENSOCK_ACCEPT`:  5
>>     - `XENSOCK_POLL`:    6
>> - **sockid** is generated by the frontend and identifies the socket to 
>> connect,
>>   bind, etc. A new sockid is required on the `XENSOCK_SOCKET` command. A new
>>   sockid is also required on `XENSOCK_ACCEPT`, for the new socket.
>> 
>> All three fields are echoed back by the backend.
>> 
>> As for the other Xen ring based protocols, after writing a request to the 
>> ring,
>> the frontend calls `RING_PUSH_REQUESTS_AND_CHECK_NOTIFY` and issues an event
>> channel notification when a notification is required.
>> 
>> Backend responses are allocated on the ring using the `RING_GET_RESPONSE` 
>> macro.
>> The format is the following:
>> 
>>     struct xen_xensock_response {
>>         uint32_t id;
>>         uint32_t cmd;
>>         uint64_t sockid;
>>         int32_t ret;
>>     };
>>     
>>     0       4       8       12      16      20
>>     +-------+-------+-------+-------+-------+
>>     |  id   |  cmd  |     sockid    |  ret  |
>>     +-------+-------+-------+-------+-------+
>> 
>> - **id**: echoed back from request
>> - **cmd**: echoed back from request
>> - **sockid**: echoed back from request
>> - **ret**: return value, identifies success (0) or failure (see error numbers
>>   below). If the **cmd** is not supported by the backend, ret is ENOTSUPP.
>> 
>> After calling `RING_PUSH_RESPONSES_AND_CHECK_NOTIFY`, the backend checks 
>> whether
>> it needs to notify the frontend and does so via event channel.
>> 
>> A description of each command, their additional request fields and the
>> expected responses follow.
>> 
>> 
>> #### Socket
>> 
>> The **socket** operation corresponds to the POSIX [socket][socket] function. 
>> It
>> creates a new socket of the specified family, type and protocol. **sockid** 
>> is
>> freely chosen by the frontend and references this specific socket from this
>> point forward. See "Socket families and address format" below.
>> 
>> Fields:
>> 
>> - **cmd** value: 0
>> - additional fields:
>>   - **domain**: the communication domain
>>   - **type**: the socket type
>>   - **protocol**: the particular protocol to be used with the socket, 
>> usually 0
>> 
>> Binary layout:
>> 
>>         16       20      24       28
>>         +--------+--------+--------+
>>         | domain |  type  |protocol|
>>         +--------+--------+--------+
>> 
>> Return value:
>> 
>>   - 0 on success
>>   - See the [POSIX socket function][connect] for error names; the 
>> corresponding
>>     error numbers are specified later in this document.
>> 
>> #### Connect
>> 
>> The **connect** operation corresponds to the POSIX [connect][connect] 
>> function.
>> It connects a previously created socket (identified by **sockid**) to the
>> specified address.
>> 
>> The connect operation creates a new shared ring, which we'll call **data
>> ring**. The data ring is used to send and receive data from the socket.
>> The connect operation passes two additional parameters which are
>> utilized to setup the new ring: **evtchn** and **ref**. **evtchn** is the
>> port number of a new event channel which will be used for notifications
>> of activity on the data ring. **ref** is the grant reference of a page
>> which containes shared pointers to write and read data from the data ring
>> and the full array of grant references for the ring buffers. It will be
>> described in more detailed later. The data ring is unmapped and freed upon
>> issuing a **release** command on the active socket identified by **sockid**.
>> 
>> When the frontend issues a **connect** command, the backend:
>> - finds its own internal socket corresponding to **sockid**
>> - connects the socket to **addr**
>> - maps the grant reference **ref**, the shared page contains the data
>>   ring interface (`struct xensock_data_intf`)
>> - maps all the grant references listed in `struct xensock_data_intf` and
>>   uses them as shared memory for the ring buffers
>> - bind the **evtchn**
>> - replies to the frontend
>> 
>> The data ring format will be described in the following section.
>> 
>> Fields:
>> 
>> - **cmd** value: 0
>> - additional fields:
>>   - **addr**: address to connect to, see the address format section for more
>>     information
>>   - **len**: address length
>>   - **flags**: flags for the connection, reserved for future usage
>>   - **ref**: grant reference of the page containing `struct
>>     xensock_data_intf`
>>   - **evtchn**: port number of the evtchn to signal activity on the data ring
>> 
>> 
>> Binary layout:
>> 
>>         16      20      24      28      32      36      40      44     48
>>         +-------+-------+-------+-------+-------+-------+-------+-------+
>>         |                            addr                       |  len  |
>>         +-------+-------+-------+-------+-------+-------+-------+-------+
>>         | flags |  ref  |evtchn |
>>         +-------+-------+-------+
>> 
>> Return value:
>> 
>>   - 0 on success
>>   - See the [POSIX connect function][connect] for error names; the 
>> corresponding
>>     error numbers are specified later in this document.
>> 
>> #### Release
>> 
>> The **release** operation closes an existing active or a passive socket.
>> 
>> When a release command is issued on a passive socket, the backend releases it
>> and frees its internal mappings. When a release command is issued for an 
>> active
>> socket, the data ring is also unmapped and freed:
>> 
>> - frontend sends release command for an active socket
>> - backend releases the socket
>> - backend unmaps the data ring buffers
>> - backend unmaps the data ring interface
>> - backend unbinds the evtchn
>> - backend replies to frontend
>> - frontend frees ring and unbinds evtchn
>> 
>> Fields:
>> 
>> - **cmd** value: 1
>> - additional fields: none
>> 
>> Return value:
>> 
>>   - 0 on success
>>   - See the [POSIX shutdown function][shutdown] for error names; the
>>     corresponding error numbers are specified later in this document.
>> 
>> #### Bind
>> 
>> The **bind** operation corresponds to the POSIX [bind][bind] function. It
>> assigns the address passed as parameter to a previously created socket,
>> identified by **sockid**. **Bind**, **listen** and **accept** are the three
>> operations required to have fully working passive sockets and should be 
>> issued
>> in this order.
>> 
>> Fields:
>> 
>> - **cmd** value: 2
>> - additional fields:
>>   - **addr**: address to connect to, see the address format section for more
>>     information
>>   - **len**: address length
>> 
>> Binary layout:
>> 
>>         16      20      24      28      32      36      40      44     48
>>         +-------+-------+-------+-------+-------+-------+-------+-------+
>>         |                            addr                       |  len  |
>>         +-------+-------+-------+-------+-------+-------+-------+-------+
>> 
>> Return value:
>> 
>>   - 0 on success
>>   - See the [POSIX bind function][bind] for error names; the corresponding 
>> error
>>     numbers are specified later in this document.
>> 
>> 
>> #### Listen
>> 
>> The **listen** operation marks the socket as a passive socket. It 
>> corresponds to
>> the [POSIX listen function][listen].
>> 
>> Fields:
>> 
>> - **cmd** value: 3
>> - additional fields:
>>   - **backlog**: the maximum length to which the queue of pending
>>     connections may grow
>> 
>> Binary layout:
>> 
>>         16      20
>>         +-------+
>>         |backlog|
>>         +-------+
>> 
>> Return value:
>>   - 0 on success
>>   - See the [POSIX listen function][listen] for error names; the 
>> corresponding
>>     error numbers are specified later in this document.
>> 
>> 
>> #### Accept
>> 
>> The **accept** operation extracts the first connection request on the queue 
>> of
>> pending connections for the listening socket identified by **sockid** and
>> creates a new connected socket. The **sockid** of the new socket is also 
>> chosen
>> by the frontend and passed as an additional field of the accept request 
>> struct.
>> See the [POSIX accept function][accept] as reference.
>> 
>> Similarly to the **connect** operation, **accept** creates a new data ring.
>> Information necessary to setup the new ring, such the grant table reference 
>> of
>> the page containing the data ring interface (`struct xensock_data_intf`) and
>> event channel port, are passed from the frontend to the backend as part of 
>> the
>> request.
>> 
>> The backend will reply to the request only when a new connection is 
>> successfully
>> accepted, i.e. the backend does not return EAGAIN or EWOULDBLOCK.
>> 
>> Example workflow:
>> 
>> - frontend issues an **accept** request
>> - backend waits for a connection to be available on the socket
>> - a new connection becomes available
>> - backend accepts the new connection
>> - backend creates an internal mapping from **sockid** to the new socket
>> - backend maps the grant reference **ref**, the shared page contains the
>>   data ring interface (`struct xensock_data_intf`)
>> - backend maps all the grant references listed in `struct
>>   xensock_data_intf` and uses them as shared memory for the new data
>>   ring
>> - backend binds the **evtchn**
>> - backend replies to the frontend
>> 
>> Fields:
>> 
>> - **cmd** value: 4
>> - additional fields:
>>   - **sockid**: id of the new socket
>>   - **ref**: grant reference of the data ring interface (`struct
>>     xensock_data_intf`)
>>   - **evtchn**: port number of the evtchn to signal activity on the data ring
>> 
>> Binary layout:
>> 
>>         16      20      24      28      32
>>         +-------+-------+-------+-------+
>>         |    sockid     |  ref  |evtchn |
>>         +-------+-------+-------+-------+
>> 
>> Return value:
>> 
>>   - 0 on success
>>   - See the [POSIX accept function][accept] for error names; the 
>> corresponding
>>     error numbers are specified later in this document.
>> 
>> 
>> #### Poll
>> 
>> The **poll** operation is only valid for passive sockets. For active sockets,
>> the frontend should look at the state of the data ring. When a new 
>> connection is
>> available in the queue of the passive socket, the backend generates a 
>> response
>> and notifies the frontend.
>> 
>> Fields:
>> 
>> - **cmd** value: 5
>> - additional fields: none
>> 
>> Return value:
>> 
>>   - 0 on success
>>   - See the [POSIX poll function][poll] for error names; the corresponding 
>> error
>>     numbers are specified later in this document.
>> 
>> #### Error numbers
>> 
>> The numbers corresponding to the error names specified by POSIX are:
>> 
>>     [EPERM]         -1
>>     [ENOENT]        -2
>>     [ESRCH]         -3
>>     [EINTR]         -4
>>     [EIO]           -5
>>     [ENXIO]         -6
>>     [E2BIG]         -7
>>     [ENOEXEC]       -8
>>     [EBADF]         -9
>>     [ECHILD]        -10
>>     [EAGAIN]        -11
>>     [EWOULDBLOCK]   -11
>>     [ENOMEM]        -12
>>     [EACCES]        -13
>>     [EFAULT]        -14
>>     [EBUSY]         -16
>>     [EEXIST]        -17
>>     [EXDEV]         -18
>>     [ENODEV]        -19
>>     [EISDIR]        -21
>>     [EINVAL]        -22
>>     [ENFILE]        -23
>>     [EMFILE]        -24
>>     [ENOSPC]        -28
>>     [EROFS]         -30
>>     [EMLINK]        -31
>>     [EDOM]          -33
>>     [ERANGE]        -34
>>     [EDEADLK]       -35
>>     [EDEADLOCK]     -35
>>     [ENAMETOOLONG]  -36
>>     [ENOLCK]        -37
>>     [ENOTEMPTY]     -39
>>     [ENOSYS]        -38
>>     [ENODATA]       -61
>>     [ETIME]         -62
>>     [EBADMSG]       -74
>>     [EOVERFLOW]     -75
>>     [EILSEQ]        -84
>>     [ERESTART]      -85
>>     [ENOTSOCK]      -88
>>     [EOPNOTSUPP]    -95
>>     [EAFNOSUPPORT]  -97
>>     [EADDRINUSE]    -98
>>     [EADDRNOTAVAIL] -99
>>     [ENOBUFS]       -105
>>     [EISCONN]       -106
>>     [ENOTCONN]      -107
>>     [ETIMEDOUT]     -110
>>     [ENOTSUPP]      -524
>> 
>> #### Socket families and address format
>> 
>> The following definitions and explicit sizes, together with POSIX
>> [sys/socket.h][address] and [netinet/in.h][in] define socket families and
>> address format. Please be aware that only the **domain** `AF_INET`, **type**
>> `SOCK_STREAM` and **protocol** `0` are supported by this version of the spec.
>> 
>>     #define AF_UNSPEC   0
>>     #define AF_UNIX     1   /* Unix domain sockets      */
>>     #define AF_LOCAL    1   /* POSIX name for AF_UNIX   */
>>     #define AF_INET     2   /* Internet IP Protocol     */
>>     #define AF_INET6    10  /* IP version 6         */
>> 
>>     #define SOCK_STREAM 1
>>     #define SOCK_DGRAM  2
>>     #define SOCK_RAW    3
>> 
>>     /* generic address format */
>>     struct sockaddr {
>>         uint16_t sa_family_t;
>>         char sa_data[26];
>>     };
>> 
>>     struct in_addr {
>>         uint32_t s_addr;
>>     };
>> 
>>     /* AF_INET address format */
>>     struct sockaddr_in {
>>         uint16_t         sa_family_t;
>>         uint16_t         sin_port;
>>         struct in_addr   sin_addr;
>>         char             sin_zero[20];
>>     };
>> 
>> 
>> ### Data ring
>> 
>> Data rings are used for sending and receiving data over a connected socket. 
>> They
>> are created upon a successful **accept** or **connect** command.
>> 
>> A data ring is composed of two pieces: the interface and the **in** and 
>> **out**
>> buffers. The interface, represented by `struct xensock_ring_intf` is shared
>> first and resides on the page whose grant reference is passed by **accept** 
>> and
>> **connect** as parameter. `struct xensock_ring_intf` contains the list of 
>> grant
>> references which constitute the **in** and **out** data buffers.
>> 
>> #### Data ring interface
>> 
>>     struct xensock_data_intf {
>>       XENSOCK_RING_IDX in_cons, in_prod;
>>       XENSOCK_RING_IDX out_cons, out_prod;
>>       int32_t in_error, out_error;
>>     
>>       uint32_t ring_order;
>>       grant_ref_t ref[];
>>     };
>> 
>>     /* not actually C compliant (ring_order changes from socket to socket) */
>>     struct xensock_data {
>>         char in[((1<<ring_order)<<PAGE_SHIFT)/2];
>>         char out[((1<<ring_order)<<PAGE_SHIFT)/2];
>>     };
>> 
>> - **ring_order**
>>   It represents the order of the data ring. The following list of grant
>>   references is of `(1 << ring_order)` elements. It cannot be greater than
>>   **max-dataring-page-order**, as specified by the backend on XenBus.
>> - **ref[]**
>>   The list of grant references which will contain the actual data. They are
>>   mapped contiguosly in virtual memory. The first half of the pages is the
>>   **in** array, the second half is the **out** array.
>> - **in** is an array used as circular buffer
>>   It contains data read from the socket. The producer is the backend, the
>>   consumer is the frontend.
>> - **out** is an array used as circular buffer
>>   It contains data to be written to the socket. The producer is the frontend,
>>   the consumer is the backend.
>> - **in_cons** and **in_prod**
>>   Consumer and producer pointers for data read from the socket. They keep 
>> track
>>   of how much data has already been consumed by the frontend from the **in**
>>   array. **in_prod** is increased by the backend, after writing data to 
>> **in**.
>>   **in_cons** is increased by the frontend, after reading data from **in**.
>> - **out_cons**, **out_prod**
>>   Consumer and producer pointers for the data to be written to the socket. 
>> They
>>   keep track of how much data has been written by the frontend to **out** and
>>   how much data has already been consumed by the backend. **out_prod** is
>>   increased by the frontend, after writing data to **out**. **out_cons** is
>>   increased by the backend, after reading data from **out**.
>> - **in_error** and **out_error** They signal errors when reading from the 
>> socket
>>   (**in_error**) or when writing to the socket (**out_error**). 0 means no
>>   errors. When an error occurs, no further reads or writes operations are
>>   performed on the socket. In the case of an orderly socket shutdown (i.e. 
>> read
>>   returns 0) **in_error** is set to ENOTCONN. **in_error** and **out_error**
>>   are never set to EAGAIN or EWOULDBLOCK.
>> 
>> The binary layout of `struct xensock_data_intf` follows:
>> 
>>     0         4         8         12        16        20        24        28
>>     +---------+---------+---------+---------+---------+---------+----------+
>>     | in_cons | in_prod |out_cons |out_prod |in_error |out_error|ring_order|
>>     +---------+---------+---------+---------+---------+---------+----------+
>> 
>>     28        32        36        40        4092     4096
>>     +---------+---------+---------+----//---+---------+
>>     |  ref[0] |  ref[1] |  ref[2] |         |  ref[N] |
>>     +---------+---------+---------+----//---+---------+
>> 
>> The binary layout of the ring buffers follow:
>> 
>>     0         ((1<<ring_order)<<PAGE_SHIFT)/2       
>> ((1<<ring_order)<<PAGE_SHIFT)
>>     +------------//-------------+------------//-------------+
>>     |            in             |           out             |
>>     +------------//-------------+------------//-------------+
>> 
>> #### Workflow
>> 
>> The **in** and **out** arrays are used as circular buffers:
>>     
>>     0                               sizeof(array) == 
>> ((1<<ring_order)<<PAGE_SHIFT)/2
>>     +-----------------------------------+
>>     |to consume|    free    |to consume |
>>     +-----------------------------------+
>>                ^            ^
>>                prod         cons
>> 
>>     0                               sizeof(array)
>>     +-----------------------------------+
>>     |  free    | to consume |   free    |
>>     +-----------------------------------+
>>                ^            ^
>>                cons         prod
>> 
>> The following function is provided to calculate how many bytes are currently
>> left unconsumed in an array:
>> 
>>     #define _MASK_XENSOCK_IDX(idx, ring_size) ((idx) & (ring_size-1))
>> 
>>     static inline XENSOCK_RING_IDX xensock_ring_queued(XENSOCK_RING_IDX prod,
>>               XENSOCK_RING_IDX cons,
>>               XENSOCK_RING_IDX ring_size)
>>     {
>>       XENSOCK_RING_IDX size;
>>     
>>       if (prod == cons)
>>               return 0;
>>     
>>       prod = _MASK_XENSOCK_IDX(prod, ring_size);
>>       cons = _MASK_XENSOCK_IDX(cons, ring_size);
>>     
>>       if (prod == cons)
>>               return ring_size;
>>     
>>       if (prod > cons)
>>               size = prod - cons;
>>       else {
>>               size = ring_size - cons;
>>               size += prod;
>>       }
>>       return size;
>>     }
>> 
>> The producer (the backend for **in**, the frontend for **out**) writes to the
>> array in the following way:
>> 
>> - read *cons*, *prod*, *error* from shared memory
>> - memory barrier
>> - return on *error*
>> - write to array at position *prod* up to *cons*, wrapping around the 
>> circular
>>   buffer when necessary
>> - memory barrier
>> - increase *prod*
>> - notify the other end via evtchn
>> 
>> The consumer (the backend for **out**, the frontend for **in**) reads from 
>> the
>> array in the following way:
>> 
>> - read *prod*, *cons*, *error* from shared memory
>> - memory barrier
>> - return on *error*
>> - read from array at position *cons* up to *prod*, wrapping around the 
>> circular
>>   buffer when necessary
>> - memory barrier
>> - increase *cons*
>> - notify the other end via evtchn
>> 
>> The producer takes care of writing only as many bytes as available in the 
>> buffer
>> up to *cons*. The consumer takes care of reading only as many bytes as 
>> available
>> in the buffer up to *prod*. *error* is set by the backend when an error 
>> occurs
>> writing or reading from the socket.
>> 
>> 
>> [address]: http://pubs.opengroup.org/onlinepubs/7908799/xns/syssocket.h.html
>> [in]: 
>> http://pubs.opengroup.org/onlinepubs/000095399/basedefs/netinet/in.h.html
>> [socket]: 
>> http://pubs.opengroup.org/onlinepubs/009695399/functions/socket.html
>> [connect]: http://pubs.opengroup.org/onlinepubs/7908799/xns/connect.html
>> [shutdown]: http://pubs.opengroup.org/onlinepubs/7908799/xns/shutdown.html
>> [bind]: http://pubs.opengroup.org/onlinepubs/7908799/xns/bind.html
>> [listen]: http://pubs.opengroup.org/onlinepubs/7908799/xns/listen.html
>> [accept]: http://pubs.opengroup.org/onlinepubs/7908799/xns/accept.html
>> [poll]: http://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.html




_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.