[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: XenSock brainstorming



Now that Xen 4.7 is out of the door, any more feedback on this?

On Mon, 6 Jun 2016, Stefano Stabellini wrote:
> Hi all,
> 
> a couple of months ago I started working on a new PV protocol for
> virtualizing syscalls. I named it XenSock, as its main purpose is to
> allow the implementation of the POSIX socket API in a domain other than
> the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> to be implemented directly in Dom0. In a way this is conceptually
> similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> See this diagram as reference:
> 
> https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> 
> The frontends and backends could live either in userspace or kernel
> space, with different trade-offs. My current prototype is based on Linux
> kernel drivers but it would be nice to have userspace drivers too.
> Discussing where the drivers could be implemented it's beyond the scope
> of this email.
> 
> 
> # Goals
> 
> The goal of the protocol is to provide networking capabilities to any
> guests, with the following added benefits:
> 
> * guest networking should work out of the box with VPNs, wireless
>   networks and any other complex network configurations in Dom0
> 
> * guest services should listen on ports bound directly to Dom0 IP
>   addresses, fitting naturally in a Docker based workflow, where guests
>   are Docker containers
> 
> * Dom0 should have full visibility on the guest behavior and should be
>   able to perform inexpensive filtering and manipulation of guest calls
> 
> * XenSock should provide excellent performance. Unoptimized early code
>   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
>   streams.
> 
> 
> # Status
> 
> I would like to get feedback on the high level architecture, the data
> path and the ring formats.
> 
> Beware that protocol and drivers are in their very early days. I don't
> have all the information to write a design document yet. The ABI is
> neither complete nor stable.
> 
> The code is not ready for xen-devel yet, but I would be happy to push a
> git branch if somebody is interested in contributing to the project.
> 
> 
> # Design and limitations
> 
> The frontend connects to the backend following the traditional xenstore
> based exchange of information.
> 
> Frontend and backend setup an event channel and shared ring. The ring is
> used by the frontend to forward socket API calls to the backend. I am
> referring to this ring as command ring. This is an example of the ring
> format:
> 
> #define XENSOCK_CONNECT        0
> #define XENSOCK_RELEASE        3
> #define XENSOCK_BIND           4
> #define XENSOCK_LISTEN         5
> #define XENSOCK_ACCEPT         6
> #define XENSOCK_POLL           7
> 
> struct xen_xensock_request {
>       uint32_t id;     /* private to guest, echoed in response */
>       uint32_t cmd;    /* command to execute */
>       uint64_t sockid; /* id of the socket */
>       union {
>               struct xen_xensock_connect {
>                       uint8_t addr[28];
>                       uint32_t len;
>                       uint32_t flags;
>                       grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                       uint32_t evtchn;
>               } connect;
>               struct xen_xensock_bind {
>                       uint8_t addr[28]; /* ipv6 ready */
>                       uint32_t len;
>               } bind;
>               struct xen_xensock_accept {
>                       grant_ref_t ref[XENSOCK_DATARING_PAGES];
>                       uint32_t evtchn;
>                       uint64_t sockid;
>               } accept;
>       } u;
> };
> 
> struct xen_xensock_response {
>       uint32_t id;
>       uint32_t cmd;
>       uint64_t sockid;
>       int32_t ret;
> };
> 
> DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request,
>                 struct xen_xensock_response);
> 
> 
> Connect and accept lead to the creation of new active sockets. Today
> each active socket has its own event channel and ring for sending and
> receiving data. Data rings have the following format:
> 
> #define XENSOCK_DATARING_ORDER 2
> #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> 
> typedef uint32_t XENSOCK_RING_IDX;
> 
> struct xensock_ring_intf {
>       char in[XENSOCK_DATARING_SIZE/4];
>       char out[XENSOCK_DATARING_SIZE/2];
>       XENSOCK_RING_IDX in_cons, in_prod;
>       XENSOCK_RING_IDX out_cons, out_prod;
>       int32_t in_error, out_error;
> };
> 
> The ring works like the Xen console ring (see
> xen/include/public/io/console.h). Data is copied to/from the ring by
> both frontend and backend. in_error, out_error are used to report
> errors. This simple design works well, but it requires at least 1 page
> per active socket. To get good performance (~20 Gbit/sec single stream),
> we need buffers of at least 64K, so actually we are looking at about 64
> pages per ring (order 6).
> 
> I am currently investigating the usage of AVX2 to perform the data copy.
> 
> 
> # Brainstorming
> 
> Are 64 pages per active socket a reasonable amount in the context of
> modern OS level networking? I believe that regular Linux tcp sockets
> allocate something in that order of magnitude.
> 
> If that's too much, I spent some time thinking about ways to reduce it.
> Some ideas follow.
> 
> 
> We could split up send and receive into two different data structures. I
> am thinking of introducing a single ring for all active sockets with
> variable size messages for sending data. Something like the following:
> 
> struct xensock_ring_entry {
>       uint64_t sockid; /* identifies a socket */
>       uint32_t len;    /* length of data to follow */
>       uint8_t data[];  /* variable length data */
> };
>  
> One ring would be dedicated to holding xensock_ring_entry structures,
> one after another in a classic circular fashion. Two indexes, out_cons
> and out_prod, would still be used the same way the are used in the
> console ring, but I would place them on a separate page for clarity:
> 
> struct xensock_ring_intf {
>       XENSOCK_RING_IDX out_cons, out_prod;
> };
> 
> The frontend, that is the producer, writes a new struct
> xensock_ring_entry to the ring, careful not to exceed the remaining free
> bytes available. Then it increments out_prod by the written amount. The
> backend, that is the consumer, reads the new struct xensock_ring_entry,
> reading as much data as specified by "len". Then it increments out_cons
> by the size of the struct xensock_ring_entry read.
> 
> I think this could work. Theoretically we could do the same thing for
> receive: a separate single ring shared by all active sockets. We could
> even reuse struct xensock_ring_entry.
> 
> 
> However I have doubts that this model could work well for receive. When
> sending data, all sockets on the frontend side copy buffers onto this
> single ring. If there is no room, the frontend returns ENOBUFS. The
> backend picks up the data from the ring and calls sendmsg, which can
> also return ENOBUFS. In that case we don't increment out_cons, leaving
> the data on the ring. The backend will try again in the near future.
> Error messages would have to go on a separate data structure which I
> haven't finalized yet.
> 
> When receiving from a socket, the backend copies data to the ring as
> soon as data is available, perhaps before the frontend requests the
> data. Buffers are copied to the ring not necessarily in the order that
> the frontend might want to read them. Thus the frontend would have to
> copy them out of the common ring into private per-socket dynamic buffers
> just to free the ring as soon as possible and consume the next
> xensock_ring_entry. It doesn't look very advantageous in terms of memory
> consumption and performance.
> 
> Alternatively, the frontend would have to leave the data on the ring if
> the application didn't ask for it yet. In that case the frontend could
> look ahead without incrementing the in_cons pointer. It would have to
> keep track of which entries have been consumed and which entries have
> not been consumed. Only when the ring is full, the frontend would have
> no other choice but to copy the data out of the ring into temporary
> buffers. I am not sure how well this could work in practice.
> 
> As a compromise, we could use a single shared ring for sending data, and
> 1 ring per active socket to receive data. This would cut the per-socket
> memory consumption in half (maybe to a quarter, moving out the indexes
> from the shared data ring into a separate page) and might be an
> acceptable trade-off.
> 
> Any feedback or ideas?
> 
> 
> Many thanks,
> 
> Stefano
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.