Xen project Mailing List

Re: [Xen-devel] RFC: XenSock brainstorming

To: Stefano Stabellini <stefano@xxxxxxxxxxx>

From: Stefano Stabellini <stefano@xxxxxxxxxxx>

Date: Thu, 23 Jun 2016 17:03:24 +0100 (BST)

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, joao.m.martins@xxxxxxxxxx, wei.liu2@xxxxxxxxxx, roger.pau@xxxxxxxxxx

Delivery-date: Thu, 23 Jun 2016 16:03:40 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

Now that Xen 4.7 is out of the door, any more feedback on this? On Mon, 6 Jun 2016, Stefano Stabellini wrote: > Hi all, > > a couple of months ago I started working on a new PV protocol for > virtualizing syscalls. I named it XenSock, as its main purpose is to > allow the implementation of the POSIX socket API in a domain other than > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > to be implemented directly in Dom0. In a way this is conceptually > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > See this diagram as reference: > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > The frontends and backends could live either in userspace or kernel > space, with different trade-offs. My current prototype is based on Linux > kernel drivers but it would be nice to have userspace drivers too. > Discussing where the drivers could be implemented it's beyond the scope > of this email. > > > # Goals > > The goal of the protocol is to provide networking capabilities to any > guests, with the following added benefits: > > * guest networking should work out of the box with VPNs, wireless > networks and any other complex network configurations in Dom0 > > * guest services should listen on ports bound directly to Dom0 IP > addresses, fitting naturally in a Docker based workflow, where guests > are Docker containers > > * Dom0 should have full visibility on the guest behavior and should be > able to perform inexpensive filtering and manipulation of guest calls > > * XenSock should provide excellent performance. Unoptimized early code > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > streams. > > > # Status > > I would like to get feedback on the high level architecture, the data > path and the ring formats. > > Beware that protocol and drivers are in their very early days. I don't > have all the information to write a design document yet. The ABI is > neither complete nor stable. > > The code is not ready for xen-devel yet, but I would be happy to push a > git branch if somebody is interested in contributing to the project. > > > # Design and limitations > > The frontend connects to the backend following the traditional xenstore > based exchange of information. > > Frontend and backend setup an event channel and shared ring. The ring is > used by the frontend to forward socket API calls to the backend. I am > referring to this ring as command ring. This is an example of the ring > format: > > #define XENSOCK_CONNECT 0 > #define XENSOCK_RELEASE 3 > #define XENSOCK_BIND 4 > #define XENSOCK_LISTEN 5 > #define XENSOCK_ACCEPT 6 > #define XENSOCK_POLL 7 > > struct xen_xensock_request { > uint32_t id; /* private to guest, echoed in response */ > uint32_t cmd; /* command to execute */ > uint64_t sockid; /* id of the socket */ > union { > struct xen_xensock_connect { > uint8_t addr[28]; > uint32_t len; > uint32_t flags; > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > uint32_t evtchn; > } connect; > struct xen_xensock_bind { > uint8_t addr[28]; /* ipv6 ready */ > uint32_t len; > } bind; > struct xen_xensock_accept { > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > uint32_t evtchn; > uint64_t sockid; > } accept; > } u; > }; > > struct xen_xensock_response { > uint32_t id; > uint32_t cmd; > uint64_t sockid; > int32_t ret; > }; > > DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request, > struct xen_xensock_response); > > > Connect and accept lead to the creation of new active sockets. Today > each active socket has its own event channel and ring for sending and > receiving data. Data rings have the following format: > > #define XENSOCK_DATARING_ORDER 2 > #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) > #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) > > typedef uint32_t XENSOCK_RING_IDX; > > struct xensock_ring_intf { > char in[XENSOCK_DATARING_SIZE/4]; > char out[XENSOCK_DATARING_SIZE/2]; > XENSOCK_RING_IDX in_cons, in_prod; > XENSOCK_RING_IDX out_cons, out_prod; > int32_t in_error, out_error; > }; > > The ring works like the Xen console ring (see > xen/include/public/io/console.h). Data is copied to/from the ring by > both frontend and backend. in_error, out_error are used to report > errors. This simple design works well, but it requires at least 1 page > per active socket. To get good performance (~20 Gbit/sec single stream), > we need buffers of at least 64K, so actually we are looking at about 64 > pages per ring (order 6). > > I am currently investigating the usage of AVX2 to perform the data copy. > > > # Brainstorming > > Are 64 pages per active socket a reasonable amount in the context of > modern OS level networking? I believe that regular Linux tcp sockets > allocate something in that order of magnitude. > > If that's too much, I spent some time thinking about ways to reduce it. > Some ideas follow. > > > We could split up send and receive into two different data structures. I > am thinking of introducing a single ring for all active sockets with > variable size messages for sending data. Something like the following: > > struct xensock_ring_entry { > uint64_t sockid; /* identifies a socket */ > uint32_t len; /* length of data to follow */ > uint8_t data[]; /* variable length data */ > }; > > One ring would be dedicated to holding xensock_ring_entry structures, > one after another in a classic circular fashion. Two indexes, out_cons > and out_prod, would still be used the same way the are used in the > console ring, but I would place them on a separate page for clarity: > > struct xensock_ring_intf { > XENSOCK_RING_IDX out_cons, out_prod; > }; > > The frontend, that is the producer, writes a new struct > xensock_ring_entry to the ring, careful not to exceed the remaining free > bytes available. Then it increments out_prod by the written amount. The > backend, that is the consumer, reads the new struct xensock_ring_entry, > reading as much data as specified by "len". Then it increments out_cons > by the size of the struct xensock_ring_entry read. > > I think this could work. Theoretically we could do the same thing for > receive: a separate single ring shared by all active sockets. We could > even reuse struct xensock_ring_entry. > > > However I have doubts that this model could work well for receive. When > sending data, all sockets on the frontend side copy buffers onto this > single ring. If there is no room, the frontend returns ENOBUFS. The > backend picks up the data from the ring and calls sendmsg, which can > also return ENOBUFS. In that case we don't increment out_cons, leaving > the data on the ring. The backend will try again in the near future. > Error messages would have to go on a separate data structure which I > haven't finalized yet. > > When receiving from a socket, the backend copies data to the ring as > soon as data is available, perhaps before the frontend requests the > data. Buffers are copied to the ring not necessarily in the order that > the frontend might want to read them. Thus the frontend would have to > copy them out of the common ring into private per-socket dynamic buffers > just to free the ring as soon as possible and consume the next > xensock_ring_entry. It doesn't look very advantageous in terms of memory > consumption and performance. > > Alternatively, the frontend would have to leave the data on the ring if > the application didn't ask for it yet. In that case the frontend could > look ahead without incrementing the in_cons pointer. It would have to > keep track of which entries have been consumed and which entries have > not been consumed. Only when the ring is full, the frontend would have > no other choice but to copy the data out of the ring into temporary > buffers. I am not sure how well this could work in practice. > > As a compromise, we could use a single shared ring for sending data, and > 1 ring per active socket to receive data. This would cut the per-socket > memory consumption in half (maybe to a quarter, moving out the indexes > from the shared data ring into a separate page) and might be an > acceptable trade-off. > > Any feedback or ideas? > > > Many thanks, > > Stefano > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.