[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] RFC: XenSock brainstorming
Although discussing the goals is fun, feedback on the design of the protocol is particularly welcome. On Thu, 23 Jun 2016, Stefano Stabellini wrote: > Now that Xen 4.7 is out of the door, any more feedback on this? > > On Mon, 6 Jun 2016, Stefano Stabellini wrote: > > Hi all, > > > > a couple of months ago I started working on a new PV protocol for > > virtualizing syscalls. I named it XenSock, as its main purpose is to > > allow the implementation of the POSIX socket API in a domain other than > > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc > > to be implemented directly in Dom0. In a way this is conceptually > > similar to virtio-9pfs, but for sockets rather than filesystem APIs. > > See this diagram as reference: > > > > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing > > > > The frontends and backends could live either in userspace or kernel > > space, with different trade-offs. My current prototype is based on Linux > > kernel drivers but it would be nice to have userspace drivers too. > > Discussing where the drivers could be implemented it's beyond the scope > > of this email. > > > > > > # Goals > > > > The goal of the protocol is to provide networking capabilities to any > > guests, with the following added benefits: > > > > * guest networking should work out of the box with VPNs, wireless > > networks and any other complex network configurations in Dom0 > > > > * guest services should listen on ports bound directly to Dom0 IP > > addresses, fitting naturally in a Docker based workflow, where guests > > are Docker containers > > > > * Dom0 should have full visibility on the guest behavior and should be > > able to perform inexpensive filtering and manipulation of guest calls > > > > * XenSock should provide excellent performance. Unoptimized early code > > reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3 > > streams. > > > > > > # Status > > > > I would like to get feedback on the high level architecture, the data > > path and the ring formats. > > > > Beware that protocol and drivers are in their very early days. I don't > > have all the information to write a design document yet. The ABI is > > neither complete nor stable. > > > > The code is not ready for xen-devel yet, but I would be happy to push a > > git branch if somebody is interested in contributing to the project. > > > > > > # Design and limitations > > > > The frontend connects to the backend following the traditional xenstore > > based exchange of information. > > > > Frontend and backend setup an event channel and shared ring. The ring is > > used by the frontend to forward socket API calls to the backend. I am > > referring to this ring as command ring. This is an example of the ring > > format: > > > > #define XENSOCK_CONNECT 0 > > #define XENSOCK_RELEASE 3 > > #define XENSOCK_BIND 4 > > #define XENSOCK_LISTEN 5 > > #define XENSOCK_ACCEPT 6 > > #define XENSOCK_POLL 7 > > > > struct xen_xensock_request { > > uint32_t id; /* private to guest, echoed in response */ > > uint32_t cmd; /* command to execute */ > > uint64_t sockid; /* id of the socket */ > > union { > > struct xen_xensock_connect { > > uint8_t addr[28]; > > uint32_t len; > > uint32_t flags; > > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > > uint32_t evtchn; > > } connect; > > struct xen_xensock_bind { > > uint8_t addr[28]; /* ipv6 ready */ > > uint32_t len; > > } bind; > > struct xen_xensock_accept { > > grant_ref_t ref[XENSOCK_DATARING_PAGES]; > > uint32_t evtchn; > > uint64_t sockid; > > } accept; > > } u; > > }; > > > > struct xen_xensock_response { > > uint32_t id; > > uint32_t cmd; > > uint64_t sockid; > > int32_t ret; > > }; > > > > DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request, > > struct xen_xensock_response); > > > > > > Connect and accept lead to the creation of new active sockets. Today > > each active socket has its own event channel and ring for sending and > > receiving data. Data rings have the following format: > > > > #define XENSOCK_DATARING_ORDER 2 > > #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER) > > #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT) > > > > typedef uint32_t XENSOCK_RING_IDX; > > > > struct xensock_ring_intf { > > char in[XENSOCK_DATARING_SIZE/4]; > > char out[XENSOCK_DATARING_SIZE/2]; > > XENSOCK_RING_IDX in_cons, in_prod; > > XENSOCK_RING_IDX out_cons, out_prod; > > int32_t in_error, out_error; > > }; > > > > The ring works like the Xen console ring (see > > xen/include/public/io/console.h). Data is copied to/from the ring by > > both frontend and backend. in_error, out_error are used to report > > errors. This simple design works well, but it requires at least 1 page > > per active socket. To get good performance (~20 Gbit/sec single stream), > > we need buffers of at least 64K, so actually we are looking at about 64 > > pages per ring (order 6). > > > > I am currently investigating the usage of AVX2 to perform the data copy. > > > > > > # Brainstorming > > > > Are 64 pages per active socket a reasonable amount in the context of > > modern OS level networking? I believe that regular Linux tcp sockets > > allocate something in that order of magnitude. > > > > If that's too much, I spent some time thinking about ways to reduce it. > > Some ideas follow. > > > > > > We could split up send and receive into two different data structures. I > > am thinking of introducing a single ring for all active sockets with > > variable size messages for sending data. Something like the following: > > > > struct xensock_ring_entry { > > uint64_t sockid; /* identifies a socket */ > > uint32_t len; /* length of data to follow */ > > uint8_t data[]; /* variable length data */ > > }; > > > > One ring would be dedicated to holding xensock_ring_entry structures, > > one after another in a classic circular fashion. Two indexes, out_cons > > and out_prod, would still be used the same way the are used in the > > console ring, but I would place them on a separate page for clarity: > > > > struct xensock_ring_intf { > > XENSOCK_RING_IDX out_cons, out_prod; > > }; > > > > The frontend, that is the producer, writes a new struct > > xensock_ring_entry to the ring, careful not to exceed the remaining free > > bytes available. Then it increments out_prod by the written amount. The > > backend, that is the consumer, reads the new struct xensock_ring_entry, > > reading as much data as specified by "len". Then it increments out_cons > > by the size of the struct xensock_ring_entry read. > > > > I think this could work. Theoretically we could do the same thing for > > receive: a separate single ring shared by all active sockets. We could > > even reuse struct xensock_ring_entry. > > > > > > However I have doubts that this model could work well for receive. When > > sending data, all sockets on the frontend side copy buffers onto this > > single ring. If there is no room, the frontend returns ENOBUFS. The > > backend picks up the data from the ring and calls sendmsg, which can > > also return ENOBUFS. In that case we don't increment out_cons, leaving > > the data on the ring. The backend will try again in the near future. > > Error messages would have to go on a separate data structure which I > > haven't finalized yet. > > > > When receiving from a socket, the backend copies data to the ring as > > soon as data is available, perhaps before the frontend requests the > > data. Buffers are copied to the ring not necessarily in the order that > > the frontend might want to read them. Thus the frontend would have to > > copy them out of the common ring into private per-socket dynamic buffers > > just to free the ring as soon as possible and consume the next > > xensock_ring_entry. It doesn't look very advantageous in terms of memory > > consumption and performance. > > > > Alternatively, the frontend would have to leave the data on the ring if > > the application didn't ask for it yet. In that case the frontend could > > look ahead without incrementing the in_cons pointer. It would have to > > keep track of which entries have been consumed and which entries have > > not been consumed. Only when the ring is full, the frontend would have > > no other choice but to copy the data out of the ring into temporary > > buffers. I am not sure how well this could work in practice. > > > > As a compromise, we could use a single shared ring for sending data, and > > 1 ring per active socket to receive data. This would cut the per-socket > > memory consumption in half (maybe to a quarter, moving out the indexes > > from the shared data ring into a separate page) and might be an > > acceptable trade-off. > > > > Any feedback or ideas? > > > > > > Many thanks, > > > > Stefano > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > http://lists.xen.org/xen-devel > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |