[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] RFC: XenSock brainstorming



Although discussing the goals is fun, feedback on the design of the
protocol is particularly welcome.

On Thu, 23 Jun 2016, Stefano Stabellini wrote:
> Now that Xen 4.7 is out of the door, any more feedback on this?
> 
> On Mon, 6 Jun 2016, Stefano Stabellini wrote:
> > Hi all,
> > 
> > a couple of months ago I started working on a new PV protocol for
> > virtualizing syscalls. I named it XenSock, as its main purpose is to
> > allow the implementation of the POSIX socket API in a domain other than
> > the one of the caller. It allows connect, accept, recvmsg, sendmsg, etc
> > to be implemented directly in Dom0. In a way this is conceptually
> > similar to virtio-9pfs, but for sockets rather than filesystem APIs.
> > See this diagram as reference:
> > 
> > https://docs.google.com/presentation/d/1z4AICTY2ejAjZ-Ul15GTL3i_wcmhKQJA7tcXwhI3dys/edit?usp=sharing
> > 
> > The frontends and backends could live either in userspace or kernel
> > space, with different trade-offs. My current prototype is based on Linux
> > kernel drivers but it would be nice to have userspace drivers too.
> > Discussing where the drivers could be implemented it's beyond the scope
> > of this email.
> > 
> > 
> > # Goals
> > 
> > The goal of the protocol is to provide networking capabilities to any
> > guests, with the following added benefits:
> > 
> > * guest networking should work out of the box with VPNs, wireless
> >   networks and any other complex network configurations in Dom0
> > 
> > * guest services should listen on ports bound directly to Dom0 IP
> >   addresses, fitting naturally in a Docker based workflow, where guests
> >   are Docker containers
> > 
> > * Dom0 should have full visibility on the guest behavior and should be
> >   able to perform inexpensive filtering and manipulation of guest calls
> > 
> > * XenSock should provide excellent performance. Unoptimized early code
> >   reaches 22 Gbit/sec TCP single stream and scales to 60 Gbit/sec with 3
> >   streams.
> > 
> > 
> > # Status
> > 
> > I would like to get feedback on the high level architecture, the data
> > path and the ring formats.
> > 
> > Beware that protocol and drivers are in their very early days. I don't
> > have all the information to write a design document yet. The ABI is
> > neither complete nor stable.
> > 
> > The code is not ready for xen-devel yet, but I would be happy to push a
> > git branch if somebody is interested in contributing to the project.
> > 
> > 
> > # Design and limitations
> > 
> > The frontend connects to the backend following the traditional xenstore
> > based exchange of information.
> > 
> > Frontend and backend setup an event channel and shared ring. The ring is
> > used by the frontend to forward socket API calls to the backend. I am
> > referring to this ring as command ring. This is an example of the ring
> > format:
> > 
> > #define XENSOCK_CONNECT        0
> > #define XENSOCK_RELEASE        3
> > #define XENSOCK_BIND           4
> > #define XENSOCK_LISTEN         5
> > #define XENSOCK_ACCEPT         6
> > #define XENSOCK_POLL           7
> > 
> > struct xen_xensock_request {
> >     uint32_t id;     /* private to guest, echoed in response */
> >     uint32_t cmd;    /* command to execute */
> >     uint64_t sockid; /* id of the socket */
> >     union {
> >             struct xen_xensock_connect {
> >                     uint8_t addr[28];
> >                     uint32_t len;
> >                     uint32_t flags;
> >                     grant_ref_t ref[XENSOCK_DATARING_PAGES];
> >                     uint32_t evtchn;
> >             } connect;
> >             struct xen_xensock_bind {
> >                     uint8_t addr[28]; /* ipv6 ready */
> >                     uint32_t len;
> >             } bind;
> >             struct xen_xensock_accept {
> >                     grant_ref_t ref[XENSOCK_DATARING_PAGES];
> >                     uint32_t evtchn;
> >                     uint64_t sockid;
> >             } accept;
> >     } u;
> > };
> > 
> > struct xen_xensock_response {
> >     uint32_t id;
> >     uint32_t cmd;
> >     uint64_t sockid;
> >     int32_t ret;
> > };
> > 
> > DEFINE_RING_TYPES(xen_xensock, struct xen_xensock_request,
> >               struct xen_xensock_response);
> > 
> > 
> > Connect and accept lead to the creation of new active sockets. Today
> > each active socket has its own event channel and ring for sending and
> > receiving data. Data rings have the following format:
> > 
> > #define XENSOCK_DATARING_ORDER 2
> > #define XENSOCK_DATARING_PAGES (1 << XENSOCK_DATARING_ORDER)
> > #define XENSOCK_DATARING_SIZE (XENSOCK_DATARING_PAGES << PAGE_SHIFT)
> > 
> > typedef uint32_t XENSOCK_RING_IDX;
> > 
> > struct xensock_ring_intf {
> >     char in[XENSOCK_DATARING_SIZE/4];
> >     char out[XENSOCK_DATARING_SIZE/2];
> >     XENSOCK_RING_IDX in_cons, in_prod;
> >     XENSOCK_RING_IDX out_cons, out_prod;
> >     int32_t in_error, out_error;
> > };
> > 
> > The ring works like the Xen console ring (see
> > xen/include/public/io/console.h). Data is copied to/from the ring by
> > both frontend and backend. in_error, out_error are used to report
> > errors. This simple design works well, but it requires at least 1 page
> > per active socket. To get good performance (~20 Gbit/sec single stream),
> > we need buffers of at least 64K, so actually we are looking at about 64
> > pages per ring (order 6).
> > 
> > I am currently investigating the usage of AVX2 to perform the data copy.
> > 
> > 
> > # Brainstorming
> > 
> > Are 64 pages per active socket a reasonable amount in the context of
> > modern OS level networking? I believe that regular Linux tcp sockets
> > allocate something in that order of magnitude.
> > 
> > If that's too much, I spent some time thinking about ways to reduce it.
> > Some ideas follow.
> > 
> > 
> > We could split up send and receive into two different data structures. I
> > am thinking of introducing a single ring for all active sockets with
> > variable size messages for sending data. Something like the following:
> > 
> > struct xensock_ring_entry {
> >     uint64_t sockid; /* identifies a socket */
> >     uint32_t len;    /* length of data to follow */
> >     uint8_t data[];  /* variable length data */
> > };
> >  
> > One ring would be dedicated to holding xensock_ring_entry structures,
> > one after another in a classic circular fashion. Two indexes, out_cons
> > and out_prod, would still be used the same way the are used in the
> > console ring, but I would place them on a separate page for clarity:
> > 
> > struct xensock_ring_intf {
> >     XENSOCK_RING_IDX out_cons, out_prod;
> > };
> > 
> > The frontend, that is the producer, writes a new struct
> > xensock_ring_entry to the ring, careful not to exceed the remaining free
> > bytes available. Then it increments out_prod by the written amount. The
> > backend, that is the consumer, reads the new struct xensock_ring_entry,
> > reading as much data as specified by "len". Then it increments out_cons
> > by the size of the struct xensock_ring_entry read.
> > 
> > I think this could work. Theoretically we could do the same thing for
> > receive: a separate single ring shared by all active sockets. We could
> > even reuse struct xensock_ring_entry.
> > 
> > 
> > However I have doubts that this model could work well for receive. When
> > sending data, all sockets on the frontend side copy buffers onto this
> > single ring. If there is no room, the frontend returns ENOBUFS. The
> > backend picks up the data from the ring and calls sendmsg, which can
> > also return ENOBUFS. In that case we don't increment out_cons, leaving
> > the data on the ring. The backend will try again in the near future.
> > Error messages would have to go on a separate data structure which I
> > haven't finalized yet.
> > 
> > When receiving from a socket, the backend copies data to the ring as
> > soon as data is available, perhaps before the frontend requests the
> > data. Buffers are copied to the ring not necessarily in the order that
> > the frontend might want to read them. Thus the frontend would have to
> > copy them out of the common ring into private per-socket dynamic buffers
> > just to free the ring as soon as possible and consume the next
> > xensock_ring_entry. It doesn't look very advantageous in terms of memory
> > consumption and performance.
> > 
> > Alternatively, the frontend would have to leave the data on the ring if
> > the application didn't ask for it yet. In that case the frontend could
> > look ahead without incrementing the in_cons pointer. It would have to
> > keep track of which entries have been consumed and which entries have
> > not been consumed. Only when the ring is full, the frontend would have
> > no other choice but to copy the data out of the ring into temporary
> > buffers. I am not sure how well this could work in practice.
> > 
> > As a compromise, we could use a single shared ring for sending data, and
> > 1 ring per active socket to receive data. This would cut the per-socket
> > memory consumption in half (maybe to a quarter, moving out the indexes
> > from the shared data ring into a separate page) and might be an
> > acceptable trade-off.
> > 
> > Any feedback or ideas?
> > 
> > 
> > Many thanks,
> > 
> > Stefano
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> http://lists.xen.org/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.