[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Re: Interdomain comms
I quite like some of this. A few more comments below. On Sat, 2005-05-07 at 19:57 -0500, Eric Van Hensbergen wrote: > > In our world, this would result in you holding a Fid pointing to the > open object. The Fid is a pointer to meta-data and is considered > state on both the FE and the BE. (this has downsides in terms of > reliability and the ability to recover sessions or fail over to > different BE's -- one of our summer students will be addressing the > reliability problem this summer). OK, so this is an area of concern for me. I used the last version of the sketchy API I outlined to create an HA cluster infrastructure. So I had to solve these kind of protocol issues and, whilst it was actually pretty easy starting from scratch, retrofitting a solution to an existing protocol might be challenging, even for a summer student. > > The FE performs a read operation passing it the necessary bits: > ret = read( fd, *buf, count ); Here the API is coupling the client to the memory management implementation by assuming that the buffer is mapped into the client's virtual address space. This is probably likely to be true most of the time so an API at this level will be useful but I'd also like to be able to write I/O applications that manage the data in buffers that are never mapped into the application address space. Also, I'd like to be able to write applications that have clients which use different types of buffers without having to code for each case in my application. This is why my API deals in terms of local_buffer_references which, for the sake of argument, might look like this: struct local_buffer_reference { local_buffer_reference_type type; local_buffer_reference_base base; buffer_reference_offset offset; buffer_reference_length length; }; A local buffer reference of type virtual_address would have a base value equal to the buf pointer above, an offset of zero and a length of count. A local buffer reference for the hidden buffer pages would have a different type, say hidden_buffer_page, the base would be a pointer to a vector of page indices, the offset would be the offset of the start of the buffer into the first page and the length would be the length of the buffer. So, my application can deal with buffers described like that without having to worry about the flavour of memory management backing them. Also, I can change the memory management without changing all the calls to the API, I only have to change where I get buffers from. BTW, this specific abstraction I learnt about from an embedded OS architected by Nik Shalor. He might have got it from somewhere else. > > This actually would be translated (in collaboration with local > meta-data into a t_read mesage) > t_read tag fid offset count (where offset is determined by local fid > metadata) > > The BE receives the read request, and based on state information kept > in the Fid (basically your metadata), it finds the file contents in > the buffer cache. It sends a response packet with a pointer to its > local buffer cache entry: > > r_read tag count *data > > There are a couple ways we could go when the FE receives the response: > a) it could memcopy the data to the user buffer *buf . This is the > way things currently work, and isn't very efficient -- but may be > the way to go for the ultra-paranoid who don't like sharing memory > references between partitions. > > b) We could have registered the memory pointed to by *buf and passed > that reference along the path -- but then it probably would just > amount to the BE doing the copy rather than the front end. Perhaps > this approximates what you were talking about doing? No, 'c' is closer to what I was sketching out, except that I was proposing a general mechanism that had one code path in the clients even though the underlying implementation could be a, b or c or any other memory management strategy determined at run-time according to the type of the local buffer references involved. > > c) As long as the buffers in question (both *buf and the buffer cache > entry) were page-aligned, etc. -- we could play clever VM games > marking the page as shared RO between the two partitions and alias the > virtual memory pointed to by *buf to the shared page. This is very > sketchy and high level and I need to delve into all sorts of details > -- but the idea would be to use virtual memory as your friend for > these sort of shared read-only buffer caches. It would also require > careful allocation of buffers of the right size on the right alignment > -- but driver writers are used to that sort of thing. Yes, it also requires buffers of the right size and alignment to be used at the receiving end of any network transfers and for the alignment to be preserved across the network even if the transfer starts at a non-zero page offset. You might think that once the data goes over the network you don't care but it might be received by an application that wants to share pages with another application so, in fact, it does matter. This is just something you have to get right if you want any kind of page referencing technique to work although you can fall back to memcopy for misaligned data if necessary. > The above looks complicated, but to a FE writer would be as simple as: > channel = dial("net!BE"); /* establish connection */ > /* in my current code, channel is passed as an argument to the FE as a > boot arg */ > root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */ > fd = open(root, "/some/path/file", OREAD); > ret = read(fd, *buf, sizeof(buf)); > close(fd); > close(root); > close(channel); So, this is obviously a blocking API. My API was non-blocking because the network latency means that you need a lot of concurrency for high throughput and you don't necessarily want so many threads. Like AIO. Having a blocking API as well is convenient though. > > If you want to get fancy, you could get rid of the root arg to open > and use a private name space (after fsmount): > bind(root, "/mnt/be", MREPL); /* bind the back end to a well known place */ > then it would be: > fd=open("/mnt/be/some/path/file", OREAD); > > There's also all sorts of cool stuff you can do on the domain > controller to provision child partitions using dynamic name space and > then just exporting the custom fashioned environment using 9P -- but > that's higher level organization stuff again. There's all sorts of > cool tricks you can play with 9P (similar to the stuff that the FUSE > and FiST user-space file system packages provide) like copy-on-write > file systems, COW block devices, muxed ttys, etc. etc. I'm definitely going to study the organisational aspects of 9P. > I've described it in terms of a file system, using your example as a > basis, but the same sort of thing would be true for a block device or > a network connection (with some slightly different semantic rules on > the network connection). The main point is to keep things simple for > the FE and BE writers, and deal with all the accounting and magic you > describe within the infrastructure (no small task). If you use the buffer abstraction I described then you can start with a very simple mm implementation and improve it without having to change the clients. > > Another difference would involve what would happen if you did have to > bridge a cluster network - the 9P network encapsulation is well > defined, all you would need to do (at the I/O partition bridging the > network) is marshall the data according to the existing protocol spec. > For more intelligent networks using RDMA and such things, you could > keep the scatter/gather style semantics and send pointers into the > RDMA space for buffer references. I don't see this as a difference. My API was explicitly compatible with a networked implementation. One of the thoughts that did occur to me was that a reliance on in-order message delivery (which 9p has) turns out to be quite painful to satisfy when combined with multi-pathing and failover because a link failure results in the remaining links being full of messages that can't be delivered until the lost messages (which happened to include the message that must be delivered first) have been resent. This is more problematical than you might think because all communications stall for the time taken to _detect_ the lost link and all the remaining links are full so you can't do the recovery without cleaning up another link. This is not insurmountable but it's not obviously the best solution either, particularly since, with SMP machines, you aren't going to want to serialise the concurrent activity so there needn't be any fundamental in-order requirements at the protocol level. > > As I said before, there's lots of similarities in what we are talking > about, I'm just gluing a slightly more abstract interface on top, > which has some benefits in some additional organizational and security > mechanisms (and a well-established (but not widely used yet) network > protocol encapsulation). Maybe something like my API underneath and something like the organisational stuff of 9p on top would be good. I'd have to see how the 9p organisational stuff stacks up against the publish and subscribe mechanisms for resource discovery that I'm used to. Also, the infrastructure requirements for building fault-tolerant systems are quite demanding and I'd have to be confident that 9p was up to the task before I'd be happy with it personally. > > There are plenty of details I know I'm glossing over, and I'm sure > I'll need lots of help getting things right. I'd have preferred > staying quiet until I had my act together a little more, but Orran and > Ron convinced me that it was important to let people know the > direction I'm planning on exploring. Yes, definitely worthwhile. I'd like to see more discussion like this on the xen-devel list. On the one hand, it's kind of embarrassing to discuss vaporware and half finished ideas but on the other, the opportunity for public comment at an early stage in the process is probably going to save a lot of effort in the long run. -- Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |