[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Re: Interdomain comms
On 5/7/05, Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > If you could go through with 9p the same concrete example I did I think > I'd find that useful. Also, I should probably spend another 10 mins on > the docs ;-) > It would be way better if we could step you through a demo in the environment. The documentation for Plan 9 has always been spartan at best -- so getting a good idea of how things work without looking over someones shoulder who is experienced has always been somewhat difficult. I've been trying to correct this while pushing some of the bits into Linux - my Freenix v9fs paper spent quite a bit of time talking about how 9P actually works and showing traces for typical file system operations. The OLS paper which I'm supposed to be writing right now covers the same sort of thing for the dynamic private name spaces. The reason why I didn't post a counter-example was because I didn't see much difference between our two ideas for the scenario you lay out (except you obviously understand a lot more about some of the details that I haven't gotten into yet): (from the looks of things, you had already established an authenticated connection between the FE and BE and the top-level had already been established and whatever file system you are using to read file data had already traversed to the file and opened it. A quick summary of how we do the above with 9P: a) establish the connection to the BE (there are various semantics possible here, in my current stuff I pass a reference to the Channel around. Alternatively you could use socket like semantics to connect to the other partition) b) Issue a protocol negotiation packet to establish buffer sizes, protocol version, etc. (t_version) c) Attach to the remote resource, providing authentication information (if necessary) t_attach -- this will also create an initial piece of shared meta-data referencing the root resource of the BE (for devices there may only be a single resource, or one resource such as a block device may have multiple nodes such as partitions, in Plan 9 devices also present different aspects of their interface (ctl, data, stat, etc.) as different nodes in a hierarchical fashion. The reference to different nodes in the tree is called a FID (think of it as a file descriptor) - it contains information about who has attached to the resource and where in the hierarchy they are. A key thing to remember is that in Plan 9, every device is a file server. d) You would then traverse the object tree to the resource you wanted to use (in your case it sounded like a file in a file system, so the metaphor is straightforward). The client would issue a t_walk message to perform this traversal. e) The object would then need to be opened (t_open) with information about what type of operation will be executed (read, write, both) and can include additional information about the type of transactions (append only, exclusive access, etc.) that may be beneficial to managing the underlying resource. The BE could use information cached in the Fid from the attach to check the FE's permission to be accessing this resource with that mode. In our world, this would result in you holding a Fid pointing to the open object. The Fid is a pointer to meta-data and is considered state on both the FE and the BE. (this has downsides in terms of reliability and the ability to recover sessions or fail over to different BE's -- one of our summer students will be addressing the reliability problem this summer). The FE performs a read operation passing it the necessary bits: ret = read( fd, *buf, count ); This actually would be translated (in collaboration with local meta-data into a t_read mesage) t_read tag fid offset count (where offset is determined by local fid metadata) The BE receives the read request, and based on state information kept in the Fid (basically your metadata), it finds the file contents in the buffer cache. It sends a response packet with a pointer to its local buffer cache entry: r_read tag count *data There are a couple ways we could go when the FE receives the response: a) it could memcopy the data to the user buffer *buf . This is the way things currently work, and isn't very efficient -- but may be the way to go for the ultra-paranoid who don't like sharing memory references between partitions. b) We could have registered the memory pointed to by *buf and passed that reference along the path -- but then it probably would just amount to the BE doing the copy rather than the front end. Perhaps this approximates what you were talking about doing? c) As long as the buffers in question (both *buf and the buffer cache entry) were page-aligned, etc. -- we could play clever VM games marking the page as shared RO between the two partitions and alias the virtual memory pointed to by *buf to the shared page. This is very sketchy and high level and I need to delve into all sorts of details -- but the idea would be to use virtual memory as your friend for these sort of shared read-only buffer caches. It would also require careful allocation of buffers of the right size on the right alignment -- but driver writers are used to that sort of thing. To do this sort of thing, we'd need to do the exact same sort of accounting you describe: >The implementation of local_buffer_reference_copy for that specific >combination of buffer types maps the BE pages into the FE address space >incrementing their reference counts and also unmaps the old FE pages and >decrements their reference counts, returning them to the free pool if >necessary. When the FE was done with the BE, it would close the resources (issuing t_clunk on any fids associated with the BE). The above looks complicated, but to a FE writer would be as simple as: channel = dial("net!BE"); /* establish connection */ /* in my current code, channel is passed as an argument to the FE as a boot arg */ root = fsmount(channel, NULL); /* this does the t_version, auth, & attach */ fd = open(root, "/some/path/file", OREAD); ret = read(fd, *buf, sizeof(buf)); close(fd); close(root); close(channel); If you want to get fancy, you could get rid of the root arg to open and use a private name space (after fsmount): bind(root, "/mnt/be", MREPL); /* bind the back end to a well known place */ then it would be: fd=open("/mnt/be/some/path/file", OREAD); There's also all sorts of cool stuff you can do on the domain controller to provision child partitions using dynamic name space and then just exporting the custom fashioned environment using 9P -- but that's higher level organization stuff again. There's all sorts of cool tricks you can play with 9P (similar to the stuff that the FUSE and FiST user-space file system packages provide) like copy-on-write file systems, COW block devices, muxed ttys, etc. etc. The reality is that I'm not sure I'd actually want to use a BE to implement a file system, but its quite reasonable to implement a buffer cache that way. In all likelihood this would result in the FE opening up a connection (and a single object) on the BE buffer cache, then using different offsets to grab specific blocks from the BE buffer cache using the t_read operation. I've described it in terms of a file system, using your example as a basis, but the same sort of thing would be true for a block device or a network connection (with some slightly different semantic rules on the network connection). The main point is to keep things simple for the FE and BE writers, and deal with all the accounting and magic you describe within the infrastructure (no small task). Another difference would involve what would happen if you did have to bridge a cluster network - the 9P network encapsulation is well defined, all you would need to do (at the I/O partition bridging the network) is marshall the data according to the existing protocol spec. For more intelligent networks using RDMA and such things, you could keep the scatter/gather style semantics and send pointers into the RDMA space for buffer references. As I said before, there's lots of similarities in what we are talking about, I'm just gluing a slightly more abstract interface on top, which has some benefits in some additional organizational and security mechanisms (and a well-established (but not widely used yet) network protocol encapsulation). There are plenty of details I know I'm glossing over, and I'm sure I'll need lots of help getting things right. I'd have preferred staying quiet until I had my act together a little more, but Orran and Ron convinced me that it was important to let people know the direction I'm planning on exploring. -eric _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |