Xen project Mailing List

Re: [Xen-devel] Re: Interdomain comms

From: Mark Williamson <mark.williamson@xxxxxxxxxxxx>

Date: Fri, 6 May 2005 14:39:45 +0100

Cc: Mike Wray <mike.wray@xxxxxx>, Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>

Delivery-date: Fri, 06 May 2005 14:17:43 +0000

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

I like it. To start with, local communication only would be fine. Eventually it would scale neatly to things like remote device access. I particularly like the abstraction for remote memory - this would be an excellent fit to take advantage of RDMA where available (e.g. a cluster running on an IB fabric). Cheers, Mark On Friday 06 May 2005 13:14, Harry Butterworth wrote: > On Fri, 2005-05-06 at 08:46 +0100, Mike Wray wrote: > > Harry Butterworth wrote: > > > The current overhead in terms of client code to establish an entity on > > > the xen inter-domain communication "bus" is currently of the order of > > > 1000 statements (counting FE, BE and slice of xend). A better > > > inter-domain communication API could reduce this to fewer than 10 > > > statements. If it's not done by the time I finish the USB work, I will > > > hopefully be allowed to help with this. > > > > This reminded me you had suggested a different model for inter-domain > > comms. I recently suggested a more socket-like API but it didn't go down > > well. > > What exactly were the issues with the socket-like proposal? > > > I agree with you that the event channel model could be improved - > > what kind of comms model do you suggest? > > The event-channel and shared memory page are fine as low-level > primitives to implement a comms channel between domains on the same > physical machine. The problem is that the primitives are unnecessarily > low-level from the client's perspective and result in too much > per-client code. > > The inter-domain communication API should preserve the efficiency of > these primitives but provide a higher level API which is more convenient > to use. > > Another issue with the current API is that, in the future, it is likely > (for a number of virtual-iron/fault-tolerant-virtual-machine-like > reasons) that it will be useful for the inter-domain communication API > to span physical nodes in a cluster. The problem with the current API is > that it directly couples the clients to a shared memory implementation > with a direct connection between the front and back end domains and the > clients would all need to be rewritten if the implementation was to span > physical machines or require indirection. Eventually I would expect the > effort invested in the clients of the inter-domain API to equal or > exceed the effort invested in the hypervisor in the same way that the > linux device drivers make up the bulk of the linux kernel code. There is > a risk therefore that this might become a significant architectural > limitation. > > So, I think we're looking for a higher-level API which can preserve the > current efficient implementation for domains resident on the same > physical machine but allows for domains to be separated by a network > interface without having to rewrite all the drivers. > > The API needs to address the following issues: > > Resource discovery --- Discovering the targets of IDC is an inherent > requirement. > > Dynamic behaviour --- Domains are going to come and go all the time. > > Stale communications --- When domains come and go, client protocols must > have a way to recover from communications in flight or potentially in > flight from before the last transition. > > Deadlock --- IDC is a shared resource and must not introduce resource > deadlock issues, for example when FE and BEs are arranged symetrically > in reverse across the same interface or when BEs are stacked and so > introduce chains of dependencies. > > Security --- There are varying degrees of trust beween the domains. > > Ease of use --- This is important for developer productivity and also to > help ensure the other goals (security/robustness) are actually met. > > Efficiency/Performance --- obviously. > > I'd need a few days (which I don't have right now) to put together a > coherent proposal tailored specifically to xen. However, it would > probably be along the lines of the following: > > A buffer abstraction to decouple the IDC API from the memory management > implementation: > > struct local_buffer_reference; > > An endpoint abstraction to represent one end of an IDC connection. It's > important that this is done on a per connection basis rather than having > one per domain for all IDC activity because it avoids deadlock issues > arising from chained, dependent communication. > > struct idc_endpoint; > > A message abstraction because some protocols are more efficiently > implemented using one-way messages than request-response pairs, > particularly when the protocol involves more than two parties. > > struct idc_message > { > ... > struct local_buffer_reference message_body; > }; > > /* When a received message is finished with */ > > void idc_message_complete( struct idc_message * message ); > > A request-response transaction abstraction because most protocols are > more easily implemented with these. > > struct idc_transaction > { > ... > struct local_buffer_reference transaction_parameters; > struct local_buffer_reference transaction_status; > }; > > /* Useful to have an error code in addition to status. */ > > /* When a received transaction is finished with. */ > > void idc_transaction_complete > ( struct idc_transaction * transaction, error_code error ); > > /* When an initiated transaction completes. Error code also reports > transport errors when endpoint disconnects whilst transaction is > outstanding. */ > > error_code idc_transaction_query_error_code > ( struct idc_transaction * transaction ); > > An IDC address abstraction: > > struct idc_address; > > A mechanism to initiate connection establishment, can't fail because > endpoint resource is pre-allocated and create doesn't actually need to > establish the connection. > > The endpoint calls the registered notification functions as follows: > > 'appear' when the remote endpoint is discovered then 'disappear' if it > goes away again or 'connect' if a connection is actually established. > > After 'connect', the client can submit messages and transactions. > > 'disconnect' when the connection is failing, the client must wait for > outstanding messages and transactions to complete (sucessfully or with a > transport error) before completing the disconnect callback and must > flush received messages and transactions whilst disconnected. > > Then 'connect' if the connection is reestablished or 'disappear' if the > remote endpoint has gone away. > > A disconnect, connect cycle guarantees that the remote endpoint also > goes through a disconnect, connect cycle. > > This API allows multi-pathing clients to make intelligent decisions and > provides sufficient guarantees about stale messages and transactions to > make a useful foundation. > > void idc_endpoint_create > ( > struct idc_endpoint * endpoint, > struct idc_address address, > void ( * appear )( struct idc_endpoint * endpoint ), > void ( * connect )( struct idc_endpoint * endpoint ), > void ( * disconnect ) > ( struct idc_endpoint * endpoint, struct callback * callback ), > void ( * disappear )( struct idc_endpoint * endpoint ), > void ( * handle_message ) > ( struct idc_endpoint * endpoint, struct idc_message * message ), > void ( * handle_transaction ) > ( > struct idc_endpoint * endpoint, > struct idc_transaction * transaction > ) > ); > > void idc_endpoint_submit_message > ( struct idc_endpoint * endpoint, struct idc_message * message ); > > void idc_endpoint_submit_transaction > ( struct idc_endpoint * endpoint, struct idc_transaction * > transaction ); > > idc_endpoint_destroy completes the callback once the endpoint has > 'disconnected' and 'disappeared' and the endpoint resource is free for > reuse for a different connection. > > void idc_endpoint_destroy > ( > struct idc_endpoint * endpoint, > struct callback * callback > ); > > The messages and transaction parameters and status must be of finite > length (these quota properties might be parameters of the endpoint > resource allocation). Need a mechanism for efficient, arbitrary length > bulk transfer too. > > An abstraction for buffers owned by remote domains: > > struct remote_buffer_reference; > > Can register a local buffer with the IDC to get a remote buffer > reference: > > struct remote_buffer_reference idc_register_buffer > ( struct local_buffer_reference buffer, some kind of resource probably > required here ); > > remote buffer references may be passed between domains in idc messages > or transaction parameters or transaction status. > > remote buffer references may be forwarded between domains and are usable > from any domain. > > Once in posession of a remote buffer reference, a domain can transfer > data between the remote buffer and a local buffer: > > void idc_send_to_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* transfer completes asynchronously */ > some kind of resource required here > ); > > void idc_receive_from_remote_buffer > ( > struct remote_buffer_reference remote_buffer, > struct local_buffer_reference local_buffer, > struct callback * callback, /* Again, completes asynchronously */ > some kind of resource required here > ); > > Can unregister to free a local buffer independent of remote buffer > references still knocking around in remote domains (subsequent > sends/receives fail): > > void idc_unregister_buffer > ( probably a pointer to the resource passed on registration ); > > So, the 1000 statements of establishment code in the current drivers > becomes: > > Receive an idc address from somewhere (resource discovery is outside the > scope of this sketch). > > Allocate an IDC endpoint from somewhere (resource management is again > outside the scope of this sketch). > > Call idc_endpoint_create. > > Wait for 'connect' before attempting to use connection for device > specific protocol implemented using messages/transactions/remote buffer > references. > > Call idc_endpoint_destroy and quiesce before unloading module. > > The implementation of the local buffer references and memory management > can hide the use of pages which are shared between domains and reference > counted to provide a zero copy implementation of bulk data transfer and > shared page-caches. > > I implemented something very similar to this before for a cluster > interconnect and it worked very nicely. There are some subtleties to > get right about the remote buffer reference implementation and the > implications for out-of-order and idempotent bulk data transfers. > > As I said, it would require a few more days work to nail down a good > API. > > Harry. > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.