[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Thoughts on cloud control APIs for Mirage



Mirage now has a number of protocols implemented as libraries, as
well as device drivers. What's missing is an effective control stack to
glue all this together into a proper OS.  So far, we are just wiring
together applications manually from the libraries, which is fine for
development but not for any real deployment.

I've been re-reading the Plan 9 papers [1] for inspiration, and many of
the ideas there are highly applicable to us. To realise the Mirage goal of
synthesising microkernels that are 'minimal for purpose', we need to:

- have multiple intercommunicating components, separated by process
  boundaries (on UNIX) or VM isolation (on Xen), or simply a function
  call compiled as part of the same kernel.

- minimise information flow between components, so they can be
  dynamically split up ('self scaling') or combined more easily.

- deal with the full lifecycle of all these VMs and processes, and not 
  just spawning them.
 
Plan 9 was built on very similar principles: instead of a big monolithic
kernel, the system is built on many processes that communicate via a
well-defined wire protocol (9P), and per-process namespaces and filesystem
abstractions for almost every service.  For example, instead of 'ifconfig',
the network is simply exposed as a /net filesystem and configured through
filesystem calls rather than an alternative command line.  Crucially, the
9P protocol can be remotely called, or directly via a simple function call
(for direct in-kernel operations).
 
In contrast, modern cloud stacks are just terribly designed: they consist
of a huge amount of static specification of VM and network state, with
little attention paid to simple UNIX/Plan9 principles that can be used to
build the more complicated abstractions.

So, this leaves us with an interesting opportunity: to implement the
Mirage control interface using similar principles:

- a per-deployment global hierarchial tree (i.e. a filesystem), with ways
  to synchronise on entries (i.e. blocking I/O, or a select/poll
  equivalent).  Our consistency model may vary somewhat, as we could be
  strongly consistent between VMs when running on the same physical host,
  and more loose cluster-wide.

- every library exposes a set of keys and values, as well as a mechanism
  for session setup, authentication and teardown (the lifecycle of the
  process. Plan 9 used ASCII for everything, whereas Mirage would layer
  a well-typed API on top of it (e.g. just write a record to a file rather
  than manually serialising it).

- extend the Xen Cloud Platform to support delegation, so that microVMs
  can be monitored or killed by supervisors. Unlike Plan9, this also
  includes operations across physical hosts (e.g. live relocation), or
  across cloud providers.
  
There are some nice implications of this work that goes beyond Mirage:

- it generally applies to all of the exokernel libraries out there,
  including HalVM (Haskell) or GuestVM (Java), as they all have this
  control problem that makes manpulating raw kernels such a pain to do.

- it can easily be extended to support existing applications on a
  monolithic guest kernel, and in make it easier to manage them too.

- application synthesis becomes much more viable: this approach could let
  me build a HTTP microkernel without a TCP stack, and simply receive a
  typed RPC from a HTTP proxy (which has done all the work of parsing the
  TCP and HTTP bits, so why repeat it?).  If my HTTP server microkernel
  later live migrates away, then it could swap back to a network connection.

  Modern cloudy applications (especialy Hadoop or CIEL) use HTTP very
  heavily to talk between components, so optimising this part of the stack
  is worthwhile (numbers needed!)
 
- Even if components are compiled up in the same binary and use function
  calls, they still have to establish and authenticate connections to each
  others.  This makes monitoring and scaling hugely easier, since the 
  control filesystem operations provide a natural logging and introspection
  point, even for large clusters.  If we had a hardware-capability-aware
  CPU in the future, it could use this information too :-)

I highly recommend that anyone interested in this area read the Plan 9
paper, as it's a really good read anyway [1]. Also the Scout OS and
x-kernel stack are good.  Our main difference from this work is the
heavy emphasis on type-safe components, as well as realistic deployment
due to the use of Xen cloud providers as a stable hardware interface.

In the very short-term, Mort and I have an OpenFlow tutorial coming up in
mid-November, so I'll lash up the network stack to have a manual version
of this as soon as possible, so that you can configure all the tap
interfaces and such much more quickly.  Meanwhile, all and any thoughts
are most welcome!

[1] Plan 9 papers: http://cm.bell-labs.com/sys/doc/

-- 
Anil Madhavapeddy                                 http://anil.recoil.org



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.