[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Is continuous replication of state possible?



On Sat, 2005-01-08 at 18:21 -0800, Jacob Gorm Hansen wrote:
> There is a paper by Bressoud and Schneider about hypervisor-based fault 
> tolerance on the PA-RISC (which had precise performance counters) which 
> is worth reading, I found a copy online at 
> http://roc.cs.berkeley.edu/294fall01/readings/bressoud.pdf .

Thanks, I hadn't read this paper before and found it interesting. I
think there are a few good bits in it like the use of the recovery
register for epochs of instruction execution but I think they have made
a mistake by choosing to design a primary-backup rather than an N-way
active system and I think the I/O architecture is fairly fundamentally
flawed as a result of a confusion as to whether the peripheral devices
are inside or outside the boundary of the replicated system.

I think a better approach for a fail-stop fault tolerant system would be
to choose an N-way active approach and put peripherals very clearly
outside the boundary of the replicated system:

With this approach...

1) A distributed consensus protocol is used to reach agreement on a
sequence of inputs to the system.
2) The same sequence of inputs is passed to all replicas. This is the
input boundary of the replicated fault-tolerant context.
3) The replicas start with the same state and receive the same sequence
of inputs so make the same sequence of responses. Every response of all
replicas is of the form "execute response X on node Y".  So the output
of the replicas is the same for all replicas. This is the output
boundary of the replicated fault-tolerant context.
4) The output of the replicas contains the information specifying which
node must actually action the output.  One node actions the output, the
remaining nodes do nothing.
5) With peripherals outside the replication boundary, all communication
from a peripheral to the virtual machine is passed through the
distributed consensus protocol and committed to the sequence of inputs
before being processed by all replicas. All communication from the
fault-tolerant virtual machine to peripherals is made through a specific
node chosen by the fault-tolerant virtual machine.  If a peripheral is
accessible from multiple nodes then the virtual machine will see
multiple redundant paths to the peripheral and may use multi-pathing
software to perform path failover when a node fails.  If a peripheral is
only accessible to one node then the fault-tolerant virtual machine may
use RAID over several such peripherals attached to different nodes to
create a fault-tolerant virtual peripheral.

This approach is also applicable to byzantine fault tolerant systems if
you enhance it with a byzantine fault tolerant distributed consensus
protocol and get each replica to digitally sign each output and forward
the signature to the replica responsible for actioning it so the
actioner can prove that it is operating on behalf of a quorum of
replicas. In this case a byzantine failure of a node still looks like a
path failure to the virtual machine because communications from the node
are dropped by the recipient when the digital signatures are found to be
insufficient.

Here are a few random other comments on the paper:

With the recovery register approach you can obviously break out early if
you encounter an idle instruction since this will be deterministic
across all replicas.
If you emulate the CPU in software then this approach is very easy.

The time of day clock is a good example of solving the problem of
non-deterministic operations by asking the hypervisor and having it pass
the result through the consensus protocol to all replicas.
In terms of the approach outlined above this translates into a replica
output to a specific node to ask that node the time. The node sends the
time through the distributed consensus protocol to be received by all
replicas. This works for random number generation as well (as someone
noted in another post on this topic) but in that case you'd want to ask
a node for a big batch of random numbers in advance. The node would
generate a batch and pass it through the distributed consensus protocol
to the replicas so that each replica had the same random number pool and
didn't incur the consensus protocol overhead on every request.  Using a
consensus protocol like PAXOS which supports multiple concurrent ballots
would allow the random number pool to be replenished at a low-water mark
without ever stalling the virtual machine operation.

"Fundamental to our design is communication among the hypervisors. This
implies that the hypervisors must not be partitioned by communications
failures."

PAXOS does much better and, I think shows the limit of what can be
achieved for fail-stop fault tolerant systems.

"I/O Accessibility Assumption: I/O operations possible by the processor
executing the primary virtual machine are also possible by the processor
executing a backup virtual machine."

See the discussion above.

"We assume that the channel linking the primary and backup processors is
FIFO...that the processor executing the backup detects the primary's
processor failure only after receiving the last message sent by the
primary's hypervisor"

The PAXOS protocol does significantly better than this.

For anyone interested in replication, "The Part Time Parliament" by
Leslie Lamport which describes PAXOS is a great read.

After that, I'd recommend reading "Practical Byzantine Fault Tolerance"
by Miguel Castro and "Provably Secure Competitive Routing against
Proactive Byzantine Adversaries via Reinforcement Learning" by Baruch
Awerbuch, David Holmer and Herbert Rubens.

-- 
Harry Butterworth <harry@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>



-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.sourceforge.net/lists/listinfo/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.