[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] Remus: Possible disk replication consistency bug
Greetings, Short version: 1. Is there any way to get disk replication to work with the blktap2 driver when I use two file disk images (say, a disk image and a swap image). 2. [Possible bug] How does Remus guarantee that when, after failover, a replicated VM boots at the backup physical machine, its memory state is going to be consistent with its disk state? Remus uses two separate channels, one for memory updates, and the other for disk updates. The primary decides when to send individual commit messages to each of these channels, but there appears to be no mechanism in place at the backup site to coordinate if and when these updates should be applied. Thus, we have the following execution scenario: - Backup receives commit for disk updates for epoch E - Primary crashes before sending commit for memory updates for epoch E - Backup resumes the execution of the guest VM using the latest available information - The guest VM's memory state corresponds to epoch E - 1 and its disk state - corresponds to epoch E. This is inconsistent. Long version: 1. In the simplified version of what I need, I have a single guest VM that has access to two disk images, disk.img and swap.img, which are both replicated using the blktap2 driver. In order to achieve this, I used the following guide, http://remusha.wikidot.com/ , which nets me the following setup configuration (I deviated a little when creating the guest): - Xen 4.2 (unstable), changeset (changeset version 24465:5b2676ac1321) - Dom0 kernel: Linux v2.6.32.40 x86_64 (commit version 2b494f184d3337d10d59226e3632af56ea66629a) - DomU kernel: Linux 3.0.4 x86_64 I have a guest VM, named frank, whose configuration file, frank.cfg, contains the following parameters: disk = [ 'tap2:remus:nslrack83.epfl.ch:9002|aio:/vserver/images/domains/frank/disk.img,xvda2,w', 'tap2:remus:nslrack83.epfl.ch:9003|aio:/vserver/images/domains/frank/swap.img,xvda1,w', ] which is correct, according to the documentation guidelines posted on http://nss.cs.ubc.ca/remus/doc.html . Notice that I have assigned a different tapdisk remote server channel to disk.img and swap.img Assigning the same port number to both of them will not work, since both the primary and the backup physical machines spawn one tapdisk daemon each per disk image. I suppose that both daemons on the backup will try to bind the same port number and, thus, one of them will fail. This causes the procedure to hang. (In fact, the affected tapdisk instances on the primary and backup will enter some kind of busy polling function and will consume 100% of the CPU assigned to them). For whatever reason, however, things are not much different at all when I assign different ports for replicating disk.img and swap.img. This is something that I cannot explain, myself, and that is where I ultimately gave up on trying to get disk replication working with blktap2. Note that if we were to disable access to swap.img in frank.cfg, then the whole process works as it should, disk, memory replication & all, which demonstrates that I have a working setup among my physical machines. 2. In the meantime, I have also been digging into the source code of Remus, blktap2 and some parts of drbd, and I think I may have come across a possible bug. If my observations are correct, then, it is possible that after a (very unlucky) primary machine failure, the replicated VM is resumed on the backup machine where its memory is in epoch A state and its disk is in epoch B state. - If we are using blktap2, then it can be that A = B + 1 (the disk state is one epoch behind than the memory state) or A = B - 1 (the disk state is one epoch ahead of the memory state) - If we are using drbd, then it can be that A = B - 1 (the disk state is one epoch ahead the memory state) Remus is using two different channels of communication, one for memory updates and one for disk updates. If I understand the code structure correctly, the issue I describe stems from the fact that Remus is also using these channels to send two separate commit messages; one to the xc_restore process, for memory, and one to the server tapdisk2 daemon (similar for drbd) for disk updates. These messages are needed in order to trace the boundaries of checkpoint epochs for each channel. However, what I feel that is missing (or I haven't been able to find it) is a process on the backup machine side that decides when the local VM state (memory or disk) gets updated. For example, the backup should update the state of a VM to epoch A iff we have received all updates pertaining to epoch A (disk and memory). Then, and only then, can the backup send a Checkpoint Acknowledgement message to the primary, at which point the primary can release the VM's network buffers. The files of interest to us are the following: xen-unstable/tools/libxc/xc_domain_save.c xen-unstable/tools/remus/remus xen-unstable/tools/python/xen/remus/device.py xen-unstable/tools/blktap2/drivers/block-remus.c - assuming that we are using disk replication with blktap2 Assume that the primary machine is about to send a commit message to the backup machine. Thus, we are at line 1982 in the xc_domain_save.c file. The primary is about to execute the discard_file_cache() function which, as a consequence, causes the primary to do an fsync() on the migration socket. I am not sure about the particular mechanics involved in calling fsync() on a connected TCP socket, but I presume that the intended behaviour is to wait until the last byte written in that socket is acknowledged by the receiver (violates the end to end argument, but works in the general case where you have two machines connected back to back with a crossover cable). Then, the primary invokes the checkpoint callback, which brings us to the commit() function at line 166 in remus. This function invokes the buf.commit() command for each of the available buffers. For network buffers, this causes them to release their output, whereas for disk buffers, this causes remus to wait for an acknowledgement response for that particular disk from the backup, as seen in device.py. Since disk buffers have been inserted first, remus waits for acknowledgements for all disk buffers before it release any network buffer. Notice that at line 89 in function postsuspend() in device.py, remus sends a 'flush' message to the disk control channel, which is all it takes for the secondary machine to release the pending disk updates to its disk state (block-remus.c). The bug described can occur if the primary crashes after invoking the discard_file_cache() command in xc_domain_save.c and before having a chance to invoke any of the buf.commit() commands in remus. If the 'flush' message has not left the primary's socket buffer yet, at the time of the crash, then we have the A = B + 1 case outlined above, where the memory state is one epoch ahead of the disk state. Similarly, if the primary crashes after sending the 'flush' message but before calling the discard_file_cache() command, we have the A = B - 1 case, where the VM's memory state is one epoch behind the VM's disk state. What's even more worrying, however, is that block-remus.c and xc_domain_save.c have two entirely disjoint heartbeat mechanisms, which can potentially amount to an entirely new level of trouble - assuming that we use drbd The setup is similar to the one described for blktap2. In this case, however, remus forces the primary to wait in the preresume() callback until it has received an acknowledgement for the disk updates. Unless I am missing something, this makes little sense to me, as we are keeping the VM suspended for a roundtrip time's worth of time, whereas it could have been running. Shouldn't the waiting logic be moved to the commit() function instead? In any case, we have a similar scenario. drbd finishes sending the commit message to the backup VM and the primary crashes immediately, before returning from the postcopy callback. Thus, the backup machine receives a commit for the disk updates but no commit for the memory updates. Since the two do not coordinate with each other, it will happily apply the disk updates and ruin memory-disk consistency on the guest VM. - Epilogue I think it would be better/cleaner/more consistent to have some kind of remus server daemon running on the backup physical machine. That daemon would coordinate when disk and memory are to be committed to the guest VM's state (when that demon has received all checkpoint state pertaining to a particular VM that is). As such, it is the daemon that should decide when to send a Checkpoint Acknowledgement message to the primary physical machine. Cheers! dmelisso _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |