Xen project Mailing List

We are seeing a disk corruption problem when migrating a VM between two nodes that are both active writers of a shared storage block device.

The corruption seems to be caused by a lack of synchronization between the migration source and destination regarding outstanding block write requests. The failing scenario is as follows:

1) The VM has block write A in progress on the source node X at the time it is being migrated.

2) The blkfront driver requeues A on the destination node Y after migration. Request A gets completed immediately, because the shared storage already has a request in flight to the same block (from X), so it ignores the new request.

3) New block write request A' is made from Y, now that the VM is running, to the same block number as A. Request A' gets completed immediately for the same reasons as in #2.

The corruption we are seeing is that the block contains the data A, not A' as the VM expects. The problem is that the shared storage doesn't guarantee the outcome of the concurrent writes X->A and Y->A. It is choosing to ignore and immediately complete the second request, which I understand is one of the acceptable strategies for managing concurrent writes to the same block. That behavior is fine when the redundant request A is being ignored, but when the new request A' occurs, we get corruption.

The problem only shows up under heavy disk load (e.g the Bonnie benchmark) while migrating, so most users probably haven't seen it.

If I understand this correctly though, this could affect anyone using shared block storage with dual active writers and live migration. When we run with a single active writer and then move the active writer to the destination node, all outstanding requests get flushed in the background and we don't see this problem.

The blkfront xenbus_driver doesn't have a "suspend" method. I was going to add one to flush the outstanding requests from the migration source to fix the problem. Or maybe we can cancel all outstanding I/O requests to eliminate the concurrency between the two nodes. Does the Linux block I/O interface allow the canceling of requests?

Anyone else seeing this problem? Any other ideas for solutions?

Thanks,

Jeff

[Xen-devel] Shared disk corruption caused by migration