[Xen-devel] Domain saving and filesystem corruption

I have been using Xen for over a year now. For the most part I have had
very good success with it and we are now working on rolling it out
throughout my company. But I just ran across something really annoying
and dangerous.

When I first started playing with xen I read all of the docs I could
find and at that time I am pretty sure xen did not automatically save
domains when the machine was shut down. Later on I noticed that it was
trying to do so but was failing because the directory to save to did not
exist on my machine for some reason (was not created during the
install). After that I completely forgot about this behavior. A month or
two ago I upgraded to Xen 3.0 from mercurial (I don't have the sources
around anymore and I don't see how to get xen to tell me its exact
version) and it seems that domain saving on shutdown is now working.
Great. I recently had some unrelated system problems which caused me to
need to shut down, boot from a rescue disk, and mount the logical volume
normally used by my mail server and do quite a bit of work on it. Once
done I booted the system normally, xen started the mail domain, and all
kinds of weird stuff started happening related to the filesystem. I shut
down the domain, did an fsck of the mail server logical volume, and
found thousands of errors.

Then I realized what had happened. The xen domain was saving state to
the disk including internal buffers and who knows what that were not
synch'd to the disk. So I mounted a very  dirty filesystem, made a bunch
of changes, then the mail server domain came back up expecting the fs to
be in the same state it was left in and proceeded as if everything were
normal which ended up causing massive corruption and many lost emails.
Fortunately this is on a dev machine which hosts a bunch of personal
domains and other stuff and not business critical things. But it is
still highly annoying.

I recommend that whenever Xen saves a domain that the  domain somehow
sync the filesystem state to disk. Ideally the fs would even be marked
clean so that if someone needs to mount the fs while the domain is not
running such as I did they can. There really needs to be a way for a xen
domain, upon being started, to know that the fs is in a sane and
consistent state just as it was when it was saved. Ensuring that only
filesystems marked clean are left after a save and mounted upon restart
is one way to do that. Or is there some sort of time stamp such as a
last mount time in the fs that the domain can look at and save with the
domain state and make sure that the last mount time has not changed when
the domain is restarted? I realize that most of these things are
filesystem/OS specific. It would be really nice to have a general
solution to this. I think something needs to be done because the current
situation seems quite dangerous. For now I have disabled the
saving/restarting of domains and will do so on all of our production
systems also. It's a risk I just can't take.

I mentioned this to someone on the IRC channel and they said "That is
documented behavior." Unfortunately that doesn't bring back my data. It
wasn't documented when I started using Xen and I can't possibly keep up
on everything written about Xen in the meantime.

Tracy R Reed

