[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Xen hangs with NFS root under high loads

Hi All!

We now have a small and growing group of customers running on Xen-hosted
machines -- Chris Clarke (in the Cc:) was the first, a few months ago,
under Xen 1.0 (would that make him the first commercial Xenoserver
customer?).  We switched to 1.2 in mid-February.  Other than the
following, the only recent issues are related to working out the bugs
and features in my own controller code, which I owe you another copy of.

But we have seen a recurring issue where a few domains hang for no
readily apparent reason, don't respond to 'xc_dom_control.py shutdown',
but do respond to 'xc_dom_control.py destroy'.  I usually see
alternating "NFS server not found" and "NFS server OK" messages on the
domain 0 console around the time that a guest on that node hangs.  When
this happens, it seems to usually be associated with someone running
something I/O intensive like 'rsync' or 'apt-get' in the guest domain.

Right now I'm running all swap partitions in VBD's, and the root
partitions are all on a central NFS server so that:

- I can mirror them and back them up.  

- We can migrate guests between nodes by assigning a guest to a
  different node -- right now that's implemented via shutdown/reboot.

- We can recover from hardware failure in a couple of minutes, just by
  assigning a guest to a different node.

But when researching this problem I noted a message from Ian (18 Mar
2004) Linux saying:

  We've seen some weird hangs under extreme conditions with NFS
  root, but we can reproduce these on stock Linux :-(

Ian, do these symptoms sound like this is what we're hitting?  Until I
can reliably reproduce the problem myself, I'm going to assume this is
the case.

What are other people doing to meet those requirements of backups,
migration, and failover?  How is the live migration code?  The
copy-on-write NFSd, or COW VBD's?  Any other backup or mirroring code
added to VBD's lately?  Other alternatives (ENBD etc.) that anyone knows
from experience to be production-quality?

Here's what I'm going to have to do unless I hear otherwise:

- Try moving the NFS server to the Xen server node itself.  This will
  provide better bandwidth and latency versus the 100Mb switch we're
  going through now.  I don't know if that will help.  I will need to
  backup each individual node's disk then.  Each node's disks will need
  to be mirrored (who else is using md raid 1 for DOM0's root
  partition?)  And we won't be able to cleanly migrate guests between
  nodes.  No hardware failover either.  Grrr.

- If that doesn't work, then I'll need to migrate each root into a Xen
  virtual block device on the node (right now only swap is there).  Then
  I won't be able to ensure backups get done myself -- any backups will
  have to be done from within each guest's O/S.  They can't be mirrored.
  And migrating between nodes becomes doubly hard, and can take hours
  depending on partition size.  No hardware failover.


Stephen G. Traugott  (KG6HDQ)
UNIX/Linux Infrastructure Architect, TerraLuna LLC
http://www.stevegt.com -- http://Infrastructures.Org 

This SF.Net email is sponsored by: SourceForge.net Broadband
Sign-up now for SourceForge Broadband and get the fastest
6.0/768 connection for only $19.95/mo for the first 3 months!
Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.