[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] xen dom0 nfs hangs


I have xen running most recently under ubuntu 15 with a host which runs a small number of domUs doing small non-intensive jobs (dns serving, spam filtering, radius). These dom0 nfs mounts a directory holding the domU disk image files on my netapp filer and the domU config files are all using loopback mounts for these disk images. Occasionally, for some reason I have yet to fathom, NFS simply stops working from the dom0 and all processes accessing nfs simply hang. I get messages about 'task blocked more than 120 seconds' (from qemu-system-i386) and so forth; the dom0 is otherwise responsive, is not swapping, high load, any other kernel messages, it's simply that NFS has gone away. Other dom0 hosts nfs mounting domU disk image files from this same filer, have no problems at all. The domU's on this affected xen host hang - networking is still working, they are ping reachable and anything not depending on disk access from inside the domU itself continues to work, but any process that touches disk (sendmail for example), is hung.

I have taken the following troubleshooting steps;

The host originally was an AMD box, running Ubuntu 14. I tried all of the memory tuning advice, minimum dom0 memory, cpu pinning, etc. NFS continued to have hangs.

I upgraded the box to an intel hexacore platform with 64g of ram. Same problems.

I installed a dedicated 4port gigE nic and put the NFS traffic onto it's own bonded port channel. Same problems.

    I upgraded to ubuntu 15. Same problems.

I tuned even more kernel variables such as swappines, dirty cache and so forth, down to almost nothing. Same problems.

I have SPAN capturing all network traffic to and from the box, during the problem period. Nothing I can see going obviously wrong, but I don't have good tools beyond tcpdump to really go into traffic however.

I have arpwatch running to make sure we don't have an ip conflict on the nfs network. Nothing noted.

I have the switches doing extended debugging for all interface state transitions, stp transitions, nothing noted, no errors, everything is clean and good.

I have had the experience where, during a period of NFS hang lasting more than 2 hours, it suddenly comes right back and picks up where it left off, all vm's suddently come back to life and things are all good again.

The short fix for when this occurs, is to simply reboot the box. Then everything just comes back and all is well. But, the problem continues unabated and I have been fighting this for too long. The best I can guess, is that it's "something" with nfs, but that is all. If I can't find a solution soon, I would be willing to consider other storage methods including iSCSI. The issue with that however is that nfs makes sense to me, I can deal with it, I know how to back it up, how to manage the space and the mounts and such, and iSCSI is an enigma to me. There hasn't been any really good howto's or other documents showing how really to connect all the pieces, unless someone has a pointer they can shoot my way. I'd love to understand what actually appears to be killing nfs and to fix that problem instead, but at this point just getting away from this problem and restoring stability here is more important.

Thank you.


Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.