[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] xen dom0 nfs hangs

To: xen-users@xxxxxxxxxxxxx
From: Mike <mike+xen@xxxxxxxxxxxxxxxxx>
Date: Sun, 31 May 2015 12:09:21 -0700
Delivery-date: Sun, 31 May 2015 19:10:53 +0000
List-id: Xen user discussion <xen-users.lists.xen.org>

Hi,

I have xen running most recently under ubuntu 15 with a host which runsa small number of domUs doing small non-intensive jobs (dns serving,spam filtering, radius). These dom0 nfs mounts a directory holding thedomU disk image files on my netapp filer and the domU config files areall using loopback mounts for these disk images. Occasionally, for somereason I have yet to fathom, NFS simply stops working from the dom0 andall processes accessing nfs simply hang. I get messages about 'taskblocked more than 120 seconds' (from qemu-system-i386) and so forth; thedom0 is otherwise responsive, is not swapping, high load, any otherkernel messages, it's simply that NFS has gone away. Other dom0 hostsnfs mounting domU disk image files from this same filer, have noproblems at all. The domU's on this affected xen host hang - networkingis still working, they are ping reachable and anything not depending ondisk access from inside the domU itself continues to work, but anyprocess that touches disk (sendmail for example), is hung.


I have taken the following troubleshooting steps;

The host originally was an AMD box, running Ubuntu 14. I tried allof the memory tuning advice, minimum dom0 memory, cpu pinning, etc. NFScontinued to have hangs.

I upgraded the box to an intel hexacore platform with 64g of ram.Same problems.

I installed a dedicated 4port gigE nic and put the NFS traffic ontoit's own bonded port channel. Same problems.


    I upgraded to ubuntu 15. Same problems.

I tuned even more kernel variables such as swappines, dirty cacheand so forth, down to almost nothing. Same problems.

I have SPAN capturing all network traffic to and from the box,during the problem period. Nothing I can see going obviously wrong, butI don't have good tools beyond tcpdump to really go into traffic however.

I have arpwatch running to make sure we don't have an ip conflicton the nfs network. Nothing noted.

I have the switches doing extended debugging for all interfacestate transitions, stp transitions, nothing noted, no errors, everythingis clean and good.

I have had the experience where, during a period of NFS hanglasting more than 2 hours, it suddenly comes right back and picks upwhere it left off, all vm's suddently come back to life and things areall good again.

The short fix for when this occurs, is to simply reboot the box.Then everything just comes back and all is well. But, the problemcontinues unabated and I have been fighting this for too long. The bestI can guess, is that it's "something" with nfs, but that is all. If Ican't find a solution soon, I would be willing to consider other storagemethods including iSCSI. The issue with that however is that nfs makessense to me, I can deal with it, I know how to back it up, how to managethe space and the mounts and such, and iSCSI is an enigma to me. Therehasn't been any really good howto's or other documents showing howreally to connect all the pieces, unless someone has a pointer they canshoot my way. I'd love to understand what actually appears to be killingnfs and to fix that problem instead, but at this point just getting awayfrom this problem and restoring stability here is more important.


Thank you.

Mike-


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

Prev by Date: [Xen-users] Running mini-os kernel
Previous by thread: [Xen-users] Running mini-os kernel
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.