This might be related to a posting a couple of days ago on random reboots, but the problem arises from a different environment and situation.

We are running a two-node cluster. Both nodes run Debian Squeeze + Xen 4.02 on top of OCFS2 1.4.4-3. Kernel is 2.6.32-5-xen-amd64. Both nodes store and run vms on the ocfs2 partition, which is accessed from the 2 boxes via ISCSI.  We run a network stress test in which the 2 vms pass a large file between them. One vm has an nfs share with the file in it, and the other vm copies this file (arbitrarily, a large, 4.6 Gb debian.iso file) to and from the nfs file share to its own local directory. Currently, network configuration giving us no problems--no lost packets, collisions, etc.

The vms are lucid instances (ubuntu 10.04) created by the following command:

sudo xen-create-image --hostname lucidxentest --ip --pygrub
+ xen-tools.conf params-- size = 8 Gb, image = full, mem. = 512, swap = 512

The stress proceeds successfully for anywhere from 1 to 12 hours, then the system reboots. The file move has been interrupted, the vms crashed, with one of the nodes rebooted.

I have noticed occasional reporting of a kernel error (linux/mm/slub.c 2969!), similar to a Debian bug (#634047). But I find no firm correlation, as often kern.log and messages logs do not usually report this kernel error.

Some things I have tried:

a basic reinstall of the all the components of the system (squeeze + xen + ocfs2)

a memtest on both nodes. (no problems).

changing the default Debian IO scheduler in combination with ocfs2: cfq, deadline, anticipatory, no op.

currently investigating, but have not yet investigated, adjusting: (1) halt state set in BIOS; (2) setting of cpufreq=dom0-kernel, frequency scaling.

Any suggestions are welcome!

Ben Weaver



