[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] strange scheduling / io problem



Hello there.
I'm trying to find out the cause of a strange problem I'm seeing with xen but I'm not doing much progress. A little background:

I have two Dell SC1425 (dual P4/Xeon (nocona) HTT non-VTx) servers running Xen and Gentoo Linux. The current setup is (everything installed from Gentoo portage):
- Xen 3.0.4-1
- Gentoo linux domains
- everything pure x86_64, no 32bit code
- Kernel 2.6.20-xen-r2, different kernels for Dom0 and DomU
- drbd 8.0.5

I have drbd doing "network raid" with a crossover gigabit between the two servers. Then I use LVM2 on the drbd disks to make logical volumes for the DomUs. Dom0 are on good old raid1 software with no drbd or lvm. This setup has proven stable and is working for a long time, it actually started in early 2006 with earlier versions of xen and linux.

I am now in the process of upgrading one of the two nodes, and I rebuilt the system from scratch with Xen-3.2.1, Linux-2.6.21-xen, drbd-8.0.12.

Everything is working well except for a thing: I have an rsyncd server in the Dom0 serving a local copy of the portage tree. If I run "emerge --sync" from a DomU using my local rsyncd server the DomU nearly freezes. Looking with xm top / xm list it seems that it just doesn't get cpu time scheduled or it is locked waiting for something. The "cpu time" consumed by the DomU does not increase for a while. After some time the rsync client in the DomU goes timeout and then everything works normal again. Curious facts:

1) the whole system is otherwise idle and doing nothing
2) just the Dom0 and one DomU for the tests
3) if I put some disk load on both dom0 and domU (I use tar cf tmpfile.tar /usr) everything is ok 4) if I put some cpu load on both dom0 and domU (I use distributed.net's client to bring cpu use to 100% on both physical cpus) everything is ok 5) if I put some simple network load ("nc -l -p 1999 > /dev/null" in Dom0 and "dd if=/dev/zero bs=1024k count=10240 | nc 172.16.0.2 1999 - q1" in DomU) everything is ok 6) if I do the dnetc + tar + netcat things all together both in dom0 and domU, everything is ok and both domains are still responsive 7) if I run "emerge --sync" in domU against an rsyncd on another machine (a gentoo official mirror or even my other node connected via crossover gigabit) everything is ok

.....but if I run "emerge --sync" in domU against the rsyncd server on dom0 on the same hardware, the dom0 runs ok and is responsive while the domU becomes sluggish: hitting enter at the empty login prompt, without a username, on "xm console" takes 40 seconds before getting another login request. The rsync clients will transfer some (little) data before these "freezing" occurs.

I really can't figure out what the rsync(domU) + rsyncd(dom0) does that makes it behave like this and can't reproduce the thing with any other test. I tried using my "old" kernel for the domU (the linux-2.6.20-xen-r2 from the other running node) and it behaves exactly the same. I use the default scheduler (credit) with default settings, dom0 has access to all cpus (4 logical cpus: they're two hyperthreading xeons), domU is single-processor kernel with just one vcpu.

Any help would be appreciated. Thanks.

--
Luca Lesinigo


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.