[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] slow live magration / xc_restore on xen4 pvops



Hi,

in preparation for our soon to arrive central storage array i wanted to test live magration and remus replication and stumbled upon a problem. When migrating a test-vm (512megs ram, idle) between my 3 servers two of them are extremely slow in "receiving" the vm. There is little to no cpu utilization from xc_restore until shortly before migration is complete.
The same goes for xm restore.
The xend.log contains:
[2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:286) restore:shadow=0x0, _static_max=0x20000000, _static_min=0x0, [2010-06-01 21:16:27 5211] DEBUG (XendCheckpoint:305) [xc_restore]: /usr/lib/xen/bin/xc_restore 48 43 1 2 0 0 0 0 [2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) xc_domain_restore start: p2m_size = 20000 [2010-06-01 21:16:27 5211] INFO (XendCheckpoint:423) Reloading memory pages: 0% [2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internal error: Error when reading batch size [2010-06-01 21:20:57 5211] INFO (XendCheckpoint:423) ERROR Internal error: error when buffering batch, finishing

When receiving a vm via live migration finally finishes. You can see the large gap in the timestamps.
The vm is perfectly fine after that, it just takes way too long.


First off let me explain my server setup, detailed information on trying to narrow down the error follows. I have 3 servers running xen4 with 2.6.31.13-pvops as kernel, its the current kernel from jeremy's xen/master git branch.
The guests are running vanilla 2.6.32.11 kernels.

The 3 servers differ slightly in hardware, two are Dell PE 2950 and one is a Dell R710, the 2950's have 2 Quad-Xeon CPUs (L5335 and L5410), the R710 has 2 Quad Xeon E5520.
All machines have 24gigs of RAM.

They are called "tarballerina" (E5520), "xentruio1" (L5335) ad "xenturio2" (L5410).

Currently i use tarballerina for testing purposes but i dont consider anything in my setup "stable".
xenturio1 has 27 guests running, xenturio2 25.
No guest does anything that would even put a dent into the systems performance (ldap servers, radius, department webservers, etc.).

I created a test-vm on my current central iscsi storage, called "hatest" that idles around, has 2 VCPUs and 512megs of ram.

First i testen xm save/restore:
tarballerina:~# time xm restore /var/saverestore-t.mem
real    0m13.227s
user    0m0.090s
sys     0m0.023s
xenturio1:~# time xm restore /var/saverestore-x1.mem
real    4m15.173s
user    0m0.138s
sys     0m0.029s


When migrating to xenturio1 or 2 it the migration takes 181 to 278 seconds, when migrating it to tarballerina it takes rougly 30seconds:
tarballerina:~# time xm migrate --live hatest 10.0.1.98
real    3m57.971s
user    0m0.086s
sys     0m0.029s
xenturio1:~# time xm migrate --live hatest 10.0.1.100
real    0m43.588s
user    0m0.123s
sys     0m0.034s


--- attempt of narrowing it down ----
My first guess was that since tarballerina had almost no guest running that did anything, it could be a issue of memory usage by the tapdisk2 processes (each dom0 has been mem-set to 4096M).
I then started almost all vms that i have on tarballerina:
tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem
real    0m2.884s
tarballerina:~# time xm restore /var/saverestore-t.mem
real    0m15.594s


i tried this several times, sometimes it too 30+ seconds.

Then i started 2 VMs that run load and io generating processes (stress, dd, openssl encryption, md5sum).
But this didnt affect xm restore perfomance, it still was quite fast:
tarballerina:~# time xm save saverestore-t /var/saverestore-t.mem
real    0m7.476s
user    0m0.101s
sys     0m0.022s
tarballerina:~# time xm restore /var/saverestore-t.mem
real    0m45.544s
user    0m0.094s
sys     0m0.022s

i tried several times again, restore took 17 to 45 seconds

Then i tried migrating the test-vm to tarballerina again, still fast, inspite of several vms including load and io generating vms:
This ate almost all available ram.
cputimes for xc_restore according to target machine's "top":
tarballerina -> xenturio1: 0:05:xx , cpu 2-4%, near the end 40%.
xenturio1 > tarballerina: 0:04:xx, cpu 4-8%, near the end 54%.

tarballerina:~# time xm migrate --live hatest 10.0.1.98
real    3m29.779s
user    0m0.102s
sys     0m0.017s
xenturio1:~# time xm migrate --live hatest 10.0.1.100
real    0m28.386s
user    0m0.154s
sys     0m0.032s


so my attempt of narrowing the problem down failed, its neither the free memory of the dom0 nor the load, io or the memory the other domUs utilize.
---end attempt---

More info(xm list, meminfo, table with migration times, etc.) on my setup can be found here:
http://andiolsi.rz.uni-lueneburg.de/node/37

There was another guy who has the same error in his logfile, this might be unrelated or not:
http://lists.xensource.com/archives/html/xen-users/2010-05/msg00318.html

Further information can be given, should demand for i arise.

With best regards

---
Andreas Olsowski <andreas.olsowski@xxxxxxxxxxxxxxx>
Leuphana Universität Lüneburg
System- und Netzwerktechnik
Rechenzentrum, Geb 7, Raum 15
Scharnhorststr. 1
21335 Lüneburg

Tel: ++49 4131 / 6771309



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.