[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] Live migration bug introduced in 2.6.32.16?

To: xen-devel@xxxxxxxxxxxxxxxxxxx
From: Nathan March <nathan@xxxxxx>
Date: Thu, 27 Jan 2011 12:21:14 -0800
Delivery-date: Thu, 27 Jan 2011 12:22:05 -0800
Domainkey-signature: a=rsa-sha1; c=nofws; d=gt.net; h=message-id:date :from:mime-version:to:subject:content-type :content-transfer-encoding; q=dns; s=mail; b=fz0K4l/u9w6aJIib8Ja PO4uHDYPG7qdNQ+i3CORItxS2HL2IoSpy9R5kliFk45SqsSWvAYSMxTalWZzNeb0 HjiNlM9HTTu9y3shYfv0WD9Kfe+AfyW/55Mw+LglsnhU4grGNqyeyFZbOTBQh9wR lSRN+FPxaQTqv3cgYJfgHpmQ=
List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Hi All,

It looks like a live migration bug may have been introduced in 2.6.32.16...

I've been experiencing issues where upon live migration, the domU simplyhangs once it gets resumed on the target dom0. I've been unable to getany crash information out of the domU, nothing comes up in xm dmesg.There could be a kernel panic happening but since I can't connect to theconsole during the migration I haven't been able to get anything useful.Comparing a successful migration to a failed one in the xend.log andxen-debug.log, nothing stands out as being different.

Testing a wide variety of VM's to see why some worked and some didn't,I've narrowed it down to the domU kernel version and down to 2.6.32.16specifically by trying these versions:

2.6.32.8    good
2.6.32.15    good
2.6.32.16    bad
2.6.32.17    bad
2.6.32.20    bad
2.6.32.24    bad
2.6.32.28    bad
2.6.37        bad

All are the stock kernel off kernel.org.

Note that this isn't consistent at all, I've got 6 dom0's and this onlyhappens when migrating certain directions between certain dom0's:

xen1->xen2    crash
xen1->xen5    crash
xen1->xen6    crash
xen2->xen5    crash
xen2->xen1    works
xen3->xen1    works
xen5->xen2    works
xen5->xen6    works
xen6->xen1    works
xen6->xen5    works

Previously, xen6->5 worked but xen5->6 didn't work. After a few reboots(of the dom0) however the problem between them resolved itself and now Ican go xen5->6 and back all day on 2.6.32.16 without issues. If i thenmigrate it to xen1 it's fine, but back to xen5 and it locks up on resume.


All 6 xen dom0's are identical:

xen5 ~ # xm info
host                   : xen5
release                : 2.6.31.13
version                : #11 SMP Wed Jan 26 10:55:28 PST 2011
machine                : x86_64
nr_cpus                : 12
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 1
cpu_mhz                : 2266

hw_caps :bfebfbff:2c100800:00000000:00001f40:009ee3fd:00000000:00000001:00000000

virt_caps              : hvm hvm_directio
total_memory           : 40950
free_memory            : 38380
node_to_cpu            : node0:0-5
                         node1:6-11
node_to_memory         : node0:23388
                         node1:14991
node_to_dma32_mem      : node0:2994
                         node1:0
max_node_id            : 1
xen_major              : 4
xen_minor              : 0
xen_extra              : .1-rc6-pre

xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32hvm-3.0-x86_32p hvm-3.0-x86_64

xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable

xen_commandline : console=com1,com2,vga com1=115200,8n1com2=115200,8n1 dom0_mem=1024M dom0_max_vcpus=1 dom0_vcpus_pin=true

cc_compiler            : gcc version 4.3.4 (Gentoo 4.3.4 p1.1, pie-10.1.5)
cc_compile_by          : root
cc_compile_domain      :
cc_compile_date        : Tue Jan 25 17:05:03 PST 2011
xend_config_format     : 4

I've tried updating to a newer dom0 release but ran into linking issuesdue to as-needed so I haven't managed to get them up yet.

Looking at the changelog for 2.6.32.16(http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.32.16) therewere two xen patches made, both involving resuming.


Diffs for the two patches:
http://git.kernel.org/?p=linux/kernel/git/longterm/linux-2.6.32.y.git;a=commitdiff;h=0f58db21025d979e38db691861985ebc931551b1
http://git.kernel.org/?p=linux/kernel/git/longterm/linux-2.6.32.y.git;a=commitdiff;h=b6d1fd29840e29d1a87d0ab15ee1ccc90ea15ec4

I've tried reversing them together and 1 at a time, yet the problemstill happens. I then took 2.6.32.15 and applied those patches and it'scompletely stable. So whatever is causing this was apparently not axen-related patch?

Anyone have any ideas on what might be going on here or how I can debugit further? I'm completely stumped at this point, don't want to just tryapplying every patch in 2.6.32.16 to see which one is doing it.Compiling + testing all these kernels is time consuming =)


Thanks,
Nathan

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel

Follow-Ups:
- Re: [Xen-devel] Live migration bug introduced in 2.6.32.16?
  - From: Philipp Hahn
- Re: [Xen-devel] Live migration bug introduced in 2.6.32.16?
  - From: Konrad Rzeszutek Wilk

Prev by Date: Re: [Xen-devel] [PATCH] xl: Special case tap/aio for disk validation
Next by Date: Re: [Xen-devel] xen 4.1.0 rc1 build problem for local CONFIG_QEMU directory
Previous by thread: [Xen-devel] [GIT PULL] for-2.6.32/bug-fixes
Next by thread: Re: [Xen-devel] Live migration bug introduced in 2.6.32.16?
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.