Xen project Mailing List

RE: [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability

To: Jeremy Fitzhardinge <jeremy@xxxxxxxx>, Scott Garron <xen-devel@xxxxxxxxxxxxxxxxxx>

From: "Xu, Dongxiao" <dongxiao.xu@xxxxxxxxx>

Date: Tue, 31 Aug 2010 14:59:40 +0800

Accept-language: en-US

Acceptlanguage: en-US

Cc: Daniel, "xen-devel@xxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxx>, Stodden <daniel.stodden@xxxxxxxxxx>

Delivery-date: Tue, 31 Aug 2010 00:02:50 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: ActIY72wxlB0sMDHRcOWz8l3SC+NLwAdg9Uw

Thread-topic: [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability

Jeremy Fitzhardinge wrote: > On 08/27/2010 06:22 PM, Scott Garron wrote: >> I use LVM volumes for domU disks. To create backups, I create a >> snapshot of the volume, mount the snapshot in the dom0, mount an >> equally-sized backup volume from another physical storage source, run >> an rsync from one to the other, unmount both, then remove the >> snapshot. >> This includes creating a snapshot and mounting NTFS volumes from >> Windows-based HVM guests. >> >> This practice may not be perfect, but has worked fine for me for a >> couple of years - while I was running Xen 3.2.1 and >> linux-2.6.18.8-xen >> dom0 (and the same kernel for domU). After upgrades of udev started >> complaining about the kernel being too old, I thought it was well >> past >> time to try to transition to a newer version of Xen and a newer dom0 >> kernel. This transition has been a gigantic learning experience, let >> me tell you. >> >> After that transition, here's the problem I've been wrestling with >> and >> can't seem to find a solution for: It seems like any time I start >> manipulating a volume group to add or remove a snapshot of a logical >> volume that's used as a disk for a running HVM guest, new calls to >> LVM2 and/or Xen's storage locks up and spins forever. The first time >> I ran across the problem, there was no indication of a problem other >> than any command I ran that handled anything to do with LVM would >> freeze and be completely unable to be signaled to do anything. In >> other words, no error messages, nothing in dmesg, nothing in >> syslog... >> The commands would just freeze and not return. That was with the >> 2.6.31.14 kernel that is what's currently retrieved if you checkout >> xen-4.0-testing.hg and just do a make dist. >> >> I have since checked out and compiled 2.6.32.18 that comes from doing >> git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as >> described on the Wiki page here: >> http://wiki.xensource.com/xenwiki/XenParavirtOps >> >> If I run that kernel for dom0, but continue to use 2.6.31.14 for the >> paravirtualized domUs, everything works fine until I try to >> manipulate >> the snapshots of the HVM volumes. Today, I got this kernel OOPS: > > That's definitely bad. Something is causing udevd to end up with bad > pagetables which are causing a kernel crash on exit. I'm not sure if > its *the* udevd or some transient child, but either way its bad. > > Any thoughts on this Daniel? > >> >> --------------------------- >> >> [78084.004530] BUG: unable to handle kernel paging request at >> ffff8800267c9010 [78084.004710] IP: [<ffffffff810382ff>] >> xen_set_pmd+0x24/0x44 [78084.004886] PGD 1002067 PUD 1006067 PMD >> 217067 PTE 80100000267c9065 [78084.005065] Oops: 0003 [#1] SMP >> [78084.005234] last sysfs file: >> /sys/devices/virtual/block/dm-32/removable >> [78084.005256] CPU 1 >> [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot >> nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp >> nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport >> k8temp >> floppy forcedeth [last unloaded: scsi_wait_scan] >> [78084.005256] Pid: 22814, comm: udevd Tainted: G W >> 2.6.32.18 #1 >> H8SMI >> [78084.005256] RIP: e030:[<ffffffff810382ff>] [<ffffffff810382ff>] >> xen_set_pmd+0x24/0x44 [78084.005256] RSP: e02b:ffff88002e2e1d18 >> EFLAGS: 00010246 [78084.005256] RAX: 0000000000000000 RBX: >> ffff8800267c9010 RCX: >> ffff880000000000 >> [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI: >> 0000000000000004 [78084.005256] RBP: ffff88002e2e1d28 R08: >> 0000000001993000 R09: >> dead000000100100 >> [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12: >> 0000000000000000 [78084.005256] R13: ffff880002d8f580 R14: >> 0000000000400000 R15: >> ffff880029248000 >> [78084.005256] FS: 00007fa07d87f7a0(0000) GS:ffff880002d81000(0000) >> knlGS:0000000000000000 [78084.005256] CS: e033 DS: 0000 ES: 0000 >> CR0: 000000008005003b [78084.005256] CR2: ffff8800267c9010 CR3: >> 0000000001001000 CR4: 0000000000000660 >> [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >> 0000000000000000 [78084.005256] DR3: 0000000000000000 DR6: >> 00000000ffff0ff0 DR7: 0000000000000400 [78084.005256] Process udevd >> (pid: 22814, threadinfo ffff88002e2e0000, >> task ffff880019491e80) [78084.005256] Stack: >> [78084.005256] 0000000000600000 000000000061e000 ffff88002e2e1de8 >> ffffffff810fb8a5 >> [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003 >> 0000000000000000 [78084.005256] <0> 0000000000000000 000000000061dfff >> 000000000061dfff 000000000061dfff [78084.005256] Call Trace: >> [78084.005256] [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e >> [78084.005256] [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7 >> [78084.005256] [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f >> [78084.005256] [<ffffffff8107714b>] mmput+0x39/0xda [78084.005256] >> [<ffffffff8107adff>] exit_mm+0xfb/0x106 [78084.005256] >> [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff [78084.005256] >> [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd [78084.005256] >> [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3 [78084.005256] >> [<ffffffff8107ce49>] sys_exit_group+0x12/0x16 [78084.005256] >> [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b [78084.005256] >> Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53 >> 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84 >> c0 >> 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f >> 9e [78084.005256] RIP [<ffffffff810382ff>] xen_set_pmd+0x24/0x44 >> [78084.005256] RSP <ffff88002e2e1d18> [78084.005256] CR2: >> ffff8800267c9010 [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]--- >> [78084.005256] Fixing recursive fault but reboot is needed! >> >> --------------------------- >> >> After that was printed on the console, use of anything that interacts >> with Xen (xentop, xm) would freeze whatever command it was and not >> return. After trying to do a sane shutdown on the guests, the whole >> dom0 locked completely. Even the alt-sysrq things stopped working >> after looking at a couple of them. >> >> I feel it's probably necessary to mention that this is after several, >> fairly rapid-fire creations and deletions of snapshot volumes. I >> have >> it scripted to make a snapshot, mount it, mount a backup volume, >> rsync >> it, unmount both volumes, and delete the snapshot for 19 volumes in a >> row. In other words, there's a lot of disk I/O going on around the >> time of the lockup. It always seems to coincide with when it gets to >> the volumes that are being used for active, running, Windows Server >> 2008, HVM volumes. That may be just coincidental, though, because >> those are the last ones on the list. 15 volumes used in active, >> running paravirtualized Linux guests are at the top of the list. >> >> >> Another issue that comes up is that if I run the 2.6.32.18 pvops >> kernel for my Linux domUs, after a time (usually only about an hour >> or >> so), the network interfaces stop responding. I don't know if the >> problem is related, but it was something else that I noticed. The >> only way to get the network access to come back is to reboot the >> domU. >> When I reverted the domU kernel to 2.6.31.14, this problem goes away. > > That's a separate problem in netfront that appears to be a bug in the > "smartpoll" code. I think Dongxiao is looking into it. Yes, I tried to reproduce these days, however I could catch it locally. I tried both netperf and ping for a long time, but the bug is not triggered. What workload are you using when met the bug? Thanks, Dongxiao > >> I'm not 100% >> sure, but I think this issue also causes xm console to not allow you >> to type on the console that you connect to. If I connect to a >> console, then issue an xm shutdown on the same domU from another >> terminal, all of the console messages that show the play-by-play of >> the shutdown process display, but my keyboard input doesn't seem to >> make it through. > > Hm, not familiar with this problem. Perhaps its just something wrong > with your console settings for the domain? Do you have "console=" on > the kernel command line? > >> Since I'm not a developer, I don't know if these questions are better >> suited for the xen-users list, but since it generated an OOPS with >> the >> word "BUG" in capital letters, I thought I'd post it here. If that >> assumption was incorrect, just give me a gentle nudge and I'll >> redirect the inquiry to somewhere more appropriate. :) > > Nope, they're both xen-devel fodder. Thanks for posting. > > J _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.