[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Making snapshot of logical volumes handling HVM domU causes OOPS and instability



Jeremy Fitzhardinge wrote:
>  On 08/27/2010 06:22 PM, Scott Garron wrote:
>> I use LVM volumes for domU disks.  To create backups, I create a
>> snapshot of the volume, mount the snapshot in the dom0, mount an
>> equally-sized backup volume from another physical storage source, run
>> an rsync from one to the other, unmount both, then remove the
>> snapshot. 
>> This includes creating a snapshot and mounting NTFS volumes from
>> Windows-based HVM guests.
>> 
>> This practice may not be perfect, but has worked fine for me for a
>> couple of years - while I was running Xen 3.2.1 and
>> linux-2.6.18.8-xen 
>> dom0 (and the same kernel for domU).  After upgrades of udev started
>> complaining about the kernel being too old, I thought it was well
>> past 
>> time to try to transition to a newer version of Xen and a newer dom0
>> kernel.  This transition has been a gigantic learning experience, let
>> me tell you.
>> 
>> After that transition, here's the problem I've been wrestling with
>> and 
>> can't seem to find a solution for:  It seems like any time I start
>> manipulating a volume group to add or remove a snapshot of a logical
>> volume that's used as a disk for a running HVM guest, new calls to
>> LVM2 and/or Xen's storage locks up and spins forever.  The first time
>> I ran across the problem, there was no indication of a problem other
>> than any command I ran that handled anything to do with LVM would
>> freeze and be completely unable to be signaled to do anything.  In
>> other words, no error messages, nothing in dmesg, nothing in
>> syslog... 
>> The commands would just freeze and not return.  That was with the
>> 2.6.31.14 kernel that is what's currently retrieved if you checkout
>> xen-4.0-testing.hg and just do a make dist.
>> 
>> I have since checked out and compiled 2.6.32.18 that comes from doing
>> git checkout -b xen/stable-2.6.32.x origin/xen/stable-2.6.32.x, as
>> described on the Wiki page here:
>> http://wiki.xensource.com/xenwiki/XenParavirtOps
>> 
>> If I run that kernel for dom0, but continue to use 2.6.31.14 for the
>> paravirtualized domUs, everything works fine until I try to
>> manipulate 
>> the snapshots of the HVM volumes.  Today, I got this kernel OOPS:
> 
> That's definitely bad.  Something is causing udevd to end up with bad
> pagetables which are causing a kernel crash on exit.  I'm not sure if
> its *the* udevd or some transient child, but either way its bad.  
> 
> Any thoughts on this Daniel?
> 
>> 
>> ---------------------------
>> 
>> [78084.004530] BUG: unable to handle kernel paging request at
>> ffff8800267c9010 [78084.004710] IP: [<ffffffff810382ff>]
>> xen_set_pmd+0x24/0x44 [78084.004886] PGD 1002067 PUD 1006067 PMD
>> 217067 PTE 80100000267c9065 [78084.005065] Oops: 0003 [#1] SMP
>> [78084.005234] last sysfs file:
>> /sys/devices/virtual/block/dm-32/removable
>> [78084.005256] CPU 1
>> [78084.005256] Modules linked in: tun xt_multiport fuse dm_snapshot
>> nf_nat_tftp nf_conntrack_tftp nf_nat_pptp nf_conntrack_pptp
>> nf_conntrack_proto_gre nf_nat_proto_gre ntfs parport_pc parport
>> k8temp 
>> floppy forcedeth [last unloaded: scsi_wait_scan]
>> [78084.005256] Pid: 22814, comm: udevd Tainted: G        W 
>> 2.6.32.18 #1 
>> H8SMI
>> [78084.005256] RIP: e030:[<ffffffff810382ff>]  [<ffffffff810382ff>]
>> xen_set_pmd+0x24/0x44 [78084.005256] RSP: e02b:ffff88002e2e1d18 
>> EFLAGS: 00010246 [78084.005256] RAX: 0000000000000000 RBX:
>> ffff8800267c9010 RCX: 
>> ffff880000000000
>> [78084.005256] RDX: dead000000100100 RSI: 0000000000000000 RDI:
>> 0000000000000004 [78084.005256] RBP: ffff88002e2e1d28 R08:
>> 0000000001993000 R09: 
>> dead000000100100
>> [78084.005256] R10: 800000016e90e165 R11: 0000000000000000 R12:
>> 0000000000000000 [78084.005256] R13: ffff880002d8f580 R14:
>> 0000000000400000 R15: 
>> ffff880029248000
>> [78084.005256] FS:  00007fa07d87f7a0(0000) GS:ffff880002d81000(0000)
>> knlGS:0000000000000000 [78084.005256] CS:  e033 DS: 0000 ES: 0000
>> CR0: 000000008005003b [78084.005256] CR2: ffff8800267c9010 CR3:
>> 0000000001001000 CR4: 0000000000000660
>> [78084.005256] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000 [78084.005256] DR3: 0000000000000000 DR6:
>> 00000000ffff0ff0 DR7: 0000000000000400 [78084.005256] Process udevd
>> (pid: 22814, threadinfo ffff88002e2e0000, 
>> task ffff880019491e80) [78084.005256] Stack:
>> [78084.005256]  0000000000600000 000000000061e000 ffff88002e2e1de8
>> ffffffff810fb8a5
>> [78084.005256] <0> 00007fff13ffffff 0000000100000206 ffff880003158003
>> 0000000000000000 [78084.005256] <0> 0000000000000000 000000000061dfff
>> 000000000061dfff 000000000061dfff [78084.005256] Call Trace:
>> [78084.005256]  [<ffffffff810fb8a5>] free_pgd_range+0x27c/0x45e
>> [78084.005256]  [<ffffffff810fbb2b>] free_pgtables+0xa4/0xc7
>> [78084.005256]  [<ffffffff810ff1fd>] exit_mmap+0x107/0x13f
>> [78084.005256]  [<ffffffff8107714b>] mmput+0x39/0xda [78084.005256]
>> [<ffffffff8107adff>] exit_mm+0xfb/0x106 [78084.005256]
>> [<ffffffff8107c86d>] do_exit+0x1e8/0x6ff [78084.005256]
>> [<ffffffff815c228b>] ? do_page_fault+0x2cd/0x2fd [78084.005256]
>> [<ffffffff8107ce0d>] do_group_exit+0x89/0xb3 [78084.005256]
>> [<ffffffff8107ce49>] sys_exit_group+0x12/0x16 [78084.005256]
>> [<ffffffff8103cc82>] system_call_fastpath+0x16/0x1b [78084.005256]
>> Code: 48 83 c4 28 5b c9 c3 55 48 89 e5 41 54 49 89 f4 53
>> 48 89 fb e8 fc ee ff ff 48 89 df ff 05 52 8f 9e 00 e8 78 e4 ff ff 84
>> c0
>> 75 05 <4c> 89 23 eb 16 e8 e0 ee ff ff 4c 89 e6 48 89 df ff 05 37 8f
>> 9e [78084.005256] RIP  [<ffffffff810382ff>] xen_set_pmd+0x24/0x44
>> [78084.005256]  RSP <ffff88002e2e1d18> [78084.005256] CR2:
>> ffff8800267c9010 [78084.005256] ---[ end trace 4eaa2a86a8e2da24 ]---
>> [78084.005256] Fixing recursive fault but reboot is needed!
>> 
>> ---------------------------
>> 
>> After that was printed on the console, use of anything that interacts
>> with Xen (xentop, xm) would freeze whatever command it was and not
>> return.  After trying to do a sane shutdown on the guests, the whole
>> dom0 locked completely.  Even the alt-sysrq things stopped working
>> after looking at a couple of them.
>> 
>> I feel it's probably necessary to mention that this is after several,
>> fairly rapid-fire creations and deletions of snapshot volumes.  I
>> have 
>> it scripted to make a snapshot, mount it, mount a backup volume,
>> rsync 
>> it, unmount both volumes, and delete the snapshot for 19 volumes in a
>> row.  In other words, there's a lot of disk I/O going on around the
>> time of the lockup.  It always seems to coincide with when it gets to
>> the volumes that are being used for active, running, Windows Server
>> 2008, HVM volumes.  That may be just coincidental, though, because
>> those are the last ones on the list.  15 volumes used in active,
>> running paravirtualized Linux guests are at the top of the list.
>> 
>> 
>> Another issue that comes up is that if I run the 2.6.32.18 pvops
>> kernel for my Linux domUs, after a time (usually only about an hour
>> or 
>> so), the network interfaces stop responding.  I don't know if the
>> problem is related, but it was something else that I noticed.  The
>> only way to get the network access to come back is to reboot the
>> domU. 
>> When I reverted the domU kernel to 2.6.31.14, this problem goes away.
> 
> That's a separate problem in netfront that appears to be a bug in the
> "smartpoll" code.  I think Dongxiao is looking into it. 

Yes, I tried to reproduce these days, however I could catch it locally. I tried 
both netperf and ping for a long time, but the bug is not triggered. What 
workload are you using when met the bug?

Thanks,
Dongxiao

> 
>> I'm not 100%
>> sure, but I think this issue also causes xm console to not allow you
>> to type on the console that you connect to.  If I connect to a
>> console, then issue an xm shutdown on the same domU from another
>> terminal, all of the console messages that show the play-by-play of
>> the shutdown process display, but my keyboard input doesn't seem to
>> make it through. 
> 
> Hm, not familiar with this problem.  Perhaps its just something wrong
> with your console settings for the domain?  Do you have "console=" on
> the kernel command line?  
> 
>> Since I'm not a developer, I don't know if these questions are better
>> suited for the xen-users list, but since it generated an OOPS with
>> the 
>> word "BUG" in capital letters, I thought I'd post it here.  If that
>> assumption was incorrect, just give me a gentle nudge and I'll
>> redirect the inquiry to somewhere more appropriate.  :)
> 
> Nope, they're both xen-devel fodder.  Thanks for posting.
> 
>     J


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.