[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Xen 4.1 with tmem / xm save -> crash


  • To: xen-users@xxxxxxxxxxxxx
  • From: Jana Saout <jana@xxxxxxxx>
  • Date: Sat, 23 Jun 2012 11:14:15 +0200
  • Delivery-date: Sat, 23 Jun 2012 09:16:48 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

Hello,

I've know that my machines would crash when trying to migrate VMs from
one host to another.  Lately, I've been starting to look into the issue
and noticed that it happens when xen is trying to save the VM, i.e. also
when using "xm save" (on a PV guest).  I've been getting access to the
console and noticed that when it happens, the hypervisor would log a
segfault and then reboot:

(note that this happened with various 4.1.x xen hypervisors and tools,
I've upgraded to the latest version recently just to rule out this is a
bug that has already been fixed)

I have googled for the issue but it seems I am (again) the only person
running into these kinds of troubles...

(XEN) ----[ Xen-4.1.3-rc2-pre  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    8
(XEN) RIP:    e008:[<ffff82c48012e92c>] do_tmem_op+0x116c/0x1630
(XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830424383c30   rcx: 0000000000000000
(XEN) rdx: ffff83101bab4620   rsi: 0000000000000000   rdi: 0000000000000001
(XEN) rbp: 000000854e7fa7b0   rsp: ffff8304247b7e08   r8:  0000000000000000
(XEN) r9:  0000000000000010   r10: ffff82c48020b3a0   r11: 0000000000000286
(XEN) r12: ffff83101bab5c30   r13: 0000000000000008   r14: 00000000ffffffff
(XEN) r15: 0000000000000001   cr0: 0000000080050033   cr4: 00000000000006f0
(XEN) cr3: 000000081d7de000   cr2: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen stack trace from rsp=ffff8304247b7e08:
(XEN)    ffff830424454000 0000000000000023 0000000000000000 00000000103ea640
(XEN)    0000000000000000 0000001024454000 0000000000000000 0000000100000012
(XEN)    0000000000000010 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 ffff82c480180cb7 00007f0b277d0000 ffff880117240088
(XEN)    ffff8300d7ada000 ffff8304247b7f18 0000000000000001 ffff82c4801fe396
(XEN)    0000000000000007 0000000000000246 0000000000000000 0000000000000000
(XEN)    00007f0b27c10ff9 000000000000e033 0000000000010203 ffff8300d7ada000
(XEN)    ffff88011c7ede98 00000000ffffffe7 ffff880118003540 0000000000000003
(XEN)    00007fff97d5cb70 ffff82c4801f9ad8 00007fff97d5cb70 0000000000000003
(XEN)    ffff880118003540 00000000ffffffe7 ffff88011c7ede98 00007fff97d5cb70
(XEN)    0000000000000286 00007f0b27bfc438 0000000000000010 00007f0b28038568
(XEN)    0000000000000026 ffffffff810014ca 00007f0b2802d358 0000000000000000
(XEN)    0000000001155004 0000010000000000 ffffffff810014ca 000000000000e033
(XEN)    0000000000000286 ffff88011c7ede40 000000000000e02b 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000008
(XEN)    ffff8300d7ada000 0000003fa44ff880 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82c48012e92c>] do_tmem_op+0x116c/0x1630
(XEN)    [<ffff82c480180cb7>] copy_from_user+0x27/0x90
(XEN)    [<ffff82c4801fe396>] do_iret+0xb6/0x1a0
(XEN)    [<ffff82c4801f9ad8>] syscall_enter+0x88/0x8d
(XEN)    
(XEN) Pagetable walk from 0000000000000000:
(XEN)  L4[0x000] = 000000101b8d7067 0000000000119c69
(XEN)  L3[0x000] = 0000000420d48067 000000000011d277
(XEN)  L2[0x000] = 0000000000000000 ffffffffffffffff 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 8:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0002]
(XEN) Faulting linear address: 0000000000000000
(XEN) ****************************************
(XEN) 

I've nailed down the crash in do_tmem_op (xen/common/tmem.c) to his line:

    case TMEMC_SAVE_GET_POOL_UUID:
         if ( pool == NULL )
             break;
        uuid = (uint64_t *)buf.p;
-->     *uuid++ = pool->uuid[0];
        *uuid = pool->uuid[1];
        rc = 0;

Apparently buf.p (%rcx) is NULL.

I've been trying to figure out how this could happen, but don't know
enough about the mechanisms involved to figure it out.  Apparently this
op gets called from Dom0 userspace (xend, tools/libxc/xc_tmem.c):

(void)xc_tmem_control(xch,i,TMEMC_SAVE_GET_POOL_UUID,dom,sizeof(uuid),0,0,&uuid);

Which then calls

xen_set_guest_handle(op.u.ctrl.buf, buf);
rc = do_tmem_op(xch,&op);

There are also calls to some bounce buffer handling in case subop is
TMEMC_LIST, but not in the other cases. (no clue what this is about)

(note: the Dom0 kernel is 3.1, but the same has been happening with
older kernels too, and I don't think the kernel is involved at all here)

Has anyone else seen this issue?  What am I doing wrong?  Does my
analysis maybe help a bit in figuring out what is going on?

Thanks,
        Jana



_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.