Hi:
Another BUG found when testing memory sharing.
In this test, I start 24 linux HVMS, each of them reboot through "xm reboot" every 30minutes.
After several hours, some of the HVM will crash. All of the crash HVM are stopped during booting.
The bug still exists even I forbid page sharing by cheating tapdisk that xc_memshr_nominate_gref()
return failure.
And no special log found.
I was able to dump the crash stack.
what could happen?
thanks.
PID: 2307 TASK: ffff810014166100 CPU: 0 COMMAND: "setfont" #0 [ffff8100123cd900] xen_panic_event at ffffffff88001d28 #1 [ffff8100123cd920] notifier_call_chain at ffffffff80066eaa #2 [ffff8100123cd940] panic at ffffffff8009094a #3 [ffff8100123cda30] oops_end at ffffffff80064fca #4 [ffff8100123cda40] do_page_fault at ffffffff80066dc0 #5 [ffff8100123cdb30] error_exit at ffffffff8005dde9 [exception RIP: vgacon_do_font_op+363] RIP: ffffffff800515e5 RSP: ffff8100123cdbe
8 RFLAGS: 00010203 RAX: 0000000000000000 RBX: ffffffff804b3740 RCX: ffff8100000a03fc RDX: 00000000000003fd RSI: ffff810011cec000 RDI: ffffffff803244c4 RBP: ffff810011cec000 R8: d0d6999996000000 R9: 0000009090b0b0ff R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000000e ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff8100123cdc20] vgacon_font_set at ffffffff8016bec5 #7 [ffff8100123cdc60] con_font_op at ffffffff801aa86b #8 
;[ffff8100123cdcd0] vt_ioctl at ffffffff801a5af4 #9 [ffff8100123cdd70] tty_ioctl at ffffffff80038a2c #10 [ffff8100123cdeb0] do_ioctl at ffffffff800420d9 #11 [ffff8100123cded0] vfs_ioctl at ffffffff800302ce #12 [ffff8100123cdf40] sys_ioctl at ffffffff8004c766 #13 [ffff8100123cdf80] tracesys at ffffffff8005d28d (via system_call) RIP: 00000039294cc557 RSP: 00007fff54c4aec8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: ffffffff8005d28d RCX: ffffffffffffffff RDX: 00007fff54c4aee0 RSI: 0000000000004b72 RDI: 0000000000000003 RBP: 000000001d747ab0 R8: 0000000000000010 R9: 0000000
000800000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010 R13: 0000000000000200 R14: 0000000000000008 R15: 0000000000000008 ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b
> Date: Fri, 21 Jan 2011 14:45:14 -0500 > Subject: Re: mem_sharing: summarized problems when domain is dying > From: juihaochiang@xxxxxxxxx > To: Tim.Deegan@xxxxxxxxxx > CC: tinnycloud@xxxxxxxxxxx; xen-devel@xxxxxxxxxxxxxxxxxxx > > Hi > > On Fri, Jan 21, 2011 at 11:19 AM, Jui-Hao Chiang <juihaochiang@xxxxxxxxx> wrote: > > Hi, Tim: > > > > From tinnycloud's result, here I summarize the current problem and > > findings of mem_sharing due to domain dying. > > (1) When domain is dying, alloc_domheap_page() and > > set_shared_p2m_entry() would just fail. So the shr_lock is not enough > > to ensure that the domain won't die in the middle of mem_sharing code. > > As tinnycloud's code shows, is that better to use > > rcu_lock_domain_by_id before calling the above two functions? > > > > There seems no good locking to protect
a domain from changing the > is_dying state. So the unshare function could fail in the middle in > several points, e.g., alloc_domheap_page and set_shared_p2m_entry. > If that's the case, we need to add some checking, and probably revert > the things we have done when is_dying is changed in the middle. > > Any comments? > > Jui-Hao
|