[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] xen-mceinj tool testing cause dom0 crash



On 11/07/17 01:37 -0700, Jan Beulich wrote:
> >>> On 07.11.17 at 09:23, <xudong.hao@xxxxxxxxx> wrote:
> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> Sent: Tuesday, November 7, 2017 4:09 PM
> >> >>> On 07.11.17 at 02:37, <xudong.hao@xxxxxxxxx> wrote:
> >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> >> >> Sent: Monday, November 6, 2017 5:17 PM
> >> >> >>> On 03.11.17 at 09:29, <xudong.hao@xxxxxxxxx> wrote:
> >> >> > We figured out the problem, some corner scripts triggered the error
> >> >> > injection at the same page (pfn 0x180020) twice, i.e. "./xen-mceinj
> >> >> > -t 0" run over one time, which resulted in Dom0 crash.
> >> >>
> >> >> But isn't this a valid scenario, which shouldn't result in a kernel 
> >> >> crash?
> >> > What if
> >> >> two successive #MCs occurred for the same page?
> >> >> I.e. ...
> >> >>
> >> >
> >> > Yes, it's another valid scenario, the expect result is kernel crash.
> >> 
> >> Kernel _crash_ or rather kernel _panic_? Of course without any kernel 
> >> messages
> >> we can't tell one from the other, but to me this makes a difference 
> >> nevertheless.
> >> 
> > Exactly, Dom0 crash.
> 
> I don't believe a crash is the expected outcome here.
>

This test case injects two errors to the same dom0 page. During the
first injection, offline_page() is called to set PGC_broken flag of
that page. During the second injection, offline_page() detects the
same broken page is touched again, and then tries to shutdown the page
owner, i.e. dom0 in this case:

    /*
     * NB. When broken page belong to guest, usually hypervisor will
     * notify the guest to handle the broken page. However, hypervisor
     * need to prevent malicious guest access the broken page again.
     * Under such case, hypervisor shutdown guest, preventing recursive mce.
     */
    if ( (pg->count_info & PGC_broken) && (owner = page_get_owner(pg)) )
    {
        *status = PG_OFFLINE_AGAIN;
        domain_shutdown(owner, SHUTDOWN_crash);
        return 0;
    }

So I think Dom0 crash and the following machine reboot are the
expected behaviors here.

But, it looks a (unexpected) page fault happens during the reboot.
Xudong, can you check whether a normal reboot on that machine triggers
a page fault?

> > And I didn't see any "kernel panic" message from the log -- attach the 
> > original log again.
> 
> Well, as said - there _no_ kernel log message at all, and hence we
> can't tell whether it's a crash or a plain panic. Iirc Xen's "Hardware
> Dom0 crashed" can't distinguish the two cases.
> 

The crash is triggered in offline_page() before Xen can inject the
error to Dom0, so there is no dom0 kernel log around the crash.

This can be confirmed by dumping the call trace when
hwdom_shutdown(SHUTDOWN_crash) is called. Xudong, can you do this?

Thanks,
Haozhong

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.