[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] xen-mceinj tool testing cause dom0 crash



> -----Original Message-----
> From: Zhang, Haozhong
> Sent: Thursday, November 9, 2017 9:45 AM
> To: Jan Beulich <JBeulich@xxxxxxxx>; Hao, Xudong <xudong.hao@xxxxxxxxx>
> Cc: Julien Grall <julien.grall@xxxxxxx>; George Dunlap
> <George.Dunlap@xxxxxxxxxx>; Lars Kurth <lars.kurth@xxxxxxxxxx>; xen-
> devel@xxxxxxxxxxxxx
> Subject: Re: [Xen-devel] [BUG] xen-mceinj tool testing cause dom0 crash
> 
> On 11/07/17 01:37 -0700, Jan Beulich wrote:
> > >>> On 07.11.17 at 09:23, <xudong.hao@xxxxxxxxx> wrote:
> > >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> > >> Sent: Tuesday, November 7, 2017 4:09 PM
> > >> >>> On 07.11.17 at 02:37, <xudong.hao@xxxxxxxxx> wrote:
> > >> >> From: Jan Beulich [mailto:JBeulich@xxxxxxxx]
> > >> >> Sent: Monday, November 6, 2017 5:17 PM
> > >> >> >>> On 03.11.17 at 09:29, <xudong.hao@xxxxxxxxx> wrote:
> > >> >> > We figured out the problem, some corner scripts triggered the
> > >> >> > error injection at the same page (pfn 0x180020) twice, i.e.
> > >> >> > "./xen-mceinj -t 0" run over one time, which resulted in Dom0 crash.
> > >> >>
> > >> >> But isn't this a valid scenario, which shouldn't result in a kernel 
> > >> >> crash?
> > >> > What if
> > >> >> two successive #MCs occurred for the same page?
> > >> >> I.e. ...
> > >> >>
> > >> >
> > >> > Yes, it's another valid scenario, the expect result is kernel crash.
> > >>
> > >> Kernel _crash_ or rather kernel _panic_? Of course without any
> > >> kernel messages we can't tell one from the other, but to me this makes a
> difference nevertheless.
> > >>
> > > Exactly, Dom0 crash.
> >
> > I don't believe a crash is the expected outcome here.
> >
> 
> This test case injects two errors to the same dom0 page. During the first
> injection, offline_page() is called to set PGC_broken flag of that page. 
> During the
> second injection, offline_page() detects the same broken page is touched 
> again,
> and then tries to shutdown the page owner, i.e. dom0 in this case:
> 
>     /*
>      * NB. When broken page belong to guest, usually hypervisor will
>      * notify the guest to handle the broken page. However, hypervisor
>      * need to prevent malicious guest access the broken page again.
>      * Under such case, hypervisor shutdown guest, preventing recursive mce.
>      */
>     if ( (pg->count_info & PGC_broken) && (owner = page_get_owner(pg)) )
>     {
>         *status = PG_OFFLINE_AGAIN;
>         domain_shutdown(owner, SHUTDOWN_crash);
>         return 0;
>     }
> 
> So I think Dom0 crash and the following machine reboot are the expected
> behaviors here.
> 
> But, it looks a (unexpected) page fault happens during the reboot.
> Xudong, can you check whether a normal reboot on that machine triggers a
> page fault?
> 

Yes, a normal rebooting of Dom0 triggered a Xen page fault on Intel Skylake 4 
sockets platform, but no page fault on Skylake 2 sockets system and Broadwell 
platforms.

Haozhong, will you fix this page fault issue?


Thanks,
-Xudong


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.