[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [xen-unstable test] 106580: regressions - trouble: blocked/broken/fail/pass



On 10/03/17 08:37, Jan Beulich wrote:
>>>> On 10.03.17 at 08:20, <osstest-admin@xxxxxxxxxxxxxx> wrote:
>> flight 106580 xen-unstable real [real]
>> http://logs.test-lab.xenproject.org/osstest/logs/106580/ 
>>
>> Regressions :-(
>>
>> Tests which did not succeed and are blocking,
>> including tests which could not be run:
>>  test-armhf-armhf-xl-arndale   3 host-install(3)        broken REGR. vs. 
>> 106534
>>  test-amd64-amd64-migrupgrade 10 xen-boot/dst_host        fail REGR. vs. 
>> 106534
> The NMI watchdog has hit the EOI timer waiting to be able to send
> an IPI on CPU1:
>
> Mar 10 00:09:32.745677 (XEN) Xen call trace:
> Mar 10 00:09:32.745727 (XEN)    [<ffff82d080134083>] _spin_lock+0x2c/0x4f
> Mar 10 00:09:32.745779 (XEN)    [<ffff82d080133e34>] 
> on_selected_cpus+0x2c/0xc6
> Mar 10 00:09:32.753699 (XEN)    [<ffff82d080177101>] 
> irq.c#irq_guest_eoi_timer_fn+0x142/0x165
> Mar 10 00:09:32.761711 (XEN)    [<ffff82d080136ddc>] 
> timer.c#execute_timer+0x47/0x62
> Mar 10 00:09:32.769683 (XEN)    [<ffff82d080136ed2>] 
> timer.c#timer_softirq_action+0xdb/0x22c
> Mar 10 00:09:32.769744 (XEN)    [<ffff82d0801337e1>] 
> softirq.c#__do_softirq+0x7f/0x8a
> Mar 10 00:09:32.777697 (XEN)    [<ffff82d080133836>] do_softirq+0x13/0x15
> Mar 10 00:09:32.785792 (XEN)    [<ffff82d080255081>] 
> entry.o#process_softirqs+0x21/0x30
>
> That lock is being held by CPU2:
>
> Mar 10 00:15:25.133639 (XEN) Xen call trace:
> Mar 10 00:15:25.133655 (XEN)    [<ffff82d080102389>] __bitmap_empty+0x54/0x96
> Mar 10 00:15:25.141636 (XEN)    [<ffff82d080133eb5>] 
> on_selected_cpus+0xad/0xc6
> Mar 10 00:15:25.149635 (XEN)    [<ffff82d0801ca640>] 
> powernow.c#powernow_cpufreq_cpu_init+0x20d/0x372
> Mar 10 00:15:25.157633 (XEN)    [<ffff82d08014c476>] 
> cpufreq_add_cpu+0x1d6/0x5d3
> Mar 10 00:15:25.157654 (XEN)    [<ffff82d0801ca173>] 
> cpufreq_cpu_init+0x17/0x1a
> Mar 10 00:15:25.165658 (XEN)    [<ffff82d08014cd8d>] set_px_pminfo+0x2b6/0x2f7
> Mar 10 00:15:25.165679 (XEN)    [<ffff82d0801956dd>] 
> do_platform_op+0xe69/0x1959
> Mar 10 00:15:25.173667 (XEN)    [<ffff82d080251485>] pv_hypercall+0x1ef/0x42d
> Mar 10 00:15:25.181678 (XEN)    [<ffff82d080254ff6>] 
> entry.o#test_all_events+0/0x30
>
> Register state tells us that it's CPU5 not responding. The only piece
> of information we have about CPU5 is
>
> Mar 10 00:09:32.809709 (XEN) CPU5 @ e008:ffff82d080134083 (0000000000000000)
>
> which is the also in _spin_lock(), but which I'm afraid is too little to
> diagnose the issue. I'm therefore wondering whether we wouldn't
> better default "async-show-all" to true in debug builds.
>
> What I'm also puzzled by is that the system is still partly alive after
> the panic: There's Dom0 output, and it is also reacting to debug
> key input. I would have expected a panic to bring down the system
> right away...

Not very surprising.  We crashed because the IPI lock was unavailable,
then disable the watchdog in machine_halt() and try to IPI again.  CPU1
is almost certainly waiting trying to broadcast __machine_halt().

This is the second odd corner case we have seen around machine_halt(). 
The last one was because of being unsafe to use if you panic() from the
middle of context_switch(), as interrupts are re-enabled, and a guest
irq hits an assertion.  The solution in both cases to make it more
reliable is to an NMI broadcast and leave interrupts disabled.

IMO, noreboot isn't a clever thing to be using at all.  OSSTest should
be installing a crash kernel and collecting crash logs, which will be
far more useful to aid diagnosis.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.