Xen project Mailing List

Re: [Xen-devel] [PATCH] x86/nmi: lower initial watchdog frequency to avoid boot hangs

To: Alexey G <x1917x@xxxxxxxxx>, Igor Druzhinin <igor.druzhinin@xxxxxxxxxx>

From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Tue, 6 Feb 2018 14:21:12 +0000

Cc: jbeulich@xxxxxxxx, xen-devel@xxxxxxxxxxxxx

Delivery-date: Tue, 06 Feb 2018 14:24:42 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 06/02/18 03:10, Alexey G wrote: > On Mon, 5 Feb 2018 21:18:42 +0000 > Igor Druzhinin <igor.druzhinin@xxxxxxxxxx> wrote: > >> We're noticing a reproducible system boot hang on certain >> post-Skylake platforms where the BIOS is configured in >> legacy boot mode with x2APIC disabled. The system stalls >> immediately after writing the first SMP initialization >> sequence into APIC ICR. >> >> The cause of the problem is watchdog NMI handler execution - >> somewhere near the end of NMI handling (after it's already >> rescheduled the next NMI) it tries to access IO port 0x61 >> to get the actual NMI reason on CPU0. Unfortunately, this >> port is emulated by BIOS using SMIs and this emulation >> apparently might take more than we expect under certain >> conditions. As the result, the system is constantly moving >> between NMI and SMI handler and not making any progress. >> >> Just lower the initial frequency for now as we lower it later >> even more anyway. > I/O port 61h normally is not emulated by SMI legacy kbd handling code > in BIOS, only ports like 60h, 64h, etc. > Contrary to USB legacy emulation, it has to intercept port 61h via a > different approach -- generic SMI I/O trap, which is not common (at least > it was) to use by BIOSes... although it is possible as EFI interface and > code for this is available. The assumption about port 61h being trapped by > the SMI handler must be explicitly confirmed by checking I/O Trap control > regs in the RCBA region. > > If I/O trap regs won't show an active I/O trap on I/O port 61h -- the > root cause might be different (might even be related to stuff like > NMI2SMI logic). > > If the problem is actually due to NMI handler being preempted by another > NMI which occurred after (a long) execution of triggered SMI handler, it > might be better to do all sensitive stuff before re-enabling NMIs by IRET in > the NMI handler. The problem is that the SMI handler executes enough instructions to trigger another NMI (which is based on the retired instruction count), which gets delivered once the SMI handler returns, and servicing the NMI triggers a new SMI, which triggers a new NMI. This is why the system stalls. I'll leave the how/why port 0x61 is trapping to SMI to Igor, but it is only a secondary concern here. We cannot reasonably have the watchdog able to trip because of exclusively SMI activity, or we'll potentially livelock anywhere. ~Andrew _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.