[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] x2apic broken with current AMD hardware



On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
> On 16.03.2023 23:03, Elliott Mitchell wrote:
> > On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
> >> On 11.03.2023 01:09, Elliott Mitchell wrote:
> >>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
> >>>>
> >>>> In any event you will want to collect a serial log at maximum verbosity.
> >>>> It would also be of interest to know whether turning off the IOMMU avoids
> >>>> the issue as well (on the assumption that your system has less than 255
> >>>> CPUs).
> >>>
> >>> I think I might have figured out the situation in a different fashion.
> >>>
> >>> I was taking a look at the BIOS manual for this motherboard and noticed
> >>> a mention of a "Local APIC Mode" setting.  Four values are listed
> >>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
> >>>
> >>> That is the sort of setting I likely left at "Auto" and that may well
> >>> result in x2 functionality being disabled.  Perhaps the x2APIC
> >>> functionality on AMD is detecting whether the hardware is present, and
> >>> failing to test whether it has been enabled?  (could be useful to output
> >>> a message suggesting enabling the hardware feature)
> >>
> >> Can we please move to a little more technical terms here? What is "present"
> >> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
> >> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
> >> also left unclear what the four modes of BIOS operation evaluate to. Even
> >> if we knew that, overriding e.g. "Compatibility" (which likely means some
> >> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
> >> In "Auto" mode Xen likely should work - the only way I could interpret the
> >> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
> >> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
> >> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
> >> speculation on my part ...
> > 
> > I provided the information I had discovered.  There is a setting for this
> > motherboard (likely present on some similar motherboards) which /may/
> > effect the issue.  I doubt I've tried "compatibility", but none of the
> > values I've tried have gotten the system to boot without "x2apic=false"
> > on Xen's command-line.
> > 
> > When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
> > I see the line "(XEN) - x2APIC".  Later is the line
> > "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
> > leaves the x2APIC turned off since neither line is present.
> 
> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
> mode. Are you sure that's the case when using "Auto"?

grep -eAPIC\ driver -e-\ x2APIC:

"Auto":
(XEN) Using APIC driver default
(XEN) Overriding APIC driver with bigsmp
(XEN) Switched to APIC driver x2apic_cluster

"x2APIC":
(XEN) Using APIC driver x2apic_cluster
(XEN) - x2APIC

Yes, I'm sure.

> > Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
> > present (so perhaps "Auto" merely doesn't activate it).
> 
> Did you also try "x2apic_phys" on the Xen command line (just to be sure
> this isn't a clustered-mode only issue)?

No.  In fact x2apic_cluster is mentioned in all failure cases.

> > Appears error_interrupt() needs locking or some concurrency handling
> > mechanism since the last error is jumbled.  With the setting "x2APIC"
> > I get a bunch of:
> > "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> > (apparently one for each core)
> > Followed by "Receive accept error, Receive accept error," (again,
> > apparently one for each core).  Then a bunch of newlines (same pattern).
> 
> This is a known issue, but since the messages shouldn't appear in the
> first place so far no-one has bothered to address this.

I won't claim it is the best solution, but see other message.

I'm tempted to propose allowing _Static_assert() since it is valuable
functionality for preventing issues.

> > With the setting "auto" the last message is a series of
> > "(XEN) CPU#: No irq handler for vector ## (IRQ -2147483648, LAPIC)" on
> > 2 different cores.  Rather more of the lines were from one core, the
> > vector value varied some.
> > 
> > Notable both sets of final error messages appeared after the Domain 0
> > kernel thought it had been operating for >30 seconds.  Lack of
> > response to interrupt generating events (pressing keys on USB keyboard)
> > suggests interrupts weren't getting through.
> > 
> > 
> > With "x2apic=false" error messages similar to the "Local APIC Mode"
> > of "x2APIC" appear >45 seconds after Domain 0 kernel start.  Of note
> > first "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
> > appears for all cores with "Receive accept error,".
> > 
> > Yet later a variation on this message starts appearing:
> > "(XEN) APIC error on CPU#: 08(08)(XEN) APIC error on CPU#: 08(08)"
> > this one appears multiple times.
> 
> That's certainly odd, as it suggests that things were okay for a short
> while.

Note, "later" means further down the log.  Upon checking this could mean
almost immediately after.  There are a total of 6 "APIC error" lines
(first with "00(08)", rest with "08(08)") and the lines with timestamps
indicate no more than 2 seconds between them.


-- 
(\___(\___(\______          --=> 8-) EHM <=--          ______/)___/)___/)
 \BS (    |         ehem+sigmsg@xxxxxxx  PGP 87145445         |    )   /
  \_CS\   |  _____  -O #include <stddisclaimer.h> O-   _____  |   /  _/
8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445





 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.