[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [BUG] x2apic broken with current AMD hardware
On 17.03.2023 18:26, Elliott Mitchell wrote: > On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote: >> On 16.03.2023 23:03, Elliott Mitchell wrote: >>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote: >>>> On 11.03.2023 01:09, Elliott Mitchell wrote: >>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote: >>>>>> >>>>>> In any event you will want to collect a serial log at maximum verbosity. >>>>>> It would also be of interest to know whether turning off the IOMMU avoids >>>>>> the issue as well (on the assumption that your system has less than 255 >>>>>> CPUs). >>>>> >>>>> I think I might have figured out the situation in a different fashion. >>>>> >>>>> I was taking a look at the BIOS manual for this motherboard and noticed >>>>> a mention of a "Local APIC Mode" setting. Four values are listed >>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto". >>>>> >>>>> That is the sort of setting I likely left at "Auto" and that may well >>>>> result in x2 functionality being disabled. Perhaps the x2APIC >>>>> functionality on AMD is detecting whether the hardware is present, and >>>>> failing to test whether it has been enabled? (could be useful to output >>>>> a message suggesting enabling the hardware feature) >>>> >>>> Can we please move to a little more technical terms here? What is "present" >>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which >>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's >>>> also left unclear what the four modes of BIOS operation evaluate to. Even >>>> if we knew that, overriding e.g. "Compatibility" (which likely means some >>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do. >>>> In "Auto" mode Xen likely should work - the only way I could interpret the >>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and >>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre- >>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's >>>> speculation on my part ... >>> >>> I provided the information I had discovered. There is a setting for this >>> motherboard (likely present on some similar motherboards) which /may/ >>> effect the issue. I doubt I've tried "compatibility", but none of the >>> values I've tried have gotten the system to boot without "x2apic=false" >>> on Xen's command-line. >>> >>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:" >>> I see the line "(XEN) - x2APIC". Later is the line >>> "(XEN) x2APIC mode is already enabled by BIOS." I'll guess "Auto" >>> leaves the x2APIC turned off since neither line is present. >> >> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC >> mode. Are you sure that's the case when using "Auto"? > > grep -eAPIC\ driver -e-\ x2APIC: > > "Auto": > (XEN) Using APIC driver default > (XEN) Overriding APIC driver with bigsmp > (XEN) Switched to APIC driver x2apic_cluster > > "x2APIC": > (XEN) Using APIC driver x2apic_cluster > (XEN) - x2APIC > > Yes, I'm sure. Okay, this then means we're running in a mode we don't mean to run in: When the IOMMU claims to not support x2APIC mode (which is odd in the first place when at the same time the CPU reports x2APIC mode as supported), amd_iommu_prepare() is intended to switch interrupt remapping mode to "restricted" (which in turn would force x2APIC mode to "physical", not "clustered"). I notice though that there are a number of error paths in the function which bypass this setting. Could you add a couple of printk()s to understand which path is taken (each time; the function can be called more than once)? >>> Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is >>> present (so perhaps "Auto" merely doesn't activate it). >> >> Did you also try "x2apic_phys" on the Xen command line (just to be sure >> this isn't a clustered-mode only issue)? > > No. In fact x2apic_cluster is mentioned in all failure cases. Could you give physical mode a try, please? >>> Appears error_interrupt() needs locking or some concurrency handling >>> mechanism since the last error is jumbled. With the setting "x2APIC" >>> I get a bunch of: >>> "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)" >>> (apparently one for each core) >>> Followed by "Receive accept error, Receive accept error," (again, >>> apparently one for each core). Then a bunch of newlines (same pattern). >> >> This is a known issue, but since the messages shouldn't appear in the >> first place so far no-one has bothered to address this. > > I won't claim it is the best solution, but see other message. > > I'm tempted to propose allowing _Static_assert() since it is valuable > functionality for preventing issues. How does _Static_assert() come into play here? Also note that we already use it when available ... Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |