[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [BUG] x2apic broken with current AMD hardware


  • To: Elliott Mitchell <ehem+xen@xxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 20 Mar 2023 09:14:17 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=fALf4sfZYXdGl8/blyrbl7A7VTR5ltLiuPxysFoLP5Q=; b=K8cZGfXz7zneUbpC0CF4YUOuUH2CsgnI2neecSO71JW8DOlo8GqiahhO2laxIBCKEEeG0LLZnH0x8WNZn+rkC7GlJt4X8HkPLSWkNzkfjuUsMfUOqdgZtQcRb/PbzLGMYnPp7JJhVe8h4w3SJnUAx2rdBWoefg9S6xdrnD8JgFz2iYNP6AXZzPXtNLee6bD2jP8zjwIxuAwK2PhAAoJm0pyr4GnCmCxrNZgWU0WqhgKBCOV0FerZ+IZEtIAokqgq5zmbEDDy7u85vozExXcq8ZRhYn4sQtYkahVJ7dTvYNpIfIAuGOxnL0NN2GAshjnwzoISXEB6j6SBh8PjAF8klQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Hw+QEUhACpj20NJwKTWakz6p+pahM27deBRKU14GPAZ6UHiyVZXpTx8/4YfJgm2p8834iJS4e/0pcO3j2JH5QFjct5L+BJ2jBaxfvZGOUQezhplcVCcM/H2pUJJvud5wjHiUeU6qhyrfqeIpfA9fksT1jhXTbK6UvqARskHUILm07CEMS+yGuwqdWRygJJGWP53bqrB+/ZPZZ3jLNXZvL9cqZln6f9QSE758ofZDRC6cEurYXQQnXwjqiiuMX/wXQsv+yyJ7R1o91XUPgUQNJObCCzHxo1mec99X/piwGjSD3sAiEBTR9M1JF12ee1Uq0tZSfw7s8ED8DNREhgXuYw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 20 Mar 2023 08:14:30 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 17.03.2023 18:26, Elliott Mitchell wrote:
> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote:
>> On 16.03.2023 23:03, Elliott Mitchell wrote:
>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote:
>>>> On 11.03.2023 01:09, Elliott Mitchell wrote:
>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote:
>>>>>>
>>>>>> In any event you will want to collect a serial log at maximum verbosity.
>>>>>> It would also be of interest to know whether turning off the IOMMU avoids
>>>>>> the issue as well (on the assumption that your system has less than 255
>>>>>> CPUs).
>>>>>
>>>>> I think I might have figured out the situation in a different fashion.
>>>>>
>>>>> I was taking a look at the BIOS manual for this motherboard and noticed
>>>>> a mention of a "Local APIC Mode" setting.  Four values are listed
>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto".
>>>>>
>>>>> That is the sort of setting I likely left at "Auto" and that may well
>>>>> result in x2 functionality being disabled.  Perhaps the x2APIC
>>>>> functionality on AMD is detecting whether the hardware is present, and
>>>>> failing to test whether it has been enabled?  (could be useful to output
>>>>> a message suggesting enabling the hardware feature)
>>>>
>>>> Can we please move to a little more technical terms here? What is "present"
>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit (which
>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). It's
>>>> also left unclear what the four modes of BIOS operation evaluate to. Even
>>>> if we knew that, overriding e.g. "Compatibility" (which likely means some
>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to do.
>>>> In "Auto" mode Xen likely should work - the only way I could interpret the
>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and
>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre-
>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet that's
>>>> speculation on my part ...
>>>
>>> I provided the information I had discovered.  There is a setting for this
>>> motherboard (likely present on some similar motherboards) which /may/
>>> effect the issue.  I doubt I've tried "compatibility", but none of the
>>> values I've tried have gotten the system to boot without "x2apic=false"
>>> on Xen's command-line.
>>>
>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended Features:"
>>> I see the line "(XEN) - x2APIC".  Later is the line
>>> "(XEN) x2APIC mode is already enabled by BIOS."  I'll guess "Auto"
>>> leaves the x2APIC turned off since neither line is present.
>>
>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC
>> mode. Are you sure that's the case when using "Auto"?
> 
> grep -eAPIC\ driver -e-\ x2APIC:
> 
> "Auto":
> (XEN) Using APIC driver default
> (XEN) Overriding APIC driver with bigsmp
> (XEN) Switched to APIC driver x2apic_cluster
> 
> "x2APIC":
> (XEN) Using APIC driver x2apic_cluster
> (XEN) - x2APIC
> 
> Yes, I'm sure.

Okay, this then means we're running in a mode we don't mean to run
in: When the IOMMU claims to not support x2APIC mode (which is odd in
the first place when at the same time the CPU reports x2APIC mode as
supported), amd_iommu_prepare() is intended to switch interrupt
remapping mode to "restricted" (which in turn would force x2APIC mode
to "physical", not "clustered"). I notice though that there are a
number of error paths in the function which bypass this setting. Could
you add a couple of printk()s to understand which path is taken (each
time; the function can be called more than once)?

>>> Both cases the line "(XEN) Switched to APIC driver x2apic_cluster" is
>>> present (so perhaps "Auto" merely doesn't activate it).
>>
>> Did you also try "x2apic_phys" on the Xen command line (just to be sure
>> this isn't a clustered-mode only issue)?
> 
> No.  In fact x2apic_cluster is mentioned in all failure cases.

Could you give physical mode a try, please?

>>> Appears error_interrupt() needs locking or some concurrency handling
>>> mechanism since the last error is jumbled.  With the setting "x2APIC"
>>> I get a bunch of:
>>> "(XEN) APIC error on CPU#: 00(08)(XEN) APIC error on CPU#: 00(08)"
>>> (apparently one for each core)
>>> Followed by "Receive accept error, Receive accept error," (again,
>>> apparently one for each core).  Then a bunch of newlines (same pattern).
>>
>> This is a known issue, but since the messages shouldn't appear in the
>> first place so far no-one has bothered to address this.
> 
> I won't claim it is the best solution, but see other message.
> 
> I'm tempted to propose allowing _Static_assert() since it is valuable
> functionality for preventing issues.

How does _Static_assert() come into play here? Also note that we already
use it when available ...

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.