[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [BUG] x2apic broken with current AMD hardware
On 30.04.2023 19:16, Elliott Mitchell wrote: > On Mon, Mar 20, 2023 at 09:28:20AM +0100, Jan Beulich wrote: >> On 20.03.2023 09:14, Jan Beulich wrote: >>> On 17.03.2023 18:26, Elliott Mitchell wrote: >>>> On Fri, Mar 17, 2023 at 09:22:09AM +0100, Jan Beulich wrote: >>>>> On 16.03.2023 23:03, Elliott Mitchell wrote: >>>>>> On Mon, Mar 13, 2023 at 08:01:02AM +0100, Jan Beulich wrote: >>>>>>> On 11.03.2023 01:09, Elliott Mitchell wrote: >>>>>>>> On Thu, Mar 09, 2023 at 10:03:23AM +0100, Jan Beulich wrote: >>>>>>>>> >>>>>>>>> In any event you will want to collect a serial log at maximum >>>>>>>>> verbosity. >>>>>>>>> It would also be of interest to know whether turning off the IOMMU >>>>>>>>> avoids >>>>>>>>> the issue as well (on the assumption that your system has less than >>>>>>>>> 255 >>>>>>>>> CPUs). >>>>>>>> >>>>>>>> I think I might have figured out the situation in a different fashion. >>>>>>>> >>>>>>>> I was taking a look at the BIOS manual for this motherboard and noticed >>>>>>>> a mention of a "Local APIC Mode" setting. Four values are listed >>>>>>>> "Compatibility", "xAPIC", "x2APIC", and "Auto". >>>>>>>> >>>>>>>> That is the sort of setting I likely left at "Auto" and that may well >>>>>>>> result in x2 functionality being disabled. Perhaps the x2APIC >>>>>>>> functionality on AMD is detecting whether the hardware is present, and >>>>>>>> failing to test whether it has been enabled? (could be useful to >>>>>>>> output >>>>>>>> a message suggesting enabling the hardware feature) >>>>>>> >>>>>>> Can we please move to a little more technical terms here? What is >>>>>>> "present" >>>>>>> and "enabled" in your view? I don't suppose you mean the CPUID bit >>>>>>> (which >>>>>>> we check) and the x2APIC-mode-enable one (which we drive as needed). >>>>>>> It's >>>>>>> also left unclear what the four modes of BIOS operation evaluate to. >>>>>>> Even >>>>>>> if we knew that, overriding e.g. "Compatibility" (which likely means >>>>>>> some >>>>>>> form of "disabled" / "hidden") isn't normally an appropriate thing to >>>>>>> do. >>>>>>> In "Auto" mode Xen likely should work - the only way I could interpret >>>>>>> the >>>>>>> the other modes are "xAPIC" meaning no x2APIC ACPI tables entries (and >>>>>>> presumably the CPUID bit also masked), "x2APIC" meaning x2APIC mode pre- >>>>>>> enabled by firmware, and "Auto" leaving it to the OS to select. Yet >>>>>>> that's >>>>>>> speculation on my part ... >>>>>> >>>>>> I provided the information I had discovered. There is a setting for this >>>>>> motherboard (likely present on some similar motherboards) which /may/ >>>>>> effect the issue. I doubt I've tried "compatibility", but none of the >>>>>> values I've tried have gotten the system to boot without "x2apic=false" >>>>>> on Xen's command-line. >>>>>> >>>>>> When setting to "x2APIC" just after "(XEN) AMD-Vi: IOMMU Extended >>>>>> Features:" >>>>>> I see the line "(XEN) - x2APIC". Later is the line >>>>>> "(XEN) x2APIC mode is already enabled by BIOS." I'll guess "Auto" >>>>>> leaves the x2APIC turned off since neither line is present. >>>>> >>>>> When "(XEN) - x2APIC" is absent the IOMMU can't be switched into x2APIC >>>>> mode. Are you sure that's the case when using "Auto"? >>>> >>>> grep -eAPIC\ driver -e-\ x2APIC: >>>> >>>> "Auto": >>>> (XEN) Using APIC driver default >>>> (XEN) Overriding APIC driver with bigsmp >>>> (XEN) Switched to APIC driver x2apic_cluster >>>> >>>> "x2APIC": >>>> (XEN) Using APIC driver x2apic_cluster >>>> (XEN) - x2APIC >>>> >>>> Yes, I'm sure. >>> >>> Okay, this then means we're running in a mode we don't mean to run >>> in: When the IOMMU claims to not support x2APIC mode (which is odd in >>> the first place when at the same time the CPU reports x2APIC mode as >>> supported), amd_iommu_prepare() is intended to switch interrupt >>> remapping mode to "restricted" (which in turn would force x2APIC mode >>> to "physical", not "clustered"). I notice though that there are a >>> number of error paths in the function which bypass this setting. Could >>> you add a couple of printk()s to understand which path is taken (each >>> time; the function can be called more than once)? >> >> I think I've spotted at least one issue. Could you give the patch below >> a try please? (Patch is fine for master and 4.17 but would need context >> adjustment for 4.16.) > > Given the patch didn't fix the problem, that wasn't the issue. I did > though manage to try another variant of BIOS settings for this > motherboard. Setting "Local APIC Mode" to "x2APIC" in the BIOS neither > breaks anything additional, nor fixes issues. What was in Xen's dmesg > did change slightly and looks likely better for my purposes. Some more > snippets from 4.17 Xen dmesg, with "x2apic_phys=true": > > (XEN) AMD-Vi: IOMMU Extended Features: > (XEN) - Peripheral Page Service Request > (XEN) - x2APIC > (XEN) - NX bit > (XEN) - Guest APIC Physical Processor Interrupt > (XEN) - Invalidate All Command > (XEN) - Guest APIC > (XEN) - Performance Counters > (XEN) - Host Address Translation Size: 0x2 > (XEN) - Guest Address Translation Size: 0 > (XEN) - Guest CR3 Root Table Level: 0x1 > (XEN) - Maximum PASID: 0xf > (XEN) - SMI Filter Register: 0x1 > (XEN) - SMI Filter Register Count: 0x1 > (XEN) - Guest Virtual APIC Modes: 0x1 > (XEN) - Dual PPR Log: 0x2 > (XEN) - Dual Event Log: 0x2 > (XEN) - Secure ATS > (XEN) - User / Supervisor Page Protection > (XEN) - Device Table Segmentation: 0x3 > (XEN) - PPR Log Overflow Early Warning > (XEN) - PPR Automatic Response > (XEN) - Memory Access Routing and Control: 0x1 > (XEN) - Block StopMark Message > (XEN) - Performance Optimization > (XEN) - MSI Capability MMIO Access > (XEN) - Guest I/O Protection > (XEN) - Enhanced PPR Handling > (XEN) - Invalidate IOTLB Type > (XEN) - VM Table Size: 0x2 > (XEN) - Guest Access Bit Update Disable > (XEN) AMD-Vi: Disabled HAP memory map sharing with IOMMU > (XEN) AMD-Vi: IOMMU 0 Enabled. > > > (XEN) I/O virtualisation enabled > (XEN) - Dom0 mode: Relaxed > (XEN) Interrupt remapping enabled > (XEN) nr_sockets: 1 > (XEN) Enabled directed EOI with ioapic_ack_old on! > (XEN) Enabling APIC mode: Physical. Using 2 I/O APICs > (XEN) ENABLING IO-APIC IRQs > (XEN) -> Using old ACK method > > > (XEN) SVM: Supported advanced features: > (XEN) - Nested Page Tables (NPT) > (XEN) - Last Branch Record (LBR) Virtualisation > (XEN) - Next-RIP Saved on #VMEXIT > (XEN) - VMCB Clean Bits > (XEN) - DecodeAssists > (XEN) - Virtual VMLOAD/VMSAVE > (XEN) - Virtual GIF > (XEN) - Pause-Intercept Filter > (XEN) - Pause-Intercept Filter Threshold > (XEN) - TSC Rate MSR > (XEN) - NPT Supervisor Shadow Stack > (XEN) - MSR_SPEC_CTRL virtualisation > (XEN) HVM: SVM enabled > > If I'm reading that correctly, everything is there for x2APIC. As such > there seem to be 1 or 2 bugs: > > The definite bug is the x2apic_cluster APIC driver fails on recent AMD > processors. > > I'm unsure whether selecting the x2apic_cluster APIC driver is correct or > not. Capabilities you used to only find a multi-socket server > motherboards are now being found on desktop motherboards. My > understanding is this processor does NUMA on a single die, not merely a > single-socket. As such it may well need the features of x2apic_cluster, > perhaps the driver assumes nr_socket > 1 which is untrue here? Just to answer this one (I don't think there's much more I can do at this point, without further information): No, there certainly isn't such an assumption. Iirc the x2APIC code also predates the existence of the nr_sockets variable (and the respective log line) by quite a bit. Jan > Does appear "x2apic_phys=true" plus "tsc_mode = 'always_emulate'" are > adaquate workarounds all the way back to 4.14. Now for the second > correct bugfix. > >
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |