|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: Assertion failed at arch/x86/genapic/x2apic.c:38 on S3 resume nested in KVM on AMD
On Thu, Aug 08, 2024 at 01:22:30PM +0200, Jan Beulich wrote:
> On 23.07.2024 16:28, Marek Marczykowski-Górecki wrote:
> > I'm observing a crash like the one below when trying to resume from S3.
> > It happens on Xen nested in KVM (QEMU 9.0, Linux 6.9.3) but only on AMD.
> > The very same software stack on Intel works just fine. QEMU is running
> > with "-cpu host,+svm,+invtsc -machine q35,kernel-irqchip=split -device
> > amd-iommu,intremap=on -smp 2" among others.
> >
> > (XEN) Preparing system for ACPI S3 state.
> > (XEN) Disabling non-boot CPUs ...
> > (XEN) Broke affinity for IRQ1, new: {0-1}
> > (XEN) Broke affinity for IRQ20, new: {0-1}
> > (XEN) Broke affinity for IRQ22, new: {0-1}
> > (XEN) Entering ACPI S3 state.
> > (XEN) Finishing wakeup from ACPI S3 state.
> > (XEN) Enabling non-boot CPUs ...
> > (XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus,
> > this_cpu))' failed at arch/x86/genapic/x2apic.c:38
> > (XEN) ----[ Xen-4.20 x86_64 debug=y Not tainted ]----
> > (XEN) CPU: 1
> > (XEN) RIP: e008:[<ffff82d04028862b>]
> > x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
> > (XEN) RFLAGS: 0000000000010096 CONTEXT: hypervisor
> > (XEN) rax: ffff830278a25f50 rbx: 0000000000000001 rcx:
> > ffff82d0405e1700
> > (XEN) rdx: 0000003233412000 rsi: ffff8302739da2d8 rdi:
> > 0000000000000000
> > (XEN) rbp: 00000000000000c8 rsp: ffff8302739d7e78 r8:
> > 0000000000000001
> > (XEN) r9: ffff8302739d7fa0 r10: 0000000000000001 r11:
> > 0000000000000000
> > (XEN) r12: 0000000000000001 r13: 0000000000000001 r14:
> > 0000000000000000
> > (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4:
> > 00000000007506e0
> > (XEN) cr3: 000000007fa7a000 cr2: 0000000000000000
> > (XEN) fsb: 0000000000000000 gsb: 0000000000000000 gss:
> > 0000000000000000
> > (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: 0000 cs: e008
> > (XEN) Xen code around <ffff82d04028862b>
> > (x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9):
> > (XEN) cf 82 ff ff eb b7 0f 0b <0f> 0b 48 8d 05 9c fc 33 00 48 8b 0d a5
> > 0a 35 00
> > (XEN) Xen stack trace from rsp=ffff8302739d7e78:
> > (XEN) 0000000000000000 00000000000000c8 0000000000000001
> > 0000000000000001
> > (XEN) 0000000000000000 ffff82d0402f1d83 ffff8302739d7fff
> > 00000000000000c8
> > (XEN) 0000000000000001 0000000000000001 ffff82d04031adb9
> > 0000000000000001
> > (XEN) 0000000000000000 0000000000000000 0000000000000000
> > ffff82d040276677
> > (XEN) 0000000000000000 0000000000000000 0000000000000000
> > 0000000000000000
> > (XEN) ffff88810037c000 0000000000000001 0000000000000246
> > deadbeefdeadf00d
> > (XEN) 0000000000000001 aaaaaaaaaaaaaaaa 0000000000000000
> > ffffffff811d130a
> > (XEN) deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d
> > 0000010000000000
> > (XEN) ffffffff811d130a 000000000000e033 0000000000000246
> > ffffc900400b3ef8
> > (XEN) 000000000000e02b 000000000000beef 000000000000beef
> > 000000000000beef
> > (XEN) 000000000000beef 0000e01000000001 ffff8302739de000
> > 0000003233412000
> > (XEN) 00000000007506e0 0000000000000000 0000000000000000
> > 0000000200000000
> > (XEN) 0000000000000002
> > (XEN) Xen call trace:
> > (XEN) [<ffff82d04028862b>] R
> > x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
> > (XEN) [<ffff82d0402f1d83>] S setup_local_APIC+0x26/0x449
> > (XEN) [<ffff82d04031adb9>] S start_secondary+0x1c4/0x37a
> > (XEN) [<ffff82d040276677>] S __high_start+0x87/0xd0
> > (XEN)
> > (XEN)
> > (XEN) ****************************************
> > (XEN) Panic on CPU 1:
> > (XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus,
> > this_cpu))' failed at arch/x86/genapic/x2apic.c:38
> > (XEN) ****************************************
>
> Would you mind giving the patch below a try?
Yes, this seems to fix the issue, thanks!
> Jan
>
> x86/x2APIC: correct cluster tracking upon CPUs going down for S3
>
> Downing CPUs for S3 is somewhat special: Since we can expect the system
> to come back up in exactly the same hardware configuration, per-CPU data
> for the secondary CPUs isn't de-allocated (and then cleared upon re-
> allocation when the CPUs are being brought back up). Therefore the
> cluster_cpus per-CPU pointer will retain its value for all CPUs other
> than the final one in a cluster (i.e. in particular for all CPUs in the
> same cluster as CPU0). That, however, is in conflict with the assertion
> early in init_apic_ldr_x2apic_cluster().
>
> Note that the issue is avoided on Intel hardware, where we park CPUs
> instead of bringing them down.
I wonder why I don't hit this issue on baremetal AMD, only on nested.
Anyway,
Tested-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> Extend the bypassing of the freeing to the suspend case, thus making
> suspend/resume also a tiny bit faster.
>
> Fixes: 2e6c8f182c9c ("x86: distinguish CPU offlining from CPU removal")
> Reported-by: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
>
> --- a/xen/arch/x86/genapic/x2apic.c
> +++ b/xen/arch/x86/genapic/x2apic.c
> @@ -228,7 +228,8 @@ static int cf_check update_clusterinfo(
> case CPU_UP_CANCELED:
> case CPU_DEAD:
> case CPU_REMOVE:
> - if ( park_offline_cpus == (action != CPU_REMOVE) )
> + if ( park_offline_cpus == (action != CPU_REMOVE) ||
> + system_state == SYS_STATE_suspend )
> break;
> if ( per_cpu(cluster_cpus, cpu) )
> {
>
>
--
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
Attachment:
signature.asc
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |