[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash on S3 resume on 4.13 and unstable if any CPU is re-offlined



On 05.01.20 08:39, Marek Marczykowski-Górecki wrote:
On Sun, Jan 05, 2020 at 12:42:30AM +0000, Andrew Cooper wrote:
On 04/01/2020 15:30, Marek Marczykowski-Górecki wrote:
Hi,

I have a reliable crash on resume from S3. I can reproduce it on both
real hardware and nested within KVM, although call traces are different
between those platforms. In any case, it happens only if some CPU is to
be re-offlined after resume (smt=off and/or maxcpus=... options).

I think the crash from the real hardware gives more clues, but the one
from qemu may also be interesting, maybe it's even another bug?

The crash message (full console log attached):

(XEN) mce_intel.c:772: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, 
CMCI
(XEN) CPU0 CMCI LVT vector (0xf2) already installed
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d08023beb7>] 
schedule.c#cpu_schedule_callback+0xea/0x1a1
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff82d080453348   rcx: ffff82d080584020
(XEN) rdx: 000000339b66e000   rsi: 0000000000008005   rdi: ffff82d080453340
(XEN) rbp: ffff8300ca45fd68   rsp: ffff8300ca45fd68   r8:  0000000000000004
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 8000000000000000
(XEN) r12: ffff82d080453340   r13: ffff82d080453200   r14: 0000000000008005
(XEN) r15: 0000000000008000   cr0: 000000008005003b   cr4: 00000000000426e0
(XEN) cr3: 00000000ca44f000   cr2: 0000000000000008
(XEN) fsb: 000079d5e4f9e740   gsb: ffff888135600000   gss: 0000000000000000
(XEN) ds: 0018   es: 0010   fs: b800   gs: 0010   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d08023beb7> 
(schedule.c#cpu_schedule_callback+0xea/0x1a1):
(XEN)  48 8b 14 d1 48 8b 04 02 <48> 8b 48 08 48 85 c9 74 64 48 8b 05 b9 c3 32 00
(XEN) Xen stack trace from rsp=ffff8300ca45fd68:
(XEN)    ffff8300ca45fdb0 ffff82d080221289 ffff8300ca45fdd8 0000000000000001
(XEN)    0000000000000000 00000000ffffffef ffff8300ca45fe00 0000000000000001
(XEN)    0000000000000200 ffff8300ca45fdc8 ffff82d080203476 0000000000000001
(XEN)    ffff8300ca45fdf0 ffff82d080203550 0000000000000000 0000000000000001
(XEN)    0000000000000000 ffff8300ca45fe20 ffff82d080203999 ffff8300ca45fef8
(XEN)    0000000000000000 0000000000000003 00000000000426e0 ffff8300ca45fe58
(XEN)    ffff82d0802e4240 ffff83042896c5f0 ffff83041bb4d000 0000000000000000
(XEN)    0000000000000000 ffff83041bb73000 ffff8300ca45fe78 ffff82d08020828f
(XEN)    ffff83041bb4d1b8 ffff82d080567210 ffff8300ca45fe90 ffff82d08023fd39
(XEN)    ffff82d080567200 ffff8300ca45fec0 ffff82d08024001a 0000000000000000
(XEN)    ffff82d080567210 ffff82d08056d980 ffff82d080584020 ffff8300ca45fef0
(XEN)    ffff82d08027247a ffff83041bbb2000 ffff83041bb4d000 ffff83041bbb3000
(XEN)    0000000000000000 ffff8300ca45fd98 0000000000000003 ffffffff820ae496
(XEN)    0000000000000003 0000000000000000 0000000000002003 ffffffff822c6868
(XEN)    0000000000000246 0000000000003403 00000000ffff0000 0000000000000000
(XEN)    0000000000000000 ffffffff810010ea 0000000000002003 0000000000000010
(XEN)    deadbeefdeadf00d 0000010000000000 ffffffff810010ea 000000000000e033
(XEN)    0000000000000246 ffffc900011abbe8 000000000000e02b 003b4a890045ffe0
(XEN)    003b4ddf00098fa8 003b4e0300000001 003b499d0045ffe0 0000e01000000000
(XEN)    ffff83041bbb2000 0000000000000000 00000000000426e0 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82d08023beb7>] R schedule.c#cpu_schedule_callback+0xea/0x1a1
(XEN)    [<ffff82d080221289>] F notifier_call_chain+0x6b/0x96
(XEN)    [<ffff82d080203476>] F cpu.c#cpu_notifier_call_chain+0x1b/0x33
(XEN)    [<ffff82d080203550>] F cpu_down+0x5e/0x15c
(XEN)    [<ffff82d080203999>] F enable_nonboot_cpus+0x113/0x1fb
(XEN)    [<ffff82d0802e4240>] F power.c#enter_state_helper+0x107/0x51b
(XEN)    [<ffff82d08020828f>] F 
domain.c#continue_hypercall_tasklet_handler+0x8b/0xb7
(XEN)    [<ffff82d08023fd39>] F tasklet.c#do_tasklet_work+0x76/0xa9
(XEN)    [<ffff82d08024001a>] F do_tasklet+0x58/0x8a
(XEN)    [<ffff82d08027247a>] F domain.c#idle_loop+0x40/0x96
(XEN)
(XEN) Pagetable walk from 0000000000000008:
(XEN)  L4[0x000] = 000000041bbff063 ffffffffffffffff
(XEN)  L3[0x000] = 000000041bbfe063 ffffffffffffffff
(XEN)  L2[0x000] = 000000041bbfd063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000008
(XEN) ****************************************

And the one from qemu:

(XEN) mce_intel.c:772: MCA Capability: firstbank 1, extended MCE MSR 0, SER
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Assertion 'c2rqd(ops, sched_unit_master(unit)) == svc->rqd' failed at 
sched_credit2.c:2137
(XEN) ----[ Xen-4.14-unstable  x86_64  debug=y   Not tainted ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82d08022fe1a>] 
sched_credit2.c#csched2_unit_wake+0x174/0x176
(XEN) RFLAGS: 0000000000010097   CONTEXT: hypervisor (d0v0)
(XEN) rax: ffff83013a7313e8   rbx: ffff83013a6bdf40   rcx: 0000000000000051
(XEN) rdx: ffff83013a731160   rsi: ffff83013a7310e0   rdi: 0000000000000003
(XEN) rbp: ffff83013a6f7d98   rsp: ffff83013a6f7d78   r8:  deadbeefdeadf00d
(XEN) r9:  deadbeefdeadf00d   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff83013a6bc7e0   r13: ffff82d08043e720   r14: 0000000000000003
(XEN) r15: 00000003c5ffecac   cr0: 0000000080050033   cr4: 0000000000000660
(XEN) cr3: 000000004b005000   cr2: 0000000000000000
(XEN) fsb: 00007751649f4740   gsb: ffff888134a00000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen code around <ffff82d08022fe1a> 
(sched_credit2.c#csched2_unit_wake+0x174/0x176):
(XEN)  ef e8 1e c1 ff ff eb a7 <0f> 0b 55 48 89 e5 41 57 41 56 41 55 41 54 53 48
(XEN) Xen stack trace from rsp=ffff83013a6f7d78:
(XEN)    ffff83013a6a3000 ffff83013a6bdf40 ffff83013a6bdf40 ffff83013a7313e8
(XEN)    ffff83013a6f7de8 ffff82d0802391f8 0000000000000202 ffff83013a7313e8
(XEN)    ffff83013a6c1018 0000000000000001 0000000000000000 0000000000000000
(XEN)    ffff83013a6c1018 ffff83013a6a3000 ffff83013a6f7e58 ffff82d08020906c
(XEN)    ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff82d08035d3c8
(XEN)    ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff83013a6f7ef8
(XEN)    0000000000000180 ffff83013a6aa000 deadbeefdeadf00d 0000000000000003
(XEN)    ffff83013a6f7ee8 ffff82d0803570c7 0000000000000001 0000000000000001
(XEN)    0000000000000000 deadbeefdeadf00d deadbeefdeadf00d ffff82d08035d3c8
(XEN)    ffff82d08035d3d4 ffff82d08035d3c8 ffff82d08035d3d4 ffff82d08035d3c8
(XEN)    ffff82d08035d3d4 ffff83013a6aa000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 00007cfec59080e7 ffff82d08035d432
(XEN)    0000000000015120 0000000000000001 0000000000000000 ffff88813024a540
(XEN)    0000000000000000 0000000000000001 0000000000000246 0000000000140000
(XEN)    ffff8880bf7db000 ffffea0004be4508 0000000000000018 ffffffff8100130a
(XEN)    0000000000000000 0000000000000001 0000000000000001 0000010000000000
(XEN)    ffffffff8100130a 000000000000e033 0000000000000246 ffffc90000c97c98
(XEN)    000000000000e02b 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000e01000000001 ffff83013a6aa000 00000030ba196000
(XEN)    0000000000000660 0000000000000000 000000013a6e2000 0000040000000000
(XEN) Xen call trace:
(XEN)    [<ffff82d08022fe1a>] R sched_credit2.c#csched2_unit_wake+0x174/0x176
(XEN)    [<ffff82d0802391f8>] F vcpu_wake+0xea/0x4d8
(XEN)    [<ffff82d08020906c>] F do_vcpu_op+0x36f/0x687
(XEN)    [<ffff82d0803570c7>] F pv_hypercall+0x28f/0x57d
(XEN)    [<ffff82d08035d432>] F lstar_enter+0x112/0x120
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Assertion 'c2rqd(ops, sched_unit_master(unit)) == svc->rqd' failed at 
sched_credit2.c:2137
(XEN) ****************************************

This looks very much like the core scheduling crash found on specific
machines in S5.  From my analysis, it was a use-after-free on a
schedulling resource.

Does switching back to thread mode (as opposed to core mode) make the
crash go away?

It is the thread mode (unless default has changed).

Does the attached patch fix it for you?


Juergen

Attachment: 0001-xen-sched-fix-resuming-from-S3-with-smt-0.patch
Description: Text Data

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.