[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer



On Mon, Aug 22, 2022 at 11:53:50AM +0200, Jan Beulich wrote:
> On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote:
> > On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki wrote:
> >> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote:
> >>> On Tue, Sep 29, 2020 at 05:27:48PM +0200, Jürgen Groß wrote:
> >>>> On 29.09.20 17:16, Marek Marczykowski-Górecki wrote:
> >>>>> On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote:
> >>>>>> On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:
> >>>>>>> On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki 
> >>>>>>> wrote:
> >>>>>>>> On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:
> >>>>>>>>> [Adding Juergen]
> >>>>>>>>>
> >>>>>>>>> On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:
> >>>>>>>>>> On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:
> >>>>>>>>>>> On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> In my test setup (inside KVM with nested virt enabled), I rather
> >>>>>>>>>>>> frequently get Xen crash on resume from S3. Full message below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is Xen 4.13.0, with some patches, including "sched: fix
> >>>>>>>>>>>> resuming
> >>>>>>>>>>>> from S3 with smt=0".
> >>>>>>>>>>>>
> >>>>>>>>>>>> Contrary to the previous issue, this one does not happen always -
> >>>>>>>>>>>> I
> >>>>>>>>>>>> would say in about 40% cases on this setup, but very rarely on
> >>>>>>>>>>>> physical
> >>>>>>>>>>>> setup.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This is _without_ core scheduling enabled, and also with smt=off.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Do you think it would be any different on xen-unstable? I cat
> >>>>>>>>>>>> try, but
> >>>>>>>>>>>> it isn't trivial in this setup, so I'd ask first.
> >>>>>>>>>>>>
> >>>>>>>>> Well, Juergen has fixed quite a few issues.
> >>>>>>>>>
> >>>>>>>>> Most of them where triggering with core-scheduling enabled, and I 
> >>>>>>>>> don't
> >>>>>>>>> recall any of them which looked similar or related to this.
> >>>>>>>>>
> >>>>>>>>> Still, it's possible that the same issue causes different symptoms, 
> >>>>>>>>> and
> >>>>>>>>> hence that maybe one of the patches would fix this too.
> >>>>>>>>
> >>>>>>>> I've tested on master (d094e95fb7c), and reproduced exactly the same 
> >>>>>>>> crash
> >>>>>>>> (pasted below for the completeness).
> >>>>>>>> But there is more: additionally, in most (all?) cases after resume 
> >>>>>>>> I've got
> >>>>>>>> soft lockup in Linux dom0 in smp_call_function_single() - see below. 
> >>>>>>>> It
> >>>>>>>> didn't happened before and the only change was Xen 4.13 -> master.
> >>>>>>>>
> >>>>>>>> Xen crash:
> >>>>>>>>
> >>>>>>>> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed 
> >>>>>>>> at credit2.c:2133
> >>>>>>>
> >>>>>>> Juergen, any idea about this one? This is also happening on the 
> >>>>>>> current
> >>>>>>> stable-4.14 (28855ebcdbfa).
> >>>>>>>
> >>>>>>
> >>>>>> Oh, sorry I didn't come back to this issue.
> >>>>>>
> >>>>>> I suspect this is related to stop_machine_run() being called during
> >>>>>> suspend(), as I'm seeing very sporadic issues when offlining and then
> >>>>>> onlining cpus with core scheduling being active (it seems as if the
> >>>>>> dom0 vcpu doing the cpu online activity sometimes is using an old
> >>>>>> vcpu state).
> >>>>>
> >>>>> Note this is default Xen 4.14 start, so core scheduling is _not_ active:
> >>>>
> >>>> The similarity in the two failure cases is that multiple cpus are
> >>>> affected by the operations during stop_machine_run().
> >>>>
> >>>>>
> >>>>>      (XEN) Brought up 2 CPUs
> >>>>>      (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
> >>>>>      (XEN) Adding cpu 0 to runqueue 0
> >>>>>      (XEN)  First cpu on runqueue, activating
> >>>>>      (XEN) Adding cpu 1 to runqueue 1
> >>>>>      (XEN)  First cpu on runqueue, activating
> >>>>>
> >>>>>> I wasn't able to catch the real problem despite of having tried lots
> >>>>>> of approaches using debug patches.
> >>>>>>
> >>>>>> Recently I suspected the whole problem could be somehow related to
> >>>>>> RCU handling, as stop_machine_run() is relying on tasklets which are
> >>>>>> executing in idle context, and RCU handling is done in idle context,
> >>>>>> too. So there might be some kind of use after free scenario in case
> >>>>>> some memory is freed via RCU despite it still being used by a tasklet.
> >>>>>
> >>>>> That sounds plausible, even though I don't really know this area of Xen.
> >>>>>
> >>>>>> I "just" need to find some time to verify this suspicion. Any help 
> >>>>>> doing
> >>>>>> this would be appreciated. :-)
> >>>>>
> >>>>> I do have a setup where I can easily-ish reproduce the issue. If there
> >>>>> is some debug patch you'd like me to try, I can do that.
> >>>>
> >>>> Thanks. I might come back to that offer as you are seeing a crash which
> >>>> will be much easier to analyze. Catching my error case is much harder as
> >>>> it surfaces some time after the real problem in a non destructive way
> >>>> (usually I'm seeing a failure to load a library in the program which
> >>>> just did its job via exactly the library claiming not being loadable).
> >>>
> >>> Hi,
> >>>
> >>> I'm resurrecting this thread as it was recently mentioned elsewhere. I
> >>> can still reproduce the issue on the recent staging branch (9dc687f155).
> >>>
> >>> It fails after the first resume (not always, but frequent enough to
> >>> debug it). At least one guest needs to be running - with just (PV) dom0
> >>> the crash doesn't happen (at least for the ~8 times in a row I tried).
> >>> If the first resume works, the second (almost?) always will fail but
> >>> with a different symptoms - dom0 kernel lockups (at least some of its
> >>> vcpus). I haven't debugged this one yet at all.
> >>>
> >>> Any help will be appreciated, I can apply some debug patches, change
> >>> configuration etc.
> >>
> >> This still happens on 4.14.3. Maybe it is related to freeing percpu
> >> areas, as it caused other issues with suspend too? Just a thought...
> > 
> > I have reproduced this on current staging(*). And I can reproduce it
> > reliably. And also, I got (I believe) closely related crash with credit1
> > scheduler.
> > 
> > (*) It isn't plain staging, it's one with my xhci console patches on
> > top, including attempt to make it survive S3. I believe the only
> > relevant part there is sticking set_timer() into console resume path (or
> > just having a timer with rather short delay registered). The actual tree
> > at https://github.com/marmarek/xen/tree/master-xue2-debug, including
> > quite a lot of debug prints and debug hacks.
> > 
> > Specific crash with credit2:
> 
> Are you sure this is Credit2? Both ...
> 
> >     (XEN) Assertion 'sched_unit_master(currunit) == cpu' failed at 
> > common/sched/credit.c:928
> 
> ... here and ...
> 
> >     (XEN) ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
> >     (XEN) CPU:    0
> >     (XEN) RIP:    e008:[<ffff82d0402434bf>] credit.c#csched_tick+0x2d4/0x494
> >     (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v4)
> >     (XEN) rax: ffff82d0405c4298   rbx: 0000000000000002   rcx: 
> > 0000000000000002
> >     (XEN) rdx: ffff8302517f64d0   rsi: ffff8302515c0fc0   rdi: 
> > 0000000000000002
> >     (XEN) rbp: ffff830256227e38   rsp: ffff830256227de0   r8:  
> > 0000000000000004
> >     (XEN) r9:  ffff8302517ac820   r10: ffff830251745068   r11: 
> > 00000088cb734887
> >     (XEN) r12: ffff83025174de50   r13: ffff8302515c0fa0   r14: 
> > ffff83025174df40
> >     (XEN) r15: ffff8302515c0cc0   cr0: 0000000080050033   cr4: 
> > 0000000000372660
> >     (XEN) cr3: 00000001bacbd000   cr2: 000077e5ec02a318
> >     (XEN) fsb: 000077e5fe533700   gsb: ffff888255700000   gss: 
> > 0000000000000000
> >     (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> >     (XEN) Xen code around <ffff82d0402434bf> 
> > (credit.c#csched_tick+0x2d4/0x494):
> >     (XEN)  01 00 00 e9 2a 01 00 00 <0f> 0b 0f 0b 0f 0b 48 8b 41 20 0f b7 00 
> > 89 45 cc
> >     (XEN) Xen stack trace from rsp=ffff830256227de0:
> >     (XEN)    ffff830256227fff 0000000000000000 0000000256227e10 
> > ffff82d04035be90
> >     (XEN)    ffff830256227ef8 ffff830251745000 ffff82d0405c3280 
> > ffff82d0402431eb
> >     (XEN)    0000000000000002 00000088c9ba9534 0000000000000000 
> > ffff830256227e60
> >     (XEN)    ffff82d04022ee53 ffff82d0405c3280 ffff8302963e1320 
> > ffff8302515c0fc0
> >     (XEN)    ffff830256227ea0 ffff82d04022f73f ffff830256227e80 
> > ffff82d0405c9f00
> >     (XEN)    ffffffffffffffff ffff82d0405c9f00 ffff830256227fff 
> > 0000000000000000
> >     (XEN)    ffff830256227ed8 ffff82d04022d26c ffff830251745000 
> > 0000000000000000
> >     (XEN)    0000000000000000 ffff830256227fff 0000000000000000 
> > ffff830256227ee8
> >     (XEN)    ffff82d04022d2ff 00007cfda9dd80e7 ffff82d0402f03c6 
> > ffff88810c005c00
> >     (XEN)    0000000000000031 0000000000000100 00000000fffffe00 
> > 0000000000000031
> >     (XEN)    0000000000000031 ffffffff82d45d28 0000000000000e2e 
> > 0000000000000000
> >     (XEN)    0000000000000032 00000000ffffef31 0000000000000000 
> > ffff88812244a700
> >     (XEN)    0000000000000005 ffff88812244a780 000000fa00000000 
> > ffffffff818db55f
> >     (XEN)    000000000000e033 0000000000000246 ffffc900409b7c50 
> > 000000000000e02b
> >     (XEN)    0000000000000000 0000000000000000 0000000000000000 
> > 0000000000000000
> >     (XEN)    0000e01000000000 ffff830251745000 0000000000000000 
> > 0000000000372660
> >     (XEN)    0000000000000000 800000025620b002 000e030300000001 
> > 0000000000000000
> >     (XEN) Xen call trace:
> >     (XEN)    [<ffff82d0402434bf>] R credit.c#csched_tick+0x2d4/0x494
> >     (XEN)    [<ffff82d04022ee53>] F timer.c#execute_timer+0x45/0x5c
> >     (XEN)    [<ffff82d04022f73f>] F timer.c#timer_softirq_action+0x71/0x278
> >     (XEN)    [<ffff82d04022d26c>] F softirq.c#__do_softirq+0x94/0xbe
> >     (XEN)    [<ffff82d04022d2ff>] F do_softirq+0x13/0x15
> >     (XEN)    [<ffff82d0402f03c6>] F x86_64/entry.S#process_softirqs+0x6/0x20
> 
> ... here the only references are to credit.c, i.e. Credit1 code.

Wrong paste, sorry, this is the credit2 one:

(XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
common/sched/credit2.c:2274
(XEN) ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
(XEN) CPU:    10
(XEN) RIP:    e008:[<ffff82d040247a4d>] credit2.c#csched2_unit_wake+0x152/0x154
(XEN) RFLAGS: 0000000000010083   CONTEXT: hypervisor (d0v0)
(XEN) rax: ffff830251778230   rbx: ffff830251768cb0   rcx: 00000032111d6000
(XEN) rdx: ffff8302515c1eb0   rsi: 0000000000000006   rdi: ffff830251769000
(XEN) rbp: ffff8302515cfd90   rsp: ffff8302515cfd70   r8:  ffff830251769000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: ffff830251768dd0   r13: ffff8302515c1d00   r14: 0000000000000006
(XEN) r15: ffff82d0405ddb40   cr0: 0000000080050033   cr4: 0000000000372660
(XEN) cr3: 000000022f2a1000   cr2: ffff8881012738e0
(XEN) fsb: 0000744bf6a0db80   gsb: ffff888255600000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
(XEN) Xen code around <ffff82d040247a4d> 
(credit2.c#csched2_unit_wake+0x152/0x154):
(XEN)  df e8 6f bf ff ff eb ad <0f> 0b f3 0f 1e fa 55 48 89 e5 41 57 41 56 41 55
(XEN) Xen stack trace from rsp=ffff8302515cfd70:
(XEN)    ffff83025174b000 ffff830251768cb0 ffff830251778270 ffff82d0405c4298
(XEN)    ffff8302515cfdd8 ffff82d04024fcb8 0000000000000202 ffff830251778270
(XEN)    ffff83025174b000 0000000000000001 ffff830251769018 0000000000000000
(XEN)    0000000000000000 ffff8302515cfe48 ffff82d04020a8c9 ffff8882556aedc0
(XEN)    0000000000000003 00001910537e623e 0000000b988f78a6 0000000059d4a716
(XEN)    00001901f30fa41e 0000000217f96af6 0000000000000000 ffff83025174b000
(XEN)    ffff830251756000 0000000000000002 0000000000000001 ffff8302515cfe70
(XEN)    ffff82d0402f7968 ffff830251756000 ffff8302515cfef8 0000000000000018
(XEN)    ffff8302515cfee8 ffff82d0402ec6de 0000000000000000 ffffffff82f157e0
(XEN)    0000000000000000 0000000000000000 ffff8302515cfef8 0000000000000000
(XEN)    0000000000000000 ffff8302515cffff ffff830251756000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 00007cfdaea300e7
(XEN)    ffff82d0402012bd 0000000000000000 ffffffff82c51120 ffff88810036cf00
(XEN)    0000000000000002 000000000001e120 0000000000000002 0000000000000246
(XEN)    ffffffff82f157e0 0000000000000001 0000000000000000 0000000000000018
(XEN)    ffffffff81e4a30a 0000000000000000 0000000000000002 0000000000000001
(XEN)    0000010000000000 ffffffff81e4a30a 000000000000e033 0000000000000246
(XEN)    ffffc9004aef7c18 000000000000e02b fb5ee398d214b10c eb5ef398c214a10c
(XEN)    eb56f390c21ca104 ebd6f310c29ca184 0000e0100000000a ffff830251756000
(XEN)    0000003211016000 0000000000372660 0000000000000000 80000002963e1002
(XEN) Xen call trace:
(XEN)    [<ffff82d040247a4d>] R credit2.c#csched2_unit_wake+0x152/0x154
(XEN)    [<ffff82d04024fcb8>] F vcpu_wake+0xfd/0x267
(XEN)    [<ffff82d04020a8c9>] F common_vcpu_op+0x178/0x5d1
(XEN)    [<ffff82d0402f7968>] F do_vcpu_op+0x69/0x226
(XEN)    [<ffff82d0402ec6de>] F pv_hypercall+0x575/0x657
(XEN)    [<ffff82d0402012bd>] F lstar_enter+0x13d/0x150
(XEN)
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 10:
(XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
common/sched/credit2.c:2274
(XEN) ****************************************


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.