[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer
On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki wrote: > On Tue, Sep 29, 2020 at 05:27:48PM +0200, Jürgen Groß wrote: > > On 29.09.20 17:16, Marek Marczykowski-Górecki wrote: > > > On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote: > > > > On 29.09.20 16:27, Marek Marczykowski-Górecki wrote: > > > > > On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki > > > > > wrote: > > > > > > On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote: > > > > > > > [Adding Juergen] > > > > > > > > > > > > > > On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki > > > > > > > wrote: > > > > > > > > On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote: > > > > > > > > > On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote: > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > In my test setup (inside KVM with nested virt enabled), I > > > > > > > > > > rather > > > > > > > > > > frequently get Xen crash on resume from S3. Full message > > > > > > > > > > below. > > > > > > > > > > > > > > > > > > > > This is Xen 4.13.0, with some patches, including "sched: fix > > > > > > > > > > resuming > > > > > > > > > > from S3 with smt=0". > > > > > > > > > > > > > > > > > > > > Contrary to the previous issue, this one does not happen > > > > > > > > > > always - > > > > > > > > > > I > > > > > > > > > > would say in about 40% cases on this setup, but very rarely > > > > > > > > > > on > > > > > > > > > > physical > > > > > > > > > > setup. > > > > > > > > > > > > > > > > > > > > This is _without_ core scheduling enabled, and also with > > > > > > > > > > smt=off. > > > > > > > > > > > > > > > > > > > > Do you think it would be any different on xen-unstable? I > > > > > > > > > > cat > > > > > > > > > > try, but > > > > > > > > > > it isn't trivial in this setup, so I'd ask first. > > > > > > > > > > > > > > > > > Well, Juergen has fixed quite a few issues. > > > > > > > > > > > > > > Most of them where triggering with core-scheduling enabled, and I > > > > > > > don't > > > > > > > recall any of them which looked similar or related to this. > > > > > > > > > > > > > > Still, it's possible that the same issue causes different > > > > > > > symptoms, and > > > > > > > hence that maybe one of the patches would fix this too. > > > > > > > > > > > > I've tested on master (d094e95fb7c), and reproduced exactly the > > > > > > same crash > > > > > > (pasted below for the completeness). > > > > > > But there is more: additionally, in most (all?) cases after resume > > > > > > I've got > > > > > > soft lockup in Linux dom0 in smp_call_function_single() - see > > > > > > below. It > > > > > > didn't happened before and the only change was Xen 4.13 -> master. > > > > > > > > > > > > Xen crash: > > > > > > > > > > > > (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed > > > > > > at credit2.c:2133 > > > > > > > > > > Juergen, any idea about this one? This is also happening on the > > > > > current > > > > > stable-4.14 (28855ebcdbfa). > > > > > > > > > > > > > Oh, sorry I didn't come back to this issue. > > > > > > > > I suspect this is related to stop_machine_run() being called during > > > > suspend(), as I'm seeing very sporadic issues when offlining and then > > > > onlining cpus with core scheduling being active (it seems as if the > > > > dom0 vcpu doing the cpu online activity sometimes is using an old > > > > vcpu state). > > > > > > Note this is default Xen 4.14 start, so core scheduling is _not_ active: > > > > The similarity in the two failure cases is that multiple cpus are > > affected by the operations during stop_machine_run(). > > > > > > > > (XEN) Brought up 2 CPUs > > > (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource > > > (XEN) Adding cpu 0 to runqueue 0 > > > (XEN) First cpu on runqueue, activating > > > (XEN) Adding cpu 1 to runqueue 1 > > > (XEN) First cpu on runqueue, activating > > > > > > > I wasn't able to catch the real problem despite of having tried lots > > > > of approaches using debug patches. > > > > > > > > Recently I suspected the whole problem could be somehow related to > > > > RCU handling, as stop_machine_run() is relying on tasklets which are > > > > executing in idle context, and RCU handling is done in idle context, > > > > too. So there might be some kind of use after free scenario in case > > > > some memory is freed via RCU despite it still being used by a tasklet. > > > > > > That sounds plausible, even though I don't really know this area of Xen. > > > > > > > I "just" need to find some time to verify this suspicion. Any help doing > > > > this would be appreciated. :-) > > > > > > I do have a setup where I can easily-ish reproduce the issue. If there > > > is some debug patch you'd like me to try, I can do that. > > > > Thanks. I might come back to that offer as you are seeing a crash which > > will be much easier to analyze. Catching my error case is much harder as > > it surfaces some time after the real problem in a non destructive way > > (usually I'm seeing a failure to load a library in the program which > > just did its job via exactly the library claiming not being loadable). > > Hi, > > I'm resurrecting this thread as it was recently mentioned elsewhere. I > can still reproduce the issue on the recent staging branch (9dc687f155). > > It fails after the first resume (not always, but frequent enough to > debug it). At least one guest needs to be running - with just (PV) dom0 > the crash doesn't happen (at least for the ~8 times in a row I tried). > If the first resume works, the second (almost?) always will fail but > with a different symptoms - dom0 kernel lockups (at least some of its > vcpus). I haven't debugged this one yet at all. > > Any help will be appreciated, I can apply some debug patches, change > configuration etc. This still happens on 4.14.3. Maybe it is related to freeing percpu areas, as it caused other issues with suspend too? Just a thought... -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab Attachment:
signature.asc
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |