[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13



On Tue, Sep 29, 2020 at 05:07:11PM +0200, Jürgen Groß wrote:
> On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:
> > On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki wrote:
> > > On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:
> > > > [Adding Juergen]
> > > > 
> > > > On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:
> > > > > On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:
> > > > > > On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > In my test setup (inside KVM with nested virt enabled), I rather
> > > > > > > frequently get Xen crash on resume from S3. Full message below.
> > > > > > > 
> > > > > > > This is Xen 4.13.0, with some patches, including "sched: fix
> > > > > > > resuming
> > > > > > > from S3 with smt=0".
> > > > > > > 
> > > > > > > Contrary to the previous issue, this one does not happen always -
> > > > > > > I
> > > > > > > would say in about 40% cases on this setup, but very rarely on
> > > > > > > physical
> > > > > > > setup.
> > > > > > > 
> > > > > > > This is _without_ core scheduling enabled, and also with smt=off.
> > > > > > > 
> > > > > > > Do you think it would be any different on xen-unstable? I cat
> > > > > > > try, but
> > > > > > > it isn't trivial in this setup, so I'd ask first.
> > > > > > > 
> > > > Well, Juergen has fixed quite a few issues.
> > > > 
> > > > Most of them where triggering with core-scheduling enabled, and I don't
> > > > recall any of them which looked similar or related to this.
> > > > 
> > > > Still, it's possible that the same issue causes different symptoms, and
> > > > hence that maybe one of the patches would fix this too.
> > > 
> > > I've tested on master (d094e95fb7c), and reproduced exactly the same crash
> > > (pasted below for the completeness).
> > > But there is more: additionally, in most (all?) cases after resume I've 
> > > got
> > > soft lockup in Linux dom0 in smp_call_function_single() - see below. It
> > > didn't happened before and the only change was Xen 4.13 -> master.
> > > 
> > > Xen crash:
> > > 
> > > (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
> > > credit2.c:2133
> > 
> > Juergen, any idea about this one? This is also happening on the current
> > stable-4.14 (28855ebcdbfa).
> > 
> 
> Oh, sorry I didn't come back to this issue.
> 
> I suspect this is related to stop_machine_run() being called during
> suspend(), as I'm seeing very sporadic issues when offlining and then
> onlining cpus with core scheduling being active (it seems as if the
> dom0 vcpu doing the cpu online activity sometimes is using an old
> vcpu state).

Note this is default Xen 4.14 start, so core scheduling is _not_ active:

    (XEN) Brought up 2 CPUs
    (XEN) Scheduling granularity: cpu, 1 CPU per sched-resource
    (XEN) Adding cpu 0 to runqueue 0
    (XEN)  First cpu on runqueue, activating
    (XEN) Adding cpu 1 to runqueue 1
    (XEN)  First cpu on runqueue, activating

> I wasn't able to catch the real problem despite of having tried lots
> of approaches using debug patches.
> 
> Recently I suspected the whole problem could be somehow related to
> RCU handling, as stop_machine_run() is relying on tasklets which are
> executing in idle context, and RCU handling is done in idle context,
> too. So there might be some kind of use after free scenario in case
> some memory is freed via RCU despite it still being used by a tasklet.

That sounds plausible, even though I don't really know this area of Xen.

> I "just" need to find some time to verify this suspicion. Any help doing
> this would be appreciated. :-)

I do have a setup where I can easily-ish reproduce the issue. If there
is some debug patch you'd like me to try, I can do that.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.