Xen project Mailing List

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Juergen Gross <jgross@xxxxxxx>

From: Jürgen Groß <jgross@xxxxxxxx>

Date: Tue, 29 Sep 2020 17:07:11 +0200

Cc: Dario Faggioli <dfaggioli@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Tue, 29 Sep 2020 15:07:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 29.09.20 16:27, Marek Marczykowski-Górecki wrote:

On Mon, Mar 23, 2020 at 01:09:49AM +0100, Marek Marczykowski-Górecki wrote:

On Thu, Mar 19, 2020 at 01:28:10AM +0100, Dario Faggioli wrote:

[Adding Juergen]

On Wed, 2020-03-18 at 23:10 +0100, Marek Marczykowski-Górecki wrote:

On Wed, Mar 18, 2020 at 02:50:52PM +0000, Andrew Cooper wrote:

On 18/03/2020 14:16, Marek Marczykowski-Górecki wrote:

Hi,

In my test setup (inside KVM with nested virt enabled), I rather
frequently get Xen crash on resume from S3. Full message below.

This is Xen 4.13.0, with some patches, including "sched: fix
resuming
from S3 with smt=0".

Contrary to the previous issue, this one does not happen always -
I
would say in about 40% cases on this setup, but very rarely on
physical
setup.

This is _without_ core scheduling enabled, and also with smt=off.

Do you think it would be any different on xen-unstable? I cat
try, but
it isn't trivial in this setup, so I'd ask first.

Well, Juergen has fixed quite a few issues.

Most of them where triggering with core-scheduling enabled, and I don't
recall any of them which looked similar or related to this.

Still, it's possible that the same issue causes different symptoms, and
hence that maybe one of the patches would fix this too.


I've tested on master (d094e95fb7c), and reproduced exactly the same crash
(pasted below for the completeness).
But there is more: additionally, in most (all?) cases after resume I've got
soft lockup in Linux dom0 in smp_call_function_single() - see below. It
didn't happened before and the only change was Xen 4.13 -> master.

Xen crash:

(XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
credit2.c:2133


Juergen, any idea about this one? This is also happening on the current
stable-4.14 (28855ebcdbfa).

Oh, sorry I didn't come back to this issue. I suspect this is related to stop_machine_run() being called during suspend(), as I'm seeing very sporadic issues when offlining and then onlining cpus with core scheduling being active (it seems as if the dom0 vcpu doing the cpu online activity sometimes is using an old vcpu state). I wasn't able to catch the real problem despite of having tried lots of approaches using debug patches. Recently I suspected the whole problem could be somehow related to RCU handling, as stop_machine_run() is relying on tasklets which are executing in idle context, and RCU handling is done in idle context, too. So there might be some kind of use after free scenario in case some memory is freed via RCU despite it still being used by a tasklet. I "just" need to find some time to verify this suspicion. Any help doing this would be appreciated. :-) Juergen

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.