Xen project Mailing List

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer

To: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>

From: Jan Beulich <jbeulich@xxxxxxxx>

Date: Tue, 20 Sep 2022 16:30:41 +0200

Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none

Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=hUWtHTQMBOfiSK/tDeGOWwDkYur51NulFMNRtcE1OJ4=; b=PaCQFiG/SNJYKAqcIp6JOQs1MJP31jTM0bDlpWo3am2tYKb6BCYm1F3VhRuBW3fmJyndJ9gwo5BaNvd7EEbbEairFjQzJ+vCbSpcWWDzmOqIrS1AjKZFJDP0WIeMyvOv4otFDqEF/Zrx+ghNBjq2mSnBGBt/AzetZj5qrpzz2IZ9955ZM0wkWgO9BG8KhtAEPVvV/x6JJSfVbbfyYOk+pgtlMRamGApbr8+fbDYTy2tiD/ISTBgsr+CafebEk4Oet1ff5+OslxSJsU+Sc9ftzReG3jDiaN6ENBvQjBiqSkkyOxxg/wXrAnN1YWOC4hujB8LxUALupBG+ffGpBbDfpQ==

Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=GxrtoG64rhNzTIKKV6O8lDocGmXcPvSRInrh8zWB6Be+6cj0BDVYpgtm4mdzju0n+H4V4kfnuvzT4FZ1EHo0FSCNlZvkqAliUt96oEhjDWX9YKl0rWZg+hp2ZxmHeYtV6wiFrwLXrI5NawHGBzS8HLzcaD5bPKnie2/X+01z7y373eK0dv6BZenQEV3fWAAaah3ZKMxjHgXs+P5IVOgx2OqWD1q8/oG19LXClbJgtlwiO9rQfHypiuLr9hSpRgXpbqInrOFukT+vusuGxldXfGtGXnfB+TR2QOpvuoc3bECXAWQ3QQpFGC/G12uTX8bmDpPPvyIDiSg+AL05KYeruQ==

Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>, Jürgen Groß <jgross@xxxxxxxx>

Delivery-date: Tue, 20 Sep 2022 14:30:49 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 20.09.2022 12:22, Marek Marczykowski-Górecki wrote: > On Mon, Aug 22, 2022 at 12:00:27PM +0200, Marek Marczykowski-Górecki wrote: >> On Mon, Aug 22, 2022 at 11:53:50AM +0200, Jan Beulich wrote: >>> On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote: >>>> On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki wrote: >>>>> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki >>>>> wrote: >>>>>> I'm resurrecting this thread as it was recently mentioned elsewhere. I >>>>>> can still reproduce the issue on the recent staging branch (9dc687f155). >>>>>> >>>>>> It fails after the first resume (not always, but frequent enough to >>>>>> debug it). At least one guest needs to be running - with just (PV) dom0 >>>>>> the crash doesn't happen (at least for the ~8 times in a row I tried). >>>>>> If the first resume works, the second (almost?) always will fail but >>>>>> with a different symptoms - dom0 kernel lockups (at least some of its >>>>>> vcpus). I haven't debugged this one yet at all. >>>>>> >>>>>> Any help will be appreciated, I can apply some debug patches, change >>>>>> configuration etc. >>>>> >>>>> This still happens on 4.14.3. Maybe it is related to freeing percpu >>>>> areas, as it caused other issues with suspend too? Just a thought... >>>> >>>> I have reproduced this on current staging(*). And I can reproduce it >>>> reliably. And also, I got (I believe) closely related crash with credit1 >>>> scheduler. >>>> >>>> (*) It isn't plain staging, it's one with my xhci console patches on >>>> top, including attempt to make it survive S3. I believe the only >>>> relevant part there is sticking set_timer() into console resume path (or >>>> just having a timer with rather short delay registered). The actual tree >>>> at https://github.com/marmarek/xen/tree/master-xue2-debug, including >>>> quite a lot of debug prints and debug hacks. >>>> >>>> Specific crash with credit2: >> >> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at >> common/sched/credit2.c:2274 >> (XEN) ----[ Xen-4.17-unstable x86_64 debug=y Tainted: C ]---- >> (XEN) CPU: 10 >> (XEN) RIP: e008:[<ffff82d040247a4d>] >> credit2.c#csched2_unit_wake+0x152/0x154 >> (XEN) RFLAGS: 0000000000010083 CONTEXT: hypervisor (d0v0) >> (XEN) rax: ffff830251778230 rbx: ffff830251768cb0 rcx: 00000032111d6000 >> (XEN) rdx: ffff8302515c1eb0 rsi: 0000000000000006 rdi: ffff830251769000 >> (XEN) rbp: ffff8302515cfd90 rsp: ffff8302515cfd70 r8: ffff830251769000 >> (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 0000000000000000 >> (XEN) r12: ffff830251768dd0 r13: ffff8302515c1d00 r14: 0000000000000006 >> (XEN) r15: ffff82d0405ddb40 cr0: 0000000080050033 cr4: 0000000000372660 >> (XEN) cr3: 000000022f2a1000 cr2: ffff8881012738e0 >> (XEN) fsb: 0000744bf6a0db80 gsb: ffff888255600000 gss: 0000000000000000 >> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 >> (XEN) Xen code around <ffff82d040247a4d> >> (credit2.c#csched2_unit_wake+0x152/0x154): >> (XEN) df e8 6f bf ff ff eb ad <0f> 0b f3 0f 1e fa 55 48 89 e5 41 57 41 56 >> 41 55 >> (XEN) Xen stack trace from rsp=ffff8302515cfd70: >> (XEN) ffff83025174b000 ffff830251768cb0 ffff830251778270 ffff82d0405c4298 >> (XEN) ffff8302515cfdd8 ffff82d04024fcb8 0000000000000202 ffff830251778270 >> (XEN) ffff83025174b000 0000000000000001 ffff830251769018 0000000000000000 >> (XEN) 0000000000000000 ffff8302515cfe48 ffff82d04020a8c9 ffff8882556aedc0 >> (XEN) 0000000000000003 00001910537e623e 0000000b988f78a6 0000000059d4a716 >> (XEN) 00001901f30fa41e 0000000217f96af6 0000000000000000 ffff83025174b000 >> (XEN) ffff830251756000 0000000000000002 0000000000000001 ffff8302515cfe70 >> (XEN) ffff82d0402f7968 ffff830251756000 ffff8302515cfef8 0000000000000018 >> (XEN) ffff8302515cfee8 ffff82d0402ec6de 0000000000000000 ffffffff82f157e0 >> (XEN) 0000000000000000 0000000000000000 ffff8302515cfef8 0000000000000000 >> (XEN) 0000000000000000 ffff8302515cffff ffff830251756000 0000000000000000 >> (XEN) 0000000000000000 0000000000000000 0000000000000000 00007cfdaea300e7 >> (XEN) ffff82d0402012bd 0000000000000000 ffffffff82c51120 ffff88810036cf00 >> (XEN) 0000000000000002 000000000001e120 0000000000000002 0000000000000246 >> (XEN) ffffffff82f157e0 0000000000000001 0000000000000000 0000000000000018 >> (XEN) ffffffff81e4a30a 0000000000000000 0000000000000002 0000000000000001 >> (XEN) 0000010000000000 ffffffff81e4a30a 000000000000e033 0000000000000246 >> (XEN) ffffc9004aef7c18 000000000000e02b fb5ee398d214b10c eb5ef398c214a10c >> (XEN) eb56f390c21ca104 ebd6f310c29ca184 0000e0100000000a ffff830251756000 >> (XEN) 0000003211016000 0000000000372660 0000000000000000 80000002963e1002 >> (XEN) Xen call trace: >> (XEN) [<ffff82d040247a4d>] R credit2.c#csched2_unit_wake+0x152/0x154 >> (XEN) [<ffff82d04024fcb8>] F vcpu_wake+0xfd/0x267 >> (XEN) [<ffff82d04020a8c9>] F common_vcpu_op+0x178/0x5d1 >> (XEN) [<ffff82d0402f7968>] F do_vcpu_op+0x69/0x226 >> (XEN) [<ffff82d0402ec6de>] F pv_hypercall+0x575/0x657 >> (XEN) [<ffff82d0402012bd>] F lstar_enter+0x13d/0x150 >> (XEN) >> (XEN) >> (XEN) **************************************** >> (XEN) Panic on CPU 10: >> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at >> common/sched/credit2.c:2274 >> (XEN) **************************************** > > Ok, I think I figured it out! > > I added a function that verifies run queues of all the sched units, and > called it basically every other line on the resume path. The debug > function (if anybody is interested): > > void verify_sched_units(void) > { > struct domain *d; > const struct sched_unit *unit; > > for_each_domain ( d ) > { > for_each_sched_unit ( d, unit ) > { > if ( c2rqd(sched_unit_master(unit)) != > csched2_unit(unit)->rqd ) > { > printk(XENLOG_WARNING "d%d sched unit %d: rq=%d, unit > master %d, rq=%d\n", > d->domain_id, unit->unit_id, > csched2_unit(unit)->rqd ? > csched2_unit(unit)->rqd->id : -1, > sched_unit_master(unit), > c2rqd(sched_unit_master(unit))->id); > WARN_ON(1); > } > } > } > } > > It appears that restore_vcpu_affinity() is responsible, specifically > this part: > > 1216 /* > 1217 * Re-assign the initial processor as after resume we have no > 1218 * guarantee the old processor has come back to life again. > 1219 * > 1220 * Therefore, here, before actually unpausing the domains, we > should > 1221 * set v->processor of each of their vCPUs to something that will > 1222 * make sense for the scheduler of the cpupool in which they are > in. > 1223 */ > ... > 1249 res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu))); > 1250 sched_set_res(unit, res); > 1251 > 1252 spin_unlock_irq(lock); > 1253 > 1254 /* v->processor might have changed, so reacquire the lock. */ > 1255 lock = unit_schedule_lock_irq(unit); > 1256 res = sched_pick_resource(unit_scheduler(unit), unit); > 1257 sched_set_res(unit, res); > 1258 spin_unlock_irq(lock); > 1259 > 1260 if ( old_cpu != sched_unit_master(unit) ) > 1261 sched_move_irqs(unit); > > It calls sched_set_res() directly, which assigns sched resources, but > does _not_ adjust runqueues (if new pcpu happen to be assigned to > another runqueue than the one from previous pcpu). > > I have two (non exclusive) ideas here: > 1. If old_cpu is actually still available, do not move it at all. > 2. Use sched_migrate() instead of sched_set_res(). > > Here is the patch that fixes it for me: > ---8<--- > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c > index 83455fbde1c8..dcf202d8b307 100644 > --- a/xen/common/sched/core.c > +++ b/xen/common/sched/core.c > @@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d) > } > } > > - res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu))); > + /* Prefer old cpu if available. */ > + if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) ) > + res = get_sched_res(old_cpu); > + else > + res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu))); > sched_set_res(unit, res); > > spin_unlock_irq(lock); > > - /* v->processor might have changed, so reacquire the lock. */ > - lock = unit_schedule_lock_irq(unit); > - res = sched_pick_resource(unit_scheduler(unit), unit); > - sched_set_res(unit, res); > - spin_unlock_irq(lock); > - > + /* > + * If different cpu was chosen, it was random, let scheduler do > proper > + * decision. > + */ > if ( old_cpu != sched_unit_master(unit) ) > + { > + /* v->processor might have changed, so reacquire the lock. */ > + lock = unit_schedule_lock_irq(unit); > + res = sched_pick_resource(unit_scheduler(unit), unit); > + sched_migrate(unit_scheduler(unit), unit, res->master_cpu); > + spin_unlock_irq(lock); > + > sched_move_irqs(unit); > + } > } > > rcu_read_unlock(&sched_res_rculock); > ---8<--- > > I have several doubts here: > > 1. If old_cpu is available, is sched_set_res() needed at all? > 2. Should both calls be changed to sched_migrate()? Currently I changed > only the second one, in case scheduler could be confused about > old_cpu not being available anymore. > 3. Are there any extra locking requirements for sched_migrate() at this > stage? The long comment above sched_unit_migrate_start() suggests > there might be, but I'm not sure if that's really the case during > resume. > 4. Related to the above - should thaw_domains() be modified to call > restore_vcpu_affinity() for all domains first, and unpause only > later? That could reduce locking requirements, I guess. All questions primarily to the scheduler maintainers - forwarding accordingly. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.