[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer



On Mon, Aug 22, 2022 at 12:00:27PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Aug 22, 2022 at 11:53:50AM +0200, Jan Beulich wrote:
> > On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote:
> > > On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > >> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki 
> > >> wrote:
> > >>> I'm resurrecting this thread as it was recently mentioned elsewhere. I
> > >>> can still reproduce the issue on the recent staging branch (9dc687f155).
> > >>>
> > >>> It fails after the first resume (not always, but frequent enough to
> > >>> debug it). At least one guest needs to be running - with just (PV) dom0
> > >>> the crash doesn't happen (at least for the ~8 times in a row I tried).
> > >>> If the first resume works, the second (almost?) always will fail but
> > >>> with a different symptoms - dom0 kernel lockups (at least some of its
> > >>> vcpus). I haven't debugged this one yet at all.
> > >>>
> > >>> Any help will be appreciated, I can apply some debug patches, change
> > >>> configuration etc.
> > >>
> > >> This still happens on 4.14.3. Maybe it is related to freeing percpu
> > >> areas, as it caused other issues with suspend too? Just a thought...
> > > 
> > > I have reproduced this on current staging(*). And I can reproduce it
> > > reliably. And also, I got (I believe) closely related crash with credit1
> > > scheduler.
> > > 
> > > (*) It isn't plain staging, it's one with my xhci console patches on
> > > top, including attempt to make it survive S3. I believe the only
> > > relevant part there is sticking set_timer() into console resume path (or
> > > just having a timer with rather short delay registered). The actual tree
> > > at https://github.com/marmarek/xen/tree/master-xue2-debug, including
> > > quite a lot of debug prints and debug hacks.
> > > 
> > > Specific crash with credit2:
> 
> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
> common/sched/credit2.c:2274
> (XEN) ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
> (XEN) CPU:    10
> (XEN) RIP:    e008:[<ffff82d040247a4d>] 
> credit2.c#csched2_unit_wake+0x152/0x154
> (XEN) RFLAGS: 0000000000010083   CONTEXT: hypervisor (d0v0)
> (XEN) rax: ffff830251778230   rbx: ffff830251768cb0   rcx: 00000032111d6000
> (XEN) rdx: ffff8302515c1eb0   rsi: 0000000000000006   rdi: ffff830251769000
> (XEN) rbp: ffff8302515cfd90   rsp: ffff8302515cfd70   r8:  ffff830251769000
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> (XEN) r12: ffff830251768dd0   r13: ffff8302515c1d00   r14: 0000000000000006
> (XEN) r15: ffff82d0405ddb40   cr0: 0000000080050033   cr4: 0000000000372660
> (XEN) cr3: 000000022f2a1000   cr2: ffff8881012738e0
> (XEN) fsb: 0000744bf6a0db80   gsb: ffff888255600000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> (XEN) Xen code around <ffff82d040247a4d> 
> (credit2.c#csched2_unit_wake+0x152/0x154):
> (XEN)  df e8 6f bf ff ff eb ad <0f> 0b f3 0f 1e fa 55 48 89 e5 41 57 41 56 41 
> 55
> (XEN) Xen stack trace from rsp=ffff8302515cfd70:
> (XEN)    ffff83025174b000 ffff830251768cb0 ffff830251778270 ffff82d0405c4298
> (XEN)    ffff8302515cfdd8 ffff82d04024fcb8 0000000000000202 ffff830251778270
> (XEN)    ffff83025174b000 0000000000000001 ffff830251769018 0000000000000000
> (XEN)    0000000000000000 ffff8302515cfe48 ffff82d04020a8c9 ffff8882556aedc0
> (XEN)    0000000000000003 00001910537e623e 0000000b988f78a6 0000000059d4a716
> (XEN)    00001901f30fa41e 0000000217f96af6 0000000000000000 ffff83025174b000
> (XEN)    ffff830251756000 0000000000000002 0000000000000001 ffff8302515cfe70
> (XEN)    ffff82d0402f7968 ffff830251756000 ffff8302515cfef8 0000000000000018
> (XEN)    ffff8302515cfee8 ffff82d0402ec6de 0000000000000000 ffffffff82f157e0
> (XEN)    0000000000000000 0000000000000000 ffff8302515cfef8 0000000000000000
> (XEN)    0000000000000000 ffff8302515cffff ffff830251756000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 0000000000000000 00007cfdaea300e7
> (XEN)    ffff82d0402012bd 0000000000000000 ffffffff82c51120 ffff88810036cf00
> (XEN)    0000000000000002 000000000001e120 0000000000000002 0000000000000246
> (XEN)    ffffffff82f157e0 0000000000000001 0000000000000000 0000000000000018
> (XEN)    ffffffff81e4a30a 0000000000000000 0000000000000002 0000000000000001
> (XEN)    0000010000000000 ffffffff81e4a30a 000000000000e033 0000000000000246
> (XEN)    ffffc9004aef7c18 000000000000e02b fb5ee398d214b10c eb5ef398c214a10c
> (XEN)    eb56f390c21ca104 ebd6f310c29ca184 0000e0100000000a ffff830251756000
> (XEN)    0000003211016000 0000000000372660 0000000000000000 80000002963e1002
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040247a4d>] R credit2.c#csched2_unit_wake+0x152/0x154
> (XEN)    [<ffff82d04024fcb8>] F vcpu_wake+0xfd/0x267
> (XEN)    [<ffff82d04020a8c9>] F common_vcpu_op+0x178/0x5d1
> (XEN)    [<ffff82d0402f7968>] F do_vcpu_op+0x69/0x226
> (XEN)    [<ffff82d0402ec6de>] F pv_hypercall+0x575/0x657
> (XEN)    [<ffff82d0402012bd>] F lstar_enter+0x13d/0x150
> (XEN)
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 10:
> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
> common/sched/credit2.c:2274
> (XEN) ****************************************

Ok, I think I figured it out!

I added a function that verifies run queues of all the sched units, and
called it basically every other line on the resume path. The debug
function (if anybody is interested):

    void verify_sched_units(void)
    {   
        struct domain *d;
        const struct sched_unit *unit;
        
        for_each_domain ( d )
        {
            for_each_sched_unit ( d, unit )
            {
                if ( c2rqd(sched_unit_master(unit)) != csched2_unit(unit)->rqd )
                {
                    printk(XENLOG_WARNING "d%d sched unit %d: rq=%d, unit 
master %d, rq=%d\n",
                            d->domain_id, unit->unit_id,
                            csched2_unit(unit)->rqd ? 
csched2_unit(unit)->rqd->id : -1,
                            sched_unit_master(unit),
                            c2rqd(sched_unit_master(unit))->id);
                    WARN_ON(1);
                }
            }
        }
    }

It appears that restore_vcpu_affinity() is responsible, specifically
this part:

1216         /*
1217          * Re-assign the initial processor as after resume we have no
1218          * guarantee the old processor has come back to life again.
1219          *
1220          * Therefore, here, before actually unpausing the domains, we 
should
1221          * set v->processor of each of their vCPUs to something that will
1222          * make sense for the scheduler of the cpupool in which they are 
in.
1223          */
...
1249         res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
1250         sched_set_res(unit, res);
1251 
1252         spin_unlock_irq(lock);
1253 
1254         /* v->processor might have changed, so reacquire the lock. */
1255         lock = unit_schedule_lock_irq(unit);
1256         res = sched_pick_resource(unit_scheduler(unit), unit);
1257         sched_set_res(unit, res);
1258         spin_unlock_irq(lock);
1259 
1260         if ( old_cpu != sched_unit_master(unit) )
1261             sched_move_irqs(unit);

It calls sched_set_res() directly, which assigns sched resources, but
does _not_ adjust runqueues (if new pcpu happen to be assigned to
another runqueue than the one from previous pcpu).

I have two (non exclusive) ideas here:
1. If old_cpu is actually still available, do not move it at all.
2. Use sched_migrate() instead of sched_set_res().

Here is the patch that fixes it for me:
---8<---
diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 83455fbde1c8..dcf202d8b307 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d)
             }
         }
 
-        res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
+        /* Prefer old cpu if available. */
+        if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) )
+            res = get_sched_res(old_cpu);
+        else
+            res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
         sched_set_res(unit, res);
 
         spin_unlock_irq(lock);
 
-        /* v->processor might have changed, so reacquire the lock. */
-        lock = unit_schedule_lock_irq(unit);
-        res = sched_pick_resource(unit_scheduler(unit), unit);
-        sched_set_res(unit, res);
-        spin_unlock_irq(lock);
-
+        /*
+         * If different cpu was chosen, it was random, let scheduler do proper
+         * decision.
+         */
         if ( old_cpu != sched_unit_master(unit) )
+        {
+            /* v->processor might have changed, so reacquire the lock. */
+            lock = unit_schedule_lock_irq(unit);
+            res = sched_pick_resource(unit_scheduler(unit), unit);
+            sched_migrate(unit_scheduler(unit), unit, res->master_cpu);
+            spin_unlock_irq(lock);
+
             sched_move_irqs(unit);
+        }
     }
 
     rcu_read_unlock(&sched_res_rculock);
---8<---

I have several doubts here:

1. If old_cpu is available, is sched_set_res() needed at all?
2. Should both calls be changed to sched_migrate()? Currently I changed
   only the second one, in case scheduler could be confused about
   old_cpu not being available anymore.
3. Are there any extra locking requirements for sched_migrate() at this
   stage? The long comment above sched_unit_migrate_start() suggests
   there might be, but I'm not sure if that's really the case during
   resume.
4. Related to the above - should thaw_domains() be modified to call
   restore_vcpu_affinity() for all domains first, and unpause only
   later? That could reduce locking requirements, I guess.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.