[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer



On Tue, Sep 20, 2022 at 04:30:41PM +0200, Jan Beulich wrote:
> On 20.09.2022 12:22, Marek Marczykowski-Górecki wrote:
> > On Mon, Aug 22, 2022 at 12:00:27PM +0200, Marek Marczykowski-Górecki wrote:
> >> On Mon, Aug 22, 2022 at 11:53:50AM +0200, Jan Beulich wrote:
> >>> On 21.08.2022 18:14, Marek Marczykowski-Górecki wrote:
> >>>> On Sat, Oct 09, 2021 at 06:28:17PM +0200, Marek Marczykowski-Górecki 
> >>>> wrote:
> >>>>> On Sun, Jan 31, 2021 at 03:15:30AM +0100, Marek Marczykowski-Górecki 
> >>>>> wrote:
> >>>>>> I'm resurrecting this thread as it was recently mentioned elsewhere. I
> >>>>>> can still reproduce the issue on the recent staging branch 
> >>>>>> (9dc687f155).
> >>>>>>
> >>>>>> It fails after the first resume (not always, but frequent enough to
> >>>>>> debug it). At least one guest needs to be running - with just (PV) dom0
> >>>>>> the crash doesn't happen (at least for the ~8 times in a row I tried).
> >>>>>> If the first resume works, the second (almost?) always will fail but
> >>>>>> with a different symptoms - dom0 kernel lockups (at least some of its
> >>>>>> vcpus). I haven't debugged this one yet at all.
> >>>>>>
> >>>>>> Any help will be appreciated, I can apply some debug patches, change
> >>>>>> configuration etc.
> >>>>>
> >>>>> This still happens on 4.14.3. Maybe it is related to freeing percpu
> >>>>> areas, as it caused other issues with suspend too? Just a thought...
> >>>>
> >>>> I have reproduced this on current staging(*). And I can reproduce it
> >>>> reliably. And also, I got (I believe) closely related crash with credit1
> >>>> scheduler.
> >>>>
> >>>> (*) It isn't plain staging, it's one with my xhci console patches on
> >>>> top, including attempt to make it survive S3. I believe the only
> >>>> relevant part there is sticking set_timer() into console resume path (or
> >>>> just having a timer with rather short delay registered). The actual tree
> >>>> at https://github.com/marmarek/xen/tree/master-xue2-debug, including
> >>>> quite a lot of debug prints and debug hacks.
> >>>>
> >>>> Specific crash with credit2:
> >>
> >> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
> >> common/sched/credit2.c:2274
> >> (XEN) ----[ Xen-4.17-unstable  x86_64  debug=y  Tainted:   C    ]----
> >> (XEN) CPU:    10
> >> (XEN) RIP:    e008:[<ffff82d040247a4d>] 
> >> credit2.c#csched2_unit_wake+0x152/0x154
> >> (XEN) RFLAGS: 0000000000010083   CONTEXT: hypervisor (d0v0)
> >> (XEN) rax: ffff830251778230   rbx: ffff830251768cb0   rcx: 00000032111d6000
> >> (XEN) rdx: ffff8302515c1eb0   rsi: 0000000000000006   rdi: ffff830251769000
> >> (XEN) rbp: ffff8302515cfd90   rsp: ffff8302515cfd70   r8:  ffff830251769000
> >> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> >> (XEN) r12: ffff830251768dd0   r13: ffff8302515c1d00   r14: 0000000000000006
> >> (XEN) r15: ffff82d0405ddb40   cr0: 0000000080050033   cr4: 0000000000372660
> >> (XEN) cr3: 000000022f2a1000   cr2: ffff8881012738e0
> >> (XEN) fsb: 0000744bf6a0db80   gsb: ffff888255600000   gss: 0000000000000000
> >> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e010   cs: e008
> >> (XEN) Xen code around <ffff82d040247a4d> 
> >> (credit2.c#csched2_unit_wake+0x152/0x154):
> >> (XEN)  df e8 6f bf ff ff eb ad <0f> 0b f3 0f 1e fa 55 48 89 e5 41 57 41 56 
> >> 41 55
> >> (XEN) Xen stack trace from rsp=ffff8302515cfd70:
> >> (XEN)    ffff83025174b000 ffff830251768cb0 ffff830251778270 
> >> ffff82d0405c4298
> >> (XEN)    ffff8302515cfdd8 ffff82d04024fcb8 0000000000000202 
> >> ffff830251778270
> >> (XEN)    ffff83025174b000 0000000000000001 ffff830251769018 
> >> 0000000000000000
> >> (XEN)    0000000000000000 ffff8302515cfe48 ffff82d04020a8c9 
> >> ffff8882556aedc0
> >> (XEN)    0000000000000003 00001910537e623e 0000000b988f78a6 
> >> 0000000059d4a716
> >> (XEN)    00001901f30fa41e 0000000217f96af6 0000000000000000 
> >> ffff83025174b000
> >> (XEN)    ffff830251756000 0000000000000002 0000000000000001 
> >> ffff8302515cfe70
> >> (XEN)    ffff82d0402f7968 ffff830251756000 ffff8302515cfef8 
> >> 0000000000000018
> >> (XEN)    ffff8302515cfee8 ffff82d0402ec6de 0000000000000000 
> >> ffffffff82f157e0
> >> (XEN)    0000000000000000 0000000000000000 ffff8302515cfef8 
> >> 0000000000000000
> >> (XEN)    0000000000000000 ffff8302515cffff ffff830251756000 
> >> 0000000000000000
> >> (XEN)    0000000000000000 0000000000000000 0000000000000000 
> >> 00007cfdaea300e7
> >> (XEN)    ffff82d0402012bd 0000000000000000 ffffffff82c51120 
> >> ffff88810036cf00
> >> (XEN)    0000000000000002 000000000001e120 0000000000000002 
> >> 0000000000000246
> >> (XEN)    ffffffff82f157e0 0000000000000001 0000000000000000 
> >> 0000000000000018
> >> (XEN)    ffffffff81e4a30a 0000000000000000 0000000000000002 
> >> 0000000000000001
> >> (XEN)    0000010000000000 ffffffff81e4a30a 000000000000e033 
> >> 0000000000000246
> >> (XEN)    ffffc9004aef7c18 000000000000e02b fb5ee398d214b10c 
> >> eb5ef398c214a10c
> >> (XEN)    eb56f390c21ca104 ebd6f310c29ca184 0000e0100000000a 
> >> ffff830251756000
> >> (XEN)    0000003211016000 0000000000372660 0000000000000000 
> >> 80000002963e1002
> >> (XEN) Xen call trace:
> >> (XEN)    [<ffff82d040247a4d>] R credit2.c#csched2_unit_wake+0x152/0x154
> >> (XEN)    [<ffff82d04024fcb8>] F vcpu_wake+0xfd/0x267
> >> (XEN)    [<ffff82d04020a8c9>] F common_vcpu_op+0x178/0x5d1
> >> (XEN)    [<ffff82d0402f7968>] F do_vcpu_op+0x69/0x226
> >> (XEN)    [<ffff82d0402ec6de>] F pv_hypercall+0x575/0x657
> >> (XEN)    [<ffff82d0402012bd>] F lstar_enter+0x13d/0x150
> >> (XEN)
> >> (XEN)
> >> (XEN) ****************************************
> >> (XEN) Panic on CPU 10:
> >> (XEN) Assertion 'c2rqd(sched_unit_master(unit)) == svc->rqd' failed at 
> >> common/sched/credit2.c:2274
> >> (XEN) ****************************************
> > 
> > Ok, I think I figured it out!
> > 
> > I added a function that verifies run queues of all the sched units, and
> > called it basically every other line on the resume path. The debug
> > function (if anybody is interested):
> > 
> >     void verify_sched_units(void)
> >     {   
> >         struct domain *d;
> >         const struct sched_unit *unit;
> >         
> >         for_each_domain ( d )
> >         {
> >             for_each_sched_unit ( d, unit )
> >             {
> >                 if ( c2rqd(sched_unit_master(unit)) != 
> > csched2_unit(unit)->rqd )
> >                 {
> >                     printk(XENLOG_WARNING "d%d sched unit %d: rq=%d, unit 
> > master %d, rq=%d\n",
> >                             d->domain_id, unit->unit_id,
> >                             csched2_unit(unit)->rqd ? 
> > csched2_unit(unit)->rqd->id : -1,
> >                             sched_unit_master(unit),
> >                             c2rqd(sched_unit_master(unit))->id);
> >                     WARN_ON(1);
> >                 }
> >             }
> >         }
> >     }
> > 
> > It appears that restore_vcpu_affinity() is responsible, specifically
> > this part:
> > 
> > 1216         /*
> > 1217          * Re-assign the initial processor as after resume we have no
> > 1218          * guarantee the old processor has come back to life again.
> > 1219          *
> > 1220          * Therefore, here, before actually unpausing the domains, we 
> > should
> > 1221          * set v->processor of each of their vCPUs to something that 
> > will
> > 1222          * make sense for the scheduler of the cpupool in which they 
> > are in.
> > 1223          */
> > ...
> > 1249         res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
> > 1250         sched_set_res(unit, res);
> > 1251 
> > 1252         spin_unlock_irq(lock);
> > 1253 
> > 1254         /* v->processor might have changed, so reacquire the lock. */
> > 1255         lock = unit_schedule_lock_irq(unit);
> > 1256         res = sched_pick_resource(unit_scheduler(unit), unit);
> > 1257         sched_set_res(unit, res);
> > 1258         spin_unlock_irq(lock);
> > 1259 
> > 1260         if ( old_cpu != sched_unit_master(unit) )
> > 1261             sched_move_irqs(unit);
> > 
> > It calls sched_set_res() directly, which assigns sched resources, but
> > does _not_ adjust runqueues (if new pcpu happen to be assigned to
> > another runqueue than the one from previous pcpu).
> > 
> > I have two (non exclusive) ideas here:
> > 1. If old_cpu is actually still available, do not move it at all.
> > 2. Use sched_migrate() instead of sched_set_res().
> > 
> > Here is the patch that fixes it for me:
> > ---8<---
> > diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> > index 83455fbde1c8..dcf202d8b307 100644
> > --- a/xen/common/sched/core.c
> > +++ b/xen/common/sched/core.c
> > @@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d)
> >              }
> >          }
> >  
> > -        res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
> > +        /* Prefer old cpu if available. */
> > +        if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) )
> > +            res = get_sched_res(old_cpu);
> > +        else
> > +            res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
> >          sched_set_res(unit, res);
> >  
> >          spin_unlock_irq(lock);
> >  
> > -        /* v->processor might have changed, so reacquire the lock. */
> > -        lock = unit_schedule_lock_irq(unit);
> > -        res = sched_pick_resource(unit_scheduler(unit), unit);
> > -        sched_set_res(unit, res);
> > -        spin_unlock_irq(lock);
> > -
> > +        /*
> > +         * If different cpu was chosen, it was random, let scheduler do 
> > proper
> > +         * decision.
> > +         */
> >          if ( old_cpu != sched_unit_master(unit) )
> > +        {
> > +            /* v->processor might have changed, so reacquire the lock. */
> > +            lock = unit_schedule_lock_irq(unit);
> > +            res = sched_pick_resource(unit_scheduler(unit), unit);
> > +            sched_migrate(unit_scheduler(unit), unit, res->master_cpu);
> > +            spin_unlock_irq(lock);
> > +
> >              sched_move_irqs(unit);
> > +        }
> >      }
> >  
> >      rcu_read_unlock(&sched_res_rculock);
> > ---8<---
> > 
> > I have several doubts here:
> > 
> > 1. If old_cpu is available, is sched_set_res() needed at all?
> > 2. Should both calls be changed to sched_migrate()? Currently I changed
> >    only the second one, in case scheduler could be confused about
> >    old_cpu not being available anymore.
> > 3. Are there any extra locking requirements for sched_migrate() at this
> >    stage? The long comment above sched_unit_migrate_start() suggests
> >    there might be, but I'm not sure if that's really the case during
> >    resume.
> > 4. Related to the above - should thaw_domains() be modified to call
> >    restore_vcpu_affinity() for all domains first, and unpause only
> >    later? That could reduce locking requirements, I guess.
> 
> All questions primarily to the scheduler maintainers - forwarding
> accordingly.

Ping?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.