[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] xen: sched: fix (ACPI S3) resume with cpupools with different schedulers.



On Fri, 2015-11-13 at 05:52 -0700, Jan Beulich wrote:
>>> On 13.11.15 at 13:30, <dario.faggioli@xxxxxxxxxx> wrote:

> > cpu_schedule_up() allocates, with the alloc_pdata hook, stuff that
> > was
> > freed by cpu_schedule_down (with the free_pdata hook) already.
> 
> If there is no change in the allocation pattern (i.e. including the
> order of frees and allocs), then I'm sorry for the noise. I
> understood
> the description of the change differently then.
> 
There are changes. So, here it is the situation.

 Right now:                   After this patch:
 ---------                    -----------------
 * Boot / CPU switch pools:   * Boot / CPU switch pools:
    cpu_schedule_up():           cpu_schedule_up():=
     v0 = alloc_vdata()           v0 = alloc_vdata()
     p0 = alloc_pdata()           p0 = alloc_pdata()

 * Suspend:                   * Suspend:
    cpu_schedule_down()          cpu_schedule_down():
     free_pdata(p0)               free_vdata(v0)
                                  free_pdata(p0)     [x]

 * Resume:                    * Resume:
    cpu_schedule_up()            cpu_schedule_up():
     p1 = alloc_pdata()           v1 = alloc_vdata()
                                  p1 = alloc_pdata() [xx]
   [*]
    schedule_cpu_switch()        schedule_cpu_switch():
     p2 = alloc_pdata()           p2 = alloc_pdata()
     v1 = alloc_vdata()           v2 = alloc_vdata()
     free_vdata(v0) [**]          free_vdata(v1)
     free_pdata(p1)               free_pdata(p1)

[*] BUG, the one described in the changelog, if scheduling happens at 
    this point (I can reproduce this with 100% reliability)

[*] LATENT BUG, as it is possible that the free_vdata() function 
   
called here is not the counterpart of the alloc_vdata() function 
   
used for allocating v0

So, there are allocations already in the resume path. With this patch,
there is one more free in the suspend path and one more alloc in the
resume path.

I can change the order of operations at [x] and/or [xx] (and send a
v2), e.g., make things look like this, if that helps,:

* Suspend:
   cpu_schedule_down():
    free_pdata(p0)
    free_vdata(v0)

* Resume:
   cpu_schedule_up():
    v1 = alloc_vdata()
    p1 = alloc_pdata()

Would that work?

I don't think I neither can easily get away with the bug, nor eliminate
the explained inconsistency, without that additional free-alloc pair.

> > In any case, if one of these allocations fails, it already returns
> > ENOMEM. What happens then is, of course, caller's call. If it
> > happens
> > during boot for pCPU 0, we BUG(). If it happens with non-boot pCPUs
> > (either at boot, during hotplug or during suspend/resume), we fail
> > the
> > bringup of that pCPU at the CPU_UP_PREPARE phase.
> 
> Except that after resume the system should get into the same state
> it was before suspend, i.e. in particular same number of CPUs.
>
And those same CPUs being in the same pools, which is where this all
springs from.

That being said, I totally agree. But if something unexpected happens
(which is what we are talking about), like being in lack of memory, is
there any value in "limiting the damage"? I think there is, and that is
why I looked at and reported the failure path in details, when
answering your "what if alloc fail" question.

If there is no point in that, why do we handle the error and rollback
the CPU bringup in cpu_up() at all? We could just, at least in case of
resume, BUG_ON() every time we don't get a NOTIFY_DONE result... 

>  Imo
> there simply shouldn't be any allocations on the resume path -
> everything should be possible to be picked up as it was left during
> suspend. 
>
Agreed again. I'm not sure to what extent we can apply this to
scheduling per pCPU and per vCPU data. For example, per vCPU data was
not being deallocated (before my patch, I mean), but per pCPU data was.

I'm not sure why exactly this was the case in the first place, but I
guess it has to do with the fact that we, OTOH, want to actually
deallocate as much stuff as possible in the case of CPU hot unplug, and
parts of this code is shared between the two.

But it is also, IMO, related to the above reasoning on error handling.
I mean, if we just don't kill the per pCPU data the scheduler keeps for
CPU x, and then, during resume, CPU x fails to come up, what happens?
We risk leaking that data or, worse, failing at making the schedule
aware that he does not really have CPU x available any longer.

So, again, if we value "graceful" error compensation, I see some of
these re-allocation necessary. Perhaps, the work I'm doing in another
series of separating scheduler per pCPU data allocation and
initialization can be helpful (not against the risk of leaks, though).

> > Fact is, there is an inconsistency between the pCPU's scheduler
> > (set in
> > cpu_schedule_up(), to Credit1, in the example) and the pCPU's idle
> > vCPU's private data (Credit2 data in the example), which, if left
> > in
> > place, could explode somewhere else, at some future point. In my
> > opinion, the best solution is to remove such inconsistency, and
> > that's
> > what the patch tries to do.
> 
> Removing the inconsistency is certainly a requirement. The question
> just is how to do so without risking system health in other ways.
> 
I had a look. For instance, in Credit2, the scheduler proper
initialization happens only during CPU_STARTING (and needs to stay
there), so I can't attach the pCPU to a Credit2 instance during
CPU_UP_PREPARE, and claim that it's ready for scheduling.

Avoiding deallocating stuff, apart from the dissertation on importance
of error handling, would at least require treating the CPU hotplug and
suspend/resume cases differently, which doesn't look ideal.

Putting something together that would hold up until proper setup
happens (namely, during the time between cpu_schedule_up() and
schedule_cpu_switch()) is also less trivial than I thought. For
instance, I'd need to access per_cpu(cpupool,cpu) from
cpu_schedule_up(), which, from a quick test, it seems I can't. I don't
think it's worth to put much effort in this, considering that the
result would be an hack anyway!

So, all in all, the patch (maybe after reordering the free-s?) remains
the best way forward, IMO. It's possible that, in future, we'' be able
to improve things toward the 'no allocs in resume path' goal, and I'll
keep an eye out for that, when touching related areas.

Thanks and Regards,
Dario
-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.