Xen project Mailing List

Re: [Xen-devel] [PATCH] xen: sched: fix (ACPI S3) resume with cpupools with different schedulers.

From: Dario Faggioli <dario.faggioli@xxxxxxxxxx>

Date: Fri, 13 Nov 2015 15:25:24 +0100

Cc: George Dunlap <george.dunlap@xxxxxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Juergen Gross <JGross@xxxxxxxx>

Delivery-date: Fri, 13 Nov 2015 14:26:02 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Fri, 2015-11-13 at 05:52 -0700, Jan Beulich wrote: >>> On 13.11.15 at 13:30, <dario.faggioli@xxxxxxxxxx> wrote: > > cpu_schedule_up() allocates, with the alloc_pdata hook, stuff that > > was > > freed by cpu_schedule_down (with the free_pdata hook) already. > > If there is no change in the allocation pattern (i.e. including the > order of frees and allocs), then I'm sorry for the noise. I > understood > the description of the change differently then. > There are changes. So, here it is the situation. Right now: After this patch: --------- ----------------- * Boot / CPU switch pools: * Boot / CPU switch pools: cpu_schedule_up(): cpu_schedule_up():= v0 = alloc_vdata() v0 = alloc_vdata() p0 = alloc_pdata() p0 = alloc_pdata() * Suspend: * Suspend: cpu_schedule_down() cpu_schedule_down(): free_pdata(p0) free_vdata(v0) free_pdata(p0) [x] * Resume: * Resume: cpu_schedule_up() cpu_schedule_up(): p1 = alloc_pdata() v1 = alloc_vdata() p1 = alloc_pdata() [xx] [*] schedule_cpu_switch() schedule_cpu_switch(): p2 = alloc_pdata() p2 = alloc_pdata() v1 = alloc_vdata() v2 = alloc_vdata() free_vdata(v0) [**] free_vdata(v1) free_pdata(p1) free_pdata(p1) [*] BUG, the one described in the changelog, if scheduling happens at this point (I can reproduce this with 100% reliability) [*] LATENT BUG, as it is possible that the free_vdata() function called here is not the counterpart of the alloc_vdata() function used for allocating v0 So, there are allocations already in the resume path. With this patch, there is one more free in the suspend path and one more alloc in the resume path. I can change the order of operations at [x] and/or [xx] (and send a v2), e.g., make things look like this, if that helps,: * Suspend: cpu_schedule_down(): free_pdata(p0) free_vdata(v0) * Resume: cpu_schedule_up(): v1 = alloc_vdata() p1 = alloc_pdata() Would that work? I don't think I neither can easily get away with the bug, nor eliminate the explained inconsistency, without that additional free-alloc pair. > > In any case, if one of these allocations fails, it already returns > > ENOMEM. What happens then is, of course, caller's call. If it > > happens > > during boot for pCPU 0, we BUG(). If it happens with non-boot pCPUs > > (either at boot, during hotplug or during suspend/resume), we fail > > the > > bringup of that pCPU at the CPU_UP_PREPARE phase. > > Except that after resume the system should get into the same state > it was before suspend, i.e. in particular same number of CPUs. > And those same CPUs being in the same pools, which is where this all springs from. That being said, I totally agree. But if something unexpected happens (which is what we are talking about), like being in lack of memory, is there any value in "limiting the damage"? I think there is, and that is why I looked at and reported the failure path in details, when answering your "what if alloc fail" question. If there is no point in that, why do we handle the error and rollback the CPU bringup in cpu_up() at all? We could just, at least in case of resume, BUG_ON() every time we don't get a NOTIFY_DONE result... > Imo > there simply shouldn't be any allocations on the resume path - > everything should be possible to be picked up as it was left during > suspend. > Agreed again. I'm not sure to what extent we can apply this to scheduling per pCPU and per vCPU data. For example, per vCPU data was not being deallocated (before my patch, I mean), but per pCPU data was. I'm not sure why exactly this was the case in the first place, but I guess it has to do with the fact that we, OTOH, want to actually deallocate as much stuff as possible in the case of CPU hot unplug, and parts of this code is shared between the two. But it is also, IMO, related to the above reasoning on error handling. I mean, if we just don't kill the per pCPU data the scheduler keeps for CPU x, and then, during resume, CPU x fails to come up, what happens? We risk leaking that data or, worse, failing at making the schedule aware that he does not really have CPU x available any longer. So, again, if we value "graceful" error compensation, I see some of these re-allocation necessary. Perhaps, the work I'm doing in another series of separating scheduler per pCPU data allocation and initialization can be helpful (not against the risk of leaks, though). > > Fact is, there is an inconsistency between the pCPU's scheduler > > (set in > > cpu_schedule_up(), to Credit1, in the example) and the pCPU's idle > > vCPU's private data (Credit2 data in the example), which, if left > > in > > place, could explode somewhere else, at some future point. In my > > opinion, the best solution is to remove such inconsistency, and > > that's > > what the patch tries to do. > > Removing the inconsistency is certainly a requirement. The question > just is how to do so without risking system health in other ways. > I had a look. For instance, in Credit2, the scheduler proper initialization happens only during CPU_STARTING (and needs to stay there), so I can't attach the pCPU to a Credit2 instance during CPU_UP_PREPARE, and claim that it's ready for scheduling. Avoiding deallocating stuff, apart from the dissertation on importance of error handling, would at least require treating the CPU hotplug and suspend/resume cases differently, which doesn't look ideal. Putting something together that would hold up until proper setup happens (namely, during the time between cpu_schedule_up() and schedule_cpu_switch()) is also less trivial than I thought. For instance, I'd need to access per_cpu(cpupool,cpu) from cpu_schedule_up(), which, from a quick test, it seems I can't. I don't think it's worth to put much effort in this, considering that the result would be an hack anyway! So, all in all, the patch (maybe after reordering the free-s?) remains the best way forward, IMO. It's possible that, in future, we'' be able to improve things toward the 'no allocs in resume path' goal, and I'll keep an eye out for that, when touching related areas. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.