[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH-for-4.17 v2] xen/sched: migrate timers to correct cpus after suspend



On 02.11.22 15:07, Jan Beulich wrote:
On 28.10.2022 13:12, Juergen Gross wrote:
Today all timers are migrated to cpu 0 when the system is being
suspended. They are not migrated back after resuming the system again.

This results (at least) to problems with the credit scheduler, as the
timer isn't handled on the cpu it was expected to occur.

While you say "at least", this doesn't really make clear in how far all
four timers for which you change their handling are actually problematic,
or whether for some you make the adjustment "just in case". Looking at
core.c's s_timer I'm inclined to say that with s_timer_fn() raising the
schedule softirq things can't go well when this doesn't happen on the
correct CPU. Just that it won't be an obvious problem like the crash in
credit1 which had prompted the creation of this patch.

Add migrating the scheduling related timers of a specific cpu from cpu
0 back to its original cpu when that cpu has gone up when resuming the
system.

Signed-off-by: Juergen Gross <jgross@xxxxxxxx>

Is there any Fixes: tag possible? If not, could you add a respective
note in the post-commit-message area, to increase the chance of
recalling that this will want queuing for backport? (Or maybe the lack
of a reasonable Fixes: tag could actually justify the use of the
Backport: one.)

I'm not sure whether it is really correct, but I assume the most probable
candidate for a Fixes: tag would be commit 0763cd268789.


--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -1284,6 +1284,32 @@ static int cpu_disable_scheduler_check(unsigned int cpu)
      return 0;
  }
+/*
+ * Called after a cpu has come up again in a suspend/resume cycle.
+ * Note that on a system with smt=0 this will be called for the sibling cpus,
+ * too, so the case for no scheduling resource being available must be
+ * considered.

I think this part of the comment would better live ...

+ * Migrate all timers for this cpu (they have been migrated to cpu 0 when the
+ * cpu was going down).
+ * Note that only timers related to a physical cpu are migrated, not the ones
+ * related to a vcpu or domain.
+ */
+void sched_migrate_timers(unsigned int cpu)
+{
+    struct sched_resource *sr;
+
+    rcu_read_lock(&sched_res_rculock);
+
+    sr = get_sched_res(cpu);

... inbetween here, increasing the chance that there won't be someone
trying to remove the extra check ...

Okay.


+    if ( sr && sr->master_cpu == cpu )

... from here.

I further think that explicitly mentioning "smt=0" isn't very helpful.
Aiui a system with some CPUs otherwise soft-offlined would suffer the
same problem. And I further assume no problem would occur even with
"smt=0" on AMD hardware or Arm (where we don't park CPUs). At the very
least I'd therefore like to ask for "e.g." to be inserted; better
would be a more generic statement like "with some CPUs parked".

I'll go with the latter.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.