[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH-for-4.17 v2] xen/sched: migrate timers to correct cpus after suspend


  • To: Juergen Gross <jgross@xxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Wed, 2 Nov 2022 15:07:55 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=MNImGd2VfP1X7a4ep1HdEPAIBmU3R5jAKZYwcxXLpUY=; b=UYlCt+EDBoJhMYlJCyPcymyvtDV7maBdLMRQeCNox7oPr2Trdg54JWSamH0wZOSYVX/KcBKYMP5CwxDbHn7IOPij1sDc692/9YTiABDGkHaEyJkT/MOeEgng64iUMVHx4f4leGW7Kxr2HKy+p7Lf/93LIevpC2Nq1NhTU/sgFBDc++Siau6ZK7iNA0xJh4DA2akk7BS0dUrV//LQ1SbZrMuZ/5OQUt+e90BXjeYxWXQe+uc0kFlL4wYECBgBTn6mXGhk/gex0KFnTPYLEyl8OmVoiZ1EiQ8u+7XbrwcssaGDZsnbJ2mT51QRA3c9YHXX4YuvyTJqTlZhiWSM4DA//g==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=dzhIE70kjY+y+cuY46maKKB5pnsYXCbwcXN0hZ6Mz84WiezyubOx3FMIZgRsUm2/x31mBYNYQUSO5VptWrsZWQae8/7JR3ov7eFf4Q57EXIJHhlxbeyLMNnXQLMaLYv9xfMqbdF3pFnw0Ubo1DsvG1FoBntY3AaX0vgdGeZTMxyDfkpAQZ3gS4+ofObDgF9wQ10kPAiYGwydHTHa6ZsCv2j+gJt1HkDVR9bEFsmSmRq+fhc6GL3PxZAVHQuA4CkVlLgOEc07axYsJ3lhn/qGKGTOW8YTsVKWpX5CUeYrk+4OJPvGRFuRhr3EfBz482ZzX2oNZ1tQ0MJ10+mGUwt4vw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: George Dunlap <george.dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Meng Xu <mengxu@xxxxxxxxxxxxx>, Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Henry Wang <Henry.Wang@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Wed, 02 Nov 2022 14:08:17 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 28.10.2022 13:12, Juergen Gross wrote:
> Today all timers are migrated to cpu 0 when the system is being
> suspended. They are not migrated back after resuming the system again.
> 
> This results (at least) to problems with the credit scheduler, as the
> timer isn't handled on the cpu it was expected to occur.

While you say "at least", this doesn't really make clear in how far all
four timers for which you change their handling are actually problematic,
or whether for some you make the adjustment "just in case". Looking at
core.c's s_timer I'm inclined to say that with s_timer_fn() raising the
schedule softirq things can't go well when this doesn't happen on the
correct CPU. Just that it won't be an obvious problem like the crash in
credit1 which had prompted the creation of this patch.

> Add migrating the scheduling related timers of a specific cpu from cpu
> 0 back to its original cpu when that cpu has gone up when resuming the
> system.
> 
> Signed-off-by: Juergen Gross <jgross@xxxxxxxx>

Is there any Fixes: tag possible? If not, could you add a respective
note in the post-commit-message area, to increase the chance of
recalling that this will want queuing for backport? (Or maybe the lack
of a reasonable Fixes: tag could actually justify the use of the
Backport: one.)

> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1284,6 +1284,32 @@ static int cpu_disable_scheduler_check(unsigned int 
> cpu)
>      return 0;
>  }
>  
> +/*
> + * Called after a cpu has come up again in a suspend/resume cycle.
> + * Note that on a system with smt=0 this will be called for the sibling cpus,
> + * too, so the case for no scheduling resource being available must be
> + * considered.

I think this part of the comment would better live ...

> + * Migrate all timers for this cpu (they have been migrated to cpu 0 when the
> + * cpu was going down).
> + * Note that only timers related to a physical cpu are migrated, not the ones
> + * related to a vcpu or domain.
> + */
> +void sched_migrate_timers(unsigned int cpu)
> +{
> +    struct sched_resource *sr;
> +
> +    rcu_read_lock(&sched_res_rculock);
> +
> +    sr = get_sched_res(cpu);

... inbetween here, increasing the chance that there won't be someone
trying to remove the extra check ...

> +    if ( sr && sr->master_cpu == cpu )

... from here.

I further think that explicitly mentioning "smt=0" isn't very helpful.
Aiui a system with some CPUs otherwise soft-offlined would suffer the
same problem. And I further assume no problem would occur even with
"smt=0" on AMD hardware or Arm (where we don't park CPUs). At the very
least I'd therefore like to ask for "e.g." to be inserted; better
would be a more generic statement like "with some CPUs parked".

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.