[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 3/3] xen/sched: fix cpu hotplug


  • To: Juergen Gross <jgross@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <Andrew.Cooper3@xxxxxxxxxx>
  • Date: Wed, 31 Aug 2022 22:52:50 +0000
  • Accept-language: en-GB, en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=JQqsuMjZhwqDgWVYWRMGbYVoLBBYOvMKs/5jE/XY9lc=; b=kYEjyOHWO0lxINPTQp8LwBRjeDROcWkKs/mXChy7LCNtNlXBVXXijXXPDHF+c3KfeUhVnfPo+lN9EjEcSx5SsU+XWGHMUO+iVmUDPF/ZUVA2v1julg3oDqYe2YDJ3zIJihfMLj+9770Rc6nUmzEBHuDvTi7DZAZrrMiFdwTEZhvpy6mpeavZla4vkoixdbyahK6kBntDJkV1a6Bk6jqHTLHUExrZZzqKZAiJb9v/2hr10hiA7AOp87V223jeFSwZC7yejP/wEitH0vHY0iATeGpW/z4KatRo6p+DLA2EWH/AOT1dprQ/PbOypqRfFcyCf6TnAWqOzbnxoG4Fgw0VLw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MV3YwPimNdAP4lli33a8Xv+812MZQ4cAfXYlTkt1RNAAgFBxRDPAIzwDc91m766ESNzZJ60FELaLw8jEbJJJUtV/rrLIJGA9INM76/gm3uh5Z/SQLR0/xD8i8sBb34+Tk8DSjrrQ2rCiVCSrPHXaoyCLgjlKh9TBH8+G8+UTSqgWFVdRNJGfolZSbelrzFwA+ekwesqeVx6RWgluxJQrQTjvycaw8bvhj7SuaH8YO7dXPLYJYFLw+TVcyHZeBUhVNFv3FokVBIrFwnAmUihySwgtCoTjYwhw8gNeaaVA+WKdxm3spCAjZBvuDOu9xycCYb0ZNNURrS2NgpzjfNNOiQ==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: George Dunlap <George.Dunlap@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Gao Ruifeng <ruifeng.gao@xxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>
  • Delivery-date: Wed, 31 Aug 2022 22:53:15 +0000
  • Ironport-data: A9a23:GThV5KNNT2SdzXPvrR1rlsFynXyQoLVcMsEvi/4bfWQNrUpw3mYHz WFKDGDSaKqOa2f1eNkiO9m2oUkCuZaAxtBrTAto+SlhQUwRpJueD7x1DKtR0wB+jCHnZBg6h ynLQoCYdKjYdleF+lH3dOCJQUBUjcmgXqD7BPPPJhd/TAplTDZJoR94kqsyj5UAbeKRWmthg vuv5ZyEULOZ82QsaDhMu/na8EgHUMna41v0gHRvPZing3eG/5UlJMp3Db28KXL+Xr5VEoaSL woU5Ojklo9x105F5uKNyt4XQGVTKlLhFVHmZk5tc7qjmnB/Shkaic7XAha+hXB/0F1ll/gpo DlEWAfZpQ0BZsUgk8xFO/VU/r0X0QSrN9YrLFDm2fF/wXEqfFP93M9UF2gkI7czxelWGEFyq qUXcSwSO0Xra+KemNpXS8FKr+F7cIzBGtNavXttizbEEfwhXJbPBb3Q4sNV1ysxgcYIGuvCY 80eanxkaxGojx9nYw9LTs5h2rj0wCWiG9FbgAv9Sa4fym7f1gFulpPqN8LYYIeiTsRJhEeI4 GnB+gwVBzlFZYbFlGPZrhpAgMf0gw3EeKFOV4G6ydVE2UTL6H1PGCcJAA7TTf6RzxTWt8hkA 04b4C01toAp6VemCNL6WnWQsHOC+xIRRddUO+k78x2WjLrZ5R6DAWoJRSIHb8Yp3Oc6SzUl2 V6Om9LBHiF0vfueTnf13q2JrD2/PydTImYFTS4CUQYBpdLkpekbjBjCU9JiG66dlcDuFHf7x DXikcQlr7AajMpO3aPr+1nC226ovsKRElNz4RjLVGW46A8/fJSie4Gj9Vnc67BHMZqdSV6C+ nMDnqBy8dwzMH1ErwTVKM1lIV1jz6/t3OH06bK3I6Qcyg==
  • Ironport-hdrordr: A9a23:eKyVu6ujrfCp9dxO3W2HfUOV7skC1YMji2hC6mlwRA09TyXGra 2TdaUgvyMc1gx7ZJh5o6H6BEGBKUmslqKceeEqTPqftXrdyRGVxeZZnMffKlzbamfDH4tmuZ uIHJIOb+EYYWIasS++2njBLz9C+qjJzEnLv5a5854Fd2gDBM9dBkVCe3+m+yZNNWt77O8CZf 6hD7181l+dkBosDviTNz0gZazuttfLnJXpbVotHBg88jSDijuu9frTDwWY9g12aUIP/Z4StU z+1yDp7KSqtP+2jjXG0XXI0phQkNz9jvNeGc23jNQPIDmEsHfpWG0hYczAgNkGmpDr1L8Yqq iJn/7mBbU115rlRBD2nfIq4Xin7N9h0Q669bbSuwqfnSWwfkNHNyMGv/MWTvKR0TtfgDk3up g7oF6xpt5ZCwjNkz/64MWNXxZ2llCsqX5niuILiWdDOLFuIYO5gLZvi3+9Kq1wah7S+cQiCq 1jHcvc7PFZfReTaG3YpHBmxJipUm4oFhmLT0AesojNugIm10xR3g8d3ogSj30A/JUyR91N4P nFKL1hkPVLQtUNZaxwCe8dSY+8C3DLQxjLLGWOSG6XXJ0vKjbIsdr68b817OaldNgBy4Yzgo 3IVBdCuWs7ayvVeLmzNV1wg2XwqUmGLETQI5tllulEU5XHNcnWGDzGTkwymM29pPhaCtHHWp +ISeBrP8M=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHYsVjUGULgew6mPUO+TQcIZn3wJK3JthgA
  • Thread-topic: [PATCH v3 3/3] xen/sched: fix cpu hotplug

On 16/08/2022 11:13, Juergen Gross wrote:
> Cpu cpu unplugging is calling schedule_cpu_rm() via stop_machine_run()

Cpu cpu.

> with interrupts disabled, thus any memory allocation or freeing must
> be avoided.
>
> Since commit 5047cd1d5dea ("xen/common: Use enhanced
> ASSERT_ALLOC_CONTEXT in xmalloc()") this restriction is being enforced
> via an assertion, which will now fail.
>
> Before that commit cpu unplugging in normal configurations was working
> just by chance as only the cpu performing schedule_cpu_rm() was doing
> active work. With core scheduling enabled, however, failures could
> result from memory allocations not being properly propagated to other
> cpus' TLBs.

This isn't accurate, is it?  The problem with initiating a TLB flush
with IRQs disabled is that you can deadlock against a remote CPU which
is waiting for you to enable IRQs first to take a TLB flush IPI.

How does a memory allocation out of the xenheap result in a TLB flush? 
Even with split heaps, you're only potentially allocating into a new
slot which was unused...

> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> index 228470ac41..ffb2d6202b 100644
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -3260,6 +3260,17 @@ static struct cpu_rm_data 
> *schedule_cpu_rm_alloc(unsigned int cpu)
>      if ( !data )
>          goto out;
>  
> +    if ( aff_alloc )
> +    {
> +        if ( !update_node_aff_alloc(&data->affinity) )

I spent ages trying to figure out what this was doing, before realising
the problem is the function name.

alloc (as with free) is the critical piece of information and needs to
come first.  The fact we typically pass the result to
update_node_aff(inity) isn't relevant, and becomes actively wrong here
when we're nowhere near.

Patch 1 needs to name these helpers:

bool alloc_affinity_masks(struct affinity_masks *affinity);
void free_affinity_masks(struct affinity_masks *affinity);

and then patches 2 and 3 become far easier to follow.

Similarly in patch 2, the new helpers need to be
{alloc,free}_cpu_rm_data() to make sense.  These have nothing to do with
scheduling.

Also, you shouldn't introduce the helpers static in patch 2 and then
turn them non-static in patch 3.  That just adds unnecessary churn to
the complicated patch.

> +        {
> +            XFREE(data);
> +            goto out;
> +        }
> +    }
> +    else
> +        memset(&data->affinity, 0, sizeof(data->affinity));

I honestly don't think it is worth optimising xzalloc() -> xmalloc() 
for the cognitive complexity of having this logic here.

> diff --git a/xen/common/sched/cpupool.c b/xen/common/sched/cpupool.c
> index 58e082eb4c..2506861e4f 100644
> --- a/xen/common/sched/cpupool.c
> +++ b/xen/common/sched/cpupool.c
> @@ -411,22 +411,28 @@ int cpupool_move_domain(struct domain *d, struct 
> cpupool *c)
>  }
>  
>  /* Update affinities of all domains in a cpupool. */
> -static void cpupool_update_node_affinity(const struct cpupool *c)
> +static void cpupool_update_node_affinity(const struct cpupool *c,
> +                                         struct affinity_masks *masks)
>  {
> -    struct affinity_masks masks;
> +    struct affinity_masks local_masks;
>      struct domain *d;
>  
> -    if ( !update_node_aff_alloc(&masks) )
> -        return;
> +    if ( !masks )
> +    {
> +        if ( !update_node_aff_alloc(&local_masks) )
> +            return;
> +        masks = &local_masks;
> +    }
>  
>      rcu_read_lock(&domlist_read_lock);
>  
>      for_each_domain_in_cpupool(d, c)
> -        domain_update_node_aff(d, &masks);
> +        domain_update_node_aff(d, masks);
>  
>      rcu_read_unlock(&domlist_read_lock);
>  
> -    update_node_aff_free(&masks);
> +    if ( masks == &local_masks )
> +        update_node_aff_free(masks);
>  }
>  
>  /*

Why do we need this at all?  domain_update_node_aff() already knows what
to do when passed NULL, so this seems like an awfully complicated no-op.

> @@ -1008,10 +1016,21 @@ static int cf_check cpu_callback(
>  {
>      unsigned int cpu = (unsigned long)hcpu;
>      int rc = 0;
> +    static struct cpu_rm_data *mem;
>  
>      switch ( action )
>      {
>      case CPU_DOWN_FAILED:
> +        if ( system_state <= SYS_STATE_active )
> +        {
> +            if ( mem )
> +            {

So, this does compile (and indeed I've tested the result), but I can't
see how it should.

mem is guaranteed to be uninitialised at this point, and ...

> +                schedule_cpu_rm_free(mem, cpu);
> +                mem = NULL;
> +            }
> +            rc = cpupool_cpu_add(cpu);
> +        }
> +        break;
>      case CPU_ONLINE:
>          if ( system_state <= SYS_STATE_active )
>              rc = cpupool_cpu_add(cpu);
> @@ -1019,12 +1038,31 @@ static int cf_check cpu_callback(
>      case CPU_DOWN_PREPARE:
>          /* Suspend/Resume don't change assignments of cpus to cpupools. */
>          if ( system_state <= SYS_STATE_active )
> +        {
>              rc = cpupool_cpu_remove_prologue(cpu);
> +            if ( !rc )
> +            {
> +                ASSERT(!mem);

... here, and each subsequent assertion too.

Given that I tested the patch and it does fix the IRQ assertion, I can
only imagine that it works by deterministically finding stack rubble
which happens to be 0.

~Andrew

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.