[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] xen: rework error handling in vcpu_create


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
  • Date: Mon, 4 Aug 2025 19:36:24 -0400
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=citrix.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none (0)
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=j+FnWxTbd2sJ1Go3v5SfdpwD1SRiUC4hSQuQzT2VSow=; b=NFtFUmoS7hbY+jcvrn5mi8W0o35fGKW0+jkp5EMISkNDBLA4jSfLD8uIZ/8Yj4FD9WFmhPv8VUAyxClg+swpTe+ifVzrOpqwigapIDJReVJyPhzjyN85FGi6xTjGS0m7Q83rSwRm9724OOwNKCISjXuTCNoL6RfalQps8cxDTDrTUktSYmGBXFbGYUClK7O6GjXfC7nd8HgLq/FYxUq3t4y5uZIKGx93fpUUMC8NKPmypXg5f8aD9gfxlQTx2Eo8VoIM0lGAj9IGJCxQTebx6FornSWSjA8nsumyIzwMyw/eEy1ZqQt1pHLl/qUj3oSe3XalgkTzy7Ay0TaGkTmktQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=iSbd4diBC5bIJk8qqb/jxPhH9YqqwTVHNq8dJxoxrnik8DfJGsFVzN0uwybsUYu+1vTMkxZ4rQRy8U+GoO7UmpskIonODE8n9Kx60xoAWM0I1FJLcv5PTPEbdt0SBMoM07TBW1FFsHH28XGdk0Zm7GtDJ/MQLry09pnj65rPlnOGNkw+9GCENWCePp2kL8WDXLoOMwDCbTHTxUbWIZJdVzn/B7Vuht1NI89kllD52nm0kqsNQMq5fa3OOdexdZs1HDPYVDMOlYW4aW5cq2m5zb3ofhc9tOZOPUabDnLHMmWb5M5zHm2lOGFEYRlTQpr1I+s9+NLPqSjWNruqTAqzgA==
  • Cc: Anthony PERARD <anthony.perard@xxxxxxxxxx>, Michal Orzel <michal.orzel@xxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>, Julien Grall <julien@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, George Dunlap <gwd@xxxxxxxxxxxxxx>
  • Delivery-date: Mon, 04 Aug 2025 23:36:44 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 8/3/25 09:19, Andrew Cooper wrote:
> On 01/08/2025 9:24 pm, Stewart Hildebrand wrote:
>> In vcpu_create after scheduler data is allocated, if
>> vmtrace_alloc_buffer fails, it will jump to the wrong cleanup label
>> resulting in a memory leak.
>>
>> Move sched_destroy_vcpu and destroy_waitqueue_vcpu to vcpu_teardown.
>> Make vcpu_teardown idempotent: deal with NULL unit.
>>
>> Fix vcpu_runstate_get (called during XEN_SYSCTL_getdomaininfolist post
>> vcpu_teardown) when v->sched_unit is NULL.
> 
> This is unfortunate.  It feels wrong to be updating stats on a domain
> that's in the process of being destroyed, especially as a side effect of
> a get call.
> 
> But, I wonder if we've uncovered an object lifecycle problem here. 
> Previously any vCPU which was addressable in the system (i.e. domid was
> addressable, a d->vcpu[x] was not NULL) had a unit.

Hm.

My understanding is that domain destruction happens in roughly two
parts. The first part finishes right away, e.g. before "xl destroy"
returns. The 2nd part may be deferred and happens in
complete_domain_destroy. There may pass a non-negligible amount of time
before complete_domain_destroy runs, giving opportunity for an admin to
invoke all sorts of hypercalls from dom0 in the meantime when
half-destroyed domains still exist. E.g. see 54fe207f29f8 ("sched/null:
avoid another crash after failed domU creation").

Currently, sched_destroy_vcpu happens in complete_domain_destroy (i.e.
it's deferred), so half-destroyed domains still have their vcpu data.
The present patch moves sched_destroy_vcpu to the first stage of domain
destruction, not deferred. A consequence of doing so is that it would
open up a non-negligible window for an admin to invoke all sorts of
hypercalls while there exists half-destroyed domains with freed vcpus.

This is getting to be a bit complex, and it appears there's potential
for more fallout. While I'd be happy to make the changes to address your
comments below, I wonder if we should consider taking v1 of the patch
for now.

>>
>> Fixes: 217dd79ee292 ("xen/domain: Add vmtrace_size domain creation 
>> parameter")
>> Signed-off-by: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
>> ---
>> v1->v2:
>> * move cleanup functions to vcpu_teardown
>> * renamed, was ("xen: fix memory leak on error in vcpu_create")
>> ---
>>  xen/common/domain.c     | 14 ++++++--------
>>  xen/common/sched/core.c |  5 ++++-
>>  2 files changed, 10 insertions(+), 9 deletions(-)
>>
>> diff --git a/xen/common/domain.c b/xen/common/domain.c
>> index 5241a1629eeb..9c65c2974ea3 100644
>> --- a/xen/common/domain.c
>> +++ b/xen/common/domain.c
>> @@ -388,6 +388,8 @@ static int vmtrace_alloc_buffer(struct vcpu *v)
>>  static int vcpu_teardown(struct vcpu *v)
>>  {
>>      vmtrace_free_buffer(v);
>> +    sched_destroy_vcpu(v);
>> +    destroy_waitqueue_vcpu(v);
>>  
>>      return 0;
>>  }
> 
> Along with this, you want a matching:
> 
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 5241a1629eeb..3fd75a6d6784 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -1412,8 +1412,6 @@ static void cf_check complete_domain_destroy(struct 
> rcu_head *head)
>              continue;
>          tasklet_kill(&v->continue_hypercall_tasklet);
>          arch_vcpu_destroy(v);
> -        sched_destroy_vcpu(v);
> -        destroy_waitqueue_vcpu(v);
>      }
>  
>      grant_table_destroy(d);
> 
> 

OK

>> @@ -448,13 +450,13 @@ struct vcpu *vcpu_create(struct domain *d, unsigned 
>> int vcpu_id)
>>      }
>>  
>>      if ( sched_init_vcpu(v) != 0 )
>> -        goto fail_wq;
>> +        goto fail;
>>  
>>      if ( vmtrace_alloc_buffer(v) != 0 )
>> -        goto fail_wq;
>> +        goto fail;
>>  
>>      if ( arch_vcpu_create(v) != 0 )
>> -        goto fail_sched;
>> +        goto fail;
>>  
>>      d->vcpu[vcpu_id] = v;
>>      if ( vcpu_id != 0 )
>> @@ -472,11 +474,7 @@ struct vcpu *vcpu_create(struct domain *d, unsigned int 
>> vcpu_id)
>>  
>>      return v;
>>  
>> - fail_sched:
>> -    sched_destroy_vcpu(v);
>> - fail_wq:
>> -    destroy_waitqueue_vcpu(v);
>> -
>> + fail:
>>      /* Must not hit a continuation in this context. */
>>      if ( vcpu_teardown(v) )
>>          ASSERT_UNREACHABLE();
>> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
>> index 2ab4313517c3..fb7c99b05360 100644
>> --- a/xen/common/sched/core.c
>> +++ b/xen/common/sched/core.c
>> @@ -321,7 +321,7 @@ void vcpu_runstate_get(const struct vcpu *v,
>>       */
>>      unit = is_idle_vcpu(v) ? get_sched_res(v->processor)->sched_unit_idle
>>                             : v->sched_unit;
>> -    lock = likely(v == current) ? NULL : unit_schedule_lock_irq(unit);
>> +    lock = likely(v == current || !unit) ? NULL : 
>> unit_schedule_lock_irq(unit);
> 
> This is obfuscation for obfuscations sake.  The normal way of writing it
> would be:
> 
>     if ( v != current && unit )
>         lock = unit_schedule_lock_irq(unit);
> 
> and that is precisely what the compiler emits.
> 
> Moreover it also matches the pattern for how the lock is released, later
> in the function.

OK, but I'd need to initialize lock to NULL (or add "else lock = NULL;")
to avoid error: 'lock' may be used uninitialized in this function. Also,
would we want the condition enclosed in unlikely(...) ?

    if ( unlikely(v != current && unit) )
        lock = unit_schedule_lock_irq(unit);
    else
        lock = NULL;



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.