[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC PATCH] xen/sched/null: avoid crash after failed domU creation


  • To: Jan Beulich <jbeulich@xxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, "Dario Faggioli" <dfaggioli@xxxxxxxx>
  • From: Stewart Hildebrand <stewart.hildebrand@xxxxxxx>
  • Date: Wed, 26 Apr 2023 09:41:34 -0400
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass (sender ip is 165.204.84.17) smtp.rcpttodomain=suse.com smtp.mailfrom=amd.com; dmarc=pass (p=quarantine sp=quarantine pct=100) action=none header.from=amd.com; dkim=none (message not signed); arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=E1eEH5BwC384RmlKdS6D0BJCyf0JUhERXAlfmnSiq6M=; b=hcJV2+Cq7oEh0GyVRBOnSTXcfjyJLn5TXtSR8tSjMm7OoK5bhzAs0FmZSmhrKhRkM3Ejk2bbCGeGaN0brW6JkmdSZ0sNJP8LiyFGv2be20jWqtB16RmT+LyiwRIF5mpWkHel1pdH4buq+ZHMANzQO6WLajYkIG5AzbRosELOsDdJbtAslMaRISl0o6s4sj68Lxyjn1bNBFY/RwMSbDZfpeN78wVIv8EDfh6LPfDbCLahJkvkRp1NRK3+2k/Qjiuq1L8/BjiW+zs2QiFm609DHRXMNcbDivw6ox2uI8jrtsfieiPyy99Tq83m16GQ1rsp+pYzo5MJRNs8t/V2/o4zxw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=lZh086fySMqdxCA/R2H/CW2/ZsafL/9K56j15SNW7DRM9a3C1jEUnkFZK7ViBQsgLDJfvzNjkhdBW5r6QRhutHo8dLQ5ZuVtFgITtjPnK/YNAmpxCQvkYPEFj5QvIaeirByQrWfdmqxLXONfVmapVbbbDpB87XrefUIPvkKBhoZzzvniW9roeyOi6UUMHSfisGC3Gsp+aAQdfdZzvGq6HaVt0o4ITpj8xj62UDXJ51fwHR7Wpj22aLdt0OfEzht4ZmaCCTgCc7UvPS9VDD9sERtnQ6ewv8R6rPIZpNNhuQX21fLhssOIqPzBfv7PZaqD+oJJg2NcnTiXGNQ71BFLXw==
  • Cc: George Dunlap <george.dunlap@xxxxxxxxxx>, <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • Delivery-date: Wed, 26 Apr 2023 13:41:50 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 4/25/23 03:42, Jan Beulich wrote:
> On 25.04.2023 08:36, Juergen Gross wrote:
>> On 24.04.23 23:00, Stewart Hildebrand wrote:
>>> When creating a domU, but the creation fails, we may end up in a state
>>> where a vcpu has not yet been added to the null scheduler, but
>>> unit_deassign() is invoked.
>>
>> This is not really true. The vcpu has been added, but it was offline at
>> that time. This resulted in null_unit_insert() returning early and not
>> calling unit_assign().
>>
>> Later the vcpu was onlined during XEN_DOMCTL_setvcpucontext handling,
>> resulting in null_unit_remove() calling unit_deassign().

Makes sense. I'll reword the message in the next revision.

>>> In this case, when running a debug build of
>>> Xen, we will hit an ASSERT and crash Xen:
>>>
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
>>> (XEN) ****************************************
>>>
>>> To work around this, remove the ASSERT and introduce a check for the
>>> case where npc->unit is NULL and simply return false from
>>> unit_deassign().
>>
>> I think the correct fix would be to call unit_deassign() from
>> null_unit_remove() only, if npc->unit isn't NULL. Dario might have a
>> different opinion, though. :-)

Yes, this seems cleaner to me, thanks for the suggestion. I did a quick test, 
and this approach works to avoid the crash too. I'll wait a few days in case 
anyone else wants to chime in, and if there aren't any more comments I'll send 
out a new patch following this suggestion.

> Furthermore, even if the proposed solution was (roughly) followed, ...
> 
>>> --- a/xen/common/sched/null.c
>>> +++ b/xen/common/sched/null.c
>>> @@ -376,7 +376,14 @@ static bool unit_deassign(struct null_private *prv, 
>>> const struct sched_unit *uni
>>>       struct null_pcpu *npc = get_sched_res(cpu)->sched_priv;
>>>
>>>       ASSERT(list_empty(&null_unit(unit)->waitq_elem));
>>> -    ASSERT(npc->unit == unit);
>>> +
>>> +    if ( !npc->unit )
>>> +    {
>>> +        dprintk(XENLOG_G_INFO, "%d <-- NULL (%pdv%d)\n", cpu, unit->domain,
>>> +                unit->unit_id);
>>> +        return false;
>>> +    }
>>> +
> 
> ... shouldn't the assertion be kept, with the new if() inserted ahead of
> it? Plus the log message probably better wouldn't print a unit ID like a
> vCPU one, but instead use e.g. %pdu%u?

Sure, although, with Juergen's suggested fix in null_unit_remove(), I think we 
could simply drop this snippet and leave unit_deassign() unmodified.

Your suggested print format is an improvement, but perhaps it would be better 
suited for a separate patch since there are several more instances throughout 
null.c that would also want to be changed.

Example with %pdv%d:
# xl create ...
(XEN) common/sched/null.c:355: 3 <-- d1v0
# xl destroy ...
(XEN) common/sched/null.c:385: 3 <-- NULL (d1v0)

Example with %pdu%u:
# xl create ...
(XEN) common/sched/null.c:355: 3 <-- d1u0
# xl destroy ...
(XEN) common/sched/null.c:385: 3 <-- NULL (d1u0)



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.