|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [RFC PATCH] xen/sched/null: avoid crash after failed domU creation
On 4/25/23 03:42, Jan Beulich wrote:
> On 25.04.2023 08:36, Juergen Gross wrote:
>> On 24.04.23 23:00, Stewart Hildebrand wrote:
>>> When creating a domU, but the creation fails, we may end up in a state
>>> where a vcpu has not yet been added to the null scheduler, but
>>> unit_deassign() is invoked.
>>
>> This is not really true. The vcpu has been added, but it was offline at
>> that time. This resulted in null_unit_insert() returning early and not
>> calling unit_assign().
>>
>> Later the vcpu was onlined during XEN_DOMCTL_setvcpucontext handling,
>> resulting in null_unit_remove() calling unit_deassign().
Makes sense. I'll reword the message in the next revision.
>>> In this case, when running a debug build of
>>> Xen, we will hit an ASSERT and crash Xen:
>>>
>>> (XEN) ****************************************
>>> (XEN) Panic on CPU 0:
>>> (XEN) Assertion 'npc->unit == unit' failed at common/sched/null.c:379
>>> (XEN) ****************************************
>>>
>>> To work around this, remove the ASSERT and introduce a check for the
>>> case where npc->unit is NULL and simply return false from
>>> unit_deassign().
>>
>> I think the correct fix would be to call unit_deassign() from
>> null_unit_remove() only, if npc->unit isn't NULL. Dario might have a
>> different opinion, though. :-)
Yes, this seems cleaner to me, thanks for the suggestion. I did a quick test,
and this approach works to avoid the crash too. I'll wait a few days in case
anyone else wants to chime in, and if there aren't any more comments I'll send
out a new patch following this suggestion.
> Furthermore, even if the proposed solution was (roughly) followed, ...
>
>>> --- a/xen/common/sched/null.c
>>> +++ b/xen/common/sched/null.c
>>> @@ -376,7 +376,14 @@ static bool unit_deassign(struct null_private *prv,
>>> const struct sched_unit *uni
>>> struct null_pcpu *npc = get_sched_res(cpu)->sched_priv;
>>>
>>> ASSERT(list_empty(&null_unit(unit)->waitq_elem));
>>> - ASSERT(npc->unit == unit);
>>> +
>>> + if ( !npc->unit )
>>> + {
>>> + dprintk(XENLOG_G_INFO, "%d <-- NULL (%pdv%d)\n", cpu, unit->domain,
>>> + unit->unit_id);
>>> + return false;
>>> + }
>>> +
>
> ... shouldn't the assertion be kept, with the new if() inserted ahead of
> it? Plus the log message probably better wouldn't print a unit ID like a
> vCPU one, but instead use e.g. %pdu%u?
Sure, although, with Juergen's suggested fix in null_unit_remove(), I think we
could simply drop this snippet and leave unit_deassign() unmodified.
Your suggested print format is an improvement, but perhaps it would be better
suited for a separate patch since there are several more instances throughout
null.c that would also want to be changed.
Example with %pdv%d:
# xl create ...
(XEN) common/sched/null.c:355: 3 <-- d1v0
# xl destroy ...
(XEN) common/sched/null.c:385: 3 <-- NULL (d1v0)
Example with %pdu%u:
# xl create ...
(XEN) common/sched/null.c:355: 3 <-- d1u0
# xl destroy ...
(XEN) common/sched/null.c:385: 3 <-- NULL (d1u0)
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |