[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 10/10] xen/arm: Enable errata for secondary CPU on hotplug after the boot



On Thu, May 10, 2018 at 1:57 PM, Mirela Simonovic
<mirela.simonovic@xxxxxxxxxx> wrote:
> Hi,
>
> +Dario
>
> On Wed, May 9, 2018 at 6:32 PM, Julien Grall <julien.grall@xxxxxxx> wrote:
>>
>>
>> On 09/05/18 16:48, Mirela Simonovic wrote:
>>>
>>> Hi Julien,
>>
>>
>> Hi Mirela,
>>
>>
>>> On Mon, Apr 30, 2018 at 6:09 PM, Julien Grall <julien.grall@xxxxxxx>
>>> wrote:
>>>>
>>>> Hi Mirela,
>>>>
>>>>
>>>> On 27/04/18 18:12, Mirela Simonovic wrote:
>>>>>
>>>>>
>>>>> On boot, enabling errata workarounds will be triggered by the boot CPU
>>>>> from start_xen(). On CPU hotplug (non-boot scenario) this would not be
>>>>> done. This patch adds the code required to enable errata workarounds
>>>>> for a CPU being hotplugged after the system boots. This is triggered
>>>>> using a notifier. If the CPU fails to enable the errata Xen will panic.
>>>>> This is done because it is assumed that the CPU which is hotplugged
>>>>> after the system/Xen boots, was initially hotplugged during the
>>>>> system/Xen boot. Therefore, enabling errata workarounds should never
>>>>> fail.
>>>>>
>>>>> Signed-off-by: Mirela Simonovic <mirela.simonovic@xxxxxxxxxx>
>>>>>
>>>>> ---
>>>>> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
>>>>> CC: Julien Grall <julien.grall@xxxxxxx>
>>>>> ---
>>>>>    xen/arch/arm/cpuerrata.c         | 35
>>>>> +++++++++++++++++++++++++++++++++++
>>>>>    xen/arch/arm/cpufeature.c        | 23 +++++++++++++++++++++++
>>>>>    xen/include/asm-arm/cpufeature.h |  1 +
>>>>>    3 files changed, 59 insertions(+)
>>>>>
>>>>> diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c
>>>>> index 1baa20654b..4040f781ec 100644
>>>>> --- a/xen/arch/arm/cpuerrata.c
>>>>> +++ b/xen/arch/arm/cpuerrata.c
>>>>> @@ -5,6 +5,8 @@
>>>>>    #include <xen/spinlock.h>
>>>>>    #include <xen/vmap.h>
>>>>>    #include <xen/warning.h>
>>>>> +#include <xen/notifier.h>
>>>>> +#include <xen/cpu.h>
>>>>>    #include <asm/cpufeature.h>
>>>>>    #include <asm/cpuerrata.h>
>>>>>    #include <asm/psci.h>
>>>>> @@ -349,6 +351,39 @@ void __init enable_errata_workarounds(void)
>>>>>        enable_cpu_capabilities(arm_errata);
>>>>>    }
>>>>>    +static int cpu_errata_callback(
>>>>> +    struct notifier_block *nfb, unsigned long action, void *hcpu)
>>>>> +{
>>>>> +    switch ( action )
>>>>> +    {
>>>>> +    case CPU_STARTING:
>>>>> +        enable_nonboot_cpu_caps(arm_errata);
>>>>> +        break;
>>>>> +    default:
>>>>> +        break;
>>>>> +    }
>>>>> +
>>>>> +    return NOTIFY_DONE;
>>>>> +}
>>>>> +
>>>>> +static struct notifier_block cpu_errata_nfb = {
>>>>> +    .notifier_call = cpu_errata_callback,
>>>>> +};
>>>>> +
>>>>> +static int __init cpu_errata_notifier_init(void)
>>>>> +{
>>>>> +    register_cpu_notifier(&cpu_errata_nfb);
>>>>> +    return 0;
>>>>> +}
>>>>> +/*
>>>>> + * Initialization has to be done at init rather than presmp_init phase
>>>>> because
>>>>> + * the callback should execute only after the secondary CPUs are
>>>>> initially
>>>>> + * booted (in hotplug scenarios when the system state is not boot). On
>>>>> boot,
>>>>> + * the enabling of errata workarounds will be triggered by the boot CPU
>>>>> from
>>>>> + * start_xen().
>>>>> + */
>>>>> +__initcall(cpu_errata_notifier_init);
>>>>> +
>>>>>    /*
>>>>>     * Local variables:
>>>>>     * mode: C
>>>>> diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
>>>>> index 525b45e22f..dd30f0d29c 100644
>>>>> --- a/xen/arch/arm/cpufeature.c
>>>>> +++ b/xen/arch/arm/cpufeature.c
>>>>> @@ -68,6 +68,29 @@ void __init enable_cpu_capabilities(const struct
>>>>> arm_cpu_capabilities *caps)
>>>>>        }
>>>>>    }
>>>>>    +/* Run through the enabled capabilities and enable() them on the
>>>>> calling CPU */
>>>>> +void enable_nonboot_cpu_caps(const struct arm_cpu_capabilities *caps)
>>>>> +{
>>>>> +    ASSERT(system_state != SYS_STATE_boot);
>>>>> +
>>>>> +    for ( ; caps->matches; caps++ )
>>>>> +    {
>>>>> +        if ( !cpus_have_cap(caps->capability) )
>>>>> +            continue;
>>>>> +
>>>>> +        if ( caps->enable )
>>>>> +        {
>>>>> +            /*
>>>>> +             * Since the CPU has enabled errata workarounds on boot, it
>>>>> should
>>>>
>>>>
>>>>
>>>> This function is not really about errata, it is about capabilities.
>>>> Errata
>>>> is just a sub-category of them.
>>>>
>>>
>>> I've fixed the comment, thanks.
>>>
>>>>> +             * never fail to enable them here.
>>>>> +             */
>>>>> +            if ( caps->enable((void *)caps) )
>>>>> +                panic("CPU%u failed to enable capability %u\n",
>>>>> +                      smp_processor_id(), caps->capability);
>>>>
>>>>
>>>>
>>>> We should really avoid to use panic(...) if this is something the system
>>>> can
>>>> survive. In that specific case, it would only affect the current CPU. So
>>>> it
>>>> would be better to return an error and let the caller decide what to do.
>>>>
>>>
>>> I need to emphasize two points:
>>> 1) I don't see how is this different compared to PSCI CPU OFF where we
>>> do panic. Essentially, in both cases the system will not be able to
>>> use that CPU and we already agreed that is a good reason to panic.
>>
>>
>> You can't compare PSCI CPU off and the enable callback failing. The *only*
>> reason PSCI CPU off can fail is because the Trusted OS is resident on that
>> CPU. If that ever happen it is a programming error on Xen, and it makes
>> sense to fail because you don't want that CPU to spin in Xen.
>>
>> Enabling a capability can fail because of a failure of allocating memory or
>> mapping (see spectre workaround). It is not a programming error but an
>> expected behavior and it is not a valid reason to assume we want to kill the
>> system.
>>
>>> As oppose to CPU_OFF which wasn't called on boot so we indeed have no
>>> idea whether it will pass on suspend, no matter how unlikely it could
>>> fail, in this scenario we are sure that enabling capability should
>>> pass because it already passed on boot. So if it doesn't pass, which I
>>> consider to be impossible, I believe we should panic.
>>> On the other hand, I understand how would this make a difference on
>>> big.LITTLE where you try to hotplug a CPU that was never booted.
>>> However, that scenario is out of this scope.
>>
>> While I agree that big.LITTLE is out of scope of your series, what I ask has
>> nothing to do with big.LITTLE. There are valid reason for the enable
>> callback to fail whether it is the case today or not.
>>
>>>
>>> 2) I still wanted to give a chance to your proposal and just convert
>>> panic into stop_cpu+printing error. The system cannot survive if
>>> enabling a capability fails. In order to test this I added a
>>> capability that will always fail after the boot. This is not realistic
>>> in my opinion, but I used it only for testing to check whether the
>>> system will survive. Instead of panic I printed an error and stopped
>>> the CPU. However, Xen crashed. The boot CPU properly concluded that
>>> the erroneous CPU will never become online, but later on credit
>>> scheduler's assertion fails.
>>
>>
>> Please provide more details.
>>
>>> I believe this is something that a person
>>> who adds big.LITTLE support should deal with.
>>
>>
>> If there is a bug in the scheduler it should be fixed rather trying to
>> workaround with a panic in the code. If you provide more details, we might
>> be able to help here.
>>
>
> This flow seems to have several bugs. Lets start from here:
>
> Please take a look at function cpu_schedule_callback in schedule.c.
> Within switch, case CPU_DEAD doesn't have a break, causing the bellow
> CPU_UP_CANCELED to execute as well when the CPU goes down. This looks
> wrong to me.
> Dario, could you please confirm that this is a bug? Otherwise could
> you please confirm the reasoning beyond?
>

Dario sorry, this looked wrong in my scenario but actually it is
correct. I understand the purpose of the missing break now.

I'll try to explain what was the issue when non-boot CPU being
hotplugged fails to enable capabilities. For those who are not
interested in details - the issue is fixed and will be submitted as
part of this patch in the v4 of this series.

For the curious ones (if any) here is detailed description: errata
notifier added in this patch had the same priority as scheduler
notifier. I though priority doesn't matter, but I was wrong. In this
particular scenario where a CPU fails to enable capabilities
(triggered by errata notifier added in this patch), scheduler callback
executed before the errata callback for CPU_STARTING event. In other
words, scheduler already called init_pdata before the errata callback
fired (and stopped the CPU).
Later on when errata callback fired, enabling of capabilities has
failed, so the erroneous non-boot CPU stopped itself and never
declared to be online.
Then CPU#0 fired new notification with CPU_UP_CANCELED event in order
to clean up for the job done on CPU_STARTING. However, this broke the
assumption (which is good) made in cpu_schedule_callback. The
assumption is that the sequence of steps should be
alloc_pdata-->init_pdata-->deinit_pdata-->free_pdata. In this
particular case deinit_pdata was not done because this would be done
only upon CPU_DEAD event which makes no sense in this scenario.
In order to avoid running into the invalid scenario described above,
the errata callback should fire before the scheduler callback. If
enabling capabilities fails, the scheduler callback for CPU_STARTING
will never execute afterwards, so the following CPU_UP_CANCELED event
triggered by the CPU#0 will do free_pdata, which is ok because
init_pdata was not executed and alloc_pdata-->free_pdata flow is also
valid. Congratulations to the reader who reached this point :)

I have tested the tuned scenario where enabling capabilities fails -
the erroneous CPU is stopped/powered down and the system continues to
work fine without it. Although I still don't understand why indeed I
needed to debug this I'll add the fix in v4.

Thanks,
Mirela

> Thanks,
> Mirela
>
>>>
>>> Do we have an agreement to keep panic?
>>
>>
>> I am afraid not, panic (and BUG*) should only be used when there are no way
>> to come back or it is a programming error to end up here. I don't think this
>> is the case with the information I have in hand.
>>
>> The two solutions I find acceptable would be:
>>         1) Log a warning and ignore the error. Likely your CPU will break
>> later on.
>>         2) Return an error and let the caller deal with it. The caller might
>> decide to kill the system, but that's not our business. This function should
>> only report an error.
>>
>> Cheers,
>>
>> --
>> Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.