Xen project Mailing List

Re: [Xen-devel] [PATCH v3 10/10] xen/arm: Enable errata for secondary CPU on hotplug after the boot

To: Julien Grall <julien.grall@xxxxxxx>, Dario Faggioli <dfaggioli@xxxxxxxx>

From: Mirela Simonovic <mirela.simonovic@xxxxxxxxxx>

Date: Thu, 10 May 2018 13:57:25 +0200

Cc: "Edgar E. Iglesias" <edgar.iglesias@xxxxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Davorin Mista <dm@xxxxxxxxxx>, Xen Devel <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Thu, 10 May 2018 11:57:39 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hi, +Dario On Wed, May 9, 2018 at 6:32 PM, Julien Grall <julien.grall@xxxxxxx> wrote: > > > On 09/05/18 16:48, Mirela Simonovic wrote: >> >> Hi Julien, > > > Hi Mirela, > > >> On Mon, Apr 30, 2018 at 6:09 PM, Julien Grall <julien.grall@xxxxxxx> >> wrote: >>> >>> Hi Mirela, >>> >>> >>> On 27/04/18 18:12, Mirela Simonovic wrote: >>>> >>>> >>>> On boot, enabling errata workarounds will be triggered by the boot CPU >>>> from start_xen(). On CPU hotplug (non-boot scenario) this would not be >>>> done. This patch adds the code required to enable errata workarounds >>>> for a CPU being hotplugged after the system boots. This is triggered >>>> using a notifier. If the CPU fails to enable the errata Xen will panic. >>>> This is done because it is assumed that the CPU which is hotplugged >>>> after the system/Xen boots, was initially hotplugged during the >>>> system/Xen boot. Therefore, enabling errata workarounds should never >>>> fail. >>>> >>>> Signed-off-by: Mirela Simonovic <mirela.simonovic@xxxxxxxxxx> >>>> >>>> --- >>>> CC: Stefano Stabellini <sstabellini@xxxxxxxxxx> >>>> CC: Julien Grall <julien.grall@xxxxxxx> >>>> --- >>>> xen/arch/arm/cpuerrata.c | 35 >>>> +++++++++++++++++++++++++++++++++++ >>>> xen/arch/arm/cpufeature.c | 23 +++++++++++++++++++++++ >>>> xen/include/asm-arm/cpufeature.h | 1 + >>>> 3 files changed, 59 insertions(+) >>>> >>>> diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c >>>> index 1baa20654b..4040f781ec 100644 >>>> --- a/xen/arch/arm/cpuerrata.c >>>> +++ b/xen/arch/arm/cpuerrata.c >>>> @@ -5,6 +5,8 @@ >>>> #include <xen/spinlock.h> >>>> #include <xen/vmap.h> >>>> #include <xen/warning.h> >>>> +#include <xen/notifier.h> >>>> +#include <xen/cpu.h> >>>> #include <asm/cpufeature.h> >>>> #include <asm/cpuerrata.h> >>>> #include <asm/psci.h> >>>> @@ -349,6 +351,39 @@ void __init enable_errata_workarounds(void) >>>> enable_cpu_capabilities(arm_errata); >>>> } >>>> +static int cpu_errata_callback( >>>> + struct notifier_block *nfb, unsigned long action, void *hcpu) >>>> +{ >>>> + switch ( action ) >>>> + { >>>> + case CPU_STARTING: >>>> + enable_nonboot_cpu_caps(arm_errata); >>>> + break; >>>> + default: >>>> + break; >>>> + } >>>> + >>>> + return NOTIFY_DONE; >>>> +} >>>> + >>>> +static struct notifier_block cpu_errata_nfb = { >>>> + .notifier_call = cpu_errata_callback, >>>> +}; >>>> + >>>> +static int __init cpu_errata_notifier_init(void) >>>> +{ >>>> + register_cpu_notifier(&cpu_errata_nfb); >>>> + return 0; >>>> +} >>>> +/* >>>> + * Initialization has to be done at init rather than presmp_init phase >>>> because >>>> + * the callback should execute only after the secondary CPUs are >>>> initially >>>> + * booted (in hotplug scenarios when the system state is not boot). On >>>> boot, >>>> + * the enabling of errata workarounds will be triggered by the boot CPU >>>> from >>>> + * start_xen(). >>>> + */ >>>> +__initcall(cpu_errata_notifier_init); >>>> + >>>> /* >>>> * Local variables: >>>> * mode: C >>>> diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c >>>> index 525b45e22f..dd30f0d29c 100644 >>>> --- a/xen/arch/arm/cpufeature.c >>>> +++ b/xen/arch/arm/cpufeature.c >>>> @@ -68,6 +68,29 @@ void __init enable_cpu_capabilities(const struct >>>> arm_cpu_capabilities *caps) >>>> } >>>> } >>>> +/* Run through the enabled capabilities and enable() them on the >>>> calling CPU */ >>>> +void enable_nonboot_cpu_caps(const struct arm_cpu_capabilities *caps) >>>> +{ >>>> + ASSERT(system_state != SYS_STATE_boot); >>>> + >>>> + for ( ; caps->matches; caps++ ) >>>> + { >>>> + if ( !cpus_have_cap(caps->capability) ) >>>> + continue; >>>> + >>>> + if ( caps->enable ) >>>> + { >>>> + /* >>>> + * Since the CPU has enabled errata workarounds on boot, it >>>> should >>> >>> >>> >>> This function is not really about errata, it is about capabilities. >>> Errata >>> is just a sub-category of them. >>> >> >> I've fixed the comment, thanks. >> >>>> + * never fail to enable them here. >>>> + */ >>>> + if ( caps->enable((void *)caps) ) >>>> + panic("CPU%u failed to enable capability %u\n", >>>> + smp_processor_id(), caps->capability); >>> >>> >>> >>> We should really avoid to use panic(...) if this is something the system >>> can >>> survive. In that specific case, it would only affect the current CPU. So >>> it >>> would be better to return an error and let the caller decide what to do. >>> >> >> I need to emphasize two points: >> 1) I don't see how is this different compared to PSCI CPU OFF where we >> do panic. Essentially, in both cases the system will not be able to >> use that CPU and we already agreed that is a good reason to panic. > > > You can't compare PSCI CPU off and the enable callback failing. The *only* > reason PSCI CPU off can fail is because the Trusted OS is resident on that > CPU. If that ever happen it is a programming error on Xen, and it makes > sense to fail because you don't want that CPU to spin in Xen. > > Enabling a capability can fail because of a failure of allocating memory or > mapping (see spectre workaround). It is not a programming error but an > expected behavior and it is not a valid reason to assume we want to kill the > system. > >> As oppose to CPU_OFF which wasn't called on boot so we indeed have no >> idea whether it will pass on suspend, no matter how unlikely it could >> fail, in this scenario we are sure that enabling capability should >> pass because it already passed on boot. So if it doesn't pass, which I >> consider to be impossible, I believe we should panic. >> On the other hand, I understand how would this make a difference on >> big.LITTLE where you try to hotplug a CPU that was never booted. >> However, that scenario is out of this scope. > > While I agree that big.LITTLE is out of scope of your series, what I ask has > nothing to do with big.LITTLE. There are valid reason for the enable > callback to fail whether it is the case today or not. > >> >> 2) I still wanted to give a chance to your proposal and just convert >> panic into stop_cpu+printing error. The system cannot survive if >> enabling a capability fails. In order to test this I added a >> capability that will always fail after the boot. This is not realistic >> in my opinion, but I used it only for testing to check whether the >> system will survive. Instead of panic I printed an error and stopped >> the CPU. However, Xen crashed. The boot CPU properly concluded that >> the erroneous CPU will never become online, but later on credit >> scheduler's assertion fails. > > > Please provide more details. > >> I believe this is something that a person >> who adds big.LITTLE support should deal with. > > > If there is a bug in the scheduler it should be fixed rather trying to > workaround with a panic in the code. If you provide more details, we might > be able to help here. > This flow seems to have several bugs. Lets start from here: Please take a look at function cpu_schedule_callback in schedule.c. Within switch, case CPU_DEAD doesn't have a break, causing the bellow CPU_UP_CANCELED to execute as well when the CPU goes down. This looks wrong to me. Dario, could you please confirm that this is a bug? Otherwise could you please confirm the reasoning beyond? Thanks, Mirela >> >> Do we have an agreement to keep panic? > > > I am afraid not, panic (and BUG*) should only be used when there are no way > to come back or it is a programming error to end up here. I don't think this > is the case with the information I have in hand. > > The two solutions I find acceptable would be: > 1) Log a warning and ignore the error. Likely your CPU will break > later on. > 2) Return an error and let the caller deal with it. The caller might > decide to kill the system, but that's not our business. This function should > only report an error. > > Cheers, > > -- > Julien Grall _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.