[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v3 10/10] xen/arm: Enable errata for secondary CPU on hotplug after the boot





On 09/05/18 16:48, Mirela Simonovic wrote:
Hi Julien,

Hi Mirela,

On Mon, Apr 30, 2018 at 6:09 PM, Julien Grall <julien.grall@xxxxxxx> wrote:
Hi Mirela,


On 27/04/18 18:12, Mirela Simonovic wrote:

On boot, enabling errata workarounds will be triggered by the boot CPU
from start_xen(). On CPU hotplug (non-boot scenario) this would not be
done. This patch adds the code required to enable errata workarounds
for a CPU being hotplugged after the system boots. This is triggered
using a notifier. If the CPU fails to enable the errata Xen will panic.
This is done because it is assumed that the CPU which is hotplugged
after the system/Xen boots, was initially hotplugged during the
system/Xen boot. Therefore, enabling errata workarounds should never
fail.

Signed-off-by: Mirela Simonovic <mirela.simonovic@xxxxxxxxxx>

---
CC: Stefano Stabellini <sstabellini@xxxxxxxxxx>
CC: Julien Grall <julien.grall@xxxxxxx>
---
   xen/arch/arm/cpuerrata.c         | 35
+++++++++++++++++++++++++++++++++++
   xen/arch/arm/cpufeature.c        | 23 +++++++++++++++++++++++
   xen/include/asm-arm/cpufeature.h |  1 +
   3 files changed, 59 insertions(+)

diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c
index 1baa20654b..4040f781ec 100644
--- a/xen/arch/arm/cpuerrata.c
+++ b/xen/arch/arm/cpuerrata.c
@@ -5,6 +5,8 @@
   #include <xen/spinlock.h>
   #include <xen/vmap.h>
   #include <xen/warning.h>
+#include <xen/notifier.h>
+#include <xen/cpu.h>
   #include <asm/cpufeature.h>
   #include <asm/cpuerrata.h>
   #include <asm/psci.h>
@@ -349,6 +351,39 @@ void __init enable_errata_workarounds(void)
       enable_cpu_capabilities(arm_errata);
   }
   +static int cpu_errata_callback(
+    struct notifier_block *nfb, unsigned long action, void *hcpu)
+{
+    switch ( action )
+    {
+    case CPU_STARTING:
+        enable_nonboot_cpu_caps(arm_errata);
+        break;
+    default:
+        break;
+    }
+
+    return NOTIFY_DONE;
+}
+
+static struct notifier_block cpu_errata_nfb = {
+    .notifier_call = cpu_errata_callback,
+};
+
+static int __init cpu_errata_notifier_init(void)
+{
+    register_cpu_notifier(&cpu_errata_nfb);
+    return 0;
+}
+/*
+ * Initialization has to be done at init rather than presmp_init phase
because
+ * the callback should execute only after the secondary CPUs are
initially
+ * booted (in hotplug scenarios when the system state is not boot). On
boot,
+ * the enabling of errata workarounds will be triggered by the boot CPU
from
+ * start_xen().
+ */
+__initcall(cpu_errata_notifier_init);
+
   /*
    * Local variables:
    * mode: C
diff --git a/xen/arch/arm/cpufeature.c b/xen/arch/arm/cpufeature.c
index 525b45e22f..dd30f0d29c 100644
--- a/xen/arch/arm/cpufeature.c
+++ b/xen/arch/arm/cpufeature.c
@@ -68,6 +68,29 @@ void __init enable_cpu_capabilities(const struct
arm_cpu_capabilities *caps)
       }
   }
   +/* Run through the enabled capabilities and enable() them on the
calling CPU */
+void enable_nonboot_cpu_caps(const struct arm_cpu_capabilities *caps)
+{
+    ASSERT(system_state != SYS_STATE_boot);
+
+    for ( ; caps->matches; caps++ )
+    {
+        if ( !cpus_have_cap(caps->capability) )
+            continue;
+
+        if ( caps->enable )
+        {
+            /*
+             * Since the CPU has enabled errata workarounds on boot, it
should


This function is not really about errata, it is about capabilities. Errata
is just a sub-category of them.


I've fixed the comment, thanks.

+             * never fail to enable them here.
+             */
+            if ( caps->enable((void *)caps) )
+                panic("CPU%u failed to enable capability %u\n",
+                      smp_processor_id(), caps->capability);


We should really avoid to use panic(...) if this is something the system can
survive. In that specific case, it would only affect the current CPU. So it
would be better to return an error and let the caller decide what to do.


I need to emphasize two points:
1) I don't see how is this different compared to PSCI CPU OFF where we
do panic. Essentially, in both cases the system will not be able to
use that CPU and we already agreed that is a good reason to panic.

You can't compare PSCI CPU off and the enable callback failing. The *only* reason PSCI CPU off can fail is because the Trusted OS is resident on that CPU. If that ever happen it is a programming error on Xen, and it makes sense to fail because you don't want that CPU to spin in Xen.

Enabling a capability can fail because of a failure of allocating memory or mapping (see spectre workaround). It is not a programming error but an expected behavior and it is not a valid reason to assume we want to kill the system.

As oppose to CPU_OFF which wasn't called on boot so we indeed have no
idea whether it will pass on suspend, no matter how unlikely it could
fail, in this scenario we are sure that enabling capability should
pass because it already passed on boot. So if it doesn't pass, which I
consider to be impossible, I believe we should panic.
On the other hand, I understand how would this make a difference on
big.LITTLE where you try to hotplug a CPU that was never booted.
However, that scenario is out of this scope.
While I agree that big.LITTLE is out of scope of your series, what I ask has nothing to do with big.LITTLE. There are valid reason for the enable callback to fail whether it is the case today or not.


2) I still wanted to give a chance to your proposal and just convert
panic into stop_cpu+printing error. The system cannot survive if
enabling a capability fails. In order to test this I added a
capability that will always fail after the boot. This is not realistic
in my opinion, but I used it only for testing to check whether the
system will survive. Instead of panic I printed an error and stopped
the CPU. However, Xen crashed. The boot CPU properly concluded that
the erroneous CPU will never become online, but later on credit
scheduler's assertion fails.

Please provide more details.

I believe this is something that a person
who adds big.LITTLE support should deal with.

If there is a bug in the scheduler it should be fixed rather trying to workaround with a panic in the code. If you provide more details, we might be able to help here.


Do we have an agreement to keep panic?

I am afraid not, panic (and BUG*) should only be used when there are no way to come back or it is a programming error to end up here. I don't think this is the case with the information I have in hand.

The two solutions I find acceptable would be:
        1) Log a warning and ignore the error. Likely your CPU will break later 
on.
2) Return an error and let the caller deal with it. The caller might decide to kill the system, but that's not our business. This function should only report an error.

Cheers,

--
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.