[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] xen: arm: Don't use stop_cpu() in halt_this_cpu()



On Sat, 25 Jun 2022, Julien Grall wrote:
> On 24/06/2022 22:31, Stefano Stabellini wrote:
> > On Fri, 24 Jun 2022, Julien Grall wrote:
> > > On 23/06/2022 23:07, Stefano Stabellini wrote:
> > > > On Thu, 23 Jun 2022, dmitry.semenets@xxxxxxxxx wrote:
> > > > > From: Dmytro Semenets <dmytro_semenets@xxxxxxxx>
> > > > So wouldn't it be better to remove the panic from the implementation of
> > > > call_psci_cpu_off?
> > > 
> > > I have asked to keep the panic() in call_psci_cpu_off(). If you remove the
> > > panic() then we will hide the fact that the CPU was not properly turned
> > > off
> > > and will consume more energy than expected.
> > > 
> > > The WFI loop is fine when shutting down or rebooting because we know this
> > > will
> > > only happen for a short period of time.
> > 
> > Yeah, I don't think we should hide that CPU_OFF failed. I think we
> > should print a warning. However, given that we know CPU_OFF can
> > reasonably fail in normal conditions returning DENIED when a Trusted OS
> > is present, then I think we should not panic.
> 
> We know how to detect this condition (see section 5.9 in DEN0022D.b). So I
> would argue we should fix it properly rather than removing the panic().
> 
> > 
> > If there was a way to distinguish a failure because a Trusted OS is
> > present (the "normal" failure) from other failures, I would suggest to:
> > - print a warning if failed due to a Trusted OS being present
> > - panic in other cases
> > 
> > Unfortunately it looks like in all cases the return code is DENIED :-(
> I am confused. Per the spec, the only reason CPU_OFF can return DENIED is
> because the Trusted OS is resident. So what other cases are you talking about?
> 
> > 
> > 
> > Given that, I would not panic and only print a warning in all cases. Or
> > we could ASSERT which at least goes away in !DEBUG builds.
> 
> ASSERT() is definitely not way to deal with external input. I could possibly
> accept a WARN(), but see above.
> 
> > > > The reason I am saying this is that stop_cpu is called in a number of
> > > > places beyond halt_this_cpu and as far as I can tell any of them could
> > > > trigger the panic. (I admit they are unlikely places but still.)
> > > 
> > > This is one of the example where the CPU will not be stopped for a short
> > > period of time. We should deal with them differently (i.e. migrating the
> > > trusted OS or else) so we give all the chance for the CPU to be fully
> > > powered.
> > > 
> > > IMHO, this is a different issue and hence why I didn't ask Dmitry to solve
> > > it.
> > 
> > I see your point now. I was seeing the two things as one.
> > 
> > I think it is true that the WFI loop is likely to work. Also it is true
> > that from a power perspective it makes no different on power down or
> > reboot.  From that point of view this patch is OK.
> > 
> > But even on shut-down/reboot, why not do that as a fallback in case
> > CPU_OFF didn't work? It is going to work most of the times anyway, why
> > change the default for the few cases that doesn't work?
> 
> Because we would not be consistent how we would turn off a CPU on a system
> supporting PSCI. I would prefer to use the same method everywhere so it is
> easier to reason about.
> 
> I am also not sure how you define "most of the time". Yes it is possible that
> the boards we aware of will not have this issue, but how about the one we
> don't know?
> 
> > 
> > Given that this patch would work, I don't want to insist on this and let
> > you decide.
> > 
> > 
> > But even if we don't want to remove the panic as part of this patch, I
> > think we should remove the panic in a separate patch anyway, at least
> > until someone investigates and thinks of a strategy how to migrate the
> > TrustedOS as you suggested.
> If we accept this patch, then we remove the immediate pain. The other uses of
> stop_cpu() are in:
>       1) idle_loop(), this is reachable when turning off a CPU after boot
> (not supported on Arm)
>         2) start_secondary(), this is only used during boot (CPU hot-unplug is
> not supported)
> 
> Even if it would be possible to trigger the panic() in 2), I am not aware of
> an immediate issue there. So I think it would be the wrong approach to remove
> the panic() first and then investigate.
> 
> The advantage of the panic() is it will remind us that some needs to be fixed.
> With a warning (or WARN()) people will tend to ignore it.

I know that this specific code path (cpu off) is probably not super
relevant for what I am about to say, but as we move closer to safety
certifiability we need to get away from using "panic" and BUG_ON as a
reminder that more work is needed to have a fully correct implementation
of something.

I also see your point and agree that ASSERT is not acceptable for
external input but from my point of view panic is the same (slightly
worse because it doesn't go away in production builds).

The return value of CPU_OFF is "external input" but this patch would
make that problem go away for halt_this_cpu, and the other two call
sites are only relevant during boot.

So, although this is not my preference, I don't want to block this
patch. (I also think it is a lot better to move faster as a project
even with not-ideal implementations.)

Julien if you are going to ack the patch feel free to go ahead.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.