Xen project Mailing List

Re: dom0 LInux 5.8-rc5 kernel failing to initialize cooling maps for Allwinner H6 SoC

From: Alejandro <alejandro.gonzalez.correo@xxxxxxxxx>

Date: Fri, 24 Jul 2020 13:20:10 +0200

Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Volodymyr Babchuk <Volodymyr_Babchuk@xxxxxxxx>, Andre Przywara <andre.przywara@xxxxxxx>

Delivery-date: Fri, 24 Jul 2020 11:20:34 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hello, and thanks for the response. El vie., 24 jul. 2020 a las 12:45, Julien Grall (<julien@xxxxxxx>) escribió: > > I'm trying Xen 4.13.1 in a Allwinner H6 SoC (more precisely a Pine H64 > > model B, with a ARM Cortex-A53 CPU). > > I managed to get a dom0 Linux 5.8-rc5 kernel running fine, unpatched, > > and I'm using the upstream device tree for > > my board. However, the dom0 kernel has trouble when reading some DT > > nodes that are related to the CPUs, and > > it can't initialize the thermal subsystem properly, which is a kind of > > showstopper for me, because I'm concerned > > that letting the CPU run at the maximum frequency without watching out > > its temperature may cause overheating. > > I understand this concern, I am aware of some efforts to get CPUFreq > working on Xen but I am not sure if there is anything available yet. I > have CCed a couple of more person that may be able to help here. Thank you for the CCs. I hope they can bring on some insight about this :) > > The relevant kernel messages are: > > > > [ +0.001959] sun50i-cpufreq-nvmem: probe of sun50i-cpufreq-nvmem > > failed with error -2 > > ... > > [ +0.003053] hw perfevents: failed to parse interrupt-affinity[0] for pmu > > [ +0.000043] hw perfevents: /pmu: failed to register PMU devices! > > [ +0.000037] armv8-pmu: probe of pmu failed with error -22 > > I am not sure the PMU failure is related to the thermal failure below. I'm not sure either, but after comparing the kernel messages for a boot with and without Xen, those were the differences (excluding, of course, the messages that inform that the Xen hypervisor console is being used and such). For the sake of completeness, I decided to mention it anyway. > > [ +0.000163] OF: /thermal-zones/cpu-thermal/cooling-maps/map0: could > > not find phandle > > [ +0.000063] thermal_sys: failed to build thermal zone cpu-thermal: -22 > Would it be possible to paste the device-tree node for > /thermal-zones/cpu-thermal/cooling-maps? I suspect the issue is because > we recreated /cpus from scratch. > > I don't know much about how the thermal subsystem works, but I suspect > this will not be enough to get it working properly on Xen. For a > workaround, you would need to create a dom0 with the same numbers of > vCPU as the numbers of pCPUs. They would also need to be pinned. > > I will leave the others to fill in more details. I think I should mention that I've tried to hackily fix things by removing the make_cpus_node call on handle_node (https://github.com/xen-project/xen/blob/master/xen/arch/arm/domain_build.c#L1585), after removing the /cpus node from the skip_matches array. This way, the original /cpus node was passed through, without being recreated by Xen. Of course, I made sure that dom0 used the same number of vCPUs as pCPUs, because otherwise things would probably blow up, which luckily that was not a compromise for me. The end result was that the aforementioned kernel error messages were gone, and the thermal subsystem worked fine again. However, this time the cpufreq-dt probe failed, with what I think was an ENODEV error. This left the CPU locked at the boot frequency of less than 1 GHz, compared to the maximum 1.8 GHz frequency that the SoC supports, which has bad implications for performance. Therefore, as it seems that passing more properties (like #cooling-cells) is enough to get temperatures working, I suspect that fixing the thermal issue is relatively easy, at least for my SoC. But maybe I have just been lucky and that's not supposed to work anyway; I'm not sure. > > > > I've searched for issues, code or commits that may be related for this > > issue. The most relevant things I found are: > > > > - A patch that blacklists the A53 PMU: > > https://patchwork.kernel.org/patch/10899881/ > > - The handle_node function in xen/arch/arm/domain_build.c: > > https://github.com/xen-project/xen/blob/master/xen/arch/arm/domain_build.c#L1427 > > I remember this discussion. The problem was that the PMU is using > per-CPU interrupts. Xen is not yet able to handle PPIs as they often > requires more context to be saved/restored (in this case the PMU context). > > There was a proposal to look if a device is using PPIs and just remove > them from the Device-Tree. Unfortunately, I haven't seen any official > submission for this patch. > > Did you have to apply the patch to boot up? If not, then the error above > shouldn't be a concern. However, if you need PMU support for the using > thermal devices then it is going to require some work. No, I didn't apply any patch to Xen whatsoever. It worked fine out of the box. As I mentioned above, with a more complete /cpus node declaration, the thermal subsystem works. I guess the PMU worked fine too, but I didn't test it in any way, so maybe it is just barely able to probe successfully somehow. > > I've thought about removing "/cpus" from the skip_matches array in the > > handle_node function, but I'm not sure > > that would be a good fix. > > The node "/cpus" and its sub-node are recreated by Xen for Dom0. This is > because Dom0 may have a different numbers of vCPUs and it doesn't seen > the pCPUs. > > If you don't skip "/cpus" from the host DT then you would end up with > two "/cpus" path in your dom0 DT. Mostly likely, Linux will not be happy > with it. Indeed, that is consistent with my observations of how the source code works. Thanks for the confirmation :) > I vaguely remember some discussions on how to deal with CPUFreq in Xen. > IIRC we agreed that Dom0 should be part of the equation because it > already contains all the drivers. However, I can't remember if we agreed > how the dom0 would be made aware of the pCPUs. That makes sense. Supporting every existing thermal and cpufreq method in every ARM SoC seems like a lot of unneeded duplication of work, provided that Linux already has pretty good support for that. But, if that's the case, I guess we should not mark the "dom0-kernel" cpufreq boot parameter as deprecated in the documentation, at least for the ARM platform: http://xenbits.xen.org/docs/unstable/misc/xen-command-line.html#cpufreq

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.