 
	
| [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [RFC PATCH 00/31] CPUFreq on ARM
 Hi, On 14/11/17 20:46, Oleksandr Tyshchenko wrote: > On Tue, Nov 14, 2017 at 12:49 PM, Andre Przywara > <andre.przywara@xxxxxxxxxx> wrote: >> Hi, > Hi Andre > >> >> On 13/11/17 19:40, Oleksandr Tyshchenko wrote: >>> On Mon, Nov 13, 2017 at 5:21 PM, Andre Przywara >>> <andre.przywara@xxxxxxxxxx> wrote: >>>> Hi, >>> Hi Andre, >>> >>>> >>>> thanks very much for your work on this! >>> Thank you for your comments. >>> >>>> >>>> On 09/11/17 17:09, Oleksandr Tyshchenko wrote: >>>>> From: Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx> >>>>> >>>>> Hi, all. >>>>> >>>>> The purpose of this RFC patch series is to add CPUFreq support to Xen on >>>>> ARM. >>>>> Motivation of hypervisor based CPUFreq is to enable one of the main PM >>>>> use-cases in virtualized system powered by Xen hypervisor. Rationale >>>>> behind this activity is that CPU virtualization is done by hypervisor and >>>>> the guest OS doesn't actually know anything about physical CPUs because >>>>> it is running on virtual CPUs. It is quite clear that a decision about >>>>> frequency change should be taken by hypervisor as only it has information >>>>> about actual CPU load. >>>> >>>> Can you please sketch your usage scenario or workloads here? I can think >>>> of quite different scenarios (oversubscribed server vs. partitioning >>>> RTOS guests, for instance). The usefulness of CPUFreq and the trade-offs >>>> in the design are quite different between those. >>> We keep embedded use-cases in mind. For example, it is a system with >>> several domains, >>> where one domain has most critical SW running on and other domain(s) >>> are, let say, for entertainment purposes. >>> I think, the CPUFreq is useful where power consumption is a question. >> >> Does the SoC you use allow different frequencies for each core? Or is it >> one frequency for all cores? Most x86 CPU allow different frequencies >> for each core, AFAIK. Just having the same OPP for the whole SoC might >> limit the usefulness of this approach in general. > Good question. All cores in a cluster share the same clock. It is > impossible to set different frequencies on the cores inside one > cluster. > >> >>>> In general I doubt that a hypervisor scheduling vCPUs is in a good >>>> position to make a decision on the proper frequency physical CPUs should >>>> run with. From all I know it's already hard for an OS kernel to make >>>> that call. So I would actually expect that guests provide some input, >>>> for instance by signalling OPP change request up to the hypervisor. This >>>> could then decide to act on it - or not. >>> Each running guest sees only part of the picture, but hypervisor has >>> the whole picture, it knows all about CPU, measures CPU load and able >>> to choose required CPU frequency to run on. >> >> But based on what data? All Xen sees is a vCPU trapping on MMIO, a >> hypercall or on WFI, for that matter. It does not know much more about >> the guest, especially it's rather clueless about what the guest OS >> actually intended to do. >> For instance Linux can track the actual utilization of a core by keeping >> statistics of runnable processes and monitoring their time slice usage. >> It can see that a certain process exhibits periodical, but bursty CPU >> usage, which may hint that is could run at lower frequency. Xen does not >> see this fine granular information. >> >>> I am wondering, does Xen >>> need additional input from guests for make a decision? >> >> I very much believe so. The guest OS is in a much better position to >> make that call. >> >>> BTW, currently guest domain on ARM doesn't even know how many physical >>> CPUs the system has and what are these OPPs. When creating guest >>> domain Xen inserts only dummy CPU nodes. All CPU info, such as clocks, >>> OPPs, thermal, etc are not passed to guest. >> >> Sure, because this is what virtualization is about. And I am not asking >> for unconditionally allowing any guest to change frequency. >> But there could be certain use cases where this could be considered: >> Think about your "critical SW" mentioned above, which is probably some >> RTOS, also possibly running on pinned vCPUs. For that >> (latency-sensitive) guest it might be well suited to run at a lower >> frequency for some time, but how should Xen know about this? >> "Normally" the best strategy to save power is to run as fast as >> possible, finish all outstanding work, then put the core to sleep. >> Because not running at all consumes much less energy than running at a >> reduced frequency. But this may not be suitable for an RTOS. > Saying "one domain has most critical SW running on" I meant hardware > domain/driver domain or even other > domain which perform some important tasks (disk, net, display, camera, > whatever) which treated by the whole system as critical > and must never fail. Other domains, for example, it might be Android > as well, are not critical at all from the system point of view. > Being honest, I haven't considered yet using CPUFreq in system where > some RT guest is present. > I think it is something that should be *thoroughly* investigated and > then worked out. Yes, as mentioned before there are quite different use cases with quite different requirements when it comes to DVFS. I believe the best would be to define typical scenarios, then assess the usefulness of CPUFreq separately for each one of them. Based on this we then should be able to make a decision. > I am not familiar with RT system requirements, I suppose, but not > entirely sure, that CPUFreq should use const > frequency for all cores the RT system is running on, or RT system > parameters should be recalculated each time the CPU frequency is being > changed > (in such case guest needs some input from Xen). > > Anyway, I got your point about some guest input. Could you, please, > describe how you think it should look like: > 1. Xen doesn't have CPUFreq logic at all. It only collects OPP change > requests from all guests and make > a decision based on these requests and maybe some policy for > prioritizing requests. Then it sends OPP change request to SCP. > 2. Xen has CPUFreq logic. In addition it can collect OPP change > requests from all guests and make > a decision based on both: it's own view and guest requests. Then it > sends OPP change request to SCP. I am leaning towards 1) conceptually. But if there is some kind of reasonable implementation of 2) already in Xen (for x86), this might be feasible as well. > Both variant implies that something like PV CPUFreq should be involved > with frontend drivers are located in guests. Am I correct? And here the SMC mailbox comes into play again, but with a twist. For guests we create SCPI, mailbox and shmem DT nodes, and use the SMC mailbox with: method = "hvc";. Xen's HVC handles then redirects this to the CPUFreq code. This would be platform agnostic for the guests, while making all CPUFreq requests ending up in Xen. So there is no need for an extra PV protocol. >> So I think we would need a combined approach: >> a) Let an administrator (via tools running in Dom0) tell Xen about power >> management strategies to use for certain guests. An RTOS could be >> treated differently (lower, but constant frequency) than an >> "entertainment" guest (varying frequency, based on guest OS input), also >> differently than some background guest doing logging, OTA update, etc. >> (constant high frequency, but putting cores to sleep instead as often as >> possible). >> b) Allow some guests (based on policy from (a)) to signal CPUFreq change >> requests to the hypervisor. Xen takes those into account, though it may >> decide to not act immediately on it, because it is going to schedule >> another vCPU, for instance. >> c) Have some way of actually realising certain OPPs. This could be via >> an SCPI client in Xen, or some other way. Might be an implementation detail. > > Just to clarify if I got the main idea correct: > 1. Guests have CPUFreq logic, they send OPP change requests to Xen. > 2. Xen has CPUFreq logic too, but in additional it can take into the account > OPP > change requests from guests. Xen sends final OPP change request. > Is my understanding correct? Yes, I think this sounds like the most flexible. Xen's CPUFreq logic could be quite simple, possibly starting with some static assignment based on administrator input, e.g. given at guest creation time. It might not involve further runtime decisions. > Also "Different power management strategies to use for certain guests" > means that it should be > hard vCPU->pCPU pinning for each guest together with possibility in > Xen to have different CPUFreq governors > running at the same time (each governor for each CPU pool)? That would need to be worked out, but I suspect that CPU pinning might be *one* option for a certain class of guests. This would probably be related to the CPUFreq policy. Without pinning the decision might become quite involved: If Xen wants to migrate a vCPU to a different pCPU, it needs to take the different P-states into account, including the cost to change the OPP. I am not sure the benefit justifies the effort. Some numbers would help here. >>>>> Although these required components (CPUFreq core, governors, etc) already >>>>> exist in Xen, it is worth to mention that they are ACPI specific. So, a >>>>> part of the current patch series makes them more generic in order to make >>>>> possible a CPUFreq usage on architectures without ACPI support in. >>>> >>>> Have you looked at how this is used on x86 these days? Can you briefly >>>> describe how this works and it's used there? >>> Xen supports CPUFreq feature on x86 [1]. I don't know how widely it is >>> used at the moment, but it is another question. So, there are two >>> possible modes: Domain0 based CPUFreq and Hypervisor based CPUFreq >>> [2]. As I understand, the second option is more popular. >>> Two different implementations of "Hypervisor based CPUFreq" are >>> present: ACPI Processor P-States Driver and AMD Architectural P-state >>> Driver. You can find both them in xen/arch/x86/acpi/cpufreq/ dir. >>> >>> [1] >>> https://wiki.xenproject.org/wiki/Xen_power_management#CPU_P-states_.28cpufreq.29 >>> [2] >>> https://wiki.xenproject.org/wiki/Xen_power_management#Hypervisor_based_cpufreq >> >> Thanks for the research and the pointers, will look at it later. >> >>>>> But, the main question we have to answer is about frequency changing >>>>> interface in virtualized system. The frequency changing interface and all >>>>> dependent components which needed CPUFreq to be functional on ARM are not >>>>> present in Xen these days. The list of required components is quite big >>>>> and may change across different ARM SoC vendors. As an example, the >>>>> following components are involved in DVFS on Renesas Salvator-X board >>>>> which has R-Car Gen3 SoC installed: generic clock, regulator and thermal >>>>> frameworks, Vendor’s CPG, PMIC, AVS, THS drivers, i2c support, etc. >>>>> >>>>> We were considering a few possible approaches of hypervisor based >>>>> CPUFreqs on ARM and came to conclusion to base this solution on popular >>>>> at the moment, already upstreamed to Linux, ARM System Control and Power >>>>> Interface(SCPI) protocol [1]. We chose SCPI protocol instead of newer ARM >>>>> System Control and Management Interface (SCMI) protocol [2] since it is >>>>> widely spread in Linux, there are good examples how to use it, the range >>>>> of capabilities it has is enough for implementing hypervisor based >>>>> CPUFreq and, what is more, upstream Linux support for SCMI is missed so >>>>> far, but SCMI could be used as well. >>>>> >>>>> Briefly speaking, the SCPI protocol is used between the System Control >>>>> Processor(SCP) and the Application Processors(AP). The mailbox feature >>>>> provides a mechanism for inter-processor communication between SCP and >>>>> AP. The main purpose of SCP is to offload different PM related tasks from >>>>> AP and one of the services that SCP provides is Dynamic voltage and >>>>> frequency scaling (DVFS), it is what we actually need for CPUFreq. I will >>>>> describe this approach in details down the text. >>>>> >>>>> Let me explain a bit more what these possible approaches are: >>>>> >>>>> 1. “Xen+hwdom” solution. >>>>> GlobalLogic team proposed split model [3], where “hwdom-cpufreq” frontend >>>>> driver in Xen interacts with the “xen-cpufreq” backend driver in Linux >>>>> hwdom (possibly dom0) in order to scale physical CPUs. This solution >>>>> hasn’t been accepted by Xen community yet and seems it is not going to be >>>>> accepted without taking into the account still unanswered major questions >>>>> and proving that “all-in-Xen” solution, which Xen community considered as >>>>> more architecturally cleaner option, would be unworkable in practice. >>>>> The other reasons why we decided not to stick to this approach are >>>>> complex communication interface between Xen and hwdom: event channel, >>>>> hypercalls, syscalls, passing CPU info via DT, etc and possible >>>>> synchronization issues with a proposed solution. >>>>> Although it is worth to mention that the beauty of this approach was that >>>>> there wouldn’t be a need to port a lot of things to Xen. All frequency >>>>> changing interface and all dependent components which needed CPUFreq to >>>>> be functional were already in place. >>>> >>>> Stefano, Julien and I were thinking about this: Wouldn't it be possible >>>> to come up with some hardware domain, solely dealing with CPUFreq >>>> changes? This could run a Linux kernel, but no or very little userland. >>>> All its vCPUs would be pinned to pCPUs and would normally not be >>>> scheduled by Xen. If Xen wants to change the frequency, it schedules the >>>> respective vCPU to the right pCPU and passes down the frequency change >>>> request. Sounds a bit involved, though, and probably doesn't solve the >>>> problem where this domain needs to share access to hardware with Dom0 >>>> (clocks come to mind). >>> Yes, another question is how to get this Linux kernel stuff (backend, >>> top level driver, etc) upstreamed. >> >> Well, the idea would be to use already upstream drivers to actually >> implement OPP changes (via Linux clock and regulator drivers), then use >> existing interfaces like the userspace governor, for instance, to >> trigger those. I don't think we need much extra kernel code for that. > I understand. Backend in userspace sets desired frequency by request > from frontend in Xen. Yeah, something like that. It was just an idea, not fully thought through yet. >>>>> Although this approach is not used, still I picked a few already acked >>>>> patches which made ACPI specific CPUFreq stuff more generic. >>>>> >>>>> 2. “all-in-Xen” solution. >>>>> This implies that all CPUFreq related stuff should be located in Xen. >>>>> Community considered this solution as more architecturally cleaner option >>>>> than “Xen+hwdom” one. No layering violation comparing with the previous >>>>> approach (letting guest OS manage one or more physical CPUs is more of a >>>>> layering violation). >>>>> This solution looks better, but to be honest, we are not in favor of this >>>>> solution as well. We expect enormous developing effort to get this >>>>> support in (the scope of required components looks unreal) and maintain >>>>> it. So, we decided not to stick to this approach as well. >>>> >>>> Yes, I even think it's not feasible to implement this. With a modern >>>> clock implementation there is one driver to control *all* clocks of an >>>> SoC, so you can't single out the CPU clock easily, for instance. One >>>> would probably run into synchronisation issues, at best. >>>> >>>>> 3. “Xen+SCP(ARM TF)” solution. >>>>> It is yet another solution based on ARM SCPI protocol. The generic idea >>>>> here is that there is a firmware, which being a server runs on some >>>>> dedicated IP core (server), provides different PM services (DVFS, >>>>> sensors, etc). On the other side there is a CPUFreq driver in Xen, which >>>>> is running on the AP (client), consumes these services. CPUFreq driver >>>>> neither changes the CPU frequency/voltage by itself nor cooperates with >>>>> Linux in order to do such job. It just communicates with SCP directly >>>>> using SCPI protocol. As I said before, some integrated into a SoC mailbox >>>>> IP need to be used for IPC (doorbell for triggering action and shared >>>>> memory region for commands). CPUFreq driver doesn’t even need to know >>>>> what should be physically changed for the new frequency to take effect. >>>>> It is a certainly SCP’s responsibility. This all avoid CPUFreq >>>>> infrastructure in Xen on ARM from diving into each supported SoC >>>>> internals and as the result having a lot of code. >>>>> >>>>> The possible issue here could be in SCP, the problem is that some >>>>> dedicated IP core may be absent at all or performs other than PM tasks. >>>>> Fortunately, there is a brilliant solution to teach firmware running in >>>>> the EL3 exception level (ARM TF) to perform SCP functions and use SMC >>>>> calls for communications [4]. Exactly this transport implementation I >>>>> want to bring to Xen the first. Such solution is going to be generic >>>>> across all ARM platforms that do have firmware running in the EL3 >>>>> exception level and don’t have candidate for being SCP. >>>> >>>> While I feel flattered that you like that idea as well ;-), you should >>>> mention that this requires actual firmware providing those services. >>> Yes, a some firmware, which provides these services, must be present >>> on the other end. >>> It is a firmware which runs on the dedicated IP core(s) in common case. >>> And it is a firmware which runs on the same core(s) as the hypervisor >>> in particular case. >>> >>>> I >>>> am not sure there is actually *any* implementation of this at the >>>> moment, apart from my PoC code for Allwinner. >>> Your PoC is a good example for writing firmware side. So, why don't >>> use it as a base for >>> other platform. >> >> Sure, but normally firmware is provided by the vendor. And until more >> vendors actually implement this, it's a bit weird to ask Xen users to >> install this hand-crafted home-brew firmware to use this feature. >> For a particular embedded use case like yours this might be feasible, >> though. > Agree. it is exactly for ARM SoCs with security extensions enabled, > but where SCP isn't available. > And these SoCs are exists. Sure, also it depends on the accessibility of firmware. Some SoCs only run signed firmware, or there is no source code for crucial firmware components (SoC setup, DRAM init), so changing the firmware might not be an option. >>>> And from a Xen point of view I am not sure we are in the position to >>>> force users to use this firmware. This may be feasible in a classic >>>> embedded scenario, where both firmware and software are provided by the >>>> same entity, but that should be clearly noted as a restriction. >>> Agree. >>> >>>> >>>>> Here we have completely synchronous case because of SMC calls nature. SMC >>>>> triggered mailbox driver emulates a mailbox which signals transmitted >>>>> data via Secure Monitor Call (SMC) instruction [5]. The mailbox receiver >>>>> is implemented in firmware and synchronously returns data when it returns >>>>> execution to the non-secure world again. This would allow us both to >>>>> trigger a request and transfer execution to the firmware code in a safe >>>>> and architected way. Like PSCI requests. >>>>> As you can see this method is free from synchronization issues. What is >>>>> more, this solution is more architecturally cleaner solution than split >>>>> model “Xen+hwdom” one. From the security point of view, I hope, >>>>> everything will be much more correct since the ARM TF, which we want to >>>>> see in charge of controlling CPU frequency/voltage, is a trusted SW >>>>> layer. Moreover, ARM TF is responsible for enabling/disabling CPU (PSCI) >>>>> and nobody complains about it, so let it do DVFS too. >>>> >>>> It should be noted that this synchronous nature of the communication can >>>> actually be a problem: a DVFS request usually involves regulator and PLL >>>> changes, which could take some time to settle in. Blocking all of this >>>> time (milliseconds?) in EL3 (probably busy-waiting) might not be desirable. >>> Agree. I haven't measured time yet to say how long is it, since I >>> don't have a working firmware at the moment, just an emulator, >>> but, yes, it will definitely take some time. The whole system won't be >>> blocked, only the CPU which performs SMC call. >>> But, if we ask hwdom to change frequency we will wait too? Or if Xen >>> manages PLL/regulator by itself, it will wait anyway? >> >> Normally this is done asynchronously. For instance the OS programs the >> regulator to change the voltage, then does other things until the >> regulator signals the change has been realised. The it re-programs the >> PLL, again executing other code, eventually being interrupted by a >> completion interrupt (or by periodically polling a bit). If we need to >> spend all of this time in EL3, the HV is blocked on this. This might or >> might not be a problem, but it should be noted. > Agree. > >> >>>>> I have to admit that I have checked this solution only due to a lack of >>>>> candidate for being SCP. But, I hope, that other ARM SoCs where dedicated >>>>> SCP is present (asynchronous case) will work too, but with some >>>>> limitations. The mailbox IPs for these ARM SoCs must have TX/RX-done >>>>> irqs. I have described in the corresponding patches why this limitation >>>>> is present. >>>>> >>>>> To be honest I have Renesas R-Car Gen3 SoCs in mind as our nearest >>>>> target, but I would like to make this solution as generic as possible. I >>>>> don’t treat proposed solution as world-wide generic, but I hope, this >>>>> solution may be suitable for other ARM SoCs which meet such requirements. >>>>> Anyway, having something which works, but doesn’t cover all cases is >>>>> better than having nothing. >>>>> >>>>> I would like to notice that the patches are POC state and I post them >>>>> just to illustrate in more detail of what I am talking about. Patch >>>>> series consist of the following parts: >>>>> 1. GL’s patches which make ACPI specific CPUFreq stuff more generic. >>>>> Although these patches has been already acked by Xen community and the >>>>> CPUFreq code base hasn’t changed in a last few years I drop all A-b. >>>>> 2. A bunch of device-tree helpers and macros. >>>>> 3. Direct ported SCPI protocol, mailbox infrastructure and the ARM SMC >>>>> triggered mailbox driver. All components except mailbox driver are in >>>>> mainline Linux. >>>> >>>> Why do you actually need this mailbox framework? Actually I just >>>> proposed the SMC driver the make it fit into the Linux framework. All we >>>> actually need for SCPI is to write a simple command into some memory and >>>> "press a button". I don't see a need to import the whole Linux >>>> framework, especially as our mailbox usage is actually just a corner >>>> case of the mailbox's capability (namely a "single-bit" doorbell). >>>> The SMC use case is trivial to implement, and I believe using the Juno >>>> mailbox is similarly simple, for instance. >>> I did a direct port for SCPI protocol. I think, it is something that >>> should be retained as much as possible. >> >> But the actual protocol is really simple. And we just need a subset of >> it, namely to query and trigger OPPs. > Yes. I think, that "Sensors service" is needed as well. I think that > CPUFreq is not completed without thermal feedback. Personally I think this should be handled by the SCPI firmware: if the requested OPP would violate thermal constraint, the firmware would just not set it. Also (secure) temperature alarm interrupts could lower the OPP. Doing this in firmware means it would just need to be implemented once, and I consider this system critical, so firmware is conceptually the better place for this code. >>> Protocol relies on mailbox feature, so I ported mailbox too. I think, >>> it would be much more easy for me to just add >>> a few required commands handling with issuing SMC call and without any >>> mailbox infrastructure involved. >>> But, I want to show what is going on and what place these things come from. >> >> I appreciate that, but I think we already have enough "bloated" Linux + >> glue code in Xen. And in particular the Linux mailbox framework is much >> more powerful than we need for SCPI, so we have a lot of unneeded >> functionality. >> If we just want to support CPUfreq using SCPI via SMC/Juno MHU/Rockchip >> mailbox, we can get away with a *much* simpler solution. > > Agree, but I am afraid that simplifying things now might lead to some > difficulties when there is a need > to integrate a little bit different mailbox IP. Also, we need to > recheck if SCMI, we might want to support as well, > have the similar interface with mailbox. > >> - We would need to port mailbox drivers one-by-one anyway, so we could >> as well implement the simple "press-the-button" subset for each mailbox >> separately. The interface between the SCPI code and the mailbox is >> probably just "signal_mailbox()". For SMC it's trivial, and for the Juno >> MHU it's also simple, I guess ([1], chapter 3.6). >> - The SCPI message assembly is easy as well. >> - The only other code needed is some DT parsing code to be compatible >> with the existing DTs describing the SCPI implementation. We would claim >> to have a mailbox driver for those compatibles, but cheat a bit since we >> only use it for SCPI and just need the single bit subset of the mailbox. > Yes, I think, we can optimize in a such way. > > Just to clarify: > Proposed "signal_mailbox" is intended for both actions: sending > request and receiving response? > So when it returns we will have either response or timeout error or > some callback will be needed anyway? > > I don't have any objections regarding optimizations, we need to > decide what mailboxes we should stick to (we can support) and in what > form we should keep > all this stuff in. > Also while making a decision, we need to keep in mind "direct ported > code" advantages: > - "direct ported code" (SCPI + mailbox) have had a thorough review by > the Linux community and Xen community > may rely on their review. > - As "direct ported code" wasn't changed heavily, I believe, it would > be easy to backport fixes/features to Xen. I understand that, but as I wrote in the other mail: This is a lean hypervisor, not a driver and subsystem dump site. The security aspect of just having much less code is crucial here. > So, let's decide. > >> >>> What is more, I don't want to restrict a usage of this CPUFreq by only >>> covering single scenario where a >>> firmware, which provides DVFS service, is in ARM TF. I hope, that this >>> solution will be suitable for ARM SoCs where a standalone SCP >>> is present and real mailbox IP, which has asynchronous nature, is used >>> for IPC. Of course, this mailbox must have TX/RX-done irqs. >>> This is a limitation at the moment. >> >> Sure, see above and the document [1] below. > Thank you for the link, it seems with MHU we have to poll for the > last_tx_done (where deasserted interrupt line in a status register is > a condition for) > after pressing the button. Or I missed something? It depends on whether we care. We could just treat this request in a fire-and-forget manner. I am not sure in how far Xen really needs to know the actual OPP used and when it's ready. Cheers, Andre. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel 
 
 
 | 
|  | Lists.xenproject.org is hosted with RackSpace, monitoring our |