Xen project Mailing List

Re: [PATCH 04/13] cpufreq: Add Hardware P-State (HWP) driver

From: Jason Andryuk <jandryuk@xxxxxxxxx>

Date: Thu, 27 May 2021 14:50:29 -0400

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 27 May 2021 18:50:45 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Wed, May 26, 2021 at 11:00 AM Jan Beulich <jbeulich@xxxxxxxx> wrote: > > On 03.05.2021 21:28, Jason Andryuk wrote: > > If HWP Energy_Performance_Preference isn't supported, the code falls > > back to IA32_ENERGY_PERF_BIAS. Right now, we don't check > > CPUID.06H:ECX.SETBH[bit 3] before using that MSR. > > May I ask what problem there is doing so? > > > The SDM reads like > > it'll be available, and I assume it was available by the time Skylake > > introduced HWP. > > The SDM documents the MSR's presence back to at least Nehalem, but ties > it to the CPUID bit even there. Your point about inconsistent virtualized environments is something I hadn't considered. I will add a check, but I will make hwp disable in that case. While it could run just without an energy/performance preference, HWP doesn't seem useful in that scenario. > > --- a/docs/misc/xen-command-line.pandoc > > +++ b/docs/misc/xen-command-line.pandoc > > @@ -1355,6 +1355,15 @@ Specify whether guests are to be given access to > > physical port 80 > > (often used for debugging purposes), to override the DMI based > > detection of systems known to misbehave upon accesses to that port. > > > > +### hwp (x86) > > +> `= <boolean>` > > + > > +> Default: `true` > > + > > +Specifies whether Xen uses Hardware-Controlled Performance States (HWP) > > +on supported Intel hardware. HWP is a Skylake+ feature which provides > > +better CPU power management. > > Is there a particular reason giving this a top-level option rather > than a sub-option of cpufreq=? I will investigate. Below, I'm trimming everything where I will just make the changes according to your earlier email. > > + }; > > + union hwp_request curr_req; > > + uint16_t activity_window; > > + uint8_t minimum; > > + uint8_t maximum; > > + uint8_t desired; > > + uint8_t energy_perf; > > +}; > > +struct hwp_drv_data *hwp_drv_data[NR_CPUS]; > > New NR_CPUS-dimensioned arrays need explicit justification. From > what I get I can't see why this couldn't be per-CPU data instead. > > Also - static? I followed acpi-cpufreq with the NR_CPUS array. per-cpu makes sense and I'll investigate. > > + hwp_verbose("HWP: FAST_IA32_HWP_REQUEST %ssupported\n", > > + eax & CPUID6_EAX_FAST_HWP_MSR ? "" : "not "); > > + if ( eax & CPUID6_EAX_FAST_HWP_MSR ) > > + { > > + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY, val) ) > > + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CAPABILITY)\n"); > > + > > + hwp_verbose("HWP: MSR_FAST_UNCORE_MSRS_CAPABILITY: %016lx\n", val); > > Missing "else" above here? Are unbalanced braces acceptable or must they be balanced? Is this acceptable: if () one; else { two; three; } > > +static void hdc_set_pkg_hdc_ctl(bool val) > > +{ > > + uint64_t msr; > > + > > + if ( rdmsr_safe(MSR_IA32_PKG_HDC_CTL, msr) ) > > + { > > + hwp_err("error rdmsr_safe(MSR_IA32_PKG_HDC_CTL)\n"); > > I'm not convinced of the need of having such log messages after > failed rdmsr/wrmsr, but I'm definitely against them getting logged > unconditionally. In verbose mode, maybe. We should not fail the rdmsr if our earlier cpuid check passed. So in that respect these messages can be removed. The benefit here is that you can see which MSR failed. If you relied on extable_fixup, you would have to cross reference manually. How will administors know if hwp isn't working properly there are no messages? For HWP in general, the SDM says to check CPUID for the availability of MSRs. Therefore rdmsr/wrmsr shouldn't fail. During development, I saw wrmsr failures when with "Valid Maximum" and other "Valid" bits set. I think that was because I hadn't set up the Package Request MSR. That has been fixed by not using those bits. Anyway, rdmsr/wrmsr shouldn't fail, so how much code should be put into checking for those failures which shouldn't happen? My sample size is 2 models of processor, so verbose reporting of errors makes some sense during wider deployment and testing. > > + return; > > + } > > + > > + msr = val ? IA32_PKG_HDC_CTL_HDC_PKG_Enable : 0; > > If you don't use the prior value, why did you read it? But I > think you really mean to set/clear just bot 0. During development I printed the initial values . I removed the printing before submission but not the reading. In the SDM, It says "Bits 63:1 are reserved and must be zero", so I intended to only write 0 or 1. Below, you remark on not discarding reserved bits. So you want all of these to be read-modify-write operations? > > +static void hdc_set_pm_ctl1(bool val) > > +{ > > + uint64_t msr; > > + > > + if ( rdmsr_safe(MSR_IA32_PM_CTL1, msr) ) > > + { > > + hwp_err("error rdmsr_safe(MSR_IA32_PM_CTL1)\n"); > > + > > + return; > > + } > > + > > + msr = val ? IA32_PM_CTL1_HDC_Allow_Block : 0; > > Same here then, and ... > > > +static void hwp_fast_uncore_msrs_ctl(bool val) > > +{ > > + uint64_t msr; > > + > > + if ( rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL, msr) ) > > + hwp_err("error rdmsr_safe(MSR_FAST_UNCORE_MSRS_CTL)\n"); > > + > > + msr = val; > > ... here (where you imply bit 0 instead of using a proper > constant). > > Also for all three functions I'm not convinced their names are > in good sync with their parameters being boolean. Would you prefer something named closer to the bit names like: hdc_set_pkg_hdc_ctl -> hdc_set_pkg_enable hdc_set_pm_ctl1 -> hdc_set_allow_block hwp_fast_uncore_msrs_ctl -> hwp_fast_request_enable > > +static void hwp_read_capabilities(void *info) > > +{ > > + struct cpufreq_policy *policy = info; > > + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > > + > > + if ( rdmsr_safe(MSR_IA32_HWP_CAPABILITIES, data->hwp_caps) ) > > + { > > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_CAPABILITIES)\n", > > + policy->cpu); > > + > > + return; > > + } > > + > > + if ( rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw) ) > > + { > > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_HWP_REQUEST)\n", > > policy->cpu); > > + > > + return; > > + } > > +} > > This function doesn't indicate failure to its caller(s), so am I > to understand that failure to read either ofthe MSRs is actually > benign to the driver? They really shouldn't fail since the CPUID check passed during initialization. If you can't read/write MSRs, something is broken and the driver cannot function. The machine would still run, but HWP would be uncontrollable. How much care should be given to "impossible" situations? > > + if ( rdmsr_safe(MSR_IA32_PM_ENABLE, val) ) > > + { > > + hwp_err("CPU%u: error rdmsr_safe(MSR_IA32_PM_ENABLE)\n", > > policy->cpu); > > + > > + return; > > + } > > + > > + hwp_verbose("CPU%u: MSR_IA32_PM_ENABLE: %016lx\n", policy->cpu, val); > > + if ( val != IA32_PM_ENABLE_HWP_ENABLE ) > > + { > > + val = IA32_PM_ENABLE_HWP_ENABLE; > > You should neither depend on reserved bits being zero, nor discard any > non-zero value here, I think. Ok. > > +static void hwp_write_request(void *info) > > +{ > > + struct cpufreq_policy *policy = info; > > + struct hwp_drv_data *data = hwp_drv_data[policy->cpu]; > > + union hwp_request hwp_req = data->curr_req; > > + > > + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t)); > > ITYM > > BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw)); > > here? Originally, I wanted this to live next to the union definition. However, BUILD_BUG_ON needs to live in a function, so I placed it here where it is used. I'd prefer BUILD_BUG_ON(sizeof(hwp_req) != sizeof(uint64_t)) to make the comparison to 64bit, the size of the MSR, explicit. > > + if ( wrmsr_safe(MSR_IA32_HWP_REQUEST, hwp_req.raw) ) > > + { > > + hwp_err("CPU%u: error wrmsr_safe(MSR_IA32_HWP_REQUEST, %lx)\n", > > + policy->cpu, hwp_req.raw); > > + rdmsr_safe(MSR_IA32_HWP_REQUEST, data->curr_req.raw); > > What if this one fails, too? data->curr_req.raw then pretty certainly > ends up stale. It would be stale, but I think this is unlikely. rdmsr should not fail given our earlier CPUID checks. wrmsr could fail if you set something wrong. During testing I set some of the valid "Maximum/Minimum" bits (now unused) that cause wrmsr to fault when some other MSR (maybe IA32_HWP_REQUEST_PKG) wasn't set. > > + policy->governor = &hwp_cpufreq_governor; > > + > > + data = xzalloc(typeof(*data)); > > Commonly we specify the type explicitly in such cases, rather than using > typeof(). I will admit though that I'm not entirely certain which one's > better. But consistency across the code base is perhaps preferable for > the time being. I thought typeof(*data) is always preferable since it will always be the matching type. I can change it, but I feel like it's a step backwards. > > + if ( !data ) > > + return -ENOMEM; > > Is it correct to have set the governor before this error check? I will re-order. > > + hwp_drv_data[cpu] = data; > > + > > + on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1); > > + > > + data->minimum = data->hw_lowest; > > + data->maximum = data->hw_highest; > > + data->desired = 0; /* default to HW autonomous */ > > + if ( feature_hwp_energy_perf ) > > + data->energy_perf = 0x80; > > + else > > + data->energy_perf = 7; > > Where's this 7 coming from? (You do mention the 0x80 at least in the > description.) When HWP energy performance preference is unavailable, it falls back to IA32_ENERGY_PERF_BIAS with a 0-15 range. Per the SDM Vol3 14.3.4, "A value of 7 roughly translates into a hint to balance performance with energy consumption." I will add a comment. > > --- a/xen/include/asm-x86/msr-index.h > > +++ b/xen/include/asm-x86/msr-index.h > > +#define MSR_IA32_HWP_CAPABILITIES 0x00000771 > > +#define MSR_IA32_HWP_REQUEST 0x00000774 > > + > > #define MSR_PASID 0x00000d93 > > #define PASID_PASID_MASK 0x000fffff > > #define PASID_VALID (_AC(1, ULL) << 31) > > > > +#define MSR_IA32_PKG_HDC_CTL 0x00000db0 > > +#define IA32_PKG_HDC_CTL_HDC_PKG_Enable (_AC(1, ULL) << 0) > > I don't think "HDC" twice is helpful? I took the MSR name "IA32_PKG_HDC_CTL" and bit name "HDC_PKG_Enable" from the SDM. I intentionally left the case to match the SDM. > Also please use all upper case names (actually also for the > CPUID constants higher up). Again, I took them straight from the SDM. I will make them all upper case and switch IA32_PKG_HDC_CTL_HDC_PKG_Enable to IA32_PKG_HDC_CTL_PKG_Enable. Regards, Jason

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.