[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver
On 14.06.2023 20:02, Jason Andryuk wrote: > Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly > choice than disabling cpufreq when HWP is not available. Specifying > cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't > available, it makes sense to give them the cpufreq that can be > supported. i.e. I can't see a user only wanting cpufreq=hwp or > cpufreq=none, but not cpufreq=xen. I continue to disagree with this, and I'd like to ask for another maintainer's opinion. To me the described behavior is like getting pears when having asked for apples. I nevertheless agree that having said fallback as an option, but I'd see that done by giving meaning to something like "cpufreq=hwp,xen" (without having checked whether that doesn't have meaning already, i.e. just to give you an idea). > We can't use parse_boolean() since it requires a single name=val string > and cpufreq_handle_common_option is provided two strings. Use > parse_bool() and manual handle no-hwp. > > Write to disable the interrupt - the linux pstate driver does this. We > don't use the interrupts, so we can just turn them off. We aren't ready > to handle them, so we don't want any. Unclear if this is necessary. > SDM says it's default disabled. Part of this may want to move to the description? > --- /dev/null > +++ b/xen/arch/x86/acpi/cpufreq/hwp.c > @@ -0,0 +1,537 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP) > + * > + * Copyright (C) 2021 Jason Andryuk <jandryuk@xxxxxxxxx> > + */ > + > +#include <xen/cpumask.h> > +#include <xen/init.h> > +#include <xen/param.h> > +#include <xen/xmalloc.h> > +#include <asm/io.h> What do you need this one for? > +#include <asm/msr.h> > +#include <acpi/cpufreq/cpufreq.h> > + > +static bool __ro_after_init feature_hwp; > +static bool __ro_after_init feature_hwp_notification; > +static bool __ro_after_init feature_hwp_activity_window; > + > +static bool __ro_after_init feature_hdc; > + > +bool __initdata opt_cpufreq_hwp; > +static bool __ro_after_init opt_cpufreq_hdc = true; > + > +union hwp_request > +{ > + struct > + { > + unsigned int min_perf:8; > + unsigned int max_perf:8; > + unsigned int desired:8; > + unsigned int energy_perf:8; > + unsigned int activity_window:10; > + bool package_control:1; > + unsigned int reserved:16; Better leave this and ... > + bool activity_window_valid:1; > + bool energy_perf_valid:1; > + bool desired_valid:1; > + bool max_perf_valid:1; > + bool min_perf_valid:1; > + }; > + uint64_t raw; > +}; > + > +struct hwp_drv_data > +{ > + union > + { > + uint64_t hwp_caps; > + struct > + { > + unsigned int highest:8; > + unsigned int guaranteed:8; > + unsigned int most_efficient:8; > + unsigned int lowest:8; > + unsigned int reserved:32; ... this without a name? Hardware interfaces like this one are, in my understanding, one of the main applications of unnamed bitfields. > + } hw; > + }; > + union hwp_request curr_req; > + int ret; > + uint16_t activity_window; > + uint8_t minimum; > + uint8_t maximum; > + uint8_t desired; > + uint8_t energy_perf; > +}; > +DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data); static? > +static bool hwp_handle_option(const char *s, const char *end) > +{ > + int ret; > + > + if ( strncmp(s, "verbose", 7) == 0 ) > + { > + s += 7; > + if ( *s == '=' ) > + { > + s++; > + cpufreq_verbose = !!simple_strtoul(s, NULL, 0); > + > + return true; > + } > + > + if ( end == NULL || > + (end == s && (*end == '\0' || *end == ',')) ) > + { > + cpufreq_verbose = true; > + > + return true; > + } > + > + return false; > + } Would be nice if the handling of "verbose" didn't need duplicating here. However, if that's unavoidable, did you consider handling this as a proper boolean instead of the custom way cpufreq_handle_common_option() also uses? > +bool __init hwp_available(void) > +{ > + unsigned int eax; > + > + if ( !opt_cpufreq_hwp ) > + return false; > + > + if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF ) > + { > + hwp_verbose("cpuid_level (%#x) lacks HWP support\n", > + boot_cpu_data.cpuid_level); > + > + return false; > + } > + > + if ( boot_cpu_data.cpuid_level < 0x16 ) > + { > + hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq > info\n", > + boot_cpu_data.cpuid_level); > + > + return false; > + } > + > + eax = cpuid_eax(CPUID_PM_LEAF); > + > + hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d > peci: %d\n", > + !!(eax & CPUID6_EAX_HWP), > + !!(eax & CPUID6_EAX_HWP_NOTIFICATION), > + !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW), > + !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE), > + !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST), > + !!(eax & CPUID6_EAX_HWP_PECI)); > + > + if ( !(eax & CPUID6_EAX_HWP) ) > + return false; > + > + if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) ) > + { > + hwp_verbose("disabled: No energy/performance preference available"); > + > + return false; > + } > + > + feature_hwp = eax & CPUID6_EAX_HWP; > + feature_hwp_notification = eax & CPUID6_EAX_HWP_NOTIFICATION; > + feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW; > + feature_hdc = eax & CPUID6_EAX_HDC; Would you mind matching this line with the earlier three, padding-wise? > + hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n", > + feature_hdc ? "" : "not ", > + feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled" > + : ""); > + > + hwp_verbose("HW_FEEDBACK %ssupported\n", > + (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not "); > + > + cpufreq_governor_internal = feature_hwp; What use is feature_hwp? Already its setting is a little odd (you could use true there as much as you could here and for the return value below, because of the earlier CPUID6_EAX_HWP check), and then I haven't been able to find any further use of the variable. > + if ( feature_hwp ) > + hwp_info("Using HWP for cpufreq\n"); > + > + return feature_hwp; > +} > + > +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val) > +{ > + uint64_t msr; > + > + if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) ) > + { > + hwp_err("rdmsr_safe(MSR_PKG_HDC_CTL)\n", cpu); > + > + return -1; Nit: While blank lines often help, and we even demand them ahead of a function's main return statement, here and ... > + } > + > + if ( val ) > + msr |= PKG_HDC_CTL_HDC_PKG_ENABLE; > + else > + msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE; > + > + if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) ) > + { > + hwp_err("wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", cpu, msr); > + > + return -1; ... here (and then again below) I think they do more harm than good. > +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy) > +{ > + uint32_t base_khz, max_khz, bus_khz, edx; > + > + cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx); > + > + policy->cpuinfo.perf_freq = base_khz * 1000; > + policy->cpuinfo.min_freq = base_khz * 1000; > + policy->cpuinfo.max_freq = max_khz * 1000; > + policy->min = base_khz * 1000; > + policy->max = max_khz * 1000; What are the consequences (for the driver) when any of the values read is zero? While I haven't checked in combination with HWP, I know that CPUs may populate only some of the fields. > +static void cf_check hwp_init_msrs(void *info) > +{ > + struct cpufreq_policy *policy = info; > + struct hwp_drv_data *data = this_cpu(hwp_drv_data); > + uint64_t val; > + > + /* > + * Package level MSR, but we don't have a good idea of packages here, so > + * just do it everytime. > + */ > + if ( rdmsr_safe(MSR_PM_ENABLE, val) ) > + { > + hwp_err("rdmsr_safe(MSR_PM_ENABLE)\n", policy->cpu); > + data->curr_req.raw = -1; > + return; > + } > + > + /* Ensure we don't generate interrupts */ > + if ( feature_hwp_notification ) > + wrmsr_safe(MSR_HWP_INTERRUPT, 0); > + > + hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val); > + if ( !(val & PM_ENABLE_HWP_ENABLE) ) > + { > + val |= PM_ENABLE_HWP_ENABLE; > + if ( wrmsr_safe(MSR_PM_ENABLE, val) ) > + { > + hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val); > + data->curr_req.raw = -1; > + return; > + } > + } > + > + if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) ) > + { > + hwp_err("rdmsr_safe(MSR_HWP_CAPABILITIES)\n", policy->cpu); > + goto error; > + } > + > + if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) ) > + { > + hwp_err("rdmsr_safe(MSR_HWP_REQUEST)\n", policy->cpu); Having seen a number of hwp_err() by now, I think these are confusing: The format string as seen at call sites doesn't match the number of arguments. I see two possible solutions to this: Either you demand that calling functions maintain a "cpu" local variable, and you simply use that from the macro without passing it as argument. Or you flip parameters / arguments: hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n"); > + goto error; > + } > + > + /* > + * Check for APERF/MPERF support in hardware > + * also check for boost/turbo support > + */ > + intel_feature_detect(policy); > + > + if ( feature_hdc ) > + { > + if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) ) > + goto error; > + if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) ) > + goto error; If either of these fails the very first time through (presumably for the BSP), wouldn't a better course of action be to clear feature_hdc? > + } > + > + hwp_get_cpu_speeds(policy); > + > + return; > + > + error: > + data->curr_req.raw = -1; > + val &= ~PM_ENABLE_HWP_ENABLE; > + if ( wrmsr_safe(MSR_PM_ENABLE, val) ) > + hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val); > + > + return; > +} I think in general we avoid "return" with no value at the end of functions. > +static void cf_check hwp_write_request(void *info) > +{ > + const struct cpufreq_policy *policy = info; > + struct hwp_drv_data *data = this_cpu(hwp_drv_data); > + union hwp_request hwp_req = data->curr_req; > + > + data->ret = 0; > + > + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t)); Why uint64_t? Generally we try to avoid using types in sizeof() and alike, and instead refer to applicable variables. I.e. here: BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw)); > +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy, > + unsigned int target_freq, > + unsigned int relation) > +{ > + unsigned int cpu = policy->cpu; > + struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu); > + /* Zero everything to ensure reserved bits are zero... */ > + union hwp_request hwp_req = { .raw = 0 }; > + > + /* .. and update from there */ > + hwp_req.min_perf = data->minimum; > + hwp_req.max_perf = data->maximum; > + hwp_req.desired = data->desired; > + hwp_req.energy_perf = data->energy_perf; > + if ( feature_hwp_activity_window ) > + hwp_req.activity_window = data->activity_window; > + > + if ( hwp_req.raw == data->curr_req.raw ) > + return 0; > + > + data->curr_req = hwp_req; > + > + hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw); > + on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1); Is this at risk of being too verbose when the verbose option as given? > + return data->ret; > +} > + > +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy) > +{ > + unsigned int cpu = policy->cpu; > + struct hwp_drv_data *data; > + > + data = xzalloc(struct hwp_drv_data); > + if ( !data ) > + return -ENOMEM; > + > + if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name, > + cpufreq_gov_hwp.name) ) Nit: I think this would better be if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name, cpufreq_gov_hwp.name) ) > + printk(XENLOG_WARNING > + "HWP: governor \"%s\" is incompatible with hwp. Using default > \"%s\"\n", > + cpufreq_opt_governor->name, cpufreq_gov_hwp.name); > + policy->governor = &cpufreq_gov_hwp; > + > + per_cpu(hwp_drv_data, cpu) = data; > + > + on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1); > + > + if ( data->curr_req.raw == -1 ) > + { > + hwp_err("Could not initialize HWP properly\n", cpu); > + XFREE(per_cpu(hwp_drv_data, cpu)); Is this completely safe even in the CPU online/hotplug case? It would seem better to me to go this way: per_cpu(hwp_drv_data, cpu) = NULL; xfree(data); Or even defer publishing "data" ... > + return -ENODEV; > + } > + > + data->minimum = data->curr_req.min_perf; > + data->maximum = data->curr_req.max_perf; > + data->desired = data->curr_req.desired; > + data->energy_perf = data->curr_req.energy_perf; > + data->activity_window = data->curr_req.activity_window; ... until after it was fully populated. (But I seem to vaguely recall that you need to use the field somewhere in the init process.) > + hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps); Isn't this expected (or even required) to be the same on all CPUs? If so, no need to log every time. > +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy) > +{ > + XFREE(per_cpu(hwp_drv_data, policy->cpu)); Same remark as above, primarily because this is also used on an error path. > +/* > + * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and > + * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at > least > + * with my HWP testing. MSR_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE > + * is what Linux uses and seems to work. > + */ "my testing" imo wants replacing here by saying what hardware was tested (not who did the testing). > +static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy > *policy) > +{ > + struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpuid); > + on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1); Nit: Blank line please between declaration(s) and statement(s). Or alternatively drop the local variable ltogether, as it's used ... > + return data->ret; ... just once here. > +} > + > +static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver = Seeing __initconstrel here, just as a remark (not a request for you to do anything): I think we want to finish conversion to altcall, such that these can all become __initconst_cf_clobber. > --- a/xen/drivers/cpufreq/cpufreq.c > +++ b/xen/drivers/cpufreq/cpufreq.c > @@ -97,6 +97,15 @@ static int __init cf_check setup_cpufreq_option(const char > *str) > return cpufreq_cmdline_parse(arg + 1); > } > > + if ( choice < 0 && !cmdline_strcmp(str, "hwp") ) > + { > + xen_processor_pmbits |= XEN_PROCESSOR_PM_PX; > + cpufreq_controller = FREQCTL_xen; > + opt_cpufreq_hwp = true; > + if ( *arg && *(arg + 1) ) A pretty unusual way of writing arg[1]. > --- a/xen/include/acpi/cpufreq/cpufreq.h > +++ b/xen/include/acpi/cpufreq/cpufreq.h > @@ -246,4 +246,7 @@ void cpufreq_dbs_timer_resume(void); > > void intel_feature_detect(struct cpufreq_policy *policy); > > +extern bool __initdata opt_cpufreq_hwp; No __initdata or alike on declarations please. Section placement attributes (among others) only belong on the definition. > --- a/xen/include/acpi/cpufreq/processor_perf.h > +++ b/xen/include/acpi/cpufreq/processor_perf.h > @@ -7,6 +7,9 @@ > > #define XEN_PX_INIT 0x80000000 > > +bool hwp_available(void); > +int hwp_register_driver(void); > + > int powernow_cpufreq_init(void); > unsigned int powernow_register_driver(void); > unsigned int get_measured_perf(unsigned int cpu, unsigned int flag); The existing split between what lives in which header is pretty arbitrary from all I can tell. Is there a particular reason you can't keep together the four declarations you add? > --- a/xen/include/public/sysctl.h > +++ b/xen/include/public/sysctl.h > @@ -296,6 +296,7 @@ struct xen_ondemand { > uint32_t up_threshold; > }; > > +#define XEN_HWP_DRIVER "hwp" I think this wants some addition to the identifier which makes it expected that the expansion is a string literal. Perhaps XEN_HWP_DRIVER_NAME? > /* Nit: There also wants to be a blank line between #define and comment. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |