|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver
On 14.06.2023 20:02, Jason Andryuk wrote:
> Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly
> choice than disabling cpufreq when HWP is not available. Specifying
> cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't
> available, it makes sense to give them the cpufreq that can be
> supported. i.e. I can't see a user only wanting cpufreq=hwp or
> cpufreq=none, but not cpufreq=xen.
I continue to disagree with this, and I'd like to ask for another
maintainer's opinion. To me the described behavior is like getting
pears when having asked for apples. I nevertheless agree that
having said fallback as an option, but I'd see that done by giving
meaning to something like "cpufreq=hwp,xen" (without having checked
whether that doesn't have meaning already, i.e. just to give you an
idea).
> We can't use parse_boolean() since it requires a single name=val string
> and cpufreq_handle_common_option is provided two strings. Use
> parse_bool() and manual handle no-hwp.
>
> Write to disable the interrupt - the linux pstate driver does this. We
> don't use the interrupts, so we can just turn them off. We aren't ready
> to handle them, so we don't want any. Unclear if this is necessary.
> SDM says it's default disabled.
Part of this may want to move to the description?
> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,537 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandryuk@xxxxxxxxx>
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/io.h>
What do you need this one for?
> +#include <asm/msr.h>
> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool __ro_after_init feature_hwp;
> +static bool __ro_after_init feature_hwp_notification;
> +static bool __ro_after_init feature_hwp_activity_window;
> +
> +static bool __ro_after_init feature_hdc;
> +
> +bool __initdata opt_cpufreq_hwp;
> +static bool __ro_after_init opt_cpufreq_hdc = true;
> +
> +union hwp_request
> +{
> + struct
> + {
> + unsigned int min_perf:8;
> + unsigned int max_perf:8;
> + unsigned int desired:8;
> + unsigned int energy_perf:8;
> + unsigned int activity_window:10;
> + bool package_control:1;
> + unsigned int reserved:16;
Better leave this and ...
> + bool activity_window_valid:1;
> + bool energy_perf_valid:1;
> + bool desired_valid:1;
> + bool max_perf_valid:1;
> + bool min_perf_valid:1;
> + };
> + uint64_t raw;
> +};
> +
> +struct hwp_drv_data
> +{
> + union
> + {
> + uint64_t hwp_caps;
> + struct
> + {
> + unsigned int highest:8;
> + unsigned int guaranteed:8;
> + unsigned int most_efficient:8;
> + unsigned int lowest:8;
> + unsigned int reserved:32;
... this without a name? Hardware interfaces like this one are, in my
understanding, one of the main applications of unnamed bitfields.
> + } hw;
> + };
> + union hwp_request curr_req;
> + int ret;
> + uint16_t activity_window;
> + uint8_t minimum;
> + uint8_t maximum;
> + uint8_t desired;
> + uint8_t energy_perf;
> +};
> +DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);
static?
> +static bool hwp_handle_option(const char *s, const char *end)
> +{
> + int ret;
> +
> + if ( strncmp(s, "verbose", 7) == 0 )
> + {
> + s += 7;
> + if ( *s == '=' )
> + {
> + s++;
> + cpufreq_verbose = !!simple_strtoul(s, NULL, 0);
> +
> + return true;
> + }
> +
> + if ( end == NULL ||
> + (end == s && (*end == '\0' || *end == ',')) )
> + {
> + cpufreq_verbose = true;
> +
> + return true;
> + }
> +
> + return false;
> + }
Would be nice if the handling of "verbose" didn't need duplicating here.
However, if that's unavoidable, did you consider handling this as a
proper boolean instead of the custom way cpufreq_handle_common_option()
also uses?
> +bool __init hwp_available(void)
> +{
> + unsigned int eax;
> +
> + if ( !opt_cpufreq_hwp )
> + return false;
> +
> + if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> + {
> + hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> + boot_cpu_data.cpuid_level);
> +
> + return false;
> + }
> +
> + if ( boot_cpu_data.cpuid_level < 0x16 )
> + {
> + hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq
> info\n",
> + boot_cpu_data.cpuid_level);
> +
> + return false;
> + }
> +
> + eax = cpuid_eax(CPUID_PM_LEAF);
> +
> + hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d
> peci: %d\n",
> + !!(eax & CPUID6_EAX_HWP),
> + !!(eax & CPUID6_EAX_HWP_NOTIFICATION),
> + !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW),
> + !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE),
> + !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST),
> + !!(eax & CPUID6_EAX_HWP_PECI));
> +
> + if ( !(eax & CPUID6_EAX_HWP) )
> + return false;
> +
> + if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) )
> + {
> + hwp_verbose("disabled: No energy/performance preference available");
> +
> + return false;
> + }
> +
> + feature_hwp = eax & CPUID6_EAX_HWP;
> + feature_hwp_notification = eax & CPUID6_EAX_HWP_NOTIFICATION;
> + feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> + feature_hdc = eax & CPUID6_EAX_HDC;
Would you mind matching this line with the earlier three, padding-wise?
> + hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n",
> + feature_hdc ? "" : "not ",
> + feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> + : "");
> +
> + hwp_verbose("HW_FEEDBACK %ssupported\n",
> + (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> +
> + cpufreq_governor_internal = feature_hwp;
What use is feature_hwp? Already its setting is a little odd (you could
use true there as much as you could here and for the return value below,
because of the earlier CPUID6_EAX_HWP check), and then I haven't been
able to find any further use of the variable.
> + if ( feature_hwp )
> + hwp_info("Using HWP for cpufreq\n");
> +
> + return feature_hwp;
> +}
> +
> +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
> +{
> + uint64_t msr;
> +
> + if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
> + {
> + hwp_err("rdmsr_safe(MSR_PKG_HDC_CTL)\n", cpu);
> +
> + return -1;
Nit: While blank lines often help, and we even demand them ahead of a
function's main return statement, here and ...
> + }
> +
> + if ( val )
> + msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
> + else
> + msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> + if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
> + {
> + hwp_err("wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", cpu, msr);
> +
> + return -1;
... here (and then again below) I think they do more harm than good.
> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> +{
> + uint32_t base_khz, max_khz, bus_khz, edx;
> +
> + cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> +
> + policy->cpuinfo.perf_freq = base_khz * 1000;
> + policy->cpuinfo.min_freq = base_khz * 1000;
> + policy->cpuinfo.max_freq = max_khz * 1000;
> + policy->min = base_khz * 1000;
> + policy->max = max_khz * 1000;
What are the consequences (for the driver) when any of the values read
is zero? While I haven't checked in combination with HWP, I know that
CPUs may populate only some of the fields.
> +static void cf_check hwp_init_msrs(void *info)
> +{
> + struct cpufreq_policy *policy = info;
> + struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> + uint64_t val;
> +
> + /*
> + * Package level MSR, but we don't have a good idea of packages here, so
> + * just do it everytime.
> + */
> + if ( rdmsr_safe(MSR_PM_ENABLE, val) )
> + {
> + hwp_err("rdmsr_safe(MSR_PM_ENABLE)\n", policy->cpu);
> + data->curr_req.raw = -1;
> + return;
> + }
> +
> + /* Ensure we don't generate interrupts */
> + if ( feature_hwp_notification )
> + wrmsr_safe(MSR_HWP_INTERRUPT, 0);
> +
> + hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
> + if ( !(val & PM_ENABLE_HWP_ENABLE) )
> + {
> + val |= PM_ENABLE_HWP_ENABLE;
> + if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> + {
> + hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> + data->curr_req.raw = -1;
> + return;
> + }
> + }
> +
> + if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
> + {
> + hwp_err("rdmsr_safe(MSR_HWP_CAPABILITIES)\n", policy->cpu);
> + goto error;
> + }
> +
> + if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
> + {
> + hwp_err("rdmsr_safe(MSR_HWP_REQUEST)\n", policy->cpu);
Having seen a number of hwp_err() by now, I think these are confusing:
The format string as seen at call sites doesn't match the number of
arguments. I see two possible solutions to this: Either you demand
that calling functions maintain a "cpu" local variable, and you simply
use that from the macro without passing it as argument. Or you flip
parameters / arguments:
hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");
> + goto error;
> + }
> +
> + /*
> + * Check for APERF/MPERF support in hardware
> + * also check for boost/turbo support
> + */
> + intel_feature_detect(policy);
> +
> + if ( feature_hdc )
> + {
> + if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) )
> + goto error;
> + if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) )
> + goto error;
If either of these fails the very first time through (presumably for the
BSP), wouldn't a better course of action be to clear feature_hdc?
> + }
> +
> + hwp_get_cpu_speeds(policy);
> +
> + return;
> +
> + error:
> + data->curr_req.raw = -1;
> + val &= ~PM_ENABLE_HWP_ENABLE;
> + if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> + hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> +
> + return;
> +}
I think in general we avoid "return" with no value at the end of functions.
> +static void cf_check hwp_write_request(void *info)
> +{
> + const struct cpufreq_policy *policy = info;
> + struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> + union hwp_request hwp_req = data->curr_req;
> +
> + data->ret = 0;
> +
> + BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));
Why uint64_t? Generally we try to avoid using types in sizeof() and
alike, and instead refer to applicable variables. I.e. here:
BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));
> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> + unsigned int target_freq,
> + unsigned int relation)
> +{
> + unsigned int cpu = policy->cpu;
> + struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> + /* Zero everything to ensure reserved bits are zero... */
> + union hwp_request hwp_req = { .raw = 0 };
> +
> + /* .. and update from there */
> + hwp_req.min_perf = data->minimum;
> + hwp_req.max_perf = data->maximum;
> + hwp_req.desired = data->desired;
> + hwp_req.energy_perf = data->energy_perf;
> + if ( feature_hwp_activity_window )
> + hwp_req.activity_window = data->activity_window;
> +
> + if ( hwp_req.raw == data->curr_req.raw )
> + return 0;
> +
> + data->curr_req = hwp_req;
> +
> + hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> + on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);
Is this at risk of being too verbose when the verbose option as given?
> + return data->ret;
> +}
> +
> +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> + unsigned int cpu = policy->cpu;
> + struct hwp_drv_data *data;
> +
> + data = xzalloc(struct hwp_drv_data);
> + if ( !data )
> + return -ENOMEM;
> +
> + if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name,
> + cpufreq_gov_hwp.name) )
Nit: I think this would better be
if ( cpufreq_opt_governor &&
strcmp(cpufreq_opt_governor->name, cpufreq_gov_hwp.name) )
> + printk(XENLOG_WARNING
> + "HWP: governor \"%s\" is incompatible with hwp. Using default
> \"%s\"\n",
> + cpufreq_opt_governor->name, cpufreq_gov_hwp.name);
> + policy->governor = &cpufreq_gov_hwp;
> +
> + per_cpu(hwp_drv_data, cpu) = data;
> +
> + on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> +
> + if ( data->curr_req.raw == -1 )
> + {
> + hwp_err("Could not initialize HWP properly\n", cpu);
> + XFREE(per_cpu(hwp_drv_data, cpu));
Is this completely safe even in the CPU online/hotplug case? It would
seem better to me to go this way:
per_cpu(hwp_drv_data, cpu) = NULL;
xfree(data);
Or even defer publishing "data" ...
> + return -ENODEV;
> + }
> +
> + data->minimum = data->curr_req.min_perf;
> + data->maximum = data->curr_req.max_perf;
> + data->desired = data->curr_req.desired;
> + data->energy_perf = data->curr_req.energy_perf;
> + data->activity_window = data->curr_req.activity_window;
... until after it was fully populated. (But I seem to vaguely recall
that you need to use the field somewhere in the init process.)
> + hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);
Isn't this expected (or even required) to be the same on all CPUs? If
so, no need to log every time.
> +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> + XFREE(per_cpu(hwp_drv_data, policy->cpu));
Same remark as above, primarily because this is also used on an error
path.
> +/*
> + * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
> + * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at
> least
> + * with my HWP testing. MSR_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
> + * is what Linux uses and seems to work.
> + */
"my testing" imo wants replacing here by saying what hardware was tested
(not who did the testing).
> +static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy
> *policy)
> +{
> + struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpuid);
> + on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);
Nit: Blank line please between declaration(s) and statement(s). Or
alternatively drop the local variable ltogether, as it's used ...
> + return data->ret;
... just once here.
> +}
> +
> +static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =
Seeing __initconstrel here, just as a remark (not a request for you
to do anything): I think we want to finish conversion to altcall, such
that these can all become __initconst_cf_clobber.
> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -97,6 +97,15 @@ static int __init cf_check setup_cpufreq_option(const char
> *str)
> return cpufreq_cmdline_parse(arg + 1);
> }
>
> + if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
> + {
> + xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> + cpufreq_controller = FREQCTL_xen;
> + opt_cpufreq_hwp = true;
> + if ( *arg && *(arg + 1) )
A pretty unusual way of writing arg[1].
> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -246,4 +246,7 @@ void cpufreq_dbs_timer_resume(void);
>
> void intel_feature_detect(struct cpufreq_policy *policy);
>
> +extern bool __initdata opt_cpufreq_hwp;
No __initdata or alike on declarations please. Section placement
attributes (among others) only belong on the definition.
> --- a/xen/include/acpi/cpufreq/processor_perf.h
> +++ b/xen/include/acpi/cpufreq/processor_perf.h
> @@ -7,6 +7,9 @@
>
> #define XEN_PX_INIT 0x80000000
>
> +bool hwp_available(void);
> +int hwp_register_driver(void);
> +
> int powernow_cpufreq_init(void);
> unsigned int powernow_register_driver(void);
> unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);
The existing split between what lives in which header is pretty
arbitrary from all I can tell. Is there a particular reason you can't
keep together the four declarations you add?
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -296,6 +296,7 @@ struct xen_ondemand {
> uint32_t up_threshold;
> };
>
> +#define XEN_HWP_DRIVER "hwp"
I think this wants some addition to the identifier which makes it expected
that the expansion is a string literal. Perhaps XEN_HWP_DRIVER_NAME?
> /*
Nit: There also wants to be a blank line between #define and comment.
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |