[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 06/15] cpufreq: Add Hardware P-State (HWP) driver


  • To: Jason Andryuk <jandryuk@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 19 Jun 2023 13:38:16 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=it3TDO5izUEHHb8E1V/IHsGVPvCmT2gLH0Ow3qA/DkA=; b=k7zh3UKkshZCSu6o6o+btPhtIVRyJb1xGwPNHiTPrP67NhtC8H+n8jkWUksbByP1qWn0DNFs9rntzKJ2Iov2dCb628lTueAqxauxIKeYeBNcYp8PEUw6p/kwGB4S5eqRFISyecWy0eNqnOhN6wNA0N547p5uYQxtzWRABBpHU1KLV3eKdc6iHI638cGaWaTD6o5RgWB138oP17z4k3vMPcuDN1A/+Xs8hH446HxM1HMaHL1iuWzTMo5MJ7iGB8IwYA0IqGL08fEZK8gwagwRSNmR0Sb/sMGL6Q+Tc3R+8VL8sJtx9rPDRqwspcjlGhOU4+6jEBC52jmivIuvSB8Qcw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Ux5AOVhDnvb4w2AtLm+yQx6WyHaO5jaoJKES+3O+6o9Dn4r2tb+JppqBYF3rNKnKXYhL5GgRLroBZGQiy+jP0VJvsMp5NyXeLkrUbHD3lHnyzbfJCxstyvQqpZf65WDPlBcL3UXmiyxgikNMT5haGjke7i48f3z4DGIZrud3Dt2vQvIDt+zy62bplda/s52Vnf82UTvDcLcTYQabEyjfSD38KlSFs8CKISPrrCpokC6C2sEwvpZ9eynXHo62JaxA6XE0z4RUJp5j4lOlYYtGAOPKu62PGTIMMZRVwLayV68THxJpdsIw7vdGP+r19hXV0ry92sfuSL9353jr4rb3kg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 19 Jun 2023 11:38:42 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 14.06.2023 20:02, Jason Andryuk wrote:
> Falling back from cpufreq=hwp to cpufreq=xen is a more user-friendly
> choice than disabling cpufreq when HWP is not available.  Specifying
> cpufreq=hwp indicates the user wants cpufreq, so, if HWP isn't
> available, it makes sense to give them the cpufreq that can be
> supported.  i.e. I can't see a user only wanting cpufreq=hwp or
> cpufreq=none, but not cpufreq=xen.

I continue to disagree with this, and I'd like to ask for another
maintainer's opinion. To me the described behavior is like getting
pears when having asked for apples. I nevertheless agree that
having said fallback as an option, but I'd see that done by giving
meaning to something like "cpufreq=hwp,xen" (without having checked
whether that doesn't have meaning already, i.e. just to give you an
idea).

> We can't use parse_boolean() since it requires a single name=val string
> and cpufreq_handle_common_option is provided two strings.  Use
> parse_bool() and manual handle no-hwp.
> 
> Write to disable the interrupt - the linux pstate driver does this.  We
> don't use the interrupts, so we can just turn them off.  We aren't ready
> to handle them, so we don't want any.  Unclear if this is necessary.
> SDM says it's default disabled.

Part of this may want to move to the description?

> --- /dev/null
> +++ b/xen/arch/x86/acpi/cpufreq/hwp.c
> @@ -0,0 +1,537 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * hwp.c cpufreq driver to run Intel Hardware P-States (HWP)
> + *
> + * Copyright (C) 2021 Jason Andryuk <jandryuk@xxxxxxxxx>
> + */
> +
> +#include <xen/cpumask.h>
> +#include <xen/init.h>
> +#include <xen/param.h>
> +#include <xen/xmalloc.h>
> +#include <asm/io.h>

What do you need this one for?

> +#include <asm/msr.h>
> +#include <acpi/cpufreq/cpufreq.h>
> +
> +static bool __ro_after_init feature_hwp;
> +static bool __ro_after_init feature_hwp_notification;
> +static bool __ro_after_init feature_hwp_activity_window;
> +
> +static bool __ro_after_init feature_hdc;
> +
> +bool __initdata opt_cpufreq_hwp;
> +static bool __ro_after_init opt_cpufreq_hdc = true;
> +
> +union hwp_request
> +{
> +    struct
> +    {
> +        unsigned int min_perf:8;
> +        unsigned int max_perf:8;
> +        unsigned int desired:8;
> +        unsigned int energy_perf:8;
> +        unsigned int activity_window:10;
> +        bool package_control:1;
> +        unsigned int reserved:16;

Better leave this and ...

> +        bool activity_window_valid:1;
> +        bool energy_perf_valid:1;
> +        bool desired_valid:1;
> +        bool max_perf_valid:1;
> +        bool min_perf_valid:1;
> +    };
> +    uint64_t raw;
> +};
> +
> +struct hwp_drv_data
> +{
> +    union
> +    {
> +        uint64_t hwp_caps;
> +        struct
> +        {
> +            unsigned int highest:8;
> +            unsigned int guaranteed:8;
> +            unsigned int most_efficient:8;
> +            unsigned int lowest:8;
> +            unsigned int reserved:32;

... this without a name? Hardware interfaces like this one are, in my
understanding, one of the main applications of unnamed bitfields.

> +        } hw;
> +    };
> +    union hwp_request curr_req;
> +    int ret;
> +    uint16_t activity_window;
> +    uint8_t minimum;
> +    uint8_t maximum;
> +    uint8_t desired;
> +    uint8_t energy_perf;
> +};
> +DEFINE_PER_CPU_READ_MOSTLY(struct hwp_drv_data *, hwp_drv_data);

static?

> +static bool hwp_handle_option(const char *s, const char *end)
> +{
> +    int ret;
> +
> +    if ( strncmp(s, "verbose", 7) == 0 )
> +    {
> +        s += 7;
> +        if ( *s == '=' )
> +        {
> +            s++;
> +            cpufreq_verbose = !!simple_strtoul(s, NULL, 0);
> +
> +            return true;
> +        }
> +
> +        if ( end == NULL ||
> +             (end == s && (*end == '\0' || *end == ',')) )
> +        {
> +            cpufreq_verbose = true;
> +
> +            return true;
> +        }
> +
> +        return false;
> +    }

Would be nice if the handling of "verbose" didn't need duplicating here.
However, if that's unavoidable, did you consider handling this as a
proper boolean instead of the custom way cpufreq_handle_common_option()
also uses?

> +bool __init hwp_available(void)
> +{
> +    unsigned int eax;
> +
> +    if ( !opt_cpufreq_hwp )
> +        return false;
> +
> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> +    {
> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> +                    boot_cpu_data.cpuid_level);
> +
> +        return false;
> +    }
> +
> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> +    {
> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq 
> info\n",
> +                 boot_cpu_data.cpuid_level);
> +
> +        return false;
> +    }
> +
> +    eax = cpuid_eax(CPUID_PM_LEAF);
> +
> +    hwp_verbose("%d notify: %d act-window: %d energy-perf: %d pkg-level: %d 
> peci: %d\n",
> +                !!(eax & CPUID6_EAX_HWP),
> +                !!(eax & CPUID6_EAX_HWP_NOTIFICATION),
> +                !!(eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW),
> +                !!(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE),
> +                !!(eax & CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST),
> +                !!(eax & CPUID6_EAX_HWP_PECI));
> +
> +    if ( !(eax & CPUID6_EAX_HWP) )
> +        return false;
> +
> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) )
> +    {
> +        hwp_verbose("disabled: No energy/performance preference available");
> +
> +        return false;
> +    }
> +
> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> +    feature_hdc = eax & CPUID6_EAX_HDC;

Would you mind matching this line with the earlier three, padding-wise?

> +    hwp_verbose("Hardware Duty Cycling (HDC) %ssupported%s\n",
> +                feature_hdc ? "" : "not ",
> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", disabled"
> +                            : "");
> +
> +    hwp_verbose("HW_FEEDBACK %ssupported\n",
> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> +
> +    cpufreq_governor_internal = feature_hwp;

What use is feature_hwp? Already its setting is a little odd (you could
use true there as much as you could here and for the return value below,
because of the earlier CPUID6_EAX_HWP check), and then I haven't been
able to find any further use of the variable.

> +    if ( feature_hwp )
> +        hwp_info("Using HWP for cpufreq\n");
> +
> +    return feature_hwp;
> +}
> +
> +static int hdc_set_pkg_hdc_ctl(unsigned int cpu, bool val)
> +{
> +    uint64_t msr;
> +
> +    if ( rdmsr_safe(MSR_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_PKG_HDC_CTL)\n", cpu);
> +
> +        return -1;

Nit: While blank lines often help, and we even demand them ahead of a
function's main return statement, here and ...

> +    }
> +
> +    if ( val )
> +        msr |= PKG_HDC_CTL_HDC_PKG_ENABLE;
> +    else
> +        msr &= ~PKG_HDC_CTL_HDC_PKG_ENABLE;
> +
> +    if ( wrmsr_safe(MSR_PKG_HDC_CTL, msr) )
> +    {
> +        hwp_err("wrmsr_safe(MSR_PKG_HDC_CTL): %016lx\n", cpu, msr);
> +
> +        return -1;

... here (and then again below) I think they do more harm than good.

> +static void hwp_get_cpu_speeds(struct cpufreq_policy *policy)
> +{
> +    uint32_t base_khz, max_khz, bus_khz, edx;
> +
> +    cpuid(0x16, &base_khz, &max_khz, &bus_khz, &edx);
> +
> +    policy->cpuinfo.perf_freq = base_khz * 1000;
> +    policy->cpuinfo.min_freq = base_khz * 1000;
> +    policy->cpuinfo.max_freq = max_khz * 1000;
> +    policy->min = base_khz * 1000;
> +    policy->max = max_khz * 1000;

What are the consequences (for the driver) when any of the values read
is zero? While I haven't checked in combination with HWP, I know that
CPUs may populate only some of the fields.

> +static void cf_check hwp_init_msrs(void *info)
> +{
> +    struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    uint64_t val;
> +
> +    /*
> +     * Package level MSR, but we don't have a good idea of packages here, so
> +     * just do it everytime.
> +     */
> +    if ( rdmsr_safe(MSR_PM_ENABLE, val) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_PM_ENABLE)\n", policy->cpu);
> +        data->curr_req.raw = -1;
> +        return;
> +    }
> +
> +    /* Ensure we don't generate interrupts */
> +    if ( feature_hwp_notification )
> +        wrmsr_safe(MSR_HWP_INTERRUPT, 0);
> +
> +    hwp_verbose("CPU%u: MSR_PM_ENABLE: %016lx\n", policy->cpu, val);
> +    if ( !(val & PM_ENABLE_HWP_ENABLE) )
> +    {
> +        val |= PM_ENABLE_HWP_ENABLE;
> +        if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> +        {
> +            hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> +            data->curr_req.raw = -1;
> +            return;
> +        }
> +    }
> +
> +    if ( rdmsr_safe(MSR_HWP_CAPABILITIES, data->hwp_caps) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_HWP_CAPABILITIES)\n", policy->cpu);
> +        goto error;
> +    }
> +
> +    if ( rdmsr_safe(MSR_HWP_REQUEST, data->curr_req.raw) )
> +    {
> +        hwp_err("rdmsr_safe(MSR_HWP_REQUEST)\n", policy->cpu);

Having seen a number of hwp_err() by now, I think these are confusing:
The format string as seen at call sites doesn't match the number of
arguments. I see two possible solutions to this: Either you demand
that calling functions maintain a "cpu" local variable, and you simply
use that from the macro without passing it as argument. Or you flip
parameters / arguments:

        hwp_err(policy->cpu, "rdmsr_safe(MSR_HWP_REQUEST)\n");

> +        goto error;
> +    }
> +
> +    /*
> +     * Check for APERF/MPERF support in hardware
> +     * also check for boost/turbo support
> +     */
> +    intel_feature_detect(policy);
> +
> +    if ( feature_hdc )
> +    {
> +        if ( hdc_set_pkg_hdc_ctl(policy->cpu, opt_cpufreq_hdc) )
> +            goto error;
> +        if ( hdc_set_pm_ctl1(policy->cpu, opt_cpufreq_hdc) )
> +            goto error;

If either of these fails the very first time through (presumably for the
BSP), wouldn't a better course of action be to clear feature_hdc?

> +    }
> +
> +    hwp_get_cpu_speeds(policy);
> +
> +    return;
> +
> + error:
> +    data->curr_req.raw = -1;
> +    val &= ~PM_ENABLE_HWP_ENABLE;
> +    if ( wrmsr_safe(MSR_PM_ENABLE, val) )
> +        hwp_err("wrmsr_safe(MSR_PM_ENABLE, %lx)\n", policy->cpu, val);
> +
> +    return;
> +}

I think in general we avoid "return" with no value at the end of functions.

> +static void cf_check hwp_write_request(void *info)
> +{
> +    const struct cpufreq_policy *policy = info;
> +    struct hwp_drv_data *data = this_cpu(hwp_drv_data);
> +    union hwp_request hwp_req = data->curr_req;
> +
> +    data->ret = 0;
> +
> +    BUILD_BUG_ON(sizeof(union hwp_request) != sizeof(uint64_t));

Why uint64_t? Generally we try to avoid using types in sizeof() and
alike, and instead refer to applicable variables. I.e. here:

    BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw));

> +static int cf_check hwp_cpufreq_target(struct cpufreq_policy *policy,
> +                                       unsigned int target_freq,
> +                                       unsigned int relation)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpu);
> +    /* Zero everything to ensure reserved bits are zero... */
> +    union hwp_request hwp_req = { .raw = 0 };
> +
> +    /* .. and update from there */
> +    hwp_req.min_perf = data->minimum;
> +    hwp_req.max_perf = data->maximum;
> +    hwp_req.desired = data->desired;
> +    hwp_req.energy_perf = data->energy_perf;
> +    if ( feature_hwp_activity_window )
> +        hwp_req.activity_window = data->activity_window;
> +
> +    if ( hwp_req.raw == data->curr_req.raw )
> +        return 0;
> +
> +    data->curr_req = hwp_req;
> +
> +    hwp_verbose("CPU%u: wrmsr HWP_REQUEST %016lx\n", cpu, hwp_req.raw);
> +    on_selected_cpus(cpumask_of(cpu), hwp_write_request, policy, 1);

Is this at risk of being too verbose when the verbose option as given?

> +    return data->ret;
> +}
> +
> +static int cf_check hwp_cpufreq_cpu_init(struct cpufreq_policy *policy)
> +{
> +    unsigned int cpu = policy->cpu;
> +    struct hwp_drv_data *data;
> +
> +    data = xzalloc(struct hwp_drv_data);
> +    if ( !data )
> +        return -ENOMEM;
> +
> +    if ( cpufreq_opt_governor && strcmp(cpufreq_opt_governor->name,
> +                                        cpufreq_gov_hwp.name) )

Nit: I think this would better be

    if ( cpufreq_opt_governor &&
         strcmp(cpufreq_opt_governor->name, cpufreq_gov_hwp.name) )

> +        printk(XENLOG_WARNING
> +               "HWP: governor \"%s\" is incompatible with hwp. Using default 
> \"%s\"\n",
> +               cpufreq_opt_governor->name, cpufreq_gov_hwp.name);
> +    policy->governor = &cpufreq_gov_hwp;
> +
> +    per_cpu(hwp_drv_data, cpu) = data;
> +
> +    on_selected_cpus(cpumask_of(cpu), hwp_init_msrs, policy, 1);
> +
> +    if ( data->curr_req.raw == -1 )
> +    {
> +        hwp_err("Could not initialize HWP properly\n", cpu);
> +        XFREE(per_cpu(hwp_drv_data, cpu));

Is this completely safe even in the CPU online/hotplug case? It would
seem better to me to go this way:

        per_cpu(hwp_drv_data, cpu) = NULL;
        xfree(data);

Or even defer publishing "data" ...

> +        return -ENODEV;
> +    }
> +
> +    data->minimum = data->curr_req.min_perf;
> +    data->maximum = data->curr_req.max_perf;
> +    data->desired = data->curr_req.desired;
> +    data->energy_perf = data->curr_req.energy_perf;
> +    data->activity_window = data->curr_req.activity_window;

... until after it was fully populated. (But I seem to vaguely recall
that you need to use the field somewhere in the init process.)

> +    hwp_verbose("CPU%u: HWP_CAPABILITIES: %016lx\n", cpu, data->hwp_caps);

Isn't this expected (or even required) to be the same on all CPUs? If
so, no need to log every time.

> +static int cf_check hwp_cpufreq_cpu_exit(struct cpufreq_policy *policy)
> +{
> +    XFREE(per_cpu(hwp_drv_data, policy->cpu));

Same remark as above, primarily because this is also used on an error
path.

> +/*
> + * The SDM reads like turbo should be disabled with MSR_IA32_PERF_CTL and
> + * PERF_CTL_TURBO_DISENGAGE, but that does not seem to actually work, at 
> least
> + * with my HWP testing.  MSR_MISC_ENABLE and MISC_ENABLE_TURBO_DISENGAGE
> + * is what Linux uses and seems to work.
> + */

"my testing" imo wants replacing here by saying what hardware was tested
(not who did the testing).

> +static int cf_check hwp_cpufreq_update(int cpuid, struct cpufreq_policy 
> *policy)
> +{
> +    struct hwp_drv_data *data = per_cpu(hwp_drv_data, cpuid);
> +    on_selected_cpus(cpumask_of(cpuid), hwp_set_misc_turbo, policy, 1);

Nit: Blank line please between declaration(s) and statement(s). Or
alternatively drop the local variable ltogether, as it's used ...

> +    return data->ret;

... just once here.

> +}
> +
> +static const struct cpufreq_driver __initconstrel hwp_cpufreq_driver =

Seeing __initconstrel here, just as a remark (not a request for you
to do anything): I think we want to finish conversion to altcall, such
that these can all become __initconst_cf_clobber.

> --- a/xen/drivers/cpufreq/cpufreq.c
> +++ b/xen/drivers/cpufreq/cpufreq.c
> @@ -97,6 +97,15 @@ static int __init cf_check setup_cpufreq_option(const char 
> *str)
>              return cpufreq_cmdline_parse(arg + 1);
>      }
>  
> +    if ( choice < 0 && !cmdline_strcmp(str, "hwp") )
> +    {
> +        xen_processor_pmbits |= XEN_PROCESSOR_PM_PX;
> +        cpufreq_controller = FREQCTL_xen;
> +        opt_cpufreq_hwp = true;
> +        if ( *arg && *(arg + 1) )

A pretty unusual way of writing arg[1].

> --- a/xen/include/acpi/cpufreq/cpufreq.h
> +++ b/xen/include/acpi/cpufreq/cpufreq.h
> @@ -246,4 +246,7 @@ void cpufreq_dbs_timer_resume(void);
>  
>  void intel_feature_detect(struct cpufreq_policy *policy);
>  
> +extern bool __initdata opt_cpufreq_hwp;

No __initdata or alike on declarations please. Section placement
attributes (among others) only belong on the definition.

> --- a/xen/include/acpi/cpufreq/processor_perf.h
> +++ b/xen/include/acpi/cpufreq/processor_perf.h
> @@ -7,6 +7,9 @@
>  
>  #define XEN_PX_INIT 0x80000000
>  
> +bool hwp_available(void);
> +int hwp_register_driver(void);
> +
>  int powernow_cpufreq_init(void);
>  unsigned int powernow_register_driver(void);
>  unsigned int get_measured_perf(unsigned int cpu, unsigned int flag);

The existing split between what lives in which header is pretty
arbitrary from all I can tell. Is there a particular reason you can't
keep together the four declarations you add?

> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -296,6 +296,7 @@ struct xen_ondemand {
>      uint32_t up_threshold;
>  };
>  
> +#define XEN_HWP_DRIVER "hwp"

I think this wants some addition to the identifier which makes it expected
that the expansion is a string literal. Perhaps XEN_HWP_DRIVER_NAME?

>  /*

Nit: There also wants to be a blank line between #define and comment.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.