[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 04/14 RESEND] cpufreq: Add Hardware P-State (HWP) driver



On Fri, May 5, 2023 at 3:01 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>
> On 04.05.2023 18:56, Jason Andryuk wrote:
> > On Thu, May 4, 2023 at 9:11 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
> >> On 01.05.2023 21:30, Jason Andryuk wrote:
> >>> --- a/docs/misc/xen-command-line.pandoc
> >>> +++ b/docs/misc/xen-command-line.pandoc
> >>> @@ -499,7 +499,7 @@ If set, force use of the performance counters for 
> >>> oprofile, rather than detectin
> >>>  available support.
> >>>
> >>>  ### cpufreq
> >>> -> `= none | {{ <boolean> | xen } 
> >>> [:[powersave|performance|ondemand|userspace][,<maxfreq>][,[<minfreq>][,[verbose]]]]}
> >>>  | dom0-kernel`
> >>> +> `= none | {{ <boolean> | xen } 
> >>> [:[powersave|performance|ondemand|userspace][,<hdc>][,[<hwp>]][,[<maxfreq>]][,[<minfreq>]][,[verbose]]]}
> >>>  | dom0-kernel`
> >>
> >> Considering you use a special internal governor, the 4 governor 
> >> alternatives are
> >> meaningless for hwp. Hence at the command line level recognizing "hwp" as 
> >> if it
> >> was another governor name would seem better to me. This would then also 
> >> get rid
> >> of one of the two special "no-" prefix parsing cases (which I'm not overly
> >> happy about).
> >>
> >> Even if not done that way I'm puzzled by the way you spell out the 
> >> interaction
> >> of "hwp" and "hdc": As you say in the description, "hdc" is meaningful 
> >> only when
> >> "hwp" was specified, so even if not merged with the governors group "hwp" 
> >> should
> >> come first, and "hdc" ought to be rejected if "hwp" wasn't first 
> >> specified. (The
> >> way you've spelled it out it actually looks to be kind of the other way 
> >> around.)
> >
> > I placed them in alphabetical order, but, yes, it doesn't make sense.
> >
> >> Strictly speaking "maxfreq" and "minfreq" also should be objected to when 
> >> "hwp"
> >> was specified.
> >>
> >> Overall I'm getting the impression that beyond your "verbose" related 
> >> adjustment
> >> more is needed, if you're meaning to get things closer to how we parse the
> >> option (splitting across multiple lines to help see what I mean):
> >>
> >> `= none
> >>  | {{ <boolean> | xen } [:{powersave|performance|ondemand|userspace}
> >>                           
> >> [{,hwp[,hdc]|[,maxfreq=<maxfreq>[,minfreq=<minfreq>]}]
> >>                           [,verbose]]}
> >>  | dom0-kernel`
> >>
> >> (We're still parsing in a more relaxed way, e.g. minfreq may come ahead of
> >> maxfreq, but better be more tight in the doc than too relaxed.)
> >>
> >> Furthermore while max/min freq don't apply directly, there are still two 
> >> MSRs
> >> controlling bounds at the package and logical processor levels.
> >
> > Well, we only program the logical processor level MSRs because we
> > don't have a good idea of the packages to know when we can skip
> > writing an MSR.
> >
> > How about this:
> > `= none
> >  | {{ <boolean> | xen } {
> > [:{powersave|performance|ondemand|userspace}[,maxfreq=<maxfreq>[,minfreq=<minfreq>]]
> >                         | [:hwp[,hdc]] }
> >                           [,verbose]]}
> >  | dom0-kernel`
>
> Looks right, yes.

There is a wrinkle to using the hwp governor.  The hwp governor was
named "hwp-internal", so it needs to be renamed to "hwp" for use with
command line parsing.  That means the checking for "-internal" needs
to change to just "hwp" which removes the generality of the original
implementation.

The other issue is that if you select "hwp" as the governor, but HWP
hardware support is not available, then hwp_available() needs to reset
the governor back to the default.  This feels like a layering
violation.

I'm still investigating, but promoting hwp to a top level option -
cpufreq=hwp - might be a better arrangement.

> >>> +union hwp_request
> >>> +{
> >>> +    struct
> >>> +    {
> >>> +        uint64_t min_perf:8;
> >>> +        uint64_t max_perf:8;
> >>> +        uint64_t desired:8;
> >>> +        uint64_t energy_perf:8;
> >>> +        uint64_t activity_window:10;
> >>> +        uint64_t package_control:1;
> >>> +        uint64_t reserved:16;
> >>> +        uint64_t activity_window_valid:1;
> >>> +        uint64_t energy_perf_valid:1;
> >>> +        uint64_t desired_valid:1;
> >>> +        uint64_t max_perf_valid:1;
> >>> +        uint64_t min_perf_valid:1;
> >>
> >> The boolean fields here would probably better be of type "bool". I also
> >> don't see the need for using uint64_t for any of the other fields -
> >> unsigned int will be quite fine, I think. Only ...
> >
> > This is the hardware MSR format, so it seemed natural to use uint64_t
> > and the bit fields.  To me, uint64_t foo:$bits; better shows that we
> > are dividing up a single hardware register using bit fields.
> > Honestly, I'm unfamiliar with the finer points of laying out bitfields
> > with bool.  And the 10 bits of activity window throws off aligning to
> > standard types.
> >
> > This seems to have the correct layout:
> > struct
> > {
> >         unsigned char min_perf;
> >         unsigned char max_perf;
> >         unsigned char desired;
> >         unsigned char energy_perf;
> >         unsigned int activity_window:10;
> >         bool package_control:1;
> >         unsigned int reserved:16;
> >         bool activity_window_valid:1;
> >         bool energy_perf_valid:1;
> >         bool desired_valid:1;
> >         bool max_perf_valid:1;
> >         bool min_perf_valid:1;
> > } ;
> >
> > Or would you prefer the first 8 bit ones to be unsigned int
> > min_perf:8?
>
> Personally I think using bitfields uniformly would be better. What you
> definitely cannot use if not using a bitfield is "unsigned char", it
> ought to by uint8_t then. If using a bitfield, as said, I think it's
> best to stick to unsigned int and bool, unless field width goes
> beyond 32 bits or fields cross a 32-bit boundary.

Ok, thanks.

> >>> +bool __init hwp_available(void)
> >>> +{
> >>> +    unsigned int eax, ecx, unused;
> >>> +    bool use_hwp;
> >>> +
> >>> +    if ( boot_cpu_data.cpuid_level < CPUID_PM_LEAF )
> >>> +    {
> >>> +        hwp_verbose("cpuid_level (%#x) lacks HWP support\n",
> >>> +                    boot_cpu_data.cpuid_level);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    if ( boot_cpu_data.cpuid_level < 0x16 )
> >>> +    {
> >>> +        hwp_info("HWP disabled: cpuid_level %#x < 0x16 lacks CPU freq 
> >>> info\n",
> >>> +                 boot_cpu_data.cpuid_level);
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    cpuid(CPUID_PM_LEAF, &eax, &unused, &ecx, &unused);
> >>> +
> >>> +    if ( !(eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE) &&
> >>> +         !(ecx & CPUID6_ECX_IA32_ENERGY_PERF_BIAS) )
> >>> +    {
> >>> +        hwp_verbose("HWP disabled: No energy/performance preference 
> >>> available");
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    feature_hwp                 = eax & CPUID6_EAX_HWP;
> >>> +    feature_hwp_notification    = eax & CPUID6_EAX_HWP_NOTIFICATION;
> >>> +    feature_hwp_activity_window = eax & CPUID6_EAX_HWP_ACTIVITY_WINDOW;
> >>> +    feature_hwp_energy_perf     =
> >>> +        eax & CPUID6_EAX_HWP_ENERGY_PERFORMANCE_PREFERENCE;
> >>> +    feature_hwp_pkg_level_ctl   = eax & 
> >>> CPUID6_EAX_HWP_PACKAGE_LEVEL_REQUEST;
> >>> +    feature_hwp_peci            = eax & CPUID6_EAX_HWP_PECI;
> >>> +
> >>> +    hwp_verbose("HWP: %d notify: %d act-window: %d energy-perf: %d 
> >>> pkg-level: %d peci: %d\n",
> >>> +                feature_hwp, feature_hwp_notification,
> >>> +                feature_hwp_activity_window, feature_hwp_energy_perf,
> >>> +                feature_hwp_pkg_level_ctl, feature_hwp_peci);
> >>> +
> >>> +    if ( !feature_hwp )
> >>> +        return false;
> >>> +
> >>> +    feature_hdc = eax & CPUID6_EAX_HDC;
> >>> +
> >>> +    hwp_verbose("HWP: Hardware Duty Cycling (HDC) %ssupported%s\n",
> >>> +                feature_hdc ? "" : "not ",
> >>> +                feature_hdc ? opt_cpufreq_hdc ? ", enabled" : ", 
> >>> disabled"
> >>> +                            : "");
> >>> +
> >>> +    feature_hdc = feature_hdc && opt_cpufreq_hdc;
> >>> +
> >>> +    hwp_verbose("HWP: HW_FEEDBACK %ssupported\n",
> >>> +                (eax & CPUID6_EAX_HW_FEEDBACK) ? "" : "not ");
> >>
> >> You report this, but you don't really use it?
> >
> > Correct.  I needed to know what capabilities my processors have.
> >
> > feature_hwp_pkg_level_ctl and feature_hwp_peci can also be dropped
> > since they aren't used beyond printing their values.  I'd still lean
> > toward keeping their printing under verbose since otherwise there
> > isn't a convenient way to know if they are available without
> > recompiling.
>
> That's fine, but wants mentioning in the description. Also respective
> variables would want to be __initdata then, be local to the function,
> or be dropped altogether. Plus you'd want to be consistent - either
> you use a helper variable for all print-only features, or you don't.

Got it, thanks.

> >>> +        if ( rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS, val) )
> >>> +        {
> >>> +            hwp_err("error rdmsr_safe(MSR_IA32_ENERGY_PERF_BIAS)\n");
> >>> +            data->curr_req.raw = -1;
> >>> +
> >>> +            return;
> >>> +        }
> >>> +
> >>> +        data->energy_perf = val & IA32_ENERGY_BIAS_MASK;
> >>> +    }
> >>
> >> In order to not need to undo the "enable" you've already done, maybe that
> >> should move down here?
> >
> > HWP needs to be enabled before the Capabilities and Request MSRs can
> > be read.
>
> I must have missed this aspect in the SDM. Do you have a pointer?

In 15.4.2 Enabling HWP
Additional MSRs associated with HWP may only be accessed after HWP is
enabled, with the exception of IA32_HWP_INTERRUPT and MSR_PPERF.
Accessing the IA32_HWP_INTERRUPT MSR requires only HWP is present as
enumerated by CPUID but does not require enabling HWP.

> >  Reading them shouldn't fail, but it seems safer to use
> > rdmsr_safe in case something goes wrong.
>
> Sure. But then the "enable" will need undoing in the unlikely event of
> failure.

Yes.

Regards,
Jason



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.