[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2] amd: disable C6 after 1000 days on Zen2
On Fri, Jun 30, 2023 at 03:18:20PM +0200, Roger Pau Monne wrote: > As specified on Errata 1474: > > "A core will fail to exit CC6 after about 1044 days after the last > system reset. The time of failure may vary depending on the spread > spectrum and REFCLK frequency." > > Detect when running on AMD Zen2 (family 17h models 30-3fh, 60-6fh or > 70-7fh) and setup a timer to prevent entering C6 after 1000 days of > uptime. Take into account the TSC value at boot in order to account > for any time elapsed before Xen has been booted. Worst case we end > up disabling C6 before strictly necessary, but that would still be > safe, and it's better than not taking the TSC value into account and > hanging. > > Disable C6 by updating the MSR listed in the revision guide, this > avoids applying workarounds in the CPU idle drivers, as the processor > won't be allowed to enter C6 by the hardware itself. > > Print a message once C6 is disabled in order to let the user know. > > Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx> > --- > The current Revision Guide for Fam17h model 60-6Fh (Lucienne and > Renoir) hasn't been updated to reflect the MSR workaround, but the PPR > for those models lists the MSR and the bits as having the expected > meaning, so I assume it's safe to apply the same workaround there. > > For all accounts this seems to affect all Zen2 models, and hence the > workaround should be the same. Might also affect Hygon, albeit I > think Hygon is strictly limited to Zen1. > --- > Changes since v1: > - Apply the workaround listed by AMD: toggle some MSR bits. > - Do not apply the workaround if virtualized. > - Check for STIBP feature instead of listing specific models. > - Implement the DAYS macro based on SECONDS. > --- > xen/arch/x86/cpu/amd.c | 70 ++++++++++++++++++++++++++++ > xen/arch/x86/include/asm/msr-index.h | 5 ++ > xen/include/xen/time.h | 1 + > 3 files changed, 76 insertions(+) > > diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c > index 0eaef82e5145..bdf45f8387e8 100644 > --- a/xen/arch/x86/cpu/amd.c > +++ b/xen/arch/x86/cpu/amd.c > @@ -51,6 +51,8 @@ bool __read_mostly amd_acpi_c1e_quirk; > bool __ro_after_init amd_legacy_ssbd; > bool __initdata amd_virt_spec_ctrl; > > +static bool __read_mostly c6_disabled; > + > static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo, > unsigned int *hi) > { > @@ -905,6 +907,31 @@ void __init detect_zen2_null_seg_behaviour(void) > > } > > +static void cf_check disable_c6(void *arg) > +{ > + uint64_t val; > + > + if (!c6_disabled) { > + printk(XENLOG_WARNING > + "Disabling C6 after 1000 days apparent uptime due to AMD errata 1474\n"); > + c6_disabled = true; > + smp_call_function(disable_c6, NULL, 0); I've realized this is racy with CPU hotplug, so I will need to inhibit CPU hotplug around the call to smp_call_function() in order to prevent CPUs being hotplugged and not seeing c6_disabled while set and also not being set in cpu_online_map when the call to smp_call_function happens. Thanks, Roger.
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |