[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] amd: disable C6 after 1000 days on Zen2


  • To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Mon, 3 Jul 2023 11:28:31 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=CLP1TraNndqjN77FTbv4iykyD/Gr/u0nmg9JO9jv7ac=; b=jihuNKG2+0eELS5HZK6v1k4A3wPZGmrTRaMfcYv1F5m6E0aE+pK0wA0C8SeOuL2wNKctEJv/I3l1kdcKUoS8m0fp35cXO03CVbCsiYtlkVArxs26byFzkuwabvIvTd4OSbuM7ZCPHXIkD0r993UuNGsLD/kzKGOo3uuwI8nDAe4zMGOKjTp+Q4COMpEZUmAkiwJpDiMgBTZmofFtWO9qsJ4Zpb/IkVdFhIrTMYAnC3ZuhmJMPvBA2YCP6/Jo57l3GWVrH7AeTrHPmE7OJjfpwCj5SBVze0fmfZx1MpnWuysIiCrEQK9Ya/3N08zv+ZUSNfnimXtd1nghDHi8GLRSxQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=eFlFbuCbvjBD2ywW/gITMlAZkKXM7/mip5Z6R4k3w98CaKaRcq9im5J/3xEGn1HAZdH8F4WSnbNhfhMkM6PDvR46csewyylpFMoA5RxSXyyKgZlcFNyfHekb2siDL1seKqdzDZykikw6RcroRmz1Ii0yLIW5/a5xclSWmB7R/MfbtEbxlXZYLBR0L3Cv+J5RasNmWRYJMUYsUZz4bs6YxvvdtUOgF1VDsIyw0/g7Ew//peiu6LY9mZy0UdjMAdnD5dtNJZQDicnbkVlCekJO0K+W7UhxREXOaNPlG3Ty+gxwjJSE+A1gwLHO0vwPgDlE6bla6a8cHrfEraBeGIXtVA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: xen-devel@xxxxxxxxxxxxxxxxxxxx, Jan Beulich <jbeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • Delivery-date: Mon, 03 Jul 2023 09:28:53 +0000
  • Ironport-data: A9a23:jI79Mq67sAi+C6c+5ZcJywxRtPbGchMFZxGqfqrLsTDasY5as4F+v mZMCmuHb67fMWb3f9xyaovi8UgAsZ+Az4IyQAprpS0yHi5G8cbLO4+Ufxz6V8+wwm8vb2o8t plDNYOQRCwQZiWBzvt4GuG59RGQ7YnRGvynTraCYnsrLeNdYH9JoQp5nOIkiZJfj9G8Agec0 fv/uMSaM1K+s9JOGjt8B5mr9lU35ZwehBtC5gZlPa8T5geE/5UoJMl3yZ+ZfiOQrrZ8RoZWd 86bpJml82XQ+QsaC9/Nut4XpWVTH9Y+lSDX4pZnc/DKbipq/0Te4Y5iXBYoUm9Fii3hojxE4 I4lWapc6+seFvakdOw1C3G0GszlVEFM0OevzXOX6aR/w6BaGpdFLjoH4EweZOUlFuhL7W5m1 fkhCg1VREu6oPPn8bO3ZblBuv4KBZy+VG8fkikIITDxK98DGcyGZpqQoNhS0XE3m9xEGuvYa 4wBcz1zYR/cYhpJfFAKFJY5m+TujX76G9FagAvN+exrvC6Ml0ooj+GF3Nn9I7RmQe1PmUmVv CTe9nnRCRAGLt2PjzGC9xpAg8eWxH+rBdhJSOfQGvhC0QWO2EYXKicscgGVp6O6j36jZ/EcJ BlBksYphe1onKCxdfHtUhv9rHOasxo0X9tLD/Z8+AyL0rDT4QuSGi4DVDEpQN4sudIyRDcq/ kSUhN6vDjtq2JWKTVqN+7HSqim9UQAXMGsDaCksXQYDpd75r+kblQnTR9xuFKq0iNzdGjzqx T2O6i8kiN0uYdUj0qy6+RXCnGiqr52REgotvFyIBySi8x9zY5Oja8qw81/H4P1cLYGfCF6co HwDnMvY5+cLZX2QqBGwrCw2NOnBz5643Pf02DaDw7FJG+yRxkOe
  • Ironport-hdrordr: A9a23:o/wo4KmO3FlL3fhdnn1GbipWA4DpDfLo3DAbv31ZSRFFG/Fw9/ rCoB17726QtN91YhsdcL+7V5VoLUmzyXcX2/hyAV7BZmnbUQKTRekP0WKL+Vbd8kbFh41gPM lbEpSXCLfLfCJHZcSR2njELz73quP3jJxBho3lvghQpRkBUdAF0+/gYDzranGfQmN9dP0EPa vZ3OVrjRy6d08aa8yqb0N1JNQq97Xw5fTbiQdtPW9f1DWz
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Jun 30, 2023 at 06:10:10PM +0100, Andrew Cooper wrote:
> On 30/06/2023 2:18 pm, Roger Pau Monne wrote:
> > diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
> > index 0eaef82e5145..bdf45f8387e8 100644
> > --- a/xen/arch/x86/cpu/amd.c
> > +++ b/xen/arch/x86/cpu/amd.c
> > @@ -51,6 +51,8 @@ bool __read_mostly amd_acpi_c1e_quirk;
> >  bool __ro_after_init amd_legacy_ssbd;
> >  bool __initdata amd_virt_spec_ctrl;
> >  
> > +static bool __read_mostly c6_disabled;
> > +
> >  static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
> >                              unsigned int *hi)
> >  {
> > @@ -905,6 +907,31 @@ void __init detect_zen2_null_seg_behaviour(void)
> >  
> >  }
> >  
> > +static void cf_check disable_c6(void *arg)
> > +{
> > +   uint64_t val;
> > +
> > +   if (!c6_disabled) {
> > +           printk(XENLOG_WARNING
> > +    "Disabling C6 after 1000 days apparent uptime due to AMD errata 
> > 1474\n");
> > +           c6_disabled = true;
> > +           smp_call_function(disable_c6, NULL, 0);
> > +   }
> > +
> > +   /* Update the MSR to disable C6, done on all threads. */
> > +   if (rdmsr_safe(MSR_AMD_CSTATE_CFG, val)) {
> > +error:
> > +           printk_once(XENLOG_ERR
> > +           "Unable to disable C6 on AMD system affected by errata 1474\n"
> > +           "Reboot recommended, otherwise system might hang\n");
> > +           return;
> > +   }
> > +   val &= ~(CSTATE_CFG_CCR0_CC6EN | CSTATE_CFG_CCR1_CC6EN |
> > +            CSTATE_CFG_CCR2_CC6EN);
> > +   if (wrmsr_safe(MSR_AMD_CSTATE_CFG, val))
> > +           goto error;
> 
> These MSRs don't fault, and we already excluded hypervisors previously,
> so the safe() doesn't really help.  The only way you'd spot a failure is
> by reading back and noticing that the update didn't take effect.

I was worried about people explicitly clearing the hypervisor flag in
CPUID and then Xen crashing on boot, so better safe than sorry.  I
don't see much value in crashing here if the MSR is somehow
unavailable.

> Independently, this really really makes me want to dust off my
> msr_{set,clear}_bits() helpers to reduce the boilerplate required for
> logic like this.
> 
> > +}
> > +
> >  static void cf_check init_amd(struct cpuinfo_x86 *c)
> >  {
> >     u32 l, h;
> > @@ -1171,6 +1198,9 @@ static void cf_check init_amd(struct cpuinfo_x86 *c)
> >     if ((smp_processor_id() == 1) && !cpu_has(c, X86_FEATURE_ITSC))
> >             disable_c1_ramping();
> >  
> > +   if (c6_disabled)
> > +           disable_c6(NULL);
> > +
> >     check_syscfg_dram_mod_en();
> >  
> >     amd_log_freq(c);
> > @@ -1180,3 +1210,43 @@ const struct cpu_dev amd_cpu_dev = {
> >     .c_early_init   = early_init_amd,
> >     .c_init         = init_amd,
> >  };
> > +
> > +static int __init cf_check c6_errata_check(void)
> > +{
> > +   /*
> > +    * Errata #1474: A Core May Hang After About 1044 Days
> > +    * Set up a timer to disable C6 after 1000 days uptime.
> > +    */
> > +   s_time_t delta;
> > +
> > +   /*
> > +    * Apply to all Zen2 models.  According to AMD revision guides at least
> > +    * Rome, Castle Peak, Renoir, Lucienne and Matisse are affected.
> 
> You probably want to replicate the spectral chicken comment about how
> identifying Zen2 is actually very hard.
> 
> That said, I don't see anything which limits this logic to Fam17h.

Oh, right, will add the missing fam17h check.

> > +    */
> > +   if (cpu_has_hypervisor || !boot_cpu_has(X86_FEATURE_AMD_STIBP))
> > +           return 0;
> > +
> > +   /*
> > +    * Deduct current TSC value, this would be relevant if kexec'ed for
> > +    * example.  Might not be accurate, but worst case we end up disabling
> > +    * C6 before strictly required, which would still be safe.
> > +    *
> > +    * NB: all affected models (Zen2) have invariant TSC and TSC adjust
> > +    * MSR, so early_time_init() will have already cleared any TSC offset.
> > +    */
> > +   delta = DAYS(1000) - tsc_ticks2ns(rdtsc());
> > +   if (delta > 0) {
> > +           static struct timer errata_c6;
> > +
> > +           init_timer(&errata_c6, disable_c6, NULL, 0);
> > +           set_timer(&errata_c6, NOW() + delta);
> > +   } else
> > +           disable_c6(NULL);
> 
> I doubt the smp_call_function() is going to be happy at presmp time.

I did take a look and forced it by setting delta = 0,  AFAICT it works
fine, cpu_online_map will be zeroed, so on_selected_cpus() is just a
noop.  Interrupts are enabled by that point, so I don't see a problem
with this approach.

> Do we really care if we're already beyond the timeout?

I would expect that C6 will already be disabled in that case, but I
don't see an issue with Xen also setting those bits.

> > +
> > +   return 0;
> > +}
> > +/*
> > + * Must be executed after early_time_init() for tsc_ticks2ns() to have been
> > + * calibrated.  That prevents us doing the check in init_amd().
> > + */
> > +presmp_initcall(c6_errata_check);
> > diff --git a/xen/arch/x86/include/asm/msr-index.h 
> > b/xen/arch/x86/include/asm/msr-index.h
> > index 2749e433d2a7..5df090fba791 100644
> > --- a/xen/arch/x86/include/asm/msr-index.h
> > +++ b/xen/arch/x86/include/asm/msr-index.h
> > @@ -211,6 +211,11 @@
> >  
> >  #define MSR_VIRT_SPEC_CTRL                  0xc001011f /* Layout matches 
> > MSR_SPEC_CTRL */
> >  
> > +#define MSR_AMD_CSTATE_CFG                  0xc0010296
> > +#define  CSTATE_CFG_CCR0_CC6EN              (_AC(1, ULL) <<  6)
> > +#define  CSTATE_CFG_CCR1_CC6EN              (_AC(1, ULL) << 14)
> > +#define  CSTATE_CFG_CCR2_CC6EN              (_AC(1, ULL) << 22)
> 
> While MSR_AMD_CSTATE_CFG is liable to stay stable, the CC6EN bits are
> uarch specific and not applicable to others.
> 
> I'd suggest keeping them local to c6_errata_check(), which probably
> ought to gain a zen2 somewhere in the name.
> 
> They definitely can't say in a global header with names that don't tie
> them to Zen2 specifically.

That MSR and bit definitions are also available on family 19h model
11h and model 70h (with even more bits added to the MSR), so it feels
weird to name them Zen2 specifically (as it seems AMD is attempting to
keep the current bits stable across families).

I can move the defines to inside the function, but I'm not convinced
those are uarch specific.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.