[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] amd: disable C6 after 1000 days on Zen2


  • To: Roger Pau Monne <roger.pau@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Fri, 30 Jun 2023 18:10:10 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=WMSny4zNfL0TIdsEKzwcfZQFaB3Z4tE3ImcTlDqpMGA=; b=NQaNq0fsyJAcXE0vMEIcObUvFai6+ABMRfMvXtqUxAFWiOLADqRAp4yrueyb6Dx+aCBSoiF75WQ5qGCSmfWKdTfR2vgpBePuHQzhCaeUjKwQn+cCr1WJdIwy1Rans8Kfv3wZ95onXkMgSFiZ3lPsLLD6tY/If9d3sYSUeAEoBounej3LB88smWKzPE2KPBGh6qi+eLAejbpHL+C2s4Eq7hG4Qa4lZODLGn/Q5viKQoZljwWL9+erJqAR/5YarNjuh/pR6hebxjPGfUod85hgj21TN2I2cTZL4iAqaMK4MiANfVWOX4di6uc6ACbs0JP0c4bNy1oMxFp5MJT2eBWPvQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=lXFO2k1kxmaOFtLjkrOXHvt0Nca2cyjF+x2+BN+kEYOIt07lUNB78VvU8if/mGNvAf57BknxlEOLTgMCRvc2SK+xS5M2tsGZKuWOfk0Ba/acRvMGEGCVYwSDFTq9pZPyZ9iZhAJosEJ7trqZsLQ1g+cTiyKtf0t7a7GhLB0KH0XKzn6w6g2w00FFQik7roYE18ooPAGl551C16hJP7E6/w4au7jocR9dtPLQsAQW7ejfm1dbBced3Yk/sPuyb0PLWUYemNftNbYQtcMuSTdhHSCUwDyVxYgkh9UsdmQ8W0yJRXSvMqaV5V5Evxw2ik3yuwm/Cc2vn2HWdFLVts87/Q==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • Delivery-date: Fri, 30 Jun 2023 17:10:44 +0000
  • Ironport-data: A9a23:KGo5Eqq7ndtfuQSKWvVnHHBXiBVeBmLSZBIvgKrLsJaIsI4StFCzt garIBmHOKuOZmrxLtp1b4+z9R4D75+GmoUwGQE9qChjHi1H+JuZCYyVIHmrMnLJJKUvbq7FA +Y2MYCccZ9uHhcwgj/3b9ANeFEljfngqoLUUbKCYWYpA1c/Ek/NsDo788YhmIlknNOlNA2Ev NL2sqX3NUSsnjV5KQr40YrawP9UlKq04GpwUmAWP6gR5weAzCFNVvrzGInqR5fGatgMdgKFb 76rIIGRpgvx4xorA9W5pbf3GmVirmn6ZFXmZtJ+AsBOszAazsAA+v9T2Mk0MC+7vw6hjdFpo OihgLTrIesf0g8gr8xGO/VQO3kW0aSrY9YrK1Dn2SCY5xWun3cBX5yCpaz5VGEV0r8fPI1Ay RAXAGkucTmcwP+w+aOQFOZMj+IfNcvqJoxK7xmMzRmBZRonabbqZv2WoPV+jHI3jM0IGuvCb c0EbzYpdA7HfxBEJlYQDtQ5gfusgX78NTZfrTp5p4JuuzSVkFM3j+SrbIe9lt+iHK25mm6xo G7c8nu/KRYdLNGFkhKO8262h/+JliT+MG4XPOThq6c72QDIlwT/DjU4e2Omrue3knW+ePdxB 2EN8wMl9+sLoRnDot7VGkfQTGS/lg4RXZ9cHvM37CmJy7HI+ECJC24cVDlDZdc68sgsSlQC1 FWEgtfoDjxHq6CORDSW8bL8hSy2ETgYKykFfyBsZQkY59jupqkjgxSJScxseIaulcH8Ezz0x zGMrQA9iq8VgMpN0L+0lW0rmBqpr5nNCwQztgPeWzr56hsjPNL4IYu19VLc8PBMap6DSUWMt 2QFnM7Y6/0SCZaKl2qGR+Bl8KyV2stp+Qb02TZHd6TNPRz0k5J/Vei8OA1DGXo=
  • Ironport-hdrordr: A9a23:pS6wDam9CX7Oi7Nd/ArP8XSm2Z/pDfLc3DAbv31ZSRFFG/Fw9/ rCoB17726StN91YhsdcL+7V5VoLUmzyXcx2/hzAV7AZniDhILLFuFfBOLZqlWNJ8S9zJ8+6U 4JScND4bbLfD1HZKjBgTVRE7wbsaW6GKLDv5ag85+6JzsaFZ2J7G1Ce3em+lUdfnghOXKgfq DsnPauoVCbCA0qR/X+PFYpdc7ZqebGkZr3CCR2eiLOuGG1/EuVAKeRKWni4isj
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 30/06/2023 2:18 pm, Roger Pau Monne wrote:
> diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
> index 0eaef82e5145..bdf45f8387e8 100644
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -51,6 +51,8 @@ bool __read_mostly amd_acpi_c1e_quirk;
>  bool __ro_after_init amd_legacy_ssbd;
>  bool __initdata amd_virt_spec_ctrl;
>  
> +static bool __read_mostly c6_disabled;
> +
>  static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
>                                unsigned int *hi)
>  {
> @@ -905,6 +907,31 @@ void __init detect_zen2_null_seg_behaviour(void)
>  
>  }
>  
> +static void cf_check disable_c6(void *arg)
> +{
> +     uint64_t val;
> +
> +     if (!c6_disabled) {
> +             printk(XENLOG_WARNING
> +    "Disabling C6 after 1000 days apparent uptime due to AMD errata 1474\n");
> +             c6_disabled = true;
> +             smp_call_function(disable_c6, NULL, 0);
> +     }
> +
> +     /* Update the MSR to disable C6, done on all threads. */
> +     if (rdmsr_safe(MSR_AMD_CSTATE_CFG, val)) {
> +error:
> +             printk_once(XENLOG_ERR
> +             "Unable to disable C6 on AMD system affected by errata 1474\n"
> +             "Reboot recommended, otherwise system might hang\n");
> +             return;
> +     }
> +     val &= ~(CSTATE_CFG_CCR0_CC6EN | CSTATE_CFG_CCR1_CC6EN |
> +              CSTATE_CFG_CCR2_CC6EN);
> +     if (wrmsr_safe(MSR_AMD_CSTATE_CFG, val))
> +             goto error;

These MSRs don't fault, and we already excluded hypervisors previously,
so the safe() doesn't really help.  The only way you'd spot a failure is
by reading back and noticing that the update didn't take effect.

Independently, this really really makes me want to dust off my
msr_{set,clear}_bits() helpers to reduce the boilerplate required for
logic like this.

> +}
> +
>  static void cf_check init_amd(struct cpuinfo_x86 *c)
>  {
>       u32 l, h;
> @@ -1171,6 +1198,9 @@ static void cf_check init_amd(struct cpuinfo_x86 *c)
>       if ((smp_processor_id() == 1) && !cpu_has(c, X86_FEATURE_ITSC))
>               disable_c1_ramping();
>  
> +     if (c6_disabled)
> +             disable_c6(NULL);
> +
>       check_syscfg_dram_mod_en();
>  
>       amd_log_freq(c);
> @@ -1180,3 +1210,43 @@ const struct cpu_dev amd_cpu_dev = {
>       .c_early_init   = early_init_amd,
>       .c_init         = init_amd,
>  };
> +
> +static int __init cf_check c6_errata_check(void)
> +{
> +     /*
> +      * Errata #1474: A Core May Hang After About 1044 Days
> +      * Set up a timer to disable C6 after 1000 days uptime.
> +      */
> +     s_time_t delta;
> +
> +     /*
> +      * Apply to all Zen2 models.  According to AMD revision guides at least
> +      * Rome, Castle Peak, Renoir, Lucienne and Matisse are affected.

You probably want to replicate the spectral chicken comment about how
identifying Zen2 is actually very hard.

That said, I don't see anything which limits this logic to Fam17h.

> +      */
> +     if (cpu_has_hypervisor || !boot_cpu_has(X86_FEATURE_AMD_STIBP))
> +             return 0;
> +
> +     /*
> +      * Deduct current TSC value, this would be relevant if kexec'ed for
> +      * example.  Might not be accurate, but worst case we end up disabling
> +      * C6 before strictly required, which would still be safe.
> +      *
> +      * NB: all affected models (Zen2) have invariant TSC and TSC adjust
> +      * MSR, so early_time_init() will have already cleared any TSC offset.
> +      */
> +     delta = DAYS(1000) - tsc_ticks2ns(rdtsc());
> +     if (delta > 0) {
> +             static struct timer errata_c6;
> +
> +             init_timer(&errata_c6, disable_c6, NULL, 0);
> +             set_timer(&errata_c6, NOW() + delta);
> +     } else
> +             disable_c6(NULL);

I doubt the smp_call_function() is going to be happy at presmp time.

Do we really care if we're already beyond the timeout?

> +
> +     return 0;
> +}
> +/*
> + * Must be executed after early_time_init() for tsc_ticks2ns() to have been
> + * calibrated.  That prevents us doing the check in init_amd().
> + */
> +presmp_initcall(c6_errata_check);
> diff --git a/xen/arch/x86/include/asm/msr-index.h 
> b/xen/arch/x86/include/asm/msr-index.h
> index 2749e433d2a7..5df090fba791 100644
> --- a/xen/arch/x86/include/asm/msr-index.h
> +++ b/xen/arch/x86/include/asm/msr-index.h
> @@ -211,6 +211,11 @@
>  
>  #define MSR_VIRT_SPEC_CTRL                  0xc001011f /* Layout matches 
> MSR_SPEC_CTRL */
>  
> +#define MSR_AMD_CSTATE_CFG                  0xc0010296
> +#define  CSTATE_CFG_CCR0_CC6EN              (_AC(1, ULL) <<  6)
> +#define  CSTATE_CFG_CCR1_CC6EN              (_AC(1, ULL) << 14)
> +#define  CSTATE_CFG_CCR2_CC6EN              (_AC(1, ULL) << 22)

While MSR_AMD_CSTATE_CFG is liable to stay stable, the CC6EN bits are
uarch specific and not applicable to others.

I'd suggest keeping them local to c6_errata_check(), which probably
ought to gain a zen2 somewhere in the name.

They definitely can't say in a global header with names that don't tie
them to Zen2 specifically.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.