[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] amd: disable C6 after 1000 days on Zen2


  • To: xen-devel@xxxxxxxxxxxxxxxxxxxx
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Mon, 3 Jul 2023 16:06:01 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=xDQU3W5cX687L1WQosDpVw5bcXd4QsSkW2708q0Pimg=; b=lrpmT7H8KruKIPzL2qlyXrshbq1dpA0ZP6IIQ1+kOPOHkQIApbz6qh4RxhW9lYxhepo1TnLsaZHLQVlWuulC8SwHLUXQA/6/NeMWvw2lFAYN8on5ofNFNv/aXfRfj47VFxvWeuQEToCHXRFEnPWU/hjlzjUnTWIBx97CZplRwelaJmKNJH2y4yItVF7VRQiSy7rUiBw1LHYmAkxG58VNzuviLkUaRXNcV9VwEPdLx+cDTtn0I43lNlqXs6zTdJ+yTkDlHuDf9VmILeoiZfV2hMabduanxcaXaiPUohG2lorVPtZ1mreMkyIuf6MDG3FJV0NDzEn/WDknNjcqviN8Tw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=V88kPAH84qXFaYIBwibslUnhFk+B9SYBYlZ/gvWhp9YcQS6WewnbTD7Bq5jrDXLdeAlas2rmisF9j+8n0mo1PV3EfTMgIdRM1rHXBpidn6Tdkyu+lW9Bb82735pJ0J4s/HDVd01Cj/1zUaD5tBdrEoFDERlllPjD5wcLY9RLdOoHiCEjl4RiA5F+x+oG9To1MMqvjCz6eBSz05svD4HsDiYO2BgXvvTmk/a6k7P624t59FXn6EPJuCXEMPUsHmA20T8LDMU8WR0sVgsKGrkZDWIET/gMiN3SPFmyTYne6GYfRa8iSeyxTl2sYePgRW/P752i07Cn6MY8ifWkV3SaEg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Jan Beulich <jbeulich@xxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>
  • Delivery-date: Mon, 03 Jul 2023 14:06:39 +0000
  • Ironport-data: A9a23:cD5Rha5Kf08gw8gc2JcTLwxRtPbGchMFZxGqfqrLsTDasY5as4F+v mcfWmmObqvfZzP8eYwnbo2yoUwH6pPVzoBhTFZt+y41Hi5G8cbLO4+Ufxz6V8+wwm8vb2o8t plDNYOQRCwQZiWBzvt4GuG59RGQ7YnRGvynTraCYnsrLeNdYH9JoQp5nOIkiZJfj9G8Agec0 fv/uMSaM1K+s9JOGjt8B5mr9lU35ZwehBtC5gZlPa8T5weH/5UoJMl3yZ+ZfiOQrrZ8RoZWd 86bpJml82XQ+QsaC9/Nut4XpWVTH9Y+lSDX4pZnc/DKbipq/0Te4Y5iXBYoUm9Fii3hojxE4 I4lWapc6+seFvakdOw1C3G0GszlVEFM0OevzXOX6aR/w6BaGpdFLjoH4EweZOUlFuhL7W5m7 8NDCW8fNyK4rcGcnqOpDcJcheEnI5y+VG8fkikIITDxK98DGMmGaYOaoNhS0XE3m9xEGuvYa 4wBcz1zYR/cYhpJfFAKFJY5m+TujX76G9FagAvN+exrvC6OnEoojumF3Nn9I7RmQe1PmUmVv CTe9nnRCRAGLt2PjzGC9xpAg8eWxH+gANhCS+zQGvhC2EOUmTU0ITcqTkK3qOiGgHy3B8h4A hlBksYphe1onKCxdfHtUhv9rHOasxo0X9tLD/Z8+AyL0rDT4QuSGi4DVDEpQN4sudIyRDcq/ kSUhN6vDjtq2JWKTVqN+7HSqim9UQAXMGsDaCksXQYDpd75r+kblQnTR9xuFKq0iNzdGjzqx T2O6i8kiN0uYdUj0qy6+RXNhWKqr52QFwotvFyJDySi8x9zY5Oja8qw81/H4P1cLYGfCF6co HwDnMvY5+cLZX2QqBGwrCw2NOnBz5643Pf02DaDw7FJG+yRxkOe
  • Ironport-hdrordr: A9a23:rRbk3apYQGMCfPzQ3JhrjT8aV5oGeYIsimQD101hICG9Ffbo7f xG/c5rriMc7Qx7ZJhNo6HmBECrewK/yXcN2+Us1MmZLWrbURqTXeRfBOLZqlWOJ8SUzJ846U 4PSchD4G6cNykDsS4piDPYLz70qOPozEjh7d21858mJTsGV0kvhD0JbHfjLnFL
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Jun 30, 2023 at 03:18:20PM +0200, Roger Pau Monne wrote:
> As specified on Errata 1474:
> 
> "A core will fail to exit CC6 after about 1044 days after the last
> system reset. The time of failure may vary depending on the spread
> spectrum and REFCLK frequency."
> 
> Detect when running on AMD Zen2 (family 17h models 30-3fh, 60-6fh or
> 70-7fh) and setup a timer to prevent entering C6 after 1000 days of
> uptime.  Take into account the TSC value at boot in order to account
> for any time elapsed before Xen has been booted.  Worst case we end
> up disabling C6 before strictly necessary, but that would still be
> safe, and it's better than not taking the TSC value into account and
> hanging.
> 
> Disable C6 by updating the MSR listed in the revision guide, this
> avoids applying workarounds in the CPU idle drivers, as the processor
> won't be allowed to enter C6 by the hardware itself.
> 
> Print a message once C6 is disabled in order to let the user know.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>
> ---
> The current Revision Guide for Fam17h model 60-6Fh (Lucienne and
> Renoir) hasn't been updated to reflect the MSR workaround, but the PPR
> for those models lists the MSR and the bits as having the expected
> meaning, so I assume it's safe to apply the same workaround there.
> 
> For all accounts this seems to affect all Zen2 models, and hence the
> workaround should be the same.  Might also affect Hygon, albeit I
> think Hygon is strictly limited to Zen1.
> ---
> Changes since v1:
>  - Apply the workaround listed by AMD: toggle some MSR bits.
>  - Do not apply the workaround if virtualized.
>  - Check for STIBP feature instead of listing specific models.
>  - Implement the DAYS macro based on SECONDS.
> ---
>  xen/arch/x86/cpu/amd.c               | 70 ++++++++++++++++++++++++++++
>  xen/arch/x86/include/asm/msr-index.h |  5 ++
>  xen/include/xen/time.h               |  1 +
>  3 files changed, 76 insertions(+)
> 
> diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c
> index 0eaef82e5145..bdf45f8387e8 100644
> --- a/xen/arch/x86/cpu/amd.c
> +++ b/xen/arch/x86/cpu/amd.c
> @@ -51,6 +51,8 @@ bool __read_mostly amd_acpi_c1e_quirk;
>  bool __ro_after_init amd_legacy_ssbd;
>  bool __initdata amd_virt_spec_ctrl;
>  
> +static bool __read_mostly c6_disabled;
> +
>  static inline int rdmsr_amd_safe(unsigned int msr, unsigned int *lo,
>                                unsigned int *hi)
>  {
> @@ -905,6 +907,31 @@ void __init detect_zen2_null_seg_behaviour(void)
>  
>  }
>  
> +static void cf_check disable_c6(void *arg)
> +{
> +     uint64_t val;
> +
> +     if (!c6_disabled) {
> +             printk(XENLOG_WARNING
> +    "Disabling C6 after 1000 days apparent uptime due to AMD errata 1474\n");
> +             c6_disabled = true;
> +             smp_call_function(disable_c6, NULL, 0);

I've realized this is racy with CPU hotplug, so I will need to inhibit
CPU hotplug around the call to smp_call_function() in order to prevent
CPUs being hotplugged and not seeing c6_disabled while set and also
not being set in cpu_online_map when the call to smp_call_function
happens.

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.