[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3] amd: disable C6 after 1000 days on Zen2


  • To: Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 31 Jul 2023 12:37:15 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=MJpp6qr5ls6DFsVwk71i7NZrucn024HOf3H9y8hTnqs=; b=NvJwzo4ARtSvR55hLLiryMkvDvmH0W6ykUW/G5UbhLHgxa/pv0y9qHK/BN6HlsUK8i4OoaIpy/Mn7USMkK/zvVHuA9T4GZPKlU24oI2ya4AyGffs/JinJywXApnT/QW+jUQwtNV5p+9dF4wo3wyu2W94G+OuyEFUV5wZ7T3JX1LMSZ1/LjJvLubdaGwEBQa6t+oXX8AS2BQympS0+WtflOeGfqtMlU3K8ytU0PQOcH5cxkBsz1irJiCSNlka3zPTutZ0X1HSnNl4jzL1GBGCtaVCA3mysTMBg5IJzTvKaCVypgnyujiBH0oAZfL6glmy4VkCD83RvivazLgUA/jcIg==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=oO1w7LW19R2Z7lPSZ6ntbvR69i1X8VfsvfZf/CNDq7Eo+/eNP3073ccvR2ffGCXGgxo7p73vQYtKfzRvOHUHJJTwGs6d8zrIwzrgSFdemqfj2crqYvvp9Gi/U1M+wgyE0whUICRVIZFdJdHQJbT9fEQP7zZCQIc3hsbKltHCAP4+KwaDTFPWwhSO2KBndKdK1fWJxBBlfPTL4MQiDP2T49pYhjE94LZkpE0GueAcRRRD+JixV0m3wtbmPZ1+P3pOXuUF2kZywoIR1Cdc3Jj802PA0Cbg4tkXKBBbbvPmInEa80m8l9tRul+2ou4i/QKcXNtWPe65ECFjO8aKwlDpog==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 31 Jul 2023 10:37:21 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 28.07.2023 16:47, Roger Pau Monne wrote:
> As specified on Errata 1474:
> 
> "A core will fail to exit CC6 after about 1044 days after the last
> system reset. The time of failure may vary depending on the spread
> spectrum and REFCLK frequency."
> 
> Detect when running on AMD Zen2 and setup a timer to prevent entering
> C6 after 1000 days of uptime.  Take into account the TSC value at boot
> in order to account for any time elapsed before Xen has been booted.
> Worst case we end up disabling C6 before strictly necessary, but that
> would still be safe, and it's better than not taking the TSC value
> into account and hanging.
> 
> Disable C6 by updating the MSR listed in the revision guide, this
> avoids applying workarounds in the CPU idle drivers, as the processor
> won't be allowed to enter C6 by the hardware itself.
> 
> Print a message once C6 is disabled in order to let the user know.
> 
> Signed-off-by: Roger Pau Monné <roger.pau@xxxxxxxxxx>

Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>
with two remarks:

> @@ -1249,3 +1282,44 @@ const struct cpu_dev amd_cpu_dev = {
>       .c_early_init   = early_init_amd,
>       .c_init         = init_amd,
>  };
> +
> +static int __init cf_check zen2_c6_errata_check(void)
> +{
> +     /*
> +      * Errata #1474: A Core May Hang After About 1044 Days
> +      * Set up a timer to disable C6 after 1000 days uptime.
> +      */
> +     s_time_t delta;
> +
> +     /*
> +      * Zen1 vs Zen2 isn't a simple model number comparison, so use STIBP as
> +      * a heuristic to separate the two uarches in Fam17h.
> +      */
> +     if (cpu_has_hypervisor || boot_cpu_data.x86 != 0x17 ||
> +         !boot_cpu_has(X86_FEATURE_AMD_STIBP))
> +             return 0;
> +
> +     /*
> +      * Deduct current TSC value, this would be relevant if kexec'ed for
> +      * example.  Might not be accurate, but worst case we end up disabling
> +      * C6 before strictly required, which would still be safe.

I'm not really convinced of this being the worst case. TSC can be
written, and hence it could also have been altered in a way making
things look as if system wasn't running for a long time when really
it has been. But I'm okay to leave that special case aside right
now.

> +      * NB: all affected models (Zen2) have invariant TSC and TSC adjust
> +      * MSR, so early_time_init() will have already cleared any TSC offset.
> +      */
> +     delta = DAYS(1000) - tsc_ticks2ns(rdtsc());
> +     if (delta > 0) {
> +             static struct timer errata_c6;
> +
> +             init_timer(&errata_c6, zen2_disable_c6, NULL, 0);
> +             set_timer(&errata_c6, NOW() + delta);
> +     } else
> +             zen2_disable_c6(NULL);

Strictly speaking you don't need the if/else here, since timers set
in the past will simply have their handlers executed right away (and
if that wasn't the case, there would be a race here).

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.