[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] amd: disable C6 after 1000 days on Fam17h models 30-3fh


  • To: Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 5 Jun 2023 17:54:50 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=quJwwjMdwNdpbM84s6w5Mms2bWeK+skP5Z0pWZBzW8c=; b=cW8VHoQSi7UQPjFhm9nhrJEGthMvnariRFYVZ/XcL/vH1qylxp/a94zZTHcicsFQ0D9jUL6yhxrma+lZKxKMpZfKo2AxnJ9DLjYdurxdfDfs9TnaKnKa8TvvnQn/vObJB+w/lD2xllH0827223YgPwAEeUtOexe+UmsFwhCuj5JTPAN0O+KCMqDf0wZpcc5H82OA1SL7ABqL9uYCyk0uOw3q1jpyMLFgzfpshNQQOZbHJDusttmCCEbTcG9TH6AfTfGDVlmhgdCIaLp7uXTgH5tgZ7tGAtqf1UTMmHF40OKRfDDkbSwmyh2ndYbWH5n6nqBkertckFwgm69swImghw==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MTA/MGgkjrqyC9Ib4EQ7XOhsBHseNmpswm5KeLp0zW4DEA4eeXvzJiYFApKbaqF21NU7AF2eYIVMvQKHUuKZjIzqtIKY7SJHFXYAck55u0iSIoJq8x2D8nM6Zh91cbAKMQW8m/rzIDhPFu2iMqx/Osev4q+WhpCknJHLrGTIR0hXfoiVLkgMfT7+ryson+Jmr2t30mC6R4DdWGohZB9gKh919dS+lujA2NoaDFmzp1fRvlVhSSTvbmh+ciin4D8lYYqHBUgkFaT+ITb07SacLYU4aNQyW6rWaUJIhdBi4mbPWZQx9YSmucBOR53HNPowBbm3CBUfTj9sx4UilOobmA==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 05 Jun 2023 15:55:01 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 05.06.2023 17:10, Roger Pau Monne wrote:
> As specified on Errata 1474:
> 
> "A core will fail to exit CC6 after about 1044 days after the last
> system reset. The time of failure may vary depending on the spread
> spectrum and REFCLK frequency."
> 
> Detect when running on AMD Fam17h models 30h-3fh and setup a timer to
> prevent entering C6 after 1000 days have elapsed.  Take into account
> the TSC value at boot in order to account for any time elapsed before
> Xen has been booted.

Models 6x are also affected as per their RG. I have some trouble with
the site, so it's too slow going to actually try and fish out the RGs
for the other possible models.

Given more than one set of models is affected I of course also wonder
whether Hygon CPUs wouldn't be affected, too. But I realize we have
hardly any means to find out.

> @@ -1189,3 +1190,44 @@ const struct cpu_dev amd_cpu_dev = {
>       .c_early_init   = early_init_amd,
>       .c_init         = init_amd,
>  };
> +
> +static void cf_check disable_c6(void *arg)
> +{
> +     printk(XENLOG_WARNING
> +            "Disabling C6 after 1000 days uptime due to AMD errata 1474\n");
> +     amd_disable_c6 = true;
> +}
> +
> +static int __init cf_check amd_c6_errata(void)
> +{
> +     /*
> +      * Errata #1474: A Core May Hang After About 1044 Days
> +      * Set up a timer to disable C6 after 1000 days uptime.
> +      */
> +     s_time_t;
> +
> +     if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
> +         boot_cpu_data.x86 != 0x17 ||
> +         (boot_cpu_data.x86_model & 0xf0) != 0x30)

Perhaps better ... & ~0xf, just to be future-proof?

> +             return 0;
> +
> +     /*
> +      * Deduct current TSC value, this would be relevant if
> +      * kexec'ed for example.
> +      */
> +     delta = DAYS(1000) - tsc_ticks2ns(rdtsc());

Generally the TSC can be written (and we actually do so under certain
circumstances), so deriving any absolute time from the TSC value is of
at best questionable value.

Furthermore, perhaps better overall to suppress all of this logic when
we're running virtualized ourselves?

> +     if (delta > 0) {
> +             static struct timer errata_c6;
> +
> +             init_timer(&errata_c6, disable_c6, NULL, 0);
> +             set_timer(&errata_c6, NOW() + delta);
> +     } else
> +             disable_c6(NULL);

The log message is going to be misleading in this case. Maybe pass
that as the so far unused handler argument? Albeit I realize that this
will mean casting away const-ness, which isn't very nice.

> --- a/xen/include/xen/time.h
> +++ b/xen/include/xen/time.h
> @@ -53,6 +53,7 @@ struct tm wallclock_time(uint64_t *ns);
>  
>  #define SYSTEM_TIME_HZ  1000000000ULL
>  #define NOW()           ((s_time_t)get_s_time())
> +#define DAYS(_d)        ((s_time_t)((_d)  * 86400000000000ULL))
>  #define SECONDS(_s)     ((s_time_t)((_s)  * 1000000000ULL))
>  #define MILLISECS(_ms)  ((s_time_t)((_ms) * 1000000ULL))
>  #define MICROSECS(_us)  ((s_time_t)((_us) * 1000ULL))

While consistent with the other macros we have here, I think this would
be neater as

#define DAYS(_d)        SECONDS((_d) * 86400ULL))

especially if considering that yet later someone may want to add YEARS().

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.