[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] amd: disable C6 after 1000 days on Fam17h models 30-3fh


  • To: Jan Beulich <jbeulich@xxxxxxxx>
  • From: Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Date: Tue, 6 Jun 2023 14:47:53 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=YDwDyoZp+SxQZl1iVzj/KEpIuUlTHSAv97fZkXKuQNw=; b=Hf/hrdwQ2iKoHKy6ffRXdE3a076gpxx0e5hi/Ny1cQ0LOra/sNQT/dJTe3iRH0YSk+Qf/cOdnLz6EnfAmHocelnDiHUPwnypa+CqDzU5FZag9aFwl4SZ8vQa/lwGxETT51zXCbLhQvahzeq4SNStYAHhhHP08Yeqa3YtW6vVpShG0BXwCF0v6AVmoB2hVfxhITJzVX4k9FvUGfDWEq53NA23MdnDlix3eJ5Urp/vP7r028/yCC07g14EmmsGUwn9pCVLuFvw1q4VnIpiNrxTopQvKuOcvJMlYhlNBXxRg48/x5T2zyK+UTN6QdeiSWAE5lTw4jeH337sU/5ZThAyMA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Z6DJ3HvpBHY03ciqfJ5kFG0kde9Hz67c1b5waNNEad9eWsdHfNIBTRH0YShMy1miac9tTWOqL7/4VErm3w+HYIRZH1pnRSbsEuNM8mSF6qeIxkHWaJ0kM0j9rYmNFd4H7NeySEON+P+kAHVZZ1tPteZaF07wSyCQhWOwnv2Ge9a4Ue6Iio6Oje68gKaSzopVmyYD8lqWXn+P1zSwDmeODrCpjOphRJ3ROcUrssSXq9ZdVKnioDO2cpaUI8kjzGlsyN/QEgUXmn0cFfOaZ70fS5TBZXVrrewAyUiY3ymNsU4P3CLbOeAR9MzQgRdpP9Vg1SXwXGKn/Hm87w0tVigRLg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Tue, 06 Jun 2023 12:49:06 +0000
  • Ironport-data: A9a23:sFLXBquJ2sNKXTPKA5FXykwmS+fnVJhfMUV32f8akzHdYApBsoF/q tZmKWnTPa2OMTPwLd0jaIXjpxkB65fUzodjHQJkqng1RStA+JbJXdiXEBz9bniYRiHhoOCLz O1FM4Wdc5pkJpP4jk3wWlQ0hSAkjclkfpKlVKiffHg3HVQ+IMsYoUoLs/YjhYJ1isSODQqIu Nfjy+XSI1bg0DNvWo4uw/vrRChH4rKq4Fv0gnRkPaoQ5AGGySFPZH4iDfrZw0XQE9E88tGSH 44v/JnhlkvF8hEkDM+Sk7qTWiXmlZaLYGBiIlIPM0STqkAqSh4ai87XB9JFAatjsB2bnsgZ9 Tl4ncfYpTHFnEH7sL91vxFwS0mSNEDdkVPNCSDXXce7lyUqf5ZwqhnH4Y5f0YAwo45K7W9yG fMwNRooNiiZi+uPxpGFZ/Qz3PojK8jzBdZK0p1g5Wmx4fcOZ7nmGv2PyfoGmTA6i4ZJAOrUY NcfZXx3dhPcbhZTO1ARTpUjgOOvgXq5eDpdwL6XjfNvvy6Pk0osgP60aIS9lt+iHK25mm6Co W3L5SLhCwwyP92D0zuVtHmrg4cjmAuiAdtKROziq6MCbFu77UgPBjQMZ1GCn96Lk1CzZdBON HEN0397xUQ13AnxJjXnZDW6qnOZuh8XW/JLDvY3rgqKz8L8/AKxFmUCCDlbZ7QOttIyRDEs/ k+EmZXuHzMHmKaOVXuX+7OQrDWzESsYN2kPYWkDVwRty8nupsQ/gwzCSv5nEbWplZvlFDfo2 TeIoSMiwbIJgqY2O76T+FnGh3ego8bPRwtsvgHPBDv9sEV+eZKvYJGu5R7D9/FcIY2FT16H+ n8Zh8yZ6+NIBpaI/MCQfNgw8HiSz67tGFXhbZRHRfHNKxzFF6afQL1t
  • Ironport-hdrordr: A9a23:VDZ4vqhmHgkedZGLThdrBzy18XBQX7t23DAbv31ZSRFFG/FwyP rCoB1L73XJYWgqM03I+eruBEBPewK4yXdQ2/hoAV7CZniehILMFu1fBOTZowEIdxeOldK1kJ 0QCJSWa+eAcWSS7/yKhzVQeuxIqLfnzEnrv5a5854Ed3AWV0gK1XYcNu/0KDwVeOEQbqBJbq Z0q/A30QZJPh8sH7SGL0hAe9KGi8zAlZrgbxJDLxk76DOWhTftxK/mHwOe1hI+VSoK5bs562 DKnyHw+63m6piAu1Xh/l6Wy64TtMrqy9NFCsDJos8JKg/0ggLtSJV9V6aEtDUVpvjqzFoxit HDrzopIsw2wXLMeWOepwfrxmDboX0Twk6n7WXdrWrooMT/Sj5/I81dhbhBeh+cz0Y7ptlz3I 9Cwmrc7vNsfFv9tRW4w+KNewBhl0Kyr3ZnuekPj0ZHWY9bTLNKt4QQ8G5cDZ9FNiPn74IMFv VoEajnlb9rWGLfS0qcknhkwdSqUHh2NhCaQnIassjQ6DRSlGAR9Tps+OUv2lM7sL4tQZhN4O rJdo5ykqtVc8MQZaVhQM8cXMqeEAX2MFzxGVPXBW6iOLAMOnrLpZKyyq4y/vuWdJsBy4Z3sI jdUWlfqXU5dyvVeIKzNaVwg1DwqViGLHfQIpk03ek6hlS8fsumDcS7ciFuryP6yM9vR/EyWJ 6ISeBr6rHYXC/T8L1yrnzDsqlpWAYjufIuy6gGsnK107b2w97Rx5vmWceWAobROhAZfU66Kk c/fVHIVbZ9BwaQKzLFvCQ=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Jun 05, 2023 at 05:54:50PM +0200, Jan Beulich wrote:
> On 05.06.2023 17:10, Roger Pau Monne wrote:
> > As specified on Errata 1474:
> > 
> > "A core will fail to exit CC6 after about 1044 days after the last
> > system reset. The time of failure may vary depending on the spread
> > spectrum and REFCLK frequency."
> > 
> > Detect when running on AMD Fam17h models 30h-3fh and setup a timer to
> > prevent entering C6 after 1000 days have elapsed.  Take into account
> > the TSC value at boot in order to account for any time elapsed before
> > Xen has been booted.
> 
> Models 6x are also affected as per their RG. I have some trouble with
> the site, so it's too slow going to actually try and fish out the RGs
> for the other possible models.
> 
> Given more than one set of models is affected I of course also wonder
> whether Hygon CPUs wouldn't be affected, too. But I realize we have
> hardly any means to find out.

I've considered Hygon, but as you say I have no way I know to figure
out what models are affected.

> > @@ -1189,3 +1190,44 @@ const struct cpu_dev amd_cpu_dev = {
> >     .c_early_init   = early_init_amd,
> >     .c_init         = init_amd,
> >  };
> > +
> > +static void cf_check disable_c6(void *arg)
> > +{
> > +   printk(XENLOG_WARNING
> > +          "Disabling C6 after 1000 days uptime due to AMD errata 1474\n");
> > +   amd_disable_c6 = true;
> > +}
> > +
> > +static int __init cf_check amd_c6_errata(void)
> > +{
> > +   /*
> > +    * Errata #1474: A Core May Hang After About 1044 Days
> > +    * Set up a timer to disable C6 after 1000 days uptime.
> > +    */
> > +   s_time_t;
> > +
> > +   if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
> > +       boot_cpu_data.x86 != 0x17 ||
> > +       (boot_cpu_data.x86_model & 0xf0) != 0x30)
> 
> Perhaps better ... & ~0xf, just to be future-proof?

Right, will need to change anyway to account for 0x60 models.  But
x86_model is a char, and hence the mask is 0xff only (8 bits).

> > +           return 0;
> > +
> > +   /*
> > +    * Deduct current TSC value, this would be relevant if
> > +    * kexec'ed for example.
> > +    */
> > +   delta = DAYS(1000) - tsc_ticks2ns(rdtsc());
> 
> Generally the TSC can be written (and we actually do so under certain
> circumstances), so deriving any absolute time from the TSC value is of
> at best questionable value.

It's IMO better than not accounting for the TSC at all. Worst
case is we end up disabling C6 before actually required, but that
would leave us in safe position anyway.

> Furthermore, perhaps better overall to suppress all of this logic when
> we're running virtualized ourselves?

Indeed, will do that.

> > +   if (delta > 0) {
> > +           static struct timer errata_c6;
> > +
> > +           init_timer(&errata_c6, disable_c6, NULL, 0);
> > +           set_timer(&errata_c6, NOW() + delta);
> > +   } else
> > +           disable_c6(NULL);
> 
> The log message is going to be misleading in this case. Maybe pass
> that as the so far unused handler argument? Albeit I realize that this
> will mean casting away const-ness, which isn't very nice.

I think we could use the same message in both cases, what about:

"Disabling C6 after 1000 days hardware uptime due to AMD errata 1474"

> > --- a/xen/include/xen/time.h
> > +++ b/xen/include/xen/time.h
> > @@ -53,6 +53,7 @@ struct tm wallclock_time(uint64_t *ns);
> >  
> >  #define SYSTEM_TIME_HZ  1000000000ULL
> >  #define NOW()           ((s_time_t)get_s_time())
> > +#define DAYS(_d)        ((s_time_t)((_d)  * 86400000000000ULL))
> >  #define SECONDS(_s)     ((s_time_t)((_s)  * 1000000000ULL))
> >  #define MILLISECS(_ms)  ((s_time_t)((_ms) * 1000000ULL))
> >  #define MICROSECS(_us)  ((s_time_t)((_us) * 1000ULL))
> 
> While consistent with the other macros we have here, I think this would
> be neater as
> 
> #define DAYS(_d)        SECONDS((_d) * 86400ULL))

Hm, it would make more sense to introduce the missing macros between
DAYS() and SECONDS() (MINUTES() and HOURS()) and use HOURS() in DAYS()?

Thanks, Roger.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.