[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] amd: disable C6 after 1000 days on Fam17h models 30-3fh


  • To: Jan Beulich <jbeulich@xxxxxxxx>, Roger Pau Monne <roger.pau@xxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 5 Jun 2023 17:07:14 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=J7eg8/yKrb/adfBAvsZCiYR/nvdSQdZgqyg4qN4NBbU=; b=FZvHYB3eL/LUDuKaevwh8/uGBpGgt7gbdOV00nM9c+Amjuby1wgy4MBTw5pr3hEOxVNj1YIiluqi4WkDTRH8ggLGEZfsty7FHhgNaVn1sLjWL847uGDulJ0Nf/0UaRqDgUpYkkN2OcFV2Nw8y1oYPEc6ilohvXqyTMaJsc+8aLWYakVxx10VHxqaStapQb2zynegZzlQ6B7C2XV1jkGusXH5Gi1dLLrHcbhEMzx5CAu6KmqYgz2O9fscZnWmi+vjp0c5OtlUvoxLQd3ABBKhdS12iFP7CFETiatlrQQEcMmVtIuEFlj0Km8XxlRe6vuYJfFI7pO/UuEqxFiEpWbwxA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=XTEjHH4Gg0o4tNE6IqMA0P1flyAX2BVNjW6Ksnkrc+k5bFwzLiuTuJfcy2l8LMSc2vRZchQwRMlAeLVbd3sWp0w7IzaAYowp87lotRzcwPG9LQBencYRtfxe1fyS56PESCF6RmHGdhJYsraPVtf2ApM2JRMzjVy8dM1IsbjFzRABsXN14+EQK3aGPoLaGsSGnkz9es/RBlLd8kwcaiECuJdH42Sq7yU62EwJDPNbatRnWw0VOs14ot6hVjUbE7M9UeWFj7uY04djQpyrsN4VORCuuShhXf11XFRAXk6LrVkC2KBzT7+A4pRb5OXc0kxq9xSYbY2Efu/CDq6A23QZkw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Wei Liu <wl@xxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 05 Jun 2023 16:08:28 +0000
  • Ironport-data: A9a23:cOmNaK3e8LWA3YOfavbD5QZwkn2cJEfYwER7XKvMYLTBsI5bpzYGy WAeDDyPPvyMZjCkfYt1bImy80kGvpHWzddnTQdppC1hF35El5HIVI+TRqvS04F+DeWYFR46s J9OAjXkBJppJpMJjk71atANlVEliefTAOK6ULWeUsxIbVcMYD87jh5+kPIOjIdtgNyoayuAo tq3qMDEULOf82cc3lk8teTb8HuDgNyo4GlD5gJmOagS1LPjvyJ94Kw3dPnZw0TQGuG4LsbiL 87fwbew+H/u/htFIrtJRZ6iLyXm6paLVeS/oiI+t5qK23CulQRrukoPD9IOaF8/ttm8t4sZJ OOhF3CHYVxB0qXkwIzxWvTDes10FfUuFLTveRBTvSEPpqFvnrSFL/hGVSkL0YMkFulfHGV2s twmCWA2bSvb3e+V2auqRstgiZF2RCXrFNt3VnBI6xj8Vaxja7aaBqLA6JlfwSs6gd1IEbDGf c0FZDFzbRPGJRpSJlMQD5F4l+Ct7pX9W2QA9BTJ+uxqui6PlmSd05C0WDbRUvWMSd9YgQCzo WXe8n6iKhobKMae2XyO9XfEaurnxHqjCd1JROXgnhJsqF7U33FMFRIJbAKU/r650kusXfAFN 2VBr0LCqoB3riRHVOLVXRe1vXqFtR40QMdLHqsx7wTl4rXQyxaUAC4DVDEpQN47sM47QxQ62 1nPmMnmbRR0q6GcQ3+Z8raSrBuxNDITIGtEYjULJSMV7t+mrIwtgxbnStd4DLXzntDzASv3w T2BsG45nbp7sCIQ/6Cy/FSCiTT1oJHMFlQx/l+PAjjj6R5lbom4YYDu8ULc8ftLMIeeSB+Go WQAnM+dqusJCPlhiRCwfQnEJ5nxj97tDdEWqQ41d3X931xBI0KeQL0=
  • Ironport-hdrordr: A9a23:yTHtqK6rqGETofNNIAPXwdWCI+orL9Y04lQ7vn2ZFiY5TiXIra qTdaogviMc6Ax/ZJjvo6H4BEDyewK6yXcT2/htAV7CZnidhILMFu1fBOTZsl7d8kHFh4tgPO JbAtND4b7LfCZHZKTBgDVQeuxIqLfnzEnrv5an854Ed3AUV0gK1XYcNu/0KDwReOALP+taKH LKjfA32wZINE5nJvhSQRI+Lpr+juyOsKijTQ8NBhYh5gXLpTS06ITiGxzd8gYCXyhJybIC93 GAtwDi/K2sv9yy1xeZjgbonthrseqk7uEGKN2Hi8ATJDmpogG0ZL55U7nHkCEprPqp4FMKls CJhxs7Jcx8517YY2nwixrw3AvL1ioo9hbZuBKlqEqmhfa8aCMxCsJHi44cWhzF63A4tNU59K 5QxWqWu7deEBuFxU3GlpL1fiAvsnDxjWspkOYVgXAaeYwCaIVJpYha2E9OCp8PEA/z9YhiOu hzC8P34upQbDqhHjvkl1gq5ObpcmU4Hx+ATERHksuJ0wJOlHQ89EcczNx3pAZ2yLsND71/o8 jUOKVhk79DCuUMa7hmOesHScyrTkTQXBPlKgupUBTaPZBCH0iIh4/84b0z6u3vUocP1oEOlJ PIV04dnXIuenjpFdaF0PRwg1HwqV2GLHbQI/xllt1EUuWWfsuuDcTDciFhryKYmYRdPiWBMM zDf66/AJfYXB/T8MhyrkvDsqJpWAojuf0uy6cGsm2107L2w63Rx5rmmaXoVfPQOAdhfF/DKV 0+exW2DPl8zymQKwrFaV7qKjzQRnA=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 05/06/2023 4:54 pm, Jan Beulich wrote:
> On 05.06.2023 17:10, Roger Pau Monne wrote:
>> As specified on Errata 1474:
>>
>> "A core will fail to exit CC6 after about 1044 days after the last
>> system reset. The time of failure may vary depending on the spread
>> spectrum and REFCLK frequency."
>>
>> Detect when running on AMD Fam17h models 30h-3fh and setup a timer to
>> prevent entering C6 after 1000 days have elapsed.  Take into account
>> the TSC value at boot in order to account for any time elapsed before
>> Xen has been booted.
> Models 6x are also affected as per their RG. I have some trouble with
> the site, so it's too slow going to actually try and fish out the RGs
> for the other possible models.
>
> Given more than one set of models is affected I of course also wonder
> whether Hygon CPUs wouldn't be affected, too. But I realize we have
> hardly any means to find out.

I'd say it's more likely than unlikely, and ...

>> @@ -1189,3 +1190,44 @@ const struct cpu_dev amd_cpu_dev = {
>>      .c_early_init   = early_init_amd,
>>      .c_init         = init_amd,
>>  };
>> +
>> +static void cf_check disable_c6(void *arg)
>> +{
>> +    printk(XENLOG_WARNING
>> +           "Disabling C6 after 1000 days uptime due to AMD errata 1474\n");
>> +    amd_disable_c6 = true;
>> +}
>> +
>> +static int __init cf_check amd_c6_errata(void)
>> +{
>> +    /*
>> +     * Errata #1474: A Core May Hang After About 1044 Days
>> +     * Set up a timer to disable C6 after 1000 days uptime.
>> +     */
>> +    s_time_t;
>> +
>> +    if (boot_cpu_data.x86_vendor != X86_VENDOR_AMD ||
>> +        boot_cpu_data.x86 != 0x17 ||
>> +        (boot_cpu_data.x86_model & 0xf0) != 0x30)
> Perhaps better ... & ~0xf, just to be future-proof?

... this wants to follow the same logic as for Branch Type Confusion. 
See amd_init_spectral_chicken() looking for STIBP.

It's very likely all Zen2 models, given that it will have taken nearly 3
years to be discovered in the first place...

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.