[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading



On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote:
>>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote:
>> @@ -30,18 +31,25 @@
>>  #include <xen/smp.h>
>>  #include <xen/softirq.h>
>>  #include <xen/spinlock.h>
>> +#include <xen/stop_machine.h>
>>  #include <xen/tasklet.h>
>>  #include <xen/guest_access.h>
>>  #include <xen/earlycpio.h>
>> +#include <xen/watchdog.h>
>>  
>> +#include <asm/delay.h>
>>  #include <asm/msr.h>
>>  #include <asm/processor.h>
>>  #include <asm/setup.h>
>>  #include <asm/microcode.h>
>>  
>> +/* By default, wait for 30000us */
>> +#define MICROCODE_DEFAULT_TIMEOUT_US 30000
>
>This value deserves some explanation as to how it was chosen.

Will do.

>
>>  static module_t __initdata ucode_mod;
>>  static signed int __initdata ucode_mod_idx;
>>  static bool_t __initdata ucode_mod_forced;
>> +static unsigned int nr_cores;
>
>Could you see about avoiding such a static? You have ...
>
>> +static int do_microcode_update(void *unused)
>
>... an "unused" here after all, and it's the (indirect) caller of the
>function which calculates the value.

Will do.

>
>> +{
>> +    unsigned int cpu = smp_processor_id();
>> +    int ret;
>>  
>> -    error = microcode_update_cpu();
>> -    if ( error )
>> -        info->error = error;
>> +    ret = wait_for_cpus(&cpu_in, MICROCODE_DEFAULT_TIMEOUT_US);
>> +    if ( ret )
>> +        return ret;
>>  
>> -    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
>> -    if ( info->cpu < nr_cpu_ids )
>> -        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, 
>> info);
>> +    /*
>> +     * Initiate an update on all processors which don't have an online 
>> sibling
>> +     * thread with a lower thread id. Other sibling threads just await the
>> +     * completion of microcode update.
>> +     */
>> +    if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>> +        ret = microcode_update_cpu();
>> +    /*
>> +     * Increase the wait timeout to a safe value here since we're 
>> serializing
>> +     * the microcode update and that could take a while on a large number of
>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>> +     * the last CPU finished updating and thus cut short
>> +     */
>> +    if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) )
>> +        panic("Timeout when finishing updating microcode");
>
>While I expect this to go away again in the next patch, I'd still like to
>see this improved, in particular in case the patch here goes in
>independently of the next one. After all on a system with 100 cores
>the timeout totals to a whopping 3 seconds.

To be clear, the timeout remains the same in the next patch due to
the serial print clause in apply_microcode().

>
>Generally the time needed to wait scales by the number of CPUs still
>in need of doing the update. And if a timeout is really to occur, it's
>perhaps because of one bad core or socket, not because nothing
>works at all. Hence it would seem both nice and possible to scale the
>"remaining time to wait" by the (known) number of remaining
>processors to respond.

Basically, I think the benefit is we can recognize the failure earlier
if no core called in in a given interval (i.e. 30ms), and trigger a
panic. Considering for such case, even with this optimization, the
system needs reboot, which generally takes several minutes, what's the
value of this optimization?

>
>Additionally it would be confusing to see dozens of panics in case
>this actually triggers, because all CPUs would hit this at about the
>same time.

Will avoid this confusion.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.