[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading
On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote: >>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote: >> @@ -30,18 +31,25 @@ >> #include <xen/smp.h> >> #include <xen/softirq.h> >> #include <xen/spinlock.h> >> +#include <xen/stop_machine.h> >> #include <xen/tasklet.h> >> #include <xen/guest_access.h> >> #include <xen/earlycpio.h> >> +#include <xen/watchdog.h> >> >> +#include <asm/delay.h> >> #include <asm/msr.h> >> #include <asm/processor.h> >> #include <asm/setup.h> >> #include <asm/microcode.h> >> >> +/* By default, wait for 30000us */ >> +#define MICROCODE_DEFAULT_TIMEOUT_US 30000 > >This value deserves some explanation as to how it was chosen. Will do. > >> static module_t __initdata ucode_mod; >> static signed int __initdata ucode_mod_idx; >> static bool_t __initdata ucode_mod_forced; >> +static unsigned int nr_cores; > >Could you see about avoiding such a static? You have ... > >> +static int do_microcode_update(void *unused) > >... an "unused" here after all, and it's the (indirect) caller of the >function which calculates the value. Will do. > >> +{ >> + unsigned int cpu = smp_processor_id(); >> + int ret; >> >> - error = microcode_update_cpu(); >> - if ( error ) >> - info->error = error; >> + ret = wait_for_cpus(&cpu_in, MICROCODE_DEFAULT_TIMEOUT_US); >> + if ( ret ) >> + return ret; >> >> - info->cpu = cpumask_next(info->cpu, &cpu_online_map); >> - if ( info->cpu < nr_cpu_ids ) >> - return continue_hypercall_on_cpu(info->cpu, do_microcode_update, >> info); >> + /* >> + * Initiate an update on all processors which don't have an online >> sibling >> + * thread with a lower thread id. Other sibling threads just await the >> + * completion of microcode update. >> + */ >> + if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) ) >> + ret = microcode_update_cpu(); >> + /* >> + * Increase the wait timeout to a safe value here since we're >> serializing >> + * the microcode update and that could take a while on a large number of >> + * CPUs. And that is fine as the *actual* timeout will be determined by >> + * the last CPU finished updating and thus cut short >> + */ >> + if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) ) >> + panic("Timeout when finishing updating microcode"); > >While I expect this to go away again in the next patch, I'd still like to >see this improved, in particular in case the patch here goes in >independently of the next one. After all on a system with 100 cores >the timeout totals to a whopping 3 seconds. To be clear, the timeout remains the same in the next patch due to the serial print clause in apply_microcode(). > >Generally the time needed to wait scales by the number of CPUs still >in need of doing the update. And if a timeout is really to occur, it's >perhaps because of one bad core or socket, not because nothing >works at all. Hence it would seem both nice and possible to scale the >"remaining time to wait" by the (known) number of remaining >processors to respond. Basically, I think the benefit is we can recognize the failure earlier if no core called in in a given interval (i.e. 30ms), and trigger a panic. Considering for such case, even with this optimization, the system needs reboot, which generally takes several minutes, what's the value of this optimization? > >Additionally it would be confusing to see dozens of panics in case >this actually triggers, because all CPUs would hit this at about the >same time. Will avoid this confusion. Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |