[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v5 7/8] x86/microcode: Synchronize late microcode loading
>>> On 11.02.19 at 14:35, <jgross@xxxxxxxx> wrote: > On 11/02/2019 14:23, Jan Beulich wrote: >>>>> On 11.02.19 at 06:40, <chao.gao@xxxxxxxxx> wrote: >>> On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote: >>>>>>> On 28.01.19 at 08:06, <chao.gao@xxxxxxxxx> wrote: >>>>> + /* >>>>> + * Initiate an update on all processors which don't have an online > sibling >>>>> + * thread with a lower thread id. Other sibling threads just await >>>>> the >>>>> + * completion of microcode update. >>>>> + */ >>>>> + if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) ) >>>>> + ret = microcode_update_cpu(); >>>>> + /* >>>>> + * Increase the wait timeout to a safe value here since we're > serializing >>>>> + * the microcode update and that could take a while on a large >>>>> number > of >>>>> + * CPUs. And that is fine as the *actual* timeout will be determined >>>>> by >>>>> + * the last CPU finished updating and thus cut short >>>>> + */ >>>>> + if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * >>>>> nr_cores) ) >>>>> + panic("Timeout when finishing updating microcode"); >>>> >>>> While I expect this to go away again in the next patch, I'd still like to >>>> see this improved, in particular in case the patch here goes in >>>> independently of the next one. After all on a system with 100 cores >>>> the timeout totals to a whopping 3 seconds. >>> >>> To be clear, the timeout remains the same in the next patch due to >>> the serial print clause in apply_microcode(). >>> >>>> >>>> Generally the time needed to wait scales by the number of CPUs still >>>> in need of doing the update. And if a timeout is really to occur, it's >>>> perhaps because of one bad core or socket, not because nothing >>>> works at all. Hence it would seem both nice and possible to scale the >>>> "remaining time to wait" by the (known) number of remaining >>>> processors to respond. >>> >>> Basically, I think the benefit is we can recognize the failure earlier >>> if no core called in in a given interval (i.e. 30ms), and trigger a >>> panic. Considering for such case, even with this optimization, the >>> system needs reboot, which generally takes several minutes, what's the >>> value of this optimization? >> >> Hmm, on one hand this is a fair point you make. Otoh, why do >> you add any timeout at all, if we say we're hosed anyway if the >> timeout expires? You could then as well log a message (say >> once a second) about how many (or which) CPUs still didn't >> respond. The admin can then still reboot the system if desired. > > That's not a data center friendly approach. > > The ability to do microcode update in an online system might by > risky, but in case of failure requiring access to the console or > power settings of the system isn't nice. I realize this, but I also wouldn't bet the system would reboot cleanly in such a case (i.e. a power cycle may be required anyway). > I think doing a panic() after some timeout is a sensible way to > handle a failure. I don't disagree; I'm just not convinced this is the only "sensible" way. > In case you'd like having a way to wait longer: we could allow the > "noreboot" parameter to be modified at runtime and do the panic only > if opt_noreboot isn't set. Why would runtime modification come into play here? Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |