Xen project Mailing List

Re: [Xen-devel] [PATCH v2] x86/microcode: Synchronize late microcode loading

Date: Tue, 1 May 2018 16:15:15 +0800

Cc: Kevin Tian <kevin.tian@xxxxxxxxx>, Ashok Raj <ashok.raj@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, xen-devel@xxxxxxxxxxxxx, Jun Nakajima <jun.nakajima@xxxxxxxxx>, tglx@xxxxxxxxxxxxx, Borislav Petkov <bp@xxxxxxx>

Delivery-date: Tue, 01 May 2018 08:20:51 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Mon, Apr 30, 2018 at 09:25:26AM -0600, Jan Beulich wrote: >>>> On 25.04.18 at 13:46, <chao.gao@xxxxxxxxx> wrote: >> @@ -281,24 +288,56 @@ static int microcode_update_cpu(const void *buf, >> size_t size) >> return err; >> } >> >> -static long do_microcode_update(void *_info) >> +/* Wait for all CPUs to rendezvous with a timeout (us) */ >> +static int wait_for_cpus(atomic_t *cnt, int timeout) > >unsigned int > >> +static int do_microcode_update(void *_info) >> +{ >> + struct microcode_info *info = _info; >> + unsigned int cpu = smp_processor_id(); >> + int ret; >> + >> + ret = wait_for_cpus(&info->cpu_in, MICROCODE_DEFAULT_TIMEOUT); >> + if ( ret ) >> + return ret; >> + /* >> + * Logical threads which set the first bit in cpu_sibling_mask can do >> + * the update. Other sibling threads just await the completion of >> + * microcode update. >> + */ >> + if ( cpumask_test_and_set_cpu( >> + cpumask_first(per_cpu(cpu_sibling_mask, cpu)), &info->cpus) >> ) >> + ret = microcode_update_cpu(info->buffer, info->buffer_size); > >Isn't the condition inverted (i.e. missing a ! )? Yes. > >Also I take it that you've confirmed that loading ucode in parallel on multiple >cores of the same socket is not a problem? The comment in the last hunk >suggests otherwise. No. In microcode_update_cpu(), microcode_mutex makes the update sequential. > >> + /* >> + * Increase the wait timeout to a safe value here since we're >> serializing >> + * the microcode update and that could take a while on a large number of >> + * CPUs. And that is fine as the *actual* timeout will be determined by >> + * the last CPU finished updating and thus cut short >> + */ >> + if ( wait_for_cpus(&info->cpu_out, MICROCODE_DEFAULT_TIMEOUT * >> + num_online_cpus()) ) >> + panic("Timeout when finishing updating microcode"); > >A 3s timeout (as an example for a system with 100 CPU threads) is still >absurdly high to me, but considering you panic() anyway if you hit the >timeout the question mainly is whether there's a slim chance for this to >complete a brief moment before the timeout expires. If all goes well, >you won't come close to even 1s, but as said before - there may be >guests running, and they may become utterly confused if they don't >get any time within a second or more. > >With you no longer doing things sequentially I don't, however, see why No. It is still sequential. And only one thread in a core will try to acquire the lock -- microcode_mutex. >you need to scale the timeout by CPU count. Maybe by the number of core. But I did't find an existing variable to count cores. > >> + >> + return ret; >> } > >You're losing this return value (once for every CPU making it into this >function). I don't understand this remark. This function is called by stop_machine_run(). stop_machine_run() could return error if any cpu failed during update. We don't care the specific CPU and how many CPUs failed to do the update. Thanks Chao > >> @@ -318,26 +357,52 @@ int >> microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len) >> >> ret = copy_from_guest(info->buffer, buf, len); >> if ( ret != 0 ) >> - { >> - xfree(info); >> - return ret; >> - } >> + goto free; >> >> info->buffer_size = len; >> - info->error = 0; >> - info->cpu = cpumask_first(&cpu_online_map); >> + >> + /* cpu_online_map must not change during update */ >> + if ( !get_cpu_maps() ) >> + { >> + ret = -EBUSY; >> + goto free; >> + } >> >> if ( microcode_ops->start_update ) >> { >> ret = microcode_ops->start_update(); >> if ( ret != 0 ) >> - { >> - xfree(info); >> - return ret; >> - } >> + goto put; >> } >> >> - return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info); >> + cpumask_empty(&info->cpus); > >DYM cpumask_clear()? > >> + atomic_set(&info->cpu_in, 0); >> + atomic_set(&info->cpu_out, 0); >> + >> + /* >> + * We intend to disable interrupt for long time, which may lead to >> + * watchdog timeout. >> + */ >> + watchdog_disable(); >> + /* >> + * Late loading dance. Why the heavy-handed stop_machine effort? >> + * >> + * -HT siblings must be idle and not execute other code while the other >> + * sibling is loading microcode in order to avoid any negative >> + * interactions cause by the loading. >> + * >> + * -In addition, microcode update on the cores must be serialized until >> + * this requirement can be relaxed in the feature. Right now, this is >> + * conservative and good. > >This is the comment I've referred to above. > >Jan > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.