Xen project Mailing List

Re: [Xen-devel] [PATCH 1/2] x86/microcode: Synchronize late microcode loading

Date: Mon, 16 Apr 2018 14:20:16 +0800

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Kevin Tian <kevin.tian@xxxxxxxxx>, Ashok Raj <ashok.raj@xxxxxxxxx>, Jun Nakajima <jun.nakajima@xxxxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Mon, 16 Apr 2018 06:25:23 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Fri, Apr 13, 2018 at 09:49:17AM -0600, Jan Beulich wrote: >>>> On 30.03.18 at 08:59, <chao.gao@xxxxxxxxx> wrote: >> @@ -281,24 +287,52 @@ static int microcode_update_cpu(const void *buf, >> size_t size) >> return err; >> } >> >> -static long do_microcode_update(void *_info) >> +static int __wait_for_cpus(atomic_t *cnt, int timeout) > >No new double-underscore prefixed functions please. > >> { >> - struct microcode_info *info = _info; >> - int error; >> + int cpus = num_online_cpus(); > >unsigned int > >> - BUG_ON(info->cpu != smp_processor_id()); >> + atomic_inc(cnt); >> >> - error = microcode_update_cpu(info->buffer, info->buffer_size); >> - if ( error ) >> - info->error = error; >> + while (atomic_read(cnt) != cpus) > >There are a number of style issues in the patch, mostly (like here) missing >blanks. Will fix all issues pointed out by comments above. > >> + { >> + if ( timeout <= 0 ) >> + { >> + printk("Timeout when waiting for CPUs calling in\n"); >> + return -1; >> + } >> + udelay(1); >> + timeout--; >> + } >> >> - info->cpu = cpumask_next(info->cpu, &cpu_online_map); >> - if ( info->cpu < nr_cpu_ids ) >> - return continue_hypercall_on_cpu(info->cpu, do_microcode_update, >> info); >> + return 0; >> +} >> >> - error = info->error; >> - xfree(info); >> - return error; >> +static int do_microcode_update(void *_info) >> +{ >> + struct microcode_info *info = _info; >> + int error, ret = 0; >> + >> + error = __wait_for_cpus(&info->cpu_in, USEC_PER_SEC); > >Why this long a timeout here? I just use the same timeout as the patch on linux kernel side. As we know if the timeout is too small, updating microcode may be likely to failed even if other CPUs disabled interrupt temporally. If you object to such a long timeout (for Xen may need much smaller time to rendezvous all CPUs compared to linux kernel because Xen doesn't have device drivers which may malfunction), how about just use the default timeout, 30ms, used by live patching? if it is also not good enough, then we make it an option which comes from callers. > >> + if ( error ) >> + { >> + ret = -EBUSY; >> + return ret; >> + } >> + >> + error = microcode_update_cpu(info->buffer, info->buffer_size); >> + if ( error && !ret ) >> + ret = error; >> + /* >> + * Increase the wait timeout to a safe value here since we're >> serializing >> + * the microcode update and that could take a while on a large number of >> + * CPUs. And that is fine as the *actual* timeout will be determined by >> + * the last CPU finished updating and thus cut short >> + */ >> + error = __wait_for_cpus(&info->cpu_out, USEC_PER_SEC * >> num_online_cpus()); > >And this one's even worse, in particular on huge systems. I'm afraid such a >long >period of time in stop-machine context is going to confuse most of the running >domains (including Dom0). There's nothing inherently wrong with e.g. processing >the updates on distinct cores (and even more so on distinct sockets) in >parallel. I cannot say for sure. But the original patch does want updating microcode be performed one-by-one. >Therefore revising the locking in microcode_update_cpu() might be a necessary >prereq step. Do you mean changing it to a per-core or per-socket lock? >Or alternatively you may need to demand that no other running >domains exist besides Dom0 (and hope the best for Dom0 itself). > >I also don't think there's any point invoking the operation on all HT threads >on a >core, but I realize stop_machine_run() isn't flexible enough to allow such. Only one thread in a core will do the actual update. Other threads only check the microcode version and find the version is already update-to-date, then exit the critical region. Considering the check may don't need much time, I wonder whether it can significantly benefit the overall time to invoking the operation on all HT threads on a core? Thanks Chao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.