[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v7 06/10] microcode: split out apply_microcode() from cpu_request_microcode()



>>> On 27.05.19 at 10:31, <chao.gao@xxxxxxxxx> wrote:
> During late microcode update, apply_microcode() is invoked in
> cpu_request_microcode(). To make late microcode update more reliable,
> we want to put the apply_microcode() into stop_machine context. So
> we split out it from cpu_request_microcode(). As a consequence,
> apply_microcode() should be invoked explicitly in the common code.
> 
> Previously, apply_microcode() gets the microcode patch to be applied from
> the microcode cache. Now, the patch is passed as a function argument and
> a patch is cached for cpu-hotplug and cpu resuming, only after it has
> been loaded to a cpu without any error. As a consequence, the
> 'match_cpu' check in microcode_update_cache is removed, which otherwise
> would fail.

The "only after it has been loaded to a cpu without any error" is a
problem, precisely for the case where ucode on the different cores
is not in sync initially. I would actually like to put up this question:
When a core has no ucode loaded at all yet and only strictly older
(than loaded on some other cores) ucode is found to be available,
whether then it wouldn't still be better to apply that ucode to
_at least_ the cores that have none loaded yet.

To get the system into "sane" state it may even be necessary to
downgrade ucode on the cores which did have it loaded already,
in such a situation.

> On AMD side, svm_host_osvw_init() is supposed to be called after
> microcode update. As apply_micrcode() won't be called by
> cpu_request_microcode() now, svm_host_osvw_init() is moved to the
> end of apply_microcode().

I guess this really ought to become a vendor hook as well, but I
wouldn't insist on you doing so here.

> --- a/xen/arch/x86/acpi/power.c
> +++ b/xen/arch/x86/acpi/power.c
> @@ -253,7 +253,7 @@ static int enter_state(u32 state)
>  
>      console_end_sync();
>  
> -    microcode_resume_cpu();
> +    early_microcode_update_cpu();

The use here, the (changed) use in start_secondary(), and the dropping
of its __init suggest to make an attempt to find a better name for the
function. Maybe microcode_update_one()?

> +/*
> + * Load a microcode update to current CPU.
> + *
> + * If no patch is provided, the cached patch will be loaded. Microcode update
> + * during APs bringup and CPU resuming falls into this case.
> + */
> +static int microcode_update_cpu(struct microcode_patch *patch)

const?

>  {
> -    int err;
> -    struct cpu_signature *sig = &this_cpu(cpu_sig);
> +    int ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>  
> -    if ( !microcode_ops )
> -        return 0;
> +    if ( unlikely(ret) )
> +        return ret;
>  
>      spin_lock(&microcode_mutex);
>  
> -    err = microcode_ops->collect_cpu_info(sig);
> -    if ( likely(!err) )
> -        err = microcode_ops->apply_microcode();
> -    spin_unlock(&microcode_mutex);
> +    if ( patch )
> +    {
> +        /*
> +         * If a patch is specified, it should has newer revision than
> +         * that of the patch cached.
> +         */
> +        if ( microcode_cache &&
> +             microcode_ops->compare_patch(patch, microcode_cache) != 
> NEW_UCODE )
> +        {
> +            spin_unlock(&microcode_mutex);
> +            return -EINVAL;
> +        }
>  
> -    return err;
> -}
> +        ret = microcode_ops->apply_microcode(patch);

There's no printk() here but ...

> +    }
> +    else if ( microcode_cache )
> +    {
> +        ret = microcode_ops->apply_microcode(microcode_cache);
> +        if ( ret == -EIO )
> +            printk("Update failed. Reboot needed\n");

... you emit a log message here. Why the difference? And wouldn't
it be better to have just a single call to ->apply anyway, by simply
assigning "microcode_cache" to "patch" and moving the call a little
further down?

> +    }
> +    else
> +        /* No patch to update */
> +        ret = -EINVAL;

-ENOENT?

> @@ -247,46 +270,32 @@ bool microcode_update_cache(struct microcode_patch 
> *patch)
>      return true;
>  }
>  
> -static int microcode_update_cpu(const void *buf, size_t size)
> +static long do_microcode_update(void *patch)
>  {
> -    int err;
> -    unsigned int cpu = smp_processor_id();
> -    struct cpu_signature *sig = &per_cpu(cpu_sig, cpu);
> -
> -    spin_lock(&microcode_mutex);
> +    int error, cpu;

While "int" is fine for error codes, it almost certainly wants to be
"unsigned int" for "cpu". The more that it had been that was before.
I also don't see why you need to switch from "err" to "error" - oh,
...

> -    err = microcode_ops->collect_cpu_info(sig);
> -    if ( likely(!err) )
> -        err = microcode_ops->cpu_request_microcode(buf, size);
> -    spin_unlock(&microcode_mutex);
> -
> -    return err;
> -}
> -
> -static long do_microcode_update(void *_info)
> -{
> -    struct microcode_info *info = _info;
> -    int error;
> +    error = microcode_update_cpu(patch);

... there was a pre-existing variable of that name here.

> +    if ( error )
> +    {
> +        microcode_ops->free_patch(microcode_cache);

Does this also set "microcode_cache" to NULL? I didn't think so.
It's anyway not really clear why _all_ forms of errors should lead
to clearing of the cache. However - looking at the code further
down, don't you rather mean to free "patch" here anyway?

> +        return error;
> +    }
>  
> -    BUG_ON(info->cpu != smp_processor_id());
>  
> -    error = microcode_update_cpu(info->buffer, info->buffer_size);
> -    if ( error )
> -        info->error = error;
> +    cpu = cpumask_next(smp_processor_id(), &cpu_online_map);
> +    if ( cpu < nr_cpu_ids )
> +        return continue_hypercall_on_cpu(cpu, do_microcode_update, patch);
>  
> -    info->cpu = cpumask_next(info->cpu, &cpu_online_map);
> -    if ( info->cpu < nr_cpu_ids )
> -        return continue_hypercall_on_cpu(info->cpu, do_microcode_update, 
> info);
> +    microcode_update_cache(patch);

Independent of my remarks at the top I would think that updating of
the cache should happen after the first successful loading on a CPU,
not after all CPUs have been updated successfully. There would then
also not be any need to pass "patch" on to continue_hypercall_on_cpu()
a few lines up from here (albeit from a general logic perspective it may
indeed be easier to keep it that way).

> @@ -294,32 +303,49 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) 
> buf, unsigned long len)
>      if ( microcode_ops == NULL )
>          return -EINVAL;
>  
> -    info = xmalloc_bytes(sizeof(*info) + len);
> -    if ( info == NULL )
> -        return -ENOMEM;
> -
> -    ret = copy_from_guest(info->buffer, buf, len);
> -    if ( ret != 0 )
> +    buffer = xmalloc_bytes(len);
> +    if ( !buffer )
>      {
> -        xfree(info);
> -        return ret;
> +        ret = -ENOMEM;
> +        goto free;
>      }
>  
> -    info->buffer_size = len;
> -    info->error = 0;
> -    info->cpu = cpumask_first(&cpu_online_map);
> +    if ( copy_from_guest(buffer, buf, len) )
> +    {
> +        ret = -EFAULT;
> +        goto free;
> +    }
>  
>      if ( microcode_ops->start_update )
>      {
>          ret = microcode_ops->start_update();
>          if ( ret != 0 )
> -        {
> -            xfree(info);
> -            return ret;
> -        }
> +            goto free;
>      }
>  
> -    return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
> +    patch = microcode_parse_blob(buffer, len);
> +    if ( IS_ERR(patch) )
> +    {
> +        printk(XENLOG_ERR "Parsing microcode blob error %ld\n", 
> PTR_ERR(patch));
> +        ret = PTR_ERR(patch);

So I assume we would get here when the system already has the
newest (or even newer) ucode loaded. That's not an error, and
imo no log message suggesting so should be issued. Perhaps the
parsing code could return NULL to indicate so? Although, judging
by the code further down you already expect NULL potentially
coming back, but I can't seem to be able to figure the condition(s).

Also please switch the two lines around and use "ret" in the printk()
invocation.

> +        goto free;
> +    }
> +
> +    if ( !microcode_ops->match_cpu(patch) )
> +    {
> +        printk(XENLOG_ERR "No matching or newer ucode found. Update 
> aborted!\n");

I assume the "matching" here is meant to cover the CPU signature?
The wording is ambiguous this way, because it could also mean there
was no ucode found matching that which is already loaded (which, as
per above, may end up being relevant).

Furthermore - why this check, when microcode_parse_blob() already
looks for something that's newer than what is currently loaded (and
matches the CPU signature)?

> @@ -362,13 +397,41 @@ int __init early_microcode_update_cpu(bool start_update)
>      }
>      if ( data )
>      {
> -        if ( start_update && microcode_ops->start_update )
> +        struct microcode_patch *patch;
> +
> +        if ( microcode_ops->start_update )
>              rc = microcode_ops->start_update();
>  
>          if ( rc )
>              return rc;
>  
> -        return microcode_update_cpu(data, len);
> +        patch = microcode_parse_blob(data, len);
> +        if ( IS_ERR(patch) )
> +        {
> +            printk(XENLOG_ERR "Parsing microcode blob error %ld\n",
> +                   PTR_ERR(patch));
> +            return PTR_ERR(patch);
> +        }
> +
> +        if ( !microcode_ops->match_cpu(patch) )
> +        {
> +            printk(XENLOG_ERR "No matching or newer ucode found. Update 
> aborted!\n");
> +            if ( patch )
> +                microcode_ops->free_patch(patch);
> +            return -EINVAL;
> +        }

Same remarks here then.

> @@ -292,6 +291,10 @@ static int apply_microcode(void)
>  
>      sig->rev = rev;
>  
> +#ifdef CONFIG_HVM
> +    svm_host_osvw_init();
> +#endif
> +
>      return 0;
>  }

While this now sits on the success path only, ...

> @@ -592,17 +596,10 @@ static int cpu_request_microcode(const void *buf, 
> size_t bufsize)
>      }
>  
>    out:
> -#if CONFIG_HVM
> -    svm_host_osvw_init();
> -#endif
> +    if ( error && !patch )
> +        patch = ERR_PTR(error);
>  
> -    /*
> -     * In some cases we may return an error even if processor's microcode has
> -     * been updated. For example, the first patch in a container file is 
> loaded
> -     * successfully but subsequent container file processing encounters a
> -     * failure.
> -     */
> -    return error;
> +    return patch;
>  }

... previously it has also been invoked in the error case. See the
comment in start_update().

> --- a/xen/arch/x86/microcode_intel.c
> +++ b/xen/arch/x86/microcode_intel.c
> @@ -273,46 +273,27 @@ static enum microcode_match_result compare_patch(
>                                    old_header->pf, old_header->rev);
>  }
>  
> -/*
> - * return 0 - no update found
> - * return 1 - found update
> - * return < 0 - error
> - */
> -static int get_matching_microcode(const void *mc)
> +static struct microcode_patch *allow_microcode_patch(

Did you perhaps mean this to be alloc_microcode_patch()?

> @@ -388,26 +368,39 @@ static long get_next_ucode_from_buffer(void **mc, const 
> u8 *buf,
>      return offset + total_size;
>  }
>  
> -static int cpu_request_microcode(const void *buf, size_t size)
> +static struct microcode_patch *cpu_request_microcode(const void *buf,
> +                                                     size_t size)
>  {
>      long offset = 0;
>      int error = 0;
>      void *mc;
> +    struct microcode_patch *patch = NULL;
>  
>      while ( (offset = get_next_ucode_from_buffer(&mc, buf, size, offset)) > 
> 0 )
>      {
> +        struct microcode_patch *new_patch;
> +
>          error = microcode_sanity_check(mc);
>          if ( error )
>              break;
> -        error = get_matching_microcode(mc);
> -        if ( error < 0 )
> +
> +        new_patch = allow_microcode_patch(mc);
> +        if ( IS_ERR(new_patch) )
> +        {
> +            error = PTR_ERR(new_patch);
>              break;
> -        /*
> -         * It's possible the data file has multiple matching ucode,
> -         * lets keep searching till the latest version
> -         */
> -        if ( error == 1 )
> -            error = 0;
> +        }
> +
> +        /* Compare patches and store the one with higher revision */
> +        if ( !patch && match_cpu(new_patch) )
> +            patch = new_patch;
> +        else if ( patch && (compare_patch(new_patch, patch) == NEW_UCODE) )

At least the appearance of this is misleading, but unless I'm missing
something it's actually wrong: You seem to imply that from
match_cpu(patch1) returning true and compare_patch(patch2, patch1)
returning true it follows that also match_cpu(patch2) would return
true. But I don't think that's the case, because of how the "pf" field
works.

Even in case I've overlooked an implicit match_cpu(new_patch) I'd
like to ask for this to be clarified, either by changing the code
structure, or by attaching a suitable comment.

> --- a/xen/arch/x86/smpboot.c
> +++ b/xen/arch/x86/smpboot.c
> @@ -363,10 +363,7 @@ void start_secondary(void *unused)
>  
>      initialize_cpu_data(cpu);
>  
> -    if ( system_state <= SYS_STATE_smp_boot )
> -        early_microcode_update_cpu(false);
> -    else
> -        microcode_resume_cpu();
> +    early_microcode_update_cpu();

I'm struggling to understand how you get away without the "false"
argument that was passed here before. You look to now be calling
->start_update() unconditionally (so long as the hook is not NULL),
which I don't think is correct. This should be called only once by
the CPU _leading_ an update (the BSP during boot, and the CPU
the hypercall gets invoked on (or the first CPU an update gets
issued one) for a late update. Am I missing something?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.