[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v7 06/10] microcode: split out apply_microcode() from cpu_request_microcode()



On Tue, Jun 11, 2019 at 01:08:36AM -0600, Jan Beulich wrote:
>>>> On 11.06.19 at 05:32, <chao.gao@xxxxxxxxx> wrote:
>> On Wed, Jun 05, 2019 at 06:37:27AM -0600, Jan Beulich wrote:
>>>>>> On 27.05.19 at 10:31, <chao.gao@xxxxxxxxx> wrote:
>>>> During late microcode update, apply_microcode() is invoked in
>>>> cpu_request_microcode(). To make late microcode update more reliable,
>>>> we want to put the apply_microcode() into stop_machine context. So
>>>> we split out it from cpu_request_microcode(). As a consequence,
>>>> apply_microcode() should be invoked explicitly in the common code.
>>>> 
>>>> Previously, apply_microcode() gets the microcode patch to be applied from
>>>> the microcode cache. Now, the patch is passed as a function argument and
>>>> a patch is cached for cpu-hotplug and cpu resuming, only after it has
>>>> been loaded to a cpu without any error. As a consequence, the
>>>> 'match_cpu' check in microcode_update_cache is removed, which otherwise
>>>> would fail.
>>>
>>>The "only after it has been loaded to a cpu without any error" is a
>>>problem, precisely for the case where ucode on the different cores
>>>is not in sync initially. I would actually like to put up this question:
>>>When a core has no ucode loaded at all yet and only strictly older
>>>(than loaded on some other cores) ucode is found to be available,
>>>whether then it wouldn't still be better to apply that ucode to
>>>_at least_ the cores that have none loaded yet.
>> 
>> Yes, it is better for this special case. And I agree to support this case.
>> 
>> This in v7, a patch is loaded only if its revision is newer than that
>> loaded to current CPU. And it is stored only if it has been loaded
>> successfully. But, as you described, a broken bios might puts the system
>> in an inconsistent state (multiple microcode revision in the system) and
>> furthermore in this case, if no or an older microcode update is
>> provided, early loading cannot get the system into sane state. So for
>> both early and late microcode loading, we could face a situation that
>> the patch to be loaded has equal or old revision than microcode of some
>> CPUs.
>> 
>> Changes I plan to make in next version are:
>> 1. For early microcode, a patch would be stored if it covers current CPU's
>> signature. All CPUs would try to load from the cache.
>> 2. For late microcode, a patch is loaded only if its revision is newer than
>> *the patch cached*. And it is stored only if has been loaded without an
>> "EIO" error.
>> 3. Cache replacement remains the same.
>
>Why the difference between early and late loading?

Storing a patch without loading it is problematic. We need complex logics
to restore the old patch if the current patch is proved to be broken.
I really want to avoid going this way. So for late microcode, we still
stick to the rule: storing a patch only after it has been loaded. For
late loading, we can try to load a patch as long as the patch covers
current cpu signature to avoid missing any possible update. But thanks
to early loading, the oldest microcode revision on all online CPUs
shouldn't be older than the cache. So as an optimization, we initiate an
update system-wide only if the patch's revision is newer than the cache.

For early loading, to avoid discarding a potential useful patch, an
exception is made to store the newest matching patch without loading it
and all CPus try to load the patch. One problem is if a broken patch
with very high revision is provided, any subsequent attempt of late
loading would fail. It is unlikely to happen, so I plan to leave it
aside. Otherwise, we can clean up the cache in microcode_init() if no
cpu has loaded this patch (we need a global variable to track the status).

>
>> But it is a temperary solution, especially for CSPs. A better way
>> might be getting the newest ucode or upgrading to the newest bios,
>> even downgrading the bios to an older version which wouldn't put
>> the system into "insane" state.
>
>On the quoted system, all BIOS versions I've ever been provided
>had the same odd behavior.
>
>>>> +static int microcode_update_cpu(struct microcode_patch *patch)
>>>>  {
>>>> -    int err;
>>>> -    struct cpu_signature *sig = &this_cpu(cpu_sig);
>>>> +    int ret = microcode_ops->collect_cpu_info(&this_cpu(cpu_sig));
>>>>  
>>>> -    if ( !microcode_ops )
>>>> -        return 0;
>>>> +    if ( unlikely(ret) )
>>>> +        return ret;
>>>>  
>>>>      spin_lock(&microcode_mutex);
>>>>  
>>>> -    err = microcode_ops->collect_cpu_info(sig);
>>>> -    if ( likely(!err) )
>>>> -        err = microcode_ops->apply_microcode();
>>>> -    spin_unlock(&microcode_mutex);
>>>> +    if ( patch )
>>>> +    {
>>>> +        /*
>>>> +         * If a patch is specified, it should has newer revision than
>>>> +         * that of the patch cached.
>>>> +         */
>>>> +        if ( microcode_cache &&
>>>> +             microcode_ops->compare_patch(patch, microcode_cache) != 
>>>> NEW_UCODE )
>>>> +        {
>>>> +            spin_unlock(&microcode_mutex);
>>>> +            return -EINVAL;
>>>> +        }
>>>>  
>>>> -    return err;
>>>> -}
>>>> +        ret = microcode_ops->apply_microcode(patch);
>>>
>>>There's no printk() here but ...
>>>
>>>> +    }
>>>> +    else if ( microcode_cache )
>>>> +    {
>>>> +        ret = microcode_ops->apply_microcode(microcode_cache);
>>>> +        if ( ret == -EIO )
>>>> +            printk("Update failed. Reboot needed\n");
>>>
>>>... you emit a log message here. Why the difference? And wouldn't
>>>it be better to have just a single call to ->apply anyway, by simply
>>>assigning "microcode_cache" to "patch" and moving the call a little
>>>further down?
>> 
>> -EIO means we loaded the patch but the revision didn't change. This
>> error code indicates an error in the patch. It is unlikely to happen.
>> And in this v7, a patch is stored after being loading successfully
>> on a CPU. To simplify handling of load failure and avoid cleanup to the
>> global cache (if a patch has a success rate, we need to consider which
>> one to save when we have a new patch with 95% rate and an old one with
>> 100% rate, it would be really complex), we assume that loading the cache
>> (being loaded successfully on a CPU) shouldn't return EIO. Otherwise,
>> an error message is prompted.
>
>But then the log message itself needs to be more specific. That'll
>then either allow to understand why one is wanted here but not
>elsewhere, or additionally a comment should be attached.
>
>>>> --- a/xen/arch/x86/smpboot.c
>>>> +++ b/xen/arch/x86/smpboot.c
>>>> @@ -363,10 +363,7 @@ void start_secondary(void *unused)
>>>>  
>>>>      initialize_cpu_data(cpu);
>>>>  
>>>> -    if ( system_state <= SYS_STATE_smp_boot )
>>>> -        early_microcode_update_cpu(false);
>>>> -    else
>>>> -        microcode_resume_cpu();
>>>> +    early_microcode_update_cpu();
>>>
>>>I'm struggling to understand how you get away without the "false"
>>>argument that was passed here before. You look to now be calling
>>>->start_update() unconditionally (so long as the hook is not NULL),
>>>which I don't think is correct. This should be called only once by
>>>the CPU _leading_ an update (the BSP during boot, and the CPU
>>>the hypercall gets invoked on (or the first CPU an update gets
>>>issued one) for a late update. Am I missing something?
>> 
>> BSP and APs call different functions, early_microcode_parse_and_update_cpu()
>> and early_microcode_update_cpu() respectively. The latter won't call
>> ->start_update().
>
>Oh, I see - I'm sorry. I've been mislead by the patch context
>indicators. But still, by their names there's not supposed to be
>such a difference. I wonder whether we wouldn't better stick
>to just one function (early_microcode_update_cpu()), having
>it take a boolean as is the case now. That would control both
>parsing and ->start_update() invocation.

Will do

Thanks
Chao
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.