[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v2] x86/microcode: Synchronize late microcode loading



On Mon, Apr 30, 2018 at 09:25:26AM -0600, Jan Beulich wrote:
>>>> On 25.04.18 at 13:46, <chao.gao@xxxxxxxxx> wrote:
>> @@ -281,24 +288,56 @@ static int microcode_update_cpu(const void *buf, 
>> size_t size)
>>      return err;
>>  }
>>  
>> -static long do_microcode_update(void *_info)
>> +/* Wait for all CPUs to rendezvous with a timeout (us) */
>> +static int wait_for_cpus(atomic_t *cnt, int timeout)
>
>unsigned int
>
>> +static int do_microcode_update(void *_info)
>> +{
>> +    struct microcode_info *info = _info;
>> +    unsigned int cpu = smp_processor_id();
>> +    int ret;
>> +
>> +    ret = wait_for_cpus(&info->cpu_in, MICROCODE_DEFAULT_TIMEOUT);
>> +    if ( ret )
>> +        return ret;
>> +    /*
>> +     * Logical threads which set the first bit in cpu_sibling_mask can do
>> +     * the update. Other sibling threads just await the completion of
>> +     * microcode update.
>> +     */
>> +    if ( cpumask_test_and_set_cpu(
>> +                cpumask_first(per_cpu(cpu_sibling_mask, cpu)), &info->cpus) 
>> )
>> +        ret = microcode_update_cpu(info->buffer, info->buffer_size);
>
>Isn't the condition inverted (i.e. missing a ! )?

Yes.

>
>Also I take it that you've confirmed that loading ucode in parallel on multiple
>cores of the same socket is not a problem? The comment in the last hunk
>suggests otherwise.

No. In microcode_update_cpu(), microcode_mutex makes the update
sequential.

>
>> +    /*
>> +     * Increase the wait timeout to a safe value here since we're 
>> serializing
>> +     * the microcode update and that could take a while on a large number of
>> +     * CPUs. And that is fine as the *actual* timeout will be determined by
>> +     * the last CPU finished updating and thus cut short
>> +     */
>> +    if ( wait_for_cpus(&info->cpu_out, MICROCODE_DEFAULT_TIMEOUT *
>> +                                       num_online_cpus()) )
>> +        panic("Timeout when finishing updating microcode");
>
>A 3s timeout (as an example for a system with 100 CPU threads) is still
>absurdly high to me, but considering you panic() anyway if you hit the
>timeout the question mainly is whether there's a slim chance for this to
>complete a brief moment before the timeout expires. If all goes well,
>you won't come close to even 1s, but as said before - there may be
>guests running, and they may become utterly confused if they don't
>get any time within a second or more.
>
>With you no longer doing things sequentially I don't, however, see why

No. It is still sequential. And only one thread in a core will try to
acquire the lock -- microcode_mutex.

>you need to scale the timeout by CPU count.

Maybe by the number of core. But I did't find an existing variable to
count cores.

>
>> +
>> +    return ret;
>>  }
>
>You're losing this return value (once for every CPU making it into this
>function).

I don't understand this remark. This function is called by
stop_machine_run(). stop_machine_run() could return error if
any cpu failed during update. We don't care the specific CPU and
how many CPUs failed to do the update.

Thanks
Chao

>
>> @@ -318,26 +357,52 @@ int 
>> microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>>  
>>      ret = copy_from_guest(info->buffer, buf, len);
>>      if ( ret != 0 )
>> -    {
>> -        xfree(info);
>> -        return ret;
>> -    }
>> +        goto free;
>>  
>>      info->buffer_size = len;
>> -    info->error = 0;
>> -    info->cpu = cpumask_first(&cpu_online_map);
>> +
>> +    /* cpu_online_map must not change during update */
>> +    if ( !get_cpu_maps() )
>> +    {
>> +        ret = -EBUSY;
>> +        goto free;
>> +    }
>>  
>>      if ( microcode_ops->start_update )
>>      {
>>          ret = microcode_ops->start_update();
>>          if ( ret != 0 )
>> -        {
>> -            xfree(info);
>> -            return ret;
>> -        }
>> +            goto put;
>>      }
>>  
>> -    return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
>> +    cpumask_empty(&info->cpus);
>
>DYM cpumask_clear()?
>
>> +    atomic_set(&info->cpu_in, 0);
>> +    atomic_set(&info->cpu_out, 0);
>> +
>> +    /*
>> +     * We intend to disable interrupt for long time, which may lead to
>> +     * watchdog timeout.
>> +     */
>> +    watchdog_disable();
>> +    /*
>> +     * Late loading dance. Why the heavy-handed stop_machine effort?
>> +     *
>> +     * -HT siblings must be idle and not execute other code while the other
>> +     *  sibling is loading microcode in order to avoid any negative
>> +     *  interactions cause by the loading.
>> +     *
>> +     * -In addition, microcode update on the cores must be serialized until
>> +     *  this requirement can be relaxed in the feature. Right now, this is
>> +     *  conservative and good.
>
>This is the comment I've referred to above.
>
>Jan
>

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.