[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen



Hi Boris,

On 10/30/2017 09:34 PM, Boris Ostrovsky wrote:
> On 10/30/2017 04:03 AM, Dongli Zhang wrote:
>> After guest live migration on xen, steal time in /proc/stat
>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by
>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is
>> derived from previous return value of xen_steal_clock().
>>
>> For instance, steal time of each vcpu is 335 before live migration.
>>
>> cpu  198 0 368 200064 1962 0 0 1340 0 0
>> cpu0 38 0 81 50063 492 0 0 335 0 0
>> cpu1 65 0 97 49763 634 0 0 335 0 0
>> cpu2 38 0 81 50098 462 0 0 335 0 0
>> cpu3 56 0 107 50138 374 0 0 335 0 0
>>
>> After live migration, steal time is reduced to 312.
>>
>> cpu  200 0 370 200330 1971 0 0 1248 0 0
>> cpu0 38 0 82 50123 500 0 0 312 0 0
>> cpu1 65 0 97 49832 634 0 0 312 0 0
>> cpu2 39 0 82 50167 462 0 0 312 0 0
>> cpu3 56 0 107 50207 374 0 0 312 0 0
>>
>> Since runstate times are cumulative and cleared during xen live migration
>> by xen hypervisor, the idea of this patch is to accumulate runstate times
>> to global percpu variables before live migration suspend. Once guest VM is
>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new
>> runstate times and previously accumulated times stored in global percpu
>> variables.
>>
>> Similar and more severe issue would impact prior linux 4.8-4.10 as
>> discussed by Michael Las at
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest,
>> which would overflow steal time and lead to 100% st usage in top command
>> for linux 4.8-4.10. A backport of this patch would fix that issue.
>>
>> References: 
>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest
>> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
>>
>> ---
>> Changed since v1:
>>   * relocate modification to xen_get_runstate_snapshot_cpu
>>
>> Changed since v2:
>>   * accumulate runstate times before live migration
>>
>> Changed since v3:
>>   * do not accumulate times in the case of guest checkpointing
>>
>> Changed since v4:
>>   * allocate array of vcpu_runstate_info to reduce number of memory 
>> allocation
>>
>> ---
>>  drivers/xen/manage.c         |  2 ++
>>  drivers/xen/time.c           | 68 
>> ++++++++++++++++++++++++++++++++++++++++++--
>>  include/xen/interface/vcpu.h |  2 ++
>>  include/xen/xen-ops.h        |  1 +
>>  4 files changed, 71 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c
>> index c425d03..3dc085d 100644
>> --- a/drivers/xen/manage.c
>> +++ b/drivers/xen/manage.c
>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data)
>>      }
>>  
>>      gnttab_suspend();
>> +    xen_accumulate_runstate_time(-1);
>>      xen_arch_pre_suspend();
>>  
>>      /*
>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data)
>>                                             : 0);
>>  
>>      xen_arch_post_suspend(si->cancelled);
>> +    xen_accumulate_runstate_time(si->cancelled);
> 
> I am not convinced that the comment above HYPERVISOR_suspend() is
> correct. The call can return an error code and so if it returns -EPERM
> (which AFAICS it can't now but might in the future) then
> xen_accumulate_runstate_time() will do wrong thing.

I would split xen_accumulate_runstate_time() into two functions to avoid the
-EPERM issue, as one is for saving and another is for accumulation, 
respectively.

Otherwise, can you use xen_accumulate_runstate_time(2) for saving before suspend
and xen_accumulate_runstate_time(si->cancelled) after resume?

> 
> 
>>      gnttab_resume();
>>  
>>      if (!si->cancelled) {
>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c
>> index ac5f23f..cf3afb9 100644
>> --- a/drivers/xen/time.c
>> +++ b/drivers/xen/time.c
>> @@ -19,6 +19,9 @@
>>  /* runstate info updated by Xen */
>>  static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate);
>>  
>> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time);
>> +static struct vcpu_runstate_info *runstate_delta;
> 
> I'd move this inside xen_accumulate_runstate_time() since that's the

If we split xen_accumulate_runstate_time() into two functions, we would leave
runstate_delta as global static.

> only function that uses it. And why does it need to be
> vcpu_runstate_info and not u64[4]?

This was suggested by Juergen to avoid the allocation and reclaim of the second
dimensional array as in v4 of this patch?

Or would you like to allocate sizeof(u64[4]) * num_possible_cpus() and emulate
the 2d array with this 1d array and move the pointer forward sizeof(u64[4]) in
each iteration?

> 
>> +
>>  /* return an consistent snapshot of 64-bit time/counter value */
>>  static u64 get64(const u64 *p)
>>  {
>> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p)
>>      return ret;
>>  }
>>  
>> -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> -                                      unsigned int cpu)
>> +static void xen_get_runstate_snapshot_cpu_delta(
>> +                    struct vcpu_runstate_info *res, unsigned int cpu)
>>  {
>>      u64 state_time;
>>      struct vcpu_runstate_info *state;
>> @@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct 
>> vcpu_runstate_info *res,
>>               (state_time & XEN_RUNSTATE_UPDATE));
>>  }
>>  
>> +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info *res,
>> +                                      unsigned int cpu)
>> +{
>> +    int i;
>> +
>> +    xen_get_runstate_snapshot_cpu_delta(res, cpu);
>> +
>> +    for (i = 0; i < RUNSTATE_max; i++)
>> +            res->time[i] += per_cpu(old_runstate_time, cpu)[i];
>> +}
>> +
>> +void xen_accumulate_runstate_time(int action)
>> +{
>> +    struct vcpu_runstate_info state;
>> +    int cpu, i;
>> +
>> +    switch (action) {
>> +    case -1: /* backup runstate time before suspend */
>> +            WARN_ON_ONCE(unlikely(runstate_delta));
> 
> pr_warn_once(), to be consistent with the rest of the file. And then
> should you return if this is true?

I would prefer to not return if it is true but just warn the administrator that
there is memory leakage issue while leaving runstate accumulation works 
normally.

> 
>> +
>> +            runstate_delta = kcalloc(num_possible_cpus(),
>> +                                     sizeof(*runstate_delta),
>> +                                     GFP_KERNEL);
>> +            if (unlikely(!runstate_delta)) {
>> +                    pr_alert("%s: failed to allocate runstate_delta\n",
>> +                                __func__);
> 
> pr_warn() should be sufficient. Below too.
> 
> Also, as a side question --- can we do kmalloc() at this point?

Yes. kmalloc_array() is better than kcalloc, unless we have 2 dimensional array
and we need to guarantee the value of first dimension is always 0.

> 
>> +                    return;
>> +            }
>> +
>> +            for_each_possible_cpu(cpu) {
>> +                    xen_get_runstate_snapshot_cpu_delta(&state, cpu);
>> +                    memcpy(runstate_delta[cpu].time, state.time,
>> +                          RUNSTATE_max * sizeof(*runstate_delta[cpu].time));
> 
> sizeof(runstate_delta[cpu].time).
> 
>> +            }
>> +
>> +            break;
>> +
>> +    case 0: /* backup runstate time after resume */
>> +            if (unlikely(!runstate_delta)) {
>> +                    pr_alert("%s: cannot accumulate runstate time as 
>> runstate_delta is NULL\n",
>> +                                __func__);
>> +                    return;
>> +            }
>> +
>> +            for_each_possible_cpu(cpu) {
>> +                    for (i = 0; i < RUNSTATE_max; i++)
>> +                            per_cpu(old_runstate_time, cpu)[i] +=
>> +                                    runstate_delta[cpu].time[i];
>> +            }
>> +            break;
>> +
>> +    default: /* do not accumulate runstate time for checkpointing */
>> +            break;
>> +    }
>> +
>> +    if (action != -1 && runstate_delta) {
>> +            kfree(runstate_delta);
>> +            runstate_delta = NULL;
>> +    }
>> +}
>> +
>>  /*
>>   * Runstate accounting
>>   */
>> diff --git a/include/xen/interface/vcpu.h b/include/xen/interface/vcpu.h
>> index 98188c8..85e81ce 100644
>> --- a/include/xen/interface/vcpu.h
>> +++ b/include/xen/interface/vcpu.h
>> @@ -110,6 +110,8 @@ DEFINE_GUEST_HANDLE_STRUCT(vcpu_runstate_info);
>>   */
>>  #define RUNSTATE_offline  3
>>  
>> +#define RUNSTATE_max      4
> 
> This file is part of Xen ABI. While this macro technically doesn't
> change anything I'd rather have those updates first appear in Xen code.
> 
> Besides, this change leaves vcpu_runstate_info.time[4] definition
> intact. I think all RUNSTATE_* macros would need to be moved above
> vcpu_runstate_info definition.
> 
> TBH, for purposes of this patch I'd use 4.
> 
> 
> -boris
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxx
> https://lists.xen.org/xen-devel
> 

Thank you very much for your suggestions!

Dongli Zhang

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
https://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.