[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH v5 1/1] xen/time: do not decrease steal time after live migration on xen
Hi Boris, On 10/31/2017 08:58 AM, Boris Ostrovsky wrote: > > > On 10/30/2017 08:14 PM, Dongli Zhang wrote: >> Hi Boris, >> >> On 10/30/2017 09:34 PM, Boris Ostrovsky wrote: >>> On 10/30/2017 04:03 AM, Dongli Zhang wrote: >>>> After guest live migration on xen, steal time in /proc/stat >>>> (cpustat[CPUTIME_STEAL]) might decrease because steal returned by >>>> xen_steal_lock() might be less than this_rq()->prev_steal_time which is >>>> derived from previous return value of xen_steal_clock(). >>>> >>>> For instance, steal time of each vcpu is 335 before live migration. >>>> >>>> cpu 198 0 368 200064 1962 0 0 1340 0 0 >>>> cpu0 38 0 81 50063 492 0 0 335 0 0 >>>> cpu1 65 0 97 49763 634 0 0 335 0 0 >>>> cpu2 38 0 81 50098 462 0 0 335 0 0 >>>> cpu3 56 0 107 50138 374 0 0 335 0 0 >>>> >>>> After live migration, steal time is reduced to 312. >>>> >>>> cpu 200 0 370 200330 1971 0 0 1248 0 0 >>>> cpu0 38 0 82 50123 500 0 0 312 0 0 >>>> cpu1 65 0 97 49832 634 0 0 312 0 0 >>>> cpu2 39 0 82 50167 462 0 0 312 0 0 >>>> cpu3 56 0 107 50207 374 0 0 312 0 0 >>>> >>>> Since runstate times are cumulative and cleared during xen live migration >>>> by xen hypervisor, the idea of this patch is to accumulate runstate times >>>> to global percpu variables before live migration suspend. Once guest VM is >>>> resumed, xen_get_runstate_snapshot_cpu() would always return the sum of new >>>> runstate times and previously accumulated times stored in global percpu >>>> variables. >>>> >>>> Similar and more severe issue would impact prior linux 4.8-4.10 as >>>> discussed by Michael Las at >>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest, >>>> >>>> which would overflow steal time and lead to 100% st usage in top command >>>> for linux 4.8-4.10. A backport of this patch would fix that issue. >>>> >>>> References: >>>> https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest >>>> >>>> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx> >>>> >>>> --- >>>> Changed since v1: >>>> * relocate modification to xen_get_runstate_snapshot_cpu >>>> >>>> Changed since v2: >>>> * accumulate runstate times before live migration >>>> >>>> Changed since v3: >>>> * do not accumulate times in the case of guest checkpointing >>>> >>>> Changed since v4: >>>> * allocate array of vcpu_runstate_info to reduce number of memory >>>> allocation >>>> >>>> --- >>>> drivers/xen/manage.c | 2 ++ >>>> drivers/xen/time.c | 68 >>>> ++++++++++++++++++++++++++++++++++++++++++-- >>>> include/xen/interface/vcpu.h | 2 ++ >>>> include/xen/xen-ops.h | 1 + >>>> 4 files changed, 71 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/xen/manage.c b/drivers/xen/manage.c >>>> index c425d03..3dc085d 100644 >>>> --- a/drivers/xen/manage.c >>>> +++ b/drivers/xen/manage.c >>>> @@ -72,6 +72,7 @@ static int xen_suspend(void *data) >>>> } >>>> gnttab_suspend(); >>>> + xen_accumulate_runstate_time(-1); >>>> xen_arch_pre_suspend(); >>>> /* >>>> @@ -84,6 +85,7 @@ static int xen_suspend(void *data) >>>> : 0); >>>> xen_arch_post_suspend(si->cancelled); >>>> + xen_accumulate_runstate_time(si->cancelled); >>> >>> I am not convinced that the comment above HYPERVISOR_suspend() is >>> correct. The call can return an error code and so if it returns -EPERM >>> (which AFAICS it can't now but might in the future) then >>> xen_accumulate_runstate_time() will do wrong thing. >> >> I would split xen_accumulate_runstate_time() into two functions to avoid the >> -EPERM issue, as one is for saving and another is for accumulation, >> respectively. >> >> Otherwise, can you use xen_accumulate_runstate_time(2) for saving before >> suspend >> and xen_accumulate_runstate_time(si->cancelled) after resume? > > > I'd probably just say something like > > si->cancelled = HYPERVISOR_suspend() ? 1 : 0; As the call of HYPERVISOR_suspend() takes 3 lines, I would make it as below and I think gcc would optimize it. - /* - * This hypercall returns 1 if suspend was cancelled - * or the domain was merely checkpointed, and 0 if it - * is resuming in a new domain. - */ si->cancelled = HYPERVISOR_suspend(xen_pv_domain() ? virt_to_gfn(xen_start_info) : 0); + si->cancelled = si->cancelled ? 1 : 0; > > and keep xen_accumulate_runstate_time() as is (maybe rename it to > xen_manage_runstate_time()). And also remove the comment above the hypercall > as > it is incorrect (but please mention the reason in the commit message) > >> >>> >>> >>>> gnttab_resume(); >>>> if (!si->cancelled) { >>>> diff --git a/drivers/xen/time.c b/drivers/xen/time.c >>>> index ac5f23f..cf3afb9 100644 >>>> --- a/drivers/xen/time.c >>>> +++ b/drivers/xen/time.c >>>> @@ -19,6 +19,9 @@ >>>> /* runstate info updated by Xen */ >>>> static DEFINE_PER_CPU(struct vcpu_runstate_info, xen_runstate); >>>> +static DEFINE_PER_CPU(u64[RUNSTATE_max], old_runstate_time); >>>> +static struct vcpu_runstate_info *runstate_delta; >>> >>> I'd move this inside xen_accumulate_runstate_time() since that's the >> >> If we split xen_accumulate_runstate_time() into two functions, we would leave >> runstate_delta as global static. >> >>> only function that uses it. And why does it need to be >>> vcpu_runstate_info and not u64[4]? >> >> This was suggested by Juergen to avoid the allocation and reclaim of the >> second >> dimensional array as in v4 of this patch? >> >> Or would you like to allocate sizeof(u64[4]) * num_possible_cpus() and >> emulate >> the 2d array with this 1d array and move the pointer forward sizeof(u64[4]) >> in >> each iteration? > > > I was thinking of > > u64 **runstate_delta = (u64 **)kmalloc(sizeof(xen_runstate.time) * > num_possible_cpus()) > > and then you should be able to access runstate_delta[cpu][RUNSTATE_*]. Would the above code work? You are allocating data only for 1st dimension of a 2d array. How can you access the 2nd array with runstate_delta[cpu][RUNSTATE_*]? Is there any option in gcc that support this? I have actually tested this with below code in linux and kernel panic with page fault when doing memcpy. +void xen_manage_runstate_time(int action) +{ + static u64 **runstate_delta; + struct vcpu_runstate_info state; + int cpu, i; + + switch (action) { + case -1: /* backup runstate time before suspend */ + WARN_ON_ONCE(unlikely(runstate_delta)); + + runstate_delta = (u64 **)kmalloc(sizeof(xen_runstate.time) * num_possible_cpus(), GFP_ATOMIC); + if (unlikely(!runstate_delta)) { + pr_alert("%s: failed to allocate runstate_delta\n", + __func__); + return; + } + + for_each_possible_cpu(cpu) { + xen_get_runstate_snapshot_cpu_delta(&state, cpu); + memcpy(runstate_delta[cpu], state.time, + sizeof(xen_runstate.time)); + } + + break; > >> >>> >>>> + >>>> /* return an consistent snapshot of 64-bit time/counter value */ >>>> static u64 get64(const u64 *p) >>>> { >>>> @@ -47,8 +50,8 @@ static u64 get64(const u64 *p) >>>> return ret; >>>> } >>>> -static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info >>>> *res, >>>> - unsigned int cpu) >>>> +static void xen_get_runstate_snapshot_cpu_delta( >>>> + struct vcpu_runstate_info *res, unsigned int cpu) >>>> { >>>> u64 state_time; >>>> struct vcpu_runstate_info *state; >>>> @@ -66,6 +69,67 @@ static void xen_get_runstate_snapshot_cpu(struct >>>> vcpu_runstate_info *res, >>>> (state_time & XEN_RUNSTATE_UPDATE)); >>>> } >>>> +static void xen_get_runstate_snapshot_cpu(struct vcpu_runstate_info >>>> *res, >>>> + unsigned int cpu) >>>> +{ >>>> + int i; >>>> + >>>> + xen_get_runstate_snapshot_cpu_delta(res, cpu); >>>> + >>>> + for (i = 0; i < RUNSTATE_max; i++) >>>> + res->time[i] += per_cpu(old_runstate_time, cpu)[i]; >>>> +} >>>> + >>>> +void xen_accumulate_runstate_time(int action) >>>> +{ >>>> + struct vcpu_runstate_info state; >>>> + int cpu, i; >>>> + >>>> + switch (action) { >>>> + case -1: /* backup runstate time before suspend */ >>>> + WARN_ON_ONCE(unlikely(runstate_delta)); >>> >>> pr_warn_once(), to be consistent with the rest of the file. And then >>> should you return if this is true? >> >> I would prefer to not return if it is true but just warn the administrator >> that >> there is memory leakage issue while leaving runstate accumulation works >> normally. >> >>> >>>> + >>>> + runstate_delta = kcalloc(num_possible_cpus(), >>>> + sizeof(*runstate_delta), >>>> + GFP_KERNEL); >>>> + if (unlikely(!runstate_delta)) { >>>> + pr_alert("%s: failed to allocate runstate_delta\n", >>>> + __func__); >>> >>> pr_warn() should be sufficient. Below too. >>> >>> Also, as a side question --- can we do kmalloc() at this point? >> >> Yes. kmalloc_array() is better than kcalloc, unless we have 2 dimensional >> array >> and we need to guarantee the value of first dimension is always 0. > > > That's not what was thinking about. GFP_KERNEL may sleep and I don't know how > sleep is handled at this point. Everything is pretty much dead now. Perhaps > GFP_ATOMIC might be a better choice. > > -boris > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxx > https://lists.xen.org/xen-devel Thank you very much for your suggestion! Dongli Zhang _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx https://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |