Xen project Mailing List

Re: [Xen-devel] [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)

From: Joao Martins <joao.m.martins@xxxxxxxxxx>

Date: Thu, 16 Jun 2016 21:27:16 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 16 Jun 2016 20:26:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> + unsigned long flags; >>> + u64 tsc; >>> + >>> + local_irq_save(flags); >>> + ap_bringup_ref.master_stime = read_platform_stime(); >>> + tsc = rdtsc(); >>> + local_irq_restore(flags); >>> + >>> + ap_bringup_ref.local_stime = get_s_time_fixed(tsc); >>> +} >>> + >>> void init_percpu_time(void) >>> { >>> struct cpu_time *t = &this_cpu(cpu_time); >>> unsigned long flags; >>> + u64 tsc; >>> s_time_t now; >>> >>> /* Initial estimate for TSC rate. */ >>> t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale; >>> >>> local_irq_save(flags); >>> - t->local_tsc_stamp = rdtsc(); >>> now = read_platform_stime(); >> Do we need to take another read from the platform timer here? We could >> just reuse the one on ap_bringup_ref.master_stime. See also below. > > Yes, for this model of approximation of the local stime delta by > master stime delta we obviously need to read the master clock > here again (or else we have no way to calculate the master > delta, which we want to use as the correcting value). Ah, correct. >> So it would be possible that >> get_s_time(S0) on >> CPU0 immediately followed by a get_s_time(S1) on CPU1 [S1 > S0] ends up >> returning a >> value backwards IOW L0 + scale(S0 - T0) is bigger than L1 + scale(S1 - T1). >> That is >> independently of the deltas being added on every calibration. >> >> So how about we do the seeding in another way? >> >> 1) Relying on individual CPU TSC like you do on CPU 0: >> >> t->stamp.master_stime = now; >> + t->stamp.local_tsc = 0; >> + t->stamp.local_stime = 0; >> + >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) >> { >> now = get_s_time_fixed(tsc); >> } > > As said before elsewhere, such a model can only work if all TSCs > start ticking at the same time. The latest this assumption breaks > is when a CPU gets physically hotplugged. Agreed. >> Or 2) taking into account the skew between platform timer and tsc we take on >> init_percpu_time. Diff below based on this series: >> >> @@ -1394,11 +1508,14 @@ void init_percpu_time(void) >> t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale; >> >> local_irq_save(flags); >> - now = read_platform_stime(); >> + now = ap_bringup_ref.master_stime; >> tsc = rdtsc_ordered(); >> local_irq_restore(flags); >> >> t->stamp.master_stime = now; >> + t->stamp.local_tsc = boot_tsc_stamp; > > Again - this is invalid. It indeed works fine on a single socket system > (where all TSCs start ticking at the same time), and gives really good > results. But due to hotplug (and to a lesser degree multi-socket, but > likely to a somewhat higher degree multi-node, systems) we just can't > use boot_tsc_stamp as reference for APs. It also gives really good results on my multi-socket system at least on my old Intel multi-socket box (Westmere-based; TSC invariant is introduced on its predecessor: Nehalem). Btw, Linux seems to take Intel boxes as having synchronized TSCs across cores and sockets [0][1]. Though CPU hotplug is the troublesome case. [0] arch/x86/kernel/tsc.c#L1082 [1] arch/x86/kernel/cpu/intel.c#L85 > >> + t->stamp.local_stime = 0; >> + >> >> /* >> * To avoid a discontinuity (TSC and platform clock can't be expected >> * to be in perfect sync), initialization here needs to match up with >> @@ -1406,10 +1523,12 @@ void init_percpu_time(void) >> */ >> if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) >> { >> + u64 stime = get_s_time_fixed(tsc); >> + >> if ( system_state < SYS_STATE_smp_boot ) >> - now = get_s_time_fixed(tsc); >> + now = stime; >> else >> - now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime; >> + now += stime - ap_bringup_ref.master_stime; > > That seems to contradict your desire to keep out of all calculations > the skew between platform timer and TSC. > That's indeed my desire (on my related series about using tsc as clocksource and tsc stable bit), but being aware of TSC limitations: I was trying to see if there was common ground between this seeding with platform timer while accounting to the potential unreliable TSC of individual CPUs. But of course, for any of these solutions to have get_s_time return monotonically increasing values, it can only work on a truly invariant TSC box. >> } >> >> Second option follows a similar thinking of this patch, but on both ways the >> local_stime will match the tsc on init_percpu_time, thus being a matched >> pair. I have >> a test similar to check_tsc_warp but with get_s_time() and I no longer >> observe time >> going backwards. But without the above I still observe it even on short >> periods. > > Right - I'm not claiming the series eliminates all backwards moves. > But I simply can't see a model to fully eliminate them. Totally agree with you. > I.e. my plan was, once the backwards moves are small enough, to maybe > indeed compensate them by maintaining a global variable tracking > the most recently returned value. There are issues with such an > approach too, though: HT effects can result in one hyperthread > making it just past that check of the global, then hardware > switching to the other hyperthread, NOW() producing a slightly > larger value there, and hardware switching back to the first > hyperthread only after the second one consumed the result of > NOW(). Dario's use would be unaffected by this aiui, as his NOW() > invocations are globally serialized through a spinlock, but arbitrary > NOW() invocations on two hyperthreads can't be made such that > the invoking party can be guaranteed to see strictly montonic > values. > > And btw., similar considerations apply for two fully independent > CPUs, if one runs at a much higher P-state than the other (i.e. > the faster one could overtake the slower one between the > montonicity check in NOW() and the callers consuming the returned > values). So in the end I'm not sure it's worth the performance hit > such a global montonicity check would incur, and therefore I didn't > make a respective patch part of this series. > Hm, guests pvclock should have faced similar issues too as their local stamps for each vcpu diverge. Linux commit 489fb49 ("x86, paravirt: Add a global synchronization point for pvclock") depicts a fix to similar situations to the scenarios you just described - which lead to have a global variable to keep track of most recent timestamp. One important chunk of that commit is pasted below for convenience: -- /* * Assumption here is that last_value, a global accumulator, always goes * forward. If we are less than that, we should not be much smaller. * We assume there is an error marging we're inside, and then the correction * does not sacrifice accuracy. * * For reads: global may have changed between test and return, * but this means someone else updated poked the clock at a later time. * We just need to make sure we are not seeing a backwards event. * * For updates: last_value = ret is not enough, since two vcpus could be * updating at the same time, and one of them could be slightly behind, * making the assumption that last_value always go forward fail to hold. */ last = atomic64_read(&last_value); do { if (ret < last) return last; last = atomic64_cmpxchg(&last_value, last, ret); } while (unlikely(last != ret)); -- Though I am not sure what other options we would have with such a heavy reliance on TSC right now. Joao _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.