Xen project Mailing List

Re: [Xen-devel] [PATCH 1/8] x86/time: improve cross-CPU clock monotonicity (and more)

To: "Joao Martins" <joao.m.martins@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Thu, 16 Jun 2016 02:27:19 -0600

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Dario Faggioli <dario.faggioli@xxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 16 Jun 2016 08:27:38 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

>>> On 16.06.16 at 00:51, <joao.m.martins@xxxxxxxxxx> wrote: > On 06/15/2016 11:26 AM, Jan Beulich wrote: >> --- a/xen/arch/x86/time.c >> +++ b/xen/arch/x86/time.c >> @@ -1328,21 +1328,51 @@ static void time_calibration(void *unuse >> &r, 1); >> } >> >> +static struct { >> + s_time_t local_stime, master_stime; >> +} ap_bringup_ref; >> + >> +void time_latch_stamps(void) { > ^ style: shouldn't this bracket be on a new > line? Oh, yes, of course. >> + unsigned long flags; >> + u64 tsc; >> + >> + local_irq_save(flags); >> + ap_bringup_ref.master_stime = read_platform_stime(); >> + tsc = rdtsc(); >> + local_irq_restore(flags); >> + >> + ap_bringup_ref.local_stime = get_s_time_fixed(tsc); >> +} >> + >> void init_percpu_time(void) >> { >> struct cpu_time *t = &this_cpu(cpu_time); >> unsigned long flags; >> + u64 tsc; >> s_time_t now; >> >> /* Initial estimate for TSC rate. */ >> t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale; >> >> local_irq_save(flags); >> - t->local_tsc_stamp = rdtsc(); >> now = read_platform_stime(); > Do we need to take another read from the platform timer here? We could > just reuse the one on ap_bringup_ref.master_stime. See also below. Yes, for this model of approximation of the local stime delta by master stime delta we obviously need to read the master clock here again (or else we have no way to calculate the master delta, which we want to use as the correcting value). >> + tsc = rdtsc(); >> local_irq_restore(flags); >> >> t->stime_master_stamp = now; >> + /* >> + * To avoid a discontinuity (TSC and platform clock can't be expected >> + * to be in perfect sync), initialization here needs to match up with >> + * local_time_calibration()'s decision whether to use its fast path. >> + */ >> + if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) >> + { >> + if ( system_state < SYS_STATE_smp_boot ) >> + now = get_s_time_fixed(tsc); >> + else >> + now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime; >> + } >> + t->local_tsc_stamp = tsc; > > As far as my current experimentation (both v1 of this patch and the whole > series on > v2), the source of the remaining warps could be fixed with this seeding. > Though I > think this seeding might not yet be correct. Essentially the boot CPU you do > get_s_time_fixed(tsc), which basically gives you a value strictly calculated > with TSC > since boot_tsc_stamp. For the other CPUs you append the time skew between > TSC and > platform timer calculated before AP bringup, with a value just read on the > platform > timer. Which I assume you take this as an approximation skew between TSC and > platform > timer. Correct. > So, before this patch cpu_time is filled like this: > > CPU 0: M0 , T0 > CPU 1: M1 , T1 > CPU 2: M2 , T2 > > With this patch it goes like: > > CPU 0: L0 , T0 > CPU 1: L1 = (M1 + (L - M)), T1 > CPU 2: L2 = (M2 + (L - M)), T2 > > Given that, > > 1) values separated by commas are respectively local stime, tsc that are > written in cpu_time > 2) M2 > M1 > M0 as values read from platform timer. > 3) L and M solely as values calculated on CPU doing AP bringup. > > After this seeding, local_time_calibration will increment the deltas between > calibrations every second, based entirely on a monotonically increasing TSC. > Altough > I see two issues here: 1) appending to a new read from platform time which > might > already be offsetted by the one taken from the boot CPU and 2) the skew you > calculate > don't account for the current tsc. Which leads to local stime and tsc still > not being > a valid pair for the secondary CPUs. That's true if platform clock and TSC, even over this small time period, diverge enough. The calculate local time indeed is only an approximation to match the TSC just read. > So it would be possible that > get_s_time(S0) on > CPU0 immediately followed by a get_s_time(S1) on CPU1 [S1 > S0] ends up > returning a > value backwards IOW L0 + scale(S0 - T0) is bigger than L1 + scale(S1 - T1). > That is > independently of the deltas being added on every calibration. > > So how about we do the seeding in another way? > > 1) Relying on individual CPU TSC like you do on CPU 0: > > t->stamp.master_stime = now; > + t->stamp.local_tsc = 0; > + t->stamp.local_stime = 0; > + > if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) > { > now = get_s_time_fixed(tsc); > } As said before elsewhere, such a model can only work if all TSCs start ticking at the same time. The latest this assumption breaks is when a CPU gets physically hotplugged. > Or 2) taking into account the skew between platform timer and tsc we take on > init_percpu_time. Diff below based on this series: > > @@ -1394,11 +1508,14 @@ void init_percpu_time(void) > t->tsc_scale = per_cpu(cpu_time, 0).tsc_scale; > > local_irq_save(flags); > - now = read_platform_stime(); > + now = ap_bringup_ref.master_stime; > tsc = rdtsc_ordered(); > local_irq_restore(flags); > > t->stamp.master_stime = now; > + t->stamp.local_tsc = boot_tsc_stamp; Again - this is invalid. It indeed works fine on a single socket system (where all TSCs start ticking at the same time), and gives really good results. But due to hotplug (and to a lesser degree multi-socket, but likely to a somewhat higher degree multi-node, systems) we just can't use boot_tsc_stamp as reference for APs. > + t->stamp.local_stime = 0; > + > > /* > * To avoid a discontinuity (TSC and platform clock can't be expected > * to be in perfect sync), initialization here needs to match up with > @@ -1406,10 +1523,12 @@ void init_percpu_time(void) > */ > if ( boot_cpu_has(X86_FEATURE_CONSTANT_TSC) ) > { > + u64 stime = get_s_time_fixed(tsc); > + > if ( system_state < SYS_STATE_smp_boot ) > - now = get_s_time_fixed(tsc); > + now = stime; > else > - now += ap_bringup_ref.local_stime - ap_bringup_ref.master_stime; > + now += stime - ap_bringup_ref.master_stime; That seems to contradict your desire to keep out of all calculations the skew between platform timer and TSC. > } > > Second option follows a similar thinking of this patch, but on both ways the > local_stime will match the tsc on init_percpu_time, thus being a matched > pair. I have > a test similar to check_tsc_warp but with get_s_time() and I no longer > observe time > going backwards. But without the above I still observe it even on short > periods. Right - I'm not claiming the series eliminates all backwards moves. But I simply can't see a model to fully eliminate them. I.e. my plan was, once the backwards moves are small enough, to maybe indeed compensate them by maintaining a global variable tracking the most recently returned value. There are issues with such an approach too, though: HT effects can result in one hyperthread making it just past that check of the global, then hardware switching to the other hyperthread, NOW() producing a slightly larger value there, and hardware switching back to the first hyperthread only after the second one consumed the result of NOW(). Dario's use would be unaffected by this aiui, as his NOW() invocations are globally serialized through a spinlock, but arbitrary NOW() invocations on two hyperthreads can't be made such that the invoking party can be guaranteed to see strictly montonic values. And btw., similar considerations apply for two fully independent CPUs, if one runs at a much higher P-state than the other (i.e. the faster one could overtake the slower one between the montonicity check in NOW() and the callers consuming the returned values). So in the end I'm not sure it's worth the performance hit such a global montonicity check would incur, and therefore I didn't make a respective patch part of this series. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.