[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] [PATCH v13] tolerate jitter in cpu_khz calculation to avoid TSC emulation



Improve decision when vTSC emulation will be activated for a domU with
tsc_mode=default. The current approach is to compare the cpu_khz value
from two physical hosts. Since this value is not accurate, it can not be
used verbatim to decide if vTSC emulation needs to be enabled. Without
this change each TSC access from domU will be emulated after migration,
which causes a significant perfomance drop for workloads that make use
of rdtsc.

If a domU uses TSC as clocksoure it also must sync with an external
clocksource in some way to avoid the potential drift what will most
likely happen, independent of any migration. The calculation of the
drift is based on the time returned by remote servers versus how fast
the local clock advances. NTP in Linux can handle a drift up to 500PPM,
other OS and ntp implementations are known to handle a higher drift.
This means the local clocksource can run up to 500us slower or faster.
This calculation is based on the TSC frequency of the host where the
domU was started.

If the domU is migrated to another host of the same class, both hosts
may have a slightly different TSC frequency. The difference is small
enough and most likely within the drift range that NTP can handle.
The upper drift limit of 500PPM is 1MHz on a 2.0GHz host.

Once a domU is migrated to a host of a different class, like from
"2.3GHz" to "2.4GHz", the TSC frequency change is significant. The domU
kernel may not recalibrate itself. As a result, the drift will be larger
and will be way outside of the 500 PPM range. In addition, the kernel
may notice the change of speed in which the TSC advances and could
change the clocksource. This will impact the workload within the domU.
All this depends of course on the type of OS that is running in the
domU. This patch can do nothing for this case.

The formula to set the tolerance for this host calculates the ticks
within a timespan of 500 PPM, which is 500us. From this number the
assumed jitter in the TSC frequency measurement must be substracted
because Xen itself can not know if the estimated value in cpu_khz is at
the edge or in the middle of the range of possible freqencies. Data
collected during the incident which triggered this change showed a
jitter of up to 200 KHz across systems of the same class. The resulting
tolerance is larger than needed, and it is expected to still cover the
possible drift that NTP can handle.

To reiterate the second paragraph: if a domU uses TSC as primary clock
source, it is expected that it syncs with external clocksources to cover
for the resulting drift. This is the same expectation as it exists for
bare metal. Therefore this change does not need a knob to turn it on or off.

Signed-off-by: Olaf Hering <olaf@xxxxxxxxx>
--

v13:
 - rename a variable to better describe its meaning
 - expand comments in the code
 - reword commit message
 - mention the tolerance in xen-tscmode(7)
v12:
 - rebase to 4deeaf2a3e
 - remove casts and trailing dot in early_time_init
 - add comments to explain how vtsc_tolerance_khz is calculated
 - adjust cast in ABS()
 - adjust comment in tsc_set_info
v11:
 - trim patch and use calculated tolerance value, no admin interaction
   required
v10:
 - rebase to ae01a8e315
 - remove changes for libxl and save/restore protocol, the feature has
   to be per host instead of per guest
 - add newline to tsc_set_info (Andrew)
 - add pointer to xen-tscmode(7) in xl.cfg(5)/vtsc_tolerance_khz (Andrew)
 - mention potential clock drift in the domU (Andrew)
 - reword the newly added paragraph in xen-tscmode(7) (Andrew),
   and also mention that it is about the measured/estimated TSC value
   rather than the real value. The latter is simply unknown.
 - use uint32 for internal representation of 
xen_domctl_tsc_info.vtsc_tolerance_khz
   and remove padding field
 - add math for real TSC frequency to xen-tscmode
v9:
 - extend commit msg, mention potential issues with xc_sr_rec_tsc_info._res1
v8:
 - adjust also python stream checker for added tolerance member
v7:
 - use uint16 in libxl_types.idl to match type used elsewhere in the patch
v6:
 - mention default value in xl.cfg
 - tsc_set_info: remove usage of __func__, use %d for domid
 - tsc_set_info: use ABS to calculate khz_diff
v5:
 - reduce functionality to allow setting of the tolerance value
   only at initial domU startup
v4:
 - add missing copyback in XEN_DOMCTL_set_vtsc_tolerance_khz
v3:
 - rename vtsc_khz_tolerance to vtsc_tolerance_khz
 - separate domctls to adjust values
 - more docs
 - update libxl.h
 - update python tests
 - flask check bound to tsc permissions
 - not runtime tested due to dlsym() build errors in staging
---
 docs/man/xen-tscmode.7.pod | 13 ++++++++++
 xen/arch/x86/time.c        | 63 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/docs/man/xen-tscmode.7.pod b/docs/man/xen-tscmode.7.pod
index 1d81a3fe18..51d5d378f7 100644
--- a/docs/man/xen-tscmode.7.pod
+++ b/docs/man/xen-tscmode.7.pod
@@ -213,6 +213,19 @@ is emulated.  Note that, though emulated, the "apparent" 
TSC frequency
 will be the TSC frequency of the initial physical machine, even after
 migration.
 
+Since the calibration of the TSC frequency isn't 100% accurate, the
+value measured by Xen will vary across reboots. This means also several
+otherwise identical systems can have a slightly different _measured_ TSC
+frequency. As a result TSC access will be emulated if a domU is migrated
+from one host to another, identical host. To avoid the performance
+impact of TSC emulation a certain tolerance of the measured host TSC
+frequency is allowed by Xen. If the measured "cpu_khz" value is within
+the tolerance range, TSC access remains native. Otherwise it will be
+emulated. This allows to migrate domUs between identical hardware. If
+the domU will be migrated to a different kind of hardware, say from a
+"2.3GHz" to a "2.5GHz" system, TSC will be emualted to maintain the TSC
+frequency expected by the domU.
+
 Finally, tsc_mode==1 always enables TSC emulation, regardless of
 the underlying physical hardware. The "apparent" TSC frequency will
 be the TSC frequency of the initial physical machine, even after migration.
diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
index 9a6ea8ffcb..5cc38ed34d 100644
--- a/xen/arch/x86/time.c
+++ b/xen/arch/x86/time.c
@@ -43,6 +43,23 @@ static char __initdata opt_clocksource[10];
 string_param("clocksource", opt_clocksource);
 
 unsigned long __read_mostly cpu_khz;  /* CPU clock frequency in kHz. */
+/*
+ * NTP implementations running within the domU can handle a certain
+ * difference of the system clockspeed, compared to an external
+ * clocksource. This is ususally described as "drift". How much drift an
+ * OS can handle is described in its documentation. For NTP in Linux the
+ * value is 500 PPM, which is the lowest compared to other OS.
+ */
+#define VTSC_NTP_PPM_TOLERANCE 500UL
+/*
+ * The measurement of cpu_khz is not accurate. Its accuracy depends on the
+ * hardware. A bunch of systems with supposedly identical frequencies will
+ * measure different frequencies, which will also vary accross reboots.
+ * This variable tries to cover a range of frequencies seen in the wild.
+ * The range is substracted from the PPM value above.
+ */
+#define VTSC_MEASUREMENT_INACCURACY_RANGE_KHZ 200UL
+static unsigned int __read_mostly vtsc_tolerance_khz;
 DEFINE_SPINLOCK(rtc_lock);
 unsigned long pit0_ticks;
 
@@ -1885,6 +1902,27 @@ void __init early_time_init(void)
     printk("Detected %lu.%03lu MHz processor.\n", 
            cpu_khz / 1000, cpu_khz % 1000);
 
+    /*
+     * How many kHz (in other words: drift) is ntpd in domU expected to handle?
+     *    freq    tolerated freq difference
+     *  ------- = -------------------------
+     *  Million         Million + PPM      
+     */
+    tmp = 1000 * 1000;
+    tmp += VTSC_NTP_PPM_TOLERANCE;
+    tmp *= cpu_khz;
+    tmp /= 1000 * 1000;
+
+    tmp -= cpu_khz;
+
+    /*
+     * Reduce the theoretical upper limit by the assumed measuring inaccuracy.
+     */
+    if ( tmp >= VTSC_MEASUREMENT_INACCURACY_RANGE_KHZ )
+        tmp -= VTSC_MEASUREMENT_INACCURACY_RANGE_KHZ;
+    vtsc_tolerance_khz = tmp;
+    printk("Tolerating vtsc jitter for domUs: %u kHz\n", vtsc_tolerance_khz);
+
     setup_irq(0, 0, &irq0);
 }
 
@@ -2193,6 +2231,8 @@ int tsc_set_info(struct domain *d,
 
     switch ( tsc_mode )
     {
+        bool disable_vtsc;
+
     case TSC_MODE_DEFAULT:
     case TSC_MODE_ALWAYS_EMULATE:
         d->arch.vtsc_offset = get_s_time() - elapsed_nsec;
@@ -2201,13 +2241,30 @@ int tsc_set_info(struct domain *d,
 
         /*
          * In default mode use native TSC if the host has safe TSC and
-         * host and guest frequencies are the same (either "naturally" or
-         * - for HVM/PVH - via TSC scaling).
+         * host and guest frequencies are (almost) the same (either "naturally"
+         * or - for HVM/PVH - via TSC scaling).
          * When a guest is created, gtsc_khz is passed in as zero, making
          * d->arch.tsc_khz == cpu_khz. Thus no need to check incarnation.
          */
+        disable_vtsc = d->arch.tsc_khz == cpu_khz;
+
+        if ( tsc_mode == TSC_MODE_DEFAULT && gtsc_khz && vtsc_tolerance_khz )
+        {
+            long khz_diff;
+
+            khz_diff = ABS(((long)cpu_khz - gtsc_khz));
+            disable_vtsc = khz_diff <= vtsc_tolerance_khz;
+
+            printk(XENLOG_G_INFO "d%d: host has %lu kHz,"
+                   " domU expects %u kHz,"
+                   " difference of %ld is %s tolerance of %u\n",
+                   d->domain_id, cpu_khz, gtsc_khz, khz_diff,
+                   disable_vtsc ? "within" : "outside",
+                   vtsc_tolerance_khz);
+        }
+
         if ( tsc_mode == TSC_MODE_DEFAULT && host_tsc_is_safe() &&
-             (d->arch.tsc_khz == cpu_khz ||
+             (disable_vtsc ||
               (is_hvm_domain(d) &&
                hvm_get_tsc_scaling_ratio(d->arch.tsc_khz))) )
         {

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.