Xen project Mailing List

Re: [Xen-devel] [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, Jan Beulich <jbeulich@xxxxxxxx>

From: Joao Martins <joao.m.martins@xxxxxxxxxx>

Date: Mon, 22 Feb 2016 21:14:14 +0000

Cc: Haozhong Zhang <haozhong.zhang@xxxxxxxxx>, Keir Fraser <keir@xxxxxxx>, xen-devel@xxxxxxxxxxxxx

Delivery-date: Mon, 22 Feb 2016 21:14:48 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On 12/29/2015 05:37 PM, Joao Martins wrote: > > > On 12/29/2015 02:58 PM, Andrew Cooper wrote: >> On 28/12/2015 16:59, Joao Martins wrote: >>> Hey! >>> >>> I've been working on pvclock vdso support on Linux guests, and came >>> across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is >>> required for vdso as of the latest pvclock ABI shared with KVM. >> >> , and originally borrowed from Xen. >> >> Please be aware that all of this was originally the Xen ABI (c/s >> 1271b793, 2005) and was added to Linux (c/s 7af192c, 2008) for join use >> with KVM. In particular, Linux c/s 424c32f1a (which introduces 'flags') >> and every subsequent change in pvclock-abi.h violates the comment at the >> top, reminding people that the hypervisors must be kept in sync. >> > OK, I indeed was aware that this came originally from Xen. Should perhaps > include a comment similar to what you state above? > >> By the looks of things, the structures are still compatible, and having >> the two in sync is in everyones best interest. The first steps here need >> to be Linux upstreaming their local modifications, and further efforts >> made to ensuring that ABI changes don't go unnoticed as far as Xen is >> concerned (entry in the maintainers file with xen-devel listed?) > Good point. Fortunately the changed part was the padding region (which was > always zeroed out on update_vcpu_system_time) so compatibility was kept. > >> >>> In addition, I've found some problems which aren't necessarily visible >>> to kernel as the pvclock algorithm in linux keeps the highest pvti >>> time read among all cpus. But as is, a process using vdso gettimeofday >>> observes a significant amount of warps (i.e. time going backwards) and >>> it could be due to 1) time calibration skew in xen rendezvous >>> algorithm, 2) xen clocksource not in sync with TSC and 3) in >>> situations when guests unaware of VCPUS moving to a different PCPU. >>> The warps are seen more frequently on PV guests (potentially because >>> vcpu time infos are only updated when guest is in kernel mode, and >>> perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And >>> it is worth noting that with guests VCPUs pinned, only PV guests see >>> these warps. But on HVM guests specifically: such warps only occur >>> when one of guest VCPUs is pinned to CPU0. >> >> These are all concerning findings (especially the pinning on cpu0). >> Which version of Xen have you been developing on? >> > When I started working on this it was based on xen 4.5 (linux <= 4.4), but > lately (~ 1 month) I have been working on xen-unstable. > >> Haozhong Zhang (CC'd) found and fixed several timing related bugs as >> part of his VMX TSC Scaling support series (Message root at >> <1449435529-12989-1-git-send-email-haozhong.zhang@xxxxxxxxx>). I would >> be surprised if your identified bugs and his identified bugs didn't at >> least partially overlap. (Note that most of the series has yet to be >> applied). > I am afraid this might be a different bug: I have been basing my patches on > top > of his work and also testing with a non-native vtsc_khz. (as part of the > testing > requested from Boris). But I think these sort of issues might not affect in > kernel since pvclock in linux takes care of ensuring monotonicity even when > pvti's aren't guaranteed to be. > >> >> As to the specific options you identify, the time calibration rendezvous >> is (or should be) a nop on modern boxes with constant/invarient/etc >> TSCs. I have a low priority TODO item to investigating conditionally >> disabling it, as it is a scalability concern on larger boxes. Option 2 >> seems more likely. > Hm, from what I understood it's not a nop: it's just that there is a different > rendezvous function depending on whether TSC is safe or not. And I believe > there > to be where it lies the problem. The tsc_timestamp written in the pvti that is > used to calculate the delta on the guest takes into account the time that the > slave CPU rendezvous with master, thus could mark the TSC at a later instant > depending on the CPU. And this would lead to a situation such as: CPU 3 > writing > an earlier tsc than say CPU 2 (thus leading to a smaller delta between both as > seen by the guest, even though TSC is guaranteed to be monotonic) and guest > would see a warp when doing a read on CPU 3 first right before CPU 2 and see > time going backwards. Patch 5 also extends on that and I followed a similar > approach to KVM. > > But you suggestion of basically being a nop could makes things simpler (and > perhaps cleaner?), and eliminates potential scalability issues when host has a > large number of CPUs: on a system with a safe TSC every EPOCH it would read > platform time and "request" the vCPU to update the pvti next time it's > schedule() or continue_running() is called? This way we could remove the CPUS > spinning for master to write stime and tsc_timestamp. > >> >> As for option 3, on a modern system it shouldn't make any difference. >> On an older system, it can presumably only be fixed by a guest >> performing its rdtsc between the two version checks, to ensure that it >> sees a consistent timestamp and scale, along with the hypervisor bumping >> version on deschedule, and against on schedule. However, this might be >> contrary to proposed plan to have one global pv wallclock. >> > Indeed, and vdso gettimeofday/clock_gettime already do that i.e. versions > checks > around rdtsc/pvti reads. >>> >>> PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the >>> vcpu_time_info (pvti) are monotonic as seen by any CPU, >>> and supporting it could help fixing the issue mentioned above. >> >> Surely fixing some of these bugs are prerequisites to supporting >> TSC_STABLE_BIT ? Either way, we should see about doing both. >> > I think so too. > >>> This >>> series aims to propose a solution to that and it's divided as >>> following: >>> >>> * Patch 1: Adds the missing flag field to vcpu_time_info. >>> * Patch 2: Adds a new clocksource based on TSC >>> * Patch 3, 4: Cleanups for patch 5 >>> * Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT. >>> * Patch 6: Same as 5 before but for other platform timers >>> >>> I have some doubts on the correctness of Patch 6 but was the only >>> solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using >>> other clocksources (i.e. != tsc). The test was running time-warp-test, >>> that constantly calls clock_gettime/gettimeofday on every CPU. It >>> measures a delta with the previous returned value and mark a warp if >>> it's negative. I measured it during periods of 1h and 6h and check how >>> many warps and their values (alongside the amount of time skew). >>> Measurements are included in individual patches. >>> >>> Note that most of the testing has been done with Linux 4.4 in which >>> these warps/skew could be easily observed as vdso would use each vCPU >>> pvti. Though Linux 4.5 will change this behaviour and use only the >>> vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage. >>> >>> I've been using it for a while in the past weeks with no issues so far >>> though still requires testing on bad TSC machines. But I would like to >>> get folks comments/suggestions if this is the correct approach in >>> implementing this flag. >> >> If you have some test instructions, I can probably find some bad TSC >> machines amongst the selection of older hardware in the XenServer test pool. > Thanks! I have a bad TSC machine for testing already, and these tests would be > to make sure there aren't regressions, since with a bad TSC > PVCLOCK_TSC_STABLE_BIT wouldn't be set. Most of my tests were on a Broadwell > CPU > single socket. Ping: Or would you prefer resubmitting with the comments so far? I just didn't do so because I didn't get comments on the main parts of this series (Patch 2,5,6). Thanks, and sorry for the noise! Joao >> >> ~Andrew >> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.