[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH RFC 0/6] x86/time: PVCLOCK_TSC_STABLE_BIT support

On 12/29/2015 02:58 PM, Andrew Cooper wrote:
> On 28/12/2015 16:59, Joao Martins wrote:
>> Hey!
>> I've been working on pvclock vdso support on Linux guests, and came
>> across Xen lacking support for PVCLOCK_TSC_STABLE_BIT flag which is
>> required for vdso as of the latest pvclock ABI shared with KVM.
> , and originally borrowed from Xen.
> Please be aware that all of this was originally the Xen ABI (c/s 
> 1271b793, 2005) and was added to Linux (c/s 7af192c, 2008) for join use 
> with KVM.  In particular, Linux c/s 424c32f1a (which introduces 'flags') 
> and every subsequent change in pvclock-abi.h violates the comment at the 
> top, reminding people that the hypervisors must be kept in sync.
OK, I indeed was aware that this came originally from Xen. Should perhaps
include a comment similar to what you state above?

> By the looks of things, the structures are still compatible, and having 
> the two in sync is in everyones best interest. The first steps here need 
> to be Linux upstreaming their local modifications, and further efforts 
> made to ensuring that ABI changes don't go unnoticed as far as Xen is 
> concerned (entry in the maintainers file with xen-devel listed?)
Good point. Fortunately the changed part was the padding region (which was
always zeroed out on update_vcpu_system_time) so compatibility was kept.

>> In addition, I've found some problems which aren't necessarily visible
>> to kernel as the pvclock algorithm in linux keeps the highest pvti
>> time read among all cpus. But as is, a process using vdso gettimeofday
>> observes a significant amount of warps (i.e. time going backwards) and
>> it could be due to 1) time calibration skew in xen rendezvous
>> algorithm, 2) xen clocksource not in sync with TSC and 3) in
>> situations when guests unaware of VCPUS moving to a different PCPU.
>> The warps are seen more frequently on PV guests (potentially because
>> vcpu time infos are only updated when guest is in kernel mode, and
>> perhaps lack of tsc_offset?), and in rare ocasions on HVM guests. And
>> it is worth noting that with guests VCPUs pinned, only PV guests see
>> these warps. But on HVM guests specifically: such warps only occur
>> when one of guest VCPUs is pinned to CPU0.
> These are all concerning findings (especially the pinning on cpu0). 
> Which version of Xen have you been developing on?
When I started working on this it was based on xen 4.5 (linux <= 4.4), but
lately (~ 1 month) I have been working on xen-unstable.

> Haozhong Zhang (CC'd) found and fixed several timing related bugs as 
> part of his VMX TSC Scaling support series (Message root at 
> <1449435529-12989-1-git-send-email-haozhong.zhang@xxxxxxxxx>). I would 
> be surprised if your identified bugs and his identified bugs didn't at 
> least partially overlap.  (Note that most of the series has yet to be 
> applied).
I am afraid this might be a different bug: I have been basing my patches on top
of his work and also testing with a non-native vtsc_khz. (as part of the testing
requested from Boris). But I think these sort of issues might not affect in
kernel since pvclock in linux takes care of ensuring monotonicity even when
pvti's aren't guaranteed to be.

> As to the specific options you identify, the time calibration rendezvous 
> is (or should be) a nop on modern boxes with constant/invarient/etc 
> TSCs.  I have a low priority TODO item to investigating conditionally 
> disabling it, as it is a scalability concern on larger boxes.  Option 2 
> seems more likely.
Hm, from what I understood it's not a nop: it's just that there is a different
rendezvous function depending on whether TSC is safe or not. And I believe there
to be where it lies the problem. The tsc_timestamp written in the pvti that is
used to calculate the delta on the guest takes into account the time that the
slave CPU rendezvous with master, thus could mark the TSC at a later instant
depending on the CPU. And this would lead to a situation such as: CPU 3 writing
an earlier tsc than say CPU 2 (thus leading to a smaller delta between both as
seen by the guest, even though TSC is guaranteed to be monotonic) and guest
would see a warp when doing a read on CPU 3 first right before CPU 2 and see
time going backwards. Patch 5 also extends on that and I followed a similar
approach to KVM.

But you suggestion of basically being a nop could makes things simpler (and
perhaps cleaner?), and eliminates potential scalability issues when host has a
large number of CPUs: on a system with a safe TSC every EPOCH it would read
platform time and "request" the vCPU to update the pvti next time it's
schedule() or continue_running() is called? This way we could remove the CPUS
spinning for master to write stime and tsc_timestamp.

> As for option 3, on a modern system it shouldn't make any difference.  
> On an older system, it can presumably only be fixed by a guest 
> performing its rdtsc between the two version checks, to ensure that it 
> sees a consistent timestamp and scale, along with the hypervisor bumping 
> version on deschedule, and against on schedule. However, this might be 
> contrary to proposed plan to have one global pv wallclock.
Indeed, and vdso gettimeofday/clock_gettime already do that i.e. versions checks
around rdtsc/pvti reads.

>> PVCLOCK_TSC_STABLE_BIT is the flag telling the guest that the
>> vcpu_time_info (pvti) are monotonic as seen by any CPU,
>> and supporting it could help fixing the issue mentioned above.
> Surely fixing some of these bugs are prerequisites to supporting 
> TSC_STABLE_BIT ?  Either way, we should see about doing both.
I think so too.

>>   This
>> series aims to propose a solution to that and it's divided as
>> following:
>>      * Patch 1: Adds the missing flag field to vcpu_time_info.
>>      * Patch 2: Adds a new clocksource based on TSC
>>      * Patch 3, 4: Cleanups for patch 5
>>      * Patch 5: Implements the PVCLOCK_TSC_STABLE_BIT.
>>      * Patch 6: Same as 5 before but for other platform timers
>> I have some doubts on the correctness of Patch 6 but was the only
>> solution I found for supporting PVCLOCK_TSC_STABLE_BIT when using
>> other clocksources (i.e. != tsc). The test was running time-warp-test,
>> that constantly calls clock_gettime/gettimeofday on every CPU. It
>> measures a delta with the previous returned value and mark a warp if
>> it's negative. I measured it during periods of 1h and 6h and check how
>> many warps and their values (alongside the amount of time skew).
>> Measurements are included in individual patches.
>> Note that most of the testing has been done with Linux 4.4 in which
>> these warps/skew could be easily observed as vdso would use each vCPU
>> pvti. Though Linux 4.5 will change this behaviour and use only the
>> vCPU0 pvti though still requiring PVCLOCK_TSC_STABLE_BIT usage.
>> I've been using it for a while in the past weeks with no issues so far
>> though still requires testing on bad TSC machines. But I would like to
>> get folks comments/suggestions if this is the correct approach in
>> implementing this flag.
> If you have some test instructions, I can probably find some bad TSC 
> machines amongst the selection of older hardware in the XenServer test pool.
Thanks! I have a bad TSC machine for testing already, and these tests would be
to make sure there aren't regressions, since with a bad TSC
PVCLOCK_TSC_STABLE_BIT wouldn't be set. Most of my tests were on a Broadwell CPU
single socket.


> ~Andrew

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.