[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy



Dave --

Thanks much for posting the preliminary results!

While I am fully supportive of offering hardware hpet as an option
for hvm guests (let's call it hwhpet=1 for shorthand), I am very
surprised by your preliminary results; the most obvious conclusion
is that Xen system time is losing time at the rate of 1000 PPM
though its possible there's a bug somewhere else in the "time
stack".  Your Windows result is jaw-dropping and inexplicable,
though I have to admit ignorance of how Windows manages time.

I think with my recent patch and hpet=1 (essentially the same as
your emulated hpet), hvm guest time should track Xen system time.
I wonder if domain0 (which if I understand correctly is directly
using Xen system time) is also seeing an error of .1%?  Also
I wonder for the skew you are seeing (in both hvm guests and
domain0) is time moving too fast or two slow?

Although hwhpet=1 is a fine alternative in many cases, it may
be unavailable on some systems and may cause significant performance
issues on others.  So I think we will still need to track down
the poor accuracy when hwhpet=0.  And if for some reason
Xen system time can't be made accurate enough (< 0.05%), then
I think we should consider building Xen system time itself on
top of hardware hpet instead of TSC... at least when Xen discovers
a capable hpet.

One more thought... do you know the accuracy of the TSC crystals
on your test systems?  I posted a patch awhile ago that was
intended to test that, though I guess it was only testing skew
of different TSCs on the same system, not TSCs against an
external time source.

Or maybe there's a computation error somewhere in the hvm hpet
scaling code?  Hmmm...

Thanks,
Dan

> -----Original Message-----
> From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx]
> Sent: Friday, June 06, 2008 1:33 PM
> To: dan.magenheimer@xxxxxxxxxx; Keir Fraser
> Cc: Ben Guthro; xen-devel; Dave Winchell
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> 
> 
> Dan, Keir:
> 
> Preliminary tests results indicate an error of .1% for Linux 64 bit
> guests configured
> for hpet with xen-unstable as is. As we have discussed many times, the
> ntp requirement is .05%.
> Tests on the patch we just submitted for hpet have indicated errors of
> .0012%
> on this platform under similar test conditions and .03% on 
> other platforms.
> 
> Windows vista64 has an error of 11% using hpet with the 
> xen-unstable bits.
> In an overnight test with our hpet patch, the Windows vista 
> error was .008%.
> 
> The tests are with two or three guests on a physical node, all under
> load, and with
> the ratio of vcpus to phys cpus > 1.
> 
> I will continue to run tests over the next few days.
> 
> thanks,
> Dave
> 
> 
> Dan Magenheimer wrote:
> 
> > Hi Dave and Ben --
> >
> > When running tests on xen-unstable (without your patch), 
> please ensure
> > that hpet=1 is set in the hvm config and also I think that when hpet
> > is the clocksource on RHEL4-32, the clock IS resilient to 
> missed ticks
> > so timer_mode should be 2 (vs when pit is the clocksource 
> on RHEL4-32,
> > all clock ticks must be delivered and so timer_mode should be 0).
> >
> > Per
> > 
> http://lists.xensource.com/archives/html/xen-devel/2008-06/msg
> 00098.html it's
> > my intent to clean this up, but I won't get to it until next week.
> >
> > Thanks,
> > Dan
> >
> >     -----Original Message-----
> >     *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx
> >     [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On 
> Behalf Of *Dave
> >     Winchell
> >     *Sent:* Friday, June 06, 2008 4:46 AM
> >     *To:* Keir Fraser; Ben Guthro; xen-devel
> >     *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell
> >     *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Keir,
> >
> >     I think the changes are required. We'll run some tests 
> today today so
> >     that we have some data to talk about.
> >
> >     -Dave
> >
> >
> >     -----Original Message-----
> >     From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf 
> of Keir Fraser
> >     Sent: Fri 6/6/2008 4:58 AM
> >     To: Ben Guthro; xen-devel
> >     Cc: dan.magenheimer@xxxxxxxxxx
> >     Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> >
> >     Are these patches needed now the timers are built on Xen system
> >     time rather
> >     than host TSC? Dan has reported much better 
> time-keeping with his
> >     patch
> >     checked in, and it¹s for sure a lot less invasive than 
> this patchset.
> >
> >
> >      -- Keir
> >
> >     On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote:
> >
> >     >
> >     > 1. Introduction
> >     >
> >     > This patch improves the hpet based guest clock in 
> terms of drift and
> >     > monotonicity.
> >     > Prior to this work the drift with hpet was greater 
> than 2%, far
> >     above the .05%
> >     > limit
> >     > for ntp to synchronize. With this code, the drift ranges from
> >     .001% to .0033%
> >     > depending
> >     > on guest and physical platform.
> >     >
> >     > Using hpet allows guest operating systems to provide monotonic
> >     time to their
> >     > applications. Time sources other than hpet are not 
> monotonic because
> >     > of their reliance on tsc, which is not synchronized 
> across physical
> >     > processors.
> >     >
> >     > Windows 2k864 and many Linux guests are supported with two
> >     policies, one for
> >     > guests
> >     > that handle missed clock interrupts and the other for guests
> >     that require the
> >     > correct number of interrupts.
> >     >
> >     > Guests may use hpet for the timing source even if the physical
> >     platform has no
> >     > visible
> >     > hpet. Migration is supported between physical machines which
> >     differ in
> >     > physical
> >     > hpet visibility.
> >     >
> >     > Most of the changes are in hpet.c. Two general facilities are
> >     added to track
> >     > interrupt
> >     > progress. The ideas here and the facilities would be useful in
> >     vpt.c, for
> >     > other time
> >     > sources, though no attempt is made here to improve vpt.c.
> >     >
> >     > The following sections discuss hpet dependencies, interrupt
> >     delivery policies,
> >     > live migration,
> >     > test results, and relation to recent work with monotonic time.
> >     >
> >     >
> >     > 2. Virtual Hpet dependencies
> >     >
> >     > The virtual hpet depends on the ability to read the 
> physical or
> >     simulated
> >     > (see discussion below) hpet.  For timekeeping, the 
> virtual hpet
> >     also depends
> >     > on two new interrupt notification facilities to implement its
> >     policies for
> >     > interrupt delivery.
> >     >
> >     > 2.1. Two modes of low-level hpet main counter reads.
> >     >
> >     > In this implementation, the virtual hpet reads with
> >     read_64_main_counter(),
> >     > exported by
> >     > time.c, either the real physical hpet main counter register
> >     directly or a
> >     > "simulated"
> >     > hpet main counter.
> >     >
> >     > The simulated mode uses a monotonic version of get_s_time()
> >     (NOW()), where the
> >     > last
> >     > time value is returned whenever the current time value is less
> >     than the last
> >     > time
> >     > value. In simulated mode, since it is layered on s_time, the
> >     underlying
> >     > hardware
> >     > can be hpet or some other device. The frequency of the main
> >     counter in
> >     > simulated
> >     > mode is the same as the standard physical hpet frequency,
> >     allowing live
> >     > migration
> >     > between nodes that are configured differently.
> >     >
> >     > If the physical platform does not have an hpet 
> device, or if xen
> >     is configured
> >     > not
> >     > to use the device, then the simulated method is used. If there
> >     is a physical
> >     > hpet device,
> >     > and xen has initialized it, then either simulated or physical
> >     mode can be
> >     > used.
> >     > This is governed by a boot time option, hpet-avoid. 
> Setting this
> >     option to 1
> >     > gives the
> >     > simulated mode and 0 the physical mode. The default 
> is physical
> >     mode.
> >     >
> >     > A disadvantage of the physical mode is that may take longer to
> >     read the device
> >     > than in simulated mode. On some platforms the cost is 
> about the
> >     same (less
> >     > than 250 nsec) for
> >     > physical and simulated modes, while on others physical cost is
> >     much higher
> >     > than simulated.
> >     > A disadvantage of the simulated mode is that it can return the
> >     same value
> >     > for the counter in consecutive calls.
> >     >
> >     > 2.2. Interrupt notification facilities.
> >     >
> >     > Two interrupt notification facilities are introduced, one is
> >     > hvm_isa_irq_assert_cb()
> >     > and the other hvm_register_intr_en_notif().
> >     >
> >     > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to
> >     the vioapic.
> >     > hvm_isa_irq_assert_cb allows a callback to be passed along to
> >     > vioapic_deliver()
> >     > and this callback is called with a mask of the vcpus 
> which will
> >     get the
> >     > interrupt. This callback is made before any vcpus receive an
> >     interrupt.
> >     >
> >     > Vhpet uses hvm_register_intr_en_notif() to register a handler
> >     for a particular
> >     > vector that will be called when that vector is injected in
> >     > [vmx,svm]_intr_assist()
> >     > and also when the guest finishes handling the interrupt. Here
> >     finished is
> >     > defined
> >     > as the point when the guest re-enables interrupts or 
> lowers the
> >     tpr value.
> >     > EOI is not used as the end of interrupt as this is sometimes
> >     returned before
> >     > the interrupt handler has done its work. A flag is 
> passed to the
> >     handler
> >     > indicating
> >     > whether this is the injection point (post = 1) or the 
> interrupt
> >     finished (post
> >     > = 0) point.
> >     > The need for the finished point callback is discussed in the
> >     missed ticks
> >     > policy section.
> >     >
> >     > To prevent a possible early trigger of the finished callback,
> >     intr_en_notif
> >     > logic
> >     > has a two stage arm, the first at injection
> >     (hvm_intr_en_notif_arm()) and the
> >     > second when
> >     > interrupts are seen to be disabled 
> (hvm_intr_en_notif_disarm()).
> >     Once fully
> >     > armed, re-enabling
> >     > interrupts will cause hvm_intr_en_notif_disarm() to 
> make the end
> >     of interrupt
> >     > callback. hvm_intr_en_notif_arm() and 
> hvm_intr_en_notif_disarm()
> >     are called by
> >     > [vmx,svm]_intr_assist().
> >     >
> >     > 3. Interrupt delivery policies
> >     >
> >     > The existing hpet interrupt delivery is preserved. 
> This includes
> >     > vcpu round robin delivery used by Linux and broadcast delivery
> >     used by
> >     > Windows.
> >     >
> >     > There are two policies for interrupt delivery, one for Windows
> >     2k8-64 and the
> >     > other
> >     > for Linux. The Linux policy takes advantage of the 
> (guest) Linux
> >     missed tick
> >     > and offset
> >     > calculations and does not attempt to deliver the 
> right number of
> >     interrupts.
> >     > The Windows policy delivers the correct number of interrupts,
> >     even if
> >     > sometimes much
> >     > closer to each other than the period. The policies are similar
> >     to those in
> >     > vpt.c, though
> >     > there are some important differences.
> >     >
> >     > Policies are selected with an HVMOP_set_param 
> hypercall with index
> >     > HVM_PARAM_TIMER_MODE.
> >     > Two new values are added, 
> HVM_HPET_guest_computes_missed_ticks and
> >     > HVM_HPET_guest_does_not_compute_missed_ticks.  The reason that
> >     two new ones
> >     > are added is that
> >     > in some guests (32bit Linux) a no-missed policy is needed for
> >     clock sources
> >     > other than hpet
> >     > and a missed ticks policy for hpet. It was felt that 
> there would
> >     be less
> >     > confusion by simply
> >     > introducing the two hpet policies.
> >     >
> >     > 3.1. The missed ticks policy
> >     >
> >     > The Linux clock interrupt handler for hpet calculates missed
> >     ticks and offset
> >     > using the hpet
> >     > main counter. The algorithm works well when the time since the
> >     last interrupt
> >     > is greater than
> >     > or equal to a period and poorly otherwise.
> >     >
> >     > The missed ticks policy ensures that no two clock 
> interrupts are
> >     delivered to
> >     > the guest at
> >     > a time interval less than a period. A time stamp (hpet main
> >     counter value) is
> >     > recorded (by a
> >     > callback registered with hvm_register_intr_en_notif) 
> when Linux
> >     finishes
> >     > handling the clock
> >     > interrupt. Then, ensuing interrupts are delivered to 
> the vioapic
> >     only if the
> >     > current main
> >     > counter value is a period greater than when the last interrupt
> >     was handled.
> >     >
> >     > Tests showed a significant improvement in clock drift with end
> >     of interrupt
> >     > time stamps
> >     > versus beginning of interrupt[1]. It is believed that 
> the reason
> >     for the
> >     > improvement
> >     > is that the clock interrupt handler goes for a 
> spinlock and can
> >     be therefore
> >     > delayed in its
> >     > processing. Furthermore, the main counter is read by the guest
> >     under the lock.
> >     > The net
> >     > effect is that if we time stamp injection, we can get the
> >     difference in time
> >     > between successive interrupt handler lock acquisitions to be
> >     less than the
> >     > period.
> >     >
> >     > 3.2. The no-missed ticks policy
> >     >
> >     > Windows 2k864 keeps very poor time with the missed 
> ticks policy.
> >     So the
> >     > no-missed ticks policy
> >     > was developed. In the no-missed ticks policy we deliver the
> >     correct number of
> >     > interrupts,
> >     > even if they are spaced less than a period apart 
> (when catching up).
> >     >
> >     > Windows 2k864 uses a broadcast mode in the interrupt routing
> >     such that
> >     > all vcpus get the clock interrupt. The best Windows drift
> >     performance was
> >     > achieved when the
> >     > policy code ensured that all the previous interrupts (on the
> >     various vcpus)
> >     > had been injected
> >     > before injecting the next interrupt to the vioapic..
> >     >
> >     > The policy code works as follows. It uses the
> >     hvm_isa_irq_assert_cb() to
> >     > record
> >     > the vcpus to be interrupted in h->hpet.pending_mask. Then, in
> >     the callback
> >     > registered
> >     > with hvm_register_intr_en_notif() at post=1 time it clears the
> >     current vcpu in
> >     > the pending_mask.
> >     > When the pending_mask is clear it decrements
> >     hpet.intr_pending_nr and if
> >     > intr_pending_nr is still
> >     > non-zero posts another interrupt to the ioapic with
> >     hvm_isa_irq_assert_cb().
> >     > Intr_pending_nr is incremented in
> >     hpet_route_decision_not_missed_ticks().
> >     >
> >     > The missed ticks policy intr_en_notif callback also uses the
> >     pending_mask
> >     > method. So even though
> >     > Linux does not broadcast its interrupts, the code could handle
> >     it if it did.
> >     > In this case the end of interrupt time stamp is made when the
> >     pending_mask is
> >     > clear.
> >     >
> >     > 4. Live Migration
> >     >
> >     > Live migration with hpet preserves the current offset of the
> >     guest clock with
> >     > respect
> >     > to ntp. This is accomplished by migrating all of the state in
> >     the h->hpet data
> >     > structure
> >     > in the usual way. The hp->mc_offset is recalculated on the
> >     receiving node so
> >     > that the
> >     > guest sees a continuous hpet main counter.
> >     >
> >     > Code as been added to xc_domain_save.c to send a small message
> >     after the
> >     > domain context is sent. The contents of the message is the
> >     physical tsc
> >     > timestamp, last_tsc,
> >     > read just before the message is sent. When the 
> last_tsc message
> >     is received in
> >     > xc_domain_restore.c,
> >     > another physical tsc timestamp, cur_tsc, is read. The two
> >     timestamps are
> >     > loaded into the domain
> >     > structure as last_tsc_sender and first_tsc_receiver with
> >     hypercalls. Then
> >     > xc_domain_hvm_setcontext
> >     > is called so that hpet_load has access to these time stamps.
> >     Hpet_load uses
> >     > the timestamps
> >     > to account for the time spent saving and loading the domain
> >     context. With this
> >     > technique,
> >     > the only neglected time is the time spent sending a small
> >     network message.
> >     >
> >     > 5. Test Results
> >     >
> >     > Some recent test results are:
> >     >
> >     > 5.1 Linux 4u664 and Windows 2k864 load test.
> >     >       Duration: 70 hours.
> >     >       Test date: 6/2/08
> >     >       Loads: usex -b48 on Linux; burn-in on Windows
> >     >       Guest vcpus: 8 for Linux; 2 for Windows
> >     >       Hardware: 8 physical cpu AMD
> >     >       Clock drift : Linux: .0012% Windows: .009%
> >     >
> >     > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
> >     >       Duration: 23 hours.
> >     >       Test date: 6/3/08
> >     >       Loads: none
> >     >       Guest vcpus: 8 for each Linux; 2 for Windows
> >     >       Hardware: 4 physical cpu AMD
> >     >       Clock drift : Linux: .033% Windows: .019%
> >     >
> >     > 6. Relation to recent work in xen-unstable
> >     >
> >     > There is a similarity between hvm_get_guest_time() in
> >     xen-unstable and
> >     > read_64_main_counter()
> >     > in this code. However, read_64_main_counter() is more tuned to
> >     the needs of
> >     > hpet.c. It has no
> >     > "set" operation, only the get. It isolates the mode, 
> physical or
> >     simulated, in
> >     > read_64_main_counter()
> >     > itself. It uses no vcpu or domain state as it is a physical
> >     entity, in either
> >     > mode. And it provides a real
> >     > physical mode for every read for those applications 
> that desire
> >     this.
> >     >
> >     > 7. Conclusion
> >     >
> >     > The virtual hpet is improved by this patch in terms 
> of accuracy and
> >     > monotonicity.
> >     > Tests performed to date verify this and more testing 
> is under way.
> >     >
> >     > 8. Future Work
> >     >
> >     > Testing with Windows Vista will be performed soon. The reason
> >     for accuracy
> >     > variations
> >     > on different platforms using the physical hpet device will be
> >     investigated.
> >     > Additional overhead measurements on simulated vs physical hpet
> >     mode will be
> >     > made.
> >     >
> >     > Footnotes:
> >     >
> >     > 1. I don't recall the accuracy improvement with end 
> of interrupt
> >     stamping, but
> >     > it was
> >     > significant, perhaps better than two to one improvement. It
> >     would be a very
> >     > simple matter
> >     > to re-measure the improvement as the facility can call back at
> >     injection time
> >     > as well.
> >     >
> >     >
> >     > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
> >     > <mailto:dwinchell@xxxxxxxxxxxxxxx>
> >     > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>
> >     > <mailto:bguthro@xxxxxxxxxxxxxxx>
> >     >
> >     >
> >     > _______________________________________________
> >     > Xen-devel mailing list
> >     > Xen-devel@xxxxxxxxxxxxxxxxxxx
> >     > http://lists.xensource.com/xen-devel
> >
> >
> >
> 
>


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.