[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
> A disadvantage of the simulated mode is that it can
return the same value > for the counter in consecutive
calls.
It
also occurs to me that if the granularity is good enough, an easy
fix
to
this problem might be to always increment the returned value
by at
least one. Then time is always at least increasing
rather
than
stopped
Hi Dan,
> While I am fully supportive of offering
hardware hpet as an option > for hvm guests (let's call it hwhpet=1 for
shorthand), I am very > surprised by your preliminary results; the most
obvious conclusion > is that Xen system time is losing time at the rate
of 1000 PPM > though its possible there's a bug somewhere else in the
"time > stack". Your Windows result is jaw-dropping and
inexplicable, > though I have to admit ignorance of how Windows manages
time.
I think xen system time is fine. You have to add the
interrupt delivery policies decribed in the write-up for the patch to
get accurate timekeeping in the guest.
The windows policy is obvious
and results in a large improvement in accuracy. The Linux policy is more
subtle, but is required to go from .1% to .03%.
> I think with my
recent patch and hpet=1 (essentially the same as > your emulated hpet),
hvm guest time should track Xen system time. > I wonder if domain0
(which if I understand correctly is directly > using Xen system time) is
also seeing an error of .1%? Also > I wonder for the skew you are
seeing (in both hvm guests and > domain0) is time moving too fast or two
slow?
I don't recall the direction. I can look it up in my notes at
work tomorrow.
> Although hwhpet=1 is a fine alternative in many
cases, it may > be unavailable on some systems and may cause significant
performance > issues on others. So I think we will still need to
track down > the poor accuracy when hwhpet=0.
Our patch is
accurate to < .03% using the physical hpet mode or the simulated
mode.
> And if for some reason > Xen system time can't be made
accurate enough (< 0.05%), then > I think we should consider building
Xen system time itself on > top of hardware hpet instead of TSC... at
least when Xen discovers > a capable hpet.
In our experience, Xen
system time is accurate enough now.
> One more thought... do you
know the accuracy of the TSC crystals > on your test systems? I
posted a patch awhile ago that was > intended to test that, though I
guess it was only testing skew > of different TSCs on the same system,
not TSCs against an > external time source.
I do not know the tsc
accuracy.
> Or maybe there's a computation error somewhere in the
hvm hpet > scaling code?
Hmmm...
Regards, Dave
-----Original
Message----- From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] Sent:
Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro;
xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet
accuracy
Dave --
Thanks much for posting the preliminary
results!
While I am fully supportive of offering hardware hpet as an
option for hvm guests (let's call it hwhpet=1 for shorthand), I am
very surprised by your preliminary results; the most obvious
conclusion is that Xen system time is losing time at the rate of 1000
PPM though its possible there's a bug somewhere else in the
"time stack". Your Windows result is jaw-dropping and
inexplicable, though I have to admit ignorance of how Windows manages
time.
I think with my recent patch and hpet=1 (essentially the same
as your emulated hpet), hvm guest time should track Xen system time. I
wonder if domain0 (which if I understand correctly is directly using Xen
system time) is also seeing an error of .1%? Also I wonder for the
skew you are seeing (in both hvm guests and domain0) is time moving too
fast or two slow?
Although hwhpet=1 is a fine alternative in many
cases, it may be unavailable on some systems and may cause significant
performance issues on others. So I think we will still need to track
down the poor accuracy when hwhpet=0. And if for some reason Xen
system time can't be made accurate enough (< 0.05%), then I think we
should consider building Xen system time itself on top of hardware hpet
instead of TSC... at least when Xen discovers a capable hpet.
One
more thought... do you know the accuracy of the TSC crystals on your test
systems? I posted a patch awhile ago that was intended to test that,
though I guess it was only testing skew of different TSCs on the same
system, not TSCs against an external time source.
Or maybe there's a
computation error somewhere in the hvm hpet scaling code?
Hmmm...
Thanks, Dan
> -----Original Message----- >
From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] >
Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@xxxxxxxxxx;
Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject:
Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan,
Keir: > > Preliminary tests results indicate an error of .1% for
Linux 64 bit > guests configured > for hpet with xen-unstable as
is. As we have discussed many times, the > ntp requirement is
.05%. > Tests on the patch we just submitted for hpet have indicated
errors of > .0012% > on this platform under similar test
conditions and .03% on > other platforms. > > Windows
vista64 has an error of 11% using hpet with the > xen-unstable
bits. > In an overnight test with our hpet patch, the Windows
vista > error was .008%. > > The tests are with two or three
guests on a physical node, all under > load, and with > the ratio
of vcpus to phys cpus > 1. > > I will continue to run tests
over the next few days. > > thanks, >
Dave > > > Dan Magenheimer wrote: > > > Hi
Dave and Ben -- > > > > When running tests on xen-unstable
(without your patch), > please ensure > > that hpet=1 is set in
the hvm config and also I think that when hpet > > is the clocksource
on RHEL4-32, the clock IS resilient to > missed ticks > > so
timer_mode should be 2 (vs when pit is the clocksource > on
RHEL4-32, > > all clock ticks must be delivered and so timer_mode
should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg >
00098.html it's > > my intent to clean this up, but I won't get to it
until next week. > > > > Thanks, > > Dan >
> > > -----Original Message----- >
> *From:*
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > > [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On >
Behalf Of *Dave > > Winchell >
> *Sent:* Friday, June 06, 2008 4:46 AM >
> *To:* Keir Fraser; Ben Guthro; xen-devel >
> *Cc:* dan.magenheimer@xxxxxxxxxx; Dave
Winchell > > *Subject:* RE: [Xen-devel]
[PATCH 0/2] Improve hpet accuracy > > >
> Keir, > > >
> I think the changes are required. We'll run some
tests > today today so > > that we have
some data to talk about. > > > >
-Dave > > > > > >
-----Original Message----- > > From:
xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf > of Keir Fraser >
> Sent: Fri 6/6/2008 4:58 AM >
> To: Ben Guthro; xen-devel >
> Cc: dan.magenheimer@xxxxxxxxxx >
> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet
accuracy > > > > Are these patches
needed now the timers are built on Xen system >
> time rather > >
than host TSC? Dan has reported much better > time-keeping with
his > > patch >
> checked in, and it¹s for sure a lot less invasive
than > this patchset. > > > > >
> -- Keir > > >
> On 5/6/08 15:59, "Ben Guthro"
<bguthro@xxxxxxxxxxxxxxx> wrote: > > >
> > > > > 1.
Introduction > > > >
> > This patch improves the hpet based guest
clock in > terms of drift and > > >
monotonicity. > > > Prior to this work the
drift with hpet was greater > than 2%, far >
> above the .05% >
> > limit > >
> for ntp to synchronize. With this code, the drift ranges from >
> .001% to .0033% >
> > depending >
> > on guest and physical platform. >
> > > > >
Using hpet allows guest operating systems to provide monotonic >
> time to their >
> > applications. Time sources other than hpet
are not > monotonic because > > > of
their reliance on tsc, which is not synchronized > across
physical > > > processors. >
> > > > >
Windows 2k864 and many Linux guests are supported with two >
> policies, one for >
> > guests > >
> that handle missed clock interrupts and the other for guests >
> that require the >
> > correct number of interrupts. >
> > > > >
Guests may use hpet for the timing source even if the physical >
> platform has no >
> > visible > >
> hpet. Migration is supported between physical machines which >
> differ in > >
> physical > > > hpet
visibility. > > > >
> > Most of the changes are in hpet.c. Two
general facilities are > > added to
track > > > interrupt >
> > progress. The ideas here and the facilities
would be useful in > > vpt.c, for >
> > other time >
> > sources, though no attempt is made here to
improve vpt.c. > > > >
> > The following sections discuss hpet
dependencies, interrupt > > delivery
policies, > > > live migration, >
> > test results, and relation to recent work
with monotonic time. > > > >
> > > > > 2.
Virtual Hpet dependencies > > > >
> > The virtual hpet depends on the ability to
read the > physical or > >
simulated > > > (see discussion below)
hpet. For timekeeping, the > virtual hpet >
> also depends > >
> on two new interrupt notification facilities to implement its >
> policies for > >
> interrupt delivery. > > > >
> > 2.1. Two modes of low-level hpet main
counter reads. > > > >
> > In this implementation, the virtual hpet
reads with > >
read_64_main_counter(), > > > exported
by > > > time.c, either the real physical
hpet main counter register > > directly or
a > > > "simulated" >
> > hpet main counter. >
> > > > >
The simulated mode uses a monotonic version of get_s_time() >
> (NOW()), where the >
> > last > >
> time value is returned whenever the current time value is less >
> than the last >
> > time > >
> value. In simulated mode, since it is layered on s_time, the >
> underlying > >
> hardware > > > can be hpet or some
other device. The frequency of the main > >
counter in > > > simulated >
> > mode is the same as the standard physical
hpet frequency, > > allowing live >
> > migration >
> > between nodes that are configured
differently. > > > >
> > If the physical platform does not have an
hpet > device, or if xen > > is
configured > > > not >
> > to use the device, then the simulated method
is used. If there > > is a physical >
> > hpet device, >
> > and xen has initialized it, then either
simulated or physical > > mode can be >
> > used. > >
> This is governed by a boot time option, hpet-avoid. > Setting
this > > option to 1 >
> > gives the >
> > simulated mode and 0 the physical mode. The
default > is physical > > mode. >
> > > > > A
disadvantage of the physical mode is that may take longer to >
> read the device >
> > than in simulated mode. On some platforms
the cost is > about the > > same
(less > > > than 250 nsec) for >
> > physical and simulated modes, while on
others physical cost is > > much
higher > > > than simulated. >
> > A disadvantage of the simulated mode is that
it can return the > > same value >
> > for the counter in consecutive
calls. > > > >
> > 2.2. Interrupt notification
facilities. > > > >
> > Two interrupt notification facilities are
introduced, one is > > >
hvm_isa_irq_assert_cb() > > > and the
other hvm_register_intr_en_notif(). > >
> > > > The vhpet uses
hvm_isa_irq_assert_cb to deliver interrupts to >
> the vioapic. > >
> hvm_isa_irq_assert_cb allows a callback to be passed along to >
> > vioapic_deliver() >
> > and this callback is called with a mask of
the vcpus > which will > > get
the > > > interrupt. This callback is made
before any vcpus receive an > >
interrupt. > > > >
> > Vhpet uses hvm_register_intr_en_notif() to
register a handler > > for a
particular > > > vector that will be
called when that vector is injected in > >
> [vmx,svm]_intr_assist() > > > and
also when the guest finishes handling the interrupt. Here >
> finished is > >
> defined > > > as the point when the
guest re-enables interrupts or > lowers the >
> tpr value. > >
> EOI is not used as the end of interrupt as this is sometimes >
> returned before >
> > the interrupt handler has done its work. A
flag is > passed to the > >
handler > > > indicating >
> > whether this is the injection point (post =
1) or the > interrupt > > finished
(post > > > = 0) point. >
> > The need for the finished point callback is
discussed in the > > missed ticks >
> > policy section. >
> > > > > To
prevent a possible early trigger of the finished callback, >
> intr_en_notif >
> > logic > >
> has a two stage arm, the first at injection >
> (hvm_intr_en_notif_arm()) and the >
> > second when >
> > interrupts are seen to be disabled >
(hvm_intr_en_notif_disarm()). > > Once
fully > > > armed, re-enabling >
> > interrupts will cause
hvm_intr_en_notif_disarm() to > make the end >
> of interrupt > >
> callback. hvm_intr_en_notif_arm() and >
hvm_intr_en_notif_disarm() > > are called
by > > > [vmx,svm]_intr_assist(). >
> > > > > 3.
Interrupt delivery policies > > > >
> > The existing hpet interrupt delivery is
preserved. > This includes > > >
vcpu round robin delivery used by Linux and broadcast delivery >
> used by > > >
Windows. > > > >
> > There are two policies for interrupt
delivery, one for Windows > > 2k8-64 and
the > > > other >
> > for Linux. The Linux policy takes advantage
of the > (guest) Linux > > missed
tick > > > and offset >
> > calculations and does not attempt to deliver
the > right number of > >
interrupts. > > > The Windows policy
delivers the correct number of interrupts, >
> even if > > >
sometimes much > > > closer to each other
than the period. The policies are similar > >
to those in > > > vpt.c, though >
> > there are some important
differences. > > > >
> > Policies are selected with an
HVMOP_set_param > hypercall with index >
> > HVM_PARAM_TIMER_MODE. >
> > Two new values are added, >
HVM_HPET_guest_computes_missed_ticks and > >
> HVM_HPET_guest_does_not_compute_missed_ticks. The reason
that > > two new ones >
> > are added is that >
> > in some guests (32bit Linux) a no-missed
policy is needed for > > clock
sources > > > other than hpet >
> > and a missed ticks policy for hpet. It was
felt that > there would > > be
less > > > confusion by simply >
> > introducing the two hpet policies. >
> > > > >
3.1. The missed ticks policy > > > >
> > The Linux clock interrupt handler for hpet
calculates missed > > ticks and
offset > > > using the hpet >
> > main counter. The algorithm works well when
the time since the > > last interrupt >
> > is greater than >
> > or equal to a period and poorly
otherwise. > > > >
> > The missed ticks policy ensures that no two
clock > interrupts are > > delivered
to > > > the guest at >
> > a time interval less than a period. A time
stamp (hpet main > > counter value)
is > > > recorded (by a >
> > callback registered with
hvm_register_intr_en_notif) > when Linux >
> finishes > >
> handling the clock > > > interrupt.
Then, ensuing interrupts are delivered to > the vioapic >
> only if the > >
> current main > > > counter value is a
period greater than when the last interrupt >
> was handled. > >
> > > > Tests showed a significant
improvement in clock drift with end > > of
interrupt > > > time stamps >
> > versus beginning of interrupt[1]. It is
believed that > the reason > > for
the > > > improvement >
> > is that the clock interrupt handler goes for
a > spinlock and can > > be
therefore > > > delayed in its >
> > processing. Furthermore, the main counter is
read by the guest > > under the lock. >
> > The net > >
> effect is that if we time stamp injection, we can get the >
> difference in time >
> > between successive interrupt handler lock
acquisitions to be > > less than the >
> > period. > >
> > > > 3.2. The no-missed ticks
policy > > > >
> > Windows 2k864 keeps very poor time with the
missed > ticks policy. > > So
the > > > no-missed ticks policy >
> > was developed. In the no-missed ticks policy
we deliver the > > correct number of >
> > interrupts, >
> > even if they are spaced less than a period
apart > (when catching up). > >
> > > > Windows 2k864 uses a broadcast
mode in the interrupt routing > > such
that > > > all vcpus get the clock
interrupt. The best Windows drift > >
performance was > > > achieved when
the > > > policy code ensured that all the
previous interrupts (on the > > various
vcpus) > > > had been injected >
> > before injecting the next interrupt to the
vioapic.. > > > >
> > The policy code works as follows. It uses
the > > hvm_isa_irq_assert_cb() to >
> > record > >
> the vcpus to be interrupted in h->hpet.pending_mask. Then, in >
> the callback > >
> registered > > > with
hvm_register_intr_en_notif() at post=1 time it clears the >
> current vcpu in >
> > the pending_mask. >
> > When the pending_mask is clear it
decrements > > hpet.intr_pending_nr and
if > > > intr_pending_nr is still >
> > non-zero posts another interrupt to the
ioapic with > >
hvm_isa_irq_assert_cb(). > > >
Intr_pending_nr is incremented in > >
hpet_route_decision_not_missed_ticks(). > >
> > > > The missed ticks policy
intr_en_notif callback also uses the > >
pending_mask > > > method. So even
though > > > Linux does not broadcast its
interrupts, the code could handle > > it if
it did. > > > In this case the end of
interrupt time stamp is made when the > >
pending_mask is > > > clear. >
> > > > > 4.
Live Migration > > > >
> > Live migration with hpet preserves the
current offset of the > > guest clock
with > > > respect >
> > to ntp. This is accomplished by migrating
all of the state in > > the h->hpet
data > > > structure >
> > in the usual way. The hp->mc_offset is
recalculated on the > > receiving node
so > > > that the >
> > guest sees a continuous hpet main
counter. > > > >
> > Code as been added to xc_domain_save.c to
send a small message > > after the >
> > domain context is sent. The contents of the
message is the > > physical tsc >
> > timestamp, last_tsc, >
> > read just before the message is sent. When
the > last_tsc message > > is received
in > > > xc_domain_restore.c, >
> > another physical tsc timestamp, cur_tsc, is
read. The two > > timestamps are >
> > loaded into the domain >
> > structure as last_tsc_sender and
first_tsc_receiver with > > hypercalls.
Then > > >
xc_domain_hvm_setcontext > > > is called
so that hpet_load has access to these time stamps. >
> Hpet_load uses >
> > the timestamps >
> > to account for the time spent saving and
loading the domain > > context. With
this > > > technique, >
> > the only neglected time is the time spent
sending a small > > network message. >
> > > > > 5.
Test Results > > > >
> > Some recent test results are: >
> > > > >
5.1 Linux 4u664 and Windows 2k864 load test. >
> >
Duration: 70 hours. > >
> Test date: 6/2/08 >
> > Loads:
usex -b48 on Linux; burn-in on Windows > >
> Guest vcpus: 8 for Linux; 2 for
Windows > >
> Hardware: 8 physical cpu AMD >
> > Clock
drift : Linux: .0012% Windows: .009% > >
> > > > 5.2 Linux 4u664, Linux 4u464 ,
and Windows 2k864 no-load test > >
> Duration: 23 hours. >
> > Test
date: 6/3/08 > >
> Loads: none >
> > Guest
vcpus: 8 for each Linux; 2 for Windows > >
> Hardware: 4 physical cpu AMD >
> > Clock
drift : Linux: .033% Windows: .019% > >
> > > > 6. Relation to recent work in
xen-unstable > > > >
> > There is a similarity between
hvm_get_guest_time() in > > xen-unstable
and > > > read_64_main_counter() >
> > in this code. However,
read_64_main_counter() is more tuned to > >
the needs of > > > hpet.c. It has
no > > > "set" operation, only the get. It
isolates the mode, > physical or > >
simulated, in > > >
read_64_main_counter() > > > itself. It
uses no vcpu or domain state as it is a physical >
> entity, in either >
> > mode. And it provides a real >
> > physical mode for every read for those
applications > that desire > >
this. > > > >
> > 7. Conclusion >
> > > > >
The virtual hpet is improved by this patch in terms > of accuracy
and > > > monotonicity. >
> > Tests performed to date verify this and more
testing > is under way. > >
> > > > 8. Future Work >
> > > > >
Testing with Windows Vista will be performed soon. The reason >
> for accuracy > >
> variations > > > on different
platforms using the physical hpet device will be >
> investigated. >
> > Additional overhead measurements on
simulated vs physical hpet > > mode will
be > > > made. >
> > > > >
Footnotes: > > > >
> > 1. I don't recall the accuracy improvement
with end > of interrupt > > stamping,
but > > > it was >
> > significant, perhaps better than two to one
improvement. It > > would be a very >
> > simple matter >
> > to re-measure the improvement as the
facility can call back at > > injection
time > > > as well. >
> > > >
> > > > Signed-off-by: Dave Winchell
<dwinchell@xxxxxxxxxxxxxxx> > > >
<mailto:dwinchell@xxxxxxxxxxxxxxx> >
> > Signed-off-by: Ben Guthro
<bguthro@xxxxxxxxxxxxxxx> > > >
<mailto:bguthro@xxxxxxxxxxxxxxx> >
> > > >
> > > >
_______________________________________________ >
> > Xen-devel mailing list >
> > Xen-devel@xxxxxxxxxxxxxxxxxxx >
> > http://lists.xensource.com/xen-devel >
> > > >
> > >
_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel
|