Xen project Mailing List

RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy

To: "Dave Winchell" <dwinchell@xxxxxxxxxxxxxxx>, "Keir Fraser" <keir.fraser@xxxxxxxxxxxxx>

From: "Dan Magenheimer" <dan.magenheimer@xxxxxxxxxx>

Date: Fri, 6 Jun 2008 14:29:55 -0600

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxx>, Ben Guthro <bguthro@xxxxxxxxxxxxxxx>

Delivery-date: Fri, 06 Jun 2008 13:31:33 -0700

List-id: Xen developer discussion <xen-devel.lists.xensource.com>

Thread-index: AcjIFBNOSTfhtJHySxCkafUaaZcKSg==

Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let's call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there's a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can't be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there's a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] > Sent: Friday, June 06, 2008 1:33 PM > To: dan.magenheimer@xxxxxxxxxx; Keir Fraser > Cc: Ben Guthro; xen-devel; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan, Keir: > > Preliminary tests results indicate an error of .1% for Linux 64 bit > guests configured > for hpet with xen-unstable as is. As we have discussed many times, the > ntp requirement is .05%. > Tests on the patch we just submitted for hpet have indicated errors of > .0012% > on this platform under similar test conditions and .03% on > other platforms. > > Windows vista64 has an error of 11% using hpet with the > xen-unstable bits. > In an overnight test with our hpet patch, the Windows vista > error was .008%. > > The tests are with two or three guests on a physical node, all under > load, and with > the ratio of vcpus to phys cpus > 1. > > I will continue to run tests over the next few days. > > thanks, > Dave > > > Dan Magenheimer wrote: > > > Hi Dave and Ben -- > > > > When running tests on xen-unstable (without your patch), > please ensure > > that hpet=1 is set in the hvm config and also I think that when hpet > > is the clocksource on RHEL4-32, the clock IS resilient to > missed ticks > > so timer_mode should be 2 (vs when pit is the clocksource > on RHEL4-32, > > all clock ticks must be delivered and so timer_mode should be 0). > > > > Per > > > http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > 00098.html it's > > my intent to clean this up, but I won't get to it until next week. > > > > Thanks, > > Dan > > > > -----Original Message----- > > *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > > [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On > Behalf Of *Dave > > Winchell > > *Sent:* Friday, June 06, 2008 4:46 AM > > *To:* Keir Fraser; Ben Guthro; xen-devel > > *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell > > *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Keir, > > > > I think the changes are required. We'll run some tests > today today so > > that we have some data to talk about. > > > > -Dave > > > > > > -----Original Message----- > > From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf > of Keir Fraser > > Sent: Fri 6/6/2008 4:58 AM > > To: Ben Guthro; xen-devel > > Cc: dan.magenheimer@xxxxxxxxxx > > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > > Are these patches needed now the timers are built on Xen system > > time rather > > than host TSC? Dan has reported much better > time-keeping with his > > patch > > checked in, and it¹s for sure a lot less invasive than > this patchset. > > > > > > -- Keir > > > > On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote: > > > > > > > > 1. Introduction > > > > > > This patch improves the hpet based guest clock in > terms of drift and > > > monotonicity. > > > Prior to this work the drift with hpet was greater > than 2%, far > > above the .05% > > > limit > > > for ntp to synchronize. With this code, the drift ranges from > > .001% to .0033% > > > depending > > > on guest and physical platform. > > > > > > Using hpet allows guest operating systems to provide monotonic > > time to their > > > applications. Time sources other than hpet are not > monotonic because > > > of their reliance on tsc, which is not synchronized > across physical > > > processors. > > > > > > Windows 2k864 and many Linux guests are supported with two > > policies, one for > > > guests > > > that handle missed clock interrupts and the other for guests > > that require the > > > correct number of interrupts. > > > > > > Guests may use hpet for the timing source even if the physical > > platform has no > > > visible > > > hpet. Migration is supported between physical machines which > > differ in > > > physical > > > hpet visibility. > > > > > > Most of the changes are in hpet.c. Two general facilities are > > added to track > > > interrupt > > > progress. The ideas here and the facilities would be useful in > > vpt.c, for > > > other time > > > sources, though no attempt is made here to improve vpt.c. > > > > > > The following sections discuss hpet dependencies, interrupt > > delivery policies, > > > live migration, > > > test results, and relation to recent work with monotonic time. > > > > > > > > > 2. Virtual Hpet dependencies > > > > > > The virtual hpet depends on the ability to read the > physical or > > simulated > > > (see discussion below) hpet. For timekeeping, the > virtual hpet > > also depends > > > on two new interrupt notification facilities to implement its > > policies for > > > interrupt delivery. > > > > > > 2.1. Two modes of low-level hpet main counter reads. > > > > > > In this implementation, the virtual hpet reads with > > read_64_main_counter(), > > > exported by > > > time.c, either the real physical hpet main counter register > > directly or a > > > "simulated" > > > hpet main counter. > > > > > > The simulated mode uses a monotonic version of get_s_time() > > (NOW()), where the > > > last > > > time value is returned whenever the current time value is less > > than the last > > > time > > > value. In simulated mode, since it is layered on s_time, the > > underlying > > > hardware > > > can be hpet or some other device. The frequency of the main > > counter in > > > simulated > > > mode is the same as the standard physical hpet frequency, > > allowing live > > > migration > > > between nodes that are configured differently. > > > > > > If the physical platform does not have an hpet > device, or if xen > > is configured > > > not > > > to use the device, then the simulated method is used. If there > > is a physical > > > hpet device, > > > and xen has initialized it, then either simulated or physical > > mode can be > > > used. > > > This is governed by a boot time option, hpet-avoid. > Setting this > > option to 1 > > > gives the > > > simulated mode and 0 the physical mode. The default > is physical > > mode. > > > > > > A disadvantage of the physical mode is that may take longer to > > read the device > > > than in simulated mode. On some platforms the cost is > about the > > same (less > > > than 250 nsec) for > > > physical and simulated modes, while on others physical cost is > > much higher > > > than simulated. > > > A disadvantage of the simulated mode is that it can return the > > same value > > > for the counter in consecutive calls. > > > > > > 2.2. Interrupt notification facilities. > > > > > > Two interrupt notification facilities are introduced, one is > > > hvm_isa_irq_assert_cb() > > > and the other hvm_register_intr_en_notif(). > > > > > > The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to > > the vioapic. > > > hvm_isa_irq_assert_cb allows a callback to be passed along to > > > vioapic_deliver() > > > and this callback is called with a mask of the vcpus > which will > > get the > > > interrupt. This callback is made before any vcpus receive an > > interrupt. > > > > > > Vhpet uses hvm_register_intr_en_notif() to register a handler > > for a particular > > > vector that will be called when that vector is injected in > > > [vmx,svm]_intr_assist() > > > and also when the guest finishes handling the interrupt. Here > > finished is > > > defined > > > as the point when the guest re-enables interrupts or > lowers the > > tpr value. > > > EOI is not used as the end of interrupt as this is sometimes > > returned before > > > the interrupt handler has done its work. A flag is > passed to the > > handler > > > indicating > > > whether this is the injection point (post = 1) or the > interrupt > > finished (post > > > = 0) point. > > > The need for the finished point callback is discussed in the > > missed ticks > > > policy section. > > > > > > To prevent a possible early trigger of the finished callback, > > intr_en_notif > > > logic > > > has a two stage arm, the first at injection > > (hvm_intr_en_notif_arm()) and the > > > second when > > > interrupts are seen to be disabled > (hvm_intr_en_notif_disarm()). > > Once fully > > > armed, re-enabling > > > interrupts will cause hvm_intr_en_notif_disarm() to > make the end > > of interrupt > > > callback. hvm_intr_en_notif_arm() and > hvm_intr_en_notif_disarm() > > are called by > > > [vmx,svm]_intr_assist(). > > > > > > 3. Interrupt delivery policies > > > > > > The existing hpet interrupt delivery is preserved. > This includes > > > vcpu round robin delivery used by Linux and broadcast delivery > > used by > > > Windows. > > > > > > There are two policies for interrupt delivery, one for Windows > > 2k8-64 and the > > > other > > > for Linux. The Linux policy takes advantage of the > (guest) Linux > > missed tick > > > and offset > > > calculations and does not attempt to deliver the > right number of > > interrupts. > > > The Windows policy delivers the correct number of interrupts, > > even if > > > sometimes much > > > closer to each other than the period. The policies are similar > > to those in > > > vpt.c, though > > > there are some important differences. > > > > > > Policies are selected with an HVMOP_set_param > hypercall with index > > > HVM_PARAM_TIMER_MODE. > > > Two new values are added, > HVM_HPET_guest_computes_missed_ticks and > > > HVM_HPET_guest_does_not_compute_missed_ticks. The reason that > > two new ones > > > are added is that > > > in some guests (32bit Linux) a no-missed policy is needed for > > clock sources > > > other than hpet > > > and a missed ticks policy for hpet. It was felt that > there would > > be less > > > confusion by simply > > > introducing the two hpet policies. > > > > > > 3.1. The missed ticks policy > > > > > > The Linux clock interrupt handler for hpet calculates missed > > ticks and offset > > > using the hpet > > > main counter. The algorithm works well when the time since the > > last interrupt > > > is greater than > > > or equal to a period and poorly otherwise. > > > > > > The missed ticks policy ensures that no two clock > interrupts are > > delivered to > > > the guest at > > > a time interval less than a period. A time stamp (hpet main > > counter value) is > > > recorded (by a > > > callback registered with hvm_register_intr_en_notif) > when Linux > > finishes > > > handling the clock > > > interrupt. Then, ensuing interrupts are delivered to > the vioapic > > only if the > > > current main > > > counter value is a period greater than when the last interrupt > > was handled. > > > > > > Tests showed a significant improvement in clock drift with end > > of interrupt > > > time stamps > > > versus beginning of interrupt[1]. It is believed that > the reason > > for the > > > improvement > > > is that the clock interrupt handler goes for a > spinlock and can > > be therefore > > > delayed in its > > > processing. Furthermore, the main counter is read by the guest > > under the lock. > > > The net > > > effect is that if we time stamp injection, we can get the > > difference in time > > > between successive interrupt handler lock acquisitions to be > > less than the > > > period. > > > > > > 3.2. The no-missed ticks policy > > > > > > Windows 2k864 keeps very poor time with the missed > ticks policy. > > So the > > > no-missed ticks policy > > > was developed. In the no-missed ticks policy we deliver the > > correct number of > > > interrupts, > > > even if they are spaced less than a period apart > (when catching up). > > > > > > Windows 2k864 uses a broadcast mode in the interrupt routing > > such that > > > all vcpus get the clock interrupt. The best Windows drift > > performance was > > > achieved when the > > > policy code ensured that all the previous interrupts (on the > > various vcpus) > > > had been injected > > > before injecting the next interrupt to the vioapic.. > > > > > > The policy code works as follows. It uses the > > hvm_isa_irq_assert_cb() to > > > record > > > the vcpus to be interrupted in h->hpet.pending_mask. Then, in > > the callback > > > registered > > > with hvm_register_intr_en_notif() at post=1 time it clears the > > current vcpu in > > > the pending_mask. > > > When the pending_mask is clear it decrements > > hpet.intr_pending_nr and if > > > intr_pending_nr is still > > > non-zero posts another interrupt to the ioapic with > > hvm_isa_irq_assert_cb(). > > > Intr_pending_nr is incremented in > > hpet_route_decision_not_missed_ticks(). > > > > > > The missed ticks policy intr_en_notif callback also uses the > > pending_mask > > > method. So even though > > > Linux does not broadcast its interrupts, the code could handle > > it if it did. > > > In this case the end of interrupt time stamp is made when the > > pending_mask is > > > clear. > > > > > > 4. Live Migration > > > > > > Live migration with hpet preserves the current offset of the > > guest clock with > > > respect > > > to ntp. This is accomplished by migrating all of the state in > > the h->hpet data > > > structure > > > in the usual way. The hp->mc_offset is recalculated on the > > receiving node so > > > that the > > > guest sees a continuous hpet main counter. > > > > > > Code as been added to xc_domain_save.c to send a small message > > after the > > > domain context is sent. The contents of the message is the > > physical tsc > > > timestamp, last_tsc, > > > read just before the message is sent. When the > last_tsc message > > is received in > > > xc_domain_restore.c, > > > another physical tsc timestamp, cur_tsc, is read. The two > > timestamps are > > > loaded into the domain > > > structure as last_tsc_sender and first_tsc_receiver with > > hypercalls. Then > > > xc_domain_hvm_setcontext > > > is called so that hpet_load has access to these time stamps. > > Hpet_load uses > > > the timestamps > > > to account for the time spent saving and loading the domain > > context. With this > > > technique, > > > the only neglected time is the time spent sending a small > > network message. > > > > > > 5. Test Results > > > > > > Some recent test results are: > > > > > > 5.1 Linux 4u664 and Windows 2k864 load test. > > > Duration: 70 hours. > > > Test date: 6/2/08 > > > Loads: usex -b48 on Linux; burn-in on Windows > > > Guest vcpus: 8 for Linux; 2 for Windows > > > Hardware: 8 physical cpu AMD > > > Clock drift : Linux: .0012% Windows: .009% > > > > > > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test > > > Duration: 23 hours. > > > Test date: 6/3/08 > > > Loads: none > > > Guest vcpus: 8 for each Linux; 2 for Windows > > > Hardware: 4 physical cpu AMD > > > Clock drift : Linux: .033% Windows: .019% > > > > > > 6. Relation to recent work in xen-unstable > > > > > > There is a similarity between hvm_get_guest_time() in > > xen-unstable and > > > read_64_main_counter() > > > in this code. However, read_64_main_counter() is more tuned to > > the needs of > > > hpet.c. It has no > > > "set" operation, only the get. It isolates the mode, > physical or > > simulated, in > > > read_64_main_counter() > > > itself. It uses no vcpu or domain state as it is a physical > > entity, in either > > > mode. And it provides a real > > > physical mode for every read for those applications > that desire > > this. > > > > > > 7. Conclusion > > > > > > The virtual hpet is improved by this patch in terms > of accuracy and > > > monotonicity. > > > Tests performed to date verify this and more testing > is under way. > > > > > > 8. Future Work > > > > > > Testing with Windows Vista will be performed soon. The reason > > for accuracy > > > variations > > > on different platforms using the physical hpet device will be > > investigated. > > > Additional overhead measurements on simulated vs physical hpet > > mode will be > > > made. > > > > > > Footnotes: > > > > > > 1. I don't recall the accuracy improvement with end > of interrupt > > stamping, but > > > it was > > > significant, perhaps better than two to one improvement. It > > would be a very > > > simple matter > > > to re-measure the improvement as the facility can call back at > > injection time > > > as well. > > > > > > > > > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx> > > > <mailto:dwinchell@xxxxxxxxxxxxxxx> > > > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx> > > > <mailto:bguthro@xxxxxxxxxxxxxxx> > > > > > > > > > _______________________________________________ > > > Xen-devel mailing list > > > Xen-devel@xxxxxxxxxxxxxxxxxxx > > > http://lists.xensource.com/xen-devel > > > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.