[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
Dan Magenheimer wrote: In EL5u1-32 however it looks like the fractions are accounted for. Indeed the EL5u1-32 "lost tick handling" code resembles the Linux/ia64 code which is what I've always assumed was the "missed tick" model. In this case, I think no policy is necessary and the measured skew should be identical to any physical hpet skew. I'll have to test this hypothesis though.I've tested this hypothesis and it seems to hold true. This means the existing (unpatched) hpet code works fine on EL5-32bit (vcpus=1) when hpet is the clocksource, even when the machine is overcommitted. A second hypothesis still needs to be tested that Dave's patch will not make this worse. Interesting, thanks for pointing this out and confirming. (Note that per previous discussion, my EL5u1-32bit guest running on an Intel dual-core physical box chose tsc as the best clocksource and I had to override it with clock=hpet in the kernel command line.) Is there one setting for all Linux guests that makes them choose hpet? Is it "clock=hpet clocksource=hpet"? I know you wrote at length about this before. Yes, that makes sense and concurs with what I remember from the EL4u5-32 code. If this is true, one would expect the default "no missed tick" policy to see time moving faster than an external source -- the first missed tick delivered after a long sleep would "catch up" and then the remainder would each add another tick.Indeed with the existing (unpatched) hpet code, time is running faster on EL4u5-32 (vcpus=1, when overcommited). So Dave's patch is definitely needed here. Its good to get the verification of this. thanks, Dave Will try 64-bit next. Dan-----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] Sent: Monday, June 09, 2008 9:21 PM To: 'Dave Winchell'; 'Keir Fraser' Cc: 'xen-devel'; 'Ben Guthro' Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyI'll tell you what I recall about this. Tomorrow I'll check the guest code to verify. I think that Linux declares a full tick, even if the interrupt is early. That's the problem.Yes, that makes sense and concurs with what I remember from the EL4u5-32 code. If this is true, one would expect the default "no missed tick" policy to see time moving faster than an external source -- the first missed tick delivered after a long sleep would "catch up" and then the remainder would each add another tick.On the other hand, if the interrupt is late it in effect declaresa tick plus fraction. If it just declared the fraction inthe first place,we could deliver the interrupts whenever we wanted.My read of the EL4u5-32 code is that the fraction is discarded and a new tick period commences at "now", so the fractions eventually accumulate as lost time. In EL5u1-32 however it looks like the fractions are accounted for. Indeed the EL5u1-32 "lost tick handling" code resembles the Linux/ia64 code which is what I've always assumed was the "missed tick" model. In this case, I think no policy is necessary and the measured skew should be identical to any physical hpet skew. I'll have to test this hypothesis though. -----Original Message-----From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]On Behalf Of Dave WinchellSent: Monday, June 09, 2008 5:35 PM To: dan.magenheimer@xxxxxxxxxx; Keir Fraser Cc: Dave Winchell; xen-devel; Ben Guthro Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyThe Linux policy is more subtle, but is required to go from .1% to .03%.Thanks for the good documentation which I hadn't thoroughly read until now. I now understand that the essence of your hpet missed ticks policy is to ensure that ticks are never delivered too close together. But I'm trying to understand WHY your patch works, in other words, what problem it is countering.I'll tell you what I recall about this. Tomorrow I'll check the guest code to verify. I think that Linux declares a full tick, even if the interrupt is early. That's the problem. On the other hand, if the interrupt is late it in effect declaresa tick plus fraction. If it just declared the fraction in the first place,we could deliver the interrupts whenever we wanted. Its really not that different than the missed ticks policy in vpt.c except that there the period in vpt.c is based on start of interrupt and I have improved that with end-of interrupt as described in the patch note. I don't recall what prompted me to try end-of-interrupt, but I saw a significant improvement. I may have been running a monotonicity test at the same time to explain the lock contention mentioned in the write-up.I care about this for more reasons than just because it is interesting: (1) I'd like to feel confident that it is fixing a bug rather than just a symptom of a bug; and (2) I wonder how universally it is applicable.Its worked well my my small set of guests. You and our QA are going to tell us about the wider set. It doesn't matter if guest A handles interrupts closely spaced or not, just whether it handles them far apart. So it should be pretty universal with guests that really handle missed ticks. I think its interesting that some 32bit Linux guests handle missed ticks for hpet.I see from code examination in mark_offset_hpet() in RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that the correction for lost ticks is just plain wrong in a virtual environment. (Suppose for example that a virtual tick was delivered every 1.999*hpet_tick... I think the clock would be off by 50%!) Is this the bug that is being "countered" by your policy?I haven't looked at that code, perhaps. I'll check it tomorrow.However, the lost tick handling in RHEL5u1/kernel/timer.c (which I think is used also for hpet) is much better so I am eager to find out if your policy works there too. If the hpet missed tick policy works for both, though, I should be happy, though I wonder about upstream kernels (e.g. the trend toward tickless).I wasn't aware of this trend. If its robust, however, it should handle late interrupts ...That said, I'd rather see this get into Xen 3.3 and worry about upstream kernels later :-)Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] Sent: Mon 6/9/2008 6:02 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracyThe Linux policy is more subtle, but is required to go from .1% to .03%.Thanks for the good documentation which I hadn't thoroughly read until now. I now understand that the essence of your hpet missed ticks policy is to ensure that ticks are never delivered too close together. But I'm trying to understand WHY your patch works, in other words, what problem it is countering. I care about this for more reasons than just because it is interesting: (1) I'd like to feel confident that it is fixing a bug rather than just a symptom of a bug; and (2) I wonder how universally it is applicable. I see from code examination in mark_offset_hpet() in RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that the correction for lost ticks is just plain wrong in a virtual environment. (Suppose for example that a virtual tick was delivered every 1.999*hpet_tick... I think the clock would be off by 50%!) Is this the bug that is being "countered" by your policy? However, the lost tick handling in RHEL5u1/kernel/timer.c (which I think is used also for hpet) is much better so I am eager to find out if your policy works there too. If the hpet missed tick policy works for both, though, I should be happy, though I wonder about upstream kernels (e.g. the trend toward tickless). That said, I'd rather see this get into Xen 3.3 and worry about upstream kernels later :-) -----Original Message----- From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] Sent: Sunday, June 08, 2008 2:32 PM To: dan.magenheimer@xxxxxxxxxx; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Hi Dan,While I am fully supportive of offering hardware hpet as an option for hvm guests (let's call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there's a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time.I think xen system time is fine. You have to add the interrupt delivery policies decribed in the write-up for the patch to get accurate timekeeping in the guest. The windows policy is obvious and results in a large improvement in accuracy. The Linux policy is more subtle, but is required to go from .1% to .03%.I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow?I don't recall the direction. I can look it up in my notes at work tomorrow.Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0.Our patch is accurate to < .03% using the physical hpet mode or the simulated mode.And if for some reason Xen system time can't be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet.In our experience, Xen system time is accurate enough now.One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source.I do not know the tsc accuracy.Or maybe there's a computation error somewhere in the hvm hpet scaling code? Hmmm...Regards, Dave -----Original Message----- From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] Sent: Fri 6/6/2008 4:29 PM To: Dave Winchell; Keir Fraser Cc: Ben Guthro; xen-devel Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dave -- Thanks much for posting the preliminary results! While I am fully supportive of offering hardware hpet as an option for hvm guests (let's call it hwhpet=1 for shorthand), I am very surprised by your preliminary results; the most obvious conclusion is that Xen system time is losing time at the rate of 1000 PPM though its possible there's a bug somewhere else in the "time stack". Your Windows result is jaw-dropping and inexplicable, though I have to admit ignorance of how Windows manages time. I think with my recent patch and hpet=1 (essentially the same as your emulated hpet), hvm guest time should track Xen system time. I wonder if domain0 (which if I understand correctly is directly using Xen system time) is also seeing an error of .1%? Also I wonder for the skew you are seeing (in both hvm guests and domain0) is time moving too fast or two slow? Although hwhpet=1 is a fine alternative in many cases, it may be unavailable on some systems and may cause significant performance issues on others. So I think we will still need to track down the poor accuracy when hwhpet=0. And if for some reason Xen system time can't be made accurate enough (< 0.05%), then I think we should consider building Xen system time itself on top of hardware hpet instead of TSC... at least when Xen discovers a capable hpet. One more thought... do you know the accuracy of the TSC crystals on your test systems? I posted a patch awhile ago that was intended to test that, though I guess it was only testing skew of different TSCs on the same system, not TSCs against an external time source. Or maybe there's a computation error somewhere in the hvm hpet scaling code? Hmmm... Thanks, Dan-----Original Message----- From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] Sent: Friday, June 06, 2008 1:33 PM To: dan.magenheimer@xxxxxxxxxx; Keir Fraser Cc: Ben Guthro; xen-devel; Dave Winchell Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Dan, Keir: Preliminary tests results indicate an error of .1% for Linux 64 bit guests configuredfor hpet with xen-unstable as is. As we have discussed manytimes, thentp requirement is .05%.Tests on the patch we just submitted for hpet haveindicated errors of.0012% on this platform under similar test conditions and .03% on other platforms. Windows vista64 has an error of 11% using hpet with the xen-unstable bits. In an overnight test with our hpet patch, the Windows vista error was .008%. The tests are with two or three guests on a physical node, all under load, and with the ratio of vcpus to phys cpus > 1. I will continue to run tests over the next few days. thanks, Dave Dan Magenheimer wrote:Hi Dave and Ben -- When running tests on xen-unstable (without your patch),please ensurethat hpet=1 is set in the hvm config and also I thinkthat when hpetis the clocksource on RHEL4-32, the clock IS resilient tomissed ticksso timer_mode should be 2 (vs when pit is the clocksourceon RHEL4-32,all clock ticks must be delivered and so timer_mode should be 0). Perhttp://lists.xensource.com/archives/html/xen-devel/2008-06/msg 00098.html it'smy intent to clean this up, but I won't get to it until next week. Thanks, Dan -----Original Message----- *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*OnBehalf Of *DaveWinchell *Sent:* Friday, June 06, 2008 4:46 AM *To:* Keir Fraser; Ben Guthro; xen-devel *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy Keir, I think the changes are required. We'll run some teststoday today sothat we have some data to talk about. -Dave -----Original Message----- From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalfof Keir FraserSent: Fri 6/6/2008 4:58 AM To: Ben Guthro; xen-devel Cc: dan.magenheimer@xxxxxxxxxx Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracyAre these patches needed now the timers are built onXen systemtime rather than host TSC? Dan has reported much bettertime-keeping with hispatch checked in, and it¹s for sure a lot less invasive thanthis patchset.-- Keir On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote: > > 1. Introduction > > This patch improves the hpet based guest clock interms of drift and> monotonicity. > Prior to this work the drift with hpet was greaterthan 2%, farabove the .05% > limit> for ntp to synchronize. With this code, the driftranges from.001% to .0033% > depending > on guest and physical platform. >> Using hpet allows guest operating systems toprovide monotonictime to their > applications. Time sources other than hpet are notmonotonic because> of their reliance on tsc, which is not synchronizedacross physical> processors. > > Windows 2k864 and many Linux guests are supported with two policies, one for > guests > that handle missed clock interrupts and the other for guests that require the > correct number of interrupts. >> Guests may use hpet for the timing source even ifthe physicalplatform has no > visible > hpet. Migration is supported between physical machines which differ in > physical > hpet visibility. >> Most of the changes are in hpet.c. Two generalfacilities areadded to track > interrupt> progress. The ideas here and the facilities wouldbe useful invpt.c, for > other time > sources, though no attempt is made here to improve vpt.c. > > The following sections discuss hpet dependencies, interrupt delivery policies, > live migration,> test results, and relation to recent work withmonotonic time.> > > 2. Virtual Hpet dependencies > > The virtual hpet depends on the ability to read thephysical orsimulated > (see discussion below) hpet. For timekeeping, thevirtual hpetalso depends> on two new interrupt notification facilities toimplement itspolicies for > interrupt delivery. > > 2.1. Two modes of low-level hpet main counter reads. > > In this implementation, the virtual hpet reads with read_64_main_counter(), > exported by > time.c, either the real physical hpet main counter register directly or a > "simulated" > hpet main counter. > > The simulated mode uses a monotonic version of get_s_time() (NOW()), where the > last> time value is returned whenever the current timevalue is lessthan the last > time > value. In simulated mode, since it is layered on s_time, the underlying > hardware > can be hpet or some other device. The frequency of the main counter in > simulated > mode is the same as the standard physical hpet frequency, allowing live > migration > between nodes that are configured differently. > > If the physical platform does not have an hpetdevice, or if xenis configured > not> to use the device, then the simulated method isused. If thereis a physical > hpet device,> and xen has initialized it, then either simulatedor physicalmode can be > used. > This is governed by a boot time option, hpet-avoid.Setting thisoption to 1 > gives the > simulated mode and 0 the physical mode. The defaultis physicalmode. >> A disadvantage of the physical mode is that maytake longer toread the device > than in simulated mode. On some platforms the cost isabout thesame (less > than 250 nsec) for> physical and simulated modes, while on othersphysical cost ismuch higher > than simulated.> A disadvantage of the simulated mode is that it canreturn thesame value > for the counter in consecutive calls. > > 2.2. Interrupt notification facilities. > > Two interrupt notification facilities are introduced, one is > hvm_isa_irq_assert_cb() > and the other hvm_register_intr_en_notif(). >> The vhpet uses hvm_isa_irq_assert_cb to deliverinterrupts tothe vioapic.> hvm_isa_irq_assert_cb allows a callback to bepassed along to> vioapic_deliver() > and this callback is called with a mask of the vcpuswhich willget the > interrupt. This callback is made before any vcpus receive an interrupt. >> Vhpet uses hvm_register_intr_en_notif() to registera handlerfor a particular > vector that will be called when that vector is injected in > [vmx,svm]_intr_assist()> and also when the guest finishes handling theinterrupt. Herefinished is > defined > as the point when the guest re-enables interrupts orlowers thetpr value. > EOI is not used as the end of interrupt as this is sometimes returned before > the interrupt handler has done its work. A flag ispassed to thehandler > indicating > whether this is the injection point (post = 1) or theinterruptfinished (post > = 0) point. > The need for the finished point callback is discussed in the missed ticks > policy section. >> To prevent a possible early trigger of the finishedcallback,intr_en_notif > logic > has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the > second when > interrupts are seen to be disabled(hvm_intr_en_notif_disarm()).Once fully > armed, re-enabling > interrupts will cause hvm_intr_en_notif_disarm() tomake the endof interrupt > callback. hvm_intr_en_notif_arm() andhvm_intr_en_notif_disarm()are called by > [vmx,svm]_intr_assist(). > > 3. Interrupt delivery policies > > The existing hpet interrupt delivery is preserved.This includes> vcpu round robin delivery used by Linux andbroadcast deliveryused by > Windows. >> There are two policies for interrupt delivery, onefor Windows2k8-64 and the > other > for Linux. The Linux policy takes advantage of the(guest) Linuxmissed tick > and offset > calculations and does not attempt to deliver theright number ofinterrupts.> The Windows policy delivers the correct number ofinterrupts,even if > sometimes much> closer to each other than the period. The policiesare similarto those in > vpt.c, though > there are some important differences. > > Policies are selected with an HVMOP_set_paramhypercall with index> HVM_PARAM_TIMER_MODE. > Two new values are added,HVM_HPET_guest_computes_missed_ticks and> HVM_HPET_guest_does_not_compute_missed_ticks. Thereason thattwo new ones > are added is that> in some guests (32bit Linux) a no-missed policy isneeded forclock sources > other than hpet > and a missed ticks policy for hpet. It was felt thatthere wouldbe less > confusion by simply > introducing the two hpet policies. > > 3.1. The missed ticks policy > > The Linux clock interrupt handler for hpet calculates missed ticks and offset > using the hpet> main counter. The algorithm works well when thetime since thelast interrupt > is greater than > or equal to a period and poorly otherwise. > > The missed ticks policy ensures that no two clockinterrupts aredelivered to > the guest at > a time interval less than a period. A time stamp (hpet main counter value) is > recorded (by a > callback registered with hvm_register_intr_en_notif)when Linuxfinishes > handling the clock > interrupt. Then, ensuing interrupts are delivered tothe vioapiconly if the > current main> counter value is a period greater than when thelast interruptwas handled. >> Tests showed a significant improvement in clockdrift with endof interrupt > time stamps > versus beginning of interrupt[1]. It is believed thatthe reasonfor the > improvement > is that the clock interrupt handler goes for aspinlock and canbe therefore > delayed in its> processing. Furthermore, the main counter is readby the guestunder the lock. > The net > effect is that if we time stamp injection, we can get the difference in time > between successive interrupt handler lock acquisitions to be less than the > period. > > 3.2. The no-missed ticks policy > > Windows 2k864 keeps very poor time with the missedticks policy.So the > no-missed ticks policy > was developed. In the no-missed ticks policy we deliver the correct number of > interrupts, > even if they are spaced less than a period apart(when catching up).> > Windows 2k864 uses a broadcast mode in the interrupt routing such that > all vcpus get the clock interrupt. The best Windows drift performance was > achieved when the > policy code ensured that all the previous interrupts (on the various vcpus) > had been injected > before injecting the next interrupt to the vioapic.. > > The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to > record> the vcpus to be interrupted inh->hpet.pending_mask. Then, inthe callback > registered> with hvm_register_intr_en_notif() at post=1 time itclears thecurrent vcpu in > the pending_mask. > When the pending_mask is clear it decrements hpet.intr_pending_nr and if > intr_pending_nr is still > non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb(). > Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks(). > > The missed ticks policy intr_en_notif callback also uses the pending_mask > method. So even though> Linux does not broadcast its interrupts, the codecould handleit if it did.> In this case the end of interrupt time stamp ismade when thepending_mask is > clear. > > 4. Live Migration > > Live migration with hpet preserves the current offset of the guest clock with > respect> to ntp. This is accomplished by migrating all ofthe state inthe h->hpet data > structure > in the usual way. The hp->mc_offset is recalculated on the receiving node so > that the > guest sees a continuous hpet main counter. >> Code as been added to xc_domain_save.c to send asmall messageafter the > domain context is sent. The contents of the message is the physical tsc > timestamp, last_tsc, > read just before the message is sent. When thelast_tsc messageis received in > xc_domain_restore.c, > another physical tsc timestamp, cur_tsc, is read. The two timestamps are > loaded into the domain > structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then > xc_domain_hvm_setcontext > is called so that hpet_load has access to these time stamps. Hpet_load uses > the timestamps > to account for the time spent saving and loading the domain context. With this > technique, > the only neglected time is the time spent sending a small network message. > > 5. Test Results > > Some recent test results are: > > 5.1 Linux 4u664 and Windows 2k864 load test. > Duration: 70 hours. > Test date: 6/2/08 > Loads: usex -b48 on Linux; burn-in on Windows > Guest vcpus: 8 for Linux; 2 for Windows > Hardware: 8 physical cpu AMD > Clock drift : Linux: .0012% Windows: .009% >> 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864no-load test> Duration: 23 hours. > Test date: 6/3/08 > Loads: none > Guest vcpus: 8 for each Linux; 2 for Windows > Hardware: 4 physical cpu AMD > Clock drift : Linux: .033% Windows: .019% > > 6. Relation to recent work in xen-unstable > > There is a similarity between hvm_get_guest_time() in xen-unstable and > read_64_main_counter()> in this code. However, read_64_main_counter() ismore tuned tothe needs of > hpet.c. It has no > "set" operation, only the get. It isolates the mode,physical orsimulated, in > read_64_main_counter() > itself. It uses no vcpu or domain state as it is a physical entity, in either > mode. And it provides a real > physical mode for every read for those applicationsthat desirethis. > > 7. Conclusion > > The virtual hpet is improved by this patch in termsof accuracy and> monotonicity. > Tests performed to date verify this and more testingis under way.> > 8. Future Work >> Testing with Windows Vista will be performed soon.The reasonfor accuracy > variations> on different platforms using the physical hpetdevice will beinvestigated.> Additional overhead measurements on simulated vsphysical hpetmode will be > made. > > Footnotes: > > 1. I don't recall the accuracy improvement with endof interruptstamping, but > it was > significant, perhaps better than two to one improvement. It would be a very > simple matter> to re-measure the improvement as the facility cancall back atinjection time > as well. > > > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx> > <mailto:dwinchell@xxxxxxxxxxxxxxx> > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx> > <mailto:bguthro@xxxxxxxxxxxxxxx> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |