[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy
OK, I can confirm that without Dave's patch RHEL4- and RHEL5-based 64-bit uni-p kernels gain time when hpet is the clocksource. But WHOA with vcpus=2, el5u1-32 time suddenly goes crazy when domains are added whereas it seems fine when vcpus=1. All my testing so far has been on 3.1.3, so I am going to redo it on xen-unstable, first without Dave's patch then with. > Is there one setting for all Linux guests that makes them > choose hpet? Is it "clock=hpet clocksource=hpet"? > I know you wrote at length about this before. In the hvm configuration file: hpet=1 acpi=1 (Note, acpi unspecified works too as 1 appears to be the default; but hpet=1 is ignored if acpi=0) In the kernel command line of the hvm domain (e.g. in grub.conf): clock=hpet notsc nopmtimer (Note, a different set of kernel parameters is necessary for each kernel but because the kernel either ignores or gives harmless warnings for invalid parameters, this set should always result in hpet being selected as the clocksource, at least on all RHEL4- and RHEL5-based kernels I've tested.) > -----Original Message----- > From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] > Sent: Wednesday, June 11, 2008 7:58 AM > To: dan.magenheimer@xxxxxxxxxx > Cc: Keir Fraser; xen-devel; Ben Guthro; Dave Winchell > Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > > > Dan Magenheimer wrote: > > >>In EL5u1-32 however it looks like the fractions are accounted > >>for. Indeed the EL5u1-32 "lost tick handling" code resembles > >>the Linux/ia64 code which is what I've always assumed was > >>the "missed tick" model. In this case, I think no policy > >>is necessary and the measured skew should be identical to > >>any physical hpet skew. I'll have to test this hypothesis though. > >> > >> > > > >I've tested this hypothesis and it seems to hold true. > >This means the existing (unpatched) hpet code works fine > >on EL5-32bit (vcpus=1) when hpet is the clocksource, > >even when the machine is overcommitted. A second hypothesis > >still needs to be tested that Dave's patch will not make this worse. > > > > > Interesting, thanks for pointing this out and confirming. > > >(Note that per previous discussion, my EL5u1-32bit guest > >running on an Intel dual-core physical box chose tsc as > >the best clocksource and I had to override it with > >clock=hpet in the kernel command line.) > > > > > Is there one setting for all Linux guests that makes them > choose hpet? Is it "clock=hpet clocksource=hpet"? > I know you wrote at length about this before. > > > > > > >>Yes, that makes sense and concurs with what I remember from > >>the EL4u5-32 code. If this is true, one would expect the > >>default "no missed tick" policy to see time moving faster > >>than an external source -- the first missed tick delivered > >>after a long sleep would "catch up" and then the remainder > >>would each add another tick. > >> > >> > > > >Indeed with the existing (unpatched) hpet code, time is > >running faster on EL4u5-32 (vcpus=1, when overcommited). > >So Dave's patch is definitely needed here. > > > > > Its good to get the verification of this. > > thanks, > Dave > > >Will try 64-bit next. > > > >Dan > > > > > > > >>-----Original Message----- > >>From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] > >>Sent: Monday, June 09, 2008 9:21 PM > >>To: 'Dave Winchell'; 'Keir Fraser' > >>Cc: 'xen-devel'; 'Ben Guthro' > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >> > >> > >>>I'll tell you what I recall about this. Tomorrow I'll check the > >>>guest code to verify. I think that Linux declares a full tick, > >>>even if the interrupt is early. That's the problem. > >>> > >>> > >>Yes, that makes sense and concurs with what I remember from > >>the EL4u5-32 code. If this is true, one would expect the > >>default "no missed tick" policy to see time moving faster > >>than an external source -- the first missed tick delivered > >>after a long sleep would "catch up" and then the remainder > >>would each add another tick. > >> > >> > >> > >>>On the other hand, if the interrupt is late it in effect declares > >>>a tick plus fraction. If it just declared the fraction in > >>> > >>> > >>the first place, > >> > >> > >>>we could deliver the interrupts whenever we wanted. > >>> > >>> > >>My read of the EL4u5-32 code is that the fraction is discarded > >>and a new tick period commences at "now", so the fractions > >>eventually accumulate as lost time. > >> > >>In EL5u1-32 however it looks like the fractions are accounted > >>for. Indeed the EL5u1-32 "lost tick handling" code resembles > >>the Linux/ia64 code which is what I've always assumed was > >>the "missed tick" model. In this case, I think no policy > >>is necessary and the measured skew should be identical to > >>any physical hpet skew. I'll have to test this hypothesis though. > >> > >>-----Original Message----- > >>From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > >>[mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]On Behalf Of > >>Dave Winchell > >>Sent: Monday, June 09, 2008 5:35 PM > >>To: dan.magenheimer@xxxxxxxxxx; Keir Fraser > >>Cc: Dave Winchell; xen-devel; Ben Guthro > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >> > >> > >>>>The Linux policy is more subtle, but is required to go > >>>>from .1% to .03%. > >>>> > >>>> > >>>Thanks for the good documentation which I hadn't thoroughly > >>>read until now. > >>>I now understand that the essence of your > >>>hpet missed ticks policy is to ensure that ticks are never > >>>delivered too close together. But I'm trying to understand > >>>WHY your patch works, in other words, what problem it is > >>>countering. > >>> > >>> > >>I'll tell you what I recall about this. Tomorrow I'll check the > >>guest code to verify. I think that Linux declares a full tick, > >>even if the interrupt is early. That's the problem. > >>On the other hand, if the interrupt is late it in effect declares > >>a tick plus fraction. If it just declared the fraction in the > >>first place, > >>we could deliver the interrupts whenever we wanted. > >> > >>Its really not that different than the missed ticks policy in vpt.c > >>except that there the period in vpt.c is based on start of interrupt > >>and I have improved that with end-of interrupt as described > >>in the patch note. > >> > >>I don't recall what prompted me to try end-of-interrupt, > >>but I saw a significant improvement. I may have been running > >>a monotonicity test at the same time to explain the lock > >>contention mentioned in the write-up. > >> > >> > >> > >>> I care about this for more reasons than just > >>>because it is interesting: (1) I'd like to feel confident that > >>>it is fixing a bug rather than just a symptom of a bug; > >>>and (2) I wonder how universally it is applicable. > >>> > >>> > >>Its worked well my my small set of guests. You and our > >>QA are going to tell us about the wider set. It doesn't > >>matter if guest A handles interrupts closely spaced or not, > >>just whether it handles them far apart. So it should be pretty > >>universal with guests that really handle missed ticks. > >>I think its interesting that some 32bit Linux guests handle > >>missed ticks for hpet. > >> > >> > >> > >>>I see from code examination in mark_offset_hpet() in > >>>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > >>>the correction for lost ticks is just plain wrong in > >>>a virtual environment. (Suppose for example that a virtual > >>>tick was delivered every 1.999*hpet_tick... I think > >>>the clock would be off by 50%!) Is this the bug that > >>>is being "countered" by your policy? > >>> > >>> > >>I haven't looked at that code, perhaps. > >>I'll check it tomorrow. > >> > >> > >> > >>>However, the lost tick handling in RHEL5u1/kernel/timer.c > >>>(which I think is used also for hpet) is much better > >>>so I am eager to find out if your policy works there > >>>too. > >>>If the hpet missed tick policy works for both, though, > >>>I should be happy, though I wonder about upstream kernels > >>>(e.g. the trend toward tickless). > >>> > >>> > >>I wasn't aware of this trend. If its robust, however, it should > >>handle late interrupts ... > >> > >> > >> > >>>That said, I'd rather > >>>see this get into Xen 3.3 and worry about upstream kernels > >>>later :-) > >>> > >>> > >>Regards, > >>Dave > >> > >> > >> > >>-----Original Message----- > >>From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] > >>Sent: Mon 6/9/2008 6:02 PM > >>To: Dave Winchell; Keir Fraser > >>Cc: Ben Guthro; xen-devel > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >> > >>>The Linux policy is more subtle, but is required to go > >>>from .1% to .03%. > >>> > >>> > >>Thanks for the good documentation which I hadn't thoroughly > >>read until now. I now understand that the essence of your > >>hpet missed ticks policy is to ensure that ticks are never > >>delivered too close together. But I'm trying to understand > >>WHY your patch works, in other words, what problem it is > >>countering. I care about this for more reasons than just > >>because it is interesting: (1) I'd like to feel confident that > >>it is fixing a bug rather than just a symptom of a bug; > >>and (2) I wonder how universally it is applicable. > >> > >>I see from code examination in mark_offset_hpet() in > >>RHEL4u5/arch/i386/kernel/timers/timer_hpet.c, that > >>the correction for lost ticks is just plain wrong in > >>a virtual environment. (Suppose for example that a virtual > >>tick was delivered every 1.999*hpet_tick... I think > >>the clock would be off by 50%!) Is this the bug that > >>is being "countered" by your policy? > >> > >>However, the lost tick handling in RHEL5u1/kernel/timer.c > >>(which I think is used also for hpet) is much better > >>so I am eager to find out if your policy works there > >>too. > >> > >>If the hpet missed tick policy works for both, though, > >>I should be happy, though I wonder about upstream kernels > >>(e.g. the trend toward tickless). That said, I'd rather > >>see this get into Xen 3.3 and worry about upstream kernels > >>later :-) > >> > >>-----Original Message----- > >>From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] > >>Sent: Sunday, June 08, 2008 2:32 PM > >>To: dan.magenheimer@xxxxxxxxxx; Keir Fraser > >>Cc: Ben Guthro; xen-devel; Dave Winchell > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >> > >>Hi Dan, > >> > >> > >> > >>>While I am fully supportive of offering hardware hpet as an option > >>>for hvm guests (let's call it hwhpet=1 for shorthand), I am very > >>>surprised by your preliminary results; the most obvious conclusion > >>>is that Xen system time is losing time at the rate of 1000 PPM > >>>though its possible there's a bug somewhere else in the "time > >>>stack". Your Windows result is jaw-dropping and inexplicable, > >>>though I have to admit ignorance of how Windows manages time. > >>> > >>> > >>I think xen system time is fine. You have to add the interrupt > >>delivery policies decribed in the write-up for the patch to get > >>accurate timekeeping in the guest. > >> > >>The windows policy is obvious and results in a large improvement > >>in accuracy. The Linux policy is more subtle, but is required to go > >>from .1% to .03%. > >> > >> > >> > >>>I think with my recent patch and hpet=1 (essentially the same as > >>>your emulated hpet), hvm guest time should track Xen system time. > >>>I wonder if domain0 (which if I understand correctly is directly > >>>using Xen system time) is also seeing an error of .1%? Also > >>>I wonder for the skew you are seeing (in both hvm guests and > >>>domain0) is time moving too fast or two slow? > >>> > >>> > >>I don't recall the direction. I can look it up in my notes at work > >>tomorrow. > >> > >> > >> > >>>Although hwhpet=1 is a fine alternative in many cases, it may > >>>be unavailable on some systems and may cause significant > performance > >>>issues on others. So I think we will still need to track down > >>>the poor accuracy when hwhpet=0. > >>> > >>> > >>Our patch is accurate to < .03% using the physical hpet mode or > >>the simulated mode. > >> > >> > >> > >>>And if for some reason > >>>Xen system time can't be made accurate enough (< 0.05%), then > >>>I think we should consider building Xen system time itself on > >>>top of hardware hpet instead of TSC... at least when Xen discovers > >>>a capable hpet. > >>> > >>> > >>In our experience, Xen system time is accurate enough now. > >> > >> > >> > >>>One more thought... do you know the accuracy of the TSC crystals > >>>on your test systems? I posted a patch awhile ago that was > >>>intended to test that, though I guess it was only testing skew > >>>of different TSCs on the same system, not TSCs against an > >>>external time source. > >>> > >>> > >>I do not know the tsc accuracy. > >> > >> > >> > >>>Or maybe there's a computation error somewhere in the hvm hpet > >>>scaling code? Hmmm... > >>> > >>> > >>Regards, > >>Dave > >> > >> > >>-----Original Message----- > >>From: Dan Magenheimer [mailto:dan.magenheimer@xxxxxxxxxx] > >>Sent: Fri 6/6/2008 4:29 PM > >>To: Dave Winchell; Keir Fraser > >>Cc: Ben Guthro; xen-devel > >>Subject: RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >> > >>Dave -- > >> > >>Thanks much for posting the preliminary results! > >> > >>While I am fully supportive of offering hardware hpet as an option > >>for hvm guests (let's call it hwhpet=1 for shorthand), I am very > >>surprised by your preliminary results; the most obvious conclusion > >>is that Xen system time is losing time at the rate of 1000 PPM > >>though its possible there's a bug somewhere else in the "time > >>stack". Your Windows result is jaw-dropping and inexplicable, > >>though I have to admit ignorance of how Windows manages time. > >> > >> > >>I think with my recent patch and hpet=1 (essentially the same as > >>your emulated hpet), hvm guest time should track Xen system time. > >>I wonder if domain0 (which if I understand correctly is directly > >>using Xen system time) is also seeing an error of .1%? Also > >>I wonder for the skew you are seeing (in both hvm guests and > >>domain0) is time moving too fast or two slow? > >> > >>Although hwhpet=1 is a fine alternative in many cases, it may > >>be unavailable on some systems and may cause significant performance > >>issues on others. So I think we will still need to track down > >>the poor accuracy when hwhpet=0. And if for some reason > >>Xen system time can't be made accurate enough (< 0.05%), then > >>I think we should consider building Xen system time itself on > >>top of hardware hpet instead of TSC... at least when Xen discovers > >>a capable hpet. > >> > >>One more thought... do you know the accuracy of the TSC crystals > >>on your test systems? I posted a patch awhile ago that was > >>intended to test that, though I guess it was only testing skew > >>of different TSCs on the same system, not TSCs against an > >>external time source. > >> > >>Or maybe there's a computation error somewhere in the hvm hpet > >>scaling code? Hmmm... > >> > >>Thanks, > >>Dan > >> > >> > >> > >>>-----Original Message----- > >>>From: Dave Winchell [mailto:dwinchell@xxxxxxxxxxxxxxx] > >>>Sent: Friday, June 06, 2008 1:33 PM > >>>To: dan.magenheimer@xxxxxxxxxx; Keir Fraser > >>>Cc: Ben Guthro; xen-devel; Dave Winchell > >>>Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >>> > >>> > >>>Dan, Keir: > >>> > >>>Preliminary tests results indicate an error of .1% for Linux 64 bit > >>>guests configured > >>>for hpet with xen-unstable as is. As we have discussed many > >>> > >>> > >>times, the > >> > >> > >>>ntp requirement is .05%. > >>>Tests on the patch we just submitted for hpet have > >>> > >>> > >>indicated errors of > >> > >> > >>>.0012% > >>>on this platform under similar test conditions and .03% on > >>>other platforms. > >>> > >>>Windows vista64 has an error of 11% using hpet with the > >>>xen-unstable bits. > >>>In an overnight test with our hpet patch, the Windows vista > >>>error was .008%. > >>> > >>>The tests are with two or three guests on a physical node, > all under > >>>load, and with > >>>the ratio of vcpus to phys cpus > 1. > >>> > >>>I will continue to run tests over the next few days. > >>> > >>>thanks, > >>>Dave > >>> > >>> > >>>Dan Magenheimer wrote: > >>> > >>> > >>> > >>>>Hi Dave and Ben -- > >>>> > >>>>When running tests on xen-unstable (without your patch), > >>>> > >>>> > >>>please ensure > >>> > >>> > >>>>that hpet=1 is set in the hvm config and also I think > >>>> > >>>> > >>that when hpet > >> > >> > >>>>is the clocksource on RHEL4-32, the clock IS resilient to > >>>> > >>>> > >>>missed ticks > >>> > >>> > >>>>so timer_mode should be 2 (vs when pit is the clocksource > >>>> > >>>> > >>>on RHEL4-32, > >>> > >>> > >>>>all clock ticks must be delivered and so timer_mode should be 0). > >>>> > >>>>Per > >>>> > >>>> > >>>> > >>>http://lists.xensource.com/archives/html/xen-devel/2008-06/msg > >>>00098.html it's > >>> > >>> > >>>>my intent to clean this up, but I won't get to it until next week. > >>>> > >>>>Thanks, > >>>>Dan > >>>> > >>>> -----Original Message----- > >>>> *From:* xen-devel-bounces@xxxxxxxxxxxxxxxxxxx > >>>> [mailto:xen-devel-bounces@xxxxxxxxxxxxxxxxxxx]*On > >>>> > >>>> > >>>Behalf Of *Dave > >>> > >>> > >>>> Winchell > >>>> *Sent:* Friday, June 06, 2008 4:46 AM > >>>> *To:* Keir Fraser; Ben Guthro; xen-devel > >>>> *Cc:* dan.magenheimer@xxxxxxxxxx; Dave Winchell > >>>> *Subject:* RE: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >>>> > >>>> Keir, > >>>> > >>>> I think the changes are required. We'll run some tests > >>>> > >>>> > >>>today today so > >>> > >>> > >>>> that we have some data to talk about. > >>>> > >>>> -Dave > >>>> > >>>> > >>>> -----Original Message----- > >>>> From: xen-devel-bounces@xxxxxxxxxxxxxxxxxxx on behalf > >>>> > >>>> > >>>of Keir Fraser > >>> > >>> > >>>> Sent: Fri 6/6/2008 4:58 AM > >>>> To: Ben Guthro; xen-devel > >>>> Cc: dan.magenheimer@xxxxxxxxxx > >>>> Subject: Re: [Xen-devel] [PATCH 0/2] Improve hpet accuracy > >>>> > >>>> Are these patches needed now the timers are built on > >>>> > >>>> > >>Xen system > >> > >> > >>>> time rather > >>>> than host TSC? Dan has reported much better > >>>> > >>>> > >>>time-keeping with his > >>> > >>> > >>>> patch > >>>> checked in, and it¹s for sure a lot less invasive than > >>>> > >>>> > >>>this patchset. > >>> > >>> > >>>> -- Keir > >>>> > >>>> On 5/6/08 15:59, "Ben Guthro" <bguthro@xxxxxxxxxxxxxxx> wrote: > >>>> > >>>> > > >>>> > 1. Introduction > >>>> > > >>>> > This patch improves the hpet based guest clock in > >>>> > >>>> > >>>terms of drift and > >>> > >>> > >>>> > monotonicity. > >>>> > Prior to this work the drift with hpet was greater > >>>> > >>>> > >>>than 2%, far > >>> > >>> > >>>> above the .05% > >>>> > limit > >>>> > for ntp to synchronize. With this code, the drift > >>>> > >>>> > >>ranges from > >> > >> > >>>> .001% to .0033% > >>>> > depending > >>>> > on guest and physical platform. > >>>> > > >>>> > Using hpet allows guest operating systems to > >>>> > >>>> > >>provide monotonic > >> > >> > >>>> time to their > >>>> > applications. Time sources other than hpet are not > >>>> > >>>> > >>>monotonic because > >>> > >>> > >>>> > of their reliance on tsc, which is not synchronized > >>>> > >>>> > >>>across physical > >>> > >>> > >>>> > processors. > >>>> > > >>>> > Windows 2k864 and many Linux guests are supported with two > >>>> policies, one for > >>>> > guests > >>>> > that handle missed clock interrupts and the other for guests > >>>> that require the > >>>> > correct number of interrupts. > >>>> > > >>>> > Guests may use hpet for the timing source even if > >>>> > >>>> > >>the physical > >> > >> > >>>> platform has no > >>>> > visible > >>>> > hpet. Migration is supported between physical machines which > >>>> differ in > >>>> > physical > >>>> > hpet visibility. > >>>> > > >>>> > Most of the changes are in hpet.c. Two general > >>>> > >>>> > >>facilities are > >> > >> > >>>> added to track > >>>> > interrupt > >>>> > progress. The ideas here and the facilities would > >>>> > >>>> > >>be useful in > >> > >> > >>>> vpt.c, for > >>>> > other time > >>>> > sources, though no attempt is made here to improve vpt.c. > >>>> > > >>>> > The following sections discuss hpet dependencies, interrupt > >>>> delivery policies, > >>>> > live migration, > >>>> > test results, and relation to recent work with > >>>> > >>>> > >>monotonic time. > >> > >> > >>>> > > >>>> > > >>>> > 2. Virtual Hpet dependencies > >>>> > > >>>> > The virtual hpet depends on the ability to read the > >>>> > >>>> > >>>physical or > >>> > >>> > >>>> simulated > >>>> > (see discussion below) hpet. For timekeeping, the > >>>> > >>>> > >>>virtual hpet > >>> > >>> > >>>> also depends > >>>> > on two new interrupt notification facilities to > >>>> > >>>> > >>implement its > >> > >> > >>>> policies for > >>>> > interrupt delivery. > >>>> > > >>>> > 2.1. Two modes of low-level hpet main counter reads. > >>>> > > >>>> > In this implementation, the virtual hpet reads with > >>>> read_64_main_counter(), > >>>> > exported by > >>>> > time.c, either the real physical hpet main counter register > >>>> directly or a > >>>> > "simulated" > >>>> > hpet main counter. > >>>> > > >>>> > The simulated mode uses a monotonic version of get_s_time() > >>>> (NOW()), where the > >>>> > last > >>>> > time value is returned whenever the current time > >>>> > >>>> > >>value is less > >> > >> > >>>> than the last > >>>> > time > >>>> > value. In simulated mode, since it is layered on s_time, the > >>>> underlying > >>>> > hardware > >>>> > can be hpet or some other device. The frequency of the main > >>>> counter in > >>>> > simulated > >>>> > mode is the same as the standard physical hpet frequency, > >>>> allowing live > >>>> > migration > >>>> > between nodes that are configured differently. > >>>> > > >>>> > If the physical platform does not have an hpet > >>>> > >>>> > >>>device, or if xen > >>> > >>> > >>>> is configured > >>>> > not > >>>> > to use the device, then the simulated method is > >>>> > >>>> > >>used. If there > >> > >> > >>>> is a physical > >>>> > hpet device, > >>>> > and xen has initialized it, then either simulated > >>>> > >>>> > >>or physical > >> > >> > >>>> mode can be > >>>> > used. > >>>> > This is governed by a boot time option, hpet-avoid. > >>>> > >>>> > >>>Setting this > >>> > >>> > >>>> option to 1 > >>>> > gives the > >>>> > simulated mode and 0 the physical mode. The default > >>>> > >>>> > >>>is physical > >>> > >>> > >>>> mode. > >>>> > > >>>> > A disadvantage of the physical mode is that may > >>>> > >>>> > >>take longer to > >> > >> > >>>> read the device > >>>> > than in simulated mode. On some platforms the cost is > >>>> > >>>> > >>>about the > >>> > >>> > >>>> same (less > >>>> > than 250 nsec) for > >>>> > physical and simulated modes, while on others > >>>> > >>>> > >>physical cost is > >> > >> > >>>> much higher > >>>> > than simulated. > >>>> > A disadvantage of the simulated mode is that it can > >>>> > >>>> > >>return the > >> > >> > >>>> same value > >>>> > for the counter in consecutive calls. > >>>> > > >>>> > 2.2. Interrupt notification facilities. > >>>> > > >>>> > Two interrupt notification facilities are introduced, one is > >>>> > hvm_isa_irq_assert_cb() > >>>> > and the other hvm_register_intr_en_notif(). > >>>> > > >>>> > The vhpet uses hvm_isa_irq_assert_cb to deliver > >>>> > >>>> > >>interrupts to > >> > >> > >>>> the vioapic. > >>>> > hvm_isa_irq_assert_cb allows a callback to be > >>>> > >>>> > >>passed along to > >> > >> > >>>> > vioapic_deliver() > >>>> > and this callback is called with a mask of the vcpus > >>>> > >>>> > >>>which will > >>> > >>> > >>>> get the > >>>> > interrupt. This callback is made before any vcpus receive an > >>>> interrupt. > >>>> > > >>>> > Vhpet uses hvm_register_intr_en_notif() to register > >>>> > >>>> > >>a handler > >> > >> > >>>> for a particular > >>>> > vector that will be called when that vector is injected in > >>>> > [vmx,svm]_intr_assist() > >>>> > and also when the guest finishes handling the > >>>> > >>>> > >>interrupt. Here > >> > >> > >>>> finished is > >>>> > defined > >>>> > as the point when the guest re-enables interrupts or > >>>> > >>>> > >>>lowers the > >>> > >>> > >>>> tpr value. > >>>> > EOI is not used as the end of interrupt as this is sometimes > >>>> returned before > >>>> > the interrupt handler has done its work. A flag is > >>>> > >>>> > >>>passed to the > >>> > >>> > >>>> handler > >>>> > indicating > >>>> > whether this is the injection point (post = 1) or the > >>>> > >>>> > >>>interrupt > >>> > >>> > >>>> finished (post > >>>> > = 0) point. > >>>> > The need for the finished point callback is discussed in the > >>>> missed ticks > >>>> > policy section. > >>>> > > >>>> > To prevent a possible early trigger of the finished > >>>> > >>>> > >>callback, > >> > >> > >>>> intr_en_notif > >>>> > logic > >>>> > has a two stage arm, the first at injection > >>>> (hvm_intr_en_notif_arm()) and the > >>>> > second when > >>>> > interrupts are seen to be disabled > >>>> > >>>> > >>>(hvm_intr_en_notif_disarm()). > >>> > >>> > >>>> Once fully > >>>> > armed, re-enabling > >>>> > interrupts will cause hvm_intr_en_notif_disarm() to > >>>> > >>>> > >>>make the end > >>> > >>> > >>>> of interrupt > >>>> > callback. hvm_intr_en_notif_arm() and > >>>> > >>>> > >>>hvm_intr_en_notif_disarm() > >>> > >>> > >>>> are called by > >>>> > [vmx,svm]_intr_assist(). > >>>> > > >>>> > 3. Interrupt delivery policies > >>>> > > >>>> > The existing hpet interrupt delivery is preserved. > >>>> > >>>> > >>>This includes > >>> > >>> > >>>> > vcpu round robin delivery used by Linux and > >>>> > >>>> > >>broadcast delivery > >> > >> > >>>> used by > >>>> > Windows. > >>>> > > >>>> > There are two policies for interrupt delivery, one > >>>> > >>>> > >>for Windows > >> > >> > >>>> 2k8-64 and the > >>>> > other > >>>> > for Linux. The Linux policy takes advantage of the > >>>> > >>>> > >>>(guest) Linux > >>> > >>> > >>>> missed tick > >>>> > and offset > >>>> > calculations and does not attempt to deliver the > >>>> > >>>> > >>>right number of > >>> > >>> > >>>> interrupts. > >>>> > The Windows policy delivers the correct number of > >>>> > >>>> > >>interrupts, > >> > >> > >>>> even if > >>>> > sometimes much > >>>> > closer to each other than the period. The policies > >>>> > >>>> > >>are similar > >> > >> > >>>> to those in > >>>> > vpt.c, though > >>>> > there are some important differences. > >>>> > > >>>> > Policies are selected with an HVMOP_set_param > >>>> > >>>> > >>>hypercall with index > >>> > >>> > >>>> > HVM_PARAM_TIMER_MODE. > >>>> > Two new values are added, > >>>> > >>>> > >>>HVM_HPET_guest_computes_missed_ticks and > >>> > >>> > >>>> > HVM_HPET_guest_does_not_compute_missed_ticks. The > >>>> > >>>> > >>reason that > >> > >> > >>>> two new ones > >>>> > are added is that > >>>> > in some guests (32bit Linux) a no-missed policy is > >>>> > >>>> > >>needed for > >> > >> > >>>> clock sources > >>>> > other than hpet > >>>> > and a missed ticks policy for hpet. It was felt that > >>>> > >>>> > >>>there would > >>> > >>> > >>>> be less > >>>> > confusion by simply > >>>> > introducing the two hpet policies. > >>>> > > >>>> > 3.1. The missed ticks policy > >>>> > > >>>> > The Linux clock interrupt handler for hpet calculates missed > >>>> ticks and offset > >>>> > using the hpet > >>>> > main counter. The algorithm works well when the > >>>> > >>>> > >>time since the > >> > >> > >>>> last interrupt > >>>> > is greater than > >>>> > or equal to a period and poorly otherwise. > >>>> > > >>>> > The missed ticks policy ensures that no two clock > >>>> > >>>> > >>>interrupts are > >>> > >>> > >>>> delivered to > >>>> > the guest at > >>>> > a time interval less than a period. A time stamp (hpet main > >>>> counter value) is > >>>> > recorded (by a > >>>> > callback registered with hvm_register_intr_en_notif) > >>>> > >>>> > >>>when Linux > >>> > >>> > >>>> finishes > >>>> > handling the clock > >>>> > interrupt. Then, ensuing interrupts are delivered to > >>>> > >>>> > >>>the vioapic > >>> > >>> > >>>> only if the > >>>> > current main > >>>> > counter value is a period greater than when the > >>>> > >>>> > >>last interrupt > >> > >> > >>>> was handled. > >>>> > > >>>> > Tests showed a significant improvement in clock > >>>> > >>>> > >>drift with end > >> > >> > >>>> of interrupt > >>>> > time stamps > >>>> > versus beginning of interrupt[1]. It is believed that > >>>> > >>>> > >>>the reason > >>> > >>> > >>>> for the > >>>> > improvement > >>>> > is that the clock interrupt handler goes for a > >>>> > >>>> > >>>spinlock and can > >>> > >>> > >>>> be therefore > >>>> > delayed in its > >>>> > processing. Furthermore, the main counter is read > >>>> > >>>> > >>by the guest > >> > >> > >>>> under the lock. > >>>> > The net > >>>> > effect is that if we time stamp injection, we can get the > >>>> difference in time > >>>> > between successive interrupt handler lock acquisitions to be > >>>> less than the > >>>> > period. > >>>> > > >>>> > 3.2. The no-missed ticks policy > >>>> > > >>>> > Windows 2k864 keeps very poor time with the missed > >>>> > >>>> > >>>ticks policy. > >>> > >>> > >>>> So the > >>>> > no-missed ticks policy > >>>> > was developed. In the no-missed ticks policy we deliver the > >>>> correct number of > >>>> > interrupts, > >>>> > even if they are spaced less than a period apart > >>>> > >>>> > >>>(when catching up). > >>> > >>> > >>>> > > >>>> > Windows 2k864 uses a broadcast mode in the interrupt routing > >>>> such that > >>>> > all vcpus get the clock interrupt. The best Windows drift > >>>> performance was > >>>> > achieved when the > >>>> > policy code ensured that all the previous interrupts (on the > >>>> various vcpus) > >>>> > had been injected > >>>> > before injecting the next interrupt to the vioapic.. > >>>> > > >>>> > The policy code works as follows. It uses the > >>>> hvm_isa_irq_assert_cb() to > >>>> > record > >>>> > the vcpus to be interrupted in > >>>> > >>>> > >>h->hpet.pending_mask. Then, in > >> > >> > >>>> the callback > >>>> > registered > >>>> > with hvm_register_intr_en_notif() at post=1 time it > >>>> > >>>> > >>clears the > >> > >> > >>>> current vcpu in > >>>> > the pending_mask. > >>>> > When the pending_mask is clear it decrements > >>>> hpet.intr_pending_nr and if > >>>> > intr_pending_nr is still > >>>> > non-zero posts another interrupt to the ioapic with > >>>> hvm_isa_irq_assert_cb(). > >>>> > Intr_pending_nr is incremented in > >>>> hpet_route_decision_not_missed_ticks(). > >>>> > > >>>> > The missed ticks policy intr_en_notif callback also uses the > >>>> pending_mask > >>>> > method. So even though > >>>> > Linux does not broadcast its interrupts, the code > >>>> > >>>> > >>could handle > >> > >> > >>>> it if it did. > >>>> > In this case the end of interrupt time stamp is > >>>> > >>>> > >>made when the > >> > >> > >>>> pending_mask is > >>>> > clear. > >>>> > > >>>> > 4. Live Migration > >>>> > > >>>> > Live migration with hpet preserves the current offset of the > >>>> guest clock with > >>>> > respect > >>>> > to ntp. This is accomplished by migrating all of > >>>> > >>>> > >>the state in > >> > >> > >>>> the h->hpet data > >>>> > structure > >>>> > in the usual way. The hp->mc_offset is recalculated on the > >>>> receiving node so > >>>> > that the > >>>> > guest sees a continuous hpet main counter. > >>>> > > >>>> > Code as been added to xc_domain_save.c to send a > >>>> > >>>> > >>small message > >> > >> > >>>> after the > >>>> > domain context is sent. The contents of the message is the > >>>> physical tsc > >>>> > timestamp, last_tsc, > >>>> > read just before the message is sent. When the > >>>> > >>>> > >>>last_tsc message > >>> > >>> > >>>> is received in > >>>> > xc_domain_restore.c, > >>>> > another physical tsc timestamp, cur_tsc, is read. The two > >>>> timestamps are > >>>> > loaded into the domain > >>>> > structure as last_tsc_sender and first_tsc_receiver with > >>>> hypercalls. Then > >>>> > xc_domain_hvm_setcontext > >>>> > is called so that hpet_load has access to these time stamps. > >>>> Hpet_load uses > >>>> > the timestamps > >>>> > to account for the time spent saving and loading the domain > >>>> context. With this > >>>> > technique, > >>>> > the only neglected time is the time spent sending a small > >>>> network message. > >>>> > > >>>> > 5. Test Results > >>>> > > >>>> > Some recent test results are: > >>>> > > >>>> > 5.1 Linux 4u664 and Windows 2k864 load test. > >>>> > Duration: 70 hours. > >>>> > Test date: 6/2/08 > >>>> > Loads: usex -b48 on Linux; burn-in on Windows > >>>> > Guest vcpus: 8 for Linux; 2 for Windows > >>>> > Hardware: 8 physical cpu AMD > >>>> > Clock drift : Linux: .0012% Windows: .009% > >>>> > > >>>> > 5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 > >>>> > >>>> > >>no-load test > >> > >> > >>>> > Duration: 23 hours. > >>>> > Test date: 6/3/08 > >>>> > Loads: none > >>>> > Guest vcpus: 8 for each Linux; 2 for Windows > >>>> > Hardware: 4 physical cpu AMD > >>>> > Clock drift : Linux: .033% Windows: .019% > >>>> > > >>>> > 6. Relation to recent work in xen-unstable > >>>> > > >>>> > There is a similarity between hvm_get_guest_time() in > >>>> xen-unstable and > >>>> > read_64_main_counter() > >>>> > in this code. However, read_64_main_counter() is > >>>> > >>>> > >>more tuned to > >> > >> > >>>> the needs of > >>>> > hpet.c. It has no > >>>> > "set" operation, only the get. It isolates the mode, > >>>> > >>>> > >>>physical or > >>> > >>> > >>>> simulated, in > >>>> > read_64_main_counter() > >>>> > itself. It uses no vcpu or domain state as it is a physical > >>>> entity, in either > >>>> > mode. And it provides a real > >>>> > physical mode for every read for those applications > >>>> > >>>> > >>>that desire > >>> > >>> > >>>> this. > >>>> > > >>>> > 7. Conclusion > >>>> > > >>>> > The virtual hpet is improved by this patch in terms > >>>> > >>>> > >>>of accuracy and > >>> > >>> > >>>> > monotonicity. > >>>> > Tests performed to date verify this and more testing > >>>> > >>>> > >>>is under way. > >>> > >>> > >>>> > > >>>> > 8. Future Work > >>>> > > >>>> > Testing with Windows Vista will be performed soon. > >>>> > >>>> > >>The reason > >> > >> > >>>> for accuracy > >>>> > variations > >>>> > on different platforms using the physical hpet > >>>> > >>>> > >>device will be > >> > >> > >>>> investigated. > >>>> > Additional overhead measurements on simulated vs > >>>> > >>>> > >>physical hpet > >> > >> > >>>> mode will be > >>>> > made. > >>>> > > >>>> > Footnotes: > >>>> > > >>>> > 1. I don't recall the accuracy improvement with end > >>>> > >>>> > >>>of interrupt > >>> > >>> > >>>> stamping, but > >>>> > it was > >>>> > significant, perhaps better than two to one improvement. It > >>>> would be a very > >>>> > simple matter > >>>> > to re-measure the improvement as the facility can > >>>> > >>>> > >>call back at > >> > >> > >>>> injection time > >>>> > as well. > >>>> > > >>>> > > >>>> > Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx> > >>>> > <mailto:dwinchell@xxxxxxxxxxxxxxx> > >>>> > Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx> > >>>> > <mailto:bguthro@xxxxxxxxxxxxxxx> > >>>> > > >>>> > > >>>> > _______________________________________________ > >>>> > Xen-devel mailing list > >>>> > Xen-devel@xxxxxxxxxxxxxxxxxxx > >>>> > http://lists.xensource.com/xen-devel > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > > > > > >_______________________________________________ > >Xen-devel mailing list > >Xen-devel@xxxxxxxxxxxxxxxxxxx > >http://lists.xensource.com/xen-devel > > > > > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |