[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] per-cpu timer changes
On Tue, May 24, 2005 at 02:20:36AM +0100, Ian Pratt wrote: > > Don, > > This is looking good. To help other people review the patch, it might be > a good idea to post some of the design discussion we had off list as I > think the approach will be new to most people. (Perhaps put some of the > text in a comment in the hypervisor interface). > > As regards the time going backwards messages, if you're seeing small -ve > deltas, I'm not surprised -- you need to round to some precision as we > won't be nanosecond accurate. Experience suggests we'll be good for a > few 10's of ns with any kind of decent crystal. We could round to e.g. > 512ns or 1024ns to make sure. > > Best, > Ian > I am including the email that we exchanged off-list. I started to edit it, but decided that something I thought wasn't important, others would find vital, so I include all the email. The time going backwards was only occasionally, and it was a BIG jump backwards. I tracked it down yesterday to a problem with doing 32-bit arithmetic in Linux on the tsc values. For some reason, every 5-20 minutes xen seems to pause for about 5 seconds. This causes the tsc to wrap if only 32-bits are used, and the 'time went backwards' message is printed. I changed to use 64-bit tsc deltas and have been running since yesterday afternoon without any 'time went backwards' messages. I want to do some more cleanup (remove my debugging code) and will post all my changes to the list this afternoon. ----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM ----- Bruce Jones/Beaverton/IBM wrote on 04/21/2005 09:07:26 AM: > John, can you provide some additional technical guidance here? > > Ian, Keir: John is the implementor of our Linux changes for Summit > and understands these issues better than anyone. > > I've added Don to the cc: list but he's on vacation this week and > not reading email. > > -- brucej > > Ian Pratt <Ian.Pratt@xxxxxxxxxxxx> wrote on 04/20/2005 05:42:47 PM: > > > > "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/20/2005 04:47:44 PM: > > > > Please could Don write a paragraph explaining why cyclone timer support > > > > is useful. Do summit systems have different frequency CPUs in the same > > > > box? > > > Bruce writes: > > > I can write that paragraph myself. IBM's high end xSeries systems are > > > NUMA systems, each node is a separate machine with it's own front side > > > bus, I/O buses, etc... The chipset provides a cache-coherent interconnect > > > to allow them to be cabled together into one big system. > > > > OK, so even the FSB clocks come from different crystals. > > Yes, and the hardware intentionally skews their frequencies, for reasons > only the chipset designers understand. :) > > > > We had a boatload of problems with Linux when we first shipped it, with > > > time moving around forward and backward for applications. The processors > > > in the various nodes run at different frequencies and the on-processor > > > timers do not run in sync. We needed to modify Linux to use a system-wide > > > timer. Our chipset (code-named Cyclone) provides one, for newer systems > > > Intel has defined the HPET that we can use. We need to make similar > > > changes to Xen. > > > > This needs some agreement on the design. > > > > My gut feeling is that it should still be possible for guests to use > > the TSC to calculate the time offset relative to the published > > Xen system time record (which is updated every couple of > > seconds). The TSC calibration should be good enough to mean that > > the relative drift over the period between records is tiny (and > > errors can't accumulate beyond the period). > > My gut feeling is that your gut feeling is wrong. We can't ever > use the TSC on these systems - even a tiny amount of relative drift > causes problems. > > But I'm no expert. John, this is your cue. Please join in. > > > The 'TSC when time record created' and 'TSC frequency' will have > > to be per VCPU and updated to reflect the real CPU that the VCPU > > is running on. > > As long as these are virtual and not read using the readTSC instruction, > we may be OK. > > > > > Ian > > > > > > ----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM ----- "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 09:24:54 AM: > > Yes, and the hardware intentionally skews their frequencies, > > for reasons only the chipset designers understand. :) > > It's to be sneaky as regards FCC EMC emissions regulations. > > Some systems even modulate the PCI bus frequency. > > > > My gut feeling is that it should still be possible for > > guests to use > > > the TSC to calculate the time offset relative to the published Xen > > > system time record (which is updated every couple of > > seconds). The TSC > > > calibration should be good enough to mean that the relative > > drift over > > > the period between records is tiny (and errors can't > > accumulate beyond > > > the period). > > > > My gut feeling is that your gut feeling is wrong. We can't > > ever use the TSC on these systems - even a tiny amount of > > relative drift causes problems. > > It depends on the crystal stability, the accuracy with which the > calibration is done, and the frequency of publishing new absoloute time > records. > > The latter can be made quite frequent if need be. > > I'd much prefer avoiding having to expose linux to the HPET/cyclone by > hiding it Xen, and having the guest use TSC extrapolation from the the > time record published by Xen. > We'd just need to update the current interface to have per-CPU records > (and TSC frequency calibration). > > > But I'm no expert. John, this is your cue. Please join in. > > > > > The 'TSC when time record created' and 'TSC frequency' will > > have to be > > > per VCPU and updated to reflect the real CPU that the VCPU > > is running > > > on. > > > > As long as these are virtual and not read using the readTSC > > instruction, we may be OK. > > Using readTSC should be fine, since we're only using it to extrapolate > from the last Xen supplied time record, and we've calibrated the > frequency of the particular CPU we're running on. We only have to worry > about rapid clock drift due to sudden temperature changes etc, but even > then we can just update the calibration frequency periodically. Using > this approach we get to keep gettimeofday very fast, and avoid > complicating the hypervisor API -- it's exactly what we need for > migrating a domain between physical servers with different frequency > CPUs. > > Ian > ----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM ----- "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 01:12:51 PM: > > First, forgive my lack of knowledge about Xen. Since I don't > > know the details of what you're proposing, let me make a > > straw-man and let you correct my assumptions. > > > > Lets say you're proposing that time be calculated with the > > following formula: > > > > timefoday = xen_time_base + rdtsc() - xen_last_tsc[CPUNUM] > > > > Given a guest domain with two cpus, the issue is managing > > xen_last_tsc[] and xen_time_base. For the equation to work, > > xen_last_tsc[0] must hold the TSC value from CPU0 at exactly > > the time stored in xen_time_base. Additionally the same is > > true with xen_las_tsc[1]. > > I'm proposing: > > timeofday = round_to_precision( last_xen_time_base[VCPU] + > ( rdtsc() - last_xen_tsc[VCPU] ) * xen_tsc_calibrate[VCPU] > ) > > We update last_xen_time_base and last_xen_tsc on each CPU every second > or so. > xen_tsc_calibrate is calculated for each CPU at start of day. For > completeness, we could recalculate the calibration every 30s or so to > cope with crystal temperature drift if we wanted ultimate precision. > > > The difficult question is how do you ensure that the two > > values in xen_last_tsc[] are linked with the time in > > xen_time_base? This requires reading the TSC on two cpus at > > the exact same time. Additionally, this sync point must > > happen frequently enough so that the continuing drift between > > cpus isn't noticed. > > Nope, we would set the time_base on each CPU independently, but relative > to the same timer. > This could be the cyclone, HPET, or even the PIT if its possible to read > the same PIT from any node (though I'm guessing you probably have a PIT > per node and can't read the remote one). > > > Then you'll have to weigh that solution against just using an > > alternate global timesource like HPET/Cyclone. > > I'd prefer to avoid this as it would mean that there'd be a different > hypervisor API for guests on cylcone/hpet systems vs. normal synchronous > CPU systems. > Using the TSC will probably give a lower cost gettimeofday, we can also > trap it and emulate if we want to lie to guests about the progress of > time. > > Best, > Ian > > > > > > ----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM ----- John Stultz/Beaverton/IBM wrote on 04/21/2005 01:49:54 PM: > I'm just resending this with proper addresses as something got futzed up in the CC list on that last mail. > > "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/21/2005 01:12:51 PM: > > > > First, forgive my lack of knowledge about Xen. Since I don't > > > know the details of what you're proposing, let me make a > > > straw-man and let you correct my assumptions. > > > > > > Lets say you're proposing that time be calculated with the > > > following formula: > > > > > > timefoday = xen_time_base + rdtsc() - xen_last_tsc[CPUNUM] > > > > > > Given a guest domain with two cpus, the issue is managing > > > xen_last_tsc[] and xen_time_base. For the equation to work, > > > xen_last_tsc[0] must hold the TSC value from CPU0 at exactly > > > the time stored in xen_time_base. Additionally the same is > > > true with xen_las_tsc[1]. > > > > I'm proposing: > > > > timeofday = round_to_precision( last_xen_time_base[VCPU] + > > ( rdtsc() - last_xen_tsc[VCPU] ) * xen_tsc_calibrate[VCPU] > > ) > > > > We update last_xen_time_base and last_xen_tsc on each CPU every second > > or so. > > Or possibly more frequently, as on a 4Ghz cpu the 32bit TSC will wrap each second. Alternatively you could use the full 64bits. > > > xen_tsc_calibrate is calculated for each CPU at start of day. For > > completeness, we could recalculate the calibration every 30s or so to > > cope with crystal temperature drift if we wanted ultimate precision. > > > > > The difficult question is how do you ensure that the two > > > values in xen_last_tsc[] are linked with the time in > > > xen_time_base? This requires reading the TSC on two cpus at > > > the exact same time. Additionally, this sync point must > > > happen frequently enough so that the continuing drift between > > > cpus isn't noticed. > > > > Nope, we would set the time_base on each CPU independently, but relative > > to the same timer. > Hmmm. That sounds like it could work. Just be sure that preempt won't bite you in the timeofday calculation. The bit about still using the cyclone/HPET to sync the different xen_time_base[] values is the real key. > > > This could be the cyclone, HPET, or even the PIT if its possible to read > > the same PIT from any node (though I'm guessing you probably have a PIT > > per node and can't read the remote one). > The ioport space is unified by the BIOS so there is one global PIT shared by all nodes. Although as you'll need a continuous timesource that doesn't loop between xen_time_base updates, the PIT would not work. > > thanks > -john ----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM ----- "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/28/2005 07:08:05 PM: > > > First I apologize for not being involved in this email > > exchange last week. > > I am also just learning about Xen so my questions may be > > obvious to others. > > > > What is the last_xen_time_base referred to in Ian's email? Is > > this the stime_irq or wc_sec,wc_usec or something else? > > I was referring to the wc_ wall clock and system time values. > We'll need to make these per VPU, or perhaps slightly more cleanly, > store an offset in ns for each VCPU. > > > When would the last_xen_tsc[VCPU] values be captured by Xen? > > Currently, the tsc for cpu 0 is obtained during > > timer_interrupt as full_tsc_irq. > > These just need to be captured periodically on each real CPU -- every > couple of seconds would do it, though more frequently woulnd't hurt. > > > When updating the domain's shared_info structure mapping the > > physical CPU to the domain's view of the CPU would need to be > > done. For example if domain2 was running on CPU3 and CPU2 and > > the domain's view was cpu0 and cpu1, the saved tsc value for > > CPU3 would be copied to last_xen_tsc[0] and CPU2 to > > last_xen_tsc[1] before sending the interrupt to the domain. > > Yep, this shouldn't be hard -- there's already some code to spot when > they need to be updated. > > > From the last algorithm from Ian, I don't see anything that > > refers to the Cyclone/HPET value directly. Is that because > > Xen is the only thing that reads the Cyclone/HPET counter and > > the domain just uses the TSC? > > Yep, we don't want to expose the cyclone/hpet to guests. There's no > need, and it would have implications for migrating VMs between different > systems. > > Strictly speaking, Xen wouldn't even need support for the hpet/cyclone > as it could just use the shared PIT, though I have no objection to > adding such support. > > Are you happy with this design? It's a little more work, but I believe > better in the long run. We need to get the hypervisor interface change > incorporated ASAP. > > Cheers, > Ian ----- Forwarded by Don Fry/Beaverton/IBM on 05/26/2005 09:29 AM ----- "Ian Pratt" <m+Ian.Pratt@xxxxxxxxxxxx> wrote on 04/30/2005 12:04:57 AM: > > It sounds like the per-cpu changes should be sufficient. > > > > Having a time base and ns deltas for each CPU sounds > > interesting, but wouldn't you have do a subtraction to > > generate the delta in Xen, and then add it back in, in the > > domain? Just saving the per-cpu value would save the extra > > add and subtract. > > Sure, but the add/subtract won't cost much, and it saves some space in > the shared info page, which might be an issue if we have a lot of VCPUs. > > Not a big deal either way. > > > The bottom line is that it can all be done with the TSC, > > without needing to use the Cyclone or HPET hardware, which > > isn't available on all systems like the TSC. > > Great, we're in agreement. I think the first stage is just to do the per > [V]CPU calibration and time vals. Could you work something up? > > Thanks, > Ian -- Don Fry brazilnut@xxxxxxxxxx _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |