Xen project Mailing List

This patch improves the hpet based guest clock in terms of drift and monotonicity.
Prior to this work the drift with hpet was greater than 2%, far above the .05% limit
for ntp to synchronize. With this code, the drift ranges from .001% to .0033% depending
on guest and physical platform.

Using hpet allows guest operating systems to provide monotonic time to their
applications. Time sources other than hpet are not monotonic because
of their reliance on tsc, which is not synchronized across physical processors.

Windows 2k864 and many Linux guests are supported with two policies, one for guests
that handle missed clock interrupts and the other for guests that require the
correct number of interrupts.

Guests may use hpet for the timing source even if the physical platform has no visible
hpet. Migration is supported between physical machines which differ in physical
hpet visibility.

Most of the changes are in hpet.c. Two general facilities are added to track interrupt
progress. The ideas here and the facilities would be useful in vpt.c, for other time
sources, though no attempt is made here to improve vpt.c.

The following sections discuss hpet dependencies, interrupt delivery policies, live migration,
test results, and relation to recent work with monotonic time.

2. Virtual Hpet dependencies

The virtual hpet depends on the ability to read the physical or simulated
(see discussion below) hpet. For timekeeping, the virtual hpet also depends
on two new interrupt notification facilities to implement its policies for
interrupt delivery.

2.1. Two modes of low-level hpet main counter reads.

In this implementation, the virtual hpet reads with read_64_main_counter(), exported by
time.c, either the real physical hpet main counter register directly or a "simulated"
hpet main counter.

The simulated mode uses a monotonic version of get_s_time() (NOW()), where the last
time value is returned whenever the current time value is less than the last time
value. In simulated mode, since it is layered on s_time, the underlying hardware
can be hpet or some other device. The frequency of the main counter in simulated
mode is the same as the standard physical hpet frequency, allowing live migration
between nodes that are configured differently.

If the physical platform does not have an hpet device, or if xen is configured not
to use the device, then the simulated method is used. If there is a physical hpet device,
and xen has initialized it, then either simulated or physical mode can be used.
This is governed by a boot time option, hpet-avoid. Setting this option to 1 gives the
simulated mode and 0 the physical mode. The default is physical mode.

A disadvantage of the physical mode is that may take longer to read the device
than in simulated mode. On some platforms the cost is about the same (less than 250 nsec) for
physical and simulated modes, while on others physical cost is much higher than simulated.
A disadvantage of the simulated mode is that it can return the same value
for the counter in consecutive calls.

2.2. Interrupt notification facilities.

Two interrupt notification facilities are introduced, one is hvm_isa_irq_assert_cb()
and the other hvm_register_intr_en_notif().

The vhpet uses hvm_isa_irq_assert_cb to deliver interrupts to the vioapic.
hvm_isa_irq_assert_cb allows a callback to be passed along to vioapic_deliver()
and this callback is called with a mask of the vcpus which will get the
interrupt. This callback is made before any vcpus receive an interrupt.

Vhpet uses hvm_register_intr_en_notif() to register a handler for a particular
vector that will be called when that vector is injected in [vmx,svm]_intr_assist()
and also when the guest finishes handling the interrupt. Here finished is defined
as the point when the guest re-enables interrupts or lowers the tpr value.
EOI is not used as the end of interrupt as this is sometimes returned before
the interrupt handler has done its work. A flag is passed to the handler indicating
whether this is the injection point (post = 1) or the interrupt finished (post = 0) point.
The need for the finished point callback is discussed in the missed ticks policy section.

To prevent a possible early trigger of the finished callback, intr_en_notif logic
has a two stage arm, the first at injection (hvm_intr_en_notif_arm()) and the second when
interrupts are seen to be disabled (hvm_intr_en_notif_disarm()). Once fully armed, re-enabling
interrupts will cause hvm_intr_en_notif_disarm() to make the end of interrupt
callback. hvm_intr_en_notif_arm() and hvm_intr_en_notif_disarm() are called by
[vmx,svm]_intr_assist().

3. Interrupt delivery policies

The existing hpet interrupt delivery is preserved. This includes
vcpu round robin delivery used by Linux and broadcast delivery used by Windows.

There are two policies for interrupt delivery, one for Windows 2k8-64 and the other
for Linux. The Linux policy takes advantage of the (guest) Linux missed tick and offset
calculations and does not attempt to deliver the right number of interrupts.
The Windows policy delivers the correct number of interrupts, even if sometimes much
closer to each other than the period. The policies are similar to those in vpt.c, though
there are some important differences.

Policies are selected with an HVMOP_set_param hypercall with index HVM_PARAM_TIMER_MODE.
Two new values are added, HVM_HPET_guest_computes_missed_ticks and
HVM_HPET_guest_does_not_compute_missed_ticks. The reason that two new ones are added is that
in some guests (32bit Linux) a no-missed policy is needed for clock sources other than hpet
and a missed ticks policy for hpet. It was felt that there would be less confusion by simply
introducing the two hpet policies.

3.1. The missed ticks policy

The Linux clock interrupt handler for hpet calculates missed ticks and offset using the hpet
main counter. The algorithm works well when the time since the last interrupt is greater than
or equal to a period and poorly otherwise.

The missed ticks policy ensures that no two clock interrupts are delivered to the guest at
a time interval less than a period. A time stamp (hpet main counter value) is recorded (by a
callback registered with hvm_register_intr_en_notif) when Linux finishes handling the clock
interrupt. Then, ensuing interrupts are delivered to the vioapic only if the current main
counter value is a period greater than when the last interrupt was handled.

Tests showed a significant improvement in clock drift with end of interrupt time stamps
versus beginning of interrupt[1]. It is believed that the reason for the improvement
is that the clock interrupt handler goes for a spinlock and can be therefore delayed in its
processing. Furthermore, the main counter is read by the guest under the lock. The net
effect is that if we time stamp injection, we can get the difference in time
between successive interrupt handler lock acquisitions to be less than the period.

3.2. The no-missed ticks policy

Windows 2k864 keeps very poor time with the missed ticks policy. So the no-missed ticks policy
was developed. In the no-missed ticks policy we deliver the correct number of interrupts,
even if they are spaced less than a period apart (when catching up).

Windows 2k864 uses a broadcast mode in the interrupt routing such that
all vcpus get the clock interrupt. The best Windows drift performance was achieved when the
policy code ensured that all the previous interrupts (on the various vcpus) had been injected
before injecting the next interrupt to the vioapic..

The policy code works as follows. It uses the hvm_isa_irq_assert_cb() to record
the vcpus to be interrupted in h->hpet.pending_mask. Then, in the callback registered
with hvm_register_intr_en_notif() at post=1 time it clears the current vcpu in the pending_mask.
When the pending_mask is clear it decrements hpet.intr_pending_nr and if intr_pending_nr is still
non-zero posts another interrupt to the ioapic with hvm_isa_irq_assert_cb().
Intr_pending_nr is incremented in hpet_route_decision_not_missed_ticks().

The missed ticks policy intr_en_notif callback also uses the pending_mask method. So even though
Linux does not broadcast its interrupts, the code could handle it if it did.
In this case the end of interrupt time stamp is made when the pending_mask is clear.

4. Live Migration

Live migration with hpet preserves the current offset of the guest clock with respect
to ntp. This is accomplished by migrating all of the state in the h->hpet data structure
in the usual way. The hp->mc_offset is recalculated on the receiving node so that the
guest sees a continuous hpet main counter.

Code as been added to xc_domain_save.c to send a small message after the
domain context is sent. The contents of the message is the physical tsc timestamp, last_tsc,
read just before the message is sent. When the last_tsc message is received in xc_domain_restore.c,
another physical tsc timestamp, cur_tsc, is read. The two timestamps are loaded into the domain
structure as last_tsc_sender and first_tsc_receiver with hypercalls. Then xc_domain_hvm_setcontext
is called so that hpet_load has access to these time stamps. Hpet_load uses the timestamps
to account for the time spent saving and loading the domain context. With this technique,
the only neglected time is the time spent sending a small network message.

5. Test Results

Some recent test results are:

5.1 Linux 4u664 and Windows 2k864 load test.
Duration: 70 hours.
Test date: 6/2/08
Loads: usex -b48 on Linux; burn-in on Windows
Guest vcpus: 8 for Linux; 2 for Windows
Hardware: 8 physical cpu AMD
Clock drift : Linux: .0012% Windows: .009%

5.2 Linux 4u664, Linux 4u464 , and Windows 2k864 no-load test
Duration: 23 hours.
Test date: 6/3/08
Loads: none
Guest vcpus: 8 for each Linux; 2 for Windows
Hardware: 4 physical cpu AMD
Clock drift : Linux: .033% Windows: .019%

6. Relation to recent work in xen-unstable

There is a similarity between hvm_get_guest_time() in xen-unstable and read_64_main_counter()
in this code. However, read_64_main_counter() is more tuned to the needs of hpet.c. It has no
"set" operation, only the get. It isolates the mode, physical or simulated, in read_64_main_counter()
itself. It uses no vcpu or domain state as it is a physical entity, in either mode. And it provides a real
physical mode for every read for those applications that desire this.

7. Conclusion

The virtual hpet is improved by this patch in terms of accuracy and monotonicity.
Tests performed to date verify this and more testing is under way.

8. Future Work

Testing with Windows Vista will be performed soon. The reason for accuracy variations
on different platforms using the physical hpet device will be investigated.
Additional overhead measurements on simulated vs physical hpet mode will be made.

Footnotes:

1. I don't recall the accuracy improvement with end of interrupt stamping, but it was
significant, perhaps better than two to one improvement. It would be a very simple matter
to re-measure the improvement as the facility can call back at injection time as well.

Signed-off-by: Dave Winchell <dwinchell@xxxxxxxxxxxxxxx>
Signed-off-by: Ben Guthro <bguthro@xxxxxxxxxxxxxxx>

[Xen-devel] [PATCH 0/2] Improve hpet accuracy