[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 1/2] cpu steal time accounting
On 22 Feb 2006, at 17:11, Rik van Riel wrote:If the domain is unrunnable, surely there won't be a process on the virtual cpu that is runnable? Or am I overlooking something here?Oh, I see, this is dealt with inside account_steal_time(). No problem then. But, the guest will not be able to properly partition this time correctly between the idle time stat and the steal time stat. When the vcpu enters its idle loop and blocks, at some point it will become ready to run again. But, it will not necessarily run right away. Instead, it will sit in the hypervisors runqueue for some amount of time. While in the runqueue, this time is "stolen", since the vcpu wants to run but isn't (at this point it is involuntarily waiting). Under a heavily overcommitted system, the amount of time in the runqueue following the block may be nontrivial. But, the guest (the code in account_steal_time()) cannot determine the time at which the vcpu was requeued into the hypervisor's runqueue. It will account all of this time toward the "idle time" stat, rather than partitioning it between "idle time" and "steal time". To solve this, it may be best to have the hypervisor interface expose per-vcpu stolen time directly, rather than vcpu_time. Then the guest does not need to try to guess whether to charge (system_time - vcpu_time) against idle or steal. 1. What if a guest gets preempted for lots of short time periods (less than a jiffy). Then some arbitrary time in the future is preempted for long enough to activate you stolen-time logic. Won't you end up incorrectly accounting the accumulated short time periods?This is true. I'm not sure we'd want to get the vcpu info at every timer interrupt though, that could end up being too expensive...Having to call down to Xen to get that information is unfortunate. Perhaps we can export it in shared_info, or have the guest register a virtual address it would like the info written to. If you do change how this information is passed through the interface, then maybe this would be a good time to define the interface to export "stolen time" computed by the hypervisor, rather than "vcpu time". On a topic relating to these patches, the Linux scheduler also uses the routine sched_clock() to calculate the run-time (and sleep-time) of processes. With the current code, these calculations will include stolen time in the total run-time of a process. Perhaps sched_clock() should be a clock that does not advance when time is stolen by the hypervisor? We've been thinking about these issues also. I've attached a document that describes our current thoughts. The document describes the portion of the VMI that deals with time and describes the changes to Linux to accommodate VMI Time (Rik, this is an updated draft of the document I sent you before). Comments are welcome. Thanks, Dan VMI - Time Interface; VMware, Inc. In a virtualized environment, virtual machines (VM) will time share the system with each other and with other processes running on the host system. Therefore, a VM's virtual cpus (vcpus) will be executing on the host's physical cpus (pcpus) for only some portion of time. This section of the VMI exposes a paravirtual view of time to the guest operating systems so that they may operate more effectively in a virtual environment. The interface also provides a way for the vcpus to set alarms in this paravirtual view of time. ------------------------------------------------------------------------- Time Domains: a) Wallclock Time: Wallclock time exposed to the VM through this interface indicates the number of nanoseconds since epoch, 1970-01-01T00:00:00Z (ISO 8601 date format). If the host's wallclock time changes (say, when an error in the host's clock is corrected), so does the wallclock time as viewed through this interface. b) Real Time: Another view of time accessible through this interface is real time. Real time always progresses except for when the VM is stopped or suspended. Real time is presented to the guest as a counter which increments at a constant rate defined (and presented) by the hypervisor. All the vcpus of a VM share the same real time counter. The unit of the counter is called "cycles". The unit and initial value (corresponding to the time the VM enters para-virtual mode) are chosen by the hypervisor so that the real time counter will not rollover in any practical length of time. It is expected that the frequency (cycles per second) is chosen such that this clock provides a "high-resolution" view of time. The unit can only change when the VM (re)enters paravirtual mode. c) Stolen time and Available time: A vcpu is always in one of three states: running, halted, or ready. The vcpu is in the 'running' state if it is executing. When the vcpu executes the VMI_Halt interface, the vcpu enters the 'halted' state and remains halted until there is some work pending for the vcpu (e.g. an alarm expires, host I/O completes on behalf of virtual I/O). At this point, the vcpu enters the 'ready' state (waiting for the hypervisor to reschedule it). Finally, at any time when the vcpu is not in the 'running' state nor the 'halted' state, it is in the 'ready' state. For example, consider the following sequence of events, with times given in real time: (Example 1) At 0 ms, VCPU executing guest code. At 1 ms, VCPU requests virtual I/O. At 2 ms, Host performs I/O for virtual I/0. At 3 ms, VCPU executes VMI_Halt. At 4 ms, Host completes I/O for virtual I/O request. At 5 ms, VCPU begins executing guest code, vectoring to the interrupt handler for the device initiating the virtual I/O. At 6 ms, VCPU preempted by hypervisor. At 9 ms, VCPU begins executing guest code. >From 0 ms to 3 ms, VCPU is in the 'running' state. At 3 ms, VCPU enters the 'halted' state and remains in this state until the 4 ms mark. From 4 ms to 5 ms, the VCPU is in the 'ready' state. At 5 ms, the VCPU re-enters the 'running' state until it is preempted by the hypervisor at the 6 ms mark. From 6 ms to 9 ms, VCPU is again in the 'ready' state, and finally 'running' again after 9 ms. Stolen time is defined per vcpu to progress at the rate of real time when the vcpu is in the 'ready' state, and does not progress otherwise. Available time is defined per vcpu to progress at the rate of real time when the vcpu is in the 'running' and 'halted' states, and does not progress when the vcpu is in the 'ready' state. So, for the above example, the following table indicates these time values for the vcpu at each ms boundary: Real time Stolen time Available time 0 0 0 1 0 1 2 0 2 3 0 3 4 0 4 5 1 4 6 1 5 7 2 5 8 3 5 9 4 5 10 4 6 Notice that at any point: real_time == stolen_time + available_time Stolen time and available time are also presented as counters in "cycles" units. The initial value of the stolen time counter is 0. This implies the initial value of the available time counter is the same as the real time counter. Alarms: Alarms can be set (armed) against the real time counter or the available time counter. Alarms can be programmed to expire once (one-shot) or on a regular period (periodic). They are armed by indicating an absolute counter value expiry, and in the case of a periodic alarm, a non-zero relative period counter value. [TBD: The method of wiring the alarms to an interrupt vector is dependent upon the virtual interrupt controller portion of the interface. Currently, the alarms may be wired as if they are attached to IRQ0 or the vector in the local APIC LVTT. This way, the alarms can be used as drop in replacements for the PIT or local APIC timer.] Alarms are per-vcpu mechanisms. An alarm set by vcpu0 will fire only on vcpu0, while an alarm set by vcpu1 will only fire on vcpu1. If an alarm is set relative to available time, its expiry is a value relative to the available time counter of the vcpu that set it. The interface includes a method to cancel (disarm) an alarm. On each vcpu, one alarm can be set against each of the two counters (real time and available time). A vcpu in the 'halted' state becomes 'ready' when any of its alarm's counters reaches the expiry. An alarm "fires" by signaling the virtual interrupt controller. An alarm will fire as soon as possible after the counter value is greater than or equal to the alarm's current expiry. However, an alarm can fire only when its vcpu is in the 'running' state. If the alarm is periodic, a sequence of expiry values, E(i) = e0 + p * i , i = 0, 1, 2, 3, ... where 'e0' is the expiry specified when setting the alarm and 'p' is the period of the alarm, is used to arm the alarm. Initially, E(0) is used as the expiry. When the alarm fires, the next expiry value in the sequence that is greater than the current value of the counter is used as the alarm's new expiry. One-shot alarms have only one expiry. When a one-shot alarm fires, it is automatically disarmed. Suppose an alarm is set relative to real time with expiry at the 3 ms mark and a period of 2 ms. It will expire on these real time marks: 3, 5, 7, 9. Note that even if the alarm does not fire during the 5 ms to 7 ms interval, the alarm can fire at most once during the 7 ms to 9 ms interval (unless, of course, it is reprogrammed). If an alarm is set relative to available time with expiry at the 1 ms mark (in available time) and with a period of 2 ms, then it will expire on these available time marks: 1, 3, 5. In the scenario described in example 1, those available time values correspond to these values in real time: 1, 3, 6. ------------------------------------------------------------------------- Interface Details: /* The cycle counters. */ #define VMI_CYCLES_REAL 0 #define VMI_CYCLES_AVAILABLE 1 #define VMI_CYCLES_STOLEN 2 /* Predefined rate of the wallclock. */ #define VMI_WALLCLOCK_HZ 1000000000 /* The alarm interface 'flags' bits. [TBD: exact format of 'flags'] */ #define VMI_ALARM_COUNTER_MASK 0x000000ff #define VMI_ALARM_WIRED_IRQ0 0x00000000 #define VMI_ALARM_WIRED_LVTT 0x00010000 #define VMI_ALARM_IS_ONESHOT 0x00000000 #define VMI_ALARM_IS_PERIODIC 0x00000100 /* The time types. */ typedef unsigned char VMI_BOOL; /* FALSE==0, TRUE==!FALSE. */ typedef uint64 VMI_CYCLES; typedef uint64 VMI_NANOSECS; typedef sint64 VMI_NANOSECSDIFF; /* The time functions. */ VMI_NANOSECS VMI_GetWallclockTime(void); VMI_BOOL VMI_WallclockUpdated(void); VMI_CYCLES VMI_GetCycleFrequency(void); VMI_CYCLES VMI_GetCycleCounter(uint32 whichCounter); void VMI_SetAlarm(uint32 flags, VMI_CYCLES expiry, VMI_CYCLES period); VMI_BOOL VMI_CancelAlarm(uint32 flags); VMI_GetWallclockTime returns the current wallclock time as the number of nanoseconds since the epoch. Nanosecond resolution along with the 64-bit unsigned type provide over 580 years from epoch until rollover. The wallclock time is relative to the host's wallclock time. VMI_WallclockUpdated returns TRUE if the wallclock time has changed relative to the real cycle counter since the previous time that VMI_WallclockUpdated was polled. For example, while a VM is suspended, the real cycle counter will halt, but wallclock time will continue to advance. Upon resuming the VM, the first call to VMI_WallclockUpdated will return TRUE. VMI_GetCycleFrequency returns the number of cycles in one second. This value can be used by the guest to convert between cycles and other time units. VMI_GetCycleCounter returns the current value, in cycles units, of the counter corresponding to 'whichCounter' if it is one of VMI_CYCLES_REAL, VMI_CYCLES_AVAILABLE or VMI_CYCLES_STOLEN. VMI_GetCycleCounter returns 0 for any other value of 'whichCounter'. VMI_SetAlarm is used to arm the vcpu's alarms. The 'flags' parameter is used to specify which counter's alarm is being set (VMI_CYCLES_REAL or VMI_CYCLES_AVAILABLE), how to deliver the alarm to the vcpu (VMI_ALARM_WIRED_IRQ0 or VMI_ALARM_WIRED_LVTT), and the mode (VMI_ALARM_IS_ONESHOT or VMI_ALARM_IS_PERIODIC). If the alarm is set against the VMI_ALARM_STOLEN counter or an undefined counter number, the call is a nop. The 'expiry' parameter indicates the expiry of the alarm, and for periodic alarms, the 'period' parameter indicates the period of the alarm. If the value of 'period' is zero, the alarm is armed as a one-shot alarm regardless of the mode specified by 'flags'. Finally, a call to VMI_SetAlarm for an alarm that is already armed is equivalent to first calling VMI_CancelAlarm and then calling VMI_SetAlarm, except that the value returned by VMI_CancelAlarm is not accessible. VMI_CancelAlarm is used to disarm an alarm. The 'flags' parameter indicates which alarm to cancel (VMI_CYCLES_REAL or VMI_CYCLES_AVAILABLE). The return value indicates whether or not the cancel succeeded. A return value of FALSE indicates that the alarm was already disarmed either because a) the alarm was never set or b) it was a one-shot alarm and has already fired (though perhaps not yet delivered to the guest). TRUE indicates that the alarm was armed and either a) the alarm was one-shot and has not yet fired (and will no longer fire until it is rearmed) or b) the alarm was periodic. ------------------------------------------------------------------------- Further discussion regarding the proposed interface: 1) Mechanism to atomically read the set of time values. The interface does not provide a way for a vcpu to atomically read the entire set of time values { wallclock time, real time counter, available time counter, stolen time counter }. While it seems that such a mechanism might be "nice" to have in theory, it seems unnecessary in practice. Indeed, real hardware rarely provides this functionality. One nice side effect of having this feature is that the explicit stolen time counter (or available time counter) can be dropped entirely from the interface, since its value can be inferred from the real time counter and available time counter (or stolen time counter). Two potential proposals that provide this feature are: a) "Struct" return value style interface: typedef struct VMITimeValues { VMI_NANOSECS wallTime; VMI_CYCLES realCounter; VMI_CYCLES availableCounter; } VMITimeValues; VMITimeValues VMI_GetTimes(void); or, similarly, void VMI_GetTimes(VMITimeValues *t); Pros: Easy to understand interface. Implementation can provide atomicity by returning all time values as a complete set. Cons: Struct return value (indirect return) less efficient, especially when caller only needs to use one of the time values. More difficult to extend interface and maintain compatibility. b) "Caching" mechanism added to current interface. #define VMI_CYCLES_REAL 0 #define VMI_CYCLES_AVAILABLE 1 #define VMI_CYCLES_CACHED 0x80 VMI_NANOSECS VMI_GetWallclockTime(uint32 flags); VMI_CYCLES VMI_GetCycleCounter(uint32 flags); When the VMI_CYCLES_CACHED bit of the 'flags' parameter is not set, the routines atomically cache the entire set of time values into the "time cache". The requested time value is returned. When the VMI_CYCLES_CACHED bit of the 'flags' parameters is set, the routine returns the requested time value from the time cache. Using this interface, along with guest provided synchronization, the entire set of time values can be retrieved atomically. Pros: Efficient calling convention, especially when only one time value is desired. Easier to extend the interface and maintain compatibility. Cons: Interface is perhaps more difficult to understand and use correctly. Requires guest synchronization to ensure the cached values remain consistent between VMI calls. 2) Mechanism to read the available time counter and stolen time counter for another vcpu. VCPU0 cannot read VCPU1's per-vcpu counters, and vice versa. This feature is not required by Linux. Might there be some guest that requires this ability? The VMI_GetCycleCounter interface could be extended to index by vcpu ID in order to provide this functionality. 3) Notification mechanism for relative change between wallclock time and real cycle counter. It is expected that the guest will usually read the wallclock time at boot, and then apply an adjustment to this time using the real cycle counter to calculate the current wallclock time. However, the wallclock time and real cycle counter may become out of sync for a variety of reasons. For example, when a VM is suspended, the real cycle counter will stop advancing while the wallclock time continues to advance. Additionally, when the host wallclock is corrected the real cycle counter is not affected. Therefore, it is convenient for the guest to receive notification that the wallclock time has changed relative to the real cycle counter. Currently, a polling interface is provided for this: VMI_WallclockUpdated. Alternatively, an interrupt interface may be provided. An interrupt would be sent to the guest to notify it of a change to the wallclock time relative to the real cycle counter. ------------------------------------------------------------------------- Use of VMI Time in Linux: To paravirtualize time in the Linux kernel, we introduce a new timer device module, timer_vmi.c (the VMI Time module), which parallels the existing modules, e.g. arch/i386/kernel/timers/{timer_pit.c, timer_tsc.c, timer_cyclone.c, timer_hpet.c, timer_pm.c}. The timer_vmi.c module implements the low-level Linux kernel time routines using the VMI Time interface. The new VMI Time module is used only when a hypervisor is detected at boot time. Otherwise, we fall back to one of the traditional modules listed above. This provides "transparency" -- the same kernel binary can run with the VMI Time device or with one of the traditional time devices. The Linux xtime variable is initialized to the wallclock time exposed by the VMI. The VMI Time module contains a timer interrupt that is driven by a periodic VMI alarm set against the available time counter. The interrupt calls the do_timer() routine every time the real time counter advances cycles_per_second/HZ cycles (call this quantity cycles_per_jiffy), as detected by any vcpu. do_timer() updates the jiffies count and xtime count. The timer interrupt additionally calls the update_process_times() routine for every time the available time counter for the current vcpu advances cycles_per_jiffy cycles. update_process_times() accounts user time stats for the process and vcpu, accounts system time stats for the process and cpu, runs the soft timers for the cpu, performs some RCU work, calls the scheduler back via scheduler_tick() and runs posix timers for the current task. The scheduler_tick() callback decrements the current process' timeslice and makes scheduling decisions. The profile_tick() routine is also called at each cycles_per_jiffy increment of the available time counter on each vcpu. Finally, the timer interrupt accounts the difference between the real time counter and the available time counter (essentially the stolen time counter) to the steal cpustat. The sched_clock() scheduler callback is also modified to calculate time using the available time counter. The effect of this is to compute process' sleep_avg using available time. sleep_avg is used primarily to compute the "effective priority" of a task. Additionally, the i386 timer modules are called back via the active timer_opts struct to get additional time information. arch/i386's do_gettimeofday() and do_settimeofday() use the get_offset() callback to offset the time from xtime. The VMI Time module implements this callback using the real time counter. The monotonic clock callback is also implemented using the real time counter. The __delay() routine is implemented via a timer_opt callback in the VMI Time module. When running under a hypervisor, delays are not necessary when communicating with virtual devices. These delays are nops when running under a hypervisor. However, the smpboot.c bootup sequence does require delays, and these delays are implemented using the real time counter. The effect of these changes are that Linux's real time variables and timers continue to run on real time. These include jiffies, xtime, ITIMER_REAL timers, SIGALARM timers, posix-timers. Additionally, the notion of process times remains consistent in the paravirtual environment as compared with native. Since process time accounting is done using available time, time that is stolen from the vcpu is not accounted toward process times. This keeps the cpustats faithful and virtual timers consistent (ITIMER_VIRT, SIGVTALARM, ITIMER_PROF, SIGPROF). Additionally, the scheduler essentially runs in available time, and so time stolen from the vcpu does not steal time from process' timeslices. Finally, with process time accounting performed using available time and with steal time accounting, the view of time (for example, using the utilities 'top', 'time', etc.) from within paravirtualized Linux is consistent with time viewed on the host and from within other paravirtualized guests on the same host. Some noteworthy characteristics of the VMI Time module are: 1) Cycle counter frequency is indicated by the VMI rather than needing to be calculated by comparing the counter rate to a known interrupt rate. On native hardware, the TSC frequency is calculated by comparing the rate at which the TSC advances with the (known) rate of a timer interrupt. This method for determining a cycle counter frequency is unreliable on virtual hardware since the vcpu may be preempted by the hypervisor during this calculation. Instead, the VMI Time device determines the frequency of the VMI cycle counters by querying the VMI. 2) The timer_irq_works() algorithm is problematic in a virtual environment for various reasons, and is avoided. When running on a hypervisor, the kernel can assume that the alarm will be wired up correctly to the interrupt controller and this logic is avoided altogether. 3) Rather than keeping kernel time state (e.g. jiffies, xtime, time slices) up to date by counting interrupts, the state is updated based on the values of the VMI cycle counters. Therefore, a vcpu only needs to receive timer interrupts while it is running. This leads to better scaling of the number of virtual machines that can be run on a single host. 4) As a consequence of #3, the interrupt rate of the alarm can be dropped lower than HZ. It may be beneficial for the guest to request a lower alarm rate to decrease the overhead of delivering virtual interrupts. With #3, this is possible without changing the HZ constant (which is important in order to allow for transparency with native hardware using the normal HZ rate). 5) System wide time state (e.g. jiffies, xtime) is updated by all vcpus. With the native x86 timer interrupts, only the boot processor's PIT interrupt updates jiffies and xtime. With the VMI Time device, jiffies and xtime are updated by all the vcpus of a VM. This is important since the vcpus may be independently scheduled. Also, it allows for a simpler NO_IDLE_HZ implementation. 6) The xtime counter is kept up with the wallclock time. 7) The jiffies counter is kept up with the real time cycle counter. 8) Scheduler time slices and sched_clock() are kept up with the available time cycle counter. When using the VMI Time device as the time source, the scheduler runs in available time. Otherwise, processes would be charged for time stolen from the vcpu (time in which the vcpu isn't even running). The scheduler_tick() interface is called only for every tick of available time. Also, sched_clock() is implemented using the available time counter so that the scheduler's calculations of a process' run-time does not charge stolen time. 9) Stolen time is computed by the hypervisor, not the Linux guest. Only the hypervisor can accurately compute stolen time. On some hypervisors, stolen time is computed by the guest by subtracting the amount of time a vcpu has run on a pcpu from the amount of real time that has passed. Then, the routine account_steal_time() is used to charge this time difference to a) steal time, if the current process is not the idle process, or b) idle time, if the current process is the idle process. The problem with that calculation is that when a vcpu's halt ends, it will not necessarily run immediately. Instead, it will be transitioned from the 'halted' state to the 'ready' state (i.e. added to the hypervisor's runqueue) until the hypervisor chooses to run it. The time the vcpu is 'ready' is not idle time, but instead stolen time. But, implementations using account_steal_time() will account all of this time to idle time, rather than partitioning the time such that the time spent in the 'halted' state is idle time and the time spent in the 'ready' state is stolen time. Only the hypervisor can determine when the transition from 'halted' to 'ready' occurred (due to work becoming pending for the vcpu). So, the VMI Time device queries the stolen time from the VMI rather than using the faulty account_steal_time() algorithm. 10) A NO_IDLE_HZ implementation is provided. When a vcpu enters its idle loop, it disables its periodic alarm and sets up a one shot alarm for the next time event. That way, it does not become ready to run just to service the periodic alarm interrupt. Instead, it can remain halted until there is some real work pending for it. This allows the hypervisor to use the physical resources more effectively since idle vcpus will have lower overhead. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |