[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been looking for)
OK, here's the long version (/me crosses fingers and hopes to get away from this for at least some of the weekend)... Proposal ("pv rdtscp"): The rdtscP instruction was added to the x86 architecture by AMD a couple of years ago and Intel added it starting at Nehalem. It is essentially the same as an rdtsc except in addition it copies the value of a privileged MSR register "TSC_AUX" into a specified memory location. There is a CPUID bit that can be checked to determine if the processor supports the rdtscp instruction. Xen currently does not expose hardware support for rdtscp to guests. I propose to paravirtualize support for rdtscp as follows: If guest vm.cfg has vrdtscp=0 (default): rdtscp is emulated and returns nsec since guest boot (same as emulated rdtsc), value returned for TSC_AUX is -1 If guest vm.cfg has vrdtscp=1: If underlying hardware has rdtscp support: rdtscp is directly executed by hardware, value returned for TSC_AUX is non-zero (see below) Else: (no hardware rdtscp support) rdtscp is emulated and returns nsec since guest boot, value returned for TSC_AUX is 0 How it works from the app point-of-view: Guest app must have some capability of getting 64-bit pvclock parameters directly from Xen without OS changes, e.g. emulated userland wrmsr, userland hypercall, or userland mapped shared page. (This will be done rarely so need not be fast! But it does create a new userland<->Xen ABI that must be kept compatible.) On first rdtscp, app records returned TSC_AUX value, verifies that it is neither 0 nor -1, fetches pvclock parameters from Xen, executes another rdtscp. If TSC_AUX matches previous value, app applies pvclock algorithm to tsc value to obtain nsec since guest boot. If TSC_AUX is zero or -1, tsc value IS nsec since guest boot. If TSC_AUX differs from last recorded value, fetch pvclock parameters from Xen again. On subsequent rdtscp's, app compares returned TSC_AUX against the previous one, and fetches pvclock parameters from Xen only if it differs (which should be rare). What Xen needs to do: Xen must record the setting for each guest's vrdtscp config variable and ensure that it persists across save/restore and migration. If the guest has vrdtscp=1, a vrdtscp "version" number is also part of the guest's state and must persist across save/restore/migration. Xen must know whether or not it is running on a machine where TSC is reliable. If TSC is NOT reliable AND rdtscp is supported by hardware, Xen must ensure that TSC_AUX is -1 on all pcpu's that are running a guest with vrdtscp=0, and 0 on all pcpu's that are running a guest where vrdtscp=1 (and must enable CR4.TSD on those pcpus if it wasn't already). If TSC is NOT reliable AND rdtscp is NOT supported by hardware, Xen must emulate rdtscp (e.g. return Xen system time) and emulate the same behavior for TSC_AUX. If TSC IS reliable, Xen sets TSC_AUX to the guest's vrdtscp version number on all pcpu's that are running the guest. Finally, when a guest transitions from one "TSC domain" to another (restore/migrate/NUMA) it increments the vrdtscp version number. I think this will work even for a NUMA machine provided Xen always schedules all the vcpus for one guest on pcpus in the same NUMA node, and increments the version number when the guest is rescheduled from one NUMA node to another (assuming TSC on each node is reliable). I think this pv-rdtscp mechanism will work for both PV and HVM (with minor additional work in Xen for HVM); it will be very fast on any hardware that supports rdtscp in hardware (which for Intel only includes Nehalem+ but that provides even more incentive for customers to upgrade). Apps that currently use rdtscp will continue to work (as long as they don't have some wild use model that I don't know about). Pvclock algorithm in the OS would need to be changed to use rdtscp (instead of rdtsc) and check for TSC_AUX=0 to do the right thing. If not changed, it will continue to work but slower (whether or not rdtsc is emulated because when emulated it returns the hardware TSC when the instruction was attempted in kernel mode). The only problem I can see is that when vrdtscp==1, other apps that are running on that guest that use rdtsc (no p) directly (i.e. haven't been modified to use pv-rdtscp) will continue to have the same kinds of failure on save/restore/ migration. But this is true of all the solutions proposed so far: Xen can only turn on emulation guest-wide, not per-app. Also even on machines where TSC is reliable, there is a small chance that consecutive TSC values read will be from different processors and so TSC might appear to go backwards by some small amount. So apps must still put raw TSC values through a "monotonicity filter". (Xen already does this for emulated reads of TSC.) Comments? > -----Original Message----- > From: Dan Magenheimer > Sent: Friday, September 18, 2009 10:30 AM > To: Xen-Devel (E-mail) > Subject: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've > been looking for) > > > Xen doesn't appear to support the rdtscp instruction. > Should it? (And specifically I'm wondering whether > it should be emulated whenever rdtsc is emulated > but see below for another intriguing possibility.) > > Rdtscp is unprivileged and we have apps that are using it > on bare metal, after validating that the CPU supports it. > The instruction is available on most (all?) recent AMD > CPUs and Intel's Nehalem supports it. > > For an OS to support rdtscp properly, the OS must (once at boot) > wrmsr a different value for each cpu to a "TSC_AUX" register > and this register is read along with the TSC when the rdtscp > instruction is executed. This allows an app to determine > if two consecutive rdtsc's are (or are not) executed on the > same CPU. > > It appears that all recent RHEL kernels write to TSC_AUX if > the CPU supports rdtscp. I'm told Windows 2008 notably does > not. Don't know about SLES or other Windoze. > > Its not clear to me if/how rdtscp can/should be virtualized. > To do it properly, the value written to the TSC_AUX msr > would become part of the vcpu's state, and would need to > be changed whenever a vcpu->pcpu mapping changes. To meet > only the current use model of the instruction, Xen could write > TSC_AUX for each pcpu on Xen boot and always ignore guest > OS writes to TSC_AUX. (This assumes that no OS ever reads > TSC_AUX and attempts to match it with the value that it > thought it wrote to TSC_AUX; and assumes that > > One solution is for Xen to deny the existence of rdtscp even > when Xen is running on hardware that supports it. Is that > exactly what is happening? > > Now thinking creatively, could TSC_AUX be used similar > to the pvclock version number... Xen bumps it whenever a > migration occurs which would prompt an app to go out > and reread new values for scaling and offset (possibly > via specially-handled-by-Xen usermode rdmsr)? Hmmm... > I think it might be the answer I've been looking for! > (Go ahead, shoot me down :-) > > Dan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@xxxxxxxxxxxxxxxxxxx > http://lists.xensource.com/xen-devel > > _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |