[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] rdtscP and xen (and maybe the app-tsc answer I've been looking for)



> > However, I do need one special case to indicate
> > emulation vs non-emulation, so wraparound is
> > still a problem.
> 
> I was assuming you'd just repurpose the existing version number scheme
> which is always even, and therefore can never equal -1.

That wasn't my plan but if it can be made to work (see
below), it probably saves code in Xen.

> What's the full algorithm for detecting this feature?  Usermode has to
> establish:
> 
>    1. It is running under Xen (or not, if you expect this to be
>       implemented on multiple hypervisors)
>    2. rdtscp is available
>    3. the ABI is actually being implemented, ie:
>          1. the tsc_aux value actually has the correct meaning
>          2. it has a working mechanism for getting the tsc scaling
>             parameters
>          3. (accommodate ways to evolve the ABI in a 
> back-compatible way)
> before it can do anything else.

Yes, that's what I was thinking.  I was planning on prototyping
these checks with "userland-rdmsr" but userland-hypercall or
userland-shared-page could work also.

> If nothing else, its probably worth removing the rdtscp 
> feature from the
> logical guest cpuid, so that nothing else tries to use it for its own
> purposes; in other words, you're exclusively claiming rdtscp for this
> ABI.  Or you could disable this ABI if a guest kernel tries 
> to set TSC_AUX.

I was thinking that setting pvrdtscp=1 would override
any kernel use of rdtscp/TSC_AUX, but disabling the
cpuid has_rdtscp flag and using a different userland
detection mechanism (than checking cpuid for has_rdtscp)
would be a better way to avoid possible conflict.

> > I've restricted the scheme to constant_tsc as I think
> > it breaks down due to nasty races if running on a
> > machine where the pvclock parameters differ across
> > different pcpus.  I think the races can only be
> > avoided if Xen sets the TSC_AUX for all of the
> > pcpus running a pvrdtscp doman while all are idle.
> >
> > Is there a scheme that avoids the races? 
> 
> rdtscp makes it quite easy to avoid races because you get the tsc and
> metadata about the tsc atomically.  You just need to encode 
> enough info
> in the metadata to do the conversion.

Yes but I don't think there is enough bits for encoding
it all (32-bits in TSC_AUX, right?).

> The obvious thing to do is to pack a version number and pcpu 
> number into
> TSC_AUX.  Usermode would maintain an array of pv_clock parameters, one
> for each pcpu.  If the version number matches, then it uses the
> parameters it has; if not it fetches new parameters and repeats the
> rdtscp.  There's no need to worry about either thread or vcpu context
> switches because you get the (tsc,params) tuple atomically, 
> which is the
> tricky bit without rdtscp.
> 
> (The version number would be truncated wrt the normal pvclock version
> number, but it just needs to be large enough to avoid aliasing from
> wrapping; I'm assuming something like 24 bits version and 8 bits cpu
> number.)

I think a race occurs if the vcpu switches pcpu TWICE
from pcpu-A to pcpu-B and back to pcpu-A and does rdtscp
each time on pcpu-A but reads one or more pvclock parameters
(that are too big to be encoded in TSC_AUX) on pcpu-B.
If Xen can atomically bump/change
TSC_AUX on *all* pcpus runniing a guest vcpu, the race
can be avoided.  But I suspect that is too expensive (some
kind of rendezvous required for each bump on any processor).

> > Fortunately, this also has the effect of greatly
> > reducing the version increase frequency.
> 
> I don't think that's going to be a huge issue; fetching time 
> parameters
> with a syscall/hypercall would be on the same order as doing 
> an emulated
> rdtsc, and would only need to happen, say, once per timeslice (100Hz?)
> at the outside.

Even if my assumption of the race (above) is incorrect,
32-bits is not very much time at 100Hz.  But the version
bump needs to occur synchronously with every P/C-state
transition for pvclock to work on non_constant_tsc machines
doesn't it?  How frequent can those transitions occur?
 
> > The rate is synced but the values may not be.  Since
> > software (BIOS or Xen) sets tsc on each processor
> > it is essentially impossible to ensure they are
> > identical.  The rendezvous algorithm should be able
> > to set them so that they are "unobservably" different,
> > but I keep hearing "within 2usec".  (It would be
> > interesting to measure this across a broad set
> > of machines.)  So it's probably prudent to recommend
> > that apps be prepared for the possibility even if
> > it never happens.
> 
> You don't need to guarantee anything stronger than they'd see on bare
> hardware.  You also need to be more precise about exactly what you're
> guaranteeing.
> 
> Are you saying that a single thread will never see regressing tscs? 
> That just requires making sure that Xen gets the tscs synced 
> closer than
> the context switch time of a thread between cpus, which 
> should be possible.
> 
> Or are you making the stronger guarantee that two threads running
> concurrently on different cpus doing rdtsc will see monotonically
> increasing tscs with respect to the ordering of all their operations? 
> That would require arbitrarily close syncing (well, within a 
> the time it
> takes a cacheline to bounce I guess).

I guess this all depends on what Xen is capable of
guaranteeing.  If Xen can provide a "cacheline
bounce guarantee", the app shouldn't have to care.

Linux now seems to provide a cacheline bounce guarantee for
itself, but afaik has no way to communicate that to an app
using raw rdtsc{,p} and all the relevant syscalls have a
monotonicity option and/or have insufficient resolution
to matter.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.