[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 5/5] x86: Rework MSR_TSC_AUX handling from scratch.



On Tue, Feb 20, 2018 at 06:28:15PM +0000, Andrew Cooper wrote:
> On 20/02/18 17:35, Roger Pau Monné wrote:
> > On Tue, Feb 20, 2018 at 11:58:43AM +0000, Andrew Cooper wrote:
> >> There are many problems with MSR_TSC_AUX handling.
> >>
> >> To being with, the RDPID instruction reads MSR_TSC_AUX, but it is only the
> >> RDTSCP feature which enumerates the MSR.  Therefore, RDPID functionally
> >> depends on RDTSCP.
> >>
> >> For PV guests, we hide RDTSCP but advertise RDPID.  We also silently drop
> >> writes to MSR_TSC_AUX, which is very antisocial.  Therefore, enable RDTSCP 
> >> for
> >> PV guests, which in turn allows RDPID to work.
> >>
> >> To support RDTSCP properly for PV guests, the MSR_TSC_AUX handling is moved
> >> into the generic MSR policy infrastructure, and becomes common.  One
> >> improvement is that we will now reject invalid values, rather than silently
> >> truncating an accepting them.  This also causes the emulator to reject 
> >> RDTSCP
> >> for guests without the features.
> >>
> >> One complication is TSC_MODE_PVRDTSCP, in which Xen takes control of
> >> MSR_TSC_AUX and the reported value is actually the migration incarnation.  
> >> The
> >> previous behaviour of this mode was to silently drop writes, but as it is a
> >> break in the x86 ABI to start with, switch the semantics to be more sane, 
> >> and
> >> have TSC_MODE_PVRDTSCP make the MSR properly read-only.
> >>
> >> With PV guests getting to use MSR_TSC_AUX properly now, the MSR needs
> >> migrating.  Cope with it the common migration logic.  Care must be taken
> >> however to avoid sending the MSR if TSC_MODE_PVRDTSCP is active, as the
> >> receiving side will reject the guest_wrmsr().
> >>
> >> What remains is that tsc_set_info() need to broadcast d->arch.incarnation 
> >> to
> >> all vCPUs MSR block if in TSC_MODE_PVRDTSCP, so the context switching and
> >> emulation code functions correctly.
> > I have one likely stupid question about the PVRDTSCP, how does the
> > guest know it's actually using it? I'm not able to find any cpuid or
> > xenfeat to signal the guest it's running in PVRDTSCP mode, and thus
> > that MSR_TSC_AUX contains this magic incarnation value?
> 
> 40000x03[0].b
> 
> Which I notice now isn't a constant in the public API. :(

Right, looks like TSC_MODE_* should be in cpuid.h, this is part of the
ABI. The values are already listed in the comment, but it's very easy
for this to get out of sync.

> (Oh - another misfeature this feature is responsible for is the fact
> that there is a time subleaf of the hypervisor CPUID leaves, without any
> ability to calculate the number of valid subleaves.)

The same applies to the HVM leaf AFAICT, although eax on sub-leaf 0 can
be used to signal the presence of other sub-leaves in that case.

> I've never seen any guest-side code to cope with feature, even in the
> classic-linux fork.  I'll defer to Konrad on the topic, but I'm fairly
> sure it is vestigial Oracle-ism.
> 
> >
> > Which TBH seems quite pointless, the guest should be perfectly capable
> > of maintaining it's own count of migrations.
> 
> The only useful piece of information is whether Xen is having to
> virtualise your time values because the frequency is now different to
> how it was previously advertised.
> 
> Oh - we also include support for emulating RDTSCP for PV guests only, on
> incapable hardware (via emul-invl-op).  I was considering restricting
> this to PVRDTSC mode only.

Seems sensible.

> >
> >> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> >> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> >> index 0539551..ab24f87 100644
> >> --- a/xen/arch/x86/hvm/hvm.c
> >> +++ b/xen/arch/x86/hvm/hvm.c
> >> @@ -792,7 +792,7 @@ static int hvm_save_cpu_ctxt(struct domain *d, 
> >> hvm_domain_context_t *h)
> >>  
> >>          ctxt.tsc = hvm_get_guest_tsc_fixed(v, 
> >> d->arch.hvm_domain.sync_tsc);
> >>  
> >> -        ctxt.msr_tsc_aux = hvm_msr_tsc_aux(v);
> >> +        /* ctxt.msr_tsc_aux omitted - now sent via generic MSR record. */
> >>  
> >>          hvm_get_segment_register(v, x86_seg_idtr, &seg);
> >>          ctxt.idtr_limit = seg.limit;
> >> @@ -1046,7 +1046,24 @@ static int hvm_load_cpu_ctxt(struct domain *d, 
> >> hvm_domain_context_t *h)
> >>      if ( hvm_funcs.tsc_scaling.setup )
> >>          hvm_funcs.tsc_scaling.setup(v);
> >>  
> >> -    v->arch.hvm_vcpu.msr_tsc_aux = ctxt.msr_tsc_aux;
> >> +    /*
> >> +     * Backwards compatibility.  MSR_TSC_AUX contains different 
> >> information
> >> +     * depending on whether TSC_MODE_PVRDTSCP is enabled.
> >> +     *
> >> +     * Before Xen 4.11, ctxt.msr_tsc_aux was sent unconditionally and 
> >> restored
> >> +     * here, but the value was otherwise ignored in TSC_MODE_PVRDTSCP.
> >> +     *
> >> +     * In 4.11, the logic was changed to send MSR_TSC_AUX via the generic 
> >> MSR
> >> +     * mechanism if tsc_mode != TSC_MODE_PVRDTSCP.  If tsc_mode ==
> >> +     * TSC_MODE_PVRDTSCP, the tsc logic is responsibile for setting the
> >> +     * correct value.
> >> +     *
> >> +     * For compatibility with migration streams from before 4.11, we 
> >> restore
> >> +     * from ctxt.msr_tsc_aux if the TSC code hasn't/isn't in charge, and 
> >> we've
> >> +     * not seen a value arrive in the generic MSR record.
> >> +     */
> >> +    if ( d->arch.tsc_mode != TSC_MODE_PVRDTSCP && !v->arch.msr->tsc_aux )
> >> +        v->arch.msr->tsc_aux = ctxt.msr_tsc_aux;
> > I'm having a hard time figuring out whether the MSRs have been loaded
> > at this point, I assume there's some guarantee that MSR will be loaded
> > before loading the CPU context?
> 
> Actually, this is subtly broken, but for different reasons.  It will
> break systems Remus/COLO (if anyone still cares).
> 
> TSC_INFO is ahead of HVM_CONTEXT in the migration stream, but this is
> "because it always was like that", not because there is a specified
> dependency.
> 
> The HVM_CONTEXT record contains both this CPU record, and the MSR
> record.  There is no particular order of records in the blob, but they
> appear in age order meaning that MSR record is later in the stream than
> the CPU record.

So if the MSR record comes later in the stream, the comment is
wrong:

"and we've not seen a value arrive in the generic MSR record."

Will always be true when processing the CPU record, because the MSR
record hasn't been processed yet.

I would rewrite that as:

"For compatibility with migration streams from before 4.11, we restore
from ctxt.msr_tsc_aux if the TSC code hasn't/isn't in charge. Check
whether tsc_aux already contains a value, in case the MSR record was
restored before the CPU one. If not just set it to the value found in
the CPU record, a later MSR record can always overwrite it."

Which seems clearer to me, but maybe that's because I'm not very
familiar with the code itself.

> The TSC mode itself (unlike its parameters) really needs to be
> create-time static configuration (along with a load of other
> parameters), and there is a protoplan to make this happen.
> 
> However, I think we can rely on TSC_MODE_PVRDTSCP being known before
> considering whether to restore.  Restoring data will, in practice, come
> either from ctxt.msr_tsc_aux, or from a later MSR record, but not both. 
> However, in the case that ctxt.msr_tsc_aux is the source of information,
> we should (re)load it when restoring into a non-clean domain, which is
> the Remus/COLO case.

Hm, maybe an 'easy' way to solve this would be to just zero all the
relevant internal structures before attempting a restore?

> The other option is to leave the HVM migration side using this value,
> and only move the PV side in the common MSR path.  This is probably a
> better option until I get around to migration-v2 for the hypervisor
> data, and unifying the PV and HVM state handling.

But the HVM code as-is seems wrong to me for the PVRDTSCP case because
it overwrites the new incarnation with the previous one from the CPU
record on restore? (because the TSC mode is restored before the CPU
record).

Maybe we could get rid of the PVRDTSCP mode completely and forget
about all that?

We just need to reserve that TSC mode in the public ABI in order to
prevent anyone from reusing it, but I'm not sure we need to continue
to implement it.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.