[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 5/5] x86: Rework MSR_TSC_AUX handling from scratch.



On 20/02/18 17:35, Roger Pau Monné wrote:
> On Tue, Feb 20, 2018 at 11:58:43AM +0000, Andrew Cooper wrote:
>> There are many problems with MSR_TSC_AUX handling.
>>
>> To being with, the RDPID instruction reads MSR_TSC_AUX, but it is only the
>> RDTSCP feature which enumerates the MSR.  Therefore, RDPID functionally
>> depends on RDTSCP.
>>
>> For PV guests, we hide RDTSCP but advertise RDPID.  We also silently drop
>> writes to MSR_TSC_AUX, which is very antisocial.  Therefore, enable RDTSCP 
>> for
>> PV guests, which in turn allows RDPID to work.
>>
>> To support RDTSCP properly for PV guests, the MSR_TSC_AUX handling is moved
>> into the generic MSR policy infrastructure, and becomes common.  One
>> improvement is that we will now reject invalid values, rather than silently
>> truncating an accepting them.  This also causes the emulator to reject RDTSCP
>> for guests without the features.
>>
>> One complication is TSC_MODE_PVRDTSCP, in which Xen takes control of
>> MSR_TSC_AUX and the reported value is actually the migration incarnation.  
>> The
>> previous behaviour of this mode was to silently drop writes, but as it is a
>> break in the x86 ABI to start with, switch the semantics to be more sane, and
>> have TSC_MODE_PVRDTSCP make the MSR properly read-only.
>>
>> With PV guests getting to use MSR_TSC_AUX properly now, the MSR needs
>> migrating.  Cope with it the common migration logic.  Care must be taken
>> however to avoid sending the MSR if TSC_MODE_PVRDTSCP is active, as the
>> receiving side will reject the guest_wrmsr().
>>
>> What remains is that tsc_set_info() need to broadcast d->arch.incarnation to
>> all vCPUs MSR block if in TSC_MODE_PVRDTSCP, so the context switching and
>> emulation code functions correctly.
> I have one likely stupid question about the PVRDTSCP, how does the
> guest know it's actually using it? I'm not able to find any cpuid or
> xenfeat to signal the guest it's running in PVRDTSCP mode, and thus
> that MSR_TSC_AUX contains this magic incarnation value?

40000x03[0].b

Which I notice now isn't a constant in the public API. :(

(Oh - another misfeature this feature is responsible for is the fact
that there is a time subleaf of the hypervisor CPUID leaves, without any
ability to calculate the number of valid subleaves.)

I've never seen any guest-side code to cope with feature, even in the
classic-linux fork.  I'll defer to Konrad on the topic, but I'm fairly
sure it is vestigial Oracle-ism.

>
> Which TBH seems quite pointless, the guest should be perfectly capable
> of maintaining it's own count of migrations.

The only useful piece of information is whether Xen is having to
virtualise your time values because the frequency is now different to
how it was previously advertised.

Oh - we also include support for emulating RDTSCP for PV guests only, on
incapable hardware (via emul-invl-op).  I was considering restricting
this to PVRDTSC mode only.

>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
>> diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
>> index 0539551..ab24f87 100644
>> --- a/xen/arch/x86/hvm/hvm.c
>> +++ b/xen/arch/x86/hvm/hvm.c
>> @@ -792,7 +792,7 @@ static int hvm_save_cpu_ctxt(struct domain *d, 
>> hvm_domain_context_t *h)
>>  
>>          ctxt.tsc = hvm_get_guest_tsc_fixed(v, d->arch.hvm_domain.sync_tsc);
>>  
>> -        ctxt.msr_tsc_aux = hvm_msr_tsc_aux(v);
>> +        /* ctxt.msr_tsc_aux omitted - now sent via generic MSR record. */
>>  
>>          hvm_get_segment_register(v, x86_seg_idtr, &seg);
>>          ctxt.idtr_limit = seg.limit;
>> @@ -1046,7 +1046,24 @@ static int hvm_load_cpu_ctxt(struct domain *d, 
>> hvm_domain_context_t *h)
>>      if ( hvm_funcs.tsc_scaling.setup )
>>          hvm_funcs.tsc_scaling.setup(v);
>>  
>> -    v->arch.hvm_vcpu.msr_tsc_aux = ctxt.msr_tsc_aux;
>> +    /*
>> +     * Backwards compatibility.  MSR_TSC_AUX contains different information
>> +     * depending on whether TSC_MODE_PVRDTSCP is enabled.
>> +     *
>> +     * Before Xen 4.11, ctxt.msr_tsc_aux was sent unconditionally and 
>> restored
>> +     * here, but the value was otherwise ignored in TSC_MODE_PVRDTSCP.
>> +     *
>> +     * In 4.11, the logic was changed to send MSR_TSC_AUX via the generic 
>> MSR
>> +     * mechanism if tsc_mode != TSC_MODE_PVRDTSCP.  If tsc_mode ==
>> +     * TSC_MODE_PVRDTSCP, the tsc logic is responsibile for setting the
>> +     * correct value.
>> +     *
>> +     * For compatibility with migration streams from before 4.11, we restore
>> +     * from ctxt.msr_tsc_aux if the TSC code hasn't/isn't in charge, and 
>> we've
>> +     * not seen a value arrive in the generic MSR record.
>> +     */
>> +    if ( d->arch.tsc_mode != TSC_MODE_PVRDTSCP && !v->arch.msr->tsc_aux )
>> +        v->arch.msr->tsc_aux = ctxt.msr_tsc_aux;
> I'm having a hard time figuring out whether the MSRs have been loaded
> at this point, I assume there's some guarantee that MSR will be loaded
> before loading the CPU context?

Actually, this is subtly broken, but for different reasons.  It will
break systems Remus/COLO (if anyone still cares).

TSC_INFO is ahead of HVM_CONTEXT in the migration stream, but this is
"because it always was like that", not because there is a specified
dependency.

The HVM_CONTEXT record contains both this CPU record, and the MSR
record.  There is no particular order of records in the blob, but they
appear in age order meaning that MSR record is later in the stream than
the CPU record.

The TSC mode itself (unlike its parameters) really needs to be
create-time static configuration (along with a load of other
parameters), and there is a protoplan to make this happen.

However, I think we can rely on TSC_MODE_PVRDTSCP being known before
considering whether to restore.  Restoring data will, in practice, come
either from ctxt.msr_tsc_aux, or from a later MSR record, but not both. 
However, in the case that ctxt.msr_tsc_aux is the source of information,
we should (re)load it when restoring into a non-clean domain, which is
the Remus/COLO case.

The other option is to leave the HVM migration side using this value,
and only move the PV side in the common MSR path.  This is probably a
better option until I get around to migration-v2 for the hypervisor
data, and unifying the PV and HVM state handling.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.