[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-devel] [PATCH v5 06/21] xen/x86: Improvements to in-hypervisor cpuid sanity checks
Currently, {pv,hvm}_cpuid() has a large quantity of essentially-static logic for modifying the features visible to a guest. A lot of this can be subsumed by {pv,hvm}_featuremask, which identify the features available on this hardware which could be given to a PV or HVM guest. This is a step in the direction of full per-domain cpuid policies, but lots more development is needed for that. As a result, the static checks are simplified, but the dynamic checks need to remain for now. As a side effect, some of the logic for special features can be improved. OSXSAVE and OSPKE will be automatically cleared because of being absent in the featuremask. This allows the fast-forward logic to be more simple. In addition, there are some corrections to the existing logic: * Hiding PSE36 out of PAE mode is architecturally wrong. It turns out that it was a bugfix for running HyperV under Xen, which wanted to see PSE36 even after choosing to use PAE paging. PSE36 is not supported by shadow paging, so is hidden from non-HAP guests, but is still visible for HAP guests. It is also leaked into non-HAP guests when the guest is already running in PAE mode. * Changing the visibility of RDTSCP based on host TSC stability or virtual TSC mode is bogus, so dropped. * When emulating Intel to a guest, the common features in e1d should be cleared. * The APIC bit in e1d (on non-Intel) is also a fast-forward from the APIC_BASE MSR. As a small improvement, use compiler-visible &'s and |'s, rather than {clear,set}_bit(). Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> --- CC: Jan Beulich <JBeulich@xxxxxxxx> v2: * Reinstate some of the dynamic checks for now. Future development work will instate a complete per-domain policy. * Fix OSXSAVE handling for PV guests. v3: * Better handling of the cross-vendor case. * Improvements to the handling of special features. * Correct PSE36 to being a HAP-only feature. * Yet more OSXSAVE fixes for PV guests. v4: * Leak PSE36 into shadow guests to fix buggy versions of Hyper-V. * Leak MTRR into the hardware domain to fix Xenolinux dom0. * Change cross-vendor 1D disabling logic. * Avoid reading arch.pv_vcpu for PVH guests. v5: * Clarify the commit message regarding PSE36. * Drop redundant is_pv_domain(). * Fix deep-C and P states with modern Linux PVOps on faulting-capable hardware. --- xen/arch/x86/hvm/hvm.c | 125 ++++++++++++------- xen/arch/x86/traps.c | 254 +++++++++++++++++++++++++++------------ xen/include/asm-x86/cpufeature.h | 2 + 3 files changed, 263 insertions(+), 118 deletions(-) diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c index b239f74..9ce61cc 100644 --- a/xen/arch/x86/hvm/hvm.c +++ b/xen/arch/x86/hvm/hvm.c @@ -72,6 +72,7 @@ #include <public/memory.h> #include <public/vm_event.h> #include <public/arch-x86/cpuid.h> +#include <asm/cpuid.h> bool_t __read_mostly hvm_enabled; @@ -3358,62 +3359,71 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx, /* Fix up VLAPIC details. */ *ebx &= 0x00FFFFFFu; *ebx |= (v->vcpu_id * 2) << 24; + + *ecx &= hvm_featureset[FEATURESET_1c]; + *edx &= hvm_featureset[FEATURESET_1d]; + + /* APIC exposed to guests, but Fast-forward MSR_APIC_BASE.EN back in. */ if ( vlapic_hw_disabled(vcpu_vlapic(v)) ) - __clear_bit(X86_FEATURE_APIC & 31, edx); + *edx &= ~cpufeat_bit(X86_FEATURE_APIC); - /* Fix up OSXSAVE. */ - if ( *ecx & cpufeat_mask(X86_FEATURE_XSAVE) && - (v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_OSXSAVE) ) + /* OSXSAVE cleared by hvm_featureset. Fast-forward CR4 back in. */ + if ( v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_OSXSAVE ) *ecx |= cpufeat_mask(X86_FEATURE_OSXSAVE); - else - *ecx &= ~cpufeat_mask(X86_FEATURE_OSXSAVE); - /* Don't expose PCID to non-hap hvm. */ + /* Don't expose HAP-only features to non-hap guests. */ if ( !hap_enabled(d) ) + { *ecx &= ~cpufeat_mask(X86_FEATURE_PCID); - /* Only provide PSE36 when guest runs in 32bit PAE or in long mode */ - if ( !(hvm_pae_enabled(v) || hvm_long_mode_enabled(v)) ) - *edx &= ~cpufeat_mask(X86_FEATURE_PSE36); + /* + * PSE36 is not supported in shadow mode. This bit should be + * unilaterally cleared. + * + * However, an unspecified version of Hyper-V from 2011 refuses + * to start as the "cpu does not provide required hw features" if + * it can't see PSE36. + * + * As a workaround, leak the toolstack-provided PSE36 value into a + * shadow guest if the guest is already using PAE paging (and + * won't care about reverting back to PSE paging). Otherwise, + * knoble it, so a 32bit guest doesn't get the impression that it + * could try to use PSE36 paging. + */ + if ( !(hvm_pae_enabled(v) || hvm_long_mode_enabled(v)) ) + *edx &= ~cpufeat_mask(X86_FEATURE_PSE36); + } break; + case 0x7: if ( count == 0 ) { - if ( !cpu_has_smep ) - *ebx &= ~cpufeat_mask(X86_FEATURE_SMEP); - - if ( !cpu_has_smap ) - *ebx &= ~cpufeat_mask(X86_FEATURE_SMAP); + /* Fold host's FDP_EXCP_ONLY and NO_FPU_SEL into guest's view. */ + *ebx &= (hvm_featureset[FEATURESET_7b0] & + ~special_features[FEATURESET_7b0]); + *ebx |= (host_featureset[FEATURESET_7b0] & + special_features[FEATURESET_7b0]); - /* Don't expose MPX to hvm when VMX support is not available. */ - if ( !(vmx_vmexit_control & VM_EXIT_CLEAR_BNDCFGS) || - !(vmx_vmentry_control & VM_ENTRY_LOAD_BNDCFGS) ) - *ebx &= ~cpufeat_mask(X86_FEATURE_MPX); + *ecx &= hvm_featureset[FEATURESET_7c0]; + /* Don't expose HAP-only features to non-hap guests. */ if ( !hap_enabled(d) ) { - /* Don't expose INVPCID to non-hap hvm. */ *ebx &= ~cpufeat_mask(X86_FEATURE_INVPCID); - /* X86_FEATURE_PKU is not yet implemented for shadow paging. */ *ecx &= ~cpufeat_mask(X86_FEATURE_PKU); } - if ( (*ecx & cpufeat_mask(X86_FEATURE_PKU)) && - (v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PKE) ) + /* OSPKE cleared by hvm_featureset. Fast-forward CR4 back in. */ + if ( v->arch.hvm_vcpu.guest_cr[4] & X86_CR4_PKE ) *ecx |= cpufeat_mask(X86_FEATURE_OSPKE); - else - *ecx &= ~cpufeat_mask(X86_FEATURE_OSPKE); - - /* Don't expose PCOMMIT to hvm when VMX support is not available. */ - if ( !cpu_has_vmx_pcommit ) - *ebx &= ~cpufeat_mask(X86_FEATURE_PCOMMIT); } - break; + case 0xb: /* Fix the x2APIC identifier. */ *edx = v->vcpu_id * 2; break; + case 0xd: /* EBX value of main leaf 0 depends on enabled xsave features */ if ( count == 0 && v->arch.xcr0 ) @@ -3430,9 +3440,12 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx, *ebx = _eax + _ebx; } } + if ( count == 1 ) { - if ( cpu_has_xsaves && cpu_has_vmx_xsaves ) + *eax &= hvm_featureset[FEATURESET_Da1]; + + if ( *eax & cpufeat_mask(X86_FEATURE_XSAVES) ) { *ebx = XSTATE_AREA_MIN_SIZE; if ( v->arch.xcr0 | v->arch.hvm_vcpu.msr_xss ) @@ -3447,20 +3460,42 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx, break; case 0x80000001: - /* We expose RDTSCP feature to guest only when - tsc_mode == TSC_MODE_DEFAULT and host_tsc_is_safe() returns 1 */ - if ( d->arch.tsc_mode != TSC_MODE_DEFAULT || - !host_tsc_is_safe() ) - *edx &= ~cpufeat_mask(X86_FEATURE_RDTSCP); - /* Hide 1GB-superpage feature if we can't emulate it. */ - if (!hvm_pse1gb_supported(d)) + *ecx &= hvm_featureset[FEATURESET_e1c]; + *edx &= hvm_featureset[FEATURESET_e1d]; + + /* If not emulating AMD, clear the duplicated features in e1d. */ + if ( d->arch.x86_vendor != X86_VENDOR_AMD ) + *edx &= ~CPUID_COMMON_1D_FEATURES; + /* fast-forward MSR_APIC_BASE.EN if it hasn't already been clobbered. */ + else if ( vlapic_hw_disabled(vcpu_vlapic(v)) ) + *edx &= ~cpufeat_bit(X86_FEATURE_APIC); + + /* Don't expose HAP-only features to non-hap guests. */ + if ( !hap_enabled(d) ) + { *edx &= ~cpufeat_mask(X86_FEATURE_PAGE1GB); - /* Only provide PSE36 when guest runs in 32bit PAE or in long mode */ - if ( !(hvm_pae_enabled(v) || hvm_long_mode_enabled(v)) ) - *edx &= ~cpufeat_mask(X86_FEATURE_PSE36); - /* Hide data breakpoint extensions if the hardware has no support. */ - if ( !boot_cpu_has(X86_FEATURE_DBEXT) ) - *ecx &= ~cpufeat_mask(X86_FEATURE_DBEXT); + + /* + * PSE36 is not supported in shadow mode. This bit should be + * unilaterally cleared. + * + * However, an unspecified version of Hyper-V from 2011 refuses + * to start as the "cpu does not provide required hw features" if + * it can't see PSE36. + * + * As a workaround, leak the toolstack-provided PSE36 value into a + * shadow guest if the guest is already using PAE paging (and + * won't care about reverting back to PSE paging). Otherwise, + * knoble it, so a 32bit guest doesn't get the impression that it + * could try to use PSE36 paging. + */ + if ( !(hvm_pae_enabled(v) || hvm_long_mode_enabled(v)) ) + *edx &= ~cpufeat_mask(X86_FEATURE_PSE36); + } + break; + + case 0x80000007: + *edx &= hvm_featureset[FEATURESET_e7d]; break; case 0x80000008: @@ -3478,6 +3513,8 @@ void hvm_cpuid(unsigned int input, unsigned int *eax, unsigned int *ebx, hvm_cpuid(0x80000001, NULL, NULL, NULL, &_edx); *eax = (*eax & ~0xffff00) | (_edx & cpufeat_mask(X86_FEATURE_LM) ? 0x3000 : 0x2000); + + *ebx &= hvm_featureset[FEATURESET_e8b]; break; } } diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 6fbb1cf..bd2f67b 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -73,6 +73,7 @@ #include <asm/hpet.h> #include <asm/vpmu.h> #include <public/arch-x86/cpuid.h> +#include <asm/cpuid.h> #include <xsm/xsm.h> /* @@ -932,69 +933,162 @@ void pv_cpuid(struct cpu_user_regs *regs) else cpuid_count(leaf, subleaf, &a, &b, &c, &d); - if ( (leaf & 0x7fffffff) == 0x00000001 ) - { - /* Modify Feature Information. */ - if ( !cpu_has_apic ) - __clear_bit(X86_FEATURE_APIC, &d); - - if ( !is_pvh_domain(currd) ) - { - __clear_bit(X86_FEATURE_PSE, &d); - __clear_bit(X86_FEATURE_PGE, &d); - __clear_bit(X86_FEATURE_PSE36, &d); - __clear_bit(X86_FEATURE_VME, &d); - } - } - switch ( leaf ) { case 0x00000001: - /* Modify Feature Information. */ - if ( !cpu_has_sep ) - __clear_bit(X86_FEATURE_SEP, &d); - __clear_bit(X86_FEATURE_DS, &d); - __clear_bit(X86_FEATURE_TM1, &d); - __clear_bit(X86_FEATURE_PBE, &d); - if ( is_pvh_domain(currd) ) - __clear_bit(X86_FEATURE_MTRR, &d); - - __clear_bit(X86_FEATURE_DTES64 % 32, &c); - __clear_bit(X86_FEATURE_MONITOR % 32, &c); - __clear_bit(X86_FEATURE_DSCPL % 32, &c); - __clear_bit(X86_FEATURE_VMX % 32, &c); - __clear_bit(X86_FEATURE_SMX % 32, &c); - __clear_bit(X86_FEATURE_TM2 % 32, &c); + c &= pv_featureset[FEATURESET_1c]; + d &= pv_featureset[FEATURESET_1d]; + if ( is_pv_32bit_domain(currd) ) - __clear_bit(X86_FEATURE_CX16 % 32, &c); - __clear_bit(X86_FEATURE_XTPR % 32, &c); - __clear_bit(X86_FEATURE_PDCM % 32, &c); - __clear_bit(X86_FEATURE_PCID % 32, &c); - __clear_bit(X86_FEATURE_DCA % 32, &c); - if ( !cpu_has_xsave ) + c &= ~cpufeat_mask(X86_FEATURE_CX16); + + if ( !is_pvh_domain(currd) ) { - __clear_bit(X86_FEATURE_XSAVE % 32, &c); - __clear_bit(X86_FEATURE_AVX % 32, &c); + /* + * Delete the PVH condition when HVMLite formally replaces PVH, + * and HVM guests no longer enter a PV codepath. + */ + + /* + * !!! OSXSAVE handling for PV guests is non-architectural !!! + * + * Architecturally, the correct code here is simply: + * + * if ( curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_OSXSAVE ) + * c |= cpufeat_mask(X86_FEATURE_OSXSAVE); + * + * However because of bugs in Xen (before c/s bd19080b, Nov 2010, + * the XSAVE cpuid flag leaked into guests despite the feature not + * being available for use), buggy workarounds where introduced to + * Linux (c/s 947ccf9c, also Nov 2010) which relied on the fact + * that Xen also incorrectly leaked OSXSAVE into the guest. + * + * Furthermore, providing architectural OSXSAVE behaviour to a + * many Linux PV guests triggered a further kernel bug when the + * fpu code observes that XSAVEOPT is available, assumes that + * xsave state had been set up for the task, and follows a wild + * pointer. + * + * Older Linux PVOPS kernels however do require architectural + * behaviour. They observe Xen's leaked OSXSAVE and assume they + * can already use XSETBV, dying with a #UD because the shadowed + * CR4.OSXSAVE is clear. This behaviour has been adjusted in all + * observed cases via stable backports of the above changeset. + * + * Therefore, the leaking of Xen's OSXSAVE setting has become a + * defacto part of the PV ABI and can't reasonably be corrected. + * + * The following situations and logic now applies: + * + * - Hardware without CPUID faulting support and native CPUID: + * There is nothing Xen can do here. The hosts XSAVE flag will + * leak through and Xen's OSXSAVE choice will leak through. + * + * In the case that the guest kernel has not set up OSXSAVE, only + * SSE will be set in xcr0, and guest userspace can't do too much + * damage itself. + * + * - Enlightened CPUID or CPUID faulting available: + * Xen can fully control what is seen here. Guest kernels need + * to see the leaked OSXSAVE, but guest userspace is given + * architectural behaviour, to reflect the guest kernels + * intentions. + */ + /* OSXSAVE cleared by pv_featureset. Fast-forward CR4 back in. */ + if ( (guest_kernel_mode(curr, regs) && + (read_cr4() & X86_CR4_OSXSAVE)) || + (curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_OSXSAVE) ) + c |= cpufeat_mask(X86_FEATURE_OSXSAVE); + + /* + * At the time of writing, a PV domain is the only viable option + * for Dom0. Several interactions between dom0 and Xen for real + * hardware setup have unfortunately been implemented based on + * state which incorrectly leaked into dom0. + * + * These leaks are retained for backwards compatibility, but + * restricted to the hardware domains kernel only. + */ + if ( is_hardware_domain(currd) && guest_kernel_mode(curr, regs) ) + { + /* + * MTRR used to unconditionally leak into PV guests. They + * cannot MTRR infrastructure at all, and shouldn't be able to + * see the feature. + * + * Modern PVOPS Linux self-clobbers the MTRR feature, to avoid + * trying to use the associated MSRs. Xenolinux-based PV dom0's + * however use the MTRR feature as an indication of the presence + * of the XENPF_{add,del,read}_memtype hypercalls. + */ + if ( cpu_has_mtrr ) + d |= cpufeat_mask(X86_FEATURE_MTRR); + + /* + * MONITOR never leaked into PV guests, as PV guests cannot + * use the MONITOR/MWAIT instructions. As such, they require + * the feature to not being present in emulated CPUID. + * + * Modern PVOPS Linux try to be cunning and use native CPUID + * to see if the hardware actually supports MONITOR, and by + * extension, deep C states. + * + * If the feature is seen, deep-C state information is + * obtained from the DSDT and handed back to Xen via the + * XENPF_set_processor_pminfo hypercall. + * + * This mechanism is incompatible with an HVM-based hardware + * domain, and also with CPUID Faulting. + * + * Luckily, Xen can be just as 'cunning', and distinguish an + * emulated CPUID from a faulted CPUID by whether a #UD or #GP + * fault is currently being serviced. Yuck... + */ + if ( cpu_has_monitor && regs->entry_vector == TRAP_gp_fault ) + c |= cpufeat_mask(X86_FEATURE_MONITOR); + + /* + * While MONITOR never leaked into PV guests, EIST always used + * to. + * + * Modern PVOPS will only parse P state information from the + * DSDT and return it to Xen if EIST is seen in the emulated + * CPUID information. + */ + if ( cpu_has_eist ) + c |= cpufeat_mask(X86_FEATURE_EIST); + } } - if ( !cpu_has_apic ) - __clear_bit(X86_FEATURE_X2APIC % 32, &c); - __set_bit(X86_FEATURE_HYPERVISOR % 32, &c); + + c |= cpufeat_mask(X86_FEATURE_HYPERVISOR); break; case 0x00000007: if ( subleaf == 0 ) - b &= (cpufeat_mask(X86_FEATURE_BMI1) | - cpufeat_mask(X86_FEATURE_HLE) | - cpufeat_mask(X86_FEATURE_AVX2) | - cpufeat_mask(X86_FEATURE_BMI2) | - cpufeat_mask(X86_FEATURE_ERMS) | - cpufeat_mask(X86_FEATURE_RTM) | - cpufeat_mask(X86_FEATURE_RDSEED) | - cpufeat_mask(X86_FEATURE_ADX) | - cpufeat_mask(X86_FEATURE_FSGSBASE)); + { + /* Fold host's FDP_EXCP_ONLY and NO_FPU_SEL into guest's view. */ + b &= (pv_featureset[FEATURESET_7b0] & + ~special_features[FEATURESET_7b0]); + b |= (host_featureset[FEATURESET_7b0] & + special_features[FEATURESET_7b0]); + + c &= pv_featureset[FEATURESET_7c0]; + + if ( !is_pvh_domain(currd) ) + { + /* + * Delete the PVH condition when HVMLite formally replaces PVH, + * and HVM guests no longer enter a PV codepath. + */ + + /* OSPKE cleared by pv_featureset. Fast-forward CR4 back in. */ + if ( curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_PKE ) + c |= cpufeat_mask(X86_FEATURE_OSPKE); + } + } else - b = 0; - a = c = d = 0; + b = c = 0; + a = d = 0; break; case XSTATE_CPUID: @@ -1017,37 +1111,49 @@ void pv_cpuid(struct cpu_user_regs *regs) } case 1: - a &= (boot_cpu_data.x86_capability[cpufeat_word(X86_FEATURE_XSAVEOPT)] & - ~cpufeat_mask(X86_FEATURE_XSAVES)); + a &= pv_featureset[FEATURESET_Da1]; b = c = d = 0; break; } break; case 0x80000001: - /* Modify Feature Information. */ + c &= pv_featureset[FEATURESET_e1c]; + d &= pv_featureset[FEATURESET_e1d]; + + /* If not emulating AMD, clear the duplicated features in e1d. */ + if ( currd->arch.x86_vendor != X86_VENDOR_AMD ) + d &= ~CPUID_COMMON_1D_FEATURES; + + /* + * MTRR used to unconditionally leak into PV guests. They cannot MTRR + * infrastructure at all, and shouldn't be able to see the feature. + * + * Modern PVOPS Linux self-clobbers the MTRR feature, to avoid trying + * to use the associated MSRs. Xenolinux-based PV dom0's however use + * the MTRR feature as an indication of the presence of the + * XENPF_{add,del,read}_memtype hypercalls. + */ + if ( is_hardware_domain(currd) && guest_kernel_mode(curr, regs) && + cpu_has_mtrr ) + d |= cpufeat_mask(X86_FEATURE_MTRR); + if ( is_pv_32bit_domain(currd) ) { - __clear_bit(X86_FEATURE_LM % 32, &d); - __clear_bit(X86_FEATURE_LAHF_LM % 32, &c); + d &= ~cpufeat_mask(X86_FEATURE_LM); + c &= ~cpufeat_mask(X86_FEATURE_LAHF_LM); + + if ( boot_cpu_data.x86_vendor != X86_VENDOR_AMD ) + d &= ~cpufeat_mask(X86_FEATURE_SYSCALL); } - if ( is_pv_32bit_domain(currd) && - boot_cpu_data.x86_vendor != X86_VENDOR_AMD ) - __clear_bit(X86_FEATURE_SYSCALL % 32, &d); - __clear_bit(X86_FEATURE_PAGE1GB % 32, &d); - __clear_bit(X86_FEATURE_RDTSCP % 32, &d); - - __clear_bit(X86_FEATURE_SVM % 32, &c); - if ( !cpu_has_apic ) - __clear_bit(X86_FEATURE_EXTAPIC % 32, &c); - __clear_bit(X86_FEATURE_OSVW % 32, &c); - __clear_bit(X86_FEATURE_IBS % 32, &c); - __clear_bit(X86_FEATURE_SKINIT % 32, &c); - __clear_bit(X86_FEATURE_WDT % 32, &c); - __clear_bit(X86_FEATURE_LWP % 32, &c); - __clear_bit(X86_FEATURE_NODEID_MSR % 32, &c); - __clear_bit(X86_FEATURE_TOPOEXT % 32, &c); - __clear_bit(X86_FEATURE_MONITORX % 32, &c); + break; + + case 0x80000007: + d &= pv_featureset[FEATURESET_e7d]; + break; + + case 0x80000008: + b &= pv_featureset[FEATURESET_e8b]; break; case 0x0000000a: /* Architectural Performance Monitor Features (Intel) */ diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h index e29b024..6a08579 100644 --- a/xen/include/asm-x86/cpufeature.h +++ b/xen/include/asm-x86/cpufeature.h @@ -81,6 +81,8 @@ #define cpu_has_xsavec boot_cpu_has(X86_FEATURE_XSAVEC) #define cpu_has_xgetbv1 boot_cpu_has(X86_FEATURE_XGETBV1) #define cpu_has_xsaves boot_cpu_has(X86_FEATURE_XSAVES) +#define cpu_has_monitor boot_cpu_has(X86_FEATURE_MONITOR) +#define cpu_has_eist boot_cpu_has(X86_FEATURE_EIST) enum _cache_type { CACHE_TYPE_NULL = 0, -- 2.1.4 _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |