Xen project Mailing List

Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Mon, 3 May 2021 16:22:38 +0200

Cc: George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Mon, 03 May 2021 14:22:58 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 03.05.2021 15:57, Andrew Cooper wrote: > On 22/04/2021 15:44, Jan Beulich wrote: >> vCPU-s get maximum size areas allocated initially. Hidden (and in >> particular default-off) features may allow for a smaller size area to >> suffice. >> >> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> >> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> >> --- >> v2: Use 1ul instead of 1ull. Re-base. >> --- >> This could be further shrunk if we used XSAVEC / if we really used >> XSAVES, as then we don't need to also cover the holes. But since we >> currently use neither of the two in reality, this would require more >> work than just adding the alternative size calculation here. >> >> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called >> from arch_vcpu_create(), I'm not sure we really need this two-stage >> approach - the slightly longer period of time during which >> v->arch.xsave_area would remain NULL doesn't look all that problematic. >> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has >> to stay anyway in some form, so the extra code churn may not be worth >> it. >> >> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the >> interface to use here. But it remains to be determined whether the >> xcr0_accum field is meant to be inclusive of XSS (in which case it would >> better be renamed) or exclusive. Right now there's no difference as we >> don't support any XSS-controlled features. > > I've been figuring out what we need to for supervisors states. The > current code is not in a good shape, but I also think some of the > changes in this series take us in an unhelpful direction. >From reading through the rest your reply I'm not sure I see what you mean. ORing in host_xss at certain points shouldn't be a big deal. > I've got a cleanup series which I will post shortly. It interacts > texturally although not fundamentally with this series, but does fix > several issues. > > For supervisor states, we need use XSAVES unilaterally, even for PV. > This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's > in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent > of TSS.RSP0). > > > A consequence is that Xen's data handling shall use the compressed > format, and include supervisor states. (While in principle we could > manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each > WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS > instruction.) I agree. > I'm planning a host xss setting, similar to mmu_cr4_features, which > shall be the setting in context for everything other than HVM vcpus > (which need the guest setting in context, and/or the VT-x bodge to > support host-only states). Amongst other things, all context switch > paths, including from-HVM, need to step XSS up to the host setting to > let XSAVES function correctly. > > However, a consequence of this is that the size of the xsave area needs > deriving from host, as well as guest-max state. i.e. even if some VMs > aren't using CET, we still need space in the xsave areas to function > correctly when a single VM is using CET. Right - as said above, taking this into consideration here shouldn't be overly problematic. > Another consequence is that we need to rethink our hypercall behaviour. > There is no such thing as supervisor states in an uncompressed XSAVE > image, which means we can't continue with that being the ABI. I don't think the hypercall input / output blob needs to follow any specific hardware layout. > I've also found some substantial issues with how we handle > xcr0/xcr0_accum and plan to address these. There is no such thing as > xcr0 without the bottom bit set, ever, and xcr0_accum needs to default > to X87|SSE seeing as that's how we use it anyway. However, in a context > switch, I expect we'll still be using xcr0_accum | host_xss when it > comes to the context switch path. Right, and to avoid confusion I think we also want to move from xcr0_accum to e.g. xstate_accum, covering both XCR0 and XSS parts all in one go. > In terms of actual context switching, we want to be using XSAVES/XRSTORS > whenever it is available, even if we're not using supervisor states. > XSAVES has both the inuse and modified optimisations, without the broken > consequence of XSAVEOPT (which is firmly in the "don't ever use this" > bucket now). The XSAVEOPT anomaly is affecting user mode only, isn't it? Or are you talking of something I have forgot about? > There's no point ever using XSAVEC. There is no hardware where it > exists in the absence of XSAVES, and can't even in theoretical > circumstances due to (perhaps unintentional) linkage of the CPUID data. > XSAVEC also doesn't use the modified optimisation, and is therefore > strictly worse than XSAVES, even when MSR_XSS is 0. > > Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or > FXSAVE, depending on hardware capability. Makes sense to me (perhaps - see above - minus your omission of XSAVEOPT here). Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.