[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 03/22] x86/xstate: re-size save area when CPUID policy changes

  • To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Mon, 3 May 2021 14:57:15 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=2mHMDwnB+Qqg6IwVstKSE+kgeSH2vCxjdn0ovWOL3ug=; b=Om8Cs0p5LLt6/z64DRlJS5xjH/zXKk8RaB1bSQF2FhzIRmj1ZkHQWaOwVHqP2VSkcvvDBAZOdOBrvCtmdiB402WJS1GBgIRwLm6mHkkXY5ZkllNLEFSRy4vEUJfwkhAz5HTeXZUMcrzb/VzEMN+MfbJTXGVUuxUIVBW5FZ+bwsnpQ3iWp7Vk5FGDDjcOEv6McmvL+lH5EgGm404UjokloBQplvoHn4NQSh7iQlDkzHL00VyGa1FLqY2ej/W5Z38vhNz71hFrBjaQSQx8+TX2nxV9vyDR/9QhOykHDUxFvDnZe3kQOx0nW5s1FZh6HtmscgRvCpkudF/XW7UNppOdow==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=izw1mRqaUMxg9D2lHlLxrCisQMULMjbr12UlyD70JA7WFYdPIa+/MaTAEgcONATilS3JMGYUu10JE+y6pyqF7wPdE6OOQaLvqR5680NwSSIoAwK9XhUHpkkdmElkURiUuvE1VPIdJQhF8h9hQpRmIkzCdLtyOhglRRL8S7gnig9BFphwZoruhulZKojAr9cylBeeZEdtntYEmMZFdhb+kpZA77zVLQhHleUkwsr7Y8b4yiX9qQxG/FqBCJNdZzL5mc/jd83erOrYQb4UD10C8KDiVyqcxAif06KcwQTZX+Kdlx+WW9IdLKbgt0RC9kf6+jrwmobXXsdNN+T/XCuXVQ==
  • Authentication-results: esa3.hc3370-68.iphmx.com; dkim=pass (signature verified) header.i=@citrix.onmicrosoft.com
  • Cc: George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Mon, 03 May 2021 13:57:29 +0000
  • Ironport-hdrordr: A9a23:tSMnGalyUspn5cqHHdfqyiIy5iPpDfP+imdD5ilNYBxZY6Wkvu izgfUW0gL1gj4NWHcm3euNIrWEXGm0z/NIyKMaVI3DYCDNvmy0IIZ+qbbz2jGIIVybysdx94 dFN5J/Btr5EERgga/BijWQPt48zLC8n5yAqvzZyx5WIz1CT4FFw0NHBh2AEktwLTM2YKYRMJ aH/MJIq36BVB0sH6eGL0IIVeTCuNHH/aiOCXI7LiUq9RWUineQ4KP6eiLy4j4lTzhNzb0+mF K18TDR26PLiZCG4y6Z7UD/xdB8mNztytxMbfb89/Q9G3HXpSuDIKhkU72GljgprO+o80ZCqq ixnz4Qe/5dxlmUUmapoQb8+wSI6kdQ11bSjWW2rFGmgcvlSCk0A8BM7LgpDCfx2g4bk/xXlI dotljp0KZ/PFf7swnWo+XsbVVMkHG5pHIz+NRj9EB3YM8lR5J66bAE8Fg9KuZnIAvKrLoJPc NJF8/m6PNfYTqhHgrkl1gq+tCqU3gpdy32O3Qqi4iQ2zhSqnhz01EV8swZhmsB75IwUfB/lp z5Dpg=
  • Ironport-sdr: f+RXXpwoaCxVOzy1YvpgcFILvFKFzdRtb+9XDU0K+4ErjkJVIx6Pgtlxkqeo7DfC4jdKITSUpO tDhnu8WCoC4OBEVa2nMtGQeGadnBNIIwnjLtAyGr+geaUVUKj1FY3A4Z7VxzAM1J4RDIR1GxGL ji9uSaSKkltaDjObWlL70ydfgCOkzbAV+flh+TvPlSZPxam5wIiZ+h+Y+phOIxxWhct4MWDHb7 1EPkcVK0R98DIw089Mo8BZcMIlDiuxq57Oj2EG4XnHRfqbXgaCqZUYFgt4G30Lsi1Ul3P4b5us zBE=
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 22/04/2021 15:44, Jan Beulich wrote:
> vCPU-s get maximum size areas allocated initially. Hidden (and in
> particular default-off) features may allow for a smaller size area to
> suffice.
> Suggested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> v2: Use 1ul instead of 1ull. Re-base.
> ---
> This could be further shrunk if we used XSAVEC / if we really used
> XSAVES, as then we don't need to also cover the holes. But since we
> currently use neither of the two in reality, this would require more
> work than just adding the alternative size calculation here.
> Seeing that both vcpu_init_fpu() and cpuid_policy_updated() get called
> from arch_vcpu_create(), I'm not sure we really need this two-stage
> approach - the slightly longer period of time during which
> v->arch.xsave_area would remain NULL doesn't look all that problematic.
> But since xstate_alloc_save_area() gets called for idle vCPU-s, it has
> to stay anyway in some form, so the extra code churn may not be worth
> it.
> Instead of cpuid_policy_xcr0_max(), cpuid_policy_xstates() may be the
> interface to use here. But it remains to be determined whether the
> xcr0_accum field is meant to be inclusive of XSS (in which case it would
> better be renamed) or exclusive. Right now there's no difference as we
> don't support any XSS-controlled features.

I've been figuring out what we need to for supervisors states.  The
current code is not in a good shape, but I also think some of the
changes in this series take us in an unhelpful direction.

I've got a cleanup series which I will post shortly.  It interacts
texturally although not fundamentally with this series, but does fix
several issues.

For supervisor states, we need use XSAVES unilaterally, even for PV. 
This is because XSS_CET_S needs to be the HVM kernel's context, or Xen's
in PV context (specifically, MSR_PL0_SSP which is the shstk equivalent
of TSS.RSP0).

A consequence is that Xen's data handling shall use the compressed
format, and include supervisor states.  (While in principle we could
manage CET_S, CET_U, and potentially PT when vmtrace gets expanded, each
WRMSR there is a similar order of magnitude to an XSAVES/XRSTORS

I'm planning a host xss setting, similar to mmu_cr4_features, which
shall be the setting in context for everything other than HVM vcpus
(which need the guest setting in context, and/or the VT-x bodge to
support host-only states).  Amongst other things, all context switch
paths, including from-HVM, need to step XSS up to the host setting to
let XSAVES function correctly.

However, a consequence of this is that the size of the xsave area needs
deriving from host, as well as guest-max state.  i.e. even if some VMs
aren't using CET, we still need space in the xsave areas to function
correctly when a single VM is using CET.

Another consequence is that we need to rethink our hypercall behaviour. 
There is no such thing as supervisor states in an uncompressed XSAVE
image, which means we can't continue with that being the ABI.

I've also found some substantial issues with how we handle
xcr0/xcr0_accum and plan to address these.  There is no such thing as
xcr0 without the bottom bit set, ever, and xcr0_accum needs to default
to X87|SSE seeing as that's how we use it anyway.  However, in a context
switch, I expect we'll still be using xcr0_accum | host_xss when it
comes to the context switch path.

In terms of actual context switching, we want to be using XSAVES/XRSTORS
whenever it is available, even if we're not using supervisor states. 
XSAVES has both the inuse and modified optimisations, without the broken
consequence of XSAVEOPT (which is firmly in the "don't ever use this"
bucket now).

There's no point ever using XSAVEC.  There is no hardware where it
exists in the absence of XSAVES, and can't even in theoretical
circumstances due to (perhaps unintentional) linkage of the CPUID data. 
XSAVEC also doesn't use the modified optimisation, and is therefore
strictly worse than XSAVES, even when MSR_XSS is 0.

Therefore, our choice of instruction wants to be XSAVES, or XSAVE, or
FXSAVE, depending on hardware capability.




Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.