[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 1/9] x86emul: support LKGS


  • To: Jan Beulich <jbeulich@xxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>
  • From: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>
  • Date: Tue, 4 Apr 2023 22:54:04 +0100
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=citrix.com; dmarc=pass action=none header.from=citrix.com; dkim=pass header.d=citrix.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=93vfan2gsno1DCeqXREVT5h6NSIC8R4Twl29c7aP6Js=; b=Cm8dYvRAb3/24djSiEWENn3XfuoXr1ty5Xo7TvmGLTv8wQIeW+CxZ4itd50W9ElinYgrY1Bf3/St7r3sABdEGV2sdQsyE6BMRZTCosPBNcSqmq5ZVxj8Z9aJHwZdTQAAn/SA5efshvhTydxEjjzdovpVZ1tIrakbLQ78jbBJ8pux8JVjv7ickzqfkSJWOFhcuFNI+aBAlK1UH+BK8lQ6OUP0IUJdJ1zh/pDs9ZuWBnPvi74DCydBierEBEAx4qgsgAYRwZNpPVbVQ+3pyu16w6WgFZAb+2ZsbyntqaX/44Ry8ntQJqRV38QVUw/OFI4YW+Lj82ih3AStLd4e6WySsA==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=VfH9SsgOx/ZeREsVz8Hab5u5UhAcj+Z2iq5n/jncm5e17uW0Dq6ueRdrOBL+Diyx+kLU7wL1Upb3YgVhlMe4Eh3CYHyLSWqW5E97bRRNzyMV4sE4Am6KR8jKSiTuABcDygtQ3mRmx+dBRC3oNjlM3GfcxOzlgTIsFuCtL8uU4VOHTVnKtTQeL6xizUCC8fkgXZcOtJ/qiWyZPcWQ6k7HD9plMQGeHk8QPRBVUY/Bp8uvAIzB3boPZz/QOjUr/Uys1tgd+sCRSQHSb9jqbdWrt0CN2QWl4C7+T//6VBxZ3pawpSceQhwP0w14JENuC1XNN58uy3NLOT/MzUyMMAi+Og==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=citrix.com;
  • Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Tue, 04 Apr 2023 21:54:49 +0000
  • Ironport-data: A9a23:aGnFs6OJgx50ehfvrR2IlsFynXyQoLVcMsEvi/4bfWQNrUoj1WYHz WBJXTyPaa3bZDemLt4kPYTk8U5UupXVmNAwHQto+SlhQUwRpJueD7x1DKtS0wC6dZSfER09v 63yTvGacajYm1eF/k/F3oDJ9CU6jufQAOKnUoYoAwgpLSd8UiAtlBl/rOAwh49skLCRDhiE/ Nj/uKUzAnf8s9JPGj9SuvLrRC9H5qyo42tE5gBmPpingXeF/5UrJMNHTU2OByOQrrl8RoaSW +vFxbelyWLVlz9F5gSNy+uTnuUiG9Y+DCDW4pZkc/HKbitq/0Te5p0TJvsEAXq7vh3S9zxHJ HehgrTrIeshFvWkdO3wyHC0GQkmVUFN0OevzXRSLaV/ZqAJGpfh66wGMa04AWEX0sUuOk9Pp e0fFDQQdCKpg+Cz4rj4UMA506zPLOGzVG8ekldJ6GmFSNoDH9XESaiM4sJE1jAtgMwIBezZe 8cSdTtoalLHfgFLPVAUTpk5mY9EhFGmK2Ee9A3T+vFxvzC7IA9ZidABNPL8fNCQSNoTtUGfv m/cpEzyAw0ANczZwj2Amp6prraXwXynBtxKTdVU8NZQnnax4DcQDSQGVHS0p/7ktErmRJF2f hl8Fi0G6PJaGFaQZtv3UgC8oXWElgUBQNcWGOo/gCmdx6yR7wuHC2wsSj9adMdgpMIwXSYt1 FKCg5XuHzMHmKKRYWKQ8PGTtzzaBMQOBWoLZCtBQQ5b5dDm+N03lkiXEo4lF7OphNroHz222 yqNsCU1m7QUi4gMyrm/+lfExTmro/AlUzII2+keZUr9hisRWWJvT9bzgbQHxZ6s9Lqkc2Q=
  • Ironport-hdrordr: A9a23:J6Ok6K0dEx/5TNQAXjYPgwqjBIIkLtp133Aq2lEZdPUzSKClfq GV88jzsCWetN9/Yh8dcLy7WZVoI0mslqKdkLNwAV7KZmCP0gaVxedZnO7fKlbbak/DH4BmpM FdWpk7JNrsDUVryebWiTPIaurIGeP3lJxAU92uqEtQcQ==
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 04/04/2023 3:49 pm, Jan Beulich wrote:
> Provide support for this insn, which is a prereq to FRED. CPUID-wise
> introduce both its and FRED's bit at this occasion, thus allowing to
> also express the dependency right away.
>
> While adding a testcase, also add a SWAPGS one. In order to not affect
> the behavior of pre-existing tests, install write_{segment,msr} hooks
> only transiently.

IMO, the emulator is already complicated enough without us having
fallback logic to cope with callers that don't set up all the hooks.

Nor do I think making these hooks transient in the test harness is a
clever idea.

> Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx>
> ---
> Instead of ->read_segment() we could of course also use ->read_msr() to
> fetch the original GS base. I don't think I can see a clear advantage of
> either approach; the way it's done it matches how we handle SWAPGS.

read_segment() is a much shorter logic chain under the hood, so will be
marginally faster, but it will be a couple of unnecessary VMREADs (on
Intel at least).

We could expose the get/set reg paths for cases where we know we're not
going to need sanity checks, but I'm not sure it's worth it in this case.

> For PV save_segments() would need adjustment, but the insn being
> restricted to ring 0 means PV guests can't use it anyway (unless we
> wanted to emulate it as another privileged insn).

I know, it's on the list.

What is rather irritating is that, depending on FRED or not, it's
opposite whether the guest user or guest kernel GS base is in context. 
Sadly Intel refused my request for a control knob to turn off FRED's
auto-SWAPGS, but I didn't really push them on it because for practically
all other circumstances, it would just be a way for OSes to shoot
themselves in the foot.

For PV guests, our regular ABI is half-way to FRED anyway.  I suspect we
can get most of the interesting rest of the functionality by adding an
ERET bit to the HYPERCALL_iret flags.  I'm not sure yet if we ought to
bother exposing CSL in the pvFRED ABI or not, but doing so would reduce
the divergence from native even further.

> --- a/xen/arch/x86/x86_emulate/private.h
> +++ b/xen/arch/x86/x86_emulate/private.h
> @@ -594,6 +594,7 @@ amd_like(const struct x86_emulate_ctxt *
>  #define vcpu_has_tsxldtrk()    (ctxt->cpuid->feat.tsxldtrk)
>  #define vcpu_has_avx_vnni()    (ctxt->cpuid->feat.avx_vnni)
>  #define vcpu_has_avx512_bf16() (ctxt->cpuid->feat.avx512_bf16)
> +#define vcpu_has_lkgs()        (ctxt->cpuid->feat.lkgs)
>  
>  #define vcpu_must_have(feat) \
>      generate_exception_if(!vcpu_has_##feat(), X86_EXC_UD)
> --- a/xen/arch/x86/x86_emulate/x86_emulate.c
> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c
> @@ -2886,8 +2886,31 @@ x86_emulate(
>                  break;
>              }
>              break;
> -        default:
> -            generate_exception_if(true, EXC_UD);
> +        case 6: /* lkgs */
> +            generate_exception_if((modrm_reg & 1) || vex.pfx != vex_f2, 
> EXC_UD);
> +            generate_exception_if(!mode_64bit() || !mode_ring0(), EXC_UD);

Can we switch to X86_* please.  Alternatively, I've got such a patch
which I've just rebased over all your emulator changes anyway, if we're
happy to fix this in one fell swoop.

(Sadly, you did move some TRAP_* names into util-xen.c which I fixed up
in my other tree-wide exception constant patch.)

> +            vcpu_must_have(lkgs);
> +            fail_if(!ops->read_segment || !ops->read_msr ||
> +                    !ops->write_segment || !ops->write_msr);
> +            if ( (rc = ops->read_msr(MSR_SHADOW_GS_BASE, &msr_val,
> +                                     ctxt)) != X86EMUL_OKAY ||
> +                 (rc = ops->read_segment(x86_seg_gs, &sreg,
> +                                         ctxt)) != X86EMUL_OKAY )
> +                goto done;
> +            dst.orig_val = sreg.base;
> +            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
> +                                         ctxt, ops)) != X86EMUL_OKAY ||
> +                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
> +                                      ctxt)) != X86EMUL_OKAY )
> +                goto done;
> +            sreg.base = dst.orig_val;

Honestly, I think a comment is needed here, because I'm struggling to
work out if this is correct or not.

There is a 64->32 bit truncation of base with LGKS, just as there is
with MOV GS.

Which I think does happen as a side effect of protmode_load_seg() only
filling in the lower half of sreg.base, but I think it would be nicer to
have:

+            dst.orig_val = sreg.base; /* Preserve full GS Base */
+            if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg,
+                                         ctxt, ops)) != X86EMUL_OKAY ||
+                 /* Write truncated base into GS_SHADOW */
+                 (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base,
+                                      ctxt)) != X86EMUL_OKAY )
+                goto done;
+            sreg.base = dst.orig_val; /* Reinstate full GS Base */

Or so, because it's weird not to see a (uint32_t) somewhere in this logic.

> +            if ( (rc = ops->write_segment(x86_seg_gs, &sreg,
> +                                          ctxt)) != X86EMUL_OKAY )
> +            {
> +                /* Best effort unwind (i.e. no error checking). */
> +                ops->write_msr(MSR_SHADOW_GS_BASE, msr_val, ctxt);

write_segment() can't fail.  (The sanity checks are actually deferred
until after emulation is complete, and I'm not sure if that's behaviour
we want...)

However, more importantly, if we actually take this error path (for some
future reason) then we've created a security vulnerability in the guest.

It will be strictly better to crash the domain in this case, than to try
and let it continue in this state.

> +                goto done;
> +            }
>              break;
>          }
>          break;
> --- a/xen/include/public/arch-x86/cpufeatureset.h
> +++ b/xen/include/public/arch-x86/cpufeatureset.h
> @@ -281,6 +281,8 @@ XEN_CPUFEATURE(AVX512_BF16,  10*32+ 5) /
>  XEN_CPUFEATURE(FZRM,         10*32+10) /*A  Fast Zero-length REP MOVSB */
>  XEN_CPUFEATURE(FSRS,         10*32+11) /*A  Fast Short REP STOSB */
>  XEN_CPUFEATURE(FSRCS,        10*32+12) /*A  Fast Short REP CMPSB/SCASB */
> +XEN_CPUFEATURE(FRED,         10*32+17) /*   Flexible Return and Event 
> Delivery */
> +XEN_CPUFEATURE(LKGS,         10*32+18) /*S  Load Kernel GS Base */
>  XEN_CPUFEATURE(WRMSRNS,      10*32+19) /*   WRMSR Non-Serialising */
>  
>  /* AMD-defined CPU features, CPUID level 0x80000021.eax, word 11 */
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -295,6 +295,9 @@ def crunch_numbers(state):
>  
>          # In principle the TSXLDTRK insns could also be considered 
> independent.
>          RTM: [TSXLDTRK],
> +
> +        # FRED builds on the LKGS instruction.
> +        LKGS: [FRED],

Hmm...  This is the first case (I think) we've got where a dependency
that goes back numerically in terms of feature number.

Obviously we need to support it, but I'm not sure if the deep_deps loop
will cope in its current form.

~Andrew



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.