Xen project Mailing List

Re: [PATCH v5 1/7] x86emul: support LKGS

To: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

Date: Thu, 5 Sep 2024 16:45:45 +0200

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: Wei Liu <wl@xxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 05 Sep 2024 14:46:14 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 04.09.2024 16:24, Andrew Cooper wrote: > On 04/09/2024 1:28 pm, Jan Beulich wrote: >> --- >> Instead of ->read_segment() we could of course also use ->read_msr() to >> fetch the original GS base. I don't think I can see a clear advantage of >> either approach; the way it's done it matches how we handle SWAPGS. > > It turns out this is littered with broken corners. See below. I'm afraid it hasn't become clear to me which of your further comments are the "broken corners". >> --- a/tools/tests/x86_emulator/test_x86_emulator.c >> +++ b/tools/tests/x86_emulator/test_x86_emulator.c >> @@ -693,6 +719,20 @@ static int read_msr( >> *val = ctxt->addr_size > 32 ? 0x500 /* LME|LMA */ : 0; >> return X86EMUL_OKAY; >> >> +#ifdef __x86_64__ >> + case 0xc0000101: /* GS_BASE */ > > It's only just occurred to me, but given x86-defns.h, isn't msr-index.h > suitably usable too ? We are doing so already. Just not in this function. And since there were hex numbers with comments here, I (blindly) added more. I'll submit a cleanup patch to change the pre-existing ones, and I've already switched over this and further patches to use the named constants instead. >> @@ -1335,6 +1400,41 @@ int main(int argc, char **argv) >> printf("%u bytes read - ", bytes_read); >> goto fail; >> } >> + printf("okay\n"); >> + >> + emulops.write_segment = write_segment; >> + emulops.write_msr = write_msr; >> + >> + printf("%-40s", "Testing swapgs..."); >> + instr[0] = 0x0f; instr[1] = 0x01; instr[2] = 0xf8; >> + regs.eip = (unsigned long)&instr[0]; >> + gs_base = 0xffffeeeecccc8888UL; >> + gs_base_shadow = 0x0000111122224444UL; >> + rc = x86_emulate(&ctxt, &emulops); >> + if ( (rc != X86EMUL_OKAY) || >> + (regs.eip != (unsigned long)&instr[3]) || >> + (gs_base != 0x0000111122224444UL) || >> + (gs_base_shadow != 0xffffeeeecccc8888UL) ) >> + goto fail; >> + printf("okay\n"); >> + >> + printf("%-40s", "Testing lkgs 2(%rdx)..."); >> + instr[0] = 0xf2; instr[1] = 0x0f; instr[2] = 0x00; instr[3] = 0x72; >> instr[4] = 0x02; >> + regs.eip = (unsigned long)&instr[0]; >> + regs.edx = (unsigned long)res; >> + res[0] = 0x00004444; >> + res[1] = 0x8888cccc; >> + i = cp.extd.nscb; cp.extd.nscb = true; /* for AMD */ >> + rc = x86_emulate(&ctxt, &emulops); >> + if ( (rc != X86EMUL_OKAY) || >> + (regs.eip != (unsigned long)&instr[5]) || >> + (gs_base != 0x0000111122224444UL) || >> + gs_base_shadow ) >> + goto fail; >> + >> + cp.extd.nscb = i; > > I think I acked the patches to rename this? > > I'd suggest putting those in now, rather than creating more (re)work later. That was sitting on top, and I was kind of hoping that I could avoid the re-basing ahead. But I've meanwhile done so, including the committing of the result, as you've probably seen. >> --- a/xen/arch/x86/x86_emulate/decode.c >> +++ b/xen/arch/x86/x86_emulate/decode.c >> @@ -743,8 +743,12 @@ decode_twobyte(struct x86_emulate_state >> case 0: >> s->desc |= DstMem | SrcImplicit | Mov; >> break; >> + case 6: >> + if ( !(s->modrm_reg & 1) && mode_64bit() ) >> + { >> case 2: case 4: >> - s->desc |= SrcMem16; >> + s->desc |= SrcMem16; >> + } > > Well - not something I was expecting, but I've just had to go and find > the Itanium instruction manuals... > > Do we really need this complexity? JMPE is a decoding wrinkle of > Itanium processors, which I think we can reasonably ignore. > > IMO we should treat Grp6 as uniformly Reg/Mem16, and rely on the > !mode_64bit() to exclude the encodings commonly used as JMPE. We already handle modrm_reg 0 and 1 differently. I'm not convinced of making 7 match 6 without need. We can't predict what Intel will put there - JMPE (which I'm not really concerned about here, and which the logic being added also doesn't exclude) already didn't match the reg/mem16 pattern. >> --- a/xen/arch/x86/x86_emulate/x86_emulate.c >> +++ b/xen/arch/x86/x86_emulate/x86_emulate.c >> @@ -2870,8 +2870,35 @@ x86_emulate( >> break; >> } >> break; >> - default: >> - generate_exception_if(true, X86_EXC_UD); >> + case 6: /* lkgs */ >> + generate_exception_if((modrm_reg & 1) || vex.pfx != vex_f2, >> + X86_EXC_UD); >> + generate_exception_if(!mode_64bit() || !mode_ring0(), >> X86_EXC_UD); >> + vcpu_must_have(lkgs); >> + fail_if(!ops->read_segment || !ops->read_msr || >> + !ops->write_segment || !ops->write_msr); >> + if ( (rc = ops->read_msr(MSR_SHADOW_GS_BASE, &msr_val, >> + ctxt)) != X86EMUL_OKAY || >> + (rc = ops->read_segment(x86_seg_gs, &sreg, >> + ctxt)) != X86EMUL_OKAY ) >> + goto done; >> + dst.orig_val = sreg.base; /* Preserve full GS Base. */ >> + if ( (rc = protmode_load_seg(x86_seg_gs, src.val, false, &sreg, >> + ctxt, ops)) != X86EMUL_OKAY || >> + /* Write (32-bit) base into SHADOW_GS. */ >> + (rc = ops->write_msr(MSR_SHADOW_GS_BASE, sreg.base, > > The comment says 32-bit, but that's the full base, isn't it? The function writes the full base, but what we retrieved via protmode_load_seg() is only 32 bits wide. Hence the parenthesization in the comment. I can add e.g. "zero-extended" if you think that makes things more clear? >> + ctxt)) != X86EMUL_OKAY ) >> + goto done; >> + sreg.base = dst.orig_val; /* Reinstate full GS Base. */ >> + if ( (rc = ops->write_segment(x86_seg_gs, &sreg, >> + ctxt)) != X86EMUL_OKAY ) >> + { >> + /* Best effort unwind (i.e. no real error checking). */ >> + if ( ops->write_msr(MSR_SHADOW_GS_BASE, msr_val, >> + ctxt) == X86EMUL_EXCEPTION ) >> + x86_emul_reset_event(ctxt); >> + goto done; >> + } > > Do we need all of this? > > Either protmode_load_seg() fails and there's nothing to unwind, or > write_msr() fails and we only have to unwind GS. > > I think? Since you say "all" I can only assume you mean both the write_segment() and the write_msr(). We need the former, as we replaced the segment base if protmode_load_seg() succeeded. It's only the write_msr() which is debatable, yet as indicated that matches SWAPGS handling. I'd like to keep the two as similar as possible. > This is actually a good example of where pipeline microcode has a much > easier time than we do. Inside the pipeline, there's no such thing as > "can't store to gs & GS_KERN once the checks are done". Indeed. > Although it does make me wonder. Would LKGS trigger the MSR > intercepts? Architecturally, it writes MSR_GS_KERN, so ought to trigger > the Write intercept. > > However, version 7 of the FRED spec says: > > "Because the base address in the descriptor is only 32 bits, LKGS clears > the upper 32 bits of the 64-bit IA32_KERNEL_GS_BASE MSR." > > so I suspect it does not architecturally read MSR_GS_KERN, so would not > trigger the Read intercept (or introspection for that matter.) Well, I'm looking at this differently anyway: The MSR is merely an alias for the segment base. Just like LFS/LGS won't trigger respective MSR intercepts, LKGS shouldn't either. > AFAICT, we're only performing the read in order to do the best-effort > unwind, so I think that path needs dropping. No, as said - we need to put back the correct base of the "real" GS. >> --- a/xen/include/public/arch-x86/cpufeatureset.h >> +++ b/xen/include/public/arch-x86/cpufeatureset.h >> @@ -296,6 +296,8 @@ XEN_CPUFEATURE(AVX512_BF16, 10*32+ 5) / >> XEN_CPUFEATURE(FZRM, 10*32+10) /*A Fast Zero-length REP MOVSB */ >> XEN_CPUFEATURE(FSRS, 10*32+11) /*A Fast Short REP STOSB */ >> XEN_CPUFEATURE(FSRCS, 10*32+12) /*A Fast Short REP CMPSB/SCASB */ >> +XEN_CPUFEATURE(FRED, 10*32+17) /* Flexible Return and Event >> Delivery */ >> +XEN_CPUFEATURE(LKGS, 10*32+18) /*S Load Kernel GS Base */ > > Can we please keep this 's' until we've had a play on real hardware? Sure. > Also, as we're going for CPUID bits more generally these days, bit 20 is > NMI_SRC also from the FRED spec. I can add that, sure. It just seemed unrelated to me. I wanted to have FRED to put in place the dependency in gen-cpuid.py. What isn't quite clear to me is whether there should then also be a dependency recorded between FRED and NMI_SRC. >> @@ -338,6 +338,9 @@ def crunch_numbers(state): >> >> # The behaviour described by RRSBA depend on eIBRS being active. >> EIBRS: [RRSBA], >> + >> + # FRED builds on the LKGS instruction. >> + LKGS: [FRED], > > I'd be tempted to justify this differently. > > It is intentional that LKGS is usable with CR4.FRED=0, for the benefit > of FRED-aware-but-not-active OSes running on FRED-capable hardware. > > However, FRED=1 systems cannot operate without LKGS. This is what I'm meaning to say with the comment. Whereas ... > So, perhaps "There is no hard dependency, but the spec requires that > LKGS is available in FRED systems" ? ... this is weaker than what I think is wanted/needed. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.