Xen project Mailing List

Re: [Xen-devel] [PATCH 1/2] x86: improve MSR_SHADOW_GS accesses

To: "Andrew Cooper" <andrew.cooper3@xxxxxxxxxx>

From: "Jan Beulich" <JBeulich@xxxxxxxx>

Date: Thu, 01 Mar 2018 10:00:50 -0700

Cc: xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

Delivery-date: Thu, 01 Mar 2018 17:01:10 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

>>> On 01.03.18 at 16:47, <andrew.cooper3@xxxxxxxxxx> wrote: > On 07/12/17 10:45, Jan Beulich wrote: >>>>> On 06.12.17 at 20:34, <andrew.cooper3@xxxxxxxxxx> wrote: >>> On 06/12/17 16:37, Jan Beulich wrote: >>>> @@ -196,6 +215,19 @@ static inline void wrgsbase(unsigned lon >>>> wrmsrl(MSR_GS_BASE, base); >>>> } >>>> >>>> +static inline void wrgsshadow(unsigned long base) >>>> +{ >>>> + alternative_input("mov %[msr], %%ecx\n\t" >>>> + "shld $32, %[val], %%rdx\n\t" >>> This is a vector path instruction and specifically called out to be >>> avoided in the AMD optimisation guide. On all hardware (according to >>> Agner's latency mesurements) it alone has a longer latency to execute >>> that the mov/shl pair you've replaced, and that is before accounting for >>> mov elimination. >> For one I doubt the latency of the SHLD will be noticable with >> the latency of the following WRMSR. And then I think the main >> goal should be to have optimal performance on modern CPUs. >> That means size-optimizing original code, to reduce the amount >> of NOPs needed when the alternative is being installed. > > I'm sorry if this comes across as blunt, but you are too focused on the > wrong details. > > I agree that `swap;{rd,wr}gsbase;swap` is better than wrmsr, and we > should be using it when available. > > However, it is simply not the case that squeezing every possible byte > out of .text makes the best result. Just as with inc and dec, there is > a very good reason why compilers don't emit shld, and that is because > the ORM's recommend against their use. In this specific case, a mov/shl > pair is better than shld on all hardware, even in older pipelines which > don't do mov elimination. Well, okay I'll switch away from using SHLD. It'll be a single padding NOP only anyway, just a slightly larger one. And note that this wasn't about size optimization, but about NOP padding reduction. > You are going to need a more convincing argument than currently provided > as to why the helpers aren't implemented like this: > > static inline unsigned long rdgsshadow(void) > { > unsigned long base; > > if ( cpu_has_fsgsbase ) > asm volatile ( "swapgs\n\t" > #ifdef HAVE_GAS_FSGSBASE > "rdgsbase %0\n\t" > #else > ".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8\n\t" > #endif > "swapgs\n\t" > : > #ifdef HAVE_GAS_FSGSBASE > "=r" (base) > #else > "=a" (base) > #endif > ); > else > rdmsrl(MSR_SHADOW_GS_BASE, base); > > return base; > } > > static inline void wrgsshadow(unsigned long base) > { > if ( cpu_has_fsgsbase ) > asm volatile ( "swapgs\n\t" > #ifdef HAVE_GAS_FSGSBASE > "wrgsbase %0\n\t" > #else > ".byte 0xf3, 0x48, 0x0f, 0xae, 0xd8\n\t" > #endif > "swapgs\n\t" > :: > #ifdef HAVE_GAS_FSGSBASE > "r" (base) > #else > "a" (base) > #endif > ); > else > wrmsrl(MSR_SHADOW_GS_BASE, base); > } > > In particular: > 1) This is consistent with existing {rd,wr}{fs,gs}base helpers. Which, in due course, I would have meant to convert to alternatives as well. > 2) cpu_has_fsgsbase is constant after boot, so the branch will be > predicted correctly. If it has an entry in the BTB in the first place. I can certainly also switch away from using alternatives here, but considering especially my response to 1) I would do this quite hesitantly. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.