[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH 1/2] x86: improve MSR_SHADOW_GS accesses
>>> On 06.12.17 at 20:34, <andrew.cooper3@xxxxxxxxxx> wrote: > On 06/12/17 16:37, Jan Beulich wrote: >> @@ -172,6 +173,24 @@ static inline unsigned long rdgsbase(voi >> return base; >> } >> >> +static inline unsigned long rdgsshadow(void) >> +{ >> + unsigned long base; >> + >> + alternative_io("mov %[msr], %%ecx\n\t" >> + "rdmsr\n\t" >> + "shl $32, %%rdx\n\t" >> + "or %%rdx, %[res]", > > There needs to be some clearer distinction between the two > alternatives. It took a while for me to spot this comma. Any suggestion? I've been noticing the issue of the split being hard to spot in other places as well, so I'd like to do this in a generally applicable and sufficiently uniform way. >> + "swapgs\n\t" >> + ".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8\n\t" /* rdgsbase rax >> */ >> + "swapgs", >> + X86_FEATURE_FSGSBASE, >> + [res] "=&a" (base), >> + [msr] "i" (MSR_SHADOW_GS_BASE) : "rcx", "rdx"); > > MSR should be a "c" constraint, and the clobber dropped. It will result > in better code in most of the callsites, by avoiding a reload of ecx, > and merging of the higher constant with the other MSR accesses. That's one view to take. The other is that the alternative doesn't require this input register at all. Furthermore, if my counting is correct, I'd need to add NOPs to the original insn sequence in order to fit the replacement one. The goal, however, is to keep the two as similar in size as possible. I also don't buy the "avoid reload of ecx" argument: There are no other uses of MSR_SHADOW_GS_BASE without intermediate other MSR accesses afaics. > I'm not entirely sure the alternative is justified here. For > consistency alone, these helpers should match their companions, and in > the unlikely case that the runtime feature test does make a measurable > difference, wouldn't a static key be a better option anyway? Static key? The main reason I dislike making {rd,wr}{fs,gs}base use alternatives is that the original code would be much larger than the replacement code. IOW of the options to make things consistent, I'd prefer using alternatives for the other inline functions too. But I think the choice here should be what fits best. >> @@ -196,6 +215,19 @@ static inline void wrgsbase(unsigned lon >> wrmsrl(MSR_GS_BASE, base); >> } >> >> +static inline void wrgsshadow(unsigned long base) >> +{ >> + alternative_input("mov %[msr], %%ecx\n\t" >> + "shld $32, %[val], %%rdx\n\t" > > This is a vector path instruction and specifically called out to be > avoided in the AMD optimisation guide. On all hardware (according to > Agner's latency mesurements) it alone has a longer latency to execute > that the mov/shl pair you've replaced, and that is before accounting for > mov elimination. For one I doubt the latency of the SHLD will be noticable with the latency of the following WRMSR. And then I think the main goal should be to have optimal performance on modern CPUs. That means size-optimizing original code, to reduce the amount of NOPs needed when the alternative is being installed. Jan _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |