[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] x86: improve MSR_SHADOW_GS accesses



>>> On 06.12.17 at 20:34, <andrew.cooper3@xxxxxxxxxx> wrote:
> On 06/12/17 16:37, Jan Beulich wrote:
>> @@ -172,6 +173,24 @@ static inline unsigned long rdgsbase(voi
>>      return base;
>>  }
>>  
>> +static inline unsigned long rdgsshadow(void)
>> +{
>> +    unsigned long base;
>> +
>> +    alternative_io("mov %[msr], %%ecx\n\t"
>> +                   "rdmsr\n\t"
>> +                   "shl $32, %%rdx\n\t"
>> +                   "or %%rdx, %[res]",
> 
> There needs to be some clearer distinction between the two
> alternatives.  It took a while for me to spot this comma.

Any suggestion? I've been noticing the issue of the split being
hard to spot in other places as well, so I'd like to do this in a
generally applicable and sufficiently uniform way.

>> +                   "swapgs\n\t"
>> +                   ".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8\n\t" /* rdgsbase rax 
>> */
>> +                   "swapgs",
>> +                   X86_FEATURE_FSGSBASE,
>> +                   [res] "=&a" (base),
>> +                   [msr] "i" (MSR_SHADOW_GS_BASE) : "rcx", "rdx");
> 
> MSR should be a "c" constraint, and the clobber dropped.  It will result
> in better code in most of the callsites, by avoiding a reload of ecx,
> and merging of the higher constant with the other MSR accesses.

That's one view to take. The other is that the alternative doesn't
require this input register at all. Furthermore, if my counting is
correct, I'd need to add NOPs to the original insn sequence in
order to fit the replacement one. The goal, however, is to keep
the two as similar in size as possible.

I also don't buy the "avoid reload of ecx" argument: There are
no other uses of MSR_SHADOW_GS_BASE without intermediate
other MSR accesses afaics.

> I'm not entirely sure the alternative is justified here.  For
> consistency alone, these helpers should match their companions, and in
> the unlikely case that the runtime feature test does make a measurable
> difference, wouldn't a static key be a better option anyway?

Static key? The main reason I dislike making {rd,wr}{fs,gs}base
use alternatives is that the original code would be much larger
than the replacement code. IOW of the options to make things
consistent, I'd prefer using alternatives for the other inline
functions too. But I think the choice here should be what fits
best.

>> @@ -196,6 +215,19 @@ static inline void wrgsbase(unsigned lon
>>          wrmsrl(MSR_GS_BASE, base);
>>  }
>>  
>> +static inline void wrgsshadow(unsigned long base)
>> +{
>> +    alternative_input("mov %[msr], %%ecx\n\t"
>> +                      "shld $32, %[val], %%rdx\n\t"
> 
> This is a vector path instruction and specifically called out to be
> avoided in the AMD optimisation guide.  On all hardware (according to
> Agner's latency mesurements) it alone has a longer latency to execute
> that the mov/shl pair you've replaced, and that is before accounting for
> mov elimination.

For one I doubt the latency of the SHLD will be noticable with
the latency of the following WRMSR. And then I think the main
goal should be to have optimal performance on modern CPUs.
That means size-optimizing original code, to reduce the amount
of NOPs needed when the alternative is being installed.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.