[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH 1/2] x86: improve MSR_SHADOW_GS accesses



On 07/12/17 10:45, Jan Beulich wrote:
>>>> On 06.12.17 at 20:34, <andrew.cooper3@xxxxxxxxxx> wrote:
>> On 06/12/17 16:37, Jan Beulich wrote:
>>> @@ -172,6 +173,24 @@ static inline unsigned long rdgsbase(voi
>>>      return base;
>>>  }
>>>  
>>> +static inline unsigned long rdgsshadow(void)
>>> +{
>>> +    unsigned long base;
>>> +
>>> +    alternative_io("mov %[msr], %%ecx\n\t"
>>> +                   "rdmsr\n\t"
>>> +                   "shl $32, %%rdx\n\t"
>>> +                   "or %%rdx, %[res]",
>> There needs to be some clearer distinction between the two
>> alternatives.  It took a while for me to spot this comma.
> Any suggestion? I've been noticing the issue of the split being
> hard to spot in other places as well, so I'd like to do this in a
> generally applicable and sufficiently uniform way.

The best I can think of is something like this:

"or %%rdx, %[res]",
/* Use a double swapgs and rdgsbase if available. */
"swapgs\n\t"


>
>> I'm not entirely sure the alternative is justified here.  For
>> consistency alone, these helpers should match their companions, and in
>> the unlikely case that the runtime feature test does make a measurable
>> difference, wouldn't a static key be a better option anyway?
> Static key? The main reason I dislike making {rd,wr}{fs,gs}base
> use alternatives is that the original code would be much larger
> than the replacement code. IOW of the options to make things
> consistent, I'd prefer using alternatives for the other inline
> functions too. But I think the choice here should be what fits
> best.
>
>>> @@ -196,6 +215,19 @@ static inline void wrgsbase(unsigned lon
>>>          wrmsrl(MSR_GS_BASE, base);
>>>  }
>>>  
>>> +static inline void wrgsshadow(unsigned long base)
>>> +{
>>> +    alternative_input("mov %[msr], %%ecx\n\t"
>>> +                      "shld $32, %[val], %%rdx\n\t"
>> This is a vector path instruction and specifically called out to be
>> avoided in the AMD optimisation guide.  On all hardware (according to
>> Agner's latency mesurements) it alone has a longer latency to execute
>> that the mov/shl pair you've replaced, and that is before accounting for
>> mov elimination.
> For one I doubt the latency of the SHLD will be noticable with
> the latency of the following WRMSR. And then I think the main
> goal should be to have optimal performance on modern CPUs.
> That means size-optimizing original code, to reduce the amount
> of NOPs needed when the alternative is being installed.

I'm sorry if this comes across as blunt, but you are too focused on the
wrong details.

I agree that `swap;{rd,wr}gsbase;swap` is better than wrmsr, and we
should be using it when available.

However, it is simply not the case that squeezing every possible byte
out of .text makes the best result.  Just as with inc and dec, there is
a very good reason why compilers don't emit shld, and that is because
the ORM's recommend against their use.  In this specific case, a mov/shl
pair is better than shld on all hardware, even in older pipelines which
don't do mov elimination.

You are going to need a more convincing argument than currently provided
as to why the helpers aren't implemented like this:

static inline unsigned long rdgsshadow(void)
{
    unsigned long base;

    if ( cpu_has_fsgsbase )
        asm volatile ( "swapgs\n\t"
#ifdef HAVE_GAS_FSGSBASE
                       "rdgsbase %0\n\t"
#else
                    ".byte 0xf3, 0x48, 0x0f, 0xae, 0xc8\n\t"
#endif
                        "swapgs\n\t"
                       :
#ifdef HAVE_GAS_FSGSBASE
                       "=r" (base)
#else
                       "=a" (base)
#endif
            );
    else
        rdmsrl(MSR_SHADOW_GS_BASE, base);

    return base;
}

static inline void wrgsshadow(unsigned long base)
{
    if ( cpu_has_fsgsbase )
        asm volatile ( "swapgs\n\t"
#ifdef HAVE_GAS_FSGSBASE
                       "wrgsbase %0\n\t"
#else
                       ".byte 0xf3, 0x48, 0x0f, 0xae, 0xd8\n\t"
#endif
                        "swapgs\n\t"
                       ::
#ifdef HAVE_GAS_FSGSBASE
                       "r" (base)
#else
                       "a" (base)
#endif
            );
    else
        wrmsrl(MSR_SHADOW_GS_BASE, base);
}

In particular:
1) This is consistent with existing {rd,wr}{fs,gs}base helpers.
2) cpu_has_fsgsbase is constant after boot, so the branch will be
predicted correctly.
3) It doesn't use instructions recommended against by the ORMs.

I don't wish to suggest that the above code is better than any and all
possible other combinations of performing the same operation, but the
more you get into the minutia of microoptimisation, the higher the
barrier for inclusion gets, because it is important to demonstrate that
the benefits genuinely outweigh the associated costs.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.