[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [PATCH] x86/build: Use new .nops directive when available



On Thu, Aug 16, 2018 at 04:56:04PM +0100, Andrew Cooper wrote:
> On 16/08/18 15:31, Roger Pau Monné wrote:
> > On Thu, Aug 16, 2018 at 11:42:56AM +0100, Andrew Cooper wrote:
> >> On 16/08/18 10:55, Roger Pau Monné wrote:
> >>> On Wed, Aug 15, 2018 at 06:57:38PM +0100, Andrew Cooper wrote:
> >>>> diff --git a/xen/arch/x86/Rules.mk b/xen/arch/x86/Rules.mk
> >>>> index ac585a3..c84ed20 100644
> >>>> --- a/xen/arch/x86/Rules.mk
> >>>> +++ b/xen/arch/x86/Rules.mk
> >>>> @@ -29,6 +29,10 @@ $(call as-option-add,CFLAGS,CC,"invpcid 
> >>>> (%rax)$$(comma)%rax",-DHAVE_AS_INVPCID)
> >>>>  $(call as-option-add,CFLAGS,CC,\
> >>>>      ".if ((1 > 0) < 0); .error \"\";.endif",,-DHAVE_AS_NEGATIVE_TRUE)
> >>>>  
> >>>> +# Check to see whether the assmbler supports the .nop directive.
> >>>> +$(call as-option-add,CFLAGS,CC,\
> >>>> +    ".L1: .L2: .nops (.L2 - .L1)$$(comma)9",-DHAVE_AS_NOP_DIRECTIVE)
> >>> I think I remember commenting on an earlier version of this about the
> >>> usage of the CONTROL parameter. I would expect the assembler to
> >>> use the most optimized version by default, is that not the case?
> >> Again, I don't understand what you're trying to say.
> >>
> >> This expression is like this, because that's how we actually use it.
> > As mentioned in another email, I was wondering why we choose to not
> > use nop instructions > 9 bytes. The assembler will by default use
> > nop instructions up to 11 bytes for 64bit code.
> 
> Using more than 9 bytes is suboptimal on some hardware.

OK. But using the same argument isn't it also suboptimal to use 9
bytes on some hardware then also?

What I mean by this is that it would be good to add a comment
somewhere that notes why nop instructions are limited to 9 bytes,
because that's likely to generate the more optimized code on a wide
variety of hardware.

At least it would have helped me.

> Toolchains use long nops (exclusively?) for basic block alignment,
> whereas we use use them for executing through because its still faster
> than a branch.
> 
> >
> >>>> +
> >>>>  CFLAGS += -mno-red-zone -fpic -fno-asynchronous-unwind-tables
> >>>>  
> >>>>  # Xen doesn't use SSE interally.  If the compiler supports it, also 
> >>>> skip the
> >>>> diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c
> >>>> index 0ef7a8b..2c844d6 100644
> >>>> --- a/xen/arch/x86/alternative.c
> >>>> +++ b/xen/arch/x86/alternative.c
> >>>> @@ -84,6 +84,19 @@ static const unsigned char * const 
> >>>> p6_nops[ASM_NOP_MAX+1] init_or_livepatch_cons
> >>>>  
> >>>>  static const unsigned char * const *ideal_nops init_or_livepatch_data = 
> >>>> p6_nops;
> >>>>  
> >>>> +#ifdef HAVE_AS_NOP_DIRECTIVE
> >>>> +
> >>>> +/* Nops in .init.rodata to compare against the runtime ideal nops. */
> >>>> +asm ( ".pushsection .init.rodata, \"a\", @progbits\n\t"
> >>>> +      "toolchain_nops: .nops " __stringify(ASM_NOP_MAX) "\n\t"
> >>>> +      ".popsection\n\t");
> >>>> +extern char toolchain_nops[ASM_NOP_MAX];
> >>>> +static bool __read_mostly toolchain_nops_are_ideal;
> >>>> +
> >>>> +#else
> >>>> +# define toolchain_nops_are_ideal false
> >>>> +#endif
> >>>> +
> >>>>  static void __init arch_init_ideal_nops(void)
> >>>>  {
> >>>>      switch ( boot_cpu_data.x86_vendor )
> >>>> @@ -112,6 +125,11 @@ static void __init arch_init_ideal_nops(void)
> >>>>              ideal_nops = k8_nops;
> >>>>          break;
> >>>>      }
> >>>> +
> >>>> +#ifdef HAVE_AS_NOP_DIRECTIVE
> >>>> +    if ( memcmp(ideal_nops[ASM_NOP_MAX], toolchain_nops, ASM_NOP_MAX) 
> >>>> == 0 )
> >>>> +        toolchain_nops_are_ideal = true;
> >>>> +#endif
> >>> You are only comparing that the biggest nop instruction (9 bytes
> >>> AFAICT) generated by the assembler is what Xen believes to be the more
> >>> optimized version. What about shorter nops?
> >> They are all variations on a theme.
> >>
> >> For P6 nops, its the 0f 1f root which is important, which takes a modrm
> >> byte.  Traditionally, its always encoded with eax and uses redundant
> >> memory encodings for longer instructions.
> >>
> >> I can't think of any way of detecting if the optimised nops if the
> >> toolchain starts using alternative registers in the encoding, but I
> >> expect this case won't happen in practice.
> > I would rather do:
> >
> > toolchain_nops:
> >     .nops 1
> >     .nops 2
> >     .nops 3
> >     ...
> >     .nops 9
> >
> > And then build an assembler_nops[ASM_NOP_MAX+1] like it's done for the
> > other nops. Then you could do:
> >
> > toolchain_nops_are_ideal = true;
> > for ( i = 1; i < ASM_NOP_MAX+1; i++ )
> >     if ( memcmp(ideal_nops[i], assembler_nops[i], i) )
> >     {
> >             toolchain_nops_are_ideal = false;
> >             break;
> >     }
> >
> > In order to make sure all the possible nop sizes are using the
> > expected optimized version.
> 
> What is the point?  Other than the 0f 1f, it really doesn't matter, and
> small variations won't invalidate them as ideal nops.
> 
> This check needs to be just good enough to tell whether the toolchain
> used P6 or K8 nops, and unless you explicitly built with -march=k8, the
> answer is going to be P6.
> 
> It does not need to check every variation byte for byte.  Going back to
> my original argument for not even doing this basic check, if we happen
> to be wrong and the toolchain wrote the other kind of long nops, you
> probably won't be able to measure the difference.

I have to admit I don't know that much about nop instruction length or
encoding, but again I think it would be nice to add the reasoning
above to the commit as a comment on why only instruction of length 9
is tested.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.