[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [PATCH] x86/build: Use new .nops directive when available
On Thu, Aug 16, 2018 at 04:56:04PM +0100, Andrew Cooper wrote: > On 16/08/18 15:31, Roger Pau Monné wrote: > > On Thu, Aug 16, 2018 at 11:42:56AM +0100, Andrew Cooper wrote: > >> On 16/08/18 10:55, Roger Pau Monné wrote: > >>> On Wed, Aug 15, 2018 at 06:57:38PM +0100, Andrew Cooper wrote: > >>>> diff --git a/xen/arch/x86/Rules.mk b/xen/arch/x86/Rules.mk > >>>> index ac585a3..c84ed20 100644 > >>>> --- a/xen/arch/x86/Rules.mk > >>>> +++ b/xen/arch/x86/Rules.mk > >>>> @@ -29,6 +29,10 @@ $(call as-option-add,CFLAGS,CC,"invpcid > >>>> (%rax)$$(comma)%rax",-DHAVE_AS_INVPCID) > >>>> $(call as-option-add,CFLAGS,CC,\ > >>>> ".if ((1 > 0) < 0); .error \"\";.endif",,-DHAVE_AS_NEGATIVE_TRUE) > >>>> > >>>> +# Check to see whether the assmbler supports the .nop directive. > >>>> +$(call as-option-add,CFLAGS,CC,\ > >>>> + ".L1: .L2: .nops (.L2 - .L1)$$(comma)9",-DHAVE_AS_NOP_DIRECTIVE) > >>> I think I remember commenting on an earlier version of this about the > >>> usage of the CONTROL parameter. I would expect the assembler to > >>> use the most optimized version by default, is that not the case? > >> Again, I don't understand what you're trying to say. > >> > >> This expression is like this, because that's how we actually use it. > > As mentioned in another email, I was wondering why we choose to not > > use nop instructions > 9 bytes. The assembler will by default use > > nop instructions up to 11 bytes for 64bit code. > > Using more than 9 bytes is suboptimal on some hardware. OK. But using the same argument isn't it also suboptimal to use 9 bytes on some hardware then also? What I mean by this is that it would be good to add a comment somewhere that notes why nop instructions are limited to 9 bytes, because that's likely to generate the more optimized code on a wide variety of hardware. At least it would have helped me. > Toolchains use long nops (exclusively?) for basic block alignment, > whereas we use use them for executing through because its still faster > than a branch. > > > > >>>> + > >>>> CFLAGS += -mno-red-zone -fpic -fno-asynchronous-unwind-tables > >>>> > >>>> # Xen doesn't use SSE interally. If the compiler supports it, also > >>>> skip the > >>>> diff --git a/xen/arch/x86/alternative.c b/xen/arch/x86/alternative.c > >>>> index 0ef7a8b..2c844d6 100644 > >>>> --- a/xen/arch/x86/alternative.c > >>>> +++ b/xen/arch/x86/alternative.c > >>>> @@ -84,6 +84,19 @@ static const unsigned char * const > >>>> p6_nops[ASM_NOP_MAX+1] init_or_livepatch_cons > >>>> > >>>> static const unsigned char * const *ideal_nops init_or_livepatch_data = > >>>> p6_nops; > >>>> > >>>> +#ifdef HAVE_AS_NOP_DIRECTIVE > >>>> + > >>>> +/* Nops in .init.rodata to compare against the runtime ideal nops. */ > >>>> +asm ( ".pushsection .init.rodata, \"a\", @progbits\n\t" > >>>> + "toolchain_nops: .nops " __stringify(ASM_NOP_MAX) "\n\t" > >>>> + ".popsection\n\t"); > >>>> +extern char toolchain_nops[ASM_NOP_MAX]; > >>>> +static bool __read_mostly toolchain_nops_are_ideal; > >>>> + > >>>> +#else > >>>> +# define toolchain_nops_are_ideal false > >>>> +#endif > >>>> + > >>>> static void __init arch_init_ideal_nops(void) > >>>> { > >>>> switch ( boot_cpu_data.x86_vendor ) > >>>> @@ -112,6 +125,11 @@ static void __init arch_init_ideal_nops(void) > >>>> ideal_nops = k8_nops; > >>>> break; > >>>> } > >>>> + > >>>> +#ifdef HAVE_AS_NOP_DIRECTIVE > >>>> + if ( memcmp(ideal_nops[ASM_NOP_MAX], toolchain_nops, ASM_NOP_MAX) > >>>> == 0 ) > >>>> + toolchain_nops_are_ideal = true; > >>>> +#endif > >>> You are only comparing that the biggest nop instruction (9 bytes > >>> AFAICT) generated by the assembler is what Xen believes to be the more > >>> optimized version. What about shorter nops? > >> They are all variations on a theme. > >> > >> For P6 nops, its the 0f 1f root which is important, which takes a modrm > >> byte. Traditionally, its always encoded with eax and uses redundant > >> memory encodings for longer instructions. > >> > >> I can't think of any way of detecting if the optimised nops if the > >> toolchain starts using alternative registers in the encoding, but I > >> expect this case won't happen in practice. > > I would rather do: > > > > toolchain_nops: > > .nops 1 > > .nops 2 > > .nops 3 > > ... > > .nops 9 > > > > And then build an assembler_nops[ASM_NOP_MAX+1] like it's done for the > > other nops. Then you could do: > > > > toolchain_nops_are_ideal = true; > > for ( i = 1; i < ASM_NOP_MAX+1; i++ ) > > if ( memcmp(ideal_nops[i], assembler_nops[i], i) ) > > { > > toolchain_nops_are_ideal = false; > > break; > > } > > > > In order to make sure all the possible nop sizes are using the > > expected optimized version. > > What is the point? Other than the 0f 1f, it really doesn't matter, and > small variations won't invalidate them as ideal nops. > > This check needs to be just good enough to tell whether the toolchain > used P6 or K8 nops, and unless you explicitly built with -march=k8, the > answer is going to be P6. > > It does not need to check every variation byte for byte. Going back to > my original argument for not even doing this basic check, if we happen > to be wrong and the toolchain wrote the other kind of long nops, you > probably won't be able to measure the difference. I have to admit I don't know that much about nop instruction length or encoding, but again I think it would be nice to add the reasoning above to the commit as a comment on why only instruction of length 9 is tested. Roger. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |