[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3 4/7] x86: control memset() and memcpy() inlining
On 25/11/2024 2:29 pm, Jan Beulich wrote: > Stop the compiler from inlining non-trivial memset() and memcpy() (for > memset() see e.g. map_vcpu_info() or kimage_load_segments() for > examples). This way we even keep the compiler from using REP STOSQ / > REP MOVSQ when we'd prefer REP STOSB / REP MOVSB (when ERMS is > available). > > With gcc10 this yields a modest .text size reduction (release build) of > around 2k. > > Unfortunately these options aren't understood by the clang versions I > have readily available for testing with; I'm unaware of equivalents. > > Note also that using cc-option-add is not an option here, or at least I > couldn't make things work with it (in case the option was not supported > by the compiler): The embedded comma in the option looks to be getting > in the way. > > Requested-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx> > Signed-off-by: Jan Beulich <jbeulich@xxxxxxxx> > --- > v3: Re-base. > v2: New. > --- > The boundary values are of course up for discussion - I wasn't really > certain whether to use 16 or 32; I'd be less certain about using yet > larger values. > > Similarly whether to permit the compiler to emit REP STOSQ / REP MOVSQ > for known size, properly aligned blocks is up for discussion. I didn't realise there were any options like this. The result is very different on GCC-12, with the following extremes: add/remove: 0/0 grow/shrink: 83/71 up/down: 8764/-3913 (4851) Function old new delta x86_emulate 136966 139990 +3024 ptwr_emulated_cmpxchg 555 1058 +503 hvm_emulate_cmpxchg 1178 1648 +470 hvmemul_do_io 1605 2059 +454 hvmemul_linear_mmio_access 1060 1324 +264 hvmemul_write_cache 655 890 +235 ... do_console_io 1293 1170 -123 arch_get_info_guest 2200 2072 -128 avtab_read_item 821 692 -129 acpi_tb_create_local_fadt 866 714 -152 xz_dec_lzma2_run 2573 2272 -301 __hvm_copy 1085 737 -348 Total: Before=3902769, After=3907620, chg +0.12% So there is a mix, but it's in a distinctly upward direction. As a possibly-related tangent, something I did notice when playing with -fanalyzer was that even attr(alloc_size/align) helped the code generation for an inlined memcpy(). e.g. with _xmalloc() only getting __attribute__((alloc_size(1),alloc_align(2))), functions like init_domain_cpu_policy() go from: 48 8b 13 mov (%rbx),%rdx 48 8d 78 08 lea 0x8(%rax),%rdi 48 89 c1 mov %rax,%rcx 48 89 de mov %rbx,%rsi 48 83 e7 f8 and $0xfffffffffffffff8,%rdi 48 89 10 mov %rdx,(%rax) 48 29 f9 sub %rdi,%rcx 48 8b 93 b0 07 00 00 mov 0x7b0(%rbx),%rdx 48 29 ce sub %rcx,%rsi 81 c1 b8 07 00 00 add $0x7b8,%ecx 48 89 90 b0 07 00 00 mov %rdx,0x7b0(%rax) c1 e9 03 shr $0x3,%ecx f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) down to simply 48 89 c7 mov %rax,%rdi b9 f7 00 00 00 mov $0xf7,%ecx 48 89 ee mov %rbp,%rsi f3 48 a5 rep movsq %ds:(%rsi),%es:(%rdi) which is removing the logic to cope with a misaligned destination pointer. As a possibly unrelated tangent, even __attribute__((malloc)) seems to have some code gen changes. In xenctl_bitmap_to_cpumask(), the change is simply to not align the -ENOMEM basic block, saving 8 bytes. This is quite reasonable because xmalloc() genuinely failing is 0% of the time to many significant figures. Mostly though, it's just basic block churn, which seems to be giving a "likely not NULL" on the return value, therefore shuffling the error paths. ~Andrew
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |