[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] x86: Use LOCK ADD instead of MFENCE for smp_mb()



On 21.09.2020 15:04, Andrew Cooper wrote:
> MFENCE is overly heavyweight for SMP semantics on WB memory, because it also
> orders weaker cached writes, and flushes the WC buffers.
> 
> This technique was used as an optimisation in Java[1], and later adopted by
> Linux[2] where it was measured to have a 60% performance improvement in VirtIO
> benchmarks.
> 
> The stack is used because it is hot in the L1 cache, and a -4 offset is used
> to avoid creating a false data dependency on live data.  (For 64bit userspace,
> the offset needs to be under the red zone to avoid false dependences).
> 
> Fix up the 32 bit definitions in HVMLoader and libxc to avoid a false data
> dependency.
> 
> [1] https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> [2] https://git.kernel.org/torvalds/c/450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>

For the hypervisor and hvmloader part:
Reviewed-by: Jan Beulich <jbeulich@xxxxxxxx>

> --- a/tools/libs/ctrl/include/xenctrl.h
> +++ b/tools/libs/ctrl/include/xenctrl.h
> @@ -68,11 +68,11 @@
>  #define xen_barrier() asm volatile ( "" : : : "memory")
>  
>  #if defined(__i386__)
> -#define xen_mb()  asm volatile ( "lock; addl $0,0(%%esp)" : : : "memory" )

If this causes a false dependency (which I agree it does), how
come ...

> +#define xen_mb()  asm volatile ( "lock addl $0, -4(%%esp)" ::: "memory" )
>  #define xen_rmb() xen_barrier()
>  #define xen_wmb() xen_barrier()
>  #elif defined(__x86_64__)
> -#define xen_mb()  asm volatile ( "mfence" : : : "memory")
> +#define xen_mb()  asm volatile ( "lock addl $0, -128(%%rsp)" ::: "memory" )

... this doesn't? It accesses the bottom 4 bytes of the redzone,
doesn't it?

As a minor other thought for all of its incarnations: Is a 32-bit
memory access really the best choice? Wouldn't an 8-bit access
further reduce (albeit not eliminate) the risk of unnecessary
dependencies between this memory access and others in functions
called from the users of this barrier?

Or actually, in the hypervisor case, since the used stack slot
would typically hold the return address of the next level's
functions, would a 64-bit access or one further away from the
current stack pointer not help avoid partial dependencies?

And finally, already when Linux used this for just 32-bit I've
always been wondering why they bother preserving the contents of
this piece of memory. How about using NOT (saving the immediate
byte) or XCHG (requiring a dead register instead of the saved
arithmetic, immediate byte, and lock prefix)?

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.