Xen project Mailing List

Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h

To: Oleksii <oleksii.kurochko@xxxxxxxxx>

Date: Mon, 19 Feb 2024 12:22:32 +0100

Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL

Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Mon, 19 Feb 2024 11:22:42 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 15.02.2024 14:41, Oleksii wrote: >>> + : "=r" (ret), "+A" (*ptr) \ >>> + : "r" (new) \ >>> + : "memory" ); \ >>> +}) >>> + >>> +#define emulate_xchg_1_2(ptr, new, ret, release_barrier, >>> acquire_barrier) \ >>> +({ \ >>> + uint32_t *ptr_32b_aligned = (uint32_t *)ALIGN_DOWN((unsigned >>> long)ptr, 4); \ >> >> You now appear to assume that this macro is only used with inputs not >> crossing word boundaries. That's okay as long as suitably guaranteed >> at the use sites, but imo wants saying in a comment. >> >>> + uint8_t mask_l = ((unsigned long)(ptr) & (0x8 - sizeof(*ptr))) >>> * BITS_PER_BYTE; \ >> >> Why 0x8 (i.e. spanning 64 bits), not 4 (matching the uint32_t use >> above)? > The idea to read 8 bytes was to deal with crossing word boundary. So if > our address is 0x3 and we have to xchg() 2 bytes, what will cross 4 > byte boundary. Instead we align add 0x3, so it will become 0x0 and then > just always work with 8 bytes. Then what if my 2-byte access crosses a dword boundary? A cache line one? A page one? >>> + unsigned long new_ = (unsigned long)(new) << mask_l; \ >>> + unsigned long ret_; \ >>> + unsigned long rc; \ >> >> Similarly, why unsigned long here? > sizeof(unsigned long) is 8 bytes and it was chosen as we are working > with lc/sc.d which are working with 8 bytes. > >> >> I also wonder about the mix of underscore suffixed (or not) variable >> names here. > If the question about ret_, then the same as before size of ret > argument of the macros will be 1 or 2, but {lc/sc}.d expected to work > with 8 bytes. Then what's the uint32_t * about? >>> + release_barrier \ >>> + "0: lr.d %0, %2\n" \ >> >> Even here it's an 8-byte access. Even if - didn't check - the insn >> was >> okay to use with just a 4-byte aligned pointer, wouldn't it make >> sense >> then to 8-byte align it, and be consistent throughout this macro wrt >> the base unit acted upon? Alternatively, why not use lr.w here, thus >> reducing possible collisions between multiple CPUs accessing the same >> cache line? > According to the docs: > LR and SC operate on naturally-aligned 64-bit (RV64 only) or 32-bit > words in memory. Misaligned > addresses will generate misaligned address exceptions. > > My intention was to deal with 4-byte crossing boundary. so if ptr is 4- > byte aligned then by reading 8-bytes we shouldn't care about boundary > crossing, if I am not missing something. If a ptr is 4-byte aligned, there's no point reading more than 4 bytes. >>> + " and %1, %0, %z4\n" \ >>> + " or %1, %1, %z3\n" \ >>> + " sc.d %1, %1, %2\n" \ >>> + " bnez %1, 0b\n" \ >>> + acquire_barrier \ >>> + : "=&r" (ret_), "=&r" (rc), "+A" (*ptr_32b_aligned) \ >>> + : "rJ" (new_), "rJ" (~mask) \ >> >> I think that as soon as there are more than 2 or maybe 3 operands, >> legibility is vastly improved by using named asm() operands. > Just to clarify you mean that it would be better to use instead of %0 > use names? Yes. Just like you have it in one of the other patches that I looked at later. >>> + : "memory"); \ >> >> Nit: Missing blank before closing parenthesis. >> >>> + \ >>> + ret = (__typeof__(*(ptr)))((ret_ & mask) >> mask_l); \ >>> +}) >> >> Why does "ret" need to be a macro argument? If you had only the >> expression here, not the the assigment, ... >> >>> +#define __xchg_generic(ptr, new, size, sfx, release_barrier, >>> acquire_barrier) \ >>> +({ \ >>> + __typeof__(ptr) ptr__ = (ptr); \ >> >> Is this local variable really needed? Can't you use "ptr" directly >> in the three macro invocations? >> >>> + __typeof__(*(ptr)) new__ = (new); \ >>> + __typeof__(*(ptr)) ret__; \ >>> + switch (size) \ >>> + { \ >>> + case 1: \ >>> + case 2: \ >>> + emulate_xchg_1_2(ptr__, new__, ret__, release_barrier, >>> acquire_barrier); \ >> >> ... this would become >> >> ret__ = emulate_xchg_1_2(ptr__, new__, release_barrier, >> acquire_barrier); \ >> >> But, unlike assumed above, there's no enforcement here that a 2-byte >> quantity won't cross a word, double-word, cache line, or even page >> boundary. That might be okay if then the code would simply crash >> (like >> the AMO insns emitted further down would), but aiui silent >> misbehavior >> would result. > As I mentioned above with 4-byte alignment and then reading and working > with 8-byte then crossing a word or double-word boundary shouldn't be > an issue. > > I am not sure that I know how to check that we are crossing cache line > boundary. > > Regarding page boundary, if the next page is mapped then all should > work fine, otherwise it will be an exception. Are you sure lr.d / sc.d are happy to access across such a boundary, when both pages are mapped? To me it seems pretty clear that for atomic accesses you want to demand natural alignment, i.e. 2-byte alignment for 2-byte accesses. This way you can be sure no potentially problematic boundaries will be crossed. >>> + break; \ >>> + case 4: \ >>> + __amoswap_generic(ptr__, new__, ret__,\ >>> + ".w" sfx, release_barrier, >>> acquire_barrier); \ >>> + break; \ >>> + case 8: \ >>> + __amoswap_generic(ptr__, new__, ret__,\ >>> + ".d" sfx, release_barrier, >>> acquire_barrier); \ >>> + break; \ >>> + default: \ >>> + STATIC_ASSERT_UNREACHABLE(); \ >>> + } \ >>> + ret__; \ >>> +}) >>> + >>> +#define xchg_relaxed(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), >>> "", "", ""); \ >>> +}) >>> + >>> +#define xchg_acquire(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ >>> + "", "", >>> RISCV_ACQUIRE_BARRIER); \ >>> +}) >>> + >>> +#define xchg_release(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ >>> + "", RISCV_RELEASE_BARRIER, >>> ""); \ >>> +}) >>> + >>> +#define xchg(ptr,x) \ >>> +({ \ >>> + __typeof__(*(ptr)) ret__; \ >>> + ret__ = (__typeof__(*(ptr))) \ >>> + __xchg_generic(ptr, (unsigned long)(x), >>> sizeof(*(ptr)), \ >>> + ".aqrl", "", ""); \ >> >> The .aqrl doesn't look to affect the (emulated) 1- and 2-byte cases. >> >> Further, amoswap also exists in release-only and acquire-only forms. >> Why do you prefer explicit barrier insns over those? (Looks to >> similarly apply to the emulation path as well as to the cmpxchg >> machinery then, as both lr and sc also come in all four possible >> acquire/release forms. Perhaps for the emulation path using >> explicit barriers is better, in case the acquire/release forms of >> lr/sc - being used inside the loop - might perform worse.) > As 1- and 2-byte cases are emulated I decided that is not to provide > sfx argument for emulation macros as it will not have to much affect on > emulated types and just consume more performance on acquire and release > version of sc/ld instructions. Question is whether the common case (4- and 8-byte accesses) shouldn't be valued higher, with 1- and 2-byte emulation being there just to allow things to not break altogether. >> No RISCV_..._BARRIER for use here and ... >> >>> + ret__; \ >>> +}) >>> + >>> +#define __cmpxchg(ptr, o, n, s) \ >>> +({ \ >>> + __typeof__(*(ptr)) ret__; \ >>> + ret__ = (__typeof__(*(ptr))) \ >>> + __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned >>> long)(n), \ >>> + s, ".rl", "", " fence rw, rw\n"); \ >> >> ... here? And anyway, wouldn't it make sense to have >> >> #define cmpxchg(ptr, o, n) __cmpxchg(ptr, o, n, sizeof(*(ptr)) >> >> to limit redundancy? >> >> Plus wouldn't >> >> #define __cmpxchg(ptr, o, n, s) \ >> ((__typeof__(*(ptr))) \ >> __cmpxchg_generic(ptr, (unsigned long)(o), (unsigned long)(n), \ >> s, ".rl", "", " fence rw, rw\n")) >> >> be shorter and thus easier to follow as well? As I notice only now, >> this would apparently apply further up as well. > I understand your point about "#define cmpxchg(ptr, o, n) __cmpxchg(", > but I can't undestand how the definition of __cmxchng should be done > shorter. Could you please clarify that? You did notice that in my form there's no local variable, and hence also no macro-wide scope ( ({ ... }) )? Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.