[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v5 11/23] xen/riscv: introduce cmpxchg.h
On 07.03.2024 11:35, Oleksii wrote: > On Wed, 2024-03-06 at 15:56 +0100, Jan Beulich wrote: >> On 26.02.2024 18:38, Oleksii Kurochko wrote: >>> The header was taken from Linux kernl 6.4.0-rc1. >>> >>> Addionally, were updated: >>> * add emulation of {cmp}xchg for 1/2 byte types using 32-bit atomic >>> access. >>> * replace tabs with spaces >>> * replace __* variale with *__ >>> * introduce generic version of xchg_* and cmpxchg_*. >>> >>> Implementation of 4- and 8-byte cases were left as it is done in >>> Linux kernel as according to the RISC-V spec: >>> ``` >>> Table A.5 ( only part of the table was copied here ) >>> >>> Linux Construct RVWMO Mapping >>> atomic <op> relaxed amo<op>.{w|d} >>> atomic <op> acquire amo<op>.{w|d}.aq >>> atomic <op> release amo<op>.{w|d}.rl >>> atomic <op> amo<op>.{w|d}.aqrl >>> >>> Linux Construct RVWMO LR/SC Mapping >>> atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop >>> atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop >>> atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez >>> loop OR >>> fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; >>> bnez loop >>> atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez >>> loop >>> >>> The Linux mappings for release operations may seem stronger than >>> necessary, >>> but these mappings are needed to cover some cases in which Linux >>> requires >>> stronger orderings than the more intuitive mappings would provide. >>> In particular, as of the time this text is being written, Linux is >>> actively >>> debating whether to require load-load, load-store, and store-store >>> orderings >>> between accesses in one critical section and accesses in a >>> subsequent critical >>> section in the same hart and protected by the same synchronization >>> object. >>> Not all combinations of FENCE RW,W/FENCE R,RW mappings with aq/rl >>> mappings >>> combine to provide such orderings. >>> There are a few ways around this problem, including: >>> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This >>> suffices >>> but is undesirable, as it defeats the purpose of the aq/rl >>> modifiers. >>> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does >>> not >>> currently work due to the lack of load and store opcodes with aq >>> and rl >>> modifiers. >> >> As before I don't understand this point. Can you give an example of >> what >> sort of opcode / instruction is missing? > If I understand the spec correctly then l{b|h|w|d} and s{b|h|w|d} > instructions don't have aq or rl annotation. How would load insns other that LR and store insns other than SC come into play here? >>> 3. Strengthen the mappings of release operations such that they >>> would >>> enforce sufficient orderings in the presence of either type of >>> acquire mapping. >>> This is the currently-recommended solution, and the one shown in >>> Table A.5. >>> ``` >>> >>> But in Linux kenrel atomics were strengthen with fences: >>> ``` >>> Atomics present the same issue with locking: release and acquire >>> variants need to be strengthened to meet the constraints defined >>> by the Linux-kernel memory consistency model [1]. >>> >>> Atomics present a further issue: implementations of atomics such >>> as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs, >>> which do not give full-ordering with .aqrl; for example, current >>> implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test >>> below to end up with the state indicated in the "exists" clause. >>> >>> In order to "synchronize" LKMM and RISC-V's implementation, this >>> commit strengthens the implementations of the atomics operations >>> by replacing .rl and .aq with the use of ("lightweigth") fences, >>> and by replacing .aqrl LR/SC pairs in sequences such as: >>> >>> 0: lr.w.aqrl %0, %addr >>> bne %0, %old, 1f >>> ... >>> sc.w.aqrl %1, %new, %addr >>> bnez %1, 0b >>> 1: >>> >>> with sequences of the form: >>> >>> 0: lr.w %0, %addr >>> bne %0, %old, 1f >>> ... >>> sc.w.rl %1, %new, %addr /* SC-release */ >>> bnez %1, 0b >>> fence rw, rw /* "full" fence */ >>> 1: >>> >>> following Daniel's suggestion. >>> >>> These modifications were validated with simulation of the RISC-V >>> memory consistency model. >>> >>> C lr-sc-aqrl-pair-vs-full-barrier >>> >>> {} >>> >>> P0(int *x, int *y, atomic_t *u) >>> { >>> int r0; >>> int r1; >>> >>> WRITE_ONCE(*x, 1); >>> r0 = atomic_cmpxchg(u, 0, 1); >>> r1 = READ_ONCE(*y); >>> } >>> >>> P1(int *x, int *y, atomic_t *v) >>> { >>> int r0; >>> int r1; >>> >>> WRITE_ONCE(*y, 1); >>> r0 = atomic_cmpxchg(v, 0, 1); >>> r1 = READ_ONCE(*x); >>> } >>> >>> exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) >> >> While I'm entirely willing to trust this can happen, I can't bring >> this >> in line with the A extension spec. >> >> Additionally it's not clear to me in how far all of this applies when >> you don't really use LR/SC in the 4- and 8-byte cases (and going >> forward >> likely also not in the 1- and 2-byte case, utilizing Zahba when >> available). > It just explain what combination of fences, lr/sc, amoswap, .aq and .rl > annotation can be combined, and why combinations introduced in this > patch are used. Except that I don't understand that explanation, iow why said combination of values could be observed even when using suffixes properly. >>> + uint8_t new_val_pos = ((unsigned long)(ptr) & (0x4 - >>> sizeof(*ptr))) * BITS_PER_BYTE; \ >> >> Why uint8_t? > It is enough to cover possible start bit position of value that should > be updated, so I decided to use uint8_t. Please take a look at the "Types" section in ./CODING_STYLE. >>> + { \ >>> + case 1: \ >>> + case 2: \ >>> + ret__ = emulate_xchg_1_2(ptr, new__, sfx, pre, post); \ >>> + break; \ >>> + case 4: \ >>> + __amoswap_generic(ptr, new__, ret__,\ >>> + ".w" sfx, pre, post); \ >>> + break; \ >>> + case 8: \ >>> + __amoswap_generic(ptr, new__, ret__,\ >>> + ".d" sfx, pre, post); \ >>> + break; \ >> >> In io.h you make sure to avoid rv64-only insns. Here you don't. The >> build >> would fail either way, but this still looks inconsistent. >> >> Also nit: Stray double blands (twice) ahead of "pre". Plus with this >> style >> of line continuation you want to consistently have exactly one blank >> ahead >> of each backslash. >> >>> + default: \ >>> + STATIC_ASSERT_UNREACHABLE(); \ >>> + } \ >>> + ret__; \ >>> +}) >>> + >>> +#define xchg_relaxed(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >> >> What is the purpose of this, when __xchg_generic() already does this >> same >> type conversion? >> >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), >>> "", "", ""); \ >>> +}) >>> + >>> +#define xchg_acquire(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)), \ >>> + "", "", >>> RISCV_ACQUIRE_BARRIER); \ >>> +}) >>> + >>> +#define xchg_release(ptr, x) \ >>> +({ \ >>> + __typeof__(*(ptr)) x_ = (x); \ >>> + (__typeof__(*(ptr)))__xchg_generic(ptr, x_, sizeof(*(ptr)),\ >>> + "", RISCV_RELEASE_BARRIER, >>> ""); \ >>> +}) >> >> As asked before: Are there going to be any uses of these three? >> Common >> code doesn't require them. And not needing to provide them would >> simplify things quite a bit, it seems. > I checked my private branches and it looks to me that I introduced them > only for the correspondent atomic operations ( which was copied from > Linux Kernel ) which are not also used. > > So we could definitely drop these macros for now, but should > xchg_generic() be updated as well? If to look at: > #define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), sizeof(* > (ptr)), \ > ".aqrl", "", "") > Last two arguments start to be unneeded, but I've wanted to leave them, > in case someone will needed to back xchg_{release, acquire, ...}. Does > it make any sense? It all depends on how it's justified in the description. >>> +#define xchg(ptr, x) __xchg_generic(ptr, (unsigned long)(x), >>> sizeof(*(ptr)), \ >>> + ".aqrl", "", "") >> >> According to the earlier comment (where I don't follow the example >> given), >> is .aqrl sufficient here? And even if it was for the 4- and 8-byte >> cases, >> is it sufficient in the 1- and 2-byte emulation case (where it then >> is >> appended to just the SC)? > If I understand your question correctly then accroding to the spec., > .aqrl is enough for amo<op>.{w|d} instructions: > Linux Construct RVWMO AMO Mapping > atomic <op> relaxed amo<op>.{w|d} > atomic <op> acquire amo<op>.{w|d}.aq > atomic <op> release amo<op>.{w|d}.rl > atomic <op> amo<op>.{w|d}.aqrl > but in case of lr/sc you are right sc requires suffix too: > Linux Construct RVWMO LR/SC Mapping > atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop > atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop > atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez > loop OR fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; bnez loop > atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez > loop > > I will add sc_sfx to emulate_xchg_1_2(). The only question is left if > __xchg_generic(ptr, new, size, sfx, pre, post) should be changed to: > __xchg_generic(ptr, new, size, sfx1, sfx2, pre, post) to cover both > cases amo<op>.{w|d}.sfx1 and lr.{w|d}.sfx1 ... sc.{w|d}.sfx2? I expect that's going to be necessary. In the end you'll see what's needed when making the code adjustment. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |