Xen project Mailing List

Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h

On Mon, 2024-02-26 at 15:20 +0100, Jan Beulich wrote: > On 26.02.2024 13:58, Oleksii wrote: > > On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote: > > > On 26.02.2024 12:18, Oleksii wrote: > > > > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: > > > > > On 23.02.2024 13:23, Oleksii wrote: > > > > > > > > > > > > > > > > > As 1- and 2-byte cases are emulated I decided that > > > > > > > > > > is > > > > > > > > > > not > > > > > > > > > > to > > > > > > > > > > provide > > > > > > > > > > sfx argument for emulation macros as it will not > > > > > > > > > > have > > > > > > > > > > to > > > > > > > > > > much > > > > > > > > > > affect on > > > > > > > > > > emulated types and just consume more performance on > > > > > > > > > > acquire > > > > > > > > > > and > > > > > > > > > > release > > > > > > > > > > version of sc/ld instructions. > > > > > > > > > > > > > > > > > > Question is whether the common case (4- and 8-byte > > > > > > > > > accesses) > > > > > > > > > shouldn't > > > > > > > > > be valued higher, with 1- and 2-byte emulation being > > > > > > > > > there > > > > > > > > > just > > > > > > > > > to > > > > > > > > > allow things to not break altogether. > > > > > > > > If I understand you correctly, it would make sense to > > > > > > > > add > > > > > > > > the > > > > > > > > 'sfx' > > > > > > > > argument for the 1/2-byte access case, ensuring that > > > > > > > > all > > > > > > > > options > > > > > > > > are > > > > > > > > available for 1/2-byte access case as well. > > > > > > > > > > > > > > That's one of the possibilities. As said, I'm not overly > > > > > > > worried > > > > > > > about > > > > > > > the emulated cases. For the initial implementation I'd > > > > > > > recommend > > > > > > > going > > > > > > > with what is easiest there, yielding the best possible > > > > > > > result > > > > > > > for > > > > > > > the > > > > > > > 4- and 8-byte cases. If later it turns out repeated > > > > > > > acquire/release > > > > > > > accesses are a problem in the emulation loop, things can > > > > > > > be > > > > > > > changed > > > > > > > to explicit barriers, without touching the 4- and 8-byte > > > > > > > cases. > > > > > > I am confused then a little bit if emulated case is not an > > > > > > issue. > > > > > > > > > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed > > > > > > and > > > > > > aqcuire > > > > > > version of xchg barries are used. > > > > > > > > > > > > The similar is done for cmpxchg. > > > > > > > > > > > > If something will be needed to change in emulation loop it > > > > > > won't > > > > > > require to change 4- and 8-byte cases. > > > > > > > > > > I'm afraid I don't understand your reply. > > > > IIUC, emulated cases it is implemented correctly in terms of > > > > usage > > > > barriers. And it also OK not to use sfx for lr/sc instructions > > > > and > > > > use > > > > only barriers. > > > > > > > > For 4- and 8-byte cases are used sfx + barrier depending on the > > > > specific case ( relaxed, acquire, release, generic xchg/cmpxchg > > > > ). > > > > What also looks to me correct. But you suggested to provide the > > > > best > > > > possible result for 4- and 8-byte cases. > > > > > > > > So I don't understand what the best possible result is as the > > > > current > > > > one usage of __{cmp}xchg_generic for each specific case ( > > > > relaxed, > > > > acquire, release, generic xchg/cmpxchg ) looks correct to me: > > > > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without > > > > barriers. > > > > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only > > > > release > > > > barrier > > > > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only > > > > acquire > > > > barrier > > > > xchg_relaxed -> (..., "", "", "") - no barries, no sfx > > > > > > So first: While explicit barriers are technically okay, I don't > > > follow why > > > you insist on using them when you can achieve the same by > > > suitably > > > tagging > > > the actual insn doing the exchange. Then second: It's somewhat > > > hard > > > for me > > > to see the final effect on the emulation paths without you > > > actually > > > having > > > done the switch. Maybe no special handling is necessary there > > > anymore > > > then. And as said, it may actually be acceptable for the > > > emulation > > > paths > > > to "only" be correct, but not be ideal in terms of performance. > > > After > > > all, > > > if you use the normal 4-byte primitive in there, more (non- > > > explicit) > > > barriers than needed would occur if the involved loop has to take > > > more > > > than one iteration. Which could (but imo doesn't need to be) > > > avoided > > > by > > > using a more relaxed 4-byte primitive there and an explicit > > > barrier > > > outside of the loop. > > > > According to the spec: > > Table A.5 ( part of the table only I copied here ) > > > > Linux Construct RVWMO Mapping > > atomic <op> relaxed amo<op>.{w|d} > > atomic <op> acquire amo<op>.{w|d}.aq > > atomic <op> release amo<op>.{w|d}.rl > > atomic <op> amo<op>.{w|d}.aqrl > > > > Linux Construct RVWMO LR/SC Mapping > > atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop > > atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez > > loop > > atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; > > bnez > > loop OR > > fence.tso; loop: lr.{w|d}; <op>; > > sc.{w|d}∗ ; > > bnez loop > > atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; > > bnez > > loop > > In your consideration what to implement you will want to limit > things to constructs we actually use. I can't find any use of the > relaxed, acquire, or release forms of atomics as mentioned above. > > > The Linux mappings for release operations may seem stronger than > > necessary, but these mappings > > are needed to cover some cases in which Linux requires stronger > > orderings than the more intuitive > > mappings would provide. In particular, as of the time this text is > > being written, Linux is actively > > debating whether to require load-load, load-store, and store-store > > orderings between accesses in one > > critical section and accesses in a subsequent critical section in > > the > > same hart and protected by the > > same synchronization object. Not all combinations of FENCE > > RW,W/FENCE > > R,RW mappings > > with aq/rl mappings combine to provide such orderings. There are a > > few > > ways around this problem, > > including: > > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This > > suffices > > but is undesir- > > able, as it defeats the purpose of the aq/rl modifiers. > > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does > > not > > currently work > > due to the lack of load and store opcodes with aq and rl modifiers. > > I don't understand this point: Which specific load and/or store forms > are missing? According to my reading of the A extension spec all > combination of aq/rl exist with both lr and sc. I think this is not about lr and sc instructions. It is about l{b|h|w|d} and s{b|h|w|d}, which should be used with fence in case of acquire and seq_cst. > > > 3. Strengthen the mappings of release operations such that they > > would > > enforce sufficient order- > > ings in the presence of either type of acquire mapping. This is the > > currently-recommended > > solution, and the one shown in Table A.5. > > > > > > Based on this it is enough in our case use only suffixed > > istructions > > (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }. > > > > > > But as far as I understand in Linux atomics were strengthen with > > fences: > > Atomics present the same issue with locking: release and > > acquire > > variants need to be strengthened to meet the constraints > > defined > > by the Linux-kernel memory consistency model [1]. > > > > Atomics present a further issue: implementations of atomics > > such > > as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC > > pairs, > > which do not give full-ordering with .aqrl; for example, > > current > > implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" > > test > > below to end up with the state indicated in the "exists" > > clause. > > > > In order to "synchronize" LKMM and RISC-V's implementation, > > this > > commit strengthens the implementations of the atomics > > operations > > by replacing .rl and .aq with the use of ("lightweigth") > > fences, > > and by replacing .aqrl LR/SC pairs in sequences such as: > > > > 0: lr.w.aqrl %0, %addr > > bne %0, %old, 1f > > ... > > sc.w.aqrl %1, %new, %addr > > bnez %1, 0b > > 1: > > > > with sequences of the form: > > > > 0: lr.w %0, %addr > > bne %0, %old, 1f > > ... > > sc.w.rl %1, %new, %addr /* SC-release */ > > bnez %1, 0b > > fence rw, rw /* "full" fence */ > > 1: > > > > following Daniel's suggestion. > > I'm likely missing something, yet as it looks it does help that the > code fragment above appears to be ... > > > These modifications were validated with simulation of the RISC- > > V > > with sequences of the form: > > > > 0: lr.w %0, %addr > > bne %0, %old, 1f > > ... > > sc.w.rl %1, %new, %addr /* SC-release */ > > bnez %1, 0b > > fence rw, rw /* "full" fence */ > > 1: > > > > following Daniel's suggestion. > > ... entirely the same as this one. Yet there's presumably a reason > for quoting it twice? I think it was done by accident ~ Oleksii > > > These modifications were validated with simulation of the RISC- > > V > > memory consistency model. > > > > C lr-sc-aqrl-pair-vs-full-barrier > > > > {} > > > > P0(int *x, int *y, atomic_t *u) > > { > > int r0; > > int r1; > > > > WRITE_ONCE(*x, 1); > > r0 = atomic_cmpxchg(u, 0, 1); > > r1 = READ_ONCE(*y); > > } > > > > P1(int *x, int *y, atomic_t *v) > > { > > int r0; > > int r1; > > > > WRITE_ONCE(*y, 1); > > r0 = atomic_cmpxchg(v, 0, 1); > > r1 = READ_ONCE(*x); > > } > > > > exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) > > > > [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2 > > > > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM > > https://marc.info/?l=linux-kernel&m=151633436614259&w=2 > > > > > > Thereby Linux kernel implementation seems to me more safe and it is > > a > > reason why I want/wanted to be aligned with it. > > Which may end up being okay. I hope you realize though that there's a > lot more explanation needed in the respective commits then compared > to > what you've had so far. As a minimum, absolutely anything remotely > unexpected needs to be explained. > > Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.