[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h
On 26.02.2024 13:58, Oleksii wrote: > On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote: >> On 26.02.2024 12:18, Oleksii wrote: >>> On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote: >>>> On 23.02.2024 13:23, Oleksii wrote: >>>>>> >>>>>>>>> As 1- and 2-byte cases are emulated I decided that is >>>>>>>>> not >>>>>>>>> to >>>>>>>>> provide >>>>>>>>> sfx argument for emulation macros as it will not have >>>>>>>>> to >>>>>>>>> much >>>>>>>>> affect on >>>>>>>>> emulated types and just consume more performance on >>>>>>>>> acquire >>>>>>>>> and >>>>>>>>> release >>>>>>>>> version of sc/ld instructions. >>>>>>>> >>>>>>>> Question is whether the common case (4- and 8-byte >>>>>>>> accesses) >>>>>>>> shouldn't >>>>>>>> be valued higher, with 1- and 2-byte emulation being >>>>>>>> there >>>>>>>> just >>>>>>>> to >>>>>>>> allow things to not break altogether. >>>>>>> If I understand you correctly, it would make sense to add >>>>>>> the >>>>>>> 'sfx' >>>>>>> argument for the 1/2-byte access case, ensuring that all >>>>>>> options >>>>>>> are >>>>>>> available for 1/2-byte access case as well. >>>>>> >>>>>> That's one of the possibilities. As said, I'm not overly >>>>>> worried >>>>>> about >>>>>> the emulated cases. For the initial implementation I'd >>>>>> recommend >>>>>> going >>>>>> with what is easiest there, yielding the best possible result >>>>>> for >>>>>> the >>>>>> 4- and 8-byte cases. If later it turns out repeated >>>>>> acquire/release >>>>>> accesses are a problem in the emulation loop, things can be >>>>>> changed >>>>>> to explicit barriers, without touching the 4- and 8-byte >>>>>> cases. >>>>> I am confused then a little bit if emulated case is not an >>>>> issue. >>>>> >>>>> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and >>>>> aqcuire >>>>> version of xchg barries are used. >>>>> >>>>> The similar is done for cmpxchg. >>>>> >>>>> If something will be needed to change in emulation loop it >>>>> won't >>>>> require to change 4- and 8-byte cases. >>>> >>>> I'm afraid I don't understand your reply. >>> IIUC, emulated cases it is implemented correctly in terms of usage >>> barriers. And it also OK not to use sfx for lr/sc instructions and >>> use >>> only barriers. >>> >>> For 4- and 8-byte cases are used sfx + barrier depending on the >>> specific case ( relaxed, acquire, release, generic xchg/cmpxchg ). >>> What also looks to me correct. But you suggested to provide the >>> best >>> possible result for 4- and 8-byte cases. >>> >>> So I don't understand what the best possible result is as the >>> current >>> one usage of __{cmp}xchg_generic for each specific case ( relaxed, >>> acquire, release, generic xchg/cmpxchg ) looks correct to me: >>> xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without >>> barriers. >>> xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only >>> release >>> barrier >>> xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire >>> barrier >>> xchg_relaxed -> (..., "", "", "") - no barries, no sfx >> >> So first: While explicit barriers are technically okay, I don't >> follow why >> you insist on using them when you can achieve the same by suitably >> tagging >> the actual insn doing the exchange. Then second: It's somewhat hard >> for me >> to see the final effect on the emulation paths without you actually >> having >> done the switch. Maybe no special handling is necessary there anymore >> then. And as said, it may actually be acceptable for the emulation >> paths >> to "only" be correct, but not be ideal in terms of performance. After >> all, >> if you use the normal 4-byte primitive in there, more (non-explicit) >> barriers than needed would occur if the involved loop has to take >> more >> than one iteration. Which could (but imo doesn't need to be) avoided >> by >> using a more relaxed 4-byte primitive there and an explicit barrier >> outside of the loop. > > According to the spec: > Table A.5 ( part of the table only I copied here ) > > Linux Construct RVWMO Mapping > atomic <op> relaxed amo<op>.{w|d} > atomic <op> acquire amo<op>.{w|d}.aq > atomic <op> release amo<op>.{w|d}.rl > atomic <op> amo<op>.{w|d}.aqrl > > Linux Construct RVWMO LR/SC Mapping > atomic <op> relaxed loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop > atomic <op> acquire loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop > atomic <op> release loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez > loop OR > fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ; > bnez loop > atomic <op> loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez > loop In your consideration what to implement you will want to limit things to constructs we actually use. I can't find any use of the relaxed, acquire, or release forms of atomics as mentioned above. > The Linux mappings for release operations may seem stronger than > necessary, but these mappings > are needed to cover some cases in which Linux requires stronger > orderings than the more intuitive > mappings would provide. In particular, as of the time this text is > being written, Linux is actively > debating whether to require load-load, load-store, and store-store > orderings between accesses in one > critical section and accesses in a subsequent critical section in the > same hart and protected by the > same synchronization object. Not all combinations of FENCE RW,W/FENCE > R,RW mappings > with aq/rl mappings combine to provide such orderings. There are a few > ways around this problem, > including: > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices > but is undesir- > able, as it defeats the purpose of the aq/rl modifiers. > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not > currently work > due to the lack of load and store opcodes with aq and rl modifiers. I don't understand this point: Which specific load and/or store forms are missing? According to my reading of the A extension spec all combination of aq/rl exist with both lr and sc. > 3. Strengthen the mappings of release operations such that they would > enforce sufficient order- > ings in the presence of either type of acquire mapping. This is the > currently-recommended > solution, and the one shown in Table A.5. > > > Based on this it is enough in our case use only suffixed istructions > (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }. > > > But as far as I understand in Linux atomics were strengthen with > fences: > Atomics present the same issue with locking: release and acquire > variants need to be strengthened to meet the constraints defined > by the Linux-kernel memory consistency model [1]. > > Atomics present a further issue: implementations of atomics such > as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs, > which do not give full-ordering with .aqrl; for example, current > implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test > below to end up with the state indicated in the "exists" clause. > > In order to "synchronize" LKMM and RISC-V's implementation, this > commit strengthens the implementations of the atomics operations > by replacing .rl and .aq with the use of ("lightweigth") fences, > and by replacing .aqrl LR/SC pairs in sequences such as: > > 0: lr.w.aqrl %0, %addr > bne %0, %old, 1f > ... > sc.w.aqrl %1, %new, %addr > bnez %1, 0b > 1: > > with sequences of the form: > > 0: lr.w %0, %addr > bne %0, %old, 1f > ... > sc.w.rl %1, %new, %addr /* SC-release */ > bnez %1, 0b > fence rw, rw /* "full" fence */ > 1: > > following Daniel's suggestion. I'm likely missing something, yet as it looks it does help that the code fragment above appears to be ... > These modifications were validated with simulation of the RISC-V > with sequences of the form: > > 0: lr.w %0, %addr > bne %0, %old, 1f > ... > sc.w.rl %1, %new, %addr /* SC-release */ > bnez %1, 0b > fence rw, rw /* "full" fence */ > 1: > > following Daniel's suggestion. ... entirely the same as this one. Yet there's presumably a reason for quoting it twice? > These modifications were validated with simulation of the RISC-V > memory consistency model. > > C lr-sc-aqrl-pair-vs-full-barrier > > {} > > P0(int *x, int *y, atomic_t *u) > { > int r0; > int r1; > > WRITE_ONCE(*x, 1); > r0 = atomic_cmpxchg(u, 0, 1); > r1 = READ_ONCE(*y); > } > > P1(int *x, int *y, atomic_t *v) > { > int r0; > int r1; > > WRITE_ONCE(*y, 1); > r0 = atomic_cmpxchg(v, 0, 1); > r1 = READ_ONCE(*x); > } > > exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0) > > [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2 > > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM > https://marc.info/?l=linux-kernel&m=151633436614259&w=2 > > > Thereby Linux kernel implementation seems to me more safe and it is a > reason why I want/wanted to be aligned with it. Which may end up being okay. I hope you realize though that there's a lot more explanation needed in the respective commits then compared to what you've had so far. As a minimum, absolutely anything remotely unexpected needs to be explained. Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |