[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h



On Mon, 2024-02-26 at 15:20 +0100, Jan Beulich wrote:
> On 26.02.2024 13:58, Oleksii wrote:
> > On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
> > > On 26.02.2024 12:18, Oleksii wrote:
> > > > On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
> > > > > On 23.02.2024 13:23, Oleksii wrote:
> > > > > > > 
> > > > > > > > > > As 1- and 2-byte cases are emulated I decided that
> > > > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > to
> > > > > > > > > > provide
> > > > > > > > > > sfx argument for emulation macros as it will not
> > > > > > > > > > have
> > > > > > > > > > to
> > > > > > > > > > much
> > > > > > > > > > affect on
> > > > > > > > > > emulated types and just consume more performance on
> > > > > > > > > > acquire
> > > > > > > > > > and
> > > > > > > > > > release
> > > > > > > > > > version of sc/ld instructions.
> > > > > > > > > 
> > > > > > > > > Question is whether the common case (4- and 8-byte
> > > > > > > > > accesses)
> > > > > > > > > shouldn't
> > > > > > > > > be valued higher, with 1- and 2-byte emulation being
> > > > > > > > > there
> > > > > > > > > just
> > > > > > > > > to
> > > > > > > > > allow things to not break altogether.
> > > > > > > > If I understand you correctly, it would make sense to
> > > > > > > > add
> > > > > > > > the
> > > > > > > > 'sfx'
> > > > > > > > argument for the 1/2-byte access case, ensuring that
> > > > > > > > all
> > > > > > > > options
> > > > > > > > are
> > > > > > > > available for 1/2-byte access case as well.
> > > > > > > 
> > > > > > > That's one of the possibilities. As said, I'm not overly
> > > > > > > worried
> > > > > > > about
> > > > > > > the emulated cases. For the initial implementation I'd
> > > > > > > recommend
> > > > > > > going
> > > > > > > with what is easiest there, yielding the best possible
> > > > > > > result
> > > > > > > for
> > > > > > > the
> > > > > > > 4- and 8-byte cases. If later it turns out repeated
> > > > > > > acquire/release
> > > > > > > accesses are a problem in the emulation loop, things can
> > > > > > > be
> > > > > > > changed
> > > > > > > to explicit barriers, without touching the 4- and 8-byte
> > > > > > > cases.
> > > > > > I am confused then a little bit if emulated case is not an
> > > > > > issue.
> > > > > > 
> > > > > > For 4- and 8-byte cases for xchg .aqrl is used, for relaxed
> > > > > > and
> > > > > > aqcuire
> > > > > > version of xchg barries are used.
> > > > > > 
> > > > > > The similar is done for cmpxchg.
> > > > > > 
> > > > > > If something will be needed to change in emulation loop it
> > > > > > won't
> > > > > > require to change 4- and 8-byte cases.
> > > > > 
> > > > > I'm afraid I don't understand your reply.
> > > > IIUC, emulated cases it is implemented correctly in terms of
> > > > usage
> > > > barriers. And it also OK not to use sfx for lr/sc instructions
> > > > and
> > > > use
> > > > only barriers.
> > > > 
> > > > For 4- and 8-byte cases are used sfx + barrier depending on the
> > > > specific case ( relaxed, acquire, release, generic xchg/cmpxchg
> > > > ).
> > > > What also looks to me correct. But you suggested to provide the
> > > > best
> > > > possible result for 4- and 8-byte cases. 
> > > > 
> > > > So I don't understand what the best possible result is as the
> > > > current
> > > > one usage of __{cmp}xchg_generic for each specific case  (
> > > > relaxed,
> > > > acquire, release, generic xchg/cmpxchg ) looks correct to me:
> > > > xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
> > > > barriers.
> > > > xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
> > > > release
> > > > barrier
> > > > xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only
> > > > acquire
> > > > barrier
> > > > xchg_relaxed ->  (..., "", "", "") - no barries, no sfx
> > > 
> > > So first: While explicit barriers are technically okay, I don't
> > > follow why
> > > you insist on using them when you can achieve the same by
> > > suitably
> > > tagging
> > > the actual insn doing the exchange. Then second: It's somewhat
> > > hard
> > > for me
> > > to see the final effect on the emulation paths without you
> > > actually
> > > having
> > > done the switch. Maybe no special handling is necessary there
> > > anymore
> > > then. And as said, it may actually be acceptable for the
> > > emulation
> > > paths
> > > to "only" be correct, but not be ideal in terms of performance.
> > > After
> > > all,
> > > if you use the normal 4-byte primitive in there, more (non-
> > > explicit)
> > > barriers than needed would occur if the involved loop has to take
> > > more
> > > than one iteration. Which could (but imo doesn't need to be)
> > > avoided
> > > by
> > > using a more relaxed 4-byte primitive there and an explicit
> > > barrier
> > > outside of the loop.
> > 
> > According to the spec:
> > Table A.5 ( part of the table only I copied here )
> > 
> > Linux Construct          RVWMO Mapping
> > atomic <op> relaxed           amo<op>.{w|d}
> > atomic <op> acquire           amo<op>.{w|d}.aq
> > atomic <op> release           amo<op>.{w|d}.rl
> > atomic <op>                   amo<op>.{w|d}.aqrl
> > 
> > Linux Construct          RVWMO LR/SC Mapping
> > atomic <op> relaxed       loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> > atomic <op> acquire       loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez
> > loop
> > atomic <op> release       loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ;
> > bnez 
> > loop OR
> >                           fence.tso; loop: lr.{w|d}; <op>;
> > sc.{w|d}∗ ;
> > bnez loop
> > atomic <op>               loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl;
> > bnez
> > loop
> 
> In your consideration what to implement you will want to limit
> things to constructs we actually use. I can't find any use of the
> relaxed, acquire, or release forms of atomics as mentioned above.
> 
> > The Linux mappings for release operations may seem stronger than
> > necessary, but these mappings
> > are needed to cover some cases in which Linux requires stronger
> > orderings than the more intuitive
> > mappings would provide. In particular, as of the time this text is
> > being written, Linux is actively
> > debating whether to require load-load, load-store, and store-store
> > orderings between accesses in one
> > critical section and accesses in a subsequent critical section in
> > the
> > same hart and protected by the
> > same synchronization object. Not all combinations of FENCE
> > RW,W/FENCE
> > R,RW mappings
> > with aq/rl mappings combine to provide such orderings. There are a
> > few
> > ways around this problem,
> > including:
> > 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This
> > suffices
> > but is undesir-
> > able, as it defeats the purpose of the aq/rl modifiers.
> > 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does
> > not
> > currently work
> > due to the lack of load and store opcodes with aq and rl modifiers.
> 
> I don't understand this point: Which specific load and/or store forms
> are missing? According to my reading of the A extension spec all
> combination of aq/rl exist with both lr and sc.
I think this is not about lr and sc instructions.
It is about l{b|h|w|d} and s{b|h|w|d}, which should be used with fence
in case of acquire and seq_cst.

> 
> > 3. Strengthen the mappings of release operations such that they
> > would
> > enforce sufficient order-
> > ings in the presence of either type of acquire mapping. This is the
> > currently-recommended
> > solution, and the one shown in Table A.5.
> > 
> > 
> > Based on this it is enough in our case use only suffixed
> > istructions
> > (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.
> > 
> > 
> > But as far as I understand in Linux atomics were strengthen with
> > fences:
> >     Atomics present the same issue with locking: release and
> > acquire
> >     variants need to be strengthened to meet the constraints
> > defined
> >     by the Linux-kernel memory consistency model [1].
> >     
> >     Atomics present a further issue: implementations of atomics
> > such
> >     as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC
> > pairs,
> >     which do not give full-ordering with .aqrl; for example,
> > current
> >     implementations allow the "lr-sc-aqrl-pair-vs-full-barrier"
> > test
> >     below to end up with the state indicated in the "exists"
> > clause.
> >     
> >     In order to "synchronize" LKMM and RISC-V's implementation,
> > this
> >     commit strengthens the implementations of the atomics
> > operations
> >     by replacing .rl and .aq with the use of ("lightweigth")
> > fences,
> >     and by replacing .aqrl LR/SC pairs in sequences such as:
> >     
> >       0:      lr.w.aqrl  %0, %addr
> >               bne        %0, %old, 1f
> >               ...
> >               sc.w.aqrl  %1, %new, %addr
> >               bnez       %1, 0b
> >       1:
> >     
> >     with sequences of the form:
> >     
> >       0:      lr.w       %0, %addr
> >               bne        %0, %old, 1f
> >               ...
> >               sc.w.rl    %1, %new, %addr   /* SC-release   */
> >               bnez       %1, 0b
> >               fence      rw, rw            /* "full" fence */
> >       1:
> >     
> >     following Daniel's suggestion.
> 
> I'm likely missing something, yet as it looks it does help that the
> code fragment above appears to be ...
> 
> >     These modifications were validated with simulation of the RISC-
> > V
> >     with sequences of the form:
> >     
> >       0:      lr.w       %0, %addr
> >               bne        %0, %old, 1f
> >               ...
> >               sc.w.rl    %1, %new, %addr   /* SC-release   */
> >               bnez       %1, 0b
> >               fence      rw, rw            /* "full" fence */
> >       1:
> >     
> >     following Daniel's suggestion.
> 
> ... entirely the same as this one. Yet there's presumably a reason
> for quoting it twice?
I think it was done by accident

~ Oleksii
> 
> >     These modifications were validated with simulation of the RISC-
> > V
> >     memory consistency model.
> >     
> >     C lr-sc-aqrl-pair-vs-full-barrier
> >     
> >     {}
> >     
> >     P0(int *x, int *y, atomic_t *u)
> >     {
> >             int r0;
> >             int r1;
> >     
> >             WRITE_ONCE(*x, 1);
> >             r0 = atomic_cmpxchg(u, 0, 1);
> >             r1 = READ_ONCE(*y);
> >     }
> >     
> >     P1(int *x, int *y, atomic_t *v)
> >     {
> >             int r0;
> >             int r1;
> >     
> >             WRITE_ONCE(*y, 1);
> >             r0 = atomic_cmpxchg(v, 0, 1);
> >             r1 = READ_ONCE(*x);
> >     }
> >     
> >     exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
> >     
> >     [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
> >      
> > https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
> >         https://marc.info/?l=linux-kernel&m=151633436614259&w=2
> > 
> > 
> > Thereby Linux kernel implementation seems to me more safe and it is
> > a
> > reason why I want/wanted to be aligned with it.
> 
> Which may end up being okay. I hope you realize though that there's a
> lot more explanation needed in the respective commits then compared
> to
> what you've had so far. As a minimum, absolutely anything remotely
> unexpected needs to be explained.
> 
> Jan




 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.