[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v4 12/30] xen/riscv: introduce cmpxchg.h


  • To: Oleksii <oleksii.kurochko@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Mon, 26 Feb 2024 15:20:14 +0100
  • Autocrypt: addr=jbeulich@xxxxxxxx; keydata= xsDiBFk3nEQRBADAEaSw6zC/EJkiwGPXbWtPxl2xCdSoeepS07jW8UgcHNurfHvUzogEq5xk hu507c3BarVjyWCJOylMNR98Yd8VqD9UfmX0Hb8/BrA+Hl6/DB/eqGptrf4BSRwcZQM32aZK 7Pj2XbGWIUrZrd70x1eAP9QE3P79Y2oLrsCgbZJfEwCgvz9JjGmQqQkRiTVzlZVCJYcyGGsD /0tbFCzD2h20ahe8rC1gbb3K3qk+LpBtvjBu1RY9drYk0NymiGbJWZgab6t1jM7sk2vuf0Py O9Hf9XBmK0uE9IgMaiCpc32XV9oASz6UJebwkX+zF2jG5I1BfnO9g7KlotcA/v5ClMjgo6Gl MDY4HxoSRu3i1cqqSDtVlt+AOVBJBACrZcnHAUSuCXBPy0jOlBhxPqRWv6ND4c9PH1xjQ3NP nxJuMBS8rnNg22uyfAgmBKNLpLgAGVRMZGaGoJObGf72s6TeIqKJo/LtggAS9qAUiuKVnygo 3wjfkS9A3DRO+SpU7JqWdsveeIQyeyEJ/8PTowmSQLakF+3fote9ybzd880fSmFuIEJldWxp Y2ggPGpiZXVsaWNoQHN1c2UuY29tPsJgBBMRAgAgBQJZN5xEAhsDBgsJCAcDAgQVAggDBBYC AwECHgECF4AACgkQoDSui/t3IH4J+wCfQ5jHdEjCRHj23O/5ttg9r9OIruwAn3103WUITZee e7Sbg12UgcQ5lv7SzsFNBFk3nEQQCACCuTjCjFOUdi5Nm244F+78kLghRcin/awv+IrTcIWF hUpSs1Y91iQQ7KItirz5uwCPlwejSJDQJLIS+QtJHaXDXeV6NI0Uef1hP20+y8qydDiVkv6l IreXjTb7DvksRgJNvCkWtYnlS3mYvQ9NzS9PhyALWbXnH6sIJd2O9lKS1Mrfq+y0IXCP10eS FFGg+Av3IQeFatkJAyju0PPthyTqxSI4lZYuJVPknzgaeuJv/2NccrPvmeDg6Coe7ZIeQ8Yj t0ARxu2xytAkkLCel1Lz1WLmwLstV30g80nkgZf/wr+/BXJW/oIvRlonUkxv+IbBM3dX2OV8 AmRv1ySWPTP7AAMFB/9PQK/VtlNUJvg8GXj9ootzrteGfVZVVT4XBJkfwBcpC/XcPzldjv+3 HYudvpdNK3lLujXeA5fLOH+Z/G9WBc5pFVSMocI71I8bT8lIAzreg0WvkWg5V2WZsUMlnDL9 mpwIGFhlbM3gfDMs7MPMu8YQRFVdUvtSpaAs8OFfGQ0ia3LGZcjA6Ik2+xcqscEJzNH+qh8V m5jjp28yZgaqTaRbg3M/+MTbMpicpZuqF4rnB0AQD12/3BNWDR6bmh+EkYSMcEIpQmBM51qM EKYTQGybRCjpnKHGOxG0rfFY1085mBDZCH5Kx0cl0HVJuQKC+dV2ZY5AqjcKwAxpE75MLFkr wkkEGBECAAkFAlk3nEQCGwwACgkQoDSui/t3IH7nnwCfcJWUDUFKdCsBH/E5d+0ZnMQi+G0A nAuWpQkjM1ASeQwSHEeAWPgskBQL
  • Cc: Alistair Francis <alistair.francis@xxxxxxx>, Bob Eshleman <bobbyeshleman@xxxxxxxxx>, Connor Davis <connojdavis@xxxxxxxxx>, Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx
  • Delivery-date: Mon, 26 Feb 2024 14:20:24 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 26.02.2024 13:58, Oleksii wrote:
> On Mon, 2024-02-26 at 12:28 +0100, Jan Beulich wrote:
>> On 26.02.2024 12:18, Oleksii wrote:
>>> On Mon, 2024-02-26 at 10:45 +0100, Jan Beulich wrote:
>>>> On 23.02.2024 13:23, Oleksii wrote:
>>>>>>
>>>>>>>>> As 1- and 2-byte cases are emulated I decided that is
>>>>>>>>> not
>>>>>>>>> to
>>>>>>>>> provide
>>>>>>>>> sfx argument for emulation macros as it will not have
>>>>>>>>> to
>>>>>>>>> much
>>>>>>>>> affect on
>>>>>>>>> emulated types and just consume more performance on
>>>>>>>>> acquire
>>>>>>>>> and
>>>>>>>>> release
>>>>>>>>> version of sc/ld instructions.
>>>>>>>>
>>>>>>>> Question is whether the common case (4- and 8-byte
>>>>>>>> accesses)
>>>>>>>> shouldn't
>>>>>>>> be valued higher, with 1- and 2-byte emulation being
>>>>>>>> there
>>>>>>>> just
>>>>>>>> to
>>>>>>>> allow things to not break altogether.
>>>>>>> If I understand you correctly, it would make sense to add
>>>>>>> the
>>>>>>> 'sfx'
>>>>>>> argument for the 1/2-byte access case, ensuring that all
>>>>>>> options
>>>>>>> are
>>>>>>> available for 1/2-byte access case as well.
>>>>>>
>>>>>> That's one of the possibilities. As said, I'm not overly
>>>>>> worried
>>>>>> about
>>>>>> the emulated cases. For the initial implementation I'd
>>>>>> recommend
>>>>>> going
>>>>>> with what is easiest there, yielding the best possible result
>>>>>> for
>>>>>> the
>>>>>> 4- and 8-byte cases. If later it turns out repeated
>>>>>> acquire/release
>>>>>> accesses are a problem in the emulation loop, things can be
>>>>>> changed
>>>>>> to explicit barriers, without touching the 4- and 8-byte
>>>>>> cases.
>>>>> I am confused then a little bit if emulated case is not an
>>>>> issue.
>>>>>
>>>>> For 4- and 8-byte cases for xchg .aqrl is used, for relaxed and
>>>>> aqcuire
>>>>> version of xchg barries are used.
>>>>>
>>>>> The similar is done for cmpxchg.
>>>>>
>>>>> If something will be needed to change in emulation loop it
>>>>> won't
>>>>> require to change 4- and 8-byte cases.
>>>>
>>>> I'm afraid I don't understand your reply.
>>> IIUC, emulated cases it is implemented correctly in terms of usage
>>> barriers. And it also OK not to use sfx for lr/sc instructions and
>>> use
>>> only barriers.
>>>
>>> For 4- and 8-byte cases are used sfx + barrier depending on the
>>> specific case ( relaxed, acquire, release, generic xchg/cmpxchg ).
>>> What also looks to me correct. But you suggested to provide the
>>> best
>>> possible result for 4- and 8-byte cases. 
>>>
>>> So I don't understand what the best possible result is as the
>>> current
>>> one usage of __{cmp}xchg_generic for each specific case  ( relaxed,
>>> acquire, release, generic xchg/cmpxchg ) looks correct to me:
>>> xchg -> (..., ".aqrl", "", "") just suffix .aqrl suffix without
>>> barriers.
>>> xchg_release -> (..., "", RISCV_RELEASE_BARRIER, "" ) use only
>>> release
>>> barrier
>>> xchg_acquire -> (..., "", "", RISCV_ACQUIRE_BARRIER ), only acquire
>>> barrier
>>> xchg_relaxed ->  (..., "", "", "") - no barries, no sfx
>>
>> So first: While explicit barriers are technically okay, I don't
>> follow why
>> you insist on using them when you can achieve the same by suitably
>> tagging
>> the actual insn doing the exchange. Then second: It's somewhat hard
>> for me
>> to see the final effect on the emulation paths without you actually
>> having
>> done the switch. Maybe no special handling is necessary there anymore
>> then. And as said, it may actually be acceptable for the emulation
>> paths
>> to "only" be correct, but not be ideal in terms of performance. After
>> all,
>> if you use the normal 4-byte primitive in there, more (non-explicit)
>> barriers than needed would occur if the involved loop has to take
>> more
>> than one iteration. Which could (but imo doesn't need to be) avoided
>> by
>> using a more relaxed 4-byte primitive there and an explicit barrier
>> outside of the loop.
> 
> According to the spec:
> Table A.5 ( part of the table only I copied here )
> 
> Linux Construct          RVWMO Mapping
> atomic <op> relaxed           amo<op>.{w|d}
> atomic <op> acquire           amo<op>.{w|d}.aq
> atomic <op> release           amo<op>.{w|d}.rl
> atomic <op>                   amo<op>.{w|d}.aqrl
> 
> Linux Construct          RVWMO LR/SC Mapping
> atomic <op> relaxed       loop: lr.{w|d}; <op>; sc.{w|d}; bnez loop
> atomic <op> acquire       loop: lr.{w|d}.aq; <op>; sc.{w|d}; bnez loop
> atomic <op> release       loop: lr.{w|d}; <op>; sc.{w|d}.aqrl∗ ; bnez 
> loop OR
>                           fence.tso; loop: lr.{w|d}; <op>; sc.{w|d}∗ ;
> bnez loop
> atomic <op>               loop: lr.{w|d}.aq; <op>; sc.{w|d}.aqrl; bnez
> loop

In your consideration what to implement you will want to limit
things to constructs we actually use. I can't find any use of the
relaxed, acquire, or release forms of atomics as mentioned above.

> The Linux mappings for release operations may seem stronger than
> necessary, but these mappings
> are needed to cover some cases in which Linux requires stronger
> orderings than the more intuitive
> mappings would provide. In particular, as of the time this text is
> being written, Linux is actively
> debating whether to require load-load, load-store, and store-store
> orderings between accesses in one
> critical section and accesses in a subsequent critical section in the
> same hart and protected by the
> same synchronization object. Not all combinations of FENCE RW,W/FENCE
> R,RW mappings
> with aq/rl mappings combine to provide such orderings. There are a few
> ways around this problem,
> including:
> 1. Always use FENCE RW,W/FENCE R,RW, and never use aq/rl. This suffices
> but is undesir-
> able, as it defeats the purpose of the aq/rl modifiers.
> 2. Always use aq/rl, and never use FENCE RW,W/FENCE R,RW. This does not
> currently work
> due to the lack of load and store opcodes with aq and rl modifiers.

I don't understand this point: Which specific load and/or store forms
are missing? According to my reading of the A extension spec all
combination of aq/rl exist with both lr and sc.

> 3. Strengthen the mappings of release operations such that they would
> enforce sufficient order-
> ings in the presence of either type of acquire mapping. This is the
> currently-recommended
> solution, and the one shown in Table A.5.
> 
> 
> Based on this it is enough in our case use only suffixed istructions
> (amo<op>.{w|d}{.aq, .rl, .aqrl, .aqrl }, lr.{w|d}.{.aq, .aqrl }.
> 
> 
> But as far as I understand in Linux atomics were strengthen with
> fences:
>     Atomics present the same issue with locking: release and acquire
>     variants need to be strengthened to meet the constraints defined
>     by the Linux-kernel memory consistency model [1].
>     
>     Atomics present a further issue: implementations of atomics such
>     as atomic_cmpxchg() and atomic_add_unless() rely on LR/SC pairs,
>     which do not give full-ordering with .aqrl; for example, current
>     implementations allow the "lr-sc-aqrl-pair-vs-full-barrier" test
>     below to end up with the state indicated in the "exists" clause.
>     
>     In order to "synchronize" LKMM and RISC-V's implementation, this
>     commit strengthens the implementations of the atomics operations
>     by replacing .rl and .aq with the use of ("lightweigth") fences,
>     and by replacing .aqrl LR/SC pairs in sequences such as:
>     
>       0:      lr.w.aqrl  %0, %addr
>               bne        %0, %old, 1f
>               ...
>               sc.w.aqrl  %1, %new, %addr
>               bnez       %1, 0b
>       1:
>     
>     with sequences of the form:
>     
>       0:      lr.w       %0, %addr
>               bne        %0, %old, 1f
>               ...
>               sc.w.rl    %1, %new, %addr   /* SC-release   */
>               bnez       %1, 0b
>               fence      rw, rw            /* "full" fence */
>       1:
>     
>     following Daniel's suggestion.

I'm likely missing something, yet as it looks it does help that the
code fragment above appears to be ...

>     These modifications were validated with simulation of the RISC-V
>     with sequences of the form:
>     
>       0:      lr.w       %0, %addr
>               bne        %0, %old, 1f
>               ...
>               sc.w.rl    %1, %new, %addr   /* SC-release   */
>               bnez       %1, 0b
>               fence      rw, rw            /* "full" fence */
>       1:
>     
>     following Daniel's suggestion.

... entirely the same as this one. Yet there's presumably a reason
for quoting it twice?

>     These modifications were validated with simulation of the RISC-V
>     memory consistency model.
>     
>     C lr-sc-aqrl-pair-vs-full-barrier
>     
>     {}
>     
>     P0(int *x, int *y, atomic_t *u)
>     {
>             int r0;
>             int r1;
>     
>             WRITE_ONCE(*x, 1);
>             r0 = atomic_cmpxchg(u, 0, 1);
>             r1 = READ_ONCE(*y);
>     }
>     
>     P1(int *x, int *y, atomic_t *v)
>     {
>             int r0;
>             int r1;
>     
>             WRITE_ONCE(*y, 1);
>             r0 = atomic_cmpxchg(v, 0, 1);
>             r1 = READ_ONCE(*x);
>     }
>     
>     exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
>     
>     [1] https://marc.info/?l=linux-kernel&m=151930201102853&w=2
>      
> https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM
>         https://marc.info/?l=linux-kernel&m=151633436614259&w=2
> 
> 
> Thereby Linux kernel implementation seems to me more safe and it is a
> reason why I want/wanted to be aligned with it.

Which may end up being okay. I hope you realize though that there's a
lot more explanation needed in the respective commits then compared to
what you've had so far. As a minimum, absolutely anything remotely
unexpected needs to be explained.

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.