[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [RFC PATCH v2 00/15] xen/arm: port Linux LL/SC and LSE atomics helpers to Xen
On 17/12/2020 15:37, Ash Wilding wrote: Hi Julien, Hi, First of all, apologies for the very late reply. Thanks for taking a look at the patches and providing feedback. I've seen your other comments and will reply to those separately when I get a chance (maybe at the weekend or over the Christmas break). RE the differences in ordering semantics between Xen's and Linux's atomics helpers, please find my notes below. Thoughts? Thank you for the very detailed answer, it made a lot easier to understand the differences! I think it would be good to have some (if not all) the content in Documents for future reference. [...] The tables below use format AAA/BBB/CCC/DDD/EEE, where: - AAA is the memory barrier before the operation - BBB is the acquire semantics of the atomic operation - CCC is the release semantics of the atomic operation - DDD is whether the asm() block clobbers memory - EEE is the memory barrier after the operation For example, ---/---/rel/mem/dmb would mean: - No memory barrier before the operation - The atomic does *not* have acquire semantics - The atomic *does* have release semantics - The asm() block clobbers memory - There is a DMB memory barrier after the atomic operation arm64 LL/SC =========== Xen Function Xen Linux Inconsistent ============ === ===== ============ atomic_add ---/---/---/---/--- ---/---/---/---/--- --- atomic_add_return ---/---/rel/mem/dmb ---/---/rel/mem/dmb --- (1) atomic_sub ---/---/---/---/--- ---/---/---/---/--- --- atomic_sub_return ---/---/rel/mem/dmb ---/---/rel/mem/dmb --- (1) atomic_and ---/---/---/---/--- ---/---/---/---/--- --- atomic_cmpxchg dmb/---/---/---/dmb ---/---/rel/mem/--- YES (2) If I am not mistaken, Linux is implementing atomic_cmpxchg() with the *_mb() version. So the semantic would be: ---/---/rel/mem/dmb atomic_xchg ---/---/rel/mem/dmb ---/acq/rel/mem/dmb YES (3) From Linux:#define __XCHG_CASE(w, sfx, name, sz, mb, nop_lse, acq, acq_lse, rel, cl) \ [...]/* LL/SC */ \ " prfm pstl1strm, %2\n" \ "1: ld" #acq "xr" #sfx "\t%" #w "0, %2\n" \ " st" #rel "xr" #sfx "\t%w1, %" #w "3, %2\n" \ " cbnz %w1, 1b\n" \ " " #mb, \ [...] __XCHG_CASE(w, , mb_, 32, dmb ish, nop, , a, l, "memory") So I think the Linux semantic would be: ---/---/rel/mem/dmb Therefore there would be no inconsistency between Xen and Linux. (1) It's actually interesting to me that Linux does it this way. As with the LSE atomics below, I'd have expected acq/rel semantics and ditch the DMB. I have done some digging. The original implementation of atomic_{sub, add}_return was actually using acq/rel semantics up until Linux 3.14. But this was reworked by 8e86f0b409a4 "arm64: atomics: fix use of acquire + release for full barrier semantics". Unless I'm missing something where there is a concern around taking an IRQ between the LDAXR and the STLXR, which can't happen in the LSE atomic case since it's a single instruction. But the exclusive monitor is cleared on exception return in AArch64 so I'm struggling to see what that potential issue may be. Regardless, Linux and Xen are consistent so we're OK ;-) The commit I pointed above contains a lot of details on the issue. For convenience, I copied the most relevant bits below: " On arm64, these operations have been incorrectly implemented as follows: // A, B, C are independent memory locations <Access [A]> // atomic_op (B) 1: ldaxr x0, [B] // Exclusive load with acquire <op(B)> stlxr w1, x0, [B] // Exclusive store with release cbnz w1, 1b <Access [C]> The assumption here being that two half barriers are equivalent to a full barrier, so the only permitted ordering would be A -> B -> C (where B is the atomic operation involving both a load and a store). Unfortunately, this is not the case by the letter of the architecture and, in fact, the accesses to A and C are permitted to pass their nearest half barrier resulting in orderings such as Bl -> A -> C -> Bs or Bl -> C -> A -> Bs (where Bl is the load-acquire on B and Bs is the store-release on B). This is a clear violation of the full barrier requirement. " (2) The Linux version uses either STLXR with rel semantics if the comparison passes, or DMB if the comparison fails. I think the DMB is only on the success path and there is no barrier on the failure path. The commit 4e39715f4b5c "arm64: cmpxchg: avoid memory barrier on comparison failure" seems to corroborate that. This is weaker than Xen's version, which is quite blunt in always wrapping the operation between two DMBs. This may be a holdover from Xen's arm32 versions being ported to arm64, as we didn't support acq/rel semantics on LDREX and STREX in Armv7-A? Regardless, The atomic helpers used in Xen were originally taken from Linux 3.14. Back then, atomic_cmpxchg() were using the two full barriers. This was introduced by the commit I pointed out in (1). This was then optimized with commit 4e39715f4b5c "arm64: cmpxchg: avoid memory barrier on comparison failure". this is quite a big discrepancy and I've not yet given it enough thought to determine whether it would actually cause an issue. My feeling is that the Linux LL/SC atomic_cmpxchg() should have have acq semantics on the LL, but like you said these helpers are well tested so I'd be surprised if there is a bug. See (5) below though, where the Linux LSE atomic_cmpxchg() *does* have acq semantics. If my understanding is correct the semantics would be (xen vs Linux): - failure: dmb/---/---/---/dmb ---/---/rel/mem/--- - success: dmb/---/---/---/dmb ---/---/rel/mem/dmbI think the success path is not going to be a problem. But we would need to check if all the callers expect a full barrier for the failure path (I would hope not). IIUC, the LSE implementation would be using a single instruction that has both the acquire and release semantics. Therefore, it would act as a full barrier.(3) The Linux version just adds acq semantics to the LL, so we're OK here. arm64 LSE (comparison to Xen's LL/SC) ===================================== Xen Function Xen Linux Inconsistent ============ === ===== ============ atomic_add ---/---/---/---/--- ---/---/---/---/--- --- atomic_add_return ---/---/rel/mem/dmb ---/acq/rel/mem/--- YES (4) atomic_sub ---/---/---/---/--- ---/---/---/---/--- --- atomic_sub_return ---/---/rel/mem/dmb ---/acq/rel/mem/--- YES (4) atomic_and ---/---/---/---/--- ---/---/---/---/--- --- atomic_cmpxchg dmb/---/---/---/dmb ---/acq/rel/mem/--- YES (5) atomic_xchg ---/---/rel/mem/dmb ---/acq/rel/mem/--- YES (4) (4) As noted in (1), this is how I would have expected Linux's LL/SC atomics to work too. I don't think this discrepancy will cause any issues. On the other hand, in the LL/SC implementation, the acquire and release semantics is happening with two different instruction. Therefore, they don't act as a full barrier. So I think the memory ordering is going to be equivalent between Xen and the Linux LSE implementation. Are they really different? In both cases, the helper will act as a full barrier. The main difference is how this ordering is achieved.(5) As with (2) above, this is quite a big discrepancy to Xen. However at least this version has acq semantics unlike the LL/SC version in (2), so I'm more confident that there won't be regressions going from Xen LL/SC to Linux LSE version of atomic_cmpxchg(). arm32 LL/SC =========== Xen Function Xen Linux Inconsistent ============ === ===== ============ atomic_add ---/---/---/---/--- ---/---/---/---/--- --- atomic_add_return dmb/---/---/---/dmb XXX/XXX/XXX/XXX/XXX YES (6) atomic_sub ---/---/---/---/--- ---/---/---/---/--- --- atomic_sub_return dmb/---/---/---/dmb XXX/XXX/XXX/XXX/XXX YES (6) atomic_and ---/---/---/---/--- ---/---/---/---/--- --- atomic_cmpxchg dmb/---/---/---/dmb XXX/XXX/XXX/XXX/XXX YES (6) atomic_xchg dmb/---/---/---/dmb XXX/XXX/XXX/XXX/XXX YES (6) (6) Linux only provides relaxed variants of these functions, such as atomic_add_return_relaxed() and atomic_xchg_relaxed(). Patches #13 and #14 in the series add the stricter versions expected by Xen, wrapping calls to Linux's relaxed variants inbetween two calls to smb_mb(). This makes them consistent with Xen's existing helpers, though is quite blunt. Linux will do the same when the fully ordered version is not implemented (see include/linux/atomic-fallback.h). It is worth noting that Armv8-A AArch32 does support acq/rel semantics on exclusive accesses, with LDAEX and STLEX, so I could imagine us introducing a new arm32 hwcap to detect whether we're on actual Armv7-A hardware or Armv8-A AArch32, then swap to lighterweight STLEX versions of these helpers rather than the heavyweight double DMB versions. Whether that would actually give measurable performance improvements is another story! That's good to know! So far, I haven't heard anyone using Xen 32-bit on Armv8. I would wait until there is a request before introducing a 3rd (4th if counting the ll/sc as 2) implementation for the atomics helpers. That said, the 32-bit port is meant to only be supported on Armv7. It should boot on Armv8, but there is no promise it will be fully functional nor stable. Overall, it looks like to me that re-syncing the atomic implementation with Linux should not be a major problem. We are in the middle of 4.15 freeze, I will try to go through the series ASAP. Cheers, -- Julien Grall
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |