Xen project Mailing List

Re: [PATCH v7 3/3] xen/events: rework fifo queue locking

Date: Wed, 25 Nov 2020 09:25:00 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 25 Nov 2020 08:25:22 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25.11.2020 09:08, Jürgen Groß wrote: > On 25.11.20 08:42, Jan Beulich wrote: >> On 25.11.2020 06:23, Jürgen Groß wrote: >>> On 24.11.20 17:32, Jan Beulich wrote: >>>> On 24.11.2020 15:49, Jürgen Groß wrote: >>>>> On 24.11.20 15:02, Jan Beulich wrote: >>>>>> On 24.11.2020 08:01, Juergen Gross wrote: >>>>>>> Two cpus entering evtchn_fifo_set_pending() for the same event channel >>>>>>> can race in case the first one gets interrupted after setting >>>>>>> EVTCHN_FIFO_PENDING and when the other one manages to set >>>>>>> EVTCHN_FIFO_LINKED before the first one is testing that bit. This can >>>>>>> lead to evtchn_check_pollers() being called before the event is put >>>>>>> properly into the queue, resulting eventually in the guest not seeing >>>>>>> the event pending and thus blocking forever afterwards. >>>>>>> >>>>>>> Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel >>>>>>> lock") made the race just more obvious, while the fifo event channel >>>>>>> implementation had this race from the beginning when an unmask operation >>>>>>> was running in parallel with an event channel send operation. >>>>>> >>>>>> Ah yes, but then also only for inter-domain channels, as it was >>>>>> only in that case that the "wrong" domain's event lock was held. >>>>>> IOW there was a much earlier change already where this issue >>>>>> got widened (when the per-channel locking got introduced). This >>>>>> then got reduced to the original scope by XSA-343's adding of >>>>>> locking to evtchn_unmask(). (Not sure how much of this history >>>>>> wants actually adding here. I'm writing it down not the least to >>>>>> make sure I have a complete enough picture.) >>>>> >>>>> I think we both agree that this race was possible for quite some time. >>>>> And I even think one customer bug I've been looking into recently >>>>> might be exactly this problem (a dom0 was occasionally hanging in >>>>> cross-cpu function calls, but switching to 2-level events made the >>>>> problem disappear). >>>> >>>> IPIs weren't affected earlier on (i.e. in any released version), >>>> if my analysis above is correct. >>> >>> I don't think it is correct. >>> >>> An unmask operation in parallel with set_pending will have had the >>> same race for IPIs. >> >> Why? When FIFO locks were introduced, the event lock got acquired >> around the call to evtchn_unmask(), and IPIs got sent with that >> lock similarly held. Likewise after XSA-343 evtchn_unmask() as >> well as the sending of IPIs acquire the per-channel lock (which at >> that point was still an ordinary spin lock). > > Oh, I think we are talking about different paths. > > I'm talking about EVTCHNOP_unmask. There is no lock involved when > calling evtchn_unmask(). Above I said "When FIFO locks were introduced, the event lock got acquired around the call to evtchn_unmask()" and further "Likewise after XSA-343 evtchn_unmask() ..." I can't see how we're talking of different paths here. The situation has changed from back then (lock in the callers of evtchn_unmask()) to after XSA-343 (lock in evtchn_unmask()), but there was suitable locking. There was a (large) window in time prior to XSA-343 where there was indeed no locking, but that was introduced by the conversion to per-channel locks and addressed by one of the XSA-343 changes. The issue then got re-introduced by converting spin_lock() to read_lock(). Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.