Xen project Mailing List

Re: [PATCH v7 3/3] xen/events: rework fifo queue locking

Date: Wed, 25 Nov 2020 08:42:22 +0100

Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <george.dunlap@xxxxxxxxxx>, Ian Jackson <iwj@xxxxxxxxxxxxxx>, Julien Grall <julien@xxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Wei Liu <wl@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx

Delivery-date: Wed, 25 Nov 2020 07:42:38 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 25.11.2020 06:23, Jürgen Groß wrote: > On 24.11.20 17:32, Jan Beulich wrote: >> On 24.11.2020 15:49, Jürgen Groß wrote: >>> On 24.11.20 15:02, Jan Beulich wrote: >>>> On 24.11.2020 08:01, Juergen Gross wrote: >>>>> Two cpus entering evtchn_fifo_set_pending() for the same event channel >>>>> can race in case the first one gets interrupted after setting >>>>> EVTCHN_FIFO_PENDING and when the other one manages to set >>>>> EVTCHN_FIFO_LINKED before the first one is testing that bit. This can >>>>> lead to evtchn_check_pollers() being called before the event is put >>>>> properly into the queue, resulting eventually in the guest not seeing >>>>> the event pending and thus blocking forever afterwards. >>>>> >>>>> Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel >>>>> lock") made the race just more obvious, while the fifo event channel >>>>> implementation had this race from the beginning when an unmask operation >>>>> was running in parallel with an event channel send operation. >>>> >>>> Ah yes, but then also only for inter-domain channels, as it was >>>> only in that case that the "wrong" domain's event lock was held. >>>> IOW there was a much earlier change already where this issue >>>> got widened (when the per-channel locking got introduced). This >>>> then got reduced to the original scope by XSA-343's adding of >>>> locking to evtchn_unmask(). (Not sure how much of this history >>>> wants actually adding here. I'm writing it down not the least to >>>> make sure I have a complete enough picture.) >>> >>> I think we both agree that this race was possible for quite some time. >>> And I even think one customer bug I've been looking into recently >>> might be exactly this problem (a dom0 was occasionally hanging in >>> cross-cpu function calls, but switching to 2-level events made the >>> problem disappear). >> >> IPIs weren't affected earlier on (i.e. in any released version), >> if my analysis above is correct. > > I don't think it is correct. > > An unmask operation in parallel with set_pending will have had the > same race for IPIs. Why? When FIFO locks were introduced, the event lock got acquired around the call to evtchn_unmask(), and IPIs got sent with that lock similarly held. Likewise after XSA-343 evtchn_unmask() as well as the sending of IPIs acquire the per-channel lock (which at that point was still an ordinary spin lock). >>>>> Additionally when an >>>>> event channel needs to change queues both queues need to be locked >>>>> initially. >>>> >>>> Since this was (afaict) intentionally not the case before, I >>>> think I would want to see a word spent on the "why", perhaps >>>> better in a code comment than here. Even more so that you >>>> delete a respective comment justifying the possible race as >>>> permissible. And I have to admit right now I'm still uncertain >>>> both ways, i.e. I neither have a clear understanding of why it >>>> would have been considered fine the other way around before, >>>> nor why the double locking is strictly needed. >>> >>> I need the double locking to avoid someone entering the locked region >>> when dropping the lock for the old queue and taking the one for the >>> new queue, as this would open the same race window again. >> >> Well, that's what have already said. Thing is that the code >> prior to your change gives the impression as if this race was >> benign. > > The race regarding a queue change, yes. But not the race I'm fixing with > this patch. I need to make sure that only one caller is inside the big > if clause for a specific event. And dropping the lock inside this clause > would violate that assumption. IOW the presumed wrong assumption back then was that the function would always be called with a lock already held which excludes the region to be entered twice for the same channel. But - was this a wrong assumption at the time? Thinking about this again I now actually come to the conclusion that my analysis above was wrong in the other direction: Even inter-domain channels did have consistent locking (of the other side's event lock), preventing any such race there. Which implies that imo one of the Fixes: tags wants dropping, as the race became possible only when "downgrading" some of the involved locks to rw ones. Obviously my "evtchn: convert vIRQ lock to an r/w one" then extends this race to vIRQ-s. Jan

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.