[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v7 3/3] xen/events: rework fifo queue locking



On 25.11.20 09:25, Jan Beulich wrote:
On 25.11.2020 09:08, Jürgen Groß wrote:
On 25.11.20 08:42, Jan Beulich wrote:
On 25.11.2020 06:23, Jürgen Groß wrote:
On 24.11.20 17:32, Jan Beulich wrote:
On 24.11.2020 15:49, Jürgen Groß wrote:
On 24.11.20 15:02, Jan Beulich wrote:
On 24.11.2020 08:01, Juergen Gross wrote:
Two cpus entering evtchn_fifo_set_pending() for the same event channel
can race in case the first one gets interrupted after setting
EVTCHN_FIFO_PENDING and when the other one manages to set
EVTCHN_FIFO_LINKED before the first one is testing that bit. This can
lead to evtchn_check_pollers() being called before the event is put
properly into the queue, resulting eventually in the guest not seeing
the event pending and thus blocking forever afterwards.

Note that commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel
lock") made the race just more obvious, while the fifo event channel
implementation had this race from the beginning when an unmask operation
was running in parallel with an event channel send operation.

Ah yes, but then also only for inter-domain channels, as it was
only in that case that the "wrong" domain's event lock was held.
IOW there was a much earlier change already where this issue
got widened (when the per-channel locking got introduced). This
then got reduced to the original scope by XSA-343's adding of
locking to evtchn_unmask(). (Not sure how much of this history
wants actually adding here. I'm writing it down not the least to
make sure I have a complete enough picture.)

I think we both agree that this race was possible for quite some time.
And I even think one customer bug I've been looking into recently
might be exactly this problem (a dom0 was occasionally hanging in
cross-cpu function calls, but switching to 2-level events made the
problem disappear).

IPIs weren't affected earlier on (i.e. in any released version),
if my analysis above is correct.

I don't think it is correct.

An unmask operation in parallel with set_pending will have had the
same race for IPIs.

Why? When FIFO locks were introduced, the event lock got acquired
around the call to evtchn_unmask(), and IPIs got sent with that
lock similarly held. Likewise after XSA-343 evtchn_unmask() as
well as the sending of IPIs acquire the per-channel lock (which at
that point was still an ordinary spin lock).

Oh, I think we are talking about different paths.

I'm talking about EVTCHNOP_unmask. There is no lock involved when
calling evtchn_unmask().

Above I said "When FIFO locks were introduced, the event lock got
acquired around the call to evtchn_unmask()" and further "Likewise
after XSA-343 evtchn_unmask() ..." I can't see how we're talking
of different paths here. The situation has changed from back then
(lock in the callers of evtchn_unmask()) to after XSA-343 (lock in
evtchn_unmask()), but there was suitable locking. There was a
(large) window in time prior to XSA-343 where there was indeed no
locking, but that was introduced by the conversion to per-channel
locks and addressed by one of the XSA-343 changes. The issue then
got re-introduced by converting spin_lock() to read_lock().

Okay, then I misunderstood your claim. I thought you meant there was
always suitable locking before the rwlock change. So I need to modify
the Fixes: tag.


Juergen

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.