[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v2 1/2] xen/events: access last_priority and last_vcpu_id together
On 20.10.2020 11:25, Julien Grall wrote: > Hi Jan, > > On 16/10/2020 13:09, Jan Beulich wrote: >> On 16.10.2020 11:36, Julien Grall wrote: >>> On 15/10/2020 13:07, Jan Beulich wrote: >>>> On 14.10.2020 13:40, Julien Grall wrote: >>>>> On 13/10/2020 15:26, Jan Beulich wrote: >>>>>> On 13.10.2020 16:20, Jürgen Groß wrote: >>>>>>> Especially Julien was rather worried by the current situation. In >>>>>>> case you can convince him the current handling is fine, we can >>>>>>> easily drop this patch. >>>>>> >>>>>> Julien, in the light of the above - can you clarify the specific >>>>>> concerns you (still) have? >>>>> >>>>> Let me start with that the assumption if evtchn->lock is not held when >>>>> evtchn_fifo_set_pending() is called. If it is held, then my comment is >>>>> moot. >>>> >>>> But this isn't interesting - we know there are paths where it is >>>> held, and ones (interdomain sending) where it's the remote port's >>>> lock instead which is held. What's important here is that a >>>> _consistent_ lock be held (but it doesn't need to be evtchn's). >>> >>> Yes, a _consistent_ lock *should* be sufficient. But it is better to use >>> the same lock everywhere so it is easier to reason (see more below). >> >> But that's already not the case, due to the way interdomain channels >> have events sent. You did suggest acquiring both locks, but as >> indicated at the time I think this goes too far. As far as the doc >> aspect - we can improve the situation. Iirc it was you who made me >> add the respective comment ahead of struct evtchn_port_ops. >> >>>>> From my understanding, the goal of lock_old_queue() is to return the >>>>> old queue used. >>>>> >>>>> last_priority and last_vcpu_id may be updated separately and I could not >>>>> convince myself that it would not be possible to return a queue that is >>>>> neither the current one nor the old one. >>>>> >>>>> The following could happen if evtchn->priority and >>>>> evtchn->notify_vcpu_id keeps changing between calls. >>>>> >>>>> pCPU0 | pCPU1 >>>>> | >>>>> evtchn_fifo_set_pending(v0,...) | >>>>> | evtchn_fifo_set_pending(v1, ...) >>>>> [...] | >>>>> /* Queue has changed */ | >>>>> evtchn->last_vcpu_id = v0 | >>>>> | -> evtchn_old_queue() >>>>> | v = d->vcpu[evtchn->last_vcpu_id]; >>>>> | old_q = ... >>>>> | spin_lock(old_q->...) >>>>> | v = ... >>>>> | q = ... >>>>> | /* q and old_q would be the same */ >>>>> | >>>>> evtchn->las_priority = priority| >>>>> >>>>> If my diagram is correct, then pCPU1 would return a queue that is >>>>> neither the current nor old one. >>>> >>>> I think I agree. >>>> >>>>> In which case, I think it would at least be possible to corrupt the >>>>> queue. From evtchn_fifo_set_pending(): >>>>> >>>>> /* >>>>> * If this event was a tail, the old queue is now empty and >>>>> * its tail must be invalidated to prevent adding an event to >>>>> * the old queue from corrupting the new queue. >>>>> */ >>>>> if ( old_q->tail == port ) >>>>> old_q->tail = 0; >>>>> >>>>> Did I miss anything? >>>> >>>> I don't think you did. The important point though is that a consistent >>>> lock is being held whenever we come here, so two racing set_pending() >>>> aren't possible for one and the same evtchn. As a result I don't think >>>> the patch here is actually needed. >>> >>> I haven't yet read in full details the rest of the patches to say >>> whether this is necessary or not. However, at a first glance, I think >>> this is not a sane to rely on different lock to protect us. And don't >>> get me started on the lack of documentation... >>> >>> Furthermore, the implementation of old_lock_queue() suggests that the >>> code was planned to be lockless. Why would you need the loop otherwise? >> >> The lock-less aspect of this affects multiple accesses to e.g. >> the same queue, I think. > I don't think we are talking about the same thing. What I was referring > to is the following code: > > static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, > struct evtchn *evtchn, > unsigned long *flags) > { > struct vcpu *v; > struct evtchn_fifo_queue *q, *old_q; > unsigned int try; > > for ( try = 0; try < 3; try++ ) > { > v = d->vcpu[evtchn->last_vcpu_id]; > old_q = &v->evtchn_fifo->queue[evtchn->last_priority]; > > spin_lock_irqsave(&old_q->lock, *flags); > > v = d->vcpu[evtchn->last_vcpu_id]; > q = &v->evtchn_fifo->queue[evtchn->last_priority]; > > if ( old_q == q ) > return old_q; > > spin_unlock_irqrestore(&old_q->lock, *flags); > } > > gprintk(XENLOG_WARNING, > "dom%d port %d lost event (too many queue changes)\n", > d->domain_id, evtchn->port); > return NULL; > } > > Given that evtchn->last_vcpu_id and evtchn->last_priority can only be > modified in evtchn_fifo_set_pending(), this suggests that it is expected > for the function to multiple called concurrently on the same event channel. > >> I'm unconvinced it was really considered >> whether racing sending on the same channel is also safe this way. > > How would you explain the 3 try in lock_old_queue then? Queue changes (as said by the gprintk()) can't result from sending alone, but require re-binding to a different vCPU or altering the priority. I'm simply unconvinced that the code indeed fully reflects the original intentions. IOW I'm unsure whether we talk about the same thing ... Jan
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |