[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v3 5/5] evtchn: don't call Xen consumer callback with per-channel lock held



On Fri, Dec 4, 2020 at 2:22 PM Julien Grall <julien.grall.oss@xxxxxxxxx> wrote:
>
> On Fri, 4 Dec 2020 at 19:15, Tamas K Lengyel <tamas.k.lengyel@xxxxxxxxx> 
> wrote:
> >
> > On Fri, Dec 4, 2020 at 10:29 AM Julien Grall <julien@xxxxxxx> wrote:
> > >
> > >
> > >
> > > On 04/12/2020 15:21, Tamas K Lengyel wrote:
> > > > On Fri, Dec 4, 2020 at 6:29 AM Julien Grall <julien@xxxxxxx> wrote:
> > > >>
> > > >> Hi Jan,
> > > >>
> > > >> On 03/12/2020 10:09, Jan Beulich wrote:
> > > >>> On 02.12.2020 22:10, Julien Grall wrote:
> > > >>>> On 23/11/2020 13:30, Jan Beulich wrote:
> > > >>>>> While there don't look to be any problems with this right now, the 
> > > >>>>> lock
> > > >>>>> order implications from holding the lock can be very difficult to 
> > > >>>>> follow
> > > >>>>> (and may be easy to violate unknowingly). The present callbacks 
> > > >>>>> don't
> > > >>>>> (and no such callback should) have any need for the lock to be held.
> > > >>>>>
> > > >>>>> However, vm_event_disable() frees the structures used by respective
> > > >>>>> callbacks and isn't otherwise synchronized with invocations of these
> > > >>>>> callbacks, so maintain a count of in-progress calls, for 
> > > >>>>> evtchn_close()
> > > >>>>> to wait to drop to zero before freeing the port (and dropping the 
> > > >>>>> lock).
> > > >>>>
> > > >>>> AFAICT, this callback is not the only place where the 
> > > >>>> synchronization is
> > > >>>> missing in the VM event code.
> > > >>>>
> > > >>>> For instance, vm_event_put_request() can also race against
> > > >>>> vm_event_disable().
> > > >>>>
> > > >>>> So shouldn't we handle this issue properly in VM event?
> > > >>>
> > > >>> I suppose that's a question to the VM event folks rather than me?
> > > >>
> > > >> Yes. From my understanding of Tamas's e-mail, they are relying on the
> > > >> monitoring software to do the right thing.
> > > >>
> > > >> I will refrain to comment on this approach. However, given the race is
> > > >> much wider than the event channel, I would recommend to not add more
> > > >> code in the event channel to deal with such problem.
> > > >>
> > > >> Instead, this should be fixed in the VM event code when someone has 
> > > >> time
> > > >> to harden the subsystem.
> > > >
> > > > I double-checked and the disable route is actually more robust, we
> > > > don't just rely on the toolstack doing the right thing. The domain
> > > > gets paused before any calls to vm_event_disable. So I don't think
> > > > there is really a race-condition here.
> > >
> > > The code will *only* pause the monitored domain. I can see two issues:
> > >     1) The toolstack is still sending event while destroy is happening.
> > > This is the race discussed here.
> > >     2) The implement of vm_event_put_request() suggests that it can be
> > > called with not-current domain.
> > >
> > > I don't see how just pausing the monitored domain is enough here.
> >
> > Requests only get generated by the monitored domain.
>
> If that's the case, then why is vm_event_put_request() able to
> deal with a non-current domain?
>
> I understand that you are possibly trusting who may call it, but this
> looks quite fragile.

I didn't write the system. You probably want to ask that question from
the original author.

Tamas



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.