[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2] libxl: Fix stubdom PCI passthrough

Jason Andryuk writes ("[PATCH v2] libxl: Fix stubdom PCI passthrough"):
> commit 0fdb48ffe7a1 "libxl: Make sure devices added by pci-attach are
> reflected in the config" broken stubdom PCI passthrough when it moved
> libxl__create_pci_backend later in the function.  xl pci-attach for a
> running PV domain may also have been broken, but that was not verified.
> The stubdomain is running (!starting) and PV, so it calls
> libxl__wait_for_backend.  With the new placement of
> libxl__create_pci_backend, the path does not exist and the call
> immediately fails.
> The wait is only relevant when the backend is already present.  num_devs
> is already used to determine if the backend needs to be created.  Re-use
> num_devs to determine if the backend wait is necessary.  The wait is
> necessary to avoid racing with another PCI attachment reconfiguring the
> front/back. If we are creating the backend, then we don't have to worry
> about a racing reconfigure.

Thanks for working on this.  Sorry it's taken a while for me to look
at this properly.  It seems very complicated.  I confess I am
confused.  I wonder if I actually understand properly how the code in
tree is supposed to work.  So if I seem not to be making sense, please
explain :-).

AFAICT what you are saying is that:

  In 0fdb48ffe7a1, pci-attach was moved later in the setup, until a time
  after the stubdomain is running.  So that code now always runs with
  !starting, when previously it would run with !!starting.

  libxl__wait_for_backend fails if the backend path does not exist;
  previously, when the domain is being created, the wait would be
  skipped.  Now because !starting, the wait is done, and fails because
  the backend path is missing.

  The purpose of the wait is to make sure the frontend is ready to
  accept the instructions, mostly to prevent multiple pci attach
  happening simultaneously.

You are using num_devs to see whether the backend exists already, so
as to skip the failing check.  I don't think that is right.  But I'm
not sure the old code is right either.

If you are right about the reason for the wait, it does not seem
correctly placed.  There is surely a TOCTOU race: first, we do the
wait, and then we write, non-transactionally, to xenstore.  If two of
these processes run at once, they could both decide not to wait,
then both try to create the backend and trample on each other.

This kind of thing is usually supposed to be dealt with by a
combination of the userdata lock (for the config) and xenstore
transaction but the code here doesn't seem to do that correctly.

Shouldn't all of this looking at xenstore occur within the transaction
loop ?  What this code seems to do is read some things
nontransactionally, then enter a transaction, and then make some
writes based on a combination of the pre-transaction and
within-transaction data.  In particular `num_devs` is read
nontransactionally and then written within the transaction, without, I
think being checked for subsequent modification.

Also, I think the purpose of `starting` is to avoid waiting for the
backend to be stable before the frontend is actaully unpaused, in
which case presumably the backend would never be Connected and we
would deadlock (and eventually time out).  In general when we are
hot-adding we must wait for the frontend and backend to be stable
before saying we're done, whereas with cold-adding we set up the
information and simply hope (expect) it all to sort itself out while
the domain boots.  So, I would be expecting to see the wait to happen
*after* the add.

There is also the wrinkle that the non-pci code is differently
structured, because it must not use libxl__wait_for_backend at all.
libxl__wait_for_backend is synchronous and blocking the whole libxl
process for an extended time is not allowed.  But AIUI we have made an
exception for pci because the pci backend is always in dom0 and

I looked through the git history.

Very relevant seems
   libxl: Multi-device passthrough coldplug: do not wait for unstarted guest
which has some explanation from me in the commit message.  I'm not
sure why I didn't spot the anomalous transaction use problem.

Also I found
   libxl: fix cold plugged PCI device with stubdomain
and, would you believe it
   libxl: fix cold plugged PCI devices with stubdomains
which seems at least tangentially relevant, showing that this seems
persistently to break :-(.  This suggests quite strongly that we
should add some tests for pci passthrough including at least one for
stubdom coldplug.

   libxl: add all pci devices to xenstore at once (during VM create)
which seems OK.

There has been a lot of refactoring, but much of it hasn't really
changed the structure of this function.

The issue I identify above, with the inconsistent use of transactions,
seems to have existed since the very beginning.  In
   libxenlight: implement pci passthrough
the `retry_transaction:` label seems to me to be in the wrong place.

I have CC'd Paul and Oleksandr (committer/reviewer of 0fdb48ffe7a1),
Marek (seems to have touched a lot of the stubdom code here) and
Stefano (original author of the pci passthrough code here)
in case they would like to comment...




Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.