[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 3/3] x86/msi: clear initial MSI-X state on boot



On Tue, Mar 28, 2023 at 9:04 AM Marek Marczykowski-Górecki
<marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Mar 28, 2023 at 02:54:38PM +0200, Jan Beulich wrote:
> > On 25.03.2023 03:49, Marek Marczykowski-Górecki wrote:
> > > Some firmware/devices are found to not reset MSI-X properly, leaving
> > > MASKALL set. Xen relies on initial state being both disabled.
> > > Especially, pci_reset_msix_state() assumes if MASKALL is set, it was Xen
> > > setting it due to msix->host_maskall or msix->guest_maskall. Clearing
> > > just MASKALL might be unsafe if ENABLE is set, so clear them both.
> >
> > But pci_reset_msix_state() comes into play only when assigning a device
> > to a DomU. If the tool stack doing a reset doesn't properly clear the
> > bit, how would it be cleared the next time round (i.e. after the guest
> > stopped and then possibly was started again)? It feels like the issue
> > wants dealing with elsewhere, possibly in the tool stack.
>
> I may be misremembering some details, but AFAIR Xen intercepts
> toolstack's (or more generally: accesses from dom0) attempt to clean
> this up and once it enters an inconsistent state (or rather: starts with
> such at the start of the day), there was no way to clean it up.
>
> I have considered changing pci_reset_msix_state() to not choke on
> MASKALL being set, but I'm a bit afraid of doing it, as there it seems
> there is a lot of assumptions all over the place and I may miss some.

Hi, Marek and Jan,

Marek, thank you for working on MSI-X support.

As Jan says, the clearing here works during system boot.  However, I
have found that Xen itself is setting MASKALL in __pci_disable_msix()
when shutting down a domU.  When that is called, memory_decoded(dev)
returns false, and Xen prints "cannot disable IRQ 137: masking MSI-X
on 0000:00:14.3".  That makes the device unavailable for subsequent
domU assignment.  I haven't investigated where and why memory decoding
gets disabled for the device.

Testing was with this v2 patchset integrated into OpenXT w/ Xen 4.16.
We have some device reset changes, so I'll have to look at them again.
Hmmm, they move the libxl device reseting from pci_remove_detached()
to libxl__destroy_domid() to ensure all devices are de-assign after
the domain is destroyed.  A kernel patch implements a "more thorough
reset" which could do a slot or bus level reset, and the desire is to
have all devices deassigned before that.  Maybe the shift later is
throwing off Xen's expectations?

As Marek says, the MASKALL handling seems tricky.  Xen seems to use it
to ensure there are no interrupts, so clearing it makes one worry
about breaking something else.

Regards,
Jason



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.