[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] PCI/MSI: Fix masking MSI/MSI-X on Xen PV



On 10/25/21 18:46, Josef Johansson wrote:
> On 10/25/21 14:27, Jason Andryuk wrote:
>> On Sun, Oct 24, 2021 at 9:26 PM Jason Andryuk <jandryuk@xxxxxxxxx> wrote:
>>> commit fcacdfbef5a1 ("PCI/MSI: Provide a new set of mask and unmask
>>> functions") introduce functions pci_msi_update_mask() and
>>> pci_msix_write_vector_ctrl() that is missing checks for
>>> pci_msi_ignore_mask that exists in commit 446a98b19fd6 ("PCI/MSI: Use
>>> new mask/unmask functions").  The checks are in place at the high level
>>> __pci_msi_mask_desc()/__pci_msi_unmask_desc(), but some functions call
>>> directly to the helpers.
>>>
>>> Push the pci_msi_ignore_mask check down to the functions that make
>>> the actual writes.  This keeps the logic local to the writes that need
>>> to be bypassed.
>>>
>>> With Xen PV, the hypervisor is responsible for masking and unmasking the
>>> interrupts, which pci_msi_ignore_mask is used to indicate.
>>>
>>> This change avoids lockups in amdgpu drivers under Xen during boot.
>>>
>>> Fixes: commit 446a98b19fd6 ("PCI/MSI: Use new mask/unmask functions")
>>> Reported-by: Josef Johansson <josef@xxxxxxxxxxx>
>>> Signed-off-by: Jason Andryuk <jandryuk@xxxxxxxxx>
>>> ---
>> I should have written that this is untested.  If this is the desired
>> approach, Josef should test that it solves his boot hangs.
>>
>> Regards,
>> Jason
> I've tested this today, both the above patch, but also my own below
> where I'm patching inside __pci_write_msi_msg,
> which is the outcome of the patch above.
I tested a lot of kernels today. To create a good baseline I compiled
without any of our patches here
and with my config flags set.

CONFIG_AMD_PMC=y
# CONFIG_HSA_AMD is not set
# CONFIG_CRYPTO_DEV_CCP is not set

The kernel stopped as before, and hung.

Test number 2 was to boot with amdgpu.msi=0.
This still resulted in a bad boot since all the xhcd drivers complained.
We can be sure that it's not amdgpu per se.

Test number 3 was with Jason's patch. It worked, but suspend/resume is
not working well.
Generally it's not behaving like other kernels do, which makes it
actually change the behavior.
Now with test 4 I tried that thought, maybe this is still a good change?
I'm deprived of a good baseline in all this, so it's very hard to
navigate between all the variables.

Test number 4 was with Jason's patch plus the amdgpu-patch below.
It worked, even suspend/resume, 2 times, but then it all crashed and
burn with quite interesting stacktraces. Are amdgpu doing it wrong here
or is it just me nitpicking?

index cc2e0c9cfe0a..f125597eb991 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -279,17 +279,8 @@ static bool amdgpu_msi_ok(struct amdgpu_device *adev)
 
 static void amdgpu_restore_msix(struct amdgpu_device *adev)
 {
-       u16 ctrl;
-
-       pci_read_config_word(adev->pdev, adev->pdev->msix_cap + PCI_MSIX_FLAGS, 
&ctrl);
-       if (!(ctrl & PCI_MSIX_FLAGS_ENABLE))
-               return;
-
-       /* VF FLR */
-       ctrl &= ~PCI_MSIX_FLAGS_ENABLE;
-       pci_write_config_word(adev->pdev, adev->pdev->msix_cap + 
PCI_MSIX_FLAGS, ctrl);
-       ctrl |= PCI_MSIX_FLAGS_ENABLE;
-       pci_write_config_word(adev->pdev, adev->pdev->msix_cap + 
PCI_MSIX_FLAGS, ctrl);
+       // checks if msix is enabled also
+       pci_restore_msi_state(adev->pdev);
 }
 
 /**

During the tests I fiddled with dpm settings, and it does have an effect
one the graphical output during suspend/resume. So maybe there's
hardware problems at play here as well.

I also looked through the code before and after Thomas' changes, and I
can't see that
this patch should make any functional difference compared to before the
MSI series of patches.
It's even such that is_virtual should be checked withing vector_ctrl. I
find Jason's patch
quite nice since it really places the checks on few places making it
easier not to slip.
Compared to my attempt that even failed because I forgot one more place
to put the checks.

With that said I would really like some more tests on this with
different chipsets, on Xen.
Any takers?

What I'm seeing is that there's no readl() in pci_msix_unmask(), it was
one in the code path before.
I'm very much unsure if there should be one there though.

We can really do a better job at the documentation for
pci_msi_ignore_mask, at least in msi.c,
maybe that should be a different patch adding some comments such that
driver folks really see
the benefits of using the built in restore functions e.g.

This became so much bigger project than I thought, thanks all for
chiming in on it.



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.