[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH v2 3/3] x86/msi: clear initial MSI-X state on boot


  • To: Jason Andryuk <jandryuk@xxxxxxxxx>
  • From: Jan Beulich <jbeulich@xxxxxxxx>
  • Date: Tue, 28 Mar 2023 17:39:41 +0200
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=suse.com; dmarc=pass action=none header.from=suse.com; dkim=pass header.d=suse.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=o/7DDeknrat4TsFHcxmfgsUOcXcBNhgyUmTnCQ460w4=; b=l+tAWVpj94zKo9TzkC4bXQxn7HiGujfRNh4BdI1OnX+uUe2ZauKCBCDu5UP5quVaIi5kJaZjQl5IBTN9gW7MY+IOI/06pbO3tuKmYibB2HOOiNBIWI32xJMdG79gMTkwL2WFapu7wUBfLvCqVBC2Oau5qHWBRLdeF9rNKChIokZAftMwhMZ//ifLnjWQ4lybPs9L5D/txlZt0K32v/MOIBgoc5DU7PME31zZlw5yxafeDKYqr9Zi24JNz6oXCbpttHZoeLlIy5iKTTphSLmxnD0xWRJIHTMAz14zjkjl2gqRTNOIK+cKZYu/vhqQy38ecwqSzJAYnEbkpYzifCVC6A==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=jR0aGa7wPqGArllZSiThJCS8sHIdIPTBEm9Nj/NMYB0c/Ady5tEOkylL7G8Nlz28aNJepPuzc10G7Z6DCpbIHw0V5AifEFp5rMATd5cr7XydNQQFPAtp4/Cypgd6btKNm3XIb0ytNeF3bst1D/iuLgqFmnls86Z2lXsOQwCRJW2vADeVuIR8e95JaXDdsgnupuU+QlkHRFzt/m5R78x++RnqVIMxAbpj4thMSuJgmVI6NYWF6V5i4JDziflgKvoAlWMEC22NymOXuhFRykx8CUzvZFYe+dMA6AMXCJcDirFcu+NGYTKLWGXGDtItS4VTCK9dD8Ru48qA9R9m7eqMfg==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=suse.com;
  • Cc: Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx>, Paul Durrant <paul@xxxxxxx>, xen-devel@xxxxxxxxxxxxxxxxxxxx, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • Delivery-date: Tue, 28 Mar 2023 15:39:53 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On 28.03.2023 17:08, Jason Andryuk wrote:
> On Tue, Mar 28, 2023 at 9:54 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>
>> On 28.03.2023 15:43, Jason Andryuk wrote:
>>> On Tue, Mar 28, 2023 at 9:35 AM Jan Beulich <jbeulich@xxxxxxxx> wrote:
>>>>
>>>> On 28.03.2023 15:32, Jason Andryuk wrote:
>>>>> On Tue, Mar 28, 2023 at 9:28 AM Roger Pau Monné <roger.pau@xxxxxxxxxx> 
>>>>> wrote:
>>>>>> On Tue, Mar 28, 2023 at 03:23:56PM +0200, Jan Beulich wrote:
>>>>>>> On 28.03.2023 15:04, Marek Marczykowski-Górecki wrote:
>>>>>>>> On Tue, Mar 28, 2023 at 02:54:38PM +0200, Jan Beulich wrote:
>>>>>>>>> On 25.03.2023 03:49, Marek Marczykowski-Górecki wrote:
>>>>>>>>>> Some firmware/devices are found to not reset MSI-X properly, leaving
>>>>>>>>>> MASKALL set. Xen relies on initial state being both disabled.
>>>>>>>>>> Especially, pci_reset_msix_state() assumes if MASKALL is set, it was 
>>>>>>>>>> Xen
>>>>>>>>>> setting it due to msix->host_maskall or msix->guest_maskall. Clearing
>>>>>>>>>> just MASKALL might be unsafe if ENABLE is set, so clear them both.
>>>>>>>>>
>>>>>>>>> But pci_reset_msix_state() comes into play only when assigning a 
>>>>>>>>> device
>>>>>>>>> to a DomU. If the tool stack doing a reset doesn't properly clear the
>>>>>>>>> bit, how would it be cleared the next time round (i.e. after the guest
>>>>>>>>> stopped and then possibly was started again)? It feels like the issue
>>>>>>>>> wants dealing with elsewhere, possibly in the tool stack.
>>>>>>>>
>>>>>>>> I may be misremembering some details, but AFAIR Xen intercepts
>>>>>>>> toolstack's (or more generally: accesses from dom0) attempt to clean
>>>>>>>> this up and once it enters an inconsistent state (or rather: starts 
>>>>>>>> with
>>>>>>>> such at the start of the day), there was no way to clean it up.
>>>>>>>
>>>>>>> Iirc Roger and you already discussed that there needs to be an
>>>>>>> indication of device reset having happened, so that Xen can resync
>>>>>>> from this "behind its back" operation. That would look to be the
>>>>>>> point/place where such inconsistencies should be eliminated.
>>>>>>
>>>>>> I think that was a different conversation with Huang Rui related to
>>>>>> the AMD GPU work, see:
>>>>>>
>>>>>> https://lore.kernel.org/xen-devel/ZBwtaceTNvCYksmR@Air-de-Roger/
>>>>>>
>>>>>> I understood the problem Marek was trying to solve was that some
>>>>>> devices where initialized with the MASKALL bit set (likely by the
>>>>>> firmware?) and that prevented Xen from using them.  But now seeing the
>>>>>> further replies on this patch I'm unsure whether that's the case.
>>>>>
>>>>> In my case, Xen's setting of MASKALL persists through a warm reboot,
>>>>
>>>> And does this get in the way of Dom0 using the device? (Before a DomU
>>>> gets to use it, things should be properly reset anyway.)
>>>
>>> Dom0 doesn't have drivers for the device, so I am not sure.  I don't
>>> seem to have the logs around, but I believe when MASKALL is set, the
>>> initial quarantine of the device fails.  Yes, some notes I have
>>> mention:
>>>
>>> It's getting -EBUSY from pdev_msix_assign() which means
>>> pci_reset_msix_state() is failing:
>>>     if ( pci_conf_read16(pdev->sbdf, msix_control_reg(pos)) &
>>>          PCI_MSIX_FLAGS_MASKALL )
>>>         return -EBUSY;
>>
>> Arguably this check may want skipping when moving to quarantine. I'd
>> still be curious to know whether the device works in Dom0, and
>> confirmation of device reset's effect on the bit would also be helpful.
> 
> echo 1 > /sys/.../reset does not clear MASKALL.

Well, I think that - as proposed elsewhere - we need the kernel to tell
us about resets, and then we could clear the bit from there. (I didn't
check whether the spec permits the bit to remain set, or whether this
is a device erratum.)

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.