[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC KERNEL PATCH v4 3/3] PCI/sysfs: Add gsi sysfs for pci_dev


  • To: Bjorn Helgaas <helgaas@xxxxxxxxxx>, Roger Pau Monné <roger.pau@xxxxxxxxxx>
  • From: "Chen, Jiqian" <Jiqian.Chen@xxxxxxx>
  • Date: Fri, 1 Mar 2024 07:57:05 +0000
  • Accept-language: en-US
  • Arc-authentication-results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=amd.com; dmarc=pass action=none header.from=amd.com; dkim=pass header.d=amd.com; arc=none
  • Arc-message-signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=M7fIs2L+73nk4IzuloKkcWwVd5SRX1Wtu54FX+9pUBo=; b=NMECgJ1F87IF2EsSWLXBhkDNpFV9qu9rvuOlQjcENFC8lHPUfPfxkxgtdeaeOU9owVwqFfjHSw38IqyafDmCdrch5wtsnhjchN0RW6Mr29tuZ99UlI3ACaTKi6cfsVsf/vrg6iuCAn8fjk9UWmC2jz/AKVMXsCqVFEKbQ2ppcFXwUZ2CRnlGSVly92bqjRo9xxBE2Y2E0+YOJheJnR4AIbmzQEaDWOODvvILttZTODTSsjBPPTi+EX+Hcd4XelFLzsZzRdj9P4JtNEsEbcWFt27hN/034ZLB4P8aDD0y6fsBN2emZLSkuo9Oca8c5tnqKTgQi0XRgCmpv9x2zm1JxQ==
  • Arc-seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=aa99U4KW51WlpvtYsWbhu85L79ykG6MMQsX5gdPQfqezShx8VvqmDzvD7zlv4O/9/iHF6dROj+3X8y33xrDiRJOzLG2R9VgPDUVc6XtUPJq4DSLQE+UggThpWo70ngxEqoqZuHO25Aj6aX7ezpsrXqp3JOOgFakp5jYEhlRhgBW7ad2Syo0lqZzo6cMyeJUAD7w3mMk5ZX3K5Wn0un/XLuIf6fYddWxU3QQbrjlfXPZhds8wHxk2hnWFqL/0U6tX+EKYguaiEgkyFkVIhCkrEMKdRNa0Q0zPCkk+5HJyMsVWz9YSFjL8RbD/6Zi3vftSZOWVzMqDnG5YiGIyN3vwqw==
  • Authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=amd.com;
  • Cc: "Chen, Jiqian" <Jiqian.Chen@xxxxxxx>, "Rafael J . Wysocki" <rafael@xxxxxxxxxx>, Len Brown <lenb@xxxxxxxxxx>, Juergen Gross <jgross@xxxxxxxx>, Stefano Stabellini <sstabellini@xxxxxxxxxx>, Oleksandr Tyshchenko <oleksandr_tyshchenko@xxxxxxxx>, Boris Ostrovsky <boris.ostrovsky@xxxxxxxxxx>, Bjorn Helgaas <bhelgaas@xxxxxxxxxx>, "xen-devel@xxxxxxxxxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxxxxxxxxx>, "linux-pci@xxxxxxxxxxxxxxx" <linux-pci@xxxxxxxxxxxxxxx>, "linux-kernel@xxxxxxxxxxxxxxx" <linux-kernel@xxxxxxxxxxxxxxx>, "linux-acpi@xxxxxxxxxxxxxxx" <linux-acpi@xxxxxxxxxxxxxxx>, "Hildebrand, Stewart" <Stewart.Hildebrand@xxxxxxx>, "Huang, Ray" <Ray.Huang@xxxxxxx>, "Ragiadakou, Xenia" <Xenia.Ragiadakou@xxxxxxx>
  • Delivery-date: Fri, 01 Mar 2024 07:57:15 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xenproject.org>
  • Thread-index: AQHaP5+doYjSbLzTtkiUcLxinuNXV7DmmK+AgAEHcQCAAAvdAIABnLmAgAg1Y4CAALowAIAAwpaAgADNJ4CAAKgsAIAA5P2AgA1jFYCAA+/3AIAAqS0AgAQDqwCAF+s/gA==
  • Thread-topic: [RFC KERNEL PATCH v4 3/3] PCI/sysfs: Add gsi sysfs for pci_dev

Hi Bjorn,
Looking forward to getting your more inputs and suggestions.
It seems /sys/bus/acpi/devices/PNP0A03:00/ is not a good place to create gsi 
sysfs.

On 2024/2/15 16:37, Roger Pau Monné wrote:
> On Mon, Feb 12, 2024 at 01:18:58PM -0600, Bjorn Helgaas wrote:
>> On Mon, Feb 12, 2024 at 10:13:28AM +0100, Roger Pau Monné wrote:
>>> On Fri, Feb 09, 2024 at 03:05:49PM -0600, Bjorn Helgaas wrote:
>>>> On Thu, Feb 01, 2024 at 09:39:49AM +0100, Roger Pau Monné wrote:
>>>>> On Wed, Jan 31, 2024 at 01:00:14PM -0600, Bjorn Helgaas wrote:
>>>>>> On Wed, Jan 31, 2024 at 09:58:19AM +0100, Roger Pau Monné wrote:
>>>>>>> On Tue, Jan 30, 2024 at 02:44:03PM -0600, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jan 30, 2024 at 10:07:36AM +0100, Roger Pau Monné wrote:
>>>>>>>>> On Mon, Jan 29, 2024 at 04:01:13PM -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Thu, Jan 25, 2024 at 07:17:24AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>> On 2024/1/24 00:02, Bjorn Helgaas wrote:
>>>>>>>>>>>> On Tue, Jan 23, 2024 at 10:13:52AM +0000, Chen, Jiqian wrote:
>>>>>>>>>>>>> On 2024/1/23 07:37, Bjorn Helgaas wrote:
>>>>>>>>>>>>>> On Fri, Jan 05, 2024 at 02:22:17PM +0800, Jiqian Chen wrote:
>>>>>>>>>>>>>>> There is a need for some scenarios to use gsi sysfs.
>>>>>>>>>>>>>>> For example, when xen passthrough a device to dumU, it will
>>>>>>>>>>>>>>> use gsi to map pirq, but currently userspace can't get gsi
>>>>>>>>>>>>>>> number.
>>>>>>>>>>>>>>> So, add gsi sysfs for that and for other potential scenarios.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>>>> I don't know enough about Xen to know why it needs the GSI in
>>>>>>>>>>>>>> userspace.  Is this passthrough brand new functionality that 
>>>>>>>>>>>>>> can't be
>>>>>>>>>>>>>> done today because we don't expose the GSI yet?
>>>>>>>>>>
>>>>>>>>>> I assume this must be new functionality, i.e., this kind of
>>>>>>>>>> passthrough does not work today, right?
>>>>>>>>>>
>>>>>>>>>>>>> has ACPI support and is responsible for detecting and controlling
>>>>>>>>>>>>> the hardware, also it performs privileged operations such as the
>>>>>>>>>>>>> creation of normal (unprivileged) domains DomUs. When we give to a
>>>>>>>>>>>>> DomU direct access to a device, we need also to route the physical
>>>>>>>>>>>>> interrupts to the DomU. In order to do so Xen needs to setup and 
>>>>>>>>>>>>> map
>>>>>>>>>>>>> the interrupts appropriately.
>>>>>>>>>>>>
>>>>>>>>>>>> What kernel interfaces are used for this setup and mapping?
>>>>>>>>>>>
>>>>>>>>>>> For passthrough devices, the setup and mapping of routing physical
>>>>>>>>>>> interrupts to DomU are done on Xen hypervisor side, hypervisor only
>>>>>>>>>>> need userspace to provide the GSI info, see Xen code:
>>>>>>>>>>> xc_physdev_map_pirq require GSI and then will call hypercall to pass
>>>>>>>>>>> GSI into hypervisor and then hypervisor will do the mapping and
>>>>>>>>>>> routing, kernel doesn't do the setup and mapping.
>>>>>>>>>>
>>>>>>>>>> So we have to expose the GSI to userspace not because userspace 
>>>>>>>>>> itself
>>>>>>>>>> uses it, but so userspace can turn around and pass it back into the
>>>>>>>>>> kernel?
>>>>>>>>>
>>>>>>>>> No, the point is to pass it back to Xen, which doesn't know the
>>>>>>>>> mapping between GSIs and PCI devices because it can't execute the ACPI
>>>>>>>>> AML resource methods that provide such information.
>>>>>>>>>
>>>>>>>>> The (Linux) kernel is just a proxy that forwards the hypercalls from
>>>>>>>>> user-space tools into Xen.
>>>>>>>>
>>>>>>>> But I guess Xen knows how to interpret a GSI even though it doesn't
>>>>>>>> have access to AML?
>>>>>>>
>>>>>>> On x86 Xen does know how to map a GSI into an IO-APIC pin, in order
>>>>>>> configure the RTE as requested.
>>>>>>
>>>>>> IIUC, mapping a GSI to an IO-APIC pin requires information from the
>>>>>> MADT.  So I guess Xen does use the static ACPI tables, but not the AML
>>>>>> _PRT methods that would connect a GSI with a PCI device?
>>>>>
>>>>> Yes, Xen can parse the static tables, and knows the base GSI of
>>>>> IO-APICs from the MADT.
>>>>>
>>>>>> I guess this means Xen would not be able to deal with _MAT methods,
>>>>>> which also contains MADT entries?  I don't know the implications of
>>>>>> this -- maybe it means Xen might not be able to use with hot-added
>>>>>> devices?
>>>>>
>>>>> It's my understanding _MAT will only be present on some very specific
>>>>> devices (IO-APIC or CPU objects).  Xen doesn't support hotplug of
>>>>> IO-APICs, but hotplug of CPUs should in principle be supported with
>>>>> cooperation from the control domain OS (albeit it's not something that
>>>>> we tests on our CI).  I don't expect however that a CPU object _MAT
>>>>> method will return IO APIC entries.
>>>>>
>>>>>> The tables (including DSDT and SSDTS that contain the AML) are exposed
>>>>>> to userspace via /sys/firmware/acpi/tables/, but of course that
>>>>>> doesn't mean Xen knows how to interpret the AML, and even if it did,
>>>>>> Xen probably wouldn't be able to *evaluate* it since that could
>>>>>> conflict with the host kernel's use of AML.
>>>>>
>>>>> Indeed, there can only be a single OSPM, and that's the dom0 OS (Linux
>>>>> in our context).
>>>>>
>>>>> Getting back to our context though, what would be a suitable place for
>>>>> exposing the GSI assigned to each device?
>>>>
>>>> IIUC, the Xen hypervisor:
>>>>
>>>>   - Interprets /sys/firmware/acpi/tables/APIC (or gets this via
>>>>     something running on the Dom0 kernel) to find the physical base
>>>>     address and GSI base, e.g., from I/O APIC, I/O SAPIC.
>>>
>>> No, Xen parses the MADT directly from memory, before stating dom0.
>>> That's a static table so it's fine for Xen to parse it and obtain the
>>> I/O APIC GSI base.
>>
>> It's an interesting split to consume ACPI static tables directly but
>> put the AML interpreter elsewhere.
> 
> Well, static tables can be consumed by Xen, because thye don't require
> an AML parser (obviously), and parsing them doesn't have any
> side-effects that would prevent dom0 from being the OSPM (no methods
> or similar are evaluated).
> 
>> I doubt the ACPI spec envisioned
>> that, which makes me wonder what other things we could trip over, but
>> that's just a tangent.
> 
> Indeed, ACPI is not be best interface for the Xen/dom0 split model.
> 
>>>>   - Needs the GSI to locate the APIC and pin within the APIC.  The
>>>>     Dom0 kernel is the OSPM, so only it can evaluate the AML _PRT to
>>>>     learn the PCI device -> GSI mapping.
>>>
>>> Yes, Xen doesn't know the PCI device -> GSI mapping.  Dom0 needs to
>>> parse the ACPI methods and signal Xen to configure a GSI with a
>>> given trigger and polarity.
>>>
>>>>   - Has direct access to the APIC physical base address to program the
>>>>     Redirection Table.
>>>
>>> Yes, the hardware (native) I/O APIC is owned by Xen, and not directly
>>> accessible by dom0.
>>>
>>>> The patch seems a little messy to me because the PCI core has to keep
>>>> track of the GSI even though it doesn't need it itself.  And the
>>>> current patch exposes it on all arches, even non-ACPI ones or when
>>>> ACPI is disabled (easily fixable).
>>>>
>>>> We only call acpi_pci_irq_enable() in the pci_enable_device() path, so
>>>> we don't know the GSI unless a Dom0 driver has claimed the device and
>>>> called pci_enable_device() for it, which seems like it might not be
>>>> desirable.
>>>
>>> I think that's always the case, as on dom0 devices to be passed
>>> through are handled by pciback which does enable them.
>>
>> pcistub_init_device() labels the pci_enable_device() as a "HACK"
>> related to determining the IRQ, which makes me think there's not
>> really a requirement for the device to be *enabled* (BAR decoding
>> enabled) by dom0.
> 
> No, there's no need for memory decoding to be enabled for getting the
> GSI from the ACPI method I would assume.  I'm confused by that
> pci_enable_device() call.  Is maybe the purpose to make sure the
> device is powered up so that reading the PCI header Interrupt Line and
> Pin fields returns valid values?  No idea whether reading those fields
> requires the device to be in certain (active) power states.
> 
>>> I agree it might be best to not tie exposing the node to
>>> pci_enable_device() having been called.  Is _PRT only evaluated as
>>> part of acpi_pci_irq_enable()? (or pci_enable_device()).
>>
>> Yes.  AFAICT, acpi_pci_irq_enable() is the only path that evaluates
>> _PRT (except for a debugger interface).  I don't think it *needs* to
>> be that way, and the fact that we do it per-device like that means we
>> evaluate _PRT many times even though I think the results never change.
>>
>> I could imagine evaluating _PRT once as part of enumerating a PCI host
>> bridge (and maybe PCI-PCI bridge, per acpi_pci_irq_find_prt_entry()
>> comment), but that looks like a fair bit of work to implement.  And of
>> course it doesn't really affect the question of how to expose the
>> result, although it does suggest /sys/bus/acpi/devices/PNP0A03:00/ as
>> a possible location.
> 
> So you suggest exposing the GSI as part of the PCI host bridge?  I'm
> afraid I'm not following how we could then map PCI SBDFs from devices
> to their assigned GSI.
> 
> Thanks, Roger.

-- 
Best regards,
Jiqian Chen.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.