[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [RFC KERNEL PATCH v2 2/3] xen/pvh: Unmask irq for passthrough device in PVH dom0



On 2023/12/5 17:19, Roger Pau Monné wrote:
> On Mon, Dec 04, 2023 at 02:19:33PM -0800, Stefano Stabellini wrote:
>> On Mon, 4 Dec 2023, Roger Pau Monné wrote:
>>> On Fri, Dec 01, 2023 at 07:37:55PM -0800, Stefano Stabellini wrote:
>>>> On Fri, 1 Dec 2023, Roger Pau Monné wrote:
>>>>> On Thu, Nov 30, 2023 at 07:15:17PM -0800, Stefano Stabellini wrote:
>>>>>> On Thu, 30 Nov 2023, Roger Pau Monné wrote:
>>>>>>> On Wed, Nov 29, 2023 at 07:53:59PM -0800, Stefano Stabellini wrote:
>>>>>>>> On Fri, 24 Nov 2023, Jiqian Chen wrote:
>>>>>>>>> This patch is to solve two problems we encountered when we try to
>>>>>>>>> passthrough a device to hvm domU base on Xen PVH dom0.
>>>>>>>>>
>>>>>>>>> First, hvm guest will alloc a pirq and irq for a passthrough device
>>>>>>>>> by using gsi, before that, the gsi must first has a mapping in dom0,
>>>>>>>>> see Xen code pci_add_dm_done->xc_domain_irq_permission, it will call
>>>>>>>>> into Xen and check whether dom0 has the mapping. See
>>>>>>>>> XEN_DOMCTL_irq_permission->pirq_access_permitted, "current" is PVH
>>>>>>>>> dom0 and it return irq is 0, and then return -EPERM.
>>>>>>>>> This is because the passthrough device doesn't do PHYSDEVOP_map_pirq
>>>>>>>>> when thay are enabled.
>>>>>>>>>
>>>>>>>>> Second, in PVH dom0, the gsi of a passthrough device doesn't get
>>>>>>>>> registered, but gsi must be configured for it to be able to be
>>>>>>>>> mapped into a domU.
>>>>>>>>>
>>>>>>>>> After searching codes, we can find map_pirq and register_gsi will be
>>>>>>>>> done in function vioapic_write_redirent->vioapic_hwdom_map_gsi when
>>>>>>>>> the gsi(aka ioapic's pin) is unmasked in PVH dom0. So the problems
>>>>>>>>> can be conclude to that the gsi of a passthrough device doesn't be
>>>>>>>>> unmasked.
>>>>>>>>>
>>>>>>>>> To solve the unmaske problem, this patch call the unmask_irq when we
>>>>>>>>> assign a device to be passthrough. So that the gsi can get registered
>>>>>>>>> and mapped in PVH dom0.
>>>>>>>>
>>>>>>>>
>>>>>>>> Roger, this seems to be more of a Xen issue than a Linux issue. Why do
>>>>>>>> we need the unmask check in Xen? Couldn't we just do:
>>>>>>>>
>>>>>>>>
>>>>>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> index 4e40d3609a..df262a4a18 100644
>>>>>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>>>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>>>>>> @@ -287,7 +287,7 @@ static void vioapic_write_redirent(
>>>>>>>>              hvm_dpci_eoi(d, gsi);
>>>>>>>>      }
>>>>>>>>  
>>>>>>>> -    if ( is_hardware_domain(d) && unmasked )
>>>>>>>> +    if ( is_hardware_domain(d) )
>>>>>>>>      {
>>>>>>>>          /*
>>>>>>>>           * NB: don't call vioapic_hwdom_map_gsi while holding 
>>>>>>>> hvm.irq_lock
>>>>>>>
>>>>>>> There are some issues with this approach.
>>>>>>>
>>>>>>> mp_register_gsi() will only setup the trigger and polarity of the
>>>>>>> IO-APIC pin once, so we do so once the guest unmask the pin in order
>>>>>>> to assert that the configuration is the intended one.  A guest is
>>>>>>> allowed to write all kind of nonsense stuff to the IO-APIC RTE, but
>>>>>>> that doesn't take effect unless the pin is unmasked.
>>>>>>>
>>>>>>> Overall the question would be whether we have any guarantees that
>>>>>>> the hardware domain has properly configured the pin, even if it's not
>>>>>>> using it itself (as it hasn't been unmasked).
>>>>>>>
>>>>>>> IIRC PCI legacy interrupts are level triggered and low polarity, so we
>>>>>>> could configure any pins that are not setup at bind time?
>>>>>>
>>>>>> That could work.
>>>>>>
>>>>>> Another idea is to move only the call to allocate_and_map_gsi_pirq at
>>>>>> bind time? That might be enough to pass a pirq_access_permitted check.
>>>>>
>>>>> Maybe, albeit that would change the behavior of XEN_DOMCTL_bind_pt_irq
>>>>> just for PT_IRQ_TYPE_PCI and only when called from a PVH dom0 (as the
>>>>> parameter would be a GSI instead of a previously mapped IRQ).  Such
>>>>> difference just for PT_IRQ_TYPE_PCI is slightly weird - if we go that
>>>>> route I would recommend that we instead introduce a new dmop that has
>>>>> this syntax regardless of the domain type it's called from.
>>>>
>>>> Looking at the code it is certainly a bit confusing. My point was that
>>>> we don't need to wait until polarity and trigger are set appropriately
>>>> to allow Dom0 to pass successfully a pirq_access_permitted() check. Xen
>>>> should be able to figure out that Dom0 is permitted pirq access.
>>>
>>> The logic is certainly not straightforward, and it could benefit from
>>> some comments.
>>>
>>> The irq permissions are a bit special, in that they get setup when the
>>> IRQ is mapped.
>>>
>>> The problem however is not so much with IRQ permissions, that we can
>>> indeed sort out internally in Xen.  Such check in dom0 has the side
>>> effect of preventing the IRQ from being assigned to a domU without the
>>> hardware source being properly configured AFAICT.
>>
>> Now I understand why you made a comment previously about Xen having to
>> configure trigger and polarity for these interrupts on its own.
>>
>>
>>>> So the idea was to move the call to allocate_and_map_gsi_pirq() earlier
>>>> somewhere because allocate_and_map_gsi_pirq doesn't require trigger or
>>>> polarity to be configured to work. But the suggestion of doing it a
>>>> "bind time" (meaning: XEN_DOMCTL_bind_pt_irq) was a bad idea.
>>>>
>>>> But maybe we can find another location, maybe within
>>>> xen/arch/x86/hvm/vioapic.c, to call allocate_and_map_gsi_pirq() before
>>>> trigger and polarity are set and before the interrupt is unmasked.
>>>>
>>>> Then we change the implementation of vioapic_hwdom_map_gsi to skip the
>>>> call to allocate_and_map_gsi_pirq, because by the time
>>>> vioapic_hwdom_map_gsi we assume that allocate_and_map_gsi_pirq had
>>>> already been done.
>>>
>>> But then we would end up in a situation where the
>>> pirq_access_permitted() check will pass, but the IO-APIC pin won't be
>>> configured, which I think it's not what we want.
>>>
>>> One option would be to allow mp_register_gsi() to be called multiple
>>> times, and update the IO-APIC pin configuration as long as the pin is
>>> not unmasked.  That would propagate each dom0 RTE update to the
>>> underlying IO-APIC.  However such approach relies on dom0 configuring
>>> all possible IO-APIC pins, even if no device on dom0 is using them, I
>>> think it's not a very reliable option.
>>>
>>> Another option would be to modify the toolstack to setup the GSI
>>> itself using the PHYSDEVOP_setup_gsi hypercall.  As said in a previous
>>> email, since we only care about PCI device passthrough the legacy INTx
>>> should always be level triggered and low polarity.
>>>
>>>> I am not familiar with vioapic.c but to give you an idea of what I was
>>>> thinking:
>>>>
>>>>
>>>> diff --git a/xen/arch/x86/hvm/vioapic.c b/xen/arch/x86/hvm/vioapic.c
>>>> index 4e40d3609a..16d56fe851 100644
>>>> --- a/xen/arch/x86/hvm/vioapic.c
>>>> +++ b/xen/arch/x86/hvm/vioapic.c
>>>> @@ -189,14 +189,6 @@ static int vioapic_hwdom_map_gsi(unsigned int gsi, 
>>>> unsigned int trig,
>>>>          return ret;
>>>>      }
>>>>  
>>>> -    ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> -    if ( ret )
>>>> -    {
>>>> -        gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> -                 gsi, ret);
>>>> -        return ret;
>>>> -    }
>>>> -
>>>>      pcidevs_lock();
>>>>      ret = pt_irq_create_bind(currd, &pt_irq_bind);
>>>>      if ( ret )
>>>> @@ -287,6 +279,17 @@ static void vioapic_write_redirent(
>>>>              hvm_dpci_eoi(d, gsi);
>>>>      }
>>>>  
>>>> +    if ( is_hardware_domain(d) ) 
>>>> +    {
>>>> +        int pirq = gsi, ret;
>>>> +        ret = allocate_and_map_gsi_pirq(currd, pirq, &pirq);
>>>> +        if ( ret )
>>>> +        {
>>>> +            gprintk(XENLOG_WARNING, "vioapic: error mapping GSI %u: %d\n",
>>>> +                    gsi, ret);
>>>> +            return ret;
>>>> +        }
>>>> +    }
>>>>      if ( is_hardware_domain(d) && unmasked )
>>>>      {
>>>>          /*
>>>
>>> As said above, such approach relies on dom0 writing to the IO-APIC RTE
>>> of likely each IO-APIC pin, which is IMO not quite reliable.  In there
>>> are two different issues here that need to be fixed for PVH dom0:
>>>
>>>  - Fix the XEN_DOMCTL_irq_permission pirq_access_permitted() call to
>>>    succeed for a PVH dom0, even if dom0 is not using the GSI itself.
>>
>> Yes makes sense
>>
>>
>>>  - Configure IO-APIC pins for PCI interrupts even if dom0 is not using
>>>    the IO-APIC pin itself.
>>>
>>> First one needs to be fixed internally in Xen, second one will require
>>> the toolstack to issue an extra hypercall in order to ensure the
>>> IO-APIC pin is properly configured.
>>  
>> On ARM, Xen doesn't need to wait for dom0 to configure interrupts
>> correctly. Xen configures them all on its own at boot based on Device
>> Tree information. I guess it is not possible to do the same on x86?
> 
> No, not exactly.  There's some interrupt information in the ACPI MADT,
> but that's just for very specific sources (Interrupt Source Override
> Structures)
> 
> Then on AML devices can have resource descriptors that contain
> information about how interrupts are setup.  However Xen is not able
> to read any of this information on AML.
> 
> Legacy PCI interrupts are (always?) level triggered and low polarity,
> because it's assumed that an interrupt source can be shared between
> multiple devices.
> 
> I'm however not able to find any reference to this in the PCI spec,
> hence I'm reluctant to take this for granted in Xen, and default all
> GSIs >= 16 to such mode.
> 
> OTOH legacy PCI interrupts are not that used anymore, as almost all
> devices will support MSI(-X) (because PCIe mandates it) and OSes
> should prefer the latter.  SR-IOV VF don't even support legacy PCI
> interrupts anymore.
> 
>> If
>> not, then I can see why we would need 1 extra toolstack hypercall for
>> that (or to bundle the operation of configuring IO-APIC pins together
>> with an existing toolstack hypercall).
> 
> One suitable compromise would be to default unconfigured GSIs >= 16 to
> level-triggered and low-polarity, as I would expect that to work in
> almost all cases.  We can always introduce the usage of
> PHYSDEVOP_setup_gsi later if required.
> 
> Maybe Jan has more input here, would you agree to defaulting non-ISA
> GSIs to level-triggered, low-polarity in the absence of a specific
> setup provided by dom0?
> 
> Thanks, Roger.

No intention to disturb if I am incorrect, just a little input. On dom0 PVH, 
when it enables devices, it will call acpi_pci_irq_enable, and in that 
function, its default trigger is level and polarity is low for pci interrupt.

-- 
Best regards,
Jiqian Chen.

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.