|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [PATCH v3 18/23] xen/riscv: implement IRQ routing for device passthrough
On 25.06.2026 17:54, Oleksii Kurochko wrote:
>
>
> On 6/25/26 1:14 PM, Jan Beulich wrote:
>> On 25.06.2026 11:48, Oleksii Kurochko wrote:
>>> On 6/25/26 8:08 AM, Jan Beulich wrote:
>>>> On 24.06.2026 17:21, Oleksii Kurochko wrote:
>>>>> On 6/22/26 5:57 PM, Jan Beulich wrote:
>>>>>> On 17.06.2026 13:17, Oleksii Kurochko wrote:
>>>>>>> --- a/xen/arch/riscv/include/asm/intc.h
>>>>>>> +++ b/xen/arch/riscv/include/asm/intc.h
>>>>>>> @@ -13,6 +13,7 @@ enum intc_version {
>>>>>>> };
>>>>>>>
>>>>>>> struct cpu_user_regs;
>>>>>>> +struct domain;
>>>>>>> struct irq_desc;
>>>>>>> struct kernel_info;
>>>>>>> struct vcpu;
>>>>>>> @@ -32,6 +33,9 @@ struct intc_hw_operations {
>>>>>>> /* hw_irq_controller to enable/disable/eoi host irq */
>>>>>>> const struct hw_interrupt_type *host_irq_type;
>>>>>>>
>>>>>>> + /* hw_irq_controller to enable/disable/eoi guest irq */
>>>>>>> + const struct hw_interrupt_type *guest_irq_type;
>>>>>>
>>>>>> It's likely my limited RISC-V knowledge that I find this extremely odd:
>>>>>> Separate struct hw_interrupt_type-s for host and guest?
>>>>>
>>>>> The guest and host interrupt controllers may handle some
>>>>> hw_irq_controller operations differently, even though the operations
>>>>> themselves are conceptually the same. The hw_irq_controller interface
>>>>> provides fairly abstract interrupt controller operations, but the
>>>>> underlying implementation may differ depending on whether the controller
>>>>> is used by the host or a guest.
>>>>>
>>>>> As an example, the Arm code already follows this approach:
>>>>>
>>>>> /* XXX different for level vs edge */
>>>>> static hw_irq_controller gicv2_host_irq_type = {
>>>>> .typename = "gic-v2",
>>>>> .startup = gicv2_irq_startup,
>>>>> .shutdown = gicv2_irq_shutdown,
>>>>> .enable = gicv2_irq_enable,
>>>>> .disable = gicv2_irq_disable,
>>>>> .ack = gicv2_irq_ack,
>>>>> .end = gicv2_host_irq_end,
>>>>> .set_affinity = gicv2_irq_set_affinity,
>>>>> };
>>>>>
>>>>> static hw_irq_controller gicv2_guest_irq_type = {
>>>>> .typename = "gic-v2",
>>>>> .startup = gicv2_irq_startup,
>>>>> .shutdown = gicv2_irq_shutdown,
>>>>> .enable = gicv2_irq_enable,
>>>>> .disable = gicv2_irq_disable,
>>>>> .ack = gicv2_irq_ack,
>>>>> .end = gicv2_guest_irq_end,
>>>>> .set_affinity = gicv2_irq_set_affinity,
>>>>> };
>>>>>
>>>>> These implementations reuse almost all interrupt controller operations,
>>>>> differing only in the .end callback.
>>>>
>>>> Which I'm having trouble with as well. Interrupts are handled by Xen. What
>>>> guests get to see are virtualized interrupts (no matter how much HW
>>>> acceleration may be in use). Hence I'm having difficulty to see such a
>>>> split justified.
>>>
>>> I think that I don't fully understand what is wrong with splitting. If
>>> there are cases exist when I need such separation for virtual interrupt
>>> controller operations then it looks fine to introduce such separation,
>>> right?
>>>
>>> Lets take an example of PLIC.
>>>
>>> For each source the PLIC has a "gateway":
>>> 1. Claim (read CONTEXT_CLAIM): returns the pending IRQ id and closes the
>>> gateway for that source, it will not forward that source to any context
>>> again until completed.
>>> 2. Complete (write the id back to CONTEXT_CLAIM): reopens the gateway.
>>> If the device line is still asserted (level high), the PLIC immediately
>>> re-marks it pending and delivers it again.
>>>
>>> The "closed gateway" between claim and complete is effectively the
>>> hardware masking the source while it's being serviced.
>>>
>>> Then if we will handle guest interrupt in the following way:
>>> 1. Passthrough device asserts its line (level stays high).
>>> 2. Xen takes the physical IRQ, claims (gateway closes), completes
>>> (gateway reopens), injects a virtual IRQ into the guest's vPLIC.
>>> 3. The guest hasn't run yet, it hasn't touched the device's registers,
>>> so the device line is still high.
>>> 4. The PLIC sees the source still asserted with an open gateway -> marks
>>> pending -> fires another physical interrupt into Xen -> ... -> repeat.
>>>
>>> So we get a storm of physical interrupts for a device the guest hasn't
>>> even begun servicing. The device line only drops when the guest driver
>>> writes the device's own registers, which happens long after, and on the
>>> guest's schedule.
>>>
>>> So the solution is that the physical complete must wait until the guest
>>> has actually quiesced the device. The only signal Xen gets for "guest is
>>> done" is the guest writing its virtual complete to the emulated vPLIC. So:
>>> 1. guest_irq->ack: the claim already happened (the readl(CONTEXT_CLAIM)
>>> in plic_handle_interrupt); ack just records which context claimed it.
>>> The gateway stays closed - good, the source is masked while the guest works.
>>> 2. inject vIRQ → guest services the device (line drops) -> guest writes
>>> vPLIC complete.
>>> 3. guest_irq->end: now do the physical complete, reopening the gateway.
>>> Device is quiet -> no spurious re-trigger; if it's a new legitimate
>>> assertion, it fires once, correctly.
>>>
>>> Is it clear enough now?
>>
>> Well, yes and no. On x86 we have to deal with the situation you describe as
>> problematic anyway, as IRQs have priorities associated with them, and higher
>> prio ones block equal/lower prio ones until they are "completed" (in the
>> terminology you use).
>
> Just for my understand what is the problem here that until "completed"
> isn't done for this high priority interrupt all other will just wait so
> basically responsiveness of the system in general will be bad?
Yes, on guest can affect other guests or the host.
>> If you don't have anything similar in RISC-V, then
>> you may indeed get somewhat simpler code overall with such a split.
>
> IIUC, if the word "block" above is used correctly I would say that
> behavior on RISC-V is different, at least, for PLIC as basically, if we
> have three IRQs and let's say `irq1` has the highest priority.
>
> `irq2` and `irq3` may become pending in the PLIC core, but they will not
> be visible to the CPU until `irq1` is CLAIMed, even if `irq1` is never
> completed (i.e., if you fail to write back to the CLAIM/COMPLETE register).
>
> When the hart reads the CLAIM/COMPLETE register, the PLIC core
> atomically retrieves the ID of the highest-priority pending interrupt
> (`irq1`) and clears its Interrupt Pending (IP) bit in the PLIC core.
>
> Once the IP bit for `irq1` is cleared, the PLIC core immediately
> re-evaluates all remaining pending interrupts. If `irq2` and `irq3` are
> pending, `irq2` (the next-highest-priority interrupt) becomes the
> highest-priority pending interrupt.
>
> The PLIC core will continue to signal the hart (by asserting the `MEIP`
> or `SEIP` bits) as long as there is any pending and enabled interrupt
> whose priority exceeds the hart's threshold.
>
> So the IRQ handler can run for irq2 and irq3 before irq1 is COMPLETED.
>
> So irqs are blocked only until they are claimed.
>
> Yet if
>> there's nothing like that in RISC-V, you can get (almost) arbitrarily deeply
>> nested interrupts, which in turn would be a problem you need to deal with.
>> IOW I suspect the architecture has something to limit nesting depth.
>
> The trap handler, where the IRQ handler is called, starts with
> interrupts disabled, so nested interrupts cannot really occur at that point.
Same on x86. Yet then in do_IRQ(), around invoking the handler, we re-enable
interrupts. There have been discussions whether this is a good idea, but
fundamentally the thought behind this is to prevent higher priority IRQs to
remain blocked for overly long periods of time. I.e. again a responsiveness
concern, the more that some of the IPIs are hi-prio ones in order for them
to be serviced quickly, to prevent blocking the CPU issuing the IPI (plus
perhaps further CPUs).
Jan
|
![]() |
Lists.xenproject.org is hosted with RackSpace, monitoring our |