[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IRQ affinity not working on Xen pci-platform device^W^W^W QEMU split-irqchip I/O APIC.



On Sat, 2023-03-04 at 01:28 +0100, Thomas Gleixner wrote:
> David!
> 
> On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote:
> > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote:
> > > > 
> > > > [    0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > > > [    0.578149] The affinity mask was 0-3
> > > > [    0.579081] The affinity mask is 0-3 and the handler is on 2
> > > > [    0.580288] The affinity mask is 0 and the handler is on 2
> > > 
> > > What happens is that once the interrupt is requested, the affinity
> > > setting is deferred to the first interrupt. See the marvelous dance in
> > > arch/x86/kernel/apic/msi.c::msi_set_affinity().
> > > 
> > > If you do the setting before request_irq() then the startup will assign
> > > it to the target mask right away.
> > > 
> > > Btw, you are using irq_get_affinity_mask(), which gives you the desired
> > > target mask. irq_get_effective_affinity_mask() gives you the real one.
> > > 
> > > Can you verify that the thing moves over after the first interrupt or is
> > > that too late already?
> > 
> > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on
> > CPU != 0 was intended to do just that. It's a level-triggered interrupt
> > so when the handler does nothing on the "wrong" CPU, it ought to get
> > invoked again on the *correct* CPU and actually work that time.
> 
> So much for the theory. This is virt after all so it does not
> necessarily behave like real hardware.

I think you're right. This looks like a QEMU bug with the "split
irqchip" I/OAPIC.

For reasons I'm unclear about, and which lack a comment in the code,
QEMU still injects I/OAPIC events into the kernel with kvm_set_irq().
(I think it's do to with caching, because QEMU doesn't cache interrupt-
remapping translations anywhere *except* in the KVM IRQ routing table,
so if it just synthesised an MSI message every time it'd have to
retranslate it every time?)

Tracing the behaviour here shows:

 • First interrupt happens on CPU2.
 • Linux updates the I/OAPIC RTE to point to CPU0, but QEMU doesn't
   update the KVM IRQ routing table yet.
 * QEMU retriggers the (still-high, level triggered) IRQ.
 • QEMU calls kvm_set_irq(11), delivering it to CPU2 again.
 • QEMU *finally* calls ioapic_update_kvm_routes().
 • Linux sees the interrupt on CPU2 again.

  $ qemu-system-x86_64 -display none -serial mon:stdio \
     -accel kvm,xen-version=0x4000a,kernel-irqchip=split \
     -kernel ~/git/linux/arch/x86/boot//bzImage \
     -append "console=ttyS0,115200 xen_no_vector_callback" \
     -smp 4 --trace ioapic\* --trace xenstore\*


...

xenstore_read tx 0 path control/platform-feature-xs_reset_watches
ioapic_set_irq vector: 11 level: 1
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
[    0.523627] The affinity mask was 0-3 and the handler is on 2
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
xenstore_reset_watches 
ioapic_set_irq vector: 11 level: 1
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021
[    0.524569] ioapic_ack_level IRQ 11 moveit = 1
ioapic_eoi_broadcast EOI broadcast for vector 33
ioapic_clear_remote_irr clear remote irr for pin 11 vector 33
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021
[    0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq()
[    0.526147] irq_do_set_affinity for IRQ 11, 0
[    0.526732] ioapic_set_affinity for IRQ 11, 0
[    0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021
[    0.527623] ioapic_set_affinity returns 0
[    0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq()
ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26
ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021
ioapic_set_remote_irr set remote irr for pin 11
ioapic_service: trigger KVM IRQ 11
ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021
[    0.529571] The affinity mask was 0 and the handler is on 2
[    xenstore_watch path memory/target token FFFFFFFF92847D40
xenstore_watch_event path memory/target token FFFFFFFF92847D40
ioapic_set_irq vector: 11 level: 1
0.530486] ioapic_ack_level IRQ 11 moveit = 0


This is with Linux doing basically nothing when the handler is invoked
on the 'wrong' CPU, and just waiting for it to be right.

Commenting out the kvm_set_irq() calls in ioapic_service() and letting
QEMU synthesise an MSI every time works. Better still, so does this,
making it update the routing table *before* retriggering the IRQ when
the guest updates the RTE:

--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -405,6 +409,7 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
                 s->ioredtbl[index] |= ro_bits;
                 s->irq_eoi[index] = 0;
                 ioapic_fix_edge_remote_irr(&s->ioredtbl[index]);
+                ioapic_update_kvm_routes(s);
                 ioapic_service(s);
             }
         }
@@ -418,7 +423,6 @@ ioapic_mem_write(void *opaque, hwaddr addr,
uint64_t val,
         break;
     }
 
-    ioapic_update_kvm_routes(s);
 }
 
 static const MemoryRegionOps ioapic_io_ops = {



Now, I don't quite see why we don't get a *third* interrupt, since
Linux did nothing to clear the level of IRQ 11 and the last trace I see
from QEMU's ioapic_set_irq confirms it's still set. But I've exceeded
my screen time for the day, so I'll have to frown at that part some
more later. I wonder if the EOI is going missing because it's coming
from the wrong CPU? Note no 'EOI broadcast' after the last line in the
log I showed above; it isn't just that I trimmed it there.

I don't think we need to do anything in Linux; if the handler gets
invoked on the wrong CPU it'll basically find no events pending for
that CPU and return having done nothing... and *hopefully* should be
re-invoked on the correct CPU shortly thereafter.

Attachment: smime.p7s
Description: S/MIME cryptographic signature


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.