Xen project Mailing List

Re: IRQ affinity not working on Xen pci-platform device^W^W^W QEMU split-irqchip I/O APIC.

To: Thomas Gleixner <tglx@xxxxxxxxxxxxx>, linux-kernel <linux-kernel@xxxxxxxxxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>

From: David Woodhouse <dwmw2@xxxxxxxxxxxxx>

Date: Sat, 04 Mar 2023 09:57:40 +0000

Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx>, Paolo Bonzini <pbonzini@xxxxxxxxxx>, qemu-devel <qemu-devel@xxxxxxxxxx>, Peter Xu <peterx@xxxxxxxxxx>

Delivery-date: Sat, 04 Mar 2023 09:57:52 +0000

List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

On Sat, 2023-03-04 at 01:28 +0100, Thomas Gleixner wrote: > David! > > On Fri, Mar 03 2023 at 16:54, David Woodhouse wrote: > > On Fri, 2023-03-03 at 17:51 +0100, Thomas Gleixner wrote: > > > > > > > > [ 0.577173] ACPI: \_SB_.LNKC: Enabled at IRQ 11 > > > > [ 0.578149] The affinity mask was 0-3 > > > > [ 0.579081] The affinity mask is 0-3 and the handler is on 2 > > > > [ 0.580288] The affinity mask is 0 and the handler is on 2 > > > > > > What happens is that once the interrupt is requested, the affinity > > > setting is deferred to the first interrupt. See the marvelous dance in > > > arch/x86/kernel/apic/msi.c::msi_set_affinity(). > > > > > > If you do the setting before request_irq() then the startup will assign > > > it to the target mask right away. > > > > > > Btw, you are using irq_get_affinity_mask(), which gives you the desired > > > target mask. irq_get_effective_affinity_mask() gives you the real one. > > > > > > Can you verify that the thing moves over after the first interrupt or is > > > that too late already? > > > > It doesn't seem to move. The hack to just return IRQ_NONE if invoked on > > CPU != 0 was intended to do just that. It's a level-triggered interrupt > > so when the handler does nothing on the "wrong" CPU, it ought to get > > invoked again on the *correct* CPU and actually work that time. > > So much for the theory. This is virt after all so it does not > necessarily behave like real hardware. I think you're right. This looks like a QEMU bug with the "split irqchip" I/OAPIC. For reasons I'm unclear about, and which lack a comment in the code, QEMU still injects I/OAPIC events into the kernel with kvm_set_irq(). (I think it's do to with caching, because QEMU doesn't cache interrupt- remapping translations anywhere *except* in the KVM IRQ routing table, so if it just synthesised an MSI message every time it'd have to retranslate it every time?) Tracing the behaviour here shows: • First interrupt happens on CPU2. • Linux updates the I/OAPIC RTE to point to CPU0, but QEMU doesn't update the KVM IRQ routing table yet. * QEMU retriggers the (still-high, level triggered) IRQ. • QEMU calls kvm_set_irq(11), delivering it to CPU2 again. • QEMU *finally* calls ioapic_update_kvm_routes(). • Linux sees the interrupt on CPU2 again. $ qemu-system-x86_64 -display none -serial mon:stdio \ -accel kvm,xen-version=0x4000a,kernel-irqchip=split \ -kernel ~/git/linux/arch/x86/boot//bzImage \ -append "console=ttyS0,115200 xen_no_vector_callback" \ -smp 4 --trace ioapic\* --trace xenstore\* ... xenstore_read tx 0 path control/platform-feature-xs_reset_watches ioapic_set_irq vector: 11 level: 1 ioapic_set_remote_irr set remote irr for pin 11 ioapic_service: trigger KVM IRQ 11 [ 0.523627] The affinity mask was 0-3 and the handler is on 2 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26 ioapic_update_kvm_routes: update KVM route for IRQ 11: fee02000 8021 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021 xenstore_reset_watches ioapic_set_irq vector: 11 level: 1 ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x1c021 [ 0.524569] ioapic_ack_level IRQ 11 moveit = 1 ioapic_eoi_broadcast EOI broadcast for vector 33 ioapic_clear_remote_irr clear remote irr for pin 11 vector 33 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26 ioapic_mem_read ioapic mem read addr 0x10 regsel: 0x26 size 0x4 retval 0x18021 [ 0.525235] ioapic_finish_move IRQ 11 calls irq_move_masked_irq() [ 0.526147] irq_do_set_affinity for IRQ 11, 0 [ 0.526732] ioapic_set_affinity for IRQ 11, 0 [ 0.527330] ioapic_setup_msg_from_msi for IRQ11 target 0 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x27 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x27 size 0x4 val 0x0 ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x27 size 0x4 val 0x26 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x18021 [ 0.527623] ioapic_set_affinity returns 0 [ 0.527623] ioapic_finish_move IRQ 11 calls unmask_ioapic_irq() ioapic_mem_write ioapic mem write addr 0x0 regsel: 0x26 size 0x4 val 0x26 ioapic_mem_write ioapic mem write addr 0x10 regsel: 0x26 size 0x4 val 0x8021 ioapic_set_remote_irr set remote irr for pin 11 ioapic_service: trigger KVM IRQ 11 ioapic_update_kvm_routes: update KVM route for IRQ 11: fee00000 8021 [ 0.529571] The affinity mask was 0 and the handler is on 2 [ xenstore_watch path memory/target token FFFFFFFF92847D40 xenstore_watch_event path memory/target token FFFFFFFF92847D40 ioapic_set_irq vector: 11 level: 1 0.530486] ioapic_ack_level IRQ 11 moveit = 0 This is with Linux doing basically nothing when the handler is invoked on the 'wrong' CPU, and just waiting for it to be right. Commenting out the kvm_set_irq() calls in ioapic_service() and letting QEMU synthesise an MSI every time works. Better still, so does this, making it update the routing table *before* retriggering the IRQ when the guest updates the RTE: --- a/hw/intc/ioapic.c +++ b/hw/intc/ioapic.c @@ -405,6 +409,7 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, s->ioredtbl[index] |= ro_bits; s->irq_eoi[index] = 0; ioapic_fix_edge_remote_irr(&s->ioredtbl[index]); + ioapic_update_kvm_routes(s); ioapic_service(s); } } @@ -418,7 +423,6 @@ ioapic_mem_write(void *opaque, hwaddr addr, uint64_t val, break; } - ioapic_update_kvm_routes(s); } static const MemoryRegionOps ioapic_io_ops = { Now, I don't quite see why we don't get a *third* interrupt, since Linux did nothing to clear the level of IRQ 11 and the last trace I see from QEMU's ioapic_set_irq confirms it's still set. But I've exceeded my screen time for the day, so I'll have to frown at that part some more later. I wonder if the EOI is going missing because it's coming from the wrong CPU? Note no 'EOI broadcast' after the last line in the log I showed above; it isn't just that I trimmed it there. I don't think we need to do anything in Linux; if the handler gets invoked on the wrong CPU it'll basically find no events pending for that CPU and return having done nothing... and *hopefully* should be re-invoked on the correct CPU shortly thereafter.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.