[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: NetBSD dom0 PVH: hardware interrupts stalls



On 19.11.2020 18:57, Manuel Bouyer wrote:
> I added an ASSERT() after the printf to ket a stack trace, and got:
> db{0}> call ioapic_dump_raw^M
> Register dump of ioapic0^M
> [  13.0193374] 00 08000000 00170011 08000000(XEN) vioapic.c:141:d0v0 
> apic_mem_readl:undefined ioregsel 3
> (XEN) vioapic.c:512:vioapic_irq_positive_edge: vioapic_deliver 2
> (XEN) Assertion '!print' failed at vioapic.c:512
> (XEN) ----[ Xen-4.15-unstable  x86_64  debug=y   Tainted:   C   ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82d0402c4164>] vioapic_irq_positive_edge+0x14e/0x150
> (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d0v0)
> (XEN) rax: ffff82d0405c806c   rbx: ffff830836650580   rcx: 0000000000000000
> (XEN) rdx: ffff8300688bffff   rsi: 000000000000000a   rdi: ffff82d0404b36b8
> (XEN) rbp: ffff8300688bfde0   rsp: ffff8300688bfdc0   r8:  0000000000000004
> (XEN) r9:  0000000000000032   r10: 0000000000000000   r11: 00000000fffffffd
> (XEN) r12: ffff8308366dc000   r13: 0000000000000022   r14: ffff8308366dc31c
> (XEN) r15: ffff8308366d1d80   cr0: 0000000080050033   cr4: 00000000003526e0
> (XEN) cr3: 00000008366c9000   cr2: 0000000000000000
> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d0402c4164> 
> (vioapic_irq_positive_edge+0x14e/0x150):
> (XEN)  3d 10 be 1d 00 00 74 c2 <0f> 0b 55 48 89 e5 41 57 41 56 41 55 41 54 53 
> 48
> (XEN) Xen stack trace from rsp=ffff8300688bfdc0:
> (XEN)    0000000200000086 ffff8308366dc000 0000000000000022 0000000000000000
> (XEN)    ffff8300688bfe08 ffff82d0402bcc33 ffff8308366dc000 0000000000000022
> (XEN)    0000000000000001 ffff8300688bfe40 ffff82d0402bd18f ffff830835a7eb98
> (XEN)    ffff8308366dc000 ffff830835a7eb40 ffff8300688bfe68 0100100100100100
> (XEN)    ffff8300688bfea0 ffff82d04026f6e1 ffff830835a7eb30 ffff8308366dc0f4
> (XEN)    ffff830835a7eb40 ffff8300688bfe68 ffff8300688bfe68 ffff82d0405cec80
> (XEN)    ffffffffffffffff ffff82d0405cec80 0000000000000000 ffff82d0405d6c80
> (XEN)    ffff8300688bfed8 ffff82d04022b6fa ffff83083663f000 ffff83083663f000
> (XEN)    0000000000000000 0000000000000000 0000000a7c62165b ffff8300688bfee8
> (XEN)    ffff82d04022b798 ffff8300688bfe08 ffff82d0402a4bcb 0000000000000000
> (XEN)    0000000000000206 ffff8316da86e61c ffff8316da86e600 ffff938031fd47c0
> (XEN)    0000000000000003 0000000000000400 ff889e8da08f928a 0000000000000000
> (XEN)    0000000000000002 0000000000000100 000000000000b86e ffff93803237f010
> (XEN)    0000000000000000 ffff8316da86e61c 0000beef0000beef ffffffff80555918
> (XEN)    000000bf0000beef 0000000000000046 ffff938031fd4790 000000000000beef
> (XEN)    000000000000beef 000000000000beef 000000000000beef 000000000000beef
> (XEN)    0000e01000000000 ffff83083663f000 0000000000000000 00000000003526e0
> (XEN)    0000000000000000 0000000000000000 0000060100000001 0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d0402c4164>] R vioapic_irq_positive_edge+0x14e/0x150
> (XEN)    [<ffff82d0402bcc33>] F arch/x86/hvm/irq.c#assert_gsi+0x5e/0x7b
> (XEN)    [<ffff82d0402bd18f>] F hvm_gsi_assert+0x62/0x77
> (XEN)    [<ffff82d04026f6e1>] F 
> drivers/passthrough/io.c#dpci_softirq+0x261/0x29e
> (XEN)    [<ffff82d04022b6fa>] F common/softirq.c#__do_softirq+0x8a/0xbf
> (XEN)    [<ffff82d04022b798>] F do_softirq+0x13/0x15
> (XEN)    [<ffff82d0402a4bcb>] F vmx_asm_do_vmentry+0x2b/0x30
> (XEN) 
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) Assertion '!print' failed at vioapic.c:512
> (XEN) ****************************************

Right, this was the expected path after what you've sent prior to this.
Which turned my attention back to the 'i' debug key output you had sent
the other day. There we have

(XEN)    IRQ:  34 vec:51 IO-APIC-level   status=010 aff:{0}/{0-7} in-flight=1 
d0: 34(-MM)

i.e. at that point we're waiting for Dom0 to signal it's done handling
the IRQ. There is, however, a timer associated with this. Yet that's
actually to prevent the system getting stuck, i.e. the "in-flight"
state ought to clear 1ms later (when that timer expires), and hence
ought to be pretty unlikely to catch when non-zero _and_ something's
actually stuck.

So for the softirq to get Dom0 out of its stuck state, there has got to
be yet some other event. Nevertheless it may be worthwhile
instrumenting irq_guest_eoi_timer_fn() to prove we actually take this
path, i.e. Xen is trying to "clean up" after Dom0 taking too long to
service an IRQ. In normal operation this path shouldn't be taken, so I
wouldn't exclude something got broken in that logic. (Orthogonal to
this it may also be worth seeing whether increasing the timeout would
actually help things. This wouldn't be a solution, but another data
point hinting something's wrong on this code path.)

Roger, I'm also somewhat puzzled by the trailing (-MM): Is PVH using
event channels for delivering pIRQ-s? I thought that's purely vIO-APIC
and vMSI? I wonder whether we misleadingly dump info from evtchn 0
here, in which case only the 2nd of the M-s would be meaningful (and
would be in line with non-zero in-flight).

Jan



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.