[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Xen-unstable: xen panic RIP: dpci_softirq
Monday, November 17, 2014, 5:34:16 PM, you wrote: > On Fri, Nov 14, 2014 at 11:09:58PM +0100, Sander Eikelenboom wrote: >> >> Friday, November 14, 2014, 9:25:13 PM, you wrote: >> >> > On Fri, Nov 14, 2014 at 05:59:23PM +0100, Sander Eikelenboom wrote: >> >> >> >> Friday, November 14, 2014, 4:43:58 PM, you wrote: >> >> >> >> >>>> On 14.11.14 at 16:20, <linux@xxxxxxxxxxxxxx> wrote: >> >> >> If it still helps i could try Andrews suggestion and try out with only >> >> >> commit aeeea485 .. >> >> >> >> > Yes, even if it's pretty certain it's the second of the commits, >> >> > verifying >> >> > this would be helpful (or if the assumption is wrong, the pattern it's >> >> > dying with would change and hence perhaps provide further clues). >> >> >> >> > Jan >> >> >> >> >> >> Ok with a revert of f6dd295 .. it survived cooking and eating a nice bowl >> >> of >> >> pasta without a panic. So it would probably be indeed that specific >> >> commit. >> >> > Could you try running with these two patches while you enjoy an beer in >> > the evening? >> >> Hmm i didn't expect it not to panic and reboot anymore :-) > I should have also asked for your to run with 'iommu=verbose,debug', but > that can be done later.. I was running with iommu=on,verbose,amd-iommu-debug .. > The guest d16 looks to have two PCI passthrough devices: > XEN) [2014-11-14 21:31:26.569] io.c:550: d16: bind: m_gsi=37 g_gsi=36 > dev=00.00.5 intx=0 > XEN) [2014-11-14 21:31:28.095] io.c:550: d16: bind: m_gsi=47 g_gsi=40 > dev=00.00.6 intx=0 > And one of them uses just the GSI while the other uses four MSI-X, is > that about right? Yes guest 16 has 1 USB controller(guest side 00:05.0) which has MSI-X enabled, and 1 conexant video-grabber (guest side 00:06.0) which should be MSI capable, but is is not enabled (probably by the driver) so using legacy interrupts. > I tried to reproduce that on my AMD box with two NICs: > # lspci > 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02) > 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] > 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II] > 00:01.2 USB Controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton > II] (rev 01) > 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01) > 00:02.0 VGA compatible controller: Technical Corp. Device 1111 > 00:03.0 Class ff80: XenSource, Inc. Xen Platform Device (rev 01) > 00:04.0 Ethernet controller: Intel Corporation 82576 Gigabit Network > Connection (rev 01) > 00:05.0 Ethernet controller: Intel Corporation 82541PI Gigabit Ethernet > Controller (rev 05) > # cat /proc/interrupts |grep eth > 36: 384183 0 xen-pirq-ioapic-level eth0 > 63: 1 0 xen-pirq-msi-x eth1 > 64: 24 661961 xen-pirq-msi-x eth1-rx-0 > 65: 205 0 xen-pirq-msi-x eth1-rx-1 > 66: 162 0 xen-pirq-msi-x eth1-tx-0 > 67: 190 0 xen-pirq-msi-x eth1-tx-1 > Is that a similar distribution of IRQ/MSIx you end up having? These are when they are still active and assigned to dom0 (and not owned by pci-back) or in the guest ? I attached my /proc/interrupts for both dom0 as guest 16 with all guests running (on a Xen from before the dpci changes). With the devices passed through I only see one line with the IRQ of a PCI soundcard passed through to a PV guest: 22: 38959 0 0 0 0 0 xen-pirq-ioapic-level xen-pciback[0000:03:06.0] All the other devices passed through (to HVM guests) are not visible in /proc/interrupts of dom0. In the guest i do get these: 23: 35 0 0 0 xen-pirq-ioapic-level uhci_hcd:usb3 40: 13440077 0 0 0 xen-pirq-ioapic-level cx25821[1], cx25821[1] 84: 2956369 0 0 0 xen-pirq-msi-x xhci_hcd 85: 0 0 0 0 xen-pirq-msi-x xhci_hcd 86: 0 0 0 0 xen-pirq-msi-x xhci_hcd 87: 0 0 0 0 xen-pirq-msi-x xhci_hcd 88: 0 0 0 0 xen-pirq-msi-x xhci_hcd >> >> However xl dmesg (complete one attached) showed it would have: >> >> (XEN) [2014-11-14 21:35:50.646] --MARK-- >> (XEN) [2014-11-14 21:35:56.861] grant_table.c:305:d0v0 Increased maptrack >> size to 9 frames >> (XEN) [2014-11-14 21:36:00.647] --MARK-- >> (XEN) [2014-11-14 21:36:10.410] grant_table.c:1299:d16v1 Expanding dom (16) >> grant table from (5) to (6) frames. >> (XEN) [2014-11-14 21:36:10.820] --MARK-- >> (XEN) [2014-11-14 21:36:20.820] --MARK-- >> (XEN) [2014-11-14 21:36:30.820] --MARK-- >> (XEN) [2014-11-14 21:36:40.821] --MARK-- >> (XEN) [2014-11-14 21:36:50.821] --MARK-- >> (XEN) [2014-11-14 21:37:00.388] CPU00: >> (XEN) [2014-11-14 21:37:00.399] CPU01: >> (XEN) [2014-11-14 21:37:00.410] d16 OK-softirq 20msec ago, state:1, 41220 >> count, [prev:ffff83054ef5e3e0, next:ffff83054ef5e3e0] PIRQ:0 >> (XEN) [2014-11-14 21:37:00.445] d16 OK-raise 46msec ago, state:1, 41223 >> count, [prev:0000000000200200, next:0000000000100100] PIRQ:0 >> (XEN) [2014-11-14 21:37:00.481] d16 ERR-poison 92msec ago, state:0, 1 count, >> [prev:0000000000200200, next:0000000000100100] PIRQ:0 >> (XEN) [2014-11-14 21:37:00.515] d16 Z-softirq 28853msec ago, state:2, 1 >> count, [prev:0000000000200200, next:0000000000100100] PIRQ:0 > The PIRQ:0 would imply that this is the legacy interrupt - which would be you > 0a:00.0 device > (Conexant Systems, Inc. Device 8210). Correct. > And it is pounding on this CPU - and the issue is that the > 'test_and_clear_bit' ends > up returning 0 - which means it was not able to set STATE_SCHED: > (!?) > if ( test_and_clear_bit(STATE_SCHED, &pirq_dpci->state) ) > { > > hvm_dirq_assist(d, pirq_dpci); > > put_domain(d); > > } > > else > > { > > _record(&debug->zombie_softirq, pirq_dpci); > which causes us to record it [Z-softirq], which says we we are in state 2 > (1<<STATE_RUN). > reset = 1; > > } > .. eons ago (28853msec). > Hmm. There is something fishy there but the only theory I have is that > we end up doing 'list_del' twice on different CPUs on the same structure. The pounding would be correct .. since it's a videograbber ... wouldn't be fun not stretching the limits ;-) (however it's running fine for about 2 or 3 years) > That should not be possible, but then this check - 'test_and_clear_bit' > returned > 0 which means that somebody had cleared it (or it failed to clear it?) > But the only other path for 'clearing' it is via the error paths and you are > not hitting any of them. > In the mean-time, could you try this patch. It adds a bit more debug to help > me figure this out. Ok will do this evening, thx ! > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c > index 23e5ed1..443975c 100644 > --- a/xen/drivers/passthrough/io.c > +++ b/xen/drivers/passthrough/io.c > @@ -126,17 +126,17 @@ static void dump_record(struct _debug_f *d, unsigned > int type) > BUG(); > > now = NOW(); > - printk("d%d %s %lumsec ago, state:%x, %ld count, [prev:%p, next:%p] ", > + printk("d%d %s %lumsec ago, state:%x, %ld count, [prev:%p, next:%p] %p", > d->domid, names[type], > (unsigned long)((now - d->last) / MILLISECS(1)), - d->>state, d->count, d->list.prev, d->list.next); + d->>state, d->count, d->list.prev, d->list.next, d->dpci); > > if ( d->dpci ) > { > struct hvm_pirq_dpci *pirq_dpci = d->dpci; > > for ( i = 0; i <= _HVM_IRQ_DPCI_GUEST_MSI_SHIFT; i++ ) > - if ( pirq_dpci->flags & 1 << _HVM_IRQ_DPCI_TRANSLATE_SHIFT ) > + if ( pirq_dpci->flags & (1 << i) ) > printk("%s ", names_flag[i]); > > printk(" PIRQ:%d", pirq_dpci->pirq); Attachment:
proc-interrupts-dom0.txt Attachment:
proc-interrupts-guest.txt _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |