[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Fatal crash: ACPI assigning duplicate physical IRQ's to second DomU



More info - here is lspci and the contents of /proc/interrupts for each dom.

Dom0
####
[root@dns ~]# cat /proc/interrupts
          CPU0
 1:          8        Phys-irq  i8042
 3:          2        Phys-irq  ehci_hcd:usb3
 4:          0        Phys-irq  ohci_hcd:usb2
 8:          1        Phys-irq  rtc
10:        303        Phys-irq  eth2
11:       4347        Phys-irq  ohci_hcd:usb1, libata
12:        113        Phys-irq  i8042
14:       2112        Phys-irq  ide0
15:        416        Phys-irq  ide1
256:       5804     Dynamic-irq  timer0
257:          0     Dynamic-irq  resched0
258:          0     Dynamic-irq  callfunc0
259:          0     Dynamic-irq  xenbus
260:          0     Dynamic-irq  console
NMI:          0
LOC:          0
ERR:          0
MIS:          0
####

DomU - this has the problem (see IRQ 11)
####
[root@intranet ~]# cat /proc/interrupts
          CPU0
11:       1393        Phys-irq  eth1
256:       3073     Dynamic-irq  timer0
257:          0     Dynamic-irq  resched0
258:          0     Dynamic-irq  callfunc0
259:        104     Dynamic-irq  xenbus
260:        147     Dynamic-irq  xencons
261:       1799     Dynamic-irq  blkif
262:        103     Dynamic-irq  eth0
NMI:          0
LOC:          0
ERR:          0
MIS:          0
####

DomU (this is the other DomU that works fine)
####
[root@gateway ~]# cat /proc/interrupts
          CPU0
 5:         17        Phys-irq  eth1
256:       3111     Dynamic-irq  timer0
257:          0     Dynamic-irq  resched0
258:          0     Dynamic-irq  callfunc0
259:        104     Dynamic-irq  xenbus
260:        170     Dynamic-irq  xencons
261:       1859     Dynamic-irq  blkif
262:        130     Dynamic-irq  eth0
NMI:          0
LOC:          0
ERR:          0
MIS:          0
####

Dom0 PCI (using pciback to reserve 01:09.0 and 01:0a.0. 01:0a.0 is getting the problem IRQ)
####
[root@dns ~]# lspci
00:00.0 Host bridge: nVidia Corporation nForce2 AGP (different version?) (rev c1)
00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 (rev c1)
00:00.2 RAM memory: nVidia Corporation nForce2 Memory Controller 4 (rev c1)
00:00.3 RAM memory: nVidia Corporation nForce2 Memory Controller 3 (rev c1)
00:00.4 RAM memory: nVidia Corporation nForce2 Memory Controller 2 (rev c1)
00:00.5 RAM memory: nVidia Corporation nForce2 Memory Controller 5 (rev c1)
00:01.0 ISA bridge: nVidia Corporation nForce2 ISA Bridge (rev a3)
00:01.1 SMBus: nVidia Corporation nForce2 SMBus (MCP) (rev a2)
00:02.0 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:02.1 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:02.2 USB Controller: nVidia Corporation nForce2 USB Controller (rev a3)
00:08.0 PCI bridge: nVidia Corporation nForce2 External PCI Bridge (rev a3)
00:09.0 IDE interface: nVidia Corporation nForce2 IDE (rev a2)
00:1e.0 PCI bridge: nVidia Corporation nForce2 AGP (rev c1)
01:08.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8169 Gigabit Ethernet (rev 10)
01:09.0 Ethernet controller: Lite-On Communications Inc LNE100TX (rev 20)
01:0a.0 Ethernet controller: Lite-On Communications Inc LNE100TX (rev 20)
01:0b.0 RAID bus controller: Silicon Image, Inc. SiI 3112 [SATALink/SATARaid] Serial ATA Controller (rev 02) 02:00.0 VGA compatible controller: ATI Technologies Inc Radeon R300 ND [Radeon 9700 Pro] 02:00.1 Display controller: ATI Technologies Inc Radeon R300 [Radeon 9700 Pro] (Secondary)
####


Hilton Day wrote:
Hi,

I've got a problem with ACPI assigning duplicate physical IRQ's to one
of my DomU's that I'm passing a PCI NIC to.  Can anyone shed some light
into ways I can avoid this problem with IRQ allocation?  I can see the
irq allocations using /proc/interrupts and see the conflict.

In my Dom0 I have 3 network cards.  eth0 and eth1 are identical
tulip-based 100MB cards, and eth2 is a realtek gigabit card that I'm
using as the xen-bridge. I have this problem with a variety of different
kernels - currently running kernel-xen-2.6.18-1 (fedora core 6
development tree) on all hosts, with xen-3.0.2.

Pass-through always works just fine for one of my DomU's, and ACPI
allocates an unused physical IRQ with no problems.  However, in a second
DomU, it consistently allocates the same IRQ as is used by my onboard
SATA controller (libata).

When the second DomU is running, I get a fatal crash that also destroys
my RAID volume info, and severely damages the filesystem.  I've had to
manually rebuild the raid each time this happens, so that I can try a
new alternative solution.  The crash typically happens within a few
minutes of booting.

After reading the archives of this list, as well as other lists, I've
tried putting "noirqdebug" as a kernel parameter in both the Dom0 and
DomU, and also made use of "noapic" and "acpi=off", as well as disabling
ACPI in my motherboard's bios (system is an athlonxp running on an
nforce2 motherboard with 2 gigs of ram).  None of them resolves the
conflict - it appears to be a bug that affects pass-through of PCI
devices and IRQ allocation?

I've also tried a variety of other ethernet devices (the forcedeth
driver for nforce2 onboard nic, and also natsemi driver for a Netgear
FA311) to pass through to the second DomU, with the same result.  Moving
the PCI card to a different PCI bus address/slot doesn't resolve the
problem either.

I managed to grab Dmesg outputs from dom0 and the problem DomU last time
it crashed -

The message I'm getting to console in the domU is:
####
irq 11: nobody cared (try booting with the "irqpoll" option)
[<c040569e>] dump_trace+0x69/0x1af
[<c04057fc>] show_trace_log_lvl+0x18/0x2c
[<c0405d9c>] show_trace+0xf/0x11
[<c0405dcb>] dump_stack+0x15/0x17
[<c044636e>] __report_bad_irq+0x36/0x7d
[<c044655b>] note_interrupt+0x1a6/0x1e3
[<c0445bda>] __do_IRQ+0xba/0xf2
[<c0406c2c>] do_IRQ+0x9e/0xbc
=======================
handlers:
[<d10636e8>] (tulip_interrupt+0x0/0xdb8 [tulip])
Disabling IRQ #11
end_request: I/O error, dev xvda, sector 42806344
Buffer I/O error on device xvda3, logical block 5061623
lost page write due to I/O error on xvda3
####

The dmesg output from the Dom0 following booting of the second DomU is:

####
PCI: Enabling device 0000:01:0a.0 (0000 -> 0003)
ACPI: PCI Interrupt 0000:01:0a.0[A] -> Link [LNK3] -> GSI 11 (level,
low) -> IRQ 11
ADDRCONF(NETDEV_CHANGE): vif4.0: link becomes ready
xenbr0: port 4(vif4.0) entering learning state
xenbr0: topology change detected, propagating
xenbr0: port 4(vif4.0) entering forwarding state
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata1.00: (BMDMA stat 0x64)
ata1.00: tag 0 cmd 0x25 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata2.00: (BMDMA stat 0x64)
ata2.00: tag 0 cmd 0xca Emask 0x4 stat 0x40 err 0x0 (timeout)
ata1: soft resetting port
ata2: soft resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata2: hard resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: failed to recover some devices, retrying in 5 secs
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2: failed to recover some devices, retrying in 5 secs
ata1: hard resetting port
ata2: hard resetting port
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata2.00: revalidation failed (errno=-5)
ata2.00: disabled
ata1: EH complete
ata2: EH complete
sd 0:0:0:0: SCSI error: return code = 0x00040000
end_request: I/O error, dev sda, sector 58152585
raid5:md3: read error not correctable (sector 46682112 on sda5).
raid5: Disk failure on sda5, disabling device. Operation continuing on 1
devices
raid5:md3: read error not correctable (sector 46682120 on sda5).
####


Please, any help in resolving this appreciated - I'd like to get this
host up and running!

Thanks,

Hilton.


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.