[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Fatal Trap 18 (convincing hardware engineer)

No one else seems to have taken the bite, so I will even
though I may not be best qualified to do so.

Matthew Baker wrote:
> Hi all,
> I have 2 servers with identical hardware (lspci at the bottom of this
> email).
Two identical servers is good. But I wasn't clear from your description
whether they behaved the same.

Assuming they behave differently then that might mean you have one
substandard component in one of the machines. Record all the serial
numbers of the components, or label them yourself, then begin
swapping them between the machines. If you can get the fault to
move from one machine to the other, you can maybe pin it on one component.

Your hardware guy may have already tried the above. If you have two
machines and they both show the fault, that's more tricky.

> An Extra Intel PRO/1000 MT Dual Port Server Adapter[1] has been
> connected into the second slot on a pci-x capable riser (the first slot
> taken by the SAS Raid controller).
> When this nic *is* connected *and* the boxes boot a Xen kernel (debian
> 4.0 2.6.18-5-xen and using Xen HyperVisor(PAE) 3.0.3-0-4) after about 2
> days I get this error on the console:
> (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]----
> (XEN) ----[ Xen-3.0.3-1 x86_32p debug=n Not tainted ]----
> (XEN) CPU: 1
> (XEN) EIP: e008:[<ff1193be>]CPU: 3
> (XEN) EIP: e008:[<ff1193be>] idle_loop+0x4e/0x60 idle_loop+0x4e/0x60
> (XEN) EFLAGS: 00000246 CONTEXT: hypervisor
> (XEN) eax: 00000000 ebx: ffbeffb4 ecx: 00000001 edx: 00000000
> (XEN) esi: ffbeffb4 edi: ffbf6080 ebp: 000090dc esp: ffbeffa8
> (XEN) cr0: 8005003b cr4: 000006f0 cr3: a3363000 cr2: b7f2c260
> (XEN)
> (XEN) EFLAGS: 00000246 CONTEXT: hypervisor
> (XEN) ds: e010 es: e010 fs: 0000 gs: 0000 ss: e010 cs: e008
> (XEN) eax: 00000000 ebx: ffbe3fb4 ecx: 096a03ba edx: ff18c080
> (XEN) Xen stack trace from esp=ffbeffa8:
> (XEN) esi: ffbf0080 edi: 07a0403a ebp: 000090dc esp: ffbe3fa8
> (XEN) 00000001cr0: 8005003b cr4: 000006f0 cr3: a1b80000 cr2: b7edd260
> (XEN) 00000001 00001000 00000001 00000000 00000000 00000001
> 00000001ds: e010 8(XEN)
> (XEN) 00000000Xen stack trace from esp=ffbe3fa8:
> (XEN) 00000000 00000001 00f90000 00000003 c01013a7 ffbf0080
> 00000061 00000001(XEN) 0000007b 0000007b 00000000 00000000 00000001
> ffbf6080 00000003
> (XEN) Xen call trace:
> (XEN) [<ff1193be>]
> (XEN) idle_loop+0x4e/0x60
> (XEN) 00000000
> (XEN) ************************************
> (XEN) 00000000CPU1 FATAL TRAP 18 (machine check), ERROR_CODE 0000.

I had a brief look to see if I could find the place in the source
code where this is being printed out, but I drew a blank. I must
not be looking at the right version of the source tree.
File .xen/arch/x86/traps.c looks like a good candidate.

> (XEN) System shutting down -- need manual reset.
> (XEN) ************************************
> The machine obviously hangs.
> If I remove the PCI NIC the machine stays up. If I boot into a vanilla
> kernel with the NIC in the box it stays up.
> I have NICs like these bought in batch running in other machines that
> are also running Xen. The machines aren't really used a great deal (at
> the moment although need to be soon) and as far as i can tell there's no
> other issue with respect to the system that is failing, i.e the obvious
> stuff like disk space running out or exhaustive cronjobs). There are no
> logs other than the one to the console suggesting a failure elsewhere.
> Our hardware engineer is convinced it's either a Xen or driver issue.

I can see why he might think so or want to say so.

> I've seen the thread at
> http://lists.xensource.com/archives/html/xen-users/2006-08/msg00792.html
> and have directed the engineer at this.
> My questions to the list are:
> 1. Can this be caused by anything else (other than hardware)?
> 2. Is there anything I can do to debug this further to confirm what part
> of the system is failing (e.g. either CPU/RAM or PCI/BUS timeout)?

grasping at straws, could you try running a memory test program, eg memtest86.

Is this a server class machine with with EEC memory? If so, is it possible
to get the linux kernel to report any soft memory errors that get corrected
via the EEC hardware? 

Is there anything in linux/Documentation/drivers/edac/edac.txt
that might help? (I have not used this myself). There may be
non fatal errors that are happening that before the fatal one.
That might give you or your hardware engineer a clue as to
where else to look.

How about building a linux kernel with some form of debugging
turned on? This might help you to see is something is
scribbling on memory when it shouldn't be.

I don't really know the answer, but good link anyway.

> Any help on this would be greatly appreciated.
> Many thanks,
> Matt

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.