[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Severe megasas_raid issues when using Xen dom0 linux kernels



Have you tried to use the MegaRAID monitor to  see if you can
diagnose some hardware problem with the RAID?  There is one
you can download and run on the linux dom0, there should be a monitor
you can get to from the BIOS as well.. those error messages look very
much like an actual hardware fault on the RAID array.

I have a lot of megasas raid both under SL5 and SL6 and have used them
as xen dom0 and kvm vm hosts without problems, several different versions
of xen.

Steve Timm



On Tue, 18 Oct 2011, David Della Vecchia wrote:

I've tried debian stable and testing, centos5 and 6 with xen 3.1-4.1 (about
5 different versions in between). I'm currently running xen 4.1.1 release on
centos6 with M.A.Young's centos6 xen dom0 kernel. For some reason the raid
array freaks out and swaps to read-only mode for the entire virtual device
the hardware raid array provides. I've tried both raid 0 and raid1 (2 1tb
SCSI drives). I've had this issue in every xen install I've tried on this
box, no matter what kernel version (tried as new as 3.0.1 in debian wheezy)
or xen version (compiled and installed the unstable branch to test) i use.
The server was running stable and fine for about a week this time before
this:


[root@gibson ~]# df -h
-bash: /bin/df: Input/output error
[root@gibson ~]# w
-bash: /usr/bin/w: Input/output error
[root@gibson ~]# modinfo megasas_raid
-bash: /sbin/modinfo: Input/output error

part of the /var/log/messages:

Oct 17 13:21:09 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
Oct 17 13:21:10 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:10 gibson kernel: megasas: reset successful
Oct 17 13:21:20 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0
retries=0
Oct 17 13:21:20 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:21 gibson kernel: megasas: reset successful
Oct 17 13:21:21 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512
cmd=2a retries=0
Oct 17 13:21:21 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:21 gibson kernel: megasas: reset successful
Oct 17 13:21:41 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0
retries=0
Oct 17 13:21:41 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete
Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:42 gibson kernel: megasas: reset successful
Oct 17 13:21:42 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512
cmd=2a retries=0
Oct 17 13:21:42 gibson kernel: megaraid_sas: no pending cmds after reset
Oct 17 13:21:42 gibson kernel: megasas: reset successful
Oct 17 13:22:02 gibson kernel: sd 0:2:0:0: [sda] megasas: RESET -85512 cmd=0
retries=0
Oct 17 13:22:02 gibson kernel: megasas: [ 0]waiting for 1 commands to
complete


[root@gibson ~]# ls -al /bin/
ls: cannot access /bin/ntfs-3g.secaudit: Input/output error
ls: cannot access /bin/ntfstruncate: Input/output error
ls: cannot access /bin/ntfsdump_logfile: Input/output error
ls: cannot access /bin/ntfsls: Input/output error
ls: cannot access /bin/ntfsdecrypt: Input/output error
ls: cannot access /bin/ntfs-3g.usermap: Input/output error
ls: cannot access /bin/ntfsmount: Input/output error
ls: cannot access /bin/ntfsfix: Input/output error
ls: cannot access /bin/ntfscluster: Input/output error
total 8192
dr-xr-xr-x.  2 root root   4096 Oct 15 14:49 .
drwxr-xr-x. 29 root root   4096 Oct 17 12:34 ..
-rwxr-xr-x.  1 root root    123 Nov 10  2010 alsaunmute
-rwxr-xr-x   1 root root  27808 May 30 10:55 arch
lrwxrwxrwx.  1 root root      4 Oct 13 10:36 awk -> gawk
-rwxr-xr-x   1 root root  26264 May 30 10:55 basename
-rwxr-xr-x   1 root root 943248 May 30 11:46 bash
-rwxr-xr-x   1 root root  51344 May 30 10:55 cat
-rwxr-xr-x   1 root root  12200 Jun 25 05:02 cgclassify
-rwxr-xr-x   1 root root  12352 Jun 25 05:02 cgcreate
-rwxr-xr-x   1 root root  11528 Jun 25 05:02 cgdelete
-rwsr-xr-x   1 root root  12136 Jun 25 05:02 cgexec
-rwxr-xr-x   1 root root  15760 Jun 25 05:02 cgget
-rwxr-xr-x   1 root root  13160 Jun 25 05:02 cgset
-rwxr-xr-x   1 root root  55472 May 30 10:55 chgrp
-rwxr-xr-x   1 root root  52472 May 30 10:55 chmod
-rwxr-xr-x   1 root root  57496 May 30 10:55 chown
-rwxr-xr-x   1 root root 122344 May 30 10:55 cp
-rwxr-xr-x   1 root root 136096 Nov 10  2010 cpio
lrwxrwxrwx.  1 root root      4 Oct 13 11:00 csh -> tcsh
-rwxr-xr-x   1 root root  45472 May 30 10:55 cut
-rwxr-xr-x   1 root root 109896 Aug 18  2010 dash
-rwxr-xr-x   1 root root  59552 May 30 10:55 date
-rwxr-xr-x   1 root root  12552 Jun 25 06:47 dbus-cleanup-sockets
-rwxr-xr-x.  1 root root 339048 Jun 25 06:47 dbus-daemon
-rwxr-xr-x   1 root root  18464 Jun 25 06:47 dbus-monitor
-rwxr-xr-x   1 root root  22376 Jun 25 06:47 dbus-send
-rwxr-xr-x   1 root root  10912 Jun 25 06:47 dbus-uuidgen
-rwxr-xr-x   1 root root  54040 May 30 10:55 dd
-rwxr-xr-x   1 root root  70256 May 30 10:55 df
-rwxr-xr-x   1 root root   9896 Jun 25 02:46 dmesg
lrwxrwxrwx.  1 root root      8 Oct 13 10:36 dnsdomainname -> hostname
lrwxrwxrwx.  1 root root      8 Oct 13 10:36 domainname -> hostname
-rwxr-xr-x   1 root root  81120 Nov 11  2010 dumpkeys
-rwxr-xr-x   1 root root  27648 May 30 10:55 echo
-rwxr-xr-x   2 root root  53352 Nov 11  2010 ed
-rwxr-xr-x   1 root root 106528 Aug 25  2010 egrep
-rwxr-xr-x   1 root root  26368 May 30 10:55 env
lrwxrwxrwx.  1 root root      2 Oct 13 10:59 ex -> vi
-rwxr-xr-x   1 root root  24592 May 30 10:55 false
-rwxr-xr-x   1 root root  71328 Aug 25  2010 fgrep
-rwxr-xr-x   1 root root 238640 Nov 11  2010 find
-rwxr-xr-x   1 root root 382456 Nov 11  2010 gawk
-rwxr-xr-x   1 root root  33416 Nov 11  2010 gettext
-rwxr-xr-x   1 root root 110160 Aug 25  2010 grep
lrwxrwxrwx.  1 root root      3 Oct 13 10:36 gtar -> tar
-rwxr-xr-x.  1 root root     61 Nov 11  2010 gunzip
-rwxr-xr-x   1 root root  68544 Nov 11  2010 gzip
-rwxr-xr-x   1 root root  16192 Aug 24  2010 hostname
-rwxr-xr-x   1 root root  14872 Jun 25 00:09 ipcalc
lrwxrwxrwx.  1 root root     20 Oct 13 10:36 iptables-xml ->
/sbin/iptables-multi
-rwxr-xr-x   1 root root  11248 Nov 11  2010 kbd_mode
-rwxr-xr-x   1 root root  24648 Aug 22  2010 keyctl
-rwxr-xr-x   1 root root  15128 Jun 25 02:46 kill
-rwxr-xr-x   1 root root  26256 May 30 10:55 link
-rwxr-xr-x   1 root root  49568 May 30 10:55 ln
-rwxr-xr-x   1 root root 112136 Nov 11  2010 loadkeys
-rwxr-xr-x   1 root root  30992 Jun 25 02:46 login
-rwxr-xr-x   1 root root  58368 Sep 12 13:32 lowntfs-3g
-rwxr-xr-x   1 root root 111744 May 30 10:55 ls
-rwxr-xr-x   1 root root  14008 Jun 25 05:02 lscgroup
-rwxr-xr-x   1 root root  12488 Jun 25 05:02 lssubsys
lrwxrwxrwx.  1 root root      5 Oct 13 10:37 mail -> mailx
-rwxr-xr-x   1 root root 390360 Aug 22  2010 mailx
-rwxr-xr-x   1 root root  48544 May 30 10:55 mkdir
-rwxr-xr-x   1 root root  32352 May 30 10:55 mknod
-rwxr-xr-x   1 root root  37352 May 30 10:55 mktemp
-rwxr-xr-x   1 root root  41144 Jun 25 02:46 more
-rwsr-xr-x.  1 root root  74712 Jun 25 02:46 mount
-rwxr-xr-x   1 root root   9800 Aug 24  2010 mountpoint
-rwxr-xr-x   1 root root 111536 May 30 10:55 mv
-rwxr-xr-x   1 root root 177360 Nov 12  2010 nano
-rwxr-xr-x   1 root root 127816 Aug 24  2010 netstat
-rwxr-xr-x   1 root root  28816 May 30 10:55 nice
lrwxrwxrwx.  1 root root      8 Oct 13 10:36 nisdomainname -> hostname
-rwxr-xr-x   1 root root  53576 Sep 12 13:32 ntfs-3g
-rwxr-xr-x   1 root root  11016 Sep 12 13:32 ntfs-3g.probe
-??????????  ? ?    ?         ?            ? ntfs-3g.secaudit
-??????????  ? ?    ?         ?            ? ntfs-3g.usermap
-rwxr-xr-x   1 root root  29896 Sep 12 13:32 ntfscat
-rwxr-xr-x   1 root root  32992 Sep 12 13:32 ntfsck
-??????????  ? ?    ?         ?            ? ntfscluster
-rwxr-xr-x   1 root root  36320 Sep 12 13:32 ntfscmp
-??????????  ? ?    ?         ?            ? ntfsdecrypt
-??????????  ? ?    ?         ?            ? ntfsdump_logfile
-??????????  ? ?    ?         ?            ? ntfsfix
-rwxr-xr-x   1 root root  57240 Sep 12 13:32 ntfsinfo
-??????????  ? ?    ?         ?            ? ntfsls
-rwxr-xr-x   1 root root  30448 Sep 12 13:32 ntfsmftalloc
l??????????  ? ?    ?         ?            ? ntfsmount
-rwxr-xr-x   1 root root  34000 Sep 12 13:32 ntfsmove
-??????????  ? ?    ?         ?            ? ntfstruncate
-rwxr-xr-x   1 root root  42240 Sep 12 13:32 ntfswipe
-rwsr-xr-x   1 root root  41432 Nov 11  2010 ping
-rwsr-xr-x   1 root root  36256 Nov 11  2010 ping6
-rwxr-xr-x   1 root root  35640 Oct 31  2010 plymouth
-rwxr-xr-x   1 root root  86776 Nov 11  2010 ps
-rwxr-xr-x   1 root root  31656 May 30 10:55 pwd
-rwxr-xr-x   1 root root  11528 Jun 25 02:46 raw
-rwxr-xr-x   1 root root  40056 May 30 10:55 readlink
-rwxr-xr-x   2 root root  53352 Nov 11  2010 red
-rwxr-xr-x.  1 root root    576 Apr 16  2008 redhat_lsb_init
-rwxr-xr-x   1 root root  57504 May 30 10:55 rm
-rwxr-xr-x   1 root root  40544 May 30 10:55 rmdir
lrwxrwxrwx.  1 root root      4 Oct 13 10:39 rnano -> nano
-rwxr-xr-x   1 root root  29904 Nov 11  2010 rpm
lrwxrwxrwx.  1 root root      2 Oct 13 10:59 rvi -> vi
lrwxrwxrwx.  1 root root      2 Oct 13 10:59 rview -> vi
-rwxr-xr-x   1 root root  72248 Aug 22  2010 sed
-rwxr-xr-x   1 root root  42312 Nov 11  2010 setfont
-rwxr-xr-x   1 root root  23600 Aug 22  2010 setserial
lrwxrwxrwx.  1 root root      4 Oct 13 10:36 sh -> bash
-rwxr-xr-x   1 root root  27880 May 30 10:55 sleep
-rwxr-xr-x   1 root root  99000 May 30 10:55 sort
-rwxr-xr-x   1 root root  65864 May 30 10:55 stty
-rwsr-xr-x   1 root root  36440 May 30 10:55 su
-rwxr-xr-x   1 root root  25464 May 30 10:55 sync
-rwxr-xr-x   1 root root 384920 Nov 11  2010 tar
-rwxr-xr-x   1 root root  14808 Jun 25 02:46 taskset
-rwxr-xr-x   1 root root 391288 Jun 25 02:05 tcsh
-rwxr-xr-x   1 root root  51952 May 30 10:55 touch
-rwxr-xr-x.  1 root root  11392 Nov 11  2010 tracepath
-rwxr-xr-x.  1 root root  12304 Nov 11  2010 tracepath6
-rwxr-xr-x   1 root root  57384 Nov 11  2010 traceroute
lrwxrwxrwx.  1 root root     10 Oct 13 10:39 traceroute6 -> traceroute
-rwxr-xr-x   1 root root  24592 May 30 10:55 true
-rwsr-xr-x.  1 root root  49280 Jun 25 02:46 umount
-rwxr-xr-x   1 root root  27808 May 30 10:55 uname
-rwxr-xr-x.  1 root root   2555 Nov 11  2010 unicode_start
-rwxr-xr-x.  1 root root    363 Nov 11  2010 unicode_stop
-rwxr-xr-x   1 root root  26264 May 30 10:55 unlink
-rwxr-xr-x   1 root root  10208 Jun 25 00:09 usleep
-rwxr-xr-x   1 root root 771800 Jun 25 04:43 vi
lrwxrwxrwx.  1 root root      2 Oct 13 10:59 view -> vi
lrwxrwxrwx.  1 root root      8 Oct 13 10:36 ypdomainname -> hostname
-rwxr-xr-x.  1 root root     62 Nov 11  2010 zcat

Here is the rough partition information for my main drive:

/boot primary ext3 1gb /dev/sda1
/dev/sda2 extended lvm pv 925gb
vg_gibson lvm-volumegroup 925gb
/ lv_root ext3 36gb
swap lv_swap 2gb

Server Specs:

Dell Poweredge R710
32GB ECC Unbuffered Ram
2x Intel Xeon Quad Core HT 2.3Ghz (16 "cores" total)
2x 1TB WD SCSI Drives in Raid-1

Drive Nitty Gritty:
Product ID: WDC WD1002FBYS-0
Revision: 0C06
Size: 953344MB

Heres some more information about the raid controller also attained from the
raid controller config utility:

Product Name: PERC 6/i
Package: 6.2.0-0013
FW Version: 1.22.02-0612
BIOS Version: 2.04.00
CtrlR Version: 1.02-015B
Boot Block: 1.00.00.01-0011

Application & OS Specs:
CentOS 6 w/2.6.32-131 M.A.Young centos6 xen dom0 kernel

Diagnostic Attempts and Results:

I've done a consistency check on the raid array and everything comes back as
clean and optimal. I've ran bad block checks, partition table corruption,
mbr corruption, everything i can think of. It all comes back as clean and
working fine. Because of these results i have not been able to force my
dedicated hosting company to replace any of the hardware. They are upgrading
the raid controller software as its about 1 minor version out of date just
to see if that could be the issue, i'll report back if that mysteriously
fixes it but i'm not holding my breath.

I've read somewhere that the 2.6.x kernels have an old version of the
megaraid_sas module that will cause problems but the version included in the
M.A.Young centos6 kernel is version 5.3 which is far beyond the 4.3 version
that article recommends upgrading to so i'm really at a loss. Besides the
version being so new the problem described in that article (the kernel not
finding the drive at all on boot) is not the issue i'm having. It just
freaks out randomly (i'm sure its not really randomly, just appears that
way) and the OS swaps to read-only mode and the only way to reboot is
basically to push the button on the front of the box.

Please, if anyone can direct me towards a solution or at least down a path i
have yet to try i would greatly appreciate it. I'm at my wits end, i've been
fighting this mysterious monster for over a month now and it always seems to
strike right before i'm about to go live with my services (first time it
happened was right after i started adding customers to the box).

Thanks in advance,
David


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.