[Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2

Hello xen-devel,

as one of the people using Dell Servers i am aware that the LSI megaraid drivers are quite old in the current 2.6.32 pvops tree, but it seems that, once again, i have run into problems that are more rare than the usual "cant find disk" issues. (Of which i had none, ever)

The situation:
I have 2 dom0 kernels, and 3.0.1 that work fine when booted bare-metal. I can run stress -m 40 -d 4 -i 1 for hours on end without any error occuring.
The kernels use version megasas modules.

When i boot that kernel on my R610 servers under xen (4.1 and 4.2) the kernels work fine too. I create 10 virtual machines, each running 4 "stress -m 40" and can do disk i/o on my local storage as much as i want to.

But on my Dell R710 system things dont look so good.
Booted bare-metal, both kernels work fine.
When i boot them as dom0 under xen, everything seems to be okay at first.
Then i create my 10 virtual machines that put some load on the memory.
And as soon as i do i/o to the local disk, even a "ls /usr/src/" can suffice, i/o freezes, the system stops to respond to anything that requires disk acccess.
After a while the kernel will start spewing out error messages:

#### lots of these
sd 0:2:0:0: [sda] megasas: RESET -83318 cmd=2a retries=0
megaraid_sas: HBA reset handler invoked without an internal reset condition.
megasas: [ 0]waiting for 16 commands to complete
megaraid_sas: no more pending commands remain after reset handling.
megasas: reset successful

### then some of these
sd 0:2:0:0: Device offlined - not ready after error recovery

### goes on to
sd 0:2:0:0: [sda] Unhandled error code
sd 0:2:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:2:0:0: [sda] CDB: Write(10): 2a 00 08 45 6f 00 00 01 88 00
end_request: I/O error, dev sda, sector 138768128
Buffer I/O error on device sda2, logical block 5138912
lost page write due to I/O error on sda2
Buffer I/O error on device sda2, logical block 5138913

### and finally these, as often as one tries to access the disk
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device
sd 0:2:0:0: rejecting I/O to offline device

If a kernel works fine on one set of servers (Dell R610 with LSI Logic / Symbios Logic LSI MegaSAS 9260 (rev 05) raid controllers) and crashes on another server (Dell R710 with a LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) raid controller), it would seem logical to assume, that the kernel does not support the hardware properly.
But when run bare-metal, no errors occur.

I for one ran out of things to try, the R710 worked fine before i upgraded its firmware to the most current versions and went from xen4.0.1 to xen4.1/4.2.

So i put it to you, fine sirs of xen-devel:
is it:
A.) a hardware problem, because the software works on different hardware
B.) a xen problem, because the hardware runs fine in a non-virtualized scenario with the same kernel

Or is it something else entirely?

Help, input, questions and suggestions are, as always, greatly appreciated.

With best regards

Andreas Olsowski
Leuphana Universität Lüneburg
Rechen- und Medienzentrum
Scharnhorststraße 1, C7.015
21335 Lüneburg

Tel: ++49 4131 677 1309

