[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2



On 24/08/11 13:06, Andrew Cooper wrote:
> On 22/08/11 10:05, Andrew Cooper wrote:
>> On 19/08/11 19:10, Andreas Olsowski wrote:
>>> Am 19.08.2011 18:49, schrieb Andrew Cooper:
>>>
>>>> The only change you need to make is in megasas_probe_one() in
>>>> megaraid_sas_base.c
>>>>
>>>> Add a call to pci_enable_msi(pdev) immediately after the current
>>> call to
>>>> pci_set_master(pdev);
>>>>
>>>> ~Andrew
>>>>
>>> Yep, that works fine. Removed the module option as well.
>>>
>>> root@tarballerina:~# cat /proc/interrupts  |grep mega
>>> 2236:      69010          0          0          0          0         
>>> 0          0          0  xen-pirq-msi       megasas
>>>
>>> The same procedure that would have lead to almost instant errors has
>>> not brought them to appear again.
>>>
>> Good.  This is what we are seeing as well.  I am still awaiting a reply
>> from LSI on this topic.
>>
>> Unfortunately, this does point to a regression in the way Xen deals with
>> legacy interrupts.
> Out of interest, on all 3 of your boxes with the megaraid_sas cards,
> could you gather the io_apic information?
>
> It is the z xen debug key on the serial console (or alternatively put
> apic_verbosity=debug on the xen commandline and the information gets
> dumped into the dmesg)

You can ignore this - it is not relevant.

I have narrowed the problem to a bug in the interrupt migration code.

The bug occurs when the move pending flag is set, and somehow another
interrupt comes in on the old pcpu without triggering the move
completion code.  This leaves the IO_APIC with ack'd but not EOI'd
interrupt from the megaraid_sas device.

This basically locks the server until something (as yet undetermined)
triggers the move completion code, at which point the server unlocks itself.

When this locked state lasts for more than 2 minutes, the scsi subsystem
decides to kill the megaraid_sas driver, from which dom0 cant recover. 
I think (although am not certain) that the megaraid_sas device gets
reset by the driver after each of these locked states, making further IO
problems for dom0.

I believe this issue to be some sort of race condition, because I have
noticed my debugging printf's significantly altering the rarity of the
problem.


To make matters worse, it appears that certain OEM firmware causes a
deadlock in the megaraid_sas probe function if you try to enable MSI
interrupts, which possibly explains why the driver never tries to enable
them in the first place (I have still not had any response from LSI)

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.