[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] megasas stops I/O when running kernel as dom0 under xen4.1/4.2

On 26/08/11 19:16, Andrew Cooper wrote:
> On 24/08/11 18:20, Andrew Cooper wrote:
>> On 24/08/11 18:09, Konrad Rzeszutek Wilk wrote:
>>> On Wed, Aug 24, 2011 at 05:57:06PM +0100, Andrew Cooper wrote:
>>>> On 24/08/11 13:06, Andrew Cooper wrote:
>>>>> On 22/08/11 10:05, Andrew Cooper wrote:
>>>>>> On 19/08/11 19:10, Andreas Olsowski wrote:
>>>>>>> Am 19.08.2011 18:49, schrieb Andrew Cooper:
>>>>>>>> The only change you need to make is in megasas_probe_one() in
>>>>>>>> megaraid_sas_base.c
>>>>>>>> Add a call to pci_enable_msi(pdev) immediately after the current
>>>>>>> call to
>>>>>>>> pci_set_master(pdev);
>>>>>>>> ~Andrew
>>>>>>> Yep, that works fine. Removed the module option as well.
>>>>>>> root@tarballerina:~# cat /proc/interrupts  |grep mega
>>>>>>> 2236:      69010          0          0          0          0         
>>>>>>> 0          0          0  xen-pirq-msi       megasas
>>>>>>> The same procedure that would have lead to almost instant errors has
>>>>>>> not brought them to appear again.
>>>>>> Good.  This is what we are seeing as well.  I am still awaiting a reply
>>>>>> from LSI on this topic.
>>>>>> Unfortunately, this does point to a regression in the way Xen deals with
>>>>>> legacy interrupts.
>>>>> Out of interest, on all 3 of your boxes with the megaraid_sas cards,
>>>>> could you gather the io_apic information?
>>>>> It is the z xen debug key on the serial console (or alternatively put
>>>>> apic_verbosity=debug on the xen commandline and the information gets
>>>>> dumped into the dmesg)
>>>> You can ignore this - it is not relevant.
>>>> I have narrowed the problem to a bug in the interrupt migration code.
>>> Goodies!
>>>> The bug occurs when the move pending flag is set, and somehow another
>>>> interrupt comes in on the old pcpu without triggering the move
>>>> completion code.  This leaves the IO_APIC with ack'd but not EOI'd
>>>> interrupt from the megaraid_sas device.
>>> Ah, so the interrupt is delievered to Dom0 on the old per_cpu
>>> event which is ignored. Ignored b/c we have rebinded the event channel
>>> to the other CPU, right?
>> The interrupt is not ignored - it seems to be being serviced by the
>> device driver in dom0.   I will admit that my debugging code may be a
>> bit flaky - I started by trying to match IRQ35 (which is always claimed
>> by PCI INTA on this server - very useful for debugging) between do_IRQ
>> and its related PHYSDEVOP_eoi.
>> I am currently trying to track the exact order of events around this
>> interrupt which misses the move completion code.
>>> Is there any code in the Hypervisor to turn off interrupt migration code?
>> Not that I have found, although playing around with vcpu and task
>> pinning should work.  My debugging shows that Xen-4.1.1 is migrating
>> this interrupt between PCPUs on average once every 4 real interrupts
>> when dom0 is under any load whatsoever.
> Please try attached patch.  It is a hack, but it works as far as I can test.
> (Patch is taken against xen-4.1.1 but should be trivial to port if it
> doesn't apply cleanly)
> ~Andrew

Apologies - previous patch fails to compile (i forgot to hg qrefresh
before sending - it has been a very long day).  Try this patch.


Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

Attachment: CA-65000-manually-eoi-migrating-irqs.patch
Description: Text Data

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.