[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8

On 12/21/2017 03:38 AM, Jan Beulich wrote:
> (dropping xen-users, to avoid cross posting)
>>>> On 20.12.17 at 18:40, <dunlapg@xxxxxxxxx> wrote:
>> On Fri, Dec 8, 2017 at 9:17 PM, Kevin Stange <kevin@xxxxxxxxxxxxx> wrote:
>>> Hi,
>>> I've been running Xen 4.4 stably for some time under kernel 4.9 in dom0
>>> on CentOS 6 and have been trying to finally move my environment up to
>>> Xen 4.6 or 4.8 using CentOS 7.  Since I've built out my test server with
>>> Xen 4.6, I've been having issues where the Intel NICs begin flapping
>>> repeatedly and the SATA disk interfaces go down and will not come back
>>> up until I reboot the server.  Even sending the bus rescan command
>>> doesn't bring the drives back.  The issue seems to trigger based on
>>> activity, so during something like an mdraid resync is more likely to
>>> cause the issue, but it's not reproducible in a consistent amount of
>>> time, which makes it hard to tell if a particular change has definitely
>>> fixed it.
>>> This is reminiscent of a problem I had been experiencing while running
>>> kernel 3.18 and Xen 4.4 on CentOS 6, but the problem resolved itself
>>> upon upgrading to kernel 4.4 and later 4.9, so I chalked that up to
>>> something bad with PCIe management in kernel 3.18 and thought nothing
>>> more of it until now.
>>> The initial test environment where the issue occurred was kernel 4.9.58
>>> and Xen 4.6.6-7 (with security patches from CentOS).  I then tried
>>> upgrading to kernel 4.9.63 and Xen 4.8.2-5, which didn't result in any
>>> improvements.
>>> I tried pcie_aspm=off on the kernel line, which has helped in the past
>>> with similar issues, but that didn't help here.
>>> I tried booting without Xen (just kernel 4.9.63) and it seems like that
>>> made the issue go away, which lead me to believe the issue only happens
>>> with hardware accessed from dom0.  I dug through Xen command line
>>> options and tried booting with msi=off and that now seems to have
>>> resulted in the problem going away, or at least, the system hasn't
>>> exhibited the issue since last week.  Previously, the issue would tend
>>> to manifest after less than 24 hours.
>>> My hardware is Supermicro X8DT3-F with Dual Intel Xeon E5620 CPUs.
>>> Disk issues begin with a kernel message like this followed by continuous
>>> ATA command failures:
>>> ata2.00: exception emask 0x0 sact 0x7c01ffff serr 0x50000 action 0x6 frozen
>>> NIC issues begin with a message like:
>>> igb 0000:04:00.1: enp4s0f1: Reset adapter unexpectedly
>>> NICs do recover almost immediately but continue to flap periodically
>>> until reboot.
>>> I don't know if this is a bug in Xen or something else at play, but I
>>> could really use some help figuring out what's going on, why msi=off
>>> seems to fix it, and if there are any better ways to resolve this.
>> Jan / Andy,
>> Any idea why Kevin might be seeing stability issues under 4.6 / 4.8
>> that is solved by adding 'msi=off'?
> Nothing I've ever heard of, and without at least full logs also very
> difficult to consider possible options. While I don't recall any
> significant bug fixes in this area since 4.8, trying with 4.10 (and
> perhaps also a more up-to-date Dom0 kernel) would certainly be
> worthwhile. With the information at hand it's not even possible
> to tell whether Xen or the Dom0 kernel is the problematic part
> here (the fact that Linux works fine natively doesn't mean much
> here, as MSI handling is quite a bit different when running on
> Xen).

I'll see if there's a build of Xen 4.10 that I can try floating around.

> What I suspect first of all is that some interrupt is not making it
> through to its handler. It may be possible to see something from
> debug key output ('M' and 'i') once the system is in that state.

Given the state gets me to a point where I can't log into the server
directly any longer, what's the best way to obtain the debug key output
at that point?  I have access to IPMI with video and serial outputs.

> Just to be sure - use of an IOMMU does not affect the behavior?

I don't know how to determine "use" of IOMMU, but I did try the option
iommu=off along with msi=off and the state of the iommu option did not
seem to impact whether the issue happened, only msi=off seemed to help.

Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin@xxxxxxxxxxxxx | www.steadfast.net

Xen-devel mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.