[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Network and SATA Instability on Xen 4.6/4.8

  • To: xen-users@xxxxxxxxxxxxxxxxxxxx,Kevin Stange <kevin@xxxxxxxxxxxxx>
  • From: Nathan March <nathan@xxxxxx>
  • Date: Fri, 08 Dec 2017 13:40:20 -0800
  • Delivery-date: Fri, 08 Dec 2017 21:40:58 +0000
  • Domainkey-signature: a=rsa-sha1; c=nofws; d=gt.net; h=date:in-reply-to :references:mime-version:content-type:content-transfer-encoding :subject:to:from:message-id; q=dns; s=mail; b=Ml3TNyq4KUgKePsZeS s3GQdypgKsZaINNaxcMwJkabmdfTlpyuAAs2OAJ2GRytDvwlosLvIXfGGwzPahG1 SnVuQZk/6Rb3YQZneMecQgzkKpXWzTsFRgTOICUod3KXzqgZ9OR7U8Zg8pKvdHDq XthcHPotpDCbxaHX0MVfdnmGc=
  • List-id: Xen user discussion <xen-users.lists.xenproject.org>

I've seen this same Intel behavior on my systems and have had no luck identifying a cause. It happens on my bonded tagged x540 nics, but not on my similarly configured 1g Intel nics. I'm currently testing 4.8 in the hopes it doesn't exhibit this behavior.

I'm on a mix of supermicro and Dell hardware and both have the issue. This started happening after a major dom0 kernel upgrade, but I dont have the version details handy.

I don't see the same sata instability, but I'm not using local storage (nfs via those 10g links).


On December 8, 2017 1:17:30 PM PST, Kevin Stange <kevin@xxxxxxxxxxxxx> wrote:

I've been running Xen 4.4 stably for some time under kernel 4.9 in dom0
on CentOS 6 and have been trying to finally move my environment up to
Xen 4.6 or 4.8 using CentOS 7. Since I've built out my test server with
Xen 4.6, I've been having issues where the Intel NICs begin flapping
repeatedly and the SATA disk interfaces go down and will not come back
up until I reboot the server. Even sending the bus rescan command
doesn't bring the drives back. The issue seems to trigger based on
activity, so during something like an mdraid resync is more likely to
cause the issue, but it's not reproducible in a consistent amount of
time, which makes it hard to tell if a particular change has definitely
fixed it.

This is reminiscent of a problem I had been experiencing while running
kernel 3.18 and Xen 4.4 on CentOS 6, but the problem resolved itself
upon upgrading to kernel 4.4 and later 4.9, so I chalked that up to
something bad with PCIe management in kernel 3.18 and thought nothing
more of it until now.

The initial test environment where the issue occurred was kernel 4.9.58
and Xen 4.6.6-7 (with security patches from CentOS). I then tried
upgrading to kernel 4.9.63 and Xen 4.8.2-5, which didn't result in any

I tried pcie_aspm=off on the kernel line, which has helped in the past
with similar issues, but that didn't help here.

I tried booting without Xen (just kernel 4.9.63) and it seems like that
made the issue go away, which lead me to believe the issue only happens
with hardware accessed from dom0. I dug through Xen command line
options and tried booting with msi=off and that now seems to have
resulted in the problem going away, or at least, the system hasn't
exhibited the issue since last week. Previously, the issue would tend
to manifest after less than 24 hours.

My hardware is Supermicro X8DT3-F with Dual Intel Xeon E5620 CPUs.

Disk issues begin with a kernel message like this followed by continuous
ATA command failures:

ata2.00: exception emask 0x0 sact 0x7c01ffff serr 0x50000 action 0x6 frozen

NIC issues begin with a message like:

igb 0000:04:00.1: enp4s0f1: Reset adapter unexpectedly

NICs do recover almost immediately but continue to flap periodically
until reboot.

I don't know if this is a bug in Xen or something else at play, but I
could really use some help figuring out what's going on, why msi=off
seems to fix it, and if there are any better ways to resolve this.

Xen-users mailing list



Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.