[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Network and SATA Instability on Xen 4.6/4.8



Hi,

we've been experiencing the same errors on very similar hardware.
Just as Kevin described: all SATA goes down and NICs start to flap in Dom0, the 
only way to fix is to reboot.

Unlike Kevin, i was unable to observe any patterns in system activity which may 
trigger these, it seems completely random.
Sometimes it happens under high load, sometimes it happens when load is really 
low (i/o and also cpu), sometimes twice a week, sometimes no errors for 
months...

We have three identical machines (Supermicro X8DTT-HIBQF+ boards with X5670 
CPUs), and all three behaves likes this.
I think they have the same chipset as Kevin's board.

xen_version            : 4.6.1
xen_commandline        : dom0_mem=1024M loglvl=all guest_loglvl=all
cc_compiler            : gcc (Debian 4.7.2-5) 4.7.2

Dom0 kernel version is 3.14.61.

Also tried with Xen 4.8, and newer kernels for Dom0 (4.4.2), did not help.

I've tried modifying power management related settings in the BIOS setup, buy 
these had no effect on this issue.
ASPM was implicitly disabled by the kernel from the beginning:
[    8.601606] acpi PNP0A08:00: _OSC failed (AE_NOT_FOUND); disabling ASPM

Now i've disabled MSI in Dom0 kernel with pci=nomsi, and also explicitly 
disabled aspm with pcie_aspm=off.
Based on /proc/interrupts, lspci and dmesg MSI/MSI-X is not being used anymore.
We will see whether it gives a cure or not.
But as the the errors emerge randomly, it doesn't really proove anyhing if i 
don't see these errors again with MSI disabled...?

Any suggestions?

Thank you!

-David

On Wed, Dec 20, 2017 at 05:40:16PM +0000, George Dunlap wrote:
> On Fri, Dec 8, 2017 at 9:17 PM, Kevin Stange <kevin@xxxxxxxxxxxxx> wrote:
> > Hi,
> >
> > I've been running Xen 4.4 stably for some time under kernel 4.9 in dom0
> > on CentOS 6 and have been trying to finally move my environment up to
> > Xen 4.6 or 4.8 using CentOS 7.  Since I've built out my test server with
> > Xen 4.6, I've been having issues where the Intel NICs begin flapping
> > repeatedly and the SATA disk interfaces go down and will not come back
> > up until I reboot the server.  Even sending the bus rescan command
> > doesn't bring the drives back.  The issue seems to trigger based on
> > activity, so during something like an mdraid resync is more likely to
> > cause the issue, but it's not reproducible in a consistent amount of
> > time, which makes it hard to tell if a particular change has definitely
> > fixed it.
> >
> > This is reminiscent of a problem I had been experiencing while running
> > kernel 3.18 and Xen 4.4 on CentOS 6, but the problem resolved itself
> > upon upgrading to kernel 4.4 and later 4.9, so I chalked that up to
> > something bad with PCIe management in kernel 3.18 and thought nothing
> > more of it until now.
> >
> > The initial test environment where the issue occurred was kernel 4.9.58
> > and Xen 4.6.6-7 (with security patches from CentOS).  I then tried
> > upgrading to kernel 4.9.63 and Xen 4.8.2-5, which didn't result in any
> > improvements.
> >
> > I tried pcie_aspm=off on the kernel line, which has helped in the past
> > with similar issues, but that didn't help here.
> >
> > I tried booting without Xen (just kernel 4.9.63) and it seems like that
> > made the issue go away, which lead me to believe the issue only happens
> > with hardware accessed from dom0.  I dug through Xen command line
> > options and tried booting with msi=off and that now seems to have
> > resulted in the problem going away, or at least, the system hasn't
> > exhibited the issue since last week.  Previously, the issue would tend
> > to manifest after less than 24 hours.
> >
> > My hardware is Supermicro X8DT3-F with Dual Intel Xeon E5620 CPUs.
> >
> > Disk issues begin with a kernel message like this followed by continuous
> > ATA command failures:
> >
> > ata2.00: exception emask 0x0 sact 0x7c01ffff serr 0x50000 action 0x6 frozen
> >
> > NIC issues begin with a message like:
> >
> > igb 0000:04:00.1: enp4s0f1: Reset adapter unexpectedly
> >
> > NICs do recover almost immediately but continue to flap periodically
> > until reboot.
> >
> > I don't know if this is a bug in Xen or something else at play, but I
> > could really use some help figuring out what's going on, why msi=off
> > seems to fix it, and if there are any better ways to resolve this.
> 
> Jan / Andy,
> 
> Any idea why Kevin might be seeing stability issues under 4.6 / 4.8
> that is solved by adding 'msi=off'?
> 
>  -George
> 
> _______________________________________________
> Xen-users mailing list
> Xen-users@xxxxxxxxxxxxxxxxxxxx
> https://lists.xenproject.org/mailman/listinfo/xen-users

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.