[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Network and SATA Instability on Xen 4.6/4.8



On Friday, December 8, 2017 10:17:30 PM CET Kevin Stange wrote:
> Hi,
> 
> I've been running Xen 4.4 stably for some time under kernel 4.9 in dom0
> on CentOS 6 and have been trying to finally move my environment up to
> Xen 4.6 or 4.8 using CentOS 7.  Since I've built out my test server with
> Xen 4.6, I've been having issues where the Intel NICs begin flapping
> repeatedly and the SATA disk interfaces go down and will not come back
> up until I reboot the server.  Even sending the bus rescan command
> doesn't bring the drives back.  The issue seems to trigger based on
> activity, so during something like an mdraid resync is more likely to
> cause the issue, but it's not reproducible in a consistent amount of
> time, which makes it hard to tell if a particular change has definitely
> fixed it.
> 
> This is reminiscent of a problem I had been experiencing while running
> kernel 3.18 and Xen 4.4 on CentOS 6, but the problem resolved itself
> upon upgrading to kernel 4.4 and later 4.9, so I chalked that up to
> something bad with PCIe management in kernel 3.18 and thought nothing
> more of it until now.
> 
> The initial test environment where the issue occurred was kernel 4.9.58
> and Xen 4.6.6-7 (with security patches from CentOS).  I then tried
> upgrading to kernel 4.9.63 and Xen 4.8.2-5, which didn't result in any
> improvements.
> 
> I tried pcie_aspm=off on the kernel line, which has helped in the past
> with similar issues, but that didn't help here.
> 
> I tried booting without Xen (just kernel 4.9.63) and it seems like that
> made the issue go away, which lead me to believe the issue only happens
> with hardware accessed from dom0.  I dug through Xen command line
> options and tried booting with msi=off and that now seems to have
> resulted in the problem going away, or at least, the system hasn't
> exhibited the issue since last week.  Previously, the issue would tend
> to manifest after less than 24 hours.
> 
> My hardware is Supermicro X8DT3-F with Dual Intel Xeon E5620 CPUs.
> 
> Disk issues begin with a kernel message like this followed by continuous
> ATA command failures:
> 
> ata2.00: exception emask 0x0 sact 0x7c01ffff serr 0x50000 action 0x6 frozen
> 
> NIC issues begin with a message like:
> 
> igb 0000:04:00.1: enp4s0f1: Reset adapter unexpectedly
> 
> NICs do recover almost immediately but continue to flap periodically
> until reboot.
> 
> I don't know if this is a bug in Xen or something else at play, but I
> could really use some help figuring out what's going on, why msi=off
> seems to fix it, and if there are any better ways to resolve this.
> 
> Thanks.

I have not seen anything like this on any server I am currently using and it's 
a mix of Tyan boards and Supermicro. (Switching away from Tyan for unrelated 
reasons)

# xl info | grep command
xen_commandline        : dom0_mem=24GB,max:24GB console=vga dom0_max_vcpus=4 
dom0_vcpus_pin gnttab_max_frames=256

# cat /proc/cmdline 
root=zhost/host/root by=id elevator=noop logo.nologo triggers=zfs quiet 
refresh softlevel=prexen

FYI: I use ZFS and some of the VMs are using 2 SSDs that are maintained by the 
host.
The majority of the storage is handled by a storage domain which has the HBA 
assigned to it directly.

I have 4 10Gbe ports that are bonded and VLAN tagged to provide connectivity 
to other hosts.

Mainboard:
Supermicro X10DRI-T4i

The hardware is occasionally stressed both on the SSDs (connected via SATA) 
and the network.

I am running a 4.9.49 kernel with Xen 4.8.2 and ZoL 0.7.3.

--
Joost


_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.