[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8

To: Kevin Stange <kevin@xxxxxxxxxxxxx>
From: Alexander Dubinin <alexander.dubinin@xxxxxxxxx>
Date: Thu, 21 Dec 2017 21:28:57 +0300
Cc: Andrew Cooper <andrew.cooper3@xxxxxxxxxx>, George Dunlap <dunlapg@xxxxxxxxx>, Jan Beulich <JBeulich@xxxxxxxx>, xen-devel <xen-devel@xxxxxxxxxxxxxxxxxxxx>
Delivery-date: Thu, 21 Dec 2017 18:29:23 +0000
List-id: Xen developer discussion <xen-devel.lists.xenproject.org>

Hello, Kevin, all,

> Given the state gets me to a point where I can't log into the server
> directly any longer, what's the best way to obtain the debug key output
> at that point? I have access to IPMI with video and serial outputs.

IPMI is good, you can use SOL (Serial Over Lan) with it.

You can use following (providing my servers options as example, adapt asyou need):

1. in Xen cmdline on server (default SOL config for Supermicro - COM2) add:

conring_size=51200K loglvl=all guest_loglvl=all iommu=1,verbosecom2=115200,8n1 console=com2,vga console_timestamps=datems

It's especially useful, if you compiled XEN with "debug=y". "datems" inconsole time stamps is also good, making logs more human-readable ;)


2. For Linux kernel, add:

earlyprintk=xen console=hvc0

That allows both Linux kernel messages and console output on same SOL(enable hvc0 as securetty first, if it's not default in your distro).


3. And from other Linux system, run ipmitool like below:

ipmitool -I lanplus -H 192.168.100.32 -U ADMIN -P ADMIN sol activate

This is example for Supermicro. For ASUS/Gigabyte/Intel IPMIimplementations, you may need to adjust the way, ipmitool is started.ADMIN/ADMIN - default Supermicro login/password and 192.168.200.32 - IPaddress of my server's BMC (IPMI), use yours instead.


IPMI tool usually provided as package with same name in many Linux distros.

While IMPI tool is running, reboot the server. You will be able to useboth BIOS setup, GRUB menu (if set to use console, rather than graphics)and Linux console from SOL.And perfect thing is that you have all logs as text, which can beselected in terminal and copied to your favorite text editor for furtherprocessing.


--
Regards,
  Alexander Dubinin

PS: Sorry for possible multiple mails - something broken with my othermail SPF settings in Google Apps...


On 21.12.2017 19:43, Kevin Stange wrote:

On 12/21/2017 03:38 AM, Jan Beulich wrote:

(dropping xen-users, to avoid cross posting)

On 20.12.17 at 18:40, <dunlapg@xxxxxxxxx> wrote:

On Fri, Dec 8, 2017 at 9:17 PM, Kevin Stange <kevin@xxxxxxxxxxxxx> wrote:

Hi,

I've been running Xen 4.4 stably for some time under kernel 4.9 in dom0
on CentOS 6 and have been trying to finally move my environment up to
Xen 4.6 or 4.8 using CentOS 7.  Since I've built out my test server with
Xen 4.6, I've been having issues where the Intel NICs begin flapping
repeatedly and the SATA disk interfaces go down and will not come back
up until I reboot the server.  Even sending the bus rescan command
doesn't bring the drives back.  The issue seems to trigger based on
activity, so during something like an mdraid resync is more likely to
cause the issue, but it's not reproducible in a consistent amount of
time, which makes it hard to tell if a particular change has definitely
fixed it.

This is reminiscent of a problem I had been experiencing while running
kernel 3.18 and Xen 4.4 on CentOS 6, but the problem resolved itself
upon upgrading to kernel 4.4 and later 4.9, so I chalked that up to
something bad with PCIe management in kernel 3.18 and thought nothing
more of it until now.

The initial test environment where the issue occurred was kernel 4.9.58
and Xen 4.6.6-7 (with security patches from CentOS).  I then tried
upgrading to kernel 4.9.63 and Xen 4.8.2-5, which didn't result in any
improvements.

I tried pcie_aspm=off on the kernel line, which has helped in the past
with similar issues, but that didn't help here.

I tried booting without Xen (just kernel 4.9.63) and it seems like that
made the issue go away, which lead me to believe the issue only happens
with hardware accessed from dom0.  I dug through Xen command line
options and tried booting with msi=off and that now seems to have
resulted in the problem going away, or at least, the system hasn't
exhibited the issue since last week.  Previously, the issue would tend
to manifest after less than 24 hours.

My hardware is Supermicro X8DT3-F with Dual Intel Xeon E5620 CPUs.

Disk issues begin with a kernel message like this followed by continuous
ATA command failures:

ata2.00: exception emask 0x0 sact 0x7c01ffff serr 0x50000 action 0x6 frozen

NIC issues begin with a message like:

igb 0000:04:00.1: enp4s0f1: Reset adapter unexpectedly

NICs do recover almost immediately but continue to flap periodically
until reboot.

I don't know if this is a bug in Xen or something else at play, but I
could really use some help figuring out what's going on, why msi=off
seems to fix it, and if there are any better ways to resolve this.

Jan / Andy,

Any idea why Kevin might be seeing stability issues under 4.6 / 4.8
that is solved by adding 'msi=off'?

Nothing I've ever heard of, and without at least full logs also very
difficult to consider possible options. While I don't recall any
significant bug fixes in this area since 4.8, trying with 4.10 (and
perhaps also a more up-to-date Dom0 kernel) would certainly be
worthwhile. With the information at hand it's not even possible
to tell whether Xen or the Dom0 kernel is the problematic part
here (the fact that Linux works fine natively doesn't mean much
here, as MSI handling is quite a bit different when running on
Xen).

I'll see if there's a build of Xen 4.10 that I can try floating around.

What I suspect first of all is that some interrupt is not making it
through to its handler. It may be possible to see something from
debug key output ('M' and 'i') once the system is in that state.

Given the state gets me to a point where I can't log into the server
directly any longer, what's the best way to obtain the debug key output
at that point?  I have access to IPMI with video and serial outputs.

Just to be sure - use of an IOMMU does not affect the behavior?

I don't know how to determine "use" of IOMMU, but I did try the option
iommu=off along with msi=off and the state of the iommu option did not
seem to impact whether the issue happened, only msi=off seemed to help.



_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

References:
- Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8
  - From: George Dunlap
- Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8
  - From: Jan Beulich
- Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8
  - From: Kevin Stange

Prev by Date: Re: [Xen-devel] [PATCH net-next v2] xen-netback: make copy batch size configurable
Next by Date: Re: [Xen-devel] 4.8.3 preparations
Previous by thread: Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8
Next by thread: [Xen-devel] [qemu-upstream-4.10-testing test] 117345: tolerable FAIL - PUSHED
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.