[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] [Xen-users] Network and SATA Instability on Xen 4.6/4.8
On 12/21/2017 03:38 AM, Jan Beulich wrote: > (dropping xen-users, to avoid cross posting) > >>>> On 20.12.17 at 18:40, <dunlapg@xxxxxxxxx> wrote: >> On Fri, Dec 8, 2017 at 9:17 PM, Kevin Stange <kevin@xxxxxxxxxxxxx> wrote: >>> Hi, >>> >>> I've been running Xen 4.4 stably for some time under kernel 4.9 in dom0 >>> on CentOS 6 and have been trying to finally move my environment up to >>> Xen 4.6 or 4.8 using CentOS 7. Since I've built out my test server with >>> Xen 4.6, I've been having issues where the Intel NICs begin flapping >>> repeatedly and the SATA disk interfaces go down and will not come back >>> up until I reboot the server. Even sending the bus rescan command >>> doesn't bring the drives back. The issue seems to trigger based on >>> activity, so during something like an mdraid resync is more likely to >>> cause the issue, but it's not reproducible in a consistent amount of >>> time, which makes it hard to tell if a particular change has definitely >>> fixed it. >>> >>> This is reminiscent of a problem I had been experiencing while running >>> kernel 3.18 and Xen 4.4 on CentOS 6, but the problem resolved itself >>> upon upgrading to kernel 4.4 and later 4.9, so I chalked that up to >>> something bad with PCIe management in kernel 3.18 and thought nothing >>> more of it until now. >>> >>> The initial test environment where the issue occurred was kernel 4.9.58 >>> and Xen 4.6.6-7 (with security patches from CentOS). I then tried >>> upgrading to kernel 4.9.63 and Xen 4.8.2-5, which didn't result in any >>> improvements. >>> >>> I tried pcie_aspm=off on the kernel line, which has helped in the past >>> with similar issues, but that didn't help here. >>> >>> I tried booting without Xen (just kernel 4.9.63) and it seems like that >>> made the issue go away, which lead me to believe the issue only happens >>> with hardware accessed from dom0. I dug through Xen command line >>> options and tried booting with msi=off and that now seems to have >>> resulted in the problem going away, or at least, the system hasn't >>> exhibited the issue since last week. Previously, the issue would tend >>> to manifest after less than 24 hours. >>> >>> My hardware is Supermicro X8DT3-F with Dual Intel Xeon E5620 CPUs. >>> >>> Disk issues begin with a kernel message like this followed by continuous >>> ATA command failures: >>> >>> ata2.00: exception emask 0x0 sact 0x7c01ffff serr 0x50000 action 0x6 frozen >>> >>> NIC issues begin with a message like: >>> >>> igb 0000:04:00.1: enp4s0f1: Reset adapter unexpectedly >>> >>> NICs do recover almost immediately but continue to flap periodically >>> until reboot. >>> >>> I don't know if this is a bug in Xen or something else at play, but I >>> could really use some help figuring out what's going on, why msi=off >>> seems to fix it, and if there are any better ways to resolve this. >> >> Jan / Andy, >> >> Any idea why Kevin might be seeing stability issues under 4.6 / 4.8 >> that is solved by adding 'msi=off'? > > Nothing I've ever heard of, and without at least full logs also very > difficult to consider possible options. While I don't recall any > significant bug fixes in this area since 4.8, trying with 4.10 (and > perhaps also a more up-to-date Dom0 kernel) would certainly be > worthwhile. With the information at hand it's not even possible > to tell whether Xen or the Dom0 kernel is the problematic part > here (the fact that Linux works fine natively doesn't mean much > here, as MSI handling is quite a bit different when running on > Xen). I'll see if there's a build of Xen 4.10 that I can try floating around. > What I suspect first of all is that some interrupt is not making it > through to its handler. It may be possible to see something from > debug key output ('M' and 'i') once the system is in that state. Given the state gets me to a point where I can't log into the server directly any longer, what's the best way to obtain the debug key output at that point? I have access to IPMI with video and serial outputs. > Just to be sure - use of an IOMMU does not affect the behavior? I don't know how to determine "use" of IOMMU, but I did try the option iommu=off along with msi=off and the state of the iommu option did not seem to impact whether the issue happened, only msi=off seemed to help. -- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin@xxxxxxxxxxxxx | www.steadfast.net _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxxx https://lists.xenproject.org/mailman/listinfo/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |