[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Strange lockup

To: Sarah Newman <srn@xxxxxxxxx>
From: Richard Kojedzinszky <krichy@xxxxxxxxxx>
Date: Tue, 05 May 2015 04:01:54 +0200
Cc: xen-users@xxxxxxxxxxxxxxxxxxxx
Delivery-date: Tue, 05 May 2015 02:02:02 +0000
List-id: Xen user discussion <xen-users.lists.xen.org>

Sarah,

We are using debian jessie, here are the versions:
# uname -a

Linux node2-1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1(2015-04-24) x86_64 GNU/Linux

# cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: 1A9F77B1CA5FF92235C2213
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

ns:312928 nr:0 dw:251580 dr:146213 al:53 bm:21 lo:0 pe:0 ua:0 ap:0ep:1 wo:f oos:0

 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

ns:0 nr:2089212 dw:2089212 dr:0 al:0 bm:263 lo:0 pe:0 ua:0 ap:0 ep:1wo:f oos:0

 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

ns:0 nr:7552532 dw:7552532 dr:0 al:0 bm:1107 lo:0 pe:0 ua:0 ap:0ep:1 wo:f oos:0

 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----

ns:0 nr:1549828 dw:1549828 dr:0 al:0 bm:217 lo:0 pe:0 ua:0 ap:0 ep:1wo:f oos:0

 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

ns:302408 nr:0 dw:8807800 dr:170000 al:408 bm:586 lo:0 pe:0 ua:0ap:0 ep:1 wo:f oos:0

 5: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----

ns:364820 nr:0 dw:8875864 dr:4076 al:60 bm:535 lo:0 pe:0 ua:0 ap:0ep:1 wo:f oos:0

The link you sent maybe related, nice to see that we are having an issuewith xen/drbd for 6 years now.


Regards,


2015-05-05 00:05 idÅpontban Sarah Newman ezt Ãrta:

what versions of the kernel and drbd are you using, could this be
related?
http://lists.linbit.com/pipermail/drbd-user/2009-April/011884.html
https://bugzilla.redhat.com/show_bug.cgi?id=666005

On 05/04/2015 12:22 PM, Richard Kojedzinszky wrote:
Dear friends,
After some testing, it turned out that disabling scatter-gather on theaffected NICs eliminates the issues. Either the NIC is an ixgbe or ane1000e,the problem disappeared. Actually I dont know if it relates to xen orjust a driver issue, but in our environment this solved our issue. Ithink drbd
has nothing to do with it, it just helped to discover this issue.
Maybe some kernel hacker knows what is exactly scatter-gather and howmay it affect xen.
Regards,

2015-05-04 21:00 idÅpontban Sarah Newman ezt Ãrta:
It sounds like you are having issues related to DRBD. Ganeti handles
all the configuration of DRBD. The xen project is not really
associated with
DRBD, though they may ship a script to use it.

I also plan to use DRBD in production, but the last time I tried I
wasn't happy with some of the error handling. It could very well have
been the same
issue you're running into since part of the testing I did was topower
off nodes. FYI I was testing with nested xen HVM nodes running xen PV
guests.

I'm working off of an internal fork of ganeti 2.9.6 and plan to check
more recent versions for changes to the DRBD code to see what
improvements have
been made. But the problems could have also been related to kernel
versions. At the time I was using ubuntu 12.04 and we've switched to
Xen4CentOS so
I'll need to retest.

Regards, Sarah

On 05/03/2015 11:52 PM, Richard Kojedzinszky wrote:
Dear Sarah,

Thanks for your reply.
First off all I reported my issue here because I think ganeti hasnothing to do with it, it is just a user-space application tocontrol xen
installations, not more. It has nothing to do with hardware.
I will somehow experience with the hardware without xen to find outif it is a driver issue, or what.
Thanks again,

Kojedzinszky Richard

On Sun, 3 May 2015, Sarah Newman wrote:
On 05/03/2015 05:56 AM, Richard Kojedzinszky wrote:
Dear users,
We have a ganeti cluster of 3 supermicro X9SCL/X9SCM servers,exactly the same hardware. In one we have an additional Intel 10Gnetwork card. Thehosts have a backbone network which is used for drbd and ganeti'sshared-file-storage nfs share.
For some domUs (instances) we use drbd mirrors. We have an issuewith planned maintenances:
I migrate all domUs off from a node which is to be upgraded, yetit is still a slave for some domUs disks'.
IIRC migrating the VM causes primary and secondary to switch and itdoesn't automatically pick a new secondary.
When the host has no more running domUs, I issue a reboot on it.After it, on the other node, the network card stops working, thekernel shows 'txhangs', and effectively I cannot recover that dom0 also without areboot.
I've attached a syslog from a node which has the 10G nic after areboot has been initiated on another node.
The strange is that if I do it the other way around, the samehappens, but with the e1000e nics.
What kind of bug is this? Maybe when the drbd slave disappears,drbd puts a high load on the nic? I dont know any other directtraffic between
the two
hosts on that dedicated network.

Any thoughts?
It's highly unlikely to be related to xen. Probably there is somesort of deadlock in the kernel but I don't know enough about DRBDto say how.
If you haven't asked on the ganeti or drbd mailing lists that'swhere you should start. You should include your kernel versionincluding distributionif applicable, the version of drbd driver and userspace toolsyou're using, and (if asking on the ganeti mailing list) theversion of ganeti you're
using.
A potential workaround is to switch the secondary node to somethingwhich is not getting rebooted before doing the reboot. Somethingelse I might
try,
if you're using different NICs for the domU networking and theganeti network, is to bring down the ganeti interface on thenow-primary node beforedoing the reboot of the secondary node. I'm assuming you haveserial console access to do this from.
If you don't mind, let me know if you figure out what the issue is.


--
Richard Kojedzinszky

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

Follow-Ups:
- Re: [Xen-users] Strange lockup
  - From: Richard Kojedzinszky

References:
- [Xen-users] Strange lockup
  - From: Richard Kojedzinszky

Prev by Date: Re: [Xen-users] Xen reboots everytime I enter it
Next by Date: Re: [Xen-users] Strange lockup
Previous by thread: [Xen-users] Strange lockup
Next by thread: Re: [Xen-users] Strange lockup
Index(es):
- Date
- Thread

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.