[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Strange lockup



Sarah,

We are using debian jessie, here are the versions:
# uname -a
Linux node2-1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-3~deb8u1 (2015-04-24) x86_64 GNU/Linux
# cat /proc/drbd
version: 8.4.3 (api:1/proto:86-101)
srcversion: 1A9F77B1CA5FF92235C2213
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:312928 nr:0 dw:251580 dr:146213 al:53 bm:21 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:2089212 dw:2089212 dr:0 al:0 bm:263 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:7552532 dw:7552532 dr:0 al:0 bm:1107 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
ns:0 nr:1549828 dw:1549828 dr:0 al:0 bm:217 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 4: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:302408 nr:0 dw:8807800 dr:170000 al:408 bm:586 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
 5: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:364820 nr:0 dw:8875864 dr:4076 al:60 bm:535 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
#

The link you sent maybe related, nice to see that we are having an issue with xen/drbd for 6 years now.

Regards,


2015-05-05 00:05 idÅpontban Sarah Newman ezt Ãrta:
what versions of the kernel and drbd are you using, could this be
related?
http://lists.linbit.com/pipermail/drbd-user/2009-April/011884.html
https://bugzilla.redhat.com/show_bug.cgi?id=666005

On 05/04/2015 12:22 PM, Richard Kojedzinszky wrote:
Dear friends,

After some testing, it turned out that disabling scatter-gather on the affected NICs eliminates the issues. Either the NIC is an ixgbe or an e1000e, the problem disappeared. Actually I dont know if it relates to xen or just a driver issue, but in our environment this solved our issue. I think drbd
has nothing to do with it, it just helped to discover this issue.

Maybe some kernel hacker knows what is exactly scatter-gather and how may it affect xen.

Regards,

2015-05-04 21:00 idÅpontban Sarah Newman ezt Ãrta:
It sounds like you are having issues related to DRBD. Ganeti handles
all the configuration of DRBD. The xen project is not really
associated with
DRBD, though they may ship a script to use it.

I also plan to use DRBD in production, but the last time I tried I
wasn't happy with some of the error handling. It could very well have
been the same
issue you're running into since part of the testing I did was to power
off nodes. FYI I was testing with nested xen HVM nodes running xen PV
guests.

I'm working off of an internal fork of ganeti 2.9.6 and plan to check
more recent versions for changes to the DRBD code to see what
improvements have
been made. But the problems could have also been related to kernel
versions. At the time I was using ubuntu 12.04 and we've switched to
Xen4CentOS so
I'll need to retest.

Regards, Sarah

On 05/03/2015 11:52 PM, Richard Kojedzinszky wrote:
Dear Sarah,

Thanks for your reply.

First off all I reported my issue here because I think ganeti has nothing to do with it, it is just a user-space application to control xen
installations, not more. It has nothing to do with hardware.

I will somehow experience with the hardware without xen to find out if it is a driver issue, or what.

Thanks again,

Kojedzinszky Richard

On Sun, 3 May 2015, Sarah Newman wrote:

On 05/03/2015 05:56 AM, Richard Kojedzinszky wrote:
Dear users,

We have a ganeti cluster of 3 supermicro X9SCL/X9SCM servers, exactly the same hardware. In one we have an additional Intel 10G network card. The hosts have a backbone network which is used for drbd and ganeti's shared-file-storage nfs share.

For some domUs (instances) we use drbd mirrors. We have an issue with planned maintenances:

I migrate all domUs off from a node which is to be upgraded, yet it is still a slave for some domUs disks'.

IIRC migrating the VM causes primary and secondary to switch and it doesn't automatically pick a new secondary.


When the host has no more running domUs, I issue a reboot on it. After it, on the other node, the network card stops working, the kernel shows 'tx hangs', and effectively I cannot recover that dom0 also without a reboot.

I've attached a syslog from a node which has the 10G nic after a reboot has been initiated on another node.

The strange is that if I do it the other way around, the same happens, but with the e1000e nics.

What kind of bug is this? Maybe when the drbd slave disappears, drbd puts a high load on the nic? I dont know any other direct traffic between
the two
hosts on that dedicated network.

Any thoughts?

It's highly unlikely to be related to xen. Probably there is some sort of deadlock in the kernel but I don't know enough about DRBD to say how.

If you haven't asked on the ganeti or drbd mailing lists that's where you should start. You should include your kernel version including distribution if applicable, the version of drbd driver and userspace tools you're using, and (if asking on the ganeti mailing list) the version of ganeti you're
using.

A potential workaround is to switch the secondary node to something which is not getting rebooted before doing the reboot. Something else I might
try,
if you're using different NICs for the domU networking and the ganeti network, is to bring down the ganeti interface on the now-primary node before doing the reboot of the secondary node. I'm assuming you have serial console access to do this from.

If you don't mind, let me know if you figure out what the issue is.



--
Richard Kojedzinszky

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.