Dear users,

We have a ganeti cluster of 3 supermicro X9SCL/X9SCM servers, exactly the same hardware. In one we have an additional Intel 10G network card. The hosts have a backbone network which is used for drbd and ganeti's shared-file-storage nfs share.

For some domUs (instances) we use drbd mirrors. We have an issue with planned maintenances:

I migrate all domUs off from a node which is to be upgraded, yet it is still a slave for some domUs disks'. When the host has no more running domUs, I issue a reboot on it. After it, on the other node, the network card stops working, the kernel shows 'tx hangs', and effectively I cannot recover that dom0 also without a reboot.

I've attached a syslog from a node which has the 10G nic after a reboot has been initiated on another node.

The strange is that if I do it the other way around, the same happens, but with the e1000e nics.

What kind of bug is this? Maybe when the drbd slave disappears, drbd puts a high load on the nic? I dont know any other direct traffic between the two hosts on that dedicated network.

Any thoughts?

Richard Kojedzinszky

