Xen project Mailing List

Hi All, this is going to be along explanation, I beg your pardon.

I'm a happy ... I mean REAL happy, Xen user since about one year.

I have two production servers running some 8-10 VMs.
The two hosts run Debian as Dom0, whereas DomUs are assorted linux/Windows distributions. No issues about DomUs.

The two hosts share an iSCSI SAN where the DomUs images are stored.

In this configuration, the two hosts allow hot/live, warm and cold VM migration from one another, just great!!!

Now, a few days ago, one of the two servers crashed, it rebooted with no noticeable problems, no events in the system and Xen log files, no issues with iSCSI and LVM, no data corruption, all VMs running happily.

Nevertheless, since then it's been the end of VM world as we know it.

What happens is that the networking subsystem appears to be badly damaged, i.e. ping latency time on the xenbr0 from the LAN increased several order of magnitude: from a normal 0.2-0.3 ms up to 300 ms.
Given this latency, the DomUs network performance, accessibility from LAN/WAN, is degraded down to unacceptable, a simple file transfer sends latency to the order of THOUSANDS of ms.

I could imagine that some piece of HW went broke during the crash, such as the NIC but this is not the case, hold on the best has yet to come.

This degrade is not permanent, but it follows some predictive rule, here is what I discovered:

Let me state first the name of the two servers, for clarity sake: one is called "villano" (the one that crashed), the other is "rocciamelone" ... I chose the names after two conspicuous mountains in my valley :-)

latency on villano is degraded, so I move all domUs on rocciamelone, whose latency is OK
rocciamelone latency OK, villano KO, domUs net OK
shutdown villano
rocciamelone latency still OK
rebooting villano, domUs still on rocciamelone
villano latency OK
rocciamelone latency KO !!!!!! useless domUs :-(
Move domUs to villano
villano latency OK, rocciamelone latency stilla KO (as expected)
reboot rocciamelone, domUs still on villano
rocciamelone latency OK after reboot
villano latency KO after rocciamelone reboot, so long to domUs

The pattern is clear, this drives me mad, it seems like every time one of the two hosts reboots, it takes away latency efficiency from the other and render it useless.
So now, I have to keep one host disconnected in order for the other to be operational, forget about fancy hot and warm standby, we're out in the cold.

Disgnostics: nothing wrong with syslog and xend.log, ethtool on NIC is OK either.
LAN and SAN use different NICs and subnets, no issue on the SAN network.
So I tried some kind of analysis of system behavior, like monitoring traffic on NICs and vifs ... here something weird happens.

On the healty system, I scripted a periodic "netstat" check (5 s) to keep track of exchanged data volumes, well, it looks like this probe is rather intrusive (it shouldn't be), in fact, while my scripts are running, I notice a partial degrade of latency, let's say up to around 2-8 ms.
This degrade is reversible, disappears as soon as I kill the scripts.

Any idea? Anything I could check/troubleshoot?

Help will be GREATLY appreciated,

Ezio

--
Ezio Ostorero, Catania
Seltz e limone col sale. Arriminatu, non annacatu

[Xen-users] Xen networking degrade