Xen project Mailing List

[Xen-users] Xen related networking issue

From: Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx>

Date: Fri, 17 May 2013 01:27:09 +1000

Delivery-date: Thu, 16 May 2013 15:28:30 +0000

List-id: Xen user discussion <xen-users.lists.xen.org>

I have a relatively complicated network and xen setup, but I'll start with the problem, and provide more details below. >From time to time (approx 1 to 3 times per day), one or more (usually one at a time) server will stop communicating with the network for anything between a few seconds and a minute (usually around 20 to 40 seconds). I have 8 physical machines (dom0's) each of which runs one VM domU (except one which runs two VM's). The VM's are primarily MS Win 2003 R2, one is MS Win XP Pro SP3, one is MS Win 2008R2. The problem seems to be load related, (but generating network traffic doesn't trigger the problem), it usually co-incides with busy user times (start of day and end of day). It seems to be restricted to the MS Win 2003R2 servers (which are Terminal Servers), and generally the busiest machines as far as CPU/disk/network, except for the domain controller which would do more network and disk but doesn't have this issue. So far, I've replaced all the cables, the switch (4 different switches, different models, different manufacturers, etc). I'm using current Debian Stable packages for Xen ii libxen-4.1 4.1.4-3+deb7u1 amd64 Public libs for Xen ii libxenstore3.0 4.1.4-3+deb7u1 amd64 Xenstore communications library for Xen ii xen-hypervisor-4.1-amd64 4.1.4-3+deb7u1 amd64 Xen Hypervisor on AMD64 ii xen-linux-system-3.2.0-4-amd64 3.2.41-2 amd64 Xen system with Linux 3.2 on 64-bit PCs (meta-package) ii xen-linux-system-amd64 3.2+46 amd64 Xen system with Linux for 64-bit PCs (meta-package) ii xen-system-amd64 4.1.4-3+deb7u1 amd64 Xen System on AMD64 (meta-package) ii xen-utils-4.1 4.1.4-3+deb7u1 amd64 XEN administrative tools ii xen-utils-common 4.1.4-3+deb7u1 all Xen administrative tools - common files ii xenstore-utils 4.1.4-3+deb7u1 amd64 Xenstore utilities for Xen I'm using a simple bridge: ifconfig -a eth0 Link encap:Ethernet HWaddr f4:6d:04:ef:e4:d7 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:59742510 errors:0 dropped:0 overruns:0 frame:0 TX packets:63945509 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:17925182347 (16.6 GiB) TX bytes:28533598982 (26.5 GiB) Interrupt:39 Base address:0x6000 eth1 Link encap:Ethernet HWaddr a0:36:9f:19:25:af inet addr:10.30.16.31 Bcast:10.30.16.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe19:25af/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:10450615 errors:0 dropped:0 overruns:0 frame:0 TX packets:11577187 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:31270341044 (29.1 GiB) TX bytes:15185740522 (14.1 GiB) Memory:fe800000-fe900000 eth2 Link encap:Ethernet HWaddr a0:36:9f:19:25:ae inet addr:10.30.16.41 Bcast:10.30.16.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe19:25ae/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:10413185 errors:0 dropped:0 overruns:0 frame:0 TX packets:11576680 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:31259680024 (29.1 GiB) TX bytes:15194685404 (14.1 GiB) Memory:fea00000-feb00000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:3152 errors:0 dropped:0 overruns:0 frame:0 TX packets:3152 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:474856 (463.7 KiB) TX bytes:474856 (463.7 KiB) vif1.0 Link encap:Ethernet HWaddr fe:ff:ff:ff:ff:ff inet6 addr: fe80::fcff:ffff:feff:ffff/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:63234371 errors:0 dropped:0 overruns:0 frame:0 TX packets:58998093 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:32 RX bytes:27401913081 (25.5 GiB) TX bytes:17707566787 (16.4 GiB) xenbr0 Link encap:Ethernet HWaddr f4:6d:04:ef:e4:d7 inet addr:10.10.10.31 Bcast:10.30.15.255 Mask:255.255.240.0 inet6 addr: fe80::f66d:4ff:feef:e4d7/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1021860 errors:0 dropped:31621 overruns:0 frame:0 TX packets:711501 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:199158280 (189.9 MiB) TX bytes:246436545 (235.0 MiB) eth1 and eth2 are connected to the iSCSI server (different vlan, different network), eth0 is on the bridge xenbr0: brctl show bridge name bridge id STP enabled interfaces xenbr0 8000.f46d04efe4d7 no eth0 vif1.0 I don't see anything in dmesg or xm dmesg at the time of the problem. I do see regular single packet drops across various parts of the network (ie, a dozen times a day or more) but I don't think this is an issue. The problem is dropping almost all packets for a period of 10+ seconds. Note, tcpdump on the dom0, and then examining in wireshark showed about a dozen packets being sent/received during the outage, some packets were retransmissions, some where ping requests/replies, but 99.9% of the normal network load was missing. ie, during the outage, one ping packet was not received (the senders tcpdump showed it had been sent), the next ping packet was received, and dom0 showed the reply was sent as well (reply from the domU), but the other machine never received the reply (missing in tcpdump at the other end). The only way I see this is I have two machines which will ping every IP 60 times (once per second) every minute, and record the results with the date/time. I can then process the logs, and both machines show the same outage on the same destination machine at the same time. The switch is currently a cisco 3560, previously I've used a netgear GS716Tv2, netgear GS748Tv4, and netgear unmanaged 16 port gigabit switch. I'm using the onboard network card at the moment, but had the same issue with a Intel server PCI card. 04:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller (rev 06) eth1/eth2 are a dual port Intel gigabit ethernet card: 02:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 02:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) I'm using the GPLPV drivers: Driver Date: 10/22/2012 Driver Version: 0.11.0.372 (Not signed) In windows, the following advanced settings are configured: Check checksum on RX packets: Enabled Checksum offload: Enabled Don't fix the blank checksum on offload: Disabled Large send offload: 61440 Locally Administered Address: (blank) MTU: 1500 Scatter/Gather: Enabled domU config file has networking configured like this: vif = ['bridge=xenbr0, mac=00:16:3e:39:26:ac'] The domU doesn't record anything in the event viewer, the switch doesn't record anything in it's logs, DoS options are disabled on the switch, the dom0 will respond to pings while the domU doesn't. For a long time, I thought this was happening to other physical machines as well, but either it isn't anymore, or never was. At least the last 4 weeks of ping stats I have show that only the domU's, and only the terminal servers will lose more than 4 consecutive pings (aside from outages caused by changing hardware/etc). Any hints on what to look at, additional information needed, how to diagnose, or any options other than continued hair-pulling would be immensely appreciated. Regards, Adam -- Adam Goryachev Website Managers www.websitemanagers.com.au _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.