[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] domU network has sleeping sickness



I've seen the same problem with my xen 3.1.0 setup.  What
the Xen gurus are telling us is that this is a symptom of Xen dom0
being busy and not servicing the network interrupts of the domu's promptly. Their advice to us was to shift an application that
had been running on dom0 to another Xen instance to see if that
would help.  We are in the process of implementing that solution now.

By the way my system (Dell poweredge2950) has got broadcomm
inbuilt network cards, not Intel E1000 so it is unlikely that
it is a network driver specific issue.

During these episodes of non-network connectivity, by the way,
it was not unusual to see the following kernel dump in dom0

2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: Call Trace:
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: <IRQ> [<ffffffff8025
8269>] softlockup_tick+0xcc/0xde
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8020e84d>]
 timer_interrupt+0x3a3/0x401
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: [<ffffffff80258898>]
 handle_IRQ_event+0x4b/0x93
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8025897e>]
 __do_IRQ+0x9e/0x100
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8020cc97>]
 do_IRQ+0x63/0x71
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8034b347>]
 evtchn_do_upcall+0xee/0x165
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: [<ffffffff8020abca>]
 do_hypervisor_callback+0x1e/0x2c
2008-02-05T18:35:16-06:00 s_sys@xxxxxxxxxxxxxxxxxxx kernel: <EOI>

or

Feb 25 10:32:39 fermigrid6 kernel: BUG: soft lockup detected on CPU#0!
Feb 25 10:32:39 fermigrid6 kernel:
Feb 25 10:32:39 fermigrid6 kernel: Call Trace:
Feb 25 10:32:39 fermigrid6 kernel: <IRQ> [<ffffffff80258269>] softlockup_tick+0xcc/0xde Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8020e84d>] timer_interrupt+0x3a3/0x401 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff80258898>] handle_IRQ_event+0x4b/0x93 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8025897e>] __do_IRQ+0x9e/0x100
Feb 25 10:32:39 fermigrid6 kernel:  [<ffffffff8020cc97>] do_IRQ+0x63/0x71
Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8034b347>] evtchn_do_upcall+0xee/0x165 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8020abca>] do_hypervisor_callback+0x1e/0x2c Feb 25 10:32:39 fermigrid6 kernel: <EOI> [<ffffffff8020622a>] hypercall_page+0x22a/0x1000 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8020622a>] hypercall_page+0x22a/0x1000 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8034b258>] force_evtchn_callback+0xa/0xb Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff803f2272>] thread_return+0xdf/0x119 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8020622a>] hypercall_page+0x22a/0x1000 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff80228a25>] __cond_resched+0x1c/0x44 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff803f25df>] cond_resched+0x37/0x42 Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff802343c4>] ksoftirqd+0x0/0xbf Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff80234432>] ksoftirqd+0x6e/0xbf
Feb 25 10:32:39 fermigrid6 kernel:  [<ffffffff802422d7>] kthread+0xc8/0xf1
Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8020ae1c>] child_rip+0xa/0x12
Feb 25 10:32:39 fermigrid6 kernel:  [<ffffffff8024220f>] kthread+0x0/0xf1
Feb 25 10:32:39 fermigrid6 kernel: [<ffffffff8020ae12>] child_rip+0x0/0x12

----------------

One of our dom0's was running an LVS server, the other one on identical hardware was not. We moved the LVS server from one to the other and
the network problems and kernel panics followed it.

Steve Timm

On Mon, 3 Mar 2008, Marc Teichgraeber wrote:

Hi all,

I have a strange network problem with some domU's on three XEN-Hosts.
They are loosing their network connectivity. I do bridged networking.
  * It happens randomly and could happen right after bootup of the domU
or anytime later.
  * The domU is not reachable from another host on the LAN.
  * The domU is always reachable from the dom0 (ssh, ping).
  * I can 'repair' the connection when attaching to the console and
ping out from the domU. First nothings happens, then the machine gets
back their network. (And thats also my momentary workaround, pinging all
the time from the console)
  * Pinging from another host at the same time helps too.
  * It could be that I can ping continously from one host and another
hosts gets only every 10th packet or so back.
  * The interfaces could come back from their sleep by itself.
  * When the networks has fallen asleep, ssh on the domU from another
host hangs, it does not come back with "no route to host" or something.

I'm suspicious about the network controllers, they are the same on all
hosts: "Intel Corporation 80003ES2LAN Gigabit Ethernet Controller
(Copper)"(lspci) some kind of "Intel® PRO/1000 EB Network Connection
with I/O Acceleration"(Intel website). I've tried the latest e1000
driver from Intel but it does'nt helped.
I've checked all MAC Adresses, they are unique, also the IP Adresses.

Any ideas are welcome :)

-------------------------------------------------------------------------
"xm info" from host1,  openSUSE 10.2 (X86-64):

release                : 2.6.18.8-0.9-xen
version                : #1 SMP Sun Feb 10 22:48:05 UTC 2008
machine                : x86_64
nr_cpus                : 4
nr_nodes               : 1
sockets_per_node       : 2
cores_per_socket       : 2
threads_per_core       : 1
cpu_mhz                : 2327
hw_caps                :
bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
total_memory           : 32766
free_memory            : 21607
max_free_memory        : 21607
max_para_memory        : 21603
max_hvm_memory         : 21544
xen_major              : 3
xen_minor              : 0
xen_extra              : .3_11774-23
xen_caps               : xen-3.0-x86_64
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : 11774
cc_compiler            : gcc version 4.1.2 20061115 (prerelease) (SUSE
Linux)
cc_compile_by          : abuild
cc_compile_domain      : suse.de
cc_compile_date        : Thu Jan 10 21:22:54 UTC 2008
xend_config_format     : 2
-------------------------------------------------------------------------
"xm info" output on host2, openSUSE 10.3 (X86-64)

release                : 2.6.22.13-0.3-xen
version                : #1 SMP 2007/11/19 15:02:58 UTC
machine                : x86_64
nr_cpus                : 8
nr_nodes               : 1
sockets_per_node       : 2
cores_per_socket       : 4
threads_per_core       : 1
cpu_mhz                : 3000
hw_caps                :
bfebfbff:20100800:00000000:00000140:0004e3bd:00000000:00000001
total_memory           : 16382
free_memory            : 591
max_free_memory        : 591
max_para_memory        : 587
max_hvm_memory         : 577
xen_major              : 3
xen_minor              : 1
xen_extra              : .0_15042-51
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : 15042
cc_compiler            : gcc version 4.2.1 (SUSE Linux)
cc_compile_by          : abuild
cc_compile_domain      : suse.de
cc_compile_date        : Tue Sep 25 21:16:06 UTC 2007
xend_config_format     : 4



--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.