[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-users] xen + dom0 + e1000e = Detected Hardware Unit Hang
I manage a Fujitsu Primergy TX100 S2 server, which has this integrated NIC: 00:19.0 Ethernet controller [0200]: Intel Corporation 82578DM Gigabit Network Connection [8086:10ef] (rev 05) Subsystem: Fujitsu Technology Solutions Device [1734:11a6] Kernel driver in use: e1000e Using a kernel as dom0 3.17.7 over xen 4.5.0 If the NIC is connected to a 100Mb/s link or forced to negotiate 100Mb/s (via ethtool advertise 0x008) it works properly even under xen. Kerdel dmesg shows this: [ 8.868197] e1000e: Intel(R) PRO/1000 Network Driver - 3.1.0.2-NAPI [ 8.868199] e1000e: Copyright(c) 1999 - 2014 Intel Corporation. [ 8.868412] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode [ 9.115322] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:19:99:a7:f8:51 [ 9.115325] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection [ 9.115359] e1000e 0000:00:19.0 eth0: MAC: 9, PHY: 9, PBA No: 313130-031 [ 9.131477] e1000e 0000:00:19.0 enp0s25: renamed from eth0 [ 29.669862] e1000e: enp0s25 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx [ 29.669972] e1000e 0000:00:19.0 enp0s25: 10/100 speed: disabling TSO However if it's connected to a 1Gb/s device, these messages appear first: [ 82.149105] e1000e: Intel(R) PRO/1000 Network Driver - 3.1.0.2-NAPI [ 82.149108] e1000e: Copyright(c) 1999 - 2014 Intel Corporation. [ 82.149312] e1000e 0000:00:19.0: Interrupt Throttling Rate (ints/sec) set to dynamic conservative mode [ 82.396102] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width x1) 00:19:99:a7:f8:51 [ 82.396105] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection [ 82.396139] e1000e 0000:00:19.0 eth0: MAC: 9, PHY: 9, PBA No: 313130-031 [ 82.414029] e1000e 0000:00:19.0 enp0s25: renamed from eth0 [ 93.410124] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx Now, if I download something with high throughput (inbound traffic, say, 120Mb/s), then it works properly too. However if I try to upload something with high throughput (outbound traffic, the receiving end is willing to accept around ~25Mb/s), under xen, the connection hangs for a few seconds and these messages appear in dmesg: [ 155.601937] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <86> TDT <a0> next_to_use <a0> next_to_clean <86> buffer_info[next_to_clean]: time_stamp <fffdb4c4> next_to_watch <86> jiffies <fffdbcbc> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <2000> PCI Status <10> [ 157.602036] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <86> TDT <a0> next_to_use <a0> next_to_clean <86> buffer_info[next_to_clean]: time_stamp <fffdb4c4> next_to_watch <86> jiffies <fffdc48c> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <2000> PCI Status <10> [ 159.601880] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <86> TDT <a0> next_to_use <a0> next_to_clean <86> buffer_info[next_to_clean]: time_stamp <fffdb4c4> next_to_watch <86> jiffies <fffdcc5c> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <2000> PCI Status <10> [ 161.601989] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <86> TDT <a0> next_to_use <a0> next_to_clean <86> buffer_info[next_to_clean]: time_stamp <fffdb4c4> next_to_watch <86> jiffies <fffdd42c> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <2000> PCI Status <10> [ 163.602096] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang: TDH <86> TDT <a0> next_to_use <a0> next_to_clean <86> buffer_info[next_to_clean]: time_stamp <fffdb4c4> next_to_watch <86> jiffies <fffddbfc> next_to_watch.status <0> MAC Status <40080083> PHY Status <796d> PHY 1000BASE-T Status <7800> PHY Extended Status <2000> PCI Status <10> [ 163.605665] ------------[ cut here ]------------ [ 163.605677] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x22c/0x240() [ 163.605680] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out [ 163.605682] Modules linked in: e1000e(O) [ 163.605690] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G O 3.17.7-hardened-r1-uther #3 [ 163.605693] Hardware name: FUJITSU PRIMERGY TX100 S2 /D2779, BIOS 6.00 Rev. 1.07.2779.A1 04/29/2011 [ 163.605695] 0000000000000009 ffffffff8183f2a9 ffff88016f203e60 ffffffff8106bb2d [ 163.605699] 0000000000000000 ffff88016f203eb0 0000000000000001 0000000000000000 [ 163.605703] 0000000000000000 ffffffff8106bb97 ffffffff81ac7c18 0000000000000030 [ 163.605707] Call Trace: [ 163.605710] <IRQ> [<ffffffff8183f2a9>] ? dump_stack+0x41/0x51 [ 163.605728] [<ffffffff8106bb2d>] ? warn_slowpath_common+0x6d/0x90 [ 163.605730] [<ffffffff8106bb97>] ? warn_slowpath_fmt+0x47/0x50 [ 163.605733] [<ffffffff81505ed2>] ? add_interrupt_randomness+0x32/0x1e0 [ 163.605735] [<ffffffff816d120c>] ? dev_watchdog+0x22c/0x240 [ 163.605737] [<ffffffff816d0fe0>] ? dev_graft_qdisc+0x70/0x70 [ 163.605741] [<ffffffff810ac232>] ? call_timer_fn.isra.36+0x12/0x70 [ 163.605744] [<ffffffff810ac440>] ? run_timer_softirq+0x1b0/0x240 [ 163.605746] [<ffffffff8106eb3b>] ? __do_softirq+0xdb/0x200 [ 163.605748] [<ffffffff8106ee3d>] ? irq_exit+0x4d/0x60 [ 163.605752] [<ffffffff814d5cef>] ? xen_evtchn_do_upcall+0x2f/0x40 [ 163.605755] [<ffffffff818478fe>] ? xen_do_hypervisor_callback+0x1e/0x30 [ 163.605756] <EOI> [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [ 163.605761] [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20 [ 163.605764] [<ffffffff8100764c>] ? xen_safe_halt+0xc/0x20 [ 163.605767] [<ffffffff81014775>] ? default_idle+0x5/0x10 [ 163.605770] [<ffffffff810989e7>] ? cpu_startup_entry+0x217/0x260 [ 163.605772] [<ffffffff81bdcea5>] ? 0xffffffff81bdcea5 [ 163.605774] [<ffffffff81bdc8c8>] ? 0xffffffff81bdc8c8 [ 163.605775] [<ffffffff81be011e>] ? 0xffffffff81be011e [ 163.605777] ---[ end trace 7d81642d805c09bf ]--- [ 163.605785] e1000e 0000:00:19.0 enp0s25: Reset adapter unexpectedly [ 166.432861] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx This happens with the vanilla bundled e1000e driver from 3.17.7 and with the 3.1.0.2 driver downloaded from intel.com too both with and without CFLAGS_EXTRA=-DDISABLE_PCI_MSI. The same problem was with 3.10 and 3.8 kernels. I don't know if there was a functional driver/xen combination becasue the problem appeared only when we upgraded our switch to a 1Gb/s one, and at that time we already had the 3.8 kernel, and xen 4.3. I was hoping newer releases will fix this eventually as I found some similar problems on the net. But no luck yet. If I let the NIC negotiate 1Gb/s, tried with "ethtool gso off gro off tso off". The kernel log remains silent, but other problems appear: reaching the server with ssh over this NIC from the outside show a lot of latency (up to 5000ms), but when I ping the server periodically (1s intervals), the server response gets better (~1000ms). When the server tries to download something, this problem does not appear (download speed reaches 120Mb/s from the Internet, the cap of the ISP). Should I attach my kernel config? Nothing fancy there. MSI support is compiled into the kernel. Only the E1000E driver is enabled as module, the other Intel modules are disabled. The problem doesn't appear at all when the very same kernel is booted without xen. I have a different (non-intel) NIC attached to the PCIe bus, it communicates 1Gb/s properly even under xen without problems. _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |