[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] xen + dom0 + e1000e = Detected Hardware Unit Hang


  • To: xen-users@xxxxxxxxxxxxx
  • From: ZoltÃn Halassy <cf0hay@xxxxxxxxx>
  • Date: Wed, 25 Mar 2015 10:18:46 +0100
  • Delivery-date: Wed, 25 Mar 2015 09:19:53 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

I manage a Fujitsu Primergy TX100 S2 server, which has this integrated NIC:

00:19.0 Ethernet controller [0200]: Intel Corporation 82578DM Gigabit
Network Connection [8086:10ef] (rev 05)
        Subsystem: Fujitsu Technology Solutions Device [1734:11a6]
        Kernel driver in use: e1000e

Using a kernel as dom0 3.17.7 over xen 4.5.0

If the NIC is connected to a 100Mb/s link or forced to negotiate
100Mb/s (via ethtool advertise 0x008) it works properly even under
xen. Kerdel dmesg
shows this:

[    8.868197] e1000e: Intel(R) PRO/1000 Network Driver - 3.1.0.2-NAPI
[    8.868199] e1000e: Copyright(c) 1999 - 2014 Intel Corporation.
[    8.868412] e1000e 0000:00:19.0: Interrupt Throttling Rate
(ints/sec) set to dynamic conservative mode
[    9.115322] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width
x1) 00:19:99:a7:f8:51
[    9.115325] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[    9.115359] e1000e 0000:00:19.0 eth0: MAC: 9, PHY: 9, PBA No: 313130-031
[    9.131477] e1000e 0000:00:19.0 enp0s25: renamed from eth0
[   29.669862] e1000e: enp0s25 NIC Link is Up 100 Mbps Full Duplex,
Flow Control: Rx
[   29.669972] e1000e 0000:00:19.0 enp0s25: 10/100 speed: disabling TSO

However if it's connected to a 1Gb/s device, these messages appear first:

[   82.149105] e1000e: Intel(R) PRO/1000 Network Driver - 3.1.0.2-NAPI
[   82.149108] e1000e: Copyright(c) 1999 - 2014 Intel Corporation.
[   82.149312] e1000e 0000:00:19.0: Interrupt Throttling Rate
(ints/sec) set to dynamic conservative mode
[   82.396102] e1000e 0000:00:19.0 eth0: (PCI Express:2.5GT/s:Width
x1) 00:19:99:a7:f8:51
[   82.396105] e1000e 0000:00:19.0 eth0: Intel(R) PRO/1000 Network Connection
[   82.396139] e1000e 0000:00:19.0 eth0: MAC: 9, PHY: 9, PBA No: 313130-031
[   82.414029] e1000e 0000:00:19.0 enp0s25: renamed from eth0
[   93.410124] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx

Now, if I download something with high throughput (inbound traffic,
say, 120Mb/s), then it works properly too. However if I try to upload
something with high throughput (outbound traffic, the receiving end is
willing to accept around ~25Mb/s), under xen, the connection hangs for a few
seconds and these messages appear in dmesg:

[  155.601937] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdbcbc>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  157.602036] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdc48c>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  159.601880] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdcc5c>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  161.601989] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffdd42c>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  163.602096] e1000e 0000:00:19.0 enp0s25: Detected Hardware Unit Hang:
  TDH                  <86>
  TDT                  <a0>
  next_to_use          <a0>
  next_to_clean        <86>
buffer_info[next_to_clean]:
  time_stamp           <fffdb4c4>
  next_to_watch        <86>
  jiffies              <fffddbfc>
  next_to_watch.status <0>
MAC Status             <40080083>
PHY Status             <796d>
PHY 1000BASE-T Status  <7800>
PHY Extended Status    <2000>
PCI Status             <10>
[  163.605665] ------------[ cut here ]------------
[  163.605677] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264
dev_watchdog+0x22c/0x240()
[  163.605680] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
[  163.605682] Modules linked in: e1000e(O)
[  163.605690] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G           O
3.17.7-hardened-r1-uther #3
[  163.605693] Hardware name: FUJITSU
PRIMERGY TX100 S2             /D2779, BIOS 6.00 Rev. 1.07.2779.A1
     04/29/2011
[  163.605695]  0000000000000009 ffffffff8183f2a9 ffff88016f203e60
ffffffff8106bb2d
[  163.605699]  0000000000000000 ffff88016f203eb0 0000000000000001
0000000000000000
[  163.605703]  0000000000000000 ffffffff8106bb97 ffffffff81ac7c18
0000000000000030
[  163.605707] Call Trace:
[  163.605710]  <IRQ>  [<ffffffff8183f2a9>] ? dump_stack+0x41/0x51
[  163.605728]  [<ffffffff8106bb2d>] ? warn_slowpath_common+0x6d/0x90
[  163.605730]  [<ffffffff8106bb97>] ? warn_slowpath_fmt+0x47/0x50
[  163.605733]  [<ffffffff81505ed2>] ? add_interrupt_randomness+0x32/0x1e0
[  163.605735]  [<ffffffff816d120c>] ? dev_watchdog+0x22c/0x240
[  163.605737]  [<ffffffff816d0fe0>] ? dev_graft_qdisc+0x70/0x70
[  163.605741]  [<ffffffff810ac232>] ? call_timer_fn.isra.36+0x12/0x70
[  163.605744]  [<ffffffff810ac440>] ? run_timer_softirq+0x1b0/0x240
[  163.605746]  [<ffffffff8106eb3b>] ? __do_softirq+0xdb/0x200
[  163.605748]  [<ffffffff8106ee3d>] ? irq_exit+0x4d/0x60
[  163.605752]  [<ffffffff814d5cef>] ? xen_evtchn_do_upcall+0x2f/0x40
[  163.605755]  [<ffffffff818478fe>] ? xen_do_hypervisor_callback+0x1e/0x30
[  163.605756]  <EOI>  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[  163.605761]  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[  163.605764]  [<ffffffff8100764c>] ? xen_safe_halt+0xc/0x20
[  163.605767]  [<ffffffff81014775>] ? default_idle+0x5/0x10
[  163.605770]  [<ffffffff810989e7>] ? cpu_startup_entry+0x217/0x260
[  163.605772]  [<ffffffff81bdcea5>] ? 0xffffffff81bdcea5
[  163.605774]  [<ffffffff81bdc8c8>] ? 0xffffffff81bdc8c8
[  163.605775]  [<ffffffff81be011e>] ? 0xffffffff81be011e
[  163.605777] ---[ end trace 7d81642d805c09bf ]---
[  163.605785] e1000e 0000:00:19.0 enp0s25: Reset adapter unexpectedly
[  166.432861] e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx

This happens with the vanilla bundled e1000e driver from 3.17.7 and
with the 3.1.0.2 driver downloaded from intel.com too both with and
without CFLAGS_EXTRA=-DDISABLE_PCI_MSI. The same problem was with 3.10
and 3.8 kernels. I don't know if there was a functional driver/xen
combination becasue
the problem appeared only when we upgraded our switch to a 1Gb/s one,
and at that time we already had the 3.8 kernel, and xen 4.3. I was hoping newer
releases will fix this eventually as I found some similar problems on
the net. But no luck yet.

If I let the NIC negotiate 1Gb/s, tried with "ethtool gso off gro off
tso off". The kernel log remains silent, but other problems appear:
reaching the server with ssh over this NIC from the outside show a lot
of latency (up to 5000ms), but when I ping the server periodically (1s
intervals), the server response gets better (~1000ms). When the server
tries to download something, this problem does not appear (download
speed reaches 120Mb/s from the Internet, the cap of the ISP).

Should I attach my kernel config? Nothing fancy there. MSI support is
compiled into the kernel. Only the E1000E driver is enabled as module,
the other Intel modules are disabled.

The problem doesn't appear at all when the very same kernel is booted
without xen.

I have a different (non-intel) NIC attached to the PCIe bus, it
communicates 1Gb/s properly even under xen without problems.

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.