[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610, Issues



cc'ing Guru...

> -----Original Message-----
> From: Joshua West [mailto:jwest@xxxxxxxxxxxx]
> Sent: Friday, March 18, 2011 1:25 PM
> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> Subject: Re: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge
> R610, Issues
> 
> Hi Guru,
> 
> Awesome, thanks for the tip.
> 
> I'll test out disabling cstates in the BIOS as I don't believe Xen
> 3.4.x
> lets you set max_cstate as an argument to xen.gz in grub.conf.
> 
> The patch in the changeset you mention applies to Xen 3.4.3 code.  Do
> you have an experience with that patch functioning/helping/working with
> Xen 3.4.x?  And if so, do you think it will end up as part of Xen 3.4.4
> (if that ever gets tagged/released)?  Assuming disabling cstates in the
> BIOS alleviates my problem, I'll probably give that patch a whirl with
> cstates enabled and see if the issue comes back.  Just wondering if
> anybody else has used that patch with Xen 3.4.3 and found success.
> 
> Thanks.
> 
> On 03/18/11 11:53, Guru Anbalagane wrote:
> > This is likely related xen losing interrupts while certain cpus goes
> > to c6 state.
> > The below patch addresses an issue around this.
> > http://xenbits.xen.org/hg/xen-unstable.hg/rev/1087f9a03ab6
> > Easy workaround would be to turn off cstates in BIOS or limit cstate
> > in xen.
> >
> > Hope this helps.
> > Thanks
> > Guru
> >> Message: 5
> >> Date: Fri, 18 Mar 2011 11:39:07 -0400
> >> From: Joshua West<jwest@xxxxxxxxxxxx>
> >> Subject: [Xen-devel] Broadcom BCM5709 (bnx2) on Dell PowerEdge R610
> >>     Issues
> >> To: xen-devel@xxxxxxxxxxxxxxxxxxx
> >> Message-ID:<4D837C9B.6030107@xxxxxxxxxxxx>
> >> Content-Type: text/plain; charset="iso-8859-1"
> >>
> >> Hey folks,
> >>
> >> Unfortunately, ever since we went live with Xen on Dell PowerEdge
> >> R610's, we've been having some odd and aggravating issues.  The
> NIC's
> >> tend to drop out when under heavy traffic after 1-7 days of uptime
> >> (random, difficult to reproduce).  But before I get into the issue's
> >> specifics, here's some information about our setup:
> >>
> >>     * Dell PowerEdge R610's w/ 4 Onboard Broadcom BCM5709 1-GbE
> NIC's.
> >>     * RHEL 5.6.
> >>     * Xen 3.4.3 (from xen.org; our own compile)
> >>     * Kernel 2.6.18.18
> >> (http://xenbits.xensource.com/linux-2.6.18-xen.hg)
> >> checkout 1073.
> >>     * bnx2 driver 2.0.18c from Broadcom's netxtreme2-6.0.53 package.
> >>       * bnx2 that ships with 2.6.18.8 doesn't support BCM5709's.
> >>       * Had to use driver package from broadcom.com in order to get
> >> networking.
> >>     * NIC bonding in pairs (eth0 + eth1, etc), with options "mode=4
> >> lacp_rate=fast miimon=100 use_carrier=1".
> >>
> >> What occurs is suddenly one of the NIC's in the bond stops
> responding.
> >> Gets stuck on transmitting from what I understand.  Kernel logs show
> the
> >> following, which includes extra debug information as the developers
> from
> >> Broadcom (Michael Chan and Benjamin Li) were assisting in
> >> troubleshooting and gave me a version of bnx2 2.0.18c to run, that
> >> prints out extra debug information upon NIC crash:
> >>
> >> Mar 18 01:40:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth0: transmit
> >> timed out
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth0
> >> --->
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_PFTQ_CTL
> 10000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_TFTQ_CTL
> 20000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RV2P_MFTQ_CTL
> 4000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TBDR_FTQ_CTL
> 4002
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TDMA_FTQ_CTL
> 10002
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TXP_FTQ_CTL
> 10002
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_TPAT_FTQ_CTL
> 10000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_CFTQ_CTL
> 8000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_RXP_FTQ_CTL
> 100000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0:
> BNX2_COM_COMXQ_FTQ_CTL
> >> 10000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0:
> BNX2_COM_COMTQ_FTQ_CTL
> >> 20000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0:
> BNX2_COM_COMQ_FTQ_CTL
> >> 10000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: BNX2_CP_CPQ_FTQ_CTL
> 4000
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TXP mode b84c state
> >> 80001000 evt_mask 500 pc 8001284 pc 8001284 instr 1440fffc
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: TPAT mode b84c state
> >> 80001000 evt_mask 500 pc 8000a50 pc 8000a4c instr 38420001
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: RXP mode b84c state
> >> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: COM mode b8cc state
> >> 80008000 evt_mask 500 pc 8000a98 pc 8000a8c instr 8821
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0: CP mode b8cc state
> >> 80000000 evt_mask 500 pc 8000c7c pc 8000928 instr 8ce800e8
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth0 -
> -->
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0]
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG: intr_sem[0]
> >> PCI_CMD[00100406]
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> PCI_PM[19002008]
> >> PCI_MISC_CFG[92000088]
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> >> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> >> Mar 18 01:40:26 xen-san-gb1 kernel: bnx2: eth0
> >> RPM_MGMT_PKT_CTRL[40000088]
> >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> >> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
> >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 DEBUG:
> >> HC_STATS_INTERRUPT_STATUS[01fe0001]
> >> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state
> 12
> >> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 7
> >> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status
> idx
> >> 307c irq jiffies 100759890
> >> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c
> >> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669
> >> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c
> >> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa
> tx
> >> 1008f41e2 poll 100759890
> >> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx
> start
> >> jiffies 0
> >> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event
> c68c37
> >> Mar 18 01:40:27 xen-san-gb1 kernel: Ring state for ring 0 napi state
> 12
> >> Mar 18 01:40:27 xen-san-gb1 kernel: netdev state 77
> >> Mar 18 01:40:27 xen-san-gb1 kernel: hw status idx 3267 last status
> idx
> >> 307c irq jiffies 100759890
> >> Mar 18 01:40:27 xen-san-gb1 kernel: hw tx cons a669 hw rx cons 103c
> >> Mar 18 01:40:27 xen-san-gb1 kernel: sw tx cons a57c a57c prod a669
> >> Mar 18 01:40:27 xen-san-gb1 kernel: sw rx cons f3c prod 103c
> >> Mar 18 01:40:27 xen-san-gb1 kernel: Current jiffies 1008f4741 HZ fa
> tx
> >> 1008f41e2 poll 100759890
> >> Mar 18 01:40:27 xen-san-gb1 kernel: tx stop jiffies 1008f41e2 tx
> start
> >> jiffies 0
> >> Mar 18 01:40:27 xen-san-gb1 kernel: irq_event c68c36 napi_event
> c68c37
> >> Mar 18 01:40:27 xen-san-gb1 kernel: bnx2: eth0 NIC Copper Link is
> Down
> >> Mar 18 01:40:27 xen-san-gb1 kernel: bonding: bond0: link status
> >> definitely down for interface eth0, disabling it
> >>
> >> This was then followed rather quickly by a failure with the second
> NIC
> >> (eth1) in the bond:
> >>
> >> Mar 18 01:42:26 xen-san-gb1 kernel: NETDEV WATCHDOG: eth1: transmit
> >> timed out
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- start FTQ dump on eth1
> >> --->
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_PFTQ_CTL
> 10000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_TFTQ_CTL
> 20000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RV2P_MFTQ_CTL
> 4000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TBDR_FTQ_CTL
> 4002
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TDMA_FTQ_CTL
> 10000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TXP_FTQ_CTL
> 10002
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_TPAT_FTQ_CTL
> 10000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_CFTQ_CTL
> 8000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_RXP_FTQ_CTL
> 100000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1:
> BNX2_COM_COMXQ_FTQ_CTL
> >> 10000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1:
> BNX2_COM_COMTQ_FTQ_CTL
> >> 20000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1:
> BNX2_COM_COMQ_FTQ_CTL
> >> 10000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: BNX2_CP_CPQ_FTQ_CTL
> 4000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TXP mode b84c state
> >> 80005000 evt_mask 500 pc 8001294 pc 8001284 instr 38640001
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: TPAT mode b84c state
> >> 80001000 evt_mask 500 pc 8000a58 pc 8000a5c instr 8f820014
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: RXP mode b84c state
> >> 80001000 evt_mask 500 pc 8004ad0 pc 8004adc instr 14e0005d
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: COM mode b8cc state
> >> 80000000 evt_mask 500 pc 8000a9c pc 8000a94 instr 3c028000
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1: CP mode b8cc state
> >> 80008000 evt_mask 500 pc 8000c58 pc 8000c6c instr 27bdffe8
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2:<--- end FTQ dump on eth1 -
> -->
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0]
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG: intr_sem[0]
> >> PCI_CMD[00100406]
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> PCI_PM[19002008]
> >> PCI_MISC_CFG[92000088]
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> >> EMAC_TX_STATUS[00000008] EMAC_RX_STATUS[00000000]
> >> Mar 18 01:42:26 xen-san-gb1 kernel: bnx2: eth1
> >> RPM_MGMT_PKT_CTRL[40000088]
> >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> >> MCP_STATE_P0[0003610e] MCP_STATE_P1[0003610e]
> >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 DEBUG:
> >> HC_STATS_INTERRUPT_STATUS[01fe0001]
> >> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state
> 12
> >> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 7
> >> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status
> idx
> >> 29c4 irq jiffies 100759898
> >> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce
> >> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421
> >> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce
> >> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc71 HZ fa
> tx
> >> 1008fb744 poll 100759898
> >> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx
> start
> >> jiffies 100239dfd
> >> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event
> ab2e14
> >> Mar 18 01:42:27 xen-san-gb1 kernel: Ring state for ring 0 napi state
> 12
> >> Mar 18 01:42:27 xen-san-gb1 kernel: netdev state 77
> >> Mar 18 01:42:27 xen-san-gb1 kernel: hw status idx 2bb0 last status
> idx
> >> 29c4 irq jiffies 100759898
> >> Mar 18 01:42:27 xen-san-gb1 kernel: hw tx cons e421 hw rx cons a8ce
> >> Mar 18 01:42:27 xen-san-gb1 kernel: sw tx cons e334 e334 prod e421
> >> Mar 18 01:42:27 xen-san-gb1 kernel: sw rx cons a7ce prod a8ce
> >> Mar 18 01:42:27 xen-san-gb1 kernel: Current jiffies 1008fbc72 HZ fa
> tx
> >> 1008fb744 poll 100759898
> >> Mar 18 01:42:27 xen-san-gb1 kernel: tx stop jiffies 1008fb744 tx
> start
> >> jiffies 100239dfd
> >> Mar 18 01:42:27 xen-san-gb1 kernel: irq_event ab2e13 napi_event
> ab2e14
> >> Mar 18 01:42:27 xen-san-gb1 kernel: bnx2: eth1 NIC Copper Link is
> Down
> >> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: link status
> >> definitely down for interface eth1, disabling it
> >> Mar 18 01:42:27 xen-san-gb1 kernel: bonding: bond0: Warning: No
> 802.3ad
> >> response from the link partner for any adapters in the bond
> >>
> >> Onto more technical details...
> >>
> >> The kernel we were running (2.6.18.8 from xenbits) was compiled
> without
> >> support for MSI/MSI-X originally.  So, we were experiencing these
> >> problems with plain standard IRQ's.  Michael Chan @ Broadcom, the
> author
> >> of bnx2 if you modinfo, has told me via email:
> >>
> >>     * "The logs show that we haven't had an interrupt for a very
> long
> >> time. It's not clear how that interrupt was lost."
> >>     * "So far the logs don't show any inconsistent state in the
> hardware
> >> or software. It is possible that the Xen kernel is missing an
> interrupt
> >> and not delivering to the driver. Normally, in INTA mode, the IRQ is
> >> level triggered and should remain asserted until it is seen by the
> >> driver and de-asserted by the driver."
> >>
> >> But, just in case, I compiled 2.6.18.8 with support for MSI/MSI-X
> and
> >> was able to confirm (via dmesg and lspci -vv) that the NIC's began
> to
> >> use MSI for interrupts.  Unfortunately, the NIC crash happened
> anyways
> >> (the above kernel logs is actually from when running with MSI).
> >>
> >> Here's whats really bugging me.  We have a Dell PowerEdge R610,
> running
> >> Xen along with the bnx2 drivers from Broadcom, thats been online for
> >> ~220 days.  Without a failure.  The only difference is the system is
> not
> >> making use of bonding.  It has just one NIC connected to the network
> >> with no VLAN's trunked down etc.
> >>
> >> It looks like I'm not alone out there, as there's a Red Hat bugzilla
> >> report for this issue:
> >>
> >> https://bugzilla.redhat.com/show_bug.cgi?id=520888
> >>
> >> ^^ The above has an indication of *Status
> >> <https://bugzilla.redhat.com/page.cgi?id=fields.html#status>*:
> CLOSED
> >> DUPLICATE of bug 511368
> >> <https://bugzilla.redhat.com/show_bug.cgi?id=511368>  , but looks
> like I
> >> don't have access to view 511368. Grrr.
> >>
> >> Anyways...
> >>
> >> 1) Has anybody else experienced this issue?
> >> 2) Any developers care to comment on possible causes of this
> problem?
> >> 3) Anybody know of a solution?
> >> 4) What can I do to troubleshoot further, and get developers
> necessary
> >> information?
> >>
> >> Lastly...
> >>
> >> 5) Is anybody running Intel NIC's within Dell PowerEdge R610's,
> using
> >> bonding + Xen 3.4.3 + 2.6.18.8, and can safely report success?  I
> may
> >> switch to Intel...
> >>
> >> Thanks!
> >>
> >
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxx
> > http://lists.xensource.com/xen-devel
> 
> 
> --
> Joshua West
> Senior Systems Engineer
> Brandeis University
> http://www.brandeis.edu
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.