[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Xen-devel] Re: blocking Xen 3.X production use: soft lockup bugs



> Here are some examples of the sort of soft lockups I'm seeing -- I
> can't say right now if they've all been showing the same stack trace,
> but I'll keep an eye on that from now on.  I know they haven't all
> been on the same CPU.  Anything else anyone needs, just let me know --
> and I'd like to reaffirm my earlier offer of access to one of these
> machines.

So, you're seeing this on a 1 CPU 32b non PAE guest. Please can you
answer some of my other questions in the previous email. How many other
guests were running, what else was the system doing? Using 'xm list', is
the guest burning CPU? What about dom0?

The soft lockup messages appear to be benign in that the domain seems to
be continuing quite happily after printing them -- its quite possible
that the system was sufficiently busy that the domain VCPU just didn't
get scheduled for a while, triggering the warning message. Are you sure
they're actually related to the more serious problem you're
experiencing? 

Have you tried using -unstable and hence xen's new scheduler? This is
less likely to provoke soft lockup false alarms.

Ian 

 
> I'm also starting to think a XenSource wiki page "how to
> report/workaround soft lockups" might be in order; I suspect many of
> the bug reports (including my own) haven't been detailed enough to
> differentiate between the various things that can cause soft lockups.
> 
> This was on an IBM x330.
> 
> Steve
> 
> n4h34:~# xm create -c /etc/xen/auto/build2.t7a.org
> Using config file "/etc/xen/auto/build2.t7a.org".
> Started domain build2.t7a.org
> Linux version 2.6.16.13-xen (root@n4h33) (gcc version 3.3.5 (Debian
> 1:3.3.5-12)) #2 SMP Sun Jun 11 14:25:16 PDT 2006
> BIOS-provided physical RAM map:
>  Xen: 0000000000000000 - 0000000008000000 (usable)
> 0MB HIGHMEM available.
> 136MB LOWMEM available.
> ACPI in unprivileged domain disabled
> IRQ lockup detection disabled
> Built 1 zonelists
> Kernel command line:  root=/dev/sda1 2
> Enabling fast FPU save and restore... done.
> Enabling unmasked SIMD FPU exception support... done.
> Initializing CPU#0
> PID hash table entries: 1024 (order: 10, 16384 bytes)
> Xen reported: 1130.113 MHz processor.
> Dentry cache hash table entries: 32768 (order: 5, 131072 bytes)
> Inode-cache hash table entries: 16384 (order: 4, 65536 bytes)
> Software IO TLB disabled
> vmalloc area: c9000000-fb7fe000, maxmem 33ffe000
> Memory: 114612k/139264k available (3368k kernel code, 16308k reserved,
> 1033k data, 196k init, 0k highmem)
> Checking if this processor honours the WP bit even in supervisor
mode...
> Ok.
> Calibrating delay using timer specific routine.. 2261.96 BogoMIPS
> (lpj=11309833)
> Security Framework v1.0.0 initialized
> Capability LSM initialized
> Mount-cache hash table entries: 512
> CPU: L1 I cache: 16K, L1 D cache: 16K
> CPU: L2 cache: 512K
> Checking 'hlt' instruction... OK.
> Brought up 1 CPUs
> migration_cost=0
> checking if image is initramfs... it is
> Freeing initrd memory: 9535k freed
> Grant table initialized
> NET: Registered protocol family 16
> Brought up 1 CPUs
> PCI: setting up Xen PCI frontend stub
> ACPI: Subsystem revision 20060127
> ACPI: Interpreter disabled.
> Linux Plug and Play Support v0.97 (c) Adam Belay
> xen_mem: Initialising balloon driver.
> SCSI subsystem initialized
> usbcore: registered new driver usbfs
> usbcore: registered new driver hub
> PCI: System does not support PCI
> PCI: System does not support PCI
> IA-32 Microcode Update Driver: v1.14-xen <tigran@xxxxxxxxxxx>
> VFS: Disk quotas dquot_6.5.1
> Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
> JFS: nTxBlock = 1024, nTxLock = 8192
> SGI XFS with ACLs, security attributes, realtime, large block numbers,
no
> debug enabled
> Initializing Cryptographic API
> io scheduler noop registered
> io scheduler anticipatory registered (default)
> io scheduler deadline registered
> io scheduler cfq registered
> PNP: No PS/2 controller found. Probing ports directly.
> i8042.c: No controller found.
> RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
> Xen virtual console successfully installed as tty1
> Event-channel device installed.
> blkif_init: reqs=64, pages=704, mmap_vstart=0xc7400000
> netfront: Initialising virtual ethernet driver.
> Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
> ide: Assuming 50MHz system bus speed for PIO modes; override with
idebus=xx
> Registering block device major 8
> ide-floppy driver 0.99.newide
> Fusion MPT base driver 3.03.07
> Copyright (c) 1999-2005 LSI Logic Corporation
> Fusion MPT SPI Host driver 3.03.07
> Fusion MPT misc device (ioctl) driver 3.03.07
> mptctl: Registered with Fusion MPT base driver
> mptctl: /dev/mptctl @ (major,minor=10,220)
> usbmon: debugfs is not available
> usbcore: registered new driver libusual
> mice: PS/2 mouse device common for all mice
> md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
> md: bitmap version 4.39
> NET: Registered protocol family 2
> IP route cache hash table entries: 2048 (order: 1, 8192 bytes)
> TCP established hash table entries: 8192 (order: 4, 65536 bytes)
> TCP bind hash table entries: 8192 (order: 4, 65536 bytes)
> TCP: Hash tables configured (established 8192 bind 8192)
> TCP reno registered
> Initializing IPsec netlink socket
> NET: Registered protocol family 1
> NET: Registered protocol family 17
> NET: Registered protocol family 8
> NET: Registered protocol family 20
> Using IPI No-Shortcut mode
> Freeing unused kernel memory: 196k freed
> Loading, please wait...
> Begin: Loading essential drivers... ...
> tg3: no version for "struct_module" found: kernel tainted.
> eepro100.c:v1.09j-t 9/29/99 Donald Becker
> http://www.scyld.com/network/eepro100.html
> eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V.
Savochkin
> <saw@xxxxxxxxxxxxx> and others
> Intel(R) PRO/1000 Network Driver - version 6.3.9-k4
> Copyright (c) 1999-2005 Intel Corporation.
> Done.
> Begin: Running /scripts/init-premount ...
> FATAL: Error inserting fan (/lib/modules/2.6.16.13-
> xen/kernel/drivers/acpi/fan.ko): No such device
> FATAL: Error inserting thermal (/lib/modules/2.6.16.13-
> xen/kernel/drivers/acpi/thermal.ko): No such device
> Done.
> Begin: Mounting root file system... ...
> Begin: Running /scripts/local-top ...
> Done.
> Begin: Running /scripts/local-premount ...
> Done.
> kjournald starting.  Commit interval 5 seconds
> EXT3-fs: mounted filesystem with ordered data mode.
> Begin: Running /scripts/log-bottom ...
> Done.
> Done.
> Begin: Running /scripts/init-bottom ...
> Done.
> mount: Mounting /sys on /root/sys failed: No such file or directory
> INIT: version 2.85 booting
> Activating swap.
> Checking root file system...
> fsck 1.39 (29-May-2006)
> /dev/sda1: clean, 21526/917504 files, 245920/1835007 blocks
> EXT3 FS on sda1, internal journal
> System time was Wed Aug  2 22:17:34 UTC 2006.
> Setting the System Clock using the Hardware Clock as reference...
> System Clock set. System local time is now Wed Aug  2 22:17:37 UTC
2006.
> Loading device-mapper support.
> Checking all file systems...
> fsck 1.39 (29-May-2006)
> Setting kernel variables..
> Mounting local filesystems...
> Adding 524280k swap on /swap00.  Priority:-1 extents:134
across:533176k
> Cleaning /tmp /var/run /var/lock.
> Running 0dns-down to make sure resolv.conf is ok...done.
> Cleaning: /etc/network/ifstate.
> Setting up IP spoofing protection: rp_filter.
> Configuring network interfaces...done.
> Loading the saved-state of the serial devices...
> /dev/ttyS0: No such file or directory
> /dev/ttyS0: No such file or directory
> /dev/ttyS1: No such file or directory
> /dev/ttyS1: No such file or directory
> Not setting System Clock
> Initializing random number generator...done.
> Recovering nvi editor sessions... done.
> INIT: Entering runlevel: 2
> Starting isconf daemonRunning isconf updateisconf: info:
build2.t7a.org is
> on guest-1 branch
> isconf: info: may reboot...
> isconf: info: checking for updates
> isconf: info: fetching
>
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb9134
55c
> 71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.911958506882
> isconf: info: fetching
>
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb9134
55c
> 71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.999292957677
> isconf: info: fetching
>
http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb9134
55c
> 71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.239902520967
> BUG: soft lockup detected on CPU#0!
> 
> Pid: 2383, comm:               isconf
> EIP: 0073:[<080c9763>] CPU: 0
> EIP is at 0x80c9763
>  ESP: 007b:bfcc962c EFLAGS: 00200282    Tainted: GF
(2.6.16.13-xen #2)
> EAX: 00000001 EBX: 0000003a ECX: bfcc9624 EDX: 00000000
> ESI: 08137cb4 EDI: 00000001 EBP: bfcc9638 DS: 007b ES: 007b
> CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640
> isconf: info: fetching
>
http://10.27.4.34:65028/t7a.org/block/ff1/ff1276f7811aeeade18d54a6c35782
61f
> f36ecbb-4fb47b36cda57ae95af56372f03bb2ca-1?challenge=0.265409462016
> isconf: info: updated /etc/ldap/ldap.conf
> BUG: soft lockup detected on CPU#0!
> 
> Pid: 2383, comm:               isconf
> EIP: 0073:[<080af84d>] CPU: 0
> EIP is at 0x80af84d
>  ESP: 007b:bfcc96d0 EFLAGS: 00200246    Tainted: GF
(2.6.16.13-xen #2)
> EAX: 00000001 EBX: 082031fe ECX: 082031fe EDX: b7af1f8c
> ESI: 00000000 EDI: 082030ec EBP: bfcc9838 DS: 007b ES: 007b
> CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640
> isconf: info: fetching
>
http://10.27.4.7:65028/t7a.org/block/c0e/c0e10bc50572deb89da6e9d96ac5971
a39
> fddc65-fc3558eaffc90497248f97f9b0e3a924-1?challenge=0.130730726051
> isconf: info: updated /etc/ca-certificates.conf
> isconf: info: running ['update-ca-certificates']
> Updating certificates in /etc/ssl/certs....done.
> isconf: info: updated /etc/ldap/ldap.conf
> BUG: soft lockup detected on CPU#0!
> 
> Pid: 1, comm:                 init
> EIP: 0061:[<c0322fe1>] CPU: 0
> EIP is at netif_poll+0x101/0x810
>  EFLAGS: 00000216    Tainted: GF      (2.6.16.13-xen #2)
> EAX: 00000037 EBX: c0945180 ECX: 0001134e EDX: c0945000
> ESI: c0f48280 EDI: c0f499e8 EBP: c09451c0 DS: 007b ES: 007b
> CR0: 8005003b CR2: b7d579e0 CR3: 0057e000 CR4: 00000640
>  [<c03d891a>] net_rx_action+0xea/0x230
>  [<c0124cb5>] __do_softirq+0xf5/0x120
>  [<c0124d75>] do_softirq+0x95/0xa0
>  [<c0106c0f>] do_IRQ+0x1f/0x30
>  [<c0312f58>] evtchn_do_upcall+0xa8/0xf0
>  [<c0105178>] hypervisor_callback+0x2c/0x34
>  [<c02c2081>] __copy_user_intel+0x31/0xb0
>  [<c02c2220>] __copy_to_user_ll+0x70/0x80
>  [<c02c22f2>] copy_to_user+0x42/0x60
>  [<c0171068>] cp_new_stat64+0xf8/0x110
>  [<c01710b7>] sys_stat64+0x37/0x40
>  [<c0104fb5>] syscall_call+0x7/0xb
> isconf: warning: clierr:  Connection reset by peer
> Starting system log daemon: syslogd.
> Starting kernel log daemon: klogd.
> No configuration file was found for slapd at /etc/ldap/slapd.conf.
> If you have moved the slapd configuration file please modify
> /etc/default/slapd to reflect this.  If you chose to not
> configure slapd during installation then you need to do so
> prior to attempting to start slapd.
> An example slapd.conf is in /usr/share/slapd
> Starting Heimdal KDC: heimdal-kdc.
> Starting Heimdal password server: kpasswdd.
> Starting internet superserver: inetd.
> Starting PCMCIA services: module directory /lib/modules/2.6.16.13-
> xen/pcmcia not found.
> Starting OpenBSD Secure Shell server: sshd.
> Starting deferred execution scheduler: atd.
> Starting periodic command scheduler: cron.
> 
> Debian GNU/Linux testing/unstable build2.t7a.org tty1
> 
> build2.t7a.org login:
> 
> On Wed, Aug 02, 2006 at 01:54:49PM -0700, Steve Traugott wrote:
> > Hi All,
> >
> > I hate to say it, but it's starting to look like soft lockup bug(s)
> > are turning into a serious roadblock for general production use of
Xen
> > 3.X, on a wide range of hardware.  I've been using Xen since the 1.0
> > days, and I have to say that this the most serious showstopper bug
> > I've ever hit -- it usually manifests itself during the first
> > significant network and/or disk I/O after starting a second or third
> > domU on the same box, and is the only bug I've ever hit that has
> > caused permanent damage -- it tends to corrupt guest filesystems.
In
> > my case it's stopped a deployment dead in its tracks, and our only
> > options at this point are to go back to Xen 2.X or (horrors) to
native
> > Linux kernels.
> >
> > The problem (or something that looks identical) is described in
> > several tickets, status currently NEW or REOPENED, no clear
> > resolution:
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697
> > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705
> >
> > In our own shop, we consistently hit soft lockups while running on
> > both IBM x330's and older Netengines (similar to an IBM 4000R).
We've
> > found no workaround.  We're on xen-3.0-testing, changeset 9732,
kernel
> > 2.6.6.13.  On April 6th, Keir posted a note saying this was fixed as
> > of a blkif_schedule() fix, which we already have because that was
way
> > back in changeset 9587...
> >
http://lists.xensource.com/archives/html/xen-devel/2006-04/msg00121.html
.
> >
> > The most recent devel list traffic I've found which covers this is
> > July 7th:
> >
http://lists.xensource.com/archives/html/xen-users/2006-07/msg00134.html
> > ...this message referred back to Kier's comment as describing a fix,
> > but it doesn't look true; while Kier's 9587 checkin may have fixed a
> > soft lockup problem, there appear to be more out there, or else
> > there's been regression.
> >
> > Do we have any consensus that this bug is fixed at all in
> > xen-3.0-testing, or even unstable?  Is anyone who was hitting soft
> > lockups in testing *not* hitting them any more on the same hardware?
> > If so, what changeset are you on now?
> >
> > If anyone needs any more information, just let me know.  As usual,
if
> > anyone wants login and console server access to one of these boxes
to
> > chase this down, I'm more than happy to provide that.
> >
> > Thanks,
> >
> > Steve
> > --
> > Stephen G. Traugott  (KG6HDQ)
> > UNIX/Linux Infrastructure Architect, TerraLuna LLC
> > stevegt@xxxxxxxxxxxxx
> > http://www.stevegt.com -- http://Infrastructures.Org
> 
> --
> Stephen G. Traugott  (KG6HDQ)
> UNIX/Linux Infrastructure Architect, TerraLuna LLC
> stevegt@xxxxxxxxxxxxx
> http://www.stevegt.com -- http://Infrastructures.Org
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@xxxxxxxxxxxxxxxxxxx
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.