[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] blocking Xen 3.X production use: soft lockup bugs
Hi Ian, Thanks for your patience... On Wed, Aug 02, 2006 at 11:25:45PM +0100, Ian Pratt wrote: > > The problem (or something that looks identical) is described in > > several tickets, status currently NEW or REOPENED, no clear > > resolution: > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=543 > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=690 > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=697 > > http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=705 > > There's very little to go on here. Two of the bugs are actually the same > guy. One of the others is x86_64 the other two are 32b. That's what I was starting to realize -- a lot of folks (including me) have been classing all soft lockups together, without digging deeper. > The only thing in common about the stack traces is that networking > functions seem to feature. Might be the same in my case; see the stack trace in my message in this thread a few minutes ago, copied below. The 'isconf' process you see there does a lot of UDP and TCP traffic for file transfers, as well as moderate disk I/O. > Taking a wild guess, are you doing some kind of unusual networking setup > involving iptables rules? Nope. Right now I can't think of anything I'm doing that's not the standard Xen bridging setup. > > Do we have any consensus that this bug is fixed at all in > > xen-3.0-testing, or even unstable? Is anyone who was hitting soft > > lockups in testing *not* hitting them any more on the same hardware? > > If so, what changeset are you on now? > > Soft lockups could be due to a huge variety of causes. It's unlikely to > be a hardware issue, and since the problems seem to be experienced by a > very small number of users my guess would be that it's configuration > dependent, most likely networking. > > > If anyone needs any more information, just let me know. As usual, if > > anyone wants login and console server access to one of these boxes to > > chase this down, I'm more than happy to provide that. > > Having a really detailed bug report would really be the best way of > proceeding. This is why I was thinking about starting a "how to report soft lockups" wiki page; I think we haven't been giving you enough. Is there already a more generic Xen bug reporting howto somewhere, or should I have at it, using your questions below as a start? > When this happens, does it just effect one guest? We typically see error messages on only one guest's console, but other guests and dom0 tend to lock up for ~30 seconds as well. > What's the stack trace? See the dmesg below (this is the same one I just posted a few minutes ago, in my previous message, copying here for reference). > How many VCPUs has the guest got? One. So far I've seen soft lockups with and without nosmp on the Xen command line on our Netengines, but can't yet tell you if they were the same stack trace. Haven't tried nosmp on the x330's yet, am about to. > Is the guest completely hosed or is it still pingable? Tends to be unpingable for ~30 seconds, usually recovers, but sometimes corrupts filesystem to a state which is unrecoverable (first machine it destroyed was our primary KDC, ouch...) > What about guest console echo? Works, but latency is on the order of several seconds or more for ~30 seconds. > What about 'xm sysreq'? Unable to answer this yet due to that high dom0 latency. > Looking in dom0, are you still seeing packets go to/from the > associated VIF? Unable to answer this yet due to that high dom0 latency. > How many network interfaces has the guest got? Only eth0 and lo. > What's the precise networking setup in dom0? Standard Xen bridging config: n4h34:~# ifconfig | grep encap eth0 Link encap:Ethernet HWaddr 00:02:55:C7:CA:D8 lo Link encap:Local Loopback peth0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF vif0.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF vif1.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF vif5.0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF xenbr0 Link encap:Ethernet HWaddr FE:FF:FF:FF:FF:FF n4h34:~# iptables -L Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination n4h34:~# route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 10.27.4.0 * 255.255.255.0 U 0 0 0 eth0 default 10.27.4.254 0.0.0.0 UG 0 0 0 eth0 > Can you come up with a recipe for reproduction, ideally with a > single guest? It looks like it can be reliably produced by starting a second guest and doing a mix of steady network and disk I/O -- isconf, for instance, runs during rc and updates the local disk image by pulling new packages over the network and installing them on the fly (http://trac.t7a.org/isconf/), so it is usually the first to trigger the bug in our environment. I haven't seen the bug as often with only one guest. For example, I built an AFS server in domain 1 on this same x330, generating lots of disk and network I/O in the process, ran it for days with no problems, then tried to start a copy of the same base image up as domain 2 on the same box and got the dmesg you see here; only the hostname, IP, MAC etc. were different. I'll see if I can come up with a simple python script or something which can trigger it. The only other "unusual" thing I can think of about this configuration is that it's using DRBD on top of EVMS in dom0 for the guest volumes; this would also increase dom0 network traffic during any guest disk I/O. I hope to heck this doesn't turn out to be a DRBD incompatibility; we've used DRBD with Xen since the early 2.X days, and it's been solid. I'll have to do some testing to see if I can eliminate DRBD as a factor. Steve n4h34:~# xm create -c /etc/xen/auto/build2.t7a.org Using config file "/etc/xen/auto/build2.t7a.org". Started domain build2.t7a.org Linux version 2.6.16.13-xen (root@n4h33) (gcc version 3.3.5 (Debian 1:3.3.5-12)) #2 SMP Sun Jun 11 14:25:16 PDT 2006 BIOS-provided physical RAM map: Xen: 0000000000000000 - 0000000008000000 (usable) 0MB HIGHMEM available. 136MB LOWMEM available. ACPI in unprivileged domain disabled IRQ lockup detection disabled Built 1 zonelists Kernel command line: root=/dev/sda1 2 Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 PID hash table entries: 1024 (order: 10, 16384 bytes) Xen reported: 1130.113 MHz processor. Dentry cache hash table entries: 32768 (order: 5, 131072 bytes) Inode-cache hash table entries: 16384 (order: 4, 65536 bytes) Software IO TLB disabled vmalloc area: c9000000-fb7fe000, maxmem 33ffe000 Memory: 114612k/139264k available (3368k kernel code, 16308k reserved, 1033k data, 196k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 2261.96 BogoMIPS (lpj=11309833) Security Framework v1.0.0 initialized Capability LSM initialized Mount-cache hash table entries: 512 CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 512K Checking 'hlt' instruction... OK. Brought up 1 CPUs migration_cost=0 checking if image is initramfs... it is Freeing initrd memory: 9535k freed Grant table initialized NET: Registered protocol family 16 Brought up 1 CPUs PCI: setting up Xen PCI frontend stub ACPI: Subsystem revision 20060127 ACPI: Interpreter disabled. Linux Plug and Play Support v0.97 (c) Adam Belay xen_mem: Initialising balloon driver. SCSI subsystem initialized usbcore: registered new driver usbfs usbcore: registered new driver hub PCI: System does not support PCI PCI: System does not support PCI IA-32 Microcode Update Driver: v1.14-xen <tigran@xxxxxxxxxxx> VFS: Disk quotas dquot_6.5.1 Dquot-cache hash table entries: 1024 (order 0, 4096 bytes) JFS: nTxBlock = 1024, nTxLock = 8192 SGI XFS with ACLs, security attributes, realtime, large block numbers, no debug enabled Initializing Cryptographic API io scheduler noop registered io scheduler anticipatory registered (default) io scheduler deadline registered io scheduler cfq registered PNP: No PS/2 controller found. Probing ports directly. i8042.c: No controller found. RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize Xen virtual console successfully installed as tty1 Event-channel device installed. blkif_init: reqs=64, pages=704, mmap_vstart=0xc7400000 netfront: Initialising virtual ethernet driver. Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 50MHz system bus speed for PIO modes; override with idebus=xx Registering block device major 8 ide-floppy driver 0.99.newide Fusion MPT base driver 3.03.07 Copyright (c) 1999-2005 LSI Logic Corporation Fusion MPT SPI Host driver 3.03.07 Fusion MPT misc device (ioctl) driver 3.03.07 mptctl: Registered with Fusion MPT base driver mptctl: /dev/mptctl @ (major,minor=10,220) usbmon: debugfs is not available usbcore: registered new driver libusual mice: PS/2 mouse device common for all mice md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27 md: bitmap version 4.39 NET: Registered protocol family 2 IP route cache hash table entries: 2048 (order: 1, 8192 bytes) TCP established hash table entries: 8192 (order: 4, 65536 bytes) TCP bind hash table entries: 8192 (order: 4, 65536 bytes) TCP: Hash tables configured (established 8192 bind 8192) TCP reno registered Initializing IPsec netlink socket NET: Registered protocol family 1 NET: Registered protocol family 17 NET: Registered protocol family 8 NET: Registered protocol family 20 Using IPI No-Shortcut mode Freeing unused kernel memory: 196k freed Loading, please wait... Begin: Loading essential drivers... ... tg3: no version for "struct_module" found: kernel tainted. eepro100.c:v1.09j-t 9/29/99 Donald Becker http://www.scyld.com/network/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <saw@xxxxxxxxxxxxx> and others Intel(R) PRO/1000 Network Driver - version 6.3.9-k4 Copyright (c) 1999-2005 Intel Corporation. Done. Begin: Running /scripts/init-premount ... FATAL: Error inserting fan (/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/fan.ko): No such device FATAL: Error inserting thermal (/lib/modules/2.6.16.13-xen/kernel/drivers/acpi/thermal.ko): No such device Done. Begin: Mounting root file system... ... Begin: Running /scripts/local-top ... Done. Begin: Running /scripts/local-premount ... Done. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. Begin: Running /scripts/log-bottom ... Done. Done. Begin: Running /scripts/init-bottom ... Done. mount: Mounting /sys on /root/sys failed: No such file or directory INIT: version 2.85 booting Activating swap. Checking root file system... fsck 1.39 (29-May-2006) /dev/sda1: clean, 21526/917504 files, 245920/1835007 blocks EXT3 FS on sda1, internal journal System time was Wed Aug 2 22:17:34 UTC 2006. Setting the System Clock using the Hardware Clock as reference... System Clock set. System local time is now Wed Aug 2 22:17:37 UTC 2006. Loading device-mapper support. Checking all file systems... fsck 1.39 (29-May-2006) Setting kernel variables.. Mounting local filesystems... Adding 524280k swap on /swap00. Priority:-1 extents:134 across:533176k Cleaning /tmp /var/run /var/lock. Running 0dns-down to make sure resolv.conf is ok...done. Cleaning: /etc/network/ifstate. Setting up IP spoofing protection: rp_filter. Configuring network interfaces...done. Loading the saved-state of the serial devices... /dev/ttyS0: No such file or directory /dev/ttyS0: No such file or directory /dev/ttyS1: No such file or directory /dev/ttyS1: No such file or directory Not setting System Clock Initializing random number generator...done. Recovering nvi editor sessions... done. INIT: Entering runlevel: 2 Starting isconf daemonRunning isconf updateisconf: info: build2.t7a.org is on guest-1 branch isconf: info: may reboot... isconf: info: checking for updates isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.911958506882 isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.999292957677 isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/fb2/fb2e8177e647be52a1c64e21fcb913455c71e731-8b3a10ecde5fc43984807e34550a2ebd-1?challenge=0.239902520967 BUG: soft lockup detected on CPU#0! Pid: 2383, comm: isconf EIP: 0073:[<080c9763>] CPU: 0 EIP is at 0x80c9763 ESP: 007b:bfcc962c EFLAGS: 00200282 Tainted: GF (2.6.16.13-xen #2) EAX: 00000001 EBX: 0000003a ECX: bfcc9624 EDX: 00000000 ESI: 08137cb4 EDI: 00000001 EBP: bfcc9638 DS: 007b ES: 007b CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640 isconf: info: fetching http://10.27.4.34:65028/t7a.org/block/ff1/ff1276f7811aeeade18d54a6c3578261ff36ecbb-4fb47b36cda57ae95af56372f03bb2ca-1?challenge=0.265409462016 isconf: info: updated /etc/ldap/ldap.conf BUG: soft lockup detected on CPU#0! Pid: 2383, comm: isconf EIP: 0073:[<080af84d>] CPU: 0 EIP is at 0x80af84d ESP: 007b:bfcc96d0 EFLAGS: 00200246 Tainted: GF (2.6.16.13-xen #2) EAX: 00000001 EBX: 082031fe ECX: 082031fe EDX: b7af1f8c ESI: 00000000 EDI: 082030ec EBP: bfcc9838 DS: 007b ES: 007b CR0: 80050033 CR2: b7b97000 CR3: 0055e000 CR4: 00000640 isconf: info: fetching http://10.27.4.7:65028/t7a.org/block/c0e/c0e10bc50572deb89da6e9d96ac5971a39fddc65-fc3558eaffc90497248f97f9b0e3a924-1?challenge=0.130730726051 isconf: info: updated /etc/ca-certificates.conf isconf: info: running ['update-ca-certificates'] Updating certificates in /etc/ssl/certs....done. isconf: info: updated /etc/ldap/ldap.conf BUG: soft lockup detected on CPU#0! Pid: 1, comm: init EIP: 0061:[<c0322fe1>] CPU: 0 EIP is at netif_poll+0x101/0x810 EFLAGS: 00000216 Tainted: GF (2.6.16.13-xen #2) EAX: 00000037 EBX: c0945180 ECX: 0001134e EDX: c0945000 ESI: c0f48280 EDI: c0f499e8 EBP: c09451c0 DS: 007b ES: 007b CR0: 8005003b CR2: b7d579e0 CR3: 0057e000 CR4: 00000640 [<c03d891a>] net_rx_action+0xea/0x230 [<c0124cb5>] __do_softirq+0xf5/0x120 [<c0124d75>] do_softirq+0x95/0xa0 [<c0106c0f>] do_IRQ+0x1f/0x30 [<c0312f58>] evtchn_do_upcall+0xa8/0xf0 [<c0105178>] hypervisor_callback+0x2c/0x34 [<c02c2081>] __copy_user_intel+0x31/0xb0 [<c02c2220>] __copy_to_user_ll+0x70/0x80 [<c02c22f2>] copy_to_user+0x42/0x60 [<c0171068>] cp_new_stat64+0xf8/0x110 [<c01710b7>] sys_stat64+0x37/0x40 [<c0104fb5>] syscall_call+0x7/0xb isconf: warning: clierr: Connection reset by peer Starting system log daemon: syslogd. Starting kernel log daemon: klogd. No configuration file was found for slapd at /etc/ldap/slapd.conf. If you have moved the slapd configuration file please modify /etc/default/slapd to reflect this. If you chose to not configure slapd during installation then you need to do so prior to attempting to start slapd. An example slapd.conf is in /usr/share/slapd Starting Heimdal KDC: heimdal-kdc. Starting Heimdal password server: kpasswdd. Starting internet superserver: inetd. Starting PCMCIA services: module directory /lib/modules/2.6.16.13-xen/pcmcia not found. Starting OpenBSD Secure Shell server: sshd. Starting deferred execution scheduler: atd. Starting periodic command scheduler: cron. Debian GNU/Linux testing/unstable build2.t7a.org tty1 build2.t7a.org login: -- Stephen G. Traugott (KG6HDQ) UNIX/Linux Infrastructure Architect, TerraLuna LLC stevegt@xxxxxxxxxxxxx http://www.stevegt.com -- http://Infrastructures.Org _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |