[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] Questions about GPLPV stability tests



> Today, yet another of my machines went down. To explain the setup, I'm
> running tests across a couple of similar machines. On some of the
> machines I updated to latest 2.6.32.x and Xen 4.1.x (and I upgrade
> every machine which goes down). The machine which just went down
> wasn't using the latest versions yet, but some more logging was
> enabled in 'dmesg'.
> 
> This box still had Xen 4.1.2 and 2.6.32.37. I'm not sure if the
> logging is too useful, yet but it may be of some use. The log output
> is mostly from the serial console. I looked at some log files on
> disks. Some files give clues, but some important log files didn't get
> synced to disk before bad things happened. The last timestamp I was
> able to find was at '16:13:55' which some of our own software wrote
> some line to its logs.
> 
> The few log files which got synced to disk tell that the VM started
> around 16:13:02 which matches PCIe device ownership to the VM:
> 
> (XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1501: d0:PCIe: unmap bdf = 3:0.0
> (XEN) [2011-12-10 16:13:03] [VT-D]iommu.c:1374: d1109:PCIe: map bdf = 3:0.0
> (XEN) [2011-12-10 16:13:03] [VT-D]io.c:304: d1109: bind: m_gsi=55
> g_gsi=32 device=4 intx=0
> (XEN) [2011-12-10 16:13:23] [VT-D]io.c:328: d1109: unbind: m_gsi=55
> g_gsi=32 device=4 intx=0
> (XEN) [2011-12-10 16:13:23] [VT-D]io.c:386: d1109 unmap: m_irq=55
> device=4 intx=0
> (XEN) [2011-12-10 16:13:23] [VT-D]io.c:304: d1109: bind: m_gsi=16
> g_gsi=32 device=4 intx=0
> 
> 
> We are using the tapdisk driver for disk 'xvdc' which is referred to
> as 'tdc' I guess. It is a copy-on-write

tdc would be what the disk is in dom 0.
> VHD disk which uses a raw disk image as its backing store. Apparently
> there are all sorts of read errors.
> Not sure if the read errors are too bad.
> 
> [905276.510737] blkback: ring-ref 15658, event-channel 13, protocol 1
> (unspecified, assuming native)
> [905276.512128] blkback: ring-ref 15658, event-channel 13, protocol 1
> (unspecified, assuming native)
> [905277.113215] block tdc: sector-size: 512 capacity: 118785
> [905277.215872] blkback: ring-ref 15775, event-channel 14, protocol 1
> (unspecified, assuming native)
> [905680.196850] end_request: I/O error, dev tdc, sector 45689
> [905680.197293] end_request: I/O error, dev tdc, sector 45689
> [905680.197624] end_request: I/O error, dev tdc, sector 45688
> [905680.197993] end_request: I/O error, dev tdc, sector 45696
> 
> More blktap issues hours later, right? The real problem must have
> happened much earlier then.

Any chance you can eliminate blktap from the picture? Can you use
blkback?

> 
> [912393.552881] blkback: ring-ref 15810, event-channel 13, protocol 1
> (unspecified, assuming native)
> [912394.101667] block tdc: sector-size: 512 capacity: 118785
> [912394.384715] blkback: ring-ref 15660, event-channel 14, protocol 1
> (unspecified, assuming native)
> [912402.433492] BUG: unable to handle kernel NULL pointer dereference
> at 00000000000002f0
> [912402.434077] IP: [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68
> [912402.434384] PGD 384bb067 PUD 3a55f067 PMD 0
> [912402.434752] Oops: 0000 [#1] SMP
> [912402.435096] last sysfs file: /sys/devices/virtual/block/tdc/removable
> [912402.435363] CPU 2
> [912402.435653] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf
> [912402.436092] Pid: 16994, comm: tapdisk2 Not tainted 2.6.32.37 #1 X8STi
> [912402.436595] RIP: e030:[<ffffffff81249ac2>]  [<ffffffff81249ac2>]
> blktap_device_end_request+0x4e/0x68
> [912402.437117] RSP: e02b:ffff8800616f9d08  EFLAGS: 00010046
> [912402.437380] RAX: 0000000000000000 RBX: ffff880014ef29b0 RCX:
> 0000000000000018
> [912402.437884] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff88007fcebd58
> [912402.438368] RBP: 0000000000000000 R08: ffffffff816a9488 R09:
> ffff8800123af040
> [912402.438855] R10: 00000001365d23c7 R11: ffffffff810861e1 R12:
> ffff88007bea4e70
> [912402.439358] R13: ffff88007b8aa800 R14: 0000000000000000 R15:
> ffff880014ef29b0
> [912402.439848] FS:  00007fde3bbe2730(0000) GS:ffff880028070000(0000)
> knlGS:0000000000000000
> [912402.440341] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [912402.440621] CR2: 00000000000002f0 CR3: 000000001f721000 CR4:
> 0000000000002660
> [912402.441102] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [912402.441583] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [912402.442083] Process tapdisk2 (pid: 16994, threadinfo
> ffff8800616f8000, task ffff88007bab4ec0)
> [912402.442569] Stack:
> [912402.442825]  ffffea000185d038 0000000000000001 ffff88007bab4ec0
> 0000000000000000
> [912402.443199] <0> ffff88007b8aa800 ffffffff81248c3b 00000000000001ff
> 0000000000000000
> [912402.443871] <0> 00033e583f482159 000000000000e030 ffff88006f6da480
> 0000039f000003a0
> [912402.444793] Call Trace:
> [912402.445054]  [<ffffffff81248c3b>] ? blktap_ring_ioctl+0x137/0x228
> [912402.445326]  [<ffffffff810bea63>] ? core_sys_select+0x1ee/0x21e
> [912402.445599]  [<ffffffff810dc64f>] ? really_put_req+0x62/0x8f
> [912402.445871]  [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa
> [912402.446159]  [<ffffffff8100eef2>] ? check_events+0x12/0x20
> [912402.446434]  [<ffffffff810dc4e4>] ? aio_read_evt+0x26/0xe4
> [912402.446704]  [<ffffffff810dd71a>] ? sys_io_getevents+0x10e/0x356
> [912402.446984]  [<ffffffff810bcc47>] ? vfs_ioctl+0x56/0x6c
> [912402.447255]  [<ffffffff810bd133>] ? do_vfs_ioctl+0x460/0x49e
> [912402.447543]  [<ffffffff81059699>] ? getnstimeofday+0x55/0xaf
> [912402.447814]  [<ffffffff810bd1c2>] ? sys_ioctl+0x51/0x70
> [912402.448083]  [<ffffffff81011a02>] ? system_call_fastpath+0x16/0x1b
> [912402.448351] Code: e8 a5 f5 ff ff 49 8b 44 24 40 48 8b b8 f0 02 00
> 00 e8 a5 37 27 00 41 8b 54 24 60 44 89 f6 4c 89 e7 e8 85 13 f8 ff 49
> 8b 44 24 40 <48> 8b 80 f0 02 00 00 fe 00 ff 14 25 28 0d 6a 81 5b 5b 41
> 5c 41
> [912402.451398] RIP  [<ffffffff81249ac2>] blktap_device_end_request+0x4e/0x68
> [912402.451733]  RSP <ffff8800616f9d08>
> [912402.451997] CR2: 00000000000002f0
> [912402.452265] ---[ end trace f20c5f7e976fe78b ]---
> 
> Half a minute later I get tons of these:
> [912432.458625] ata6.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 
> frozen

.. frozen..
> [912432.458893] ata6.00: failed command: READ FPDMA QUEUED
> [912432.459172] ata6.00: cmd 60/08:00:f2:13:e3/00:00:0d:00:00/40 tag 0
> ncq 4096 in
> 
> And tons of these:
> [912478.690536] sd 5:0:0:0: [sda] Unhandled error code
> [912478.690798] sd 5:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET
> driverbyte=DRIVER_OK
> [912478.691306] sd 5:0:0:0: [sda] CDB: Read(10): 28 00 00 e6 95 05 00 00 68 00
> 
> I wonder if we are having real SATA issues or whether this is somehow

That really looks to cause blktap to die when the command failed.


> caused by the 'real' problem which is tapdisk?
> 
> Finally I also get timeouts in the network code. I think these are
> just caused by the bad things happening.
> [912737.470673] WARNING: at net/sched/sch_generic.c:261
> dev_watchdog+0xd2/0x16e()
> [912737.471170] Hardware name: X8STi
> [912737.471426] NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out

and then this time out..

> [912737.471686] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf
> [912737.472106] Pid: 0, comm: swapper Tainted: G      D    2.6.32.37 #1
> [912737.472586] Call Trace:
> [912737.472838]  <IRQ>  [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e
> [912737.473130]  [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e
> [912737.473391]  [<ffffffff81040fb1>] ? warn_slowpath_common+0x77/0xa3
> [912737.473668]  [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e
> [912737.473928]  [<ffffffff81041039 > ] ? warn_slowpk
> _dev_program_evt+ 0xe3/0x18c
> [90/0x60
> [912737.474988]  [<ffffffff8100e865>] ? xen_force_evtchn_callback+0x9/0xa
> [912737.480706]  [<ffffffff8100eef2>] ? check_events+0x12/0x20
> [912737.480965]  [<ffffffff813c8d15>] ? netif_tx_lock+0x3d/0x68
> [912737.481222]  [<ffffffff813c8d40>] ? dev_watchdog+0x0/0x16e
> [912737.481481]  [<ffffffff813b83a0>] ? netdev_drivername+0x3b/0x40
> [912737.481741]  [<ffffffff813c8e12>] ? dev_watchdog+0xd2/0x16e
> [912737.482015]  [<ffffffff81049614>] ? run_timer_softirq+0x16c/0x1d9
> [912737.482274]  [<ffffffff81078fb8>] ? handle_percpu_irq+0x4e/0x6a
> [912737.482534]  [<ffffffff81045bc8>] ? __do_softirq+0x91/0x116
> [912737.482794]  [<ffffffff81012b6c>] ? call_softirq+0x1c/0x30
> [912737.483052]  [<ffffffff810140e7>] ? do_softirq+0x3f/0x7c
> [912737.483324]  [<ffffffff81045a64>] ? irq_exit+0x36/0x76
> [912737.483584]  [<ffffffff8123b6ad>] ? xen_evtchn_do_upcall+0x33/0x42
> [912737.483843]  [<ffffffff81012bbe>] ? xen_do_hypervisor_callback+0x1e/0x30
> [912737.484101]  <EOI>  [<ffffffff8100eedf>] ? 
> xen_restore_fl_direct_end+0x0/0x1
> [912737.484394]  [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001
> [912737.484670]  [<ffffffff810093aa>] ? hypercall_page+0x3aa/0x1001
> [912737.484930]  [<ffffffff8100e8e3>] ? xen_safe_halt+0xc/0x15
> [912737.485188]  [<ffffffff81017f07>] ? default_idle+0x21/0x3d
> [912737.485447]  [<ffffffff81010df1>] ? cpu_idle+0x59/0x91
> [912737.485704] ---[ end trace f20c5f7e976fe78c ]---
> 
> 
> Does this feel like blktap issues? Or is it more likely that something
> else happened which causes blktap and the other things to fail?

It looks like the interrupts stopped. Perhaps you can run the C code
(attached) on the serial console and see _what_ (or perhaps _when_)
the interrupts stops (and also verify that the timeouts and 'frozen'
happen due to no interrupts being received).

There are a couple of bugs that were discovered in the interrupt code
(that had been there since 2.6.27!) that went to the stable tree - so
they should been backported to 2.6.32.x tree - but I honestly don't
recall the names. I can look them up tomorrow.

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.