[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] Re: [Xen-devel] Questions about GPLPV stability tests
I have finally got some results (the tests are still running). Let me first summarize what tests were running. In total I had 4 machines. All systems had at least Xen 4.1.2 + Linux 2.6.32.48 (and one had a Xen 4.1.3 snapshot). I divided the systems into 2 groups and each group ran a different test. There are 2 differencess between the groups: 1) one was the use of blktap vs not using blktap. The main reason for having the blktap systems in even though it adds noise, is that it kept some of tests close to what they previously were. 2) the non-blktap systems used a different cpu pinning configuration. Previously Dom0 was using 0-3 and DomU 4-7, but due to hyperthreading both VMs used the same cores (0 and 4 are on the same physical core). Now to the initial results. - so far each machine has been up for 10 days. - one machine failed early on due to a PCI device assignment issue. I suspect that at some point due to a race condition, ownership of the PCI device wasn't transferred correctly back from DomU to Dom0 on VM shutdown. Xm and Xl respectively fail with 'Error: failed to assign device 03:00.0: maybe it has already been assigned to other domain, or maybe it doesn't exist.' From a quick glance at the logs it looks like on shutdown of a previous VM, the 'unmap bdf' never happened. Not sure why, but I guess due to a potential toolchain bug. - One of the non-blktap machines had a kernel Oops. It happened on VM shutdown and I wouldn't be surprised if it was similar to the crashes we wanted to debug. The system is in a usable state though, but this may have been due to the use of Linux 2.6.32.48 or the different CPU pinning configuration. Some of the serial output: Thu Dec 22 23:17:38 2011 - 1232 0: 87 113 365 blkif-backend 525325sec Thu Dec 22 23:30:07 2011 - 1259 0: 12 6 222 xenbus 526065sec Thu Dec 22 23:30:07 2011 - 1250 0: 62 42 1461 ahci 526065sec Thu Dec 22 23:30:07 2011 - 1249 0: 37 28 75 eth0-rx-0 526065sec Thu Dec 22 23:30:07 2011 - 1248 0: 71 305 933 eth0-tx-0 526065sec Thu Dec 22 23:30:07 2011 - 1243 0: 6 3 134 evtchn:xenstored 526065sec Thu Dec 22 23:30:07 2011 - 1241 0: 6 3 87582 evtchn:xenstored 526065sec Thu Dec 22 23:30:07 2011 - 1240 0: 256 261 54162 evtchn:qemu-dm 526065sec Thu Dec 22 23:30:07 2011 - 1239 0: 244 251 7219 evtchn:qemu-dm 526065sec Thu Dec 22 23:30:07 2011 - 1238 0: 289 273 5931 evtchn:qemu-dm 526065sec Thu Dec 22 23:30:07 2011 - 1237 0: 248 245 4279 evtchn:qemu-dm 526065sec Thu Dec 22 23:30:07 2011 - 1236 0: 12 7 315 blkif-backend 526065sec Thu Dec 22 23:30:07 2011 - 1234 0: 7 4 43 veth1 526065sec Thu Dec 22 23:30:07 2011 - 1232 CPU0 going backwards (-3304)! Thu Dec 22 23:30:07 2011 - 19 CPU0 going backwards (-212)! Thu Dec 22 23:30:07 2011 - 18 0: 35 17 35 ehci_hcd:usb1, uhci_hcd:usb8 526065sec Thu Dec 22 23:30:07 2011 - 16 CPU0 going backwards (-176)! Thu Dec 22 23:30:07 2011 - 12 CPU0 going backwards (-4)! Thu Dec 22 23:30:07 2011 - 4 CPU0 going backwards (-1)! Thu Dec 22 23:30:12 2011 - 1250 0: 384 213 1461 ahci 526070sec Thu Dec 22 23:30:12 2011 - 1249 0: 14 21 75 eth0-rx-0 526070sec Thu Dec 22 23:30:12 2011 - 1240 0: 260 260 54162 evtchn:qemu-dm 526070sec Thu Dec 22 23:30:12 2011 - 1239 0: 279 265 7219 evtchn:qemu-dm 526070sec Thu Dec 22 23:30:13 2011 - 1238 0: 271 272 5931 evtchn:qemu-dm 526070sec Thu Dec 22 23:30:13 2011 - 1237 0: 261 253 4279 evtchn:qemu-dm 526070sec Thu Dec 22 23:30:13 2011 - 1236 0: 145 76 315 blkif-backend 526070sec Thu Dec 22 23:30:18 2011 - 1250 0: 372 293 1461 ahci 526075sec Thu Dec 22 23:30:18 2011 - 1249 0: 26 24 75 eth0-rx-0 526075sec Thu Dec 22 23:30:18 2011 - 1248 0: 16 86 933 eth0-tx-0 526075sec Thu Dec 22 23:30:18 2011 - 1240 0: 234 247 54162 evtchn:qemu-dm 526075sec Thu Dec 22 23:30:18 2011 - 1239 0: 234 249 7219 evtchn:qemu-dm 526075sec Thu Dec 22 23:30:18 2011 - 1238 0: 241 256 5931 evtchn:qemu-dm 526075sec Thu Dec 22 23:30:18 2011 - 1237 0: 224 239 4279 evtchn:qemu-dm 526075sec Thu Dec 22 23:30:18 2011 - 1236 0: 100 88 315 blkif-backend 526075sec Thu Dec 22 23:30:23 2011 - 1249 0: 16 20 75 eth0-rx-0 526080sec Thu Dec 22 23:30:23 2011 - 1248 0: 7 46 933 eth0-tx-0 526080sec Thu Dec 22 23:30:23 2011 - 1240 0: 8 128 54162 evtchn:qemu-dm 526080sec Thu Dec 22 23:30:23 2011 - 1239 0: 16 133 7219 evtchn:qemu-dm 526080sec Thu Dec 22 23:30:23 2011 - 1238 0: 28 142 5931 evtchn:qemu-dm 526080sec Thu Dec 22 23:30:23 2011 - 1233 CPU0 going backwards (-12648)! Thu Dec 22 23:30:23 2011 - 19 CPU0 going backwards (-212)! Thu Dec 22 23:30:23 2011 - 18 0: 35 17 42 ehci_hcd:usb1, uhci_hcd:usb8 526080sec Thu Dec 22 23:30:23 2011 - 16 CPU0 going backwards (-176)! Thu Dec 22 23:30:23 2011 - 12 CPU0 going backwards (-4)! Thu Dec 22 23:30:23 2011 - 4 CPU0 going backwards (-1)! [533804.785589] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [533804.793913] IP: [<ffffffff814bdba9>] _spin_lock+0x5/0x15 [533804.799556] PGD 0 [533804.801896] Oops: 0002 [#1] SMP [533804.805448] last sysfs file: /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/class [533804.813736] CPU 0 [533804.816066] Modules linked in: dm_snapshot dm_mod ipmi_si ipmi_devintf [533804.822914] Pid: 21632, comm: qemu-dm Not tainted 2.6.32.48 #1 X8STi [533804.829595] RIP: e030:[<ffffffff814bdba9>] [<ffffffff814bdba9>] _spin_lock+0x5/0x15 [533804.837873] RSP: e02b:ffff88005f187c50 EFLAGS: 00010202 [533804.843513] RAX: 0000000000000100 RBX: ffff88007d6923c0 RCX: 0000000000000003 [533804.851192] RDX: ffff88007deb3180 RSI: ffff88007d6923c0 RDI: 0000000000000008 [533804.858871] RBP: ffff88007e260128 R08: 0000000000000000 R09: 0000000000000000 [533804.866552] R10: ffff88007121eb40 R11: ffffffff811b862b R12: ffff88007fac1e40 [533804.874541] R13: ffff88007e3c3340 R14: ffff88007e3c3340 R15: ffff88007fdfbc00 [533804.882243] FS: 00007f5cdcefe6f0(0000) GS:ffff880028038000(0000) knlGS:0000000000000000 [533804.890874] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b [533804.896948] CR2: 0000000000000008 CR3: 0000000001001000 CR4: 0000000000002660 [533804.904627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [533804.912306] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [533804.919985] Process qemu-dm (pid: 21632, threadinfo ffff88005f186000, task ffff880074f4e270) [533804.928971] Stack: [533804.931312] ffffffff810a9ad0 ffff88007deb3180 ffff88007e260100 ffff88007e260100 [533804.938762] <0> ffffffff8124222d 0000000000000008 0000000000000008 ffff88007deb3180 [533804.946900] <0> ffffffff810b245e ffff88007fac1e40 ffff88007deb3180 0000000000000000 [533804.955465] Call Trace: [533804.958244] [<ffffffff810a9ad0>] ? mmu_notifier_unregister+0x27/0xa5 [533804.965019] [<ffffffff8124222d>] ? gntdev_release+0xc3/0xd1 [533804.971007] [<ffffffff810b245e>] ? __fput+0x100/0x1af [533804.976477] [<ffffffff810af998>] ? filp_close+0x5b/0x62 [533804.982119] [<ffffffff81042989>] ? put_files_struct+0x64/0xc1 [533804.988280] [<ffffffff810441f0>] ? do_exit+0x209/0x68d [533804.993836] [<ffffffff810441f8>] ? do_exit+0x211/0x68d [533804.999390] [<ffffffff810446e9>] ? do_group_exit+0x75/0x9c [533805.005294] [<ffffffff8104cf1d>] ? get_signal_to_deliver+0x30d/0x338 [533805.012063] [<ffffffff81010f7a>] ? do_notify_resume+0x8a/0x6b1 [533805.018310] [<ffffffff810bdb3a>] ? poll_select_copy_remaining+0xd0/0xf3 [533805.025339] [<ffffffff81011c8e>] ? int_signal+0x12/0x17 [533805.030980] Code: 00 00 00 01 74 05 e8 67 1c d2 ff 48 89 d0 5e c3 ff 14 25 20 2d 6a 81 f0 81 2f 00 00 00 01 74 05 e8 4d 1c d2 ff c3 b8 00 01 00 00 <f0> 66 0f c1 07 38 e0 74 06 f3 90 8a 07 eb f6 c3 f0 81 2f 00 00 [533805.050454] RIP [<ffffffff814bdba9>] _spin_lock+0x5/0x15 [533805.056182] RSP <ffff88005f187c50> [533805.059997] CR2: 0000000000000008 [533805.063638] ---[ end trace 85ee7cbec9ce72ac ]--- [533805.068584] Fixing recursive fault but reboot is needed! If I had to guess, some of the gnttab code in qemu triggered a bug in the gntdev code? I have some experience with gnttab/gntdev, but don't know the inner parts of it very well. Roderick On Fri, Dec 16, 2011 at 2:25 AM, Roderick Colenbrander <thunderbird2k@xxxxxxxxx> wrote: > Thanks. On Tuesday I set up your monitoring tool on the already > running tests. Some of the systems were on the verge of showing > results, but we just had a power outage and I have to restart the > tests. It will probably take until the Holidays before there any > results :( > > Roderick > > On Tue, Dec 13, 2011 at 9:20 PM, Konrad Rzeszutek Wilk > <konrad@xxxxxxxxxx> wrote: >>> > Does this feel like blktap issues? Or is it more likely that something >>> > else happened which causes blktap and the other things to fail? >>> >>> It looks like the interrupts stopped. Perhaps you can run the C code >>> (attached) on the serial console and see _what_ (or perhaps _when_) >>> the interrupts stops (and also verify that the timeouts and 'frozen' >>> happen due to no interrupts being received). >> >> The C code is http://darnok.org/xen/read_interrupts.c >> >>> >>> There are a couple of bugs that were discovered in the interrupt code >>> (that had been there since 2.6.27!) that went to the stable tree - so >>> they should been backported to 2.6.32.x tree - but I honestly don't >>> recall the names. I can look them up tomorrow. >> >> xen: x86_32: do not enable iterrupts when returning from exception in >> interrupt context (git commit d198d499148a0c64a41b3aba9e7dd43772832b91) >> _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxxxxxxxx http://lists.xensource.com/xen-devel
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |