[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] Advice/hints on debugging a xen/auditd kernel issue.


  • To: xen-users@xxxxxxxxxxxxx
  • From: Peter Moody <pmoody@xxxxxxxxxx>
  • Date: Mon, 23 Jul 2012 08:26:21 -0700
  • Delivery-date: Mon, 23 Jul 2012 15:28:02 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

Hey folks,

I'm looking for some help debugging a kernel oops that only seems to
manifest itself on xen instances.

We have a small number of machines that seem to be having a weird
interaction between auditd and xen. When under a high-ish amount of
load, with auditd running, these machines crash almost 100% of the
time.

Here are two separate OOPs traces that appear to be from the same bug:

from kernel 2.6.38.8

[32581.835885] ------------[ cut here ]------------
[32581.835894] kernel BUG at fs/buffer.c:1263!
[32581.835898] invalid opcode: 0000 [#1] SMP
[32581.835903] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings
[32581.835906] CPU 0
[32581.835910] Pid: 14899, comm: python2.6 Not tainted
2.6.38.8-gg868-ganetixenu #1
[32581.835917] RIP: e030:[<ffffffff81153853>]  [<ffffffff81153853>]
__find_get_block+0x1f3/0x200
[32581.835927] RSP: e02b:ffff88067cfcdc78  EFLAGS: 00010046
[32581.835931] RAX: ffff8807be6b0000 RBX: ffff8807740c50f0 RCX: 00000000007e980a
[32581.835935] RDX: 0000000000001000 RSI: 00000000007e980a RDI: ffff8807b8cad380
[32581.835939] RBP: ffff88067cfcdcd8 R08: 0000000000000001 R09: ffff8807740c5018
[32581.835944] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8807740c501c
[32581.835949] R13: ffff8807740c5048 R14: ffff8807fef36210 R15: 00000000007e980a
[32581.835955] FS:  00007f37fd752700(0000) GS:ffff8807fff26000(0063)
knlGS:0000000000000000
[32581.835960] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
[32581.835965] CR2: 00000000ea515000 CR3: 000000068522b000 CR4: 0000000000002660
[32581.835970] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[32581.835975] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[32581.835979] Process python2.6 (pid: 14899, threadinfo
ffff88067cfcc000, task ffff8806e772da80)
[32581.835984] Stack:
[32581.835986]  ffff88067cfcdc98 ffffffff81654cd1 ffff88067cfcdca8
ffff8807740c5a98
[32581.835994]  ffff88067cfcdd08 ffffffff811c9294 ffff8806ffffffc3
0000000000000014
[32581.836001]  ffff8807740c50f0 ffff8807740c501c ffff8807740c5048
ffff8807fef36210
[32581.836009] Call Trace:
[32581.836015]  [<ffffffff81654cd1>] ? down_read+0x11/0x30
[32581.836021]  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
[32581.836027]  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
[32581.836032]  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
[32581.836037]  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
[32581.836042]  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
[32581.836047]  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
[32581.836052]  [<ffffffff8114065f>] evict+0x1f/0xb0
[32581.836058]  [<ffffffff81006d52>] ? check_events+0x12/0x20
[32581.836063]  [<ffffffff81140c14>] iput+0x1a4/0x290
[32581.836068]  [<ffffffff8113ed05>] dput+0x265/0x310
[32581.836072]  [<ffffffff81132435>] path_put+0x15/0x30
[32581.836078]  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
[32581.836084]  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
[32581.836088] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05
01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff
0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57
49 89
[32581.836141] RIP  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
[32581.836146]  RSP <ffff88067cfcdc78>
[32581.836157] ---[ end trace 0658a2308b35c81e ]---

and from kernel 3.2.5

[12109.709499] ------------[ cut here ]------------
[12109.709508] kernel BUG at fs/buffer.c:1265!
[12109.709512] invalid opcode: 0000 [#1] SMP
[12109.709517] CPU 0
[12109.709521] Pid: 21385, comm: python2.6 Not tainted 3.2.5-gg886-ganetixenu #1
[12109.709527] RIP: e030:[<ffffffff81166473>]  [<ffffffff81166473>]
__find_get_block+0x1f3/0x200
[12109.709536] RSP: e02b:ffff88061fa99c68  EFLAGS: 00010046
[12109.709540] RAX: ffff8807c91e0000 RBX: ffff88028e268cf8 RCX: 0000000002523819
[12109.709544] RDX: 0000000000001000 RSI: 0000000002523819 RDI: ffff8807dede2080
[12109.709549] RBP: ffff88061fa99cc8 R08: 0000000000000001 R09: ffff88028e268c20
[12109.709553] R10: ffff8806cf8bc000 R11: ffffffff81c30e08 R12: ffff88028e268c24
[12109.709557] R13: ffff88028e268c50 R14: ffff8807defe8528 R15: 0000000002523819
[12109.709563] FS:  00007fa1f4026700(0000) GS:ffff8807fff2b000(0063)
knlGS:0000000000000000
[12109.709567] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
[12109.709571] CR2: 0000000002bb6000 CR3: 00000006c8730000 CR4: 0000000000002660
[12109.709575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12109.709580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[12109.709584] Process python2.6 (pid: 21385, threadinfo
ffff88061fa98000, task ffff8806e37796a0)
[12109.709589] Stack:
[12109.709591]  ffff88061fa99cb8 ffffffff81143001 ffff88061fa99c98
ffffffff81699f91
[12109.709599]  ffff88061fa99d28 ffff8806fcb79610 ffff88061fa99d08
ffffffff811e5d34
[12109.709606]  ffff88028e268cf8 ffff88028e268c24 ffff88028e268c50
ffff8807defe8528
[12109.709613] Call Trace:
[12109.709619]  [<ffffffff81143001>] ? generic_permission+0x101/0x250
[12109.709625]  [<ffffffff81699f91>] ? down_read+0x11/0x30
[12109.709631]  [<ffffffff811e5d34>] ? ext3_xattr_get+0xf4/0x2b0
[12109.709635]  [<ffffffff811d0ac8>] ext3_clear_blocks+0x128/0x190
[12109.709640]  [<ffffffff811d0c44>] ext3_free_data+0x114/0x160
[12109.709644]  [<ffffffff811d13da>] ext3_truncate+0x50a/0x610
[12109.709649]  [<ffffffff811465c6>] ? path_lookupat+0x66/0x690
[12109.709654]  [<ffffffff8112b5d7>] ? kmem_cache_alloc+0x37/0x120
[12109.709659]  [<ffffffff812393b5>] ? journal_start+0xb5/0x100
[12109.709664]  [<ffffffff811d2638>] ext3_evict_inode+0x218/0x280
[12109.709669]  [<ffffffff81151051>] evict+0xa1/0x1a0
[12109.709673]  [<ffffffff81151258>] iput+0x108/0x210
[12109.709677]  [<ffffffff8114f385>] dput+0x285/0x330
[12109.709681]  [<ffffffff81142425>] path_put+0x15/0x30
[12109.709686]  [<ffffffff810afb11>] audit_syscall_exit+0x171/0x260
[12109.709692]  [<ffffffff8169df6a>] sysexit_audit+0x21/0x5f
[12109.709695] Code: 82 a0 ff 00 00 85 c0 75 de 65 48 89 1c 25 a0 ff
00 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff
0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57
49 89
[12109.709752] RIP  [<ffffffff81166473>] __find_get_block+0x1f3/0x200
[12109.709757]  RSP <ffff88061fa99c68>
[12109.709763] ---[ end trace 842635a4dd3e3b5f ]---

That BUG() line is:

static inline void check_irqs_on(void)
{
#ifdef irqs_disabled
        BUG_ON(irqs_disabled());
#endif
}

so it seems like something about the load is either causing
__find_get_block to be called while irqs are disabled or there's
something mucking with the irq flags.

Finally, here's the relevant bits from xm info on one of the affected nodes:

# xm info
host                   :
release                : 3.2.13-ganeti-rx6-xen0
version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
machine                : x86_64
nr_cpus                : 24
nr_nodes               : 2
cores_per_socket       : 6
threads_per_core       : 2
cpu_mhz                : 3200
hw_caps                :
bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
virt_caps              :
total_memory           : 98292
free_memory            : 82906
node_to_cpu            : node0:0-5,12-17
                         node1:6-11,18-23
node_to_memory         : node0:40810
                         node1:42095
node_to_dma32_mem      : node0:1969
                         node1:0
max_node_id            : 1
xen_major              : 4
xen_minor              : 0
xen_extra              : .1
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : unavailable
xen_commandline        : placeholder dom0_mem=2048M loglvl=all
com1=115200,8n1 console=com1 iommu=0
cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
cc_compile_by          :
cc_compile_domain      : google.com
cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
xend_config_format     : 4

I should be getting access to one of these machines shortly, but I'm
new to xen-debugging, so I was hoping that someone might be able to
give me some hints on debugging xen-specific kernel issues.

Cheers,
peter

-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.