[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Advice/hints on debugging a xen/auditd kernel issue.


  • To: xen-users@xxxxxxxxxxxxx
  • From: Peter Moody <pmoody@xxxxxxxxxx>
  • Date: Wed, 25 Jul 2012 10:32:36 -0700
  • Delivery-date: Wed, 25 Jul 2012 17:34:56 +0000
  • List-id: Xen user discussion <xen-users.lists.xen.org>

Is there a more suitable list for this question? perhaps xen-devel?

On Mon, Jul 23, 2012 at 8:26 AM, Peter Moody <pmoody@xxxxxxxxxx> wrote:
> Hey folks,
>
> I'm looking for some help debugging a kernel oops that only seems to
> manifest itself on xen instances.
>
> We have a small number of machines that seem to be having a weird
> interaction between auditd and xen. When under a high-ish amount of
> load, with auditd running, these machines crash almost 100% of the
> time.
>
> Here are two separate OOPs traces that appear to be from the same bug:
>
> from kernel 2.6.38.8
>
> [32581.835885] ------------[ cut here ]------------
> [32581.835894] kernel BUG at fs/buffer.c:1263!
> [32581.835898] invalid opcode: 0000 [#1] SMP
> [32581.835903] last sysfs file: /sys/devices/system/cpu/sched_mc_power_savings
> [32581.835906] CPU 0
> [32581.835910] Pid: 14899, comm: python2.6 Not tainted
> 2.6.38.8-gg868-ganetixenu #1
> [32581.835917] RIP: e030:[<ffffffff81153853>]  [<ffffffff81153853>]
> __find_get_block+0x1f3/0x200
> [32581.835927] RSP: e02b:ffff88067cfcdc78  EFLAGS: 00010046
> [32581.835931] RAX: ffff8807be6b0000 RBX: ffff8807740c50f0 RCX: 
> 00000000007e980a
> [32581.835935] RDX: 0000000000001000 RSI: 00000000007e980a RDI: 
> ffff8807b8cad380
> [32581.835939] RBP: ffff88067cfcdcd8 R08: 0000000000000001 R09: 
> ffff8807740c5018
> [32581.835944] R10: 0000000000000000 R11: 0000000000000000 R12: 
> ffff8807740c501c
> [32581.835949] R13: ffff8807740c5048 R14: ffff8807fef36210 R15: 
> 00000000007e980a
> [32581.835955] FS:  00007f37fd752700(0000) GS:ffff8807fff26000(0063)
> knlGS:0000000000000000
> [32581.835960] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
> [32581.835965] CR2: 00000000ea515000 CR3: 000000068522b000 CR4: 
> 0000000000002660
> [32581.835970] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [32581.835975] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
> 0000000000000400
> [32581.835979] Process python2.6 (pid: 14899, threadinfo
> ffff88067cfcc000, task ffff8806e772da80)
> [32581.835984] Stack:
> [32581.835986]  ffff88067cfcdc98 ffffffff81654cd1 ffff88067cfcdca8
> ffff8807740c5a98
> [32581.835994]  ffff88067cfcdd08 ffffffff811c9294 ffff8806ffffffc3
> 0000000000000014
> [32581.836001]  ffff8807740c50f0 ffff8807740c501c ffff8807740c5048
> ffff8807fef36210
> [32581.836009] Call Trace:
> [32581.836015]  [<ffffffff81654cd1>] ? down_read+0x11/0x30
> [32581.836021]  [<ffffffff811c9294>] ? ext3_xattr_get+0xf4/0x2b0
> [32581.836027]  [<ffffffff811baf88>] ext3_clear_blocks+0x128/0x190
> [32581.836032]  [<ffffffff811bb104>] ext3_free_data+0x114/0x160
> [32581.836037]  [<ffffffff811bbc0a>] ext3_truncate+0x87a/0x950
> [32581.836042]  [<ffffffff812133f5>] ? journal_start+0xb5/0x100
> [32581.836047]  [<ffffffff811bc840>] ext3_evict_inode+0x180/0x1a0
> [32581.836052]  [<ffffffff8114065f>] evict+0x1f/0xb0
> [32581.836058]  [<ffffffff81006d52>] ? check_events+0x12/0x20
> [32581.836063]  [<ffffffff81140c14>] iput+0x1a4/0x290
> [32581.836068]  [<ffffffff8113ed05>] dput+0x265/0x310
> [32581.836072]  [<ffffffff81132435>] path_put+0x15/0x30
> [32581.836078]  [<ffffffff810a5d31>] audit_syscall_exit+0x171/0x260
> [32581.836084]  [<ffffffff8103ed9a>] sysexit_audit+0x21/0x5f
> [32581.836088] Code: 82 00 05 01 00 85 c0 75 de 65 48 89 1c 25 00 05
> 01 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff
> 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57
> 49 89
> [32581.836141] RIP  [<ffffffff81153853>] __find_get_block+0x1f3/0x200
> [32581.836146]  RSP <ffff88067cfcdc78>
> [32581.836157] ---[ end trace 0658a2308b35c81e ]---
>
> and from kernel 3.2.5
>
> [12109.709499] ------------[ cut here ]------------
> [12109.709508] kernel BUG at fs/buffer.c:1265!
> [12109.709512] invalid opcode: 0000 [#1] SMP
> [12109.709517] CPU 0
> [12109.709521] Pid: 21385, comm: python2.6 Not tainted 3.2.5-gg886-ganetixenu 
> #1
> [12109.709527] RIP: e030:[<ffffffff81166473>]  [<ffffffff81166473>]
> __find_get_block+0x1f3/0x200
> [12109.709536] RSP: e02b:ffff88061fa99c68  EFLAGS: 00010046
> [12109.709540] RAX: ffff8807c91e0000 RBX: ffff88028e268cf8 RCX: 
> 0000000002523819
> [12109.709544] RDX: 0000000000001000 RSI: 0000000002523819 RDI: 
> ffff8807dede2080
> [12109.709549] RBP: ffff88061fa99cc8 R08: 0000000000000001 R09: 
> ffff88028e268c20
> [12109.709553] R10: ffff8806cf8bc000 R11: ffffffff81c30e08 R12: 
> ffff88028e268c24
> [12109.709557] R13: ffff88028e268c50 R14: ffff8807defe8528 R15: 
> 0000000002523819
> [12109.709563] FS:  00007fa1f4026700(0000) GS:ffff8807fff2b000(0063)
> knlGS:0000000000000000
> [12109.709567] CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
> [12109.709571] CR2: 0000000002bb6000 CR3: 00000006c8730000 CR4: 
> 0000000000002660
> [12109.709575] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [12109.709580] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
> 0000000000000400
> [12109.709584] Process python2.6 (pid: 21385, threadinfo
> ffff88061fa98000, task ffff8806e37796a0)
> [12109.709589] Stack:
> [12109.709591]  ffff88061fa99cb8 ffffffff81143001 ffff88061fa99c98
> ffffffff81699f91
> [12109.709599]  ffff88061fa99d28 ffff8806fcb79610 ffff88061fa99d08
> ffffffff811e5d34
> [12109.709606]  ffff88028e268cf8 ffff88028e268c24 ffff88028e268c50
> ffff8807defe8528
> [12109.709613] Call Trace:
> [12109.709619]  [<ffffffff81143001>] ? generic_permission+0x101/0x250
> [12109.709625]  [<ffffffff81699f91>] ? down_read+0x11/0x30
> [12109.709631]  [<ffffffff811e5d34>] ? ext3_xattr_get+0xf4/0x2b0
> [12109.709635]  [<ffffffff811d0ac8>] ext3_clear_blocks+0x128/0x190
> [12109.709640]  [<ffffffff811d0c44>] ext3_free_data+0x114/0x160
> [12109.709644]  [<ffffffff811d13da>] ext3_truncate+0x50a/0x610
> [12109.709649]  [<ffffffff811465c6>] ? path_lookupat+0x66/0x690
> [12109.709654]  [<ffffffff8112b5d7>] ? kmem_cache_alloc+0x37/0x120
> [12109.709659]  [<ffffffff812393b5>] ? journal_start+0xb5/0x100
> [12109.709664]  [<ffffffff811d2638>] ext3_evict_inode+0x218/0x280
> [12109.709669]  [<ffffffff81151051>] evict+0xa1/0x1a0
> [12109.709673]  [<ffffffff81151258>] iput+0x108/0x210
> [12109.709677]  [<ffffffff8114f385>] dput+0x285/0x330
> [12109.709681]  [<ffffffff81142425>] path_put+0x15/0x30
> [12109.709686]  [<ffffffff810afb11>] audit_syscall_exit+0x171/0x260
> [12109.709692]  [<ffffffff8169df6a>] sysexit_audit+0x21/0x5f
> [12109.709695] Code: 82 a0 ff 00 00 85 c0 75 de 65 48 89 1c 25 a0 ff
> 00 00 e9 87 fe ff ff 48 89 df e8 e9 fc ff ff 4c 89 f7 e9 02 ff ff ff
> 0f 0b eb fe <0f> 0b eb fe 0f 0b eb fe 0f 1f 44 00 00 55 48 89 e5 41 57
> 49 89
> [12109.709752] RIP  [<ffffffff81166473>] __find_get_block+0x1f3/0x200
> [12109.709757]  RSP <ffff88061fa99c68>
> [12109.709763] ---[ end trace 842635a4dd3e3b5f ]---
>
> That BUG() line is:
>
> static inline void check_irqs_on(void)
> {
> #ifdef irqs_disabled
>         BUG_ON(irqs_disabled());
> #endif
> }
>
> so it seems like something about the load is either causing
> __find_get_block to be called while irqs are disabled or there's
> something mucking with the irq flags.
>
> Finally, here's the relevant bits from xm info on one of the affected nodes:
>
> # xm info
> host                   :
> release                : 3.2.13-ganeti-rx6-xen0
> version                : #1 SMP Thu Jun 7 12:59:40 CEST 2012
> machine                : x86_64
> nr_cpus                : 24
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 2
> cpu_mhz                : 3200
> hw_caps                :
> bfebfbff:2c100800:00000000:00001f40:029ee3ff:00000000:00000001:00000000
> virt_caps              :
> total_memory           : 98292
> free_memory            : 82906
> node_to_cpu            : node0:0-5,12-17
>                          node1:6-11,18-23
> node_to_memory         : node0:40810
>                          node1:42095
> node_to_dma32_mem      : node0:1969
>                          node1:0
> max_node_id            : 1
> xen_major              : 4
> xen_minor              : 0
> xen_extra              : .1
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p
> xen_scheduler          : credit
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          : unavailable
> xen_commandline        : placeholder dom0_mem=2048M loglvl=all
> com1=115200,8n1 console=com1 iommu=0
> cc_compiler            : gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5)
> cc_compile_by          :
> cc_compile_domain      : google.com
> cc_compile_date        : Wed Mar 16 15:24:06 UTC 2011
> xend_config_format     : 4
>
> I should be getting access to one of these machines shortly, but I'm
> new to xen-debugging, so I was hoping that someone might be able to
> give me some hints on debugging xen-specific kernel issues.
>
> Cheers,
> peter
>
> --
> Peter Moody      Google    1.650.253.7306
> Security Engineer  pgp:0xC3410038



-- 
Peter Moody      Google    1.650.253.7306
Security Engineer  pgp:0xC3410038

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.