Xen project Mailing List

Re: [Xen-devel] netback Oops then xenwatch stuck in D state

From: Ian Campbell <Ian.Campbell@xxxxxxxxxx>

Date: Tue, 12 Feb 2013 09:58:53 +0000

Cc: "xen-devel@xxxxxxxxxxxxx" <xen-devel@xxxxxxxxxxxxx>

Delivery-date: Tue, 12 Feb 2013 09:59:13 +0000

List-id: Xen developer discussion <xen-devel.lists.xen.org>

On Mon, 2013-02-11 at 11:45 +0000, Wei Liu wrote: > On Sun, 2013-02-10 at 22:03 +0000, Christopher S. Aker wrote: > > And another this afternoon on a different machine: > > > > BUG: unable to handle kernel NULL pointer dereference at 00000000000008b8 > > OK, so the guest is faulting at different offset now. It is very likely > that there is OOM / race condition in other places. And judging from > your two emails, I presume this bug can be reproduce steadily. > > > IP: [<ffffffff81011dda>] xen_spin_lock_flags+0x3a/0x80 > > PGD 0 > > Oops: 0002 [#1] SMP > > Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip > > ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter e1000e > > CPU 5 > > Pid: 1550, comm: netback/5 Not tainted 3.7.6-1-x86_64 #1 Supermicro > > X8DT6/X8DT6 > > RIP: e030:[<ffffffff81011dda>] [<ffffffff81011dda>] > > xen_spin_lock_flags+0x3a/0x80 > > RSP: e02b:ffff8800836e7b58 EFLAGS: 00010006 > > RAX: 0000000000000400 RBX: 00000000000008b8 RCX: 000000000045de5d > > RDX: 0000000000000001 RSI: 0000000000000211 RDI: 00000000000008b8 > > RBP: ffff8800836e7b78 R08: 0000000000000068 R09: 0000000000000000 > > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001 > > R13: 0000000000000200 R14: 0000000000000400 R15: 000000000045de5d > > FS: 00007f474a465700(0000) GS:ffff880100740000(0000) knlGS:0000000000000000 > > CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > > CR2: 00000000000008b8 CR3: 0000000001c0b000 CR4: 0000000000002660 > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > > Process netback/5 (pid: 1550, threadinfo ffff8800836e6000, task > > ffff880084510000) > > Stack: > > 0000000000000211 00000000000008b8 ffff8800771e5700 ffff8800771e57d8 > > ffff8800836e7b98 ffffffff817605da 0000000000000000 00000000000008b8 > > ffff8800836e7bd8 ffffffff8154446f ffff8800771e5000 0000000000000000 > > Call Trace: > > [<ffffffff817605da>] _raw_spin_lock_irqsave+0x2a/0x40 > > [<ffffffff8154446f>] xen_netbk_schedule_xenvif+0x8f/0x100 > > [<ffffffff81544505>] xen_netbk_check_rx_xenvif+0x25/0x60 > > [<ffffffff815445eb>] netbk_tx_err+0x5b/0x70 > > [<ffffffff8154518c>] xen_netbk_tx_build_gops+0xb8c/0xbc0 > > [<ffffffff81012880>] ? __switch_to+0x160/0x4f0 > > [<ffffffff810891b8>] ? idle_balance+0xf8/0x150 > > [<ffffffff81080150>] ? finish_task_switch+0x60/0xd0 > > [<ffffffff8175f7b4>] ? __schedule+0x394/0x750 > > [<ffffffff815452af>] xen_netbk_kthread+0xef/0x9d0 > > [<ffffffff81080150>] ? finish_task_switch+0x60/0xd0 > > [<ffffffff810720c0>] ? wake_up_bit+0x40/0x40 > > [<ffffffff815451c0>] ? xen_netbk_tx_build_gops+0xbc0/0xbc0 > > [<ffffffff81071a06>] kthread+0xc6/0xd0 > > [<ffffffff810037b9>] ? xen_end_context_switch+0x19/0x20 > > [<ffffffff81071940>] ? kthread_freezable_should_stop+0x70/0x70 > > [<ffffffff8176847c>] ret_from_fork+0x7c/0xb0 > > [<ffffffff81071940>] ? kthread_freezable_should_stop+0x70/0x70 > [snip] > > > > We're not so good at this, but it looks like xl->lock deref is what we > > hit? The lock was gone? > > > > A quick check on the xen_spinlock struct, its offset should not be > 0x8b8. This originally came from "&netbk->net_schedule_list_lock" in xen_netbk_schedule_xenvif so I guess most of the 0x8b8 came from the offset of net_schedule_list_lock. > Reading the backtrace suggests that it is the netbk struct is > gone. Yes. It would be interesting to add if (!netbk) netdev_err(vif->dev, "vif has no associated netbk!"); and also to add prints to xen_netbk_add_xenvif() and xen_netbk_remove_xenvif() to track to setup and teardown of the vif<->netbk relationships (these are infrequent, only when a vif is opened/closed, so it might be that dumping a stack trace is plausible/useful especially on the teardown). It would also be useful to confirm that the netbk selected in xen_netbk_add_xenvif is non-NULL and that its index relates sanely to xen_netbk_group_nr. There should be no way for a vif to get on the schedule list without being associated with a non-NULL netbk. Here the call chain is through xen_netbk_tx_build_gops -> netbk_tx_err -> xen_netbk_check_rx_xenvif. However the netback variable in xen_netbk_tx_build_gops has been used several times before we even get near any call to netbk_tx_err. I suppose adding a check if (vif->netbk != netbk) netdev_err(vif->dev, "has netbk %p should be %p!"); right after the !vif check at the top of the loop would also be interesting. Have you applied the XSA-39 fixes to this kernel? Every invocation of netbk_tx_err *should* have an associated error print, I think, at least after that change, if you are before it would be worth just checking. Either way you'll need to turn on debugging (or s/pr_dbg/pr_err/ in netback.c). Knowing which call to tx_err occurred might yield a clue. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@xxxxxxxxxxxxx http://lists.xen.org/xen-devel

©2013 Xen Project, A Linux Foundation Collaborative Project. All Rights Reserved.
Linux Foundation is a registered trademark of The Linux Foundation.
Xen Project is a trademark of The Linux Foundation.