[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] netback Oops then xenwatch stuck in D state



On Mon, 2013-02-11 at 11:45 +0000, Wei Liu wrote:
> On Sun, 2013-02-10 at 22:03 +0000, Christopher S. Aker wrote:
> > And another this afternoon on a different machine:
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 00000000000008b8
> 
> OK, so the guest is faulting at different offset now. It is very likely
> that there is OOM / race condition in other places. And judging from
> your two emails, I presume this bug can be reproduce steadily.
> 
> > IP: [<ffffffff81011dda>] xen_spin_lock_flags+0x3a/0x80
> > PGD 0
> > Oops: 0002 [#1] SMP
> > Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip 
> > ip_set_hash_net ip_set ebtable_nat xen_gntdev bonding ebtable_filter e1000e
> > CPU 5
> > Pid: 1550, comm: netback/5 Not tainted 3.7.6-1-x86_64 #1 Supermicro 
> > X8DT6/X8DT6
> > RIP: e030:[<ffffffff81011dda>]  [<ffffffff81011dda>] 
> > xen_spin_lock_flags+0x3a/0x80
> > RSP: e02b:ffff8800836e7b58  EFLAGS: 00010006
> > RAX: 0000000000000400 RBX: 00000000000008b8 RCX: 000000000045de5d
> > RDX: 0000000000000001 RSI: 0000000000000211 RDI: 00000000000008b8
> > RBP: ffff8800836e7b78 R08: 0000000000000068 R09: 0000000000000000
> > R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000001
> > R13: 0000000000000200 R14: 0000000000000400 R15: 000000000045de5d
> > FS:  00007f474a465700(0000) GS:ffff880100740000(0000) knlGS:0000000000000000
> > CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> > CR2: 00000000000008b8 CR3: 0000000001c0b000 CR4: 0000000000002660
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Process netback/5 (pid: 1550, threadinfo ffff8800836e6000, task 
> > ffff880084510000)
> > Stack:
> >   0000000000000211 00000000000008b8 ffff8800771e5700 ffff8800771e57d8
> >   ffff8800836e7b98 ffffffff817605da 0000000000000000 00000000000008b8
> >   ffff8800836e7bd8 ffffffff8154446f ffff8800771e5000 0000000000000000
> > Call Trace:
> >   [<ffffffff817605da>] _raw_spin_lock_irqsave+0x2a/0x40
> >   [<ffffffff8154446f>] xen_netbk_schedule_xenvif+0x8f/0x100
> >   [<ffffffff81544505>] xen_netbk_check_rx_xenvif+0x25/0x60
> >   [<ffffffff815445eb>] netbk_tx_err+0x5b/0x70
> >   [<ffffffff8154518c>] xen_netbk_tx_build_gops+0xb8c/0xbc0
> >   [<ffffffff81012880>] ? __switch_to+0x160/0x4f0
> >   [<ffffffff810891b8>] ? idle_balance+0xf8/0x150
> >   [<ffffffff81080150>] ? finish_task_switch+0x60/0xd0
> >   [<ffffffff8175f7b4>] ? __schedule+0x394/0x750
> >   [<ffffffff815452af>] xen_netbk_kthread+0xef/0x9d0
> >   [<ffffffff81080150>] ? finish_task_switch+0x60/0xd0
> >   [<ffffffff810720c0>] ? wake_up_bit+0x40/0x40
> >   [<ffffffff815451c0>] ? xen_netbk_tx_build_gops+0xbc0/0xbc0
> >   [<ffffffff81071a06>] kthread+0xc6/0xd0
> >   [<ffffffff810037b9>] ? xen_end_context_switch+0x19/0x20
> >   [<ffffffff81071940>] ? kthread_freezable_should_stop+0x70/0x70
> >   [<ffffffff8176847c>] ret_from_fork+0x7c/0xb0
> >   [<ffffffff81071940>] ? kthread_freezable_should_stop+0x70/0x70
> [snip]
> > 
> > We're not so good at this, but it looks like xl->lock deref is what we 
> > hit?  The lock was gone?
> > 
> 
> A quick check on the xen_spinlock struct, its offset should not be
> 0x8b8.

This originally came from "&netbk->net_schedule_list_lock" in
xen_netbk_schedule_xenvif so I guess most of the 0x8b8 came from the
offset of net_schedule_list_lock.


>  Reading the backtrace suggests that it is the netbk struct is
> gone.

Yes. It would be interesting to add
        if (!netbk)
                netdev_err(vif->dev, "vif has no associated netbk!");
and also to add prints to xen_netbk_add_xenvif() and
xen_netbk_remove_xenvif() to track to setup and teardown of the
vif<->netbk relationships (these are infrequent, only when a vif is
opened/closed, so it might be that dumping a stack trace is
plausible/useful especially on the teardown).

It would also be useful to confirm that the netbk selected in
xen_netbk_add_xenvif is non-NULL and that its index relates sanely to
xen_netbk_group_nr.

There should be no way for a vif to get on the schedule list without
being associated with a non-NULL netbk. Here the call chain is through
xen_netbk_tx_build_gops -> netbk_tx_err -> xen_netbk_check_rx_xenvif.
However the netback variable in xen_netbk_tx_build_gops has been used
several times before we even get near any call to netbk_tx_err. I
suppose adding a check
        if (vif->netbk != netbk)
                netdev_err(vif->dev, "has netbk %p should be %p!");
right after the !vif check at the top of the loop would also be
interesting.

Have you applied the XSA-39 fixes to this kernel? Every invocation of
netbk_tx_err *should* have an associated error print, I think, at least
after that change, if you are before it would be worth just checking.
Either way you'll need to turn on debugging (or s/pr_dbg/pr_err/ in
netback.c). Knowing which call to tx_err occurred might yield a clue.

Ian.





_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.