[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] netback Oops then xenwatch stuck in D state



>>> On 13.02.13 at 03:51, "Christopher S. Aker" <caker@xxxxxxxxxxxx> wrote:
> Feb 12 20:34:12: vif vif-21-0 vif21.0: Frag is bigger than frame.
> Feb 12 20:34:12: vif vif-21-0 vif21.0: fatal error; disabling device 
> <--------------
> Feb 12 20:34:12: BUG: unable to handle kernel NULL pointer dereference at 
> 00000000000008b8
>...
> Feb 12 20:34:12: Call Trace:
> Feb 12 20:34:12: [<ffffffff817605da>] _raw_spin_lock_irqsave+0x2a/0x40
> Feb 12 20:34:12: [<ffffffff8154446f>] xen_netbk_schedule_xenvif+0x8f/0x100
> Feb 12 20:34:12: [<ffffffff81544505>] xen_netbk_check_rx_xenvif+0x25/0x60
> Feb 12 20:34:12: [<ffffffff815445eb>] netbk_tx_err+0x5b/0x70
> Feb 12 20:34:12: [<ffffffff8154518c>] xen_netbk_tx_build_gops+0xb8c/0xbc0
> Feb 12 20:34:12: [<ffffffff81012880>] ? __switch_to+0x160/0x4f0
> Feb 12 20:34:12: [<ffffffff810891b8>] ? idle_balance+0xf8/0x150
> Feb 12 20:34:12: [<ffffffff81080150>] ? finish_task_switch+0x60/0xd0
> Feb 12 20:34:12: [<ffffffff8175f7b4>] ? __schedule+0x394/0x750
> Feb 12 20:34:12: [<ffffffff815452af>] xen_netbk_kthread+0xef/0x9d0
> Feb 12 20:34:12: [<ffffffff81080150>] ? finish_task_switch+0x60/0xd0
> Feb 12 20:34:12: [<ffffffff810720c0>] ? wake_up_bit+0x40/0x40
> Feb 12 20:34:12: [<ffffffff815451c0>] ? xen_netbk_tx_build_gops+0xbc0/0xbc0
> Feb 12 20:34:12: [<ffffffff81071a06>] kthread+0xc6/0xd0
> Feb 12 20:34:12: [<ffffffff810037b9>] ? xen_end_context_switch+0x19/0x20
> Feb 12 20:34:12: [<ffffffff81071940>] ? 
> kthread_freezable_should_stop+0x70/0x70
> Feb 12 20:34:12: [<ffffffff8176847c>] ret_from_fork+0x7c/0xb0
> Feb 12 20:34:12: [<ffffffff81071940>] ? 
> kthread_freezable_should_stop+0x70/0x70

I think the root cause is the same as for the problem reported on
the !classic" kernels - we should not blindly shut down everything
on a fatal error. Instead I think we ought to set a flag on the
xenvif and disassociate the two in a more controlled manner. On
the pv-ops tree, that would likely be just at the bottom of the
main loop in xen_netbk_kthread(), with the caveat that there
needs to be a way to identify the busted xenvif(s).

On the classic tree, this apparently could be done directly in
net_tx_action() (and hence can be done in netbk_fatal_tx_err()
in place of the call to xenvif_carrier_off()), but the scheduled
piece of code would then need to sync with both tasklets. Of
course there's nothing preventing the pv-ops solution to be
similar to this (allowing easier adding back of tasklet support,
which - as I already told you elsewhere - appears to address
throughput and/or CPU utilization problems people reported to
us with the kthreads approach).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel


 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.