[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-devel] [BUG] kernel bug encountered at drivers/net/xen-netback/netback.c:430!



> -----Original Message-----
> From: Alex Braunegg [mailto:alex.braunegg@xxxxxxxxx]
> Sent: 28 December 2017 19:32
> To: 'Michael Collins' <mike@xxxxxxxxxxx>; 'Juergen Gross'
> <jgross@xxxxxxxx>; xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: Paul Durrant <Paul.Durrant@xxxxxxxxxx>; Wei Liu <wei.liu2@xxxxxxxxxx>
> Subject: RE: [Xen-devel] [BUG] kernel bug encountered at drivers/net/xen-
> netback/netback.c:430!
> 
> Hi Mike,
> 
> Thanks for the confirmation on that. Since the last crash I was having them
> daily until I downgraded back to kernel 4.4 and Xen 4.6 where stability
> resumed. Zero crashes since 24th December.
> 
> @Paul, Wei,
> 
> Can we get this investigated? It appears that this is a stability blocker for 
> Xen
> releases on newer kernels.

The only mildly suspicious thing I can see in netback is:

commit cc8737a5fe9051b7fa052b08c57ddb9f539c389a
Author: Willem de Bruijn <willemb@xxxxxxxxxx>
Date:   Fri Aug 25 13:10:43 2017 -0400

    xen-netback: update ubuf_info initialization to anonymous union

    The xen driver initializes struct ubuf_info fields using designated
    initializers. I recently moved these fields inside a nested anonymous
    struct inside an anonymous union. I had missed this use case.

    This breaks compilation of xen-netback with older compilers.
    >From kbuild bot with gcc-4.4.7:

       drivers/net//xen-netback/interface.c: In function
       'xenvif_init_queue':
       >> drivers/net//xen-netback/interface.c:554: error: unknown field 'ctx' 
specified in initializer
       >> drivers/net//xen-netback/interface.c:554: warning: missing braces 
around initializer
          drivers/net//xen-netback/interface.c:554: warning: (near 
initialization for '(anonymous).<anonymous>')
       >> drivers/net//xen-netback/interface.c:554: warning: initialization 
makes integer from pointer without a cast
       >> drivers/net//xen-netback/interface.c:555: error: unknown field 'desc' 
specified in initializer

    Add double braces around the designated initializers to match their
    nested position in the struct. After this, compilation succeeds again.

    Fixes: 4ab6c99d99bb ("sock: MSG_ZEROCOPY notification coalescing")
    Reported-by: kbuild bot <lpk@xxxxxxxxx>
    Signed-off-by: Willem de Bruijn <willemb@xxxxxxxxxx>
    Acked-by: Wei Liu <wei.liu2@xxxxxxxxxx>
    Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx>

...and it's only mildly suspicious since netback uses the ubuf_info structure 
and stores the pending_idx value used by xenvif_grant_handle_reset() (which is 
the function calling BUG()) in the desc field; I can't spot anything wrong with 
the patch as such. It could be that the cause is external to netback.

How easy is it to trigger this? I'm assuming, from the original description, 
that I can probably trigger it by forcibly terminating a running domain and 
then trying to restart it.

  Paul

> 
> Best regards,
> 
> Alex
> 
> -----Original Message-----
> From: Michael Collins [mailto:mike@xxxxxxxxxxx]
> Sent: Friday, 29 December 2017 5:05 AM
> To: Alex Braunegg; 'Juergen Gross'; xen-devel@xxxxxxxxxxxxxxxxxxxx
> Cc: 'Paul Durrant'; 'Wei Liu'
> Subject: Re: [Xen-devel] [BUG] kernel bug encountered at drivers/net/xen-
> netback/netback.c:430!
> 
> Alex,
> 
>           I saw this same issue when running a kernel 4.13+, switched
> back to 4.11 and the problem has not resurfaced.  I would like to
> understand the root cause of this issue.
> 
> Mike
> 
> 
> On 12/22/2017 3:35 PM, Alex Braunegg wrote:
> > Hi all,
> >
> > Another crash this morning:
> >
> > vif vif-2-0 vif2.0: Trying to unmap invalid handle! pending_idx: 0x3a
> > ------------[ cut here ]------------
> > kernel BUG at drivers/net/xen-netback/netback.c:430!
> > invalid opcode: 0000 [#1] SMP
> > Modules linked in: xt_physdev(E) iptable_filter(E) ip_tables(E)
> xen_netback(E) nfsd(E) lockd(E) grace(E) nfs_acl(E) auth_rpcgss(E) sunrpc(E)
> ipmi_si(E) ipmi_msghandler(E) k10temp(E) zfs(POE) zcommon(POE)
> znvpair(POE) icp(POE) spl(OE) zavl(POE) zunicode(POE) tpm_infineon(E)
> sp5100_tco(E) i2c_piix4(E) i2c_core(E) ohci_pci(E) ohci_hcd(E) tg3(E) ptp(E)
> pps_core(E) sg(E) raid1(E) sd_mod(E) ata_generic(E) pata_acpi(E)
> pata_atiixp(E) ahci(E) libahci(E) dm_mirror(E) dm_region_hash(E) dm_log(E)
> dm_mod(E) dax(E)
> > CPU: 0 PID: 14238 Comm: vif2.0-q0-deall Tainted: P           OE   4.14.6-
> 1.el6.x86_64 #1
> > Hardware name: HP ProLiant MicroServer, BIOS O41     10/01/2013
> > task: ffff880059e255c0 task.stack: ffffc90001f64000
> > RIP: e030:xenvif_tx_dealloc_action+0x1bb/0x230 [xen_netback]
> > RSP: e02b:ffffc90001f67c68 EFLAGS: 00010292
> > RAX: 0000000000000045 RBX: ffffc90001f55000 RCX: 0000000000000000
> > RDX: ffff88007f4146e8 RSI: ffff88007f40db38 RDI: ffff88007f40db38
> > RBP: ffffc90001f67e98 R08: 0000000000000372 R09: 0000000000000373
> > R10: 0000000000000001 R11: 0000000000000000 R12: ffffc90001f5e730
> > R13: 0000160000000000 R14: aaaaaaaaaaaaaaab R15: ffffc9000099bbe8
> > FS:  00007f92865d29a0(0000) GS:ffff88007f400000(0000)
> knlGS:0000000000000000
> > CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: ffffffffff600400 CR3: 000000006209c000 CR4: 0000000000000660
> > Call Trace:
> >   ? _raw_spin_unlock_irqrestore+0x11/0x20
> >   ? error_exit+0x5/0x20
> >   ? __update_load_avg_cfs_rq+0x176/0x180
> >   ? xen_mc_flush+0x87/0x120
> >   ? xen_load_sp0+0x84/0xa0
> >   ? __switch_to+0x1c1/0x360
> >   ? finish_task_switch+0x78/0x240
> >   ? __schedule+0x192/0x496
> >   ? _raw_spin_lock_irqsave+0x1a/0x3c
> >   ? _raw_spin_lock_irqsave+0x1a/0x3c
> >   ? _raw_spin_unlock_irqrestore+0x11/0x20
> >   xenvif_dealloc_kthread+0x68/0xf0 [xen_netback]
> >   ? do_wait_intr+0x80/0x80
> >   ? xenvif_map_frontend_data_rings+0xe0/0xe0 [xen_netback]
> >   kthread+0x106/0x140
> >   ? kthread_destroy_worker+0x60/0x60
> >   ret_from_fork+0x25/0x30
> > Code: 89 df 49 83 c4 02 e8 e5 f5 ff ff 4d 39 ec 75 e8 eb a2 48 8b 43 20 48 
> > c7 c6
> 10 2b 55 a0 48 8b b8 20 03 00 00 31 c0 e8 85 c9 06 e1 <0f> 0b 0f 0b 48 8b 53 
> 20
> 89 c1 48 c7 c6 48 2b 55 a0 31 c0 45 31
> > RIP: xenvif_tx_dealloc_action+0x1bb/0x230 [xen_netback] RSP:
> ffffc90001f67c68
> > ---[ end trace 130de0b7e39d0eea ]---
> >
> > Best regards,
> >
> > Alex
> >
> >
> >
> > -----Original Message-----
> > From: Juergen Gross [mailto:jgross@xxxxxxxx]
> > Sent: Friday, 22 December 2017 5:47 PM
> > To: Alex Braunegg; xen-devel@xxxxxxxxxxxxxxxxxxxx
> > Cc: Wei Liu; Paul Durrant
> > Subject: Re: [Xen-devel] [BUG] kernel bug encountered at
> drivers/net/xen-netback/netback.c:430!
> >
> > On 22/12/17 07:40, Alex Braunegg wrote:
> >> Hi all,
> >>
> >> Experienced the same issue again today:
> > Ccing the maintainers.
> >
> >
> > Juergen
> >
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> vif vif-2-0 vif2.0: Trying to unmap invalid handle! pending_idx: 0x2f
> >> ------------[ cut here ]------------
> >> kernel BUG at drivers/net/xen-netback/netback.c:430!
> >> invalid opcode: 0000 [#1] SMP
> >> Modules linked in: xt_physdev(E) iptable_filter(E) ip_tables(E)
> >> xen_netback(E) nfsd(E) lockd(E) grace(E) nfs_acl(E) auth_rpcgss(E)
> sunrpc(E)
> >> ipmi_si(E) ipmi_msghandler(E) k10temp(E) zfs(POE) zcommon(POE)
> znvpair(POE)
> >> icp(POE) spl(OE) zavl(POE) zunicode(POE) tpm_infineon(E) sp5100_tco(E)
> >> i2c_piix4(E) i2c_core(E) ohci_pci(E) ohci_hcd(E) tg3(E) ptp(E) pps_core(E)
> >> sg(E) raid1(E) sd_mod(E) ata_generic(E) pata_acpi(E) pata_atiixp(E)
> ahci(E)
> >> libahci(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) dax(E)
> >> CPU: 0 PID: 12636 Comm: vif2.0-q0-deall Tainted: P           OE
> >> 4.14.6-1.el6.x86_64 #1
> >> Hardware name: HP ProLiant MicroServer, BIOS O41     10/01/2013
> >> task: ffff880062518000 task.stack: ffffc90004f88000
> >> RIP: e030:xenvif_tx_dealloc_action+0x1bb/0x230 [xen_netback]
> >> RSP: e02b:ffffc90004f8bc68 EFLAGS: 00010292
> >> RAX: 0000000000000045 RBX: ffffc90000fcd000 RCX: 0000000000000000
> >> RDX: ffff88007f4146e8 RSI: ffff88007f40db38 RDI: ffff88007f40db38
> >> RBP: ffffc90004f8be98 R08: 000000000000037d R09: 000000000000037e
> >> R10: 0000000000000001 R11: 0000000000000000 R12: ffffc90000fd6730
> >> R13: 0000160000000000 R14: aaaaaaaaaaaaaaab R15: ffffc9000099bbe8
> >> FS:  00007f40c63639a0(0000) GS:ffff88007f400000(0000)
> knlGS:0000000000000000
> >> CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffff600400 CR3: 000000006375f000 CR4: 0000000000000660
> >> Call Trace:
> >>   ? error_exit+0x5/0x20
> >>   ? __update_load_avg_cfs_rq+0x176/0x180
> >>   ? xen_mc_flush+0x87/0x120
> >>   ? xen_load_sp0+0x84/0xa0
> >>   ? __switch_to+0x1c1/0x360
> >>   ? finish_task_switch+0x78/0x240
> >>   ? __schedule+0x192/0x496
> >>   ? _raw_spin_lock_irqsave+0x1a/0x3c
> >>   ? _raw_spin_lock_irqsave+0x1a/0x3c
> >>   ? _raw_spin_unlock_irqrestore+0x11/0x20
> >>   xenvif_dealloc_kthread+0x68/0xf0 [xen_netback]
> >>   ? do_wait_intr+0x80/0x80
> >>   ? xenvif_map_frontend_data_rings+0xe0/0xe0 [xen_netback]
> >>   kthread+0x106/0x140
> >>   ? kthread_destroy_worker+0x60/0x60
> >>   ? kthread_destroy_worker+0x60/0x60
> >>   ret_from_fork+0x25/0x30
> >> Code: 89 df 49 83 c4 02 e8 e5 f5 ff ff 4d 39 ec 75 e8 eb a2 48 8b 43 20 48
> >> c7 c6 10 5b 55 a0 48 8b b8 20 03 00 00 31 c0 e8 85 99 06 e1 <0f> 0b 0f 0b 
> >> 48
> >> 8b 53 20 89 c1 48 c7 c6 48 5b 55 a0 31 c0 45 31
> >> RIP: xenvif_tx_dealloc_action+0x1bb/0x230 [xen_netback] RSP:
> >> ffffc90004f8bc68
> >> ---[ end trace 010682c76619a1bd ]---
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> Best regards,
> >>
> >> Alex
> >>
> >> -----Original Message-----
> >> From: Alex Braunegg [mailto:alex.braunegg@xxxxxxxxx]
> >> Sent: Thursday, 21 December 2017 8:04 AM
> >> To: 'xen-devel@xxxxxxxxxxxxxxxxxxxx'
> >> Subject: [BUG] kernel bug encountered at
> >> drivers/net/xen-netback/netback.c:430!
> >>
> >> Hi all,
> >>
> >> I experienced the following bug whilst using a Xen VM. What happened
> was
> >> that this morning a single Xen VM suddenly terminated without cause
> with the
> >> following being logged in dmesg.
> >>
> >> Only 1 VM experienced an issue (out of 2 which were running), the other
> >> remained up and fully functional until I attempted to restart the crashed
> VM
> >> which triggered the kernel bug.
> >>
> >> Kernel:    4.14.6
> >> Xen:               4.8.2
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> vif vif-2-0 vif2.0: Trying to unmap invalid handle! pending_idx: 0x3f
> >> ------------[ cut here ]------------
> >> kernel BUG at drivers/net/xen-netback/netback.c:430!
> >> invalid opcode: 0000 [#1] SMP
> >> Modules linked in: xt_physdev(E) iptable_filter(E) ip_tables(E)
> >> xen_netback(E) nfsd(E) lockd(E) grace(E) nfs_acl(E) auth_rpcgss(E)
> sunrpc(E)
> >> ipmi_si(E) ipmi_msghandler(E) zfs(POE) zcommon(POE) znvpair(POE)
> icp(POE)
> >> spl(OE) zavl(POE) zunicode(POE) k10temp(E) tpm_infineon(E)
> sp5100_tco(E)
> >> i2c_piix4(E) i2c_core(E) ohci_pci(E) ohci_hcd(E) tg3(E) ptp(E) pps_core(E)
> >> sg(E) raid1(E) sd_mod(E) ata_generic(E) pata_acpi(E) pata_atiixp(E)
> ahci(E)
> >> libahci(E) dm_mirror(E) dm_region_hash(E) dm_log(E) dm_mod(E) dax(E)
> >> CPU: 0 PID: 13163 Comm: vif2.0-q0-deall Tainted: P           OE
> >> 4.14.6-1.el6.x86_64 #1
> >> Hardware name: HP ProLiant MicroServer, BIOS O41     10/01/2013
> >> task: ffff8800595cc980 task.stack: ffffc900028e0000
> >> RIP: e030:xenvif_tx_dealloc_action+0x1bb/0x230 [xen_netback]
> >> RSP: e02b:ffffc900028e3c68 EFLAGS: 00010292
> >> RAX: 0000000000000045 RBX: ffffc90002969000 RCX: 0000000000000000
> >> RDX: ffff88007f4146e8 RSI: ffff88007f40db38 RDI: ffff88007f40db38
> >> RBP: ffffc900028e3e98 R08: 000000000000037b R09: 000000000000037c
> >> R10: 0000000000000001 R11: 0000000000000000 R12: ffffc90002972730
> >> R13: 0000160000000000 R14: aaaaaaaaaaaaaaab R15: ffffc9000099bbe8
> >> FS:  00007fee260ff9a0(0000) GS:ffff88007f400000(0000)
> knlGS:0000000000000000
> >> CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: ffffffffff600400 CR3: 0000000062815000 CR4: 0000000000000660
> >> Call Trace:
> >>   ? error_exit+0x5/0x20
> >>   ? __update_load_avg_cfs_rq+0x176/0x180
> >>   ? xen_mc_flush+0x87/0x120
> >>   ? xen_load_sp0+0x84/0xa0
> >>   ? __switch_to+0x1c1/0x360
> >>   ? finish_task_switch+0x78/0x240
> >>   ? __schedule+0x192/0x496
> >>   ? _raw_spin_lock_irqsave+0x1a/0x3c
> >>   ? _raw_spin_lock_irqsave+0x1a/0x3c
> >>   ? _raw_spin_unlock_irqrestore+0x11/0x20
> >>   xenvif_dealloc_kthread+0x68/0xf0 [xen_netback]
> >>   ? do_wait_intr+0x80/0x80
> >>   ? xenvif_map_frontend_data_rings+0xe0/0xe0 [xen_netback]
> >>   kthread+0x106/0x140
> >>   ? kthread_destroy_worker+0x60/0x60
> >>   ? kthread_destroy_worker+0x60/0x60
> >>   ret_from_fork+0x25/0x30
> >> Code: 89 df 49 83 c4 02 e8 e5 f5 ff ff 4d 39 ec 75 e8 eb a2 48 8b 43 20 48
> >> c7 c6 10 3b 55 a0 48 8b b8 20 03 00 00 31 c0 e8 85 b9 06 e1 <0f> 0b 0f 0b 
> >> 48
> >> 8b 53 20 89 c1 48 c7 c6 48 3b 55 a0 31 c0 45 31
> >> RIP: xenvif_tx_dealloc_action+0x1bb/0x230 [xen_netback] RSP:
> >> ffffc900028e3c68
> >> ---[ end trace 7d827dae67002ffc ]---
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> The section of relevant kernel code is:
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> static inline void xenvif_grant_handle_reset(struct xenvif_queue
> *queue,
> >>                                               u16 pending_idx)
> >> {
> >>          if (unlikely(queue->grant_tx_handle[pending_idx] ==
> >>                       NETBACK_INVALID_HANDLE)) {
> >>                  netdev_err(queue->vif->dev,
> >>                             "Trying to unmap invalid handle! pending_idx:
> >> 0x%x\n",
> >>                             pending_idx);
> >>                  BUG();
> >>          }
> >>          queue->grant_tx_handle[pending_idx] =
> NETBACK_INVALID_HANDLE;
> >> }
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> In an attempt to recover from this situation I restarted / destroyed (xl
> >> restart <vmname> / xl destroy <vmname>) the VM to recover it's state
> and the
> >> following error messages were logged at the console:
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> libxl: error: libxl_exec.c:129:libxl_report_child_exitstatus:
> >> /etc/xen/scripts/block remove [25271] died due to fatal signal
> Segmentation
> >> fault
> >> libxl: error: libxl_device.c:1080:device_backend_callback: unable to
> remove
> >> device with path /local/domain/0/backend/vif/2/0
> >> libxl: error: libxl.c:1647:devices_destroy_cb: libxl__devices_destroy 
> >> failed
> >> for 2
> >>
> >>
> ==========================================================
> ==================
> >> =========
> >>
> >> After which the physical system hung, then the physical system restarted
> >> with nothing else logged and everything came back OK & operational
> including
> >> the VM that crashed.
> >>
> >> Further details (xl dmesg, xl info) attached.
> >>
> >> Best regards,
> >>
> >> Alex Braunegg
> >>
> >>
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@xxxxxxxxxxxxxxxxxxxx
> >> https://lists.xenproject.org/mailman/listinfo/xen-devel
> >>
> >
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@xxxxxxxxxxxxxxxxxxxx
> > https://lists.xenproject.org/mailman/listinfo/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxxxxxxxxx
https://lists.xenproject.org/mailman/listinfo/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.