[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: repeated Kernel oops need help to debug



We have had crashes on one machine.  Do not have details because i do not have broadband now.  

Somethings that stuck out to me.
Inability to sync would make me wonder if filesystem issue.  
Also wondering if you installed the netdata package since some of the errors seem to be related to statistics?  Our crashes happened after installing netdata but crashes are only occurring on one machine.     

Hypothesis:  netdata needs to read and write filesystem more which has exposed filesystem corruption.  

What fs is it?   extX or LVM/extX or ?

Read out the fs metadata and fs check confuguration with:
tune2fs -l /dev/sdaX
Replace X with 0, 1, 2 ... whatever you may have.

Then triple check these parameters as i am far from from fast internet at the moment.   I have done the following tons of times on normal Linux machines, but not so much with xen kernel.  Is Dom-0 VM fs the same as underlying kernel?   But set fs check interval and mount count with:  

‘tune2fs -i 1d /dev/sdaX’
‘tune2fs -c 1 /dev/sdaX’

‘e2fsck -c -c -C0 -D /dev/sdaX‘  
can be used to check for both read and write, output to stdout, and optimize layout, directory structure, and so on.  But this would require booting from install ISO or qlive media.


On Sun, Jul 26, 2020 at 9:51 AM moftah moftah <mofta7y@xxxxxxxxx> wrote:
Hi All,
We have a problem that is ongoing for more than 1 month

We have several servers running xcp-ng and we are facing kernel oops that crash the server

My skill is not enough to debug the issue So need someone to point me to the right direction
the issue is not hardware related
it occurred on servers that are of different processor , nic and even kernel version (all under 4.19)

the stack trace looks like this

[2399526.430672]  ALERT: BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[2399526.430695]   INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
[2399526.430710]   WARN: Oops: 0000 [#1] SMP NOPTI
[2399526.430720]   WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted 4.19.108 #1
[2399526.430728]   WARN: Hardware name: HP ProLiant SL230s Gen8   /, BIOS P75 05/24/2019
[2399526.430745]   WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
[2399526.430753]   WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
[2399526.430773]   WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
[2399526.430780]   WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX: 0000000000000000
[2399526.430789]   WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI: ffff8883de0b9c00
[2399526.430801]   WARN: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[2399526.430811]   WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12: 0000000000000001
[2399526.430823]   WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15: ffff8883de0b9c00
[2399526.430852]   WARN: FS:  00007ffac43fe700(0000) GS:ffff888451240000(0000) knlGS:0000000000000000
[2399526.430868]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2399526.430879]   WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4: 0000000000040660
[2399526.430899]   WARN: Call Trace:
[2399526.430914]   WARN:  __qdisc_run+0xa2/0x4f0
[2399526.430928]   WARN:  ? __switch_to_asm+0x41/0x70
[2399526.430940]   WARN:  net_tx_action+0x148/0x230
[2399526.430949]   WARN:  __do_softirq+0xd1/0x28c
[2399526.430966]   WARN:  run_ksoftirqd+0x26/0x40
[2399526.430980]   WARN:  smpboot_thread_fn+0x10e/0x160
[2399526.430993]   WARN:  kthread+0xf8/0x130
[2399526.431004]   WARN:  ? sort_range+0x20/0x20
[2399526.431010]   WARN:  ? kthread_bind+0x10/0x10
[2399526.431017]   WARN:  ret_from_fork+0x35/0x40
[2399526.431027]   WARN: Modules linked in: act_police cls_basic sch_ingress sch_tbf tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter sunrpc hid_generic sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_rapl_perf psmouse lpc_ich usbhid hid sg hpilo ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables raid1 md_mod sd_mod serio_raw uhci_hcd ahci libahci igb libata ehci_pci ehci_hcd bnx2x mdio libcrc32c mpt3sas
[2399526.431154]   WARN:  raid_class scsi_transport_sas scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
[2399526.431177]   WARN: CR2: 0000000000000004
[2399526.431189]   WARN: ---[ end trace 32a268c3653eb10c ]---
[2399526.431201]   WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
[2399526.431212]   WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
[2399526.431238]   WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
[2399526.431247]   WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX: 0000000000000000
[2399526.431260]   WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI: ffff8883de0b9c00
[2399526.431270]   WARN: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000020
[2399526.431280]   WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12: 0000000000000001
[2399526.431289]   WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15: ffff8883de0b9c00
[2399526.431307]   WARN: FS:  00007ffac43fe700(0000) GS:ffff888451240000(0000) knlGS:0000000000000000
[2399526.431319]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2399526.431331]   WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4: 0000000000040660
[2399526.431355]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt


xen crash analyzer generate many other files as well
dmesg.kexec.log
dom0.log ( for each dom )
dom0.structures.log  for each dom ( for each dom )
....
lspci-tv.out
lspci-vv.out
lspci-vvxxxx.out
readelf-Wl.out
readelf-Wn.out
time-v.out
xen.log
xen.pcpu0.stack.log ( for each pcpu) 
...
xen-crashdump-analyser.log

the log can be seen from xenlog file as 
 Call Trace:
[ffffffff810014aa] xen_hypercall_kexec_op+0xa/0x20
 ffffffff81071f85  panic+0x111/0x27c
 ffffffff81027a7f  oops_end+0xcf/0xd0
 ffffffff8105da63  no_context+0x1b3/0x3c0
 ffffffff816c0223  inet_gro_receive+0x213/0x2b0
 ffffffff8105e32a  __do_page_fault+0xaa/0x4f0
 ffffffff8162cd44  netif_receive_skb_internal+0x34/0xe0
 ffffffff81800f6e  page_fault+0x1e/0x30
 ffffffff81663ac9  pfifo_fast_dequeue+0xc9/0x140
 ffffffff81663f38  __qdisc_run+0xa8/0x4e0
 ffffffff816290c8  net_tx_action+0x148/0x220
 ffffffff81a000d1  __softirqentry_text_start+0xd1/0x28c
 ffffffff81077ff6  run_ksoftirqd+0x26/0x40
 ffffffff8109763e  smpboot_thread_fn+0x10e/0x160
 ffffffff81093b68  kthread+0xf8/0x130
 ffffffff81097530  smpboot_thread_fn+0/0x160
 ffffffff81093a70  kthread+0/0x130
 ffffffff81800215  ret_from_fork+0x35/0x40

I did use a tool to trace the source code where the issue occure
./decode_stacktrace.sh /usr/lib/debug/lib/modules/4.19.108/vmlinux /usr/lib/debug/lib/modules/4.19.108/ < ./trace2 > out3

and this is the output

[ffffffff810014aa] xen_hypercall_kexec_op (arch/x86/kernel/.tmp_head_64.o:?)
ffffffff81071f85 panic (/usr/src/debug/kernel-4.19.19/kernel/panic.c:209)
ffffffff81027a7f oops_end (/usr/src/debug/kernel-4.19.19/arch/x86/kernel/dumpstack.c:352)
ffffffff8105da63 no_context (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:808)
ffffffff816c0223 inet_gro_receive (/usr/src/debug/kernel-4.19.19/include/linux/skbuff.h:2350 /usr/src/debug/kernel-4.19.19/net/ipv4/af_inet.c:1495)
ffffffff8105e32a __do_page_fault (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:1435)
ffffffff8162cd44 netif_receive_skb_internal (/usr/src/debug/kernel-4.19.19/net/core/dev.c:5152)
ffffffff81800f6e page_fault (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:1204)
ffffffff81663ac9 pfifo_fast_dequeue (/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:723 /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:740 /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:747 /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:677)
ffffffff81663f38 __qdisc_run (/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:283 /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:385 /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:403)
ffffffff816290c8 net_tx_action (/usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:235 /usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:388 /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:145 /usr/src/debug/kernel-4.19.19/include/net/pkt_sched.h:121 /usr/src/debug/kernel-4.19.19/net/core/dev.c:4595)
ffffffff81a000d1 __softirqentry_text_start (/usr/src/debug/kernel-4.19.19/kernel/softirq.c:292 /usr/src/debug/kernel-4.19.19/include/linux/jump_label.h:138 /usr/src/debug/kernel-4.19.19/include/trace/events/irq.h:142 /usr/src/debug/kernel-4.19.19/kernel/softirq.c:293)
ffffffff81077ff6 run_ksoftirqd (/usr/src/debug/kernel-4.19.19/arch/x86/include/asm/paravirt.h:799 /usr/src/debug/kernel-4.19.19/kernel/softirq.c:654)
ffffffff8109763e smpboot_thread_fn (/usr/src/debug/kernel-4.19.19/kernel/smpboot.c:164)
ffffffff81093b68 kthread (/usr/src/debug/kernel-4.19.19/kernel/kthread.c:246)
ffffffff81097530  smpboot_thread_fn+0/0x160
ffffffff81093a70  kthread+0/0x130
ffffffff81800215 ret_from_fork (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:421)

based on that the issue occurred when calling 

thats as far as i can reach not sure how to debug further to find the root cause and fix it 



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.