[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: repeated Kernel oops need help to debug



this is a bit weird
I decided to debug the issue before patching the kernel more
so i went to one server and change the qdisc on all network interfaces (not guest interfaces ) to fifo instead of pfifo fast
and to my shock i got the same panic on pfifo_fast_dequeue !!!!

another thing i noticed that when ever the issue occur on any server i check the cpu stack of the cpu that had the panic 
and there is 2 stacks always one for dom0 and the other one for hypervisor

the hypervisor stack always has this on the panicking cpu
  ffff832027be7d20: ffff82d08021831d kexec_crash+0x4d/0x50
  ffff832027be7d28: 00000000fffffffe
  ffff832027be7d30: ffff82d080218c9d do_kexec_op_internal+0x44d/0x710
  ffff832027be7d38: 0000000000040660
  ffff832027be7d40: 0000000000000000
  ffff832027be7d48: 0000000000000000
  ffff832027be7d50: 000000000000000c
  ffff832027be7d58: 000000000000000c
  ffff832027be7d60: ffff83202780f000
  ffff832027be7d68: ffff832027be7da8 .+64
  ffff832027be7d70: 000000000000000c
  ffff832027be7d78: ffff82d080266750 vga_noop_puts+0/0x10
  ffff832027be7d80: ffff82d080249f3a do_console_io+0x41a/0x460
  ffff832027be7d88: ffffc90000000001
  ffff832027be7d90: ffff83202780fa24
  ffff832027be7d98: 000000000000e033

Could the issue be the in the hypervisor side and the dom0 kernel panic message is just misleading ? 


On Wed, Jul 29, 2020 at 11:44 AM moftah moftah <mofta7y@xxxxxxxxx> wrote:
Hi  Juergen,

This seems very related to my issue
but I wonder why the fix was not backtracked to all versions why i was only added to 4.18 and 5.4 branches and ignored the 4.19 branch

I could try to add the patch manually (if it works ) in 4.19 branch and test it locally

On Mon, Jul 27, 2020 at 3:27 AM Jürgen Groß <jgross@xxxxxxxx> wrote:
On 26.07.20 17:47, moftah moftah wrote:
> Hi All,
> We have a problem that is ongoing for more than 1 month
>
> We have several servers running xcp-ng and we are facing kernel oops
> that crash the server
>
> My skill is not enough to debug the issue So need someone to point me to
> the right direction
> the issue is not hardware related
> it occurred on servers that are of different processor , nic and even
> kernel version (all under 4.19)
>
> the stack trace looks like this
>
> [2399526.430672]  ALERT: BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000004
> [2399526.430695]   INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> [2399526.430710]   WARN: Oops: 0000 [#1] SMP NOPTI
> [2399526.430720]   WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> 4.19.108 #1
> [2399526.430728]   WARN: Hardware name: HP ProLiant SL230s Gen8   /,
> BIOS P75 05/24/2019
> [2399526.430745]   WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.430753]   WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
> d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
> 0a f0 ff
> [2399526.430773]   WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.430780]   WARN: RAX: ffff88842087b900 RBX: 0000000000000001
> RCX: 0000000000000000
> [2399526.430789]   WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
> RDI: ffff8883de0b9c00
> [2399526.430801]   WARN: RBP: 0000000000000000 R08: 0000000000000000
> R09: 0000000000000020
> [2399526.430811]   WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
> R12: 0000000000000001
> [2399526.430823]   WARN: R13: ffff8883db210a00 R14: 0000000000000002
> R15: ffff8883de0b9c00
> [2399526.430852]   WARN: FS:  00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.430868]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.430879]   WARN: CR2: 0000000000000004 CR3: 000000044ad58000
> CR4: 0000000000040660
> [2399526.430899]   WARN: Call Trace:
> [2399526.430914]   WARN:  __qdisc_run+0xa2/0x4f0
> [2399526.430928]   WARN:  ? __switch_to_asm+0x41/0x70
> [2399526.430940]   WARN:  net_tx_action+0x148/0x230
> [2399526.430949]   WARN:  __do_softirq+0xd1/0x28c
> [2399526.430966]   WARN:  run_ksoftirqd+0x26/0x40
> [2399526.430980]   WARN:  smpboot_thread_fn+0x10e/0x160
> [2399526.430993]   WARN:  kthread+0xf8/0x130
> [2399526.431004]   WARN:  ? sort_range+0x20/0x20
> [2399526.431010]   WARN:  ? kthread_bind+0x10/0x10
> [2399526.431017]   WARN:  ret_from_fork+0x35/0x40

I wonder whether you are missing all fixes for commit 021a17ed796b
which went into kernel 4.18. It needs following fixes on top:

d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).

 From the backtrace I really doubt this is a Xen problem, BTW. Maybe
running under Xen makes the problem more likely due to different
timing.


Juergen

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.