Xen project Mailing List

this is a bit weird

I decided to debug the issue before patching the kernel more

so i went to one server and change the qdisc on all network interfaces (not guest interfaces ) to fifo instead of pfifo fast

and to my shock i got the same panic on pfifo_fast_dequeue !!!!

another thing i noticed that when ever the issue occur on any server i check the cpu stack of the cpu that had the panic

and there is 2 stacks always one for dom0 and the other one for hypervisor

the hypervisor stack always has this on the panicking cpu

ffff832027be7d20: ffff82d08021831d kexec_crash+0x4d/0x50
ffff832027be7d28: 00000000fffffffe
ffff832027be7d30: ffff82d080218c9d do_kexec_op_internal+0x44d/0x710
ffff832027be7d38: 0000000000040660
ffff832027be7d40: 0000000000000000
ffff832027be7d48: 0000000000000000
ffff832027be7d50: 000000000000000c
ffff832027be7d58: 000000000000000c
ffff832027be7d60: ffff83202780f000
ffff832027be7d68: ffff832027be7da8 .+64
ffff832027be7d70: 000000000000000c
ffff832027be7d78: ffff82d080266750 vga_noop_puts+0/0x10
ffff832027be7d80: ffff82d080249f3a do_console_io+0x41a/0x460
ffff832027be7d88: ffffc90000000001
ffff832027be7d90: ffff83202780fa24
ffff832027be7d98: 000000000000e033

Could the issue be the in the hypervisor side and the dom0 kernel panic message is just misleading ?

On Wed, Jul 29, 2020 at 11:44 AM moftah moftah <mofta7y@xxxxxxxxx> wrote:

Hi Juergen,

This seems very related to my issue
but I wonder why the fix was not backtracked to all versions why i was only added to 4.18 and 5.4 branches and ignored the 4.19 branch

I could try to add the patch manually (if it works ) in 4.19 branch and test it locally

On Mon, Jul 27, 2020 at 3:27 AM Jürgen Groß <jgross@xxxxxxxx> wrote:
On 26.07.20 17:47, moftah moftah wrote:
> Hi All,
> We have a problem that is ongoing for more than 1 month
>
> We have several servers running xcp-ng and we are facing kernel oops
> that crash the server
>
> My skill is not enough to debug the issue So need someone to point me to
> the right direction
> the issue is not hardware related
> it occurred on servers that are of different processor , nic and even
> kernel version (all under 4.19)
>
> the stack trace looks like this
>
> [2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000004
> [2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> [2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
> [2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> 4.19.108 #1
> [2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /,
> BIOS P75 05/24/2019
> [2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
> d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
> 0a f0 ff
> [2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001
> RCX: 0000000000000000
> [2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
> RDI: ffff8883de0b9c00
> [2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000
> R09: 0000000000000020
> [2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
> R12: 0000000000000001
> [2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002
> R15: ffff8883de0b9c00
> [2399526.430852] WARN: FS: 00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000
> CR4: 0000000000040660
> [2399526.430899] WARN: Call Trace:
> [2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
> [2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
> [2399526.430940] WARN: net_tx_action+0x148/0x230
> [2399526.430949] WARN: __do_softirq+0xd1/0x28c
> [2399526.430966] WARN: run_ksoftirqd+0x26/0x40
> [2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
> [2399526.430993] WARN: kthread+0xf8/0x130
> [2399526.431004] WARN: ? sort_range+0x20/0x20
> [2399526.431010] WARN: ? kthread_bind+0x10/0x10
> [2399526.431017] WARN: ret_from_fork+0x35/0x40

I wonder whether you are missing all fixes for commit 021a17ed796b
which went into kernel 4.18. It needs following fixes on top:

d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).

From the backtrace I really doubt this is a Xen problem, BTW. Maybe
running under Xen makes the problem more likely due to different
timing.

Juergen

Re: repeated Kernel oops need help to debug