[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: repeated Kernel oops need help to debug



Hi  Juergen,

This seems very related to my issue
but I wonder why the fix was not backtracked to all versions why i was only added to 4.18 and 5.4 branches and ignored the 4.19 branch

I could try to add the patch manually (if it works ) in 4.19 branch and test it locally

On Mon, Jul 27, 2020 at 3:27 AM Jürgen Groß <jgross@xxxxxxxx> wrote:
On 26.07.20 17:47, moftah moftah wrote:
> Hi All,
> We have a problem that is ongoing for more than 1 month
>
> We have several servers running xcp-ng and we are facing kernel oops
> that crash the server
>
> My skill is not enough to debug the issue So need someone to point me to
> the right direction
> the issue is not hardware related
> it occurred on servers that are of different processor , nic and even
> kernel version (all under 4.19)
>
> the stack trace looks like this
>
> [2399526.430672]  ALERT: BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000004
> [2399526.430695]   INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> [2399526.430710]   WARN: Oops: 0000 [#1] SMP NOPTI
> [2399526.430720]   WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> 4.19.108 #1
> [2399526.430728]   WARN: Hardware name: HP ProLiant SL230s Gen8   /,
> BIOS P75 05/24/2019
> [2399526.430745]   WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.430753]   WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
> d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
> 0a f0 ff
> [2399526.430773]   WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.430780]   WARN: RAX: ffff88842087b900 RBX: 0000000000000001
> RCX: 0000000000000000
> [2399526.430789]   WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
> RDI: ffff8883de0b9c00
> [2399526.430801]   WARN: RBP: 0000000000000000 R08: 0000000000000000
> R09: 0000000000000020
> [2399526.430811]   WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
> R12: 0000000000000001
> [2399526.430823]   WARN: R13: ffff8883db210a00 R14: 0000000000000002
> R15: ffff8883de0b9c00
> [2399526.430852]   WARN: FS:  00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.430868]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.430879]   WARN: CR2: 0000000000000004 CR3: 000000044ad58000
> CR4: 0000000000040660
> [2399526.430899]   WARN: Call Trace:
> [2399526.430914]   WARN:  __qdisc_run+0xa2/0x4f0
> [2399526.430928]   WARN:  ? __switch_to_asm+0x41/0x70
> [2399526.430940]   WARN:  net_tx_action+0x148/0x230
> [2399526.430949]   WARN:  __do_softirq+0xd1/0x28c
> [2399526.430966]   WARN:  run_ksoftirqd+0x26/0x40
> [2399526.430980]   WARN:  smpboot_thread_fn+0x10e/0x160
> [2399526.430993]   WARN:  kthread+0xf8/0x130
> [2399526.431004]   WARN:  ? sort_range+0x20/0x20
> [2399526.431010]   WARN:  ? kthread_bind+0x10/0x10
> [2399526.431017]   WARN:  ret_from_fork+0x35/0x40

I wonder whether you are missing all fixes for commit 021a17ed796b
which went into kernel 4.18. It needs following fixes on top:

d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).

 From the backtrace I really doubt this is a Xen problem, BTW. Maybe
running under Xen makes the problem more likely due to different
timing.


Juergen

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.