[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-devel] kexec+kdump troubles on xen 4.5-unstable, centos 7, x86_64 (need to get a crash dump)


  • To: Xen <xen-devel@xxxxxxxxxxxxx>
  • From: ÐÑÐÐÐÑÐÐ ÐÑÐÑÐÐ <grigory.ptashko@xxxxxxxxx>
  • Date: Fri, 17 Oct 2014 22:17:14 +0400
  • Delivery-date: Fri, 17 Oct 2014 18:18:02 +0000
  • List-id: Xen developer discussion <xen-devel.lists.xen.org>

Hello.

The long story is this. I'm running CentOS 7 with custom built kernel.
My architecture is x86_64. I'm trying to passthrough different GPUs to xen.
I've got a problem with AMD FirePro W9100. Windows HVM guest starts with GPU
and even some 3D benchmark is running OK. But after some time of working the
domU and dom0 freeze.
I monitor the serial console for kernel panics but I don't see them at all.
I've decided to make a crash dump of the dom0 kernel to see what's going on.
And it appears that I just cannot do this.
I've tried specifying the crashkernel parameter both for the xen.gz and for
my dom0 kernel (bzImage).


1. The first case: crashkernel=256M for dom0 cmdline:

bzImage crashkernel=256M

[root@kvmxen-centos7-test1-nb ~]# systemctl status kdump.service
kdump.service - Crash recovery kernel arming
...
ÐÐÑ 17 21:19:38 kvmxen-centos7-test1-nb kdumpctl[1506]: kexec: loaded kdump kernel
...

[root@kvmxen-centos7-test1-nb ~]# cat /sys/kernel/kexec_crash_loadedÂ
1

Here we see that kexec from kdump.service worked well. Seems like it has
loaded the dump capture kernel.
And now let's try to panic:

[root@kvmxen-centos7-test1-nb ~]# echo c > /proc/sysrq-trigger

In the console we see:

[ Â421.673471] SysRq : Trigger a crash
[ Â421.677110] BUG: unable to handle kernel NULL pointer dereference at      (null)
[ Â421.685021] IP: [<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ Â421.691172] PGD 2d11e58067 PUD 2c95d3c067 PMD 0Â
[ Â421.695900] Oops: 0002 [#1] SMPÂ
[ Â421.699210] Modules linked in: ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables sg rpcsec_gss_krb5 nls_utf8 iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw gf128mul sb_edac glue_helper ablk_helper ipmi_si lpc_ich edac_core cryptd i2c_i801 pcspkr mfd_core ipmi_msghandler mei_me ioatdma wmi mei shpchp dca nfsd binfmt_misc mgag200 drm_kms_helper ttm drm ahci mlx4_core libahci libata
[ Â421.745725] CPU: 9 PID: 11422 Comm: bash Not tainted 3.17.0 #3
[ Â421.751562] Hardware name: Supermicro X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013
[ Â421.761910] task: ffff882e94383640 ti: ffff882c71758000 task.ti: ffff882c71758000
[ Â421.769398] RIP: e030:[<ffffffff81484486>] Â[<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ Â421.777961] RSP: e02b:ffff882c7175be88 ÂEFLAGS: 00010246
[ Â421.783276] RAX: 000000000000000f RBX: ffffffff81d2d780 RCX: 0000000000000000
[ Â421.790416] RDX: 0000000000000000 RSI: ffff882eea52e5b8 RDI: 0000000000000063
[ Â421.797557] RBP: ffff882c7175be88 R08: 0000000000000002 R09: ffffffff82034afc
[ Â421.804708] R10: 00000000000004a7 R11: 00000000000004a6 R12: 0000000000000063
[ Â421.811839] R13: 0000000000000000 R14: 0000000000000007 R15: 0000000000000000
[ Â421.818992] FS: Â00007f1c0205b740(0000) GS:ffff882eea520000(0000) knlGS:0000000000000000
[ Â421.827075] CS: Âe033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ Â421.832821] CR2: 0000000000000000 CR3: 0000002c2a879000 CR4: 0000000000042660
[ Â421.839972] Stack:
[ Â421.841998] Âffff882c7175beb8 ffffffff81484cd7 0000000000000002 00007f1c0207f000
[ Â421.849494] Â0000000000000002 ffff882c7175bf48 ffff882c7175bed0 ffffffff8148517f
[ Â421.857019] Âffff882e94765380 ffff882c7175bef0 ffffffff81251afd ffff882c7175bf48
[ Â421.864514] Call Trace:
[ Â421.866981] Â[<ffffffff81484cd7>] __handle_sysrq+0x107/0x170
[ Â421.872645] Â[<ffffffff8148517f>] write_sysrq_trigger+0x2f/0x40
[ Â421.878575] Â[<ffffffff81251afd>] proc_reg_write+0x3d/0x80
[ Â421.884069] Â[<ffffffff811eaef7>] vfs_write+0xb7/0x1f0
[ Â421.889209] Â[<ffffffff811ebb15>] SyS_write+0x55/0xd0
[ Â421.894294] Â[<ffffffff8183fc29>] system_call_fastpath+0x16/0x1b
[ Â421.900300] Code: 65 34 75 e5 4c 89 ef e8 d9 f7 ff ff eb db 0f 1f 80 00 00 00 00 66 66 66 66 90 55 c7 05 88 43 7f 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 5d c3 66 66 66 66 90 55 31 c0 c7 05 2eÂ
[ Â421.920596] RIP Â[<ffffffff81484486>] sysrq_handle_crash+0x16/0x20
[ Â421.926803] ÂRSP <ffff882c7175be88>
[ Â421.930302] CR2: 0000000000000000

And that's it. The dump capture kernel is not loaded. After this kernel panic
my server just reboot.


2. The second case: crashkernel=256M in xen.gz cmdline.

xen.gz crashkernel=256M

[root@kvmxen-centos7-test1-nb ~]# systemctl status kdump.serviceÂ
kdump.service - Crash recovery kernel arming
...
 ÂActive: failed (Result: exit-code) since ÐÑ 2014-10-17 19:56:57 MSK; 1h 9min ago
...
ÐÐÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: No memory reserved for crash kernel.
ÐÐÑ 17 19:56:57 kvmxen-centos7-test1-nb kdumpctl[1536]: Starting kdump: [FAILED]
....

As we see the kdump.service cannot load the dump capture kernel because
'No memory reserved for crash kernel'.


So the questions are:

1. How can I make crash dumps of the hypervisor and the dom0?

2. How am I supposed to diagnose the thing that causes such dom0 freezes?
I thought that if I ask on the list that my dom0 freezes, it will be a waste
of time without any logs or crash dumps.. But I cannot even make them..

I really want to contribute by testing xen and submitting bugs but I'd like
to do it with more material for the developers.


Thank you,
Grigory.


--
Best regards,
Grigory Ptashko

+7 (916) 1489766
grigory.ptashko@xxxxxxxxx
skype grigory_ptashko

_______________________________________________
Xen-devel mailing list
Xen-devel@xxxxxxxxxxxxx
http://lists.xen.org/xen-devel

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.