[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Xen-users] InfiniBand RDMA latency test on Xen's dom0 crashes



Hello.

The short story: while setting up InfiniBand connection between two servers one of which is Xen's dom0, I cannot complete the RDMA latency test. It crashes even with breaking the ssh connection to Xen's dom0. I've decided to ask the question here so that maybe Xen gurus will notice something that is actually related to Xen.

The long story. The first server is the Xen 4.4 with Ubuntu 14.04 as dom0 (hostname is xen). The second server is a usual server with Ubuntu 14.04 (hostname is node3). They both have Mellanox MT25208 HCAs connected over IB switch. Both have all the kernel modules loaded, OpenSM installed. The IPoIB works fine. The bare ibping goes both directions xen -> node3 and node3 -> xen. The problem occurs when I try ib_rdma_lat test. Here are the steps that lead to ib_rdma_lat and next sshd crash on xen.

1. On the xen I run ib_rdma_lat.
2. On the node3 I run ib_rdma_lat xen
3. The ssh connection to xen closes.
4. This is the output before ssh's connection close on xen.

root@xen:~/tmp/22# ib_rdma_latÂ
local address: LID 0x03 QPN 0x10406 PSN 0x9f903b RKey 0x40004000 VAddr 0x000000017e4001
remote address: LID 0x01 QPN 0x10406 PSN 0xd8c16e RKey 0x20004000 VAddr 0x000000013fd001
Connection to xen closed by remote host.
Connection to xen closed.


I googled, and the only thing that I could do was tunig the ib_mthca's module parameters num_mtt and log_mtts_per_seg. As it is said in the article http://community.mellanox.com/docs/DOC-1120. I set them on both servers as num_mtt=4194304 and log_mtts_per_seg=4. I did this while experimenting with those values so that the ib_mthca module would load correct.
But this didn't help. ib_rdma_lat still crashes on xen. Here's the log:


Aug Â4 00:12:52 localhost kernel: [ 4011.170180] ib_rdma_lat invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
Aug Â4 00:12:52 localhost kernel: [ 4011.170189] ib_rdma_lat cpuset=/ mems_allowed=0
Aug Â4 00:12:52 localhost kernel: [ 4011.170195] CPU: 0 PID: 2889 Comm: ib_rdma_lat Tainted: G Â ÂB Â W Â Â3.13.0-32-generic #57-Ubuntu
Aug Â4 00:12:52 localhost kernel: [ 4011.170198] Hardware name: Supermicro X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013
Aug Â4 00:12:52 localhost kernel: [ 4011.170202] Â0000000000000000 ffff880f175ebc68 ffffffff8171bcb4 ffff880f1ae02fe0
Aug Â4 00:12:52 localhost kernel: [ 4011.170209] Âffff880f175ebcf0 ffffffff817165ef ffff880f1a96afe0 0000000000000000
Aug Â4 00:12:52 localhost kernel: [ 4011.170213] Â00000000016ad5c1 ffff880f1a96afe0 ffffffff817246aa ffffffff8172417b
Aug Â4 00:12:52 localhost kernel: [ 4011.170217] Call Trace:
Aug Â4 00:12:52 localhost kernel: [ 4011.170236] Â[<ffffffff8171bcb4>] dump_stack+0x45/0x56
Aug Â4 00:12:52 localhost kernel: [ 4011.170242] Â[<ffffffff817165ef>] dump_header+0x7f/0x1f1
Aug Â4 00:12:52 localhost kernel: [ 4011.170248] Â[<ffffffff817246aa>] ? error_exit+0x2a/0x60
Aug Â4 00:12:52 localhost kernel: [ 4011.170253] Â[<ffffffff8172417b>] ? retint_restore_args+0x5/0x6
Aug Â4 00:12:52 localhost kernel: [ 4011.170260] Â[<ffffffff81151bfe>] oom_kill_process+0x1ce/0x330
Aug Â4 00:12:52 localhost kernel: [ 4011.170269] Â[<ffffffff812d3ac5>] ? security_capable_noaudit+0x15/0x20
Aug Â4 00:12:52 localhost kernel: [ 4011.170273] Â[<ffffffff81152334>] out_of_memory+0x414/0x450
Aug Â4 00:12:52 localhost kernel: [ 4011.170278] Â[<ffffffff811523df>] pagefault_out_of_memory+0x6f/0x80
Aug Â4 00:12:52 localhost kernel: [ 4011.170284] Â[<ffffffff81714c38>] mm_fault_error+0x8e/0x180
Aug Â4 00:12:52 localhost kernel: [ 4011.170289] Â[<ffffffff81727f01>] __do_page_fault+0x4a1/0x560
Aug Â4 00:12:52 localhost kernel: [ 4011.170299] Â[<ffffffff81111116>] ? __acct_update_integrals+0x76/0xe0
Aug Â4 00:12:52 localhost kernel: [ 4011.170305] Â[<ffffffff8111155c>] ? acct_account_cputime+0x1c/0x20
Aug Â4 00:12:52 localhost kernel: [ 4011.170312] Â[<ffffffff8109d7db>] ? account_user_time+0x8b/0xa0
Aug Â4 00:12:52 localhost kernel: [ 4011.170316] Â[<ffffffff8109ddf4>] ? vtime_account_user+0x54/0x60
Aug Â4 00:12:52 localhost kernel: [ 4011.170320] Â[<ffffffff81727fda>] do_page_fault+0x1a/0x70
Aug Â4 00:12:52 localhost kernel: [ 4011.170324] Â[<ffffffff81724448>] page_fault+0x28/0x30
Aug Â4 00:12:52 localhost kernel: [ 4011.170326] Mem-Info:
Aug Â4 00:12:52 localhost kernel: [ 4011.170329] Node 0 DMA per-cpu:
Aug Â4 00:12:52 localhost kernel: [ 4011.170334] CPU Â Â0: hi: Â Â0, btch: Â 1 usd: Â 0
Aug Â4 00:12:52 localhost kernel: [ 4011.170336] Node 0 DMA32 per-cpu:
Aug Â4 00:12:52 localhost kernel: [ 4011.170339] CPU Â Â0: hi: Â186, btch: Â31 usd: 135
Aug Â4 00:12:52 localhost kernel: [ 4011.170341] Node 0 Normal per-cpu:
Aug Â4 00:12:52 localhost kernel: [ 4011.170344] CPU Â Â0: hi: Â186, btch: Â31 usd: 124
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] active_anon:7920 inactive_anon:23 isolated_anon:0
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âactive_file:20177 inactive_file:37521 isolated_file:0
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âunevictable:8 dirty:0 writeback:0 unstable:0
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âfree:15211440 slab_reclaimable:4583 slab_unreclaimable:8427
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âmapped:4644 shmem:408 pagetables:993 bounce:0
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âfree_cma:0
Aug Â4 00:12:52 localhost kernel: [ 4011.170358] Node 0 DMA free:15888kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Aug Â4 00:12:52 localhost kernel: [ 4011.170367] lowmem_reserve[]: 0 1980 60135 60135
Aug Â4 00:12:52 localhost kernel: [ 4011.170372] Node 0 DMA32 free:2017364kB min:1032kB low:1288kB high:1548kB active_anon:992kB inactive_anon:4kB active_file:2596kB inactive_file:5756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045472kB managed:2031128kB mlocked:0kB dirty:0kB writeback:0kB mapped:692kB shmem:32kB slab_reclaimable:428kB slab_unreclaimable:472kB kernel_stack:40kB pagetables:132kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Aug Â4 00:12:52 localhost kernel: [ 4011.170381] lowmem_reserve[]: 0 0 58154 58154
Aug Â4 00:12:52 localhost kernel: [ 4011.170386] Node 0 Normal free:58812508kB min:30348kB low:37932kB high:45520kB active_anon:30688kB inactive_anon:88kB active_file:78112kB inactive_file:144328kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:60853112kB managed:59550432kB mlocked:32kB dirty:0kB writeback:0kB mapped:17884kB shmem:1600kB slab_reclaimable:17904kB slab_unreclaimable:33236kB kernel_stack:1704kB pagetables:3840kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Aug Â4 00:12:52 localhost kernel: [ 4011.170394] lowmem_reserve[]: 0 0 0 0
Aug Â4 00:12:52 localhost kernel: [ 4011.170398] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15888kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170416] Node 0 DMA32: 1*4kB (M) 12*8kB (UEM) 7*16kB (UE) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (E) 1*1024kB (E) 2*2048kB (ER) 491*4096kB (M) = 2017364kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170434] Node 0 Normal: 67*4kB (UM) 34*8kB (UEM) 16*16kB (UEM) 38*32kB (UM) 26*64kB (UM) 22*128kB (UEM) 15*256kB (UEM) 2*512kB (M) 1*1024kB (M) 3*2048kB (UEM) 14354*4096kB (MR) = 58812508kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170468] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170470] 58105 total pagecache pages
Aug Â4 00:12:52 localhost kernel: [ 4011.170473] 0 pages in swap cache
Aug Â4 00:12:52 localhost kernel: [ 4011.170476] Swap cache stats: add 0, delete 0, find 0/189
Aug Â4 00:12:52 localhost kernel: [ 4011.170478] Free swap Â= 33517564kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170480] Total swap = 33517564kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170482] 15728639 pages RAM
Aug Â4 00:12:52 localhost kernel: [ 4011.170483] 0 pages HighMem/MovableOnly
Aug Â4 00:12:52 localhost kernel: [ 4011.170485] 325670 pages reserved
Aug Â4 00:12:52 localhost kernel: [ 4011.170487] [ pid ]  uid Âtgid total_vm   Ârss nr_ptes swapents oom_score_adj name
Aug Â4 00:12:52 localhost kernel: [ 4011.170496] [ Â375] Â Â 0 Â 375 Â Â 4935 Â Â Â228 Â Â Â14 Â Â Â Â0 Â Â Â Â Â Â 0 upstart-udev-br
Aug Â4 00:12:52 localhost kernel: [ 4011.170501] [ Â384] Â Â 0 Â 384 Â Â12927 Â Â Â485 Â Â Â28 Â Â Â Â0 Â Â Â Â -1000 systemd-udevd
Aug Â4 00:12:52 localhost kernel: [ 4011.170505] [ Â571] Â 102 Â 571 Â Â 9887 Â Â Â391 Â Â Â23 Â Â Â Â0 Â Â Â Â Â Â 0 dbus-daemon
Aug Â4 00:12:52 localhost kernel: [ 4011.170509] [ Â590] Â 101 Â 590 Â Â63961 Â Â Â318 Â Â Â27 Â Â Â Â0 Â Â Â Â Â Â 0 rsyslogd
Aug Â4 00:12:52 localhost kernel: [ 4011.170513] [ Â596] Â Â 0 Â 596 Â Â 4823 Â Â Â373 Â Â Â14 Â Â Â Â0 Â Â Â Â Â Â 0 bluetoothd
Aug Â4 00:12:52 localhost kernel: [ 4011.170516] [ Â606] Â Â 0 Â 606 Â Â18680 Â Â Â893 Â Â Â40 Â Â Â Â0 Â Â Â Â Â Â 0 cupsd
Aug Â4 00:12:52 localhost kernel: [ 4011.170520] [ Â614] Â Â 0 Â 614 Â Â 5870 Â Â Â106 Â Â Â16 Â Â Â Â0 Â Â Â Â Â Â 0 rpc.idmapd
Aug Â4 00:12:52 localhost kernel: [ 4011.170523] [ Â622] Â Â 0 Â 622 Â Â10863 Â Â Â454 Â Â Â26 Â Â Â Â0 Â Â Â Â Â Â 0 systemd-logind
Aug Â4 00:12:52 localhost kernel: [ 4011.170528] [ Â702] Â Â 0 Â 702 Â Â 3984 Â Â Â308 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 upstart-file-br
Aug Â4 00:12:52 localhost kernel: [ 4011.170531] [ Â877] Â Â 0 Â 877 Â Â 5855 Â Â Â275 Â Â Â18 Â Â Â Â0 Â Â Â Â Â Â 0 rpcbind
Aug Â4 00:12:52 localhost kernel: [ 4011.170534] [ Â898] Â 111 Â 898 Â Â 5386 Â Â Â347 Â Â Â15 Â Â Â Â0 Â Â Â Â Â Â 0 rpc.statd
Aug Â4 00:12:52 localhost kernel: [ 4011.170538] [ Â901] Â Â 0 Â 901 Â Â 3848 Â Â Â184 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 upstart-socket-
Aug Â4 00:12:52 localhost kernel: [ 4011.170541] [ 1300] Â 105 Â1300 Â Â 7861 Â Â Â513 Â Â Â21 Â Â Â Â0 Â Â Â Â Â Â 0 ntpd
Aug Â4 00:12:52 localhost kernel: [ 4011.170545] [ 1374] Â Â 0 Â1374 Â Â 5268 Â Â Â237 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170548] [ 1378] Â Â 0 Â1378 Â Â 5268 Â Â Â235 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170551] [ 1384] Â Â 0 Â1384 Â Â 5268 Â Â Â237 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170555] [ 1385] Â Â 0 Â1385 Â Â 5268 Â Â Â238 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170558] [ 1388] Â Â 0 Â1388 Â Â 5268 Â Â Â238 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170561] [ 1427] Â Â 0 Â1427 Â Â15341 Â Â Â762 Â Â Â33 Â Â Â Â0 Â Â Â Â -1000 sshd
Aug Â4 00:12:52 localhost kernel: [ 4011.170564] [ 1443] Â Â 0 Â1443 Â Â 5914 Â Â Â257 Â Â Â17 Â Â Â Â0 Â Â Â Â Â Â 0 cron
Aug Â4 00:12:52 localhost kernel: [ 4011.170568] [ 1554] Â Â 0 Â1554 Â Â 2750 Â Â Â242 Â Â Â11 Â Â Â Â0 Â Â Â Â Â Â 0 xenstored
Aug Â4 00:12:52 localhost kernel: [ 4011.170571] [ 1566] Â Â 0 Â1566 Â Â22752 Â Â Â261 Â Â Â19 Â Â Â Â0 Â Â Â Â Â Â 0 xenconsoled
Aug Â4 00:12:52 localhost kernel: [ 4011.170575] [ 1613] Â Â 0 Â1613 Â Â73631 Â Â 1045 Â Â Â48 Â Â Â Â0 Â Â Â Â Â Â 0 polkitd
Aug Â4 00:12:52 localhost kernel: [ 4011.170578] [ 1885] Â 113 Â1885 Â Â 7052 Â Â Â249 Â Â Â18 Â Â Â Â0 Â Â Â Â Â Â 0 dnsmasq
Aug Â4 00:12:52 localhost kernel: [ 4011.170581] [ 2004] Â Â 0 Â2004 Â 148275 Â Â Â997 Â Â Â39 Â Â Â Â0 Â Â Â Â Â Â 0 console-kit-dae
Aug Â4 00:12:52 localhost kernel: [ 4011.170585] [ 2166] Â Â 0 Â2166 Â Â23985 Â Â Â237 Â Â Â21 Â Â Â Â0 Â Â Â Â Â Â 0 xl
Aug Â4 00:12:52 localhost kernel: [ 4011.170589] [ 2303] Â Â 0 Â2303 Â Â 5268 Â Â Â237 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170592] [ 2378] Â Â 0 Â2378 Â Â82712 Â Â Â784 Â Â Â23 Â Â Â Â0 Â Â Â Â Â Â 0 opensm
Aug Â4 00:12:52 localhost kernel: [ 4011.170595] [ 2379] Â Â 0 Â2379 Â Â65942 Â Â Â358 Â Â Â22 Â Â Â Â0 Â Â Â Â Â Â 0 opensm
Aug Â4 00:12:52 localhost kernel: [ 4011.170598] [ 2450] Â 106 Â2450 Â Â91259 Â Â 1269 Â Â Â74 Â Â Â Â0 Â Â Â Â Â Â 0 whoopsie
Aug Â4 00:12:52 localhost kernel: [ 4011.170602] [ 2453] Â Â 0 Â2453 Â Â93762 Â Â 3220 Â Â 114 Â Â Â Â0 Â Â Â Â Â Â 0 libvirtd
Aug Â4 00:12:52 localhost kernel: [ 4011.170605] [ 2634] Â Â 0 Â2634 Â Â26407 Â Â 1058 Â Â Â54 Â Â Â Â0 Â Â Â Â Â Â 0 sshd
Aug Â4 00:12:52 localhost kernel: [ 4011.170608] [ 2671] Â1000 Â2671 Â Â26407 Â Â Â501 Â Â Â52 Â Â Â Â0 Â Â Â Â Â Â 0 sshd
Aug Â4 00:12:52 localhost kernel: [ 4011.170612] [ 2672] Â1000 Â2672 Â Â 7041 Â Â 1040 Â Â Â17 Â Â Â Â0 Â Â Â Â Â Â 0 bash
Aug Â4 00:12:52 localhost kernel: [ 4011.170615] [ 2749] Â Â 0 Â2749 Â Â17566 Â Â Â547 Â Â Â36 Â Â Â Â0 Â Â Â Â Â Â 0 sudo
Aug Â4 00:12:52 localhost kernel: [ 4011.170618] [ 2750] Â Â 0 Â2750 Â Â 7063 Â Â 1074 Â Â Â16 Â Â Â Â0 Â Â Â Â Â Â 0 bash
Aug Â4 00:12:52 localhost kernel: [ 4011.170622] [ 2889] Â Â 0 Â2889 Â Â 3732 Â Â Â213 Â Â Â12 Â Â Â Â0 Â Â Â Â Â Â 0 ib_rdma_lat
Aug Â4 00:12:52 localhost kernel: [ 4011.170625] Out of memory: Kill process 2453 (libvirtd) score 0 or sacrifice child
Aug Â4 00:12:52 localhost kernel: [ 4011.170729] Killed process 2453 (libvirtd) total-vm:375048kB, anon-rss:4748kB, file-rss:8132kB



The xen (dom0) has 60GB of RAM. And the node3 has 180GB of RAM.
Here are some logs and command outputs that I made for diagnosing the problem.

9. a cut from /var/log/syslog after ib_rdma_lat crash on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/syslog.xen


Can anyone advise me anything, please?


--
Best regards,
Grigory Ptashko

+7 (916) 1489766
grigory.ptashko@xxxxxxxxx
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxx
http://lists.xen.org/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.