[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [Xen-users] InfiniBand RDMA latency test on Xen's dom0 crashes
Hello. The short story: while setting up InfiniBand connection between two servers one of which is Xen's dom0, I cannot complete the RDMA latency test. It crashes even with breaking the ssh connection to Xen's dom0. I've decided to ask the question here so that maybe Xen gurus will notice something that is actually related to Xen.
The long story. The first server is the Xen 4.4 with Ubuntu 14.04 as dom0 (hostname is xen). The second server is a usual server with Ubuntu 14.04 (hostname is node3). They both have Mellanox MT25208 HCAs connected over IB switch. Both have all the kernel modules loaded, OpenSM installed. The IPoIB works fine. The bare ibping goes both directions xen -> node3 and node3 -> xen. The problem occurs when I try ib_rdma_lat test. Here are the steps that lead to ib_rdma_lat and next sshd crash on xen.
1. On the xen I run ib_rdma_lat. 2. On the node3 I run ib_rdma_lat xen 3. The ssh connection to xen closes. 4. This is the output before ssh's connection close on xen.
root@xen:~/tmp/22# ib_rdma_lat local address: LID 0x03 QPN 0x10406 PSN 0x9f903b RKey 0x40004000 VAddr 0x000000017e4001 remote address: LID 0x01 QPN 0x10406 PSN 0xd8c16e RKey 0x20004000 VAddr 0x000000013fd001
Connection to xen closed by remote host. Connection to xen closed. I googled, and the only thing that I could do was tunig the ib_mthca's module parameters num_mtt and log_mtts_per_seg. As it is said in the article http://community.mellanox.com/docs/DOC-1120. I set them on both servers as num_mtt=4194304 and log_mtts_per_seg=4. I did this while experimenting with those values so that the ib_mthca module would load correct.
But this didn't help. ib_rdma_lat still crashes on xen. Here's the log: Aug Â4 00:12:52 localhost kernel: [ 4011.170180] ib_rdma_lat invoked oom-killer: gfp_mask=0x0, order=0, oom_score_adj=0
Aug Â4 00:12:52 localhost kernel: [ 4011.170189] ib_rdma_lat cpuset=/ mems_allowed=0 Aug Â4 00:12:52 localhost kernel: [ 4011.170195] CPU: 0 PID: 2889 Comm: ib_rdma_lat Tainted: G Â ÂB Â W Â Â3.13.0-32-generic #57-Ubuntu
Aug Â4 00:12:52 localhost kernel: [ 4011.170198] Hardware name: Supermicro X9DRFF-iG+/-7G+/-iTG+/-7TG+/X9DRFF-iG+/-7G+/-iTG+/-7TG+, BIOS 3.0 07/29/2013 Aug Â4 00:12:52 localhost kernel: [ 4011.170202] Â0000000000000000 ffff880f175ebc68 ffffffff8171bcb4 ffff880f1ae02fe0
Aug Â4 00:12:52 localhost kernel: [ 4011.170209] Âffff880f175ebcf0 ffffffff817165ef ffff880f1a96afe0 0000000000000000 Aug Â4 00:12:52 localhost kernel: [ 4011.170213] Â00000000016ad5c1 ffff880f1a96afe0 ffffffff817246aa ffffffff8172417b
Aug Â4 00:12:52 localhost kernel: [ 4011.170217] Call Trace: Aug Â4 00:12:52 localhost kernel: [ 4011.170236] Â[<ffffffff8171bcb4>] dump_stack+0x45/0x56 Aug Â4 00:12:52 localhost kernel: [ 4011.170242] Â[<ffffffff817165ef>] dump_header+0x7f/0x1f1
Aug Â4 00:12:52 localhost kernel: [ 4011.170248] Â[<ffffffff817246aa>] ? error_exit+0x2a/0x60 Aug Â4 00:12:52 localhost kernel: [ 4011.170253] Â[<ffffffff8172417b>] ? retint_restore_args+0x5/0x6
Aug Â4 00:12:52 localhost kernel: [ 4011.170260] Â[<ffffffff81151bfe>] oom_kill_process+0x1ce/0x330 Aug Â4 00:12:52 localhost kernel: [ 4011.170269] Â[<ffffffff812d3ac5>] ? security_capable_noaudit+0x15/0x20
Aug Â4 00:12:52 localhost kernel: [ 4011.170273] Â[<ffffffff81152334>] out_of_memory+0x414/0x450 Aug Â4 00:12:52 localhost kernel: [ 4011.170278] Â[<ffffffff811523df>] pagefault_out_of_memory+0x6f/0x80
Aug Â4 00:12:52 localhost kernel: [ 4011.170284] Â[<ffffffff81714c38>] mm_fault_error+0x8e/0x180 Aug Â4 00:12:52 localhost kernel: [ 4011.170289] Â[<ffffffff81727f01>] __do_page_fault+0x4a1/0x560
Aug Â4 00:12:52 localhost kernel: [ 4011.170299] Â[<ffffffff81111116>] ? __acct_update_integrals+0x76/0xe0 Aug Â4 00:12:52 localhost kernel: [ 4011.170305] Â[<ffffffff8111155c>] ? acct_account_cputime+0x1c/0x20
Aug Â4 00:12:52 localhost kernel: [ 4011.170312] Â[<ffffffff8109d7db>] ? account_user_time+0x8b/0xa0 Aug Â4 00:12:52 localhost kernel: [ 4011.170316] Â[<ffffffff8109ddf4>] ? vtime_account_user+0x54/0x60
Aug Â4 00:12:52 localhost kernel: [ 4011.170320] Â[<ffffffff81727fda>] do_page_fault+0x1a/0x70 Aug Â4 00:12:52 localhost kernel: [ 4011.170324] Â[<ffffffff81724448>] page_fault+0x28/0x30
Aug Â4 00:12:52 localhost kernel: [ 4011.170326] Mem-Info: Aug Â4 00:12:52 localhost kernel: [ 4011.170329] Node 0 DMA per-cpu: Aug Â4 00:12:52 localhost kernel: [ 4011.170334] CPU Â Â0: hi: Â Â0, btch: Â 1 usd: Â 0
Aug Â4 00:12:52 localhost kernel: [ 4011.170336] Node 0 DMA32 per-cpu: Aug Â4 00:12:52 localhost kernel: [ 4011.170339] CPU Â Â0: hi: Â186, btch: Â31 usd: 135 Aug Â4 00:12:52 localhost kernel: [ 4011.170341] Node 0 Normal per-cpu:
Aug Â4 00:12:52 localhost kernel: [ 4011.170344] CPU Â Â0: hi: Â186, btch: Â31 usd: 124 Aug Â4 00:12:52 localhost kernel: [ 4011.170351] active_anon:7920 inactive_anon:23 isolated_anon:0 Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âactive_file:20177 inactive_file:37521 isolated_file:0
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âunevictable:8 dirty:0 writeback:0 unstable:0 Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âfree:15211440 slab_reclaimable:4583 slab_unreclaimable:8427
Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âmapped:4644 shmem:408 pagetables:993 bounce:0 Aug Â4 00:12:52 localhost kernel: [ 4011.170351] Âfree_cma:0 Aug Â4 00:12:52 localhost kernel: [ 4011.170358] Node 0 DMA free:15888kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15972kB managed:15888kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Aug Â4 00:12:52 localhost kernel: [ 4011.170367] lowmem_reserve[]: 0 1980 60135 60135 Aug Â4 00:12:52 localhost kernel: [ 4011.170372] Node 0 DMA32 free:2017364kB min:1032kB low:1288kB high:1548kB active_anon:992kB inactive_anon:4kB active_file:2596kB inactive_file:5756kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2045472kB managed:2031128kB mlocked:0kB dirty:0kB writeback:0kB mapped:692kB shmem:32kB slab_reclaimable:428kB slab_unreclaimable:472kB kernel_stack:40kB pagetables:132kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Aug Â4 00:12:52 localhost kernel: [ 4011.170381] lowmem_reserve[]: 0 0 58154 58154 Aug Â4 00:12:52 localhost kernel: [ 4011.170386] Node 0 Normal free:58812508kB min:30348kB low:37932kB high:45520kB active_anon:30688kB inactive_anon:88kB active_file:78112kB inactive_file:144328kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:60853112kB managed:59550432kB mlocked:32kB dirty:0kB writeback:0kB mapped:17884kB shmem:1600kB slab_reclaimable:17904kB slab_unreclaimable:33236kB kernel_stack:1704kB pagetables:3840kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Aug Â4 00:12:52 localhost kernel: [ 4011.170394] lowmem_reserve[]: 0 0 0 0 Aug Â4 00:12:52 localhost kernel: [ 4011.170398] Node 0 DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (R) 3*4096kB (M) = 15888kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170416] Node 0 DMA32: 1*4kB (M) 12*8kB (UEM) 7*16kB (UE) 2*32kB (UM) 1*64kB (U) 2*128kB (UM) 0*256kB 1*512kB (E) 1*1024kB (E) 2*2048kB (ER) 491*4096kB (M) = 2017364kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170434] Node 0 Normal: 67*4kB (UM) 34*8kB (UEM) 16*16kB (UEM) 38*32kB (UM) 26*64kB (UM) 22*128kB (UEM) 15*256kB (UEM) 2*512kB (M) 1*1024kB (M) 3*2048kB (UEM) 14354*4096kB (MR) = 58812508kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170468] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Aug Â4 00:12:52 localhost kernel: [ 4011.170470] 58105 total pagecache pages
Aug Â4 00:12:52 localhost kernel: [ 4011.170473] 0 pages in swap cache Aug Â4 00:12:52 localhost kernel: [ 4011.170476] Swap cache stats: add 0, delete 0, find 0/189 Aug Â4 00:12:52 localhost kernel: [ 4011.170478] Free swap Â= 33517564kB
Aug Â4 00:12:52 localhost kernel: [ 4011.170480] Total swap = 33517564kB Aug Â4 00:12:52 localhost kernel: [ 4011.170482] 15728639 pages RAM Aug Â4 00:12:52 localhost kernel: [ 4011.170483] 0 pages HighMem/MovableOnly
Aug Â4 00:12:52 localhost kernel: [ 4011.170485] 325670 pages reserved Aug Â4 00:12:52 localhost kernel: [ 4011.170487] [ pid ]  uid Âtgid total_vm   Ârss nr_ptes swapents oom_score_adj name Aug Â4 00:12:52 localhost kernel: [ 4011.170496] [ Â375]   0  375   4935   Â228   Â14    Â0       0 upstart-udev-br
Aug Â4 00:12:52 localhost kernel: [ 4011.170501] [ Â384] Â Â 0 Â 384 Â Â12927 Â Â Â485 Â Â Â28 Â Â Â Â0 Â Â Â Â -1000 systemd-udevd Aug Â4 00:12:52 localhost kernel: [ 4011.170505] [ Â571] Â 102 Â 571 Â Â 9887 Â Â Â391 Â Â Â23 Â Â Â Â0 Â Â Â Â Â Â 0 dbus-daemon
Aug Â4 00:12:52 localhost kernel: [ 4011.170509] [ Â590] Â 101 Â 590 Â Â63961 Â Â Â318 Â Â Â27 Â Â Â Â0 Â Â Â Â Â Â 0 rsyslogd Aug Â4 00:12:52 localhost kernel: [ 4011.170513] [ Â596] Â Â 0 Â 596 Â Â 4823 Â Â Â373 Â Â Â14 Â Â Â Â0 Â Â Â Â Â Â 0 bluetoothd
Aug Â4 00:12:52 localhost kernel: [ 4011.170516] [ Â606] Â Â 0 Â 606 Â Â18680 Â Â Â893 Â Â Â40 Â Â Â Â0 Â Â Â Â Â Â 0 cupsd Aug Â4 00:12:52 localhost kernel: [ 4011.170520] [ Â614] Â Â 0 Â 614 Â Â 5870 Â Â Â106 Â Â Â16 Â Â Â Â0 Â Â Â Â Â Â 0 rpc.idmapd
Aug Â4 00:12:52 localhost kernel: [ 4011.170523] [ Â622] Â Â 0 Â 622 Â Â10863 Â Â Â454 Â Â Â26 Â Â Â Â0 Â Â Â Â Â Â 0 systemd-logind Aug Â4 00:12:52 localhost kernel: [ 4011.170528] [ Â702] Â Â 0 Â 702 Â Â 3984 Â Â Â308 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 upstart-file-br
Aug Â4 00:12:52 localhost kernel: [ 4011.170531] [ Â877] Â Â 0 Â 877 Â Â 5855 Â Â Â275 Â Â Â18 Â Â Â Â0 Â Â Â Â Â Â 0 rpcbind Aug Â4 00:12:52 localhost kernel: [ 4011.170534] [ Â898] Â 111 Â 898 Â Â 5386 Â Â Â347 Â Â Â15 Â Â Â Â0 Â Â Â Â Â Â 0 rpc.statd
Aug Â4 00:12:52 localhost kernel: [ 4011.170538] [ Â901] Â Â 0 Â 901 Â Â 3848 Â Â Â184 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 upstart-socket- Aug Â4 00:12:52 localhost kernel: [ 4011.170541] [ 1300] Â 105 Â1300 Â Â 7861 Â Â Â513 Â Â Â21 Â Â Â Â0 Â Â Â Â Â Â 0 ntpd
Aug Â4 00:12:52 localhost kernel: [ 4011.170545] [ 1374] Â Â 0 Â1374 Â Â 5268 Â Â Â237 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty Aug Â4 00:12:52 localhost kernel: [ 4011.170548] [ 1378] Â Â 0 Â1378 Â Â 5268 Â Â Â235 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170551] [ 1384] Â Â 0 Â1384 Â Â 5268 Â Â Â237 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty Aug Â4 00:12:52 localhost kernel: [ 4011.170555] [ 1385] Â Â 0 Â1385 Â Â 5268 Â Â Â238 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170558] [ 1388] Â Â 0 Â1388 Â Â 5268 Â Â Â238 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty Aug Â4 00:12:52 localhost kernel: [ 4011.170561] [ 1427] Â Â 0 Â1427 Â Â15341 Â Â Â762 Â Â Â33 Â Â Â Â0 Â Â Â Â -1000 sshd
Aug Â4 00:12:52 localhost kernel: [ 4011.170564] [ 1443] Â Â 0 Â1443 Â Â 5914 Â Â Â257 Â Â Â17 Â Â Â Â0 Â Â Â Â Â Â 0 cron Aug Â4 00:12:52 localhost kernel: [ 4011.170568] [ 1554] Â Â 0 Â1554 Â Â 2750 Â Â Â242 Â Â Â11 Â Â Â Â0 Â Â Â Â Â Â 0 xenstored
Aug Â4 00:12:52 localhost kernel: [ 4011.170571] [ 1566] Â Â 0 Â1566 Â Â22752 Â Â Â261 Â Â Â19 Â Â Â Â0 Â Â Â Â Â Â 0 xenconsoled Aug Â4 00:12:52 localhost kernel: [ 4011.170575] [ 1613] Â Â 0 Â1613 Â Â73631 Â Â 1045 Â Â Â48 Â Â Â Â0 Â Â Â Â Â Â 0 polkitd
Aug Â4 00:12:52 localhost kernel: [ 4011.170578] [ 1885] Â 113 Â1885 Â Â 7052 Â Â Â249 Â Â Â18 Â Â Â Â0 Â Â Â Â Â Â 0 dnsmasq Aug Â4 00:12:52 localhost kernel: [ 4011.170581] [ 2004] Â Â 0 Â2004 Â 148275 Â Â Â997 Â Â Â39 Â Â Â Â0 Â Â Â Â Â Â 0 console-kit-dae
Aug Â4 00:12:52 localhost kernel: [ 4011.170585] [ 2166] Â Â 0 Â2166 Â Â23985 Â Â Â237 Â Â Â21 Â Â Â Â0 Â Â Â Â Â Â 0 xl Aug Â4 00:12:52 localhost kernel: [ 4011.170589] [ 2303] Â Â 0 Â2303 Â Â 5268 Â Â Â237 Â Â Â13 Â Â Â Â0 Â Â Â Â Â Â 0 getty
Aug Â4 00:12:52 localhost kernel: [ 4011.170592] [ 2378] Â Â 0 Â2378 Â Â82712 Â Â Â784 Â Â Â23 Â Â Â Â0 Â Â Â Â Â Â 0 opensm Aug Â4 00:12:52 localhost kernel: [ 4011.170595] [ 2379] Â Â 0 Â2379 Â Â65942 Â Â Â358 Â Â Â22 Â Â Â Â0 Â Â Â Â Â Â 0 opensm
Aug Â4 00:12:52 localhost kernel: [ 4011.170598] [ 2450] Â 106 Â2450 Â Â91259 Â Â 1269 Â Â Â74 Â Â Â Â0 Â Â Â Â Â Â 0 whoopsie Aug Â4 00:12:52 localhost kernel: [ 4011.170602] [ 2453] Â Â 0 Â2453 Â Â93762 Â Â 3220 Â Â 114 Â Â Â Â0 Â Â Â Â Â Â 0 libvirtd
Aug Â4 00:12:52 localhost kernel: [ 4011.170605] [ 2634] Â Â 0 Â2634 Â Â26407 Â Â 1058 Â Â Â54 Â Â Â Â0 Â Â Â Â Â Â 0 sshd Aug Â4 00:12:52 localhost kernel: [ 4011.170608] [ 2671] Â1000 Â2671 Â Â26407 Â Â Â501 Â Â Â52 Â Â Â Â0 Â Â Â Â Â Â 0 sshd
Aug Â4 00:12:52 localhost kernel: [ 4011.170612] [ 2672] Â1000 Â2672 Â Â 7041 Â Â 1040 Â Â Â17 Â Â Â Â0 Â Â Â Â Â Â 0 bash Aug Â4 00:12:52 localhost kernel: [ 4011.170615] [ 2749] Â Â 0 Â2749 Â Â17566 Â Â Â547 Â Â Â36 Â Â Â Â0 Â Â Â Â Â Â 0 sudo
Aug Â4 00:12:52 localhost kernel: [ 4011.170618] [ 2750] Â Â 0 Â2750 Â Â 7063 Â Â 1074 Â Â Â16 Â Â Â Â0 Â Â Â Â Â Â 0 bash Aug Â4 00:12:52 localhost kernel: [ 4011.170622] [ 2889] Â Â 0 Â2889 Â Â 3732 Â Â Â213 Â Â Â12 Â Â Â Â0 Â Â Â Â Â Â 0 ib_rdma_lat
Aug Â4 00:12:52 localhost kernel: [ 4011.170625] Out of memory: Kill process 2453 (libvirtd) score 0 or sacrifice child Aug Â4 00:12:52 localhost kernel: [ 4011.170729] Killed process 2453 (libvirtd) total-vm:375048kB, anon-rss:4748kB, file-rss:8132kB
The xen (dom0) has 60GB of RAM. And the node3 has 180GB of RAM. Here are some logs and command outputs that I made for diagnosing the problem. 2. xl dmesg on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/xl-dmesg.xen.log
3. parameters of the loaded ib_mthca on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ib_mthca.xen.log 4. ibhosts on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibhosts.xen.log
5. ibstat on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibstat.xen.log 6. ibstatus on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/ibstatus.xen.log
7. lsmod | grep rdma on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/lsmod-rdma.xen.log 8. lspci -s 04:00.0 -k on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/lspci.xen.log
9. a cut from /var/log/syslog after ib_rdma_lat crash on xen https://dl.dropboxusercontent.com/u/8057759/ib_mthca/syslog.xen Can anyone advise me anything, please? _______________________________________________ Xen-users mailing list Xen-users@xxxxxxxxxxxxx http://lists.xen.org/xen-users
|
Lists.xenproject.org is hosted with RackSpace, monitoring our |