[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Xen-users] Re: Using Xen Virtualization Environment for Development and Testing of Supercomputing and High Performance Computing (HPC) Cluster MPICH2 MPI-2 Applications
- To: xen-devel@xxxxxxxxxxxxxxxxxxx, xen-users@xxxxxxxxxxxxxxxxxxx, "Mr. Teo En Ming \(Zhang Enming\)" <space.time.universe@xxxxxxxxx>
- From: Boris Derzhavets <bderzhavets@xxxxxxxxx>
- Date: Fri, 30 Oct 2009 01:41:57 -0700 (PDT)
- Cc: space.time.universe@xxxxxxxxx
- Delivery-date: Fri, 30 Oct 2009 01:43:06 -0700
- Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=KbFwgvFPDV8ePayo01C1IPaO+GSu4GRZSgQus9Su35En5W25BHgiXONmWC1ZKELWnH5kgmG8L4MEF4YlJ8wGIF/Az6KDzdcixDhgqrzxQBLGbH2yly4wONRVU2W8IrWqH42NLxfaJqOVX0FCjSaY9wgmwevrXkYxOxfUNN+zWzk=;
- List-id: Xen user discussion <xen-users.lists.xensource.com>
What kind of tcpdump reports , obtained on Dom0 or some other box on the LAN brings you you to this idea ?
Wrong checksum offloading at DomU front end network driver happens ( in my experience with RTL PCI Gigabit Ethernet 8110SC/8169 on SNV and OSOL, however RTL PCI-E Ethernet 8111SC works fine) , but not necessarily.
> Virtualization Tip: Always disable checksumming on virtual ethernet devicesWhy always ?
Boris.
--- On Fri, 10/30/09, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
From: Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> Subject: [Xen-users] Re: Using Xen Virtualization Environment for Development and Testing of Supercomputing and High Performance Computing (HPC) Cluster MPICH2 MPI-2 Applications To: xen-devel@xxxxxxxxxxxxxxxxxxx, xen-users@xxxxxxxxxxxxxxxxxxx Cc: space.time.universe@xxxxxxxxx Date: Friday, October 30, 2009, 4:12 AM
Dear All, I have googled something which may help to solve my problem. [Xen-devel] Network drop on domU (netfront: rx->offset: 0, size: 4294967295)
http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01274.htmlVirtualization Tip: Always disable checksumming on virtual ethernet devices
http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices
Let me try to work on it first. -- Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering) Alma Maters: (1) Singapore Polytechnic (2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.comMy Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe@xxxxxxxxxMSN: teoenming@xxxxxxxxxxxMobile Phone (SingTel): +65-9648-9798 Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009) Height: 1.78 meters Race: Chinese Dialect: Hokkien Street: Bedok Reservoir Road Country: Singapore On Fri, Oct 30, 2009 at 3:53 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
Hi,
I have reverted to the 2-node troubleshooting scenario. I have started node 1 and node 2.
On node 1, I will try to bring up the ring of mpd for the 2 nodes using mpdboot and try to execute mpiexec. On node 2, I will capture the tcpdump messages on virtual network interface eth0.
Please see attached PNG screenshots. They are numbered in sequence.
Please check if there are any problems.
Thank you.
--
On Fri, Oct 30, 2009 at 2:53 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
Dear All,
Here are more virtual network interface eth0 kernel messages. Notice the "net eth0: rx->offset: 0" messages. Are they of significance?
Node 1
Oct 30 22:40:34 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.253:1009 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:40:56 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.252:877 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:41:19 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.251:1000 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:41:41 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.250:882 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:42:04 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.249:953 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd starting; no mpdid yet Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd has mpdid=enming-f11-pv-hpc-node0001_48545 (port=48545) Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:40 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: __ratelimit: 12 callbacks suppressed Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Node 6
Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd starting; no mpdid yet Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd has mpdid=enming-f11-pv-hpc-node0006_52805 (port=52805)
Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295 Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Node 1 NFS Server Configuration
[root@enming-f11-pv-hpc-node0001 ~]# cat /etc/exports /home/enming/mpich2-install/bin 192.168.1.0/24(ro)
Node 2 /etc/fstab Configuration Entry for NFS Client
192.168.1.254:/home/enming/mpich2-install/bin /home/enming/mpich2-install/bin nfs rsize=8192,wsize=8192,timeo=14,intrOn Fri, Oct 30, 2009 at 2:37 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
Dear All,
I have created a virtual high performance computing (HPC) cluster of 6 compute nodes with MPICH2 using Xen-based Fedora 11 Linux 64-bit paravirtualized (PV) domU guests. Dom0 is Fedora 11 Linux 64-bit. My Intel Desktop Board DQ45CB has a single onboard Gigabit LAN network adapter.
I am able to bring up the ring of mpd on the set of 6 compute nodes. However, I am consistently encountering the "(mpiexec 392): no msg recvd from mpd when expecting ack of request" error.
After much troubleshooting, I have found that there are Receive Errors (RX-ERR) in the virtual network interface eth0 of all the six compute nodes. All the 6 compute nodes are identical F11 linux 64-bit PV virtual machines.
Here is my PV guest configuration for node 1:
[enming@fedora11-x86-64-host xen]$ cat enming-f11-pv-hpc-node0001 name="enming-f11-pv-hpc-node0001" memory=512 disk = ['phy:/dev/virtualmachines/f11-pv-hpc-node0001,xvda,w' ]
vif = [ 'mac=00:16:3E:69:E9:11,bridge=eth0' ] vfb = [ 'vnc=1,vncunused=1,vncdisplay=0,vnclisten=127.0.0.1,vncpasswd=' ] vncconsole=1 bootloader = "/usr/bin/pygrub" #kernel = "/home/enming/fedora11/vmlinuz"
#ramdisk = "/home/enming/fedora11/initrd.img" vcpus=2
Will there be any problems with Xen networking for MPICH2 applications? Or it's just a fine-tuning exercise for Xen networking? I am using PV guests because PV guests have much higher performance than HVM guests.
Here are my mpich-discuss mailing list threads:
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005883.html
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005887.html
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005889.html
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005890.html
http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005891.html
Please advise on the RX-ERR.
Thank you very much.
-- Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering) Alma Maters: (1) Singapore Polytechnic (2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe@xxxxxxxxx MSN: teoenming@xxxxxxxxxxx Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009) Height: 1.78 meters Race: Chinese Dialect: Hokkien Street: Bedok Reservoir Road Country: Singapore
-----Inline Attachment Follows-----
|
_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users
|