[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Xen-users] Re: Using Xen Virtualization Environment for Development and Testing of Supercomputing and High Performance Computing (HPC) Cluster MPICH2 MPI-2 Applications


  • To: xen-devel@xxxxxxxxxxxxxxxxxxx, xen-users@xxxxxxxxxxxxxxxxxxx, "Mr. Teo En Ming \(Zhang Enming\)" <space.time.universe@xxxxxxxxx>
  • From: Boris Derzhavets <bderzhavets@xxxxxxxxx>
  • Date: Fri, 30 Oct 2009 01:41:57 -0700 (PDT)
  • Cc: space.time.universe@xxxxxxxxx
  • Delivery-date: Fri, 30 Oct 2009 01:43:06 -0700
  • Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type; b=KbFwgvFPDV8ePayo01C1IPaO+GSu4GRZSgQus9Su35En5W25BHgiXONmWC1ZKELWnH5kgmG8L4MEF4YlJ8wGIF/Az6KDzdcixDhgqrzxQBLGbH2yly4wONRVU2W8IrWqH42NLxfaJqOVX0FCjSaY9wgmwevrXkYxOxfUNN+zWzk=;
  • List-id: Xen user discussion <xen-users.lists.xensource.com>

What kind of tcpdump reports , obtained on Dom0 or some other box on the LAN
brings you you to this idea ?

Wrong checksum offloading at DomU front end network driver happens ( in my experience with  RTL PCI Gigabit Ethernet 8110SC/8169 on SNV and OSOL,
however RTL PCI-E Ethernet 8111SC works fine) , but not necessarily.

> Virtualization Tip: Always disable checksumming on virtual ethernet devices

Why always ?

Boris.


--- On Fri, 10/30/09, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:

From: Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx>
Subject: [Xen-users] Re: Using Xen Virtualization Environment for Development and Testing of Supercomputing and High Performance Computing (HPC) Cluster MPICH2 MPI-2 Applications
To: xen-devel@xxxxxxxxxxxxxxxxxxx, xen-users@xxxxxxxxxxxxxxxxxxx
Cc: space.time.universe@xxxxxxxxx
Date: Friday, October 30, 2009, 4:12 AM

Dear All,

I have googled something which may help to solve my problem.

[Xen-devel] Network drop on domU (netfront: rx->offset: 0, size: 4294967295)

http://lists.xensource.com/archives/html/xen-devel/2009-05/msg01274.html

Virtualization Tip: Always disable checksumming on virtual ethernet devices


http://hightechsorcery.com/2008/03/virtualization-tip-always-disable-checksumming-virtual-ethernet-devices


Let me try to work on it first.

--
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe@xxxxxxxxx
MSN: teoenming@xxxxxxxxxxx
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore

On Fri, Oct 30, 2009 at 3:53 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
Hi,

I have reverted to the 2-node troubleshooting scenario. I have started node 1 and node 2.

On node 1, I will try to bring up the ring of mpd for the 2 nodes using mpdboot and try to execute mpiexec. On node 2, I will capture the tcpdump messages on virtual network interface eth0.

Please see attached PNG screenshots. They are numbered in sequence.

Please check if there are any problems.

Thank you.

--
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe@xxxxxxxxx
MSN: teoenming@xxxxxxxxxxx
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore

On Fri, Oct 30, 2009 at 2:53 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
Dear All,

Here are more virtual network interface eth0 kernel messages. Notice the "net eth0: rx->offset: 0" messages. Are they of significance?

Node 1

Oct 30 22:40:34 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.253:1009 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:40:56 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.252:877 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:41:19 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.251:1000 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:41:41 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.250:882 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:42:04 enming-f11-pv-hpc-node0001 mountd[1304]: authenticated mount request from 192.168.1.249:953 for /home/enming/mpich2-install/bin (/home/enming/mpich2-install/bin)
Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd starting; no mpdid yet
Oct 30 22:42:34 enming-f11-pv-hpc-node0001 mpd: mpd has mpdid=enming-f11-pv-hpc-node0001_48545 (port=48545)
Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:37 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:38 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:39 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:40 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: __ratelimit: 12 callbacks suppressed
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:46 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:47 enming-f11-pv-hpc-node0001 kernel: net eth0: rx->offset: 0, size: 4294967295

Node 6

Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:44 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd starting; no mpdid yet
Oct 30 22:42:48 enming-f11-pv-hpc-node0006 mpd: mpd has mpdid=enming-f11-pv-hpc-node0006_52805 (port=52805)
Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295
Oct 30 22:46:00 enming-f11-pv-hpc-node0006 kernel: net eth0: rx->offset: 0, size: 4294967295

Node 1 NFS Server Configuration

[root@enming-f11-pv-hpc-node0001 ~]# cat /etc/exports
/home/enming/mpich2-install/bin        192.168.1.0/24(ro)

Node 2 /etc/fstab Configuration Entry for NFS Client

192.168.1.254:/home/enming/mpich2-install/bin    /home/enming/mpich2-install/bin    nfs    rsize=8192,wsize=8192,timeo=14,intr


--
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe@xxxxxxxxx
MSN: teoenming@xxxxxxxxxxx
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore

On Fri, Oct 30, 2009 at 2:37 PM, Mr. Teo En Ming (Zhang Enming) <space.time.universe@xxxxxxxxx> wrote:
Dear All,

I have created a virtual high performance computing (HPC) cluster of 6 compute nodes with MPICH2 using Xen-based Fedora 11 Linux 64-bit paravirtualized (PV) domU guests. Dom0 is Fedora 11 Linux 64-bit. My Intel Desktop Board DQ45CB has a single onboard Gigabit LAN network adapter.

I am able to bring up the ring of mpd on the set of 6 compute nodes. However, I am consistently encountering the "(mpiexec 392): no msg recvd from mpd when expecting ack of request" error.

After much troubleshooting, I have found that there are Receive Errors (RX-ERR) in the virtual network interface eth0 of all the six compute nodes. All the 6 compute nodes are identical F11 linux 64-bit PV virtual machines.

Here is my PV guest configuration for node 1:

[enming@fedora11-x86-64-host xen]$ cat enming-f11-pv-hpc-node0001
name="enming-f11-pv-hpc-node0001"
memory=512
disk = ['phy:/dev/virtualmachines/f11-pv-hpc-node0001,xvda,w' ]
vif = [ 'mac=00:16:3E:69:E9:11,bridge=eth0' ]
vfb = [ 'vnc=1,vncunused=1,vncdisplay=0,vnclisten=127.0.0.1,vncpasswd=' ]
vncconsole=1
bootloader = "/usr/bin/pygrub"
#kernel = "/home/enming/fedora11/vmlinuz"
#ramdisk = "/home/enming/fedora11/initrd.img"
vcpus=2



Will there be any problems with Xen networking for MPICH2 applications? Or it's just a fine-tuning exercise for Xen networking? I am using PV guests because PV guests have much higher performance than HVM guests.

Here are my mpich-discuss mailing list threads:

http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005883.html

http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005887.html

http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005889.html

http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005890.html

http://lists.mcs.anl.gov/pipermail/mpich-discuss/2009-October/005891.html

Please advise on the RX-ERR.

Thank you very much.

--
Mr. Teo En Ming (Zhang Enming) Dip(Mechatronics) BEng(Hons)(Mechanical Engineering)
Alma Maters:
(1) Singapore Polytechnic
(2) National University of Singapore
My blog URL: http://teo-en-ming-aka-zhang-enming.blogspot.com
My Youtube videos: http://www.youtube.com/user/enmingteo
Email: space.time.universe@xxxxxxxxx
MSN: teoenming@xxxxxxxxxxx
Mobile Phone (SingTel): +65-9648-9798
Mobile Phone (Starhub Prepaid): +65-8369-2618
Age: 31 (as at 30 Oct 2009)
Height: 1.78 meters
Race: Chinese
Dialect: Hokkien
Street: Bedok Reservoir Road
Country: Singapore













-----Inline Attachment Follows-----

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

_______________________________________________
Xen-users mailing list
Xen-users@xxxxxxxxxxxxxxxxxxx
http://lists.xensource.com/xen-users

 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.