![[xtreme Machines]](Xtreme.gif)
|
Cluster Quick Start (DRAFT) |
The following document is incomplete,
but contains some commonly sought after
information about getting a cluster working
Cluster Quick Start Version 0.1
===============================
Prepared by: Douglas Eadline
(deadline@plogic.com)
Update: 2/2/99
This document is distributed under GNU GENERAL PUBLIC LICENSE
Version 2, June 1991 Copies of the license can be down found at
http://www.fsf.org/copyleft/gpl.html
Copyright (C) 1999 Paralogic, Inc., 130 Webster St, Bethlehem PA 18015
(http://www.plogic.com)
READ FIRST:
===========
Check: http://www.xtreme-machines.com/x-cluster-qs.html for updates!
This document assumes all machines have been installed with the standard
RedHat 5.1 or 5.2 release. Furthermore, all systems are assumed to boot from
a their own hard drive and have a fast Ethernet network connection to a switch
controlling the private cluster network. One of the machines is assumed to be a
gateway node which normally attaches to a local area network.
There are other ways to boot and configure the hardware
(bootp and disk-less nodes) but this document does not cover this method.
There are many ways to configure a cluster. This document describes one way.
This guide not complete, it is really just a series of steps or issues
that need to be addressed before a cluster can be usable.
Indeed, it is very difficult to create a "step by step"
set of instructions because every cluster seems to
be different.
I am sure I have missed many things. Please submit questions to
deadline@plogic.com with he subject "cluster quick start". I will either
incorporate the answer in the text or add it to the FAQ at the end.
Some things that are intentionally not included: building and installing
kernels, hardware selection, Linux installation, script or kick start
node builds, and packaging.
Comments and suggestions are welcome. Please send any CONSTRUCTIVE comments to
deadline@plogic.com with the subject "cluster quick start".
The Original Extreme Linux CD is now considered OBSOLETE. Other than
documentation, the RPMs on the CD should not be used.
See the Beowulf HOWTO for other important information and links to other
sources of information. You may also want to check the Beowulf mailing
list archives http://www.beowulf.org/listarchives/beowulf/ for more information.
Finally, we plan to prepare current RPMs for all software mentioned in
this document.
I should also mention that for those who have a need to buy a
Beowulf cluster, roll it out of the box, and turn it on, please see
http://www.xtreme-machines.com. (i.e. you do not want to have to read
and perform all the tasks in this document)
1.0 Introduction:
=================
1.1 What makes a cluster different than a network of Workstations?
------------------------------------------------------------------
Aside from the philosophical issues, the main difference revolves around
security issues, application software, administration, booting, and file
systems. As delivered most Linux distributions have very high security
enabled (for good reasons). This security often "gets in the way" of
cluster computing.
Application software uses underlying message passing systems like MPI and PVM.
There are many ways to express parallelism, but message passing is the most
popular and appropriate in many cases.
UNLESS AN APPLICATION WAS SPECIFICALLY WRITTEN FOR A CLUSTER ENVIRONMENT IT WILL
ONLY WORK ON ONE CPU. YOU CAN NOT TAKE APACHE OR MYSQL AND RUN THEM OVER
MULTIPLE CPUS IN A CLUSTER UNLESS SPECIFIC PARALLEL SOURCE CODE VERSIONS
HAVE BEEN CREATED.
Fortunately there are tools to help administered a cluster. For instance,
rebooting 32 nodes by hand is not where you want to be spending your time. Part
of administered means relaxing some security within the cluster. Allowing root
to rlogin any machine is very valuable for a cluster, but very dangerous
for open network.
Although disk-less booting from a host node is appealing, it requires a bit more
configuration than allowing each host to boot of its own hard drive. Disk-less
booting will not be covered at the present time.
Finally, file systems should be set up so that at a minimum /home is shared
across all the nodes. Other file systems can be shared as required by the user.
1.2 General Setup
=================
As mentioned, this document assumes a very simple configuration.
The cluster consists of two or more Linux boxes connected through
a Fast Ethernet switch (A hub will also work, but it is slower).
One node is usually a gateway that connects to a local LAN.
The gateway node is often the "login node" which has a keyboard
monitor and mouse. The rest of the nodes are "headless" (no
keyboard, mouse, or monitor) and often referred to as "computer nodes".
The following diagram describes the general setup.
-------------------
| SWITCH |
-------------------
__________________________| | | |
| ______________________| | | (gateway)
| | __________________| |_________ __________ LAN ____>
| | | | |
----- ----- ----- -----
| | | | | | HOST NODE | | -------
| | | | | | (login node) | | | |---
| | | | | | | | ---- | | |
----- ----- ----- ----- ------- |
node node node node |ooooo| _
four three two one |ooooo| | |
------- -
Keyboard/Mouse
Monitor
1.3 Assumed Competence
======================
Setting up and configuring a cluster requires certain Linux administration skills.
Introduction to these skills is far beyond the scope of this document.
Basically, knowledge of networking, addressing, booting, building kernels,
Ethernet drivers, NSF, package installation, compiling, and some others
will be required for successful administration. See the LDP (Linux Documentation
Project http://metalab.unc.edu/LDP/) for excellent documentation.
You may also look at the Beowulf HOWTO for other links (http://metalab.unc.edu/LDP/HOWTO/Beowulf-HOWTO-5.html#ss5.6).
2.0 Hardware Issues:
====================
In general, this document only covers hardware issue that are important to
getting the cluster up and running.
2.1 Network Interface Cards:
----------------------------
For best performance Intel or DEC tulip based (2114X) NICs should be used.
These NICs provide "wire speed" performance. Driver availability can be
a big issue particularly for tulip NICs.
Please consult Linux-Tulip Information page for complete instructions
("http://maximus.bmen.tulane.edu/~siekas/tulip.html") on how to install
the latest tulip driver. Also, please see Don Becker's Linux and the
DEC "Tulip" Chip " for additional information.
(http://cesdis.gsfc.nasa.gov/linux/drivers/tulip.html)
2.2 Addressing:
---------------
Within a cluster nodes should be addressed using the reserved IP address.
These address are should never be made "visible to the Internet" and should
use the numbers in the range 192.168.X.X.
2.3 Gateway Node:
-----------------
It is also assumed one of the nodes in the cluster will act as a gateway
to a local area network This node will have two network connections
(probably two NICs) One NIC will be addresses as 192.168.X.X while the
other will have in internal hose IP address. It is possible to route
through the gateway (i.e. to connect to a file server).
2.4 Switch Configuration:
-------------------------
Most switches can be addressed by assigning them an IP number and then a
administered by telneting to the switch. You can use something like
192.168.0.254 if you are assign you nodes address like
192.168.0.1, 192.168.0.2,... Switches should have any spanning tree
features disabled and all ports should come up at 100BTx-Full Duplex.
If this is not the case check to see that the NICs and the switch are
negotiating correctly. See NIC and switch problems below under
"testing the cluster".
2.5 Direct Access:
------------------
At some time you will want or need to access a node directly through and
attached monitor and keyboard. Usually due to a network problem.
This can be done the old fashion way using a extra monitor and keyboard
an moving it to the desired node. An other way is to buy a KVM
(keyboard video mouse) device that allows you to share a single keyboard
mouse and monitor between all systems. With either a push button of button
or a hot-key you can quickly move between machines
and fix problems.
3.0 Security Issues:
====================
It is recommended that these security change be made ONLY on the non
gateway nodes. In this way the security of the gateway is not compromised.
3.1 .rhosts versus hosts.equiv
------------------------------
There are two ways to eliminate passwords within the cluster.
You can either add an entry to the /etc/hosts.equiv file or
add a .rhosts in each users home directory.
The .rhosts is preferable because there is one copy in each users
directory. /etc/hosts.equiv must be maintained on each node of the
cluster which may lead to an administration headache when users are
added or deleted.
The format of .rhosts file is simply a list of hosts:
# .rhost file for coyote cluster
# must be read/writable by user only!
coyote1
coyote2
coyote3
coyote4
The format of the hosts.equiv file is
#hosts.equiv file for coyote cluster
#node name user name
coyote1 deadline
coyote2 deadline
coyote3 deadline
coyote4 deadline
coyote1 wgates
coyote2 wgates
coyote3 wgates
coyote4 wgates
3.2 root rlogin Access:
-----------------------
To allow root to rlogin to any node in the cluster, add
a .rhosts file in the root directory on each node. The .rhosts file should
list all the nodes in the cluster. IMPORTANT: The .rhosts must
be only read/writable by the owner. ("chmod go-rwx .rhosts")
Again this should not be done for the gateway node.
In addition, swap the first two lines of /etc/pam.d/rlogin:
#original /etc/pam.d/rlogin
auth required /lib/security/pam_securetty.so
auth sufficient /lib/security/pam_rhosts_auth.so
auth required /lib/security/pam_pwdb.so shadow nullok
auth required /lib/security/pam_nologin.so
account required /lib/security/pam_pwdb.so
password required /lib/security/pam_cracklib.so
password required /lib/security/pam_pwdb.so shadow nullok use_authtok
session required /lib/security/pam_pwdb.so
#first two lines are swapped /etc/pam.d/rlogin
auth sufficient /lib/security/pam_rhosts_auth.so
auth required /lib/security/pam_securetty.so
auth required /lib/security/pam_pwdb.so shadow nullok
auth required /lib/security/pam_nologin.so
account required /lib/security/pam_pwdb.so
password required /lib/security/pam_cracklib.so
password required /lib/security/pam_pwdb.so shadow nullok use_authtok
session required /lib/security/pam_pwdb.so
NOTE: I do not know if there is a better way to do this, but it seems to work.
3.3 root telnet Access:
-----------------------
On every node except the gateway, the following has been added to
the /etc/securetty file:
ttyp0
ttyp1
ttyp2
ttyp3
ttyp4
This change will allow remote telnet to any node in the cluster.
3.4 root ftp Access:
--------------------
On any system that needs root ftp access, /etc/ftpusers file has to have the
entry for root commented out:
# Comment out root to allow other systems ftp access as root
#root
bin
daemon
adm
lp
sync
shutdown
halt
mail
news
uucp
operator
games
nobody
4.0 Sample hosts file
=====================
The following is a sample /etc/hosts file for an 8
node cluster with an addressable switch. It is assumed
that each node has the correct IP addressed assigned
when linux was installed or set using the network
administration tool.
#sample /etc/hosts
127.0.0.1 localhost localhost.cluster
192.168.0.1 node1 node1.cluster
192.168.0.2 node2 node2.cluster
192.168.0.3 node3 node3.cluster
192.168.0.4 node4 node4.cluster
192.168.0.5 node5 node5.cluster
192.168.0.6 node6 node6.cluster
192.168.0.7 node7 node7.cluster
192.168.0.8 node8 node8.cluster
192.168.0.254 switch
5.0 User Accounts and File Systems:
===================================
Each user needs an account on every node they will want
to use. To ease administration problems, /home from
a host node (usually the gateway or login node) is
mounted on every node using NSF.
5.1 Mounting /home across the cluster
-------------------------------------
The /home directory needs to be cross-mounted throughout
the cluster. All nodes other than the host (presumably
the gateway) should have nothing in /home. (This means
you should not add users to nodes (see below).
To add /home make the following entry in /etc/fstab on all
the nodes (not the host).
hostnode:/home /home nfs bg,rw,intr 0 0
where "hostnode: is the name of your host node.
While you are at, if the host has a CDROM you can add the following
line to access the CD-ROM via NFS (make sure each node has a /mnt/cdrom
directory)
hostnode:/mnt/cdrom /mnt/cdrom nfs noauto,ro,soft 0 0
Next you will need to modify /etc/exports on the host node.
You will need two entries that look like this:
#allow nodes to mount /home and read CDROM
/home node1(rw) node2(rw), node3(re)
/mnt/cdrom node1(ro) node2(ro), node4(ro)
Restart nfs and mount. (ps for the pid of rpc.nfsd and rpc.mountd and
then enter a "kill -HUP pid" for each pid)
If all goes well, you can issue a "mount /home" on any node
and /home should be mounted. If not check /var/log/messages
for any errors and check man pages for mount for more information.
When the system boots it will automatically mount /home, but
not the CDROM. To mount the CDROM on a node, just enter "mount /mnt/cdrom"
and the contents of the CDROM will be visible under /mnt/cdrom.
If you have problem, check /var/log/messages and the manuals for
mount and nfs.
5.2 Add user accounts
---------------------
Add user accounts on the host node as you would a single
Linux workstation. A quick way to add users to
each node is to copy the entry for the user from
/etc/passwd on the host node and paste it the /etc/passwd
file one each node. Make sure that the user and group id
are the same for the each user through out the cluster.
The user should now have login privileges throughout the cluster.
Another way is to use NIS (to be completed)
6.0 Administration: CMS
=======================
There is a package called CMS (Cluster Management System). It is available
from http://smile.cpe.ku.ac.th/software/scms/index.html. This version is new,
and we have not had time to test it. The previous version worked
well except for the remote (real time) monitoring. It does include
a system reboot and shutdown feature.
7.0 Compilers:
==============
We recommend using the egcs (including g77) for compiling codes.
Source: http://egcs.cygnus.com/
Version: egcs-1.1.1 gzip format
Once compiled and installed, the egcs compilers should reside
in /usr/local. In this way users ican set their path
to point to the appropriate versions (i.e. standard gcc is
in /usr/bin while egcs gcc is in /usr/local/bin)
Note: use gcc (not egcs gcc) to build kernels.
"gcc -v" and "which gcc" can tell you which versions you are using.
g77 is the egcs FORTRAN compiler.
8.0 Communication Packages
==========================
This is not a n exhaustive list. These are the most common packages.
A cluster is a collection of local memory machines. The only way
for node A to communicate to node B is through the network. Software
built on top of this architecture "passes messages" between nodes.
While message passing codes are conceptually simple, their operation
and debugging can be quite complex.
There are two popular message passing libraries that are used: PVM and MPI.
8.1 PVM versus MPI
------------------
Both PVM and MPI provide a portable software API that supports message passing.
From a historical stand point PVM was first and designed to work on
networks of workstations (Parallel Virtual machine) It has since been
adapted to many parallel supercomputers (that may or may not use distributed
memory). Control of PVM is primarily with its authors.
MPI on the other hand is a standard that is supported by many hardware vendors.
It provides a bit more functionality than PVM and has versions for
networks of workstations (clusters). Control of MPI is with the standards
committee.
For many cases there does not seem to be a compelling reason to
choose PVM over MPI. Many people choose MPI because it is
a standard, but the PVM legacy lives on. We offer the source
for each in this document.
MPI:
----
There are two freely available versions of MPI (Message Passing Interface).
MPICH:
Source: http://www-unix.mcs.anl.gov/mpi/mpich/
Version: mpich.tar.gz
Notes: People (including us) have noted some problems with the Linux version
LAM-MPI:
Source: http://www.mpi.nd.edu/lam/
Version: lam61.tar.gz
Notes: install the patches (lam61-patch.tar). Performance of LAM
is quite good using the -c2c mode.
PVM:
----
Version: pvm3/pvm3.4.beta7.tgz
Source: http://www.epm.ornl.gov/pvm/
Notes: There are a lot of PVM codes and examples out there.
9.0 Conversion software:
========================
Converting existing software to parallel is a time consuming task.
Automatic conversion is very difficult. Tools are available for
FORTRAN conversion. Converting C is much more difficult due
to pointers.
One tool for converting FORTRAN codes (made by us) is called d BERT
and works on Linux systems. A free Lite version of available from
http://www.plogic.com/bert.html
10.0 Sample .cshrc
===================
#Assumes LAM-MPI, PVM and MPICH are installed
setenv LAMHOME /usr/local/lam61
setenv PVM_ROOT /usr/local/pvm3
setenv PVM_ARCH LINUX
setenv MPIR_HOME /usr/local/mpich
set path = (. $path)
# use egcs compilers first
set path = (/usr/local/bin $path)
set path = ($path /usr/local/pvm3/lib/LINUX)
set path = ($path /usr/local/lam61/bin)
set path = ($path /user/local/mpich/lib/LINUX/ch_p4)
11. Benchmarking and Testing Your Systems
=========================================
11.1 Network performance: netperf
---------------------------------
Source: http://www.netperf.org/netperf/NetperfPage.html
Run Script:
./netperf -t UDP_STREAM -p 12865 -n 2 -l 60 -H NODE -- -s 65535 -m 1472
./netperf -t TCP_STREAM -p 12865 -n 2 -l 60 -H NODE
NODE is the remote node name.
11.2: Network Performance: netpipe
----------------------------------
Source: http://www.scl.ameslab.gov/Projects/Netpipe/
11.3 Parallel Performance: NASA parallel Benchmarks
----------------------------------------------------
Source: http://www.nas.nasa.gov/NAS/NPB/
12. Ethernet Channel Bonding:
============================
Background on Channel Bonding is available form:
http://www.beowulf.org/software/software.html
Requirements:
TWO Ethernet NICs per system.
TWO hubs (one for each channel) OR two switches (one for each channel)
OR a switch that can be segmented into virtual LANS
Steps (for kernel 2.0.36):
1. Download and build the ifenslave.c program:
(http://beowulf.gsfc.nasa.gov/software/bonding.html)
Comment out line 35 "#include " and compile using
"gcc -Wall -Wstrict-prototypes -O ifenslave.c -o ifenslave"
2. Apply the kernel patch (get linux-2.0.36-channel-bonding.patch from
ftp ftp.plogic.com), run xconfig and enable Beowulf Channel bonding
3. Rebuild and install the kennel.
Each channel must be on a separate switch or hub (or a segmented switch).
There is no need to assign an IP number to the second interface, although
using it as separate network (without channel bonding) may have advantages
for some applications.
To channel bond, login to each system as root and issue the following
command on each system:
./ifenslave -v eth0 eth1
This will bond eth1 to eth0. This assumes:
eth0 is already configured and used as your cluster network. eth1 is
the second Ethernet card detected by the OS at boot.
You can do this from the host node by enslaving
all nodes BEFORE the host (Order is important. Node 2 must enslaved before the
host - node 1). For each node, do steps a,b,c.
a. open a window
b. login to node 2
c. enter (as root) the above command.
d. in a separate window enter (as root) the above command for node 1.
Your cluster should now be "channel bonded". You can test this by
running netperf or a similar benchmark.
Channel bonding shutdown is not as simple. We are investigating this and will
provide command line tools that execute channel bonding set-up and tear down
automatically. In the mean time, the safest way to restore single channel
performance is to either reboot each system or use the network manager (part of
the control panel) to shutdown and restart each interface.
REMEMBER: communication between a a channel bonded node and a none channel
bonded node is very slow if not impossible. Therefore the whole cluster
must be channel bonded.
13.0 Quick Primer on LAM:
==========================
LAM is implemented for networks of workstations.
In order to run LAM, a LAM daemon must be started on each node.
The daemon is very powerful for test and debugging purposes
(LAM can provide real time debugging information including
dead lock conditions) Once code is working, however,
programs can be run using a standard socket interface
for maximum speed. The daemon is still used for start-up
and tear-down, however.
13.1 Starting the LAM daemons.
----------------------------
From your home directory enter:
lamboot -v lamhosts
This boots the daemons based on the "lamhosts" file (which
is just a list of your machines) The "-v" option
prints debug information - it is nice to see what it is
doing. NOTE: the LAM daemon will stay "resident"
even when you logout.
13.2 To see if the daemons are running:
-------------------------------------
Either:
"ps auxww|grep lam"
which will show all the lamd (lam daemons) running. NOTE:
each user can have their own LAM daemon running (nice feature)
Or:
"mpitask"
Which will print:
TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT
if no jobs are running. If a job is running you will see
jobs listed here.
13.3 Cleaning up LAM:
---------------------
If you are concerned about the "state" of LAM possibly due
to a terminated job, you can "clean" your daemons up
by entering the "lamclean" command. You can shutdown
your lam daemons by issuing a "wipe lamhosts", where
lamhosts is the host file you used to boot LAM.
13.3 Running programs
---------------------
Running program uses the mpirun command like MPICH,
but there are some options that should to be used:
mpirun -O -c 2 -s h -c2c program
-O = assume homogeneous environment (no special encoding)
-c = how many copies to run (in this case 2) NOTE:
The -c option assign programs "round robin"
using the order specified in the "hostfile"
If "-c" is greater than the machines in your host
file, LAM will start overloading the machines
with jobs in the order specified in your host file.
-s = source of executable, this is handy, but with NFS
not really necessary. The "h" means get the executable
from the host (node where you started LAM)
-c2c = use the client to client "socket" mode. This makes
LAM go fast, but the daemons are not used for communication
and therefore can not provide run time debugging or trace
information which is fine once you got your application
running.
13.4 Other information:
-----------------------
You can "man" the following topics: mpirun, wipe, lamclean, lamboot.
You also may want to consult:
/usr/local/src/lam61/doc/mpi-quick-tut/lam_ezstart.tut
and:
http://www.mpi.nd.edu/lam/
for more information.
14.0 Quick Primer on PVM
=========================
check out: http://netlib.org/pvm3/book/pvm-book.html
15.0 Configuring switches:
=======================
BayStack 350T Main Menu
IP Configuration...
SNMP Configuration...
System Characteristics...
Switch Configuration...
Console/Service Port Configuration...
Spanning Tree Configuration...
TELNET Configuration...
Software Download...
Display Event Log
Reset
Reset to Default Settings
Logout
Use arrow keys to highlight option, press or to select option.
16. Clock Synchonozation
-------------------------
There are some problems with the 2.0.X SMP kernels and
clock drift. It is due to some interrupt problem. The best solution
it to use xntp and synchronize to an external source. IN any case,
it is best to have your clusters clocks synchronized. Here is how to
use xntp.
1. set all system clocks to the current time.
2. write the time to the CMOS RTC (Real Time Clock) using the
command "clock -w"
3. mount the cdrom on each system (mount /mnt/cdrom, see Sec. 5 if it
does not work)
4. go to /mnt/cdrom/RedHat/RPMS
5. as root, enter "rpm -i xntp3-5.93-2.i386.rpm"
6. now edit the file /etc/ntp.conf
ON ALL SYSTEMS
comment out the lines (as shown):
#multicastclient # listen on default 224.0.1.1
#broadcastdelay 0.008
ON NON-HOST SYSTEMS (every one but the host)
Edit the lines:
server HOSTNODE # local clock
#fudge 127.127.1.0 stratum 0
where HOSTNODE is the name of the host node.
Close the /etc/ntp.conf file on each node.
7. Start xntpd and all systems, "/sbin/xntpd"
You can start this each time you boot by adding
this to your /etc/rc.d/rc.local file.
It will take some time for synchronization, but you can see
messages from xntpd in /var/log/messages.
What you have just done is to tell the host node to run
xntp and use the local clock system clock as reference.
In addition, all the other nodes in the cluster will get their
time from the host.
Although xntpd is supposed to keep the system clock and the RTC
in sync. It is probably a good idea to sync them once a day.
You can do this by (as root) going to /etc/cron.daily
and creating a file called "sync_clocks" that contains the following:
# Assumes ntp is running, so sync the CMOS RTC to OS system clock
/sbin/clock -w
Now all the clocks in your cluster should be sync'ed and use
the host a s a reference. If you wish to use an external reference
consult the xntpd documentation.
17.0 NIC and Switch Problems
----------------------------
18.0 BEOWULF FAQ:
-----------------
1. What is the difference between the extreme linux CD and a standard
Red Hat distribution?
2. Can I build a cluster with extreme linux and run Oracle 8 (or what ever
across the clusters)
3.
--------------------
This document is distributed under GNU GENERAL PUBLIC LICENSE
Version 2, June 1991 Copies of the license can be down found at
http://www.fsf.org/copyleft/gpl.html
Copyright (C) 1999 Paralogic, Inc., 130 Webster St, Bethlehem PA 18015
(http://www.plogic.com)
![[Home]](button_h.gif) |
![[Top of Page]](U_hand.gif) |
This page, and all contents, are
Copyright (c) 1999 by Paralogic Inc., Bethlehem, PA, USA, All Rights
Reserved. This notice must appear on all copies (electronic, paper, or
otherwise) of this document.