This is not an exhaustive list. These are the most common packages. A cluster is a collection of local memory machines. The only way for node A to communicate to node B is through the network. Software built on top of this architecture "passes messages" between nodes. While message passing codes are conceptually simple, their operation and debugging can be quite complex.
There are two popular message passing libraries that are used: PVM and MPI.
Both PVM and MPI provide a portable software API that supports message passing. From a historical stand point PVM was first and designed to work on networks of workstations (Parallel Virtual Machine) It has since been adapted to many parallel supercomputers (that may or may not use distributed memory). Control of PVM is primarily with its authors.
MPI on the other hand is a standard that is supported by many hardware vendors. It provides a bit more functionality than PVM and has versions for networks of workstations (clusters). Control of MPI is with the standards committee.
For many cases there does not seem to be a compelling reason to choose PVM over MPI. Many people choose MPI because it is a standard, but the PVM legacy lives on. We offer the source for each in this document.
There are two freely available versions of MPI (Message Passing Interface).
MPICH:
LAM-MPI:
LAM is implemented for networks of workstations. In order to run LAM, a LAM daemon must be started on each node. The daemon is very powerful for test and debugging purposes (LAM can provide real time debugging information including dead lock conditions) Once code is working, however, programs can be run using a standard socket interface for maximum speed. The daemon is still used for start-up and tear-down, however.
From your home directory enter:
lamboot -v lamhosts
This boots the daemons based on the "lamhosts" file (which is just a list of your machines) The "-v" option prints debug information - it is nice to see what it is doing. NOTE: the LAM daemon will stay "resident" even when you logout.
Either:
ps auxw | grep lam
which will show all the lamd (lam daemons) running. NOTE: each user can have their own LAM daemon running (nice feature)
Or:
mpitask
Which will print: TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT
if no jobs are running. If a job is running you will see jobs listed here.
If you are concerned about the "state" of LAM possibly due to a terminated job, you can "clean" your daemons up by entering the "lamclean" command. You can shutdown your lam daemons by issuing a "wipe lamhosts", where lamhosts is the host file you used to boot LAM.
Running program uses the mpirun command like MPICH, but there are some options that should to be used:
mpirun -O -c 2 -s h -c2c program
-O = assume homogeneous environment (no special encoding)
-c = how many copies to run (in this case 2) NOTE: The -c option assign programs "round robin" using the order specified in the "hostfile" If "-c" is greater than the machines in your host file, LAM will start overloading the machines with jobs in the order specified in your host file.
-s = source of executable, this is handy, but with NFS not really necessary. The "h" means get the executable from the host (node where you started LAM)
-c2c = use the client to client "socket" mode. This makes LAM go fast, but the daemons are not used for communication and therefore can not provide run time debugging or trace information which is fine once you got your application running.
You can "man" the following topics: mpirun, wipe, lamclean, lamboot. You also may want to consult:
/usr/local/src/lam61/doc/mpi-quick-tut/lam_ezstart.tut
and:
for more information.
Version: pvm3/pvm3.4.beta7.tgz Source: http://www.epm.ornl.gov/pvm/ Notes: There are a lot of PVM codes and examples out there.
check out: http://netlib.org/pvm3/book/pvm-book.html