This release of the Scyld Beowulf Scalable Computing Distribution
contains all the software required for configuring, administering,
running and maintaining a Beowulf cluster.
Advances provided by Scyld Beowulf include:
Scyld Beowulf Overview
For an overview of the main portions
of the Scyld Beowulf Scalable Computing Distribution, see section `Scyld Beowulf System Overview' in Scyld Beowulf Installation Guide. Additionally, use the Table of
Contents and the Index of this manual to find other information of
interest.
Hardware Recommendations
Hardware recommendations for
building a Scyld Beowulf are contained in section `Scyld Beowulf System Overview' in Scyld Beowulf Installation Guide.
Starting the Installation
To launch the "quick start"
installation, boot the cluster's front-end machine from the Scyld
CD-ROM. See section `Quick Start' in Scyld Beowulf Installation Guide. Alternatively, install Scyld Beowulf from RPM
packages. See section `Scyld Beowulf Installation from RPMs' in Scyld Beowulf Installation Guide.
The beosetup
program is a graphical front-end for controlling a
Beowulf cluster using the BProc system. It is intended to be used by
the cluster system administrator; configuration file write permission
is required for most actions.
The main window contains three lists of Ethernet hardware addresses. The first list contains unknown addresses, those not yet assigned to either of the other two lists. The second list contains nodes that are to be active in the cluster. They are ordered by node number (ID). The third list contains nodes or other machines that are to be ignored, even though they produce RARP (reverse address request procotol) requests.
Addresses may be moved between lists by dragging an address with the left (first) mouse button or by right (third button) clicking on the address with the mouse and choosing the appropriate pop-up menu item.
After moving addresses between lists, the Apply button must be clicked for changes to take effect. Clicking on the Apply button saves the changes to the configuration file and signals the Beowulf daemons to re-read the configuration file.
Revert will re-read the existing Beowulf configuration file. This
has the effect of undoing any undesired changes that have not yet been
applied or synchronizing beosetup
with any changes that have been made
to the configuration file by an external editor.
Next to the Apply and Revert buttons are two short-cut buttons for generating a Node Floppy ("slave node boot floppy") and setting Preferences. These items are also accessed through the File Menu and Settings Menu, respectively.
Each list has a pop-up menu associated with it that can be accessed by right clicking on a list item. Insert a new address by choosing Insert from the pop-up menu on the active node (middle) list. Delete (forget about) addresses by selecting Delete in the pop-up menu on the active node list.
Any active hardware address may be edited by choosing Edit from the pop-up menu.
This section explains the functionality of the menu items in beosetup
.
Boot Configuration File and Configuration File allow
non-default filenames to be used for the output configuration files.
The boot configuration file is used for the beoboot
floppy. The configuration file must be the same one that the
beoserv
daemon is currently reading, for the Beowulf Server
software to work properly with beosetup
.
Create Node Boot Floppy creates a beoboot
floppy disk (or image)
for booting a node in the cluster. Create BeoBoot file creates the
network boot file, which is downloaded from the server to each node
during node boot. This beoboot
file contains the kernel image, kernel
flags, and ramdisk image that start each node.
Exit will quit the beosetup
program (not the beoserv daemon).
Choosing Preferences from the Settings menu brings up the beosetup configuration dialog box. PCI Table brings up the PCI table dialog. Restart Daemons sends a signal to the Beowulf daemons to re-read the configuration file. It doesn't actually kill the daemons.
The first tab of the Configuration dialog box contains network configuration items that appear in the configuration file:
The second tab accesses the settings for the following GUI options:
Apply
button in the main window).
The third tab contains file system options for the later stages of booting. During a normal boot, the server will attempt to configure the filesystems on the node by running some combination of a filesystem check and a filesystem create. The radio buttons in this tab determine the default global policy:
The PCI Table dialog is used to add PCI vendor/device/driver entries to the boot configuration file. Use it when you know that a new version of an old card is supported by a certain driver, but is not in the Beowulf PCI table (thus not getting recognized and loaded properly).
BeoBoot is a set of utilities to simplify booting of slave nodes in a Beowulf cluster. BeoBoot generates initial boot images which allow a slave node to boot and download its kernel over the network, from the cluster master node.
Beoboot:
BeoBoot is a collection of programs and scripts which allows easy booting of slave nodes in the Scyld Beowulf cluster. On the master node, there is a boot server daemon and a collection of scripts for setting up slave nodes.
The following events occur while booting a slave node with BeoBoot.
beofdisk
. See section `Disk Partitioning' in Scyld Beowulf Installation Guide.
There are two sets of boot images involved in booting a slave node with BeoBoot. The first set is copied onto the slave node boot floppy disk and into the BeoBoot partition of the slave node hard disk, if using the Scyld default partitioning scheme, see section `Disk Partitioning' in Scyld Beowulf Installation Guide. These are known as the phase 1 or initial images, composed of a minimal kernel image and an initial ramdisk image. These are generated from kernels and modules that are included with the Scyld Beowulf BeoBoot distribution. To add a network driver to a slave node boot floppy image, you must compile the driver against the kernel headers which match the BeoBoot kernels. See section Adding A New Network Driver.
The second boot image contains the final kernel and modules that the slave node will use. This image is usually generated from the kernel images that the master node is running.
You should never have to regenerate the BeoBoot initial image unless you make some kind of hardware change to the cluster or you have some other kind of problem which forces you to make a change.
The second phase boot image should be updated whenever you upgrade the kernel or any other modules on the front end. Running the same kernel on the master node and the slave nodes is highly recommended.
The file system table for slave nodes is stored in `/etc/beowulf/fstab'.
This section contains basic usage information for the binaries that are included with BeoBoot.
beoboot [ -o outputfile ] -1
beoboot [ -o outputfile ] -2 [-k kernelimage] [ -c commandline ]
beoboot generates Beowulf boot images. There are two sets of images: phase 1 and phase 2. Phase 1 images are placed on the hard disk or a floppy disk and are used to boot the machine. The phase 2 image is downloaded from the cluster front end by the phase 1 image. The phase 2 image is placed on the front end in a place where beoserv can find it.
In the -2 mode, beoboot will detect the version of the kernel given as its argument and look for the matching modules in `/lib/modules/kernelversion'
Options:
Options for phase 2:
@command{beoboot-install -h}
@command{beoboot-install -v}
@command{beoboot-install node device}
@command{beoboot-install -a device}
beoboot-install
installs the beoboot initial slave node boot
image onto the hard disk of a cluster node. This will allow booting the
node without using a slave node boot floppy disk or CD-ROM.
Options:
Requirement: a small partition (minimum 2MB) must be set aside for beoboot on the hard disk. This partition should be tagged as type 89. And, this partiton should exist near the beginning of the disk to avoid problems with large disks. See section `Disk Partitioning' in Scyld Beowulf Installation Guide.
@command{beoserv -h}
@command{beoserv -v}
@command{beoserv [-f} file @command{] [-n} file @command{]}
beoserv
is the BeoBoot boot server. It responds to RARP requests
from slave nodes in a cluster and also serves a boot image (via
TCP) to the nodes.
Options:
Configuration information is normally read from
`/etc/beowulf/config'. Beoserv will listen on the interface
specified by the interface
line. The range of IP addresses for
assignment to slave nodes are defined in the iprange
directive.
Beoserv will respond to addresses given on the node
lines. IP
addresses are assigned to slave nodes in the order that these
node
lines appear in the configuration file.
The server will ignore requests from addresses that are listed on
ignore
lines.
When a request comes in from an unknown address, the server will append
an unknown
line to the configuration file. This allows the
setup tools to see new nodes as they appear on the network.
Sending a HUP signal to the daemon will cause it to re-read its configuration file, thus implementing any updates to the file.
It is possible to build the BeoBoot kernel (and generate the slave node boot floppy) for hardware which is not supported by the BeoBoot system as shipped. You must have the driver for the hardware (This section does not include instructions on how to build kernel modules).
The Linux kernel include files to build against are located in `/usr/lib/beoboot/include'. Use `/usr/lib/beoboot' as the location of the Linux source.
After building the module, place the resulting kernel module binary in `/usr/lib/beoboot/kernel/module_binary_name'. The next time you generate a BeoBoot image it will be included.
If the driver is for new hardware, the vendor and device IDs for the hardware should be included in the driver list. The driver list is stored in `/etc/beowulf/config.boot'. If your driver is composed of multiple modules, dependencies are automatically generated via `depmod'. If the driver merely replaces an old driver and doesn't add support for new hardware, this step may be skipped.
After these steps are completed, re-run BeoBoot to generate a new slave node boot floppy image.
Phase 1 is the initial boot up of the machine from the initial (floppy) image. This image may be stored either on a floppy disk or in the BeoBoot partition of the node's hard drive. See section `Disk Partitioning' in Scyld Beowulf Installation Guide. First, the BIOS loads a sector from the slave node boot image. Next, the boot loader on the floppy (or hard disk) takes over and loads the rest of the data stored in the initial image.
The slave node initial boot image contains a minimal kernel image and an initial ramdisk image. These images probe the PCI for network hardware, configure network interfaces and download the final kernel image and ramdisk that the machine will run.
The final image and ramdisk will be started via a `Two Kernel Monte'.
In phase 2, the node is running the final kernel image, which was downloaded in phase 1.
The root file system is the ramdisk image downloaded during phase 1. This image contains all the kernel modules for this final kernel. The PCI probe will load all relevant drivers at this time.
The ramdisk image contains a smaller image which will be used as the permanent root file system. The boot program for this phase takes this smaller ramdisk image and copies it into one of the /dev/ramX devices.
In phase 3, the linuxrc has exited and the new smaller root file system has been mounted. The init program used is "boot". In this capacity, it starts the BProc slave daemon and waits for it to exit. If the slave daemon dies for any reason, the init program will reboot the system.
The Scyld Beowulf Distributed Process Space (BProc) is set of
kernel modifications, utilities and libraries which allow a user to
start processes on other machines in a Beowulf-style cluster. Remote
processes started with this mechanism appear in the process table of the
front end machine in a cluster. This allows remote process management
using the normal UNIX process control facilities. Signals are
transparently forwarded to remote processes and exit status is received
using the usual wait()
mechanisms.
BProc also provides process migration mechanisms for the creation of remote processes. These mechanisms remove the need for most binaries on the remote nodes.
BProc requires a number of kernel modifications and modules to be installed. It is much simpler to install pre-built kernel packages rather than build kernel images from scratch. To simplify managing the nodes in a BProc style cluster, use of the BeoBoot cluster management package is highly recommended.
RPMs for Scyld Beowulf are available via FTP from: ftp://ftp.scyld.com/pub/beowulf.
Note that you may have to modify `/etc/lilo.conf' to point to the new kernel. Re-run lilo to make these changes take effect.
Building BProc from scratch means building a kernel that includes the
BProc modifications. Apply the bproc
patch to your kernel. When
configuring the new kernel, select "Yes" to `Beowulf Distributed
Process Space'. See the documentation included with the Linux kernel
for more information about configuring and compiling Linux kernels.
After patching the kernel, it is possible to build the rest of the BProc
package by running `make' in the top level bproc
directory. The
Makefile presumes that the kernel tree to build against resides in
`/usr/src/linux'. If this is not accurate, provide make
with the `LINUX=/path/to/linux' argument.
See the instructions with the Linux kernel or your Linux distribution for instructions on how to install a new kernel.
First, install the BProc kernel modules. There are three modules which
must be loaded in the following order: ksyscall.o, vmadump.o and
bproc.o. After running depmod
, `modprobe bproc' should load
them all. These modules must be loaded on both the front end and the
slave nodes.
If using pre-built kernel packages, run the following to install all the programs and modules to their proper locations.
# make install # depmod -a # modprobe bproc
Note: BProc daemons require `/dev/bproc' to communicate with the kernel layer. This is a character device with major number 10, minor number 226.
The master daemon, bpmaster
is the central part of BProc system.
It runs on the front end machine. Once it is running, the slave
nodes run the slave daemon, bpslave
to connect to the front end
machine.
bpmaster
runs on the front end machine and handles all the details of
running BProc.
# bpmaster
down
unavailable
up
reboot
halt
pwroff
Node states may be viewed and manually manipulated using the
bpctl
program.
VMADump is the system used by BProc to take a running process and copy it to a remote node. VMADump saves or restores a process's memory space to or from a stream. In the case of BProc, the stream is a TCP socket to the remote machine. VMADump implements an optimization which greatly reduces the size of the memory space.
Most programs on the system are dynamically linked. At run time, they
will use mmap
to get copies of various libraries in their memory
spaces. Since they are demand paged, the entire library is always
mapped even if most of it will never be used. These regions must be
included when copying a process's memory space and again when the
process is restored. This is expensive since the C library dwarfs
most programs in size.
Here is an example memory space for the program sleep
. This is
taken directly from `/proc/pid/maps'.
08048000-08049000 r-xp 00000000 03:01 288816 /bin/sleep 08049000-0804a000 rw-p 00000000 03:01 288816 /bin/sleep 40000000-40012000 r-xp 00000000 03:01 911381 /lib/ld-2.1.2.so 40012000-40013000 rw-p 00012000 03:01 911381 /lib/ld-2.1.2.so 40017000-40102000 r-xp 00000000 03:01 911434 /lib/libc-2.1.2.so 40102000-40106000 rw-p 000ea000 03:01 911434 /lib/libc-2.1.2.so 40106000-4010a000 rw-p 00000000 00:00 0 bfffe000-c0000000 rwxp fffff000 00:00 0
The total size of the memory space for this trivial program is 1089536
bytes. All but 32K of that comes from shared libraries - VMADump takes
advantage this. Instead of storing the data contained in each of these
regions, it stores a reference to the regions. When the image is restored,
that files will be mmap
ed to the same memory locations.
In order for this optimization to work, VMADump must know which files it
can expect to find in the location where they are restored. VMADump has
a list of files which it presumes are present on remote systems. The
vmadlib
utility exists to manage this list. See section vmadlib.
Note that VMADump will correctly handle regions mapped with
MAP_PRIVATE
, which have been written.
VMADump does not specially handle shared memory regions. A copy of the data within the region will be included in the dump. No attempt to re-share the region will be made at restoration time. The process will get a private copy.
VMADump does not save or restore any information about file descriptors.
VMADump will only dump a single thread of a multi-threaded program. There is currently no way to dump a multi-threaded program in a single dump.
This section contains basic usage information for the binaries that are included with BProc.
bpmaster -h
bpmaster -v
bpmaster [ -c c_file ] [ -m m_file ]
bpmaster is the BProc master daemon. It runs on the front end machine
of a cluster running BProc. It listens on a TCP port and accepts
connections from slave daemons. Configuration information comes from
the Beowulf configuration file. The BProc master daemon reads
interface
, iprange
, bprocport
,
allowinsecureports
and logfacility
. See section Scyld Beowulf Configuration
File Reference.
Options:
bpslave -h
bpslave -v
bpslave [ -l facility ] [ -r ] [ -m m_file ] masterhostname port
bpslave is the BProc slave daemon. It runs on slave nodes in a cluster and connects to the front end machine (masterhostname) to accept jobs through masterport (port).
Options:
bpstat [ -h ] [ -v ] [ -n ] [ -u ] [ -a nodenum ] [ -s nodenum ] [-m] [-p] [-P]
bpstat displays various pieces of status information about a BProc cluster. This program also includes a number of options intended to be useful for scripts.
Options:
iprange
) not the number of nodes that are up.
bpctl -h
bpctl -v
bpctl -M [ -a ]
bpctl -S node [ -a ] [ -r dir ] [ -s state ]
bpctl is bproc control. Used to apply commands to referenced nodes.
Options:
chroot()
to dir. After
doing this, all processes started on a node via BProc will see dir
as their root directory. This command is only usable on slave nodes.
bpsh [-n]
nodenumber command
bpsh -a [-n] command
bpsh -A [-n] command
bpsh is a rsh
replacement.
Runs command on node.
Options:
bpcp [ -p ] f1 f2
bpcp [ -r ] [ -p ] f1 ... fn dir
bpcp copies files between machines. Each file or directory argument is either a remote file name of the form node:path, or a local file name (containing no `:' characters).
Options:
vmadlib -c
vmadlib -a [ libs ... ]
vmadlib -d [ libs ... ]
vmadlib -l
This program is a utility to manage the VMADump in-kernel library list.
Options:
BProc currently includes a C Library interface only.
Bproc provides a number of mechanisms for creating processes on remote nodes. It is instructive to think of these mechanisms as moving processes from the front end to the remote node. The rexec mechanism is like doing a move then exec with lower overhead. The rfork mechanism is implemented as an ordinary fork on the front end and then a move to the remote node before the system call returns. Execmove does an exec and then move before the exec returns to the new process.
Movement to another machine on the system is voluntary and is not transparent. Once a process has been moved all its open files are lost except for STDOUT and STDERR. These two are replaced with a single socket(their outputs are combined). There is an IO daemon which will forward from the other end of that connection to whatever the original STDOUT was connected. No pseudo tty operations are done.
The move is completely visible to the process after it has moved except
for process ID space operations. Process ID space operations include
fork
, wait
, kill
, etc. All file operations will
operate on files local to the node to which the process has been moved.
Memory that was shared on the front end will no longer be shared.
Programs that use the BProc library should contain the line
#include <sys/bproc.h>
and be linked against the BProc library by
adding -lbproc
to the linker command line.
The BProc library provides the following interfaces for finding information about the configuration of the machine. These interfaces may be used from any node on the cluster.
int bproc_numnodes(void)
int bproc_currnode(void)
int bproc_nodestatus(int node)
bproc_node_down
bproc_node_unavailable
bproc_node_error
bproc_node_up
int bproc_nodeaddr(int node, struct sockaddr *addr, int *size)
int bproc_masteraddr(struct sockaddr *addr, int *size)
bproc_nodeaddr(-1, addr, size)
int bproc_rexec(int node, char *cmd, char **argv, char **envp)
execve
. It replaces the
current process with a new one. The new process is created on node and
the local process becomes the ghost representing it. All arguments are
interpreted on the remote machine. The binary and all libraries it
needs must be present on the remote machine. Currently, if remote
process creation is successful but exec fails, the process will just
exit with status 1. If remote process creation fails, the function will
return -1.
int bproc_move(int node)
int bproc_rfork(int node)
fork
except
that the child process created will end up on the node given by the
node argument. The process forks a child and that child performs
a bproc_move
to move itself to the remote node.
Combining these two operations in a system call, prevents
zombies and SIGCHLD's in the case that the fork is successful but the
move is not.
On success, this function returns the process ID of the new child
process to the parent and zero to the child. On failure it returns -1.
int bproc_execmove(int node, char *cmd, char **argv, char **envp)
The system management calls are made by programs like bpctl
to control
the machine state. These calls are privledged and not useful to normal
applications.
int bproc_slave_chroot(int node, char *path)
chroot
. This
call returns 0 on success and -1 on failure.
int bproc_setnodestatus(int node, int status)
bproc_nodestatus
for
information regarding permissible node states. It is not possible to
change the status of a node which is marked as down.
MPI, or Message Passing Interface, is a defacto-standard interface for message-based parallel computing that is maintained by a forum of members drawn from academia and the remnants of the traditional supercomputing industry.
The MPI forum was self-tasked with creating a standard that could loosely accommodate the existing systems for message-passing on multi-computers in a way that could be implemented on contemporary machines with reasonable performance.
MPI, unlike earlier systems such as PVM, was to be a standard instead of software itself. Furthermore, MPI was to be an API standard. This meant that implementors were granted wide latitude to implement MPI in ways that need not have runtime interoperability with other platforms or implementations.
At the present time, there are at least a dozen such implementations of MPI under active maintenance -- the Scyld Computing implementation, BeoMPI is one.
More information about MPI is available from Argonne National Lab at http://www-unix.mcs.anl.gov/mpi.
Scyld distributes BeoMPI, an implementation of MPI drawn directly from the MPICH project at Argonne National Laboratory. Scyld has made only those changes necessary to allow MPICH to take advantage of the special system features provided by our Beowulf system software (notably the features provided by the BProc system).
In general, if you have an application which can take advantage of MPI, you can make it run on Beowulf. In particular, applications which already run on MPICH should have no problems on Beowulf.
Scyld has simplified the deployment of MPI applications in a number of ways -- applications which take advantage of these simplifications may experience porting pains when backporting to more primitive systems. Fortunately, our improvements to the system are not provided at the expense of compliance with the MPI standard.
More information about MPICH is available from Argonne National Lab at http://info.mcs.anl.gov/pub/mpi.
BeoMPI is built against the Scyld BProc system. Your system must have the BProc dynamic libraries installed to install BeoMPI. Additionally, your system must have the BProc header files installed to successfully build BeoMPI. NOTE: You do not need to have a BProc-enabled kernel to build, install, or run BeoMPI, but you will not be able to take advantage of many of the multiprocessing features of a Beowulf system.
RPMs of BeoMPI are available via FTP from: ftp://ftp.scyld.com/pub/beowulf.
Make BeoMPI from scratch by running `make' in the top level `beompi' directory.
Install BeoMPI by running `make install' in the top level `beompi' directory.
After installing, run `ldconfig'.
NOTE: As beompi
is designed for installation as a system-wide
MPI resource for Beowulf systems, the beompi installation process
creates a number of files which may create collisions with other
MPI implmentations you may intend to install. In particular, you
should be aware of:
/usr/man
/usr/include
(including mpi.h
)
/usr/lib
(including libmpi.so and libmpi.a
)
mpirun
in /usr/bin
(a complete list of files is available through the rpm system)
You should try to install alternate MPI implementations in non-conflicting locations as some Beowulf utilities may depend on features present in Scyld's BeoMPI.
If you wish to install BeoMPI on an existing system, you may specify alternate file locations when installing a scratch-built system. Do this by running `USRDIR=/usr/beowulf make -e install' in the top level `beompi' directory (where `/usr/beowulf' is your intended target path).
There are no configuration files or daemons which require configuration to use the BeoMPI subsystem for Beowulf. Information about the state of the system and the nodes is gathered from the BProc system at runtime.
Instructions on running BeoMPI therefore relate only to starting MPI-enabled applications on a Beowulf system.
Simply preparing a job for execution has long been a weak point on loosely-coupled MPPs. It has typically been a multi-stage process that required careful system configuration by a skilled administrator.
Given the features offered by the BProc system, installing and running a parallel program can be as simple as running a serial one.
The MPI standard does not extend to job creation (exception: see
MPI_Comm_spawn()
in MPI-2) However, a convention does exist: most
MPI implementations support an external program called, `mpirun'
that is responsible for running an MPI application.
While beompi does not require the use of such an external link, beompi makes it available for those applications which expect it.
@command{mpirun --mpi-help}
@command{mpirun --mpi-version}
@command{mpirun [options]} [options] @command{<command>} [command options]
Options:
In addition to the above command-line options, mpirun responds to several environment variables:
Variables:
Command-line arguments override conflicting values supplied by the environment.
Instead of relying on an external program to spawn MPI jobs, beompi
makes an inline interface available to applications which link
dynamically against the MPI library. Users may directly supply
any of the command-line arguments, environment variables, or
compile-time hints accepted by mpirun
directly
to the MPI-enabled application.
These arguments are processed and a job schedule is created before
the application's main()
function is even called. This feature
allows for the construction of a parallelized application which
behaves and can be invoked transparently to the user.
The inline @command{mpirun} features may be accessed with the same command-line options and environment variables as the stand-alone version of @command{mpirun}, however @command{mpirun} arguments may now be mixed freely with options belonging to the command. For example:
> mpifrob --mode=deathray --np 16 --outputfile=/dev/null
may be used in place of
> mpirun --np 16 mpifrob --mode=deathray --outputfile=/dev/null
The inline @command{mpirun} may be disabled by:
MPIRUN_INLINE
to 0
.
NO_INLINE_MPIRUN
to non-empty.
beompi
supports one other model of MPI job creation to address
the special needs of applications with defined dynamic-link
interfaces to executable `plug-ins'.
beompi's
in-place job creation system allows an application of
this type to run an MPI-enabled plug-in without itself having to
be MPI-aware. Provided is a fragment plug-in that is MPI-aware.
Note that mpirun()
will generate an argc, argv pair
for you that contains the arguments needed by MPI_Init()
-- even
if you were not passed an argc, argv as part of your plug-in API.
#include <mpi.h> #include <mpirun.h> int plugin_init() { int retval; int module_argc; char **module_argv; int rank,size; /* schedule this job -- ask for size==8 */ retval=mpirun(&module_argc,&module_argv,MSH_SIZE,8,MSH_END); MPI_Init(&module_argc,&module_argv); /* From here, all of the jobs are running from this * point in the code -- no need for them to go through * the body of the parent application to get here. */ MPI_Comm_size(MPI_COMM_WORLD,&size); MPI_Comm_rank(MPI_COMM_WORLD,&rank); /* Do parallel processing here */ MPI_Finalize(); /* * Children should never exit back into the parent application */ if(rank!=0) exit(0); else return 0; }
BeoMPI features language bindings for C, C++, and Fortran.
beompi
places the MPI header files and libraries in standard
locations. Compiling and linking an MPI application is often
as simple as:
> cc -lmpi foo.c -o foo
To compile a fortran code, try:
f77 -lmpif foo.f -o foo
Notice that the MPI library for fortran is 'mpif'. In the future, these libraries may be merged -- in which case the 'mpif' library will be maintained for backwards compatibility with beompi and with other MPI MPI implementations.
While beompi supports the defacto @command{mpirun} interface for
scheduling and spawning MPI-enabled programs, Scyld has created an
extra mechanism for an application to directly provide scheduler
cues to the system without needing external `schema' files or enormous
mpirun
command lines.
This `hinting' technique involves placing harmless macro calls inside an MPI-enabled application (as shown below) that generate specially-named common symbols in the resulting application. These symbols are available both to the beompi MPI library and to external programs which process the application's symbol table.
An example:
#include <mpi.h> /* generally necessary for mpi applications */ #include <mpirun.h> /* necessary to use the library interface to mpirun */ #ifdef MPIRUN_GLOBAL_HINT MPIRUN_GLOBAL_HINT(MPIRUN_NP,16) /* this code likes at least 16 jobs */ #endif int main(int argc,char **argv) { MPI_Init(&argc,&argv); /* do parallel processing */ MPI_Finalize(); return 0; }
In the above example, the application hints that it wants to run as a 16-way job. These hints may be overridden by both command-line arguments and environment variables, but may be convenient for applications that have particular knowledge about the way they perform.
A number of hints are defined:
MPIRUN_INLINE <flag>
MPIRUN_NP <int>
MPIRUN_NODES <int>
MPIRUN_CPUS <int>
MPIRUN_LOCAL <flag>
@command{strace} and other ptrace()
based tools are not currently
well supported under the BProc system when running on multiple
machines. These tools may be used, however, if the target MPI
application is run as a `local' job. Example:
> LOCAL=true strace -f mpi-application
@command{ strace } and @command{ ltrace } both accept -f which
instructs them to follow fork()
calls and print calls for
children. You must supply this option to see the system calls for
the entire MPI application.
@command{ mpirun } contains a built-in facility for logging and
debugging. You can access this facility by supplying the
MFT_LOG_THRESH
environment variable to any of
the @command{ mpirun } forms described here.
MFT_LOG_THRESH
may take on one of the following values:
none
fatal
error
info
branch
progress
entryexit
Logging levels are cumulative. Setting MFT_LOG_THRESH
to
info
will cause log messages for error
and
fatal
levels to also be emitted.
beompi
is constructed from MPICH on P4. MPI applications built on top of
beompi
may use the debugging features built into P4. Example:
> mpi-application -p4dbg 100
-p4dbg
accepts an integer from 0 to 100; 100 is maximum logging.
The Pentium Pro Performance counter package adds support for the hardware performance counters present in the Intel Pentium Pro, Pentium II, Celeron and Pentium III CPUs TM. The Pentium Pro provides two counters which can be programmed to count a wide variety of system events (see Countable Events, below).
The counters are virtualized so many different processes can safely use the counters at the same time. Processes will only count when they are scheduled. Since the counter values and configurations are saved and restored at context switch time, the counters are safe to use on SMP machines where processes may move from one CPU to another. When counting in the system-wide mode on an SMP machine, individual counts are returned for each CPU in the system.
The C language interface is provided via `libperf.a'. The included header file (`perf.h') defines the following interfaces. Note that this requires `asm/perf.h' from the kernel source to be present at compile time.
PERF_COUNTERS
PERF_COUNTERS
is the number of performance counters supported
by this performance counter library. Currently, 2 counters
are supported.
int perf_reset(void);
perf_reset
function clears the configuration and counter
registers. If counting was started, it will be stopped.
int perf_get_config(int counter, int *config);
perf_get_config
function reads back counter configurations.
counter is the counter whose configuration is to be read and
config points to the location where the value will be stored.
The value read back may not always be the same as a the value
that was written. (See PERF_OS
and PERF_USR
.)
int perf_set_config(int counter, int config);
perf_set_config
function is used to select which events will
be counted in a counter. The config argument is one of the
countable events (see below) and may be OR'ed with zero or more flags.
Note that some values can only be counted in certain counters. This
function has the side effect of stopping the counters and resetting them
back to zero.
int perf_start(void);
int perf_stop(void);
perf_start
and perf_stop
functions start and stop the
counters. These should be used after configuring the counters. Note
that these functions start and stop all the counters.
int perf_read(int counter, unsigned long long *dest);
perf_read
function reads the value of a single performance
counter. counter is the counter to be read and the value will be
stored in the memory location pointed to by dest.
int perf_write(int counter, unsigned long long *src);
perf_write
function writes the value of a single performance
counter. counter is the counter to be written and the value will
be read from the memory location pointed to by src.
int perf_wait(pid_t pid, int *status, int options, struct rusage *ru, unsigned long long *counts);
perf_wait
function is an extension of the wait(4)
function. Its operation is identical except that it can also return
the values of the performance counters at the time that the process
exited. The counts argument should be an array of length
PERF_COUNTERS
.
There are versions of these functions that may be used for system wide counting. Normally, the counter configurations are switched at task switch time so that each process appears to have its own set of counters. Counters can also be used on a system-wide basis. In this mode, counting is unaffected by task switches. Every CPU also produces its own counting results.
The system-wide counters are only available to the super user. While
using the system-wide counters, users will receive an EBUSY error if
they attempt to use `per-process' counters. Calling any of the
perf_sys
functions (except perf_sys_reset
) will cause
system-wide counting to start. System-wide counting will not stop until
perf_sys_reset
is called again. Note that system-wide counting
does NOT stop if the process that started system-wide counting
terminates.
int perf_sys_reset(void);
perf_sys_reset
clears the counter configuration and frees
the performance counters for per-process use.
int perf_sys_set_config(int cpu, int counter, int event, int flags);
int perf_sys_get_config(int cpu, int counter, int *event, int *flags);
int perf_sys_start(void);
int perf_sys_stop(void);
int perf_sys_read(int cpu, int counter, unsigned long long *dest);
int perf_sys_write(int cpu, int counter, unsigned long long *src);
All of the following functions return 0 on success and -1 on
failure. On failure, errno
will also be set.
The perf syscalls can produce the following errors:
EBUSY
EPERM
EFAULT
In general the sys_perf
system call is the only system call that
will affect counter configurations.
fork
exec
perf_wait()
.
Counter configurations are stored in integers. Valid configurations are generated by picking one of the countable events and doing a bitwise OR with zero or more of the counter flags.
PERF_DATA_MEM_REFS
PERF_DCU_LINES_IN
PERF_DCU_M_LINES_IN
PERF_DCU_M_LINES_OUT
PERF_DCU_MISS_STANDING
PERF_IFU_IFETCH
PERF_IFU_IFETCH_MISS
PERF_ITLB_MISS
PERF_IFU_MEM_STALL
PERF_ILD_STALL
PERF_L2_IFETCH
PERF_L2_LD
PERF_L2_ST
PERF_L2_LINES_IN
PERF_L2_LINES_OUT
PERF_L2_LINES_INM
PERF_L2_LINES_OUTM
PERF_L2_RQSTS
PERF_L2_ADS
PERF_L2_DBUS_BUSY
PERF_L2_DBUS_BUSY_RD
PERF_BUS_DRDY_CLOCKS
PERF_BUS_LOCK_CLOCKS
PERF_BUS_REQ_OUTSTANDING
PERF_BUS_TRAN_BRD
PERF_BUS_TRAN_RFO
PERF_BUS_TRANS_WB
PERF_BUS_TRAN_IFETCH
PERF_BUS_TRAN_INVAL
PERF_BUS_TRAN_PWR
PERF_BUS_TRAN_P
PERF_BUS_TRANS_IO
PERF_BUS_TRAN_DEF
PERF_BUS_TRAN_BURST
PERF_BUS_TRAN_ANY
PERF_BUS_TRAN_MEM
PERF_BUS_DATA_RCV
PERF_BUS_BNR_DRV
PERF_BUS_HIT_DRV
PERF_BUS_HITM_DRV
PERF_BUS_SNOOP_STALL
PERF_FLOPS
PERF_FP_COMP_OPS_EXE
PERF_FP_ASSIST
PERF_MUL
PERF_DIV
PERF_CYCLES_DIV_BUSY
PERF_LD_BLOCK
PERF_SB_DRAINS
PERF_MISALIGN_MEM_REF
PERF_INST_RETIRED
PERF_UOPS_RETIRED
PERF_INST_DECODER
PERF_HW_INT_RX
PERF_CYCLES_INST_MASKED
PERF_CYCLES_INT_PENDING_AND_MASKED
PERF_BR_INST_RETIRED
PERF_BR_MISS_PRED_RETIRED
PERF_BR_TAKEN_RETIRED
PERF_BR_MISS_PRED_TAKEN_RET
PERF_BR_INST_DECODED
PERF_BR_BTB_MISSES
PERF_BR_BOGUS
PERF_BACLEARS
PERF_RESOURCE_STALLS
PERF_PARTIAL_RAT_STALLS
PERF_SEGMENT_REG_LOADS
PERF_CPU_CLK_UNHALTED
Many of the external bus logic events can be further qualified with
either the PERF_SELF
or PERF_ANY
flags.
PERF_SELF
PERF_ANY
Many of the L2 cache events to be counted can be further qualified with the following flags. These flags can be OR'ed together to count more than one cache state.
PERF_CACHE_M
PERF_CACHE_E
PERF_CACHE_S
PERF_CACHE_I
PERF_CACHE_ALL
The flags PERF_OS
and PERF_USR
flags allow you to control
when counting should occur. These two flags can be combined. The
default (when no flag is specified) for per-process counting is
PERF_USR
only and the default for system-wide counting is
PERF_OS
only.
PERF_OS
PERF_USR
When system-wide counting is used, the other processes get no indication that their monitoring has been corrupted.
Most of the information provided here is derived from the Intel TM architecture manuals available at http://developer.intel.com.
See also the wait(4)
manual page for more information on
wait semantics.
BeoStatus is the Scyld Beowulf Status program. It displays CPU usage, memory usage, swap usage, and root partition disk usage. These outputs may be displayed in four different formats: two GTK+ formats, Curses format and line output format. Beowstatus works on either a Beowulf BProc system or on a simple cluster of Linux machines using rsh.
This is an overview of available options to BeoStatus:
(shorthand, explicit)
-r, --rsh
-s, --ssh
-b, --bpsh
-c, --curses
-t, --text
-d, --dots
-u, --update=secs
-v, --version
There are three different methods for communicating with the nodes in
the cluster: ssh, rsh, and Beowulf/BProc. The default communication
method is currently ssh. Rsh mode is selected with the --r
or
--rsh
option; Beowulf/BProc mode is selected using --b
or
--bpsh
option. Only one of these should be specified at a time.
While ssh and rsh modes use machine names, Beowulf/BProc mode uses node
numbers (OR, if no numbers are specified, then all nodes defined by
IP Address range are implied).
beostatus -b 0 1 2 3
In Beowulf mode, the up and available flags correspond directly to the BProc states of the same name. In rsh or ssh modes, up means that beostatus is successfully pinging the machine with ICMP packets; available means that beostatus is receiving status packets from that host.
If while running in rsh or ssh mode, node status is up but not
available, manually use rsh or ssh to transfer and run the
grabstats
program on the remote machine. In order to avoid the
password challenge on the remote machine, you must list your local
machine in the `.rhosts' file (rsh) or `.ssh/authorized_keys'
(ssh) file on the remote machine.
There are currently four presentation modes. The default mode is GTK+
mode
, which uses a progress bar to represent usage.
Dots mode
is a compact GTK+ format which uses colored dots to
represent each node's status. The dot color represents the status.
The default color scheme is as follows:
unavailable
).
Curses mode
should be used when an X server connection is not
available for beostatus. It is automatically selected if the
DISPLAY environment variable is not set or is manually
selected with the --curses
flag.
There is also a line output mode
, selected with the --text
flag in case a terminal doesn't support Curses control characters.
The Beowulf configuration file is used by all the Beowulf daemons and normally resides in `/etc/beowulf/config' on the front end machine of the Beowulf cluster.
address ipaddress
allowinsecureports
bootfile file
bootport port
bprocport port
fsck policy
mkfs
.
ignore macaddress
interface eth
iprange w.x.y.z w.x.y.z
libraries library ...
.so
files) in that directory will be copied.
logfacility facilityname
mkfs policy
fsck
.
netmask mask
netmask
sets the netmask on the internal cluster network.
node macaddress
unknown macaddress
The Scyld Beowulf boot configuration file is used by the `beoboot' script when creating new boot images. A copy of this configuration file is included in the boot images and actually used at boot time. This file is located in `/etc/beowulf/config.boot' on the front end machine of the Scyld Beowulf cluster.
bootmodule modules ...
bootport port
bprocport port
insmod module args...
modarg module args...
moddep module depenencies...
modprobe module args...
pci vendor device driver
Version 2, June 1991
Copyright (C) 1989, 1991 Free Software Foundation, Inc. 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software.
Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations.
Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all.
The precise terms and conditions for copying, distribution and modification follow.
If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.> Copyright (C) 19yy <name of author>
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Also add information on how to contact you by electronic and paper mail.
If the program is interactive, make it output a short notice like this when it starts in an interactive mode:
Gnomovision version 69, Copyright (C) 19yy name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989 Ty Coon, President of Vice
This General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Library General Public License instead of this License.
Jump to: / - a - b - c - d - f - h - i - l - m - n - p - r - u - v
This document was generated on 9 October 2000 using texi2html 1.56k.