Beowulf Installation and Administration HOWTO: Setting up clients

6. Setting up clients

There are are three main methods of installing the client nodes. First is cloning the nodes using the dd command. The second method is the one I used in the first stage of our topcat system, that is installing the operating system on each client separately and then running a configuration script on the server which performs the rest of the setup. The third method is to use disk-less clients in which case all installation and configuration is done on the server. I shall describe the last two methods in detail because this is how I configured our topcat system.

6.1 Cloning clients

The basic concept of cloning is making an exact copy of a partition from one drive onto a partition on another drive. You can install one client, configure it, and make an exact copy of the disk. You can use this disk image on other clients, and you should only have to change few settings like the IP address and hostname. If your clients have their own disk with the operating system, then this method is the easiest way of achieving it. Cloning is described in more detailed by Jan Lindheim in Building a Beowulf System http://www.cacr.caltech.edu/beowulf/tutorial/beosoft/. It is basically copying a partition from one disk to another exactly, sector by sector.

6.2 Configuring disk-less clients

This method is different to the previous two because all client configuration is done on the server. This is because the clients have no physical disk of their own, and all their files are stored on the server node. if you want more information about booting a disk-less client you should read both NFS Root mini howto http://metalab.unc.edu/LDP/HOWTO/mini/NFS-Root.html and the NFS Root Client HOWTO.

Because on a disk-less client all system files are actually on the server, this is where the client configuration will be done. I have followed the NFS-root howto when configured our system with minor modifications.

First of all you will need a floppy disk with a kernel for each of the clients. I have only tried this with a monolithic kernel but I can't see why modular kernel would not work. One thing you will have to remember is to compile the support for your network card into the kernel. Kernel will need this driver before it mounts any file systems, i.e. before any modules are available.
First compile the kernel which you will use on the clients. Start with configuring :
```
make menuconfig
```
Make sure you compile support for NFS-root : CONFIG_ROOT_NFS, CONFIG_RNFS_BOOTP, CONFIG_RNFS_RARP.
After you have configured all the options in the kernel you can start compiling it. Issue following commands :
```
make dep && make clean && make zImage
```
Now you will have to change the root device of the kernel to NFS-root. I adopted this trick of making a dummy device from NFS-root Mini-Howto
```
mknod /dev/nfsroot b 0 255
cd /usr/src/linux/arch/i386/boot
rdev zImage /dev/nfsroot
```
All there is to do now is to copy the kernel image onto a floppy disk.
```
dd if=zImage of=/dev/fd0
```
If all your clients are the same you will be able to use the same image to boot all systems. In my case I had to create two different floppies, one for single CPU systems and one for SMP machines.
The next step after creating a boot disk for the client is setting up a template which will be used to create root directories of the clients. It is a good idea to setup this template right after installing the server and the Operating System patches, and before you modify any files in /var and /etc. Simply cut and paste the sdct script into a file and run it. The script will create all necessary directories and copy all needed files. Note that this script does NOT create a root directory for any of the clients but simply a template which will be used by another script to create these root directories. You will have to run the adcn script to create the NFS-root file system for each of the clients.
After creating the NFS-root directory template you should create an NFS-root file system for each of the clients. This can be achieved by running the adcn script which will build the file system under /tftpboot. The most common way of running this script is:
```
adcn -n node2 -i 10.0.0.2 -d beowulf.my.domain -l -D eth1
```
Let us look at the command line options:
- -n node2 specifies the first name of the client. This must not be a fully qualified domain name.
- -i 10.0.0.2 specifies the IP address of the client
- -d beowulf.my.domain is the DNS domain of the cluster. If this option is not specified, server's DNS domain (/bin/dnsdomainname) will be used. You should only have to use this if server's domain is different to cluster's domain. In our case, clients full name would be node2.beowulf.my.domain
- -l means listen for RARP request. When this option is used, adcn will listen for RARP requests on the interface specified with the -D option (see next paragraph) and use the MAC address from the first "sniffed" RARP request as client's hardware address. This option uses tcpdump to sniff the MAC address, so please make sure you have it installed.
- -D specifies the device connected to the cluster. If you have more than one device connected to your cluster (cluster is divided into more than one subnet) then you should use the interface directly connected to the network to which the disk-less client is connected to. This option will read the device information from /etc/sysconfig/network-scripts/ifcfg-* to find out the network, broadcast, netmask, and gateway for the cluster (server's IP will be used as the gateway). The device information will also be used by the -l option, telling tcpdump which device to "sniff" on.
If -D is not specified then script's default values will be used. There are a number of other comman line options to override these defaults. Run adcn -h for more information. In most cases the example usage show above will be what you need. You can put multiple commands in a script and setup the whole disk-less client cluster using one command. For example, to setup a 16 node disk-less client cluster with eth1 being server's interface connected to the cluster, you could run this script:
```
#!/bin/bash
adcn -n node2 -i 10.0.0.2 -d beowulf.my.domain -l -D eth1
adcn -n node3 -i 10.0.0.3 -d beowulf.my.domain -l -D eth1
adcn -n node4 -i 10.0.0.4 -d beowulf.my.domain -l -D eth1
adcn -n node5 -i 10.0.0.5 -d beowulf.my.domain -l -D eth1
adcn -n node6 -i 10.0.0.6 -d beowulf.my.domain -l -D eth1
adcn -n node7 -i 10.0.0.7 -d beowulf.my.domain -l -D eth1
adcn -n node8 -i 10.0.0.8 -d beowulf.my.domain -l -D eth1
adcn -n node9 -i 10.0.0.9 -d beowulf.my.domain -l -D eth1
adcn -n node10 -i 10.0.0.10 -d beowulf.my.domain -l -D eth1
adcn -n node11 -i 10.0.0.11 -d beowulf.my.domain -l -D eth1
adcn -n node12 -i 10.0.0.12 -d beowulf.my.domain -l -D eth1
adcn -n node13 -i 10.0.0.13 -d beowulf.my.domain -l -D eth1
adcn -n node14 -i 10.0.0.14 -d beowulf.my.domain -l -D eth1
adcn -n node15 -i 10.0.0.15 -d beowulf.my.domain -l -D eth1
adcn -n node16 -i 10.0.0.16 -d beowulf.my.domain -l -D eth1
```

Troubleshooting ideas

Disk-less client doesn't get a RARP reply from server. If you boot your disk-less client but it 'hangs' with a message on the screen saying "Sending BOOTP and RARP requests ...", then you should check the following:
- Check your network cables, switch configuration; make sure that the interface on the server is correctly configured.
- Make sure that rarp is supported by the server's kernel.
- Make sure that the server has a rarp entry for the problematic client. This can be checked with 'rarp -a'. Make sure that the hardware address is correct for the client.
- Run 'tcpdump -i eth1 rarp' on the server and boot the disk-less client (assuming eth1 is the interface connected to the cluster). When the client boots and broadcasts its rarp requested, you should see this in the tcpdump output. If everything is setup correctly, you will also see server's rarp reply. If you can't see the request packet, then the most probable cause of the problem is a faulty connection; this could be a cable, switch, or a NIC. If you can see client's rarp request, but the server does not reply, then the most probable cause of the problem is an incorrect, or lack of rarp entry.

6.3 How to access clients' consoles ?

Because your clients do not have a video card or a keyboard attached to them you cannot access them directly as you can with the server. There might be a time (specially during changes of configuration) when there is a problem with the network and you cannot telnet or rlogin to the clients so you must access them some other way. There are basically to methods of accessing clients' consoles. The first one is using monitor and keyboard switches as described by Jan Lindheim in Building a Beowulf System http://www.cacr.caltech.edu/beowulf/tutorial/building.html, and the other is using a serial terminal.

6.4 Installing the operating system on each client separately.

If you are installing off a CD-ROM and only have one drive for the whole system, you will have to move the CD-ROM drive from client to client after each install, or do an NFS install. If you have only one floppy disk drive you will have to move it as well. In my case I installed all the nodes from our local ftp server so I only had to move the floppy drive. To cut down on the installation time I recommend installing the full distribution. Selecting packages to install is a real pain and it is even worse if you have 16 nodes to install. These days the smallest hard disks you can buy are well over a 2 GB so you should not have to worry about disk space shortage.

Next Previous Contents