|
Cluster Quick Start (DRAFT) |
Cluster Quick Start Version 0.1 =============================== Prepared by: Douglas Eadline (deadline@plogic.com) Update: 2/2/99 This document is distributed under GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copies of the license can be down found at http://www.fsf.org/copyleft/gpl.html Copyright (C) 1999 Paralogic, Inc., 130 Webster St, Bethlehem PA 18015 (http://www.plogic.com) READ FIRST: =========== Check: http://www.xtreme-machines.com/x-cluster-qs.html for updates! This document assumes all machines have been installed with the standard RedHat 5.1 or 5.2 release. Furthermore, all systems are assumed to boot from a their own hard drive and have a fast Ethernet network connection to a switch controlling the private cluster network. One of the machines is assumed to be a gateway node which normally attaches to a local area network. There are other ways to boot and configure the hardware (bootp and disk-less nodes) but this document does not cover this method. There are many ways to configure a cluster. This document describes one way. This guide not complete, it is really just a series of steps or issues that need to be addressed before a cluster can be usable. Indeed, it is very difficult to create a "step by step" set of instructions because every cluster seems to be different. I am sure I have missed many things. Please submit questions to deadline@plogic.com with he subject "cluster quick start". I will either incorporate the answer in the text or add it to the FAQ at the end. Some things that are intentionally not included: building and installing kernels, hardware selection, Linux installation, script or kick start node builds, and packaging. Comments and suggestions are welcome. Please send any CONSTRUCTIVE comments to deadline@plogic.com with the subject "cluster quick start". The Original Extreme Linux CD is now considered OBSOLETE. Other than documentation, the RPMs on the CD should not be used. See the Beowulf HOWTO for other important information and links to other sources of information. You may also want to check the Beowulf mailing list archives http://www.beowulf.org/listarchives/beowulf/ for more information. Finally, we plan to prepare current RPMs for all software mentioned in this document. I should also mention that for those who have a need to buy a Beowulf cluster, roll it out of the box, and turn it on, please see http://www.xtreme-machines.com. (i.e. you do not want to have to read and perform all the tasks in this document) 1.0 Introduction: ================= 1.1 What makes a cluster different than a network of Workstations? ------------------------------------------------------------------ Aside from the philosophical issues, the main difference revolves around security issues, application software, administration, booting, and file systems. As delivered most Linux distributions have very high security enabled (for good reasons). This security often "gets in the way" of cluster computing. Application software uses underlying message passing systems like MPI and PVM. There are many ways to express parallelism, but message passing is the most popular and appropriate in many cases. UNLESS AN APPLICATION WAS SPECIFICALLY WRITTEN FOR A CLUSTER ENVIRONMENT IT WILL ONLY WORK ON ONE CPU. YOU CAN NOT TAKE APACHE OR MYSQL AND RUN THEM OVER MULTIPLE CPUS IN A CLUSTER UNLESS SPECIFIC PARALLEL SOURCE CODE VERSIONS HAVE BEEN CREATED. Fortunately there are tools to help administered a cluster. For instance, rebooting 32 nodes by hand is not where you want to be spending your time. Part of administered means relaxing some security within the cluster. Allowing root to rlogin any machine is very valuable for a cluster, but very dangerous for open network. Although disk-less booting from a host node is appealing, it requires a bit more configuration than allowing each host to boot of its own hard drive. Disk-less booting will not be covered at the present time. Finally, file systems should be set up so that at a minimum /home is shared across all the nodes. Other file systems can be shared as required by the user. 1.2 General Setup ================= As mentioned, this document assumes a very simple configuration. The cluster consists of two or more Linux boxes connected through a Fast Ethernet switch (A hub will also work, but it is slower). One node is usually a gateway that connects to a local LAN. The gateway node is often the "login node" which has a keyboard monitor and mouse. The rest of the nodes are "headless" (no keyboard, mouse, or monitor) and often referred to as "computer nodes". The following diagram describes the general setup. ------------------- | SWITCH | ------------------- __________________________| | | | | ______________________| | | (gateway) | | __________________| |_________ __________ LAN ____> | | | | | ----- ----- ----- ----- | | | | | | HOST NODE | | ------- | | | | | | (login node) | | | |--- | | | | | | | | ---- | | | ----- ----- ----- ----- ------- | node node node node |ooooo| _ four three two one |ooooo| | | ------- - Keyboard/Mouse Monitor 1.3 Assumed Competence ====================== Setting up and configuring a cluster requires certain Linux administration skills. Introduction to these skills is far beyond the scope of this document. Basically, knowledge of networking, addressing, booting, building kernels, Ethernet drivers, NSF, package installation, compiling, and some others will be required for successful administration. See the LDP (Linux Documentation Project http://metalab.unc.edu/LDP/) for excellent documentation. You may also look at the Beowulf HOWTO for other links (http://metalab.unc.edu/LDP/HOWTO/Beowulf-HOWTO-5.html#ss5.6). 2.0 Hardware Issues: ==================== In general, this document only covers hardware issue that are important to getting the cluster up and running. 2.1 Network Interface Cards: ---------------------------- For best performance Intel or DEC tulip based (2114X) NICs should be used. These NICs provide "wire speed" performance. Driver availability can be a big issue particularly for tulip NICs. Please consult Linux-Tulip Information page for complete instructions ("http://maximus.bmen.tulane.edu/~siekas/tulip.html") on how to install the latest tulip driver. Also, please see Don Becker's Linux and the DEC "Tulip" Chip " for additional information. (http://cesdis.gsfc.nasa.gov/linux/drivers/tulip.html) 2.2 Addressing: --------------- Within a cluster nodes should be addressed using the reserved IP address. These address are should never be made "visible to the Internet" and should use the numbers in the range 192.168.X.X. 2.3 Gateway Node: ----------------- It is also assumed one of the nodes in the cluster will act as a gateway to a local area network This node will have two network connections (probably two NICs) One NIC will be addresses as 192.168.X.X while the other will have in internal hose IP address. It is possible to route through the gateway (i.e. to connect to a file server). 2.4 Switch Configuration: ------------------------- Most switches can be addressed by assigning them an IP number and then a administered by telneting to the switch. You can use something like 192.168.0.254 if you are assign you nodes address like 192.168.0.1, 192.168.0.2,... Switches should have any spanning tree features disabled and all ports should come up at 100BTx-Full Duplex. If this is not the case check to see that the NICs and the switch are negotiating correctly. See NIC and switch problems below under "testing the cluster". 2.5 Direct Access: ------------------ At some time you will want or need to access a node directly through and attached monitor and keyboard. Usually due to a network problem. This can be done the old fashion way using a extra monitor and keyboard an moving it to the desired node. An other way is to buy a KVM (keyboard video mouse) device that allows you to share a single keyboard mouse and monitor between all systems. With either a push button of button or a hot-key you can quickly move between machines and fix problems. 3.0 Security Issues: ==================== It is recommended that these security change be made ONLY on the non gateway nodes. In this way the security of the gateway is not compromised. 3.1 .rhosts versus hosts.equiv ------------------------------ There are two ways to eliminate passwords within the cluster. You can either add an entry to the /etc/hosts.equiv file or add a .rhosts in each users home directory. The .rhosts is preferable because there is one copy in each users directory. /etc/hosts.equiv must be maintained on each node of the cluster which may lead to an administration headache when users are added or deleted. The format of .rhosts file is simply a list of hosts: # .rhost file for coyote cluster # must be read/writable by user only! coyote1 coyote2 coyote3 coyote4 The format of the hosts.equiv file is #hosts.equiv file for coyote cluster #node name user name coyote1 deadline coyote2 deadline coyote3 deadline coyote4 deadline coyote1 wgates coyote2 wgates coyote3 wgates coyote4 wgates 3.2 root rlogin Access: ----------------------- To allow root to rlogin to any node in the cluster, add a .rhosts file in the root directory on each node. The .rhosts file should list all the nodes in the cluster. IMPORTANT: The .rhosts must be only read/writable by the owner. ("chmod go-rwx .rhosts") Again this should not be done for the gateway node. In addition, swap the first two lines of /etc/pam.d/rlogin: #original /etc/pam.d/rlogin auth required /lib/security/pam_securetty.so auth sufficient /lib/security/pam_rhosts_auth.so auth required /lib/security/pam_pwdb.so shadow nullok auth required /lib/security/pam_nologin.so account required /lib/security/pam_pwdb.so password required /lib/security/pam_cracklib.so password required /lib/security/pam_pwdb.so shadow nullok use_authtok session required /lib/security/pam_pwdb.so #first two lines are swapped /etc/pam.d/rlogin auth sufficient /lib/security/pam_rhosts_auth.so auth required /lib/security/pam_securetty.so auth required /lib/security/pam_pwdb.so shadow nullok auth required /lib/security/pam_nologin.so account required /lib/security/pam_pwdb.so password required /lib/security/pam_cracklib.so password required /lib/security/pam_pwdb.so shadow nullok use_authtok session required /lib/security/pam_pwdb.so NOTE: I do not know if there is a better way to do this, but it seems to work. 3.3 root telnet Access: ----------------------- On every node except the gateway, the following has been added to the /etc/securetty file: ttyp0 ttyp1 ttyp2 ttyp3 ttyp4 This change will allow remote telnet to any node in the cluster. 3.4 root ftp Access: -------------------- On any system that needs root ftp access, /etc/ftpusers file has to have the entry for root commented out: # Comment out root to allow other systems ftp access as root #root bin daemon adm lp sync shutdown halt mail news uucp operator games nobody 4.0 Sample hosts file ===================== The following is a sample /etc/hosts file for an 8 node cluster with an addressable switch. It is assumed that each node has the correct IP addressed assigned when linux was installed or set using the network administration tool. #sample /etc/hosts 127.0.0.1 localhost localhost.cluster 192.168.0.1 node1 node1.cluster 192.168.0.2 node2 node2.cluster 192.168.0.3 node3 node3.cluster 192.168.0.4 node4 node4.cluster 192.168.0.5 node5 node5.cluster 192.168.0.6 node6 node6.cluster 192.168.0.7 node7 node7.cluster 192.168.0.8 node8 node8.cluster 192.168.0.254 switch 5.0 User Accounts and File Systems: =================================== Each user needs an account on every node they will want to use. To ease administration problems, /home from a host node (usually the gateway or login node) is mounted on every node using NSF. 5.1 Mounting /home across the cluster ------------------------------------- The /home directory needs to be cross-mounted throughout the cluster. All nodes other than the host (presumably the gateway) should have nothing in /home. (This means you should not add users to nodes (see below). To add /home make the following entry in /etc/fstab on all the nodes (not the host). hostnode:/home /home nfs bg,rw,intr 0 0 where "hostnode: is the name of your host node. While you are at, if the host has a CDROM you can add the following line to access the CD-ROM via NFS (make sure each node has a /mnt/cdrom directory) hostnode:/mnt/cdrom /mnt/cdrom nfs noauto,ro,soft 0 0 Next you will need to modify /etc/exports on the host node. You will need two entries that look like this: #allow nodes to mount /home and read CDROM /home node1(rw) node2(rw), node3(re) /mnt/cdrom node1(ro) node2(ro), node4(ro) Restart nfs and mount. (ps for the pid of rpc.nfsd and rpc.mountd and then enter a "kill -HUP pid" for each pid) If all goes well, you can issue a "mount /home" on any node and /home should be mounted. If not check /var/log/messages for any errors and check man pages for mount for more information. When the system boots it will automatically mount /home, but not the CDROM. To mount the CDROM on a node, just enter "mount /mnt/cdrom" and the contents of the CDROM will be visible under /mnt/cdrom. If you have problem, check /var/log/messages and the manuals for mount and nfs. 5.2 Add user accounts --------------------- Add user accounts on the host node as you would a single Linux workstation. A quick way to add users to each node is to copy the entry for the user from /etc/passwd on the host node and paste it the /etc/passwd file one each node. Make sure that the user and group id are the same for the each user through out the cluster. The user should now have login privileges throughout the cluster. Another way is to use NIS (to be completed) 6.0 Administration: CMS ======================= There is a package called CMS (Cluster Management System). It is available from http://smile.cpe.ku.ac.th/software/scms/index.html. This version is new, and we have not had time to test it. The previous version worked well except for the remote (real time) monitoring. It does include a system reboot and shutdown feature. 7.0 Compilers: ============== We recommend using the egcs (including g77) for compiling codes. Source: http://egcs.cygnus.com/ Version: egcs-1.1.1 gzip format Once compiled and installed, the egcs compilers should reside in /usr/local. In this way users ican set their path to point to the appropriate versions (i.e. standard gcc is in /usr/bin while egcs gcc is in /usr/local/bin) Note: use gcc (not egcs gcc) to build kernels. "gcc -v" and "which gcc" can tell you which versions you are using. g77 is the egcs FORTRAN compiler. 8.0 Communication Packages ========================== This is not a n exhaustive list. These are the most common packages. A cluster is a collection of local memory machines. The only way for node A to communicate to node B is through the network. Software built on top of this architecture "passes messages" between nodes. While message passing codes are conceptually simple, their operation and debugging can be quite complex. There are two popular message passing libraries that are used: PVM and MPI. 8.1 PVM versus MPI ------------------ Both PVM and MPI provide a portable software API that supports message passing. From a historical stand point PVM was first and designed to work on networks of workstations (Parallel Virtual machine) It has since been adapted to many parallel supercomputers (that may or may not use distributed memory). Control of PVM is primarily with its authors. MPI on the other hand is a standard that is supported by many hardware vendors. It provides a bit more functionality than PVM and has versions for networks of workstations (clusters). Control of MPI is with the standards committee. For many cases there does not seem to be a compelling reason to choose PVM over MPI. Many people choose MPI because it is a standard, but the PVM legacy lives on. We offer the source for each in this document. MPI: ---- There are two freely available versions of MPI (Message Passing Interface). MPICH: Source: http://www-unix.mcs.anl.gov/mpi/mpich/ Version: mpich.tar.gz Notes: People (including us) have noted some problems with the Linux version LAM-MPI: Source: http://www.mpi.nd.edu/lam/ Version: lam61.tar.gz Notes: install the patches (lam61-patch.tar). Performance of LAM is quite good using the -c2c mode. PVM: ---- Version: pvm3/pvm3.4.beta7.tgz Source: http://www.epm.ornl.gov/pvm/ Notes: There are a lot of PVM codes and examples out there. 9.0 Conversion software: ======================== Converting existing software to parallel is a time consuming task. Automatic conversion is very difficult. Tools are available for FORTRAN conversion. Converting C is much more difficult due to pointers. One tool for converting FORTRAN codes (made by us) is called d BERT and works on Linux systems. A free Lite version of available from http://www.plogic.com/bert.html 10.0 Sample .cshrc =================== #Assumes LAM-MPI, PVM and MPICH are installed setenv LAMHOME /usr/local/lam61 setenv PVM_ROOT /usr/local/pvm3 setenv PVM_ARCH LINUX setenv MPIR_HOME /usr/local/mpich set path = (. $path) # use egcs compilers first set path = (/usr/local/bin $path) set path = ($path /usr/local/pvm3/lib/LINUX) set path = ($path /usr/local/lam61/bin) set path = ($path /user/local/mpich/lib/LINUX/ch_p4) 11. Benchmarking and Testing Your Systems ========================================= 11.1 Network performance: netperf --------------------------------- Source: http://www.netperf.org/netperf/NetperfPage.html Run Script: ./netperf -t UDP_STREAM -p 12865 -n 2 -l 60 -H NODE -- -s 65535 -m 1472 ./netperf -t TCP_STREAM -p 12865 -n 2 -l 60 -H NODE NODE is the remote node name. 11.2: Network Performance: netpipe ---------------------------------- Source: http://www.scl.ameslab.gov/Projects/Netpipe/ 11.3 Parallel Performance: NASA parallel Benchmarks ---------------------------------------------------- Source: http://www.nas.nasa.gov/NAS/NPB/ 12. Ethernet Channel Bonding: ============================ Background on Channel Bonding is available form: http://www.beowulf.org/software/software.html Requirements: TWO Ethernet NICs per system. TWO hubs (one for each channel) OR two switches (one for each channel) OR a switch that can be segmented into virtual LANS Steps (for kernel 2.0.36): 1. Download and build the ifenslave.c program: (http://beowulf.gsfc.nasa.gov/software/bonding.html) Comment out line 35 "#include" and compile using "gcc -Wall -Wstrict-prototypes -O ifenslave.c -o ifenslave" 2. Apply the kernel patch (get linux-2.0.36-channel-bonding.patch from ftp ftp.plogic.com), run xconfig and enable Beowulf Channel bonding 3. Rebuild and install the kennel. Each channel must be on a separate switch or hub (or a segmented switch). There is no need to assign an IP number to the second interface, although using it as separate network (without channel bonding) may have advantages for some applications. To channel bond, login to each system as root and issue the following command on each system: ./ifenslave -v eth0 eth1 This will bond eth1 to eth0. This assumes: eth0 is already configured and used as your cluster network. eth1 is the second Ethernet card detected by the OS at boot. You can do this from the host node by enslaving all nodes BEFORE the host (Order is important. Node 2 must enslaved before the host - node 1). For each node, do steps a,b,c. a. open a window b. login to node 2 c. enter (as root) the above command. d. in a separate window enter (as root) the above command for node 1. Your cluster should now be "channel bonded". You can test this by running netperf or a similar benchmark. Channel bonding shutdown is not as simple. We are investigating this and will provide command line tools that execute channel bonding set-up and tear down automatically. In the mean time, the safest way to restore single channel performance is to either reboot each system or use the network manager (part of the control panel) to shutdown and restart each interface. REMEMBER: communication between a a channel bonded node and a none channel bonded node is very slow if not impossible. Therefore the whole cluster must be channel bonded. 13.0 Quick Primer on LAM: ========================== LAM is implemented for networks of workstations. In order to run LAM, a LAM daemon must be started on each node. The daemon is very powerful for test and debugging purposes (LAM can provide real time debugging information including dead lock conditions) Once code is working, however, programs can be run using a standard socket interface for maximum speed. The daemon is still used for start-up and tear-down, however. 13.1 Starting the LAM daemons. ---------------------------- From your home directory enter: lamboot -v lamhosts This boots the daemons based on the "lamhosts" file (which is just a list of your machines) The "-v" option prints debug information - it is nice to see what it is doing. NOTE: the LAM daemon will stay "resident" even when you logout. 13.2 To see if the daemons are running: ------------------------------------- Either: "ps auxww|grep lam" which will show all the lamd (lam daemons) running. NOTE: each user can have their own LAM daemon running (nice feature) Or: "mpitask" Which will print: TASK (G/L) FUNCTION PEER|ROOT TAG COMM COUNT if no jobs are running. If a job is running you will see jobs listed here. 13.3 Cleaning up LAM: --------------------- If you are concerned about the "state" of LAM possibly due to a terminated job, you can "clean" your daemons up by entering the "lamclean" command. You can shutdown your lam daemons by issuing a "wipe lamhosts", where lamhosts is the host file you used to boot LAM. 13.3 Running programs --------------------- Running program uses the mpirun command like MPICH, but there are some options that should to be used: mpirun -O -c 2 -s h -c2c program -O = assume homogeneous environment (no special encoding) -c = how many copies to run (in this case 2) NOTE: The -c option assign programs "round robin" using the order specified in the "hostfile" If "-c" is greater than the machines in your host file, LAM will start overloading the machines with jobs in the order specified in your host file. -s = source of executable, this is handy, but with NFS not really necessary. The "h" means get the executable from the host (node where you started LAM) -c2c = use the client to client "socket" mode. This makes LAM go fast, but the daemons are not used for communication and therefore can not provide run time debugging or trace information which is fine once you got your application running. 13.4 Other information: ----------------------- You can "man" the following topics: mpirun, wipe, lamclean, lamboot. You also may want to consult: /usr/local/src/lam61/doc/mpi-quick-tut/lam_ezstart.tut and: http://www.mpi.nd.edu/lam/ for more information. 14.0 Quick Primer on PVM ========================= check out: http://netlib.org/pvm3/book/pvm-book.html 15.0 Configuring switches: ======================= BayStack 350T Main Menu IP Configuration... SNMP Configuration... System Characteristics... Switch Configuration... Console/Service Port Configuration... Spanning Tree Configuration... TELNET Configuration... Software Download... Display Event Log Reset Reset to Default Settings Logout Use arrow keys to highlight option, press or to select option. 16. Clock Synchonozation ------------------------- There are some problems with the 2.0.X SMP kernels and clock drift. It is due to some interrupt problem. The best solution it to use xntp and synchronize to an external source. IN any case, it is best to have your clusters clocks synchronized. Here is how to use xntp. 1. set all system clocks to the current time. 2. write the time to the CMOS RTC (Real Time Clock) using the command "clock -w" 3. mount the cdrom on each system (mount /mnt/cdrom, see Sec. 5 if it does not work) 4. go to /mnt/cdrom/RedHat/RPMS 5. as root, enter "rpm -i xntp3-5.93-2.i386.rpm" 6. now edit the file /etc/ntp.conf ON ALL SYSTEMS comment out the lines (as shown): #multicastclient # listen on default 224.0.1.1 #broadcastdelay 0.008 ON NON-HOST SYSTEMS (every one but the host) Edit the lines: server HOSTNODE # local clock #fudge 127.127.1.0 stratum 0 where HOSTNODE is the name of the host node. Close the /etc/ntp.conf file on each node. 7. Start xntpd and all systems, "/sbin/xntpd" You can start this each time you boot by adding this to your /etc/rc.d/rc.local file. It will take some time for synchronization, but you can see messages from xntpd in /var/log/messages. What you have just done is to tell the host node to run xntp and use the local clock system clock as reference. In addition, all the other nodes in the cluster will get their time from the host. Although xntpd is supposed to keep the system clock and the RTC in sync. It is probably a good idea to sync them once a day. You can do this by (as root) going to /etc/cron.daily and creating a file called "sync_clocks" that contains the following: # Assumes ntp is running, so sync the CMOS RTC to OS system clock /sbin/clock -w Now all the clocks in your cluster should be sync'ed and use the host a s a reference. If you wish to use an external reference consult the xntpd documentation. 17.0 NIC and Switch Problems ---------------------------- 18.0 BEOWULF FAQ: ----------------- 1. What is the difference between the extreme linux CD and a standard Red Hat distribution? 2. Can I build a cluster with extreme linux and run Oracle 8 (or what ever across the clusters) 3. -------------------- This document is distributed under GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copies of the license can be down found at http://www.fsf.org/copyleft/gpl.html Copyright (C) 1999 Paralogic, Inc., 130 Webster St, Bethlehem PA 18015 (http://www.plogic.com)
This page, and all contents, are Copyright (c) 1999 by Paralogic Inc., Bethlehem, PA, USA, All Rights Reserved. This notice must appear on all copies (electronic, paper, or otherwise) of this document.