Bccd-ng - bccd-3.3.1-rc - r4486 - Aaron's Log

From BCCD 3.0

Jump to: navigation, search

Contents

Purpose

To determine whether the BCCD 3.3.1 Release Candidate (RC) ISOs are suitable for release.

Materials

Procedure

Preparation

Make bootable USB

  1. Boot up the cluster.
  2. Become the root user:
    $ sudo su - root
  3. Go to the directory with the script for making bootable USBs:
    # cd /root
  4. Download the ISO (this example uses amd64, the URL is different for i386, but the process is the same):
    # wget http://cluster.earlham.edu/bccd-ng/testing/amweeden06/bccd-3.3.1-rc.amd64.iso
  5. Download the md5 checksum for the ISO:
    # wget http://cluster.earlham.edu/bccd-ng/testing/amweeden06/bccd-3.3.1-rc.amd64.iso.md5
  6. Verify the downloaded checksum matches the generated checksum:
    # md5sum bccd-3.3.1-rc.amd64.iso > tmp; diff tmp bccd-3.3.1-rc.amd64.iso.md5; rm tmp
    • If this prints anything, the checksums do not match, which probably means the ISO did not download properly. If this is the case, go back to the "Download the ISO" step and try again.
    • Second USB Note: If you have a second USB stick, the following steps will be repeated, starting with this step:
  7. Check which devices are currently present on the machine:
    # ls /dev/sd*
  8. Insert the USB stick.
  9. Check which devices are now present on the machine:
    # ls /dev/sd*
  10. Note the device name of the USB stick (it will be something like /dev/sdb).
  11. Start the build script, replacing /dev/sdb with the device name of your USB stick, if it differs:
    # ./build_bootable_USB.sh -d /dev/sdb -i bccd-3.3.1-rc.amd64.iso -m mac
    • I got the following message at the end. This also occurred for the second USB stick. However, this did not affect the rest of the testing:
      No device specified!
      Usage: mkdosfs [-a][-A][-c][-C][-v][-I][-l bad-block-file][-b backup-boot-sector]
      [-m boot-msg-file][-n volume-name][-i volume-id]
      [-s sectors-per-cluster][-S logical-sector-size][-f number-of-FATs]
      [-h hidden-sectors][-F fat-size][-r root-dir-entries][-R reserved-sectors]
      /dev/name [blocks]
      Failed to format /dev/sdb2
  12. If you have a second USB stick, go back to the first step that is repeated (see the Second USB Note above), follow the steps, then remove the second USB stick before continuing to the next step:
  13. Reboot the machine:
    # reboot
  14. Enter the BIOS of the machine as it boots:
    • This will depend on your machine, but for LittleFe, press the Delete key.

Enable bootable USB in the BIOS

These instructions are LittleFe-specific. If you aren't using LittleFe, check documentation for your own machine's BIOS, or use trial and error to figure out how to emulate these steps.

  1. Select (Use the arrow keys to highlight and press Enter) Advanced BIOS Features.
  2. Select the 1st Boot Device and select the USB (in my case USB:Kingston DataTraveler).
  3. Save the changes and exit by pressing the F10 button and selecting OK.
  4. If you have a second USB stick, repeat these steps on the compute node.

Boot into BCCD

  1. Make sure you are booting into the BCCD via the USB, not the hard drive. You should see a boot prompt, as opposed to a GRUB screen. Press Enter to begin the boot.
  2. You should eventually see a place to enter your password (the text will stop scrolling and say Please set a password for the default user. The canonical choice for this is bccd.
  3. A blue screen will ask you about network information. If it says "No DHCP for eth0", you can tell it to skip by pressing Enter.
    • Note: For the second USB stick, this blue screen should not appear. Instead, the BCCD should boot into a hostname of "node009".
  4. You will be greeted by two terminal windows. In one of them, confirm you are running the proper ISO:
    $ bccd-version
    • This should return Revision: 3.3.1.4486. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIVE. If it instead says LIBERATED, it did not successfully boot from the USB.
  5. If you have a second USB stick, repeat these steps (note that the second node may already have passed the boot screen).

Testing

Background image test

  1. Confirm the background image ("Earlham College, BCCD, Shodor") is clear with no text cut off.

Download the automated testing script

Skip these instructions for LIBERATED mode.

  1. Change to the bin directory:
    $ cd /bin
  2. Download the script:
    $ sudo wget http://cluster.earlham.edu/svn/bccd-ng/trunk/trees/bin/bccd-test-suite
  3. Fix the permissions so that the bccd user can execute the script:
    $ sudo mv bccd-test-suite.1 bccd-test-suite
    $ sudo chmod go+x bccd-test-suite
  4. Change back to the home directory:
    $ cd
    • Note: if this step is skipped, test #788 of bccd-test-suite fails, because it creates a temporary file in the current directory, and the bccd user does not have write permission in the /bin directory.

Run the automated testing script

  1. Run the script:
    $ bccd-test-suite
  2. If you have a second node, repeat the above step on that node.
  3. Record the Results below.

GalaxSee MPI test

  1. Snarf the hosts on the head node:
    $ bccd-snarfhosts -v
  2. If you have 2 nodes running in your cluster, confirm node009 (or node011 for LIBERATED mode) is listed first, followed by node000.
  3. Change to the GalaxSee directory:
    $ cd GalaxSee
  4. Build the executable (skip this step in LIBERATED mode):
    $ make
  5. Sync the directory across the cluster (skip this step in LIBERATED mode):
    $ bccd-syncdir . ~/machines-openmpi
  6. Change to the sync'd directory (skip this step in LIBERATED mode):
    $ cd /tmp/node000-bccd
  7. Run a job across all nodes in the cluster (replace 2 with the number of nodes you have running in the cluster):
    $ mpirun -np 2 --bynode -machinefile ~/machines-openmpi GalaxSee.cxx-mpi
  8. In the other window, confirm only 1 process for GalaxSee shows up in the table of processes (re-run GalaxSee in the other window if it finishes before you run this):
    $ top
  9. Quit top by typing q in that window.
  10. Change back to the home directory in the other window:
    $ cd

Wi-Fi test

For these instructions, you will need a Wi-Fi network that you can connect to. If you have no Wi-Fi, but you do have a Mac laptop connected via Ethernet to the Internet, follow the immediate instructions below. Otherwise, if you have Wi-Fi, skip to the next set of instructions:

  1. On the laptop, make sure Ethernet is connected and you can reach the Internet with it.
  2. Go to System Preferences -> Sharing
  3. Click Internet Sharing
  4. Share your connection from -> Ethernet
  5. To computers using -> Wi-Fi
  6. Click the checkbox next to Internet Sharing
  7. Click Start.
  8. If you have OS X Mountain Lion, you can configure a WPA2 personal network under Wi-Fi Options -> Security.

Look for and connect to the Wi-Fi network

  1. Become the root user:
    $ sudo su - root
  2. Enable the Wi-Fi interface (you will need the interface name -- for LittleFe it is wlan0):
    # ifconfig wlan0 up
  3. Do a scan for Wi-Fi networks:
    # iwlist wlan0 scan | grep -i essid
  4. If there is a network nearby, you should see something like the following, with the network name in quotes:
    ESSID:"littlefe"
  5. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe

If the network uses WPA, use the following instructions.

  1. Create a directory for the pre-shared key (PSK) (skip this step for LIBERATED mode):
    # mkdir /UNIONFS/home/bccd/wpa
  2. Generate the PSK, replacing bccdbccd with the password of the network (for the instructions below in LIBERATED mode, replace UNIONFS with bccd):
    # wpa_passphrase littlefe bccdbccd > /UNIONFS/home/bccd/wpa/SSID.psk
  3. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe
    # wpa_supplicant -B -i wlan0 -c /UNIONFS/home/bccd/wpa/SSID.psk -Dwext
    # sleep 3

Get an IP and test the connection:

  1. # dhclient -v wlan0
  2. Make sure the interface has an IP:
    # ifconfig wlan0
  3. Make sure you can ping over the interface:
    # ping -I wlan0 bccd.net
  4. Press Ctrl-D to exit as root.

/proc test

These instructions require 2 or more nodes in the cluster.

  1. ssh to the client node:
    $ ssh node009
    • Note: Use node011 for LIBERATED mode
  2. Run ps:
    node009$ ps
  3. Use /tmp:
    node009$ ls /tmp
  4. Confirm ps still works:
    node009$ ps
  5. Press Ctrl-D to log out of the client node.

R test

  1. Load the R module:
    $ module load R
  2. Start R and confirm it says version 3.0.1:
    $ R
    R version 3.0.1 (2013-05-16) -- "Good Sport"
  3. Run a line of R code and confirm it works (e.g. make sure Inf - Inf = NaN):
    > Inf - Inf
    [1] NaN
  4. Press Ctrl-D to quit R:
    Save workspace image? [y/n/c]:
  5. Enter n to not save a workspace image.
  6. Unload the R module:
    $ module unload R

Liberation test

These instructions require a hard drive that can be erased and upon which can be installed the BCCD.

  1. Run the liberation script, which erases the hard drive and installs BCCD. Replace /dev/sda with the device name of your hard drive. Make sure this is the correct device, otherwise you will erase something else!
    $ sudo perl /root/liberate.pl --libdev /dev/sda
  2. You will be prompted whether you are sure you want to erase what is on the drive. Enter y to approve.
  3. Reboot:
    $ sudo reboot
  4. Disconnect the USB stick so that it boots from the hard drive.
  5. Check the version number.
    $ bccd-version
    • This should return Revision: 3.3.1-rc.4486. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIBERATED. If it instead says LIVE, it booted from the USB. Reboot and try again with the USB stick detached.
  6. Set up PXE booting:
    $ sudo bccd-reset-network
    1. When it asks if you want to skip eth0, choose No
    2. When it asks if you want to make eth0 PXE capable, choose Yes
    3. When it asks if you want to skip eth0:1, choose No
  7. Remove the BCCD drive from the second node and boot it up. It should pick up the BCCD through PXE over the network (if it doesn't make sure the network is properly connected and that network booting is enabled in the BIOS of the machine).
  8. Once the cluster is running in liberated mode, repeat the steps in the Testing section (except the liberation test, of course), and then proceed to the tests after this one.

cpumon test

This test requires 6 nodes in the cluster. The instructions below assume LIBERATED mode.

  1. Boot up all 6 nodes.
  2. Generate the machines file:
    $ bccd-snarfhosts -v
  3. Make sure all 6 nodes appear. If they do not, make sure they all booted properly and that the head node can reach them over the network.
  4. Change to the cpumon directory:
    $ cd Community-Modules/UMW/cpumon
  5. Create the cpumonitor executable:
    $ make
  6. Start cpumonitor with one process on each of the 6 nodes:
    $ mpirun -np 6 --bynode -machinefile ~/machines-openmpi ./cpumonitor
  7. In the other terminal window, change to the GalaxSee directory:
    $ cd GalaxSee
  8. Compile the software (it may say "Nothing to be done" if it has already been compiled):
    $ make
  9. Run GalaxSee with 1 process on each node for 100,000 time steps:
    $ mpirun -np 6 --bynode -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  10. Observe the core values, confirming that one core on each node is running at close to 100%, and that both cores on the head node are running at close to 100%.
  11. Run GalaxSee with 2 processes on each node for 100,000 time steps (same line as above, but change 6 to 12, and delete the --bynode option:
    $ mpirun -np 12 -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  12. Observe the core values, confirming that both cores on each process are running at close to 100%. The last node may have less load.
  13. Press the "X" at the top-right corner of the cpumonitor window, and confirm it closes the program.

bccd-shutdown test

  1. Confirm bccd-shutdown shuts down the entire cluster:
    $ bccd-shutdown

Results

Conclusions

Testing problems

Major problems

Minor problems

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox