Bccd-ng - bccd-3.3.4-rc - r5989- Shawn's Log

From BCCD 3.0

Jump to: navigation, search

Contents

Purpose

To determine whether the BCCD 3.3.4 Release Candidate (RC) ISO is suitable for release.

Materials

Procedure

Preparation

Make bootable USB

  1. Boot up the cluster.
  2. Become the root user:
    $ sudo su - root
  3. Go to the directory with the script for making bootable USBs:
    # cd
  4. Set a variable for the ISO filename
    # ISO=bccd-3.3.4-skylar.5989.amd64.iso
  5. Download the ISO:
    # wget http://cluster.earlham.edu/bccd-ng/testing/skylar${ISO}
  6. Download the md5 checksum for the ISO:
    # wget http://cluster.earlham.edu/bccd-ng/testing/skylar${ISO}.md5
  7. Verify the downloaded checksum matches the generated checksum:
    # md5sum bccd-3.3.4-skylar.5989.amd64.iso | awk '{print $1} >tmp; cat bccd-3.3.4-skylar.5989.amd64.iso.md5 | awk '{print $4} | diff tmp -; rm tmp'
    • If this prints anything, the checksums do not match, which probably means the ISO did not download properly. If this is the case, go back to the "Download the ISO" step and try again.
  8. Check which devices are currently present on the machine:
    # ls /dev/sd*
  9. Insert the USB stick.
  10. Check which devices are now present on the machine:
    • Second USB Note: If you have a second USB stick, the following steps will be repeated, starting with this step:
    # ls /dev/sd*
  11. Note the device name of the USB stick (it will be something like /dev/sdb).
  12. Downgrade syslinux so the bootable USB does not have a kernel panic on boot:
    # wget https://mirrors.mediatemple.net/debian-archive/debian-backports/pool/main/s/syslinux/syslinux-common_4.05%2bdfsg-2~bpo60%2b1_all.deb></nowwiki></font></code> #: <code><font size=3><nowiki># wget https://mirrors.mediatemple.net/debian-archive/debian-backports/pool/main/s/syslinux/syslinux_4.05%2bdfsg-2~bpo60%2b1_amd64.deb
    # dpkg -i syslinux-common_4.05+dfsg-2~bpo60+1_all.deb syslinux_4.05+dfsg-2~bpo60+1_amd64.deb
    # syslinux -version
    • Confirm version 4.05
  13. Start the build script, replacing /dev/sdb with the device name of your USB stick, if it differs:
    # ./build_bootable_USB.sh -d /dev/sdb -i ${ISO} -m pc
  14. If you have a second USB stick, go back to the first step that is repeated (see the Second USB Note above).
  15. Reboot the machine:
    # reboot
  16. Enter the BIOS of the machine as it boots:
    • This will depend on your machine, but for LittleFe v4d, press the Delete key.

Enable bootable USB in the BIOS

These instructions are LittleFe-v4d-specific. If you aren't using this system, check documentation for your own machine's BIOS, or use trial and error to figure out how to emulate these steps.

  1. Select (Use the arrow keys to highlight and press Enter) Advanced BIOS Features.
  2. Select the 1st Boot Device and select the USB.
  3. Save the changes and exit by pressing the F10 button and selecting OK.
  4. If you have a second USB stick, repeat these steps on the compute node.

Boot into BCCD

  1. Make sure you are booting into the BCCD via the USB, not the hard drive. You should see a boot prompt, as opposed to a GRUB screen. Press Enter to begin the boot.
  2. You should eventually see a place to enter your password (the text will stop scrolling and say Please set a password for the default user. The canonical choice for this is bccd.
  3. A blue screen will ask you about network information. If it says "No DHCP for eth0", you can tell it to skip by pressing Enter.
    • Note: For the second USB stick, this blue screen should not appear. Instead, the BCCD should boot into a hostname of "node009".
  4. You will be greeted by two terminal windows. In one of them, confirm you are running the proper ISO:
    $ bccd-version
    • This should return Revision: 3.3.4-skylar.5989 and Build date: 2018-1-25. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIVE. If it instead says LIBERATED, it did not successfully boot from the USB.
  5. If you have a second USB stick, repeat these steps.

Testing

Background image test

  1. Confirm the background image ("Earlham College, BCCD, Shodor") is clear with no text cut off.

Run the automated testing script

$ bccd-test-suite
  1. Record the results.

GalaxSee MPI test

  1. Snarf the hosts on the head node:
    $ bccd-snarfhosts -v
  2. If you have 2 nodes running in your cluster, confirm node009 is listed first, followed by node000.
  3. Change to the GalaxSee directory:
    $ cd GalaxSee
  4. Build the executable:
    $ make
  5. Sync the directory across the cluster:
    $ bccd-syncdir . ~/machines-openmpi
  6. Change to the sync'd directory:
    $ cd /tmp/node000-bccd
  7. Run a job across all nodes in the cluster (replace 2 with the number of nodes you have running in the cluster):
    $ mpirun -np 2 --map-by node -machinefile ~/machines-openmpi GalaxSee.cxx-mpi
  8. In the other window, confirm only 1 process for GalaxSee shows up in the table of processes (re-run GalaxSee in the other window if it finishes before you run this):
    $ top

Wi-Fi test

For these instructions, you will need a Wi-Fi network that you can connect to. If you have no Wi-Fi, but you do have a Mac laptop connected via Ethernet to the Internet, follow the immediate instructions below. Otherwise, if you have Wi-Fi, skip to the next set of instructions:

  1. On the laptop, make sure Ethernet is connected and you can reach the Internet with it.
  2. Go to System Preferences -> Sharing
  3. Click Internet Sharing
  4. Share your connection from -> Ethernet
  5. To computers using -> Wi-Fi
  6. Click the checkbox next to Internet Sharing
  7. Click Start.

Look for and connect to the Wi-Fi network

  1. Enable the Wi-Fi interface (you will need the interface name -- for LittleFe it is wlan0):
    # ifconfig wlan0 up
  2. Do a scan for Wi-Fi networks:
    # iwlist wlan0 scan | grep -i essid
  3. If there is a network nearby, you should see something like the following, with the network name in quotes:
    ESSID:"littlefe"
  4. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe

If the network uses WPA, use the following instructions.

  1. Become the root user:
    $ sudo su - root
  2. Install wpasupplicant:
    # apt-get update
    # apt-get install wpasupplicant
    Do you want to continue [Y/n]? Y

    WARNING: The following packages cannot be authenticated!
    libpcsclite1 wpasupplicant
    Install these packages without verification [y/N]? y
  3. If the network uses WPA/WPA2, generate a pre-shared key (PSK), replacing XXXX with the password of the network:
    # wpa_passphrase shodor XXXX > /bccd/home/bccd/wpa/SSID.psk
  4. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe
    # wpa_supplicant -B -i wlan0 -c /bccd/home/bccd/wpa/SSID.psk -Dwext
    # sleep 3
    # dhclient wlan0
  5. Make sure the interface has an IP:
    # ifconfig wlan0
  6. Make sure you can ping over the interface:
    # ping -I wlan0 google.com
  7. Press Ctrl-D to exit as root.

/proc test

These instructions require 2 or more nodes in the cluster.

  1. ssh to the client node:
    $ ssh node009
  2. Run ps:
    node009$ ps
  3. Use /tmp:
    node009$ ls /tmp
  4. Confirm ps still works:
    node009$ ps

R test

  1. Load the R module:
    $ module load R
  2. Start R and confirm it says version 3.2.1:
    $ R
    R version 3.2.1 (2015-06-18) -- "World-Famous Astronaut"
  3. Run a line of R code and confirm it works (e.g. make sure Inf - Inf = NaN):
    > Inf - Inf
    [1] NaN
  4. Quit R:
    > q()
    Save workspace image? [y/n/c]:
  5. Enter n to not save a workspace image.
  6. Unload the R module:
    $ module unload R

Liberation test

These instructions require a hard drive that can be erased and upon which can be installed the BCCD.

  1. Run the liberation script, which erases the hard drive and installs BCCD. Replace /dev/sda with the device name of your hard drive. Make sure this is the correct device, otherwise you will erase something else!
    $ sudo perl /root/liberate.pl --libdev /dev/sda
    • Note that it will prompt you (possibly twice) whether you are sure you want to erase what is on the drive. Enter y to approve.
  2. Reboot:
    $ sudo reboot
  3. Disconnect the USB stick so that it boots from the hard drive.
  4. Check the version number.
    $ bccd-version
    • This should return Revision: 3.3.4-rc.5300 and Build date: 2015-9-3. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIBERATED. If it instead says LIVE, it booted from the USB. Reboot and try again with the USB stick detached.
  5. Set up PXE booting:
    $ sudo bccd-reset-network
    1. When it asks if you want to skip eth0, choose No
    2. When it asks if you want to make eth0 PXE capable, choose Yes
  6. Remove the BCCD drive from the second node and boot it up. It should pick up the BCCD through PXE over the network (if it doesn't make sure the network is properly connected and that network booting is enabled in the BIOS of the machine).
  7. Once the cluster is running in liberated mode, repeat the steps in the Testing section (except the liberation test, of course).

cpumon test

This test requires 6 nodes in the cluster. The instructions below assume LIBERATED mode.

  1. Boot up all 6 nodes.
  2. Generate the machines file:
    $ bccd-snarfhosts -v
  3. Make sure all 6 nodes appear. If they do not, make sure they all booted properly and that the head node can reach them over the network.
  4. Change to the cpumon directory:
    $ cd Community-Modules/UMW/cpumon/
  5. Create the cpumonitor executable:
    $ make
  6. Start cpumonitor with one process on each of the 6 nodes:
    $ mpirun -np 6 --map-by node -machinefile ~/machines-openmpi ./cpumonitor
  7. In the other terminal window, change to the GalaxSee directory:
    $ cd GalaxSee
  8. Compile the software (it may say "Nothing to be done" if it has already been compiled):
    $ make
  9. Run GalaxSee with 1 process on each node for 100,000 time steps:
    $ mpirun -np 6 --map-by node -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  10. Observe the core values, confirming that one core on each process is running at close to 100%.
  11. Press the "X" at the top-right corner of the cpumonitor window, and confirm it closes the program.
  12. Run GalaxSee with 2 processes on each node for 100,000 time steps:
    $ mpirun -np 12 -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  13. Observe the core values, confirming that both cores on each process are running at close to 100%.

bccd-shutdown test

  1. Confirm bccd-shutdown shuts down the entire cluster:
    $ bccd-shutdown

Results

Conclusions

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox