Bccd-ng - bccd-3.3.4-skylar - r5821 - Albert's Log

From BCCD 3.0

Jump to: navigation, search

Contents

THIS PAGE IS AN UNFINISHED WORK IN PROGRESS!!! PLEASE IGNORE FOR NOW!!!

Purpose

To determine whether the BCCD 3.3.4 Release Candidate (RC) ISO is suitable for release.

Materials

Procedure

Preparation

Make bootable USB

  1. Boot up the cluster.
  2. Become the root user:
    $ sudo su - root
  3. Go to the directory with the script for making bootable USBs:
    # cd
  4. Set some helpful variables
    # ISO=bccd-3.3.4-skylar.5821.amd64.iso
    # URL=http://cluster.earlham.edu/bccd-ng/testing/skylar/
  5. Download the ISO:
    # wget ${URL}/${ISO}
  6. Download the md5 checksum for the ISO:
    # wget ${URL}/${ISO}.md5
  7. Verify the downloaded checksum matches the generated checksum:
    # md5sum ${ISO} | awk '{print $1}' > tmp; cat ${ISO}.md5 | awk '{print $4}' > tmp2; diff tmp tmp2; rm tmp tmp2
    • If this prints anything, the checksums do not match, which probably means the ISO did not download properly. If this is the case, go back to the "Download the ISO" step and try again.
  8. Check which devices are currently present on the machine:
    # ls /dev/sd*
  9. Insert the USB stick.
  10. Check which devices are now present on the machine:
    • Second USB Note: If you have a second USB stick, the following steps will be repeated, starting with this step:
    # ls /dev/sd*
  11. Note the device name of the USB stick (it will be something like /dev/sdb).
  12. Downgrade syslinux so the bootable USB does not have a kernel panic on boot:
    # wget http://cluster.earlham.edu/bccd-ng/syslinux/syslinux-common_4.05+dfsg-6+deb7u1_all.deb
    # wget http://cluster.earlham.edu/bccd-ng/syslinux/syslinux_4.05+dfsg-6+deb7u1_amd64.deb
    # dpkg --auto-deconfigure -i syslinux-common_4.05+dfsg-6+deb7u1_all.deb syslinux_4.05+dfsg-6+deb7u1_amd64.deb
    # syslinux -version
    • Confirm version 4.05
  13. Start the build script, replacing /dev/sdb with the device name of your USB stick, if it differs:
    # ./build_bootable_USB.sh -d /dev/sdb -i ${ISO} -m pc
  14. If you have a second USB stick, go back to the first step that is repeated (see the Second USB Note above).
  15. Reboot the machine:
    # reboot
  16. Enter the BIOS of the machine as it boots:
    • This will depend on your machine, but for LittleFe v4d, press the Delete key.

Enable bootable USB in the BIOS

These instructions are LittleFe-v4d-specific. If you aren't using this system, check documentation for your own machine's BIOS, or use trial and error to figure out how to emulate these steps.

  1. Select (Use the arrow keys to highlight and press Enter) Advanced BIOS Features.
  2. Select the 1st Boot Device and select the USB.
  3. Save the changes and exit by pressing the F10 button and selecting OK.
  4. If you have a second USB stick, repeat these steps on the compute node.

Boot into BCCD

  1. Make sure you are booting into the BCCD via the USB, not the hard drive. You should see a boot prompt, as opposed to a GRUB screen. Press Enter to begin the boot.
  2. You should eventually see a place to enter your password (the text will stop scrolling and say Please set a password for the default user. The canonical choice for this is bccd.
  3. A blue screen will ask you about network information. If it says "No DHCP for eth0", you can tell it to skip by pressing Enter.
    • Note: For the second USB stick, this blue screen should not appear. Instead, the BCCD should boot into a hostname of "node009".
  4. You will be greeted by two terminal windows. In one of them, confirm you are running the proper ISO:
    $ bccd-version
    • This should return Revision: 3.3.4-skylar.5821 and Build date: 2016-10-15. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIVE. If it instead says LIBERATED, it did not successfully boot from the USB.
  5. If you have a second USB stick, repeat these steps. (Keep first node on, boot from second node.)

Testing

Background image test

  1. Confirm the background image ("Earlham College, BCCD, Shodor") is clear with no text cut off.

Run the automated testing script

$ bccd-test-suite
  1. Record the results.

GalaxSee MPI test

  1. Snarf the hosts on the head node:
    $ bccd-snarfhosts -v
  2. If you have 2 nodes running in your cluster, confirm node009 is listed first, followed by node000.
  3. Change to the GalaxSee directory:
    $ cd GalaxSee
  4. Build the executable:
    $ make
  5. Sync the directory across the cluster:
    $ bccd-syncdir . ~/machines-openmpi
  6. Change to the sync'd directory:
    $ cd /tmp/node000-bccd
  7. Run a job across all nodes in the cluster (replace 2 with the number of nodes you have running in the cluster):
    $ mpirun -np 2 --map-by node -machinefile ~/machines-openmpi GalaxSee.cxx-mpi
  8. In the other window, confirm only 1 process for GalaxSee shows up in the table of processes (re-run GalaxSee in the other window if it finishes before you run this):
    $ top

Wi-Fi test

For these instructions, you will need a Wi-Fi network that you can connect to. If you have no Wi-Fi, but you do have a Mac laptop connected via Ethernet to the Internet, follow the immediate instructions below. Otherwise, if you have Wi-Fi, skip to the next set of instructions:

  1. On the laptop, make sure Ethernet is connected and you can reach the Internet with it.
  2. Go to System Preferences -> Sharing
  3. Click Internet Sharing
  4. Share your connection from -> Ethernet
  5. To computers using -> Wi-Fi
  6. Click the checkbox next to Internet Sharing
  7. Click Start.

Look for and connect to the Wi-Fi network

  1. Enable the Wi-Fi interface (you will need the interface name -- for LittleFe it is wlan0):
    # ifconfig wlan0 up
  2. Do a scan for Wi-Fi networks:
    # iwlist wlan0 scan | grep -i essid
  3. If there is a network nearby, you should see something like the following, with the network name in quotes:
    ESSID:"littlefe"
  4. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe

If the network uses WPA, use the following instructions.

  1. Become the root user:
    $ sudo su - root
  2. Install wpasupplicant:
    # apt-get update
    # apt-get install wpasupplicant
    Do you want to continue [Y/n]? Y

    WARNING: The following packages cannot be authenticated!
    libpcsclite1 wpasupplicant
    Install these packages without verification [y/N]? y
  3. If the network uses WPA/WPA2, generate a pre-shared key (PSK), replacing XXXX with the password of the network:
    # wpa_passphrase shodor XXXX > /bccd/home/bccd/wpa/SSID.psk
  4. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe
    # wpa_supplicant -B -i wlan0 -c /bccd/home/bccd/wpa/SSID.psk -Dwext
    # sleep 3
    # dhclient wlan0
  5. Make sure the interface has an IP:
    # ifconfig wlan0
  6. Make sure you can ping over the interface:
    # ping -I wlan0 google.com
  7. Press Ctrl-D to exit as root.

/proc test

These instructions require 2 or more nodes in the cluster.

  1. ssh to the client node:
    $ ssh node009
  2. Run ps:
    node009$ ps
  3. Use /tmp:
    node009$ ls /tmp
  4. Confirm ps still works:
    node009$ ps

R test

  1. Load the R module:
    $ module load R
  2. Start R and confirm it says version 3.3.1:
    $ R
    R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
  3. Run a line of R code and confirm it works (e.g. make sure Inf - Inf = NaN):
    > Inf - Inf
    [1] NaN
  4. Quit R:
    > q()
    Save workspace image? [y/n/c]:
  5. Enter n to not save a workspace image.
  6. Unload the R module:
    $ module unload R

Liberation test

These instructions require a hard drive that can be erased and upon which can be installed the BCCD.

  1. Run the liberation script, which erases the hard drive and installs BCCD. Replace /dev/sda with the device name of your hard drive. Make sure this is the correct device, otherwise you will erase something else!
    $ sudo perl /root/liberate.pl --libdev /dev/sda
    • Note that it will prompt you (possibly twice) whether you are sure you want to erase what is on the drive. Enter y to approve.
  2. Reboot:
    $ sudo reboot
  3. Disconnect the USB stick so that it boots from the hard drive.
  4. Check the version number.
    $ bccd-version
    • This should return Revision: 3.3.4-skylar.5821 and Build date: 2016-10-15. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIBERATED. If it instead says LIVE, it booted from the USB. Reboot and try again with the USB stick detached.
  5. Set up PXE booting:
    $ sudo bccd-reset-network
    1. When it asks if you want to skip eth0, choose No
    2. When it asks if you want to make eth0 PXE capable, choose Yes
  6. Remove the BCCD drive from the second node and boot it up. It should pick up the BCCD through PXE over the network (if it doesn't make sure the network is properly connected and that network booting is enabled in the BIOS of the machine).
  7. Once the cluster is running in liberated mode, repeat the steps in the Testing section (except the liberation test, of course).

cpumon test

This test requires 6 nodes in the cluster. The instructions below assume LIBERATED mode.

  1. Boot up all 6 nodes.
  2. Generate the machines file:
    $ bccd-snarfhosts -v
  3. Make sure all 6 nodes appear. If they do not, make sure they all booted properly and that the head node can reach them over the network.
  4. Change to the cpumon directory:
    $ cd Community-Modules/UMW/cpumon/
  5. Create the cpumonitor executable:
    $ make
  6. Start cpumonitor with one process on each of the 6 nodes:
    $ mpirun -np 6 -machinefile ~/machines-openmpi -pernode ./cpumonitor
  7. In the other terminal window, change to the GalaxSee directory:
    $ cd GalaxSee
  8. Compile the software (it may say "Nothing to be done" if it has already been compiled):
    $ make
  9. Run GalaxSee with 1 process on each node for 100,000 time steps:
    $ mpirun -np 6 -machinefile ~/machines-openmpi -pernode ./GalaxSee.cxx-mpi
  10. Observe the core values, confirming that one core on each process is running at close to 100%.
  11. Press the "X" at the top-right corner of the cpumonitor window, and confirm it closes the program.
  12. Run GalaxSee with 2 processes on each node for 100,000 time steps:
    $ mpirun -np 12 -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  13. Observe the core values, confirming that both cores on each process are running at close to 100%.

bccd-shutdown test

  1. Confirm bccd-shutdown shuts down the entire cluster:
    $ bccd-shutdown

Results

Background image test

Background image test passed

Automated test script

Live mode bccd-test-suite results:

bccd@node000:~$ bccd-test-suite
#567 passed
#651 passed
#658 passed
#666 passed
#771 passed
#780 passed - /etc/vim files exist

We trust you have received the usual lecture from the local System
Administrator. It usually boils down to these three things:

    #1) Respect the privacy of others.
    #2) Think before you type.
    #3) With great power comes great responsibility.

[sudo] password for bccd: 
#781 passed
#786 passed - bccd user in video group
#788 passed - R
#790 passed - pypar
#790 passed - mpi4py
#795 passed
#794 passed
#802 passed
#803 passed
#806 passed
#811 passed
#813 passed
#818 FAILED
#819 passed
#822 passed
#823 passed
#825 passed
#828 passed
grep: /var/mail/bccd: No such file or directory
#829 passed
#830 passed
#831 passed
#832 passed
#798 passed
#801 passed
CUDA passed
ASROCK USB support passed
sudoers mail passed
cron email disabled passed
#884 passed
#897 passed

Liberation bccd-test-suite results:

bccd@node000:~$ bccd-test-suite
Built /home/bccd/GalaxSee w/ mpich2
Built /home/bccd/Life w/ mpich2
Built /home/bccd/GalaxSee w/ openmpi
Built /home/bccd/Life w/ openmpi
#947 passed

GalaxSee MPI test

Wi-Fi test

Failed

/proc test

Live mode /proc test results:

bccd@node009:~$ ps
  PID TTY          TIME CMD
22419 pts/2    00:00:00 bash
31077 pts/2    00:00:00 ps
bccd@node009:~$ ls /tmp/
bccd                 ifconfig.rAn2ZRix  orbit-bccd
bccd_x.log           mpd2.console_root  serverauth.w3iwn7D8o1
blueman-applet-1000  mpd2.logfile_root
ifconfig.PMyh2blv    node009-bccd
bccd@node009:~$ ps
  PID TTY          TIME CMD
22419 pts/2    00:00:00 bash
31116 pts/2    00:00:00 ps

Liberation /proc test results:

bccd@node000:~$ ssh node011
Warning: Permanently added 'node011' (RSA) to the list of known hosts.
Linux node000.bccd.net 3.14.0bccd-gbb71e91 #2 SMP PREEMPT Wed Sep 17 20:50:06 EDT 2014 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Tue Jul 18 15:40:54 2017
bccd@node011:~$ ps
  PID TTY          TIME CMD
 6232 pts/1    00:00:00 bash
 6399 pts/1    00:00:00 ps
bccd@node011:~$ ls /tmp/
bccd               ifconfig.hcam8leC  mpd2.logfile_root
ifconfig.ffZAQbbQ  mpd2.console_root
bccd@node011:~$ ps
  PID TTY          TIME CMD
 6232 pts/1    00:00:00 bash
 6474 pts/1    00:00:00 ps

R test

bccd@node009:~$ module load R
bccd@node009:~$ R

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> Inf - Inf
[1] NaN
> q()
Save workspace image? [y/n/c]: n
bccd@node009:~$ module unload R

Conclusions

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox