Testing the BCCD

From BCCD 3.0

Jump to: navigation, search

Contents

Quick & dirty testing

The automated test suite (below) unfortunately has fallen into disuse. For now, this section contains the preferred testing methodology, based on Aaron's test log.

You can use the procedure below as a template - as changes are made to BCCD, it should be updated.

To track your actual test results, create your own user page at Testing. Your individual test pages should have a format something like "Notes form BCCD Revision revision (optional version) date". As you iterate over the tests and enhance the ISO, keep updating the page until all the tests that were failing now pass.

The meat of this testing is based around testing of exit code in /bin/bccd-test-suite - before a manual test is considered, think of some way to test exit code.

Materials

Procedure

Preparation

Make bootable USB

  1. Boot up the cluster.
  2. Become the root user:
    $ sudo su - root
  3. Go to the directory with the script for making bootable USBs:
    # cd /root
  4. Download the ISO and MD5 sum referenced in the email
  5. Download the md5 checksum for the ISO:
  6. Verify the downloaded checksum matches the generated checksum:
    # md5sum -c bccd-3.3.1-rc.amd64.iso.md5
    • Second USB Note: If you have a second USB stick, the following steps will be repeated, starting with this step:
  7. Check which devices are currently present on the machine:
    # ls /dev/sd*
  8. Insert the USB stick.
  9. Check which devices are now present on the machine:
    # ls /dev/sd*
  10. Note the device name of the USB stick (it will be something like /dev/sdb).
  11. Start the build script, replacing /dev/sdb with the device name of your USB stick, if it differs:
    # ./build_bootable_USB.sh -d /dev/sdb -i bccd-3.3.1-rc.amd64.iso -m mac
    • I got the following message at the end. This also occurred for the second USB stick. However, this did not affect the rest of the testing:
      No device specified!
      Usage: mkdosfs [-a][-A][-c][-C][-v][-I][-l bad-block-file][-b backup-boot-sector]
      [-m boot-msg-file][-n volume-name][-i volume-id]
      [-s sectors-per-cluster][-S logical-sector-size][-f number-of-FATs]
      [-h hidden-sectors][-F fat-size][-r root-dir-entries][-R reserved-sectors]
      /dev/name [blocks]
      Failed to format /dev/sdb2
  12. If you have a second USB stick, go back to the first step that is repeated (see the Second USB Note above), follow the steps, then remove the second USB stick before continuing to the next step:
  13. Reboot the machine:
    # reboot
  14. Enter the BIOS of the machine as it boots:
    • This will depend on your machine, but for LittleFe, press the Delete key.

Enable bootable USB in the BIOS

These instructions are LittleFe-specific. If you aren't using LittleFe, check documentation for your own machine's BIOS, or use trial and error to figure out how to emulate these steps.

  1. Select (Use the arrow keys to highlight and press Enter) Advanced BIOS Features.
  2. Select the 1st Boot Device and select the USB (in my case USB:Kingston DataTraveler).
  3. Save the changes and exit by pressing the F10 button and selecting OK.
  4. If you have a second USB stick, repeat these steps on the compute node.

Boot into BCCD

  1. Make sure you are booting into the BCCD via the USB, not the hard drive. You should see a boot prompt, as opposed to a GRUB screen. Press Enter to begin the boot.
  2. You should eventually see a place to enter your password (the text will stop scrolling and say Please set a password for the default user. The canonical choice for this is bccd.
  3. A blue screen will ask you about network information. If it says "No DHCP for eth0", you can tell it to skip by pressing Enter.
    • Note: For the second USB stick, this blue screen should not appear. Instead, the BCCD should boot into a hostname of "node009".
  4. You will be greeted by two terminal windows. In one of them, confirm you are running the proper ISO:
    $ bccd-version
    • This should return something like Revision: 3.3.1.4486. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIVE. If it instead says LIBERATED, it did not successfully boot from the USB.
  5. If you have a second USB stick, repeat these steps (note that the second node may already have passed the boot screen).

Testing

Background image test

  1. Confirm the background image ("Earlham College, BCCD, Shodor") is clear with no text cut off.

Run the automated testing script

  1. Run the script:
    $ bccd-test-suite
  2. If you have a second node, repeat the above step on that node.
  3. Record the Results below.

GalaxSee MPI test

  1. Snarf the hosts on the head node:
    $ bccd-snarfhosts -v
  2. If you have 2 nodes running in your cluster, confirm node009 (or node011 for LIBERATED mode) is listed first, followed by node000.
  3. Change to the GalaxSee directory:
    $ cd GalaxSee
  4. Build the executable (skip this step in LIBERATED mode):
    $ make
  5. Sync the directory across the cluster (skip this step in LIBERATED mode):
    $ bccd-syncdir . ~/machines-openmpi
  6. Change to the sync'd directory (skip this step in LIBERATED mode):
    $ cd /tmp/node000-bccd
  7. Run a job across all nodes in the cluster (replace 2 with the number of nodes you have running in the cluster):
    $ mpirun -np 2 --bynode -machinefile ~/machines-openmpi GalaxSee.cxx-mpi
  8. In the other window, confirm only 1 process for GalaxSee shows up in the table of processes (re-run GalaxSee in the other window if it finishes before you run this):
    $ top
  9. Quit top by typing q in that window.
  10. Change back to the home directory in the other window:
    $ cd

Wi-Fi test

For these instructions, you will need a Wi-Fi network that you can connect to. If you have no Wi-Fi, but you do have a Mac laptop connected via Ethernet to the Internet, follow the immediate instructions below. Otherwise, if you have Wi-Fi, skip to the next set of instructions:

  1. On the laptop, make sure Ethernet is connected and you can reach the Internet with it.
  2. Go to System Preferences -> Sharing
  3. Click Internet Sharing
  4. Share your connection from -> Ethernet
  5. To computers using -> Wi-Fi
  6. Click the checkbox next to Internet Sharing
  7. Click Start.
  8. If you have OS X Mountain Lion, you can configure a WPA2 personal network under Wi-Fi Options -> Security.

Look for and connect to the Wi-Fi network

  1. Become the root user:
    $ sudo su - root
  2. Enable the Wi-Fi interface (you will need the interface name -- for LittleFe it is wlan0):
    # ifconfig wlan0 up
  3. Do a scan for Wi-Fi networks:
    # iwlist wlan0 scan | grep -i essid
  4. If there is a network nearby, you should see something like the following, with the network name in quotes:
    ESSID:"littlefe"
  5. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe

If the network uses WPA, use the following instructions.

  1. Create a directory for the pre-shared key (PSK) (skip this step for LIBERATED mode):
    # mkdir /UNIONFS/home/bccd/wpa
  2. Generate the PSK, replacing bccdbccd with the password of the network (for the instructions below in LIBERATED mode, replace UNIONFS with bccd):
    # wpa_passphrase littlefe bccdbccd > /UNIONFS/home/bccd/wpa/SSID.psk
  3. Attempt to connect to the network (replace littlefe with the name of your network):
    # iwconfig wlan0 essid littlefe
    # wpa_supplicant -B -i wlan0 -c /UNIONFS/home/bccd/wpa/SSID.psk -Dwext
    # sleep 3

Get an IP and test the connection:

  1. # dhclient -v wlan0
  2. Make sure the interface has an IP:
    # ifconfig wlan0
  3. Make sure you can ping over the interface:
    # ping -I wlan0 bccd.net
  4. Press Ctrl-D to exit as root.

/proc test

These instructions require 2 or more nodes in the cluster.

  1. ssh to the client node:
    $ ssh node009
    • Note: Use node011 for LIBERATED mode
  2. Run ps:
    node009$ ps
  3. Use /tmp:
    node009$ ls /tmp
  4. Confirm ps still works:
    node009$ ps
  5. Press Ctrl-D to log out of the client node.

R test

  1. Load the R module:
    $ module load R
  2. Start R and confirm it says version 3.0.1:
    $ R
    R version 3.0.1 (2013-05-16) -- "Good Sport"
  3. Run a line of R code and confirm it works (e.g. make sure Inf - Inf = NaN):
    > Inf - Inf
    [1] NaN
  4. Press Ctrl-D to quit R:
    Save workspace image? [y/n/c]:
  5. Enter n to not save a workspace image.
  6. Unload the R module:
    $ module unload R

Liberation test

Manual liberation

These instructions require a hard drive that can be erased and upon which can be installed the BCCD.

  1. Run the liberation script, which erases the hard drive and installs BCCD. Replace /dev/sda with the device name of your hard drive. Make sure this is the correct device, otherwise you will erase something else!
    $ sudo perl /root/liberate.pl --libdev /dev/sda
  2. You will be prompted whether you are sure you want to erase what is on the drive. Enter y to approve.
  3. Reboot:
    $ sudo reboot
  4. Disconnect the USB stick so that it boots from the hard drive.
  5. Check the version number.
    $ bccd-version
    • This should return Revision: 3.3.1-rc.4486. If it does not, you are running the wrong ISO.
    $ cat /etc/bccd-stage
    • This should return LIBERATED. If it instead says LIVE, it booted from the USB. Reboot and try again with the USB stick detached.
  6. Set up PXE booting:
    $ sudo bccd-reset-network
    1. When it asks if you want to skip eth0, choose No
    2. When it asks if you want to make eth0 PXE capable, choose Yes
    3. When it asks if you want to skip eth0:1, choose No
  7. Remove the BCCD drive from the second node and boot it up. It should pick up the BCCD through PXE over the network (if it doesn't make sure the network is properly connected and that network booting is enabled in the BIOS of the machine).
  8. Once the cluster is running in liberated mode, repeat the steps in the Testing section (except the liberation test, of course), and then proceed to the tests after this one.

Automatic liberation

  1. Boot up a live CD or USB.
  2. Supply linux libdev=/dev/sda as boot options
  3. Enter password.
  4. Configure networking.
  5. Let system reboot, remove live media, and boot off hard drive.

cpumon test

This test requires 6 nodes in the cluster. The instructions below assume LIBERATED mode.

  1. Boot up all 6 nodes.
  2. Generate the machines file:
    $ bccd-snarfhosts -v
  3. Make sure all 6 nodes appear. If they do not, make sure they all booted properly and that the head node can reach them over the network.
  4. Change to the cpumon directory:
    $ cd Community-Modules/UMW/cpumon
  5. Create the cpumonitor executable:
    $ make
  6. Start cpumonitor with one process on each of the 6 nodes:
    $ mpirun -np 6 --bynode -machinefile ~/machines-openmpi ./cpumonitor
  7. In the other terminal window, change to the GalaxSee directory:
    $ cd GalaxSee
  8. Compile the software (it may say "Nothing to be done" if it has already been compiled):
    $ make
  9. Run GalaxSee with 1 process on each node for 100,000 time steps:
    $ mpirun -np 6 --bynode -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  10. Observe the core values, confirming that one core on each node is running at close to 100%, and that both cores on the head node are running at close to 100%.
  11. Run GalaxSee with 2 processes on each node for 100,000 time steps (same line as above, but change 6 to 12, and delete the --bynode option:
    $ mpirun -np 12 -machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi 500 500 100000
  12. Observe the core values, confirming that both cores on each process are running at close to 100%. The last node may have less load.
  13. Press the "X" at the top-right corner of the cpumonitor window, and confirm it closes the program.

bccd-shutdown test

  1. Confirm bccd-shutdown shuts down the entire cluster:
    $ bccd-shutdown

Test conclusion

Report your results on your Testing page. Follow up with any failures with a ticket in Trac, and the release engineer.

Then rinse, lather, repeat!

Automatic test suite

NOTE: The automated test suite is defunct (for now), but is a long-term project to improve. See the

The first version of the BCCD test suite is now present in recent revisions of the BCCD. It resides in the bccd user home directory under tests.

Automatically Testing the BCCD

  1. Start up whatever setup you wish to test
  2. Log in as bccd and cd to tests
  3. Make sure there's a control directory. see the following section: "Generating a Control Directory" if you don't see one.
  4. Run startx
  5. Run bccd_test_suite.pl --mailto (your email)
  6. You should shortly receive a message in your email inbox declaring success or describing what tests had errors and how many.
    • In the latter case, the details of the test will be included as an attachment.

Generating a Control Directory

The BCCD Test Suite utilizes comparative testing. It runs its tests and compares them against the results they had in a distribution / environment already known to work. These known good results are referred to as the "control." If you do not have a directory "control" with the control results of the tests in your ~bccd/tests folder, (if you're an end user you should) you can generate one on a known good machine with the -c predicate:

  1. Start up a network of at least two machines with the latest ISO
    • If you mean to test in a single-node setting, boot up only one machine.
  2. Log in as bccd and cd into tests
  3. Run startx
  4. In either terminal, run perl bccd_test_suite.pl -c --mail --mailto you@yourdomain.com
  5. Presuming that your machine is connected to the network, eventually you'll receive an email with the control tarball.
  6. Check it for errors before using it. A test isn't much good if the control set is wrong.

Adding new tests

See a list of tests to be added to the test_suite.

Adding new tests has never been simpler.

  1. Write a script in your favorite language
    • Make sure it knows what parser to use when executed (get to know #! ("shebang" if you want to Google it) if you're not already acquainted)
    • Take a look to see if a similar test already exists to avoid duplicated effort (see Note)
  2. Save it in ~bccd/tests/scripts
  3. If your script is not a test, but a diagnostic for relevant system data, drop it in the "system" subdirectory
    • "system" tests are not included in control generation, and are always included in reports, even reports of success

Note: There is a hidden subdirectory in the scripts directory .sharedcode. It is in this directory that you should put the code shared by multiple tests. For instance, the MPI tests all use the same general format and therefore are mostly handled by the perlshared.pm module file in the .sharedcode directory.

Scripts and Software combinations

Test Script Notes Test Results
Gromacs gromacs.pl TestGromacsNotes GromacsTestLog
R r.pl TestRNotes RTestLog
GalaxSee mpitest.sh TestGalNotes GalaxeeTestLog
Life mpitest.sh TestLifeNotes LifeTestLog
Param Space mpitest.sh TestPSNotes ParamSpaceTestLog
PSC_DX NA TestPSCDX PSC_DXTestLog

MPI/Compiler Binding Matrix

Application OpenMPI + GCC OpenMPI + ICC MPICH2 + GCC MPICH2 + ICC
GalaxSee Yes (as of r2297) Unknown Yes (as of r2297) Unknown
Life Yes (as of r2297) Unknown Yes (as of r2297) Unknown
Param Space Yes (as of r2297) Unknown Yes (as of r2297) Unknown

Run states

Live CD

Either

Liberated

  1. Liberate: sudo perl /root/liberate.pl --libdev /device/to/liberate/to
  2. Reboot.
  3. Start with DHCP services: linux startdhcp startnfs allowpxe
  4. Boot other cluster nodes after boot up.


  1. As root, add a new user
  2. Do the necessary setup for that user
  3. Attempt to run the parallel software as listed above.

Liberated twins

This is in response to #886 from Karl Frinkle. It involves having two liberated systems, with one system running a DHCP server for the other.

  1. Liberate both systems, with no networking between them: perl liberate.pl --libdev=/dev/sda
  2. Boot up one system.
    1. sudo /bin/bccd-reset-network
    2. Respond Yes to every prompt (no PXE networking!)
  3. Boot up the second system.
    1. sudo /bin/bccd-reset-network
    2. Respond Yes to any prompts

Troubleshooting

  1. If zlib compression failures swamp the console when you try to liberate, this probably means you're using a scratched livecd disc.
  2. Wipe the disc clean and try again.
Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox