From BCCD 3.0
The High Performance Linpack (HPL) benchmark is the tool used to calculate the performance of a distributed memory computer. It is the software package from which the numbers listed on the top500 list are derived.
The basic BCCD image is not distributed with HPL or the requisite linear algebra libraries. This page will describe the process for compiling and running the HPL benchmark with the BCCD.
HPL as pre-loaded on the BCCD
As of March 25, 2010, the HPL source and build scripts tailored to the BCCD are included in the bccd user's home directory. To build and run, follow these simple steps:
$ cd ~/hpl $ make $ /bin/bash hpl.run
Note, however, that this will run HPL over all of the nodes in the current BCCD cluster. So if you're in a lab with other students, be kind to them and edit the hpl.run script to reduce the number of processes in use.
HPL from scratch
- As of this writing, version 2.0 (September 10, 2008) was the most recent stable version, and was used below.
- A BLAS (Basic Linear Algebra Subprograms) implementation such as ATLAS
- As of this writing, version 3.8.3 (February 18, 2009) was the most recent stable version, and was used below.
- The BCCD
The compiling stage can take a very long time, depending on your hardware. The ATLAS configure/compile scripts run a large suite of tests to determine the best configuration for your system. On a 1.6GHz dual-core Atom with 2GB of RAM, this stage took a number of hours.
- Create a directory to work in, and unpack the software
$ mkdir hpl $ tar xf atlas3.8.3.tar.gz $ tar xf hpl-2.0.tar.gz
- We'll compile ATLAS first, since HPL will need to link its binaries to those libraries.
$ cd hpl/ATLAS
- The ATLAS Installation Guide is an excellent source of information.
- Create a build directory (required) and run the configure script
$ mkdir Linux_Atom330 # Typically this is <OS>_<Architecture> $ cd Linux_Atom330 $ ../configure -b 32 \ # Currently the BCCD only supports 32-bit -t -1 \ # -1 tells ATLAS to try to autodetect the number of threads to use -Si cputhrchk 0 \ # Do not check for CPU throttling --prefix=$HOME/hpl/atlas \ # Could be anywhere, but note this path, we'll use it later --nof77 \ # Don't worry about FORTRAN --cc=/usr/bin/gcc \ # Use gcc -C ic /usr/bin/gcc # Really, use gcc (see doc for explaination)
- The configure script shouldn't take too long to run, once it's done, build and test the libraries
$ make build && make check && make time && make install
- Go home, have lunch, drink a pot of coffee, find something else to do. This part takes a while.
- Assuming the makes finish without error, check to make sure the libraries are where we told them to be, and we can move on to HPL
$ ls ~/hpl/atlas/lib ~/hpl/atlas/include /bccd/home/bccd/hpl/atlas/include: atlas cblas.h clapack.h /bccd/home/bccd/hpl/atlas/lib: libatlas.a libcblas.a libf77blas.a liblapack.a libptcblas.a libptf77blas.a $ cd ~/hpl/hpl-2.0
- Copy a pre-written Makefile template. I've found that the Linux_PII_CBLAS template is generic enough to work most of the time.
$ cp setup/Make.Linux_PII_CBLAS .
- Regardless, there are some changes that need to be made. As of BCCD r2086 from November 15, 2009, these lines must be set as follows:
TOPdir = $(HOME)/hpl/hpl-2.0 MPdir = /bccd/software/openmpi-1.2.9 MPlib = $(MPdir)/lib/libmpi.so LAdir = $(HOME)/hpl/atlas LAinc = $(LAdir)/include LAlib = $(LAdir)/lib/libcblas.a $(LAdir)/lib/libatlas.a HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) -I$(LAinc) $(MPinc) LINKER = /usr/bin/gcc
- Build the binaries, fix any errors that arise (almost always involves updating Make.Linux_PII_CBLAS):
$ make arch=Linux_PII_CBLAS
- There should now be a binary in bin/Linux_PII_CBLAS along with an HPL.dat
$ cd bin/Linux_PII_CBLAS $ ls HPL.dat xhpl
- HPL.dat is the file that controls how HPL is going to conduct the tests. It can often take quite a bit of experimentation to get the optimal settings. This page from ACT can help quite a bit, as can the HPL FAQ and the HPL Tuning guide.
- Below is an example file, written for a Dual-core Atom LittleFe (6 nodes, 12 cores, 2GB RAM/node)
- As of this writing, this file is still being tuned
- Also note that as is, this configuration results in 1152 tests being run, consuming a few days worth of wall time.
HPLinpack benchmark input file Cluster Computing Group, Earlham College HPL.out output file name (if any) 1 device out (6=stdout,7=stderr,file) 4 # of problems sizes (N) 5000 10000 15000 20000 Ns 8 # of NBs 32 64 96 128 160 192 224 256 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 2 # of process grids (P x Q) 2 1 Ps 6 6 Qs 16.0 threshold 3 # of panel fact 0 1 2 PFACTs (0=left, 1=Crout, 2=Right) 2 # of recursive stopping criterium 2 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 3 # of recursive panel fact. 0 1 2 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 0 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0)
- Run xhpl. The number of processes started must be equal to the maximum value of P*Q. In the example given above, this is 12.
- The mpirun here must be from the same implementation as specified by MPdir in Make.Linux_PII_CBLAS. Here it's OpenMPI
$ mpirun -np 12 --hostfile ~/machines ./xhpl
- Assuming all went well, you should see a file, HPL.out, that contain lines that look something like:
- See the beginning page or two of HPL.out for a description of these lines.
T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR00R2L4 15000 64 2 6 809.16 2.781e+00 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0034938 ...... PASSED