GalaxSee

From BCCD 3.0

Jump to: navigation, search

This tutorial will walk you through an investigation of n-body parallel computing problems using the program GalaxSee. For instructions on how to compile and run MPI programs, please see Compiling and Running. This tutorial assumes that you have booted up the BCCD and have X running (to do this, enter startx). It uses mpirun, which means you need an implementation of MPI installed; the BCCD comes with both OpenMPI and MPICH2, with OpenMPI running by default. For more information, see Running MPICH and Running OpenMPI).

Contents

GalaxSee: N-Body Physics

Nbody.png

One of the grand challenge problems in astronomy is the evolution and structure of the universe and galaxies. The universe is seen to have a structure of sheets and voids on a large scale. Galaxies are seen to often have a spiral structure that is difficult to explain. Space is not occupied by a homogeneous fluid, but by discrete particles that interact through gravity over long ranges.

This is often modeled as discrete bodies interacting through gravity. The gravitational force is given by Fg = G M1M2/D2, where M1 and M2 are two objects masses and D is the distance between them. The acceleration of an object is given by the sum of the forces acting on that object divided by its mass a = ΣF/M. If you know the acceleration of each mass, you can calculate the change in velocity a = Δv/Δt. If you know the velocity, you can calculate the change in position v = Δx/Δt.

The algorithm can be loosely described as NEW = OLD + CHANGE

This is applied to each particle. To apply this to each particle, you need to know the acceleration, but the acceleration is determined by the sum of all of the forces.

What this means is that the more objects you have, the more forces you need to calculate. What’s worse, every object needs to know about every other object.

The GalaxSee code is a simple implementation of parallelism. Since most of the time in a given N-Body model is spent calculating the forces, we only parallelize that part of the code. “Client” programs that just calculate accelerations are fed every particle’s information, and a list of which particles that client should compute. A “server” runs the main program, and sends out requests and collects results during the force calculation.

You could think of the total running time in the following way:

As long as you run the model for enough time steps that not much time is spent “setting up” the program, a reasonable model for how long the GalaxSee program will take to run on a given cluster is

time = A*N*(P-1) + B*N*N/P

For your cluster, what would be the coefficients of this equation?

Running the program from Live-BCCD

To run the program, first move into the GalaxSee directory by executing cd ~/GalaxSee. Next the executable needs to be "made" by running make. This will create the executable GalaxSee.

Next we need to copy this executable to all the nodes that will be running it (If you have not yet set up your nodes for remote access, make sure you have logged into each machine as bccd, started the heartbeat (pkbcast) program, run bccd-allowall, and run bccd-snarfhosts. You must do this before continuing.) A new automated script now exists to successfully copy executables across BCCD nodes without compromising other user's runs. It is called 'bccd-syncdir', and for GalaxSee, it is run with the following command:

bccd-syncdir ~/GalaxSee ~/machines-openmpi

where ~/GalaxSee is the directory which holds the executable, and ~/machines-openmpi is the machinefile created previously with 'bccd-snarfhosts', which contains a list of all the nodes in your cluster formatted for use with OpenMPI. This creates a unique directory in /tmp which holds your executable directory across all nodes. The name of this directory is unique and follows the pattern /tmp/hostname-user (so yours could be /tmp/node009-bccd). Make note of this directory and move into with cd <your directory>.

To run the program on one node of your cluster, enter the following command

time mpirun –np 1 –machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi  500      400      1000.0
              #cpus                                 #bodies  #mass    #final time

(Run the following models without a display, and record the wall time for each model. To turn off the display add a "0" to the end of the command line, e.g. ./GalaxSee 500 400 1000.0 0)

Troubleshooting: If you do not get a display back on BCCD v3.4.0

If GalaxSee invokes MPI_ABORT without producing a display, try supplying the DISPLAY variable as the final parameter:

time mpirun –np 1 –machinefile ~/machines-openmpi ./GalaxSee.cxx-mpi  500      400      1000.0 ${DISPLAY}

Running the program from Liberated-BCCD

To run the program, first move into the GalaxSee directory by executing cd ~/GalaxSee. Next the executable needs to be "made" by running make. This will create the executable GalaxSee.

Next we need to copy this executable to all the nodes that will be running it. To do so first in your terminal run bccd-allowall, and then run bccd-snarfhosts.

To run the program on one node of your cluster, enter the following command

time mpirun –np 1 –machinefile ~/machines ./GalaxSee.cxx-mpi  500      400      1000.0
              #cpus                                 #bodies  #mass    #final time

For running it on multiple processors, make the appropriate changes to the number of processors (-np),etc... follow the instruction above.

“Guestimate” the coefficients, and try it out for a few different runs other than the ones above. Does this “model of the model” work?

What happens to your efficiency as you add processors? What would happen if you went from 4 to 8 CPUs? 8 to 16? 16 to 32? 32 to 64?

What if the model were bigger? Real “production” N-Body codes use millions of particles. How long would it take to run a million particle code for the lifetime of the universe (14,000 Myrs with an 8 Myr timestep) on your cluster? What if you had 16 CPUs? 32? 256? 512? 2048?

Test suite

This is a manual test due to the X11 dependency. Running GalaxSee.cxx-mpi in a script fails mysteriously with an assertion failure on dpy.

Run as the bccd user:

OpenMPI test

BCCD v3.3.3 & v3.4.0

  1. pushd ~/GalaxSee
  2. module unload openmpi mpich2
  3. module load openmpi
  4. make clean && make
  5. bccd-snarfhosts -s # Use -s to make sure head node goes first so X11 works
  6. Determine how many CPUs are available by examining ~/bccd-snarfhosts
  7. bccd-syncdir --ni ~/GalaxSee ~/machines-openmpi
  8. mpirun -np $NSLOTS -machinefile ~/machines-openmpi /tmp/node000-bccd/GalaxSee.cxx-mpi 1000 500 400
  9. Make sure X11 window opens.

MPICH2 test

MPICH2 - BCCD v3.3.3

MPICH2 prior to BCCD v3.3.3 uses the mpd start method.

  1. module switch openmpi mpich2
  2. make clean && make
  3. bccd-snarfhosts
  4. cp ~/machines-mpich2 ~/GalaxSee/machines
  5. bccd-syncdir --ni ~/GalaxSee ~/machines-mpich2
  6. mpirun -np $NSLOTS /tmp/node000-bccd/GalaxSee.cxx-mpi 1000 500 400
  7. Make sure X11 window opens

MPICH2 - BCCD v3.4.0

MPICH2 starting with BCCD v3.4.0 uses the Hydra start method.

  1. module switch openmpi mpich2
  2. make clean && make
  3. bccd-snarfhosts -s# Use -s to make sure head node goes first
  4. bccd-syncdir --ni ~/GalaxSee ~/machines-mpich2
  5. mpirun -f ~/machines-mpich2 -enable-x -np $NSLOTS /tmp/node000-bccd/GalaxSee.cxx-mpi 1000 500 400
  6. Make sure X11 window opens

Troubleshooting

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox