Checkpointing OpenMPI applications

From ComputeMode
Jump to: navigation, search


Installing and configuring the nodes

TODO: create a kameleon recipe

  • Retrieve the fixed openmpi-checkpoint package and copy it to the node image directory:
 shell# wget -O /cm/debian/orig/tmp/openmpi-checkpoint_1.4.2-4_amd64.deb
  • Install the required packages in the node image:
 shell# chroot /cm/debian/orig
 shell# apt-get install openmpi-bin blcr-dkms blcr-util
 shell# dpkg -i /tmp/openmpi-checkpoint_1.4.2-4_amd64.deb
 shell# rm /tmp/openmpi-checkpoint_1.4.2-4_amd64.deb
 shell# exit
  • Configure the bclr to be loaded at boot by the nodes:
    • Make sure that there is a "blcr" line in /cm/debian/patch/etc/modules :
 # /etc/modules: kernel modules to load at boot time.
 # This file should contain the names of kernel modules that are
 # to be loaded at boot time, one per line.  Comments begin with
 # a "#", and everything on the line after them are ignored.

User configuration

  • Create two directories to store the checkpoints in the NFS shared home:
 shell$ mkdir $HOME/checkpoints
  • Create/edit the openmpi mca configuration file $HOME/.openmpi/mca-params.conf with the following options:
    • OpenMPI < 1.5.1 (debian stable package)
 # Remote snapshot directory (globally mounted file system)
    • OpenMPI >= 1.5.1
 # Remote snapshot directory (globally mounted file system)

Simple usage

The '-am ft-enable-cr' must be passed to mpirun for checkpointing to run :

 shell$ mpirun -am ft-enable-cr my-app <args>

At any moment, simply call ompi-checkpoint on the PID of a running MPI process to checkpoint it:

 shell$ ompi-checkpoint 2405

To restart an saved process, call ompi-restart with the basename (not path!) of the dump:

 shell$ ompi-restart ompi_global_snapshot_2405.ckpt

Using the wrapper script


Personal tools

user portal
developer portal
wiki stuff