Scientific Computing and Data / High Performance Computing / Documentation / Restart Your LSF Jobs: Job Checkpoint

Checkpoint Restart

Checkpoint/Restore In Userspace, or CRIU (pronounced kree-oo, IPA: /krɪʊ/, Russian: криу), is a Linux software application. It can freeze a running container (or an individual application) and checkpoint its state to disk. The data saved can be used to restore the application and run it exactly as it was during the time of the freeze. Using this functionality, application or container live migration, snapshots, remote debugging, and many other things are now possible.

On Minerva, CRIU is installed on all compute and interactive nodes. It is not installed on the login nodes. It is implemented in LSF which allows the use of the -k option on bsub and, subsequently, the bchkpnt and brestart commands.
Unlike most other checkpoint/restart programs, CRIU does not require the user to preload any code in order to take advantage of the capability. This allows users, with some care, to checkpoint programs running under LSF without the necessity of using the built-in LSF features. Nevertheless, using the LSF checkpointing facilities is immensely easier.

NOTE: The current version of CRIU does not restore file positions upon restart. If the file sizes are different on restart than when checkpointed, the restart will fail. Hence, we do not recommend using the periodic checkpoint feature of LSF.

CRIU Checkpoint/Restart Using LSF

To use the standard LSF method of specifying checkpoint, use the -k option of bsub:

bsub -k “checkpoint_dir method=criu”

Example: bsub -k “chkpntDir method=criu” < myScript.lsf

  • Specify a relative or absolute path name for chkpntDir If a relative path name, it is relative to the submission folder.
  • The quotes (“) are required because you must specify a custom checkpoint and restart method name.

The job ID and job file name are concatenated to the checkpoint dir when creating a checkpoint file.
The checkpoint directory is used for restarting the job (see brestart(1)). The checkpoint directory can be any valid path.

Method name should be criu as that is the only checkpointing method supported on Minerva.

Checkpoint must be initiated using bchkpnt.

bchkpnt [-k] jobid

Note: The optional -k designates that the job is to be killed after the checkpoint is taken. Because file location information is not saved or used on restart, it is highly recommended that the -k option be specified.

To restart the job, one uses the brestart command:

brestart [options] checkpointFolder jobid

Example: brestart -W 4:00 chkpnt 193876
An LSF script example:

#BSUB -q test
#BSUB -P acc_hpcstaff
#BSUB -W 2:00
#BSUB -n 1
#BSUB -k “chkpnt method=criu”
#BSUB -oo %J.out
/hpc/users/fludee01/TEST_PGMS/checkpoint/serial/serialcount 0 myout.txt

 

Checkpointing in the absence of the -k option

If a job has been submitted without the -k option and there is a need to checkpoint it, it can be done manually.

The first thing to do is to find out the job number and the execution host. This is done with the bjobs command:

Next ssh to the execution host. Once there, issue the command:

pstree -a -u your_user_id -p

Look for the res process. This is the LSF process that is reserving the resources on the node for your job. There may be more than one if you have several jobs running on the node.Identify the one you want by looking at the first child process and noting the last part of the command name. This should be the job number you are looking for (yellow highlight). This is your job file.

The next child process is the LSF job script. It, too, has the jobid as the last field in the file name.

The next child should be the one you are interested in. It is the process started by the job script that is running your calculation. Note the process ID (peach highlighted). In this case, it is 331686 and happens to be the first process in a parallel job that has 3 other processes.

Issue the command:

criu dump -t processid -D checkpointFolder --shell-job 
E.g.: criu dump -t 331686 -D chkpnt --shell-job

 

Restarting Manual Checkpoint

To restart such a checkpoint, issue:

criu restore -D checkpointFolder --script-job

within a LSF job script. e.g.:

#BSUB -q test
#BSUB -P acc_hpcstaff
#BSUB -W 50
#BSUB -n 1
#BSUB -R affinity[core(3)]
#BSUB -oo %J.out
criu restore chkpnt –script-job

Note that in this case, the process being restored is the parent process of 3 parallel processes. Hence, the LSF options must make a request for 3 cores. If the checkpointed process is only a single thread, you need only request 1 core.