Checkpoint Restart

 

Checkpoint/Restore In Userspace, or CRIU (pronounced kree-oo, IPA: /krɪʊ/, Russian: криу), is a Linux software application. It can freeze a running container (or an individual application) and checkpoint
its state to disk. The data saved can be used to restore the application and run it exactly as it was during the time of the freeze. Using this functionality, application or container live migration, snapshots, remote debugging, and many other things are now possible.

On Minerva, CRIU is installed on all compute and interactive nodes. It is not installed on the login nodes. It is implemented in LSF which allows the use of the -k option on bsub and, subsequently, the bchkpnt and brestart commands.

Unlike most other checkpoint/restart programs, CRIU does not require the user to preload any code in order to take advantage of the capability. This allows users, with some care, to checkpoint programs running under LSF without the necessity of using the built-in LSF features. Nevertheless, using the LSF checkpointing facilities is immensely easier.

CRIU Checkpoint/Restart Using LSF

To use the standard LSF method of specifying checkpoint, use the -k option of bsub:

bsub -k "checkpoint_dir [init=initial_checkpoint_period] [check‐point_period] [method=method_name]"

Example: bsub -k "chkpntDir init=10 90 method=criu"

  • Specify a relative or absolute path name. If a relative path name, it is relative to the submission folder.
  • The quotes (“) are required if you specify a checkpoint period, initial checkpoint period, or custom checkpoint and restart method name.

The job ID and job file name are concatenated to the checkpoint dir when creating a checkpoint file.

The checkpoint directory is used for restarting the job (see brestart(1)). The checkpoint directory can be any valid path.

  • Optional, specify a checkpoint period in minutes. Specify a positive integer. The running job is checkpointed automatically every checkpoint period. Because checkpointing is a heavyweight operation, you should choose a checkpoint period greater than half an hour.
  • Optional, specify an initial checkpoint period in minutes.  Specify a positive integer. The first checkpoint does not happen until the initial period has elapsed. After the first checkpoint, the job checkpoint frequency is controlled by the normal job checkpoint interval.
  • Method name should be criu as that is the only checkpointing method supported on Minerva.

 

If one does not specify a time interval for checkpointing, the checkpoint must be initiated using bchkpnt.

bchkpnt [-k] jobid

The optional -k designates that the job is to be killed after the checkpoint is taken.

To restart the job, one uses the brestart command:

brestart [options] checkpointFolder jobid

Example: brestart -W 4:00 chkpnt 193876

An LSF script example:

#BSUB -q test
#BSUB -P acc_hpcstaff
#BSUB -W 2:00
#BSUB -n 1
#BSUB -k "chkpnt method=criu"
#BSUB -oo %J.out

/hpc/users/fludee01/TEST_PGMS/checkpoint/serial/serialcount 0 myout.txt

Checkpointing in the absence of the -k option

If a job has been submitted without the -k option and there is a need to checkpoint it, it can be done manually.

The first thing to do is to find out the job number and the execution host. This is done with the bjobs command:

Take notice of the JobID and the execution host.

Next ssh to the execution host. Once there, issue the command:

pstree -a -u your_user_id -p

Look for the res process. This is the LSF process that is reserving the resources on the node for your job. There may be more than one if you have several jobs running on the node.Identify the one you want by looking at the first child process and noting the last part of the command name. This should be the job number you are looking for (yellow highlight). This is your job file.

The next child process is the LSF job script. It, too, has the jobid as the last field in the file name.

The next child should be the one you are interested in. It is the process started by the job script that is running your calculation. Note the process ID (peach highlighted). In this case, it is 331686 and happens to be the first process in a parallel job that has 3 other processes.

Issue the command:

criu dump -t processid -D checkpointFolder --script-job
E.g.: criu dump -t 331686 -D chkpnt --script-job

Restarting manual checkpoint

To restart such a checkpoint, issue:

criu restore -D checkpointFolder --script-job
within a LSF job script. E.g.:


#BSUB -q test
#BSUB -P acc_hpcstaff
#BSUB -W 50
#BSUB -n 1
#BSUB -R affinity[core(3)]
#BSUB -oo %J.out

criu restore chkpnt --script-job

Note that in this case, the process being restored is the parent process of 3 parallel processes. Hence, the LSF options must make a request for 3 cores. If the checkpointed process is only a single thread, you need only request 1 core.