{"id":5859,"date":"2022-08-31T14:51:47","date_gmt":"2022-08-31T18:51:47","guid":{"rendered":"https:\/\/labs.icahn.mssm.edu\/minervalab\/?page_id=5859"},"modified":"2025-11-03T13:24:02","modified_gmt":"2025-11-03T18:24:02","slug":"job-checkpoint","status":"publish","type":"page","link":"https:\/\/labs.icahn.mssm.edu\/minervalab\/documentation\/job-checkpoint\/","title":{"rendered":"Checkpoint Restart"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; fullwidth=&#8221;on&#8221; _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221;][et_pb_fullwidth_menu menu_id=&#8221;15&#8243; menu_style=&#8221;centered&#8221; fullwidth_menu=&#8221;on&#8221; active_link_color=&#8221;#d80b8c&#8221; dropdown_menu_line_color=&#8221;#221f72&#8243; _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221; menu_font=&#8221;|600|||||||&#8221; menu_text_color=&#8221;#FFFFFF&#8221; menu_font_size=&#8221;16px&#8221; background_color=&#8221;#221f72&#8243; background_layout=&#8221;dark&#8221; sticky_position=&#8221;top&#8221;][\/et_pb_fullwidth_menu][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;4.9.0&#8243; custom_padding=&#8221;0px||0px||false|false&#8221;][et_pb_row _builder_version=&#8221;4.9.0&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221; custom_padding=&#8221;||0px||false|false&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;3.25&#8243; custom_padding=&#8221;|||&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221;]<\/p>\n<p><a href=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/scientific-computing-and-data\/\">Scientific Computing and Data<\/a> \/ <a href=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/\">High Performance Computing<\/a> \/ <a title=\"Documentation\" href=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/documentation\/\">Documentation<\/a> \/ Restart Your LSF Jobs: Job Checkpoint<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221;][et_pb_row _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221;][et_pb_text _builder_version=&#8221;4.9.0&#8243; _module_preset=&#8221;default&#8221; header_font=&#8221;|600|||||||&#8221; header_text_color=&#8221;#221f72&#8243; header_font_size=&#8221;26px&#8221; header_2_text_color=&#8221;#221f72&#8243; header_2_font_size=&#8221;24px&#8221;]<\/p>\n<h1>Checkpoint Restart<\/h1>\n<p><a href=\"https:\/\/criu.org\/Main_Page\">Checkpoint\/Restore In Userspace<\/a>, or CRIU (pronounced kree-oo, IPA: \/kr\u026a\u028a\/, Russian: \u043a\u0440\u0438\u0443), is a Linux software application. It can freeze a running container (or an individual application) and checkpoint its state to disk. The data saved can be used to restore the application and run it exactly as it was during the time of the freeze. Using this functionality, application or container live migration, snapshots, remote debugging, and many other things are now possible.<\/p>\n<p>On Minerva, CRIU is installed on all compute and interactive nodes. It is not installed on the login nodes. It is implemented in LSF which allows the use of the -k option on bsub and, subsequently, the bchkpnt and brestart commands.<br \/>Unlike most other checkpoint\/restart programs, CRIU does not require the user to preload any code in order to take advantage of the capability. This allows users, with some care, to checkpoint programs running under LSF without the necessity of using the built-in LSF features. Nevertheless, using the LSF checkpointing facilities is immensely easier.<\/p>\n<p><b>NOTE: The current version of CRIU does not restore file positions upon restart. If the file sizes are different on restart than when checkpointed, the restart will fail. Hence, we do not recommend using the periodic checkpoint feature of LSF. <\/b><\/p>\n<h2>CRIU Checkpoint\/Restart Using LSF<\/h2>\n<p>To use the standard LSF method of specifying checkpoint, use the -k option of bsub:<\/p>\n<blockquote>\n<p>bsub -k &#8220;checkpoint_dir method=criu&#8221;<\/p>\n<\/blockquote>\n<p>Example: <b>bsub -k &#8220;chkpntDir method=criu&#8221; &lt; myScript.lsf<\/b><\/p>\n<ul>\n<li>Specify a relative or absolute path name for chkpntDir If a relative path name, it is relative to the submission folder.<\/li>\n<li>The quotes (\u201c) are required because you must specify a custom checkpoint and restart method name.<\/li>\n<\/ul>\n<p>The job ID and job file name are concatenated to the checkpoint dir when creating a checkpoint file.<br \/>The checkpoint directory is used for restarting the job (see brestart(1)). The checkpoint directory can be any valid path.<\/p>\n<p>Method name should be criu as that is the only checkpointing method supported on Minerva.<\/p>\n<p>Checkpoint must be initiated using <b>bchkpnt<\/b>.<\/p>\n<blockquote>\n<p>bchkpnt [-k] jobid<\/p>\n<\/blockquote>\n<p><i>Note: The optional <b>-k<\/b> designates that the job is to be killed after the checkpoint is taken. Because file location information is not saved or used on restart, it is highly recommended that the -k option be specified.<\/i><\/p>\n<p>To restart the job, one uses the <b>brestart<\/b> command:<\/p>\n<blockquote>\n<p>brestart [options] checkpointFolder jobid<\/p>\n<\/blockquote>\n<p>Example: <strong>brestart -W 4:00 chkpnt 193876<\/strong><br \/>An LSF script example:<\/p>\n<blockquote>\n<p>#BSUB -q test<br \/>#BSUB -P acc_hpcstaff<br \/>#BSUB -W 2:00<br \/>#BSUB -n 1<br \/>#BSUB -k \u201cchkpnt method=criu\u201d<br \/>#BSUB -oo %J.out<br \/>\/hpc\/users\/fludee01\/TEST_PGMS\/checkpoint\/serial\/serialcount 0 myout.txt<\/p>\n<\/blockquote>\n<p>&nbsp;<\/p>\n<h2>Checkpointing in the absence of the -k option<\/h2>\n<p>If a job has been submitted without the <b>-k<\/b> option and there is a need to checkpoint it, it can be done manually.<\/p>\n<p>The first thing to do is to find out the job number and the execution host. This is done with the <b>bjobs<\/b> command:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1219\" src=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-content\/uploads\/sites\/342\/2020\/03\/bjobsmarked.png\" alt=\"\" width=\"942\" height=\"48\" \/><\/p>\n<p>Next ssh to the execution host. Once there, issue the command:<\/p>\n<blockquote>\n<pre><code>pstree -a -u your_user_id -p<\/code><\/pre>\n<\/blockquote>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-1220\" src=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-content\/uploads\/sites\/342\/2020\/03\/pstreemarked.png\" alt=\"\" width=\"540\" height=\"143\" \/><\/p>\n<p>Look for the <i>res<\/i> process. This is the LSF process that is reserving the resources on the node for your job. There may be more than one if you have several jobs running on the node.Identify the one you want by looking at the first child process and noting the last part of the command name. This should be the job number you are looking for (yellow highlight). This is your job file.<\/p>\n<p>The next child process is the LSF job script. It, too, has the <em>jobid<\/em> as the last field in the file name.<\/p>\n<p>The next child should be the one you are interested in. It is the process started by the job script that is running your calculation. Note the process ID (peach highlighted). In this case, it is 331686 and happens to be the first process in a parallel job that has 3 other processes.<\/p>\n<p>Issue the command:<\/p>\n<blockquote>\n<pre><code>criu dump -t processid -D checkpointFolder --shell-job \r\nE.g.: criu dump -t 331686 -D chkpnt --shell-job<\/code><\/pre>\n<\/blockquote>\n<p>&nbsp;<\/p>\n<h2>Restarting Manual Checkpoint<\/h2>\n<p>To restart such a checkpoint, issue:<\/p>\n<pre><code>criu restore -D checkpointFolder --script-job<\/code><\/pre>\n<p>within a LSF job script. e.g.:<\/p>\n<blockquote>\n<p>#BSUB -q test<br \/>#BSUB -P acc_hpcstaff<br \/>#BSUB -W 50<br \/>#BSUB -n 1<br \/>#BSUB -R affinity[core(3)]<br \/>#BSUB -oo %J.out<br \/>criu restore chkpnt &#8211;script-job<\/p>\n<\/blockquote>\n<p>Note that in this case, the process being restored is the parent process of 3 parallel processes. Hence, the LSF options must make a request for 3 cores. If the checkpointed process is only a single thread, you need only request 1 core.<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scientific Computing and Data \/ High Performance Computing \/ Documentation \/ Restart Your LSF Jobs: Job CheckpointCheckpoint Restart Checkpoint\/Restore In Userspace, or CRIU (pronounced kree-oo, IPA: \/kr\u026a\u028a\/, Russian: \u043a\u0440\u0438\u0443), is a Linux software application. It can freeze a running container (or an individual application) and checkpoint its state to disk. The data saved can be [&hellip;]<\/p>\n","protected":false},"author":600,"featured_media":0,"parent":35,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"class_list":["post-5859","page","type-page","status-publish","hentry"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/5859","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/users\/600"}],"replies":[{"embeddable":true,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/comments?post=5859"}],"version-history":[{"count":8,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/5859\/revisions"}],"predecessor-version":[{"id":7948,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/5859\/revisions\/7948"}],"up":[{"embeddable":true,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/35"}],"wp:attachment":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/media?parent=5859"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}