Scientific Computing and Data / High Performance Computing / Documentation / Load Sharing Facility (LSF) Job Scheduler

Load Sharing Facility (LSF) Job Scheduler

When you log into Minerva, you are placed on one of the login nodes. Minerva login nodes should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. The login nodes should not be used to run production jobs. Production work should be performed on the system’s compute nodes. Note that Minerva compute nodes do not have internet access by design for security reasons. You need to be on the Minerva login node or one of the interactive nodes to talk to the outside world.

Contents
Running Jobs on Minerva Compute Nodes
Submitting Batch Jobs with bsub
Commonly Used bsub Options
Sample Batch Jobs
Array Jobs
Interactive Jobs
Useful LSF Commands
Pending Reasons
LSF Queues And Policies
Multiple Serial Jobs
Minerva Training Session

 

Running Jobs on Minerva Compute Nodes

Access to compute resources and job scheduling are managed by IBM Spectrum LSF (Load Sharing Facility) batch system. LSF is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources, and scheduling work for future execution. Once access to compute resources has been allocated through the batch system, users have the ability to execute jobs on the allocated resources.

You should request compute resources (e.g., number of nodes, number of compute cores per node, amount of memory, max time per job, etc.) that are consistent with the type of application(s) you are running:

  • A serial (non-parallel) application can only make use of a single compute core on a single node, and will only see that node’s memory.
  • A threaded program (e.g. one that uses OpenMP) employs a shared memory programming (SMP) model and is also restricted to a single node, but can run on multiple CPU cores on that same node. If multiple nodes are requested for a threaded application it will only see the memory and compute cores on the first node assigned to the job while those on the rest of the nodes are wasted. It’s highly recommended the number of threads is set to that of compute cores.
  • An MPI (Message Passing Interface) parallel program can be distributed over multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network.
  • An LSF job array enables large numbers of independent jobs to be distributed over more than a single node from a single batch job submission

 

Submitting Batch Jobs with bsub

To submit a job to the LSF queue you must have a project allocation account first and it has to be provided using the “-P” option flag. To see the list of project allocation accounts you have access to:

$ mybalance

If you need access to a specific project account, you will have to have the project authorizer (owner/delegate) send us a request at hpchelp@hpc.mssm.edu.

A batch job is the most common way users run production applications on Minerva. To submit a batch job to one of the Minerva queues use the bsub command:

bsub [options] command

$ bsub -P acc_hpcstaff -q premium -n 1 -W 00:10 -o hello.out echo “Hello World!”

In this example, a job is submitted to the premium queue (-q premium) to execute the command

echo “Hello World!”

with a resource request of a single core (-n 1) and the default amount of memory (3 GB) for a 10 min. of walltime (-W 00:10) under a specific project allocation account that Minerva staff members have access to (-P acc_hpcstaff). The result will be written to a file (-o hello.out) with a summary of resource usage information when the job is completed. The job will advance in the queue until it has reached the top. At this point, LSF will allocate the requested compute resources to the batch job.

Typically, the user submits a job script to the batch system.

bsub [options] < YourJobScript

Here “YourJobScript” is the name of a text file containing #BSUB directives and shell commands that describe the particulars of the job you are submitting.

$ bsub < HelloWorld.lsf

where HelloWorld.lsf is:

#!/bin/bash#BSUB -P acc_hpcstaff#BSUB -q premium#BSUB -n 1#BSUB -W 00:10#BSUB -o “hello.out”      echo “Hello World!”

Note that If an option is given on both the bsub command line and in the job script, the command line option overrides the option in the script. The following job will be submitted to the express queue, not to the premium queue as specified in the job script.

$ bsub -q express < HelloWorld.lsf

 

Commonly Used bsub Options

-P acc_Project Name Project allocation account name (required)
-q queue_name Job submission queue (default: premium)
-W WallClockTime Wall-clock limit in form of HH:MM (default 1;00:
-J job_name Job name
-n Ncore Total number of cpu cores requested (default: 1)
-R rusage[mem=#]

Amount of memory per core in MB (default: rusage[mem=3000]).

Note that this is not per job.

  • Max memory per node: 160GiB (compute), 326GB (GPU), 1.4TiB (himem), 1.9TB (himem-GPU-A100-80GB)
-R span[ptile=#n’s per node] Number of cpu cores per physical node
-R span[hosts=1] All cores on same node
-R himem Request high memory node
-o output_file Direct job standard output to output_file (without -e option error goes to this file). LSF will append the job’s output to the specified output_file. If you want the output to overwrite the existing output_file, use to “-oo” option instead.
-e error_file  Direct job error output to error_file. To overwrite existing error_file, use the “-eo” option instead
-L login_shell Initializes the execution environment using the specified login shell

For more login details about the bsub options, click here.

Although there are default values for all batch parameters except for the -P, it is always a good idea to specify the name of the queue, the number of cores, and the walltime for all batch jobs. To minimize the time spent waiting in the queue, specify the smallest walltime that will safely allow the job to complete.

LSF inserts job report information into the job’s output. This information includes the submitting user and host, the execution host, the CPU time (user plus system time) used by the job, and the exit status. Note that the standard LSF configuration allows to email the job’s output to the user when the job completes, BUT this feature has been disabled on Minerva for the reason of stability.

 

Sample Batch Jobs

Serial Jobs

A serial job is one that only requires a single computational core. There is no queue specifically configured to run serial jobs. Serial jobs share nodes, rather than having exclusive access. Multiple jobs will be scheduled on an available node until either all cores are in use, or until there is not enough memory available for additional processes on that node. The following job script requests a single core and 8GB of memory for 2 hours in the “express” queue:

#!/bin/bash
#BSUB -J mySerialJob                           # Job name
#BSUB -P acc_YourAllocationAccount   # allocation account
#BSUB -q express                                  # queue
#BSUB -n 1                                            # number of compute cores = 1
#BSUB -R rusage[mem=8000]              # 8GB of memory
#BSUB -W 02:00                                   # walltime in HH:MM
#BSUB -o %J.stdout                              # output log (%J : JobID)
#BSUB -eo %J.stderr                             # error log
#BSUB -L /bin/bash                               # Initialize the execution environment

 

ml gcc
cd /sc/arion/work/MyID/my/job/dir/
../mybin/serial_executable < testdata.inp > results.log

Multithreaded Jobs

In general, a multithreaded application uses a single process which then spawns multiple threads of execution. The compute cores must be on the same node and it’s highly recommended the number of threads is set to the number of compute cores. The following example requests 8 cores on the same node for 12 hours in the “premium” queue. Note that the memory requirement ( -R rusage[mem=4000] ) is in MB and is PER CORE, not per job. A total of 32GB of memory will be allocated for this job.

#!/bin/bash
#BSUB -J mySTARjob                   # Job name
#BSUB -P acc_PLK2                     # allocation account
#BSUB -q premium                        # queue
#BSUB -n 8                                    # number of compute cores
#BSUB -W 12:00                            # walltime in HH:MM
#BSUB -R rusage[mem=4000]      # 32 GB of memory (4 GB per core)
#BSUB -R span[hosts=1]            # all cores from the same node
#BSUB -o %J.stdout                     # output log (%J : JobID)
#BSUB -eo %J.stderr                   # error log
#BSUB -L /bin/bash                      # Initialize the execution environment

 

module load star                           # load star module
WRKDIR=/sc/arioin/projects/hpcstaff/benchmark_star

 

STAR –genomeDir $WRKDIR/star-genome –readFilesIn Experiment1.fastq –runThreadN 8 -outFileNamePrefix Experiment1Star

MPI Parallel Jobs

An MPI application launches multiple copies of its executable (MPI tasks) over multiple nodes that are separated but can communicate with each other across the network. The following example requests 48 cores and 2 hours in the “premium” queue. Those 48 cores are distributed over 6 nodes (8 cores per node) and a total of 192GB of memory will be allocated for this job with 32GB on each node.

#!/bin/bash
#BSUB -J myMPIjob                     # Job name
#BSUB -P acc_hpcstaff                # allocation account
#BSUB -q premium                       # queue
#BSUB -n 48                                 # total number of compute cores
#BSUB -R span[ptile=8]              # 8 cores per node
#BSUB -R rusage[mem=4000]     # 192 GB of memory (4 GB per core)
#BSUB -W 02:00                          # walltime in HH:MM
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash

 

module load selfsched
mpirun -np 48 selfsched < test.inp

Please refer to LSF Queues and Policy for further information about available queues and computing resource allocation. Please refer to the GPU section for further information about options relevant to GPU job submissions.

Array Jobs

Sometimes it is necessary to run a group of jobs that share the same computational requirements but with different input files. Job arrays can be used to handle this type of embarrassingly parallel workload. They can be submitted, controlled, and monitored as a single unit or as individual jobs or groups of jobs. Each job submitted from a job array shares the same job ID as the job array and are uniquely referenced using an array index.
To create a job array add an index range to the jobname specification:

#BSUB -J MyArrayJob[1-100]

In this way you ask for 100 jobs, numbered from 1 to 100. Each job in the job array can be identified by its index, which is accessible through the environment variable “LSB_JOBINDEX“. LSF provides the runtime variables%Iand%J, that correspond to the job array index and the jobID respectively. They can be used in the #BSUB option specification to diversify the jobs. In your commands, however, you have to use the environment variablesLSB_JOBINDEXand LSB_JOBID.

#!/bin/bash
#BSUB -P acc_hpcstaff
#BSUB -n 1
#BSUB -W 02:00
#BSUB -q express
#BSUB -J “jobarraytest[1-10]”
#BSUB -o out.%J.%Im
#BSUB -e err.%J.%I

 

echo “Working on file.$LSB_JOBINDEX”

A total of 10 jobs, jobarraytest[1] to jobarraytest[10], will be created and submitted to the queue simultaneously by this script. The output and error files will be written to out.JobID.1 ~ out.JobID.10 anderr.JobID.1 ~ err.JobID.10, respectively.
Note: We also have an in-house tool “selfscheduler” that can submit large numbers of independent serial jobs, each of which is short (less than ca. 10 min.) and doesn’t have to be indexed by number, as a single batch.

Interactive Jobs

Interactive batch jobs give users interactive access to compute resources. A common use for interactive batch jobs is debugging and testing. Running a batch-interactive job is done by using the -I option with bsub.
Here is an example command creating an interactive shell on compute nodes with internet access:

bsub -P AllocationAccount -q interactive -n 8 -W 15 -R span[hosts=1] -XF -Is /bin/bash

This command allocates a total of 8 cores on one of the interactive compute nodes, reserved exclusively for jobs running in interactive mode. All cores are on the same node. The -XF option is for the X11 forwarding to enable the graphics applications to open on your screen. Once the interactive shell has been started, the user can execute jobs interactively there. In addition to those dedicated nodes, regular compute nodes that have no internet access can also be used to open an interactive session by submitting a batch interactive job to the non-interactive queues such as “premium”, “express” and “gpu” using the bsub -I, -Is, and -Ip options.

Useful LSF Commands

bjobs Show job status

Check the status of your own jobs in the queue. To display detailed information about a job in a multi-line format use the bjobs command with the “-l” option flag: bjobs -l JobID. If a job has already been completed the bjobs command won’t show the information about the job. At that point, one must invoke the bhist command to retrieve the job’s record from the LSF database.

bkill Cancel a batch job

A Job can be removed from the queue or be killed if it is running using the “bkill JobID” command. There are many ways to terminate a specific set of jobs of yours.

  • To terminate a job by its job name: bkill -J myjob_1
  • To terminate bunch of jobs using a wildcard: bkill -J myjob_*
  • To kill some selected array jobs: bkill JobID[1,7,10-25]
  • To kill all your jobs: bkill 0

bmod Modify the resource requirement of a pending job

Many of the batch job specifications can be modified after a batch job is submitted and before it runs. Typical fields that can be modified include the job size (number of nodes), queue, and wall clock limit. Job specifications cannot be modified by the user once the job enters the RUN state. The bmod command is used to modify a job’s specifications. For example: bmod -q express jobID changes the job’s queue to the express queue bmod -R rusage[mem=20000] jobID changes the job’s memory requirement to 20 GB per core. Note that -R replaces ALL R fields, not just the one you specify.

bpeek Displays the stdout and stderr output of an unfinished job.

bqueues Displays information about queues.

bhosts Displays hosts and their static and dynamic resources.

For the full list of LSF commands, see IBM Spectrum LSF Command Reference.

Pending Reasons

There could be many reasons for a job to stay in the queue for longer than usual; The whole cluster may be all busy; Your job requests a large amount of compute resources such as amount of memory and compute cores so that no compute node can satisfy the resource requirement at the moment; Your job would overlap with a scheduled PM. To see the pending reason use the bjobs command with the “-l” flag:

$ bjobs -l

LSF Queues And Policies

The command to check the queues is “bqueues”. To get more details about a specific queue, type “bqueues -l ” e.g., bqueues -l premium
Current Minerva Queues:
* The default memory for all queues are set to 3000 MB.

Queue Description Max Walltime
premium Normal submission queue 144 hrs
express Rapid turnaround jobs 12 hrs
interactive Jobs running in interactive mode 12 hrs
long Jobs requiring extended runtime 336 hrs
gpu Jobs requiring gpu resources 144 hrs
gpuexpress Short jobs requiring gpu resources 15 hrs
private Jobs using dedicated resources Unlimited
others Any other queues are for testing by the Scientific Computing group N/A

 

Policies

LSF is configured to do “ absolute priority scheduling(APS)” backfill. Backfilling allows smaller, shorter jobs to use otherwise idle resources.
To check the priority of the pending jobs, you can do $bjobs -u all -aps
In certain special cases, the priority of a job may be manually increased upon request. To request priority change you may contact ISMMS Scientific Computing Support at hpchelp@hpc.mssm.edu. We will need the job ID and reason to submit the request.

 

Multiple Serial Jobs

How to submit multiple serial jobs over more than a single node:

Sometimes users want to submit large numbers of independent serial jobs as a single batch. Rather than using a script to repeatedly call bsub, a self-scheduling utility (“selfsched”) can be used to have multiple serial jobs bundled and scheduled over more than a single node with one bsub command.

Usage:
In your batch script, load the self-scheduler module and execute it using mpi wrapper (“mpirun”) in a parallel mode:

module load selfschedmpirun selfsched < YourInputForSelfScheduler

where YourInputForSelfScheduler is a file containing serial job commands like,

/my/bin/path/Exec_1 < my_input_parameters_1 > output_1.log/my/bin/path/Exec_2 < my_input_parameters_2 > output_2.log/my/bin/path/Exec_3 < my_input_parameters_3 > output_3.log...

Each line has 2048 character limit and TAB is not allowed.
Please note that one of compute cores is used to monitor and schedule serial jobs over the rest of cores, so the actual number of cores used for the real computation is (the total number of cores assigned – 1).

A simple utility (“PrepINP”) is also provided to facilitate generation of YourInputForSelfScheduler file. The self-scheduler module has to be loaded first.

Usage:

module load selfschedPrepINP < templ.txt > YourInputForSelfScheduler

templ.txt contains input parameters with the number fields replaced by “#” to generate YourInputForSelfScheduler file.

Example 1:

1 10000 2  F                             ← start,  end,  stride,  fixed field length?/my/bin/path/Exec_# < my_input_parameters_# > output_#.log

The output will be

/my/bin/path/Exec_1 < my_input_parameters_1 > output_1.log/my/bin/path/Exec_3 < my_input_parameters_3 > output_3.log/my/bin/path/Exec_5 < my_input_parameters_5 > output_5.log.../my/bin/path/Exec_9999 < my_input_parameters_9999 > output_9999.log

Example 2:

1 10000 1  T                            ←  start,  end,  stride,  fixed field length?5                                       ←  field length/my/bin/path/Exec_# < my_input_parameters_# > output_#.log

The output will beserial j

/my/bin/path/Exec_00001 < my_input_parameters_00001 > output_00001.log/my/bin/path/Exec_00002 < my_input_parameters_00002 > output_00002.log/my/bin/path/Exec_00003 < my_input_parameters_00003 > output_00003.log.../my/bin/path/Exec_10000 < my_input_parameters_10000 > output_10000.log 

Improper Resource Specification

Minerva is a shared resource. Improper specification of job requirements not only wastes them but also prevents other researchers from running their jobs.

1. If your program is not explicitly written to use more than one core, specifying more than one core is wasting cores that could be used by other users. E.g.

IF

bsub -n 6 < sincgle_core_program

THEN

  • 6 cores allocated

  • program runs on 1 core

  • 5 cores are idle

2. If your program can use more than one core but does not use things like mpirun, mpiexec, or torchrun, then all cores must be on the same node. This is a Shared-memory MultiProcessing program (SMP). You must specify -R span[hosts=1] to ensure all cores are on the same node.

E.g:

IF

bsub -n 6 < SMP_program

THEN

  • 6 cores are allocated: 1 on node1,2 on node2, 3 on node3

  • 1 core on node1 is used to run the program

  • The other 5 cores sit idle

or, perhaps,

IF

bsub -n 6 < SMP_program

THEN

  • 6 cores are allocated: 3 on node1,1 on node2, 2 on node3

  • 3 cores on node1 are used to run the program

  • The other 3 cores sit idle

But with -R span[hosts=1]

IF

bsub -n 6 -R span[hosts=1] < SMP_program

THEN

  • 6 cores are allocated: 6 on node1

  • All 6 cores on node1 are used to run the program

3. Memory specified by -R rusage[mem=xxx] is reserved for your use and cannot be used by anyone else until released at end of job. If most/all of the memory on a node is reserved, no additional jobs can run even if there are still CPUs available.

Example:

IF

  • A 192GB node has 3 jobs dispatched to it

  • each job is requesting 1 core and 64GB of memory (-n 1 -R rusage[mem=64G] )

  • each is actually using only 1GB

THEN

  • All memory on the node is reserved so no other job can be dispatched to it.

  • Only 3GB of the memory is being used.

  • 189 GB of memory and 43 cores are idle and cannot be used.

4. Check to see how much memory your program is actually using and request accordingly. If you are running a series of jobs using the same program, run one or two test jobs with a large amount of memory and then adjust for the production runs.

Test:

bsub -R rusage[mem=6G] other options < testJob.lsf

Look at output:

Successfully completed.

Resource usage summary:

CPU time : 1.82 sec.

Max Memory : 8 MB

Average Memory : 6.75 MB

Total Requested Memory : 6144.00 MB

Delta Memory : 6136.00 MB

Max Swap : –

Max Processes : 3

Max Threads : 3

Run time : 2 sec.

Turnaround time : 50 sec.

Test run shows this program only used 8MB maximum. Memory usage varies between run depending on what else is running on the node so change the amount of memory down but give yourself some “wiggle room”

bsub -R rusage[mem=15M] other options < productionJob.lsf

Successfully completed.

Resource usage summary:

CPU time : 2.93 sec.

Max Memory : 5 MB

Average Memory : 4.00 MB

Total Requested Memory : 15.00 MB

Delta Memory : 10.00 MB

Max Swap : –

Max Processes : 3

Max Threads : 3

Run time : 4 sec.

Turnaround time : 47 sec.

Minerva Training Session

On April 8, 2022, the Scientific Computing and Data team hosted a Training Class on LSF Job Scheduler. Click here for training session powerpoint slides.
Click here to see more about Minerva Training Classes.