Load Sharing Facility (LSF) Job Scheduler
When you log into Minerva, you are placed on one of the login nodes. Minerva login nodes should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. The login nodes should not be used to run production jobs. Production work should be performed on the system’s compute nodes. Note that Minerva compute nodes do not have internet access by design for security reasons. You need to be on the Minerva login node or one of the interactive nodes to talk to the outside world.
Contents
Running Jobs on Minerva Compute Nodes
Submitting Batch Jobs with bsub
Commonly Used bsub Options
Sample Batch Jobs
Array Jobs
Interactive Jobs
Useful LSF Commands
Pending Reasons
LSF Queues And Policies
Multiple Serial Jobs
Minerva Training Session
Running Jobs on Minerva Compute Nodes
Access to compute resources and job scheduling are managed by IBM Spectrum LSF (Load Sharing Facility) batch system. LSF is responsible for allocating resources to users, providing a framework for starting, executing and monitoring work on allocated resources, and scheduling work for future execution. Once access to compute resources has been allocated through the batch system, users have the ability to execute jobs on the allocated resources.
You should request compute resources (e.g., number of nodes, number of compute cores per node, amount of memory, max time per job, etc.) that are consistent with the type of application(s) you are running:
- A serial (non-parallel) application can only make use of a single compute core on a single node, and will only see that node’s memory.
- A threaded program (e.g. one that uses OpenMP) employs a shared memory programming (SMP) model and is also restricted to a single node, but can run on multiple CPU cores on that same node. If multiple nodes are requested for a threaded application it will only see the memory and compute cores on the first node assigned to the job while those on the rest of the nodes are wasted. It’s highly recommended the number of threads is set to that of compute cores.
- An MPI (Message Passing Interface) parallel program can be distributed over multiple nodes: it launches multiple copies of its executable (MPI tasks, each assigned unique IDs called ranks) that can communicate with each other across the network.
- An LSF job array enables large numbers of independent jobs to be distributed over more than a single node from a single batch job submission
Submitting Batch Jobs with bsub
To submit a job to the LSF queue you must have a project allocation account first and it has to be provided using the “-P” option flag. To see the list of project allocation accounts you have access to:
$ mybalance
If you need access to a specific project account, you will have to have the project authorizer (owner/delegate) send us a request at hpchelp@hpc.mssm.edu.
A batch job is the most common way users run production applications on Minerva. To submit a batch job to one of the Minerva queues use the bsub command:
bsub [options] command
$ bsub -P acc_hpcstaff -q premium -n 1 -W 00:10 -o hello.out echo “Hello World!”
In this example, a job is submitted to the premium queue (-q
premium) to execute the command
echo “Hello World!”
with a resource request of a single core (-n
1) and the default amount of memory (3 GB) for a 10 min. of walltime (-W
00:10) under a specific project allocation account that Minerva staff members have access to (-P
acc_hpcstaff). The result will be written to a file (-o
hello.out) with a summary of resource usage information when the job is completed. The job will advance in the queue until it has reached the top. At this point, LSF will allocate the requested compute resources to the batch job.
Typically, the user submits a job script to the batch system.
bsub [options] < YourJobScript
Here “YourJobScript” is the name of a text file containing #BSUB
directives and shell commands that describe the particulars of the job you are submitting.
$ bsub < HelloWorld.lsf
where HelloWorld.lsf is:
#!/bin/bash#BSUB -P acc_hpcstaff#BSUB -q premium#BSUB -n 1#BSUB -W 00:10#BSUB -o “hello.out” echo “Hello World!”
Note that If an option is given on both the bsub command line and in the job script, the command line option overrides the option in the script. The following job will be submitted to the express queue, not to the premium queue as specified in the job script.
$ bsub -q express < HelloWorld.lsf
Commonly Used bsub Options
-P acc_Project Name | Project allocation account name (required) |
-q queue_name | Job submission queue (default: premium) |
-W WallClockTime | Wall-clock limit in form of HH:MM (default 1;00: |
-J job_name | Job name |
-n Ncore | Total number of cpu cores requested (default: 1) |
-R rusage[mem=#] |
Amount of memory per core in MB (default: rusage[mem=3000]). Note that this is not per job.
|
-R span[ptile=#n’s per node] | Number of cpu cores per physical node |
-R span[hosts=1] | All cores on same node |
-R himem | Request high memory node |
-o output_file | Direct job standard output to output_file (without -e option error goes to this file). LSF will append the job’s output to the specified output_file. If you want the output to overwrite the existing output_file, use to “-oo” option instead. |
-e error_file | Direct job error output to error_file. To overwrite existing error_file, use the “-eo” option instead |
-L login_shell | Initializes the execution environment using the specified login shell |
For more login details about the bsub options, click here.
Although there are default values for all batch parameters except for the -P
, it is always a good idea to specify the name of the queue, the number of cores, and the walltime for all batch jobs. To minimize the time spent waiting in the queue, specify the smallest walltime that will safely allow the job to complete.
LSF inserts job report information into the job’s output. This information includes the submitting user and host, the execution host, the CPU time (user plus system time) used by the job, and the exit status. Note that the standard LSF configuration allows to email the job’s output to the user when the job completes, BUT this feature has been disabled on Minerva for the reason of stability.
Sample Batch Jobs
Serial Jobs
A serial job is one that only requires a single computational core. There is no queue specifically configured to run serial jobs. Serial jobs share nodes, rather than having exclusive access. Multiple jobs will be scheduled on an available node until either all cores are in use, or until there is not enough memory available for additional processes on that node. The following job script requests a single core and 8GB of memory for 2 hours in the “express” queue:
#!/bin/bash
#BSUB -J mySerialJob # Job name
#BSUB -P acc_YourAllocationAccount # allocation account
#BSUB -q express # queue
#BSUB -n 1 # number of compute cores = 1
#BSUB -R rusage[mem=8000] # 8GB of memory
#BSUB -W 02:00 # walltime in HH:MM
#BSUB -o %J.stdout # output log (%J : JobID)
#BSUB -eo %J.stderr # error log
#BSUB -L /bin/bash # Initialize the execution environment
ml gcc
cd /sc/arion/work/MyID/my/job/dir/
../mybin/serial_executable < testdata.inp > results.log
Multithreaded Jobs
In general, a multithreaded application uses a single process which then spawns multiple threads of execution. The compute cores must be on the same node and it’s highly recommended the number of threads is set to the number of compute cores. The following example requests 8 cores on the same node for 12 hours in the “premium” queue. Note that the memory requirement ( -R rusage[mem=4000] ) is in MB and is PER CORE, not per job. A total of 32GB of memory will be allocated for this job.
#!/bin/bash
#BSUB -J mySTARjob # Job name
#BSUB -P acc_PLK2 # allocation account
#BSUB -q premium # queue
#BSUB -n 8 # number of compute cores
#BSUB -W 12:00 # walltime in HH:MM
#BSUB -R rusage[mem=4000] # 32 GB of memory (4 GB per core)
#BSUB -R span[hosts=1] # all cores from the same node
#BSUB -o %J.stdout # output log (%J : JobID)
#BSUB -eo %J.stderr # error log
#BSUB -L /bin/bash # Initialize the execution environment
module load star # load star module
WRKDIR=/sc/arioin/projects/hpcstaff/benchmark_star
STAR –genomeDir $WRKDIR/star-genome –readFilesIn Experiment1.fastq –runThreadN 8 -outFileNamePrefix Experiment1Star
MPI Parallel Jobs
An MPI application launches multiple copies of its executable (MPI tasks) over multiple nodes that are separated but can communicate with each other across the network. The following example requests 48 cores and 2 hours in the “premium” queue. Those 48 cores are distributed over 6 nodes (8 cores per node) and a total of 192GB of memory will be allocated for this job with 32GB on each node.
#!/bin/bash
#BSUB -J myMPIjob # Job name
#BSUB -P acc_hpcstaff # allocation account
#BSUB -q premium # queue
#BSUB -n 48 # total number of compute cores
#BSUB -R span[ptile=8] # 8 cores per node
#BSUB -R rusage[mem=4000] # 192 GB of memory (4 GB per core)
#BSUB -W 02:00 # walltime in HH:MM
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash
module load selfsched
mpirun -np 48 selfsched < test.inp
Please refer to LSF Queues and Policy for further information about available queues and computing resource allocation. Please refer to the GPU section for further information about options relevant to GPU job submissions.
Array Jobs
Sometimes it is necessary to run a group of jobs that share the same computational requirements but with different input files. Job arrays can be used to handle this type of embarrassingly parallel workload. They can be submitted, controlled, and monitored as a single unit or as individual jobs or groups of jobs. Each job submitted from a job array shares the same job ID as the job array and are uniquely referenced using an array index.
To create a job array add an index range to the jobname specification:
#BSUB -J MyArrayJob[1-100]
In this way you ask for 100 jobs, numbered from 1 to 100. Each job in the job array can be identified by its index, which is accessible through the environment variable “LSB_JOBINDEX
“. LSF provides the runtime variables%I
and%J
, that correspond to the job array index and the jobID respectively. They can be used in the #BSUB option specification to diversify the jobs. In your commands, however, you have to use the environment variablesLSB_JOBINDEX
and LSB_JOBID
.
#!/bin/bash
#BSUB -P acc_hpcstaff
#BSUB -n 1
#BSUB -W 02:00
#BSUB -q express
#BSUB -J “jobarraytest[1-10]”
#BSUB -o out.%J.%Im
#BSUB -e err.%J.%I
echo “Working on file.$LSB_JOBINDEX”
A total of 10 jobs, jobarraytest[1] to jobarraytest[10], will be created and submitted to the queue simultaneously by this script. The output and error files will be written to out.JobID.1 ~ out.JobID.10
anderr.JobID.1 ~ err.JobID.10
, respectively.
Note: We also have an in-house tool “selfscheduler” that can submit large numbers of independent serial jobs, each of which is short (less than ca. 10 min.) and doesn’t have to be indexed by number, as a single batch.
Interactive Jobs
Interactive batch jobs give users interactive access to compute resources. A common use for interactive batch jobs is debugging and testing. Running a batch-interactive job is done by using the -I option with bsub.
Here is an example command creating an interactive shell on compute nodes with internet access:
bsub -P AllocationAccount -q interactive -n 8 -W 15 -R span[hosts=1] -XF -Is /bin/bash
This command allocates a total of 8 cores on one of the interactive compute nodes, reserved exclusively for jobs running in interactive mode. All cores are on the same node. The -XF option is for the X11 forwarding to enable the graphics applications to open on your screen. Once the interactive shell has been started, the user can execute jobs interactively there. In addition to those dedicated nodes, regular compute nodes that have no internet access can also be used to open an interactive session by submitting a batch interactive job to the non-interactive queues such as “premium”, “express” and “gpu” using the bsub -I, -Is, and -Ip options.
Useful LSF Commands
bjobs Show job status
Check the status of your own jobs in the queue. To display detailed information about a job in a multi-line format use the bjobs command with the “-l” option flag: bjobs -l JobID
. If a job has already been completed the bjobs command won’t show the information about the job. At that point, one must invoke the bhist command to retrieve the job’s record from the LSF database.
bkill Cancel a batch job
A Job can be removed from the queue or be killed if it is running using the “bkill JobID
” command. There are many ways to terminate a specific set of jobs of yours.
- To terminate a job by its job name:
bkill -J myjob_1
- To terminate bunch of jobs using a wildcard:
bkill -J myjob_*
- To kill some selected array jobs:
bkill JobID[1,7,10-25]
- To kill all your jobs:
bkill 0
bmod Modify the resource requirement of a pending job
Many of the batch job specifications can be modified after a batch job is submitted and before it runs. Typical fields that can be modified include the job size (number of nodes), queue, and wall clock limit. Job specifications cannot be modified by the user once the job enters the RUN
state. The bmod
command is used to modify a job’s specifications. For example: bmod -q express jobID
changes the job’s queue to the express queue bmod -R rusage[mem=20000] jobID
changes the job’s memory requirement to 20 GB per core. Note that -R replaces ALL R fields, not just the one you specify.
bpeek Displays the stdout and stderr output of an unfinished job.
bqueues Displays information about queues.
bhosts Displays hosts and their static and dynamic resources.
For the full list of LSF commands, see IBM Spectrum LSF Command Reference.
Pending Reasons
There could be many reasons for a job to stay in the queue for longer than usual; The whole cluster may be all busy; Your job requests a large amount of compute resources such as amount of memory and compute cores so that no compute node can satisfy the resource requirement at the moment; Your job would overlap with a scheduled PM. To see the pending reason use the bjobs command with the “-l” flag:
$ bjobs -l
LSF Queues And Policies
The command to check the queues is “bqueues”. To get more details about a specific queue, type “bqueues -l ” e.g., bqueues -l premium
Current Minerva Queues:
* The default memory for all queues are set to 3000 MB.
Queue | Description | Max Walltime |
premium | Normal submission queue | 144 hrs |
express | Rapid turnaround jobs | 12 hrs |
interactive | Jobs running in interactive mode | 12 hrs |
long | Jobs requiring extended runtime | 336 hrs |
gpu | Jobs requiring gpu resources | 144 hrs |
gpuexpress | Short jobs requiring gpu resources | 15 hrs |
private | Jobs using dedicated resources | Unlimited |
others | Any other queues are for testing by the Scientific Computing group | N/A |
Policies
LSF is configured to do “ absolute priority scheduling(APS)” backfill. Backfilling allows smaller, shorter jobs to use otherwise idle resources.
To check the priority of the pending jobs, you can do $bjobs -u all -aps
In certain special cases, the priority of a job may be manually increased upon request. To request priority change you may contact ISMMS Scientific Computing Support at hpchelp@hpc.mssm.edu. We will need the job ID and reason to submit the request.
Multiple Serial Jobs
How to submit multiple serial jobs over more than a single node:
Sometimes users want to submit large numbers of independent serial jobs as a single batch. Rather than using a script to repeatedly call bsub, a self-scheduling utility (“selfsched”) can be used to have multiple serial jobs bundled and scheduled over more than a single node with one bsub command.
Usage:
In your batch script, load the self-scheduler module and execute it using mpi wrapper (“mpirun”) in a parallel mode:
module load selfschedmpirun selfsched < YourInputForSelfScheduler
where YourInputForSelfScheduler is a file containing serial job commands like,
/my/bin/path/Exec_1 < my_input_parameters_1 > output_1.log/my/bin/path/Exec_2 < my_input_parameters_2 > output_2.log/my/bin/path/Exec_3 < my_input_parameters_3 > output_3.log...
Each line has 2048 character limit and TAB is not allowed.
Please note that one of compute cores is used to monitor and schedule serial jobs over the rest of cores, so the actual number of cores used for the real computation is (the total number of cores assigned – 1).
A simple utility (“PrepINP”) is also provided to facilitate generation of YourInputForSelfScheduler file. The self-scheduler module has to be loaded first.
Usage:
module load selfschedPrepINP < templ.txt > YourInputForSelfScheduler
templ.txt contains input parameters with the number fields replaced by “#” to generate YourInputForSelfScheduler file.
Example 1:
1 10000 2 F ← start, end, stride, fixed field length?/my/bin/path/Exec_# < my_input_parameters_# > output_#.log
The output will be
/my/bin/path/Exec_1 < my_input_parameters_1 > output_1.log/my/bin/path/Exec_3 < my_input_parameters_3 > output_3.log/my/bin/path/Exec_5 < my_input_parameters_5 > output_5.log.../my/bin/path/Exec_9999 < my_input_parameters_9999 > output_9999.log
Example 2:
1 10000 1 T ← start, end, stride, fixed field length?5 ← field length/my/bin/path/Exec_# < my_input_parameters_# > output_#.log
The output will beserial j
/my/bin/path/Exec_00001 < my_input_parameters_00001 > output_00001.log/my/bin/path/Exec_00002 < my_input_parameters_00002 > output_00002.log/my/bin/path/Exec_00003 < my_input_parameters_00003 > output_00003.log.../my/bin/path/Exec_10000 < my_input_parameters_10000 > output_10000.log
Improper Resource Specification
Minerva is a shared resource. Improper specification of job requirements not only wastes them but also prevents other researchers from running their jobs.
1. If your program is not explicitly written to use more than one core, specifying more than one core is wasting cores that could be used by other users. E.g.
IF
bsub -n 6 < sincgle_core_program
THEN
-
6 cores allocated
-
program runs on 1 core
-
5 cores are idle
2. If your program can use more than one core but does not use things like mpirun, mpiexec, or torchrun, then all cores must be on the same node. This is a Shared-memory MultiProcessing program (SMP). You must specify -R span[hosts=1] to ensure all cores are on the same node.
E.g:
IF
bsub -n 6 < SMP_program
THEN
-
6 cores are allocated: 1 on node1,2 on node2, 3 on node3
-
1 core on node1 is used to run the program
-
The other 5 cores sit idle
or, perhaps,
IF
bsub -n 6 < SMP_program
THEN
-
6 cores are allocated: 3 on node1,1 on node2, 2 on node3
-
3 cores on node1 are used to run the program
-
The other 3 cores sit idle
But with -R span[hosts=1]
IF
bsub -n 6 -R span[hosts=1] < SMP_program
THEN
-
6 cores are allocated: 6 on node1
-
All 6 cores on node1 are used to run the program
3. Memory specified by -R rusage[mem=xxx] is reserved for your use and cannot be used by anyone else until released at end of job. If most/all of the memory on a node is reserved, no additional jobs can run even if there are still CPUs available.
Example:
IF
-
A 192GB node has 3 jobs dispatched to it
-
each job is requesting 1 core and 64GB of memory (-n 1 -R rusage[mem=64G] )
-
each is actually using only 1GB
THEN
-
All memory on the node is reserved so no other job can be dispatched to it.
-
Only 3GB of the memory is being used.
-
189 GB of memory and 43 cores are idle and cannot be used.
4. Check to see how much memory your program is actually using and request accordingly. If you are running a series of jobs using the same program, run one or two test jobs with a large amount of memory and then adjust for the production runs.
Test:
bsub -R rusage[mem=6G] other options < testJob.lsf
Look at output:
Successfully completed.
Resource usage summary:
CPU time : 1.82 sec.
Max Memory : 8 MB
Average Memory : 6.75 MB
Total Requested Memory : 6144.00 MB
Delta Memory : 6136.00 MB
Max Swap : –
Max Processes : 3
Max Threads : 3
Run time : 2 sec.
Turnaround time : 50 sec.
Test run shows this program only used 8MB maximum. Memory usage varies between run depending on what else is running on the node so change the amount of memory down but give yourself some “wiggle room”
bsub -R rusage[mem=15M] other options < productionJob.lsf
Successfully completed.
Resource usage summary:
CPU time : 2.93 sec.
Max Memory : 5 MB
Average Memory : 4.00 MB
Total Requested Memory : 15.00 MB
Delta Memory : 10.00 MB
Max Swap : –
Max Processes : 3
Max Threads : 3
Run time : 4 sec.
Turnaround time : 47 sec.
Minerva Training Session
On April 8, 2022, the Scientific Computing and Data team hosted a Training Class on LSF Job Scheduler. Click here for training session powerpoint slides.
Click here to see more about Minerva Training Classes.