Harnessing GPUs for Scientific Computing: Concepts and Usage Etiquette
In modern computing, GPUs (Graphics Processing Units) have evolved far beyond their initial role as graphics rendering accelerators like 3D rendering and image processing. This evolution has led to the emergence of GPGPU (General-Purpose computing on Graphics Processing Units), a paradigm shift where GPUs are utilized for a broader range of computational tasks traditionally handled by CPUs.
The NVIDIA CUDA toolkit includes everything needed to develop GPU-accelerated applications, including GPU-accelerated libraries, a compiler, development tools, and the CUDA runtime. NVCC (NVIDIA CUDA Compiler), the compiler driver, acts as a high-level interface, simplifying the CUDA compilation process for developers. NVCC separates the CUDA code into two parts: host code (for the CPU) and device code (for the GPU), and sends the host code to a standard C++ host compiler (like GCC or Clang) and the device code to the CUDA backend compiler. NVCC handles the creation of fat binaries, which bundle the compiled device code for different GPU architectures within the host executable. NVCC manages the linking of the compiled host code, the device code, and necessary CUDA runtime libraries.
Minerva has a total of 75 GPU nodes. Please check below for more information.
- 12 Intel nodes each with 32 cores, 384GiB RAM, and 4 NVIDIA V100 GPU cards with 16GB memory on each card (see more)
- 8 Intel nodes each with 48 cores, 384GiB RAM, and 4 NVIDIA A100 GPU cards with 40GB memory on each card (see more)
- 2 Intel nodes each with 64 cores, 2TiB RAM, and 4 NVIDIA A100 GPU cards with 80GB memory on each card each with NVLINK connectivity
- 49 nodes in total, including 47 nodes with Intel Xeon Platinum 8568Y+ processors (96 cores, 1.5 TB RAM per node) and 4 NVIDIA H100 GPUs (80 GB each) connected via SXM5 with NVLink, as well as 2 nodes with Intel Xeon Platinum 8358 processors (32 cores, 512 GB RAM per node) and 4 NVIDIA H100 GPUs (80 GB each) connected via PCIe. The 2 additional nodes also feature 3.84 TB NVMe SSD storage (3.5 TB usable) (see more)
- 4 nodes, each with AMD Genoa 9334 processors (64 cores, 1.5 TB RAM per node) and equipped with 8 NVIDIA L40S GPUs per node, totaling 32 L40S GPUs across 4 nodes. Each processor operates at 2.7 GHz (see more)
Requesting GPU Resources
The GPU nodes must be accessed by way of the LSF job queue. There are nodes on the interactive, gpu
and gpuexpress
queues.
To request GPU resources, use the -gpu
option of the bsub command. At a minimum, the number of GPU cards requested per node must be specified using the num=
sub-option. The requested GPU model should be specified by using an LSF resource request via the -R
option.
bsub -gpu num=2 -R a100
Full details of the -gpu
option can be obtained here.
By default, GPU resource requests are exclusive_process.
The available GPU models and corresponding resource specifications are:
-R | GPU MODEL |
---|---|
v100 | TeslaV100_PCIE_16GB |
a100 | NVIDIAA100_PCIE_40GB |
a10080g | NVIDIAA100_SXM4_80GB |
h10080g | NVIDIAH100_PCIE_80GB |
h100nvl | NVIDIAH100_SXM5_80GB |
l40s | NVIDIAL40S_PCIE_48GB |
-gpu gmodel=
option does not always work. The recommended way to specify the GPU model is via the -R option.
Also, CUDA 11.8 or higher is needed to utilize the h100 GPU devices.
Note that GPU resource allocation is per node. Suppose your LSF options call for multiple CPUs ( e.g., bsub -n 4 …
). If LSF can’t satisfy all 4 CPUs on one node, it may split the job across multiple nodes (e.g., 2 CPUs on Node A, 2 on Node B). When you also request GPUs like -gpu "num=1"
, then 1 GPU per node is allocated. So in this example with 2 nodes, 2 GPUs total are reserved (1 per node). If your application can’t run in a multi-node setup (e.g., it’s not MPI-aware or can’t coordinate GPU use across nodes), those extra GPUs on the second node are wasted. The solution is to use -R "span[hosts=1]"
to forces all resources on one node to avoid waste.
If your program needs to know which GPU cards have been allocated to your job (not common), LSF sets the CUDA_VISIBLE_DEVICES
environment variable to specify which cards have been assigned to your job.
Supplemental Software
One almost certainly will need auxiliary software to utilize the GPUs. Most likely the CUDA libraries from NVIDA and perhaps the cuDNN (CUDA Deep Neural Network library) libraries. There are several versions of each on Minerva. Use:
ml avail cuda
and/or
ml avail cudnn
to determine which versions are available for loading.
For developers, there are a number of CUDA accelerated libraries available for download from NVIDIA.
Interactive Submission
Minerva sets aside a number of GPU enabled nodes to be accessed via the interactive queue. This number is changed periodically based on demand but is always a small number, e.g, 1 or 2. The number and type of GPU will be posted in the announcement section of the home page.
To open an interactive session on one of these nodes:
bsub -P acc_xxx -q interactive -n 1 -R v100 -gpu num=1 -W 01:00 -Is /bin/bash
Alternatively, one can open an interactive session on one of the batch GPU nodes. This is particularly useful if the interactive nodes do not have the model GPU you would like to use:
bsub -P acc_xxx -q gpu -n 1 -R a100 -gpu num=2 -W 01:00 -Is /bin/bash
Batch Submission
Batch submission is a straightforward specification of the GPU related bsub options in your LSF script.
bsub < test.lsf
Where test.lsf is something like:
#BSUB -q gpu #BSUB -R a100 #BSUB -gpu num=4 #BSUB -n 1 #BSUB -W 4 #BSUB -P acc_xxx #BSUB -oo test.out ml cuda ml cudnn echo “salve mundi”
Accessing the Local SSD on the A100, A100-80GB, H100-80GB, H100NVL, and L40S GPU Nodes
To fully leverage GPU capabilities, local SSD storage plays a critical role in minimizing data bottlenecks. Fast local SSDs—especially NVMe drives—ensure high-throughput data access between storage and GPU memory, which is essential for workloads involving large datasets or frequent read/write operations. Choosing between SATA and NVMe SSDs directly impacts training speed, I/O performance, and overall GPU efficiency in GPGPU systems. Local SSD storage is available on A100, A100-80GB, H100-80GB, H100NVL and L40S GPU nodes, as shown below. 1 TB local HDD storage is available on V100 GPU nodes.
GPU Model | Local SSD Storage |
---|---|
a100 | 1.8 TB SATA SSD |
a10080g | 7.0 TB NVMe PCIe SSD |
h10080g | 3.84 TB NVMe PCIe SSD |
h100nvl | 3.84 TB NVMe PCIe SSD |
l40s | 3.84 TB NVMe PCIe SSD |
To take advantage of the local SSD, request the resource using the rusage specification, for example:
-R “rusage[ssd_gb=1000]”
This example will allocate 1000GB of dedicated SSD space to your job.
We would advise you to make your ssd_rg
request to be <= 1,500
(1.5T).
The soft link, /ssd
, points to the ssd storage. You can specify /ssd
in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files.
A foolproof way of doing this is to use LSF’s options for executing pre_execution commands and post_execution commands. To do this, add the following to your bsub options:
#BSUB -E “mkdir /ssd/$LSB_JOBNUM” #Create a folder before beginning execution #BSUB -Ep “rm -rf /ssd/$LSB_JOBNUM” # remove folder after job completes
Inside your LSF script, use /ssd/$JOB_NUM
as the folder in which to create and use temp files.
Monitoring GPU Usage
The LSF queuing system on Minerva is configured to gather GPU resource usage using NVIDIA Data Center GPU Manager (DCGM). This allows users to view the gpu usage of their finished jobs using bjobs -l -gpu
if the job finished within the last 30 minutes or bhist -l -gpu
otherwise.
The following is a sample of the report: (SM=streaming multiprocessors. i.e., the part of the GPU card that runs CUDA)
bjobs -l -gpu 69703953
HOST: lg07c03; CPU_TIME: 8 seconds # Indicates the job ran on host lg07c03 and the total CPU time consumed by the job was 8 seconds.
GPU ID: 2 # This sections provide information specific to each GPU used by the job.
Total Execution Time: 9 seconds # This indicates the duration the GPU was actively engaged in processing tasks for this specific job
Energy Consumed: 405 Joules # This represents the energy consumed by each GPU during its execution time.
SM Utilization (%): Avg 16, Max 100, Min 0 # SM (Streaming Multiprocessor) utilization reflects the percentage of time the Streaming Multiprocessors (which execute your GPU threads) are executing a task. The average utilization is 16%, suggesting that the GPUs were not constantly running at full capacity during the entire execution time. The maximum utilization reaching 100% indicates that at some points, the GPUs were fully engaged. The minimum utilization of 0% signifies periods when the GPUs were idle.
Memory Utilization (%): Avg 0, Max 3, Min 0 # GPU Memory Utilization indicates how much of the GPU’s dedicated memory is currently in use
Max GPU Memory Used: 38183895040 bytes # This represents the peak memory usage on each GPU during the job’s execution.
GPU ID: 1
Total Execution Time: 9 seconds
Energy Consumed: 415 Joules
SM Utilization (%): Avg 20, Max 100, Min 0
Memory Utilization (%): Avg 0, Max 3, Min 0
Max GPU Memory Used: 38183895040 bytes
GPU Energy Consumed: 820.000000 Joules # This represents the total energy consumed by both GPUs combined (405 J + 415 J = 820 J).
Further Information:
GPU Etiquette
The GPU nodes on Minerva are in high demand. To optimize resource allocation, it’s critical to specify job requirements accurately in LSF. Incorrect resource requests can lead to unnecessary reservation of GPU resources, making them unavailable for other researchers.
The bsub options that need particular attention are:
-n
The number of slots assigned to your job.
-R rusage[mem=]
The amount of memory assigned to each slot.
-gpu num=
The number of GPU cards per node assigned to your job.
-R span[]
The arrangement of the resources assigned to your job across Minerva nodes
-n
This option specifies how many job slots you want for your job. View this value as how many packets of resources you want allocated to your job. By default, each slot you ask for will have 1 cpu and 3GB of memory. For most cases, job slots and cpus are synonymous. Note that if your program is not explicitly coded to be parallelized, adding more cores will not affect the performance.
-R rusage[mem=]
This option specifies how much memory is to be allocated per slot. Default is 3GB. Note that request is per slot, not per job. The total amount of memory requested will be this value times the number of slots you have requested with the -n
option.
-gpu num=
This option specifies how many GPU cards you want allocated per compute node, not slot, that is allocated to your job. The GPU cards/devices are numbered 0 through 3. However, for L40S GPUs, the devices are numbered 0 through 7.
Because your program needs to connect to a specific GPU device, LSF will set an environment variable, CUDA_VISIBLE_DEVICES, to the list of GPU devices assigned to your job. E.g, CUDA_VISIBLE_DEVICES=0,3. Do not change these values manually and, if writing your own code, you must honor the assignment. The installed software on Minerva that use GPUs honor these assignments.
-R span[]
This option tells LSF how to lay out your slots/cpus on the compute nodes of Minerva. By default, slots/cpus can be, and usually are, spread out over several nodes.
There are two ways that this option is generally specified: using ptile=x
, to specify the tiling across nodes or hosts=1
to specify all slots are to be allocated on one node (only option).
Example A:
-R span[ptile=]
where value is the number of slots/cores to be placed on a node.
e.g.
#BSUB -n 6 # Allocate 6 cores/slots total #BSUB -R rusage[mem=5G] #Allocate 5GB per core/slot #BSUB -R span[ptile=4] # Allocate 4 core/slots per node #BSUB -gpu=2 #BSUB -R a100
As a result, two nodes will be allocated:
• Node 1: 4 CPU cores, 20 GB memory (4 × 5 GB), 2 A100 GPUs
• Node 2: 2 CPU cores, 10 GB memory (2 × 5 GB), 2 A100 GPUs
Example B:
-R span[hosts=1]
allocate all cores/slots to one host.
e.g.
#BSUB -n 6 #BSUB -R rusage[mem=5G] #BSUB -R span[hosts=1] #BSUB -gpu=2 #BSUB -R a100
One node will be allocated with the following resources:
• 6 CPU cores
• 30 GB memory (6 × 5 GB)
• 2 A100 GPUs
If you do not specify a span requirement, LSF will make its own placement decisions, which may result in allocating only one slot per node. In the example above, this could lead to your job occupying 12 GPU cards—one on each of 12 different nodes.
If your program is not designed for distributed execution (e.g., if it’s a Shared Memory Parallel (SMP) application—which most parallel programs in our environment are), it will only utilize resources on the first node assigned. The additional nodes (and their GPUs) will remain idle and unavailable to other users, potentially leading to significant delays in the job queue.
Some signs that your program cannot run in a distributed fashion include:
- Documentation does not say that it can.
- The
mpirun
command is not used to start the program. - Has program options such as
-ncpu
-threads
In such cases, all required resources must be located on a single node, and you should include the following in your job script:
-R span[hosts=1]
This ensures that all allocated CPUs, memory, and GPUs are placed on the same node.
This is especially important when GPUs are involved. GPU devices are valuable and limited in number. By default, LSF will allocate the number of GPUs you request (via the -gpu num= option) on every node assigned to your job—even if your application only runs on one of them.
If your program is single-threaded or an SMP (Shared Memory Parallel) application, and your job spans multiple nodes, only the first node will be used. GPUs on all other nodes will remain allocated but idle, potentially blocking other jobs from accessing those resources and causing unnecessary scheduling delays.