General Purpose Graphics Processor Unit (GPGPU or just GPU) and GPU Etiquette

NVIDIA has made the computing engine of their graphics processing units (GPU‘s) accessible to programmers. This engine is technically called the Compute Unified Device Architecture, CUDA for short, is a parallel computing architecture developed by NVIDIA used to render images that are to be displayed on a computer screen. These devices can be programmed using a variant of C or C++ and the NVIDIA compiler. While the CUDA engine is programmer accessible in virtually all of NVIDIA’s newer graphics cards, GPUs, they also make specially purposed devices that are used exclusively for computation called GPGPU or, more commonly, just GPU.

Minerva has a total of 22 nodes configured with 4 NVIDIA GPU cards each:

12 Intel nodes each with 32 cores, 384GiB RAM, and 4 NVIDIA V100 GPU cards with 16GB memory on each card (see more)
8 Intel nodes each with 48 cores, 384GiB RAM, and 4 NVIDIA A100 GPU cards with 40GB memory on each card (see more)
2 Intel nodes each with 64 cores, 2TiB RAM, and 4 NVIDIA A100 GPU cards with 80GB memory on each card each with NVLINK connectivity
2 Intel nodes each with 32 cores, 512 RAM, and 4 NVIDIA H100 GPU cards with 80GB memory on each card each with PCIe connectivity

Requesting GPU Resources

The GPU nodes must be accessed by way of the LSF job queue. There are nodes on the interactive, gpu and gpuexpress queues.

To request GPU resources, use the -gpu option of the bsub command. At a minimum, the number of GPU cards requested per node must be specified using the num= sub-option. The requested GPU model should be specified by using an LSF resource request via the -R option.

bsub -gpu num=2 -R a100

Full details of the -gpu option can be obtained here.

By default, GPU resource requests are exclusive_process.

The available GPU models and corresponding resource specifications are:

-R	GPU MODEL
v100	`TeslaV100_PCIE_16GB`
a100	`NVIDIAA100_PCIE_40GB`
a10080g	`NVIDIAA100_SXM4_80GB`
h10080g	NVIDIAH100_PCIE_80G

Specifying the gpu model number in the -gpu gmodel= option does not always work. The recommended way to specify the GPU model is via the -R option.

Also, CUDA 11.8 or higher is needed to utilize the h100 GPU devices.

Note that GPU resource allocation is per node. If your LSF options call for multiple CPUs ( e.g., -n 4 ) and LSF allocates these CPUs across more than one node, the number of GPU cards you specified in the -gpu num= parameter will be allocated to your job on each of those nodes. If your job cannot distribute its workload over multiple nodes, be sure to specify the option: -R span[hosts=1] otherwise you will be wasting resources.

If your program needs to know which GPU cards have been allocated to your job (not common), LSF sets the CUDA_VISIBLE_DEVICES environment variable to specify which cards have been assigned to your job.

Supplemental Software

One almost certainly will need auxiliary software to utilize the GPUs. Most likely the CUDA libraries from NVIDA and perhaps the CUDNN libraries. There are several versions of each on Minerva. Use:

ml avail cuda

and/or

ml avail cudnn

to determine which versions are available for loading.

For developers, there are a number of CUDA accelerated libraries available for download from NVIDIA.

Interactive Submission

Minerva sets aside a number of GPU enabled nodes to be accessed via the interactive queue. This number is changed periodically based on demand but is always a small number, e.g, 1 or 2. The number and type of GPU will be posted in the announcement section of the home page.

To open an interactive session on one of these nodes:

bsub -P acc_xxx -q interactive -n 1 -R v100 -gpu num=1 -W 01:00 -Is /bin/bash

Alternatively, one can open an interactive session on one of the batch GPU nodes. This is particularly useful if the interactive nodes do not have the model GPU you would like to use:

bsub -P acc_xxx -q gpu -n 1 -R a100 -gpu num=2 -W 01:00 -Is /bin/bash

Batch Submission

Batch submission is a straightforward specification of the GPU related bsub options in your LSF script.

bsub < test.lsf

Where test.lsf is something like:

#BSUB -q gpu

#BSUB -R a100
#BSUB -gpu num=4
#BSUB -n 1
#BSUB -W 4
#BSUB -P acc_xxx
#BSUB -oo test.out
ml cuda
ml cudnn
echo “salve mundi”

Accessing the Local SSD on the a100 GPU Nodes

To take advantage of the local 1.8 TB SSD, request the resource using the rusage specification, for example:

-R “rusage[ssd_gb=1000]”

This example will allocate 1000GB of dedicated SSD space to your job.

We would advise you to make your ssd_rg request to be <= 1,500 (1.5T).

The soft link, /ssd, points to the ssd storage. You can specify /ssd in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files.

A foolproof way of doing this is to use LSF’s options for executing pre_executions_commands and post_execution commands. To do this, add the following to your bsub options:

#BSUB -E “mkdir /ssd/$LSB_JOBNUM” #Create a folder before beginning execution
#BSUB -Ep “rm -rf /ssd/$LSB_JOBNUM” # remove folder after job completes

Inside your LSF script, use /ssd/$JOB_NUM as the folder in which to create and use temp files.

Monitoring GPU Usage

The LSF queuing system on Minerva is configured to gather GPU resource usage using NVIDIA Data Center GPU Manager (DCGM). This allows users to view the gpu usage of their finished jobs using bjobs -l -gpu if the job finished within the last 30 minutes or bhist -l -gpu otherwise.

The following is a sample of the report: ( SM=streaming multiprocessors. i.e., the part of the card that runs CUDA)

bjobs -l -gpu 69703953
HOST: lg07c03; CPU_TIME: 8 seconds
GPU ID: 2
Total Execution Time: 9 seconds
Energy Consumed: 405 Joules
SM Utilization (%): Avg 16, Max 100, Min 0
Memory Utilization (%): Avg 0, Max 3, Min 0
Max GPU Memory Used: 38183895040 bytes

GPU ID: 1
Total Execution Time: 9 seconds
Energy Consumed: 415 Joules
SM Utilization (%): Avg 20, Max 100, Min 0
Memory Utilization (%): Avg 0, Max 3, Min 0
Max GPU Memory Used: 38183895040 bytes

GPU Energy Consumed: 820.000000 Joules

Further Information:

CUDA programming
Available applications

GPU Etiquette

The GPU nodes in Minerva are in great demand. Because of the manner in which LSF allocates GPU resources to a job, it is important to specify the job requirements carefully. Incorrect LSF resource requests can cause resources to be unnecessarily reserved and, hence, unavailable to other researchers.

The bsub options that need particular attention are:

-n The number of slots assigned to your job.
-R rusage[mem=] The amount of memory assigned to each slot.
-gpu num=x The number of GPU cards per node and the gpu model assigned to your job.
-R span[] The arrangement of the resources assigned to your job across Minerva nodes.

-n
This option specifies how many job slots you want for your job. View this value as how many packets of resources you want allocated to your job. By default, each slot you ask for will have 1 cpu and 3GB of memory. For most cases, job slots and cpus are synonymous. Note that if your program is not parallelized, adding more cores will not affect the performance.

-R rusage[mem=]
This option specifies how much memory is to be allocated per slot. Default is 3GB. Note that request is per slot, not per job. The total amount of memory requested will be this value times the number of slots you have requested with the -n option.

-gpu num=x
This option specifies how many GPU cards you want allocated per compute node, not slot, that is allocated to your job. The GPU cards/devices are numbered 0 through 3.

Because your program needs to connect to a specific GPU device, LSF will set an environment variable, CUDA_VISIBLE_DEVICES, to the list of GPU devices assigned to your job. E.g, CUDA_VISIBLE_DEVICES=0,3. Do not change these values manually and, if writing your own code, you must honor the assignment. The installed software on Minerva that use GPUs honor these assignments.

-R span[]
This option tells LSF how to lay out your slots/cpus on the compute nodes of Minerva. By default, slots/cpus can be, and usually are, spread out over several nodes.

There are two ways that this option is generally specified: using ptile=x, to specify the tiling across nodes or hosts=1 to specify all slots are to be allocated on one node (only option).

Example A:
-R span[ptile=] where value is the number of slots/cores to be placed on a node.
e.g.

#BSUB -n 6 # Allocate 6 cores/slots total
#BSUB -R rusage[mem=5G] #Allocate 5GB per core/slot
#BSUB -R span[ptile=4] # Allocate 4 core/slots per node

#BSUB -gpu=2

#BSUB -R a100

Will result in 2 nodes being allocated: one with 4 cpus; 20GB (4*5GB) of memory; 2 A100 GPU cards and a second with 2 cpus; 10GB (2*5GB) of memory;2 A100 GPU cards.

Example B:
-R span[hosts=1] allocate all cores/slots to one host.
eg.

#BSUB -n 6
#BSUB -R rusage[mem=5G]
#BSUB -R span[hosts=1]

#BSUB -gpu=2

#BSUB -R a100

Will result in 1 node being allocated with 6 cpus; 30GB (6*5GB) of memory; 2 A100 GPU cards..

If you do not specify a span requirement, LSF will do what it thinks is best and may allocate only one slot per node so that, in the example above, you job would be tying up 12 GPU cards.

If your program is not a distributed program or is a Shared Memory Parallel (SMP) program, and most parallel programs in our environment are, only the resources on the first node of the nodes assigned to your job will be used. The remainder will be unavailable to other users and may cause significant backlogs in job processing.

Some hints that your program cannot run in a distributed manner are:

Documentation does not say that it can.
The mpirun command is not used to start the program.
Has program options such as
- -ncpu
- -threads

In these cases all resources must be on a single node and you should specify

-R span[hosts=1]

to ensure all resources are placed on a single node.

This is crucial when GPUs are involved. GPU devices are valuable and there are not a large number of them. LSF will allocate the number of GPUs you request with the -gpu num= option on each node that is allocated to your job whether you are using that node or not.

If your program is single thread or an SMP program and your job specification spans several nodes, only the first node will be used and the GPUs on all other nodes will be allocated but unused and probably preventing another job from being dispatched.