General Purpose Graphics Processor Unit (GPGPU or just GPU) and GPU Etiquette

NVIDIA has made the computing engine of their graphics processing units (GPU‘s) accessible to programmers. This engine is technically called the Compute Unified Device Architecture, CUDA for short, is a parallel computing architecture developed by NVIDIA used to render images that are to be displayed on a computer screen. These devices can be programmed using a variant of C or C++ and the NVIDIA compiler. While the CUDA engine is programmer accessible in virtually all of NVIDIA’s newer graphics cards, GPUs, they also make specially purposed devices that are used exclusively for computation called GPGPU or, more commonly, just GPU.

Minerva has a total of 20 nodes configured with 4 NVIDIA GPU cards each:

  • 12 Intel nodes each with 32 cores, 384GiB RAM, and 4 NVIDIA V100 GPU cards with 16GB memory on each card (see more)
  • 8 Intel nodes each with 48 cores, 384GiB RAM, and 4 NVIDIA A100 GPU cards with 40GB memory on each card (see more)
  • 2 Intel nodes each with 64 cores, 2TiB RAM, and 4 NVIDIA A100 GPU cards with 80GB memory on each card each with NVLINK connectivity

 

Accessing the GPU Nodes

The GPU nodes must be accessed by way of a queued job. There are nodes on the interactive and gpu queues. Therefore, your job must specify either:

-q interactive
or
-q gpu

To specify which type of GPU you want:

-R v100
or
-R a100

If the job is submitted to the interactive queue, make sure you specify the model GPU that is in that queue. Otherwise, your job will stall in the queue.

In addition, the number of GPU cards on each node to be allocated to your job must be specified using the LSF rusage specification, e.g.:

-R rusage[ngpus_excl_p=1] # For 1 GPU card per allocated node

-R rusage[ngpus_excl_p=3] # For 3 GPU cards per allocated node

Note that if your LSF options call for multiple cpu’s ( e.g -n 4 ) and LSF allocates these cpu’s across more than one node, the number of GPU cards you specified in the ngpus_excl_p parameter will be allocated to your job on each of those nodes. If your job cannot distribute its workload over multiple nodes, be sure to specify the option: -R span[hosts=1].

If your program needs to know which GPU cards have been allocated to your job (not common), LSF sets the CUDA_VISIBLE_DEVICES environment variable to specify which cards have been assigned to your job.

 

Supplemental Software

One almost certainly will need auxiliary software to utilize the GPUs. Most likely the CUDA libraries from NVIDA and perhaps the CUDNN libraries. There are several versions of each on Minerva. Use:

ml avail cuda
and/or
ml avail cudnn

to determine which versions are available for loading.

For developers, there are a number of CUDA accelerated libraries available for download from NVIDIA.

 

Interactive Submission

Minerva sets aside a number of GPU enabled nodes to be accessed via the interactive queue. This number is changed periodically based on demand but is always a small number, e.g, 1 or 2. The number and type of GPU will be posted in the announcement section of the home page.

To open an interactive session on one of these nodes:

bsub -P acc_xxx -q interactive -n 1 -R v100 -R rusage[ngpus_excl_p=1] -W 01:00 -Is /bin/bash

Alternatively, one can open an interactive session on one of the batch GPU nodes. This is particularly useful if the interactive nodes do not have the model GPU you would like to use:

bsub -P acc_xxx -q gpu -n 1 -R a100 -R rusage[ngpus_excl_p=1] -W 01:00 -Is /bin/bash

 

Batch Submission

Batch submission is a straightforward specification of the GPU related bsub options in your LSF script.

bsub < test.lsf

Where test.lsf is something like:

#BSUB -q gpu
#BSUB -R a100
#BSUB -R rusage[ngpus_excl_p=1]
#BSUB -n 1
#BSUB -W 4
#BSUB -P acc_xxx
#BSUB -oo test.out

ml cuda
ml cudnn

echo “salve mundi”

 

Accessing the Local SSD on the a100 GPU Nodes

To take advantage of the local 1.8 TB SSD, request the resource using the rusage specification, for example:

-R “rusage[ssd_gb=1000]”

This example will allocate 1000GB of dedicated SSD space to your job.

We would advise you to make your ssd_rg request to be <= 1,500 (1.5T).

The slink /ssd points to the ssd storage. You can specify /ssd in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files. A foolproof way of doing this is to use LSF’s options for executing pre_executions_commands and post_execution commands. To do this, add the following to your bsub options:

#BSUB -E “mkdir /ssd/$LSB_JOBNUM” #Create a folder before beginning execution
#BSUB -Ep “rm -rf /ssd/$LSB_JOBNUM” # remove folder after job completes

Inside your LSF script, use /ssd/$JOB_NUM as the folder in which to create and use temp files.

 

Monitoring GPU Usage

The LSF queuing system on Minerva is configured to gather GPU resource usage using NVIDIA Data Center GPU Manager (DCGM). This allows users to view the gpu usage of their finished jobs using

bjobs -l -gpu 

if the job finished within the last 30 minutes or

bhist -l -gpu 

otherwise.

The following is a sample of the report: ( SM=streaming multiprocessors. i.e., the part of the card that runs CUDA)

bjobs -l -gpu 69703953
HOST: lg07c03; CPU_TIME: 8 seconds
GPU ID: 2
Total Execution Time: 9 seconds
Energy Consumed: 405 Joules
SM Utilization (%): Avg 16, Max 100, Min 0
Memory Utilization (%): Avg 0, Max 3, Min 0
Max GPU Memory Used: 38183895040 bytes

GPU ID: 1
Total Execution Time: 9 seconds
Energy Consumed: 415 Joules
SM Utilization (%): Avg 20, Max 100, Min 0
Memory Utilization (%): Avg 0, Max 3, Min 0
Max GPU Memory Used: 38183895040 bytes

GPU Energy Consumed: 820.000000 Joules

 

Further Information:

CUDA programming
Available applications

 

GPU Etiquette

The GPU nodes in Minerva can be in great demand. Because of the manner in which LSF allocates GPU resources to a job, it is important to specify the job requirements carefully. Incorrect LSF resource requests can cause resources to be unnecessarily reserved and, hence, unavailable to other researchers.

The bsub options that need particular attention are:

-n The number of slots assigned to your job.
-R rusage[mem=] The amount of memory assigned to each slot.
-R rusage[ngpus_excl_p=] The number of GPU cards per node assigned to your job.
-R span[] The arrangement of the resources assigned to your job across Minerva nodes.

-n
This option specifies how many job slots you want for your job. View this value as how many packets of resources you want allocated to your job. By default, each slot you ask for will have 1 cpu and 3GB of memory. For most cases, job slots and cpus are synonymous. Note that if your program is not parallelized, adding more cores will not affect the performance.

-R rusage[mem=]
This option specifies how much memory is to be allocated per slot. Default is 3GB. Note that request is per slot, not per job. The total amount of memory requested will be this value times the number of slots you have requested with the -n option.

-R rusage[ngpus_excl_p=]
This option specifies how many GPU cards you want allocated per compute node, not slot, that is allocated to your job. The GPU cards/devices are numbered 0 through 3.

Because your program needs to connect to a specific GPU device, LSF will set an environment variable, CUDA_VISIBLE_DEVICES, to the list of GPU devices assigned to your job. E.g, CUDA_VISIBLE_DEVICES=0,3. Do not change these values manually and, if writing your own code, you must honor the assignment. The installed software on Minerva that use GPUs honor these assignments.

-R span[]
This option tells LSF how to lay out your slots/cpus on the compute nodes of Minerva. By default, slots/cpus can be, and usually are, spread out over several nodes.

There are two ways that this option is generally specified: using ptile=x, to specify the tiling or hosts=x to specify how many nodes to use.

Example A:
-R span[ptile=] where value is the number of slots/cores to be placed on a node.
e.g.

#BSUB -n 6 # Allocate 6 cores/slots total
#BSUB -R rusage[mem=5G] #Allocate 5GB per core/slot
#BSUB -R span[ptile=4] # Allocate 4 core/slots per node

Will result in 2 nodes being allocated: one with 4 cpus and 20GB (4*5GB) of memory and a second with 2 cpus and 10GB (2*5GB) of memory.

Example B:
-R span[hosts=1] allocate all cores/slots to one host.
eg.

#BSUB -n 6
#BSUB -R rusage[mem=5G]
#BSUB -R span[hosts=1]

Will result in 1 node being allocated with 6 cpus and 30GB (6*5GB) of memory.

If your program is not as distributed program or is a Shared Memory Parallel (SMP) program, and most parallel programs in our environment are, only the resources on the first node of those nodes assigned to your job will be used. The remainder will be unavailable to other users and may cause significant backlogs in job processing.

Some hints that your program cannot run in a distributed manner are:

  • Documentation does not say that it can.
  • The mpirun command is not used to start the program.
  • Has program options such as
    • -ncpu
    • -threads

In these cases all resources must be on a single node and you should specify

-R span[hosts=1]

to ensure all resources are placed on a single node.

This is crucial when GPUs are involved. GPU devices are valuable and there are not a large number of them. LSF will allocate the number of GPUs you request with the rusage[ngpus_excl_p=] option on each node that is allocated to your job whether you are using that node or not.

#BSUB -n 6
#BSUB -R rusage[mem=5G]
#BSUB -R [ngpus_excl_p=2]
#BSUB -R span[ptile=4]

Will result in 2 nodes being allocated: one with 4 cpus, 20GB of memory, and 2GPUs and a second with 2 cpus, 10GB of memory, and 2GPUs.

If your program is single thread or an SMP program, only the first node will be used and the GPUs on the second node will be allocated but unused and probably preventing another job from being dispatched.

Even worse, if you do not use the span option, LSF could distribute one core to each of 6 nodes where each node will have 2 GPU’s reserved. In this case 10 GPUs would be wasted.