Software and Packages
- Software and Applications
- Schrodinger Suite
- MATLAB, Simulink and MATLAB Distributed Compute Server
- Jupyter Notebook
Queues and Resources
- LSF Queues And Policies
- GPU Etiquette
- Access TSM with GUI
- Access TSM with Command Line
- Checkpoint Restart
- Disaster Recovery Plan
General Purpose Graphics Processor Unit (GPGPU)
NVIDIA has made the computing engine of their graphics processing units (GPU‘s) accessible to programmers. This engine is technically called the Compute Unified Device Architecture, CUDA for short, is a parallel computing architecture developed by NVIDIA used to render images that are to be displayed on a computer screen. These devices can be programmed using a variant of C or C++ and the NVIDIA compiler. While the CUDA engine is programmer accessible in virtually all of NVIDIA’s newer graphics cards, GPU‘s, they also make specially purposed devices that are used exclusively for computation called GPGPU or, more commonly, just GPU.
Minerva has a total of 20 nodes configured with 4 NVIDIA GPU cards each:
- 12 Intel nodes each with 32 cores, 384GiB RAM, and 4 NVIDIA V100 GPU cards with 16GB memory on each card (see more)
- 8 Intel nodes each with 48 cores, 384GiB RAM, and 4 NVIDIA A100 GPU cards with 40GB memory on each card (see more)
Accessing the GPU Nodes
The GPU nodes must be accessed by way of a queued job. There are nodes on the interactive and gpu queues. Therefore, your job must specify either:
To specify which type of GPU you want:
If the job is submitted to the interactive queue, make sure you specify the model GPU that is in that queue. Otherwise, your job will stall in the queue.
In addition, the number of GPU cards on each node to be allocated to your job must be specified using the LSF rusage specification, e.g.:
-R rusage[ngpus_excl_p=1] # For 1 GPU card per allocated node
-R rusage[ngpus_excl_p=3] # For 3 GPU cards per allocated node
Please note that if your LSF options call for multiple cpu’s ( e.g -n 4 ) and LSF allocates these cpu’s across more than one node, the number of GPU cards you specified in the ngpus_excl_p parameter will be allocated to your job on each of those nodes. If your job cannot distribute its workload over multiple nodes, be sure to specify the option: -R span[hosts=1].
If your program needs to know which GPU cards have been allocated to your job (not common), LSF sets the CUDA_VISIBLE_DEVICES environment variable to specify which cards have been assigned to your job.
One almost certainly will need auxiliary software to utilize the GPUs. Most likely the CUDA libraries from NVIDA and perhaps the CUDNN libraries. There are several versions of each on Minerva. Use:
ml avail cuda
ml avail cudnn
to determine which versions are available for loading.
For developers, there are a number of CUDA accelerated libraries available for download from NVIDIA.
Minerva sets aside a number of GPU enabled nodes to be accessed via the interactive queue. This number is changed periodically based on demand but is always a small number, e.g, 1 or 2. The number and type of GPU will be posted in the announcement section of the home page.
To open an interactive session on one of these nodes:
bsub -P acc_xxx -q interactive -n 1 -R v100 -R rusage[ngpus_excl_p=1] -W 01:00 -Is /bin/bash
Alternatively, one can open an interactive session on one of the batch GPU nodes. This is particularly useful if the interactive nodes do not have the model GPU you would like to use:
bsub -P acc_xxx -q gpu -n 1 -R a100 -R rusage[ngpus_excl_p=1] -W 01:00 -Is /bin/bash
Batch submission is a straightforward specification of the GPU related bsub options in your LSF script.
bsub < test.lsf
Where test.lsf is something like:
#BSUB -q gpu
#BSUB -R a100
#BSUB -R rusage[ngpus_excl_p=1]
#BSUB -n 1
#BSUB -W 4
#BSUB -P acc_xxx
#BSUB -oo test.out
echo “salve mundi”
Accessing the Local SSD on the a100 GPU Nodes
To take advantage of the local 1.8 TB SSD, request the resource using the rusage specification, for example:
This example will allocate 1000GB of dedicated SSD space to your job.
We would advise you to make your ssd_rg request to be <= 1,500 (1.5T).
The slink /ssd points to the ssd storage. You can specify /ssd in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files.
CUDA programming http://www.nvidia.com/object/What-is-GPU-Computing.html
Available applications http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf