We have added 2 new A100-80GB GPU nodes to the LSF queue. Each node is equipped with 2 TB of memory and 7 TB of local NVMe PCIe SSD to provide higher performance over the prior generation.
What are the A100-80GB GPU nodes on Minerva?
8 A100 GPUs in 2 nodes
· 64 Intel Xeon Platinum 8358 2.6 GHz CPU Processors per node, 2 TB memory per node, for a total of 128 CPU cores
· 7.68 TB NVMe PCIe SSD (7.0TB usable) per node, which can deliver a sustained read-write speed of 3.5 GB/s in contrast with SATA SSDs that limit at 600 MB/s
· 4 A100 GPUs per node, 80 GB of memory for each GPU, for a total 320 GB per node
· The A100 is connected via NVLink
How to submit jobs to the A100-80GB GPU nodes?
· A100-80GB GPU nodes are available in the GPU queue ( use the LSF flag “-q gpu” ).
· To submit your jobs to those A100-80GB GPU nodes, flag “-R a10080g” is required. I.e., add #BSUB -R a10080g to your LSF script or -R a10080g to your LSF command line. For example, the following requests 1 GPU card, 8 CPUs and 256GB of memory for 1hr:
bsub -P acc_xxx -q gpu -n 8 -R rusage[mem=32000] -R a10080g -R rusage[ngpus_excl_p=1] -R span[hosts=1] -W 01:00 -Is /bin/bash
· Note, the gpu queue also contains other GPU nodes with V100 and A100 GPU cards. You can access those resources with the corresponding flags “-R v100”, “-R a100”. If GPU model flag is not specified, your job will start on the earliest available GPU nodes.
How to use the ssd on the A100-80GB GPU nodes?
· The slink /ssd points to the local NVMe SSD storage. You can specify /ssd in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files.
What cuda version is supported on the A100-80GB GPU nodes?
· Cuda 11.x or later is supported on those A100-80GB nodes. Please load the cuda module by ml cuda/11.1 or ml cuda ( cuda/11.1 is the default version currently)
If you have any question on this, please send us a ticket at firstname.lastname@example.org