New A100 GPU nodes are available on Minerva

We have completed all the tests on the new A100 GPU nodes. Those new A100 GPU nodes are added to LSF gpu queues with resource -R a100 needed. A100 provides higher performance over the prior generation with a detailed datasheet here .
 
What are the A100 GPU nodes on Minerva?
32 A100 GPUs in 8 nodes 
  • 48 Intel Xeon Platinum 8268 2.9 GHz CPU Processors per node, 384 GB memory per node, for a total of 384 CPU cores
  • 1.92 TB SSD  (1.8 TB usable) per node
  • 4 A100 GPUs per node, 40 GB of memory for each GPU, for a total 160 GB per node
  • The A100  is connected via PCIe
How to submit jobs to the A100 GPU nodes?
  • A100 GPU nodes are available in the GPU queue ( use LSF flag “-q gpu” ).
  • To submit your jobs to those A100 GPU nodes, fag “-R a100” is required.  I.e., add  #BSUB -R a100 to your LSF script or -R a100 to your LSF command line.
  • Note, the gpu queue also contains GPU nodes with V100 and P100 GPU cards. You can access those resources with the corresponding flags “-R v100”, “-R p100”. If GPU model flag is not specified, your job will start on the earliest available GPU nodes.
How to use the ssd on the A100 GPU nodes?
  • To take advantage of local 1.8 TB SSD, please request the resource such as -R rusage[ssd_gb=1000]. This flag will allocate 1000GB of dedicated SSD space to your job. We would advise you to request the ssd_rg less or around 1500 (1.5T).
  • The slink /ssd points to the ssd storage. You can specify /ssd in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files. 
What cuda version is supported on the A100 GPU nodes?
  • Cuda 11.x or later is supported on those A100 nodes. Please load the Cuda module by ml cuda/11.1  or ml cuda ( cuda/11.1 is the default version currently)
 
If you have any question on this, please send us a ticket at hpchelp@hpc.mssm.edu