{"id":10937,"date":"2025-07-23T10:40:07","date_gmt":"2025-07-23T14:40:07","guid":{"rendered":"https:\/\/labs.icahn.mssm.edu\/minervalab\/?page_id=10937"},"modified":"2026-04-30T14:35:34","modified_gmt":"2026-04-30T18:35:34","slug":"gpgpu","status":"publish","type":"page","link":"https:\/\/labs.icahn.mssm.edu\/minervalab\/documentation\/gpgpu\/","title":{"rendered":"GPGPU"},"content":{"rendered":"<p>[et_pb_section fb_built=&#8221;1&#8243; admin_label=&#8221;section&#8221; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;0px||0px||false|false&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_row admin_label=&#8221;row&#8221; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;||0px||false|false&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;|||&#8221; global_colors_info=&#8221;{}&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text admin_label=&#8221;Breadcrumb&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p><a href=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/scientific-computing-and-data\/\">Scientific Computing and Data<\/a>\u00a0\/\u00a0<a href=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/\">High Performance Computing<\/a> \/ <a href=\"https:\/\/labs.icahn.mssm.edu\/minervalab\/documentation\/\">Documentation<\/a> \/\u00a0gpgpu<\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section][et_pb_section fb_built=&#8221;1&#8243; admin_label=&#8221;section&#8221; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;0px||0px||false|false&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_row _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;|auto|-20px|auto|false|false&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_text admin_label=&#8221;Title&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; header_font=&#8221;|700|||||||&#8221; header_text_color=&#8221;#221f72&#8243; custom_margin=&#8221;||0px||false|false&#8221; custom_padding=&#8221;||0px||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h1>Harnessing GPUs for Scientific Computing: Concepts and Usage Etiquette<\/h1>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][et_pb_row admin_label=&#8221;row&#8221; _builder_version=&#8221;4.16&#8243; background_size=&#8221;initial&#8221; background_position=&#8221;top_left&#8221; background_repeat=&#8221;repeat&#8221; global_colors_info=&#8221;{}&#8221;][et_pb_column type=&#8221;4_4&#8243; _builder_version=&#8221;4.16&#8243; custom_padding=&#8221;|||&#8221; global_colors_info=&#8221;{}&#8221; custom_padding__hover=&#8221;|||&#8221;][et_pb_text admin_label=&#8221;Overview&#8221; _builder_version=&#8221;4.27.4&#8243; text_line_height=&#8221;1.5em&#8221; global_colors_info=&#8221;{}&#8221;]In modern computing, GPUs (Graphics Processing Units) have evolved far beyond their initial role as graphics rendering accelerators like 3D rendering and image processing. This evolution has led to the emergence of GPGPU (General-Purpose computing on Graphics Processing Units), a paradigm shift where GPUs are utilized for a broader range of computational tasks traditionally handled by CPUs.<\/p>\n<p>The NVIDIA CUDA toolkit includes everything needed to develop GPU-accelerated applications, including GPU-accelerated libraries, a compiler, development tools, and the CUDA runtime.\u00a0NVCC (NVIDIA CUDA Compiler), the compiler driver, acts as a high-level interface, simplifying the CUDA compilation process for developers. NVCC separates the CUDA code into two parts: host code (for the CPU) and device code (for the GPU), and sends the host code to a standard C++ host compiler (like GCC or Clang) and\u00a0the device code to the CUDA backend compiler. NVCC handles the creation of fat binaries, which bundle the compiled device code for different GPU architectures within the host executable. NVCC manages the linking of the compiled host code, the device code, and necessary CUDA runtime libraries.<\/p>\n<p>Minerva has a total of 81 GPU nodes. Please check below for more information.<\/p>\n<ul>\n<li>12 Intel nodes each with 32 cores, 384GiB RAM, and 4 NVIDIA V100 GPU cards with 16GB memory on each card (<a href=\"https:\/\/images.nvidia.com\/content\/technologies\/volta\/pdf\/volta-v100-datasheet-update-us-1165301-r5.pdf\">see more<\/a>)<\/li>\n<li>8 Intel nodes each with 48 cores, 384GiB RAM, and 4 NVIDIA A100 GPU cards with 40GB memory on each card (<a href=\"https:\/\/www.nvidia.com\/content\/dam\/en-zz\/Solutions\/Data-Center\/a100\/pdf\/nvidia-a100-datasheet.pdf\">see more<\/a>)<\/li>\n<li>2 Intel nodes each with 64 cores, 2TiB RAM, and 4 NVIDIA A100 GPU cards with 80GB memory on each card each with NVLINK connectivity<\/li>\n<li>49 nodes in total, including 47 nodes with Intel Xeon Platinum 8568Y+ processors (96 cores, 1.5 TB RAM per node) and 4 NVIDIA H100 GPUs (80 GB each) connected via SXM5 with NVLink, as well as 2 nodes with Intel Xeon Platinum 8358 processors (32 cores, 512 GB RAM per node) and 4 NVIDIA H100 GPUs (80 GB each) connected via PCIe. The 2 additional nodes also feature 3.84 TB NVMe SSD storage (3.5 TB usable) (<a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/h100\/\">see more<\/a>)<\/li>\n<li>4 nodes, each with AMD Genoa 9334 processors (64 cores, 1.5 TB RAM per node) and equipped with 8 NVIDIA L40S GPUs per node, totaling 32 L40S GPUs across 4 nodes. Each processor operates at 2.7 GHz (<a href=\"https:\/\/www.nvidia.com\/en-us\/data-center\/l40s\/\">see more<\/a>)<\/li>\n<li>6 Lenovo SR780a V3 DGX nodes with 48 NVIDIA B200 GPUs (8 NVLinked per node, 192 GB per GPU; 9 TB total GPU memory), 672 Intel Xeon Platinum 8570 cores, 12 TB system memory, and 25 TB NVMe per node. Supports FP4 (4-bit) precision for near-exaflop AI inference performance (<a href=\"https:\/\/resources.nvidia.com\/en-us-dgx-systems\/dgx-b200-datasheet\">see more<\/a>)<\/li>\n<\/ul>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Requesting GPU Resources&#8221; _builder_version=&#8221;4.27.4&#8243; text_line_height=&#8221;1.5em&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Requesting GPU Resources<\/h2>\n<p>The GPU nodes must be accessed by way of the LSF job queue. There are nodes on the <code>interactive<\/code>, <code>gpu<\/code> and <code>gpuexpress<\/code> queues.<\/p>\n<p>To request GPU resources, use the <code>-gpu<\/code> option of the bsub command. At a minimum, the number of GPU cards requested per node must be specified using the <code>num=<\/code> sub-option. The requested GPU model should be specified by using an LSF resource request via the <code>-R<\/code> option.<\/p>\n<pre><code>bsub -gpu num=2 -R a100<\/code><\/pre>\n<p><strong>Full details of the <code>-gpu<\/code> option can be obtained\u202f<a href=\"https:\/\/www.ibm.com\/docs\/en\/spectrum-lsf\/10.1.0?topic=o-gpu\">here<\/a>.<\/strong><\/p>\n<p><strong>By default, GPU resource requests are <i>exclusive_process<\/i>.<\/strong><\/p>\n<p>The available GPU models and corresponding resource specifications are:[\/et_pb_text][et_pb_code admin_label=&#8221;Table&#8221; _builder_version=&#8221;4.27.4&#8243; _module_preset=&#8221;default&#8221; position_origin_a=&#8221;center_center&#8221; position_origin_f=&#8221;center_center&#8221; text_orientation=&#8221;center&#8221; width=&#8221;50%&#8221; max_width=&#8221;50%&#8221; custom_margin=&#8221;20px||30px|150px|false|false&#8221; custom_padding=&#8221;|||200px|false|false&#8221; custom_css_main_element=&#8221;.centered-content {||    display: flex;||    justify-content: center; \/* Center horizontally *\/||    align-items: center; \/* Center vertically *\/||    height: 100%; \/* Ensure full height is available *\/||    text-align: center; \/* Optional: Center the text within the content *\/||}||||.centered-table {||        margin-left: auto;||        margin-right: auto;||        text-align: center;||}||    ||.centered-table td, .centered-table th {||        text-align: center;||        vertical-align: middle;||}&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<div class=\"centered-content\"><!-- [et_pb_line_break_holder] --><\/p>\n<table class=\"centered-table\" border=\"1\"><!-- [et_pb_line_break_holder] --><\/p>\n<thead>\n<tr class=\"tableizer-firstrow\">\n<th>-R\u00a0<\/th>\n<th>GPU MODEL\u00a0<\/th>\n<\/tr>\n<\/thead>\n<tbody><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>v100\u00a0<\/td>\n<td>TeslaV100_PCIE_16GB\u00a0<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>a100\u00a0<\/td>\n<td>NVIDIAA100_PCIE_40GB\u00a0<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>a10080g\u00a0<\/td>\n<td>NVIDIAA100_SXM4_80GB\u00a0<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>h10080g\u00a0<\/td>\n<td>NVIDIAH100_PCIE_80GB\u00a0<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>h100nvl\u00a0<\/td>\n<td>NVIDIAH100_SXM5_80GB\u00a0<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>l40s\u00a0<\/td>\n<td>NVIDIAL40S_PCIE_48GB\u00a0<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>b200\u00a0<\/td>\n<td>NVIDIAB200\u00a0<\/td>\n<\/tr>\n<p>  <!-- [et_pb_line_break_holder] --><\/tbody>\n<\/table>\n<p><!-- [et_pb_line_break_holder] --><\/div>\n<p>[\/et_pb_code][et_pb_text admin_label=&#8221;Requesting GPU Resources Continued&#8221; _builder_version=&#8221;4.27.4&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;||10px||false|false&#8221; custom_padding=&#8221;||10px||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<strong>Specifying the gpu model &#8216;b200&#8217; is required to secure a B200 GPU LSF job run.<\/strong><\/p>\n<p><strong>Specifying the gpu model number in the\u202f<code>-gpu gmodel=<\/code>\u202foption does not always work. The recommended way to specify the GPU model is via the\u202f-R\u202foption. <\/strong><\/p>\n<p><strong>Also, CUDA 11.8 or higher is needed to utilize the h100 GPU devices.<\/strong><\/p>\n<p><strong>Note that GPU resource allocation is per node.<\/strong> Suppose your LSF options call for multiple CPUs ( e.g.,<code> bsub -n 4 \u2026<\/code>). If LSF can&#8217;t satisfy all 4 CPUs on one node, it may split the job across multiple nodes (e.g., 2 CPUs on Node A, 2 on Node B). When you also request GPUs like <code>-gpu \"num=1\"<\/code>, then 1 GPU per node is allocated. So in this example with 2 nodes, 2 GPUs total are reserved (1 per node). If your application can\u2019t run in a multi-node setup (e.g., it&#8217;s not MPI-aware or can&#8217;t coordinate GPU use across nodes), those extra GPUs on the second node are wasted. The solution is to use <code>-R \"span[hosts=1]\"<\/code> to forces all resources on one node to avoid waste.<\/p>\n<p>If your program needs to know which GPU cards have been allocated to your job (not common), LSF sets the <strong><code>CUDA_VISIBLE_DEVICES<\/code><\/strong> environment variable to specify which cards have been assigned to your job.[\/et_pb_text][et_pb_text _builder_version=&#8221;4.27.4&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h3>B200 DGX Selection Criteria<\/h3>\n<p data-path-to-node=\"1\">The <b data-path-to-node=\"1\" data-index-in-node=\"4\">B200-based DGX nodes<\/b> are designed for large-scale AI and machine learning workloads that require massive GPU memory and high-speed communication across multiple GPUs via <b data-path-to-node=\"1\" data-index-in-node=\"174\">NVSwitch<\/b>.<\/p>\n<p data-path-to-node=\"2\">For smaller-scale jobs requiring only one or two GPUs or lower memory overhead, the <b data-path-to-node=\"2\" data-index-in-node=\"84\">H100 or A100 GPUs<\/b> are typically the most efficient choice. Please select your resources based on your specific model architecture, dataset size, and memory requirements.<\/p>\n<blockquote data-path-to-node=\"3\">\n<p data-path-to-node=\"3,0\"><b data-path-to-node=\"3,0\" data-index-in-node=\"0\">Note:<\/b> Minerva team monitors B200 DGX utilization to ensure optimal resource allocation. We may reach out to discuss adjustments if your job\u2019s resource requests do not align with the actual usage.<\/p>\n<\/blockquote>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Supplemental Software&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Supplemental Software<\/h2>\n<p>One almost certainly will need auxiliary software to utilize the GPUs. Most likely the CUDA libraries from NVIDA and perhaps the cuDNN (CUDA Deep Neural Network library) libraries. There are several versions of each on Minerva. Use:<\/p>\n<pre><code>ml avail cuda<\/code><\/pre>\n<p>and\/or<\/p>\n<pre><code>ml avail cudnn<\/code><\/pre>\n<p>to determine which versions are available for loading.<\/p>\n<p>For developers, there are a number of CUDA accelerated libraries available for download from NVIDIA.<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Interactive Submission&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Interactive Submission<\/h2>\n<p>Minerva sets aside a number of GPU enabled nodes to be accessed via the interactive queue. This number is changed periodically based on demand but is always a small number, e.g, 1 or 2. The number and type of GPU will be posted in the announcement section of the home page.<\/p>\n<p>To open an interactive session on one of these nodes:<\/p>\n<pre><code><strong>bsub -P acc_xxx -q interactive -n 1 -R v100 -gpu num=1 -W 01:00 -Is \/bin\/bash<\/strong><\/code><\/pre>\n<p>Alternatively, one can open an interactive session on one of the batch GPU nodes. This is particularly useful if the interactive nodes do not have the model GPU you would like to use:<\/p>\n<pre><code><strong>bsub -P acc_xxx -q gpu -n 1 -R a100 -gpu num=2 -W 01:00 -Is \/bin\/bash<\/strong><\/code><\/pre>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Batch Submission&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Batch Submission<\/h2>\n<p>Batch submission is a straightforward specification of the GPU related bsub options in your LSF script.<\/p>\n<pre><code>bsub &lt; test.lsf <\/code><\/pre>\n<p>Where test.lsf is something like:<\/p>\n<blockquote>\n<pre><code>#BSUB -q gpu\r\n#BSUB -R a100\r\n#BSUB -gpu num=4\r\n#BSUB -n 1\r\n#BSUB -W 4\r\n#BSUB -P acc_xxx\r\n#BSUB -oo test.out\r\n\r\nml cuda \r\nml cudnn \r\n\r\necho \u201csalve mundi\u201d \r\n<\/code><\/pre>\n<\/blockquote>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Accessing the Local SSD Before Table&#8221; _builder_version=&#8221;4.27.4&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Accessing the Local SSD on the A100, A100-80GB, B200, H100-80GB, H100NVL, and L40S GPU Nodes<\/h2>\n<p>To fully leverage GPU capabilities, local SSD storage plays a critical role in minimizing data bottlenecks. Fast local SSDs\u2014especially NVMe drives\u2014ensure high-throughput data access between storage and GPU memory, which is essential for workloads involving large datasets or frequent read\/write operations. Choosing between SATA and NVMe SSDs directly impacts training speed, I\/O performance, and overall GPU efficiency in GPGPU systems. Local SSD storage is available on A100, A100-80GB, H100-80GB, H100NVL and L40S GPU nodes, as shown below. 1 TB local HDD storage is available on V100 GPU nodes.<\/p>\n<p>[\/et_pb_text][et_pb_code admin_label=&#8221;Table&#8221; _builder_version=&#8221;4.27.4&#8243; _module_preset=&#8221;default&#8221; position_origin_a=&#8221;center_center&#8221; position_origin_f=&#8221;center_center&#8221; text_orientation=&#8221;center&#8221; width=&#8221;54%&#8221; max_width=&#8221;54%&#8221; custom_margin=&#8221;30px||30px|150px|false|false&#8221; custom_padding=&#8221;|||200px|false|false&#8221; custom_css_main_element=&#8221;.centered-content {||    display: flex;||    justify-content: center; \/* Center horizontally *\/||    align-items: center; \/* Center vertically *\/||    height: 100%; \/* Ensure full height is available *\/||    text-align: center; \/* Optional: Center the text within the content *\/||}||||.centered-table {||        margin-left: auto;||        margin-right: auto;||        text-align: center;||}||    ||.centered-table td, .centered-table th {||        text-align: center;||        vertical-align: middle;||}&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<div class=\"centered-content\"><!-- [et_pb_line_break_holder] --><\/p>\n<table class=\"centered-table\" border=\"1\"><!-- [et_pb_line_break_holder] --><\/p>\n<thead>\n<tr class=\"tableizer-firstrow\">\n<th>GPU Model<\/th>\n<th>Local SSD Storage<\/th>\n<\/tr>\n<\/thead>\n<tbody><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>a100\u00a0<\/td>\n<td>1.8 TB SATA SSD<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>a10080g\u00a0<\/td>\n<td>7.0 TB NVMe PCIe SSD<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>h10080g\u00a0<\/td>\n<td>3.84 TB NVMe PCIe SSD<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>h100nvl\u00a0<\/td>\n<td>3.84 TB NVMe PCIe SSD<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>l40s\u00a0<\/td>\n<td>3.84 TB NVMe PCIe SSD<\/td>\n<\/tr>\n<p><!-- [et_pb_line_break_holder] --> <\/p>\n<tr>\n<td>b200\u00a0<\/td>\n<td>25 TB NVMe PCIe SSD<\/td>\n<\/tr>\n<p>  <!-- [et_pb_line_break_holder] --><\/tbody>\n<\/table>\n<p><!-- [et_pb_line_break_holder] --><\/div>\n<p>[\/et_pb_code][et_pb_text admin_label=&#8221;Accessing the Local SSD After the Table&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p>To take advantage of the local SSD, request the resource using the rusage specification, for example:<\/p>\n<pre><code>-R \u201crusage[ssd_gb=1000]\u201d<\/code><\/pre>\n<p>This example will allocate 1000GB of dedicated SSD space to your job.<\/p>\n<p>We would advise you to make your <code>ssd_rg<\/code> request to be <code>&lt;= 1,500<\/code> (1.5T).<\/p>\n<p>The soft link, <code><strong>\/ssd<\/strong><\/code>, points to the ssd storage. You can specify <code>\/ssd<\/code> in your job script and direct your temporary files there. At the end of your job script, please remember to clean up your temporary files.<\/p>\n<p>A foolproof way of doing this is to use LSF\u2019s options for executing <i>pre_execution<\/i> commands and <i>post_execution<\/i> commands. To do this, add the following to your bsub options:<\/p>\n<blockquote>\n<pre><code>#BSUB -E \u201cmkdir \/ssd\/$LSB_JOBNUM\u201d #Create a folder before beginning execution\r\n#BSUB -Ep \u201crm -rf \/ssd\/$LSB_JOBNUM\u201d # remove folder after job completes<\/code><\/pre>\n<\/blockquote>\n<p>Inside your LSF script, use <code>\/ssd\/$JOB_NUM<\/code> as the folder in which to create and use temp files.<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Monitoring GPU Usage&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Monitoring GPU Usage<\/h2>\n<p>The LSF queuing system on Minerva is configured to gather GPU resource usage using <a href=\"https:\/\/developer.nvidia.com\/dcgm\">NVIDIA Data Center GPU Manager (DCGM)<\/a>. This allows users to view the gpu usage of their finished jobs using <code>bjobs -l -gpu <\/code> if the job finished within the last 30 minutes or <code>bhist -l -gpu <\/code> otherwise.<\/p>\n<p>The following is a sample of the report: (SM=streaming multiprocessors. i.e., the part of the GPU card that runs CUDA)<\/p>\n<pre><code>bjobs -l -gpu 69703953<\/code><\/pre>\n<p><strong><br \/> HOST: lg07c03; CPU_TIME: 8 seconds <\/strong>#\u00a0 Indicates the job ran on host lg07c03 and the total CPU time consumed by the job was 8 seconds.<strong><br \/> GPU ID: 2\u00a0 <\/strong># This\u00a0sections provide information specific to each GPU used by the job.\u00a0<strong><br \/> Total Execution Time: 9 seconds <\/strong>#\u00a0 This indicates the duration the GPU was actively engaged in processing tasks for this specific job<strong><br \/> Energy Consumed: 405 Joules <\/strong>#\u00a0This represents the energy consumed by each GPU during its execution time.<strong><br \/> SM Utilization (%): Avg 16, Max 100, Min 0 <\/strong>#\u00a0SM (Streaming Multiprocessor) utilization reflects the percentage of time the Streaming Multiprocessors (which execute your GPU threads) are executing a task.\u00a0The average utilization is 16%, suggesting that the GPUs were not constantly running at full capacity during the entire execution time.\u00a0The maximum utilization reaching 100% indicates that at some points, the GPUs were fully engaged. The minimum utilization of 0% signifies periods when the GPUs were idle.<br \/><strong> Memory Utilization (%): Avg 0, Max 3, Min 0\u00a0<\/strong>#\u00a0GPU Memory Utilization indicates how much of the GPU&#8217;s dedicated memory is currently in use<br \/> <strong>Max GPU Memory Used: 38183895040 bytes <\/strong>#\u00a0This represents the peak memory usage on each GPU during the job&#8217;s execution.<\/p>\n<p><strong>GPU ID: 1<br \/> Total Execution Time: 9 seconds<br \/> Energy Consumed: 415 Joules<br \/> SM Utilization (%): Avg 20, Max 100, Min 0<br \/> Memory Utilization (%): Avg 0, Max 3, Min 0<br \/> Max GPU Memory Used: 38183895040 bytes<br \/> GPU Energy Consumed: 820.000000 Joules <\/strong># This represents the total energy consumed by both GPUs combined (405 J + 415 J = 820 J).\u00a0<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Further Information:&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>Further Information:<\/h2>\n<p><a href=\"http:\/\/www.nvidia.com\/object\/What-is-GPU-Computing.html\">CUDA Programming<\/a><br \/> <a href=\"http:\/\/www.nvidia.com\/docs\/IO\/123576\/nv-applications-catalog-lowres.pdf\">Available Applications <\/a><\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;GPU Etiquette&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<h2>GPU Etiquette<\/h2>\n<p>The GPU nodes on Minerva are in high demand. To optimize resource allocation, it\u2019s critical to specify job requirements accurately in LSF. Incorrect resource requests can lead to unnecessary reservation of GPU resources, making them unavailable for other researchers.<\/p>\n<p>The bsub options that need particular attention are:<br \/> <code>-n<\/code> The number of slots assigned to your job.<br \/> <code>-R rusage[mem=]<\/code> The amount of memory assigned to each slot.<br \/> <code>-gpu num=<\/code> The number of GPU cards per node assigned to your job.<br \/> <code>-R span[]<\/code> The arrangement of the resources assigned to your job across Minerva nodes<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;-n&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p><code><strong>-n<\/strong><\/code><\/p>\n<p>This option specifies how many job slots you want for your job. View this value as how many packets of resources you want allocated to your job. By default, each slot you ask for will have 1 cpu and 3GB of memory. For most cases, job slots and cpus are synonymous. <strong>Note that if your program is not explicitly coded to be parallelized, adding more cores will not affect the performance.<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;-R rusage%91mem=%93 &#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<code><strong>-R rusage[mem=]<\/strong><\/code><\/strong><\/p>\n<p>This option specifies how much memory is to be allocated per slot. Default is 3GB. Note that request is per slot, not per job. The total amount of memory requested will be this value times the number of slots you have requested with the <code>-n<\/code> option.[\/et_pb_text][et_pb_text admin_label=&#8221;-gpu num=&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p><code><strong>-gpu num=<\/strong><\/code><\/p>\n<p>This option specifies how many GPU cards you want allocated <strong>per compute node, not slot<\/strong>, that is allocated to your job. The GPU cards\/devices are numbered 0 through 3. However, for L40S GPUs, the devices are numbered 0 through 7.<\/p>\n<p>Because your program needs to connect to a specific GPU device, LSF will set an environment variable, CUDA_VISIBLE_DEVICES, to the list of GPU devices assigned to your job. E.g, CUDA_VISIBLE_DEVICES=0,3. Do not change these values manually and, if writing your own code, you must honor the assignment. The installed software on Minerva that use GPUs honor these assignments.<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;-R span%91%93&#8243; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<code><strong>-R span[]<\/strong><\/code><\/p>\n<p>This option tells LSF how to lay out your slots\/cpus on the compute nodes of Minerva. By default, slots\/cpus can be, and usually are, spread out over several nodes.<\/p>\n<p>There are two ways that this option is generally specified: using <code>ptile=x<\/code>, to specify the tiling across nodes or <code>hosts=1<\/code> to specify all slots are to be allocated on one node (only option).<\/code>[\/et_pb_text][et_pb_text admin_label=&#8221;Example A&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p><i>Example A:<\/i><\/p>\n<p><code>-R span[ptile=]<\/code> where value is the number of slots\/cores to be placed on a node.<\/p>\n<p>e.g.<\/p>\n<blockquote>\n<pre><code>#BSUB -n 6\t# Allocate 6 cores\/slots total\r\n#BSUB -R rusage[mem=5G]  #Allocate 5GB per core\/slot\r\n#BSUB -R span[ptile=4]   # Allocate 4 core\/slots per node\r\n\r\n#BSUB -gpu=2\r\n\r\n#BSUB -R a100\r\n<\/code><\/pre>\n<\/blockquote>\n<p><code><code><\/code><\/code><\/p>\n<p>As a result, two nodes will be allocated:<br data-start=\"362\" data-end=\"365\" \/> \u2022 Node 1: 4 CPU cores, 20 GB memory (4 \u00d7 5 GB), 2 A100 GPUs<br data-start=\"426\" data-end=\"429\" \/> \u2022 Node 2: 2 CPU cores, 10 GB memory (2 \u00d7 5 GB), 2 A100 GPUs<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Example B&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p><i>Example B:<\/i><\/p>\n<p><code>-R span[hosts=1]<\/code> allocate all cores\/slots to one host.<\/p>\n<p>e.g.<\/p>\n<blockquote>\n<pre><code>#BSUB -n 6 \r\n#BSUB -R rusage[mem=5G] \r\n#BSUB -R span[hosts=1] \r\n\r\n#BSUB -gpu=2\r\n\r\n#BSUB -R a100\r\n<\/code><\/pre>\n<\/blockquote>\n<p><code><code><\/code><\/code><\/p>\n<p>One node will be allocated with the following resources:<br data-start=\"327\" data-end=\"330\" \/> \u2022 6 CPU cores<br data-start=\"345\" data-end=\"348\" \/> \u2022 30 GB memory (6 \u00d7 5 GB)<br data-start=\"375\" data-end=\"378\" \/> \u2022 2 A100 GPUs<\/p>\n<p>[\/et_pb_text][et_pb_text admin_label=&#8221;Tips&#8221; _builder_version=&#8221;4.16&#8243; _module_preset=&#8221;default&#8221; custom_margin=&#8221;20px||||false|false&#8221; custom_css_main_element=&#8221;ol, ul {||  padding: 30px 0px 10px 50px;||  list-style-position: outside;||}||||ol ol, ul ul {||  padding: 0px 0px 10px 50px;||  list-style-type: circle;||}||||ol li, ul li {||  margin-bottom: 5px;||}||&#8221; global_colors_info=&#8221;{}&#8221;]<\/p>\n<p>If you do not specify a span requirement, LSF will make its own placement decisions, which may result in allocating only one slot per node. In the example above, this could lead to your job occupying 12 GPU cards\u2014one on each of 12 different nodes.<\/p>\n<p>If your program is not designed for distributed execution (e.g., if it&#8217;s a Shared Memory Parallel (SMP) application\u2014which most parallel programs in our environment are), it will only utilize resources on the first node assigned. The additional nodes (and their GPUs) will remain idle and unavailable to other users, potentially leading to significant delays in the job queue.<\/p>\n<p>Some signs that your program <strong>cannot run in a distributed fashion<\/strong> include:<\/p>\n<ul>\n<li>Documentation does not say that it can.<\/li>\n<li>The <code>mpirun<\/code> command is not used to start the program.<\/li>\n<li>Has program options such as\n<ul>\n<li><code>-ncpu<\/code><\/li>\n<li><code>-threads<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>In such cases, all required resources must be located on a single node, and you should include the following in your job script:<\/p>\n<p><code>-R span[hosts=1]<\/code><\/p>\n<p>This ensures that all allocated CPUs, memory, and GPUs are placed on the same node.<\/p>\n<p><strong>This is especially important when GPUs are involved.<\/strong> GPU devices are valuable and limited in number. By default, LSF will allocate the number of GPUs you request (via the -gpu num= option) on every node assigned to your job\u2014even if your application only runs on one of them.<\/p>\n<p><em>If your program is single-threaded or an SMP (Shared Memory Parallel) application, and your job spans multiple nodes, only the first node will be used. GPUs on all other nodes will remain allocated but idle, potentially blocking other jobs from accessing those resources and causing unnecessary scheduling delays.<\/em><\/p>\n<p>[\/et_pb_text][\/et_pb_column][\/et_pb_row][\/et_pb_section]<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Scientific Computing and Data\u00a0\/\u00a0High Performance Computing \/ Documentation \/\u00a0gpgpuHarnessing GPUs for Scientific Computing: Concepts and Usage EtiquetteIn modern computing, GPUs (Graphics Processing Units) have evolved far beyond their initial role as graphics rendering accelerators like 3D rendering and image processing. This evolution has led to the emergence of GPGPU (General-Purpose computing on Graphics Processing Units), [&hellip;]<\/p>\n","protected":false},"author":624,"featured_media":0,"parent":35,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_et_pb_use_builder":"on","_et_pb_old_content":"","_et_gb_content_width":"","footnotes":""},"class_list":["post-10937","page","type-page","status-publish","hentry"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/10937","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/users\/624"}],"replies":[{"embeddable":true,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/comments?post=10937"}],"version-history":[{"count":28,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/10937\/revisions"}],"predecessor-version":[{"id":13822,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/10937\/revisions\/13822"}],"up":[{"embeddable":true,"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/pages\/35"}],"wp:attachment":[{"href":"https:\/\/labs.icahn.mssm.edu\/minervalab\/wp-json\/wp\/v2\/media?parent=10937"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}