Minerva Quick Start

 

Partitions

The Chimera partition:

  • 286 compute nodes – 48 Intel 8168 cores (2.7GHz) and 192 GB memory
  • 4x high memory nodes – 48 Intel 8168 cores (2.7GHz) and 1.5 TB memory
  • 48 V100 GPUs in 12 nodes – 32 Intel 6142 cores (2.6GHz) and 384 GB memory – 4x V100-16 GB GPU

The BODE2 partition:

  • 78 compute nodes – 48 Intel 8268 cores (2.9GHz) and 192 GB memory
  • Only BODE-enabled users have access to the BODE2 partition

Connecting to Minerva

For security, Minerva uses the Secure Shell (ssh) protocol and Two Factor authentication. Unix systems typically have an ssh client already installed. Windows systems can download one of several ssh clients that are available for free such as PuTTY.

Two Factor authentication requires you to enter a password that is the combination of your Sinai password and a generated token (Either Software or Hardware token).
Software Token:
On an Android and/or iPhone, the application is called “VIP Access” and is published by Symantec. Blackberry, Windows Mobile, etc are also supported.

Hardware Token:
You can obtain a Hardware Token from the IT Helpdesk. We don’t have Hardware Token available now.
To setup two factor authentication visit the ASCIT website

From on-site and off-site

All users can login to Minerva cluster via ssh to minerva.hpc.mssm.edu. As a part of our HIPAA compliance activities, we need to shut down the external gateway access to Minerva. The High Performance Computing team has made adjustments so that all users can connect to internal login nodes, thus all users will need a VPN account for off-campus login. Please refer to here for details.

For example:

> ssh your_userid@minerva.hpc.mssm.edu
Password: > your_Sinai_password123456

( the > sign indicates what you would type in; 123456 represents the numeric sequence obtained from your token)

For more information on logging in, please visit Logging In


File System

/hpc/users/<userid> User HOME directories.  20GB quota.  It is NOT purged and is backed up.  Generally used for all the ‘rc’ and configuration files for various programs.
/sc/arion/work/<userid> A WORK directory for each user.  100GB quota.  It is NOT purged and it is NOT backed up. To be used for whatever purpose the user desires.
/sc/arion/scratch/<userid> A folder for each user inside the /sc/arion/scratch directory.

/sc/arion/scratch has a 100TB quota and it is shared by all users.  This should be used in lieu of /tmp for temporary files as well as short term storage up to a maximum of 14 days.  Files older than 14 days are purged automatically by the system.

/sc/arion/projects/<projectid> PI’s can request project storage. Need to submit an allocation request and renew annual at https://labs.icahn.mssm.edu/minervalab/event/allocation-renewal-period-20-21/

A directory for each approved project.  The quota is set to the approved allocation for the project.  It is NOT purged but it is NOT backed up. 

 


Queues

The queues that are available are:
Default memory per core is set as 3000MB for all the queues.

Queue Description Max Walltime
Premium Jobs requesting high priority with APS doubled as 200. Charged at 150% of alloc rate 144 hrs.
express Jobs requiring less than 12 hours walltime 12 hrs.
interactive Jobs running in interactive mode 12 hrs.
long Jobs requiring more than 144 hours walltime 2 weeks
gpu Jobs running on GPU nodes 144 hrs.
private Jobs using dedicated resources unlimited

 


LSF

Minerva uses LSF for batch submission. bsub is the submission command. Options can be put on the command line or in the submission script. HOWEVER, if the options are placed in the submission script, you must feed the script into the bsub command via stdin for the options to be read: E.g.,

cat MyLSF.script | bsub
or
bsub < MyLSF.script

 

Some important points of interest:

 

  • The default disposition for output and logs is for LSF to email the output to you. This piece is not working yet so you must use the “-o” option to save the output.
  • In general, the shortest quantum of time in LSF is 1 minute. Wall time is expressed as HHH:MM — There are no seconds. Durations are generally in minutes.
  • System level checkpoints are supported by LSF. There are some “gotchas” ( E.g., the default method does not work on our system) so check with the SC staff if you need/want to do checkpointing.

Some useful commands:

bjobs – shows all your jobs in the queue
bpeek – peek at your output before the job ends
bqueues – what queues are available
bkill – kill a job

Check out the main pages for all options.