Disaster Recovery Plan

Introduction

The Scientific Computing facility is in the process of developing a Disaster Recovery Plan. We are currently in the first phase of developing the Plan: requirements collection. In this phase we are talking to scientists to help determine how quickly we need to recover from a disaster so that we understand their requirements and our recovery objectives. A draft plan will be developed and released for comment within the scientific computing community in order to ensure no requirement is inadvertently overlooked. In the next phase, an implementation plan will be developed and we will purchase equipment and software to meet the requirements outlined from the researchers, within the available budget. Finally we will finalize the disaster recovery document and share the plan.

This document is the first draft of the plan, and provides an outline as well as an educated but initial baseline for discussion. The Plan consists of four parts: the recovery point and time objectives, preventative, detective and corrective measures. This plan will be updated twice a year, evolving in response to updated requirements and new resources.

 

The Scientific Computing Facility

The Scientific Computing Facility consists of the Minerva supercomputer. Minerva utilizes 16,944 Intel Platinum 8268 24C, 2.9 GHz compute cores (48 cores per node with two sockets in each node) with 192 GB of memory per node in 353 nodes; 3,520 Intel Platinum 8358 32C 2.6GHz compute cores (64 cores per node with 2 sockets in each node) with 1.5GB of memory per node in 55 nodes; 1,728 Intel Platinum 8268 24C 2.9GHz cores (48 cores per node with two sockets per node) with 1.5TiB per node in 36 nodes; 416 Intel Gold 6142 16C 2.6GHz cores (32 cores per node with two sockets) with 390GiB of memory and 4 NVidia V100 GPU cards per node; 384 Intel Platinum 8268 24C 2.9GHz cores (48 cores per node with 2 sockets) with 390GiB memory, and 4 NVidia A100 GPU cards per node; 210 terabytes of total memory, 350 terabytes of solid state storage, and 32 petabytes of usable, spinning storage accessed via IBM’s Spectrum Scale/General Parallel File System (GPFS) for a total of 2.0 petaflops of compute power. Scientists in the medical school use this facility for research only.

Recovery Point and Time Objectives
The Recovery Point Objective is the maximum tolerable period in which access to data might be lost due to a disaster. The Recovery Time Objective is the duration of time and service level that must be restored after a disaster.

It is our current belief that the compute and data stored in the Scientific Computing Facility is not under major time pressure to be recovered in the event of a disaster. As we collect requirements from the scientists, we will revisit and revise this assumption and align our disaster recovery strategy accordingly.

Preventative Measures
These are controls that are aimed at preventing an event from occurring.

Several preventative measures are already in place and we anticipate implementation of further safeguards to protect Minerva and its associated data storage. First, the equipment is on an Uninterruptible Power Supply (UPS), providing Minerva with clean power and protection from the sags, surges and momentary losses of power. Minerva also benefits from the building security and fire detection systems that are in place. Minerva’s scratch disk storage is a RAID6 configuration, which requires three disks in the same RAID set to be lost simultaneously before there is a data loss. This is a work or scratch file system and we do not provide backup services for this file system. Researchers are cautioned about this. The servers for the scratch storage are also setup in a redundant configuration. There are redundant servers during operations, but not to address disaster recovery.

To address the long term preservation of data, the Scientific Computing Facility is evaluating and purchasing the equipment and software that will provide a secure and protected archival storage capability for the scientists and researchers at Mount Sinai. This will enable them to protect files and data generated on Minerva or in their laboratories in case of any kind of disaster within the buildings in Mount Sinai. The archival storage system will include an automated tape storage system. Part of the operational plan for this archival storage system will include creating multiple copies of the data on different tapes and a regular rotation of a copy offsite. Possible duplication of files to a remote tape storage system is also under consideration. We will also regularly check to make sure the files written to tape are readable by periodically recalling and scanning tapes from both within the tape library and any stored off-site. We will provide and/or regularly use a checksum system to ensure that the files written to tape are the same as on disk. We also plan to backup user home directories and project directories nightly to this same archival storage system. The exact configuration and procedures will be documented and updated in this Plan as they are determined. The archival storage system will contain data that does have minimal retention periods per regulations and/or laws (NIH, etc.), however following these regulations and laws is the responsibility of the researcher, and not of the facility itself.

Detective Measures
These controls are aimed at detecting or discovering unwanted events.

The Scientific Computing Facility will institute a system-wide monitoring system using nagios. We will be contacted via email and text message about a variety of system and file system anomalies and examine their impact in real-time. We will also put a monitor in place in the Sinai Computer Room Operations Center along with a procedure and call list so that operators can inform us of any event out of the ordinary. We are also added to the Mount Sinai Medical Center Notification System, and have tested this functionality.

Corrective Measures
These controls are aimed at correcting or restoring the system after a disaster.

If a disaster should occur and our computing and data infrastructure is not available, there are no special procedures for our users to follow. Since we do not support patient care, there is no downtime tolerance requirement.

We will develop a procedure for the series of steps to recover the system up after a power outage or disaster. We will publish on our website the process for recovering data from tape for scientists to bring their data back. If data for archival storage or the work/scratch file system is needed urgently should a disaster occur, then we will explore possibilities of moving disks and/or setting up servers in other locations to retrieve the data.

For system dependencies, we rely on the Mount Sinai’s computer room infrastructure, including networking to bring our computers back online. We do not have a special request for a high priority recovery of our machines or data.