Scientific Computing and Data

Partnering with researchers to advance scientific discovery

About Data Ark

The Data Ark team downloads, organizes and performs quality assurance and quality control on the data. The team also manages the data access process, answers questions on the data, and updates to the latest versions of the data sets. The Data Ark is located on Minerva at /sc/arion/projects/data-ark/. This Mount Sinai data commons is guided by the FAIR principles [1]: making data more findable, accessible, interoperable and reusable. Data Ark includes both public (restricted and unrestricted) and Sinai-generated data sets.

The overarching goal of the Data Ark is to ensure that research data at Mount Sinai are managed, processed and combined in a way that optimizes the power, pace and relevance of our science.

  • Power: Scientists typically use only a tiny fraction of available data
  • Pace: Users will have rapid access to huge, powerful research data
  • Relevance: Our diverse patient population is ideal for testing the generalizability of our results

Data Ark is an initiative led by Associate Professor Paul O’Reilly and Dean for Scientific Computing and Data Patricia Kovatch, and supported by the Department of Genetics and Genomic Sciences and Scientific Computing. An advisory board has been convened to provide guidance and to help Data Ark become sustainable over time.


Access Data Ark

First, all users must read, agree to, and sign the Data Use Agreement specific to the requested data set. Once the agreement has been submitted, as well as any evidence of approved permission for public restricted-use data, the Data Ark team will grant access within two working days. Users will receive email confirmation that access has been granted.

The Data Use Agreement is accessible only through the Mount Sinai campus network or secure remote VPN. Click here for the Data Use Agreement and choose the data set that you would like to access from the drop-down list. From here you can follow the link to view and agree to the specific Data Use Agreement. Users will need to login with your Sinai account and password and will be able to choose only one data set at a time.

For more information For all inquiries relating to the Data Ark please email:


Data Ark Data Sets, Version 1

As of launch, the Data Ark consists of the seven data sets listed below (click links for dedicated data set pages). We plan to expand the number, type and diversity of data sets over the next year.

Public data sets (unrestricted):

  • 1,000 Genomes Project – Whole Genome Sequencing (WGS) data on ~1,000 individuals of mixed ancestry
  • GTEx – Gene expression data on hundreds of individuals across ~50 tissues
  • GWAS Summary Stats – Genome Wide Association Studies (GWAS) results in standardized format across 1,000s of outcomes

Public data sets (restricted):

  • UK Biobank – Genetic data (genotype/WES) from the UK Biobank data on 500,000 individuals.
  • TCGA data – COMING SOON!!

Mount Sinai generated data (unrestricted):

  • STOP COVID NYC Cohort – symptom and behavior on COVID-19 on ~50,000 New York City residents surveyed via phone apps in April 2020

Mount Sinai generated data (restricted):


Data Ark Quick Links