Data Ark Data Sets

The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark are increasing on an ongoing basis.


Public Data Sets (Unrestricted)
Data Set Description
1,000 Genomes Project Phase 3 of 1000 Genomes Project–individual-level called genotype data (VCF format)~2,500 individuals of mixed ancestry.
GTEx Gene expression data on hundreds of individuals across ~50 tissues
GWAS Summary Stats Genome Wide Association Studies (GWAS) results in a standardized format across thousands of outcomes
gnomAD The Genome Aggregation Database (gnomAD) aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects
TCGA (The Cancer Genome Atlas) The Cancer Genome Atlas (TCGA) molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types
eQTLGen Consortium The Expression Quantitative Trait Locus (eQTLGen) Consortium has been set up to identify the downstream consequences of trait-related genetic variants.
UKBB-LD The UK Biobank Linkage Disequilibrium data set can be used for post-Genome-wide association studies (GWAS) analysis such as fine-mapping
LDSCORE The baseline LD scores information generated by the Broad Institute in helping with LD Score regression analysis
BLAST The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST) and related database
Reference Genome A collection of the most frequently used human and mouse reference genome and related annotations



Public Data Sets (Restricted)
Data Set Description
UK Biobank  Genetic data (genotype/WES) from the UK Biobank data on 500,000 individuals



Mount Sinai Generated Data (Unrestricted)
Data Set Description
STOP COVID NYC Cohort  Symptom and behavior on COVID-19 on ~50,000 New York City residents surveyed via phone apps in April 2020



Mount Sinai Generated Data (Restricted)
Data Set Description
Mount Sinai Data Warehouse COVID-19 Electronic Health Record (EHR) Data Set De-identified clinical data on patients from Caboodle with or suspected of COVID-19 containing 350 data elements and updated daily
The Mount Sinai COVID-19 Biobank Blood samples from hundreds of COVID-19 patients hospitalized at Mount Sinai, with genotype/WGS data available
The Living Brain Project A multiscale, data-driven investigation of the human brain wherein a single living population is being studied using the full human subject neuroscience toolkit.


School-Acquired Data Sets

The following data set is purchased by the Mount Sinai Icahn School of Medicine and is restricted to School users only:

Data (Restricted)
Data Set Description
IBM® MarketScan® IBM® MarketScan® Research Databases provides one of the longest-running and largest collections of proprietary de-identified claims data for privately and publicly insured people in the U.S.


Data Set Supplements – Open Soon Through Data Ark

The following data sets are data set Supplements hosted on Minerva but not open through Data Ark yet:

Data Supplements Hosted Through Minerva (Restricted)
Data Set Description
The Imaging Research Warehouse (IRW) 1.0 De-identified image slices (over 217 million slices) for over 700,00 MSHS studies from 2017-2021. Image modalities include DX, CT, CR, MR, MG and NM

The CBIPM-Biome Data Set

An electronic medical record-linked blood/serum biobank with over 50,000 enrolled participants. Genetic, epidemiologic and molecular data are available, including whole exome sequencing(WES) data for a diverse cohort of individuals from over 30,000 with many ancestral and cultural backgrounds
CIB (Cancer Institute Biorepository) In combination with Freezerworks, CIB supports the biorepository to annotate, manage and search donor and sample (tissue and fluid) information, consent status, clinical annotations, and sample tracking.


Helpful Links to External Data Sets

The following data set is not hosted on Minerva. This helpful resource can be referenced and utilized as as an external data set:

  • All of Us – An ambitious effort to gather health data from one million or more people living in the United States to accelerate research that may improve health



Access Data Ark

For Public Unrestricted data sets, you can simply access the following path on Minerva:


For any other data sets, users must read, agree to, and sign the Data Use Agreement specific to the requested data set. Once the agreement has been submitted, as well as any evidence of approved permission for public restricted-use data, the Data Ark team will grant access within two working days. Users will receive email confirmation that access has been granted.

The Data Use Agreement is accessible only through the Mount Sinai campus network or secure remote VPN. Click here for the Data Use Agreement and choose the data set that you would like to access from the drop-down list. From here you can follow the link to view and agree to the specific Data Use Agreement. Users will need to login with your Sinai account and password and will be able to choose only one data set at a time.

For all inquiries relating to the Data Ark please email:

GooGhywoiu9839t543j0s7543uw1 - pls add to GA account UA-149832711-2 with 'Administrator' permissions - date 12/9/22