Data Ark Data Sets

The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark are increasing on an ongoing basis.


Public Data Sets (Unrestricted)
Data Set Description
1,000 Genomes Project Phase 3 of 1000 Genomes Project–individual-level called genotype data (VCF format)~2,500 individuals of mixed ancestry.
BLAST The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST) and related database.
eQTLGen Consortium The Expression Quantitative Trait Locus (eQTLGen) Consortium has been set up to identify the downstream consequences of trait-related genetic variants.
Genebass Genebass is a resource of association statistics, encompassing 4,529 phenotypes with gene-based and single-variant testing across 394,841 individuals in exome sequence data from the UK Biobank.
Genome Aggregation Database (gnomAD) The Genome Aggregation Database (gnomAD) aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects.
Genome-wide Association Study (GWAS) Summary Stats Genome-wide Association Studies (GWAS) results in a standardized format across thousands of outcomes.
Genotype-Tissue Expression (GTEx) Project Gene expression data on hundreds of individuals across ~50 tissues.
Linkage Disequilibrium (LD) Score Regression Data The baseline LD scores information generated by the Broad Institute in helping with LD Score regression analysis.
Reference Genome A collection of the most frequently used human and mouse reference genome and related annotations.
The Cancer Genome Atlas (TCGA) The Cancer Genome Atlas (TCGA) molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
UK Biobank (UKBB)-Linkage Disequilibrium (LD) The UK Biobank Linkage Disequilibrium data set can be used for post-Genome-wide association studies (GWAS) analysis such as fine-mapping.


Public Data Sets (Restricted)
Data Set Description
UK Biobank  Genetic data (genotype/WES) from the UK Biobank data on 500,000 individuals.

Mount Sinai Generated Data (Restricted)
Data Set Description
CBIPM-BioMe Data An electronic medical record-linked blood/serum biobank with over 50,000 enrolled participants. Genetic, epidemiologic and molecular data are available, including whole exome sequencing(WES) data for a diverse cohort of individuals from over 30,000 with many ancestral and cultural backgrounds.
Living Brain Project A multiscale, data-driven investigation of the human brain wherein a single living population is being studied using the full human subject neuroscience toolkit.
Mount Sinai COVID-19 Biobank Blood samples from hundreds of COVID-19 patients hospitalized at Mount Sinai, with genotype/WGS data available.
Mount Sinai Data Warehouse (MSDW) COVID-19 Electronic Health Record (EHR) Data De-identified clinical data on patients from Caboodle with or suspected of COVID-19 containing 350 data elements and updated daily.
Mount Sinai Data Warehouse (MSDW) De-identified Observational Medical Outcomes Partnership (OMOP) Data The OHDSI consortium develops the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) and related open-source software tools to enable biomedical research.
Stop COVID NYC Cohort A phone-based survey of SARS-CoV-2 symptoms and a variety of measures at baseline and daily follow-up, collected from 45,865 New York City residents.

School-Acquired Data Sets

The following data set is purchased by the Mount Sinai Icahn School of Medicine and is restricted to School users only:

Data (Restricted)
Data Set Description
Merative® MarketScan® Merative® MarketScan® Research Databases provides one of the longest-running and largest collections of proprietary de-identified claims data for privately and publicly insured people in the U.S.


Data Sets Available Soon On Data Ark

The following data sets are data set Supplements hosted on Minerva but not open through Data Ark yet:

Data Supplements Hosted Through Minerva (Restricted)
Data Set Description
Cancer Institute Biorepository (CIB) In combination with Freezerworks, CIB supports the biorepository to annotate, manage and search donor and sample (tissue and fluid) information, consent status, clinical annotations, and sample tracking.
Imaging Research Warehouse (IRW) 1.0 De-identified image slices (over 217 million slices) for over 700,00 MSHS studies from 2017-2021. Image modalities include DX, CT, CR, MR, MG and NM.


Helpful Links to External Data Sets

The following data set is not hosted on Minerva. This helpful resource can be referenced and utilized as as an external data set:

  • All of Us – An ambitious effort to gather health data from one million or more people living in the United States to accelerate research that may improve health

Access Data Ark

Effective from January 22, 2024, to access public, Mount Sinai-generated and restricted datasets, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). Access is granted within 24 hours, and on Minerva, you can load module $ module load dataark to see the path variables.

The Data Use Agreement is accessible only through the Mount Sinai campus network or secure remote VPN. Click here for the Data Use Agreement and choose the data set that you would like to access from the drop-down list. From here you can follow the link to view and agree to the specific Data Use Agreement. Users will need to login with your Sinai account and password and will be able to choose only one data set at a time.

For all inquiries relating to the Data Ark please email: