Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / Data Sets (Version 1)
Data Ark Data Sets
The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark are increasing on an ongoing basis.
Public Data Sets (Unrestricted) | |
Data Set | Description |
1,000 Genomes Project | Phase 3 of 1000 Genomes Project–individual-level called genotype data (VCF format)~2,500 individuals of mixed ancestry. |
GTEx | Gene expression data on hundreds of individuals across ~50 tissues |
GWAS Summary Stats | Genome Wide Association Studies (GWAS) results in a standardized format across thousands of outcomes |
gnomAD | The Genome Aggregation Database (gnomAD) aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects |
TCGA (The Cancer Genome Atlas) | The Cancer Genome Atlas (TCGA) molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types |
eQTLGen Consortium | The Expression Quantitative Trait Locus (eQTLGen) Consortium has been set up to identify the downstream consequences of trait-related genetic variants. |
UKBB-LD | The UK Biobank Linkage Disequilibrium data set can be used for post-Genome-wide association studies (GWAS) analysis such as fine-mapping |
LDSCORE | The baseline LD scores information generated by the Broad Institute in helping with LD Score regression analysis |
BLAST | The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST) and related database |
Reference Genome | A collection of the most frequently used human and mouse reference genome and related annotations |
Public Data Sets (Restricted) | |
Data Set | Description |
UK Biobank | Genetic data (genotype/WES) from the UK Biobank data on 500,000 individuals |
Mount Sinai Generated Data (Unrestricted) | |
Data Set | Description |
STOP COVID NYC Cohort | Symptom and behavior on COVID-19 on ~50,000 New York City residents surveyed via phone apps in April 2020 |
Mount Sinai Generated Data (Restricted) | |
Data Set | Description |
Mount Sinai Data Warehouse COVID-19 Electronic Health Record (EHR) Data Set | De-identified clinical data on patients from Caboodle with or suspected of COVID-19 containing 350 data elements and updated daily |
The Mount Sinai COVID-19 Biobank | Blood samples from hundreds of COVID-19 patients hospitalized at Mount Sinai, with genotype/WGS data available |
The Living Brain Project | A multiscale, data-driven investigation of the human brain wherein a single living population is being studied using the full human subject neuroscience toolkit. |
School-Acquired Data Sets
The following data set is purchased by the Mount Sinai Icahn School of Medicine and is restricted to School users only:
Data (Restricted) | |
Data Set | Description |
IBM® MarketScan® | IBM® MarketScan® Research Databases provides one of the longest-running and largest collections of proprietary de-identified claims data for privately and publicly insured people in the U.S. |
Data Set Supplements – Open Soon Through Data Ark
The following data sets are data set Supplements hosted on Minerva but not open through Data Ark yet:
Data Supplements Hosted Through Minerva (Restricted) | |
Data Set | Description |
The Imaging Research Warehouse (IRW) 1.0 | De-identified image slices (over 217 million slices) for over 700,00 MSHS studies from 2017-2021. Image modalities include DX, CT, CR, MR, MG and NM |
An electronic medical record-linked blood/serum biobank with over 50,000 enrolled participants. Genetic, epidemiologic and molecular data are available, including whole exome sequencing(WES) data for a diverse cohort of individuals from over 30,000 with many ancestral and cultural backgrounds | |
CIB (Cancer Institute Biorepository) | In combination with Freezerworks, CIB supports the biorepository to annotate, manage and search donor and sample (tissue and fluid) information, consent status, clinical annotations, and sample tracking. |
Helpful Links to External Data Sets
The following data set is not hosted on Minerva. This helpful resource can be referenced and utilized as as an external data set:
- All of Us – An ambitious effort to gather health data from one million or more people living in the United States to accelerate research that may improve health
Access Data Ark
For Public Unrestricted data sets, you can simply access the following path on Minerva:
/sc/arion/projects/data-ark/Public_Unrestricted
For any other data sets, users must read, agree to, and sign the Data Use Agreement specific to the requested data set. Once the agreement has been submitted, as well as any evidence of approved permission for public restricted-use data, the Data Ark team will grant access within two working days. Users will receive email confirmation that access has been granted.
The Data Use Agreement is accessible only through the Mount Sinai campus network or secure remote VPN. Click here for the Data Use Agreement and choose the data set that you would like to access from the drop-down list. From here you can follow the link to view and agree to the specific Data Use Agreement. Users will need to login with your Sinai account and password and will be able to choose only one data set at a time.
For all inquiries relating to the Data Ark please email: data-ark-team@lists.mssm.edu