Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / Data Sets

Data Ark Data Sets

The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark are increasing on an ongoing basis.

Public Data Sets (Unrestricted)
Data Set	Description
1,000 Genomes Project	Phase 3 of 1000 Genomes Project–individual-level called genotype data (VCF format)~2,500 individuals of mixed ancestry.
BLAST	The National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST) and related database.
eQTLGen Consortium	The Expression Quantitative Trait Locus (eQTLGen) Consortium has been set up to identify the downstream consequences of trait-related genetic variants.
Genebass	Genebass is a resource of association statistics, encompassing 4,529 phenotypes with gene-based and single-variant testing across 394,841 individuals in exome sequence data from the UK Biobank.
Genome Aggregation Database (gnomAD)	The Genome Aggregation Database (gnomAD) aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects.
Genome-wide Association Study (GWAS) Summary Stats	Genome-wide Association Studies (GWAS) results in a standardized format across thousands of outcomes.
Genotype-Tissue Expression (GTEx) Project	Gene expression data on hundreds of individuals across ~50 tissues.
Linkage Disequilibrium (LD) Score Regression Data	The baseline LD scores information generated by the Broad Institute in helping with LD Score regression analysis.
Reference Genome	A collection of the most frequently used human and mouse reference genome and related annotations.
The Cancer Genome Atlas (TCGA)	The Cancer Genome Atlas (TCGA) molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types.
UK Biobank (UKBB)-Linkage Disequilibrium (LD)	The UK Biobank Linkage Disequilibrium data set can be used for post-Genome-wide association studies (GWAS) analysis such as fine-mapping.

Mount Sinai Generated Data (Restricted)
Data Set	Description
De-identified Digital Pathology Slides	A digital archive of over 1.5 million whole slide images encompassing a broad spectrum of biopsies, resections, autopsies, and a diversity of diseases in a wide range of patients’ backgrounds.
Living Brain Project	A multiscale, data-driven investigation of the human brain wherein a single living population is being studied using the full human subject neuroscience toolkit.
Mount Sinai COVID-19 Biobank	Blood samples from hundreds of COVID-19 patients hospitalized at Mount Sinai, with genotype/WGS data available.
Mount Sinai Data Warehouse (MSDW) COVID-19 Electronic Health Record (EHR) Data	De-identified clinical data on patients from Caboodle with or suspected of COVID-19 containing 350 data elements and updated daily.
Mount Sinai Data Warehouse (MSDW) De-identified Observational Medical Outcomes Partnership (OMOP) Data	The OHDSI consortium develops the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) and related open-source software tools to enable biomedical research.
Stop COVID NYC Cohort	A phone-based survey of SARS-CoV-2 symptoms and a variety of measures at baseline and daily follow-up, collected from 45,865 New York City residents.

Data Sets Available Soon On Data Ark

The following data sets are data set Supplements hosted on Minerva but not open through Data Ark yet:

Data Supplements Hosted Through Minerva (Restricted)
Data Set	Description
Cancer Institute Biorepository (CIB)	In combination with Freezerworks, CIB supports the biorepository to annotate, manage and search donor and sample (tissue and fluid) information, consent status, clinical annotations, and sample tracking.
Imaging Research Warehouse (IRW) 1.0	De-identified image slices (over 217 million slices) for over 700,00 MSHS studies from 2017-2021. Image modalities include DX, CT, CR, MR, MG and NM.

Helpful Links to External Data Sets

The following data set is not hosted on Minerva. This helpful resource can be referenced and utilized as as an external data set:

All of Us – An ambitious effort to gather health data from one million or more people living in the United States to accelerate research that may improve health

Access Data Ark

Effective from January 22, 2024, to access public, Mount Sinai-generated and restricted datasets, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). Access is granted within 24 hours, and on Minerva, you can load module $ module load dataark to see the path variables.

The Data Use Agreement is accessible only through the Mount Sinai campus network or secure remote VPN. Click here for the Data Use Agreement and choose the data set that you would like to access from the drop-down list. From here you can follow the link to view and agree to the specific Data Use Agreement. Users will need to login with your Sinai account and password and will be able to choose only one data set at a time.

For all inquiries relating to the Data Ark please email: hpchelp@hpc.mssm.edu