Data Ark: a Data Commons for Mount Sinai
Increasing the power, pace and relevance of our science
Image by Jessica Johnson ©. See www.jessicajohnsonart.com
The overarching goal of the Data Ark is to ensure that research data at Mount Sinai are managed, processed and combined in a way that optimizes the power, pace and relevance of our science.
- Power: Scientists typically use only a tiny fraction of available data
- Pace: Users will have rapid access to huge, powerful research data
- Relevance: Our diverse patient population is ideal for testing the generalizability of our results
Data Ark Data Sets, version 1 (3/1/21):
As of launch, the Data Ark consists of the seven data sets listed below (click links for dedicated data set pages). We plan to expand the number, type and diversity of data sets over the next year.
Public data sets (unrestricted):
- 1,000 Genomes Project – Whole Genome Sequencing (WGS) data on ~1,000 individuals of mixed ancestry
- GTEx – Gene expression data on hundreds of individuals across ~50 tissues
- GWAS Summary Stats – Genome Wide Association Studies (GWAS) results in standardized format across 1,000s of outcomes
Public data sets (restricted):
- UK Biobank – Genetic data (genotype/WES) from the UK Biobank data on 500,000 individuals.
- TCGA data – COMING SOON!!
Mount Sinai generated data (unrestricted):
- STOP COVID NYC Cohort – symptom and behavior on COVID-19 on ~50,000 New York City residents surveyed via phone apps in April 2020
Mount Sinai generated data (restricted):
- Mount Sinai Data Warehouse COVID-19 Electronic Health Record (EHR) Data Set – de-identified clinical data on patients from Caboodle with or suspected of COVID-19 containing 350 data elements and updated daily
- The Mount Sinai COVID-19 Biobank – blood samples from hundreds of COVID-19 patients hospitalized at Mount Sinai, with genotype/WGS data available.
Can can I access the data sets?
You must read, agree and sign the Data Use Agreement specific to each data set that you want to access. Once the agreement has been submitted, as well as any evidence of approved permission for public restricted-use data, the Data Ark team will grant access within two working days. You will be notified by email that access has been granted.
To access the Data Use Agreement page, you must be logged in through the Mount Sinai campus network or secure remote VPN. Please click here and choose the data set that you would like to access from the drop-down list. From here you can follow the link to view and agree to the specific data use agreement. You will need to login with your Sinai account and password. You will be able to only choose one data set at a time.
Help us? We need your help to keep the Data Ark afloat: please report every grant submission, award and publication enabled by the Data Ark by emailing us at email@example.com with the info. Thanks so much for letting us know how the Data Ark has been useful!
For more information For all inquiries relating to the Data Ark please email: firstname.lastname@example.org
What is the Data Ark?
- Space on the Minerva Supercomputer to host all frequent-use research data sets
- A team of data scientists/engineers to manage the resource, process data, simplify access process
- An opportunity for a step-change in the power and pace of Sinai research
This Mount Sinai data commons is guided by the FAIR principles : making data more findable, accessible, interoperable and reusable. Data Ark includes both public (restricted and unrestricted) and Sinai-generated data sets.
The Data Ark team downloads, organizes and performs quality assurance and quality control on the data. The team also manages the data access process, answers questions on the data, and updates to the latest versions of the data sets. The Data Ark is located on Minerva at /sc/arion/projects/data-ark/.
Why use the Data Ark?
- Increasing your sample size reduces false-positives and boosts statistical power
- Analyzing new data sources allows testing the generalizability of your results and enables you to ask new scientific questions
- It will save you time otherwise spent locating, processing and correcting data
- The data quality is extremely high due to its processing by the dedicated Data Ark team and its repeated use by many Sinai investigators able to detect and correct data errors
- It reduces wasteful duplication of data sets
Why share your data?
- Data quality will be maximized by professional processing and repeat use
- Your lab will have more time for science rather than processing data
- The profile of your data set will be raised
- Expanded opportunities for citations and collaboration
- New ways of using your data will be highlighted
- Being a good data-sharer will be credited in faculty evaluations and by the appointments and promotions committee
Diverse research projects performed across Mount Sinai on exactly the same large data resource will foster effective collaboration and has the potential to dramatically increase the pace of our scientific and medical advances.
Data Ark is an initiative led by Associate Professor Paul O’Reilly and Dean for Scientific Computing Patricia Kovatch, and supported by the Department of Genetics and Genomic Sciences and Scientific Computing. An advisory board has been convened to provide guidance and to help Data Ark become sustainable over time.
 Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18