Scientific Computing and Data / Research Data Services / Data Ark: Data Commons
Data Ark: Data Commons for Mount Sinai
Increasing the power, pace and relevance of our science

Image by Jessica Johnson ©. See www.jessicajohnsonart.com
The overarching goal of the Data Ark is to ensure that research data at Mount Sinai are managed, processed and combined in a way that optimizes the power, pace and relevance of our science.
- Power: Scientists typically use only a tiny fraction of available data
- Pace: Users will have rapid access to huge, powerful research data
- Relevance: Our diverse patient population is ideal for testing the generalizability of our results
Data Ark Data Sets (06/26/23)
The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark will increase substantially in the coming months. The Data Ark consists of public data sets, Mount Sinai generated data sets and School-Acquired data sets. There are also some data supplements provided via Data Ark.
Public Datasets (Unrestricted)
- 1,000 Genomes Project
- BLAST
- eQTLGen
- Genebass
- Genome Aggregation Database (gnomAD)
- Genome-wide Association Study (GWAS) Summary Stats
- Genotype-Tissue Expression (GTEx) Project
- Linkage Disequilibrium (LD) Score Regression Data
- Reference Genome
- The Cancer Genome Atlas (TCGA)
- UK Biobank (UKBB)-Linkage Disequilibrium (LD)
Mount Sinai Generated Datasets (Restricted)
- CBIPM-BioMe Data (pending on IRB approval)
- Living Brain Project
- Mount Sinai COVID-19 Biobank
- Mount Sinai Data Warehouse (MSDW) De-identified COVID-19 Electronic Health Record (EHR) Data
- Mount Sinai Data Warehouse (MSDW) De-identified Observational Medical Outcomes Partnership (OMOP) Data
- STOP COVID NYC Cohort
- UK Biobank (changed to be implemented soon)
User Group-acquired Datasets (Restricted)
Data Ark also provides resources on helpful links to external data sets.
Helpful External Data Sets: All of Us
To see more detail about each data set, including supplemental data sets, click here.
How can I access the data sets?
For Public Unrestricted data sets, there are no restrictions whatsoever and you can access the data directly on Minerva. See the following path:
/sc/arion/projects/data-ark/Public_Unrestricted
If you haven’t used Minerva before, please follow this link to register and here for quick start guidelines.
For any other data sets, you must read, agree and sign the Data Use Agreement specific to each data set that you want to access. Once the agreement has been submitted, as well as any evidence of approved permission for public restricted-use data, the Data Ark team will grant access within two working days. You will be notified by email that access has been granted.
To access the Data Use Agreement page:
- Log in through the Mount Sinai campus network or secure remote VPN
- Click here to access the Data Ark Forms
- Choose the data set that you would like to access from the drop-down list
- Follow the link to view and agree to the specific Data Use Agreement. You will need to login with your Sinai account and password.
- You will be able to only choose one data set at a time.
Contact the Team or Submit a Ticket
We need your help to keep the Data Ark afloat: please report every grant submission, award and publication enabled by the Data Ark by emailing us at data-ark-team@lists.mssm.edu with the info. Thanks so much for letting us know how the Data Ark has been useful!
For more information and for all inquiries relating to the Data Ark, please email: data-ark-team@lists.mssm.edu, or join our Data Ark Slack channel at https://join.slack.com/t/data-ark/signup and signup using your Mount Sinai credentials. You will be able to interact with the researchers and the Data Ark group right away!
What is the Data Ark?
- Space on the Minerva Supercomputer to host all frequent-use research data sets
- A team of data scientists/engineers to manage the resource, process data, simplify access process
- An opportunity for a step-change in the power and pace of Sinai research
This Mount Sinai data commons is guided by the FAIR principles [1]: making data more findable, accessible, interoperable and reusable. Data Ark includes both public (restricted and unrestricted) and Sinai-generated data sets.
The Data Ark team downloads, organizes and performs quality assurance and quality control on the data. The team also manages the data access process, answers questions on the data, and updates to the latest versions of the data sets. The Data Ark is located on Minerva at /sc/arion/projects/data-ark/.
Why use the Data Ark?
- Increasing your sample size reduces false-positives and boosts statistical power
- Analyzing new data sources allows testing the generalizability of your results and enables you to ask new scientific questions
- It will save you time otherwise spent locating, processing and correcting data
- The data quality is extremely high due to its processing by the dedicated Data Ark team and its repeated use by many Sinai investigators able to detect and correct data errors
- It reduces wasteful duplication of data sets
Why share your data?
- Data quality will be maximized by professional processing and repeat use
- Your lab will have more time for science rather than processing data
- The profile of your data set will be raised
- Expanded opportunities for citations and collaboration
- New ways of using your data will be highlighted
- Being a good data-sharer will be credited in faculty evaluations and by the appointments and promotions committee
Diverse research projects performed across Mount Sinai on exactly the same large data resource will foster effective collaboration and has the potential to dramatically increase the pace of our scientific and medical advances.
About Us
Data Ark is an initiative led by Associate Professor Paul O’Reilly and Dean for Scientific Computing and Data Patricia Kovatch, and supported by the Department of Genetics and Genomic Sciences and Scientific Computing and Data. An advisory board has been convened to provide guidance and to help Data Ark become sustainable over time.