Scientific Computing and Data / Research Data Services / Data Ark: Data Commons
Data Ark: Data Commons for Mount Sinai
Increasing the power, pace and relevance of our science
Image by Jessica Johnson ©. See www.jessicajohnsonart.com
The overarching goal of the Data Ark is to ensure that research data at Mount Sinai are managed, processed and combined in a way that optimizes the power, pace and relevance of our science.
- Power: Scientists typically use only a tiny fraction of available data
- Pace: Users will have rapid access to huge, powerful research data
- Relevance: Our diverse patient population is ideal for testing the generalizability of our results
Data Ark Data Sets (10/10/24)
The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark will increase substantially in the coming months. The Data Ark consists of public data sets, Mount Sinai generated data sets and School-Acquired data sets. There are also some data supplements provided via Data Ark.
Public Datasets (Unrestricted)
- 1,000 Genomes Project
- BLAST
- eQTLGen
- Genebass
- Genome Aggregation Database (gnomAD)
- Genome-wide Association Study (GWAS) Summary Stats
- Genotype-Tissue Expression (GTEx) Project
- Linkage Disequilibrium (LD) Score Regression Data
- Reference Genome
- The Cancer Genome Atlas (TCGA)
- UK Biobank (UKBB)-Linkage Disequilibrium (LD)
Mount Sinai Generated Datasets (Restricted)
- CBIPM-BioMe Data
- Digital Pathology Slides (Coming Soon)
- Living Brain Project
- Mount Sinai COVID-19 Biobank
- Mount Sinai Data Warehouse (MSDW) De-identified COVID-19 Electronic Health Record (EHR) Data
- Mount Sinai Data Warehouse (MSDW) De-identified Observational Medical Outcomes Partnership (OMOP) Data
- STOP COVID NYC Cohort
User Group-acquired Datasets (Restricted)
Data Ark also provides resources on helpful links to external data sets.
Helpful External Data Sets: All of Us
To see more detail about each data set, including supplemental data sets, click here.
How can I access the data sets?
Effective from January 22, 2024, to access public, Mount Sinai-generated and restricted datasets, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). Access is granted within 24 hours, and on Minerva, you can load module $ module load dataark to see the path variables.
If you haven’t used Minerva before, please follow this link to register and here for quick start guidelines.
To access the Data Use Agreement page, Minerva account is required. Please following:
- Click here to access the Data Ark DUA Forms. The Mount Sinai campus network is needed or school VPN if off-campus
- Choose the data set that you would like to access from the drop-down list
- Please input your Minerva username and password on the next prompt window (no VIP token needed).
- Follow the link to view and agree to the specific Data Use Agreement.
- You will be able to only choose one data set at a time.
Contact the Team or Submit a Ticket
We need your help to keep the Data Ark afloat: please report every grant submission, award and publication enabled by the Data Ark by emailing us at hpchelp@hpc.mssm.edu with the info. Thanks so much for letting us know how the Data Ark has been useful!
For more information and for all inquiries relating to the Data Ark, please email: hpchelp@hpc.mssm.edu, or join our Data Ark Slack channel at https://join.slack.com/t/data-ark/signup and signup using your Mount Sinai credentials. You will be able to interact with the researchers and the Data Ark group right away!
What is the Data Ark?
- Space on the Minerva Supercomputer to host all frequent-use research data sets
- A team of data scientists/engineers to manage the resource, process data, simplify access process
- An opportunity for a step-change in the power and pace of Sinai research
This Mount Sinai data commons is guided by the FAIR principles [1]: making data more findable, accessible, interoperable and reusable. Data Ark includes both public (restricted and unrestricted) and Sinai-generated data sets.
The Data Ark team downloads, organizes and performs quality assurance and quality control on the data. The team also manages the data access process, answers questions on the data, and updates to the latest versions of the data sets. The Data Ark is located on Minerva at /sc/arion/projects/data-ark/.
Why use the Data Ark?
- Increasing your sample size reduces false-positives and boosts statistical power
- Analyzing new data sources allows testing the generalizability of your results and enables you to ask new scientific questions
- It will save you time otherwise spent locating, processing and correcting data
- The data quality is extremely high due to its processing by the dedicated Data Ark team and its repeated use by many Sinai investigators able to detect and correct data errors
- It reduces wasteful duplication of data sets
Why share your data?
- Data quality will be maximized by professional processing and repeat use
- Your lab will have more time for science rather than processing data
- The profile of your data set will be raised
- Expanded opportunities for citations and collaboration
- New ways of using your data will be highlighted
- Being a good data-sharer will be credited in faculty evaluations and by the appointments and promotions committee
Diverse research projects performed across Mount Sinai on exactly the same large data resource will foster effective collaboration and has the potential to dramatically increase the pace of our scientific and medical advances.
About Us
Data Ark is an initiative led by Associate Professor Paul O’Reilly and Dean for Scientific Computing and Data Patricia Kovatch, and supported by the Department of Genetics and Genomic Sciences and Scientific Computing and Data. An advisory board has been convened to provide guidance and to help Data Ark become sustainable over time.
Ackowledge CTSA
Please acknowledge CTSA a fund source for Data Ark in your ensuing publications as the following.
“This work was supported in part through the computational resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences.”
To associate the CTSA grant UL1TR004419 to an existing publication, please follow these instructions from the NIH (see the section “Associating Funding to your Publications”).