Scientific Computing and Data / Research Data Services / Data Ark: Data Commons

Data Ark: Data Commons for Mount Sinai

Increasing the power, pace and relevance of our science

Image by Jessica Johnson ©. See www.jessicajohnsonart.com

The overarching goal of the Data Ark is to ensure that research data at Mount Sinai are managed, processed and combined in a way that optimizes the power, pace and relevance of our science.

Power: Scientists typically use only a tiny fraction of available data
Pace: Users will have rapid access to huge, powerful research data
Relevance: Our diverse patient population is ideal for testing the generalizability of our results

Data Ark Data Sets (5/7/25)

The Data Ark is located on Minerva and the number, type, and diversity of data sets on the Data Ark will increase substantially in the coming months. The Data Ark consists of public data sets, Mount Sinai generated data sets and School-Acquired data sets. There are also some data supplements provided via Data Ark.

Public Datasets (Unrestricted)

Mount Sinai Generated Datasets (Restricted)

To find more information about the Kidney Precision Medicine Project (KPMP), click here.

Data Ark also provides resources on helpful links to external data sets.

Helpful External Data Sets: All of Us

To see more detail about each data set, including supplemental data sets, click here.

More Data Sets

How can I access the data sets?

Effective from January 22, 2024, to access public, Mount Sinai-generated and restricted datasets, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). Access is granted within 24 hours, and on Minerva, you can load module $ module load dataark to see the path variables.

If you haven’t used Minerva before, please follow this link to register and here for quick start guidelines.

To access the Data Use Agreement page, Minerva account is required. Please following:

Click here to access the Data Ark DUA Forms. The Mount Sinai campus network is needed or school VPN if off-campus
Choose the data set that you would like to access from the drop-down list
Please input your Minerva username and password on the next prompt window (no VIP token needed).
Follow the link to view and agree to the specific Data Use Agreement.
You will be able to only choose one data set at a time.

Contact the Team or Submit a Ticket

We need your help to keep the Data Ark afloat: please report every grant submission, award and publication enabled by the Data Ark by emailing us at hpchelp@hpc.mssm.edu with the info. Thanks so much for letting us know how the Data Ark has been useful!

For more information and for all inquiries relating to the Data Ark, please email: hpchelp@hpc.mssm.edu, or join our Data Ark Slack channel at https://join.slack.com/t/data-ark/signup and signup using your Mount Sinai credentials. You will be able to interact with the researchers and the Data Ark group right away!

Submit a Ticket

What is the Data Ark?

Space on the Minerva Supercomputer to host all frequent-use research data sets
A team of data scientists/engineers to manage the resource, process data, simplify access process
An opportunity for a step-change in the power and pace of Sinai research

This Mount Sinai data commons is guided by the FAIR principles [1]: making data more findable, accessible, interoperable and reusable. Data Ark includes both public (restricted and unrestricted) and Sinai-generated data sets.

The Data Ark team downloads, organizes and performs quality assurance and quality control on the data. The team also manages the data access process, answers questions on the data, and updates to the latest versions of the data sets. The Data Ark is located on Minerva at /sc/arion/projects/data-ark/.

Why use the Data Ark?

Increasing your sample size reduces false-positives and boosts statistical power
Analyzing new data sources allows testing the generalizability of your results and enables you to ask new scientific questions
It will save you time otherwise spent locating, processing and correcting data
The data quality is extremely high due to its processing by the dedicated Data Ark team and its repeated use by many Sinai investigators able to detect and correct data errors
It reduces wasteful duplication of data sets

Why share your data?

Data quality will be maximized by professional processing and repeat use
Your lab will have more time for science rather than processing data
The profile of your data set will be raised
Expanded opportunities for citations and collaboration
New ways of using your data will be highlighted
Being a good data-sharer will be credited in faculty evaluations and by the appointments and promotions committee

Diverse research projects performed across Mount Sinai on exactly the same large data resource will foster effective collaboration and has the potential to dramatically increase the pace of our scientific and medical advances.

About Us

Data Ark is an initiative led by Associate Professor Paul O’Reilly and Dean for Scientific Computing and Data Patricia Kovatch, and supported by the Department of Genetics and Genomic Sciences and Scientific Computing and Data. An advisory board has been convened to provide guidance and to help Data Ark become sustainable over time.

Ackowledge CTSA

Please acknowledge CTSA a fund source for Data Ark in your ensuing publications as the following.

“This work was supported in part through the computational resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences.”

To associate the CTSA grant UL1TR004419 to an existing publication, please follow these instructions from the NIH (see the section “Associating Funding to your Publications”).

More About Data Ark

————————–

[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18