Scientific Computing and Data / AIR·MS (AI Ready Mount Sinai) / Getting Started
AIR·MS Data Modalities
The AIR·MS platform proudly features the following Mount Sinai and public datasets. Our team is dedicated to continuously expanding our database with additional data modalities, striving to build a comprehensive, multi-modal research resource. To help you get started with the data, we offer Quick Start guides available here.
Available Datasets (PHI = Protected Health Information, De-ID = De-Identified Information):
- Mount Sinai Data Warehouse (MSDW) OMOP De-Identified (De-ID) and Protected Health Information (PHI)
- Pathology Metadata (PHI)
- Mount Sinai Million Data (PHI and De-ID)
- Electrocardiogram (ECG) Metadata (PHI)
- Intensive Care Unit (ICU) Operational Datamart (PHI)
- Radiology Metadata (PHI)
- Echocardiography Metadata (PHI)
Public Datasets:
Mount Sinai Data Warehouse (MSDW) OMOP De-Identified (De-ID) and Protected Health Information (PHI)
The MSDW dataset leverages the OMOP Common Data Model. The data is comprised of clinical data extracted from Mount Sinai’s Epic Caboodle database and other ancillary systems. We offer both the identified (PHI) and de-identified (De-ID) versions of this data.
Pathology Metadata (PHI)
The Pathology metadata aids researchers in the field of Computational Pathology. Researchers are able to query the metadata in combination with other linked data modalities to build a patient cohort, subsequently apply quantitative methods for the analysis of digital microscopy slides and relating the resulting statistical descriptors to patient outcomes.
We are also working on making the digital slides available to researchers on Minerva HPC.
Note: Pathology reports are now available in AIR·MS and can be found in table ACC_RESULTS. The reports are broken up into sections (clinical history, final diagnosis, SNOMED coding, etc.) that can be identified by column PATH_RPT_HEADING_NAME. If you are looking for the full report, you can combine all records for with the same ACCESSION_2_ID into a single output. The column containing the free-text (ACC_RESULTS_FINDING) has been enabled with SAP HANA full-text search capabilities. Examples of how to use HANA full-text search are available in this tutorial notebook.
Mount Sinai Million Health Discoveries Program
The current lack of diversity in genomic research data is hindering what we can learn about health and potential treatments in our global population. By enhancing the diversity of people participating in genomic research, we can advance our knowledge and discovery of human genetics for all populations. To that end, The Charles Bronfman Institute for Personalized Medicine is spearheading the effort to carry out the genetic sequencing of one million Mount Sinai patients within the next five years. This initiative, one of the largest such sequencing projects of its kind, will integrate health and research data at Mount Sinai to promote discoveries that will directly benefit our patient population.
Access to the BioMe Biobank and Mount Sinai Million Biobank on HPC can be requested via the CBIPM Data and Specimen Inquiry Form
AIR·MS now features radiology metadata extracted from the Mount Sinai IRW 2.0 XNAT system (via an MSDW data pipeline). This data set is comprised of detailed DICOM (Digital Imaging and Communications in Medicine) tags associated with the medical images.
These tags provide essential metadata, including patient information, imaging parameters, equipment details, and procedural context, ensuring a comprehensive understanding of each radiological study.
By integrating this metadata, we enable researchers to gain deeper insights into the imaging data, facilitating advanced analyses and fostering innovations in medical imaging research.
Electrocardiogram (ECG) Data (PHI)
Electrocardiogram data, derived from Mount Sinai’s Cardiology Information System, is now available in AIRMS.
Intensive Care Unit (ICU) Data (PHI)
The Mount Sinai ICU Datamart is the world’s first ICU data platform designed to simultaneously support research, quality improvement, and operational initiatives. It is built on a common data model that standardizes critical care medical concepts, enabling consistent interpretation and integration of data across diverse ICU settings.
The Mount Sinai ICU Datamart harmonizes and delivers high-fidelity information from all Mount Sinai adult ICUs with highly granular data available from 2011 onward and refreshed weekly. Beyond serving as a comprehensive data resource, it also tracks the evolution of the health system’s critical care landscape, capturing changes in unit specialties, geographic distribution, and the addition of new ICUs.
By transforming the ICU’s inherently rich data environment into a standardized, dynamic, and accessible platform, the Mount Sinai ICU Datamart empowers clinicians, researchers, and administrators to advance data-driven care, operational excellence, and clinical discovery.
Radiology Metadata (PHI)
AIR·MS now features radiology metadata extracted from the Mount Sinai IRW 2.0 XNAT system (via an MSDW data pipeline). This data set is comprised of detailed DICOM (Digital Imaging and Communications in Medicine) tags associated with the medical images.
These tags provide essential metadata, including patient information, imaging parameters, equipment details, and procedural context, ensuring a comprehensive understanding of each radiological study.
By integrating this metadata, we enable researchers to gain deeper insights into the imaging data, facilitating advanced analyses and fostering innovations in medical imaging research.
Echocardiography Metadata (PHI)
AIR·MS contains DICOM metadata tags for cardiovascular imaging studies performed in the Mount Sinai Health System that are contained within the Softlink cardiovascular PACS system. This does not include radiology data contained within the radiology PACS systems. The following common modalities include: US, CT, XA, NM, MR, IVUS. These modalities include,
among others, echocardiographic ultrasound, vascular ultrasound, and angiographic data. These tags are linked to DICOM files by ECHO_METADATA.IMAGE_FILE_PATH in a repository on the Minerva cluster. Note that the schema name CDMECHO is a misnomer – this catalog contains much more than echocardiogram data.
The catalog contains every public DICOM metadata tag (dictionaries are publicly available, see https://dicom.innolitics.com/ciods for instance). The
Tag is under ECHO_TAGS_DATA.TAGS and is labeled by its Name not its hexadecimal number for ease of readability (see
https://www.dicomlibrary.com/dicom/dicom-tags/ for dictionary of public tags and name). The value of the tag is under ECHO_TAGS_DATA.TAG_VALUE.
Notes:
- As of late August 2025 there are known significant gaps in data availability around year 2019 and starting in mid 2023 and onward. The catalog ends in late 2023. Mechanisms for capturing missing data and updating the catalog moving forward are underway.
- There will be some duplicated data within the archive (i.e. 2 identical studies may be in 2 separate paths)
- There is a known bug in the way the value representation of “Person Name” is stored. It is stored as a list of single characters rather than a
string (i.e. the name Smith is stored as [S,m,i,t,h])
Synthetic Public Use File (DE-SynPUF)
The SYNPUF (Synthetic Public Use Files) dataset, provided by the Centers for Medicare & Medicaid Services (CMS), offers a synthetic version of Medicare claims data from the years 2008 to 2010. This dataset is meticulously designed to maintain the statistical properties and relationships present in the original data while ensuring that no actual patient information is disclosed, thereby safeguarding privacy.
SYNPUF includes a comprehensive array of variables such as beneficiary demographics, chronic conditions, hospital and outpatient claims, and prescription drug events, making it an invaluable resource for researchers and data scientists. It serves as an exemplary tool for developing and testing healthcare models, algorithms, and applications without the constraints associated with sensitive real-world data.
The SYNPUF dataset in AIR·MS utilizes the OMOP Common Data Model, aligned with other clinical data sets available on the platform. Since SYNPUF data does not require an approved IRB, you can easily get onboarded and start building ML models!
