AIR·MS Data Modalities

The AIR·MS platform proudly features the following Mount Sinai and public datasets. Our team is dedicated to continuously expanding our database with additional data modalities, striving to build a comprehensive, multi-modal research resource. To help you get started with the data, we offer Quick Start guides available here.

 

Mount Sinai Datasets

  • Mount Sinai Data Warehouse
  • Clinical Notes
  • Computational Pathology
  • Radiology
  • GI Research DB
  • Mount Sinai Million Health Discoveries Program
  • Electrocardiogram (ECG)
  • Cardiovascular Imaging

 

Public Datasets

  • Medicare Claims Synthetic Public Use Files (SynPUFs)

 

 

Mount Sinai Data Warehouse (MSDW) in OMOP format

The MSDW dataset leverages the OMOP Common Data Model. The data is comprised of clinical data extracted from Mount Sinai’s Epic Caboodle database and other ancillary systems. ​​​​​​​

 

We offer both an identifiable and a de-identified version of the MSDW dataset in AIR·MS

MSDW OMOP identifiable (PHI) with extended attributes

 

​​​​​​​Current Status

Schema: CDMPHI

Data snapshot from 04/16/2025
Unique patients:  12,107,621

MSDW OMOP de-identified (de-id)

 

​​​​​​​Current Status

Schema: CDMDEID

Data snapshot from 09/26/2023
Unique patients:  10,934,180
Table Record Count
CARE_SITE 106,699
CDC_RACE_ETHNICITY_XTN 967
CONCEPT 11,337,020
CONCEPT_ANCESTOR 101,072,304
CONCEPT_CLASS 453
CONCEPT_RELATIONSHIP 170,586,066
CONCEPT_SYNONYM 5,190,521
CONDITION_OCCURRENCE 206,228,288
DEATH 49,797
DOMAIN  50
DRUG_EXPOSURE 213,932,157
DRUG_STRENGTH 3,003,619
FACT_RELATIONSHIP 161,135,148
LOCATION 13,394,927
MEASUREMENT 1,875,881,345
NOTE 242,610,978
OBSERVATION 516,585,500
OBSERVATION_PERIOD 12,147,842
PERSON 12,107,621
PROCEDURE_OCCURRENCE 312,011,901
PROVIDER 1,363,962
PROVIDER_ATTRIBUTE_XTN 783,560
RELATIONSHIP 730
VISIT_OCCURRENCE 198,943,396
VOCABULARY 254

Note

Some of the standard OMOP tables contain extension fields (starting with the prefix ‘XTN’) which contain data outside of the OMOP standard data model. Many of these XTN attributes are based on data derived directly from EPIC (i.e. codes used in EPIC rather than the standardized OMOP codes), or attributes not currently contained in the OMOP standard.

Table Record Count 
MEASUREMENT 1,330,106,439
OBSERVATION 274,496,232
PROCEDURE_OCCURRENCE 192,252,154
DRUG_EXPOSURE 149,311,961
CONCEPT_RELATIONSHIP 143,179,970
NOTE 137,886,581
CONDITION_OCCURRENCE 122,966,185
VISIT_OCCURRENCE 118,936,818
FACT_RELATIONSHIP 118,710,809
CONCEPT_ANCESTOR 94,315,531
CONCEPT_SYNONYM 11,899,964
OBSERVATION_PERIOD 10,955,831
PERSON 10,934,180
CONCEPT 10,302,430
DRUG_STRENGTH 2,994,169
LOCATION 1,321,549
PROVIDER 1,307,112
CARE_SITE 301,467
DEATH 3,213
RELATIONSHIP 692
CONCEPT_CLASS 437
VOCABULARY 251
DOMAIN 50

 

 

Clinical Notes

Clinical notes in the form of unstructured data (progress notes, telephone encounters, nursing notes, procedures, etc.) extracted from the MSDW OMOP identifiable dataset have been loaded to AIR.MS and enabled for search using SAP HANA’s in-memory full-text search capabilities. This feature empowers the researcher to build patient cohorts based on terms contained in unstructured reports in seconds or even milliseconds! The researcher can further filter based on note type or an array of other clinical attributes.

Tip: See the Getting Started Guides for examples on how to perform search using python!

​​​​​​​Current Status

 

Schema: CDMPHI
Data snapshot from 04/16/2025
Number of indexed notes: 242,610,978
 

 

Pathology (metadata)

The Pathology metadata aids researchers in the field of Computational Pathology. Researchers are able to query the metadata in combination with other linked data modalities to build a patient cohort, subsequently apply quantitative methods for the analysis of digital microscopy slides and relating the resulting statistical descriptors to patient outcomes.

We are also working on making the digital slides available to researchers on Minerva HPC.

​​​​​​​Current Status

 

Data Source: Powerpath
Schema: CDMPATHOLOGY
Data snapshot from 01/08/2025
Unique patients: 3,528,719

Note

 

Pathology reports are now available in ​​​​​​​​​​​​​​​​​​​​​AIR·MS and can be found in table ACC_RESULTS. The reports are broken up into sections (clinical history, final diagnosis, SNOMED coding, etc.) that can be identified by column PATH_RPT_HEADING_NAME. If you are looking for the full report, you can combine all records for with the same ACCESSION_2_ID into a single output. The column containing the free-text (ACC_RESULTS_FINDING) has been enabled with SAP HANA full-text search capabilities. Examples of how to use HANA full-text search are available in this tutorial notebook.

The following tables and attributes are available in AIR·MS:

ACCESSION

Number of records: 7,107,464
Column Name Comments
ACC_CATG Case category
ACC_PROCESS_STEP_COMPLETED_DATE Case finalize date / status update datetime
ACCESSION_2_ID PowerPath unique case ID
ACCESSION_NO Case number
BIRTH_DATE Patient date of birth
CREATED_DATE Case creation date
CURRENT_STATUS_ID PowerPath unique identifier for a case status
FACILITY_CODE Facility code
FACILITY_ID PowerPath ID for the facility associated with the accession
FACILITY_NAME Facility name
IMPORTED_CASE One-character \”Y\” or \”N\” code indicating if the case was imported into PowerPath
LAST_UPDATE_DATETIME Case finalize date / status update datetime
MED_REC_NO Medical record number  (EPIC MRN)
MRN_FACILITY_CODE PowerPath ID for the facility that assigned the MRN
MRN_FACILITY_DESCRIPTION Name of the facility that assigned the MRN
ORDER_NUMBER Order number from the ordering system
PATIENT_AGE The patient’s age on the case creation date
PATIENT_ID PowerPath patient ID
PERSONNEL_2_FULL_NAME Name of the pathologist who finalized the accession
PERSONNEL_2_ID PowerPath ID for the pathologist who finalized the accession
PROCESS_STEP_DESCRIPTION Case status name / description
VISIT_NUMBER Encounter identifier

ACC_ICD

Number of records: 3,548,815
Column Name Comments
ACC_ICD9_ID PowerPath surrogate unique identifier for an ICD-10 code assigned to a case
ACCESSION_2_ID PowerPath unique case ID
LAST_UPDATE_DATETIME Case finalize date / status update datetime
MEDICAL_CODE ICD-10 code assigned for billing
MEDICAL_CODE_ID PowerPath surrogate unique identifier for an ICD-10 code

ACC_SLIDE

Number of records: 14,241,494
Column Name Comments
ACC_BLOCK_ID PowerPath unique block ID
ACC_BLOCK_LABEL Specimen block identifier
ACC_PROCESS_STEP_COMPLETED_DATE Case finalize date / status update datetime
ACC_SLIDE_ID PowerPath unique slide ID
ACC_SPECIMEN_DESCRIPTION Specimen source description
ACC_SPECIMEN_ID PowerPath unique specimen ID
ACCESSION_2_ID PowerPath unique case ID
BIOPSY Boolean flag for 1 = biopsy,  0 = non-biopsy
COLLECTION_DATE Specimen collection date
CONSULT_LABEL Optional free text for slides of type \”consult\”
LAB_PROCEDURE_CODE The procedure code
LAB_PROCEDURE_DESCRIPTION The procedure description
LAB_PROCEDURE_ID PowerPath unique procedure identifier
LAST_UPDATE_DATETIME Case finalize date / status update datetime
RECV_DATE Specimen received date
SLIDE_LABEL Derived unique (business key) identifier for each slide
SLIDE_NO Ordinal number of the slide from the specimen & block
SLIDE_TYPE Whether the slide is stained,  unstained  or antibody/IHC
SOURCE_MATERIAL_LABEL Derived unique (business key) identifier for each slide’s source specimen and block
SOURCE_REC_TYPE Where the slide came from, either specimen or block
SPECIMEN_CATEGORY_ID PowerPath specimen category ID
SPECIMEN_CATEGORY_NAME The specimen category name
SPECIMEN_GROUPS_CODE Specimen specialty code
SPECIMEN_GROUPS_ID PowerPath specimen specialty ID
SPECIMEN_LABEL Specimen identifier
TYPE Whether the slide is consult or not consult

ACC_RESULTS

Number of records: 31,193,765
Column Name Comments
ACC_RESULTS_FINDING The text of the report section
ACC_RESULTS_ID PowerPath unique identifier for each report section
ACC_RESULTS_REC_ID sort order of result section on RTF
ACCESSION_2_ID PowerPath unique case ID
LAST_UPDATE_DATETIME Case finalize date / status update datetime
PATH_RPT_HEADING_ID PowerPath result section heading on RTF
PATH_RPT_HEADING_NAME PowerPath result section heading name

ACC_SLIDE_IMAGESERVER

Number of records: 2,564,239
Column Name Comments
ACC_SLIDE_ID PowerPath unique slide ID
ACC_SLIDE_IMAGESERVER_DESCRIPTION The name of the Philips iSyntax slide image file
ACC_SLIDE_IMAGESERVER_ID PowerPath unique identifier for a slide image
INTERNAL_SLIDE_ID Identifier for the slide,  also known as the \”barcode\” ID
LAST_UPDATE_DATETIME Case finalize date / status update datetime
SCAN_DATE The date on which the slide image was digitized

 

 

Radiology (metadata)

AIR·MS now features radiology metadata extracted from the Mount Sinai IRW 2.0 XNAT system (via an MSDW data pipeline). This data set is comprised of detailed DICOM (Digital Imaging and Communications in Medicine) tags associated with the medical images.

These tags provide essential metadata, including patient information, imaging parameters, equipment details, and procedural context, ensuring a comprehensive understanding of each radiological study.

By integrating this metadata, we enable researchers to gain deeper insights into the imaging data, facilitating advanced analyses and fostering innovations in medical imaging research.

Current Status

 

Schema: CDMRADIOLOGY
Data snapshot from 8/16/2024
Unique patients: 2,057,482

The following tables and attributes are available in AIR·MS:

Number of records: 66,993,826
RADIOLOGY_METADATA
ID
PATIENT_ID
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
ETL_RECORD_UPDATE_DATETIME
Number of records: 7,745,407,087
RADIOLOGY_DICOM_DATA
ID
RADIOLOGY_METADATA_ID
DICOM_TAGS
TAG_VALUE_REPRESENTATION
TAG_VALUE” NCLOB MEMORY
TAG_XTN_PATIENT_EPIC_MRN
TAG_PATIENT_NAME
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
ETL_RECORD_UPDATE_DATETIME
 

 

The current lack of diversity in genomic research data is hindering what we can learn about health and potential treatments in our global population. By enhancing the diversity of people participating in genomic research, we can advance our knowledge and discovery of human genetics for all populations. To that end, The Charles Bronfman Institute for Personalized Medicine is spearheading the effort to carry out the genetic sequencing of one million Mount Sinai patients within the next five years. This initiative, one of the largest such sequencing projects of its kind, will integrate health and research data at Mount Sinai to promote discoveries that will directly benefit our patient population. 

Access to the BioMe Biobank and Mount Sinai Million Biobank on HPC can be requested via the CBIPM Data and Specimen Inquiry Form 

The following tables and attributes are available in AIR·MS:

Mount Sinai Million / BioMe identifiable (PHI)

​​​​​​​Current Status

Schema: CDMMSM

Data snapshot from 06/09/2025
Unique patients:  279,885

Mount Sinai Million / BioMe de-identified (de-id)

​​​​​​​Current Status

Schema: CDMMSMDEID

Data snapshot from 06/09/2025
Unique patients:  279,885
PATIENT Comments
ID Internal ID for linking to other tables within the dataset
MRN Medical Record Number (EPIC MRN – only accessible under regulatory approval) 
MASKED_MRN De-Identifier for combined BioMe Biobank set with Regeneron and Sema4 data 
RGN_ID De-identifier for first Regeneron batch regarding BioMe Biobank 
SEMA4_ID De-identifier for Sema4, a subset of Masked MRN ID 
MSM_ID De-identifier for Mount Sinai Million Biobank, a combined setoff RGN_ID and new MSM ID 
MILLION_ID Indicator for all consented patients with and without genomic data 
AIR_CREATED_AT Record creation in AIR·MS
AIR_UPDATED_AT Record updated in AIR·MS
PATIENT Comments
ID Internal ID for linking to other tables within the dataset
MASKED_MRN De-Identifier for combined BioMe Biobank set with Regeneron and Sema4 data
RGN_ID De-identifier for first Regeneron batch regarding BioMe Biobank
SEMA4_ID De-identifier for Sema4, a subset of Masked MRN ID
MSM_ID De-identifier for Mount Sinai Million Biobank, a combined setoff RGN_ID and new MSM ID
MILLION_ID Indicator for all consented patients with and without genomic data
AIR_CREATED_AT Record creation in AIR·MS
AIR_UPDATED_AT Record updated in AIR·MS
 

 

Electrocardiogram (ECG)

Current Status

Data Source: GE HealthCare MUSE Cardiology Information System

Schema: CDMECG
Data snapshot from: 04/10/2021
Unique patients: 1,961,254

The following tables and attributes are available in AIR·MS:

Number of records: 9,275,130
PATIENT_DEMOGRAPHICS
PATIENT_DEMOGRAPHICS_ID (X)
FILE_ENTRY_ID
PATIENT_ID
PATIENTAGE
AGEUNITS
DATEOFBIRTH
GENDER
RACE
PATIENTLASTNAME
PATIENTFIRSTNAME
Number of records: 9,168,266
DIAGNOSIS
DIAGNOSIS_ID (X)
FILE_ENTRY_ID
MODALITY
DIAGNOSISSTATEMENT
Number of records: 73,631,055
LEAD_DATA
LEAD_DATA_ID (X)
FILE_ENTRY_ID
LEADBYTECOUNTTOTAL
LEADTIMEOFFSET
LEADSAMPLECOUNTTOTAL
LEADAMPLITUDEUNITSPERBIT
LEADAMPLITUDEUNITS
LEADHIGHLIMIT
LEADLOWLIMIT
LEADID
LEADOFFSETFIRSTSAMPLE
FIRSTSAMPLEBASELINE
LEADSAMPLESIZE
LEADOFF
BASELINESWAY
LEADDATACRC32
WAVEFORMDATA
Number of records: 9,610,935
ECG_FILES
FILE_ENTRY_ID (X)
FILE_NAME
FILE_PATH
FILE_HASH
FILE_SIZE_BYTES
ACQUISITION_DATE
ACQUISITION_TIME
PROCESSING_STATUS
STATUS_CODE
NOTES_AND_COMMENTS
FILE_TIMESTAMP
AIR_CREATED_AT
AIR_UPDATED_AT
JSON_STATUS
Number of records: 9,168,230
MUSE_INFO
MUSEVERSION
FILE_ENTRY_ID
Number of records: 7,757,472
ORDER_INFO
ORDER_INFO_ID (X)
FILE_ENTRY_ID
HISACCOUNTNUMBER
ORDERTIME
ADMITTIME
ADMITDATE
HISLOCATION
BED
ATTENDINGMDHISID
ATTENDINGMDLASTNAME
ATTENDINGMDFIRSTNAME
ALTERNATEVISITID
HISDISPOSITION
ADMITSOURCE
PRIMARYDIAGNOSTICCODE
SERVICINGFACILITY
ADMITTINGMDHISID
ADMITTINGMDLASTNAME
ADMITTINGMDFIRSTNAME
CONSULTINGMDID
REFERRINGMDHISID
HOSPITALSERVICE
ADMISSIONTYPE
Number of records: 9,164,220
ORIGINAL_DIAGNOSIS
ORIGINAL_DIAGNOSIS_ID (X)
FILE_ENTRY_ID
MODALITY
DIAGNOSISSTATEMENT
Number of records: 9,168,044
ORIGINAL_RESTING_ECG_MEASUREMENTS
ORIGINAL_RESTING_ECG_MEASUREMENTS_ID (X)
VENTRICULARRATE
ATRIALRATE
PRINTERVAL
QRSDURATION
QTINTERVAL
QTCORRECTED
PAXIS
RAXIS
TAXIS
QRSCOUNT
QONSET
QOFFSET
PONSET
POFFSET
TOFFSET
ECGSAMPLEBASE
ECGSAMPLEEXPONENT
QTCFREDERICA
Number of records: 3,613,372
PHARMA_DATA
PHARMA_DATA_ID (X)
PHARMARRINTERVAL
PHARMAUNIQUEECGID
PHARMAPPINTERVAL
PHARMACARTID
FILE_ENTRY_ID
Number of records: 9,649,712
QRS_TIMES_TYPES
GLOBALRR
QTRGGR
FILE_ENTRY_ID
Number of records: 9,657,368
RESTING_ECG
RESTING_ECG_ID (X)
FILE_ENTRY_ID
PATIENT_ID
ACQUISITIONDATE
ACQUISITIONTIME
STATUS
Number of records: 9,167,778
RESTING_ECG_MEASUREMENTS
RESTING_ECG_MEASUREMENTS_ID (X)
FILE_ENTRY_ID
VENTRICULARRATE
ATRIALRATE
PRINTERVAL
QRSDURATION
QTINTERVAL
QTCORRECTED
PAXIS
RAXIS
TAXIS
QRSCOUNT
QONSET
QOFFSET
PONSET
POFFSET
TOFFSET
ECGSAMPLEBASE
ECGSAMPLEEXPONENT
QTCFREDERICA
Number of records: 9,654,325
TEST_DEMOGRAPHICS
TEST_DEMOGRAPHICS_ID (X)
FILE_ENTRY_ID
DATATYPE
SITE
SITENAME
ACQUISITIONDEVICE
STATUS
EDITLISTSTATUS
PRIORITY
LOCATION
LOCATIONNAME
ROOMID
ACQUISITIONTIME
ACQUISITIONDATE
CARTNUMBER
ACQUISITIONSOFTWAREVERSION
ANALYSISSOFTWAREVERSION
EDITTIME
EDITDATE
EDITORID
REFERRINGMDLASTNAME
REFERRINGMDFIRSTNAME
ACQUISITIONTECHLASTNAME
EDITORLASTNAME
EDITORFIRSTNAME
SECONDARYID
HISSTATUS
 

 

Cardiovascular Imaging (metadata)

AIR·MS contains DICOM metadata tags for cardiovascular imaging studies performed in the Mount Sinai Health System that are contained within the Softlink cardiovascular PACS system. This does not include radiology data contained within the radiology PACS systems.  The following common modalities include: US, CT, XA, NM, MR, IVUS.  These modalities include,

among others, echocardiographic ultrasound, vascular ultrasound, and angiographic data.   These tags are linked to DICOM files by ECHO_METADATA.IMAGE_FILE_PATH in a repository on the Minerva cluster.  Note that the schema name CDMECHO is a misnomer – this catalog
contains much more than echocardiogram data.

The catalog contains every public DICOM metadata tag (dictionaries are publicly available, see https://dicom.innolitics.com/ciods for instance).  The
Tag is under ECHO_TAGS_DATA.TAGS and is labeled by its Name not its hexadecimal number for ease of readability (see
https://www.dicomlibrary.com/dicom/dicom-tags/ for dictionary of public tags and name). The value of the tag is under ECHO_TAGS_DATA.TAG_VALUE.

Notes:

  • Aas of late August 2025 there are known significant gaps in data availability around year 2019 and starting in mid 2023 and onward.  The catalog ends in late 2023.  Mechanisms for capturing missing data and updating the catalog moving forward are underway.

  • There will be some duplicated data within the archive (i.e. 2 identical studies may be in 2 separate paths)

  • There is a known bug in the way the value representation of “Person Name” is stored.  It is stored as a list of single characters rather than a
    string (i.e. the name Smith is stored as [S,m,i,t,h])

 

Current Status

Schema: CDMECHO
Data snapshot from: 10/17/2023
Unique patients: 885,957

The following tables and attributes are available in AIR·MS:

Number of records: 268,682,931
ECHO_METADATA
ID
FILE_ENTRY_ID
PATIENT_ID
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
SOP_INSTANCE_UID
IMAGE_FILE_PATH
AIR_CREATED_AT
AIR_UPDATED_AT
Number of records: 21,089,507,788
ECHO_TAGS_DATA
ID
FILE_ENTRY_ID
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
SOP_INSTANCE_UID
TAGS
TAG_VALUE
AIR_CREATED_AT
AIR_UPDATED_AT
 

 

Synthetic Public Use File (DE-SynPUF)

The SYNPUF (Synthetic Public Use Files) dataset, provided by the Centers for Medicare & Medicaid Services (CMS), offers a synthetic version of Medicare claims data from the years 2008 to 2010. This dataset is meticulously designed to maintain the statistical properties and relationships present in the original data while ensuring that no actual patient information is disclosed, thereby safeguarding privacy.

SYNPUF includes a comprehensive array of variables such as beneficiary demographics, chronic conditions, hospital and outpatient claims, and prescription drug events, making it an invaluable resource for researchers and data scientists. It serves as an exemplary tool for developing and testing healthcare models, algorithms, and applications without the constraints associated with sensitive real-world data.

The SYNPUF dataset in AIR·MS utilizes the OMOP Common Data Model, aligned with other clinical data sets available on the platform. Since SYNPUF data does not require an approved IRB, you can easily get onboarded and start building ML models!

 

​​​​​​​Current Status

Schema: CDMSYNPUF

Number of patients: 2,326,856
Number of observations: 37,531,051
Number of measurements: 72,387,791

 

Work in Progress

We are constantly integrating new data to AIR·MS. The following data modalities are currently being worked on:

  • Electroencephalogram (EEG)

  • Endoscopy & colonoscopy reports

  • Bedmaster

  • Radiology images and reports