Data and Resources

SAP HANA In-Memory Database 

What is SAP HANA?

SAP HANA (High-performance Analytical Appliance) is an in-memory database platform developed by SAP SE. It uses in-memory computing to store data primarily in RAM, unlike traditional relational databases that need to retrieve data from disk-based storage solutions. With this technological innovation, SAP HANA can access in-memory data 10,000 times faster than data stored on standard disks. The result is that companies can now rapidly analyze large amounts of data and process transactions in seconds rather than hours. SAP HANA is the platform for SAP’s Enterprise Resource Planning software (S/4HANA) as well as other business applications and can run on-premises, in the cloud or in a hybrid configuration. Enterprises that have migrated over to SAP HANA have been able to realize accelerated business processes, improved data insights, and simplified IT environments. Since SAP HANA can run both OLTP and OLAP workloads, it eliminates the need to move data to another database to run analytical applications. This removes the burden of maintaining separate legacy systems, data silos and data warehouses. The result is less data redundancy, a smaller hardware footprint and less data management costs as well. SAP HANA analyzes live data for real-time business decisions and analytics, using advanced data processing engines for business, text, spatial, graph and series data. The initial version of the SAP HANA database was released to select customers in late 2010. SAP later debuted the general release at SAPPHIRE NOW, SAP’s annual technology conference in Orlando, FL in June 2011. SAP HANA made history as the first in-memory database in the world. It was an extremely popular product release, quickly becoming SAP’s fastest-adopted solution. Today there are over 30,000 customers utilizing SAP HANA.

Overview of SAP HANA

SAP HANA Platform

Overview of SAP HANA

While SAP HANA is mostly recognized for its ability to process large amounts of data at record speeds, it is far more than just an in-memory database. SAP HANA offers both column and row-based storage, making it capable of handling both OLTP and OLAP workloads. Tables that are organized in columns are optimized for high performing read operations while still providing good performance for write operations. In addition, column store offers extremely efficient data compression, which in turn saves memory and speeds up searches and calculations. Typical compression rates are 7x or greater when compared with a traditional RDBMS. Other column store features include table partitioning (which can further improve performance by utilizing partition pruning), and a delta store to optimize write operations. Another technology built into the platform includes dynamic tiering, allowing for a temperature based (hot/warm/cold) data aging strategy. The purpose of this feature is to extend SAP HANA memory with a disk-centric columnar store for less frequently used data. In addition, SAP HANA offers data integration services, high availability and disaster recovery capabilities, and a complete development platform known as SAP HANA XSA (SAP HANA extended application services, advanced model) that can be used to code with Java and Node.js. It includes OData support and is the successor to the now-deprecated XS classic model. Apps that were created with SAP HANA XS can be ported over to SAP HANA XSA. The SAP Web IDE for SAP HANA is a browser-based integrated development environment used to create web-based and mobile user interfaces, business logic, and advanced data models. It provides a handful of developer tools, including syntax-aware code editors, inspection tools, debugging tools, and CDS modeling capability.

SAP HANA and Machine Learning

Python Machine Learning Client for SAP HANA

Welcome to Python machine learning client for SAP HANA (hana-ml)! This package enables Python data scientists to access SAP HANA data and build various machine learning models using the data directly in SAP HANA. This page provides an overview of hana-ml. Python machine learning client for SAP HANA consists of two main parts:

SAP HANA DataFrame, which provides a set of methods for accessing and querying data in SAP HANA without bringing the data to the client.
A set of machine learning APIs for developing machine learning models.

Specifically, machine learning APIs are composed of two packages:

PAL packagePAL package consists of a set of Python algorithms and functions which provide access to machine learning capabilities in SAP HANA Predictive Analysis Library(PAL). SAP HANA PAL functions cover a variety of machine learning algorithms for training a model and then the trained model is used for scoring.
APL packageAutomated Predictive Library (APL) package exposes the data mining capabilities of the Automated Analytics engine in SAP HANA through a set of functions. These functions develop a predictive modeling process that analysts can use to answer simple questions on their customer datasets stored in SAP HANA.

hana-ml uses SAP HANA Python driver (hdbcli) to connect to and access SAP HANA. A figure of architecture is shown below:

For more information, see the Python Machine Learning Client for SAP HANA website

Secure cloud infrastructure 

The team at HPI.MS worked closely with engineers at Microsoft, Mount Sinai IT and Data4Life to establish secure Microsoft Azure cloud infrastructure to run AIR.MS. The following architecture diagram depicts the AIR.MS infrastructure running in the Mount Sinai Azure cloud.

Python Libraries 

SAP HANA ML (Machine Learning) is one of the Python libraries that we provide to researchers to query AIR.MS. SAP HANA ML is a component of SAP HANA, an in-memory database and analytics platform designed for handling massive amounts of data in real-time. HANA ML provides integrated machine learning capabilities to build, deploy, and execute machine learning models directly within the SAP HANA environment. It provides a wide range of machine learning algorithms, such as regression, classification, clustering, and time-series forecasting, that can be used for various use cases like patient dataset preparation, notes analysis, etc. It offers a set of libraries and APIs to develop custom machine learning models, or users can use pre-built models from the SAP HANA Model Store. The biggest advantage of HANA ML is In-Database Processing. HANA ML leverages the power of SAP HANA’s in-memory processing capabilities, allowing data to be analyzed and processed directly within the database. This eliminates the need for data movement, resulting in faster model training and predictions. HANA ML also supports model versioning and model management, allowing researchers to track and control the lifecycle of their machine learning models. Additionally, it offers model explainability features, providing insights into how the models arrive at their predictions, which is crucial for ensuring transparency and compliance with regulations. More details can be found in the SAP Documentation.

Benchmarking 

AIR.MS in-memory full-text search performance examples

Number of indexed notes: 86,893,767

Example 1: Show count for notes containing an EXACT search string ‘acute kidney disease‘, broken down by gender

0.31 second runtime!

Example 2: Show count for notes containing FUZZY search string ‘acute kidney disease‘, broken down by gender

0.39 second runtime!

Example 3: Show results for FUZZY search containing term ‘ulcerative colitis‘, and group by Note Type

AIR·MS Data Modalities 

AIR·MS data

The AIR·MS platform proudly features the following Mount Sinai and public datasets. Our team is dedicated to continuously expanding our database with additional data modalities, striving to build a comprehensive, multi-modal research resource. To help you get started with the data, we offer Quick Start guides available here.

Mount Sinai Data Warehouse (MSDW) in OMOP format

The MSDW dataset leverages the OMOP Common Data Model. The data is comprised of clinical data extracted from Mount Sinai’s Epic Caboodle database and other ancillary systems.

We offer both an identifiable and a de-identified version of the MSDW dataset in AIR·MS

MSDW OMOP identifiable (PHI) with extended attributes

Current Status

Schema: CDMPHI

Data snapshot from 04/16/2025

Unique patients:  12,107,621

MSDW OMOP de-identified (de-id)

Current Status

Schema: CDMDEID

Data snapshot from 09/26/2023

Unique patients:  10,934,180

Table	Record Count
CARE_SITE	106,699
CDC_RACE_ETHNICITY_XTN	967
CONCEPT	11,337,020
CONCEPT_ANCESTOR	101,072,304
CONCEPT_CLASS	453
CONCEPT_RELATIONSHIP	170,586,066
CONCEPT_SYNONYM	5,190,521
CONDITION_OCCURRENCE	206,228,288
DEATH	49,797
DOMAIN	50
DRUG_EXPOSURE	213,932,157
DRUG_STRENGTH	3,003,619
FACT_RELATIONSHIP	161,135,148
LOCATION	13,394,927
MEASUREMENT	1,875,881,345
NOTE	242,610,978
OBSERVATION	516,585,500
OBSERVATION_PERIOD	12,147,842
PERSON	12,107,621
PROCEDURE_OCCURRENCE	312,011,901
PROVIDER	1,363,962
PROVIDER_ATTRIBUTE_XTN	783,560
RELATIONSHIP	730
VISIT_OCCURRENCE	198,943,396
VOCABULARY	254

Note

Some of the standard OMOP tables contain extension fields (starting with the prefix ‘XTN’) which contain data outside of the OMOP standard data model. Many of these XTN attributes are based on data derived directly from EPIC (i.e. codes used in EPIC rather than the standardized OMOP codes), or attributes not currently contained in the OMOP standard.

Table	Record Count
MEASUREMENT	1,330,106,439
OBSERVATION	274,496,232
PROCEDURE_OCCURRENCE	192,252,154
DRUG_EXPOSURE	149,311,961
CONCEPT_RELATIONSHIP	143,179,970
NOTE	137,886,581
CONDITION_OCCURRENCE	122,966,185
VISIT_OCCURRENCE	118,936,818
FACT_RELATIONSHIP	118,710,809
CONCEPT_ANCESTOR	94,315,531
CONCEPT_SYNONYM	11,899,964
OBSERVATION_PERIOD	10,955,831
PERSON	10,934,180
CONCEPT	10,302,430
DRUG_STRENGTH	2,994,169
LOCATION	1,321,549
PROVIDER	1,307,112
CARE_SITE	301,467
DEATH	3,213
RELATIONSHIP	692
CONCEPT_CLASS	437
VOCABULARY	251
DOMAIN	50

Clinical Notes

Clinical notes in the form of unstructured data (progress notes, telephone encounters, nursing notes, procedures, etc.) extracted from the MSDW OMOP identifiable dataset have been loaded to AIR.MS and enabled for search using SAP HANA’s in-memory full-text search capabilities. This feature empowers the researcher to build patient cohorts based on terms contained in unstructured reports in seconds or even milliseconds! The researcher can further filter based on note type or an array of other clinical attributes.

Tip: See the Getting Started Guides for examples on how to perform search using python!

Current Status

Schema: CDMPHI

Data snapshot from 04/16/2025

Number of indexed notes: 242,610,978

Pathology (metadata)

The Pathology metadata aids researchers in the field of Computational Pathology. Researchers are able to query the metadata in combination with other linked data modalities to build a patient cohort, subsequently apply quantitative methods for the analysis of digital microscopy slides and relating the resulting statistical descriptors to patient outcomes.

We are also working on making the digital slides available to researchers on Minerva HPC.

Current Status

Data Source: Powerpath

Schema: CDMPATHOLOGY

Data snapshot from 01/08/2025

Unique patients: 3,528,719

Note

Pathology reports are now available in AIR·MS and can be found in table ACC_RESULTS. The reports are broken up into sections (clinical history, final diagnosis, SNOMED coding, etc.) that can be identified by column PATH_RPT_HEADING_NAME. If you are looking for the full report, you can combine all records for with the same ACCESSION_2_ID into a single output. The column containing the free-text (ACC_RESULTS_FINDING) has been enabled with SAP HANA full-text search capabilities. Examples of how to use HANA full-text search are available in this tutorial notebook. The following tables and attributes are available in AIR·MS: ACCESSION

Number of records: 7,107,464

Column Name	Comments
ACC_CATG	Case category
ACC_PROCESS_STEP_COMPLETED_DATE	Case finalize date / status update datetime
ACCESSION_2_ID	PowerPath unique case ID
ACCESSION_NO	Case number
BIRTH_DATE	Patient date of birth
CREATED_DATE	Case creation date
CURRENT_STATUS_ID	PowerPath unique identifier for a case status
FACILITY_CODE	Facility code
FACILITY_ID	PowerPath ID for the facility associated with the accession
FACILITY_NAME	Facility name
IMPORTED_CASE	One-character \”Y\” or \”N\” code indicating if the case was imported into PowerPath
LAST_UPDATE_DATETIME	Case finalize date / status update datetime
MED_REC_NO	Medical record number (EPIC MRN)
MRN_FACILITY_CODE	PowerPath ID for the facility that assigned the MRN
MRN_FACILITY_DESCRIPTION	Name of the facility that assigned the MRN
ORDER_NUMBER	Order number from the ordering system
PATIENT_AGE	The patient’s age on the case creation date
PATIENT_ID	PowerPath patient ID
PERSONNEL_2_FULL_NAME	Name of the pathologist who finalized the accession
PERSONNEL_2_ID	PowerPath ID for the pathologist who finalized the accession
PROCESS_STEP_DESCRIPTION	Case status name / description
VISIT_NUMBER	Encounter identifier

ACC_ICD

Number of records: 3,548,815

Column Name	Comments
ACC_ICD9_ID	PowerPath surrogate unique identifier for an ICD-10 code assigned to a case
ACCESSION_2_ID	PowerPath unique case ID
LAST_UPDATE_DATETIME	Case finalize date / status update datetime
MEDICAL_CODE	ICD-10 code assigned for billing
MEDICAL_CODE_ID	PowerPath surrogate unique identifier for an ICD-10 code

ACC_SLIDE

Number of records: 14,241,494

Column Name	Comments
ACC_BLOCK_ID	PowerPath unique block ID
ACC_BLOCK_LABEL	Specimen block identifier
ACC_PROCESS_STEP_COMPLETED_DATE	Case finalize date / status update datetime
ACC_SLIDE_ID	PowerPath unique slide ID
ACC_SPECIMEN_DESCRIPTION	Specimen source description
ACC_SPECIMEN_ID	PowerPath unique specimen ID
ACCESSION_2_ID	PowerPath unique case ID
BIOPSY	Boolean flag for 1 = biopsy, 0 = non-biopsy
COLLECTION_DATE	Specimen collection date
CONSULT_LABEL	Optional free text for slides of type \”consult\”
LAB_PROCEDURE_CODE	The procedure code
LAB_PROCEDURE_DESCRIPTION	The procedure description
LAB_PROCEDURE_ID	PowerPath unique procedure identifier
LAST_UPDATE_DATETIME	Case finalize date / status update datetime
RECV_DATE	Specimen received date
SLIDE_LABEL	Derived unique (business key) identifier for each slide
SLIDE_NO	Ordinal number of the slide from the specimen & block
SLIDE_TYPE	Whether the slide is stained, unstained or antibody/IHC
SOURCE_MATERIAL_LABEL	Derived unique (business key) identifier for each slide’s source specimen and block
SOURCE_REC_TYPE	Where the slide came from, either specimen or block
SPECIMEN_CATEGORY_ID	PowerPath specimen category ID
SPECIMEN_CATEGORY_NAME	The specimen category name
SPECIMEN_GROUPS_CODE	Specimen specialty code
SPECIMEN_GROUPS_ID	PowerPath specimen specialty ID
SPECIMEN_LABEL	Specimen identifier
TYPE	Whether the slide is consult or not consult

ACC_RESULTS

Number of records: 31,193,765

Column Name	Comments
ACC_RESULTS_FINDING	The text of the report section
ACC_RESULTS_ID	PowerPath unique identifier for each report section
ACC_RESULTS_REC_ID	sort order of result section on RTF
ACCESSION_2_ID	PowerPath unique case ID
LAST_UPDATE_DATETIME	Case finalize date / status update datetime
PATH_RPT_HEADING_ID	PowerPath result section heading on RTF
PATH_RPT_HEADING_NAME	PowerPath result section heading name

ACC_SLIDE_IMAGESERVER

Number of records: 2,564,239

Column Name	Comments
ACC_SLIDE_ID	PowerPath unique slide ID
ACC_SLIDE_IMAGESERVER_DESCRIPTION	The name of the Philips iSyntax slide image file
ACC_SLIDE_IMAGESERVER_ID	PowerPath unique identifier for a slide image
INTERNAL_SLIDE_ID	Identifier for the slide, also known as the \”barcode\” ID
LAST_UPDATE_DATETIME	Case finalize date / status update datetime
SCAN_DATE	The date on which the slide image was digitized

Radiology (metadata)

AIR·MS now features radiology metadata extracted from the Mount Sinai IRW 2.0 XNAT system (via an MSDW data pipeline). This data set is comprised of detailed DICOM (Digital Imaging and Communications in Medicine) tags associated with the medical images.

These tags provide essential metadata, including patient information, imaging parameters, equipment details, and procedural context, ensuring a comprehensive understanding of each radiological study.

By integrating this metadata, we enable researchers to gain deeper insights into the imaging data, facilitating advanced analyses and fostering innovations in medical imaging research.

Current Status

Schema: CDMRADIOLOGY

Data snapshot from 8/16/2024

Unique patients: 2,057,482

The following tables and attributes are available in AIR·MS:

Number of records: 66,993,826

RADIOLOGY_METADATA

PATIENT_ID

SERIES_INSTANCE_UID

STUDY_INSTANCE_UID

ETL_RECORD_UPDATE_DATETIME

Number of records: 7,745,407,087

RADIOLOGY_DICOM_DATA

RADIOLOGY_METADATA_ID

DICOM_TAGS

TAG_VALUE_REPRESENTATION

TAG_VALUE” NCLOB MEMORY

TAG_XTN_PATIENT_EPIC_MRN

TAG_PATIENT_NAME

SERIES_INSTANCE_UID

STUDY_INSTANCE_UID

ETL_RECORD_UPDATE_DATETIME

Mount Sinai Million Health Discoveries Program

The current lack of diversity in genomic research data is hindering what we can learn about health and potential treatments in our global population. By enhancing the diversity of people participating in genomic research, we can advance our knowledge and discovery of human genetics for all populations. To that end, The Charles Bronfman Institute for Personalized Medicine is spearheading the effort to carry out the genetic sequencing of one million Mount Sinai patients within the next five years. This initiative, one of the largest such sequencing projects of its kind, will integrate health and research data at Mount Sinai to promote discoveries that will directly benefit our patient population.

Access to the BioMe Biobank and Mount Sinai Million Biobank on HPC can be requested via the CBIPM Data and Specimen Inquiry Form

The following tables and attributes are available in AIR·MS:

Mount Sinai Million / BioMe identifiable (PHI)

Current Status

Schema: CDMMSM

Data snapshot from 06/09/2025

Unique patients:  279,885

Mount Sinai Million / BioMe de-identified (de-id)

Current Status

Schema: CDMMSMDEID

Data snapshot from 06/09/2025

Unique patients:  279,885

PATIENT	Comments
ID	Internal ID for linking to other tables within the dataset
MRN	Medical Record Number (EPIC MRN – only accessible under regulatory approval)
MASKED_MRN	De-Identifier for combined BioMe Biobank set with Regeneron and Sema4 data
RGN_ID	De-identifier for first Regeneron batch regarding BioMe Biobank
SEMA4_ID	De-identifier for Sema4, a subset of Masked MRN ID
MSM_ID	De-identifier for Mount Sinai Million Biobank, a combined setoff RGN_ID and new MSM ID
MILLION_ID	Indicator for all consented patients with and without genomic data
AIR_CREATED_AT	Record creation in AIR·MS
AIR_UPDATED_AT	Record updated in AIR·MS

PATIENT	Comments
ID	Internal ID for linking to other tables within the dataset
MASKED_MRN	De-Identifier for combined BioMe Biobank set with Regeneron and Sema4 data
RGN_ID	De-identifier for first Regeneron batch regarding BioMe Biobank
SEMA4_ID	De-identifier for Sema4, a subset of Masked MRN ID
MSM_ID	De-identifier for Mount Sinai Million Biobank, a combined setoff RGN_ID and new MSM ID
MILLION_ID	Indicator for all consented patients with and without genomic data
AIR_CREATED_AT	Record creation in AIR·MS
AIR_UPDATED_AT	Record updated in AIR·MS

Electrocardiogram (ECG)

Data Source: GE HealthCare MUSE Cardiology Information System

Schema: CDMECG

Data snapshot from: 04/10/2021

Unique patients: 1,961,254

The following tables and attributes are available in AIR·MS:

Number of records: 9,275,130

PATIENT_DEMOGRAPHICS

PATIENT_DEMOGRAPHICS_ID (X)

FILE_ENTRY_ID

PATIENT_ID

PATIENTAGE

AGEUNITS

DATEOFBIRTH

GENDER

RACE

PATIENTLASTNAME

PATIENTFIRSTNAME

Number of records: 9,168,266

DIAGNOSIS

DIAGNOSIS_ID (X)

FILE_ENTRY_ID

MODALITY

DIAGNOSISSTATEMENT

Number of records: 73,631,055

LEAD_DATA

LEAD_DATA_ID (X)

FILE_ENTRY_ID

LEADBYTECOUNTTOTAL

LEADTIMEOFFSET

LEADSAMPLECOUNTTOTAL

LEADAMPLITUDEUNITSPERBIT

LEADAMPLITUDEUNITS

LEADHIGHLIMIT

LEADLOWLIMIT

LEADID

LEADOFFSETFIRSTSAMPLE

FIRSTSAMPLEBASELINE

LEADSAMPLESIZE

LEADOFF

BASELINESWAY

LEADDATACRC32

WAVEFORMDATA

Number of records: 9,610,935

ECG_FILES

FILE_ENTRY_ID (X)

FILE_NAME

FILE_PATH

FILE_HASH

FILE_SIZE_BYTES

ACQUISITION_DATE

ACQUISITION_TIME

PROCESSING_STATUS

STATUS_CODE

NOTES_AND_COMMENTS

FILE_TIMESTAMP

AIR_CREATED_AT

AIR_UPDATED_AT

JSON_STATUS

Number of records: 9,168,230

MUSE_INFO

MUSEVERSION

FILE_ENTRY_ID

Number of records: 7,757,472

ORDER_INFO

ORDER_INFO_ID (X)

FILE_ENTRY_ID

HISACCOUNTNUMBER

ORDERTIME

ADMITTIME

ADMITDATE

HISLOCATION

BED

ATTENDINGMDHISID

ATTENDINGMDLASTNAME

ATTENDINGMDFIRSTNAME

ALTERNATEVISITID

HISDISPOSITION

ADMITSOURCE

PRIMARYDIAGNOSTICCODE

SERVICINGFACILITY

ADMITTINGMDHISID

ADMITTINGMDLASTNAME

ADMITTINGMDFIRSTNAME

CONSULTINGMDID

REFERRINGMDHISID

HOSPITALSERVICE

ADMISSIONTYPE

Number of records: 9,164,220

ORIGINAL_DIAGNOSIS

ORIGINAL_DIAGNOSIS_ID (X)

FILE_ENTRY_ID

MODALITY

DIAGNOSISSTATEMENT

Number of records: 9,168,044

ORIGINAL_RESTING_ECG_MEASUREMENTS

ORIGINAL_RESTING_ECG_MEASUREMENTS_ID (X)

VENTRICULARRATE

ATRIALRATE

PRINTERVAL

QRSDURATION

QTINTERVAL

QTCORRECTED

PAXIS

RAXIS

TAXIS

QRSCOUNT

QONSET

QOFFSET

PONSET

POFFSET

TOFFSET

ECGSAMPLEBASE

ECGSAMPLEEXPONENT

QTCFREDERICA

Number of records: 3,613,372

PHARMA_DATA

PHARMA_DATA_ID (X)

PHARMARRINTERVAL

PHARMAUNIQUEECGID

PHARMAPPINTERVAL

PHARMACARTID

FILE_ENTRY_ID

Number of records: 9,649,712

QRS_TIMES_TYPES

GLOBALRR

QTRGGR

FILE_ENTRY_ID

Number of records: 9,657,368

RESTING_ECG

RESTING_ECG_ID (X)

FILE_ENTRY_ID

PATIENT_ID

ACQUISITIONDATE

ACQUISITIONTIME

STATUS

Number of records: 9,167,778

RESTING_ECG_MEASUREMENTS

RESTING_ECG_MEASUREMENTS_ID (X)

FILE_ENTRY_ID

VENTRICULARRATE

ATRIALRATE

PRINTERVAL

QRSDURATION

QTINTERVAL

QTCORRECTED

PAXIS

RAXIS

TAXIS

QRSCOUNT

QONSET

QOFFSET

PONSET

POFFSET

TOFFSET

ECGSAMPLEBASE

ECGSAMPLEEXPONENT

QTCFREDERICA

Number of records: 9,654,325

TEST_DEMOGRAPHICS

TEST_DEMOGRAPHICS_ID (X)

FILE_ENTRY_ID

DATATYPE

SITE

SITENAME

ACQUISITIONDEVICE

STATUS

EDITLISTSTATUS

PRIORITY

LOCATION

LOCATIONNAME

ROOMID

ACQUISITIONTIME

ACQUISITIONDATE

CARTNUMBER

ACQUISITIONSOFTWAREVERSION

ANALYSISSOFTWAREVERSION

EDITTIME

EDITDATE

EDITORID

REFERRINGMDLASTNAME

REFERRINGMDFIRSTNAME

ACQUISITIONTECHLASTNAME

EDITORLASTNAME

EDITORFIRSTNAME

SECONDARYID

HISSTATUS

Echocardiogram (echo)

Schema: CDMECHO

Data snapshot from: 10/17/2023

Unique patients: 885,957

The following tables and attributes are available in AIR·MS:

Number of records: 268,682,931

ECHO_METADATA

FILE_ENTRY_ID

PATIENT_ID

SERIES_INSTANCE_UID

STUDY_INSTANCE_UID

SOP_INSTANCE_UID

IMAGE_FILE_PATH

AIR_CREATED_AT

AIR_UPDATED_AT

Number of records: 21,089,507,788

ECHO_TAGS_DATA

FILE_ENTRY_ID

SERIES_INSTANCE_UID

STUDY_INSTANCE_UID

SOP_INSTANCE_UID

Synthetic Public Use File (DE-SynPUF)

The SYNPUF (Synthetic Public Use Files) dataset, provided by the Centers for Medicare & Medicaid Services (CMS), offers a synthetic version of Medicare claims data from the years 2008 to 2010. This dataset is meticulously designed to maintain the statistical properties and relationships present in the original data while ensuring that no actual patient information is disclosed, thereby safeguarding privacy.

SYNPUF includes a comprehensive array of variables such as beneficiary demographics, chronic conditions, hospital and outpatient claims, and prescription drug events, making it an invaluable resource for researchers and data scientists. It serves as an exemplary tool for developing and testing healthcare models, algorithms, and applications without the constraints associated with sensitive real-world data.

The SYNPUF dataset in AIR·MS utilizes the OMOP Common Data Model, aligned with other clinical data sets available on the platform. Since SYNPUF data does not require an approved IRB, you can easily get onboarded and start building ML models!

Current Status

Schema: CDMSYNPUF

Number of patients: 2,326,856

Number of observations: 37,531,051

Number of measurements: 72,387,791

Work in Progress

We are constantly integrating new data to AIR·MS. The following data modalities are currently being worked on:

Electroencephalogram (EEG)
Endoscopy & colonoscopy reports
Bedmaster
Radiology images and reports

Compute

Training	Feedback

Consulting	Open Positions

Publications

Archive

Compute

Training	Feedback

Consulting	Open Positions

Publications

Archive

Services

Training	Github

Consulting	Open Positions

Publications

Archive

About Us

Meet the Team	Feedback

Instagram	Open Positions

Youtube

Contact Us

Data and Resources

SAP HANA In-Memory Database

What is SAP HANA?

SAP HANA Platform

SAP HANA and Machine Learning

Python Machine Learning Client for SAP HANA

Secure cloud infrastructure

Python Libraries

Benchmarking

AIR.MS in-memory full-text search performance examples

Example 1: Show count for notes containing an EXACT search string ‘acute kidney disease‘, broken down by gender

Example 2: Show count for notes containing FUZZY search string ‘acute kidney disease‘, broken down by gender

Example 3: Show results for FUZZY search containing term ‘ulcerative colitis‘, and group by Note Type

AIR·MS Data Modalities

AIR·MS data

Mount Sinai Data Warehouse (MSDW) in OMOP format

We offer both an identifiable and a de-identified version of the MSDW dataset in AIR·MS

MSDW OMOP identifiable (PHI) with extended attributes

​​​​​​​Current Status

MSDW OMOP de-identified (de-id)

​​​​​​​Current Status

Note

Clinical Notes

​​​​​​​Current Status

Pathology (metadata)

​​​​​​​Current Status

Note

Radiology (metadata)

Current Status

Mount Sinai Million Health Discoveries Program

Mount Sinai Million / BioMe identifiable (PHI)

​​​​​​​Current Status

Mount Sinai Million / BioMe de-identified (de-id)

​​​​​​​Current Status

Electrocardiogram (ECG)

Echocardiogram (echo)

Synthetic Public Use File (DE-SynPUF)

​​​​​​​Current Status

Work in Progress

SAP HANA In-Memory Database 

Secure cloud infrastructure 

Python Libraries 

Benchmarking 

AIR·MS Data Modalities 

Current Status

Current Status

Current Status

Current Status

Current Status

Current Status

Current Status