Scientific Computing & Data

Data and Resources

SAP HANA In-Memory Database 

What is SAP HANA?

SAP HANA (High-performance Analytical Appliance) is an in-memory database platform developed by SAP SE. It uses in-memory computing to store data primarily in RAM, unlike traditional relational databases that need to retrieve data from disk-based storage solutions. With this technological innovation, SAP HANA can access in-memory data 10,000 times faster than data stored on standard disks. The result is that companies can now rapidly analyze large amounts of data and process transactions in seconds rather than hours. SAP HANA is the platform for SAP’s Enterprise Resource Planning software (S/4HANA) as well as other business applications and can run on-premises, in the cloud or in a hybrid configuration. Enterprises that have migrated over to SAP HANA have been able to realize accelerated business processes, improved data insights, and simplified IT environments. Since SAP HANA can run both OLTP and OLAP workloads, it eliminates the need to move data to another database to run analytical applications. This removes the burden of maintaining separate legacy systems, data silos and data warehouses. The result is less data redundancy, a smaller hardware footprint and less data management costs as well. SAP HANA analyzes live data for real-time business decisions and analytics, using advanced data processing engines for business, text, spatial, graph and series data. The initial version of the SAP HANA database was released to select customers in late 2010. SAP later debuted the general release at SAPPHIRE NOW, SAP’s annual technology conference in Orlando, FL in June 2011. SAP HANA made history as the first in-memory database in the world. It was an extremely popular product release, quickly becoming SAP’s fastest-adopted solution. Today there are over 30,000 customers utilizing SAP HANA.

Overview of SAP HANA

SAP HANA Platform

Overview of SAP HANA

While SAP HANA is mostly recognized for its ability to process large amounts of data at record speeds, it is far more than just an in-memory database. SAP HANA offers both column and row-based storage, making it capable of handling both OLTP and OLAP workloads. Tables that are organized in columns are optimized for high performing read operations while still providing good performance for write operations. In addition, column store offers extremely efficient data compression, which in turn saves memory and speeds up searches and calculations. Typical compression rates are 7x or greater when compared with a traditional RDBMS. Other column store features include table partitioning (which can further improve performance by utilizing partition pruning), and a delta store to optimize write operations. Another technology built into the platform includes dynamic tiering, allowing for a temperature based (hot/warm/cold) data aging strategy. The purpose of this feature is to extend SAP HANA memory with a disk-centric columnar store for less frequently used data. In addition, SAP HANA offers data integration services, high availability and disaster recovery capabilities, and a complete development platform known as SAP HANA XSA (SAP HANA extended application services, advanced model) that can be used to code with Java and Node.js. It includes OData support and is the successor to the now-deprecated XS classic model. Apps that were created with SAP HANA XS can be ported over to SAP HANA XSA. The SAP Web IDE for SAP HANA is a browser-based integrated development environment used to create web-based and mobile user interfaces, business logic, and advanced data models. It provides a handful of developer tools, including syntax-aware code editors, inspection tools, debugging tools, and CDS modeling capability.

SAP HANA and Machine Learning

Python Machine Learning Client for SAP HANA

Welcome to Python machine learning client for SAP HANA (hana-ml)! This package enables Python data scientists to access SAP HANA data and build various machine learning models using the data directly in SAP HANA. This page provides an overview of hana-ml. Python machine learning client for SAP HANA consists of two main parts:

  • SAP HANA DataFrame, which provides a set of methods for accessing and querying data in SAP HANA without bringing the data to the client.
  • A set of machine learning APIs for developing machine learning models.

Specifically, machine learning APIs are composed of two packages:

  • PAL packagePAL package consists of a set of Python algorithms and functions which provide access to machine learning capabilities in SAP HANA Predictive Analysis Library(PAL). SAP HANA PAL functions cover a variety of machine learning algorithms for training a model and then the trained model is used for scoring.
  • APL packageAutomated Predictive Library (APL) package exposes the data mining capabilities of the Automated Analytics engine in SAP HANA through a set of functions. These functions develop a predictive modeling process that analysts can use to answer simple questions on their customer datasets stored in SAP HANA.

hana-ml uses SAP HANA Python driver (hdbcli) to connect to and access SAP HANA. A figure of architecture is shown below:

For more information, see the Python Machine Learning Client for SAP HANA website

 

Secure cloud infrastructure 

The team at HPI.MS worked closely with engineers at Microsoft, Mount Sinai IT and Data4Life to establish secure Microsoft Azure cloud infrastructure to run AIR.MS. The following architecture diagram depicts the AIR.MS infrastructure running in the Mount Sinai Azure cloud.

 

Python Libraries 

SAP HANA ML (Machine Learning) is one of the Python libraries that we provide to researchers to query AIR.MS. SAP HANA ML is a component of SAP HANA, an in-memory database and analytics platform designed for handling massive amounts of data in real-time. HANA ML provides integrated machine learning capabilities to build, deploy, and execute machine learning models directly within the SAP HANA environment. It provides a wide range of machine learning algorithms, such as regression, classification, clustering, and time-series forecasting, that can be used for various use cases like patient dataset preparation, notes analysis, etc. It offers a set of libraries and APIs to develop custom machine learning models, or users can use pre-built models from the SAP HANA Model Store. The biggest advantage of HANA ML is In-Database Processing. HANA ML leverages the power of SAP HANA’s in-memory processing capabilities, allowing data to be analyzed and processed directly within the database. This eliminates the need for data movement, resulting in faster model training and predictions. HANA ML also supports model versioning and model management, allowing researchers to track and control the lifecycle of their machine learning models. Additionally, it offers model explainability features, providing insights into how the models arrive at their predictions, which is crucial for ensuring transparency and compliance with regulations. More details can be found in the SAP Documentation.

 

Benchmarking 

AIR.MS in-memory full-text search performance examples

Number of indexed notes: 86,893,767

Example 1: Show count for notes containing an EXACT search string ‘acute kidney disease‘, broken down by gender

 

0.31 second runtime!

Example 2: Show count for notes containing FUZZY search string ‘acute kidney disease‘, broken down by gender

 

0.39 second runtime!

Example 3: Show results for FUZZY search containing term ‘ulcerative colitis‘, and group by Note Type

 

AIR·MS Data Modalities 

AIR·MS data

The AIR·MS platform proudly features the following Mount Sinai and public datasets. Our team is dedicated to continuously expanding our database with additional data modalities, striving to build a comprehensive, multi-modal research resource. To help you get started with the data, we offer Quick Start guides available here.

 

 

Mount Sinai Data Warehouse (MSDW) in OMOP format

The MSDW dataset leverages the OMOP Common Data Model. The data is comprised of clinical data extracted from Mount Sinai’s Epic Caboodle database and other ancillary systems. ​​​​​​​

 

We offer both an identifiable and a de-identified version of the MSDW dataset in AIR·MS

MSDW OMOP identifiable (PHI) with extended attributes

​​​​​​​Current Status

Schema: CDMPHI
Data snapshot from 04/16/2025
Unique patients:  12,107,621

MSDW OMOP de-identified (de-id)

​​​​​​​Current Status

Schema: CDMDEID
Data snapshot from 09/26/2023
Unique patients:  10,934,180
Table Record Count
CARE_SITE 106,699
CDC_RACE_ETHNICITY_XTN 967
CONCEPT 11,337,020
CONCEPT_ANCESTOR 101,072,304
CONCEPT_CLASS 453
CONCEPT_RELATIONSHIP 170,586,066
CONCEPT_SYNONYM 5,190,521
CONDITION_OCCURRENCE 206,228,288
DEATH 49,797
DOMAIN  50
DRUG_EXPOSURE 213,932,157
DRUG_STRENGTH 3,003,619
FACT_RELATIONSHIP 161,135,148
LOCATION 13,394,927
MEASUREMENT 1,875,881,345
NOTE 242,610,978
OBSERVATION 516,585,500
OBSERVATION_PERIOD 12,147,842
PERSON 12,107,621
PROCEDURE_OCCURRENCE 312,011,901
PROVIDER 1,363,962
PROVIDER_ATTRIBUTE_XTN 783,560
RELATIONSHIP 730
VISIT_OCCURRENCE 198,943,396
VOCABULARY 254

Note

Some of the standard OMOP tables contain extension fields (starting with the prefix ‘XTN’) which contain data outside of the OMOP standard data model. Many of these XTN attributes are based on data derived directly from EPIC (i.e. codes used in EPIC rather than the standardized OMOP codes), or attributes not currently contained in the OMOP standard.

Table Record Count 
MEASUREMENT 1,330,106,439
OBSERVATION 274,496,232
PROCEDURE_OCCURRENCE 192,252,154
DRUG_EXPOSURE 149,311,961
CONCEPT_RELATIONSHIP 143,179,970
NOTE 137,886,581
CONDITION_OCCURRENCE 122,966,185
VISIT_OCCURRENCE 118,936,818
FACT_RELATIONSHIP 118,710,809
CONCEPT_ANCESTOR 94,315,531
CONCEPT_SYNONYM 11,899,964
OBSERVATION_PERIOD 10,955,831
PERSON 10,934,180
CONCEPT 10,302,430
DRUG_STRENGTH 2,994,169
LOCATION 1,321,549
PROVIDER 1,307,112
CARE_SITE 301,467
DEATH 3,213
RELATIONSHIP 692
CONCEPT_CLASS 437
VOCABULARY 251
DOMAIN 50

 

Clinical Notes

Clinical notes in the form of unstructured data (progress notes, telephone encounters, nursing notes, procedures, etc.) extracted from the MSDW OMOP identifiable dataset have been loaded to AIR.MS and enabled for search using SAP HANA’s in-memory full-text search capabilities. This feature empowers the researcher to build patient cohorts based on terms contained in unstructured reports in seconds or even milliseconds! The researcher can further filter based on note type or an array of other clinical attributes.

Tip: See the Getting Started Guides for examples on how to perform search using python!

​​​​​​​Current Status

Schema: CDMPHI
Data snapshot from 04/16/2025
Number of indexed notes: 242,610,978
 

 

Pathology (metadata)

The Pathology metadata aids researchers in the field of Computational Pathology. Researchers are able to query the metadata in combination with other linked data modalities to build a patient cohort, subsequently apply quantitative methods for the analysis of digital microscopy slides and relating the resulting statistical descriptors to patient outcomes.

We are also working on making the digital slides available to researchers on Minerva HPC.

​​​​​​​Current Status

Data Source: Powerpath
Schema: CDMPATHOLOGY
Data snapshot from 01/08/2025
Unique patients: 3,528,719

Note

Pathology reports are now available in ​​​​​​​​​​​​​​​​​​​​​AIR·MS and can be found in table ACC_RESULTS. The reports are broken up into sections (clinical history, final diagnosis, SNOMED coding, etc.) that can be identified by column PATH_RPT_HEADING_NAME. If you are looking for the full report, you can combine all records for with the same ACCESSION_2_ID into a single output. The column containing the free-text (ACC_RESULTS_FINDING) has been enabled with SAP HANA full-text search capabilities. Examples of how to use HANA full-text search are available in this tutorial notebook. The following tables and attributes are available in AIR·MS: ACCESSION

Number of records: 7,107,464
Column Name Comments
ACC_CATG Case category
ACC_PROCESS_STEP_COMPLETED_DATE Case finalize date / status update datetime
ACCESSION_2_ID PowerPath unique case ID
ACCESSION_NO Case number
BIRTH_DATE Patient date of birth
CREATED_DATE Case creation date
CURRENT_STATUS_ID PowerPath unique identifier for a case status
FACILITY_CODE Facility code
FACILITY_ID PowerPath ID for the facility associated with the accession
FACILITY_NAME Facility name
IMPORTED_CASE One-character \”Y\” or \”N\” code indicating if the case was imported into PowerPath
LAST_UPDATE_DATETIME Case finalize date / status update datetime
MED_REC_NO Medical record number  (EPIC MRN)
MRN_FACILITY_CODE PowerPath ID for the facility that assigned the MRN
MRN_FACILITY_DESCRIPTION Name of the facility that assigned the MRN
ORDER_NUMBER Order number from the ordering system
PATIENT_AGE The patient’s age on the case creation date
PATIENT_ID PowerPath patient ID
PERSONNEL_2_FULL_NAME Name of the pathologist who finalized the accession
PERSONNEL_2_ID PowerPath ID for the pathologist who finalized the accession
PROCESS_STEP_DESCRIPTION Case status name / description
VISIT_NUMBER Encounter identifier

ACC_ICD

Number of records: 3,548,815
Column Name Comments
ACC_ICD9_ID PowerPath surrogate unique identifier for an ICD-10 code assigned to a case
ACCESSION_2_ID PowerPath unique case ID
LAST_UPDATE_DATETIME Case finalize date / status update datetime
MEDICAL_CODE ICD-10 code assigned for billing
MEDICAL_CODE_ID PowerPath surrogate unique identifier for an ICD-10 code

ACC_SLIDE

Number of records: 14,241,494
Column Name Comments
ACC_BLOCK_ID PowerPath unique block ID
ACC_BLOCK_LABEL Specimen block identifier
ACC_PROCESS_STEP_COMPLETED_DATE Case finalize date / status update datetime
ACC_SLIDE_ID PowerPath unique slide ID
ACC_SPECIMEN_DESCRIPTION Specimen source description
ACC_SPECIMEN_ID PowerPath unique specimen ID
ACCESSION_2_ID PowerPath unique case ID
BIOPSY Boolean flag for 1 = biopsy,  0 = non-biopsy
COLLECTION_DATE Specimen collection date
CONSULT_LABEL Optional free text for slides of type \”consult\”
LAB_PROCEDURE_CODE The procedure code
LAB_PROCEDURE_DESCRIPTION The procedure description
LAB_PROCEDURE_ID PowerPath unique procedure identifier
LAST_UPDATE_DATETIME Case finalize date / status update datetime
RECV_DATE Specimen received date
SLIDE_LABEL Derived unique (business key) identifier for each slide
SLIDE_NO Ordinal number of the slide from the specimen & block
SLIDE_TYPE Whether the slide is stained,  unstained  or antibody/IHC
SOURCE_MATERIAL_LABEL Derived unique (business key) identifier for each slide’s source specimen and block
SOURCE_REC_TYPE Where the slide came from, either specimen or block
SPECIMEN_CATEGORY_ID PowerPath specimen category ID
SPECIMEN_CATEGORY_NAME The specimen category name
SPECIMEN_GROUPS_CODE Specimen specialty code
SPECIMEN_GROUPS_ID PowerPath specimen specialty ID
SPECIMEN_LABEL Specimen identifier
TYPE Whether the slide is consult or not consult

ACC_RESULTS

Number of records: 31,193,765
Column Name Comments
ACC_RESULTS_FINDING The text of the report section
ACC_RESULTS_ID PowerPath unique identifier for each report section
ACC_RESULTS_REC_ID sort order of result section on RTF
ACCESSION_2_ID PowerPath unique case ID
LAST_UPDATE_DATETIME Case finalize date / status update datetime
PATH_RPT_HEADING_ID PowerPath result section heading on RTF
PATH_RPT_HEADING_NAME PowerPath result section heading name

ACC_SLIDE_IMAGESERVER

Number of records: 2,564,239
Column Name Comments
ACC_SLIDE_ID PowerPath unique slide ID
ACC_SLIDE_IMAGESERVER_DESCRIPTION The name of the Philips iSyntax slide image file
ACC_SLIDE_IMAGESERVER_ID PowerPath unique identifier for a slide image
INTERNAL_SLIDE_ID Identifier for the slide,  also known as the \”barcode\” ID
LAST_UPDATE_DATETIME Case finalize date / status update datetime
SCAN_DATE The date on which the slide image was digitized
 

 

Radiology (metadata)

AIR·MS now features radiology metadata extracted from the Mount Sinai IRW 2.0 XNAT system (via an MSDW data pipeline). This data set is comprised of detailed DICOM (Digital Imaging and Communications in Medicine) tags associated with the medical images.

These tags provide essential metadata, including patient information, imaging parameters, equipment details, and procedural context, ensuring a comprehensive understanding of each radiological study.

By integrating this metadata, we enable researchers to gain deeper insights into the imaging data, facilitating advanced analyses and fostering innovations in medical imaging research.

Current Status

Schema: CDMRADIOLOGY
Data snapshot from 8/16/2024
Unique patients: 2,057,482

The following tables and attributes are available in AIR·MS:

Number of records: 66,993,826
RADIOLOGY_METADATA
ID
PATIENT_ID
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
ETL_RECORD_UPDATE_DATETIME
Number of records: 7,745,407,087
RADIOLOGY_DICOM_DATA
ID
RADIOLOGY_METADATA_ID
DICOM_TAGS
TAG_VALUE_REPRESENTATION
TAG_VALUE” NCLOB MEMORY
TAG_XTN_PATIENT_EPIC_MRN
TAG_PATIENT_NAME
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
ETL_RECORD_UPDATE_DATETIME
 

 

Mount Sinai Million Health Discoveries Program

The current lack of diversity in genomic research data is hindering what we can learn about health and potential treatments in our global population. By enhancing the diversity of people participating in genomic research, we can advance our knowledge and discovery of human genetics for all populations. To that end, The Charles Bronfman Institute for Personalized Medicine is spearheading the effort to carry out the genetic sequencing of one million Mount Sinai patients within the next five years. This initiative, one of the largest such sequencing projects of its kind, will integrate health and research data at Mount Sinai to promote discoveries that will directly benefit our patient population.

Access to the BioMe Biobank and Mount Sinai Million Biobank on HPC can be requested via the CBIPM Data and Specimen Inquiry Form

The following tables and attributes are available in AIR·MS:

Mount Sinai Million / BioMe identifiable (PHI)

​​​​​​​Current Status

Schema: CDMMSM
Data snapshot from 06/09/2025
Unique patients:  279,885

Mount Sinai Million / BioMe de-identified (de-id)

​​​​​​​Current Status

Schema: CDMMSMDEID
Data snapshot from 06/09/2025
Unique patients:  279,885
PATIENT Comments
ID Internal ID for linking to other tables within the dataset
MRN Medical Record Number (EPIC MRN – only accessible under regulatory approval) 
MASKED_MRN De-Identifier for combined BioMe Biobank set with Regeneron and Sema4 data 
RGN_ID De-identifier for first Regeneron batch regarding BioMe Biobank 
SEMA4_ID De-identifier for Sema4, a subset of Masked MRN ID 
MSM_ID De-identifier for Mount Sinai Million Biobank, a combined setoff RGN_ID and new MSM ID 
MILLION_ID Indicator for all consented patients with and without genomic data 
AIR_CREATED_AT Record creation in AIR·MS
AIR_UPDATED_AT Record updated in AIR·MS
PATIENT Comments
ID Internal ID for linking to other tables within the dataset
MASKED_MRN De-Identifier for combined BioMe Biobank set with Regeneron and Sema4 data
RGN_ID De-identifier for first Regeneron batch regarding BioMe Biobank
SEMA4_ID De-identifier for Sema4, a subset of Masked MRN ID
MSM_ID De-identifier for Mount Sinai Million Biobank, a combined setoff RGN_ID and new MSM ID
MILLION_ID Indicator for all consented patients with and without genomic data
AIR_CREATED_AT Record creation in AIR·MS
AIR_UPDATED_AT Record updated in AIR·MS
 

 

Electrocardiogram (ECG)

 
Data Source: GE HealthCare MUSE Cardiology Information System
Schema: CDMECG
Data snapshot from: 04/10/2021
Unique patients: 1,961,254

The following tables and attributes are available in AIR·MS:

Number of records: 9,275,130
PATIENT_DEMOGRAPHICS
PATIENT_DEMOGRAPHICS_ID (X)
FILE_ENTRY_ID
PATIENT_ID
PATIENTAGE
AGEUNITS
DATEOFBIRTH
GENDER
RACE
PATIENTLASTNAME
PATIENTFIRSTNAME
Number of records: 9,168,266
DIAGNOSIS
DIAGNOSIS_ID (X)
FILE_ENTRY_ID
MODALITY
DIAGNOSISSTATEMENT
Number of records: 73,631,055
LEAD_DATA
LEAD_DATA_ID (X)
FILE_ENTRY_ID
LEADBYTECOUNTTOTAL
LEADTIMEOFFSET
LEADSAMPLECOUNTTOTAL
LEADAMPLITUDEUNITSPERBIT
LEADAMPLITUDEUNITS
LEADHIGHLIMIT
LEADLOWLIMIT
LEADID
LEADOFFSETFIRSTSAMPLE
FIRSTSAMPLEBASELINE
LEADSAMPLESIZE
LEADOFF
BASELINESWAY
LEADDATACRC32
WAVEFORMDATA
Number of records: 9,610,935
ECG_FILES
FILE_ENTRY_ID (X)
FILE_NAME
FILE_PATH
FILE_HASH
FILE_SIZE_BYTES
ACQUISITION_DATE
ACQUISITION_TIME
PROCESSING_STATUS
STATUS_CODE
NOTES_AND_COMMENTS
FILE_TIMESTAMP
AIR_CREATED_AT
AIR_UPDATED_AT
JSON_STATUS
Number of records: 9,168,230
MUSE_INFO
MUSEVERSION
FILE_ENTRY_ID
Number of records: 7,757,472
ORDER_INFO
ORDER_INFO_ID (X)
FILE_ENTRY_ID
HISACCOUNTNUMBER
ORDERTIME
ADMITTIME
ADMITDATE
HISLOCATION
BED
ATTENDINGMDHISID
ATTENDINGMDLASTNAME
ATTENDINGMDFIRSTNAME
ALTERNATEVISITID
HISDISPOSITION
ADMITSOURCE
PRIMARYDIAGNOSTICCODE
SERVICINGFACILITY
ADMITTINGMDHISID
ADMITTINGMDLASTNAME
ADMITTINGMDFIRSTNAME
CONSULTINGMDID
REFERRINGMDHISID
HOSPITALSERVICE
ADMISSIONTYPE
Number of records: 9,164,220
ORIGINAL_DIAGNOSIS
ORIGINAL_DIAGNOSIS_ID (X)
FILE_ENTRY_ID
MODALITY
DIAGNOSISSTATEMENT
Number of records: 9,168,044
ORIGINAL_RESTING_ECG_MEASUREMENTS
ORIGINAL_RESTING_ECG_MEASUREMENTS_ID (X)
VENTRICULARRATE
ATRIALRATE
PRINTERVAL
QRSDURATION
QTINTERVAL
QTCORRECTED
PAXIS
RAXIS
TAXIS
QRSCOUNT
QONSET
QOFFSET
PONSET
POFFSET
TOFFSET
ECGSAMPLEBASE
ECGSAMPLEEXPONENT
QTCFREDERICA
Number of records: 3,613,372
PHARMA_DATA
PHARMA_DATA_ID (X)
PHARMARRINTERVAL
PHARMAUNIQUEECGID
PHARMAPPINTERVAL
PHARMACARTID
FILE_ENTRY_ID
Number of records: 9,649,712
QRS_TIMES_TYPES
GLOBALRR
QTRGGR
FILE_ENTRY_ID
Number of records: 9,657,368
RESTING_ECG
RESTING_ECG_ID (X)
FILE_ENTRY_ID
PATIENT_ID
ACQUISITIONDATE
ACQUISITIONTIME
STATUS
Number of records: 9,167,778
RESTING_ECG_MEASUREMENTS
RESTING_ECG_MEASUREMENTS_ID (X)
FILE_ENTRY_ID
VENTRICULARRATE
ATRIALRATE
PRINTERVAL
QRSDURATION
QTINTERVAL
QTCORRECTED
PAXIS
RAXIS
TAXIS
QRSCOUNT
QONSET
QOFFSET
PONSET
POFFSET
TOFFSET
ECGSAMPLEBASE
ECGSAMPLEEXPONENT
QTCFREDERICA
Number of records: 9,654,325
TEST_DEMOGRAPHICS
TEST_DEMOGRAPHICS_ID (X)
FILE_ENTRY_ID
DATATYPE
SITE
SITENAME
ACQUISITIONDEVICE
STATUS
EDITLISTSTATUS
PRIORITY
LOCATION
LOCATIONNAME
ROOMID
ACQUISITIONTIME
ACQUISITIONDATE
CARTNUMBER
ACQUISITIONSOFTWAREVERSION
ANALYSISSOFTWAREVERSION
EDITTIME
EDITDATE
EDITORID
REFERRINGMDLASTNAME
REFERRINGMDFIRSTNAME
ACQUISITIONTECHLASTNAME
EDITORLASTNAME
EDITORFIRSTNAME
SECONDARYID
HISSTATUS
 

 

Echocardiogram (echo)

 
Schema: CDMECHO
Data snapshot from: 10/17/2023
Unique patients: 885,957

The following tables and attributes are available in AIR·MS:

Number of records: 268,682,931
ECHO_METADATA
ID
FILE_ENTRY_ID
PATIENT_ID
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
SOP_INSTANCE_UID
IMAGE_FILE_PATH
AIR_CREATED_AT
AIR_UPDATED_AT
Number of records: 21,089,507,788
ECHO_TAGS_DATA
ID
FILE_ENTRY_ID
SERIES_INSTANCE_UID
STUDY_INSTANCE_UID
SOP_INSTANCE_UID
TAGS
TAG_VALUE
AIR_CREATED_AT
AIR_UPDATED_AT
 

Synthetic Public Use File (DE-SynPUF)

The SYNPUF (Synthetic Public Use Files) dataset, provided by the Centers for Medicare & Medicaid Services (CMS), offers a synthetic version of Medicare claims data from the years 2008 to 2010. This dataset is meticulously designed to maintain the statistical properties and relationships present in the original data while ensuring that no actual patient information is disclosed, thereby safeguarding privacy.

SYNPUF includes a comprehensive array of variables such as beneficiary demographics, chronic conditions, hospital and outpatient claims, and prescription drug events, making it an invaluable resource for researchers and data scientists. It serves as an exemplary tool for developing and testing healthcare models, algorithms, and applications without the constraints associated with sensitive real-world data.

The SYNPUF dataset in AIR·MS utilizes the OMOP Common Data Model, aligned with other clinical data sets available on the platform. Since SYNPUF data does not require an approved IRB, you can easily get onboarded and start building ML models!

​​​​​​​Current Status

Schema: CDMSYNPUF

Number of patients: 2,326,856
Number of observations: 37,531,051
Number of measurements: 72,387,791

 

Work in Progress

We are constantly integrating new data to AIR·MS. The following data modalities are currently being worked on:

  • Electroencephalogram (EEG)

  • Endoscopy & colonoscopy reports

  • Bedmaster

  • Radiology images and reports