Scientific Computing and Data / AIR·MS (AI Ready Mount Sinai) / Documentation

Application Tier

The application tier is a foundation that facilitates the execution of applications within the AIR·MS environment. It streamlines software development by providing infrastructure components that are commonly needed by microservice-based applications:

scalable compute infrastructure for executing application code,
a database for storing application metadata,
an access control mechanism to restrict usage of individual applications.

These building blocks ensure that software developers have a consistent and reliable foundation to build upon, enhancing efficiency and reducing redundancy.

In terms of user roles within AIR·MS, the application tier caters to distinct personas with specific needs:

Researchers, who access deployed applications via private endpoints within the Mount Sinai network.
Service Providers, who deploy and manage services integrated with the AIR·MS environment.

Architecture

The application tier is composed by an execution layer, a data layer, and an application gateway. The diagram below presents the architecture of the application tier.

Service Overview

Execution Layer

In adopting a practical and scalable infrastructure strategy, we opt for a microservices architecture containerized through Docker and orchestrated with Kubernetes. This approach enables a flexible deployment model, supporting efficiency and scalability for our applications.

To facilitate our Kubernetes-based approach, the following infrastructure elements play a critical role:

Azure Kubernetes Cluster (AKS): Serving as a reliable foundation, AKS manages our containerized microservices, offering scalability and ease of orchestration.
Container registry: Essential for version control and efficient distribution, the container registry is employed to store and manage container images in a centralized repository.
Key Vault: Prioritizing security, Key Vault securely manages sensitive information, such as API keys and database credentials, ensuring a robust layer of protection for our microservices.

Data Layer

An Azure Database for PostgreSQL – Flexible Server is used to store application metadata. This includes configuration values, internal application states, and integration-specific metadata.

This managed PostgreSQL service resides in a delegated subnet and is only accessible from within the application tier.

Persistent volumes can also be mounted by application containers. This is useful for applications that need to store data in files that persist beyond the lifecycle of a single container instance.

In addition, the AIR·MS platform integrates with SAP HANA for storing and querying sensitive research data. Although not part of the application tier per se, applications deployed to the tier can securely query SAP HANA:

All communication uses SSL.
Access is controlled via LDAP-authenticated Entra ID group mappings.
Privileges restrict access to specific datasets through Sailpoint-managed roles.

Azure Application Gateway

The Azure Application Gateway provides a unified entry point for applications deployed in AIR·MS. It performs TLS termination and acts as a reverse proxy that routes traffic based on subdomain or path.

Examples:

Path-based routing:
- https://airms.mssm.edu/visian
- https://airms.mssm.edu/hello-world
Subdomain-based routing:
- https://d2e.airms.mssm.edu

Environment-specific base domains include:

https://airms.mssm.edu (Production)
https://airms-staging.mssm.edu (Staging)
https://airms-sandbox.mssm.edu (Sandbox)

The different environments are isolated from each other and allow for development and testing without impacting production services:

The gateway supports both Kubernetes ingress and VM-based services.

Application Access Control

Access to deployed applications is controlled using Microsoft Entra ID:

Each application is registered with Entra ID and assigned a unique client ID.
Entra ID groups define which users can access which applications.
Sailpoint is used for automating and managing group membership.

Network-level access is also restricted:

Only AIR·MS users within the Mount Sinai network can access private application endpoints
Network security groups (NSGs) enforce subnet-level isolation and allow only the authorized traffic

CI/CD Pipeline

The CI/CD pipeline automates how applications are built, tested, and deployed across different environments (Sandbox, Staging, Production). It uses GitHub Actions on Mount Sinai’s own GitHub Enterprise instance and self-hosted GitHub runners to build and deploy applications reliably and securely.

They key responsibilities of the CI/CD pipeline are to:

Build an application when code changes are made and push it to the container registry
Run tests to ensure it still works
Deploy it to the desired environment
Create databases and access users, when needed

Key Actions

There are 3 automated workflows available:

Action	What it does
Build	Automatically triggered when code changes. Builds the app, runs tests, and stores it.
Deploy	Deploys the app to the specified environment (Sandbox, Staging, Production). Triggered automatically on Sandbox.
Database	Creates or updates a metadata database and access user for an application. Triggered manually.

Deployment Secrets

To deploy securely, sensitive information like passwords or API keys (called secrets) must be stored safely in Azure Key Vault. These are loaded automatically during deployment.

There are two kinds of configurations:

Secrets: Stored in Key Vault, e.g. passwords
Settings: Non-sensitive configuration stored in the repository

Image Signing

All application versions are stored as signed Docker images. This ensures their authenticity and integrity. In order to only allow trusted images to end up in the container registry, Docker consent trust is enabled and used. The CI/CD pipeline uses a signing key that is granted access to the container registry.

Data Quality Dashboard

Overview

The Data Quality Dashboard (DQD) on AIR·MS is implemented with the intention of allowing the data team and the users to understand the quality of the dataset being added to the AIR·MS database.

The DQD is part of the HADES library in OHDSI, and has been modified to run on SAP HANA. It is currently executed by the data administrators on the AIR·MS platform, as and when needed, due to high resource utilization during the run.

Quality Checks

The DQD performs a set of data quality checks on the AIR·MS dataset. It executes the checks systematically, assesses them against a predetermined threshold, and then communicates the results in a straightforward and understandable manner.

The quality checks are organized according to the Kahn framework. It employs a system of categories and contexts that stand in for methods for evaluating the quality of data.

More information: Official DQD documentation.

The DQD consists of 24 types of checks, categorized into Kahn contexts and categories. Moreover, every type of data quality check is categorized as a table check, field check, or concept-level check.

Table-level checks are assessments of the table as a whole, without focusing on specific fields, or checks that apply across multiple event tables. These checks ensure that the necessary tables exist and that some individuals in the PERSON table have corresponding records in the event tables.
Field-level checks pertain to individual fields within a table and are the most common type of check in the current version of DQD. This comprises checks that assess primary key relationships and checks that verify if the concepts in a domain adhere to the specified rules.
Concept-level checks pertain to specific concepts (codes).

More information about each type of check: test type information in the OHDSI DQD documentation.

With a few exceptions, each check collects a set of relevant table rows (for example, all rows in a specific table, or all rows using a specific code) and then verifies if each row satisfies a certain pass/fail criteria. For example, that the patient ID actually occurs in the PATIENT table, or that a specific code is classified as a preferred code by OHDSI. If the fraction of rows that fail the check is above the check-specific threshold, the check is marked as failed.

The thresholds differ between checks: some fail as soon as a single row fails, others require 5% or more of the rows to fail, indicating that some criteria are considered impossible to fulfill in every single case.

Also, apart from passing or failing, a check can be skipped if no relevant rows to check were found. For example, if a particular table is not used or a specific code does not appear in the data at all. It is not uncommon for 50% or more of all plausibility checks to be skipped. Skipped checks are counted as passed in the summary table and the failure percentage calculated relative to all checks, including skipped ones.

Around 4,000 specific data quality checks are executed against the database and assessed using a predetermined threshold. The outcomes are visualized in a table as shown below:

Results of data quality check

The table organizes the output according to the following main categories:

Plausibility: Does the data agree with basic logical and medical expectations? Example: Does the measurement unit provided for a specific lab test make sense. Example: Is it a unit like cm or m for body height?
Conformance: Does the data conform to the OMOP Common Data Model? Example: Is the patient ID given for a diagnosis entry indeed the primary ID of an entry in the PATIENT table?
Completeness: Are all the expected data elements and vocabulary mappings present? Example: Does every medication entry have a standard OHDSI code identifying the medication given?

The DQD user interface provides a complete list of all run checks, including a check description, the fraction of failed rows, and the overall check pass/fail outcome.

Application Access

Currently, only data administrators have access to the results of a DQD run. In the next version of AIR·MS, we plan to enable researchers to sign in to a researcher portal to view the DQD results, if they wish.

Understanding Azure Machine Learning and the Azure Machine Learning (AML) Platform

What is Azure?

The Azure cloud platform (commonly called Azure) is Microsoft’s public cloud platform. Azure offers a large collection of services, which includes platform as a service (PaaS), infrastructure as a service (IaaS), and managed database service capabilities. It has more than 200 products and cloud services designed to support a wide range of solutions. Azure allows to build, run, and manage applications across multiple clouds, on-premises, and at the edge, with the tools and frameworks of your choice.

Azure relies on virtualization technology. To learn more about virtualization, visit this link with excellent information by Microsoft Learn: How does Azure work? [↗]

What is Azure Machine Learning?

For you, as a researcher in the AIR·MS project, the machine learning-related servies of Azure are of particular interest. The machine learning-related services form Azure Machine Learning platform (commonly reffered to as AML). It’s designed to govern the entire machine learning life cycle, so you can train and deploy models without focusing on setup. The platform is suitable for any kind of machine learning, from classical to deep learning, to supervised and unsupervised learning.

With built-in services, like Azure Machine Learning studio, which provides a user-friendly interface, and Automated Machine Learning capabilities that assist you in model selection and training, Azure Machine Learning has tools and features to suit every level of experience.

How to Use Azure Machine Learning?

Using Azure Machine Learning requires an Azure account and an Azure subscription. As a researcher, please reach out to the AIR·MS team with your request to use AML. The AIR·MS team will create the required Azure accounts and enroll your account to appropriate subscription.

Azure Machine Learning manages all the resources you need for the machine learning lifecycle inside a workspace. Workspaces can be shared by multiple users and include things like the computing resources available for your notebooks, training clusters, and pipelines. They are also containers for your data stores and a repository for models. The AIR·MS at the time of setup, will create a workspace for you.

You can interact with Azure Machine Learning in these ways:

In the cloud with the AML user interface.
Through your local machine through the Python software development kit (SDK), REST API, and command line interface (CLI) extension.

Azure Machine Learning enables users familiar with machine learning frameworks to quickly train and deploy models using code, while giving others powerful visual tools. If you prefer low-code or no-code options, you can use Azure Machine Learning studio to quickly train and deploy machine learning models.

What is Azure Machine Learning Studio?

Azure Machine Learning studio is a browser-based service that provides no-code and code-first solutions to visually create, train, and manage models through a web UI.

The components of Azure Machine Learning studio are:

Jupyter Notebooks: Notebooks provide a collaborative environment for runnable code, visualizations, and comments. Included in studio are sample notebooks you can use to get started with Azure Machine Learning.
AutoML: Automated Machine Learning (AutoML) automates creating the best machine learning models, helping you find the best model for your data – no matter your data science expertise. Specializing in classification, regression, and time-series forecasting, AutoML experiments with different features, algorithms, and parameters depending on the task, then provides scores on models it thinks are the best fit. You can use AutoML in Azure Machine Learning studio or through the Python SDK.
Designer: If you prefer a no-code option, Azure Machine Learning Designer within the Azure Machine Learning studio gives you a visual canvas with drag and drop controls to manipulate datasets and modules. You can find more information about this option here.

Modules within Azure Machine Learning Designer are algorithms that can have a range of purposes, from data ingress functions to training, scoring, and validation processes.

If you are looking for scenarios in which AML has been particularly powerful for teams across different companies, visit When to use Azure Machine Learning [↗] on the Microsoft Learn website.

Using AutoML and AML Designer

About Microsoft AutoML

Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time-consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build machine learning models with high scale, efficiency, and productivity, all while sustaining model quality. It particularly specializes in classification, regression, and time-series forecasting.

State-of-the-art machine learning/AI systems consist of complex pipelines with choices of hyperparameters, models, and configuration details that need to be tuned for optimal performance. The resulting optimization space can be too complex and high-dimensional for researchers and engineers to explore manually.

When automated systems are used, the high costs of running a single experiment (for example, training a deep neural network) and the high sample complexity (that is, large number of experiments required) together make naïve approaches impractical. Many of the problems we are interested in can be cast as high-dimensional combinatorial optimization tasks.

Broadly speaking, AutoML tackles these problems by designing probabilistic machine learning models to guide (automated) experimental decisions and meta-learning to reduce the sample complexity and transfer knowledge across related datasets or problems.

No-Code UI or a Code-Based SDK for AutoML

No code

If you prefer a no-code approach, the following tutorial from Microsoft explains the AutoML user interface and its features. You can follow along at your own pace: No-code AutoML training for tabular data [↗].

SDK

If you’re a code-experienced researcher, you can use AutoML with the Azure Machine Learning Python SDK. Get started with this tutorial from Microsoft: Train an object detection model (preview) with AutoML and Python [↗].

Azure Machine Learning designer

Azure Machine Learning designer is a drag-and-drop interface used to train and deploy models in Azure Machine Learning. It allows you to use a visual canvas to build an end-to-end machine learning workflow. Train, test, and deploy models in the designer:

Drag and drop data assets and components onto the canvas.
Connect the components to create a pipeline draft.
Submit a pipeline run using the compute resources in your Azure Machine Learning workspace.
Convert your training pipelines to inference pipelines.
Publish your pipelines to a REST pipeline endpoint to submit a new pipeline that runs with different parameters and data assets.
- Publish a training pipeline to reuse a single pipeline to train multiple models while changing parameters and data assets.
- Publish a batch inference pipeline to make predictions on new data by using a previously trained model.
Deploy a real-time inference pipeline to an online endpoint to make predictions on new data in real time.

Core Concepts

Pipeline: A pipeline consists of data assets and analytical components, which you connect. Pipelines have many uses: You can make a pipeline that trains a single model, or one that trains multiple models. You can create a pipeline that makes predictions in real time or in batch, or make a pipeline that only cleans data. Pipelines let you reuse your work and organize your projects.
Data: A machine learning data asset makes it easy to access and work with your data. Several sample data assets [↗] are included in the designer for you to experiment with. You can register [↗] additional data assets as you need them.
Component: A component is an algorithm that you can perform on your data. The designer has several components ranging from data ingress functions to training, scoring, and validation processes.

To learn more: Tutorial: Designer – train a no-code regression model [↗]

VISIAN – the Image Annotation Tool

VISIAN is a web-based editor to annotate medical images. It allows you to view medical images in 2D and 3D while changing the viewing orientation or adjusting parameters such as contrast and brightness. VISIAN is equipped with annotation features including brush and outline tools, as well as smart brushes that significantly speed up the segmentation process. Moreover, VISIAN supports multiple annotation layers, enabling detailed analysis and preparation for medical research and diagnostics.

Accessing VISIAN

The latest release of VISIAN, integrated for the Bowel Segmentation use case, makes use of the AIR·MS backend (Azure) for storing medical images and annotations. When annotators use VISIAN, both the medical images and the created annotations are loaded and saved in storage in AIR·MS. No data is managed outside of Mount Sinai’s infrastructure.

You can access VISIAN from within the Mount Sinai network or using VPN. In order to access the application, approved annotators of the project need to be assigned a specific role. For this purpose, please send us a request per email using our contact information. Once this role is assigned to an annotator, the principal investigator of can add them to the project, using their email address. Afterwards, annotators can authenticate using Mount Sinai’s single sign-on.

URL: https://airms.mssm.edu/visian/

Working with VISIAN

When you open VISIAN, you see a list of projects where your are collaborating. To open a project, click its title.

Once within a project, the image studies of that project are shown. Projects in VISIAN have Principal Investigators (PIs) and Annotators. Each of these have different views. Principal investigators can see all the studies and the annotations created by the Annotators of that project. In contrast, Annotators can only see the studies assigned to them.

Image studies have a status:

No Annotation
In Progress
Completed

Upon choosing a study, the study is loaded in the browser, and Annotators can create their annotation:

Annotators can save their work by clicking the Save button.

An annotation can be saved as In Progress or Completed. Principal investigators can see the status of all annotations in their projects.

To open any supplementary image series while adding an annotation, click the clip icon.

Managing Project Data

PIs have special file management privileges to upload image studies and to retrieve the created annotations. For this purpose, Azure Storage Explorer (ASE) must be used. In order to upload image studies to the project, go through the following steps.

After installing Azure Storage Explorer, open it and click Sign in with Azure:
Select Azure as environment to sign in:
You are redirected to the browser:
In the browser, enter your Mount Sinai email address (@mssm.edu or @mountsinai.org):
Back in ASE, click the subscription where the Bowel Segmentation project is stored:
In the panel on the left, navigate to the storage container called bowelseg, by clicking Storage Accounts > imgannostorage > Blob Containers > bowelseg:

Within this storage container, the system has two subfolders: an input and an output subfolder. As the names indicate, these are for placing the input image studies, and where the annotations for these studies will be found later (respectively):
Within the input folder, subfolders for each of the image studies can be created. Below, image studies from a few subjects are shown:

Within each subject’s folder, the system expects exactly two subfolders: one with the image series to be annotated, called ABD-PELVIS_AX HASTE T2 Long TE_COMPOSED, and another one with the image series that is supplementary, called ABD-PELVIS_COR T2 HASTE_ MBH_COMP_AD:

The actual .dcm files should be copied into each of those two subfolders (e.g. through drag-and-drop from your Finder, on macOS):