De-identified Digital Pathology Slides (Coming Soon)

Overview

The Departments of Pathology, Molecular, & Cell Based Medicine, Windreich Department of AI and Human and Health, and Scientific Computing and Data, are have collaborated to share this extensive digital archive of over 1.5 million whole slide images, collected from the Mount Sinai Anatomic Pathology and Consultation Service. These specimens encompass a broad spectrum of biopsies, resections, and autopsies, reflecting the diversity of diseases affecting patients from a wide range of backgrounds. Virtually every organ system is represented within this collection, including but not limited to the lung, heart, pancreas, kidney, liver, gastrointestinal, genitourinary, gynecological, hematological, and neuropathological systems. The disease processes span a wide array, encompassing neoplastic, developmental, inflammatory, toxic, metabolic, genetic, degenerative, traumatic, and infectious pathologies. The slides were prepared using a variety of staining techniques, from routine hematoxylin and eosin (H&E) to specialized stains like silver and trichrome, as well as immunohistochemistry. This rich dataset offers a unique and powerful resource for advancing the study of human disease through digital pathology.

 

What Is Hosted Under Data Ark?

Currently, Data Ark hosts 1.5 million de-identified digital pathology slides and more slides will be made available on a continuing basis. Digital pathology slides have been linked to patients’ EHR (electronic health record), and EHR data is available from Mount Sinai Data Warehouse. Slides were scanned on the Philips Ultrafast or other system at 40x magnification. iSyntax files were converted to TIF. Slides served through Data Ark have been de-identified and digitized into a readable TIF format​.

Figure 1. Digitized pathology slides are gigapixel images that can span hundreds of thousands of pixels in each dimension.

Digital pathology slides were collected from 191,119 patients through October 1, 2024, with demographic information detailed below.

Table 1. Cohort demographics: (a) gender, (b) race and ethnicity, and (c) age.

(a)

GenderCount
Female120,121
Indeterminate13
Male70,945
Unknown40
(b)

Race Ethnicity CombinedCount
American Indian or Alaska Native140
Asian14,071
Black or African-American28,323
Hispanic35,508
Native Hawaiian or Pacific Islander159
Patient Declined958
White66,515
Unknown/Other/Not Reported45,445
(c)

Age GroupCount
0-101,377
11-203,239
21-3014,950
31-4028,684
41-5031,704
51-6035,551
61-7038,128
71-8026,597
81-909,424
91+1,465

Access (Coming Soon)

To use this data, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). IRB (Institutional Review Board) approval is not required to access Digital Pathology Slide data via Data Ark.

After granted access, you can access the slides by going to the folder on Minerva directly. To get the path variable, you can load the module by issuing the command $ module load dataark.

In addition, we are also working on the Digital Slide Archive web application for interactive slide viewer and annotation (coming soon).

The digital pathology technology effectively reduces process time

Figure 1. Value stream mapping of simple (a) traditional glass slide vs. (b) digital slide consult. Image adapted from Haghighi, Mehrvash, et al. 2021. doi: 10.4103/jpi.jpi_74_21.

Data Ark Data Sets

Please visit the Data Ark Data Set webpage to explore other data sets.