Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / Digital Pathology Slides

De-identified Digital Pathology Slides

Overview

The Departments of Pathology, Molecular, & Cell Based Medicine, Windreich Department of AI and Human and Health, and Scientific Computing and Data, have collaborated to share this extensive digital archive of over 2.6 million whole slide images, collected from the Mount Sinai Anatomic Pathology and Consultation Service. These specimens encompass a broad spectrum of biopsies, resections, and autopsies, reflecting the diversity of diseases affecting patients from a wide range of backgrounds. Virtually every organ system is represented within this collection, including but not limited to the lung, heart, pancreas, kidney, liver, gastrointestinal, genitourinary, gynecological, hematological, and neuropathological systems. The disease processes span a wide array, encompassing neoplastic, developmental, inflammatory, toxic, metabolic, genetic, degenerative, traumatic, and infectious pathologies. The slides were prepared using a variety of staining techniques, from routine hematoxylin and eosin (H&E) to specialized stains like silver and trichrome, as well as immunohistochemistry. This rich dataset offers a unique and powerful resource for advancing the study of human disease through digital pathology.

What Is Hosted Under Data Ark?

Currently, Data Ark hosts 2.6 million de-identified digital pathology slides and more slides will be made available on a continuing basis. Digital pathology slides have been linked to patients’ EHR (electronic health record), and EHR data is available from Mount Sinai Data Warehouse. Slides were scanned on the Philips Ultrafast or other system at 40x magnification. iSyntax files were converted to TIF. Slides served through Data Ark have been de-identified and digitized into a readable TIF format.

Figure 1. Digitized pathology slides are gigapixel images that can span hundreds of thousands of pixels in each dimension.

Digital pathology slides were collected from 191,119 patients through October 1, 2024, with demographic information detailed below.

Table 1. Cohort demographics: (a) gender, (b) race and ethnicity, and (c) age.

(a)

Gender	Count
Female	120,121
Indeterminate	13
Male	70,945
Unknown	40

(b)

Race Ethnicity Combined	Count
American Indian or Alaska Native	140
Asian	14,071
Black or African-American	28,323
Hispanic	35,508
Native Hawaiian or Pacific Islander	159
Patient Declined	958
White	66,515
Unknown/Other/Not Reported	45,445

(c)

Age Group	Count
0-10	1,377
11-20	3,239
21-30	14,950
31-40	28,684
41-50	31,704
51-60	35,551
61-70	38,128
71-80	26,597
81-90	9,424
91+	1,465

Access

To use this data, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). IRB (Institutional Review Board) approval is not required to access Digital Pathology Slide data via Data Ark.

After granted access, you can access the slides by going to the folder on Minerva directly. To get the path variable, you can load the module by issuing the command $ module load dataark.

In addition, we are also working on the Digital Slide Archive web application for interactive slide viewer and annotation (coming soon).

Further details about how to use these slides in your research can be found here:

If you have questions, you can submit a ticket here: https://hpims.atlassian.net/servicedesk/customer/portal/67

The digital pathology technology effectively reduces process time

Figure 1. Value stream mapping of simple (a) traditional glass slide vs. (b) digital slide consult. Image adapted from Haghighi, Mehrvash, et al. 2021. doi: 10.4103/jpi.jpi_74_21.

Data Ark Data Sets

Please visit the Data Ark Data Set webpage to explore other data sets.