Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / The CBIPM-BioMe Data

CBIPM-BioMe Biobank Program

Effective on Sunday February 1, 2026, BioMe Biobank data will no longer be available through Data Ark. For continued access BioMe Biobank and Mount Sinai Million data, please request through https://app.smartsheet.com/b/form/8dfa289108b4498d8565e31bd0f3ad00.

Introduction

The BioMe Biobank Program at the Charles Bronfman Institute for Personalized Medicine (CBIPM-BioMe Biobank Program) is an electronic health record (EHR)-linked biorepository that enrolls participants (relatively) non-selectively from across the Mountain Sinai Health System (MSHS). Since its founding in 2007, over 60,000 participants have been enrolled in the CBIPM-BioMe Biobank Program from over 26 outpatient sites across Manhattan and Queens. Through linkage with the EHR, participant data is regularly updated to include data for care provided within the MSHS before and after study enrollment.

CBIPM-BioMe Biobank Program participants have been genotyped using various platforms, including the Regeneron Global Screening Array (GSA) and the Sema4 Global Diversity Array (GDA).

The IRB-approved CBIPM-BioMe Biobank Program protocol permits use for research purposes of a) anonymized participant genotype information, and b) associated de-identified past, present, and future clinical phenotype information from the EHR.

Access

This CBIPM-BioMe Biobank Program dataset is hosted under the HPC Data Ark and is only available to Mount Sinai faculty and staff researchers. To access the dataset, you must read, agree to, and sign the Data Use Agreement (access to the agreement requires a Minerva HPC account and login through the Mount Sinai campus network/secure remote VPN).

Dataset Overview

The CBIPM-BioMe Biobank Program dataset is currently at its freeze V2 and consists of:

Anonymized genotypic data

Genotyping microarray: Combined and intersected set of 53,982 samples (GSA + GDA)

Regeneron Global Screening Array (GSA) — 31,413 samples

Sema4 Global Diversity Array (GDA) — 22,569 samples

Imputed data: After variant QC, imputation was performed separately for

GDA with 1.6m typed variants before QC

GSA with 650k typed variants before QC

For each array (GDA, GSA), imputation was performed twice – once with the 1000G– reference panel and once with the TOPMed reference panel. Imputed data for each array with each reference panel was filtered based on a strict INFO score threshold of >0.7. The imputed data from both arrays for each reference panel was then combined

Whole exome sequencing data: 30,813 CBIPM-BioMe Biobank Program samples with IDT targeted sequencing kit by Regeneron and additional 15,080 samples with SureSelect targeted sequencing kit by Sema4

BioMe Epic EHR Data mart data in Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) format

The EHR information is based on data from EPIC and other clinical databases standardized and structured with OMOP. The accompanying OMOP data dictionary contains detailed information on the definition of each data element in the set, and a log of all changes made. In the de-identified dataset, all 18 protected health information identifiers as delineated in the HIPAA Privacy Rule (including names, dates, and addresses) are masked or removed. Only OMOP data tables that comply with the HIPAA Privacy Rule anonymization criteria have been uploaded. To maintain the temporal relationship between events, all dates are converted to elapsed days relative to the patient encounter start date. For more information on the data dictionary, please visit https://labs.icahn.mssm.edu/msdw/data-dictionary.

For more details on the study design and quality control, please review the following documents:

The GSA (Regeneron) and GDA (Sema4) microarray Document

Regeneron Whole Exome Sequencing Data Document

Data folder tree structure on Minerva

Within the dataset folder /sc/arion/projects/data-ark/CBIPM/:

├── datamart Versioned OMOP Common Data Model V5.4 containing 37 structured tables

│ └── BioMe_Anonymized

└──2023-04-03 Version date

├── Microarray Genotypic microarray data

│ └── combined Combined GDA + GSA set

│ ├── genotyped_V2 Intersected set of genotyped variants in plink format
│ ├── imputed_1kg_V2 Separate imputation of GSA and GDA with 1000G V3 P5 reference panel and combined

│ ├── addition Genetic principal components, imputation info scores, KING kinship matrix

│ ├── bgen Genetic data in bgen format

│ ├── plink Hard-called imputed data in plink format

│ └── imputed_TOPMED_V2 Separate imputation of GSA and GDA with TOPMed r2 reference panel and combined

│ ├── addition Same as in imputed_1kg_V2

│ ├── bgen Same as in imputed_1kg_V2

│ ├── plink Same as in imputed_1kg_V2

├── WES

│ └── Regeneron Regeneron WES data in bcf format

│ └── biallelic bcf files for 22 chromosomes split into biallelic parts

│ ├── stats Accompanying bcftools stats files

│ └── multiallelic bcf files for 22 chromosomes split into multiallelic parts

│ ├── stats Accompanying bcftools stats files

│ └── sema4 Sema4 WES data in bcf format

│ └── stats Accompanying bcftools stats files

Terminology

Data mart: Specialized, focused subset of a larger data warehouse designed to serve the specific needs of a particular group of users

bcf files: Binary vcf files compatible with bcftools

Hard-called genotypes: Converted to plink file format with –-hard-call-threshold 0.4999

Imputation INFO score: Imputation quality score retrieved from Michigan- and TOPMed Imputation server

Resources for Genetic Data Analysis

Minerva HPC documentation

Minerva documentation https://labs.icahn.mssm.edu/minervalab/documentation/

Access the following computational tools installed in the module system ml [module_name] (more information: Software Environment: Lmod)

Data manipulation and analysis tools

Plink https://www.cog-genomics.org/plink/1.9/resources

Bcftools https://samtools.github.io/bcftools/

GWAS tools

Saige https://saigegit.github.io/SAIGE-doc/

Regenie https://rgcgithub.github.io/regenie/

PRS tools

PRSice-2 and PRSet https://choishingwan.github.io/PRSice/

PRS-CSx https://github.com/getian107/PRScsx

Additional Resources

For more information on the CBIPM-BioMe Biobank Program dataset, visit the CBIPM website or contact cbipm@mssm.edu. To explore other available datasets on the HPC Data Ark, visit the HPC Data Ark Datasets website. On Minerva, use module load dataark to see the path variables for the HPC Data Ark datasets.

Note: This page is a continuous work in progress. Please reach out to cbipm@mssm.edu if there is any additional information that you would like to see added to any section of this page (e.g., terms added to the Terminology section).

Acknowledgement

Use this acknowledgement in publications that utilize any HPC Data Ark dataset: “This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai, supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences.”