Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / The CBIPM-BioMe Data

The CBIPM-BioMe Data

Overview

The BioMe Biobank Program from CBIPM (The Charles Bronfman Institute for Personalized Medicine) maintains a comprehensive data management and analysis environment encompassing clinical phenotype and genotype information and allowing phenotype/genotype data to be linked for research.

The CBIPM-BioMe data set hosted under Data Ark currently includes the BioMe Global Screening Array (GSA) (from Regeneron) and Global Diversity Array genotyping array (GDA) (from Sema4) microarry, Whole Exome Sequencing Data (Regeneron and Sema4), and BioMe Epic EHR Data Mart in OMOP format. All the data are anonymized version.

Genotypic data

The CBIPM data set is at its freeze V2 version currently: a combined set of 53,982 genotyping-array samples imputed with the 1000G- and TOPMed reference panel. V2 consists of additional 22,299 BioMe Biobank samples with 1.6 million typed variants based on the Illumina Global Diversity Array genotyping array (GDA) + the former release of 31,683 imputed samples based on the Illumina Global Screening Array (GSA) with ~650k typed variants.

For details on the study design and quality control, please review the following documents carefully:

The GSA (Regeneron) and GDA (Sema4) microarray Document

Regeneron Whole Exome Sequencing Data Document

BioMe Epic EHR data mart

The data is available as “|”-separated text files. The EHR information is based off data from EPIC and other clinical databases. To be transparent and to promote reproducibility and usability, a detailed data dictionary was created to accompany the data sets. The data dictionary contains information on the definition of each data element in the set, as well as a log of all changes made. In the de-identified data set, all 18 protected health information identifiers as delineated in the HIPAA Privacy Rule, including names, dates and addresses, are masked or removed. In order to maintain the temporal relationship between events, all dates are converted to elapsed days relative to the patient encounter start date. More information on the data dictionary, please visit https://labs.icahn.mssm.edu/msdw/data-dictionary/

Here is the CBIPM data folder tree structure on Minerva:

.

├── datamart

│ └── BioMe_Anonymized

├── Microarray

│ └── combined

│ ├── genotyped_TOPMED_V2

│ ├── imputed_1kg_V2

│ └── imputed_TOPMED_V2

└── WES

├── regeneron

└── sema4

Access

This data set is only open to Mount Sinai researchers, staff and faculty. To use these data, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network/secure remote VPN and a Minerva HPC account is needed). On Minerva, you can load module $ module load dataark to see the path variables.

More information

Please visit the CBIPM website by Click Here. If you are still interested in more information on the CBIPM data set, please contact cbipm@mssm.edu.

Data Ark Data Sets

Please visit the Data Ark Data Set webpage to explore other data sets.