Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / UK Biobank

UK Biobank

Overview

The UK Biobank is the largest widely available genetic epidemiological data set in the world. The data set includes genotype, whole exome and whole genome sequencing data on 500,000 individuals, linked to a rich set of thousands of self-reported traits, hospital and primary care records, biomarker measures, mental health information and imaging data. Participants aged 40-69 years were recruited during 2006-2010 across 22 recruitment centres in the United Kingdom and have been followed up multiple times since collecting a huge range of information on their traits, biomarkers and diseases.

Access

To use these data you must submit a project proposal to the UK Biobank (see here) describing your planned research and receive its approval, or if you are collaborating with one of the labs at Sinai that already have an approved application (see list here) then you could potentially be added to their application (via discussion with PI/delegate) if your project is strongly aligned to theirs. However, we highly recommend making your own application, in which you can specify all research that you plan to conduct with the data and control who has access to it.

The three Tiers of fees are based broadly on the size of the respective datasets.

While the phenotype data for approved applications can only be accessed by investigators listed on the corresponding application, we host and manage all of the genetic data on the Data Ark (genotype data in PLINK format, imputed data in BGEN format), which can be used by any investigator on an approved UK Biobank application.

The data access process, as well as details on the range of data types and how to perform UK Biobank analyses, have been described in a UK Biobank tutorial produced by Judit Garcia-Gonzalez (judit.garciagonzalez@mssm.edu) and Sam (Shing Wan) Choi (shingwan.choi@mssm.edu) from the O’Reilly lab, which we highly recommend reading. This tutorial includes SQL scripts (written by Sam Choi) that will make accessing ICD-10 coded disease data from the biobank substantially easier. For more details on the UK Biobank, see the main website here.

To use this data through Data Ark, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN).

More Information

The UK Biobank genetic data were uploaded to the Data Ark by Shing Wan Choi (shingwan.choi@mssm.edu), on 11/1/20, and will be updated as new genetic data are provided by the UK Biobank.

Data Ark Data Sets

Please visit the Data Ark Data Set webpage to explore other data sets.