Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / GWAS Summary Statistics
GWAS Summary Statistics
Overview
Genome-wide association studies (GWAS) provide a powerful tool for identifying genetic loci associated with phenotypes of interest. The sharing of GWAS summary statistics has enabled a range of secondary research applications that do not require access to the individual level data such as gene prioritization, fine-mapping, pathway enrichment analyses, causal inference of exposures, risk prediction, genetic correlation and heritability estimation.
Several thousand GWAS summary statistics are available in the Data Ark, obtained from the IEU Open GWAS Project including:
* ebi-a (n = 288): GWAS satisfying minimum requirements imported from the EBI database of complete GWAS summary data
* ieu-a (n = 440): GWAS generated by many different consortia that have been manually collected and curated, initially developed for MR-Base
* ieu-b (n = 37): GWAS generated by many different consortia that have been manually collected and curated, initially developed for MR-Base (round 2)
* ukb-b (n = 2514): IEU analysis of UK Biobank phenotypes
These GWAS are stored on the Data Ark in the GWAS-VCF format, which provides a consistent and robust approach to storing genetic variants, annotations and metadata enabling interoperability and reusability consistent with the FAIR principles [1]. Crucially, this ensures that all the provided GWAS are harmonized so that eg. the ALT allele corresponds to the effect allele and that all the files utilize a consistent labeling scheme.
More details on the GWAS-VCF format (illustrated above) and the available Open-source tools for working with GWAS-VCFs can be found in the corresponding publication. Lyon, M. et al. (2021). The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol 22, 32, and are included in the Data Ark.
Now available in the Data Ark:
- The Oxford Brain Imaging Genetics Server – BIG40: This open data server contains results from GWAS of almost 4,000 imaging-derived phenotypes from the multimodal brain imaging in UK Biobank. It is a major update to the original BIG server, using data from the 40,000 subject imaging data release from early 2020. The discovery sample size was 22,138 and the replication sample 11,086. Chromosomes 1:22 and X are included, resulting in associations with 17,103,079 SNPs. More Information can be found here.
Access
Effective from January 22, 2024, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN). Access is granted within 24 hours, and on Minerva, you can load module $ module load dataark to see the path variables.
More Information
The GWAS summary statistics, as well as scripts for working with them, were uploaded to the Data Ark by Shea Andrews (shea.andrews@mssm.edu), on 01/16/21.
Data Ark Data Sets
Please visit the Data Ark Data Set webpage to explore other data sets.