Scientific Computing and Data / Research Data Services / Data Ark: Data Commons / GWAS Summary Statistics
GWAS Summary Statistics
Genome-wide association studies (GWAS) provide a powerful tool for identifying genetic loci associated with phenotypes of interest. The sharing of GWAS summary statistics has enabled a range of secondary research applications that do not require access to the individual level data such as gene prioritization, fine-mapping, pathway enrichment analyses, causal inference of exposures, risk prediction, genetic correlation and heritability estimation.
Several thousand GWAS summary statistics are available in the Data Ark, obtained from the IEU Open GWAS Project including:
* ebi-a (n = 288): GWAS satisfying minimum requirements imported from the EBI database of complete GWAS summary data
* ieu-a (n = 440): GWAS generated by many different consortia that have been manually collected and curated, initially developed for MR-Base
* ieu-b (n = 37): GWAS generated by many different consortia that have been manually collected and curated, initially developed for MR-Base (round 2)
* ukb-b (n = 2514): IEU analysis of UK Biobank phenotypes
These GWAS are stored on the Data Ark in the GWAS-VCF format, which provides a consistent and robust approach to storing genetic variants, annotations and metadata enabling interoperability and reusability consistent with the FAIR principles . Crucially, this ensures that all the provided GWAS are harmonized so that eg. the ALT allele corresponds to the effect allele and that all the files utilize a consistent labeling scheme.
More details on the GWAS-VCF format (illustrated above) and the available Open-source tools for working with GWAS-VCFs can be found in the corresponding publication. Lyon, M. et al. (2021). The variant call format provides efficient and robust storage of GWAS summary statistics. Genome Biol 22, 32, and are included in the Data Ark.
Now available in the Data Ark:
- The Oxford Brain Imaging Genetics Server – BIG40: This open data server contains results from GWAS of almost 4,000 imaging-derived phenotypes from the multimodal brain imaging in UK Biobank. It is a major update to the original BIG server, using data from the 40,000 subject imaging data release from early 2020. The discovery sample size was 22,138 and the replication sample 11,086. Chromosomes 1:22 and X are included, resulting in associations with 17,103,079 SNPs. More Information can be found here.
The GWAS summary statistics, as well as scripts for working with them, were uploaded to the Data Ark by Shea Andrews (firstname.lastname@example.org), on 01/16/21.
To use this data, NO DUA form is required, you can access the data at the following path on Minerva – /sc/arion/projects/data-ark/Public_Unrestricted/GWAS_SumStats or you can load module $ module load dataark to see the path variables.
Public Data Sets (unrestricted)
- 1,000 Genomes Project
- GWAS Summary Stats
- The Cancer Genome Atlas (TCGA)
- Reference Genome
Public Data Sets (restricted)
Mount Sinai Generated Data (unrestricted)
Mount Sinai Generated Data (restricted)
School-Acquired Data Sets (restricted)
Data Set Supplements (restricted)