New genomic data sets now available on the Data Ark Data Commons

The Department of Genetics & Genomic Sciences and Scientific Computing and Data are pleased to announce the launch of new public unrestricted data sets hosted on the Data Ark:

Select TCGA (The Cancer Genome ATLAS Program)
gnomAD (Genome Aggregation Database)
eQTLGen (Expression Quantitative Trait Locus)
UKBB-LD (UK Biobank Linkage Disequilibrium)
Reference genome

The public data are directly accessible to users on Minerva supercomputer without the Data Use Agreement (DUA) form.

Select TCGA Data

The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The Data Ark hosts a select part of TCGA data that belongs to the “open-access” category and were obtained from the Genomic Data Commons Data Portal; this includes all the open-accessed biospecimens, clinical, RNA-seq counts, and WXS (Mutation Annotation Format, MAF files) data. All the RNA-seq counts files (over 11,000 patients) and their related biospecimen and clinical data are combined and consolidated into 33 different outcomes.

To learn more about Select TCGA and how to access this data set, click here.

gnomAD Data

The Genome Aggregation Database (gnomAD) aggregates and harmonizes both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and makes summary data available for the wider scientific community.

The v2.1.1 data set (GRCh37/hg19) provided spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals sequenced as part of various disease-specific and population genetic studies. gnomAD v2.1.1 is preferable over v3 for interpreting coding variants. The v3.1 data set (GRCh38) spans 76,156 genomes, providing more data for noncoding regions or coding regions not covered well in exomes, such as regions with high GC content or regions not targeted with exome capture.

To learn more about gnomAD and how to access this data set, click here.

eQTLGen

The eQTLGen Consortium has been set up to identify the downstream consequences of trait-related genetic variants. To investigate the genetics of gene expression, the researcher group performed cis- and trans-expression quantitative trait locus (eQTL) analyses using a blood-derived expression in a total of 31,684 individuals.

You can find the cis-eQTL, trans-eQTL, eQTS, replication, and single-cell eQTLGen data folders on the Data Ark.

To learn more about eQTLGen and how to access this data set, click here.

UKBB-LD

UKBB-LD is summary linkage disequilibrium (LD) matrices files computed from UK Biobank (UKBB) based on N=337K British-ancestry individuals. The LD information is stored as 2,763 3Mb-long regions spanning the entire genome. This data set can be used for post-Genome-wide association studies (GWAS) analysis such as fine-mapping. The data is generated from Alkes Price’s group at Harvard.

For more information on this data set, click here.

Reference Genome

Data Ark hosts the most frequently used reference genome files and related annotations. We have fasta files–GRCh37 and GRCh38 for Human and GRCm38 and GRCm39 for Mouse. In addition, we host Ensembl, Gencode, and Refseq for different annotation purposes. For version control, a ‘current’ version is a symlink that always points to the most updated version of the file, so you do not need to change your code path to access the file.

To learn more about the Reference Genome and how to access this data set, click here.

Click here to learn more about other available data within the Data Ark.

Contact the Data Ark Team at data-ark-team@lists.mssm.edu

Join our Data Ark Slack channel at https://join.slack.com/t/data-ark/signup