Reference Genome and Annotation
Data Ark is building an accessible reference genome resource folder. The folder covers the most frequently used reference genome (.fasta file) and annotation files (.tdf file).
Between Gencode and Ensemble, the gene annotation is the same in both files. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.
In general, the GENCODE/Ensemble annotations are more comprehensive–contain more exons, have greater genomic coverage, and capture many more variants than RefSeq in both genome and exome datasets, you can find more information through this paper link.
For the purpose of version control, we have a “current” version which is a symlink that always points to the most updated version of the file.
To use this data, NO DUA form is required. Access the data at the following path on Minerva –/sc/arion/projects/data-ark/Public_Unrestricted/reference_genome or load module $ module load dataark to see the path variables.
To suggest a new reference genome or related data sets, join our Data Ark Slack channel at https://join.slack.com/t/data-ark/signup and sign up using your Mount Sinai credentials.
Data Ark Data Sets
Public data sets (unrestricted)
- 1,000 Genomes Project
- GWAS Summary Stats
- The Cancer Genome Atlas (TCGA)
- Reference Genome
Public data sets (restricted)
Mount Sinai generated data (unrestricted)
Mount Sinai generated data (restricted)