Reference Genome and Annotation
Data Ark is building an accessible reference genome resource folder. The folder covers the most frequently used reference genome (.fasta file) and annotation files (.tdf file).
Between Gencode and Ensemble, the gene annotation is the same in both files. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.
In general, the GENCODE/Ensemble annotations are more comprehensive–contain more exons, have greater genomic coverage, and capture many more variants than RefSeq in both genome and exome datasets, you can find more information through this paper link.
For the purpose of version control, we have a “current” version which is a symlink that always points to the most updated version of the file.
To use this data, NO DUA form is required. Access the data at the following path on Minerva –/sc/arion/projects/data-ark/Public_Unrestricted/reference_genome or load module $ module load dataark to see the path variables.
Data Ark Data Sets
Please visit the Data Ark Data Set webpage to explore other data sets.