Reference Genome and Annotation 

Data Ark is building an accessible reference genome resource folder. The folder covers the most frequently used reference genome (.fasta file) and annotation files (.tdf file). 

The Genome files are downloaded from Ensemble Release 106, and the annotation files are downloaded from Ensemble, Refseq, and Gencode.

Between Gencode and Ensemble, the gene annotation is the same in both files. The only exception is that the genes which are common to the human chromosome X and Y PAR regions can be found twice in the GENCODE GTF, while they are shown only for chromosome X in the Ensembl file.

In general, the GENCODE/Ensemble annotations are more comprehensive–contain more exons, have greater genomic coverage, and capture many more variants than RefSeq in both genome and exome datasets, you can find more information through this paper link.

For the purpose of version control, we have a “current” version which is a symlink that always points to the most updated version of the file. 

 

To use this data, NO DUA form is required. Access the data at the following path on Minerva –/sc/arion/projects/data-ark/Public_Unrestricted/reference_genome or load module $ module load dataark to see the path variables.

To suggest a new reference genome or related data sets, join our Data Ark Slack channel at https://join.slack.com/t/data-ark/signup and sign up using your Mount Sinai credentials.

Data Ark Data Sets

Public data sets (unrestricted)

Public data sets (restricted)

Mount Sinai generated data (unrestricted)

Mount Sinai generated data (restricted)

Data access