Scientific Computing and Data

Partnering with researchers to advance scientific discovery

1000 Genomes Project

The 1000 Genomes Project was initiated in 2008, comprising 3 pilot phases that focused on low coverage whole genome sequencing (WGS) of 180 individuals of African/Asian/European ancestry, and deep coverage of two trios and of 1000 genes in 900 unrelated samples. These pilot studies were expanded to larger projects, published across three Nature papers published in 2010, 2012, 2015, with the final data set corresponding to 2,504 individuals from 26 global populations (~4X WGS and 30X WES for all, and 24 individuals with 30X WGS for validation).

Since its completion in 2013, the 1000 Genomes Project was superseded by The International Genome Sample Resource (IGSR), established to host and extend the 1000G data. For more information on the 1000G Project and the IGSR see their website here: 1000G & IGSR website.

The Data Ark hosts a replica of the Phase 3 individual-level called genotype data (VCF format) created by Google Health using high coverage (30X) Illumina sequencing performed by the New York Genome Center and called using DeepVariant (detailed methods about the variant calling pipeline can be found here). The CRAM files can be obtained from the 1000G website but these are extremely large (36TB) and will not be needed for the vast majority of user cases and so are not included on the Data Ark.

To use this data, you must read, agree and sign the Data Use Agreement (you must be logged in through the Mount Sinai campus network or secure remote VPN).


Data Ark Data Sets

Public data sets (unrestricted)

Public data sets (restricted)

Mount Sinai generated data (unrestricted)

Mount Sinai generated data (restricted)

Data access