We are leading one of the national Proteomics and Genomics Data Analysis Centers (PGDAC) in the NCI-funded CPTAC consortium (https://proteomics.cancer.gov/programs/cptacnetwork).

CPTAC (Clinical Proteomic Tumor Analysis Consortium) is a sister to TCGA. In the past few years, CPTAC has solicited national wide efforts to perform deep proteomic and genomic profiling for multiple tumor types at a scale larger than any of the precedent proteomics studies. In the pilot phase of CPTAC, integrative proteogenomic analyses have already led to important findings for colon, breast and ovarian cancers (Liebler et. al., Nature, 2014; Mertins et. al., Nature, 2016; Zhang et. al., Cell, 2016). Now, in the coming few years, six new cancer types will be studied including lung cancer. Currently, deep proteomics/genomics profiling for 100 tumor+100 matched blood samples of each tumor type are carried out by multiple centers in the CPTAC network. Particularly, in proteomics profiling, a big focus is put on characterizing the post-translational modification (PTM) activities through phosphor-, glycol- and acetyl-proteomics profiling. These experiments are carried out using the latest state-of-art mass spectrometry platforms (TMT+MS3) and extensive fractionation techniques, which enables identification and quantify >20K proteins and >70K PTMs in each sample.

The rational of our CPTAC-PGDAC consists of two aspects:

  1. Systems-level approaches are key for enhancing our understanding of cancer, which involves a complex array of pathway interactions and dysfunctions across multiple systems. In the past couple of decade, Drs. Schadt (M-PI), Wang (M-PI) and their collaborators have successfully pioneered the field of systems learning. The systems learning approach we will employ for this project involves two basic components: (1) inferring the gene/protein regulatory networks for biological samples of interest; and (2) utilizing the inferred networks to derive novel biological and clinical knowledge that elucidates the complexity of cancer.
  1. Unique properties of proteomics data require tailored methods.

Both unlabeled and labeled proteomics experiments have substantial missing data in their outputs. Properly modeling the non-ignorable missing mechanism could significantly enhance the efficiency of proteomics data analysis. Besides the issue of missing data, labelled proteomics experiments also give rise to severe batch effects. To address these issues, Dr. Wang and team members have proposed a suite of statistical methods to properly model proteomics data. These methods can lead to improved estimation and inference accuracy.