Research | Ron Do Laboratory

Genomic discovery

We have conducted large-scale rare variant association studies for machine learning-based phenotypes for coronary artery disease (Petrazzini et al. Nature Genetics. 2024) and MASLD (Chen et al. Genome Biology. 2024).

Using machine learning and EHR clinical data to predict disease

We have used machine learning and clinical features from EHR to build a EHR score to predict coronary artery disease risk one year from diagnosis. We show that a EHR score improves CAD prediction by 12% and reclassification by 26% compared to the conventional clinical risk score, the pooled cohorts equation (Petrazzini et al. JACC 2022). We have conducted several studies related to machine-learning based estimation of disease risk for CAD (Petrazzini et al. AHJ Plus. 2022 and Forrest et al. Lancet. 2023), heart failure (Park et al. JAHA. 2023), SARDS (Forrest et al. Nat Comm. 2023), Lyme disease (Forrest et al. Clin Infectious Dis. 2023), and Venous Thromboembolism (Chen et al. ATVB. 2023).

Using machine learning and genomic features to predict variant function

Computational methods have been developed to predict variant pathogenicity. However, more granular levels of variant characterization is needed to fully understand the links between variant, function and phenotype. One such area is variant mode of inheritance. A large fraction of coding variants are missing MOI information and of the ones that have such information, are not necessarily accurate. We have used machine learning and a wide array of genomic annotations to predict recessive MOI for missense variants. We developed MOI-Pred, a three-way variant-level MOI prediction tool for autosomal recessive MOI and show excellent discrimination power at identifying missense variants causing autosomal recessive diseases (Petrazzini et al. medRxiv 2021).

Causal inference of risk factors and complex disease using Mendelian randomization

We have developed and implemented methods and approaches for causal analyses of complex disease using Mendelian randomization. We have developed a modified approach to Mendelian Randomization to isolate causal influences among a set of correlated risk factors for plasma lipids and coronary artery disease (CAD) (Do et al. Nature Genetics. 2013). We have further developed a method to detect and correct for horizontal pleiotropy called MR-PRESSO (Verbanck et al. Nature Genetics. 2018). We have applied MR-PRESSO and other Mendelian randomization methods to inform drug outcomes for clinical trials, specifically for uric acid and chronic kidney disease (Jordan et al. PLOS Medicine. 2019). We are currently developing and applying new methods to infer causality of risk factors and disease.

Evaluating genetic pleiotropy in human genetic variation

We have developed the HOrizontal Pleiotropy Score (HOPS) (Jordan, Verbanck et al. Genome Biology. 2019) which measures the amount of horizontal pleiotropy for a set of traits and/or diseases for a given genetic variant. We applied the HOPS to genetic variation genome-wide and show that horizontal pleiotropy is pervasive, enriched in regulatory regions and driven by highly polygenic traits. The results suggest that pervasive horizontal pleiotropy in human genetic variation is driven by extreme polygenicity of traits.

Impact of population genetic forces on human biology and disease

We have utilized sequencing data generated from various medical genetics projects to pursue questions in population genetics. The Exome Sequencing Project showed that the majority of coding variation in the human genome is rare and deleterious. Furthermore, the study demonstrated that there is an overabundance of rare coding variation beyond what we would expect in theory and that this can be best explained by a population having undergone recent, accelerated growth. Building on this work, we developed a population genetics framework that uses the accumulation of deleterious coding mutations in both ancient and modern humans to make inferences about the role of natural selection in diverse sets of human populations and show that there is no evidence that natural selection has been less effective at removing deleterious mutations in Europeans than in West Africans (Do et al. Nature Genetics. 2015). Furthermore, we have shown that our framework can be used as a test to determine the mode of selection for gene sets (Balick et al. PLOS Genetics. 2015). We have extended on this prior work by examining the effect of deleterious load on the medical phenome (Vy et al. PLOS Genetics. 2021). Furthermore, we have inferred recessive selection using human population frequency data and provided insights into the mode of selection and disease mode of inheritance (Balick*, Jordan* et al. AJHG. 2021)

Interpreting genetic association links between genes and clinical phenotypes using electronic health record (EHR)-linked biobank data

Currently, there exists large-scale publicly available genetic association summary statistic data for a wide array of genes, phenotypes and expression quantitative trait loci in tissues. However, interpretation of the genetic links between these rich data sources is difficult and complex. We are currently applying network-based methods to better understand shared genetic relationships between genes and phenotypes in specific tissues(Rocheleau et al. Comm Biol. 2022).

Human genetics-guided framework to inform drug therapeutic outcomes

We aim to build a framework that leverages genomic data that can inform main indications and side effects of drug therapeutics. We have led a study that showed that genes with five genetic features – tissue-specific gene expression, Mendelian loci, phenotype and tissue-level effects of genetic associations, and genetic constraint – are associated with 2.6 increased side effects compared to genes with none of these features (Duffy et al. Science Advances. 2021). In another study, we developed a genetic priority score that integrates diverse human genetic data to prioritize drug indications (Duffy et al. Nature Genetics. 2024). We have also developed a machine learning-based genetic priority score for drug indications (Chen et al. Nature Communications. 2024)

Evaluating clinical utility of genetic risk to disease

We now have some knowledge of the genetic risk for rare clinical variants and high polygenic risk scores for a number of diseases. However, there is limited progress on understand how to best translate this information into clinical diagnosis. We are involved in studies that involve evaluating whether genetic risk can inform clinical diagnosis. We have evaluated rates of diagnosis of a highly penetrant variant in the TTR gene (TTR V122I) causing hereditary amyloid transthyretin cardiomyopathy (hATTR-CM) and show that the variant is strongly associated with heart failure in African Americans and Hispanic Americans, but there was marked underdiagnosis of hATTR-CM diagnosis in carriers of TTR V122I (Damrauer et al. JAMA. 2019). Furthermore, we have participated in a study highlight insensitivity of high polygenic risk for coronary artery disease with established clinical guidelines for assessing cardiovascular disease risk (Aragnam et al. JACC. 2020). We have further assessed genotype-first approaches to diagnosis (Forrest et al. Hum. Mol. Genet. 2021; Forrest et al. Hum. Mut. 2021; Forrest et al. Eur J Heart Fail 2022) and population-based approaches to quantify penetrance (Forrest et al. JAMA. 2022). We have extended our insights in a perspective on using large-scale population-based data to improve disease risk assessment of clinical variants (Forrest et al. Nature Genetics. 2025). Furthermore, we have also developed a framework to quantify machine-learning based penetrance using clinical data from electronic health records (Forrest et al. Science. 2025).