PI:
Adam C. Knapp, Ph.D.
Project Summary
The DNA of every organism retains a trace of its evolutionary lineage, and as the DNA is copied, new genetic variations are introduced. Researchers analyze patterns in DNA sequences to piece together organisms’ evolutionary histories, and the estimation of these relationships has diverse applications. Examples include characterizing the evolutionary patterns of metastatic colorectal cancer or determining the relationship between different viral strains. The evolutionary history is often represented by a phylogenetic tree that displays the ancestry and descent relationships among organisms; however, genetic information is not always transferred vertically, and in these cases, the relationships should be represented by a network. For example, two viral genomes can co-infect the same host cell, leading to the exchange of genetic segments, called recombination, and thus contributing to their diversity. Additionally, organismal evolution occurs at two distinct levels: at the level of individual genes and at the level of species, which constrains the histories of the individual genes. Accurate estimation of evolutionary trees and networks is a major problem in mathematical and statistical inference since the precision of their estimation has a direct impact on fields such as epidemiology, forensic medicine, biosecurity, and cancer biology. The main objective of this project is to improve the scalability and accuracy of current methods and develop new methods for inferring phylogenetic species networks. The project is a collaboration between the American University and University of Florida and offers valuable educational and outreach opportunities. Specifically, graduate and undergraduate students will be trained in phylogenetic methodologies as part of the process, resulting in a vertical transfer of knowledge. The work for inferring species-level relationships from multigene alignments are specific to models that simultaneously account for variability in individual gene histories due to processes such as incomplete lineage sorting, gene flow, hybridization, and recombination. The focus is on site-based species methodologies, with the goals of (1) improving quartet species inference and scalability; (2) implementing inference of rooted species trees from rooted triples; (3) developing more efficient site-based methods for identifying hybrid species or recombination events without the need for an outgroup; and (4) implementing site-based species network inference from quartets and 4-taxon networks. To achieve these goals, the PIs will extend multiple mathematical ideas from Markov models in the single gene tree setting to models under coalescent and gene flow, such as leaf transformations and a measure based on paralinear distance, which so far have not been formally described, tested, or implemented under the species tree model. The method for hybrid species identification is based on deriving functions whose ratio approaches the ratio of the mixing parameter, which is distinct from current algebraic methods that look for vanishing polynomials over networks. As a result, this work contributes to the mathematical, statistical, and biological sciences, and has the potential to spark new discussions in the theoretical and computational communities. A complete software package implementing our site-based species-level inference methods will be freely available to the empirical phylogenetics community. This project is jointly funded by the Mathematical Biology Program in the Division of Mathematical Sciences and Systematic and Biodiversity Science Cluster in the Division of Environmental Biology. This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.
Publications
Adam C. Knapp, Daniel A. Cruz, Borna Mehrad, Reinhard C. Laubenbacher. Personalizing computational models to construct medical digital twins. bioRxiv 2024.05.31.596692; doi: 10.1101/2024.05.31.596692