Date: June 21, 2022
Speaker: Dr. Manu Aggarwal
Affiliation: National Institutes of Health (NIH)/National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)
Title: Dory: Overcoming Barriers to Computing Persistent Homology
Abstract: The structure and relative arrangement of the constituents of any biological system is crucial to its function due to existence of proximity-dependent interactions. Given data measuring the spatial embedding of such constituents, a pattern of interest is a region devoid of the constituents surrounded by a region of high density with constituents close enough to allow interaction, which we colloquially refer to as a hole. Such holes have been shown to have functional significance, for example, chromatin loops in chromosomes enable long range control of gene transcription, three-dimensional voids in protein crystal structures are related to ligand interaction, and cosmic voids in the universe are related to dark energy. An algorithm to compute loops and voids is then needed to analyze the deluge of experimental data which often has large experimental uncertainties. Furthermore, identifying voids in a 3D embedding by visual inspection using the human eye is subjective and prone to inconsistencies. An objective mathematically sound method to detect holes and their statistical significance is required. Persistent homology (PH) is an approach to topological data analysis (TDA) that can compute the existence of holes in discrete data sets, assigning them a significance based on their robustness to experimental variability in the data set. This information comes at a high computational cost (run time and memory) that has limited applicability of PH to small data sets of a few thousand points. Further, it is commonly restricted to computing only the existence and significance of holes and not their location due to higher computational costs and a lack of precision in computing their location. We developed Dory, an efficient and scalable algorithm for computing PH along with the location of significant holes with improved precision in large data sets. We used Dory to find cosmic voids in the arrangement of around 108k galaxies in the universe, find protein homologs with significantly different topology by analyzing 180k publicly available crystal structures (PDB), and find chromatin loops in the human genome by analyzing high resolution Hi-C contact maps that result in point clouds with millions of points. In benchmarking different software, Dory was the only one that was able to analyze genome wide Hi-C contact maps, and it did so in a matter of a few minutes using less than 3GB of memory. To validate the locations of loops and voids computed by Dory, we show that the loops in Hi-C data sets and the voids in proteins agree with known biology.