Author ORCID Identifier
https://orcid.org/0009-0000-0259-391X
Date of Award
5-6-2024
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Advisor
Alex Zelikovsky
Abstract
Extracting insights from the vast dataset of viral genome sequences collected throughout the COVID-19 pandemic requires the development of novel algorithms that are tailored to its unique properties. These properties, such as high sampling density, unambiguous knowledge of the phylogenetic root sequence, and completeness with respect to the virus’s evolutionary history in humans, make it distinct among viral genome datasets. This dissertation details the development and application of advanced computational methodologies to analyze the SARS-CoV-2 genomic dataset. We introduce a suite of computational techniques that are tailored to this data, beginning with SPHERE, an algorithm for scalable phylogeny reconstruction that adapts to the high density of the genomic data. The next is (ε, τ )-MSN, which forms genetic relatedness networks by joining all possible minimum spanning trees and sensibly augmenting the network with additional edges, to capture groups of similar sequences. Furthermore, we present an unsupervised learning approach for finding a clustering of genomic sequences that minimizes cluster entropy. We also propose a method for implementing evolutionary jumps within genetic algorithms, simulating the punctuated equilibrium phenomena observed in SARS-CoV-2 sequencing data, which was shown to improve the speed of convergence for hard instances of the 0-1 Knapsack Problem. Collectively, these works detail new, efficient ways in which to consider modeling and extracting information from large scale viral sequencing datasets.
DOI
https://doi.org/10.57709/36973044
Recommended Citation
Novikov, Daniel, "Efficient Algorithms for Large Scale Analysis of Viral Genome Sequencing Data." Dissertation, Georgia State University, 2024.
doi: https://doi.org/10.57709/36973044
File Upload Confirmation
1