Author ORCID Identifier

https://orcid.org/0009-0000-0259-391X

Date of Award

5-6-2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Alex Zelikovsky

Abstract

Extracting insights from the vast dataset of viral genome sequences collected throughout the COVID-19 pandemic requires the development of novel algorithms that are tailored to its unique properties. These properties, such as high sampling density, unambiguous knowledge of the phylogenetic root sequence, and completeness with respect to the virus’s evolutionary history in humans, make it distinct among viral genome datasets. This dissertation details the development and application of advanced computational methodologies to analyze the SARS-CoV-2 genomic dataset. We introduce a suite of computational techniques that are tailored to this data, beginning with SPHERE, an algorithm for scalable phylogeny reconstruction that adapts to the high density of the genomic data. The next is (ε, τ )-MSN, which forms genetic relatedness networks by joining all possible minimum spanning trees and sensibly augmenting the network with additional edges, to capture groups of similar sequences. Furthermore, we present an unsupervised learning approach for finding a clustering of genomic sequences that minimizes cluster entropy. We also propose a method for implementing evolutionary jumps within genetic algorithms, simulating the punctuated equilibrium phenomena observed in SARS-CoV-2 sequencing data, which was shown to improve the speed of convergence for hard instances of the 0-1 Knapsack Problem. Collectively, these works detail new, efficient ways in which to consider modeling and extracting information from large scale viral sequencing datasets.

DOI

https://doi.org/10.57709/36973044

File Upload Confirmation

1

Share

COinS