Date of Award
12-2024
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Advisor
Murray Patterson
Abstract
Molecular sequence analysis is vital for understanding the functions, structures, and behaviors of organisms, including viruses, playing a key role in disease prevention and control. To apply machine learning (ML) and deep learning (DL) models to biological data, protein sequences must be transformed into fixed-length numerical representations for efficient clustering and classification. This research aims to develop methods for extracting informative numerical features from protein sequences to enable comprehensive ML/DL analysis. We focus on addressing challenges in clustering and classifying protein sequences by converting them into numerical formats. Feature embedding techniques like k-mers and One Hot Encoding (OHE) capture patterns within sequences. We also explore advanced encoding methods such as low-dimensional representations, ViralVectors (based on minimizers), Sparse coding, and PseAAC embedding, which account for variations in sequence length and complexity, preserving crucial information for accurate ML/DL analysis. To validate our approach, we experiment with diverse datasets, including SARS-CoV- 2 virus sequences and T-cell receptor (TCR) sequences linked to cancer. We apply various ML/DL models for clustering and classification, comparing our results with existing methods to demonstrate the effectiveness of our techniques. This research also contributes to cancer studies by developing a simulator that models single-nucleotide variants (SNVs) and copy number aberrations (CNAs) for cancer phylogeny inference. By integrating insights from protein sequence analysis and cancer evolution modeling, this work advances our understanding of biological systems and enhances methodologies in bioinformatics and disease control.
Recommended Citation
Tayebi, Zahra, "Clustering, Classification, and Simulation of Evolutionarily Related Sequences in the Context of Cancer and Viral Spread." Dissertation, Georgia State University, 2024.
https://scholarworks.gsu.edu/cs_diss/228
File Upload Confirmation
1