Date of Award

12-2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Murray Patterson

Abstract

Molecular sequence analysis is vital for understanding the functions, structures, and behaviors of organisms, including viruses, playing a key role in disease prevention and control. To apply machine learning (ML) and deep learning (DL) models to biological data, protein sequences must be transformed into fixed-length numerical representations for efficient clustering and classification. This research aims to develop methods for extracting informative numerical features from protein sequences to enable comprehensive ML/DL analysis. We focus on addressing challenges in clustering and classifying protein sequences by converting them into numerical formats. Feature embedding techniques like k-mers and One Hot Encoding (OHE) capture patterns within sequences. We also explore advanced encoding methods such as low-dimensional representations, ViralVectors (based on minimizers), Sparse coding, and PseAAC embedding, which account for variations in sequence length and complexity, preserving crucial information for accurate ML/DL analysis. To validate our approach, we experiment with diverse datasets, including SARS-CoV- 2 virus sequences and T-cell receptor (TCR) sequences linked to cancer. We apply various ML/DL models for clustering and classification, comparing our results with existing methods to demonstrate the effectiveness of our techniques. This research also contributes to cancer studies by developing a simulator that models single-nucleotide variants (SNVs) and copy number aberrations (CNAs) for cancer phylogeny inference. By integrating insights from protein sequence analysis and cancer evolution modeling, this work advances our understanding of biological systems and enhances methodologies in bioinformatics and disease control.

File Upload Confirmation

1

Share

COinS