Author ORCID Identifier

0000-0001-8121-2168

Date of Award

Fall 12-6-2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Murray Patterson

Second Advisor

Alexander Zelikovsky

Third Advisor

Esra Akbas

Fourth Advisor

Jose Bento

Abstract

Advancements in technology and reductions in cost for getting molecular sequencing data have led to an unprecedented amount of sequence data being available. The scale of this data has out-paced traditional methods for its analysis while machine-learning approaches aimed at clustering and classification are becoming an attractive choice. Since the sequencing data is highly dimensional, considering the much smaller area of interest in it is a viable option for the analyses like in the case of the SARS-CoV-2 virus genome, focusing on the spike region can save a great deal of processing. Since the spike protein mediates the attachment of the coronavirus to the host cell, most of the newer and more contagious variants can be characterized by alterations to the spike protein; hence it is often sufficient for characterizing the different SARS-CoV-2 variants. Machine learning models have been applied to predict protein function, aiding in drug design, understanding protein interactions, and more. Machine learning techniques can be employed to improve the accuracy and efficiency of sequence analysis algorithms, which are fundamental in comparing and identifying similarities between biological sequences. Computationally efficient feature embedding generation is the domain, which needs more attention from researchers. Applying any machine learning (ML) model to a biological sequence requires first transforming it into a fixed-length (numerical) form that needs sequence alignment, which is a popular and fundamental issue that is computationally expensive. While several compact embeddings exist, the generation process is computationally expensive since the features added to the resulting vectors are indexed in a naıve fashion. To solve this problem, we propose a fast and alignment-free hashing-based approach to design a fixed-length feature embedding for spike protein sequences, which can be used as input to any standard ML model. Using real-world data, we show that the proposed embedding is not only efficient to compute but also outperforms current state-of-the-art embedding methods in terms of classification accuracy. We also propose kernel function-based methods to efficiently perform supervised analysis of the biological sequences. However, since kernel-based methods are expensive in terms of storage cost, their scalability remains a major issue. To solve this problem, we propose an embedding-based solution that combines the quality of kernel methods with the power of representation learning to generate low-dimensional vectors, which serve as state-of-the-art in terms of molecular sequence analysis. To test the robustness of ML models, we propose several ways of introducing biologically meaningful errors into the SARS-CoV-2 genome sequences, which reflect the error profiles of modern NGS using technologies such as Illumina and PacBio. We compare the different embedding techniques for different types of sequencing data like long reads-based errored sequences, Illumina-based errored sequences, etc.

File Upload Confirmation

1

Share

COinS