Author ORCID Identifier

0000-0001-6434-3297

Date of Award

Spring 5-6-2024

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Advisor

Murray Patterson

Second Advisor

Alex Zelikovsky

Third Advisor

Esra Akbas

Fourth Advisor

Zhisheng Yan

Fifth Advisor

Daniel Takabi

Abstract

Molecular sequence analysis serves as a fundamental process for elucidating the intricate functions, structures, and behaviors inherent in sequences. Its application extends to char- acterizing associated organisms, such as viruses, facilitating the development of preventive measures to mitigate their dissemination and influence. Given the potential of viruses to trigger epidemics with global ramifications, comprehensive sequence analysis is pivotal in understanding and managing their impact effectively. The rapid expansion of bio-sequence data has surpassed the computational capabilities of traditional analytical techniques, such as the phylogenetic approach, due to their high computational costs. Consequently, clustering and classification have emerged as compelling alternatives, with machine learning (ML) and deep learning (DL) algorithms capable of effectively implementing these methods. Although ML/DL models are known for their high analytical capabilities, however, they typically require the inputs to be either in numerical or image form. Therefore, efficient and effective mechanisms are needed to transform bio-sequences into ML/DL-compatible inputs, and this research intends to devise such techniques. In this regard, alignment-free and fast feature-engineering-based approaches and image-based approaches are put forward in this work to convert the bio-sequences into numerical and image form respectively. The feature-engineering-based methods, PSSMFreq2Vec and PSSM2Vec combine the power of k- mers and position weight matrix (PWM) to be scalable, alignment-free, and compact, while Hashing2Vec utilizes the combination of hashing and k-mers to achieve high embedding generation speed and to be alignment-free respectively. Furthermore, two of the image-based approaches follow the underlying concept of Chaos Game Representation (CGR) to map sequences to images while one uses Bezier function-based mapping of sequences into images, and they aim to enable the application of sophisticated vision DL analytical models on bio- sequences. The representations gained from both feature-engineering-based and image-based methods are passed on to ML/DL models to perform classification tasks and their results illustrate high predictive performance as compared to the respective baseline models.

DOI

https://doi.org/10.57709/36979457

File Upload Confirmation

1

Share

COinS