Author ORCID Identifier
0000-0001-6434-3297
Date of Award
Spring 5-6-2024
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Advisor
Murray Patterson
Second Advisor
Alex Zelikovsky
Third Advisor
Esra Akbas
Fourth Advisor
Zhisheng Yan
Fifth Advisor
Daniel Takabi
Abstract
Molecular sequence analysis serves as a fundamental process for elucidating the intricate functions, structures, and behaviors inherent in sequences. Its application extends to char- acterizing associated organisms, such as viruses, facilitating the development of preventive measures to mitigate their dissemination and influence. Given the potential of viruses to trigger epidemics with global ramifications, comprehensive sequence analysis is pivotal in understanding and managing their impact effectively. The rapid expansion of bio-sequence data has surpassed the computational capabilities of traditional analytical techniques, such as the phylogenetic approach, due to their high computational costs. Consequently, clustering and classification have emerged as compelling alternatives, with machine learning (ML) and deep learning (DL) algorithms capable of effectively implementing these methods. Although ML/DL models are known for their high analytical capabilities, however, they typically require the inputs to be either in numerical or image form. Therefore, efficient and effective mechanisms are needed to transform bio-sequences into ML/DL-compatible inputs, and this research intends to devise such techniques. In this regard, alignment-free and fast feature-engineering-based approaches and image-based approaches are put forward in this work to convert the bio-sequences into numerical and image form respectively. The feature-engineering-based methods, PSSMFreq2Vec and PSSM2Vec combine the power of k- mers and position weight matrix (PWM) to be scalable, alignment-free, and compact, while Hashing2Vec utilizes the combination of hashing and k-mers to achieve high embedding generation speed and to be alignment-free respectively. Furthermore, two of the image-based approaches follow the underlying concept of Chaos Game Representation (CGR) to map sequences to images while one uses Bezier function-based mapping of sequences into images, and they aim to enable the application of sophisticated vision DL analytical models on bio- sequences. The representations gained from both feature-engineering-based and image-based methods are passed on to ML/DL models to perform classification tasks and their results illustrate high predictive performance as compared to the respective baseline models.
DOI
https://doi.org/10.57709/36979457
Recommended Citation
Murad, Taslim, "Designing Methods for Representation Learning of Molecular Sequences and its Application in Analysis Tasks." Dissertation, Georgia State University, 2024.
doi: https://doi.org/10.57709/36979457
File Upload Confirmation
1