Designing Methods for Representation Learning of Molecular Sequences and its Application in Analysis Tasks
Murad, Taslim
Citations
Abstract
Molecular sequence analysis serves as a fundamental process for elucidating the intricate functions, structures, and behaviors inherent in sequences. Its application extends to char- acterizing associated organisms, such as viruses, facilitating the development of preventive measures to mitigate their dissemination and influence. Given the potential of viruses to trigger epidemics with global ramifications, comprehensive sequence analysis is pivotal in understanding and managing their impact effectively. The rapid expansion of bio-sequence data has surpassed the computational capabilities of traditional analytical techniques, such as the phylogenetic approach, due to their high computational costs. Consequently, clustering and classification have emerged as compelling alternatives, with machine learning (ML) and deep learning (DL) algorithms capable of effectively implementing these methods. Although ML/DL models are known for their high analytical capabilities, however, they typically require the inputs to be either in numerical or image form. Therefore, efficient and effective mechanisms are needed to transform bio-sequences into ML/DL-compatible inputs, and this research intends to devise such techniques. In this regard, alignment-free and fast feature-engineering-based approaches and image-based approaches are put forward in this work to convert the bio-sequences into numerical and image form respectively. The feature-engineering-based methods, PSSMFreq2Vec and PSSM2Vec combine the power of k- mers and position weight matrix (PWM) to be scalable, alignment-free, and compact, while Hashing2Vec utilizes the combination of hashing and k-mers to achieve high embedding generation speed and to be alignment-free respectively. Furthermore, two of the image-based approaches follow the underlying concept of Chaos Game Representation (CGR) to map sequences to images while one uses Bezier function-based mapping of sequences into images, and they aim to enable the application of sophisticated vision DL analytical models on bio- sequences. The representations gained from both feature-engineering-based and image-based methods are passed on to ML/DL models to perform classification tasks and their results illustrate high predictive performance as compared to the respective baseline models.