Loading...
Thumbnail Image
Item

Clustering, Classification, and Simulation of Evolutionarily Related Sequences in the Context of Cancer and Viral Spread

Tayebi, Zahra
Citations
Altmetric:
Abstract

Molecular sequence analysis is vital for understanding the functions, structures, and behaviors of organisms, including viruses, playing a key role in disease prevention and control. To apply machine learning (ML) and deep learning (DL) models to biological data, protein sequences must be transformed into fixed-length numerical representations for efficient clustering and classification. This research aims to develop methods for extracting informative numerical features from protein sequences to enable comprehensive ML/DL analysis. We focus on addressing challenges in clustering and classifying protein sequences by converting them into numerical formats. Feature embedding techniques like k-mers and One Hot Encoding (OHE) capture patterns within sequences. We also explore advanced encoding methods such as low-dimensional representations, ViralVectors (based on minimizers), Sparse coding, and PseAAC embedding, which account for variations in sequence length and complexity, preserving crucial information for accurate ML/DL analysis. To validate our approach, we experiment with diverse datasets, including SARS-CoV- 2 virus sequences and T-cell receptor (TCR) sequences linked to cancer. We apply various ML/DL models for clustering and classification, comparing our results with existing methods to demonstrate the effectiveness of our techniques. This research also contributes to cancer studies by developing a simulator that models single-nucleotide variants (SNVs) and copy number aberrations (CNAs) for cancer phylogeny inference. By integrating insights from protein sequence analysis and cancer evolution modeling, this work advances our understanding of biological systems and enhances methodologies in bioinformatics and disease control.

Description
Date
2024-12-01
Journal Title
Journal ISSN
Volume Title
Publisher
Research Projects
Organizational Units
Journal Issue
Keywords
Biological sequences, Protein sequence analysis, Embedding generation, Clustering, classification, Machine learning, and Deep learning models
Citation
Embedded videos