Loading...
Thumbnail Image
Item

Attentive Multi-modal Learning: Unifying Visual and Clinical Insights for Medical Image Classification and Report Generation

Nagur Shareef Shaik
Citations
Altmetric:
Abstract

This work investigates the integration of diverse multi-modal data-spanning medical imaging, clinical text, and genetic information-through scalable attention networks and vision-language models to improve diagnostic accuracy and automate medical report generation. The core hypothesis is that attentive multi-modal learning can seamlessly unify visual and clinical insights, thereby enhancing the precision of disease diagnosis and facilitating efficient medical report generation. Focusing on neuroimaging and retinal imaging, this research introduces novel architectures and methodologies that leverage deep learning techniques to synthesize and analyze multi-modal data. We propose an innovative attention mechanism called Spatial Sequence Attention (SSA) to identify subtle brain morphological changes in structural MRIs (sMRIs) that are otherwise imperceptible through manual analysis, specifically targeting the diagnosis of complex cognitive impairments like schizophrenia. Building on this, we extend the approach to integrate additional multi-modal data, incorporating Functional Network Connectivity (FNC) data and Single Nucleotide Polymorphisms (SNPs) genomic information. To achieve this, we introduce the Multi-modal Imaging Genomics Transformer (MIGTrans), which leverages attention-based mechanisms to uncover structural brain abnormalities, functional connectivity disruptions, and relevant genetic markers associated with schizophrenia, enhancing classification accuracy and interpretability. Finally, we extend this multi-modal framework to retinal imaging, introducing the Multi-Modal Medical Transformer (M3T) to generate clinically relevant medical reports. M3T efficiently integrates retinal images with diagnostic text, improving both the quality and coherence of the generated reports.

Comments
Description
Date
2025-04-04
Journal Title
Journal ISSN
Volume Title
Publisher
Research Projects
Organizational Units
Journal Issue
Keywords
Multi-modal Learning, Attention Networks, Vision-Language Models (VLM), Schizophrenia, Medical Report Generation
Citation
Nagur Shareef Shaik (2025). "Attentive Multi-modal Learning: Unifying Visual and Clinical Insights for Medical Image Classification and Report Generation." Thesis, Georgia State University. https://doi.org/10.57709/tw0z-py92
Embargo Lift Date
2025-04-04
Embedded videos