Attentive Multi-modal Learning: Unifying Visual and Clinical Insights for Medical Image Classification and Report Generation
Nagur Shareef Shaik
Citations
Abstract
This work investigates the integration of diverse multi-modal data-spanning medical imaging, clinical text, and genetic information-through scalable attention networks and vision-language models to improve diagnostic accuracy and automate medical report generation. The core hypothesis is that attentive multi-modal learning can seamlessly unify visual and clinical insights, thereby enhancing the precision of disease diagnosis and facilitating efficient medical report generation. Focusing on neuroimaging and retinal imaging, this research introduces novel architectures and methodologies that leverage deep learning techniques to synthesize and analyze multi-modal data. We propose an innovative attention mechanism called Spatial Sequence Attention (SSA) to identify subtle brain morphological changes in structural MRIs (sMRIs) that are otherwise imperceptible through manual analysis, specifically targeting the diagnosis of complex cognitive impairments like schizophrenia. Building on this, we extend the approach to integrate additional multi-modal data, incorporating Functional Network Connectivity (FNC) data and Single Nucleotide Polymorphisms (SNPs) genomic information. To achieve this, we introduce the Multi-modal Imaging Genomics Transformer (MIGTrans), which leverages attention-based mechanisms to uncover structural brain abnormalities, functional connectivity disruptions, and relevant genetic markers associated with schizophrenia, enhancing classification accuracy and interpretability. Finally, we extend this multi-modal framework to retinal imaging, introducing the Multi-Modal Medical Transformer (M3T) to generate clinically relevant medical reports. M3T efficiently integrates retinal images with diagnostic text, improving both the quality and coherence of the generated reports.
