Home

Awesome

Welcome to My GitHub Profile 👋

Multimodal Fusion Learning with Dual Attention for Medical Imaging

Multimodal fusion learning has shown significant promise in classifying various diseases such as skin cancer and brain tumors. However, existing methods face three key limitations:

  1. Lack of Generalizability: Existing methods often fail to generalize across diagnosis tasks due to their focus on a specific disease.
  2. Limited Use of Diverse Modalities: They do not fully leverage multiple health records from diverse modalities to learn robust complementary information.
  3. Single Attention Mechanism: Relying on a single attention mechanism misses the benefits of combining multiple attention strategies within and across various modalities.

Our Proposed Approach: DRIFA

To address these challenges, we propose:
A Dual Robust Information Fusion Attention Mechanism (DRIFA)

Key Features of DRIFA:

DRIFA can be integrated with any deep neural network, forming a multimodal fusion learning framework known as DRIFA-Net.

Performance Highlights:

Technologies and Applications:

image Figure 1. Detailed architecture of DRIFA-Net. Key components include: (A) the target-specific multimodal fusion learning (TMFL) phase, followed by (B) an uncertainty quantification (UQ) phase. TMFL phase comprises a robust residual attention (RRA) block, shown in (C), and utilizes multi-branch fusion attention (MFA), an additional MFA module for further refinement of local representations, a multimodal information fusion attention (MIFA) module for improved multimodal representation learning, and multitask learning (MTL) for handling multiple classification tasks. During (UQ) phase, the reliability of DRIFA-Net predictions are assessed.

image

Figure 2. (a) Multi-branch fusion attention (MFA) module.Key components include hierarchical information fusion attention (HIFA) for diverse local information enhancement and channelwise local information attention (CLIA) for improved channelspecific representation learning.

image

Figure 3. (a) Multimodal information fusion attention (MIFA) module. This module includes multimodal global information fusion attention (MGIFA) (shown in b) and multimodal local information fusion attention (MLIFA) (shown in c).

image

Figure 4. Visual representation of the important regions highlighted by our proposed DRIFA-Net and four SOTA methods using the GRAD-CAM technique on two benchmark datasets D1 and D3. (a) and (g) display the original images, while (b) and (h) present results for Gloria, (c) and (i) for MTF with MA, (d) and (j) for CAF, (e) and (k) for MTTU-Net, and (f) and (l) for our proposed DRIFA-Net.

image

Figure 5. T-SNE visualization of different models applied to the dermoscopy images of the D1 dataset, where (a) represents the T-SNE visualization of Gloria, (b) of MTTU-Net, and (c) of our proposed DRIFA-Net.

Citation:

If you find this work useful, please cite:

@inproceedings{dhar2025multimodal,
  title={Multimodal Fusion Learning with Dual Attention for Medical Imaging},
  author={Dhar, Joy and Zaidi, N. and Haghighat, M. and Goyal, P. and Roy, S. and Alavi, A. and Kumar, V.},
  booktitle={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2025},
  url={https://arxiv.org/abs/2412.01248}
}