This project investigates multimodal learning for aerial scene recognition by jointly leveraging RGB aerial imagery and environmental audio. A cross-attention–based fusion model is proposed and evaluated on the ADVANCE dataset.
- Task: Aerial scene classification (13 classes)
- Modalities: Vision + Audio
- Dataset: ADVANCE (5,075 paired samples)
- Objective: Compare unimodal baselines against multimodal fusion
- Vision encoder: CLIP RN50 (frozen) with CBAM attention
- Audio encoder: ResNet-18 (frozen) with SENet attention
- Fusion: Cross-Attention Block (CAB) enabling bidirectional audio–visual refinement
- Best unimodal (vision): 91.46% accuracy
- Best unimodal (audio): 79.00% accuracy
- Multimodal CAB fusion: 93.62% accuracy
Multimodal fusion consistently outperforms unimodal models.
Grad-CAM analysis shows improved spatial focus and more semantically meaningful activations in the multimodal model compared to unimodal baselines.
- Manav Jobanputra
- Pranav Khunt
Supervisor: Dr. Ankit Jha