This repository explores the reproduction and improvement of the Tem-adapter architecture for Video Question Answering (VideoQA) using the SUTD-TrafficQA dataset. The project involves replication of results using released checkpoints, training from scratch, and extending the architecture with a custom cross-attention layer.
- Dataset: Download the SUTD-TrafficQA dataset and place it in the
data/folder - Released Checkpoint: Drive Link
- Reproduced Checkpoint: Drive Link
| Source | Validation Accuracy |
|---|---|
| Original (paper) | 46.00% |
| Reproduced (ckpt) | 46.00% |
✔️ Exact match with the published results using the official checkpoint.
| Metric | Value |
|---|---|
| Sum loss | 0.127 |
| Avg loss | 0.34 |
| CE loss | 33.28 |
| Recon loss | 0.0067 |
| Average Accuracy | 98.20% |
| Validation Accuracy | 45.37% |