-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hi, thank you for your great work on GRAM!
I’m currently trying to reproduce the evaluation results reported in the paper using the provided pretrained models. However, I encountered some issues regarding the datasets used for evaluation. I would appreciate it if you could provide some clarification:
- Datasets for Evaluation
According to the paper and repository, GRAM is evaluated on datasets such as:
MSR-VTT
DiDeMo
ActivityNet
VATEX
Could you kindly share:
Where can I download the processed version of these datasets (or any instructions to process the raw versions)?
Are there any specific preprocessing scripts you used to prepare the evaluation data compatible with your code?
Is there a recommended directory structure for these datasets?
- Audio Files Requirement
In the config JSON files, I see that both vision_path and audio_path are required for evaluation.
Are the audio features provided as part of the official datasets? If not, could you provide instructions or scripts to extract them?
For example, how should the audio_path directory be structured for the MSR-VTT or DiDeMo dataset?
What audio format and feature type (e.g., wav2vec2, VGGish, etc.) does the model expect?
- Recommended Folder Structure
To avoid misconfiguration, could you also share an example of a valid folder structure for one of the datasets (e.g., MSR-VTT), showing where the video, audio, and metadata (e.g., captions or annotations) should be located?
Looking forward to your guidance!