Thank you for your excellent work and for open-sourcing this project! 🙌
I'm trying to reproduce the audio zero-shot classification results and was wondering if you could share a few quick details:
- The prompt template used (e.g.,
"This is a sound of {}")
- The exact class label list (and how they were formatted)
- Any audio/text preprocessing steps
If there's a config snippet or eval script lying around, that'd be awesome too! 😄
No worries if it's not handy—just thought I'd ask. Really appreciate your help!