Avano is a powerful Persian speech-to-text service designed for multi-speaker transcription in a single audio session.
آوانو یک سرویس قدرتمند تبدیل صوت به متن فارسی است که برای پیادهسازی متن گفتارِ چند سخنران در یک جلسه صوتی طراحی شده است.
Avano uses the state-of-the-art vhdm/whisper-large-fa-v1 model, which is specifically fine-tuned for Persian speech recognition. The model achieves a Word Error Rate (WER) of 14.07% on clean Persian speech data.
- 🎯 Fine-tuned on high-quality Persian speech data
- 🚀 Based on OpenAI's Whisper Large V3 Turbo architecture
- 📊 14.07% Word Error Rate (WER)
- 💪 Optimized for Persian voice transcription
- Python 3.10 or higher
- CUDA-compatible GPU (recommended)
- Docker and Docker Compose (optional)
- Clone the repository:
git clone https://github.com/ma14ch/avano.git
cd avano- Start the service using Docker Compose:
docker-compose up --buildThe service will be available at http://localhost:5016.
- Clone the repository:
git clone https://github.com/ma14ch/avano.git
cd avano- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Run the service:
python src/main.pyThe service will be available at http://localhost:5016.
- The service automatically detects GPU availability
- Default port is 5016 (can be modified in
main.py) - Model files are stored in the
models/directory
Check if the API is running:
curl -X GET http://localhost:5016/Send an audio file for transcription:
curl -X POST http://localhost:5016/api/inference/ \
-F "audio_file=@/path/to/your/audio/file.mp3" \
-F "num_speakers=2"-
audio_file: The audio file to transcribe (required) -
num_speakers: Number of speakers to identify (optional)
Check if the models are loaded correctly:
curl -X GET http://localhost:5016/debug/modelsThe API returns a JSON response with transcribed segments:
{
"segments": [
{
"speaker": "SPEAKER_0",
"start": 0.5,
"end": 5.2,
"transcription": "متن تبدیلشده برای گوینده اول"
},
{
"speaker": "SPEAKER_1",
"start": 5.8,
"end": 10.3,
"transcription": "متن تبدیلشده برای گوینده دوم"
}
]
}- Optimized for clean audio quality
- Not designed for real-time streaming ASR
- May occasionally produce hallucinations (a common limitation in Whisper models)
- Best performance on standard Persian speech, may have reduced accuracy with heavy accents or dialects
This project is licensed under the MIT License - see the LICENSE file for details.