Turn YouTube videos into text transcripts with Voxtral AI
Scriptotic is a transcription tool that downloads YouTube videos, extracts the audio, and creates accurate text transcripts using Voxtral Mini 3B, one of the most accurate speech-to-text models available. It offers both a web interface (Playwright-compatible) and command-line access.
- YouTube Video Download: Automatically downloads audio from YouTube videos
- High-Quality Transcription: Uses Voxtral Mini 3B (14% better accuracy than Whisper large-v3)
- Speaker Identification: Identifies different speakers and labels their dialogue
- Web Interface: Modern web interface accessible via browser and Playwright automation
- Command Line Support: CLI for automation and scripting
- Multilingual Support: Automatic language detection and superior multilingual accuracy
- Multiple Output Formats: Text, JSON, or SRT subtitle formats
- Windows 10/11 with WSL2 or Linux (64-bit)
- WSL2 with Ubuntu (for Windows users)
- NVIDIA GPU (RTX 20-series or newer, 12GB+ VRAM recommended)
- 16GB+ RAM (for optimal performance)
- 15GB+ free disk space (for Voxtral model and temporary files)
- Stable internet connection (for first-time setup only)
- HuggingFace Account: Free account at huggingface.co for Voxtral model access
For Windows Users:
# Run in WSL2 Ubuntu terminal
bash scripts/setup_wsl2_voxtral.shFor Linux Users:
# Run in your Linux terminal
bash scripts/setup_wsl2_voxtral.shFor Windows Users (Recommended):
# Double-click this file in Windows Explorer:
start_scriptotic.bat
# Or run in Command Prompt:
start_scriptotic.batFor WSL/Linux Users:
# Run in terminal:
bash start_scriptotic.shOpen your browser and go to: http://localhost:5000
The web interface provides:
- ✅ Form-based transcription - Enter YouTube URL, speaker names, choose format
- ✅ Real-time progress tracking - Live updates as transcription proceeds
- ✅ Server status monitoring - Shows Voxtral server startup progress
- ✅ File downloads - Download completed transcripts
- ✅ Full Playwright compatibility - Perfect for browser automation
// Navigate to Scriptotic web interface
await page.goto('http://localhost:5000');
// Fill in the YouTube URL
await page.getByRole('textbox', { name: 'YouTube URL:' }).fill('https://www.youtube.com/watch?v=dQw4w9WgXcQ');
// Set speaker names
await page.getByRole('textbox', { name: 'Speaker Names:' }).fill('Rick, Audience');
// Choose JSON output format
await page.getByLabel('Output Format:').selectOption('JSON');
// Start transcription (when server is ready)
await page.getByRole('button', { name: 'Generate Transcript' }).click();
// Wait for completion and download result
await page.getByRole('button', { name: 'Download Transcript' }).click();- Start the web server using
start_scriptotic.bat(Windows) orstart_scriptotic.sh(Linux) - Open http://localhost:5000 in your browser
- First time: You'll be prompted to enter your HuggingFace token
- Enter the YouTube video URL
- Enter speaker names (comma-separated, e.g., "Alice, Bob, Charlie") - optional
- Choose output format:
- text: Human-readable transcript with speaker labels
- json: Structured data with timestamps
- srt: Subtitle file format
- Click "Generate Transcript"
- Wait for processing (progress bar shows status)
- Download the completed transcript
# Direct CLI usage (after starting the web server)
python src/core/scriptotic.py "https://www.youtube.com/watch?v=VIDEO_ID" --names "Alice,Bob,Charlie" --output "transcript.txt"
# With JSON output
python src/core/scriptotic.py "https://www.youtube.com/watch?v=VIDEO_ID" --format json --output "transcript.json"- Go to huggingface.co and create a free account
- Go to Settings → Access Tokens
- Click "New token" and create a token with Read permissions
- Accept the license agreement for the Voxtral model:
- Save your token - you'll need it when you first run Scriptotic
The transcript will include:
- Video title and metadata
- Model information (Voxtral Mini 3B)
- Speaker-labeled dialogue:
[Alice] Hello everyone, welcome to today's discussion. [Bob] Thanks for having me, Alice. I'm excited to talk about this topic. [Alice] Let's start with the basics...
WSL2-Based Design:
- Web Server: Flask server running in WSL2
- Backend: vLLM server with Voxtral Mini 3B model
- Communication: HTTP API between web interface and transcription engine
- Access: Browser-based interface accessible from Windows and WSL2
- Web Interface: Modern HTML/JavaScript interface with real-time updates
- Audio Download: yt-dlp downloads audio from YouTube in WSL2 environment
- Server Management: Automatic Voxtral server startup and health monitoring
- Transcription: Voxtral Mini 3B transcribes audio via vLLM HTTP API
- Speaker Mapping: Optional speaker name assignment
- Output: Generates formatted transcript in chosen format
- Engine: Voxtral Mini 3B (mistralai/Voxtral-Mini-3B-2507)
- Accuracy: 5.1% WER (14% better than Whisper large-v3's 5.9% WER)
- Languages: Superior multilingual support with automatic detection
- Requirements: ~9.5GB VRAM, fits perfectly on RTX 4080 12GB
- All processing is local - no audio or transcripts are sent to external servers
- Only model downloads require internet connection
- Your HuggingFace token is stored locally in WSL2 environment
- Audio files are automatically cleaned up after processing
"Web server won't start"
- Make sure WSL2 is installed and Ubuntu distribution is available
- Run the setup script:
bash scripts/setup_wsl2_voxtral.sh - Check that all dependencies are installed in the virtual environment
"Can't access http://localhost:5000"
- Ensure the web server is running (check terminal output)
- Try http://127.0.0.1:5000 instead
- Check Windows firewall settings for port 5000
"Server failed to start"
- Check that your NVIDIA driver is version 555 or newer
- Ensure you have enough disk space (~15GB free)
- Check the server logs:
cat wsl2/voxtral.log
"HuggingFace token invalid"
- Make sure you accepted the license agreement for the Voxtral model
- Generate a new token with Read permissions
- The token should start with
hf_
Very slow transcription
- This is normal - Voxtral prioritizes accuracy over speed
- A 1-hour video typically takes 10-15 minutes to process
- The accuracy improvement (14% better than Whisper) is worth the wait
WSL2 using too much memory (Windows)
- WSL2 may keep 8-12GB allocated even after servers stop
- Solution: Run
cleanup_memory.batto free WSL2 memory - Or: Close the server terminal window (automatically runs cleanup)
- This frees the
vmmemWSLprocess you see in Task Manager
If you encounter issues:
- Check the web interface status indicator
- Look at server logs:
cat wsl2/voxtral.log - Verify WSL2 is working: run
wsl -d Ubuntu lsin Command Prompt - Make sure all setup steps were completed
- Try with a shorter video first
Main Launchers (Choose one):
start_scriptotic.bat- Windows launcher with visible server consolestart_scriptotic.sh- Linux/WSL launcher with visible server console
Utility Scripts:
cleanup_memory.bat- Windows utility to free WSL2 memory (use if vmmemWSL is using too much RAM)
Core Components:
src/core/web_server.py- Flask web serversrc/core/templates/index.html- Web interfacesrc/core/scriptotic.py- Core transcription logic and CLIsrc/core/voxtral_engine.py- Voxtral transcription engine
Setup:
scripts/setup_wsl2_voxtral.sh- One-time environment setup
- Current Version: 2.0.0
- Status: Production ready - Web interface with Playwright compatibility
- Architecture: Flask web server + WSL2 vLLM backend
This project is open source. See individual model licenses for AI models used.
Found a bug or want to improve Scriptotic? Please open an issue or submit a pull request!
Made with ❤️ by brinedew