Skip to content

YouTube to transcript tool with speaker diarization - Windows GUI/CLI app using WhisperX and AI speaker identification

Notifications You must be signed in to change notification settings

Brinedew/Scriptotic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scriptotic

Turn YouTube videos into text transcripts with Voxtral AI

Scriptotic is a transcription tool that downloads YouTube videos, extracts the audio, and creates accurate text transcripts using Voxtral Mini 3B, one of the most accurate speech-to-text models available. It offers both a web interface (Playwright-compatible) and command-line access.

Features

  • YouTube Video Download: Automatically downloads audio from YouTube videos
  • High-Quality Transcription: Uses Voxtral Mini 3B (14% better accuracy than Whisper large-v3)
  • Speaker Identification: Identifies different speakers and labels their dialogue
  • Web Interface: Modern web interface accessible via browser and Playwright automation
  • Command Line Support: CLI for automation and scripting
  • Multilingual Support: Automatic language detection and superior multilingual accuracy
  • Multiple Output Formats: Text, JSON, or SRT subtitle formats

What You Need

System Requirements

  • Windows 10/11 with WSL2 or Linux (64-bit)
  • WSL2 with Ubuntu (for Windows users)
  • NVIDIA GPU (RTX 20-series or newer, 12GB+ VRAM recommended)
  • 16GB+ RAM (for optimal performance)
  • 15GB+ free disk space (for Voxtral model and temporary files)
  • Stable internet connection (for first-time setup only)

Accounts You'll Need

  • HuggingFace Account: Free account at huggingface.co for Voxtral model access

Quick Start

1. Setup (First Time Only)

For Windows Users:

# Run in WSL2 Ubuntu terminal
bash scripts/setup_wsl2_voxtral.sh

For Linux Users:

# Run in your Linux terminal
bash scripts/setup_wsl2_voxtral.sh

2. Start the Web Interface

For Windows Users (Recommended):

# Double-click this file in Windows Explorer:
start_scriptotic.bat

# Or run in Command Prompt:
start_scriptotic.bat

For WSL/Linux Users:

# Run in terminal:
bash start_scriptotic.sh

3. Access the Interface

Open your browser and go to: http://localhost:5000

The web interface provides:

  • Form-based transcription - Enter YouTube URL, speaker names, choose format
  • Real-time progress tracking - Live updates as transcription proceeds
  • Server status monitoring - Shows Voxtral server startup progress
  • File downloads - Download completed transcripts
  • Full Playwright compatibility - Perfect for browser automation

Playwright Automation Example

// Navigate to Scriptotic web interface
await page.goto('http://localhost:5000');

// Fill in the YouTube URL
await page.getByRole('textbox', { name: 'YouTube URL:' }).fill('https://www.youtube.com/watch?v=dQw4w9WgXcQ');

// Set speaker names
await page.getByRole('textbox', { name: 'Speaker Names:' }).fill('Rick, Audience');

// Choose JSON output format
await page.getByLabel('Output Format:').selectOption('JSON');

// Start transcription (when server is ready)
await page.getByRole('button', { name: 'Generate Transcript' }).click();

// Wait for completion and download result
await page.getByRole('button', { name: 'Download Transcript' }).click();

How to Use

Web Interface (Recommended)

  1. Start the web server using start_scriptotic.bat (Windows) or start_scriptotic.sh (Linux)
  2. Open http://localhost:5000 in your browser
  3. First time: You'll be prompted to enter your HuggingFace token
  4. Enter the YouTube video URL
  5. Enter speaker names (comma-separated, e.g., "Alice, Bob, Charlie") - optional
  6. Choose output format:
    • text: Human-readable transcript with speaker labels
    • json: Structured data with timestamps
    • srt: Subtitle file format
  7. Click "Generate Transcript"
  8. Wait for processing (progress bar shows status)
  9. Download the completed transcript

Command Line Usage

# Direct CLI usage (after starting the web server)
python src/core/scriptotic.py "https://www.youtube.com/watch?v=VIDEO_ID" --names "Alice,Bob,Charlie" --output "transcript.txt"

# With JSON output
python src/core/scriptotic.py "https://www.youtube.com/watch?v=VIDEO_ID" --format json --output "transcript.json"

HuggingFace Token Setup (First Run Only)

  1. Go to huggingface.co and create a free account
  2. Go to Settings → Access Tokens
  3. Click "New token" and create a token with Read permissions
  4. Accept the license agreement for the Voxtral model:
  5. Save your token - you'll need it when you first run Scriptotic

Output Format

The transcript will include:

  • Video title and metadata
  • Model information (Voxtral Mini 3B)
  • Speaker-labeled dialogue:
    [Alice] Hello everyone, welcome to today's discussion.
    [Bob] Thanks for having me, Alice. I'm excited to talk about this topic.
    [Alice] Let's start with the basics...
    

Technical Details

Architecture

WSL2-Based Design:

  • Web Server: Flask server running in WSL2
  • Backend: vLLM server with Voxtral Mini 3B model
  • Communication: HTTP API between web interface and transcription engine
  • Access: Browser-based interface accessible from Windows and WSL2

What's Happening Under the Hood

  1. Web Interface: Modern HTML/JavaScript interface with real-time updates
  2. Audio Download: yt-dlp downloads audio from YouTube in WSL2 environment
  3. Server Management: Automatic Voxtral server startup and health monitoring
  4. Transcription: Voxtral Mini 3B transcribes audio via vLLM HTTP API
  5. Speaker Mapping: Optional speaker name assignment
  6. Output: Generates formatted transcript in chosen format

Model Information

  • Engine: Voxtral Mini 3B (mistralai/Voxtral-Mini-3B-2507)
  • Accuracy: 5.1% WER (14% better than Whisper large-v3's 5.9% WER)
  • Languages: Superior multilingual support with automatic detection
  • Requirements: ~9.5GB VRAM, fits perfectly on RTX 4080 12GB

Privacy

  • All processing is local - no audio or transcripts are sent to external servers
  • Only model downloads require internet connection
  • Your HuggingFace token is stored locally in WSL2 environment
  • Audio files are automatically cleaned up after processing

Troubleshooting

Common Issues

"Web server won't start"

  • Make sure WSL2 is installed and Ubuntu distribution is available
  • Run the setup script: bash scripts/setup_wsl2_voxtral.sh
  • Check that all dependencies are installed in the virtual environment

"Can't access http://localhost:5000"

  • Ensure the web server is running (check terminal output)
  • Try http://127.0.0.1:5000 instead
  • Check Windows firewall settings for port 5000

"Server failed to start"

  • Check that your NVIDIA driver is version 555 or newer
  • Ensure you have enough disk space (~15GB free)
  • Check the server logs: cat wsl2/voxtral.log

"HuggingFace token invalid"

  • Make sure you accepted the license agreement for the Voxtral model
  • Generate a new token with Read permissions
  • The token should start with hf_

Very slow transcription

  • This is normal - Voxtral prioritizes accuracy over speed
  • A 1-hour video typically takes 10-15 minutes to process
  • The accuracy improvement (14% better than Whisper) is worth the wait

WSL2 using too much memory (Windows)

  • WSL2 may keep 8-12GB allocated even after servers stop
  • Solution: Run cleanup_memory.bat to free WSL2 memory
  • Or: Close the server terminal window (automatically runs cleanup)
  • This frees the vmmemWSL process you see in Task Manager

Getting Help

If you encounter issues:

  1. Check the web interface status indicator
  2. Look at server logs: cat wsl2/voxtral.log
  3. Verify WSL2 is working: run wsl -d Ubuntu ls in Command Prompt
  4. Make sure all setup steps were completed
  5. Try with a shorter video first

Files Overview

Main Launchers (Choose one):

  • start_scriptotic.bat - Windows launcher with visible server console
  • start_scriptotic.sh - Linux/WSL launcher with visible server console

Utility Scripts:

  • cleanup_memory.bat - Windows utility to free WSL2 memory (use if vmmemWSL is using too much RAM)

Core Components:

  • src/core/web_server.py - Flask web server
  • src/core/templates/index.html - Web interface
  • src/core/scriptotic.py - Core transcription logic and CLI
  • src/core/voxtral_engine.py - Voxtral transcription engine

Setup:

  • scripts/setup_wsl2_voxtral.sh - One-time environment setup

Version Information

  • Current Version: 2.0.0
  • Status: Production ready - Web interface with Playwright compatibility
  • Architecture: Flask web server + WSL2 vLLM backend

License

This project is open source. See individual model licenses for AI models used.

Contributing

Found a bug or want to improve Scriptotic? Please open an issue or submit a pull request!


Made with ❤️ by brinedew

About

YouTube to transcript tool with speaker diarization - Windows GUI/CLI app using WhisperX and AI speaker identification

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •