Author: Piyush Ramteke
A Retrieval-Augmented Generation (RAG) system that transforms video course content into an intelligent, searchable knowledge base. This project enables users to ask natural language questions about video content and receive contextual answers with precise timestamps.
- Overview
- Features
- Architecture
- Pipeline Workflow
- Technologies Used
- Installation
- Usage
- Project Structure
- Use Cases
- How It Works
- Configuration
- Future Improvements
- Contributing
- License
This RAG-based AI system processes video tutorials (specifically the Sigma Web Development Course) and creates a semantic search engine that allows learners to:
- Ask questions in natural language
- Get answers with specific video references
- Navigate directly to relevant timestamps in videos
- Search across multiple video lectures simultaneously
The system leverages OpenAI Whisper for speech-to-text transcription, BGE-M3 embeddings for semantic understanding, and Ollama LLMs (like Llama 3.2) for generating human-like responses.
| Feature | Description |
|---|---|
| ๐ฌ Video to Audio Conversion | Automatically extracts audio from video files using FFmpeg |
| ๐ฃ๏ธ Speech-to-Text Transcription | Uses Whisper large-v2 model with Hindi-to-English translation |
| ๐ Chunk-based Processing | Splits transcriptions into timestamped segments for precise retrieval |
| ๐ Semantic Search | Uses BGE-M3 embeddings for meaning-based search (not just keywords) |
| ๐ค AI-Powered Responses | Generates contextual answers using local LLMs via Ollama |
| โฑ๏ธ Timestamp Navigation | Provides exact timestamps for relevant content |
| ๐พ Persistent Storage | Saves embeddings using joblib for fast subsequent queries |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RAG PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ Videos โโโโโบโ Audio โโโโโบโ JSON โโโโโบโEmbeddingsโ โ
โ โ (.mp4) โ โ (.mp3) โ โ(transcr.)โ โ (.joblib)โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โผ โผ โผ โผ โ
โ FFmpeg Whisper Chunking BGE-M3 Model โ
โ large-v2 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ QUERY PIPELINE โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ User โโโโโบโ Query โโโโโบโ Cosine โโโโโบโ LLM โ โ
โ โ Query โ โ Embeddingโ โSimilarityโ โ Response โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ โ โ โ โ
โ โผ โผ โผ โผ โ
โ Natural Lang BGE-M3 Top-5 Chunks Llama 3.2 โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# video_to_mp3.py
FFmpeg extracts audio โ Creates .mp3 files with structured naming# mp3_to_json.py
Whisper model โ Transcribes audio โ Generates timestamped JSON chunks# preprocess_json.py
BGE-M3 model โ Creates embeddings โ Stores in embeddings.joblib# process_incoming.py
User query โ Semantic search โ LLM generates contextual response| Category | Technology |
|---|---|
| Speech Recognition | OpenAI Whisper (large-v2) |
| Embeddings | BGE-M3 (via Ollama) |
| LLM | Llama 3.2 / DeepSeek-R1 (via Ollama) |
| Audio Processing | FFmpeg |
| Data Processing | Pandas, NumPy, Scikit-learn |
| Storage | Joblib |
| API Server | Ollama (localhost:11434) |
| Language | Python 3.x |
- Python 3.8+ installed
- FFmpeg installed and in PATH
- Ollama installed and running
git clone <repository-url>
cd "Rag Based AI"pip install whisper pandas numpy scikit-learn joblib requests# Install embedding model
ollama pull bge-m3
# Install LLM (choose one)
ollama pull llama3.2
# or
ollama pull deepseek-r1cd whisper
pip install -e .Place your video files in the videos/ folder and run:
python video_to_mp3.pypython mp3_to_json.pyThis creates JSON files with timestamped transcriptions in the jsons/ folder.
python preprocess_json.pyThis creates embeddings.joblib containing all chunk embeddings.
python process_incoming.pyExample interaction:
Ask a Question: Where is HTML concluded in this course?
Response: HTML is concluded in Video 13 titled "Entities, Code tag and more on HTML".
You can find the conclusion at around 8:40 (520 seconds). The instructor also
mentions in Video 14 "Introduction to CSS" at the beginning (around 0:05) that
HTML has been completed. I recommend watching Video 13 from timestamp 8:40 onwards
for the HTML conclusion!
Rag Based AI/
โ
โโโ video_to_mp3.py # Converts videos to MP3 audio files
โโโ mp3_to_json.py # Transcribes audio using Whisper
โโโ preprocess_json.py # Creates embeddings from transcriptions
โโโ process_incoming.py # Main query processing script
โ
โโโ embeddings.joblib # Stored embeddings database
โโโ prompt.txt # Last generated prompt (for debugging)
โโโ response.txt # Last LLM response (for debugging)
โ
โโโ Audios/ # Converted audio files (.mp3)
โโโ jsons/ # Transcription JSON files
โ โโโ 01_Installing VS Code & How Websites Work.mp3.json
โ โโโ 02_Your First HTML Website.mp3.json
โ โโโ ... (18 video transcriptions)
โ
โโโ whisper/ # OpenAI Whisper submodule
โ โโโ whisper/ # Core Whisper library
โ โโโ tests/ # Test files
โ โโโ notebooks/ # Jupyter notebooks
โ
โโโ README.md # This file
| Use Case | Description |
|---|---|
| Course Navigation | Help students find specific topics in lengthy video courses |
| Study Assistant | Answer questions about course content with precise references |
| Revision Helper | Quickly locate topics for exam preparation |
| Content Discovery | Search across multiple lectures simultaneously |
| Use Case | Description |
|---|---|
| Training Videos | Make corporate training searchable |
| Meeting Recordings | Find specific discussions in recorded meetings |
| Webinar Archives | Search through past webinars efficiently |
| Knowledge Base | Create searchable video documentation |
| Use Case | Description |
|---|---|
| Viewer Support | Help viewers find specific content |
| Content Indexing | Automatic chapter generation for videos |
| FAQ Automation | Auto-answer common viewer questions |
| Accessibility | Make video content accessible via text search |
| Use Case | Description |
|---|---|
| Lecture Archives | Search through academic lecture recordings |
| Interview Analysis | Find specific quotes in recorded interviews |
| Conference Videos | Navigate through conference presentations |
| Podcast Search | Make podcast episodes searchable |
The Whisper model processes audio files and generates timestamped segments:
{
"number": "1",
"title": "Installing VS Code & How Websites Work",
"start": 0.0,
"end": 3.5,
"text": "From today's video, we will start the Sigma Web Development course."
}Each text chunk is converted to a 1024-dimensional vector using BGE-M3:
embedding = create_embedding([chunk_text]) # Returns [1024] vectorUser queries are embedded and compared using cosine similarity:
similarities = cosine_similarity(all_embeddings, [query_embedding])
top_5_chunks = get_top_n(similarities, n=5)Top chunks are formatted into a prompt and sent to the LLM:
prompt = f"""
Here are video subtitle chunks: {relevant_chunks}
User question: {user_query}
Answer with video references and timestamps...
"""
response = llm.generate(prompt)In preprocess_json.py and process_incoming.py:
"model": "bge-m3" # Change to preferred embedding modelIn process_incoming.py:
"model": "llama3.2" # Options: llama3.2, deepseek-r1, mistral, etc.In process_incoming.py:
top_results = 5 # Increase for more context, decrease for speedIn mp3_to_json.py:
model = whisper.load_model("large-v2") # Options: tiny, base, small, medium, large, large-v2- Web Interface - Create a Streamlit/Gradio UI for easier interaction
- Multi-language Support - Extend beyond Hindi-English translation
- Real-time Processing - Process videos as they're uploaded
- GPU Acceleration - Optimize for faster embedding generation
- Vector Database - Replace joblib with Chroma/Pinecone for scalability
- Caching Layer - Cache common queries for faster responses
- API Endpoints - Create REST API for integration with other systems
- Video Player Integration - Direct links to video timestamps
- Batch Processing - Handle multiple queries simultaneously
- Fine-tuning - Custom model training on domain-specific content
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Please read CODE_OF_CONDUCT.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Whisper - For the excellent speech recognition model
- Ollama - For making local LLM deployment easy
- BGE-M3 - For the powerful multilingual embedding model
- Sigma Web Development Course - The course content used for demonstration
- ๐ GitHub Issues: Use the Issues tab for bugs or feature requests
- ๐ง Email: piyu.143247@gmail.com
- ๐ผ LinkedIn: www.linkedin.com/in/piyu24
Made with โค๏ธ by Piyush Ramteke for better learning experiences