Chat with Starter Story - RAG Implementation

A basic Retrieval-Augmented Generation (RAG) implementation for testing purposes, built to enable conversational interactions with Starter Story YouTube video.

🚀 Technology Stack

Backend: Symfony 7.3 (PHP 8.4)
Database: PostgreSQL 17
Search Engine: Elasticsearch 8.13.2
AI/ML: OpenAI API with LLPhant library
Frontend: Tailwind (4.1) & DaisyUI
Web Server: Caddy

📋 Prerequisites

Docker and Docker Compose
OpenAI API key
Supadata API access (for YouTube data fetching)
Youtube API key

🖥️ Demo

You can try a demo here

🛠️ Installation Setup

1. Clone the Repository

git clone <repository-url>
cd ChatWithStarterStory

2. Environment Configuration

Copy the environment files and configure them:

cp .env .env.local

Add your API keys to .env.local:

OPENAI_API_KEY=your_openai_api_key_here
SUPADATA_API_KEY=your_supadata_api_key_here
YOUTUBE_API_KEY=your_youtube_api_key_here

3. Start the Application

# Build and start all services
docker compose --env-file .env.docker up -d --build

# Access the PHP container
docker exec -ti php /bin/bash

4. Install Dependencies & Setup Database

Inside the PHP container:

# Install Composer dependencies
composer install

# Create database and run migrations
bin/console doctrine:database:create
bin/console doctrine:migrations:migrate

# Build Tailwind CSS (in a separate terminal)
bin/console tailwind:build --watch

5. Access the Application

Web Interface: http://localhost:8080 (Caddy will proxy to the Symfony app)
Elasticsearch: http://localhost:9200
Database: PostgreSQL on default port with credentials from .env.docker

📊 Data Generation for Embeddings

The RAG system requires a three-step data preparation process:

Step 1: Import YouTube Videos

bin/console app:import-youtube-videos

This command:

Fetches videos from the Starter Story YouTube channel
Retrieves video metadata (title, description, thumbnail, etc.)
Stores video information in the PostgreSQL database
Processes up to 100 videos in batches

Step 2: Create Transcription Chunks

bin/console app:create-transcription-chunks

This command:

Fetches transcriptions for each imported video using Supadata API
Breaks transcriptions into manageable chunks with timestamps
Creates TranscriptionChunk entities with content, offset, and duration
Respects API rate limits with built-in delays

Step 3: Generate Embeddings

bin/console app:generate-embeddings

This command:

Processes transcription chunks that don't have embeddings
Generates vector embeddings using OpenAI's embedding model
Stores embeddings for semantic search capabilities
Processes chunks in batches of 25 for optimal performance

🧠 How It Works

RAG Architecture Overview

Data Ingestion: YouTube videos are imported and transcribed into searchable chunks
Vector Storage: Text chunks are converted to embeddings and stored in Elasticsearch
Query Processing: User questions are converted to embeddings for similarity search
Context Retrieval: Most relevant video chunks are retrieved based on semantic similarity
Response Generation: OpenAI LLM generates answers using retrieved context
Result Presentation: Responses include relevant video links with timestamps

Data Flow

User Question → Embedding → Vector Search → Context Building → LLM Query → Response + Video Links

🔧 Development Commands

Docker Management

# Stop all services
docker compose down --remove-orphans

# View logs
docker compose logs -f

# Rebuild specific service
docker compose up -d --build php

Asset Management

# Build Tailwind CSS
bin/console tailwind:build

# Watch for changes
bin/console tailwind:build --watch

📄 License

This project is for testing and educational purposes. Please ensure compliance with YouTube's Terms of Service and OpenAI's usage policies when using this application.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.junie		.junie
assets		assets
bin		bin
config		config
docker		docker
migrations		migrations
public		public
src		src
templates		templates
tests		tests
translations		translations
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env		.env
.env.dev		.env.dev
.env.docker		.env.docker
.env.test		.env.test
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
compose.override.yaml		compose.override.yaml
compose.prod.yaml		compose.prod.yaml
compose.yaml		compose.yaml
composer.json		composer.json
composer.lock		composer.lock
importmap.php		importmap.php
phpunit.dist.xml		phpunit.dist.xml
symfony.lock		symfony.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat with Starter Story - RAG Implementation

🚀 Technology Stack

📋 Prerequisites

🖥️ Demo

🛠️ Installation Setup

1. Clone the Repository

2. Environment Configuration

3. Start the Application

4. Install Dependencies & Setup Database

5. Access the Application

📊 Data Generation for Embeddings

Step 1: Import YouTube Videos

Step 2: Create Transcription Chunks

Step 3: Generate Embeddings

🧠 How It Works

RAG Architecture Overview

Data Flow

🔧 Development Commands

Docker Management

Asset Management

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Eddaoust/ChatWithStarterStory

Folders and files

Latest commit

History

Repository files navigation

Chat with Starter Story - RAG Implementation

🚀 Technology Stack

📋 Prerequisites

🖥️ Demo

🛠️ Installation Setup

1. Clone the Repository

2. Environment Configuration

3. Start the Application

4. Install Dependencies & Setup Database

5. Access the Application

📊 Data Generation for Embeddings

Step 1: Import YouTube Videos

Step 2: Create Transcription Chunks

Step 3: Generate Embeddings

🧠 How It Works

RAG Architecture Overview

Data Flow

🔧 Development Commands

Docker Management

Asset Management

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages