RAG Workshop: Building a PDF-Based Chatbot for Persian Documents

This repository contains the materials, notebooks, and code used in a hands-on workshop I presented titled "Development of an Offline Persian AI Assistant using RAG" at the WTLAB 2025 workshops. This workshop focused on building a Retrieval-Augmented Generation (RAG) chatbot for Persian PDF documents, such as legal regulations and organizational rules.

The workshop demonstrates how to extract text from PDFs, chunk and index the content, retrieve relevant documents, and generate final answers using open-source Large Language Models (LLMs). A simple Streamlit web interface is also included for interactive chat.

Workshop Goals

Understand the basics of LLMs and Retrieval-Augmented Generation (RAG)
Learn how to choose and use open-source models
Learn how to parse Persian PDF documents efficiently
Convert text to structured and searchable representations
Implement a RAG pipeline to answer user questions from private domain documents with citations
Deploy a simple chatbot UI using Streamlit

Repository Structure

File / Folder	Description
`0.About-me.ipynb`	Workshop introduction and presenter info
`1.Introduction-to-LLMs-And-RAG.ipynb`	Overview of LLMs and RAG concepts
`2.GettingStarted-with-Open-Source-LLMs.ipynb`	Practical intro to running and choosing open-source models
`3.RAG-Workflow.ipynb`	Explanation of RAG architecture
`4.Parsing-PDFs.ipynb`	Extracting and cleaning text from Persian PDFs
`5.Implement_RAG.ipynb`	Building a RAG system
`8.More-Advanced-RAG.ipynb`	Explanation of more advanced methods, Parent-Child Retriever, GraphRAG, etc.
`7.streamlit_ui.py`	Minimal chatbot UI built with Streamlit
`data-sources/`	Sample PDF regulation document used in the workshop
`docling-pdf-parse/`	Parsing PDF using docling
`pyproject.toml`, `uv.lock`	Environment configuration for reproducibility

Running the RAG Chatbot

1) Prepare the vector index (optional depending on notebook steps)

Run the relevant notebook to generate chunk embeddings.

2) Launch the Streamlit UI:

streamlit run 7.streamlit_ui.py

This opens a web interface where you can interact with your chatbot.

Notes on Persian PDF Processing

Persian PDFs are often encrypted, scanned, or improperly encoded. In this workshop, we compare multiple parsing approaches:

PyPDF2
OCR-based extraction
Docling structured document parsing

Pros/cons and preprocessing techniques are discussed in the notebooks.

Future Extensions

You may extend this project by:

Using FAISS / Qdrant / Weaviate for scalable vector search
Implementing GraphRAG (entity/relation extraction → knowledge graphs) or Parent-Child Document Retriever

Author

Ali Moameri; Software Engineer & NLP Researcher

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG Workshop: Building a PDF-Based Chatbot for Persian Documents

Workshop Goals

Repository Structure

Running the RAG Chatbot

1) Prepare the vector index (optional depending on notebook steps)

2) Launch the Streamlit UI:

Notes on Persian PDF Processing

Future Extensions

Author

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data-sources		data-sources
docling-pdf-parse		docling-pdf-parse
images		images
.gitignore		.gitignore
.python-version		.python-version
0.About-me.ipynb		0.About-me.ipynb
1.Introduction-to-LLMs-And-RAG.ipynb		1.Introduction-to-LLMs-And-RAG.ipynb
2.GettingStarted-with-Open-Source-LLMs.ipynb		2.GettingStarted-with-Open-Source-LLMs.ipynb
3.RAG-Workflow.ipynb		3.RAG-Workflow.ipynb
4.Parsing-PDFs.ipynb		4.Parsing-PDFs.ipynb
5.Implement_RAG.ipynb		5.Implement_RAG.ipynb
7.streamlit_ui.py		7.streamlit_ui.py
8.More-Advanced-RAG.ipynb		8.More-Advanced-RAG.ipynb
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

alimoameri/rag-workshop-wtlab2025

Folders and files

Latest commit

History

Repository files navigation

RAG Workshop: Building a PDF-Based Chatbot for Persian Documents

Workshop Goals

Repository Structure

Running the RAG Chatbot

1) Prepare the vector index (optional depending on notebook steps)

2) Launch the Streamlit UI:

Notes on Persian PDF Processing

Future Extensions

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages