This project is a Python-based Retrieval-Augmented Generation (RAG) pipeline that intelligently summarizes .pdf and .txt documents using a lightweight sentence embedding model and a locally hosted LLM (ollama). It leverages semantic search over chunked document text to extract the most relevant context before querying the language model.
- โ
Supports both
.txtand.pdfinputs - โ Automatically filters out irrelevant sections like References or Bibliography
- โ Splits large documents into context-friendly chunks
- โ
Uses sentence-transformers (
all-MiniLM-L6-v2) for vector similarity search - โ Summarizes based on top-k relevant chunks using Ollama + Gemma3:1b
- โ Outputs concise, context-aware answers to a given question
- โ
Saves results as
.txt
- Python
- SentenceTransformers
- PyPDF2
- Ollama
- NumPy
- Pandas
- text = read and clean input document
- chunks = split text into retrievable pieces
- chunk_embeddings = embed each chunk using SentenceTransformer
- context = retrieve top-k relevant chunks via cosine similarity to query
- answer = run LLM on prompt with query + context
- File Reading
- Preprocessing
- Chunking
- Embedding
- Retrieval
- RAG Summarization
- Output
pip install -r requirements.txt
ollama run gemma3:1b
python pdf-summaizer.pyDon't forget to star me on GitHub and follow me! Thanks :)