AskTheManual is a Multimodal Retrieval-Augmented Generation (RAG) Proof of Concept designed to read, see, and explain manuals to your customers. It transforms static PDF manuals into an interactive chatbot that understands both text and images.
New: Now features a Guided User Interface (GUI) to make the data ingestion process simple and intuitive!
This project follows a multi-stage pipeline to create a searchable knowledge base:
- Extraction: Converts detailed PDF manuals into Markdown.
- Human-in-the-Loop Review: A GUI allows you to filter out "junk" images (icons, decorative elements) and keep only relevant diagrams.
- Vision Enrichment: Uses AI (or human input) to describe screenshots, turning visual information into searchable text.
- Vector Indexing: Stores the enriched content in a local vector database.
- Local Chat: A Streamlit dashboard to query your manual.
- Local Control & Privacy: Uses local Ollama and FAISS for the core "brain". Your proprietary manual text stays local.
- No "Black Box": You control the extraction. You decide which images strictly belong in the knowledge base.
- Multimodal: The bot understands what's inside screenshots (e.g., "The default IP is 127.0.0.1") because the ingestion pipeline explicitly captures it.
Ensure you have Python 3.10+ installed.
pip install streamlit docling langchain-huggingface langchain-community faiss-cpu sentence-transformers requests ttkbootstrap openai(Note: ttkbootstrap is required for the new GUI.)
docling-tools models download- Install Ollama from ollama.com.
- Pull a model (e.g., Qwen 2.5):
ollama pull qwen2.5:7b
- Make sure the Ollama server is running.
The project uses sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. To avoid the GUI freezing while downloading this model on the first run, execute this one-liner:
python -c "from langchain_huggingface import HuggingFaceEmbeddings; HuggingFaceEmbeddings(model_name='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')"For automatic image description (Vision AI), you need an OpenAI API key.
- Export it:
export OPENAI_API_KEY="sk-..." - Or paste it into
image_to_information.py(not recommended for production).
We have replaced the complex script chain with a single App.
Run the GUI:
python AskTheManual_GUI.pyThe App will guide you through 3 stages:
-
Extraction & Review:
- Select your PDF. (it has to be in the main directory of the projectfolder)
- The app extracts all images.
- Interactive Review: A gallery appears. Use Arrow Keys to navigate. Press 'DELETE' to Delete junk images, 'KEEP' to Keep valid ones.
-
Enrichment:
- Vision AI (Auto but maybe not every description is correct): Sends kept images to OpenAI to generate detailed technical descriptions.
- Human Description (recommended - if you want to be sure everything is correct): If you don't have an API key, you can manually type descriptions for each image in the GUI.
-
Indexing:
- Click "Update Vector Index" to finalize the database.
Once indexing is complete, launch the chat interface:
streamlit run chatbot_dashboard.pyThis PoC is intended for internal testing and demonstration. It serves as a blueprint for how technical documentation can be made "intelligent" by combining text parsers, Vision AI, and Vector Search.
