Skip to content

AgentRAG is a powerful AI-driven system that integrates web crawling, scraping, vector search, and intelligent chatbot interactions to process online information.

License

Notifications You must be signed in to change notification settings

Followb1ind1y/AgentRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AgentRAG: AI-Powered Retrieval & Summarization with Web Search

About

This project is a comprehensive AI-driven system that integrates web crawling, scraping, vector search, and intelligent chatbot interactions. It extracts and processes online information, stores it in a Pinecone vector database, and enables fast and relevant retrieval using OpenAI embeddings. Key features include:

  1. Web Crawling extracts and filters sitemap URLs to gather target pages.
  2. Vector Processing chunks content, generates OpenAI embeddings, and stores them in Pinecone for semantic search.
  3. Hybrid Retrieval combines vector similarity search and MMR optimization to fetch contextually relevant documents.
  4. AI Agents leverage Retrieval-Augmented Generation (RAG) with Pinecone and LangGraph, enhanced by real-time web searches via Tavily API, to deliver accurate, context-aware responses.
  5. Multi-Interface Support provides both REST APIs (FastAPI) and Discord chatbot integrations, featuring automatic conversation summarization and PDF export capabilities.

With these components, the system can power intelligent chatbots, research assistants, and automated knowledge retrieval applications. 🚀

Project Structure

AgentRAG/
│── src/
│   ├── scraper.py     
│   ├── preprocess.py    
│   ├── retriever.py     
│   ├── discord_bot.py
│   ├── api.py                 
│── assets/              
│── requirements.txt  
│── docker-compose.yml  
│── Dockerfile
│── README.md           

1. Web Crawling

Extracts all links from a given sitemap.xml file and optionally filters them based on a provided substring.

Run the script

python src/crawl.py \
  --url "https://python.langchain.com/sitemap.xml" \
  --filter "/docs/tutorials/"

Output Example

🔗 Found 12 Pages in Sitemap:
https://python.langchain.com/docs/tutorials/
https://python.langchain.com/docs/tutorials/agents/
https://python.langchain.com/docs/tutorials/chatbot/
...

2. Web Scraping & Vector Storage with Pinecone

Fetche webpage content, processe it into text chunks, generate vector embeddings using OpenAI's embedding model, and store them in a Pinecone vector database for efficient retrieval.

Run the script

python src/preprocess.py \
  --urls "https://python.langchain.com/docs/tutorials/" \
  "https://python.langchain.com/docs/tutorials/agents/" \
  "https://python.langchain.com/docs/tutorials/chatbot/" \
  --index_name "langchain-tut"

Output Example

Starting to fetch webpages...
Fetching pages: 100%|#############| 3/3 [00:00<00:00,  7.73it/s]
Text chunking...
Vectorizing and storing in Pinecone...
Index langchain-tut does not exist, creating...
✅ Index langchain-tut is ready
Successfully stored 19 chunks to index langchain-tut

3. Vector Search

Retrieve relevant information from a Pinecone vector database using OpenAI embeddings.Initialize a Pinecone vector store, embed text using OpenAI's text-embedding-ada-002 model, and allow searching via similarity or MMR-based retrieval. It retrieves and displays relevant results along with metadata.

Run the script

python src/retriever.py \
  --index "langchain-tut" \
  --search_type "similarity" \
  --query "What is a chatbot?"

Output Example

✅ Index 'langchain-tut' found. Checking status...
✅ Index 'langchain-tut' is ready for use.

🔍 Search Results:
...

4. AI Agent with Retrieval & Web Search

An AI agent that retrieves information from a vector database and supplements responses with web search results. It uses Pinecone for vector search, LangGraph for AI-driven interactions, and OpenAI's GPT model for response generation.

  • Retrieval-Augmented Generation (RAG): Queries a Pinecone vector store for relevant context.
  • Web Search Integration: Uses Tavily API to fetch external information when needed.
  • Memory Management: Stores conversation history for summarization.
  • PDF Export: Saves agent interactions as a PDF summary.

Run the script

python src/agent.py --mode chat

Output Example

Discord Bot Demo

Discord Bot Demo

5. Chat API with Retrieval and Summarization

Provide a simple interface to interact with an AI model that can answer questions, retrieve relevant documents from a Pinecone vector store, and summarize conversation history. It combines local data retrieval with web search, making it adaptable for different use cases.

Run the API

To run the FastAPI app, use the following command:

uvicorn main:app --host 0.0.0.0 --port 8000 --reload

6. Discord AI Chatbot

A simple AI-powered chatbot for Discord, leveraging an agent-based execution model to process user queries and provide intelligent responses. The bot maintains conversation history and can generate summaries of interactions.

Run the Bot

Ensure you have a valid Discord bot token and the necessary dependencies installed.

python src/discord_bot.py

Output Example

  • Chat Bot

    Discord Bot Demo
  • Summarization

    Discord Bot Demo

7. Docker Usage

  • Build and Run with Docker

    git clone https://github.com/Followb1ind1y/AgentRAG.git
    cd AgentRAG
    docker-compose build
    docker-compose up
  • Environment Variables for Docker

    If you need to configure environment variables (e.g., API keys), create a .env file in the root directory or use the .env.example file as a template. Some important environment variables you may need are:

    # Example .env file
    OPENAI_API_KEY=your_openai_api_key
    PINECONE_API_KEY=your_pinecone_api_key
    DISCORD_BOT_TOKEN=your_discord_bot_token

Licence

This repository is licensed under the Apache-2.0 License - see the LICENSE file for details.

About

AgentRAG is a powerful AI-driven system that integrates web crawling, scraping, vector search, and intelligent chatbot interactions to process online information.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published