1. Introduction

aeRAG: A Scalable Modular Knowledge System

1. Introduction

1.1 Context and Project Origin

Archethic is a transaction chain that introduces a number of innovations beyond what most traditional blockchains (e.g., Ethereum, Bitcoin, Cosmos) provide. Despite its unique capabilities—such as enhanced security models, improved transaction handling, and a more scalable architecture—the complexity of its underlying technology can make it difficult to explain, learn, and adopt.

To tackle this issue, we aim to develop a Retrieval-Augmented Generation (RAG) system that unifies Archethic’s data (source code, commits, issues, pull requests, developer documentation, community discussions, etc.) into a single, easily accessible knowledge base. By bridging natural language queries with relevant, curated project data, we hope to improve:

Technical clarity for developers looking to build decentralized applications (DApps) on Archethic.
Transparency for potential investors who need quick, accurate insights into the technology’s fundamentals and roadmap.
Efficiency for members of the Archethic Foundation who want to query existing data, code, and documents rapidly.
Community engagement by enabling content creators and ecosystem advocates to generate reliable articles, FAQs, and promotional materials.
Scalability by ensuring that as the ecosystem grows, the system can integrate new data sources and retrieval methods efficiently.

1.2 Main Objective

The overarching goal is to provide a comprehensive, AI-powered solution that can accurately answer a wide range of questions—from deep technical queries about node configurations or smart contract development, to higher-level inquiries about Archethic’s value proposition and economic model.

Furthermore, the system is designed with replicability and extensibility in mind: the same architecture and codebase can be reused for other projects by simply adjusting configuration files and data-collection endpoints to suit each new context. By abstracting data collection, storage, and retrieval processes, aeRAG can be adapted for different blockchain ecosystems or even non-blockchain domains.

1.3 Scope of This Document

This document details:

High-Level Architecture of the RAG pipeline: from data collection in GitHub repositories, to chunking and embedding content, to producing final answers through a Large Language Model (LLM).
Implementation Details of the current system: how MongoDB is used to store, query, and index data; how chunking strategies differ between code and text; and how metadata such as summaries and keywords are generated.
Future Extensions including advanced retrieval logic, integration with vector databases (e.g., Faiss, LlamaIndex, or Milvus), query optimization techniques (SQL-based retrieval, dynamic reasoning, multi-step planning), and support for multiple LLM backends (Mistral, DeepSeek, OpenAI, etc.).
Deployment Considerations for local or cloud-based environments.
Testing and Maintenance approaches that ensure the system remains robust, accurate, and easy to evolve over time.
User Interface and Accessibility by discussing how aeRAG can be integrated into chatbots, API endpoints, and other interactive platforms to make knowledge retrieval as seamless as possible.

By offering a modular, well-structured RAG solution, this project aims to accelerate Archethic’s adoption and empower its community with a reliable, user-friendly tool for exploring, understanding, and leveraging the technology’s full potential. Additionally, it serves as a blueprint for any ecosystem looking to integrate RAG-based AI-driven knowledge retrieval.

2. Requirements and Use Cases

2.1 Functional Requirements

2.1.1 Data Collection

GitHub Integration:
The system must be able to fetch code, commits, pull requests, issues, and release notes from Archethic’s (or any other project's) GitHub repositories. This includes handling authentication (e.g., OAuth tokens), pagination, and rate limits.
Extended Data Sources:
While GitHub is the primary data source initially, the architecture should support plugging in additional data feeds (e.g., forum posts, technical blogs, governance records). Configuration files or environment variables should make it simple to add or remove these sources.

2.1.2 Data Organization and Indexing

Structured Storage in MongoDB:
Each type of data (e.g., commits, issues, code files) is stored in dedicated MongoDB collections. The system must maintain references between these collections (e.g., issues linked to related pull requests).
Metadata Generation:
The system must generate extended metadata for each file or record, including:
- Text chunking (custom size, overlap, or code syntax-aware).
- Embeddings (e.g., with SentenceTransformers).
- Summaries (T5 Summarizer or equivalent).
- Keywords (Yake or any other configured extractor).
Vector Index Support:
Even though MongoDB houses the core data, the system should be designed to integrate with a vector database (e.g., Faiss, LlamaIndex, Milvus) for rapid similarity searches on embeddings.

2.1.3 Retrieval-Augmented Queries

Similarity Search:
For any incoming natural language query, the system must locate the most relevant chunks (via embeddings) and produce a concise context for the LLM.
LLM Interaction:
The pipeline should support hooking up multiple LLMs (Mistral, DeepSeek, OpenAI, etc.). Internally, the system organizes the context, merges it with user queries, and delegates the final text generation to the chosen model.

2.1.4 Chatbot / API Interface

User Interaction:
The solution should provide at least a chatbot-like interface, whether through a CLI, a web-based UI, or both.
API Access (Planned):
An HTTP endpoint (e.g., REST or GraphQL) may be exposed so external applications can query the system programmatically.

2.1.5 Logging and Monitoring

Request Logs:
Each query, along with the relevant chunks retrieved and LLM response, should be optionally logged for future analysis (with attention to privacy/security concerns).
System Health:
Basic metrics (e.g., memory usage, response times, indexing completion stats) are necessary for debugging and optimization.

2.1.6 Testing (Unit and Non-Regression)

Unit Tests:
Each module (collectors, chunkers, embeddings, metadata management) should include tests to confirm expected behavior.
Integration & Non-Regression Tests:
The system should run scenario-based tests (e.g., retrieving commits for a specific repo, chunking and vectorizing them, ensuring results remain consistent over time).

2.2 Non-Functional Requirements

2.2.1 Performance and Scalability

Low Latency Retrieval:
The system must return relevant chunks quickly, especially in conversational use cases. This requirement often drives the need for an ANN (Approximate Nearest Neighbor) index in a vector database.
Horizontal Scalability:
As more data sources are added, the system should handle increased storage and indexing loads via horizontal scaling. For instance, separate microservices could handle chunking, vector indexing, and LLM calls.

2.2.2 Reliability and Fault Tolerance

Robust Error Handling:
If a collector encounters an error (e.g., a GitHub rate limit), it should retry gracefully or log the issue without halting the entire pipeline.
Retry and Rollback:
Large-scale indexing processes might fail mid-way; the system should be able to resume or roll back partial updates.

2.2.3 Maintainability and Clean Code

Modular Architecture:
Each functionality (data collection, chunking, embeddings, etc.) is self-contained with clear interfaces. This fosters easier upgrades and refactoring.
Adherence to Clean Code Principles:
Following Robert C. Martin’s guidelines (SOLID, DRY, etc.) ensures the code remains approachable for new contributors.
Documentation:
Clear in-code docstrings and external documentation for each module so that maintainers can easily extend or modify the system.

2.2.4 Security and Access Control

Token Management:
Credentials for GitHub or other data sources must be stored securely, e.g., in environment variables or a secrets manager.
User Data Privacy:
If personal or sensitive data is ever ingested, the system must comply with relevant privacy regulations and best practices.
Network Restrictions:
For enterprise or closed-source usage, there may be a requirement to restrict external LLM calls or manage them via secure proxies.

2.2.5 Portability

Local vs. Cloud Deployments:
The system should support easy installation on a local machine (e.g., on-premise server) or container-based deployment in the cloud.
Minimal External Dependencies:
While vector databases or advanced LLMs might be optional add-ons, the core pipeline should remain functional and not break if certain optional dependencies are missing or swapped.

2.3 Use Cases

2.3.1 Developer Building DApps on Archethic

Scenario: A developer wants to understand Archethic’s smart contract framework and needs to see examples of code usage.
Steps:
- Developer types a question, e.g., “How do I deploy a multi-sig contract on Archethic?”
- aeRAG fetches relevant code snippets, docs, and commits that reference multi-sig, compiles context, and feeds it to the LLM.
- LLM responds with a step-by-step guide or references to official docs.
Benefit: Rapid, contextual knowledge retrieval shortens the learning curve and speeds up development.

2.3.2 Potential Investor or Partner

Scenario: A venture capital firm or an individual investor wants to understand Archethic’s roadmap and unique selling points.
Steps:
- They query, “What makes Archethic more scalable than typical proof-of-stake networks?”
- aeRAG surfaces relevant discussions from the repository, official documentation, and whitepapers.
- The LLM generates a concise overview of the scalability features (e.g., the transaction chain’s design).
Benefit: Clear, investor-friendly explanations build trust and expedite decision-making.

2.3.3 Archethic Foundation Member

Scenario: A foundation developer needs to quickly find the commit history related to a specific protocol upgrade that might have introduced a bug.
Steps:
- The query, “Show me all commits related to ‘protocol_upgrade_X’ since version 2.0”.
- aeRAG compiles the relevant commits, issues, and possibly pull requests to highlight changes.
- The developer identifies potential regressions faster without manually browsing through logs.
Benefit: Faster internal processes and improved incident resolution.

2.3.4 Community Content Creator

Scenario: A community ambassador wants to write a blog post about Archethic’s approach to governance.
Steps:
- They ask, “What governance proposals have shaped Archethic’s roadmap in 2023?”
- aeRAG retrieves data from the main GitHub governance repo or related discussion boards, plus any official announcements.
- The LLM provides an ordered list or summary of major proposals with key points.
Benefit: Accurate, centralized knowledge supports consistent community messaging.

3. High-Level Architecture

This chapter provides a bird’s-eye view of the aeRAG system—how data travels from raw sources (like GitHub) through chunking, embedding, and indexing, and ultimately how it’s fed to a Large Language Model (LLM) to answer user queries. The architecture is designed around modularity and extensibility, ensuring new data sources, vector databases, or LLM backends can be added with minimal friction.

3.1 Overview of the RAG Pipeline

Data Collection
- The system periodically or on-demand fetches information from configured sources (primarily GitHub).
- Each repository’s commits, issues, pull requests, and files are retrieved and stored.
Data Storage
- Retrieved data is structured and saved in MongoDB (or any chosen database).
- Collections such as commits, issues, pull_requests, main_files, etc., store relevant metadata and references.
Document Chunking
- Each stored file or piece of text is split into manageable chunks for fine-grained retrieval.
- aeRAG uses overlapping text-based or language-specific code chunking strategies to preserve context.
Vectorization
- Each chunk is embedded into a vector space using a chosen embedding model (e.g., SentenceTransformers).
- The resulting vectors are stored, either directly in MongoDB or in a vector database (planned feature).
Vector Database Indexing (Planned)
- For faster similarity searches, embeddings may be indexed in a specialized vector DB (Faiss, LlamaIndex, or Milvus).
- The system can fall back to MongoDB-based indexing if a vector DB is unavailable or optional for smaller deployments.
Retrieval & RAG Engine
- Upon user query, the engine searches for top-k similar chunks based on semantic similarity.
- The system compiles relevant chunks into a context package.
LLM Integration
- The context package and user query are passed to the selected LLM (Mistral, DeepSeek, OpenAI, etc.).
- The LLM returns an answer that is more accurate and context-aware, thanks to the RAG pipeline.
User Interaction
- Interactions can occur via a CLI chatbot, web UI, or an API endpoint.
- Query logs or results can be monitored for debugging and continuous improvement.

3.2 Key Components and Their Roles

3.2.1 Collectors

Primary Purpose: Gather data from external sources, especially GitHub, and insert or update that data in MongoDB.

GitHubCollector
- Coordinates the retrieval of commits, pull requests, issues, and file trees.
- Uses submodules to handle specific data (e.g., github_commits.py, github_issues.py).
- Manages API rate limits, token authentication, and error handling.
Collector Extensibility
- Additional collectors can be created for other data sources (forums, curated docs, or even other Git platforms).
- Each collector follows a similar pattern: fetch data → standardize format → store in the database.

3.2.2 Core Database & File Storage

Primary Purpose: Serve as the unified data backbone, storing both raw text/code and structured metadata.

DatabaseManager
- Handles MongoDB connections and collection creation.
- Defines indexes for improved query performance (e.g., on repo, date, or updated_at fields).
FileStorageManager
- Manages large file content, either locally (via a simple HTTP server) or through a remote storage service.
- Ensures raw files remain accessible for post-processing or direct viewing.

3.2.3 Chunking Strategies

Primary Purpose: Break down large documents (code files, markdown docs, etc.) into smaller chunks that can be indexed and retrieved more precisely.

TextChunkingStrategy
- Splits text into segments based on a chosen size (e.g., 500 tokens) and overlap (e.g., 50 tokens).
- Useful for documentation, discussion threads, or any natural-language content.
CodeChunkingStrategy
- Analyzes code structure (functions, classes, imports) to split at logical boundaries.
- Different sub-routines exist for languages like Python, JavaScript/TypeScript, etc.
ChunkingStrategyFactory
- Dynamically decides which strategy to use based on file type or extension.
- Extendable for new formats (HTML, JSON config, etc.).

3.2.4 Embeddings and Metadata

Primary Purpose: Add semantic layers to the data so the RAG engine can find relevant chunks via similarity search.

Embedding Models
- Typically uses a SentenceTransformerEmbeddingModel for general semantic embeddings.
- Could integrate other encoders if needed (e.g., OpenAI embeddings, BERT variants).
Summarizers & Keyword Extractors
- Summarizers (e.g., T5) can generate short abstracts of each chunk, aiding high-level indexing.
- Keyword extractors (e.g., Yake) allow quick tagging and filtering.
MetadataManager
- Coordinates the creation of chunked documents, embeddings, summaries, and keywords.
- Updates or re-generates metadata if a file changes in GitHub or a new version is released.

3.2.5 Vector Database (Future Integration)

Primary Purpose: Accelerate similarity searches on embedded data, crucial as the dataset grows.

Supported Options
- Faiss: A library for efficient similarity search in high-dimensional spaces.
- LlamaIndex or Milvus: Alternative solutions offering advanced features (like partitioning or distributed indexes).
Index Management
- The system defines how embeddings are uploaded, updated, or deleted.
- Queries the index to retrieve the top-k results for a user’s query embedding.

3.2.6 RAG Engine

Primary Purpose: Orchestrate the retrieval and packaging of relevant context for the LLM.

Similarity Search
- Takes the user’s question, embeds it, and searches the vector index or fallback (MongoDB).
- Ranks chunks by relevance score.
Context Assembly
- Collects top-k chunks into a consolidated context, possibly combining multiple documents.
- May apply summarization or additional filtering (e.g., removing duplicates).
Query Routing (Future Feature)
- In more advanced modes, decides if the LLM even needs retrieval from the knowledge base (fallback to base LLM if the question is generic).
- Could also orchestrate multi-step plans, e.g., generating a structured query for the DB.

3.2.7 LLMs and Response Generation

Primary Purpose: Generate answers in natural language using the retrieved context.

LLM Manager
- Abstracts the interface to different LLMs (Mistral, DeepSeek, OpenAI, etc.).
- Passes the user query plus retrieved context to the chosen model.
Prompt Engineering
- Shapes how the context and user question are combined (system prompts, user prompts, or few-shot examples).
- Minimizes hallucination by emphasizing or injecting retrieved facts.
Response Handling
- Returns the LLM’s final text to the user (via CLI, web UI, or API).
- Optionally logs the query and final response for analysis or debugging.

3.2.8 Server and CLI

Primary Purpose: Provide entry points for interactive usage and automation.

LocalStorageServer
- A minimal HTTP server that serves raw files (such as code or logs) for convenience.
- Can be extended to offer a small web UI or endpoints if needed.
main.py CLI
- Interactive menu for listing repos, updating data, starting the local server, and performing metadata updates.
- Allows testing or demonstration of the system’s core functions in a terminal environment.
Future API
- A planned extension for external apps: REST or GraphQL endpoints that handle user queries.

3.3 Data Flow Diagram (Conceptual)

Below is a conceptual outline of how data moves through aeRAG. (A text-based description is provided—feel free to create a visual diagram if desired.)

Collector calls the GitHub API → receives commits, issues, PRs.
Collector stores results in MongoDB (collections: commits, issues, etc.).
MetadataManager reads a file or record → uses ChunkingStrategy to split → uses EmbeddingModel to produce vectors.
MetadataManager saves chunk metadata (embedding, summary, keywords) in MongoDB or a Vector DB.
At query time, the RAG engine:
- Embeds the user query → searches top-k similar chunks.
- Gathers context from the relevant chunk documents.
- Sends the context and user query to the LLM.
- LLM produces a final answer, which is returned to the user.

3.4 Architectural Rationale

Separation of Concerns
- Each step (collecting, chunking, embedding, retrieving, LLM generation) is modular. This allows partial upgrades (e.g., swapping out the chunking strategy or embedding model) without rewriting the entire pipeline.
Incremental Adoption
- The system works with just MongoDB for smaller data sets. As the data volume increases, a specialized vector DB can be integrated for speed and scalability.
Scalability and Flexibility
- By supporting multiple LLMs and data sources, the architecture remains future-proof.
- The system can scale horizontally, with separate microservices for chunking, embeddings, and search if needed.
Maintainability
- Clean Code patterns keep each module self-contained and testable.
- Clear data flows ensure developers can trace how a chunk moves from raw file to LLM prompt.

3.5 Summary

The aeRAG architecture combines a robust ingestion pipeline with modular chunking, embedding, and retrieval strategies. It leverages MongoDB for flexible document storage and aims to integrate specialized vector databases for high-speed similarity search. A dedicated RAG engine then collates context for the LLM, ultimately enabling domain-specific, context-rich responses for end-users.

In the next chapter, we will delve deeper into the existing implementation, showing how these architectural components map to actual modules and functions in the codebase. This includes details on how the GitHub collectors are structured, how metadata is generated and stored, and the current approach to chunking and embedding.

4. Current Implementation Details

Building on the High-Level Architecture described in Chapter 3, this chapter focuses on how the system is implemented today (02/03/2025), including the structure of collectors, metadata generation, and data handling in MongoDB. Future releases will add or modify features documented here, and a dedicated changelog will track major changes over time.

4.1 Data Collection

4.1.1 GitHubCollector

Primary Module: github_collector.py
Role: Orchestrates fetching and updating information from GitHub, including:
- Repository Info: Basic metadata (description, language, URL, last commit date).
- Commits: Detailed commit logs, including author, date, changed files, and diffs.
- Pull Requests: PR titles, states, commit references, and attached comments.
- Issues: Titles, bodies, state (open/closed), labels, and discussion threads.

4.1.1.1 Workflow

Fetch Repository List: A call to GitHub’s /orgs/<org>/repos endpoint retrieves all repos for the organization.
Update Selected Repos: For each repo, the system fetches new commits, pull requests, and issues since the last recorded date.
Store in MongoDB:
- Data gets normalized into collections like repositories, commits, pull_requests, issues, etc.
- Any large file content is saved separately via the file_storage_manager.py.

4.1.1.2 Error & Rate-Limit Handling

The collector checks GitHub’s response headers (e.g., X-RateLimit-Remaining) and waits if the rate limit is reached.
If requests fail, the system logs the error and continues processing for other repos or data types.

4.1.2 Additional Collector Files

github_commits.py: Fetches commit details and changed files.
github_issues.py: Retrieves issues and related comments, storing them in issues and issues_comments collections.
github_pull_requests.py: Manages pull request data (state, associated commits, comments).
github_files.py: Fetches files from the main branch or the latest release tag, storing them into main_files or last_release_files.

4.2 Data Storage in MongoDB

4.2.1 DatabaseManager

File: database_manager.py
Responsibilities:
- Open a connection to the MongoDB instance.
- Create indices for each collection (e.g., on repo, date, state, or custom fields).
- Provide a simple db property to access raw PyMongo operations.

4.2.1.1 Collections Overview

repositories:
- Holds basic info: _id (e.g., org/repo), description, language, URL.
commits:
- Each commit includes metadata such as message, date, author, files changed.
- Linked to files in files or main_files via commit SHA references.
pull_requests and issues:
- Each record captures title, body, creation date, labels, comments, etc.
- Comments are stored separately in pull_requests_comments and issues_comments to keep the main documents smaller.
files, main_files, and last_release_files:
- Store file metadata (filename, commit ID, and external URLs to the raw content).
metadata:
- Contains references to text chunks, embeddings, summaries, and file hashes.
- The _id often encodes the source collection and document ID (e.g., meta_repo_commits_sha).
chunks:
- Stores chunk-level data (e.g., chunk_src, embedding, metadata_id).
- Links back to the metadata collection, enabling fine-grained retrieval.

4.2.2 FileStorageManager

File: file_storage_manager.py
Purpose:
- Writes large or raw file content to a local storage path.
- Generates a public (or internal) URL, which is saved into MongoDB for future retrieval or analysis.
- Also handles retrieving existing content if needed for re-processing.

4.2.2.1 Usage Flow

Store File: On retrieving a new file from GitHub, the collector calls store_file_content() with (content, repo, reference_id, filename).
Generate URL: The manager returns a path like http://localhost:8000/<repo>/<reference_id>/<filename>.
MongoDB Reference: This URL gets stored in a relevant collection (e.g., main_files).
Subsequent Access: If a chunking process or user needs the raw file, it can fetch via this URL.

4.3 Chunking and Metadata Generation

4.3.1 Chunking Strategies

TextChunkingStrategy (text_chunking_strategy.py):
- Splits documents into overlapping text chunks. Typical defaults: chunk size of ~500–1000 characters and overlap of 50–200 characters.
- Useful for markdown docs, issues bodies, or any plain text.
CodeChunkingStrategy (code_chunking_strategy.py):
- Splits code based on classes, function definitions, imports, etc.
- Supports multiple languages (Python, JavaScript, Dart, Go, C/C++, Ruby, etc.) by using language-specific regex patterns.
ChunkingStrategyFactory (chunking_strategy_factory.py):
- Determines whether to treat a file as code or text based on file extensions or other metadata.
- Fallback: If the file type is unknown, it uses the text strategy.

4.3.2 MetadataManager

File: metadata_manager.py
Core Logic:
1. Detect File Type: For each record (commit, file, etc.), check extension or stored metadata.
2. Retrieve Content: Fetch text from external_url if available, or from the patch field for commits.
3. Chunk the Content: Call the appropriate chunking strategy.
4. Generate Embeddings: Pass each chunk to an embedding model.
5. Create Summaries & Keywords: Use T5-based summarizers or Yake-based keyword extractors.
6. Save to metadata and chunks**: Each chunk is stored with an _id referencing the metadata record.

4.3.2.1 Example Flow for a New File

A new file in main_files is detected with _id: <repo>_main_<filepath>.
MetadataManager extracts text from file_storage_manager’s URL.
Chooses CodeChunkingStrategy if extension is .py; else TextChunkingStrategy.
The resulting chunk texts are embedded using a SentenceTransformerEmbeddingModel from embeddings.py.
Summaries and keywords might be generated, stored in the metadata collection.
Each chunk’s embedding and text is saved to chunks.

4.4 Embeddings and Summaries

4.4.1 Embedding Models

File: embeddings.py
Implemented Class: SentenceTransformerEmbeddingModel (commonly with the "all-MiniLM-L6-v2" or user-specified model).
Usage:
- encode(text) returns a vector (list of floats).
- For chunked text, this is called in a loop. The resulting vectors are inserted into the chunks collection along with references to the parent metadata ID.

4.4.2 Summarizers

File: summarizers.py
Current Approach: A T5-based summarizer, which can compress chunked text into shorter paragraphs.
Optional Step: Summaries are not always generated for large volumes of data, as it can be time-consuming. Instead, it may be triggered on demand or for specific file types.

4.4.3 Keyword Extraction

File: keywords_extractors.py
Extractor Used: Yake (YakeKeywordExtractor) by default.
Purpose: Assign up to 10 keywords for each content extracted to facilitate topic tagging and quick searching.

4.5 RAG Engine and Retrieval

4.5.1 Current Retrieval Mechanism

MongoDB:
- Currently, the system relies on MongoDB queries combined with embedded vectors.
- The approach is simpler but less optimal for high-scale similarity searches.
- A fallback to “metadata-based” queries also exists (e.g., matching chunk labels, file names, or keywords).

Note: A skeleton rag_engine.py file exists (or is planned) to unify retrieval logic. In future iterations, it will orchestrate how queries are parsed, how chunk searches are performed, and how fallback logic is managed.

4.5.2 Planned Vector Index

Faiss / LlamaIndex:
- The codebase contains placeholders or “TODO” comments for hooking into a vector database.
- Once integrated, each chunk embedding is pushed to the index, enabling top-k similarity searches.

4.5.3 LLM Integration

Connection Points:
- The pipeline can feed top-k chunks to an LLM.
- Chat-based usage is prototyped in main.py, but a more robust “RAG engine” that orchestrates chunk retrieval is under development.
Configurable Model:
- The user can specify Mistral, DeepSeek, OpenAI GPT-4, etc. The code is structured so that each model implements a standardized interface (see llm_manager.py or llm_interface.py in the LLMs folder).

4.6 Local Server and CLI

4.6.1 Local Storage Server

File: local_storage_server.py
Role: Serves locally stored files (e.g., raw code from GitHub) through a simple HTTP interface.
Usage:
- Runs typically at localhost:8000, so any external file references become accessible via standard HTTP requests.
Extension:
- Could be expanded into a minimal REST API for retrieving chunk data or for chatbot integration.

4.6.2 `main.py` CLI

Interactive Menu: Provides the user with commands such as:
- Listing GitHub repositories and updating them.
- Fetching and storing new commits/issues/PR data.
- Updating metadata for selected collections (files, commits, issues, etc.).
- Starting/Stopping the local storage server.
- Basic logs viewing (to see if the server or collector is running smoothly).
Roadmap:
- Possibly a “Chat” option (still partially implemented) to let the user query the system from the CLI.

4.7 Example Data Flow

Consider a scenario: New commit made to archethic-foundation/core-engine with a patch for improving transaction logic.

GitHubCollector → detects new commits since last fetch.
Commit details stored in commits collection, associated files in main_files.
MetadataManager → sees an unprocessed commit or file.
Chunk the file if it’s new or changed.
Embed the chunk text, store vectors in chunks.
On user query, e.g., “How does Archethic handle transaction throughput?”
The system (currently) searches chunks by textual keywords or naive embedding similarity in MongoDB. (Advanced: hits Faiss index in future).
Assembles the top relevant chunks, merges them, and sends them to an LLM for an answer.

4.8 Limitations in the Current Implementation

No Dedicated Vector DB
- Reliance on MongoDB’s basic queries or naive embedding search can be slow at scale.
- The upcoming integration of Faiss or LlamaIndex is expected to greatly improve retrieval performance.
Minimal RAG Engine
- There is no sophisticated fallback logic (e.g., deciding whether to query the knowledge base at all).
- Multi-step reasoning (like generating SQL or advanced queries) is not implemented yet.
Limited UI/UX
- The user interacts via CLI or a very basic local HTTP server.
- A modern, user-friendly solution could be:
  1. A minimal web-based chatbot (e.g., built with React, Vue, or Svelte) that calls the RAG API, displaying results in real time.
  2. Plugin-based integration with popular chat platforms (e.g., Slack, Discord) for collaborative usage.
- Either approach would prioritize ease of access, a clean interface, and real-time feedback for end users.
Automatic Summarization
- Summaries can be expensive. Currently, summary generation is triggered manually for large volumes of data.
- Two possible enhancements:
  1. Batch Summarization: Summaries are generated in bulk after certain thresholds (e.g., once daily or weekly).
  2. More Efficient Model: A GPU-accelerated or lighter summarization model could handle large-scale data without blocking processes.

4.9 Summary

The current state of aeRAG already provides:

End-to-end GitHub data collection and structured storage in MongoDB.
Document chunking with strategies for both textual and code-based content.
Embedding-based metadata generation (via SentenceTransformers), plus optional summaries and keywords.
A CLI-driven workflow to fetch data, chunk, embed, and prepare it for retrieval.

Upcoming expansions—like vector database integration, an enhanced RAG engine, and more sophisticated LLM prompts—will transform this from a basic aggregator into a fully featured retrieval and conversation system.

This chapter reflects the codebase as of 02/03/2025; subsequent updates will be tracked in a dedicated changelog for easier versioning.

Below is a restructured Chapter 5, renamed “Technical Proposals for Next Steps.” We have removed the explicit roadmap and milestones content (previously Section 5.8 and part of 5.9) so that it can be placed in Chapter 6 (“Roadmap and Priorities”), as per your original plan. This version focuses on technical proposals and feature expansions, but leaves the detailed timeline and prioritization for the next chapter.

5. Technical Proposals for Next Steps

Even though aeRAG already handles core ingestion, chunking, and basic retrieval, several key technical developments are needed for the system to reach robust, production-grade status. This chapter outlines future proposals that will enhance aeRAG’s performance, user experience, and feature set—particularly focusing on a minimal-yet-performant RAG engine, testing strategy, summarization improvements, user interface enhancements, and the idea of a GPT-like agent enriched by RAG.

5.1 Minimal Yet Performant RAG Engine

5.1.1 Motivation

Immediate Testing & Validation: To ensure reliability and gather baseline metrics, aeRAG needs a RAG engine that’s straightforward yet leverages a vector database for quick, similarity-based retrieval.
Scalable Foundation: A “lightweight” design means simpler debugging, clearer performance data, and a stable foundation for subsequent feature expansions.

5.1.2 Proposed Approach

Faiss Integration
- Core Step: Store new or updated chunk embeddings in a Faiss index instead of relying solely on MongoDB for similarity searches.
- Implementation:
  - Develop a service (e.g., faiss_index_manager.py) to push vectors from the chunks collection into Faiss.
  - Adapt the retrieval pipeline to query Faiss for top-k results, then map back to chunk data.
Light RAG Engine
- Focus: Keep logic minimal—query embeddings, fetch top-k chunks, assemble context, and invoke the chosen LLM.
- Benefit: Clear testable steps, easier to measure latency, resource usage, and accuracy.

5.1.3 Advantages

Reduced Complexity: Eliminates advanced multi-step logic until we have proven performance.
Better Data for Tests: With Faiss or a similar vector index, we can run systematic performance checks on speed and recall quality.

5.2 Testing Strategy: Unit, Non-Regression, and Performance

5.2.1 Current Gap

No Existing Tests: There are no automated tests for either the collectors, chunkers, or the RAG pipeline.

5.2.2 Planned Testing Layers

Unit Tests
- Scope: Each module (collectors, chunking, embedding, LLM connectors) gets dedicated tests.
- Example: Verifying GitHubCollector properly handles pagination; confirming the chunk factory returns CodeChunkingStrategy for .py.
Non-Regression Tests
- Goal: Protect against breakage of previously functioning features as new code is introduced.
- Method: Maintain a canonical test dataset (sample code commits, issues, etc.). Each code update triggers a pipeline run, ensuring outputs remain consistent.
Performance Tests
- Usage: Evaluate retrieval speed, memory usage, and scaling.
- Implementation: Query the minimal Faiss-based engine with realistic data volume to see if it meets targeted latencies.

5.2.3 Release Confidence

Once all three test layers pass, changes can be tagged as a “stable” or “release” build, ensuring production readiness.

5.3 Summarization Revisited

5.3.1 Observations from T5 and Mistral-7B

T5: While easy to integrate, it has proven suboptimal for Archethic’s specialized content. The process is also relatively slow for large batches.
Mistral-7B Instruct: Delivers more relevant summaries but is even more GPU-heavy, raising cost and infrastructure needs.

5.3.2 Potential Strategies

Optimized or Smaller Models
- Investigate quantization, model distillation, or specialized short-input summarizers that reduce load.
Selective Summaries
- Summaries are generated only for top-accessed or frequently queried chunks, conserving resources.

5.3.3 Benefits

Optional but Valuable: Enhanced summarizations still have potential for improving retrieval context, especially if done selectively.
User-Focused: Detailed, relevant summaries help developers, investors, or community members grasp key points without parsing entire chunks.

5.4 User Interface Enhancements

5.4.1 Telegram or Discord Chat Integration

Rationale: Users often prefer a familiar messaging platform.
Telegram Bot: Commands (e.g., /rag "search query") forward to aeRAG, and the bot replies with top chunks or an LLM-generated answer.
Discord Bot: Similarly, the bot listens in specified channels, responding with context from the RAG engine.

5.4.2 Modern Web UI

Approach:
- Build a minimal React or Vue single-page app that communicates with an HTTP/REST (or GraphQL) endpoint.
- Provide a sleek chat window, plus references for chunk sources.
Benefit: A more general interface accessible to those not using Telegram/Discord or the CLI.

5.5 GPT-Like Agent Enriched by RAG

5.5.1 Concept

Hybrid Approach: Combine aeRAG’s domain knowledge with advanced OpenAI ChatGPT+ models.
User Experience: End users “talk to” a GPT instance that’s automatically augmented with fresh, context-specific data from Archethic repos.

5.5.2 Possible Implementations

ChatGPT Plugin
- The plugin calls aeRAG for relevant chunks whenever the user asks a domain-specific question.
- Must follow OpenAI’s plugin guidelines.
Proxy API
- A server intercepts user queries, retrieves context from aeRAG, and injects it into ChatGPT’s system prompts.
- Manages session and token usage carefully.

5.5.3 Anticipated Gains

Real-Time Augmentation: Even advanced GPT models lack up-to-date or project-specific knowledge; RAG bridging solves that.
Easy Adoption: People already using ChatGPT+ can retain their workflow while gaining specialized insights on Archethic.

5.6 Additional Data Sources

5.6.1 Other Git Platforms

Adaptation: Minimal changes for GitLab or Bitbucket collectors, reusing the GitHubCollector’s structure.

5.6.2 Community & Governance

On-Chain Proposals, Forum Threads:
- Each post/proposal can be chunked and embedded, letting the user query historical or current decisions easily.

5.6.3 Index Refreshes

Scheduled Jobs: Nightly or weekly re-checking of large sources, ensuring the knowledge base remains updated.

5.7 Logging, Analytics, and Monitoring

5.7.1 Structured Logging

Objective: Collect queries, chunk retrieval data, and LLM latency in JSON logs.
Benefit: Facilitates debugging, usage tracking, and compliance with security or privacy standards.

5.7.2 Usage Analytics

Focus: Identify the most common search queries or bottlenecks.
Outcome: Informs future data collection priorities, helps refine chunking or summarization.

5.7.3 Resource Monitoring

GPU/CPU Tracking: Summarization or large LLM inferences can spike resource usage.
Alerts: DevOps can set thresholds that trigger notifications if usage is excessive or if a collector fails.

5.8 Conclusion

Implementing Faiss-based retrieval for a minimal RAG engine, establishing unit/non-regression/performance tests, optimizing summaries, and enhancing user interfaces (Telegram/Discord or a web UI) all represent critical next steps. Furthermore, the notion of a GPT-like system—where ChatGPT+ is seamlessly augmented by aeRAG’s specialized data—captures the long-term vision of delivering advanced, domain-aware conversations.

In Chapter 6, we will detail the Roadmap and Priorities, mapping these technical proposals onto specific milestones, timelines, and resource planning. This ensures every enhancement—whether for performance, user experience, or LLM integration—fits into a coherent path forward for aeRAG’s evolution.

6. Roadmap and Priorities

This chapter provides a structured plan for implementing the enhancements described throughout the document. By dividing the roadmap into short-, medium-, and long-term milestones, we ensure aeRAG evolves incrementally yet remains focused on delivering tangible improvements at each phase.

6.1 Overview of Roadmap Structure

Short-Term (1–3 Months)
- Activities that are essential for establishing baseline stability, performance, and basic user-facing features.
Medium-Term (3–6 Months)
- More advanced capabilities, such as specialized summarizations and broader UI support, once the core RAG engine is stable.
Long-Term (6+ Months)
- Scaling to large datasets, advanced retrieval logic, deeper LLM integrations, and potential plugin-based GPT augmentation.

Each section below lists priority tasks (P1, P2, etc.) to help guide resource allocation and sprint planning.

6.2 Short-Term Milestones (1–3 Months)

6.2.1 Integrate Faiss-Based Minimal RAG Engine

P1: Implement Faiss Index Manager
- Write a service (e.g., faiss_index_manager.py) to store and query chunk embeddings.
- Incorporate periodic synchronization between the chunks collection and the Faiss index.
P1: Refactor Retrieval Flow
- Update the RAG engine logic (minimal version) to embed user queries, retrieve the top-k similar chunks from Faiss, and pass them to the chosen LLM.
P2: Basic Performance Benchmarks
- Use a small test dataset to measure average retrieval latency, memory usage, and throughput under typical load.

6.2.2 Establish Testing Foundations

P1: Unit Tests
- Create a skeleton test suite in tests/ to cover each module: collectors, chunkers, embeddings, LLM connectors.
- Examples: verifying GitHub pagination, ensuring ChunkingStrategyFactory picks the correct strategy.
P1: Non-Regression & Performance Tests
- Define a canonical test dataset of code, commits, and issues.
- Automate the pipeline in CI, checking that retrieval and indexing produce consistent results and meet basic time thresholds.
P2: First Stable Release Tag
- Once tests reliably pass and Faiss integration is validated, tag a stable release (e.g., v0.1) for internal or public usage.

6.2.3 Initial User Interface Enhancements

P2: Telegram Bot (Prototype)
- Provide a minimal chat-based interface for querying the new RAG engine.
- Deploy in a test environment to gather feedback on usability.
P3: CLI Improvements
- Expand main.py with a simple “chat” command that demonstrates end-to-end retrieval using Faiss.

Goal: Emerge from the short-term phase with a solid, test-backed RAG foundation and Faiss-accelerated retrieval ready to handle typical user queries.

6.3 Medium-Term Milestones (3–6 Months)

6.3.1 Summarization Optimization

P1: Selective Summarization
- Summaries are generated only for frequently accessed or newly updated chunks, balancing resource costs.
- Evaluate model performance (T5, Mistral-7B, or alternatives) via smaller-scale deployments.
P2: Batch Summaries
- Schedule a nightly or weekly job to summarize new chunks en masse, possibly offloading to a GPU server if required.

6.3.2 UI & Chat Integrations

P1: Enhanced Telegram (or Discord) Bot
- Add more advanced features: conversation state, partial multi-step retrieval.
- Possibly store user messages and RAG responses for analytics (with privacy considerations).
P2: Prototype Web UI
- Develop a minimal React or Vue front-end with a chat window, chunk references in the response, and optional user authentication.
- Gather user feedback to refine the design and plan a full MVP launch.

6.3.3 Extended Data Sources

P2: Additional Git Platforms
- Adapt existing collectors for GitLab or Bitbucket if needed by the Archethic ecosystem or external adopters.
P3: Community & Governance
- Index forum discussions, on-chain governance proposals, or relevant archives for a richer knowledge base.

6.3.4 Testing Maturity

P1: Expanded Non-Regression Coverage
- Add more scenarios (larger repos, code merges, governance data) to ensure robust pipeline performance.
P2: Performance Stress Tests
- Evaluate system behavior under higher concurrency or data volumes.
- Identify bottlenecks (in chunking, embeddings, or Faiss indexing) and optimize accordingly.

6.4 Long-Term Milestones (6+ Months)

6.4.1 GPT-Like Agent Enriched by RAG

P1: Plugin or Proxy Approach
- Create a ChatGPT plugin or a proxy server that injects aeRAG chunks into GPT prompts.
- Overcome constraints related to prompt size, session management, and plugin policy guidelines.
P2: Continuous Context Updates
- Automate retrieval of the latest Archethic data, ensuring near real-time augmentation for GPT-based queries.
- Integrate usage analytics to measure query frequency, chunk relevance, and user satisfaction.

6.4.2 Advanced RAG Retrieval Logic

P2: Multi-Step Reasoning
- The RAG engine orchestrates iterative queries. For complex questions, the system re-retrieves or refines its context.
- Could utilize an LLM-based “planner” that decides which chunks or data sources to target next.
P3: SQL-Based or GraphQL-Based RAG
- Let the LLM generate structured queries for more refined data fetching (e.g., selecting commits by date range, labeling issues by state).

6.4.3 Scalability and Architecture Refinements

P1: Distributed Faiss or Milvus
- Move from a single-node vector index to a multi-node cluster for high availability in enterprise-scale usage.
P2: Microservices
- Split chunking, embeddings, and retrieval tasks into separate services if usage patterns or data volumes justify it.

6.5 Priorities in Context

Priority	Task	Timeframe
P1	Faiss integration, minimal RAG engine	Short-Term
P1	Core test suite (unit, non-regression)	Short-Term
P2	Basic Telegram bot	Short-Term
P1	Summarization optimization (selective/batch)	Medium-Term
P1	Enhanced Telegram/Discord or Web UI	Medium-Term
P2	GPT-like augmentation or plugin approach	Long-Term
P2	Multi-step RAG reasoning	Long-Term
P3	Additional data collectors (GitLab, forums)	Medium-Term

Note: This table is a simplified summary. Actual priority may shift based on resource availability, user feedback, and emerging business needs.

6.6 Conclusion

The short-, medium-, and long-term milestones outlined above aim to transform aeRAG from a basic aggregator of Archethic data into a powerful, user-friendly RAG solution. By prioritizing Faiss-based minimal retrieval, rigorous testing, and UI enhancements in the near term, we lay the foundation for more ambitious goals—like GPT augmentation and advanced multi-step retrieval—in subsequent phases.

Through these incremental steps, aeRAG will become increasingly stable, flexible, and valuable to Archethic’s developers, investors, and community members, while remaining adaptable enough to serve other ecosystems or projects that require reliable retrieval-augmented AI.

7. Conclusion

7.1 Recap of aeRAG’s Purpose

Archethic is a powerful transaction chain offering advanced capabilities that surpass many traditional blockchains. However, complex technology can be an adoption barrier—both for developers seeking to build DApps and for potential investors or community members seeking clear, reliable information.

In response, aeRAG (ArchEthic RAG) was conceived as a Retrieval-Augmented Generation system to:

Collect, index, and enrich Archethic’s growing dataset (commits, issues, pull requests, documentation, discussions, etc.).
Enable natural language access to that knowledge via state-of-the-art LLMs.
Scale seamlessly to new data sources, new retrieval engines, and advanced user interfaces.

7.2 Key Insights from This Document

Clean Data Pipeline
- Chapter 3 and 4 demonstrated how GitHub data is collected, stored in MongoDB, and chunked. This pipeline ensures a steady flow of up-to-date information for later retrieval.
Metadata and Chunking
- Adopting code-specific or text-specific chunking strategies (Chapter 4) yields more targeted embeddings, enabling relevant, high-quality matches when a user queries the system.
Embedding and Summarization
- SentenceTransformers embeddings provide a semantic layer, while optional summarization can boost clarity (albeit at higher compute cost).
Performance Needs
- The shift to a Faiss-based RAG engine (Chapter 5 and 6) lays the groundwork for fast nearest neighbor lookups, crucial as data scales.
Testing and Roadmap
- Thorough testing—unit, non-regression, and performance—underpins release confidence (Chapters 5 and 6).
- The roadmap (Chapter 6) delineates short-, medium-, and long-term priorities, ensuring a progressive yet organized evolution.

7.3 Anticipated Impact

By merging domain-specific knowledge with modern LLMs, aeRAG can drastically:

Reduce onboarding friction for new developers on Archethic.
Provide transparency for investors or partners by surfacing relevant commits, proposals, and metrics.
Streamline the collaborative workflow, letting the Archethic Foundation and community easily retrieve and cross-reference historical context (e.g., searching commits that introduced certain protocol changes).

7.4 Reusability Beyond Archethic

An important design criterion for aeRAG is replicability. Although the immediate focus is on Archethic data, the same core architecture—GitHub collectors, chunk-based embeddings, Faiss integration, etc.—can be extended to:

Other blockchain ecosystems (e.g., for verifying or explaining code, proposals, and community discussions).
Non-blockchain domains that need a robust RAG pipeline (documentation portals, corporate wikis, specialized research archives, etc.).

7.5 Next Steps

From the perspective of ongoing development:

Implement the Minimal Faiss RAG Engine
- This addresses immediate performance bottlenecks and sets the stage for advanced queries.
Establish Comprehensive Testing
- Ensure unit and performance tests run in CI to catch regressions and validate readiness for stable releases.
Enhance UI and Accessibility
- Whether through a Telegram/Discord bot or a minimal web app, user-friendliness is key to encouraging adoption.
Plan GPT-Augmentation
- Investigate an official ChatGPT plugin or a proxy approach to seamlessly integrate aeRAG context into GPT-4 or other advanced LLMs.

7.6 Final Remarks

aeRAG stands as a scalable, modular RAG system poised to demystify Archethic for a wide audience—developers, investors, and community members alike. By adhering to Clean Code principles and robust design patterns, the project remains maintainable and extensible, ready to incorporate upcoming innovations in both the Archethic chain and the broader AI landscape.

In the spirit of open collaboration, the next releases of aeRAG will continue to iterate on performance, testing, and user experience. Contributions from the community—be they in chunking optimizations, new LLM connectors, or advanced retrieval techniques—will only deepen the value that aeRAG offers to Archethic and beyond.

With a well-defined roadmap, a capable pipeline for data, and a strong vision for future expansions, aeRAG is prepared to grow alongside Archethic, becoming an indispensable resource for knowledge retrieval and user engagement in the months and years ahead.

Below is a Chapter 8 that can serve as Appendices and Resources, providing additional references, implementation details, clean code best practices, and technical pointers for further exploration. This chapter consolidates various supporting materials that might not fit neatly into earlier chapters but still enrich the understanding and future development of aeRAG.

8. Appendices and Resources

This chapter gathers supporting information and technical references that complement the core sections of this document. It serves as a handy reference for developers extending or maintaining aeRAG, as well as newcomers seeking deeper insights into the technologies and design principles behind it.

8.1 References and Documentation

8.1.1 MongoDB

Official Docs: https://docs.mongodb.com
- Comprehensive guides on queries, indexing strategies, and replication.
- Useful for optimizing queries, especially with large collections like commits, issues, chunks.

8.1.2 GitHub API

GitHub REST API Documentation: https://docs.github.com/en/rest
- Endpoints for repositories, commits, issues, pull requests, etc.
- Rate limit details and examples of pagination logic.

8.1.3 Vector Databases

Faiss: https://github.com/facebookresearch/faiss
- High-performance nearest neighbor search, recommended for the Minimal RAG approach.
- Includes CPU- and GPU-based indexing strategies, memory management tips.
LlamaIndex: https://github.com/jerryjliu/llama_index
- Alternative indexing library with a more Pythonic interface, though it may require additional overhead to integrate seamlessly.
Milvus: https://milvus.io/
- A distributed, cloud-native vector database, scalable for enterprise-level usage.

8.1.4 Embeddings and Summarization

SentenceTransformers: https://www.sbert.net/
- Offers a range of models (e.g., all-MiniLM-L6-v2) and fine-tuning examples.
T5 Summarizer: https://github.com/google-research/text-to-text-transfer-transformer
- Official repo for T5. Summarization is one application, though domain specialization may require advanced fine-tuning.
Mistral-7B: Official release details vary; keep an eye on repositories like https://huggingface.co for the latest model versions and usage tips.

8.1.5 OpenAI GPT and ChatGPT Plugins

OpenAI Documentation: https://platform.openai.com/docs/introduction
- Provides API references for GPT-3.5, GPT-4, and plugin development guidelines.
- Key for building or integrating a GPT-like agent enriched by aeRAG context.

8.2 Code Samples and Templates

8.2.1 Minimal Faiss Integration Snippet

# faiss_index_manager.py (example)
import faiss
import numpy as np

class FaissIndexManager:
    def __init__(self, dim, index_path=None):
        self.dim = dim
        # Example: Using a simple IndexFlatL2
        self.index = faiss.IndexFlatL2(dim)
        
        if index_path:
            self.load_index(index_path)

    def add_vectors(self, vectors: np.ndarray):
        # vectors shape: (num_vectors, dim)
        self.index.add(vectors)

    def search(self, query_vector: np.ndarray, top_k: int = 5):
        # query_vector shape: (1, dim)
        distances, indices = self.index.search(query_vector, top_k)
        return distances, indices

    def save_index(self, path):
        faiss.write_index(self.index, path)

    def load_index(self, path):
        self.index = faiss.read_index(path)

Notes:

This snippet demonstrates the bare minimum for storing vectors and retrieving top-k neighbors.
Real-world usage: You’d likely map each vector to a chunk_id for referencing back to chunks in MongoDB.

8.2.2 Summarization Trigger Example

def maybe_summarize_chunk(chunk_text, usage_count, summarizer):
    """ Summarize a chunk if it's heavily accessed or passes a specific threshold. """
    threshold = 10  # example usage threshold
    if usage_count > threshold:
        summary = summarizer.summarize(chunk_text)
        return summary
    return None

Notes:

Integrates domain logic (e.g., usage_count from logs or metadata) to decide if summarization is warranted.
Minimizes unnecessary compute for rarely accessed chunks.

8.3 Clean Code Best Practices

aeRAG embraces Robert C. Martin’s “Clean Code” principles to ensure maintainability and readability:

Meaningful Names
- Use descriptive function/class/variable names (fetch_commits(), MetadataManager), avoiding abbreviations that obscure meaning.
Single Responsibility Principle (SRP)
- Keep modules like github_collector.py focused on one main concern—fetching GitHub data—rather than mixing in chunking or storage logic.
DRY (Don’t Repeat Yourself)
- Factor out repeated code into utility functions or base classes, e.g., github_request.py for common GitHub API calls.
Small Functions
- Limit function size to a single screenful or less, ensuring clarity. Larger logic can be subdivided into helper methods.
Commenting and Docstrings
- Provide docstrings at the start of each class or method clarifying purpose, parameters, and return values.
- Use inline comments sparingly—only when the rationale or approach is non-trivial.

8.4 Additional Design Patterns and Architectural Tips

Factory Pattern
- Used in chunking_strategy_factory.py to pick the correct strategy (text vs. code).
- Encourages future extensions (e.g., specialized chunkers for JSON configs, logs, or plain text with fewer lines).
Adapter / Wrapper Classes
- Could be introduced for bridging external APIs or libraries (like Faiss or a Slack/Telegram API).
- Keeps your core logic independent of specific library details.
Microservice Architecture (Long-Term)
- If data or user requests grow significantly, splitting the chunking pipeline, vector indexing, and LLM requests into separate services can improve concurrency and maintainability.

8.5 Potential Pitfalls & Troubleshooting

GitHub Rate Limits
- Use environment variables (e.g., GITHUB_TOKEN) and implement exponential backoff to avoid frequent “403 Forbidden” errors.
Memory Management with Faiss
- Large embeddings (millions of vectors) can exceed RAM. Investigate GPU-based indexes or sharding across multiple machines.
LLM Token Limits
- When merging multiple chunks for a large query, you can exceed an LLM’s context window (e.g., 8k or 32k tokens). Implement chunk pruning or summarization to fit within constraints.
Summarization Quality
- Domain-specific or newly introduced models (Mistral) may require custom fine-tuning for best results.
- Ensure your hardware (GPU VRAM) can handle the chosen model at acceptable speeds.

8.6 Glossary of Terms

RAG (Retrieval-Augmented Generation):
AI approach that fetches relevant documents/chunks before or during generating a response.
Chunk:
A smaller segment of a larger document (e.g., code file, markdown docs).
Embedding:
A numeric vector representing text in a semantic space, enabling similarity-based searches.
Faiss:
A C++ library from Facebook (Meta) for efficient similarity search of dense vectors.
Mistral-7B:
A 7-billion-parameter LLM known for good instruction-following, albeit with higher resource demands.
ChatGPT Plugin:
An add-on that extends ChatGPT with domain-specific capabilities, bridging external data sources.

8.7 Future Reading and Exploration

LangChain: https://github.com/hwchase17/langchain
- A popular library for building LLM “chains,” relevant for multi-step reasoning in advanced RAG engines.
RLHF (Reinforcement Learning from Human Feedback)
- Potential method to refine LLM behavior using user feedback from aeRAG queries.
Cloud-Native Vector DB Services
- Vendors like AWS OpenSearch, Azure Cognitive Search, or managed Milvus services if you prefer not to self-host a vector index at scale.

8.8 Conclusion

This appendix—covering references, code snippets, best practices, design insights, and a glossary—aims to empower developers and stakeholders with the context and resources needed to extend aeRAG confidently. Whether you plan to integrate a new summarizer, adopt additional Git platforms, or scale up Faiss to multi-machine setups, these tips and references provide a starting point.

As aeRAG grows—alongside the Archethic ecosystem—so too will the supporting tools, patterns, and advanced features. By adhering to clean code principles, leveraging well-documented libraries, and staying mindful of potential pitfalls, you ensure aeRAG remains adaptable, performant, and developer-friendly for the long haul.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
LLMs		LLMs
chunks		chunks
collectors		collectors
core		core
embeddings		embeddings
experiments		experiments
keywords_extractors		keywords_extractors
metadata		metadata
models		models
rag		rag
scripts		scripts
server		server
summarizers		summarizers
tests		tests
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
arborescence.txt		arborescence.txt
fusion_full_code_rag_builder.md		fusion_full_code_rag_builder.md
main.py		main.py
plan_documentation.txt		plan_documentation.txt
requirements.txt		requirements.txt

License

MayDaReal/aeRAG

Folders and files

Latest commit

History

Repository files navigation

1. Introduction

1.1 Context and Project Origin

1.2 Main Objective

1.3 Scope of This Document

2. Requirements and Use Cases

2.1 Functional Requirements

2.1.1 Data Collection

2.1.2 Data Organization and Indexing

2.1.3 Retrieval-Augmented Queries

2.1.4 Chatbot / API Interface

2.1.5 Logging and Monitoring

2.1.6 Testing (Unit and Non-Regression)

2.2 Non-Functional Requirements

2.2.1 Performance and Scalability

2.2.2 Reliability and Fault Tolerance

2.2.3 Maintainability and Clean Code

2.2.4 Security and Access Control

2.2.5 Portability

2.3 Use Cases

2.3.1 Developer Building DApps on Archethic

2.3.2 Potential Investor or Partner

2.3.3 Archethic Foundation Member

2.3.4 Community Content Creator

3. High-Level Architecture

3.1 Overview of the RAG Pipeline

3.2 Key Components and Their Roles

3.2.1 Collectors

3.2.2 Core Database & File Storage

3.2.3 Chunking Strategies

3.2.4 Embeddings and Metadata

3.2.5 Vector Database (Future Integration)

3.2.6 RAG Engine

3.2.7 LLMs and Response Generation

3.2.8 Server and CLI

3.3 Data Flow Diagram (Conceptual)

3.4 Architectural Rationale

3.5 Summary

4. Current Implementation Details

4.1 Data Collection

4.1.1 GitHubCollector

4.1.1.1 Workflow

4.1.1.2 Error & Rate-Limit Handling

4.1.2 Additional Collector Files

4.2 Data Storage in MongoDB

4.2.1 DatabaseManager

4.2.1.1 Collections Overview

4.2.2 FileStorageManager

4.2.2.1 Usage Flow

4.3 Chunking and Metadata Generation

4.3.1 Chunking Strategies

4.3.2 MetadataManager

4.3.2.1 Example Flow for a New File

4.4 Embeddings and Summaries

4.4.1 Embedding Models

4.4.2 Summarizers

4.4.3 Keyword Extraction

4.5 RAG Engine and Retrieval

4.5.1 Current Retrieval Mechanism

4.5.2 Planned Vector Index

4.5.3 LLM Integration

4.6 Local Server and CLI

4.6.1 Local Storage Server

4.6.2 main.py CLI

4.7 Example Data Flow

4.8 Limitations in the Current Implementation

4.9 Summary

5. Technical Proposals for Next Steps

5.1 Minimal Yet Performant RAG Engine

5.1.1 Motivation

5.1.2 Proposed Approach

5.1.3 Advantages

5.2 Testing Strategy: Unit, Non-Regression, and Performance

5.2.1 Current Gap

5.2.2 Planned Testing Layers

4.6.2 `main.py` CLI