The Content Engine is designed to analyze and compare multiple PDF documents using Retrieval Augmented Generation (RAG) techniques. It integrates a backend framework, vector store, embedding model, and local language model (LLM), along with a Streamlit frontend for user interaction.
- LangChain A powerful toolkit for building LLM applications with a focus on retrieval-augmented generation. Installation instructions: pip install langchain
- Frontend Framework Streamlit An open-source app framework for creating interactive web applications. Installation instructions: pip install streamlit
- Vector Store ChromaDB Chosen for its efficient management and querying of embeddings. Setup instructions: pip install chromadb
- Embedding Model Sentence Transformer Local embedding model to generate embeddings from PDF content. Installation: pip install sentence-transformers
- Local Language Model (LLM) Hugging Face Transformers Integration of a local instance for processing and generating insights. Installation: pip install transformers
Download and preprocess the three provided PDF documents (Alphabet Inc., Tesla Inc., Uber Technologies Inc.).
Use PyMuPDF or PyPDF2 to extract text and structure from PDFs.
Utilize Sentence Transformer to create embeddings for document content.
Implement functions to persist embeddings into ChromaDB vector store.
Define retrieval tasks based on document embeddings using ChromaDB.
Set up a local instance of a Large Language Model (LLM) for contextual insights.
Use Streamlit to create a user-friendly interface for querying and displaying comparative insights from documents.
- Clone the repository:
git clone https://github.com/yourusername/content-engine.git cd content-engine
- Install dependencies: pip install -r requirements.txt
- Run the Streamlit app: streamlit run content_engine.py