GitHub - SayamAlt/Content-Engine: Successfully designed and developed a system which analyzes and compares multiple PDF documents, specifically identifying and highlighting their differences.

Content Engine Documentation

Overview

The Content Engine is designed to analyze and compare multiple PDF documents using Retrieval Augmented Generation (RAG) techniques. It integrates a backend framework, vector store, embedding model, and local language model (LLM), along with a Streamlit frontend for user interaction.

1. Setup

a) Backend Framework

LangChain A powerful toolkit for building LLM applications with a focus on retrieval-augmented generation. Installation instructions: pip install langchain
Frontend Framework Streamlit An open-source app framework for creating interactive web applications. Installation instructions: pip install streamlit
Vector Store ChromaDB Chosen for its efficient management and querying of embeddings. Setup instructions: pip install chromadb
Embedding Model Sentence Transformer Local embedding model to generate embeddings from PDF content. Installation: pip install sentence-transformers
Local Language Model (LLM) Hugging Face Transformers Integration of a local instance for processing and generating insights. Installation: pip install transformers

2. Initialization

Data Preparation

Download and preprocess the three provided PDF documents (Alphabet Inc., Tesla Inc., Uber Technologies Inc.).

Parsing Documents

Use PyMuPDF or PyPDF2 to extract text and structure from PDFs.

Generating Vectors

Utilize Sentence Transformer to create embeddings for document content.

Storing in Vector Store

Implement functions to persist embeddings into ChromaDB vector store.

3. Development

Configuring Query Engine

Define retrieval tasks based on document embeddings using ChromaDB.

Integrating LLM

Set up a local instance of a Large Language Model (LLM) for contextual insights.

Developing Chatbot Interface

Use Streamlit to create a user-friendly interface for querying and displaying comparative insights from documents.

3. Usage

Clone the repository:
git clone https://github.com/yourusername/content-engine.git cd content-engine
Install dependencies: pip install -r requirements.txt
Run the Streamlit app: streamlit run content_engine.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
db		db
pdfs		pdfs
.gitignore		.gitignore
Alemeno ML Internship Assessment.ipynb		Alemeno ML Internship Assessment.ipynb
Assignment for Internship-AI ML.pdf		Assignment for Internship-AI ML.pdf
README.md		README.md
content_engine.py		content_engine.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Content Engine Documentation

Overview

1. Setup

a) Backend Framework

2. Initialization

Data Preparation

Parsing Documents

Generating Vectors

Storing in Vector Store

3. Development

Configuring Query Engine

Integrating LLM

Developing Chatbot Interface

3. Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

SayamAlt/Content-Engine

Folders and files

Latest commit

History

Repository files navigation

Content Engine Documentation

Overview

1. Setup

a) Backend Framework

2. Initialization

Data Preparation

Parsing Documents

Generating Vectors

Storing in Vector Store

3. Development

Configuring Query Engine

Integrating LLM

Developing Chatbot Interface

3. Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages