GitHub - khushita28/PDFQuery_LangChain: This project creates a PDF Question Answering system using Langchain, Astra DB, and a Vector Database. It extracts PDF text, generates vector embeddings for efficient storage and retrieval, and uses a Language Model to answer questions by finding and interpreting relevant sections.

PDF Document Question Answering System
This project builds a PDF Document Question Answering system using Langchain, Astra DB (Cassandra), and a Vector Database to efficiently handle and answer questions based on PDF documents. The system extracts text from PDF files, generates embeddings for searchability, stores these embeddings in Astra DB, and retrieves relevant sections to answer queries with a Language Model.

Project Overview
1)Text Extraction: Parses and extracts text from PDF documents.
2)Embeddings Generation: Converts text into vector embeddings to facilitate semantic search.
3)Vector Storage in Astra DB: Stores embeddings in Cassandra’s Astra DB with a vector index for efficient retrieval.
4)Relevant Section Retrieval: Uses vector similarity to find document sections that align with a given question.
5)LLM-Based Question Answering: Generates answers based on the content of retrieved sections, using Langchain’s LLM capabilities.

Prerequisites
Python 3.7+
Astra DB Account with a keyspace configured for vector storage.
Google Colab or Local Development Environment

Configuration
Astra DB Setup: Configure Astra DB credentials and endpoint in the project.
Langchain API Keys: Set up API keys for the embedding and Language Model functionalities.

Example Workflow
Load PDF Documents: Add documents to the documents/ folder.
Process and Store: Run the script to extract, embed, and store document content.
Query and Answer: Input questions, and the system retrieves and interprets relevant text segments to generate answers.

Future Improvements
Enhanced PDF Preprocessing: Incorporate support for complex document structures.
Multi-Document Search: Enable search across multiple documents in the system.
Caching Mechanism: Implement caching for frequently accessed documents and embeddings.

License
This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
PDF_Document_Question_Answering.ipynb		PDF_Document_Question_Answering.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

khushita28/PDFQuery_LangChain

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages