Health Assistant: Medical RAG System

Welcome to the Health Assistant project! This is a Retrieval-Augmented Generation (RAG) system for medical question answering, combining structured Q&A from MedQuAD and authoritative articles from MedlinePlus.

🚀 Project Overview

Goal: Provide accurate, explainable, and up-to-date medical answers using a combination of trusted datasets and LLMs.
Data Sources:
- MedQuAD: Medical Q&A pairs from NIH and other reputable sources (XML format)
- MedlinePlus: 4,000+ encyclopedia articles from the U.S. National Library of Medicine (scraped live)
Retrieval: All data is indexed in ChromaDB for fast semantic search.
Generation: Uses Google Gemini LLM for answer synthesis.

🛠️ Setup Instructions

Clone the repository and install dependencies:

git clone <your-repo-url>
cd Health Assistant
python -m venv env
env\Scripts\activate  # On Windows
pip install -r requirements.txt

Set up your environment variables:
- Create a .env file with your Gemini API key:
```
GEMINI_API_KEY=your_gemini_api_key_here
```

Directory Structure:

Health Assistant/
├── MedQuAD/                # MedQuAD XML dataset (place here)
├── chroma_persistent_storage/ # ChromaDB vector storage
├── process_medquad_data.py # Script to process MedQuAD
├── scrape_medlineplus.py   # Script to scrape MedlinePlus
├── query_medical_qa.py     # Query interface
├── requirements.txt
└── README.md

📚 Data Processing Pipeline

1. Process MedQuAD Data

Purpose: Parse all XML Q&A pairs and store them in ChromaDB.
Run:
```
python process_medquad_data.py
```
Options:
- Test mode (processes a small subset)
- Full mode (processes all files)

2. Scrape MedlinePlus Encyclopedia

Purpose: Scrape all A-Z medical articles and add them to ChromaDB.
Run:
```
python scrape_medlineplus.py
```
Options:
- Choose how many articles to scrape (start with 100 for testing, or 'all' for the full set)

3. Query the Knowledge Base

Purpose: Ask medical questions and get answers synthesized from both MedQuAD and MedlinePlus.
Run:
```
python query_medical_qa.py
```
Features:
- Interactive mode (ask multiple questions)
- Single question mode
- Shows sources for every answer

🧠 How It Works

Ingestion:
- MedQuAD XMLs are parsed for Q&A pairs.
- MedlinePlus articles are scraped and parsed for title/content.
Indexing:
- All documents are embedded and stored in ChromaDB for semantic search.
Retrieval:
- For each user question, the top relevant documents are retrieved.
Generation:
- The Gemini LLM is prompted with the retrieved context to generate a final answer.
Transparency:
- The sources (Q&A or article titles/URLs) are shown for every answer.

🤖 Example Usage

# Process MedQuAD (once)
python process_medquad_data.py

# Scrape MedlinePlus (once)
python scrape_medlineplus.py

# Query the system
python query_medical_qa.py

📄 License & Credits

MedQuAD: NIH License
MedlinePlus: A.D.A.M. Medical Encyclopedia
ChromaDB, Google Gemini: see respective licenses

🙋‍♂️ Team & Contact

Built for [Your Hackathon Name]
Team: [Your Team Name]
Contact: [Your Email or Discord]

Good luck and have fun hacking!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Health Assistant: Medical RAG System

🚀 Project Overview

🛠️ Setup Instructions

📚 Data Processing Pipeline

1. Process MedQuAD Data

2. Scrape MedlinePlus Encyclopedia

3. Query the Knowledge Base

🧠 How It Works

🤖 Example Usage

📄 License & Credits

🙋‍♂️ Team & Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
MedQuAD		MedQuAD
.env		.env
.gitattributes		.gitattributes
README.md		README.md
TECHNICAL_DOCS.md		TECHNICAL_DOCS.md
process_medquad_data.py		process_medquad_data.py
query_medical_qa.py		query_medical_qa.py
scrape_medlineplus.py		scrape_medlineplus.py

Secantwave/Health-Advisor

Folders and files

Latest commit

History

Repository files navigation

Health Assistant: Medical RAG System

🚀 Project Overview

🛠️ Setup Instructions

📚 Data Processing Pipeline

1. Process MedQuAD Data

2. Scrape MedlinePlus Encyclopedia

3. Query the Knowledge Base

🧠 How It Works

🤖 Example Usage

📄 License & Credits

🙋‍♂️ Team & Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages