Welcome to the Health Assistant project! This is a Retrieval-Augmented Generation (RAG) system for medical question answering, combining structured Q&A from MedQuAD and authoritative articles from MedlinePlus.
- Goal: Provide accurate, explainable, and up-to-date medical answers using a combination of trusted datasets and LLMs.
- Data Sources:
- MedQuAD: Medical Q&A pairs from NIH and other reputable sources (XML format)
- MedlinePlus: 4,000+ encyclopedia articles from the U.S. National Library of Medicine (scraped live)
- Retrieval: All data is indexed in ChromaDB for fast semantic search.
- Generation: Uses Google Gemini LLM for answer synthesis.
-
Clone the repository and install dependencies:
git clone <your-repo-url> cd Health Assistant python -m venv env env\Scripts\activate # On Windows pip install -r requirements.txt
-
Set up your environment variables:
- Create a
.envfile with your Gemini API key:GEMINI_API_KEY=your_gemini_api_key_here
- Create a
-
Directory Structure:
Health Assistant/ ├── MedQuAD/ # MedQuAD XML dataset (place here) ├── chroma_persistent_storage/ # ChromaDB vector storage ├── process_medquad_data.py # Script to process MedQuAD ├── scrape_medlineplus.py # Script to scrape MedlinePlus ├── query_medical_qa.py # Query interface ├── requirements.txt └── README.md
- Purpose: Parse all XML Q&A pairs and store them in ChromaDB.
- Run:
python process_medquad_data.py
- Options:
- Test mode (processes a small subset)
- Full mode (processes all files)
- Purpose: Scrape all A-Z medical articles and add them to ChromaDB.
- Run:
python scrape_medlineplus.py
- Options:
- Choose how many articles to scrape (start with 100 for testing, or 'all' for the full set)
- Purpose: Ask medical questions and get answers synthesized from both MedQuAD and MedlinePlus.
- Run:
python query_medical_qa.py
- Features:
- Interactive mode (ask multiple questions)
- Single question mode
- Shows sources for every answer
- Ingestion:
- MedQuAD XMLs are parsed for Q&A pairs.
- MedlinePlus articles are scraped and parsed for title/content.
- Indexing:
- All documents are embedded and stored in ChromaDB for semantic search.
- Retrieval:
- For each user question, the top relevant documents are retrieved.
- Generation:
- The Gemini LLM is prompted with the retrieved context to generate a final answer.
- Transparency:
- The sources (Q&A or article titles/URLs) are shown for every answer.
# Process MedQuAD (once)
python process_medquad_data.py
# Scrape MedlinePlus (once)
python scrape_medlineplus.py
# Query the system
python query_medical_qa.py- MedQuAD: NIH License
- MedlinePlus: A.D.A.M. Medical Encyclopedia
- ChromaDB, Google Gemini: see respective licenses
- Built for [Your Hackathon Name]
- Team: [Your Team Name]
- Contact: [Your Email or Discord]
Good luck and have fun hacking!