Patient Analyzer

TL;DR

A portfolio project demonstrating how to compare medical patients in embedding space using machine learning. Includes an interactive Streamlit demo that visualizes patient embeddings and finds nearest neighbors using synthetic data.

Problem Statement

In healthcare analytics, comparing patients based on their medical histories can help with:

Finding similar patient cases for treatment insights
Identifying patient cohorts for clinical studies
Supporting clinical decision-making with data-driven similarity analysis

This project converts patient records (age, gender, diagnosis codes, clinical notes) into vector embeddings and uses distance metrics to find similar patients. The embeddings can be visualized in 2D using PCA/t-SNE to understand patient clusters.

Quick Start

Prerequisites

Python 3.8+

Installation

# Clone the repository
git clone https://github.com/edgarbc/patient_analyzer.git
cd patient_analyzer

# Create and activate virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Run the Streamlit Demo

streamlit run demo/app.py

The demo will open in your browser and allow you to:

Select a patient from the synthetic dataset
View patient details
See the top-3 most similar patients
Visualize all patients in 2D embedding space

Features

Synthetic Patient Data: 10 sample patients with realistic attributes (age, gender, diagnosis codes, clinical notes)
Lightweight Embeddings: Uses TF-IDF vectorization for fast, dependency-light processing
Nearest Neighbor Search: Finds similar patients using cosine similarity
Interactive Visualization: 2D PCA scatter plot showing patient clusters
CI-Friendly: No large model downloads required for demo/tests

Project Structure

patient_analyzer/
├── README.md                    # This file
├── LICENSE                      # Apache 2.0 License
├── requirements.txt             # Python dependencies
├── demo/
│   ├── app.py                   # Streamlit demo application
│   └── sample_patients.csv      # Synthetic patient dataset
├── tests/
│   └── smoke_test.py            # CI smoke test
├── .github/
│   └── workflows/
│       └── ci.yml               # CI workflow
├── generate_embeddings.py       # BERT-based embedding generation (production)
├── generate_patients.py         # Patient data generation utilities
├── visualize_patients.py        # Visualization utilities
└── notebooks/                   # Jupyter notebooks for exploration

Using Real Embeddings (Production)

For production use cases with better similarity quality, you can use sentence transformers or medical BERT models:

# Install optional dependencies for production embeddings
pip install sentence-transformers torch transformers

Then modify demo/app.py to use the sentence-transformers or Bio_ClinicalBERT model instead of TF-IDF (see comments in the code).

Privacy & Data Use Statement

⚠️ Important: This project uses 100% synthetic data for demonstration purposes.

No real patient data is included or required
The sample dataset is algorithmically generated with fictional patient information
For production use with real data, ensure compliance with:
- HIPAA (US healthcare data)
- GDPR (EU personal data)
- Your organization's data governance policies

Never commit real patient data to version control.

Expected Output

The Streamlit demo shows a patient selector, patient details table, nearest neighbors list, and a 2D scatter plot of patient embeddings.

Development

Running Tests

python -m pytest tests/smoke_test.py -v

Linting

# Install development dependencies
pip install flake8 black

# Run linter
flake8 demo/ tests/
black --check demo/ tests/

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Submit a pull request

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Patient Analyzer

TL;DR

Problem Statement

Quick Start

Prerequisites

Installation

Run the Streamlit Demo

Features

Project Structure

Using Real Embeddings (Production)

Privacy & Data Use Statement

Expected Output

Development

Running Tests

Linting

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github/workflows		.github/workflows
demo		demo
notebooks		notebooks
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_embeddings.py		generate_embeddings.py
generate_patients.py		generate_patients.py
requirements.txt		requirements.txt
visualize_patients.py		visualize_patients.py

License

edgarbc/patient_analyzer

Folders and files

Latest commit

History

Repository files navigation

Patient Analyzer

TL;DR

Problem Statement

Quick Start

Prerequisites

Installation

Run the Streamlit Demo

Features

Project Structure

Using Real Embeddings (Production)

Privacy & Data Use Statement

Expected Output

Development

Running Tests

Linting

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages