A portfolio project demonstrating how to compare medical patients in embedding space using machine learning. Includes an interactive Streamlit demo that visualizes patient embeddings and finds nearest neighbors using synthetic data.
In healthcare analytics, comparing patients based on their medical histories can help with:
- Finding similar patient cases for treatment insights
- Identifying patient cohorts for clinical studies
- Supporting clinical decision-making with data-driven similarity analysis
This project converts patient records (age, gender, diagnosis codes, clinical notes) into vector embeddings and uses distance metrics to find similar patients. The embeddings can be visualized in 2D using PCA/t-SNE to understand patient clusters.
- Python 3.8+
# Clone the repository
git clone https://github.com/edgarbc/patient_analyzer.git
cd patient_analyzer
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtstreamlit run demo/app.pyThe demo will open in your browser and allow you to:
- Select a patient from the synthetic dataset
- View patient details
- See the top-3 most similar patients
- Visualize all patients in 2D embedding space
- Synthetic Patient Data: 10 sample patients with realistic attributes (age, gender, diagnosis codes, clinical notes)
- Lightweight Embeddings: Uses TF-IDF vectorization for fast, dependency-light processing
- Nearest Neighbor Search: Finds similar patients using cosine similarity
- Interactive Visualization: 2D PCA scatter plot showing patient clusters
- CI-Friendly: No large model downloads required for demo/tests
patient_analyzer/
├── README.md # This file
├── LICENSE # Apache 2.0 License
├── requirements.txt # Python dependencies
├── demo/
│ ├── app.py # Streamlit demo application
│ └── sample_patients.csv # Synthetic patient dataset
├── tests/
│ └── smoke_test.py # CI smoke test
├── .github/
│ └── workflows/
│ └── ci.yml # CI workflow
├── generate_embeddings.py # BERT-based embedding generation (production)
├── generate_patients.py # Patient data generation utilities
├── visualize_patients.py # Visualization utilities
└── notebooks/ # Jupyter notebooks for exploration
For production use cases with better similarity quality, you can use sentence transformers or medical BERT models:
# Install optional dependencies for production embeddings
pip install sentence-transformers torch transformersThen modify demo/app.py to use the sentence-transformers or Bio_ClinicalBERT model instead of TF-IDF (see comments in the code).
- No real patient data is included or required
- The sample dataset is algorithmically generated with fictional patient information
- For production use with real data, ensure compliance with:
- HIPAA (US healthcare data)
- GDPR (EU personal data)
- Your organization's data governance policies
Never commit real patient data to version control.
The Streamlit demo shows a patient selector, patient details table, nearest neighbors list, and a 2D scatter plot of patient embeddings.
python -m pytest tests/smoke_test.py -v# Install development dependencies
pip install flake8 black
# Run linter
flake8 demo/ tests/
black --check demo/ tests/Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
