Deep Learning-powered Medical Data Analysis System
Analyzing patient data for disease prediction with 92%+ accuracy using deep learning models, NLP, and Biopython
HealthInsight AI is a comprehensive medical data analysis system that combines:
- Deep Learning: Neural network models for disease prediction (92%+ accuracy)
- Natural Language Processing: Medical text analysis and symptom extraction
- Biopython Integration: Biological sequence analysis for genomics
- Advanced Visualization: Healthcare insights using Matplotlib, Seaborn, and Plotly
- Multi-layer neural network with 256-128-64-32-16 architecture
- Batch normalization and dropout for regularization
- Early stopping and learning rate scheduling
- Achieves 92.5% accuracy on disease prediction (verified)
- Medical text preprocessing and tokenization
- Symptom and condition extraction
- Medical note analysis and summarization
- Sentiment analysis for patient notes
- DNA sequence analysis
- GC content calculation
- Sequence translation and transcription
- Motif finding and pattern matching
- Feature distribution plots
- Correlation matrices
- ROC curves and confusion matrices
- Training history visualization
- Disease risk factor analysis
- Python 3.8 or higher
- pip package manager
- Clone the repository
git clone https://github.com/AmmarAhm3d/HealthInsight-AI.git
cd HealthInsight-AI- Install dependencies
pip install -r requirements.txt- Run the analysis pipeline
python main.py- Try the examples (optional)
python examples.py- README.md - Project overview and usage
- TEST_RESULTS.md - Detailed test results and metrics
- examples.py - Quick usage examples
HealthInsight-AI/
βββ src/
β βββ data_generator.py # Synthetic medical data generation
β βββ dl_model.py # Deep learning model implementation
β βββ nlp_analyzer.py # NLP text analysis
β βββ bio_analyzer.py # Biopython sequence analysis
β βββ visualizer.py # Visualization components
βββ data/ # Generated patient data
βββ models/ # Trained models
βββ results/ # Analysis results (JSON, CSV)
βββ visualizations/ # Generated plots and charts
βββ notebooks/ # Jupyter notebooks for exploration
βββ main.py # Main pipeline orchestrator
βββ requirements.txt # Project dependencies
βββ README.md # This file
The system generates synthetic medical data including:
- Patient demographics (age, gender)
- Vital signs (BP, heart rate, temperature)
- Lab results (cholesterol, glucose, BMI)
- Lifestyle factors (smoking, exercise, alcohol)
- Medical history (family history, previous conditions)
A deep neural network trains on patient features to predict disease risk:
from src.dl_model import DiseasePredictionModel
model = DiseasePredictionModel()
results = model.train(X, y, epochs=100)
print(f"Accuracy: {results['accuracy']:.2%}")Analyzes medical notes to extract symptoms and conditions:
from src.nlp_analyzer import MedicalNLPAnalyzer
nlp = MedicalNLPAnalyzer()
results = nlp.analyze_notes(medical_notes)
print(f"Symptoms found: {results['most_common_symptoms']}")Processes DNA sequences for genomic insights:
from src.bio_analyzer import BiologicalDataAnalyzer
bio = BiologicalDataAnalyzer()
sequences = bio.generate_sample_sequences(n_sequences=20)
results = bio.medical_genomics_summary()Creates professional healthcare visualizations:
from src.visualizer import HealthcareVisualizer
viz = HealthcareVisualizer()
viz.plot_model_performance(results)
viz.plot_feature_distribution(patient_data)After running the pipeline (python main.py), you'll find:
- Accuracy Achieved: 92.50% β
- Patient Records: 15,000 synthetic records
- Features Analyzed: 15 medical features
- Visualizations Created: 6 comprehensive plots
- See TEST_RESULTS.md for detailed results
results/patient_data.csv- Synthetic patient datasetresults/analysis_report.json- Comprehensive analysis summaryresults/nlp_analysis.json- NLP insightsresults/biological_analysis.json- Genomic analysis results
visualizations/feature_distributions.png- Medical feature distributionsvisualizations/correlation_matrix.png- Feature correlation heatmapvisualizations/disease_distribution.png- Disease prevalence chartsvisualizations/model_performance.png- Model metrics and ROC curvevisualizations/training_history.png- Training progressvisualizations/feature_by_disease.png- Risk factor analysis
- Accuracy: 92.5% on test data (exceeds 92% target β)
- Architecture: 5-layer deep neural network (256-128-64-32-16)
- Training: Early stopping with learning rate scheduling
- Evaluation: Precision, Recall, F1-Score, ROC-AUC
Detailed Metrics:
- Precision: 90-94%
- Recall: 88-94%
- F1-Score: 91-94%
- Training samples: 12,000
- Test samples: 3,000
from main import HealthInsightAI
# Initialize pipeline
pipeline = HealthInsightAI()
# Run complete analysis
pipeline.run_full_pipeline()# Generate data only
pipeline.generate_data()
# Train model with custom parameters
from src.dl_model import DiseasePredictionModel
model = DiseasePredictionModel()
results = model.train(X, y, epochs=150, batch_size=64)
# Make predictions
predictions = model.predict(new_patient_data)from src.nlp_analyzer import MedicalNLPAnalyzer
nlp = MedicalNLPAnalyzer()
symptoms = nlp.extract_symptoms("Patient has chest pain and fatigue")
conditions = nlp.extract_conditions("Suspected cardiovascular disease")| Metric | Value |
|---|---|
| Accuracy | 92.5% β |
| Precision (No Disease) | 93.87% |
| Precision (Disease) | 90.46% |
| Recall (No Disease) | 93.61% |
| Recall (Disease) | 90.83% |
| F1-Score | 91-94% |
- Deep Learning: TensorFlow/Keras
- Machine Learning: scikit-learn
- NLP: NLTK, spaCy
- Bioinformatics: Biopython
- Visualization: Matplotlib, Seaborn, Plotly
- Data Processing: Pandas, NumPy
This project is designed for easy iteration:
- Modify Data Generation: Edit
src/data_generator.pyto add new features - Enhance Model: Update
src/dl_model.pyto try different architectures - Expand NLP: Add more medical terminology in
src/nlp_analyzer.py - Add Visualizations: Create new plots in
src/visualizer.py
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
Ammar Ahmed
GitHub: @AmmarAhm3d
- TensorFlow team for the deep learning framework
- NLTK and spaCy for NLP capabilities
- Biopython community for bioinformatics tools
- Healthcare data science community for inspiration
Note: This project uses synthetic data for demonstration purposes. For real medical applications, ensure compliance with healthcare regulations (HIPAA, GDPR, etc.) and consult with medical professionals.