This project aims to analyze and predict the likelihood of a stroke based on various health factors using Machine Learning (ML) and Deep Learning (DL) models.
We performed detailed data cleaning, visualization, feature engineering, model building, and evaluation.
This was done as part of an academic project to understand end-to-end Machine Learning pipelines and Deep Learning modeling.
- The dataset
healthcare-dataset-stroke-data.csvwas loaded using pandas. - Basic information such as shape and statistical description was obtained to understand the dataset.
data.shape
data.describe()- The
bmicolumn had missing values. - Missing
bmivalues were filled with the mean value of BMI, grouped by gender, marital status, and age group to ensure contextual relevance. - The
idcolumn was dropped as it does not provide any predictive value.
- An
age_groupcolumn was created to categorize individuals as Infant, Child, Adolescent, Young Adult, Adult, or Old Aged based on their age. - Rows with ambiguous gender ("Other") were removed to maintain binary classification.
We used Plotly Express, Seaborn, and Matplotlib to visualize various feature relationships with stroke:
These visualizations helped understand which features were more influential.
The dataset was highly imbalanced (very few people had strokes).
We used upsampling and SMOTE (Synthetic Minority Over-sampling Technique) to balance the dataset:
- Upsampling involved duplicating minority class samples.
- SMOTE generated synthetic samples to balance the classes.
- Important continuous features like
age,avg_glucose_level, andbmiwere normalized using MinMaxScaler or StandardScaler for consistent scaling across ML models.
We implemented and trained several ML models:
| Model | Key Details |
|---|---|
| Extra Trees Classifier | Ensemble method |
| Random Forest Classifier | Ensemble method, tuned with hyperparameters |
| XGBoost Classifier | Advanced boosting technique |
| Gradient Boosting Classifier | Boosted ensemble method |
Each model was:
- Trained on the training set.
- Evaluated on both training and testing sets.
- Assessed using Accuracy Scores, Classification Reports, and Confusion Matrices.
Example model training:
etc_model = ExtraTreesClassifier()
rfc_model = RandomForestClassifier(n_estimators=29, max_leaf_nodes=900, max_features=0.8, criterion='entropy')
xgb_model = XGBClassifier(objective="binary:logistic", eval_metric="auc")
gbc_model = GradientBoostingClassifier(max_depth=29, min_samples_leaf=4, min_samples_split=13, subsample=0.8)
models = [etc_model, rfc_model, xgb_model, gbc_model]
for model in models:
model.fit(x_train, y_train)Confusion matrices were plotted for each model.
To enhance prediction accuracy, Deep Learning models were built using TensorFlow and Keras.
- A simple ANN model with two hidden layers.
- Compiled with
adamoptimizer andbinary_crossentropyloss. - Trained for 50 epochs.
model = Sequential([
Dense(16, input_dim=X.shape[1], activation='relu'),
Dense(8, activation='relu'),
Dense(1, activation='sigmoid')
])- Handled class imbalance using SMOTE.
- Added Dropout layers to prevent overfitting.
- Used EarlyStopping callback to terminate training early if no improvement was observed.
model = Sequential([
Dense(32, activation='relu', input_dim=X_train.shape[1]),
Dropout(0.3),
Dense(16, activation='relu'),
Dense(1, activation='sigmoid')
])
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)This improved model showed better generalization and achieved higher accuracy.
- Classification reports showed precision, recall, f1-score for both classes (Stroke / No Stroke).
- Confusion matrices visualized True Positives, True Negatives, False Positives, and False Negatives.
- Test Accuracy was printed at the end for both ML and DL models.
Through this project, we learned how to:
- Preprocess messy real-world healthcare datasets.
- Handle missing values thoughtfully.
- Balance imbalanced datasets using resampling and SMOTE.
- Build, train, and evaluate multiple machine learning models.
- Build deep learning models using Keras and TensorFlow.
- Evaluate models using statistical and visual metrics.
This end-to-end project helped us understand the critical steps in building a reliable prediction system for sensitive applications like healthcare.
- Python
- Pandas, Numpy
- Matplotlib, Seaborn, Plotly Express
- Scikit-learn
- XGBoost
- TensorFlow, Keras
- imbalanced-learn (SMOTE)
- Healthcare Dataset for Stroke Prediction (available on Kaggle).









