Diabetes Prediction Project This project involves an end-to-end Exploratory Data Analysis (EDA) and the development of a Predictive Model to determine whether a patient has diabetes based on diagnostic measurements. The analysis is performed using the Pima Indians Diabetes Database.
๐ Table of Contents Project Overview
Dataset Description
Dependencies
Key Features of the Analysis
Model Performance
Visualizations
๐ Project Overview The primary goal of this notebook is to clean the diagnostic data, explore relationships between health metrics (like Glucose, BMI, and Age), and train a Machine Learning model (Logistic Regression) to predict diabetes outcomes.
๐ Dataset Description The dataset used is diabetes.csv. It contains 768 observations with the following features:
Glucose: Plasma glucose concentration.
BloodPressure: Diastolic blood pressure (mm Hg).
SkinThickness: Triceps skin fold thickness (mm).
Insulin: 2-hour serum insulin (mu U/ml).
BMI: Body mass index (weight in kg/(height in m)^2).
DiabetesPedigreeFunction: A function which scores likelihood of diabetes based on family history.
Age: Age in years.
Outcome: Class variable (0 if non-diabetic, 1 if diabetic).
๐ Dependencies To run this notebook, you will need the following Python libraries:
Python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import scipy as sp from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc ๐ Key Features of the Analysis Descriptive Statistics: Summarizing central tendency and dispersion to identify potential outliers (e.g., zero values in BMI or Blood Pressure).
Exploratory Data Analysis (EDA):
Correlation analysis using heatmaps to find which features most influence the Outcome.
Distribution analysis of health metrics.
Data Cleaning: Checking for null values and handling inconsistencies in the dataset.
Machine Learning:
Implementation of Logistic Regression.
Hyperparameter tuning using RandomizedSearchCV to find the best model estimators.
๐ Model Performance The model's effectiveness is evaluated using the Receiver Operating Characteristic (ROC) Curve.
Metric: Area Under the Curve (AUC).
Result: The notebook includes code to calculate the roc_auc score and plot the curve, demonstrating the model's ability to distinguish between diabetic and non-diabetic patients.
๐ผ Visualizations The project generates several key plots:
Heatmaps: To visualize the correlation matrix.
ROC Curve: To visualize the trade-off between the true positive rate and false positive rate at various threshold settings.