Skip to content

Manav13254/Data-Science-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

4 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Diabetes Prediction Project This project involves an end-to-end Exploratory Data Analysis (EDA) and the development of a Predictive Model to determine whether a patient has diabetes based on diagnostic measurements. The analysis is performed using the Pima Indians Diabetes Database.

๐Ÿ“‹ Table of Contents Project Overview

Dataset Description

Dependencies

Key Features of the Analysis

Model Performance

Visualizations

๐Ÿš€ Project Overview The primary goal of this notebook is to clean the diagnostic data, explore relationships between health metrics (like Glucose, BMI, and Age), and train a Machine Learning model (Logistic Regression) to predict diabetes outcomes.

๐Ÿ“Š Dataset Description The dataset used is diabetes.csv. It contains 768 observations with the following features:

Glucose: Plasma glucose concentration.

BloodPressure: Diastolic blood pressure (mm Hg).

SkinThickness: Triceps skin fold thickness (mm).

Insulin: 2-hour serum insulin (mu U/ml).

BMI: Body mass index (weight in kg/(height in m)^2).

DiabetesPedigreeFunction: A function which scores likelihood of diabetes based on family history.

Age: Age in years.

Outcome: Class variable (0 if non-diabetic, 1 if diabetic).

๐Ÿ›  Dependencies To run this notebook, you will need the following Python libraries:

Python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import scipy as sp from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc ๐Ÿ” Key Features of the Analysis Descriptive Statistics: Summarizing central tendency and dispersion to identify potential outliers (e.g., zero values in BMI or Blood Pressure).

Exploratory Data Analysis (EDA):

Correlation analysis using heatmaps to find which features most influence the Outcome.

Distribution analysis of health metrics.

Data Cleaning: Checking for null values and handling inconsistencies in the dataset.

Machine Learning:

Implementation of Logistic Regression.

Hyperparameter tuning using RandomizedSearchCV to find the best model estimators.

๐Ÿ“ˆ Model Performance The model's effectiveness is evaluated using the Receiver Operating Characteristic (ROC) Curve.

Metric: Area Under the Curve (AUC).

Result: The notebook includes code to calculate the roc_auc score and plot the curve, demonstrating the model's ability to distinguish between diabetic and non-diabetic patients.

๐Ÿ–ผ Visualizations The project generates several key plots:

Heatmaps: To visualize the correlation matrix.

ROC Curve: To visualize the trade-off between the true positive rate and false positive rate at various threshold settings.

Releases

No releases published

Packages

No packages published