GitHub - Manav13254/Data-Science-Project

Diabetes Prediction Project This project involves an end-to-end Exploratory Data Analysis (EDA) and the development of a Predictive Model to determine whether a patient has diabetes based on diagnostic measurements. The analysis is performed using the Pima Indians Diabetes Database.

📋 Table of Contents Project Overview

Dataset Description

Dependencies

Key Features of the Analysis

Model Performance

Visualizations

🚀 Project Overview The primary goal of this notebook is to clean the diagnostic data, explore relationships between health metrics (like Glucose, BMI, and Age), and train a Machine Learning model (Logistic Regression) to predict diabetes outcomes.

📊 Dataset Description The dataset used is diabetes.csv. It contains 768 observations with the following features:

Glucose: Plasma glucose concentration.

BloodPressure: Diastolic blood pressure (mm Hg).

SkinThickness: Triceps skin fold thickness (mm).

Insulin: 2-hour serum insulin (mu U/ml).

BMI: Body mass index (weight in kg/(height in m)^2).

DiabetesPedigreeFunction: A function which scores likelihood of diabetes based on family history.

Age: Age in years.

Outcome: Class variable (0 if non-diabetic, 1 if diabetic).

🛠 Dependencies To run this notebook, you will need the following Python libraries:

Python import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import scipy as sp from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve, auc 🔍 Key Features of the Analysis Descriptive Statistics: Summarizing central tendency and dispersion to identify potential outliers (e.g., zero values in BMI or Blood Pressure).

Exploratory Data Analysis (EDA):

Correlation analysis using heatmaps to find which features most influence the Outcome.

Distribution analysis of health metrics.

Data Cleaning: Checking for null values and handling inconsistencies in the dataset.

Machine Learning:

Implementation of Logistic Regression.

Hyperparameter tuning using RandomizedSearchCV to find the best model estimators.

📈 Model Performance The model's effectiveness is evaluated using the Receiver Operating Characteristic (ROC) Curve.

Metric: Area Under the Curve (AUC).

Result: The notebook includes code to calculate the roc_auc score and plot the curve, demonstrating the model's ability to distinguish between diabetic and non-diabetic patients.

🖼 Visualizations The project generates several key plots:

Heatmaps: To visualize the correlation matrix.

ROC Curve: To visualize the trade-off between the true positive rate and false positive rate at various threshold settings.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project.ipynb		Project.ipynb
README.md		README.md
Scaler.pkl		Scaler.pkl
app.py		app.py
best_log_reg_model.pkl		best_log_reg_model.pkl
diabetes.csv		diabetes.csv
random_search_log_reg_model.pkl		random_search_log_reg_model.pkl
requirements.txt		requirements.txt
scaler.pkl		scaler.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Manav13254/Data-Science-Project

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages