This project implements a machine learning model to predict the survival of passengers on the Titanic based on various features such as age, sex, passenger class, and family size. The dataset used is the Titanic dataset, and the model employs a Random Forest Classifier with hyperparameter tuning to achieve high accuracy.
The Titanic dataset (Titanic-Dataset.csv) contains 891 samples with the following features:
PassengerId: Unique identifier for each passenger Survived: Target variable (0 = Did not survive, 1 = Survived) Pclass: Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd) Name: Passenger's name Sex: Passenger's gender Age: Passenger's age SibSp: Number of siblings/spouses aboard Parch: Number of parents/children aboard Ticket: Ticket number Fare: Passenger fare Cabin: Cabin number Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
The dataset is sourced from a local file path in the provided notebook.
TITANIC SURVIVAL PREDICTION.ipynb: Jupyter Notebook containing the complete code for data preprocessing, feature engineering, model training, evaluation, and visualization.
Titanic-Dataset.csv: The dataset file (not included in this repository; ensure it is available in your local environment or update the file path in the notebook).
To run the notebook, install the following Python libraries: pip install pandas numpy matplotlib seaborn scikit-learn
The Titanic dataset is loaded using pandas.
Dataset shape, data types, missing values, and summary statistics are analyzed.
include survival count, survival by passenger class, gender, and age distribution.
Extracted titles from names and grouped rare titles (e.g., Lady, Countess) into a 'Rare' category. Created FamilySize (SibSp + Parch + 1) and IsAlone (1 if FamilySize = 1, else 0) features. Binned Age into 5 categories (0-4) based on age ranges. Binned Fare into 4 categories (0-3) using quartiles. Dropped unnecessary columns: PassengerId, Name, Ticket, Cabin, AgeBand, FareBand.
Numeric features (Age, Fare, SibSp, Parch, FamilySize) are imputed with median values and scaled using StandardScaler. Categorical features (Pclass, Sex, Embarked, Title, IsAlone) are imputed with the most frequent value and one-hot encoded. Data is split into training (80%) and testing (20%) sets with stratification.
A Random Forest Classifier is trained within a pipeline that includes preprocessing steps. Hyperparameter tuning is performed using GridSearchCV with parameters: n_estimators (100, 200), max_depth (None, 5, 10), and min_samples_split (2, 5).
Model accuracy is calculated using accuracy_score. A classification report (precision, recall, F1-score) is generated. A confusion matrix shows classification performance.
Best Parameters: {'classifier__max_depth': 5, 'classifier__min_samples_split': 2, 'classifier__n_estimators': 100} Cross-validation Accuracy: ~82.7%
Accuracy: 82.7%
[[97 13] [18 51]]