Skip to content

The goal of this dataset is to facilitate the development and testing of regression models for predicting insurance premiums based on various customer characteristics and policy details. Insurance companies often rely on data-driven approaches to estimate premiums, taking into account factors such as age, income, health status, and claim history.

Notifications You must be signed in to change notification settings

RohitRP22/Insurance_Premium_Amount_Prediction

Repository files navigation

Insurance Premium Prediction Dataset

Problem Statement

The goal of this dataset is to facilitate the development and testing of regression models for predicting insurance premiums based on various customer characteristics and policy details. Insurance companies often rely on data-driven approaches to estimate premiums, taking into account factors such as age, income, health status, and claim history. This synthetic dataset simulates real-world scenarios to help practitioners practice feature engineering, data cleaning, and model training.

Dataset Overview

This dataset contains 2Lk+ and 20 features with a mix of categorical, numerical, and text data. It includes missing values, incorrect data types, and skewed distributions to mimic the complexities faced in real-world datasets. The target variable for prediction is the "Premium Amount".

Features

  1. Age: Age of the insured individual (Numerical)
  2. Gender: Gender of the insured individual (Categorical: Male, Female)
  3. Annual Income: Annual income of the insured individual (Numerical, skewed)
  4. Marital Status: Marital status of the insured individual (Categorical: Single, Married, Divorced)
  5. Number of Dependents: Number of dependents (Numerical, with missing values)
  6. Education Level: Highest education level attained (Categorical: High School, Bachelor's, Master's, PhD)
  7. Occupation: Occupation of the insured individual (Categorical: Employed, Self-Employed, Unemployed)
  8. Health Score: A score representing the health status (Numerical, skewed)
  9. Location: Type of location (Categorical: Urban, Suburban, Rural)
  10. Policy Type: Type of insurance policy (Categorical: Basic, Comprehensive, Premium)
  11. Previous Claims: Number of previous claims made (Numerical, with outliers)
  12. Vehicle Age: Age of the vehicle insured (Numerical)
  13. Credit Score: Credit score of the insured individual (Numerical, with missing values)
  14. Insurance Duration: Duration of the insurance policy (Numerical, in years)
  15. Premium Amount: Target variable representing the insurance premium amount (Numerical, skewed)
  16. Policy Start Date: Start date of the insurance policy (Text, improperly formatted)
  17. Customer Feedback: Short feedback comments from customers (Text)
  18. Smoking Status: Smoking status of the insured individual (Categorical: Yes, No)
  19. Exercise Frequency: Frequency of exercise (Categorical: Daily, Weekly, Monthly, Rarely)
  20. Property Type: Type of property owned (Categorical: House, Apartment, Condo)

Data Characteristics

  • Missing Values: Certain features contain missing values to simulate real-world data collection issues.
  • Incorrect Data Types: Some fields are intentionally set to incorrect data types to practice data cleaning.
  • Skewed Distributions: Numerical features like Annual Income and Premium Amount have skewed distributions, which can be addressed through transformations.

Usage

This dataset can be used for:

  • Practicing feature engineering techniques.
  • Implementing data cleaning and preprocessing steps.
  • Training regression models for predicting insurance premiums.
  • Evaluating model performance and tuning hyperparameters.

License

This synthetic dataset is created for educational purposes and can be used freely for practice and experimentation.

About

The goal of this dataset is to facilitate the development and testing of regression models for predicting insurance premiums based on various customer characteristics and policy details. Insurance companies often rely on data-driven approaches to estimate premiums, taking into account factors such as age, income, health status, and claim history.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published