Customer Segmentation & Sales Analytics (End-to-End Automated Pipeline)

Fully automated segmentation system using K-Prototypes, BigQuery, GitHub Actions & Power BI for real-time insurance customer insights.

Highlights

Fully automated from raw data to Power BI dashboard
Cloud-native architecture using BigQuery + GitHub Actions + Power BI
Segmentation using K-Prototypes for mixed numerical & categorical data
Runs fully on free-tier cloud services (zero cost to operate)
Designed for marketing & sales decision-making (targeting, upsell, retention)
Production-ready, scalable, and maintainable for enterprise deployment

Skills Demonstrated

✔ Data Engineering (ETL, orchestration, scheduling)

✔ Machine Learning (clustering with mixed data types)

✔ Business Analytics & Storytelling (segment insights & business strategies)

✔ Cloud Analytics (BigQuery + Power BI)

✔ Automation & Machine Learning Operations (GitHub Actions, scheduled refresh)

✔ AI-Augmented Development (segment labeling + SQL generation)

Overview

With rising competition and evolving customer expectations, insurers leverage data-driven intelligence to understand customer behavior and tailor their offerings accordingly. Customer segmentation provides a structured way to group individuals with shared characteristics, enabling more efficient targeting and marketing strategies.

Note:
This public repository is intended for demonstration purposes only. The automated execution of the Python workflows runs in a separate private repository of Github.

Technically, this project automates an end-to-end customer segmentation workflow, integrating raw data ingestion from Google Sheets, clustering analysis using K-Prototypes for mixed numeric and categorical features, data warehousing in Google BigQuery, and daily refreshed Power BI dashboards. It is designed to deliver actionable insights for marketing and sales teams through a fully automated, cloud-based, and production-ready solution.

Key Business Impacts

This solution enables insurers to operationalize customer segmentation for commercial decision-making, driving measurable value across sales, marketing, and product teams. Key impacts include:

Targeted Marketing & Improved Conversion: Segmentation enables marketing teams to tailor campaigns and offers to specific customer cohorts, reducing untargeted spend and improving lead conversion.
Increased Revenue via Cross-Sell & Upsell: Segment profiles (e.g., high-income mid-life customers) reveal opportunities to introduce higher-value products or bundle policies, increasing premium per customer and Customer Lifetime Value (CLV).
Higher Retention of High-Value Customers: Tenure-driven insights highlight that long-standing customers deliver higher CLV, enabling proactive retention programs for profitable cohorts.
Product & Portfolio Strategy Optimization: Product performance by customer segment identifies which insurance products drive premium growth and which should be prioritized for development, pricing, and sales emphasis.
Geo-Targeted Commercial Planning: Insights on revenue concentration by location allow sales teams to allocate resources, budgets, and agents to high-value geographies.

Data Insights & Interpretation

Customer Segmentation Overview

The clustering analysis identifies three customer segments with clear differences in demographics, value contribution, and insurance behaviour.

Segment	Profile	Business Value	Product Strategy
Young Urban Starters	Younger, budget-conscious customers with lower income and shorter policy tenure.	Lower current premium and CLV, but represent an important pool of future growth.	Entry-level pricing, affordable bundles.
Discerning Independents	Middle-aged, high-income individuals with moderate tenure, often single or divorced.	Currently contribute the largest share of total annual premium, making them a key revenue segment despite relatively lower per-customer CLV due to short tenure.	Upsell and cross-sell add-on covering not only business but also life insurance products.
Loyal Family Households	Mature, predominantly married customers with the longest policy tenure. Strong focus on family and long-term protection.	Provide stable, predictable revenue and the highest CLV across segments. They form the backbone of long-term profitability.	Renewal incentives, long-term loyalty programs, family bundles and multi-policy discounts.

Key Analytical Findings

Revenue concentration is segment-driven: Discerning Independents contribute the largest share of total annual premium driven by higher-income profiles and demand for higher-value insurance products, despite not being the largest segment by customer count.
Retention drives profitability: Customers with 10+ years tenure are consistently represented among high-CLV and top-premium customers. This highlights the importance of customer retention strategies in driving revenue and lifetime value.
Young customers drive volume, not value: Younger customers aged 18–30 (Young Urban Starters) appear in higher volume but contribute lower total premium due to lower coverage.
Product performance varies by segment: Business products generate the highest total premiums, led by OfficeProtect and MotorShield.
Geography matters: London generates the highest total annual premium, indicating a strong concentration of high-value customers.

Results Summary

3 distinct customer clusters identified
Discerning Independents generate the highest total premium (53.2%)
London contributes the highest geographic premium share (21.3%)
ETL + dashboard refresh automated with zero manual intervention

Technologies Used

Python: pandas, seaborn, matplotlib, kmodes, gower, prince, sklearn, pandas-gbq
Artificial Intelligence: gemini-2.5-flash
GitHub Actions: automation of Python scripts
Google BigQuery: SQL
Power BI: real-time dashboards

ETL Pipeline

End-to-end automated data flow from ingestion to BI dashboard

Workflow Overview

This is a project covering an automated end-to-end data pipeline which performs clustering, stores the output in a warehouse, and updates a live dashboard.

Data Ingestion: Collecting raw data from Google Sheets: customers.csv, sales.csv, products.csv
Data Processing (Extract-Transform-Load ETL)

Data cleaning (e.g. checking any missing values or duplicates)
Data wrangling and feature engineering to ensure the data is in the right format for K-Prototype clustering
K-Prototypes clustering for a mix of numeric and categorical data
Segment labeling and description generated by Gemini 2.5 Flash
Loading the updated customers, sales and products tables to Google BigQuery
Automation of the above process via GitHub Actions (daily at 10PM UK time)

Data Warehousing: Processed model outputs were stored as tables in BigQuery for reporting and analysis, alongside views generated from SQL queries.

Business Intelligence: A live dashboard in Microsoft Power BI shows an analysis of customer and sales data for the company. Data is scheduled for an automatic update on a daily basis at 11PM UK time.

Use of Artificial Intelligence (AI)

Artificial intelligence was incorporated to enhance productivity and interpretability within the analytics pipeline, while all statistical modeling, feature engineering, and data processing steps remained transparent and fully deterministic which were executed using classical machine learning methods.

AI primarily assisted in two key areas of the workflow:

SQL Query Generation for BigQuery

Google Gemini was used to accelerate the development of SQL queries executed in BigQuery. The model helped to:

draft initial query structures for data transformations and aggregation
convert business questions into query specifications

All AI-generated queries were reviewed, validated, and tested manually before integration into the automated pipeline.

Automated Segment Labeling and Description

After clustering customers with K-Prototypes, Gemini was leveraged within the Python pipeline to generate human-interpretable labels and narrative descriptions for each customer segment. This enabled:

more intuitive representation of segment profiles
higher adaptability in light of any changes in clustering parameters, feature sets, or updated customer data

This significantly improves the scalability and maintainability of the pipeline, as no manual relabeling is required when the model evolves.

Schedule & Maintenance

Layer	Schedule	Managed By
Python Script	Daily @ 10pm BST	GitHub Actions
BigQuery Views	Live / real-time	Google BigQuery
Power BI	Daily @ 11pm BST	Power BI Scheduler

Author

Carmen Wong

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
.github/workflows		.github/workflows
Power BI report		Power BI report
dataset		dataset
images		images
Customer_Segmentation_gh_final.ipynb		Customer_Segmentation_gh_final.ipynb
README.md		README.md
customer_segmentation_gh.py		customer_segmentation_gh.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Customer Segmentation & Sales Analytics (End-to-End Automated Pipeline)

Highlights

Skills Demonstrated

Overview

Key Business Impacts

Data Insights & Interpretation

Customer Segmentation Overview

Key Analytical Findings

Results Summary

Technologies Used

ETL Pipeline

Workflow Overview

Use of Artificial Intelligence (AI)

Schedule & Maintenance

Author

About

Uh oh!

Releases

Packages

Languages

cckmwong-data/customer_segmentation

Folders and files

Latest commit

History

Repository files navigation

Customer Segmentation & Sales Analytics (End-to-End Automated Pipeline)

Highlights

Skills Demonstrated

Overview

Key Business Impacts

Data Insights & Interpretation

Customer Segmentation Overview

Key Analytical Findings

Results Summary

Technologies Used

ETL Pipeline

Workflow Overview

Use of Artificial Intelligence (AI)

Schedule & Maintenance

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages