An interactive web application for exploring the Human Development Index (HDI) dataset, focusing on visualizing the impact of the three core components of HDI (Health, Education, and Income) across different countries. The tool provides clustering capabilities to identify patterns and bottlenecks in human development across regions.
- Interactive Map Visualization: Navigate through countries with an interactive world map
- Component Analysis: Visualize how the three HDI components (Life Expectancy, Education, GNI per capita) impact each country
- Clustering Insights: Identify groups of countries with similar development patterns
- Bottleneck Identification: Discover which components are limiting factors for different regions
- Clean Dashboard: Simple, focused interface showing country-specific statistics
- Streamlit - Python web framework for rapid data app development
- Pros: Python-native, easy to prototype, built-in interactive widgets
- Perfect for data science projects with minimal frontend overhead
- Pandas - Data manipulation and analysis
- NumPy - Numerical computations
- Scikit-learn - Clustering algorithms (K-Means, DBSCAN, Hierarchical)
- Plotly - Interactive charts and graphs
- Choropleth maps for geographic visualization
- Scatter plots, bar charts, radar charts for component analysis
- 3D scatter plots for clustering visualization
- Plotly Express - High-level plotting interface
- Matplotlib/Seaborn (optional) - Additional statistical visualizations
- PyYAML - Configuration management
- Streamlit-Folium (optional) - If we need advanced map interactions
data-vis-project/
│
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore file
│
├── app.py # Main Streamlit application entry point
│
├── data/ # Data directory
│ ├── raw/ # Original HDI datasets
│ ├── processed/ # Cleaned and processed data
│ └── README.md # Data documentation
│
├── src/ # Source code modules
│ ├── __init__.py
│ │
│ ├── data_processing.py # Data loading, cleaning, and preprocessing
│ │ ├── load_hdi_data()
│ │ ├── clean_data()
│ │ └── calculate_hdi_components()
│ │
│ ├── clustering.py # Clustering algorithms and analysis
│ │ ├── perform_kmeans()
│ │ ├── perform_hierarchical()
│ │ ├── determine_optimal_clusters()
│ │ └── analyze_cluster_characteristics()
│ │
│ ├── visualizations.py # Visualization components
│ │ ├── create_choropleth_map()
│ │ ├── create_component_chart()
│ │ ├── create_radar_chart()
│ │ ├── create_cluster_scatter()
│ │ └── create_time_series()
│ │
│ └── utils.py # Utility functions
│ ├── get_country_info()
│ ├── calculate_statistics()
│ └── format_display_data()
│
├── config/ # Configuration files
│ ├── app_config.yaml # Application settings
│ └── visualization_config.yaml # Visualization themes and settings
│
├── assets/ # Static assets
│ ├── styles/ # Custom CSS
│ └── images/ # Logos, icons
│
└── tests/ # Unit tests (optional)
├── __init__.py
├── test_data_processing.py
├── test_clustering.py
└── test_visualizations.py
The application follows a modular architecture with clear separation of concerns:
Raw Data → Data Processing → Processed Data → Streamlit App → User Interface
↓ ↓ ↓
Calculations Storage Visualizations
df = pd.read_csv("data/raw/data-raw.csv")
# Result: 195 countries × 880 columns- Loads the raw CSV containing HDI data from 1990-2021
- Includes all components: HDI scores, life expectancy, education metrics, GNI
# Converts string columns to numeric
df['HDI (2021)'] = pd.to_numeric(df['HDI (2021)'], errors='coerce')
# Handles missing values marked as ".." in the dataset
# Removes aggregate regions (e.g., "Arab States", "World")Purpose: Ensure all data is in the correct format for calculations
This is the core function that implements the UNDP HDI methodology:
A) Health Index (Life Expectancy)
Health_Index = (Life_Expectancy - 20) / (85 - 20)- Input: Life expectancy at birth (years)
- Range: [20, 85] years → normalized to [0, 1]
- Example: 70 years → (70-20)/(85-20) = 0.769
B) Education Index
# Two sub-indices:
Mean_School_Index = Mean_Years_Schooling / 18
Expected_School_Index = Expected_Years_Schooling / 18
# Combined:
Education_Index = (Mean_School_Index + Expected_School_Index) / 2- Mean Years: Average years of schooling for adults 25+
- Expected Years: Years a child is expected to receive
- Maximum: 18 years for both
C) Income Index (GNI per capita)
Income_Index = (ln(GNI) - ln(100)) / (ln(75000) - ln(100))- Input: Gross National Income per capita in PPP $
- Logarithmic transformation: Reflects diminishing returns of income
- Range: [100, 75000] USD → normalized to [0, 1]
D) HDI Calculation (Geometric Mean)
HDI = (Health_Index × Education_Index × Income_Index)^(1/3)- Why geometric mean?: Ensures balanced development
- If any component is 0, the entire HDI becomes 0
- Countries must improve all dimensions, not just one
E) Bottleneck Identification
# Find the minimum component for each country
bottleneck = min(Health_Index, Education_Index, Income_Index)
# Example: Afghanistan
Health_Index = 0.65
Education_Index = 0.35 ← Bottleneck!
Income_Index = 0.45Insight: Shows which dimension is holding back a country's development
# Extracts columns like:
# "Human Development Index (1990)", ..., "Human Development Index (2021)"
# Creates separate DataFrames for HDI, Life Expectancy, Education, GNIPurpose: Enable historical trend analysis
Saves multiple processed files:
hdi_components.csv- Main dataset with all calculationsclustering_data.csv- Only the 3 indices for clusteringtimeseries_*.csv- Historical data for each metric
fig = go.Figure(data=go.Choropleth(
locations=df['ISO3'], # ISO 3-letter country codes
z=df['HDI'], # Values to color by
colorscale='RdYlGn', # Red → Yellow → Green
...
))How it works:
- Uses ISO3 codes to match countries to geographic shapes
- Maps HDI values to colors (Red = low, Green = high)
- Adds hover tooltips and interactivity
- Plotly handles the geographic projection
Color Scales:
- HDI: Red-Yellow-Green (intuitive good/bad)
- Health: Red gradient (health intensity)
- Education: Blue gradient (knowledge theme)
- Income: Green gradient (money association)
# Bar chart showing the 3 indices side-by-side
components = ['Health', 'Education', 'Income']
values = [0.85, 0.92, 0.78]Purpose: Quick visual comparison of a country's strengths/weaknesses
- Radar Chart: Spider plot comparing countries
- Time Series: Line graphs showing evolution over time
- 3D Scatter: Clustering results in 3D component space
- Bottleneck Bar Chart: Distribution of limiting factors
Helper functions that support the main application:
data = {
'components': pd.read_csv('hdi_components.csv'),
'clustering': pd.read_csv('clustering_data.csv'),
'timeseries': {
'hdi': pd.read_csv('timeseries_hdi.csv'),
...
}
}Uses Streamlit's @st.cache_data to load files only once
# Get columns for years 2010-2020
year_cols = get_year_columns(df, start_year=2010, end_year=2020)
# Calculate mean HDI across those years
mean_df = df[year_cols].mean(axis=1)Purpose: Powers the year range slider functionality
# Find by name or ISO3 code
country_data = df[df['Country'] == 'Switzerland']
# Returns: All HDI data for SwitzerlandA) Page Configuration
st.set_page_config(
page_title="HDI Explorer",
layout="wide", # Full width
...
)B) Data Loading with Caching
@st.cache_data # ← Caches result, loads once
def load_data():
return load_processed_data()Benefit: Instant subsequent page loads
C) Sidebar Filters
selected_category = st.sidebar.selectbox("HDI Category", categories)
selected_region = st.sidebar.selectbox("Region", regions)
selected_bottleneck = st.sidebar.selectbox("Bottleneck", bottlenecks)Creates dropdown menus for filtering
D) Dynamic Filtering Logic
filtered_df = components_df.copy()
if selected_category != 'All':
filtered_df = filtered_df[filtered_df['HDI_Category'] == selected_category]
if selected_region != 'All':
filtered_df = filtered_df[filtered_df['Region'] == selected_region]Real-time filtering: Changes immediately update the visualization
E) Year Range Slider
year_range = st.sidebar.slider(
"Select Year Range",
min_value=1990,
max_value=2021,
value=(1990, 2021) # Default: all years
)
# Calculate mean HDI for selected range
mean_hdi = timeseries_df[year_cols].mean(axis=1)Interactive: Drag slider ends to adjust range
F) Map Rendering
fig = create_choropleth_map(
filtered_df,
value_column='HDI',
title='Human Development Index by Country'
)
st.plotly_chart(fig, use_container_width=True)Plotly integration: Streamlit displays interactive Plotly charts
G) Reactive Updates
User changes filter → filtered_df updates → map re-renders → statistics recalculate
Everything happens automatically - Streamlit re-runs the script on each interaction
User Action: "Show me Sub-Saharan African countries with Education bottlenecks"
# 1. User selects filters
Region = "SSA"
Bottleneck = "Education"
# 2. Data filtering
df_filtered = df[df['Region'] == 'SSA']
df_filtered = df_filtered[df_filtered['Bottleneck_Component'] == 'Education']
# Result: 25 countries
# 3. Map creation
fig = create_choropleth_map(df_filtered, value_column='Education_Index')
# Colors SSA countries by their education index
# 4. Statistics calculation
avg_education = df_filtered['Education_Index'].mean() # e.g., 0.52
# 5. Display
st.plotly_chart(fig) # Shows map
st.metric("Avg Education Index", f"{avg_education:.3f}") # Shows 0.520def identify_bottleneck(health_idx, education_idx, income_idx):
"""
Returns which component is lowest (the bottleneck)
"""
components = {
'Health': health_idx,
'Education': education_idx,
'Income': income_idx
}
bottleneck = min(components, key=components.get)
return bottleneckWhy this matters:
- Policy insight: Countries should focus on their bottleneck
- Resource allocation: Improving the bottleneck has the highest impact
- Development strategy: Balanced improvement is better than focusing on already-strong areas
def calculate_mean_for_years(df, year_columns):
"""
Calculates average HDI across multiple years for each country
"""
# year_columns = ['HDI (2010)', 'HDI (2011)', ..., 'HDI (2020)']
result = df[['ISO3', 'Country']].copy()
result['Mean_Value'] = df[year_columns].mean(axis=1)
return resultUse case: Compare average development levels across different time periods
1. Caching
@st.cache_data
def load_data():
# Only loads once, then cachedBenefit: Instant subsequent loads
2. Lazy Loading
# Only loads time series when needed
if year_range_enabled:
timeseries_data = load_timeseries()3. Efficient Filtering
# Pandas boolean indexing (very fast)
filtered_df = df[df['HDI_Category'] == 'Very High']Arithmetic Mean: (0.9 + 0.9 + 0.1) / 3 = 0.63
Geometric Mean: (0.9 × 0.9 × 0.1)^(1/3) = 0.44
Implication: A country with one very low component gets a lower HDI Policy message: Can't compensate for poor health with high income
GNI = $1,000 → Income_Index = 0.40
GNI = $10,000 → Income_Index = 0.74
GNI = $100,000 → Income_Index = 1.00
Reason: Diminishing returns - going from $1k to $10k improves life more than $50k to $60k
- Choropleth map showing HDI values by country
- Color gradient based on HDI score
- Click interaction to select countries
- Hover tooltips with basic information
- Filter by HDI categories (Very High, High, Medium, Low)
When a country is selected:
- Country Overview: HDI score, rank, and category
- Component Breakdown: Visual breakdown of the three components
- Life Expectancy Index
- Education Index (mean + expected years)
- GNI per capita Index
- Radar Chart: Compare country against regional/global averages
- Bottleneck Analysis: Identify which component is the limiting factor
- Historical Trends: Time series of HDI evolution
- Cluster View: Toggle to show countries grouped by similarity
- Algorithm Selection: K-Means, Hierarchical, or DBSCAN
- Number of Clusters: Interactive slider
- 3D Scatter Plot: Visualize countries in 3-dimensional space (3 HDI components)
- Cluster Characteristics: Summary statistics for each cluster
- Bottleneck Patterns: Identify common bottlenecks within clusters
The Human Development Index is calculated from three dimensions:
- Health - Life Expectancy at Birth
- Education -
- Mean Years of Schooling
- Expected Years of Schooling
- Standard of Living - GNI per capita (PPP $)
Formula: HDI = (Health Index × Education Index × Income Index)^(1/3)
- Set up development environment
- Install dependencies
- Load and explore HDI dataset
- Clean and preprocess data
- Calculate component contributions
- Implement bottleneck detection algorithm
- Extract time series data (1990-2021)
- Save processed data to multiple CSV files
- Create basic Streamlit app structure
- Implement choropleth map with Plotly
- Add sidebar filters (Category, Region, Bottleneck)
- Build dashboard layout with metrics
- Add year range slider with mean calculation
- Implement real-time filtering
- Create bottleneck distribution chart
- Add data table view
- Implement component breakdown visualizations
- Build bottleneck identification logic
- Create radar charts for country comparison
- Add historical trend visualizations for individual countries
- Implement country detail page
- Implement K-Means clustering
- Implement Hierarchical clustering
- Create cluster visualization (3D scatter)
- Add cluster selection and filtering
- Analyze and display cluster characteristics
- Show common bottlenecks within clusters
- Add custom styling with CSS
- Implement data caching for performance
- Write comprehensive documentation
- Add country-to-country comparison feature
- Export functionality for filtered data
- Add more interactive tooltips
- Comprehensive testing
- Python 3.8+
- pip or conda package manager
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtstreamlit run app.pyThe application will open in your default browser at http://localhost:8501
- Primary Source: United Nations Development Programme (UNDP)
- Dataset: Human Development Reports
- URL: http://hdr.undp.org/en/data
[Add team member names and responsibilities here]
[Specify license for educational project]
- UNDP for providing the HDI dataset
- Data Visualization course instructors
Last Updated: October 6, 2025
