Skip to content

timothynn/mini-data-warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Mini Data Warehouse

A simple data warehouse project demonstrating PostgreSQL database setup with synthetic e-commerce data using Docker and Python.

Overview

This project creates a mini data warehouse with synthetic customer, product, order, and order item data. It uses PostgreSQL as the database backend with PgAdmin for database administration, all containerized with Docker.

Project Structure

mini-data-warehouse/
β”œβ”€β”€ data/                        # Generated CSV files and exports
β”‚   β”œβ”€β”€ customers.csv           # Customer data (200 records)
β”‚   β”œβ”€β”€ products.csv            # Product data (50 records)
β”‚   β”œβ”€β”€ orders.csv              # Order data (1000+ records)
β”‚   └── order_items.csv         # Order item data (variable records)
β”œβ”€β”€ init/
β”‚   └── schema.sql              # Database initialization script
β”œβ”€β”€ πŸ“Š Core Data Components
β”œβ”€β”€ data.py                     # Synthetic data generation
β”œβ”€β”€ transform_pipeline.py       # ETL transformation pipeline
β”œβ”€β”€ data_quality.py            # Data quality assessment tools
β”œβ”€β”€ incremental_load.py        # Incremental data loading
β”œβ”€β”€ retention_policy.py        # Data retention management
β”œβ”€β”€ 🌟 Star Schema & Analytics  
β”œβ”€β”€ star_schema.sql            # Star schema creation script
β”œβ”€β”€ dashboard.py               # Interactive dashboard generator
β”œβ”€β”€ πŸ› οΈ Management & Operations
β”œβ”€β”€ warehouse_manager.py       # Central management CLI
β”œβ”€β”€ docker-compose.yaml        # Docker services configuration
β”œβ”€β”€ πŸ”§ Development Environment
β”œβ”€β”€ flake.nix                  # Nix flake for reproducible environments
β”œβ”€β”€ flake.lock                 # Nix flake lock file
β”œβ”€β”€ .envrc                     # direnv configuration
└── README.md                  # This documentation

Database Schema

The data warehouse consists of four main tables:

  • customers: Customer information with enriched geographic and tier data
  • products: Product catalog with performance analytics and price tiers
  • orders: Order headers with comprehensive order metrics
  • order_items: Order line items with detailed product associations
  • dim_customer: Customer dimension with regional groupings and behavior analysis
  • dim_product: Product dimension with category and performance metrics
  • dim_date: Date dimension with comprehensive time attributes
  • fact_sales: Central fact table with sales metrics and calculated fields

Prerequisites

Option 1: Using Nix (Recommended)

Option 2: Manual Setup

  • Docker and Docker Compose
  • Python 3.x (for data generation)
  • pip (Python package manager)

Quick Start

Option 1: Using Management CLI (Recommended)

The fastest way to get started with the complete data warehouse:

# Setup complete environment and run full pipeline
python warehouse_manager.py pipeline

This single command will:

  • Generate synthetic data
  • Start PostgreSQL and PgAdmin services
  • Create star schema with dimension tables
  • Run data transformation pipelines
  • Execute data quality checks
  • Generate interactive dashboard

Option 2: Using Nix (Development Environment)

  1. Enter the development environment:
nix develop
  1. Run the complete pipeline:
python warehouse_manager.py pipeline
  1. Or run individual components:
# Generate data and start services
python warehouse_manager.py setup

# Generate dashboard
python warehouse_manager.py dashboard

# Check data quality
python warehouse_manager.py quality

# View status
python warehouse_manager.py status

Option 3: Manual Step-by-Step

  1. Generate Synthetic Data:

First, install the required Python dependencies and generate the synthetic data:

pip install pandas faker
python data.py

This will create CSV files in the data/ directory with:

  • 200 synthetic customers
  • 50 products across 5 categories (Electronics, Furniture, Clothing, Books, Sports)
  • 1000 orders with associated order items
  1. Start the Database:

Start the PostgreSQL database and PgAdmin using Docker Compose:

docker-compose up -d

This will:

  • Start a PostgreSQL 15 database on port 5432
  • Start PgAdmin on port 8080
  • Automatically load the CSV data into the database tables
  1. Access the Database:

Database Connection:

  • Host: localhost
  • Port: 5432
  • Database: warehouse
  • Username: admin
  • Password: secret

PgAdmin Web Interface:

Advanced Usage Examples

Management CLI Operations

# Complete pipeline (setup β†’ transform β†’ quality β†’ dashboard)
python warehouse_manager.py pipeline

# Individual operations
python warehouse_manager.py setup              # Initial setup
python warehouse_manager.py generate           # Generate more data
python warehouse_manager.py transform          # Run ETL pipeline  
python warehouse_manager.py quality            # Check data quality
python warehouse_manager.py dashboard          # Create dashboard
python warehouse_manager.py status             # Show current status

# Incremental operations
python warehouse_manager.py incremental --customers 10 --orders 50
python warehouse_manager.py retention          # Execute retention policies

# Maintenance
python warehouse_manager.py stop               # Stop services
python warehouse_manager.py cleanup            # Full cleanup

Data Quality Management

# Run comprehensive quality checks
python data_quality.py

# View detailed quality report
cat data_quality_report.json | jq '.'

Quality checks include:

  • Completeness Analysis: NULL value detection and completeness rates
  • Referential Integrity: Foreign key constraint validation
  • Data Consistency: Business rule validation and anomaly detection
  • Overall Quality Score: Automated scoring (0-100) based on quality metrics

Star Schema Analytics

Once the star schema is created, you can run advanced analytical queries:

-- Customer lifetime value analysis using star schema
SELECT 
    dc.region,
    dc.customer_tier,
    COUNT(DISTINCT dc.customer_key) as customers,
    AVG(fs.total_amount) as avg_order_value,
    SUM(fs.total_amount) as total_revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.customer_key = dc.customer_key
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE dd.year = 2024
GROUP BY dc.region, dc.customer_tier
ORDER BY total_revenue DESC;

-- Product performance by price tier and category  
SELECT 
    dp.category,
    dp.price_tier,
    COUNT(fs.sales_key) as transactions,
    SUM(fs.quantity) as units_sold,
    SUM(fs.total_amount) as revenue,
    SUM(fs.total_amount * fs.profit_margin / 100) as profit
FROM fact_sales fs
JOIN dim_product dp ON fs.product_key = dp.product_key
GROUP BY dp.category, dp.price_tier
ORDER BY revenue DESC;

-- Time-series revenue analysis with trends
SELECT 
    dd.year,
    dd.quarter,
    dd.month_name,
    SUM(fs.total_amount) as monthly_revenue,
    COUNT(DISTINCT fs.customer_key) as active_customers,
    AVG(fs.total_amount) as avg_transaction_value
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key  
GROUP BY dd.year, dd.quarter, dd.month_name, dd.month
ORDER BY dd.year, dd.month;

Data Generation Details

The synthetic data is generated using the Python Faker library with the following characteristics:

  • Customers: Random names, emails, and geographic locations
  • Products: 5 categories with realistic price ranges ($5-$2000)
  • Orders: Random date range (2024), 1-5 items per order
  • Geographic Distribution: Global customer base with diverse countries

Nix Flake Usage

This project includes a Nix flake for reproducible development environments. Here are the available commands:

Development Shell

# Enter development environment with all dependencies
nix develop

# Use direnv for automatic environment loading (if installed)
direnv allow

Run Applications

# Generate synthetic data using the packaged app
nix run .

# Or run the specific generate-data app
nix run .#generate-data

Build Packages

# Build the data generation package
nix build

# Build Docker image (optional)
nix build .#docker

Flake Information

# Show all available outputs
nix flake show

# Check flake for issues
nix flake check

Management Commands

Stop the services

docker-compose down

View logs

docker-compose logs

Regenerate data and restart services

python warehouse_manager.py generate --customers 50 --orders 100
python warehouse_manager.py incremental --customers 10 --orders 25

Connect to database directly

docker exec -it mini_warehouse_db psql -U admin -d warehouse

Generate fresh dashboard

python warehouse_manager.py dashboard
# Open dashboard.html in your browser

Configuration & Customization

Retention Policy Configuration

Create retention_config.json to customize data retention:

{
  "raw_data": {
    "orders": 1095,
    "order_items": 1095,
    "customers": -1,
    "products": -1
  },
  "aggregated_data": {
    "fact_sales": 2555,
    "dim_customer": -1,
    "dim_product": -1,
    "dim_date": -1
  }
}

Environment Variables

The default configuration can be modified in docker-compose.yaml:

  • POSTGRES_USER: Database username (default: admin)
  • POSTGRES_PASSWORD: Database password (default: secret)
  • POSTGRES_DB: Database name (default: warehouse)
  • PGADMIN_DEFAULT_EMAIL: PgAdmin login email
  • PGADMIN_DEFAULT_PASSWORD: PgAdmin password

Ports

  • PostgreSQL: 5432
  • PgAdmin: 8080

Data Generation Parameters

Customize data generation in data.py:

  • num_customers: Number of customers (default: 200)
  • num_products: Number of products (default: 50)
  • num_orders: Number of orders (default: 1000)
  • Geographic distribution and product categories

Troubleshooting

Data not loading

Ensure CSV files are generated before starting Docker containers:

python data.py
docker-compose up -d

Connection refused

Wait a few seconds for PostgreSQL to fully initialize, then try connecting again.

Port conflicts

If ports 5432 or 8080 are already in use, modify the port mappings in docker-compose.yaml.

Enhanced Features ✨

This project now includes all the advanced data warehouse features:

πŸ”„ Data Transformation Pipelines

  • Customer Data Enrichment: Geographic mapping, customer tiers, activity analysis
  • Product Performance Analytics: Price tiers, popularity rankings, margin analysis
  • Daily Sales Aggregates: Time-series analysis with customer segmentation
  • Automated ETL Process: Staging tables and data cleansing workflows

⭐ Star Schema Implementation

  • Fact Tables: fact_sales with comprehensive sales metrics
  • Dimension Tables: dim_customer, dim_product, dim_date with enriched attributes
  • Optimized Queries: Pre-built views for common analytics patterns
  • Performance Indexes: Strategic indexing for fast query performance

πŸ” Data Quality Management

  • Completeness Checks: Null value analysis across all tables
  • Referential Integrity: Foreign key constraint validation
  • Data Consistency: Business rule validation and anomaly detection
  • Quality Scoring: Automated quality score calculation (0-100)
  • Quality Reports: Detailed JSON and console reports

πŸ“Š Interactive Dashboard

  • Real-time Metrics: Key business indicators and trends
  • Visual Analytics: Revenue charts, category performance, customer insights
  • Responsive Design: Modern HTML dashboard with charts
  • Auto-generated: One-command dashboard creation

πŸ“ˆ Incremental Data Loading

  • Change Data Capture: Simulated CDC for existing records
  • Incremental Updates: Add new customers and orders seamlessly
  • Star Schema Sync: Automatic dimension and fact table updates
  • Load Metadata: Tracking of load timestamps and record counts

πŸ—„οΈ Data Retention Policies

  • Configurable Retention: JSON-based retention policy configuration
  • Automated Archival: Safe archiving before deletion
  • Audit Trail: Complete audit log of retention policy executions
  • Flexible Rules: Time-based retention with table-specific policies

πŸ› οΈ Management CLI

  • Unified Interface: Single command-line tool for all operations
  • Pipeline Automation: Complete ETL pipeline execution
  • Status Monitoring: Real-time status of warehouse components
  • Maintenance Operations: Cleanup, backup, and service management

License

This project is open source and available under the MIT License.

Contributors 2

  •  
  •