Mini Data Warehouse

A simple data warehouse project demonstrating PostgreSQL database setup with synthetic e-commerce data using Docker and Python.

Overview

This project creates a mini data warehouse with synthetic customer, product, order, and order item data. It uses PostgreSQL as the database backend with PgAdmin for database administration, all containerized with Docker.

Project Structure

mini-data-warehouse/
├── data/                        # Generated CSV files and exports
│   ├── customers.csv           # Customer data (200 records)
│   ├── products.csv            # Product data (50 records)
│   ├── orders.csv              # Order data (1000+ records)
│   └── order_items.csv         # Order item data (variable records)
├── init/
│   └── schema.sql              # Database initialization script
├── 📊 Core Data Components
├── data.py                     # Synthetic data generation
├── transform_pipeline.py       # ETL transformation pipeline
├── data_quality.py            # Data quality assessment tools
├── incremental_load.py        # Incremental data loading
├── retention_policy.py        # Data retention management
├── 🌟 Star Schema & Analytics  
├── star_schema.sql            # Star schema creation script
├── dashboard.py               # Interactive dashboard generator
├── 🛠️ Management & Operations
├── warehouse_manager.py       # Central management CLI
├── docker-compose.yaml        # Docker services configuration
├── 🔧 Development Environment
├── flake.nix                  # Nix flake for reproducible environments
├── flake.lock                 # Nix flake lock file
├── .envrc                     # direnv configuration
└── README.md                  # This documentation

Database Schema

The data warehouse consists of four main tables:

customers: Customer information with enriched geographic and tier data
products: Product catalog with performance analytics and price tiers
orders: Order headers with comprehensive order metrics
order_items: Order line items with detailed product associations
dim_customer: Customer dimension with regional groupings and behavior analysis
dim_product: Product dimension with category and performance metrics
dim_date: Date dimension with comprehensive time attributes
fact_sales: Central fact table with sales metrics and calculated fields

Prerequisites

Option 1: Using Nix (Recommended)

Nix package manager with flakes enabled

Option 2: Manual Setup

Docker and Docker Compose
Python 3.x (for data generation)
pip (Python package manager)

Quick Start

Option 1: Using Management CLI (Recommended)

The fastest way to get started with the complete data warehouse:

# Setup complete environment and run full pipeline
python warehouse_manager.py pipeline

This single command will:

Generate synthetic data
Start PostgreSQL and PgAdmin services
Create star schema with dimension tables
Run data transformation pipelines
Execute data quality checks
Generate interactive dashboard

Option 2: Using Nix (Development Environment)

Enter the development environment:

nix develop

Run the complete pipeline:

python warehouse_manager.py pipeline

Or run individual components:

# Generate data and start services
python warehouse_manager.py setup

# Generate dashboard
python warehouse_manager.py dashboard

# Check data quality
python warehouse_manager.py quality

# View status
python warehouse_manager.py status

Option 3: Manual Step-by-Step

Generate Synthetic Data:

First, install the required Python dependencies and generate the synthetic data:

pip install pandas faker
python data.py

This will create CSV files in the data/ directory with:

200 synthetic customers
50 products across 5 categories (Electronics, Furniture, Clothing, Books, Sports)
1000 orders with associated order items

Start the Database:

Start the PostgreSQL database and PgAdmin using Docker Compose:

docker-compose up -d

This will:

Start a PostgreSQL 15 database on port 5432
Start PgAdmin on port 8080
Automatically load the CSV data into the database tables

Access the Database:

Database Connection:

Host: localhost
Port: 5432
Database: warehouse
Username: admin
Password: secret

PgAdmin Web Interface:

URL: http://localhost:8080
Email: admin@local.com
Password: admin
Host: db

Advanced Usage Examples

Management CLI Operations

# Complete pipeline (setup → transform → quality → dashboard)
python warehouse_manager.py pipeline

# Individual operations
python warehouse_manager.py setup              # Initial setup
python warehouse_manager.py generate           # Generate more data
python warehouse_manager.py transform          # Run ETL pipeline  
python warehouse_manager.py quality            # Check data quality
python warehouse_manager.py dashboard          # Create dashboard
python warehouse_manager.py status             # Show current status

# Incremental operations
python warehouse_manager.py incremental --customers 10 --orders 50
python warehouse_manager.py retention          # Execute retention policies

# Maintenance
python warehouse_manager.py stop               # Stop services
python warehouse_manager.py cleanup            # Full cleanup

Data Quality Management

# Run comprehensive quality checks
python data_quality.py

# View detailed quality report
cat data_quality_report.json | jq '.'

Quality checks include:

Completeness Analysis: NULL value detection and completeness rates
Referential Integrity: Foreign key constraint validation
Data Consistency: Business rule validation and anomaly detection
Overall Quality Score: Automated scoring (0-100) based on quality metrics

Star Schema Analytics

Once the star schema is created, you can run advanced analytical queries:

-- Customer lifetime value analysis using star schema
SELECT 
    dc.region,
    dc.customer_tier,
    COUNT(DISTINCT dc.customer_key) as customers,
    AVG(fs.total_amount) as avg_order_value,
    SUM(fs.total_amount) as total_revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.customer_key = dc.customer_key
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE dd.year = 2024
GROUP BY dc.region, dc.customer_tier
ORDER BY total_revenue DESC;

-- Product performance by price tier and category  
SELECT 
    dp.category,
    dp.price_tier,
    COUNT(fs.sales_key) as transactions,
    SUM(fs.quantity) as units_sold,
    SUM(fs.total_amount) as revenue,
    SUM(fs.total_amount * fs.profit_margin / 100) as profit
FROM fact_sales fs
JOIN dim_product dp ON fs.product_key = dp.product_key
GROUP BY dp.category, dp.price_tier
ORDER BY revenue DESC;

-- Time-series revenue analysis with trends
SELECT 
    dd.year,
    dd.quarter,
    dd.month_name,
    SUM(fs.total_amount) as monthly_revenue,
    COUNT(DISTINCT fs.customer_key) as active_customers,
    AVG(fs.total_amount) as avg_transaction_value
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key  
GROUP BY dd.year, dd.quarter, dd.month_name, dd.month
ORDER BY dd.year, dd.month;

Data Generation Details

The synthetic data is generated using the Python Faker library with the following characteristics:

Customers: Random names, emails, and geographic locations
Products: 5 categories with realistic price ranges ($5-$2000)
Orders: Random date range (2024), 1-5 items per order
Geographic Distribution: Global customer base with diverse countries

Nix Flake Usage

This project includes a Nix flake for reproducible development environments. Here are the available commands:

Development Shell

# Enter development environment with all dependencies
nix develop

# Use direnv for automatic environment loading (if installed)
direnv allow

Run Applications

# Generate synthetic data using the packaged app
nix run .

# Or run the specific generate-data app
nix run .#generate-data

Build Packages

# Build the data generation package
nix build

# Build Docker image (optional)
nix build .#docker

Flake Information

# Show all available outputs
nix flake show

# Check flake for issues
nix flake check

Management Commands

Stop the services

docker-compose down

View logs

docker-compose logs

Regenerate data and restart services

python warehouse_manager.py generate --customers 50 --orders 100
python warehouse_manager.py incremental --customers 10 --orders 25

Connect to database directly

docker exec -it mini_warehouse_db psql -U admin -d warehouse

Generate fresh dashboard

python warehouse_manager.py dashboard
# Open dashboard.html in your browser

Configuration & Customization

Retention Policy Configuration

Create retention_config.json to customize data retention:

{
  "raw_data": {
    "orders": 1095,
    "order_items": 1095,
    "customers": -1,
    "products": -1
  },
  "aggregated_data": {
    "fact_sales": 2555,
    "dim_customer": -1,
    "dim_product": -1,
    "dim_date": -1
  }
}

Environment Variables

The default configuration can be modified in docker-compose.yaml:

POSTGRES_USER: Database username (default: admin)
POSTGRES_PASSWORD: Database password (default: secret)
POSTGRES_DB: Database name (default: warehouse)
PGADMIN_DEFAULT_EMAIL: PgAdmin login email
PGADMIN_DEFAULT_PASSWORD: PgAdmin password

Ports

PostgreSQL: 5432
PgAdmin: 8080

Data Generation Parameters

Customize data generation in data.py:

num_customers: Number of customers (default: 200)
num_products: Number of products (default: 50)
num_orders: Number of orders (default: 1000)
Geographic distribution and product categories

Troubleshooting

Data not loading

Ensure CSV files are generated before starting Docker containers:

python data.py
docker-compose up -d

Connection refused

Wait a few seconds for PostgreSQL to fully initialize, then try connecting again.

Port conflicts

If ports 5432 or 8080 are already in use, modify the port mappings in docker-compose.yaml.

Enhanced Features ✨

This project now includes all the advanced data warehouse features:

🔄 Data Transformation Pipelines

Customer Data Enrichment: Geographic mapping, customer tiers, activity analysis
Product Performance Analytics: Price tiers, popularity rankings, margin analysis
Daily Sales Aggregates: Time-series analysis with customer segmentation
Automated ETL Process: Staging tables and data cleansing workflows

⭐ Star Schema Implementation

Fact Tables: fact_sales with comprehensive sales metrics
Dimension Tables: dim_customer, dim_product, dim_date with enriched attributes
Optimized Queries: Pre-built views for common analytics patterns
Performance Indexes: Strategic indexing for fast query performance

🔍 Data Quality Management

Completeness Checks: Null value analysis across all tables
Referential Integrity: Foreign key constraint validation
Data Consistency: Business rule validation and anomaly detection
Quality Scoring: Automated quality score calculation (0-100)
Quality Reports: Detailed JSON and console reports

📊 Interactive Dashboard

Real-time Metrics: Key business indicators and trends
Visual Analytics: Revenue charts, category performance, customer insights
Responsive Design: Modern HTML dashboard with charts
Auto-generated: One-command dashboard creation

📈 Incremental Data Loading

Change Data Capture: Simulated CDC for existing records
Incremental Updates: Add new customers and orders seamlessly
Star Schema Sync: Automatic dimension and fact table updates
Load Metadata: Tracking of load timestamps and record counts

🗄️ Data Retention Policies

Configurable Retention: JSON-based retention policy configuration
Automated Archival: Safe archiving before deletion
Audit Trail: Complete audit log of retention policy executions
Flexible Rules: Time-based retention with table-specific policies

🛠️ Management CLI

Unified Interface: Single command-line tool for all operations
Pipeline Automation: Complete ETL pipeline execution
Status Monitoring: Real-time status of warehouse components
Maintenance Operations: Cleanup, backup, and service management

License

This project is open source and available under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
init		init
.envrc		.envrc
ABOUT.md		ABOUT.md
IMPROVEMENTS.md		IMPROVEMENTS.md
README.md		README.md
dashboard.html		dashboard.html
dashboard.py		dashboard.py
data.py		data.py
data_quality.py		data_quality.py
data_quality_report.json		data_quality_report.json
docker-compose.yaml		docker-compose.yaml
flake.lock		flake.lock
flake.nix		flake.nix
incremental_load.py		incremental_load.py
requirements.txt		requirements.txt
retention_policy.py		retention_policy.py
setup_star_schema.py		setup_star_schema.py
star_schema.sql		star_schema.sql
transform_pipeline.py		transform_pipeline.py
warehouse_manager.py		warehouse_manager.py

timothynn/mini-data-warehouse

Folders and files

Latest commit

History

Repository files navigation