A simple data warehouse project demonstrating PostgreSQL database setup with synthetic e-commerce data using Docker and Python.
This project creates a mini data warehouse with synthetic customer, product, order, and order item data. It uses PostgreSQL as the database backend with PgAdmin for database administration, all containerized with Docker.
mini-data-warehouse/
βββ data/ # Generated CSV files and exports
β βββ customers.csv # Customer data (200 records)
β βββ products.csv # Product data (50 records)
β βββ orders.csv # Order data (1000+ records)
β βββ order_items.csv # Order item data (variable records)
βββ init/
β βββ schema.sql # Database initialization script
βββ π Core Data Components
βββ data.py # Synthetic data generation
βββ transform_pipeline.py # ETL transformation pipeline
βββ data_quality.py # Data quality assessment tools
βββ incremental_load.py # Incremental data loading
βββ retention_policy.py # Data retention management
βββ π Star Schema & Analytics
βββ star_schema.sql # Star schema creation script
βββ dashboard.py # Interactive dashboard generator
βββ π οΈ Management & Operations
βββ warehouse_manager.py # Central management CLI
βββ docker-compose.yaml # Docker services configuration
βββ π§ Development Environment
βββ flake.nix # Nix flake for reproducible environments
βββ flake.lock # Nix flake lock file
βββ .envrc # direnv configuration
βββ README.md # This documentation
The data warehouse consists of four main tables:
- customers: Customer information with enriched geographic and tier data
- products: Product catalog with performance analytics and price tiers
- orders: Order headers with comprehensive order metrics
- order_items: Order line items with detailed product associations
- dim_customer: Customer dimension with regional groupings and behavior analysis
- dim_product: Product dimension with category and performance metrics
- dim_date: Date dimension with comprehensive time attributes
- fact_sales: Central fact table with sales metrics and calculated fields
- Nix package manager with flakes enabled
- Docker and Docker Compose
- Python 3.x (for data generation)
- pip (Python package manager)
The fastest way to get started with the complete data warehouse:
# Setup complete environment and run full pipeline
python warehouse_manager.py pipelineThis single command will:
- Generate synthetic data
- Start PostgreSQL and PgAdmin services
- Create star schema with dimension tables
- Run data transformation pipelines
- Execute data quality checks
- Generate interactive dashboard
- Enter the development environment:
nix develop- Run the complete pipeline:
python warehouse_manager.py pipeline- Or run individual components:
# Generate data and start services
python warehouse_manager.py setup
# Generate dashboard
python warehouse_manager.py dashboard
# Check data quality
python warehouse_manager.py quality
# View status
python warehouse_manager.py status- Generate Synthetic Data:
First, install the required Python dependencies and generate the synthetic data:
pip install pandas faker
python data.pyThis will create CSV files in the data/ directory with:
- 200 synthetic customers
- 50 products across 5 categories (Electronics, Furniture, Clothing, Books, Sports)
- 1000 orders with associated order items
- Start the Database:
Start the PostgreSQL database and PgAdmin using Docker Compose:
docker-compose up -dThis will:
- Start a PostgreSQL 15 database on port 5432
- Start PgAdmin on port 8080
- Automatically load the CSV data into the database tables
- Access the Database:
Database Connection:
- Host:
localhost - Port:
5432 - Database:
warehouse - Username:
admin - Password:
secret
PgAdmin Web Interface:
- URL: http://localhost:8080
- Email:
admin@local.com - Password:
admin - Host:
db
# Complete pipeline (setup β transform β quality β dashboard)
python warehouse_manager.py pipeline
# Individual operations
python warehouse_manager.py setup # Initial setup
python warehouse_manager.py generate # Generate more data
python warehouse_manager.py transform # Run ETL pipeline
python warehouse_manager.py quality # Check data quality
python warehouse_manager.py dashboard # Create dashboard
python warehouse_manager.py status # Show current status
# Incremental operations
python warehouse_manager.py incremental --customers 10 --orders 50
python warehouse_manager.py retention # Execute retention policies
# Maintenance
python warehouse_manager.py stop # Stop services
python warehouse_manager.py cleanup # Full cleanup# Run comprehensive quality checks
python data_quality.py
# View detailed quality report
cat data_quality_report.json | jq '.'Quality checks include:
- Completeness Analysis: NULL value detection and completeness rates
- Referential Integrity: Foreign key constraint validation
- Data Consistency: Business rule validation and anomaly detection
- Overall Quality Score: Automated scoring (0-100) based on quality metrics
Once the star schema is created, you can run advanced analytical queries:
-- Customer lifetime value analysis using star schema
SELECT
dc.region,
dc.customer_tier,
COUNT(DISTINCT dc.customer_key) as customers,
AVG(fs.total_amount) as avg_order_value,
SUM(fs.total_amount) as total_revenue
FROM fact_sales fs
JOIN dim_customer dc ON fs.customer_key = dc.customer_key
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE dd.year = 2024
GROUP BY dc.region, dc.customer_tier
ORDER BY total_revenue DESC;
-- Product performance by price tier and category
SELECT
dp.category,
dp.price_tier,
COUNT(fs.sales_key) as transactions,
SUM(fs.quantity) as units_sold,
SUM(fs.total_amount) as revenue,
SUM(fs.total_amount * fs.profit_margin / 100) as profit
FROM fact_sales fs
JOIN dim_product dp ON fs.product_key = dp.product_key
GROUP BY dp.category, dp.price_tier
ORDER BY revenue DESC;
-- Time-series revenue analysis with trends
SELECT
dd.year,
dd.quarter,
dd.month_name,
SUM(fs.total_amount) as monthly_revenue,
COUNT(DISTINCT fs.customer_key) as active_customers,
AVG(fs.total_amount) as avg_transaction_value
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key
GROUP BY dd.year, dd.quarter, dd.month_name, dd.month
ORDER BY dd.year, dd.month;The synthetic data is generated using the Python Faker library with the following characteristics:
- Customers: Random names, emails, and geographic locations
- Products: 5 categories with realistic price ranges ($5-$2000)
- Orders: Random date range (2024), 1-5 items per order
- Geographic Distribution: Global customer base with diverse countries
This project includes a Nix flake for reproducible development environments. Here are the available commands:
# Enter development environment with all dependencies
nix develop
# Use direnv for automatic environment loading (if installed)
direnv allow# Generate synthetic data using the packaged app
nix run .
# Or run the specific generate-data app
nix run .#generate-data# Build the data generation package
nix build
# Build Docker image (optional)
nix build .#docker# Show all available outputs
nix flake show
# Check flake for issues
nix flake checkdocker-compose downdocker-compose logspython warehouse_manager.py generate --customers 50 --orders 100
python warehouse_manager.py incremental --customers 10 --orders 25docker exec -it mini_warehouse_db psql -U admin -d warehousepython warehouse_manager.py dashboard
# Open dashboard.html in your browserCreate retention_config.json to customize data retention:
{
"raw_data": {
"orders": 1095,
"order_items": 1095,
"customers": -1,
"products": -1
},
"aggregated_data": {
"fact_sales": 2555,
"dim_customer": -1,
"dim_product": -1,
"dim_date": -1
}
}The default configuration can be modified in docker-compose.yaml:
POSTGRES_USER: Database username (default: admin)POSTGRES_PASSWORD: Database password (default: secret)POSTGRES_DB: Database name (default: warehouse)PGADMIN_DEFAULT_EMAIL: PgAdmin login emailPGADMIN_DEFAULT_PASSWORD: PgAdmin password
- PostgreSQL: 5432
- PgAdmin: 8080
Customize data generation in data.py:
num_customers: Number of customers (default: 200)num_products: Number of products (default: 50)num_orders: Number of orders (default: 1000)- Geographic distribution and product categories
Ensure CSV files are generated before starting Docker containers:
python data.py
docker-compose up -dWait a few seconds for PostgreSQL to fully initialize, then try connecting again.
If ports 5432 or 8080 are already in use, modify the port mappings in docker-compose.yaml.
This project now includes all the advanced data warehouse features:
- Customer Data Enrichment: Geographic mapping, customer tiers, activity analysis
- Product Performance Analytics: Price tiers, popularity rankings, margin analysis
- Daily Sales Aggregates: Time-series analysis with customer segmentation
- Automated ETL Process: Staging tables and data cleansing workflows
- Fact Tables:
fact_saleswith comprehensive sales metrics - Dimension Tables:
dim_customer,dim_product,dim_datewith enriched attributes - Optimized Queries: Pre-built views for common analytics patterns
- Performance Indexes: Strategic indexing for fast query performance
- Completeness Checks: Null value analysis across all tables
- Referential Integrity: Foreign key constraint validation
- Data Consistency: Business rule validation and anomaly detection
- Quality Scoring: Automated quality score calculation (0-100)
- Quality Reports: Detailed JSON and console reports
- Real-time Metrics: Key business indicators and trends
- Visual Analytics: Revenue charts, category performance, customer insights
- Responsive Design: Modern HTML dashboard with charts
- Auto-generated: One-command dashboard creation
- Change Data Capture: Simulated CDC for existing records
- Incremental Updates: Add new customers and orders seamlessly
- Star Schema Sync: Automatic dimension and fact table updates
- Load Metadata: Tracking of load timestamps and record counts
- Configurable Retention: JSON-based retention policy configuration
- Automated Archival: Safe archiving before deletion
- Audit Trail: Complete audit log of retention policy executions
- Flexible Rules: Time-based retention with table-specific policies
- Unified Interface: Single command-line tool for all operations
- Pipeline Automation: Complete ETL pipeline execution
- Status Monitoring: Real-time status of warehouse components
- Maintenance Operations: Cleanup, backup, and service management
This project is open source and available under the MIT License.