Skip to content

Warehouse Intelligence System: Data pipeline, EDA, anomaly detection, and ML prediction API for warehouse scan analysis. Built with Python, FastAPI, and Docker.

Notifications You must be signed in to change notification settings

numericalmachinelearning/dexory-technical-task

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Dexory Technical Task - Warehouse Intelligence System

Applied AI Engineer Technical Assessment
Author: Alessandro Alati
Date: November 2025

πŸ“‹ Project Overview

This project analyzes 10 days of warehouse scan data to extract actionable intelligence about inventory accuracy, error patterns, and operational issues. It includes a complete data pipeline, exploratory analysis, anomaly detection, and a containerized REST API.


🎯 Completed Tasks

βœ… Point 1: Data Engineering Pipeline

  • Ingests 10 days of warehouse scan data (350K+ records)
  • Merges with warehouse layout (33K+ locations)
  • Outputs clean Parquet dataset with spatial features
  • Includes comprehensive unit tests

βœ… Point 2: Exploratory Data Analysis (EDA)

  • WHAT: Daily accuracy trends and error type breakdown
  • WHERE: Spatial hotspots (shelf levels, aisles, height correlations)
  • WHEN: Velocity analysis (fast-moving locations)
  • Generates 8 publication-quality visualizations
  • Statistical validation (chi-square tests, correlations)

βœ… Point 3: Anomaly Detection

  • Composite risk scoring model (error severity + operational impact)
  • Identifies Top 20 most problematic locations
  • Transparent, explainable scoring system
  • Output: Ranked CSV with actionable metrics

βœ… Point 4: Scalable & Containerized API

  • FastAPI application with 4 endpoints
  • Docker containerization with docker-compose
  • Interactive Swagger documentation
  • Health checks and error handling

βœ… Point 5: Error Prediction Model

  • Random Forest classifier (zero-to-one model)
  • Predicts high/low error risk from static features only
  • Works on new warehouses with no scan history
  • 65% accuracy (30% above baseline)

πŸ“ Project Structure

dexory-technical-task/
β”œβ”€β”€ core_scripts/               # Main analysis scripts
β”‚   β”œβ”€β”€ data_pipeline.py        # Data ingestion and cleaning
β”‚   β”œβ”€β”€ eda.py                  # Exploratory data analysis
β”‚   β”œβ”€β”€ anomaly_detection.py    # Top 20 problematic locations
β”‚   β”œβ”€β”€ error_prediction.py     # ML model for error prediction
β”‚   └── test_pipeline.py        # Unit tests
β”‚
β”œβ”€β”€ API/
β”‚   └── warehouse-api/          # FastAPI application
β”‚       β”œβ”€β”€ app/main.py         # API endpoints
β”‚       β”œβ”€β”€ Dockerfile          # Container definition
β”‚       β”œβ”€β”€ docker-compose.yml  # Docker orchestration
β”‚       └── requirements.txt    # API dependencies
β”‚
β”œβ”€β”€ data_models/
β”‚   β”œβ”€β”€ technical-task-data/    # Raw input data (10 days)
β”‚   └── output/                 # Processed data & models
β”‚       β”œβ”€β”€ warehouse_data.parquet
β”‚       β”œβ”€β”€ top_20_problematic.csv
β”‚       β”œβ”€β”€ error_predictor.pkl
β”‚       └── eda_plots/          # Analysis visualizations
β”‚
└── README.md                   # This file

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Docker Desktop (for API)

1. Install Dependencies

pip install -r requirements.txt

2. Run the Complete Pipeline

# Step 1: Process data (Point 1)
cd core_scripts
python data_pipeline.py

# Step 2: Run EDA (Point 2)
python eda.py

# Step 3: Detect anomalies (Point 3)
python anomaly_detection.py

# Step 4: Train prediction model (Point 5)
python error_prediction.py

3. Launch the API (Point 4)

# Navigate to API folder
cd ../API/warehouse-api

# Run with Docker
docker-compose up --build

# Access API at:
# http://localhost:8000/docs

πŸ“Š Key Results

Inventory Accuracy

  • Mean Accuracy: 75.07%
  • Range: 3.09% - 99.30%
  • Most Common Error: Unknown item found (3.83%)

Spatial Insights

  • Ground shelves: 6.24% error rate (highest)
  • High shelves: 1.37% error rate (lowest)
  • Most problematic aisle: AZ 1 (10.64% error rate)
  • Significant correlation: Shelf level affects error rate (p < 0.001)

Velocity Analysis

  • High-velocity locations: ~650 (2% of total)
  • Static locations: 25,234 (75% of total)
  • Finding: High-velocity locations have significantly more errors

Anomaly Detection

  • Top 20 problematic locations identified
  • Scoring factors: Error rate (40%), Operational impact (30%), Error severity (20%), Spatial context (10%)
  • Highest risk score: 0.52 (Location with 18.5% error rate + high velocity)

Prediction Model

  • Algorithm: Random Forest (balanced class weights)
  • Accuracy: 65% (vs 50% baseline)
  • High Error Recall: 73% (catches 73% of problematic locations)
  • Key features: Shelf height, position, aisle location

πŸ”Œ API Endpoints

The API serves analysis results and predictions:

Endpoint Description
GET /health System health check
GET /warehouse/anomalies Top 20 problematic locations
GET /warehouse/stats Daily accuracy trends & error breakdown
GET /location/{name} Detailed location analysis + prediction

Interactive docs: http://localhost:8000/docs

See API/warehouse-api/README.md for detailed API documentation.


πŸ§ͺ Testing

Run unit tests:

cd core_scripts
pytest test_pipeline.py -v

Test coverage includes:

  • Data loading and validation
  • Merge operations (no data loss)
  • Feature extraction
  • Edge case handling

πŸ“ˆ Visualizations

The EDA generates 8 visualizations in data_models/output/eda_plots/:

  1. daily_accuracy.png - Accuracy trends over 10 days
  2. status_breakdown.png - Overall status distribution
  3. substatus_breakdown.png - Top 15 error types
  4. spatial_hotspots.png - Error rates by shelf level and aisle
  5. fast_moving_locations.png - Top 20 highest velocity locations
  6. problematic_locations.png - Top 20 risk scores
  7. error_prediction_model.png - Model performance metrics

πŸ’‘ Key Insights

Operational Recommendations

  1. Ground-Level Shelves Need Attention

    • Despite easy access, ground shelves have highest error rates (6.24%)
    • Hypothesis: Rushing, picking interference, or label damage
    • Action: Investigate workflows for ground-level operations
  2. Aisle AZ 1 Requires Investigation

    • 10.64% error rate (2.7x warehouse average)
    • May indicate: lighting issues, layout problems, or label quality
    • Action: On-site audit of physical conditions
  3. Fast-Moving Locations = Higher Risk

    • Positive correlation between velocity and error rate
    • More handling = more opportunities for errors
    • Action: Implement more frequent audits for high-velocity locations
  4. Predictive Model Enables Proactive Management

    • Can identify high-risk locations before they accumulate errors
    • Works on new warehouses (zero-to-one capability)
    • Action: Deploy for ongoing monitoring and early intervention

πŸ› οΈ Technologies Used

  • Data Processing: pandas, numpy, pyarrow
  • Validation: pydantic
  • Machine Learning: scikit-learn (Random Forest)
  • Visualization: matplotlib, seaborn
  • Statistical Analysis: scipy
  • API: FastAPI, uvicorn
  • Containerization: Docker, docker-compose
  • Testing: pytest

πŸ“ Model Justification

Why Composite Risk Scoring (Point 3)?

Chosen over unsupervised methods (Isolation Forest, DBSCAN) because:

  • βœ… Transparent and explainable to stakeholders
  • βœ… Incorporates domain knowledge (error severity weights)
  • βœ… Tunable based on business priorities
  • βœ… Every component can be validated independently
  • βœ… Produces actionable insights

Formula:

Risk Score = 0.40 Γ— Error_Severity + 
             0.30 Γ— Operational_Impact + 
             0.20 Γ— Error_Type_Severity + 
             0.10 Γ— Spatial_Context

Why Random Forest (Point 5)?

Chosen for zero-to-one prediction because:

  • βœ… Handles mixed feature types (numeric + categorical)
  • βœ… Robust to class imbalance (with class_weight='balanced')
  • βœ… Provides feature importances (interpretability)
  • βœ… No feature scaling required
  • βœ… Proven performance on tabular data

Alternative considered: Logistic Regression (too simple), XGBoost (overkill for this data size)


πŸ“„ Requirements

See requirements.txt for complete dependencies.

Core packages:

pandas>=2.0.0
scikit-learn>=1.3.0
fastapi>=0.104.0
uvicorn>=0.24.0

Built with FastAPI β€’ Docker β€’ Python 3.11 πŸš€

About

Warehouse Intelligence System: Data pipeline, EDA, anomaly detection, and ML prediction API for warehouse scan analysis. Built with Python, FastAPI, and Docker.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published