Pure-Python, stream-oriented analysis of Superstore-style order data (no pandas). Queries operate on iterators to avoid loading everything into memory, and helper aggregations keep the code DRY.
- Require Python 3.10+ (tested with 3.13).
- Optional but recommended: create a virtualenv.
source .venv/bin/activate
- Install deps (only coverage is non-stdlib):
pip3 install -r requirements.txt
- Data is present in
data/orders.csvtaken fromhttps://www.kaggle.com/datasets/fatihilhan/global-superstore-dataset. All queries default to this path unless you pass a custom one. - Additional dataset notes present in
DATASET_NODES.md
- Run the all analyses:
python3 queries_run.py-
Key analytical functions (in
queries.py):-
sales_by_year_region_category(path)– total sales per (Year, Market, Category).- Identify high-performing regions and categories to focus marketing and supply-chain resources - Detect underperforming markets requiring pricing changes, better distribution, or promotions. -
yoy_category_sales_trends(path)– Year over Year change per (Market, Category) in percentage.- Detect growth or decline in specific markets or product categories. - Support goal tracking, because YoY is a key KPI in performance dashboards. -
profit_margin_by_category_subcategory(path)– profit/sales per (Category, Sub-Category). - Discover high-margin vs. low-margin product segments. - Inform inventory and catalog decisions: remove low-margin items that occupy storage but don’t drive profit. -
top_categories_by_margin(path, n=5)– top N category/sub-category by margin (min-heap).- Highlight the most profitable categories for targeted advertising and homepage placemen - Support profit maximization by optimizing product mix. -
total_discounted_profit(path)/discounted_profit_share(path)– profit from discounted orders and total share of total profit.- Evaluate whether discount strategies are profitable or harming margins. - Detect margin leakage from overly aggressive discounting. -
average_fulfillment_days(path)– mean days between Order.Date and Ship.Date.`- Measure operational efficiency of supply chain and warehouse processes. - Evaluate carrier performance and SLA compliance. - Predict customer satisfaction (shipping delays → low ratings & high returns). -
profit volatility(profit std) and min, max, count per (Market, Category).- Inform product managers which categories need stabilization strategies (pricing, inventory). - Help supply chain plan safety stock levels for unstable categories.
-
-
Console formatters live in
queries_run.py(tables only; no logic).
- Run the unit suite (uses in-memory CSV via
StringIOto avoid I/O):python3 -m unittest
- Coverage:
python3 -m coverage run -m unittest python3 -m coverage report # optional HTML python3 -m coverage html
orders.py–Ordermodel, CSV streaming, datetime/product parsing.grouping_aggregation_helpers.py– reusable sum/mean/stddev/min-max-count by key.queries.py– analytical queries built on streams and helpers.queries_run.py– CLI-style output aggregating multiple reports.test_queries_unittest.py–unittestcoverage for helpers, parsing, and queries (in-memory data).
- Output is added in Analyses-Results.pdf.
- Unit test coverage is added in Unit Test Coverage.png
- Stream processing means large files are handled lazily; use
StringIOin tests or pass file-like objects tostream_orders. - Stddev defaults to sample (n-1); set
sample=Falseif you need population metrics.
-
Input validation & error handling: Wrap Order.from_row parsing with schema checks, helpful error messages (column + value), and a strategy for bad rows (skip-with-log vs fail-fast). Validate numeric conversions and date parsing instead of silent None.
-
Single-pass aggregations: Current queries reread the CSV multiple times; refactor to either share a cached stream (noting iterator exhaustion) or build combined aggregations in one pass per execution. For large files this is a meaningful perf win.
-
Numerical semantics: Decide on sample stddev behavior for single observations; either return None/nan or document the current 0. Align with business definitions for YOY edge cases
-
Testing: Add integration tests over a real sample file, and tests for failure paths (bad date, missing column, zero sales) to ensure validation and logging work. Include property-based or fuzz tests for aggregations if feasible.
-
Testing: Add integration tests over a real sample file, and tests for failure paths (bad date, missing column, zero sales) to ensure validation and logging work. Include property-based or fuzz tests for aggregations if feasible.