Skip to content

MikeEbbe/data-engineering-warehouse-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Warehouse Project

This portfolio project showcases a modern approach to data warehousing and analytics using SQL Server. It covers key components such as ETL processes, data modeling, and analytics, while reflecting industry-standard practices in modern data engineering.

Project Overview

This project involves:

  1. Data Architecture: Designing a Modern Data Warehouse Using the Medallion Architecture: Bronze, Silver, and Gold Layers.
  2. ETL Pipelines: Extracting, Transforming, and Loading data from source systems into the warehouse.
  3. Data Modeling (Kimball): Developing fact and dimension tables optimized for analytical queries.

Data Architecture

The data architecture for this project is the Medallion (Multi-Hop) Architecture. This architecture consists of three distinct layers that perform ETL on the data.

Data Architecture

Bronze Layer

In the Bronze Layer the data is Extracted from the source system (the CRM and ERP). The data is not Transformed in the Bronze Layer and is Loaded to the 'bronze' schema using a Full Load.

Silver Layer

The Silver Layer Transforms the data from the Bronze Layer to clean data, and then it is Loaded into the 'silver' schema using a Full Load. The following Transformations are applied:

  • Data Cleaning: string trimming and replacing NULL values
  • Data Standardization: making values consistent
  • Data Normalization: replacing special characters
  • Derived Columns: calculating the cost
  • Data Enrichment: adding additional data from other tables (e.g., customer's birth dates and countries)

Gold Layer

The Gold Layer integrates and aggregates data to create business-ready information. It is modeled using a star schema to optimize analytics and facilitate informed decision-making.

Links & Tools:

  • Datasets: The data used for this project (csv files).
  • Data with Baraa: Helpful information for learning about SQL, ETL and data warehouses.
  • SQL Server: Lightweight server for hosting the SQL database.
  • mssql VSCode extension: VSCode extension for managing and interacting with the database.
  • Git: For managing the code and project details.
  • DrawIO: Tool to design data architecture, models, flows, and diagrams.
  • Notion: All-in-one tool for project management and planning.

Repository Structure

data-engineering-warehouse-project/
│
├── datasets/                           # Raw datasets used for the project (ERP and CRM data)
│
├── docs/                               # Project documentation and architecture details
│   ├── data_architecture.drawio.png    # Shows the project's architecture
│   ├── data_catalog.md                 # Catalog of datasets, including field descriptions and metadata
│   ├── data_flow.drawio.png            # Shows the data flow diagram
│   ├── data_models.drawio.png          # Shows the data model (star schema)
│   └── naming-conventions.md           # Consistent naming guidelines for tables, columns, and files
│
├── scripts/                            # SQL scripts for ETL and transformations
│   ├── bronze/                         # Scripts for extracting and loading raw data
│   ├── data-analysis/                  # Scripts for data analysis and reporting 
│   ├── gold/                           # Scripts for creating analytical models
│   └── silver/                         # Scripts for cleaning and transforming data
│
├── tests/                              # Scripts for validating the quality of the scripts
│
├── README.md                           # Project overview and information
└── LICENSE                             # License information for the repository

License

This project is licensed under the MIT License. You are free to use, modify, and share this project with proper attribution.

About Me

Hello, I am Mike Ebbe, an IT/AI graduate aspiring to specialize in data engineering! This project is my way of further delving into world of data by getting acquainted with data architectures and ETL. I hope to learn a lot more!

Tools and Methodologies I Want to Learn in Future Projects:

  • Cloud Data Platforms:
    • Databricks
    • Snowflake
  • Big Data Processing Frameworks
    • Apache Spark
  • Event Streaming Platforms:
    • Kafka
  • Data Warehousing Methodologies:
    • Inmon
  • Databases
    • PostgreSQL

About

Building a modern data warehouse with SQL Server, including ETL processes, date modeling and analytics

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages