This portfolio project showcases a modern approach to data warehousing and analytics using SQL Server. It covers key components such as ETL processes, data modeling, and analytics, while reflecting industry-standard practices in modern data engineering.
This project involves:
- Data Architecture: Designing a Modern Data Warehouse Using the Medallion Architecture: Bronze, Silver, and Gold Layers.
- ETL Pipelines: Extracting, Transforming, and Loading data from source systems into the warehouse.
- Data Modeling (Kimball): Developing fact and dimension tables optimized for analytical queries.
The data architecture for this project is the Medallion (Multi-Hop) Architecture. This architecture consists of three distinct layers that perform ETL on the data.
In the Bronze Layer the data is Extracted from the source system (the CRM and ERP). The data is not Transformed in the Bronze Layer and is Loaded to the 'bronze' schema using a Full Load.
The Silver Layer Transforms the data from the Bronze Layer to clean data, and then it is Loaded into the 'silver' schema using a Full Load. The following Transformations are applied:
- Data Cleaning: string trimming and replacing NULL values
- Data Standardization: making values consistent
- Data Normalization: replacing special characters
- Derived Columns: calculating the cost
- Data Enrichment: adding additional data from other tables (e.g., customer's birth dates and countries)
The Gold Layer integrates and aggregates data to create business-ready information. It is modeled using a star schema to optimize analytics and facilitate informed decision-making.
- Datasets: The data used for this project (csv files).
- Data with Baraa: Helpful information for learning about SQL, ETL and data warehouses.
- SQL Server: Lightweight server for hosting the SQL database.
- mssql VSCode extension: VSCode extension for managing and interacting with the database.
- Git: For managing the code and project details.
- DrawIO: Tool to design data architecture, models, flows, and diagrams.
- Notion: All-in-one tool for project management and planning.
data-engineering-warehouse-project/
│
├── datasets/ # Raw datasets used for the project (ERP and CRM data)
│
├── docs/ # Project documentation and architecture details
│ ├── data_architecture.drawio.png # Shows the project's architecture
│ ├── data_catalog.md # Catalog of datasets, including field descriptions and metadata
│ ├── data_flow.drawio.png # Shows the data flow diagram
│ ├── data_models.drawio.png # Shows the data model (star schema)
│ └── naming-conventions.md # Consistent naming guidelines for tables, columns, and files
│
├── scripts/ # SQL scripts for ETL and transformations
│ ├── bronze/ # Scripts for extracting and loading raw data
│ ├── data-analysis/ # Scripts for data analysis and reporting
│ ├── gold/ # Scripts for creating analytical models
│ └── silver/ # Scripts for cleaning and transforming data
│
├── tests/ # Scripts for validating the quality of the scripts
│
├── README.md # Project overview and information
└── LICENSE # License information for the repository
This project is licensed under the MIT License. You are free to use, modify, and share this project with proper attribution.
Hello, I am Mike Ebbe, an IT/AI graduate aspiring to specialize in data engineering! This project is my way of further delving into world of data by getting acquainted with data architectures and ETL. I hope to learn a lot more!
- Cloud Data Platforms:
- Databricks
- Snowflake
- Big Data Processing Frameworks
- Apache Spark
- Event Streaming Platforms:
- Kafka
- Data Warehousing Methodologies:
- Inmon
- Databases
- PostgreSQL
