Skip to content

TheDataArtisanDev/Azure-DataEngineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Azure Data Engineering Learning Repository

Azure Databricks Azure Data Factory Python

A comprehensive learning repository for Microsoft Azure Data Engineering that demonstrates modern cloud data pipeline development using Azure Databricks, Azure Data Factory, Delta Live Tables, and Azure Key Vault. This project provides hands-on examples and real-world use cases for building scalable data engineering solutions on Azure.

๐Ÿ“‹ Table of Contents

๐ŸŽฏ Project Overview

This repository serves as a complete learning platform for Azure data engineering, focusing on:

  • Big Data Processing with Azure Databricks and Apache Spark
  • Data Pipeline Orchestration with Azure Data Factory
  • Real-time Analytics with Delta Live Tables
  • Security Management with Azure Key Vault
  • Data Lake Architecture and best practices
  • ETL/ELT Workflows in the Azure ecosystem

๐Ÿ“ Repository Structure

Azure-DataEngineering/
โ”œโ”€โ”€ ๐Ÿ“‚ Azure Databricks/              # Databricks notebooks and Delta Lake examples
โ”‚   โ”œโ”€โ”€ ๐Ÿ““ Databricks_Commands.ipynb   # Basic Databricks operations
โ”‚   โ””โ”€โ”€ ๐Ÿ“‚ Delta Live Tables/          # Advanced Delta Live Tables examples
โ”œโ”€โ”€ ๐Ÿ“‚ Azure DataFactory/             # Data Factory pipelines and configurations
โ”‚   โ”œโ”€โ”€ ๐Ÿ“‚ DataFactory Repo/          # Complete ADF repository structure
โ”‚   โ””โ”€โ”€ ๐Ÿ“‚ Usecases/                  # Real-world ADF use cases
โ”œโ”€โ”€ ๐Ÿ“‚ Azure-KeyVault/                # Key Vault integration examples
โ””โ”€โ”€ ๐Ÿ“„ README.md                      # This file

๐Ÿงฎ Azure Databricks

Location: Azure Databricks/

Core Files

Essential Databricks operations and mounting configurations:

  • DBFS Mounting: Connect to Azure Blob Storage
  • Storage Integration: Access external data sources
  • Authentication: Secure connection patterns

Delta Live Tables: Delta Live Tables/

Advanced data pipeline development with Delta Lake:

Notebook Focus Description
01_DeltaLiveTable_Basics.ipynb Fundamentals Delta table creation, CRUD operations, and time travel
02_DeltaLiveTable_Pipeline_Output.ipynb Pipeline Results Analyzing and monitoring pipeline outputs
02_DeltaLiveTable_Pipeline.ipynb Pipeline Design Building end-to-end data pipelines
03_DeltaLiveTable_Pipeline.ipynb Advanced Pipelines Complex pipeline patterns and optimizations

๐Ÿ”ง Key Concepts Covered:

  • โœ… Delta Lake Operations: Create, Read, Update, Delete (CRUD)
  • โœ… Time Travel: Query historical data versions
  • โœ… Data Versioning: Track changes and rollbacks
  • โœ… Pipeline Orchestration: Automated data workflows
  • โœ… Performance Optimization: Table optimization and cleanup
  • โœ… Unity Catalog: Modern data governance

๐Ÿญ Azure Data Factory

Location: Azure DataFactory/

DataFactory Repository: DataFactory Repo/

Complete Azure Data Factory project structure with production-ready components:

Core Components

Component Directory Description
๐Ÿ“Š Dataflows dataflow/ Data transformation logic
๐Ÿ“ Datasets dataset/ Data source and sink definitions
๐Ÿ”— Linked Services linkedService/ Connection configurations
๐Ÿš€ Pipelines pipeline/ Orchestration workflows
โฐ Triggers trigger/ Pipeline execution schedules

Featured Dataflows

Dataflow Purpose Description
DataflowPipeline_for_DimMovies.json Dimension Table Movie dimension data processing
DataflowPipeline_for_DimOnlineService.json Service Dimension Online service dimension creation
DataflowPipeline_for_FactOnlinePurchase.json Fact Table Online purchase fact table ETL
FactOnlinePurchase_MonthlySnapshot.json Snapshots Monthly aggregation snapshots

Real-World Use Cases: Usecases/

Practical examples covering common data engineering scenarios:

Use Case Requirement Implementation
Requirement 01 File Copy Copy CSV files between blob storages
Requirement 02 File Cleanup Delete processed files from blob storage
Requirement 03 SQL to Blob Extract data from SQL Database to Blob
Requirement 04 Databricks Integration Orchestrate Databricks notebooks
Requirement 05 Database Migration Copy data between SQL databases
Requirement 06 Data Transformation Implement dataflow activities
Requirement 07 Pipeline Monitoring Monitor and troubleshoot pipelines
Requirement 08 CI/CD Deployment Implement DevOps for Data Factory

๐ŸŽฏ Data Factory Capabilities:

  • โœ… Data Integration: Connect 90+ data sources
  • โœ… ETL/ELT Pipelines: Transform data at scale
  • โœ… Hybrid Connectivity: On-premises and cloud integration
  • โœ… Monitoring & Alerting: Pipeline health and performance
  • โœ… CI/CD Integration: DevOps best practices
  • โœ… Cost Optimization: Efficient resource utilization

๐Ÿ” Azure Key Vault

Location: Azure-KeyVault/

Secure secrets management and authentication:

Component Purpose Description
01_Create_secrets_from_python.py Secret Management Python-based secret creation and retrieval
requirements.txt Dependencies Required Azure SDK packages

๐Ÿ”‘ Security Features:

  • โœ… Secret Storage: Secure API keys and connection strings
  • โœ… Access Policies: Fine-grained permission control
  • โœ… Auditing: Complete access logging
  • โœ… Integration: Seamless Azure service integration

๐Ÿ”ง Prerequisites

Software Requirements

  • Python 3.8+ with pip
  • Azure CLI for authentication
  • Azure Databricks workspace access
  • Azure Data Factory instance
  • Azure Key Vault access

Azure Setup

  1. Azure Subscription: Active Azure subscription
  2. Resource Group: Create dedicated resource group
  3. Storage Account: Azure Data Lake Storage Gen2
  4. Databricks Workspace: Premium tier recommended
  5. Data Factory: V2 instance
  6. Key Vault: Standard tier

Python Dependencies

# Core Azure SDKs
pip install azure-identity
pip install azure-keyvault-secrets
pip install azure-storage-blob
pip install azure-mgmt-datafactory

# Databricks integration
pip install databricks-cli
pip install delta-spark

๐Ÿš€ Getting Started

1. Clone the Repository

git clone https://github.com/TheDataArtisanDev/Azure-DataEngineering.git
cd Azure-DataEngineering

2. Azure Authentication

# Login to Azure
az login

# Set your subscription
az account set --subscription "your-subscription-id"

# Create service principal (for automation)
az ad sp create-for-rbac --name "data-engineering-sp"

3. Configure Databricks

# Install Databricks CLI
pip install databricks-cli

# Configure authentication
databricks configure --token

4. Set Up Key Vault

# Create Key Vault
az keyvault create --name "your-keyvault" --resource-group "your-rg"

# Set secrets
az keyvault secret set --vault-name "your-keyvault" --name "storage-key" --value "your-key"

5. Run Your First Example

# Test Key Vault connection
cd Azure-KeyVault
python 01_Create_secrets_from_python.py

๐ŸŽ“ Key Learning Outcomes

After completing this repository, you will understand:

  • โœ… Azure Data Platform architecture and services
  • โœ… Databricks & Spark for big data processing
  • โœ… Delta Lake for reliable data lakes
  • โœ… Data Factory for orchestration and integration
  • โœ… Security Best Practices with Key Vault
  • โœ… Production Deployment and monitoring
  • โœ… Cost Optimization strategies

๐Ÿ’ก Use Cases & Examples

Real-World Scenarios Covered:

  • ๐Ÿช Retail Analytics: Customer purchase behavior analysis
  • ๐ŸŽฌ Entertainment: Movie rating and recommendation systems
  • ๐Ÿ’ฐ Financial Services: Transaction processing and monitoring
  • ๐Ÿ“Š Business Intelligence: Dimensional modeling and reporting
  • ๐Ÿ”„ Data Migration: Legacy system modernization

Technical Patterns:

  • Medallion Architecture: Bronze, Silver, Gold data layers
  • CDC (Change Data Capture): Real-time data synchronization
  • Metadata-Driven Pipelines: Dynamic and scalable ETL
  • Error Handling: Robust pipeline resilience
  • Performance Tuning: Optimized data processing

๐Ÿค Contributing

This is a learning project focused on Azure data engineering. Contributions are welcome:

  • ๐Ÿ’ก Suggest new use cases or Azure services
  • ๐Ÿ› Report issues or improvements
  • ๐Ÿ“š Add documentation or examples
  • ๐Ÿ”ง Optimize existing pipelines or code

For major changes, please open an issue first to discuss your ideas.


๐Ÿ”— Useful Links


Happy Learning! ๐Ÿš€

This repository provides a comprehensive journey through Azure data engineering. Use it to build production-ready data solutions on Microsoft Azure!

About

This Repo contains codes related to Azure services including Datafactory, Databricks, Keyvault.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published