This repository contains an Azure Data Factory (ADF) pipeline designed to dynamically ingest data from Azure SQL Database into a Bronze Layer (Azure Data Lake Storage). The pipeline is optimized for incremental processing, backfilling, and efficient file management.
The ADF pipeline performs automated data ingestion following a metadata‑driven architecture. It loops through multiple SQL tables, retrieves data using dynamic queries, and writes results into the Bronze Layer in Parquet format.
The pipeline supports:
- Dynamic ingestion across multiple tables
- Incremental loading using Change Data Capture (CDC)
- Data backfilling logic for historical loads
- Optimized storage management to avoid empty files
- GitHub integration for version control
- Uses parameters, variables, and ForEach loops to ingest all configured tables.
- Generates dynamic SQL queries to pull data from Azure SQL Database.
- Allows easy addition of new tables without modifying the pipeline logic.
- Implemented Change Data Capture (CDC) to identify inserts, updates, and deletes.
- Supports:
- Incremental loads based on __cdc_lsn
- Historical data backfill for selected time windows
- Ensures minimal data movement and optimized performance.
- Created secure linked services for:
- Azure SQL Database
- Azure Data Lake Storage Gen2 (Bronze Container)
- All ingested data is stored in Parquet format for:
- Better query performance
- Compression
- Compatibility with Spark, Fabric, Databricks, Synapse
ADF also creates files when no new data is captured (e.g., consecutive incremental runs). This project includes logic to:
- Detect and delete empty Parquet files
- Maintain clean Bronze storage
- Reduce ADLS clutter and unnecessary file counts
Full ADF pipeline source code is stored and version-controlled in GitHub. Enabled:
- Collaboration
- CI/CD readiness
- Rollback and change tracking