Skip to content

TheDataArtisanDev/GCP-DataEngineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

32 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GCP Data Engineering Learning Repository

Apache Beam Google Cloud Python Java

A comprehensive learning repository for Google Cloud Platform (GCP) Data Engineering that demonstrates modern data pipeline development using Apache Beam, BigQuery, Pub/Sub, and other GCP services. This project covers both Python and Java implementations with real-world examples and hands-on exercises.

πŸ“‹ Table of Contents

🎯 Project Overview

This repository serves as a complete learning platform for data engineering concepts, focusing on:

  • Batch and Stream Processing with Apache Beam
  • Data Pipeline Development in Python and Java
  • GCP Services Integration (BigQuery, Pub/Sub, Cloud Storage)
  • Real-world Data Processing scenarios
  • Best Practices for scalable data engineering

πŸ“ Repository Structure

GCP-DataEngineering/
β”œβ”€β”€ πŸ“‚ Apache Beam/                    # Python-based Apache Beam examples
β”œβ”€β”€ πŸ“‚ ApacheBeam-Java/               # Java-based Apache Beam examples
β”œβ”€β”€ πŸ“‚ GCP BigQuery/                  # BigQuery integration examples
β”œβ”€β”€ πŸ“‚ GCP PubSub/                    # Pub/Sub messaging examples
β”œβ”€β”€ πŸ“‚ Datasets/                      # Sample datasets for practice
└── πŸ“„ README.md                      # This file

🐍 Apache Beam - Python

Location: Apache Beam/

Core Files

Examples Directory: Examples/

Progressive learning examples covering core Beam concepts:

File Concept Description
01_Beam_create_integers.py Create Creating PCollections from integers
02_Beam Create Key-Value Pairs.py Key-Value Working with key-value pair data
03_Beam Create objects.py Objects Creating PCollections from custom objects
04_Beam Create String.py Strings String data processing
05_Beam Filter.py Filter Filtering data based on conditions
06_Beam Map Elements to Formatted String.py Map Transforming elements to formatted strings
07_Beam Map Elements.py Map Basic element transformation
08_Beam FlatMap.py FlatMap Flattening nested collections
09_Beam FlatMap Elements from List to Integer.py FlatMap Converting lists to individual integers
10_Beam Group by Key and Sum.py GroupByKey Grouping and aggregating data
11_Beam Group by Key.py GroupByKey Basic grouping operations
12_Beam ParDo (Parallel Do).py ParDo Parallel data processing
13_Beam ParDo with Key-Value.py ParDo ParDo with key-value pairs
14_WordCount.ipynb WordCount Classic word counting example

Main Functions Directory: Main Functions/

In-depth Jupyter notebooks covering essential Beam transformations:

Notebook Transform Description
01_Create.ipynb Create Creating PCollections from various sources
02_ReadTransform.ipynb Read Reading data from files and external sources
03_WriteTransform.ipynb Write Writing data to various sinks
04_FlatMap.ipynb FlatMap Advanced flattening operations
05_Map.ipynb Map Element-wise transformations
06_FilterLambda.ipynb Filter Lambda-based filtering
07_Filter.ipynb Filter Advanced filtering techniques
08_Flatten.ipynb Flatten Combining multiple PCollections
09_CombinePerKey.ipynb CombinePerKey Aggregation operations
10_CountPerKey.ipynb Count Counting elements per key
11_CogroupByKey.ipynb CoGroupByKey Joining multiple PCollections

Datasets: datasets/


β˜• Apache Beam - Java

Location: ApacheBeam-Java/

Project Structure

File Concept Description
BeamExample.java Basic Pipeline Simple Beam pipeline example
code_01_BeamCreate_Integer.java Create Creating integer PCollections
code_01_BeamCreate_KV.java Key-Value Key-value pair creation
code_01_BeamCreate_Objects.java Objects Custom object processing
code_01_BeamCreate_String.java Strings String data handling
code_02_BeamFilter.java Filter Data filtering operations
code_03_BeamMapElements_formattedStringOutput.java Map Formatted string output
code_04_BeamFlatMap.java FlatMap Collection flattening
code_05_BeamGroupByKey.java GroupByKey Data grouping
code_05_BeamGroupByKey_Sum.java GroupByKey Grouping with summation
code_06_BeamParDo.java ParDo Parallel processing
code_07_BeamParDo_KeyValue.java ParDo ParDo with key-value data
File Concept Description
code_01_BeamWindowing.java Windowing Time-based data windowing
code_01_BeamWindowing_Demo.java Windowing Advanced windowing demo
code_02_BeamSideInputs.java Side Inputs Data enrichment patterns
code_02_BeamStatefulProcessing.java Stateful Stateful data processing
code_03_BeamPipeline.java Pipeline Complex pipeline configurations
File Use Case Description
code_01_wordcount.java Word Count Classic word counting implementation
code_02_even_odd.java Classification Even/odd number classification
code_03_average_numbers.java Aggregation Numerical average calculation
code_03_average_numbers_combineApproach.java Combine Average using Combine transforms

☁️ GCP Services Integration

BigQuery Integration: GCP BigQuery/

Notebook Focus Description
01_Load_from_StorageBucket.ipynb Data Loading Loading data from Cloud Storage to BigQuery
02_BigQuery_Datasets_Python.ipynb Dataset Management Creating and managing BigQuery datasets
03_BigQuery_Tables_Python.ipynb Table Operations BigQuery table creation and manipulation
04_Load_to_StorageBucket.ipynb Data Export Exporting BigQuery data to Cloud Storage

Pub/Sub Messaging: GCP PubSub/

Notebook Focus Description
01_PubSub_messaging.ipynb Messaging Complete Pub/Sub messaging implementation

πŸ“Š Sample Datasets

Location: Datasets/

File Type Description
employee.txt Employee Data Sample employee records
titanic_dataset.csv Historical Data Famous Titanic passenger dataset

πŸ”§ Prerequisites

Software Requirements

  • Python 3.7+ with pip
  • Java 8+ with Maven
  • Google Cloud SDK (gcloud CLI)
  • Git for version control

GCP Setup

  1. GCP Project: Create or use existing GCP project
  2. Authentication: Configure gcloud authentication
  3. APIs: Enable required APIs (BigQuery, Pub/Sub, Cloud Storage)
  4. Service Account: Create service account with appropriate permissions

Python Dependencies

pip install apache-beam[gcp]
pip install google-cloud-bigquery
pip install google-cloud-pubsub
pip install mysql-connector-python

Java Dependencies

Maven dependencies are configured in pom.xml


πŸš€ Getting Started

1. Clone the Repository

git clone https://github.com/TheDataArtisanDev/GCP-DataEngineering.git
cd GCP-DataEngineering

2. Set Up Python Environment

# Create virtual environment
python -m venv beam-env
source beam-env/bin/activate  # On Windows: beam-env\Scripts\activate

# Install dependencies
pip install -r requirements.txt  # Create this based on imports

3. Configure GCP

# Authenticate with GCP
gcloud auth login

# Set your project
gcloud config set project YOUR_PROJECT_ID

# Create application default credentials
gcloud auth application-default login

4. Run Your First Example

# Python example
cd "Apache Beam/Examples"
python 01_Beam_create_integers.py

# Java example
cd ApacheBeam-Java
mvn compile exec:java -Dexec.mainClass="com.example.BeamExample"

πŸŽ“ Key Learning Outcomes

After completing this repository, you will understand:

  • βœ… Apache Beam fundamentals in Python and Java
  • βœ… Data pipeline design patterns and best practices
  • βœ… GCP service integration for end-to-end data workflows
  • βœ… Batch and stream processing concepts
  • βœ… Scalable data transformation techniques
  • βœ… Real-world data engineering scenarios

🀝 Contributing

Contributions are welcome! This is a learning project, so feel free to:

  • Report issues or bugs you find
  • Suggest improvements to examples
  • Add more use cases or examples
  • Fix documentation or code issues

For major changes, please open an issue first to discuss what you would like to change.


Useful Links


Happy Learning! πŸš€

This repository is a comprehensive data engineering learning journey. If you find it helpful, feel free to use it for your own learning!

About

This Repo contains code related to GCP Services including Apache Beam(Java/Python), BigQuery

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published