A comprehensive learning repository for Google Cloud Platform (GCP) Data Engineering that demonstrates modern data pipeline development using Apache Beam, BigQuery, Pub/Sub, and other GCP services. This project covers both Python and Java implementations with real-world examples and hands-on exercises.
- Project Overview
- Repository Structure
- Apache Beam - Python
- Apache Beam - Java
- GCP Services Integration
- Sample Datasets
- Prerequisites
- Getting Started
- Learning Path
- Contributing
This repository serves as a complete learning platform for data engineering concepts, focusing on:
- Batch and Stream Processing with Apache Beam
- Data Pipeline Development in Python and Java
- GCP Services Integration (BigQuery, Pub/Sub, Cloud Storage)
- Real-world Data Processing scenarios
- Best Practices for scalable data engineering
GCP-DataEngineering/
βββ π Apache Beam/ # Python-based Apache Beam examples
βββ π ApacheBeam-Java/ # Java-based Apache Beam examples
βββ π GCP BigQuery/ # BigQuery integration examples
βββ π GCP PubSub/ # Pub/Sub messaging examples
βββ π Datasets/ # Sample datasets for practice
βββ π README.md # This file
Location: Apache Beam/
- π MySQL Connection - Database integration with Apache Beam
- β‘ Spark Runner - Running Beam pipelines on Spark
- π Apache Beam Basics Notebook - Interactive learning guide
Examples Directory: Examples/
Progressive learning examples covering core Beam concepts:
| File | Concept | Description |
|---|---|---|
| 01_Beam_create_integers.py | Create | Creating PCollections from integers |
| 02_Beam Create Key-Value Pairs.py | Key-Value | Working with key-value pair data |
| 03_Beam Create objects.py | Objects | Creating PCollections from custom objects |
| 04_Beam Create String.py | Strings | String data processing |
| 05_Beam Filter.py | Filter | Filtering data based on conditions |
| 06_Beam Map Elements to Formatted String.py | Map | Transforming elements to formatted strings |
| 07_Beam Map Elements.py | Map | Basic element transformation |
| 08_Beam FlatMap.py | FlatMap | Flattening nested collections |
| 09_Beam FlatMap Elements from List to Integer.py | FlatMap | Converting lists to individual integers |
| 10_Beam Group by Key and Sum.py | GroupByKey | Grouping and aggregating data |
| 11_Beam Group by Key.py | GroupByKey | Basic grouping operations |
| 12_Beam ParDo (Parallel Do).py | ParDo | Parallel data processing |
| 13_Beam ParDo with Key-Value.py | ParDo | ParDo with key-value pairs |
| 14_WordCount.ipynb | WordCount | Classic word counting example |
Main Functions Directory: Main Functions/
In-depth Jupyter notebooks covering essential Beam transformations:
| Notebook | Transform | Description |
|---|---|---|
| 01_Create.ipynb | Create | Creating PCollections from various sources |
| 02_ReadTransform.ipynb | Read | Reading data from files and external sources |
| 03_WriteTransform.ipynb | Write | Writing data to various sinks |
| 04_FlatMap.ipynb | FlatMap | Advanced flattening operations |
| 05_Map.ipynb | Map | Element-wise transformations |
| 06_FilterLambda.ipynb | Filter | Lambda-based filtering |
| 07_Filter.ipynb | Filter | Advanced filtering techniques |
| 08_Flatten.ipynb | Flatten | Combining multiple PCollections |
| 09_CombinePerKey.ipynb | CombinePerKey | Aggregation operations |
| 10_CountPerKey.ipynb | Count | Counting elements per key |
| 11_CogroupByKey.ipynb | CoGroupByKey | Joining multiple PCollections |
Datasets: datasets/
- π flights_sample.csv - Flight data for analysis
- π Poem.txt - Text data for word processing
- π transactions.csv - Financial transaction data
- π word_count_data.txt - Sample text for word counting
- π output/ - Pipeline output files
Location: ApacheBeam-Java/
- π pom.xml - Maven build configuration
- π coding_guide.md - Java Beam coding guidelines
Basic Examples: src/main/java/com/example/
| File | Concept | Description |
|---|---|---|
| BeamExample.java | Basic Pipeline | Simple Beam pipeline example |
| code_01_BeamCreate_Integer.java | Create | Creating integer PCollections |
| code_01_BeamCreate_KV.java | Key-Value | Key-value pair creation |
| code_01_BeamCreate_Objects.java | Objects | Custom object processing |
| code_01_BeamCreate_String.java | Strings | String data handling |
| code_02_BeamFilter.java | Filter | Data filtering operations |
| code_03_BeamMapElements_formattedStringOutput.java | Map | Formatted string output |
| code_04_BeamFlatMap.java | FlatMap | Collection flattening |
| code_05_BeamGroupByKey.java | GroupByKey | Data grouping |
| code_05_BeamGroupByKey_Sum.java | GroupByKey | Grouping with summation |
| code_06_BeamParDo.java | ParDo | Parallel processing |
| code_07_BeamParDo_KeyValue.java | ParDo | ParDo with key-value data |
Complex Examples: src/main/java/com/complexExamples/
| File | Concept | Description |
|---|---|---|
| code_01_BeamWindowing.java | Windowing | Time-based data windowing |
| code_01_BeamWindowing_Demo.java | Windowing | Advanced windowing demo |
| code_02_BeamSideInputs.java | Side Inputs | Data enrichment patterns |
| code_02_BeamStatefulProcessing.java | Stateful | Stateful data processing |
| code_03_BeamPipeline.java | Pipeline | Complex pipeline configurations |
Use Cases: src/main/java/com/usecases/
| File | Use Case | Description |
|---|---|---|
| code_01_wordcount.java | Word Count | Classic word counting implementation |
| code_02_even_odd.java | Classification | Even/odd number classification |
| code_03_average_numbers.java | Aggregation | Numerical average calculation |
| code_03_average_numbers_combineApproach.java | Combine | Average using Combine transforms |
BigQuery Integration: GCP BigQuery/
| Notebook | Focus | Description |
|---|---|---|
| 01_Load_from_StorageBucket.ipynb | Data Loading | Loading data from Cloud Storage to BigQuery |
| 02_BigQuery_Datasets_Python.ipynb | Dataset Management | Creating and managing BigQuery datasets |
| 03_BigQuery_Tables_Python.ipynb | Table Operations | BigQuery table creation and manipulation |
| 04_Load_to_StorageBucket.ipynb | Data Export | Exporting BigQuery data to Cloud Storage |
Pub/Sub Messaging: GCP PubSub/
| Notebook | Focus | Description |
|---|---|---|
| 01_PubSub_messaging.ipynb | Messaging | Complete Pub/Sub messaging implementation |
Location: Datasets/
| File | Type | Description |
|---|---|---|
| employee.txt | Employee Data | Sample employee records |
| titanic_dataset.csv | Historical Data | Famous Titanic passenger dataset |
- Python 3.7+ with pip
- Java 8+ with Maven
- Google Cloud SDK (gcloud CLI)
- Git for version control
- GCP Project: Create or use existing GCP project
- Authentication: Configure gcloud authentication
- APIs: Enable required APIs (BigQuery, Pub/Sub, Cloud Storage)
- Service Account: Create service account with appropriate permissions
pip install apache-beam[gcp]
pip install google-cloud-bigquery
pip install google-cloud-pubsub
pip install mysql-connector-pythonMaven dependencies are configured in pom.xml
git clone https://github.com/TheDataArtisanDev/GCP-DataEngineering.git
cd GCP-DataEngineering# Create virtual environment
python -m venv beam-env
source beam-env/bin/activate # On Windows: beam-env\Scripts\activate
# Install dependencies
pip install -r requirements.txt # Create this based on imports# Authenticate with GCP
gcloud auth login
# Set your project
gcloud config set project YOUR_PROJECT_ID
# Create application default credentials
gcloud auth application-default login# Python example
cd "Apache Beam/Examples"
python 01_Beam_create_integers.py
# Java example
cd ApacheBeam-Java
mvn compile exec:java -Dexec.mainClass="com.example.BeamExample"After completing this repository, you will understand:
- β Apache Beam fundamentals in Python and Java
- β Data pipeline design patterns and best practices
- β GCP service integration for end-to-end data workflows
- β Batch and stream processing concepts
- β Scalable data transformation techniques
- β Real-world data engineering scenarios
Contributions are welcome! This is a learning project, so feel free to:
- Report issues or bugs you find
- Suggest improvements to examples
- Add more use cases or examples
- Fix documentation or code issues
For major changes, please open an issue first to discuss what you would like to change.
- Apache Beam Documentation
- Google Cloud Platform Documentation
- BigQuery Documentation
- Pub/Sub Documentation
Happy Learning! π
This repository is a comprehensive data engineering learning journey. If you find it helpful, feel free to use it for your own learning!