Skip to content
View TheDataArtisanDev's full-sized avatar

Block or report TheDataArtisanDev

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
TheDataArtisanDev/README.md

Hi there! πŸ‘‹

I'm Yaseen Shaik - a Senior Data Engineer with 6 years building production-scale data platforms across Azure, GCP, and AWS.

I specialize in PySpark, real-time streaming, and data orchestration at scale, with interest in Gen AI applications in data engineering.

LinkedIn Email


πŸ› οΈ Tech Stack

Python PySpark SQL Java Shell Apache Spark Apache Flink Kafka Apache Beam Airflow

Azure GCP AWS Databricks Snowflake BigQuery PostgreSQL MySQL Neo4j Redis

Docker Kubernetes Terraform Helm Prometheus Grafana LangChain OpenAI


πŸš€ Data Engineering Projects

12 production-ready DAG patterns with complete monitoring stack

What's Inside:

  • Operators & Sensors: Bash, Python, branching, XCom patterns, file/time sensors
  • Database Integration: SQL operators, connection management, data extraction
  • External Systems: Microsoft Teams webhook notifications, API integrations
  • Real ETL Pipelines: NYC Taxi data pipeline, advanced DuckDB analytics pipeline
  • Dynamic DAGs: Factory patterns for programmatic DAG generation
  • Complete Monitoring: Prometheus + Grafana dashboards, StatsD exporter, real-time metrics

Tech: Apache Airflow, Python, Docker, Prometheus, Grafana, DuckDB


Apache Beam pipelines (Python & Java), BigQuery, Pub/Sub, and Cloud Storage integrations

What's Inside:

  • Apache Beam Python: 40+ examples covering Create, Read, Write, FlatMap, Map, Filter, Flatten, CombinePerKey, CountPerKey, CoGroupByKey transforms
  • Apache Beam Java: Complete Maven project with basic to complex patterns (windowing, side inputs, stateful processing, use cases like word count, even/odd classification, average calculations)
  • BigQuery Integration: Dataset management, table operations, data loading from Cloud Storage, data export workflows
  • Pub/Sub Messaging: Real-time streaming patterns with complete messaging implementation
  • Sample Datasets: Flight data, transaction data, text processing examples

Tech: Apache Beam, Python, Java, BigQuery, Pub/Sub, Cloud Storage, Maven


Databricks, Data Factory, Delta Live Tables, and Key Vault integration

What's Inside:

  • Delta Live Tables: Time travel queries, schema evolution, metadata inspection, pipeline design and monitoring
  • Azure Data Factory: Complete repository structure with dataflows, datasets, linked services, pipelines, triggers
  • Production Dataflows: Dimension tables (DimMovies, DimOnlineService), Fact tables (FactOnlinePurchase), monthly snapshots
  • 8 Real-World Use Cases: CSV copy operations, file cleanup, SQL to Blob transfer, Databricks orchestration, database migration, dataflow activities, pipeline monitoring, CI/CD implementation
  • Key Vault Integration: Secret management with Python, secure connection strings, access policies

Tech: Azure Databricks, Data Factory, Delta Lake, Key Vault, Python, Dataflows


PySpark patterns, transformations, and Apache Iceberg data lakehouse

What's Inside:

  • Core Patterns: RDD operations, file loading strategies, filtering, data transformations
  • Data Manipulation: Pivot operations, string processing with regex, backfilling strategies, encoding techniques
  • Apache Iceberg Integration: Table creation and management, time travel queries for historical data access, schema evolution and versioning, metadata inspection and catalog operations

Tech: PySpark, Apache Spark, Apache Iceberg, Python, Data Lakehouse


πŸ”₯ Apache Flink

Hands-on Apache Flink implementations with Table API, SQL, and streaming transformations

What's Inside:

  • Basics (7 notebooks): Table operations, schema inspection, data conversions (Pandas ↔ Flink Table, List of Rows, Tuples), row counting
  • Transformations: Map, filter, groupBy, aggregations, joins with dynamic column handling
  • PyFlink Examples: Word count and streaming patterns in Python
  • SQL Examples: SQL-based stream processing with Flink SQL

Tech: Apache Flink, Python, PyFlink, Flink Table API, Flink SQL


Java-based Kafka Streams application for real-time data processing

What's Inside:

  • Deduplication: State store-based unique record processing with RocksDB backend
  • Stream Joining: Stream-to-stream and stream-to-table join operations
  • Schema Validation: YAML-based message validation with custom validators
  • Error Handling: Dead letter queue (DLQ) patterns for resilience
  • State Management: RocksDB-backed state stores for stateful stream processing
  • Components: Stream processors, join processors, topology definitions, Kafka configuration

Tech: Java, Apache Kafka Streams, Spring Boot, Docker, RocksDB, Maven


Python-based data processing with Snowpark DataFrames, joins, and UDFs

What's Inside:

  • Connection Management: Secure Snowflake authentication with environment variables
  • DataFrame Operations: Snowpark API for data transformations, filtering, aggregations
  • Join Operations: Multi-table joins with Snowpark DataFrames
  • Report Generation: Data aggregation and summary report creation
  • User-Defined Functions: Inline UDFs for custom transformations

Tech: Snowflake, Snowpark Python, Jupyter, Python


Production-ready Kubernetes deployment patterns from basic to advanced with Helm charts

What's Inside:

  • Basic Deployments: Nginx with NodePort services, fundamental Kubernetes concepts (Pods, Services, Deployments)
  • Flask Microservices: Multi-replica Python APIs with Docker containerization, REST API patterns
  • Health Monitoring: Production-grade liveness and readiness probes, self-healing capabilities
  • Helm Charts: Infrastructure as Code with templating, values injection, package management

Tech: Kubernetes, Docker, Helm, Python/Flask, YAML


Containerization from basics to multi-stage builds with Docker Compose orchestration

What's Inside:

  • Basic Containerization: Simple Flask app deployment with Dockerfile basics
  • API Services: Dynamic CSV report generator with REST endpoints, JSON to CSV conversion
  • Multi-container Apps: Flask + Redis with Docker Compose, service networking, volume mapping
  • Multi-stage Builds: Production-optimized PDF generator, separate build and runtime environments, minimal image size

Tech: Docker, Docker Compose, Python/Flask, Redis


πŸ—„οΈ NoSQL Explorations

Comprehensive hands-on with NoSQL databases

What's Inside:

  • Graph Database (Neo4j): 7 progressive notebooks covering node creation, relationships, properties, labels and multi-labels, Cypher fundamentals (MATCH, MERGE, WITH clauses), updating and deleting operations
  • Python Integration: Neo4j driver usage, connection management, query execution

Tech: Neo4j, Cypher, Python, Jupyter


πŸ€– Gen AI Learning Projects

AI-powered data analytics agent converting natural language to SQL with production monitoring

What's Inside:

  • Text-to-SQL System: Natural language interface powered by Azure OpenAI GPT-4 with LangGraph orchestration
  • Data Analysis: CSV/Excel file upload and analysis capability
  • Production Monitoring: Prometheus metrics collection (query duration, success rates, performance tracking)
  • Grafana Dashboards: Real-time visualization of system performance and cost analysis
  • LangGraph State Management: Multi-step workflow orchestration for query processing

Tech: Azure OpenAI GPT-4, LangGraph, DuckDB, Prometheus, Grafana, Python, Streamlit


Hands-on implementations exploring LLM capabilities (7 experiments)

What's Inside:

  • RAG Systems: Document processing for ePub (Alice in Wonderland), PowerPoint (pitch decks), PDF (travel guides, receipts), Word documents - all with FAISS vector store and semantic search
  • LLM Output Structuring: Using Pydantic for structured responses, schema validation
  • Function Calling: Dynamic weather chatbot with tool use and API integration
  • Context-Aware Chatbot: E-commerce assistant with memory management and conversation chains
  • Focus Areas: Chunking strategies (RecursiveCharacterTextSplitter), embedding model evaluation (HuggingFace, Azure OpenAI), vector store optimization, retrieval pattern analysis

Tech: FAISS, LangChain, Azure OpenAI, Pydantic, Vector Embeddings, Jupyter


Complete hotel management automation with LangGraph multi-agent workflow

What's Inside:

  • Multi-Agent System: Three specialized agents (booking agent, housekeeping agent, customer service agent) with coordinated workflows
  • State Management: Shared state across agents using LangGraph state machines
  • RAG Integration: FAQ-based customer service responses using vector database (FAISS)
  • Streamlit UI: Interactive web interface for hotel operations
  • MySQL Integration: Database for booking and room management
  • Workflow Orchestration: Complete booking β†’ housekeeping β†’ customer service pipeline

Tech: LangGraph, Azure OpenAI, MySQL, RAG, FAISS, Streamlit, Python


Multi-agent SQL generation with CrewAI framework - security compliance and cost monitoring

What's Inside:

  • SQL Generation Agent: Automated SQL query generation from natural language
  • SQL Validation Agent: Security compliance checking, SQL injection prevention
  • Multi-Agent Coordination: CrewAI framework for agent orchestration and collaboration
  • Real-time Cost Monitoring: Token usage tracking and cost analysis
  • Security Features: Query validation, pattern matching, safeguard implementations

Tech: CrewAI, Azure OpenAI, Python, Multi-agent orchestration


πŸŽ“ Certifications

Cloud & Data Engineering:

Foundational & Specialized:


πŸ“« Connect: LinkedIn β€’ Email

Pinned Loading

  1. spark-projects spark-projects Public

    This Repo contains codes related to Apache Spark

    Jupyter Notebook

  2. GCP-DataEngineering GCP-DataEngineering Public

    This Repo contains code related to GCP Services including Apache Beam(Java/Python), BigQuery

    Jupyter Notebook

  3. Azure-DataEngineering Azure-DataEngineering Public

    This Repo contains codes related to Azure services including Datafactory, Databricks, Keyvault.

    Jupyter Notebook

  4. kafka-streams-processor kafka-streams-processor Public

    This repository contains the code to read the streaming kafka data and perform various streaming operations

    Java

  5. airflow-production-patterns airflow-production-patterns Public

    Apache Airflow DAG patterns and orchestration examples with production monitoring stack

    Jupyter Notebook

  6. ai-engineering-portfolio ai-engineering-portfolio Public

    Jupyter Notebook