Yaseen Banu TheDataArtisanDev

Hi there! 👋

I'm Yaseen Shaik - a Senior Data Engineer with 6 years building production-scale data platforms across Azure, GCP, and AWS.

I specialize in PySpark, real-time streaming, and data orchestration at scale, with interest in Gen AI applications in data engineering.

🛠️ Tech Stack

🚀 Data Engineering Projects

⚙️ Apache Airflow Production Patterns

12 production-ready DAG patterns with complete monitoring stack

What's Inside:

Operators & Sensors: Bash, Python, branching, XCom patterns, file/time sensors
Database Integration: SQL operators, connection management, data extraction
External Systems: Microsoft Teams webhook notifications, API integrations
Real ETL Pipelines: NYC Taxi data pipeline, advanced DuckDB analytics pipeline
Dynamic DAGs: Factory patterns for programmatic DAG generation
Complete Monitoring: Prometheus + Grafana dashboards, StatsD exporter, real-time metrics

Tech: Apache Airflow, Python, Docker, Prometheus, Grafana, DuckDB

☁️ GCP Data Engineering

Apache Beam pipelines (Python & Java), BigQuery, Pub/Sub, and Cloud Storage integrations

What's Inside:

Apache Beam Python: 40+ examples covering Create, Read, Write, FlatMap, Map, Filter, Flatten, CombinePerKey, CountPerKey, CoGroupByKey transforms
Apache Beam Java: Complete Maven project with basic to complex patterns (windowing, side inputs, stateful processing, use cases like word count, even/odd classification, average calculations)
BigQuery Integration: Dataset management, table operations, data loading from Cloud Storage, data export workflows
Pub/Sub Messaging: Real-time streaming patterns with complete messaging implementation
Sample Datasets: Flight data, transaction data, text processing examples

Tech: Apache Beam, Python, Java, BigQuery, Pub/Sub, Cloud Storage, Maven

☁️ Azure Data Engineering

Databricks, Data Factory, Delta Live Tables, and Key Vault integration

What's Inside:

Delta Live Tables: Time travel queries, schema evolution, metadata inspection, pipeline design and monitoring
Azure Data Factory: Complete repository structure with dataflows, datasets, linked services, pipelines, triggers
Production Dataflows: Dimension tables (DimMovies, DimOnlineService), Fact tables (FactOnlinePurchase), monthly snapshots
8 Real-World Use Cases: CSV copy operations, file cleanup, SQL to Blob transfer, Databricks orchestration, database migration, dataflow activities, pipeline monitoring, CI/CD implementation
Key Vault Integration: Secret management with Python, secure connection strings, access policies

Tech: Azure Databricks, Data Factory, Delta Lake, Key Vault, Python, Dataflows

⚡ Apache Spark Projects

PySpark patterns, transformations, and Apache Iceberg data lakehouse

What's Inside:

Core Patterns: RDD operations, file loading strategies, filtering, data transformations
Data Manipulation: Pivot operations, string processing with regex, backfilling strategies, encoding techniques
Apache Iceberg Integration: Table creation and management, time travel queries for historical data access, schema evolution and versioning, metadata inspection and catalog operations

Tech: PySpark, Apache Spark, Apache Iceberg, Python, Data Lakehouse

🔥 Apache Flink

Hands-on Apache Flink implementations with Table API, SQL, and streaming transformations

What's Inside:

Basics (7 notebooks): Table operations, schema inspection, data conversions (Pandas ↔ Flink Table, List of Rows, Tuples), row counting
Transformations: Map, filter, groupBy, aggregations, joins with dynamic column handling
PyFlink Examples: Word count and streaming patterns in Python
SQL Examples: SQL-based stream processing with Flink SQL

Tech: Apache Flink, Python, PyFlink, Flink Table API, Flink SQL

📊 Kafka Streams Processor

Java-based Kafka Streams application for real-time data processing

What's Inside:

Deduplication: State store-based unique record processing with RocksDB backend
Stream Joining: Stream-to-stream and stream-to-table join operations
Schema Validation: YAML-based message validation with custom validators
Error Handling: Dead letter queue (DLQ) patterns for resilience
State Management: RocksDB-backed state stores for stateful stream processing
Components: Stream processors, join processors, topology definitions, Kafka configuration

Tech: Java, Apache Kafka Streams, Spring Boot, Docker, RocksDB, Maven

❄️ Snowflake Snowpark Examples

Python-based data processing with Snowpark DataFrames, joins, and UDFs

What's Inside:

Connection Management: Secure Snowflake authentication with environment variables
DataFrame Operations: Snowpark API for data transformations, filtering, aggregations
Join Operations: Multi-table joins with Snowpark DataFrames
Report Generation: Data aggregation and summary report creation
User-Defined Functions: Inline UDFs for custom transformations

Tech: Snowflake, Snowpark Python, Jupyter, Python

☸️ Kubernetes Deployments

Production-ready Kubernetes deployment patterns from basic to advanced with Helm charts

What's Inside:

Basic Deployments: Nginx with NodePort services, fundamental Kubernetes concepts (Pods, Services, Deployments)
Flask Microservices: Multi-replica Python APIs with Docker containerization, REST API patterns
Health Monitoring: Production-grade liveness and readiness probes, self-healing capabilities
Helm Charts: Infrastructure as Code with templating, values injection, package management

Tech: Kubernetes, Docker, Helm, Python/Flask, YAML

🐳 Docker Containerization Patterns

Containerization from basics to multi-stage builds with Docker Compose orchestration

What's Inside:

Basic Containerization: Simple Flask app deployment with Dockerfile basics
API Services: Dynamic CSV report generator with REST endpoints, JSON to CSV conversion
Multi-container Apps: Flask + Redis with Docker Compose, service networking, volume mapping
Multi-stage Builds: Production-optimized PDF generator, separate build and runtime environments, minimal image size

Tech: Docker, Docker Compose, Python/Flask, Redis

🗄️ NoSQL Explorations

Comprehensive hands-on with NoSQL databases

What's Inside:

Graph Database (Neo4j): 7 progressive notebooks covering node creation, relationships, properties, labels and multi-labels, Cypher fundamentals (MATCH, MERGE, WITH clauses), updating and deleting operations
Python Integration: Neo4j driver usage, connection management, query execution

Tech: Neo4j, Cypher, Python, Jupyter

🤖 Gen AI Learning Projects

🤖 AI Data Analytics Agent

AI-powered data analytics agent converting natural language to SQL with production monitoring

What's Inside:

Text-to-SQL System: Natural language interface powered by Azure OpenAI GPT-4 with LangGraph orchestration
Data Analysis: CSV/Excel file upload and analysis capability
Production Monitoring: Prometheus metrics collection (query duration, success rates, performance tracking)
Grafana Dashboards: Real-time visualization of system performance and cost analysis
LangGraph State Management: Multi-step workflow orchestration for query processing

Tech: Azure OpenAI GPT-4, LangGraph, DuckDB, Prometheus, Grafana, Python, Streamlit

📚 Gen AI Experiments

Hands-on implementations exploring LLM capabilities (7 experiments)

What's Inside:

RAG Systems: Document processing for ePub (Alice in Wonderland), PowerPoint (pitch decks), PDF (travel guides, receipts), Word documents - all with FAISS vector store and semantic search
LLM Output Structuring: Using Pydantic for structured responses, schema validation
Function Calling: Dynamic weather chatbot with tool use and API integration
Context-Aware Chatbot: E-commerce assistant with memory management and conversation chains
Focus Areas: Chunking strategies (RecursiveCharacterTextSplitter), embedding model evaluation (HuggingFace, Azure OpenAI), vector store optimization, retrieval pattern analysis

Tech: FAISS, LangChain, Azure OpenAI, Pydantic, Vector Embeddings, Jupyter

🏨 Multi-Agent Hotel Management

Complete hotel management automation with LangGraph multi-agent workflow

What's Inside:

Multi-Agent System: Three specialized agents (booking agent, housekeeping agent, customer service agent) with coordinated workflows
State Management: Shared state across agents using LangGraph state machines
RAG Integration: FAQ-based customer service responses using vector database (FAISS)
Streamlit UI: Interactive web interface for hotel operations
MySQL Integration: Database for booking and room management
Workflow Orchestration: Complete booking → housekeeping → customer service pipeline

Tech: LangGraph, Azure OpenAI, MySQL, RAG, FAISS, Streamlit, Python

🔧 CrewAI SQL Assistant

Multi-agent SQL generation with CrewAI framework - security compliance and cost monitoring

What's Inside:

SQL Generation Agent: Automated SQL query generation from natural language
SQL Validation Agent: Security compliance checking, SQL injection prevention
Multi-Agent Coordination: CrewAI framework for agent orchestration and collaboration
Real-time Cost Monitoring: Token usage tracking and cost analysis
Security Features: Query validation, pattern matching, safeguard implementations

Tech: CrewAI, Azure OpenAI, Python, Multi-agent orchestration

🎓 Certifications

Cloud & Data Engineering:

Azure Certified Data Engineer Associate (2024)
GCP Certified Professional Data Engineer (2023)
Databricks Certified Data Engineer Associate (2025)
Databricks Generative AI Engineer Associate (2025)

Foundational & Specialized:

📫 Connect: LinkedIn • Email

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yaseen Banu TheDataArtisanDev

Block or report TheDataArtisanDev

Hi there! 👋

🛠️ Tech Stack

🚀 Data Engineering Projects

⚙️ Apache Airflow Production Patterns

☁️ GCP Data Engineering

☁️ Azure Data Engineering

⚡ Apache Spark Projects

🔥 Apache Flink

📊 Kafka Streams Processor

❄️ Snowflake Snowpark Examples

☸️ Kubernetes Deployments

🐳 Docker Containerization Patterns

🗄️ NoSQL Explorations

🤖 Gen AI Learning Projects

🤖 AI Data Analytics Agent

📚 Gen AI Experiments

🏨 Multi-Agent Hotel Management

🔧 CrewAI SQL Assistant

🎓 Certifications

Pinned Loading

Uh oh!