I'm Yaseen Shaik - a Senior Data Engineer with 6 years building production-scale data platforms across Azure, GCP, and AWS.
I specialize in PySpark, real-time streaming, and data orchestration at scale, with interest in Gen AI applications in data engineering.
12 production-ready DAG patterns with complete monitoring stack
What's Inside:
- Operators & Sensors: Bash, Python, branching, XCom patterns, file/time sensors
- Database Integration: SQL operators, connection management, data extraction
- External Systems: Microsoft Teams webhook notifications, API integrations
- Real ETL Pipelines: NYC Taxi data pipeline, advanced DuckDB analytics pipeline
- Dynamic DAGs: Factory patterns for programmatic DAG generation
- Complete Monitoring: Prometheus + Grafana dashboards, StatsD exporter, real-time metrics
Tech: Apache Airflow, Python, Docker, Prometheus, Grafana, DuckDB
βοΈ GCP Data Engineering
Apache Beam pipelines (Python & Java), BigQuery, Pub/Sub, and Cloud Storage integrations
What's Inside:
- Apache Beam Python: 40+ examples covering Create, Read, Write, FlatMap, Map, Filter, Flatten, CombinePerKey, CountPerKey, CoGroupByKey transforms
- Apache Beam Java: Complete Maven project with basic to complex patterns (windowing, side inputs, stateful processing, use cases like word count, even/odd classification, average calculations)
- BigQuery Integration: Dataset management, table operations, data loading from Cloud Storage, data export workflows
- Pub/Sub Messaging: Real-time streaming patterns with complete messaging implementation
- Sample Datasets: Flight data, transaction data, text processing examples
Tech: Apache Beam, Python, Java, BigQuery, Pub/Sub, Cloud Storage, Maven
βοΈ Azure Data Engineering
Databricks, Data Factory, Delta Live Tables, and Key Vault integration
What's Inside:
- Delta Live Tables: Time travel queries, schema evolution, metadata inspection, pipeline design and monitoring
- Azure Data Factory: Complete repository structure with dataflows, datasets, linked services, pipelines, triggers
- Production Dataflows: Dimension tables (DimMovies, DimOnlineService), Fact tables (FactOnlinePurchase), monthly snapshots
- 8 Real-World Use Cases: CSV copy operations, file cleanup, SQL to Blob transfer, Databricks orchestration, database migration, dataflow activities, pipeline monitoring, CI/CD implementation
- Key Vault Integration: Secret management with Python, secure connection strings, access policies
Tech: Azure Databricks, Data Factory, Delta Lake, Key Vault, Python, Dataflows
PySpark patterns, transformations, and Apache Iceberg data lakehouse
What's Inside:
- Core Patterns: RDD operations, file loading strategies, filtering, data transformations
- Data Manipulation: Pivot operations, string processing with regex, backfilling strategies, encoding techniques
- Apache Iceberg Integration: Table creation and management, time travel queries for historical data access, schema evolution and versioning, metadata inspection and catalog operations
Tech: PySpark, Apache Spark, Apache Iceberg, Python, Data Lakehouse
π₯ Apache Flink
Hands-on Apache Flink implementations with Table API, SQL, and streaming transformations
What's Inside:
- Basics (7 notebooks): Table operations, schema inspection, data conversions (Pandas β Flink Table, List of Rows, Tuples), row counting
- Transformations: Map, filter, groupBy, aggregations, joins with dynamic column handling
- PyFlink Examples: Word count and streaming patterns in Python
- SQL Examples: SQL-based stream processing with Flink SQL
Tech: Apache Flink, Python, PyFlink, Flink Table API, Flink SQL
Java-based Kafka Streams application for real-time data processing
What's Inside:
- Deduplication: State store-based unique record processing with RocksDB backend
- Stream Joining: Stream-to-stream and stream-to-table join operations
- Schema Validation: YAML-based message validation with custom validators
- Error Handling: Dead letter queue (DLQ) patterns for resilience
- State Management: RocksDB-backed state stores for stateful stream processing
- Components: Stream processors, join processors, topology definitions, Kafka configuration
Tech: Java, Apache Kafka Streams, Spring Boot, Docker, RocksDB, Maven
βοΈ Snowflake Snowpark Examples
Python-based data processing with Snowpark DataFrames, joins, and UDFs
What's Inside:
- Connection Management: Secure Snowflake authentication with environment variables
- DataFrame Operations: Snowpark API for data transformations, filtering, aggregations
- Join Operations: Multi-table joins with Snowpark DataFrames
- Report Generation: Data aggregation and summary report creation
- User-Defined Functions: Inline UDFs for custom transformations
Tech: Snowflake, Snowpark Python, Jupyter, Python
βΈοΈ Kubernetes Deployments
Production-ready Kubernetes deployment patterns from basic to advanced with Helm charts
What's Inside:
- Basic Deployments: Nginx with NodePort services, fundamental Kubernetes concepts (Pods, Services, Deployments)
- Flask Microservices: Multi-replica Python APIs with Docker containerization, REST API patterns
- Health Monitoring: Production-grade liveness and readiness probes, self-healing capabilities
- Helm Charts: Infrastructure as Code with templating, values injection, package management
Tech: Kubernetes, Docker, Helm, Python/Flask, YAML
Containerization from basics to multi-stage builds with Docker Compose orchestration
What's Inside:
- Basic Containerization: Simple Flask app deployment with Dockerfile basics
- API Services: Dynamic CSV report generator with REST endpoints, JSON to CSV conversion
- Multi-container Apps: Flask + Redis with Docker Compose, service networking, volume mapping
- Multi-stage Builds: Production-optimized PDF generator, separate build and runtime environments, minimal image size
Tech: Docker, Docker Compose, Python/Flask, Redis
ποΈ NoSQL Explorations
Comprehensive hands-on with NoSQL databases
What's Inside:
- Graph Database (Neo4j): 7 progressive notebooks covering node creation, relationships, properties, labels and multi-labels, Cypher fundamentals (MATCH, MERGE, WITH clauses), updating and deleting operations
- Python Integration: Neo4j driver usage, connection management, query execution
Tech: Neo4j, Cypher, Python, Jupyter
AI-powered data analytics agent converting natural language to SQL with production monitoring
What's Inside:
- Text-to-SQL System: Natural language interface powered by Azure OpenAI GPT-4 with LangGraph orchestration
- Data Analysis: CSV/Excel file upload and analysis capability
- Production Monitoring: Prometheus metrics collection (query duration, success rates, performance tracking)
- Grafana Dashboards: Real-time visualization of system performance and cost analysis
- LangGraph State Management: Multi-step workflow orchestration for query processing
Tech: Azure OpenAI GPT-4, LangGraph, DuckDB, Prometheus, Grafana, Python, Streamlit
π Gen AI Experiments
Hands-on implementations exploring LLM capabilities (7 experiments)
What's Inside:
- RAG Systems: Document processing for ePub (Alice in Wonderland), PowerPoint (pitch decks), PDF (travel guides, receipts), Word documents - all with FAISS vector store and semantic search
- LLM Output Structuring: Using Pydantic for structured responses, schema validation
- Function Calling: Dynamic weather chatbot with tool use and API integration
- Context-Aware Chatbot: E-commerce assistant with memory management and conversation chains
- Focus Areas: Chunking strategies (RecursiveCharacterTextSplitter), embedding model evaluation (HuggingFace, Azure OpenAI), vector store optimization, retrieval pattern analysis
Tech: FAISS, LangChain, Azure OpenAI, Pydantic, Vector Embeddings, Jupyter
Complete hotel management automation with LangGraph multi-agent workflow
What's Inside:
- Multi-Agent System: Three specialized agents (booking agent, housekeeping agent, customer service agent) with coordinated workflows
- State Management: Shared state across agents using LangGraph state machines
- RAG Integration: FAQ-based customer service responses using vector database (FAISS)
- Streamlit UI: Interactive web interface for hotel operations
- MySQL Integration: Database for booking and room management
- Workflow Orchestration: Complete booking β housekeeping β customer service pipeline
Tech: LangGraph, Azure OpenAI, MySQL, RAG, FAISS, Streamlit, Python
π§ CrewAI SQL Assistant
Multi-agent SQL generation with CrewAI framework - security compliance and cost monitoring
What's Inside:
- SQL Generation Agent: Automated SQL query generation from natural language
- SQL Validation Agent: Security compliance checking, SQL injection prevention
- Multi-Agent Coordination: CrewAI framework for agent orchestration and collaboration
- Real-time Cost Monitoring: Token usage tracking and cost analysis
- Security Features: Query validation, pattern matching, safeguard implementations
Tech: CrewAI, Azure OpenAI, Python, Multi-agent orchestration
Cloud & Data Engineering:
- Azure Certified Data Engineer Associate (2024)
- GCP Certified Professional Data Engineer (2023)
- Databricks Certified Data Engineer Associate (2025)
- Databricks Generative AI Engineer Associate (2025)
Foundational & Specialized:
- Databricks Developer Foundations
- Databricks Developer Essentials
- Udacity Nanodegree - Data Streaming (2022)
- AWS Academy Graduate - Cloud Architecting (2021)
- Applied Data Science I: Scientific Computing & Python (with honors)
- Hadoop Ecosystem Masterclass (Udemy)
- Apache Kafka Series (Udemy)