Data Engineer Screening Exercise (60 minutes)

Overview

This is a live coding interview evaluating four core competencies:

System design thinking and communication (Part 1)
Implementation skills - writing clean, functional code (Part 2)
Code review skills - identifying security and data issues (Part 3)
Orchestration design - building production-ready workflows (Part 4)

Setup: You may use the internet, documentation, and AI assistants as you normally would while coding.

PART 1: Architecture Discussion (10 minutes)

⚠️ NOTE TO INTERVIEWEE: Complete this entire discussion section before moving to Part 2 (Implementation).

Scenario

Business stakeholders are asking for data from Aviation Stack to be available via internal self-service BI tooling. This data will allow for performance tracking and monitoring of our different airline partners.

Context: You're joining an existing data team. We already have:

Airflow for orchestration
Snowflake as our data warehouse
dbt for transformations
GCS for staging/raw data storage
Looker for BI (connected to Snowflake)

Your Task

Walk us through how you would approach this request.

Think through the entire lifecycle: from gathering requirements, to designing the architecture, to implementing and operationalizing the pipeline.

We're interested in understanding:

Your thought process and how you break down ambiguous problems
What questions you'd ask stakeholders and why
Which high-level architecture you'd choose
What technical tradeoffs you'd consider
How you'd ensure this runs reliably in production

This is an open conversation. We're evaluating your communication, critical thinking, and how you approach real-world data engineering problems.

PART 2: API Implementation (15 minutes)

Implement the API extraction function in src/aviation_stack.py:

Write a function to fetch flight data from the AviationStack Flights API
Extract these fields for end users: flight_date, flight_status, departure airport, arrival airport, airline_name
Return the data in a format suitable for downstream storage (JSON or structured dict/list)
Handle API errors appropriately
Note: You can use a free API key or mock the response - just document your choice

Success criteria:

Function is callable and returns structured data
Code handles basic error cases
Code is readable with appropriate variable names

PART 3: Code Review (10 minutes)

Review the existing code in src/snowflake.py. This module contains intentional issues that a senior engineer should identify.

Tasks:

Identify problems in the implementation (security, correctness, efficiency, best practices)
Document your findings - add comments directly in the code explaining what's wrong
Fix any critical issue - Address at least 1 critical issue

What we're looking for:

Do you understand idempotency and how to implement it?
Can you prioritize critical vs. minor issues?

Note: src/gcs.py also has issues, but focus on snowflake.py for time constraints.

PART 4: Airflow DAG Implementation (20 minutes)

In dags/flight_data.py, implement a production-ready Airflow DAG that orchestrates the complete pipeline.

Requirements:

Create the DAG with appropriate schedule, retries, and configuration
Implement these tasks:
- extract_flight_data - Call your aviation_stack.py function
- upload_to_gcs - Save data to GCS with date-based partitioning
- load_to_snowflake - Load from GCS to Snowflake (using your idempotent approach)
- run_dbt - Trigger dbt models
Define task dependencies clearly
Add appropriate error handling - What happens if the API fails? If Snowflake is down?
Include data quality checks - At least one validation step

Success criteria:

DAG is syntactically correct (or close enough to be runnable with minor fixes)
Task dependencies are logical and correct
Shows understanding of Airflow best practices (task isolation, XCom, etc.)
Demonstrates production thinking (retries, timeouts, failure handling)

Hints:

Use PythonOperator or @task decorator for custom tasks
Use BashOperator for dbt or consider DbtCloudRunJobOperator
Think about how to pass data between tasks (XCom vs. external storage)
Consider using Airflow connections for credentials

PART 5: Wrap-up Discussion (5 minutes)

Be prepared to discuss:

What would you add with more time?
How would you monitor this pipeline in production?
What data quality checks would you implement?
How would you handle schema changes in the API?

Interview Structure Summary

Part	Duration	Focus
1. Architecture Discussion	10 min	Requirements, tradeoffs, system design
2. API Implementation	15 min	Writing clean code with error handling
3. Code Review	10 min	Identifying security/correctness issues
4. Airflow DAG	20 min	Production-ready orchestration
5. Wrap-up	5 min	Production concerns & extensions
Total	60 min

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dags		dags
dbt_models		dbt_models
src		src
.env.example		.env.example
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Engineer Screening Exercise (60 minutes)

Overview

PART 1: Architecture Discussion (10 minutes)

Scenario

Your Task

PART 2: API Implementation (15 minutes)

PART 3: Code Review (10 minutes)

PART 4: Airflow DAG Implementation (20 minutes)

PART 5: Wrap-up Discussion (5 minutes)

Interview Structure Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

AirspaceTechnologies/data-eng-screen

Folders and files

Latest commit

History

Repository files navigation

Data Engineer Screening Exercise (60 minutes)

Overview

PART 1: Architecture Discussion (10 minutes)

Scenario

Your Task

PART 2: API Implementation (15 minutes)

PART 3: Code Review (10 minutes)

PART 4: Airflow DAG Implementation (20 minutes)

PART 5: Wrap-up Discussion (5 minutes)

Interview Structure Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages