Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions _posts/2025-12-23-gnec-hackathon-win.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
layout: post
title: Gnec Hackathon Win
date: 2025-12-23
author_name: Silke Nodwell
author_role: Lead at Women Coding Community
image: /assets/images/blog/2025-12-23-gnec-hackathon-win.png
description: Our Women Coding Community Team Takes 3rd Place at the GNEC Fall Hackathon
category: tech-career
---

<div class="text-justify">
<p><em>Our Women Coding Community Team Takes 3rd Place at the GNEC Fall Hackathon</em></p>
<p><em>“Nourish Together” team: <a href="https://www.linkedin.com/in/silke-nodwell-763681172/">Silke Nodwell</a>, <a href="https://www.linkedin.com/in/ainan-ihsan-27792b68/">Ainan Ihsan</a>, <a href="https://www.linkedin.com/in/tammy-l-s-a80b6a2a9/">Tammy Sisodiya,</a> <a href="https://www.linkedin.com/in/nino-godoradze/">Nino Godoradze</a></em></p>
<p>What does it take to win a hackathon? For our team of four from Women Coding Community, the answer is a combination of persistence, experimentation, and the resilience to tackle each new challenge together.</p>
<p>This fall, we competed in the <a href="https://gnec-hackathon-2025-fall.devpost.com/">Global NGO Executive Committee (GNEC) Hackathon</a>, which focused on the <a href="https://sdgs.un.org/goals">UN Sustainable Development Goals</a> of <strong>No Poverty</strong> and <strong>Zero Hunger</strong>. Our team, Nourish Together, placed <a href="https://gnec-hackathon-2025-fall.devpost.com/"><strong>3rd out of more than 150 teams.</strong></a> Over eight weeks of evening meetings, trial-and-error, and bursts of productivity, we built a working prototype we’re really proud of. Here’s how it came together.</p>
<h2>How It All Started</h2>
<p>In August, <a href="https://www.linkedin.com/in/tammy-l-s-a80b6a2a9/">Tammy</a> put a message in the WCC Slack about the GNEC hackathon, asking whether anyone was interested in joining a team. Within a couple of hours, we had a full team. We met virtually a few days later and quickly realised how international our group was: four people across three countries and time zones, spanning the US, Sweden and the UK.</p>
<h2>Choosing a Direction</h2>
<p>At first, we struggled. “Poverty” and “hunger” are enormous, complex issues, and none of us worked directly in those domains. Coming up with a focused idea felt daunting.</p>
<p>Eventually, we realised that instead of trying to design for communities we didn’t know well, we could design for a group whose needs we <em>did</em> understand: donors. Donors have both the resources and the technology to use an app, and, crucially, we felt equipped to empathise with their decision-making process.</p>
<p>So we framed our problem simply:</p>
<p><strong>How might we help donors confidently identify charities aligned with their values?</strong></p>
<h2>Designing our Website</h2>
<p>Our initial concept was intentionally broad: a website that could help donors explore both financial and non-financial ways to contribute. As part of this, we imagined an “impact tracker” that visualised how each donated dollar translated into tangible outcomes.</p>
<p>Around this time, I had seen an impressive prototype built with <a href="https://lovable.dev/">Lovable</a> at work and suggested we try it for our UI. Lovable turned out to be ideal. It enabled us to quickly build a polished mock impact tracker that displayed the journey of each donation and showed how many families could be helped through that contribution.</p>
<p>After the impact tracker was in place, <a href="https://www.linkedin.com/in/tammy-l-s-a80b6a2a9/">Tammy</a> added an embedded food bank map. It connected directly to Google Maps so users could instantly find their nearest food bank if they preferred donating food rather than money. </p>
<h2>Adding an LLM-Inspired Recommender</h2>
<p>With the core website established, we began exploring how to make the platform more intelligent. That was when the idea of a charity recommender emerged. At first, we approached it as a conventional machine learning problem. At the same time, I was reading <em>Prompt Engineering with LLMs</em> with the WCC book club, which sparked the question:</p>
<p><strong>Instead of building a traditional machine-learning recommender, what if we built an LLM-powered one?</strong></p>
<p>In true hackathon fashion, we opted for the simplest approach that could work. Rather than running a full large language model, we implemented a recommender using a Sentence Transformer, a smaller and more efficient model designed to convert text into vector embeddings, or numeric representations that capture meaning.</p>
<p>We generated embeddings for each charity’s description and stored them in a FAISS index, which is optimised for fast similarity search across large collections of vectors. This setup is similar to a Retrieval-Augmented Generation (RAG) system, where relevant items are retrieved based on semantic similarity rather than exact keyword matches.</p>
<p>When a user entered a query, it was embedded in the same way and compared it against the stored charity embeddings using cosine similarity, a common metric for measuring how close two vectors are in high-dimensional space. The nearest matches became our recommendations. The result was an intuitive, natural-language experience for our ‘Find Your Perfect Charity’ feature.</p>
<h2>The Reality of Online Collaboration</h2>
<p>In the beginning, we tried assigning fixed roles. This approach collapsed almost immediately because volunteer schedules are unpredictable. Some weeks one of us had time to take on multiple tasks, while other weeks we were barely available. We switched to leaving tasks unassigned; anyone could pick up an item from the list as long as they kept the group updated on their progress.</p>
<p>Working fully online added its own challenges. During one evening sync call, we realised that we needed uninterrupted time together if we were going to deliver something cohesive. So we all agreed to take a full day off work to focus on the project. This became a turning point. It aligned our codebase, resolved repository issues and built crucial momentum. </p>
<p>What followed was a very determined weekend sprint.</p>
<h2>Two Days, A Cloud Deployment and Little Sleep</h2>
<p>Integrating the FAISS index into our Lovable website was more complicated than expected. After several failed attempts, we switched to a new plan. We deployed the recommender on <a href="https://railway.com/">Railway</a> as a standalone Cloud API and connected this API to our Lovable front end.</p>
<p>This approach worked, but not without challenges. On Saturday night, <a href="https://www.linkedin.com/in/ainan-ihsan-27792b68/">Ainan</a> stayed up late resolving dependency issues. Early on Sunday morning, I picked up where she left off and finally got the API communicating with the UI.</p>
<p>Meanwhile, <a href="https://www.linkedin.com/in/nino-godoradze/">Nino</a>, who lives in the United States and has the most video-editing experience, logged in at 1 pm UK time. She edited a beautifully polished demo video, only to discover that it was thirty seconds too short for the submission requirements. To fix this, we recorded a quick Teams call introducing ourselves and added it to the final cut. It ended up fitting perfectly.</p>
<p>We had eight weeks to complete the project, yet we still submitted with only thirty minutes to spare. Hackathon law, surely!</p>
<h2>Tools That Made a Difference</h2>
<ul>
<li>
<p><a href="https://lovable.dev/">Lovable</a> for creating a clean, engaging UI</p>
</li>
<li>
<p><a href="https://railway.com/">Railway</a> for hosting our FAISS-based recommender as an API</p>
</li>
<li>
<p><a href="https://www.gamma.app">Gamma</a> for transforming plain text into polished slides</p>
</li>
</ul>
<h2>What We Learned</h2>
<p>This project was not only about building an intelligent website for donors. It was also about learning how to collaborate flexibly, make practical technical decisions and sustain momentum even when the path forward was unclear. </p>
<p>If there is one lesson we are taking forward, it is this:</p>
<p><strong>A hackathon team succeeds when it stays adaptable.</strong></p>
<p>Not when roles are perfectly assigned, or when the plan unfolds neatly, but when everyone leans in however and whenever they can.</p>
<p>We are proud of our 3rd-place finish, proud of the product we built and even prouder of how we worked together. Above all, we are grateful for Women Coding Community, which brought this team together and made this experience possible.</p>
<hr />
<p><a href="https://hackathon-donation-project.lovable.app/"><em>Hackathon website</em></a></p>
<p><a href="https://devpost.com/software/1031061"><em>Devpost project</em></a></p>
<p><a href="https://github.com/ainanihsan/Recommender-FAST-API"><em>GitHub (recommender API)</em></a></p>
<p><a href="https://github.com/ainanihsan/hackathon-donation-project"><em>GitHub (donation project)</em></a></p>
</div>
235 changes: 235 additions & 0 deletions _posts/2025-12-25-data-engineering-portfolio-projects.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
---
layout: post
title: Data Engineering Portfolio Projects
date: 2025-12-25
author_name: Sowmiya Ravikumar
author_role: Data Engineer
image: /assets/images/blog/2025-12-25-data-engineering-portfolio-projects.jpg
description: Data Engineering Portfolio Projects
category: tech-career
---

<div class="text-justify">
<p>Building portfolio projects for Data Engineering can be challenging outside enterprise environments due to limited access to realistic data, missing business context, and cloud costs.</p>
<p>Below are <strong>four practical portfolio projects</strong> that aspiring data engineers can build to showcase real-world skills. Each project focuses on a commonly used data engineering pattern and can be implemented using open-source tools or managed cloud services</p>
<hr />
<h2>1. Daily Sales Batch ETL</h2>
<p>A Finance team requires an <strong>audit-ready daily sales report</strong> delivered every morning by <strong>8:00 AM</strong>, based on the previous day’s completed orders. This is a classic <strong>batch data engineering</strong> scenario where data must be processed on a <strong>fixed schedule</strong> with strong guarantees around correctness, reproducibility, and scalability.</p>
<h3>Architecture Overview</h3>
<ol>
<li><strong>Extract</strong> daily order data from the raw storage layer </li>
<li><strong>Transform</strong> sales data into clean, analytics-ready models </li>
<li><strong>Load</strong> curated tables for reporting and audits </li>
<li><strong>Schedule</strong> the pipeline to meet a strict daily SLA</li>
</ol>
<h3>Technology Stack</h3>
<table class="bordered-table">
<thead>
<tr>
<th>Environment</th>
<th>Storage</th>
<th>Processing</th>
<th>Scheduling</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Local / Open-Source</strong></td>
<td>MinIO</td>
<td>Spark Docker</td>
<td>Airflow</td>
</tr>
<tr>
<td><strong>AWS</strong></td>
<td>Amazon S3</td>
<td>AWS Glue</td>
<td>Glue Triggers</td>
</tr>
<tr>
<td><strong>GCP</strong></td>
<td>GCS</td>
<td>Cloud Dataflow</td>
<td>Cloud Scheduler</td>
</tr>
</tbody>
</table>
<h3>Key points to consider</h3>
<ul>
<li><strong>Idempotent runs:</strong> Safe daily runs and reruns without duplication </li>
<li><strong>Incremental loading:</strong> Process only new or updated records using bookmarks </li>
<li><strong>Failure recovery &amp; backfills:</strong> Reprocess data for specific dates as needed </li>
<li><strong>Schema evolution:</strong> Adapt to new columns or data type changes </li>
<li><strong>Partitioned, columnar storage:</strong> Efficient querying, and maintenance</li>
</ul>
<h2>2. Real-Time Order Monitoring</h2>
<p>A Customer Support team needs <strong>immediate visibility into stuck orders</strong> (e.g., payment complete but products not shipped) to intervene before customers churn. This is a classic <strong>real-time operational use case</strong>, where events must be processed <strong>as they occur</strong> with guarantees for correctness and timeliness.</p>
<h3>Architecture Overview</h3>
<ol>
<li><strong>Capture Events:</strong> Track order updates in near real-time from transactional systems using Change Data Capture (CDC) </li>
<li><strong>Process Stream:</strong> Transform, deduplicate, and aggregate events as they arrive </li>
<li><strong>Persist &amp; Query:</strong> Store curated streams or aggregates for dashboards and alerts </li>
<li><strong>Alert / Monitor:</strong> Trigger notifications for stuck orders or SLA violations</li>
</ol>
<h3>Technology Stack</h3>
<table class="bordered-table">
<thead>
<tr>
<th>Environment</th>
<th>Event Capture / CDC</th>
<th>Stream Processing</th>
<th>Storage / Query</th>
<th>Alerting / Monitoring</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Local / Open-Source</strong></td>
<td>Debezium + Kafka</td>
<td>Spark Structured Streaming</td>
<td>MinIO + DuckDB</td>
<td>Python / Spark triggers, Prometheus + Grafana</td>
</tr>
<tr>
<td><strong>AWS</strong></td>
<td>DMS + Kinesis</td>
<td>AWS Glue Streaming</td>
<td>S3 + Athena</td>
<td>CloudWatch / SNS</td>
</tr>
<tr>
<td><strong>GCP</strong></td>
<td>Datastream + Pub/Sub (CDC)</td>
<td>Cloud Dataflow</td>
<td>GCS + BigQuery</td>
<td>Cloud Monitoring + Pub/Sub alerts</td>
</tr>
</tbody>
</table>
<h3>Key points to consider</h3>
<ul>
<li><strong>Late-arriving/Out-of-order events:</strong> Watermarks for delayed events and back-filling </li>
<li><strong>Deduplication:</strong> Exactly-once processing, despite receiving duplicates from source </li>
<li><strong>Partial/missing events handling:</strong> Robust to incomplete or missing sequences </li>
<li><strong>Windowed aggregations:</strong> Real-time metrics over fixed or sliding time windows</li>
</ul>
<hr />
<h2>3. Campaign Performance Data Analytics</h2>
<p>A Marketing team runs campaigns across <strong>Google Ads, Meta, and Email</strong>. They need a single source of truth to consistently analyse total spend, conversions, and campaign performance across channels. This is a classic <strong>analytics engineering</strong> use case, where raw ingestion data is transformed into curated, <strong>analysis-ready models.</strong></p>
<h3>Architecture Overview</h3>
<ol>
<li><strong>Storage (Bronze):</strong> Capture unprocessed campaign and conversion data. </li>
<li><strong>Transformation (Silver):</strong> Clean, standardize, enrich, and apply business logic </li>
<li><strong>Data Warehouse (Gold):</strong> Aggregate metrics at campaign and channel level for reporting and product analytics. </li>
<li><strong>Orchestration &amp; Consumption:</strong> Automate daily ETL runs and query from BI tools.</li>
</ol>
<h3>Technology Stack</h3>
<table class="bordered-table">
<thead>
<tr>
<th>Environment</th>
<th>Storage</th>
<th>Processing</th>
<th>Data Warehouse / Product Layer</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Local / Open-Source</strong></td>
<td>DuckDB / MinIO</td>
<td>Spark Docker</td>
<td>DuckDB</td>
</tr>
<tr>
<td><strong>AWS</strong></td>
<td>S3</td>
<td>AWS Glue</td>
<td>Redshift</td>
</tr>
<tr>
<td><strong>GCP</strong></td>
<td>GCS</td>
<td>Cloud Dataflow</td>
<td>BigQuery</td>
</tr>
</tbody>
</table>
<h3>Key points to consider</h3>
<ul>
<li><strong>Schema evolution detection &amp; data contracts</strong>: Prevent broken transformations due to upstream changes </li>
<li><strong>Dimension modelling</strong>: Using Star/Snowflake schemas, SCD Type 2, and surrogate keys for historical tracking </li>
<li><strong>Data integrity &amp; quality checks</strong>: Handle missing, malformed, or inconsistent records, backfill specific dates without duplication </li>
<li><strong>Consistent metric definitions</strong>: Ensure KPIs (spend, conversions, ROI) are reliable</li>
</ul>
<hr />
<h2>4. Real-Time IoT Sensor Analytics</h2>
<p>A factory floor needs to monitor <strong>high-frequency IoT sensor data</strong> to detect overheating machines or abnormal energy usage before equipment fails. This is a <strong>stateful streaming use case</strong>, where it is critical to compute averages, trends, and anomalies in real time.</p>
<h3>Architecture Overview</h3>
<ol>
<li><strong>Ingestion:</strong> Capture sensor readings continuously from IoT devices or message streams </li>
<li><strong>Stream Processing:</strong> Maintain state, compute rolling averages, windowed aggregations, and detect anomalies </li>
<li><strong>Storage:</strong> Persist aggregated or processed sensor data for operational and historical use </li>
<li><strong>Monitoring &amp; Alerting:</strong> Visualize metrics and trigger alerts on abnormal conditions</li>
</ol>
<h3>Technology Stack</h3>
<table class="bordered-table">
<thead>
<tr>
<th>Environment</th>
<th>Ingestion</th>
<th>Stream Processing</th>
<th>Storage / Query</th>
<th>Monitoring &amp; Alerting</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Local / Open-Source</strong></td>
<td>Kafka</td>
<td>Apache Flink</td>
<td>InfluxDB</td>
<td>Grafana / Python triggers</td>
</tr>
<tr>
<td><strong>AWS</strong></td>
<td>Kinesis</td>
<td>Managed Service Apache Flink</td>
<td>Timestream</td>
<td>CloudWatch / SNS</td>
</tr>
<tr>
<td><strong>GCP</strong></td>
<td>Pub/Sub</td>
<td>Dataproc for Apache Flink</td>
<td>Cloud Bigtable</td>
<td>Cloud Monitoring / Pub/Sub alerts</td>
</tr>
</tbody>
</table>
<h3>Key points to consider</h3>
<ul>
<li><strong>Event-time processing &amp; watermarks:</strong> Handles late or out-of-order readings </li>
<li><strong>Stateful rolling computations:</strong> Maintains averages, trends, and windowed metrics efficiently </li>
<li><strong>Dynamic anomaly detection:</strong> Configurable thresholds or statistical models per sensor </li>
<li><strong>High-throughput resilience:</strong> Processes large volumes of events without data loss </li>
<li><strong>Reliable alerting:</strong> Minimizes false positives while triggering timely notifications</li>
</ul>
<hr />
<h2>Useful tips</h2>
<ul>
<li>Use publicly available data sources like Kaggle, open APIs, or public cloud datasets </li>
<li><a href="https://www.kaggle.com/datasets/yashdevladdha/uber-ride-analytics-dashboard">Uber Data Analytics Dashboard Dataset</a> </li>
<li><a href="https://www.kaggle.com/datasets/manjeetsingh/retaildataset">Retail Data Analytics Dataset</a> </li>
<li><a href="https://github.com/bytewax/awesome-public-real-time-datasets?tab=readme-ov-file">Free Real Time APIs</a> </li>
<li>Generate synthetic data (using Faker or GenAI) to simulate scale and edge cases </li>
<li><a href="https://fakerapi.it/fake-data-download">Faker API</a> </li>
<li>Provision infrastructure using IaC (Terraform, CloudFormation, or YAML configs) </li>
<li>Check-In code into GitHub repository and document architecture, data flow, assumptions and trade-offs in a README</li>
</ul>
<h2>Resources</h2>
<p><a href="https://hub.docker.com/_/spark">Apache Spark Docker</a></p>
<p><a href="https://duckdb.org/docs/stable/clients/python/overview">DuckDB Docs Python API</a></p>
<p><a href="https://debezium.io/documentation/reference/3.4/tutorial.html">Debezium Tutorial</a></p>
<p><a href="https://www.baeldung.com/minio">Introduction to MinIO | Baeldung</a></p>
<p><a href="https://peterbaumann.substack.com/p/future-data-systems?utm_source=%2Finbox&amp;utm_medium=reader2">Future Data Systems Article</a></p>
<p><a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/best-practices.html">Best practices for optimizing Apache Iceberg workloads</a></p>
</div>
Loading