Skip to content

Built a serverless data pipeline that ingests product and inspiration imagery metadata from a public API, normalizes inconsistent metadata, and stores analytics-ready datasets in Amazon S3 to support downstream trend analysis, visual similarity, and pricing opportunity detection.

Notifications You must be signed in to change notification settings

nivasharmaa/MetadataForge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaDataForge

MetaDataForge is a serverless data engineering pipeline that ingests visual metadata from third-party APIs, normalizes semi-structured payloads, and persists analytics-ready datasets to Amazon S3 using AWS Lambda.

The project demonstrates a modern, cloud-native pattern for serverless ingestion → transformation → lakehouse storage, inspired by real-world visual commerce and content analytics systems.

Features Implemented

  • Serverless ingestion of visual metadata from a third-party REST API (Unsplash)
  • Schema-driven normalization of semi-structured JSON payloads
  • Defensive handling of missing fields and malformed records
  • Cloud persistence of analytics-ready datasets in Amazon S3
  • Environment-based configuration for secrets and infrastructure settings
  • End-to-end execution using AWS Lambda (no notebooks, no local runtime)

Tech Stack Used

Languages & Libraries

  • Python
  • Pandas

Cloud & Infrastructure

  • AWS Lambda
  • Amazon S3
  • AWS IAM

Data Formats

  • CSV (serverless execution)
  • Parquet (local pipeline)

APIs

  • Unsplash REST API

Pipeline Overview

  1. AWS Lambda fetches visual metadata from the Unsplash API
  2. Raw JSON responses are normalized into a flat, analytics-ready schema
  3. Cleaned metadata is written to Amazon S3 for downstream consumption

This mirrors the ingestion layer used by visual commerce, trend analysis, and content discovery platforms.

Planned Enhancements

The current MVP intentionally focuses on core serverless ingestion, normalization, and cloud persistence. Future enhancements include:

  • Asynchronous or batched ingestion to support higher API throughput
  • Scheduled and event-driven execution using Amazon EventBridge
  • Explicit retry and backoff strategies to handle external API rate limits
  • Partitioned storage layouts and downstream serving layers for analytics and ML workloads

What This Project Demonstrates

  • Practical serverless pipeline design using AWS Lambda
  • Cloud-native ingestion without reliance on local infrastructure
  • Schema-first thinking for semi-structured data
  • Clear separation between ingestion, transformation, and storage layers
  • Secure configuration and permissions management
  • Foundations for scalable lakehouse-style architectures

Summary

MetaDataForge is a production-oriented serverless ingestion pipeline that transforms raw visual metadata into analytics-ready datasets in the cloud. The project emphasizes clarity, correctness, and extensibility, reflecting patterns used in real-world data engineering systems.

About

Built a serverless data pipeline that ingests product and inspiration imagery metadata from a public API, normalizes inconsistent metadata, and stores analytics-ready datasets in Amazon S3 to support downstream trend analysis, visual similarity, and pricing opportunity detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages