Skip to content

This project automates a high-grade data pipeline for a music streaming startup, Sparkify, using **Apache Airflow**. It automates the extraction of user activity logs and song metadata from **AWS S3**, stages it in **Amazon Redshift**, and transforms it into a dimensional star schema for advanced analytics.

Notifications You must be signed in to change notification settings

Datamathican/Data-Modeling-with-Cassandra

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 

Repository files navigation

🎡 Sparkify: Data Modeling with Apache Cassandra

πŸ“– Project Description

This project focuses on building a NoSQL database using Apache Cassandra for a music streaming startup, Sparkify. The goal is to create an optimized data model to analyze the company's new music streaming app data, specifically focusing on user activity and song listening history.

I have developed an ETL pipeline using Python to process a directory of CSV event files into a denormalized dataset and modeled several tables designed specifically to answer high-priority analytical queries.

πŸ—οΈ Repository File Structure

Data-Modeling-Cassandra
β”œβ”€β”€ event_data/                 # Directory containing raw CSV event logs partitioned by date
β”œβ”€β”€ images/                     # Screenshots and diagrams used in documentation
β”œβ”€β”€ Project_Notebook.ipynb      # The main Jupyter Notebook containing the ETL and Modeling logic
β”œβ”€β”€ event_datafile_new.csv      # The processed, denormalized dataset used for table loading
└── README.md                   # Project documentation and summary

✨ Technical Highlights Denormalization ETL: Implemented a Python script to iterate through daily event files and consolidate them into a single event_datafile_new.csv, reducing the complexity of data loading.

Query-First Design: Modeled three distinct tables based strictly on the required SELECT statements to ensure efficient partitions and clustering columns.

Primary Key Optimization: Applied specific Partition Keys to distribute data across nodes and Clustering Columns to ensure data is sorted correctly within partitions.

Data Integrity: Utilized IF NOT EXISTS clauses during table creation and DROP TABLE statements to ensure a clean, repeatable ETL process.

πŸš€ How to Run Prerequisites: Ensure you have a local instance of Apache Cassandra or a containerized version running.

Environment: Install the cassandra-driver using pip:

Bash

pip install cassandra-driver Execution: Open Project_Notebook.ipynb in your Jupyter environment.

Step 1: Run the ETL section to process the raw event_data files into the denormalized CSV.

Step 2: Execute the CQL statements to create the keyspace and tables.

Step 3: Run the provided test queries to verify that the data has been loaded and modeled correctly.

About

This project automates a high-grade data pipeline for a music streaming startup, Sparkify, using **Apache Airflow**. It automates the extraction of user activity logs and song metadata from **AWS S3**, stages it in **Amazon Redshift**, and transforms it into a dimensional star schema for advanced analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published