🎵 Sparkify: Data Modeling with Apache Cassandra

📖 Project Description

This project focuses on building a NoSQL database using Apache Cassandra for a music streaming startup, Sparkify. The goal is to create an optimized data model to analyze the company's new music streaming app data, specifically focusing on user activity and song listening history.

I have developed an ETL pipeline using Python to process a directory of CSV event files into a denormalized dataset and modeled several tables designed specifically to answer high-priority analytical queries.

🏗️ Repository File Structure

Data-Modeling-Cassandra
├── event_data/                 # Directory containing raw CSV event logs partitioned by date
├── images/                     # Screenshots and diagrams used in documentation
├── Project_Notebook.ipynb      # The main Jupyter Notebook containing the ETL and Modeling logic
├── event_datafile_new.csv      # The processed, denormalized dataset used for table loading
└── README.md                   # Project documentation and summary

✨ Technical Highlights Denormalization ETL: Implemented a Python script to iterate through daily event files and consolidate them into a single event_datafile_new.csv, reducing the complexity of data loading.

Query-First Design: Modeled three distinct tables based strictly on the required SELECT statements to ensure efficient partitions and clustering columns.

Primary Key Optimization: Applied specific Partition Keys to distribute data across nodes and Clustering Columns to ensure data is sorted correctly within partitions.

Data Integrity: Utilized IF NOT EXISTS clauses during table creation and DROP TABLE statements to ensure a clean, repeatable ETL process.

🚀 How to Run Prerequisites: Ensure you have a local instance of Apache Cassandra or a containerized version running.

Environment: Install the cassandra-driver using pip:

Bash

pip install cassandra-driver Execution: Open Project_Notebook.ipynb in your Jupyter environment.

Step 1: Run the ETL section to process the raw event_data files into the denormalized CSV.

Step 2: Execute the CQL statements to create the keyspace and tables.

Step 3: Run the provided test queries to verify that the data has been loaded and modeled correctly.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Project_1B_ Project_Template.ipynb		Project_1B_ Project_Template.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎵 Sparkify: Data Modeling with Apache Cassandra

📖 Project Description

🏗️ Repository File Structure

About

Uh oh!

Releases

Packages

Languages

Datamathican/Data-Modeling-with-Cassandra

Folders and files

Latest commit

History

Repository files navigation

🎵 Sparkify: Data Modeling with Apache Cassandra

📖 Project Description

🏗️ Repository File Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages