This project focuses on building a NoSQL database using Apache Cassandra for a music streaming startup, Sparkify. The goal is to create an optimized data model to analyze the company's new music streaming app data, specifically focusing on user activity and song listening history.
I have developed an ETL pipeline using Python to process a directory of CSV event files into a denormalized dataset and modeled several tables designed specifically to answer high-priority analytical queries.
Data-Modeling-Cassandra
βββ event_data/ # Directory containing raw CSV event logs partitioned by date
βββ images/ # Screenshots and diagrams used in documentation
βββ Project_Notebook.ipynb # The main Jupyter Notebook containing the ETL and Modeling logic
βββ event_datafile_new.csv # The processed, denormalized dataset used for table loading
βββ README.md # Project documentation and summaryβ¨ Technical Highlights Denormalization ETL: Implemented a Python script to iterate through daily event files and consolidate them into a single event_datafile_new.csv, reducing the complexity of data loading.
Query-First Design: Modeled three distinct tables based strictly on the required SELECT statements to ensure efficient partitions and clustering columns.
Primary Key Optimization: Applied specific Partition Keys to distribute data across nodes and Clustering Columns to ensure data is sorted correctly within partitions.
Data Integrity: Utilized IF NOT EXISTS clauses during table creation and DROP TABLE statements to ensure a clean, repeatable ETL process.
π How to Run Prerequisites: Ensure you have a local instance of Apache Cassandra or a containerized version running.
Environment: Install the cassandra-driver using pip:
Bash
pip install cassandra-driver Execution: Open Project_Notebook.ipynb in your Jupyter environment.
Step 1: Run the ETL section to process the raw event_data files into the denormalized CSV.
Step 2: Execute the CQL statements to create the keyspace and tables.
Step 3: Run the provided test queries to verify that the data has been loaded and modeled correctly.