This project demonstrates real-time data streaming, processing, and visualization using Apache Kafka, PySpark, Cassandra and an interactive Python Dash dashboard that visualizes the processed data. Initially, the visualization was hosted on an EC2 instance to provide real-time updates. For cost-effectiveness and simplicity in presentation, a static version of the dashboard was later hosted online.
The project simulates a real-time system to generate logs, process them, and visualize the insights through a dashboard. It includes:
- Data Ingestion: Logs are generated at 2 logs/second and sent to Kafka.
- Data Processing: PySpark Structured Streaming processes logs from Kafka every 10 seconds.
- Data Storage: Transformed data is stored in Amazon Keyspaces (Cassandra) for real-time querying and in s3 as a data lake for future querying.
- Data Visualization: A Python Dash dashboard visualizes stored data in real-time.
- Apache Kafka: For real-time log ingestion.
- PySpark Structured Streaming: For scalable stream processing.
- Amazon Keyspaces (Cassandra): For low-latency, distributed data storage.
- Amazon S3: For storing raw data for batch processing.
- Dash: For building interactive dashboards.
- Plotly: For creating visually engaging graphs.
- Amazon EC2: For hosting Kafka, Spark, and the Dash web application.
├── kafka/
│ └── log_generator.py # Script to generate logs and send to Kafka
│
├── spark/
│ ├── process_logs.py # Spark Structured Streaming script to process logs
│ ├── application.conf # Configuration file for Keyspaces-Spark Connector
│ └── config.py # Configuration file for Spark
│
├── visualization/
│ ├── app.py # Dash web application for visualization
│ └── assets/ # Static CSS, images, etc.
│
├── architecture-diagram.png # Architecture diagram for the project
│
└── README.md # Documentation for the project
-
Launch 3 EC2 Instances (t2.micro or larger depending on data volume):
- Kafka and Zookeeper
- Create a topic named
logsin Kafka. - Update the configuration file.
- Create a topic named
- Spark cluster for both real-time and batch processing
- Update the configuration file.
- Dash web application
- Kafka and Zookeeper
-
Key Setup Steps:
- Install and configure Kafka and Zookeeper on the first instance.
- Install Spark on the second instance and configure it for both structured streaming and batch jobs.
- Install Dash and Plotly on the third instance for visualizing the processed data.
To make this project easier to deploy, replicate, and scale, the following enhancements are planned:
- Cloud Infrastructure Automation with Terraform: Automate the provisioning of EC2 instances, Amazon Keyspaces, and other AWS resources.
- Containerization with Docker: Use Docker containers to encapsulate various components like Kafka, Spark, and the Dash app.
- Streamlined Deployment: Combine Terraform and Docker for end-to-end automation.
