Before starting, ensure you have downloaded and extracted the following files at the project root:
-
Set up Spark Environment Variables:
export SPARK_HOME=~/spark-3.5.1-bin-hadoop3 export PATH=$SPARK_HOME/bin:$PATH
-
Install MinIO and MinIO Client (mc):
brew install minio/stable/minio brew install minio/stable/mc
-
Create a directory for MinIO:
mkdir -p ~/minio/data -
Set up MinIO Server with access credentials:
export MINIO_ROOT_USER=StrongPass!2024 export MINIO_ROOT_PASSWORD=hadoopUser123 minio server ~/minio/data
-
Configure MinIO Client to access your server:
mc alias set myminio http://127.0.0.1:9000 hadoopUser123 'StrongPass!2024'
-
Create a bucket to store your data:
mc mb myminio/drone-data-lake
-
Create a second bucket for storing analysis results:
mc mb myminio/storageanalyse
-
Configure Spark to use MinIO:
Modify the
spark-defaults.conffile located at$SPARK_HOME/conf/spark-defaults.conf(or$SPARK_HOME/conf/spark-defaults.conf.template):spark.hadoop.fs.s3a.endpoint http://127.0.0.1:9000 spark.hadoop.fs.s3a.access.key 'StrongPass!2024' spark.hadoop.fs.s3a.secret.key hadoopUser123 spark.hadoop.fs.s3a.path.style.access true spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
-
Start Zookeeper (in the first terminal):
bin/zookeeper-server-start.sh config/zookeeper.properties
-
Start the first Kafka broker (in the second terminal):
bin/kafka-server-start.sh config/server.properties
-
Start the second Kafka broker (in the third terminal):
bin/kafka-server-start.sh config/server-1.properties
Example configuration for
server-1.properties:broker.id=1 log.dirs=/tmp/kafka-logs1 listeners=PLAINTEXT://:9093
-
Create Kafka topics:
bin/kafka-topics.sh --create --topic drone-data --bootstrap-server localhost:9092 bin/kafka-topics.sh --create --topic high-danger-alerts --bootstrap-server localhost:9092
-
Configure topics with 3 partitions:
bin/kafka-topics.sh --alter --topic drone-data --partitions 3 --bootstrap-server localhost:9092 bin/kafka-topics.sh --alter --topic high-danger-alerts --partitions 3 --bootstrap-server localhost:9092
-
Assign partitions to brokers with replicas to prevent data loss in case of a crash:
Create a file named
topics.jsonwith the following content:{ "version": 1, "partitions": [ {"topic": "drone-data", "partition": 0, "replicas": [0, 1]}, {"topic": "drone-data", "partition": 1, "replicas": [0, 1]}, {"topic": "drone-data", "partition": 2, "replicas": [1, 0]}, {"topic": "high-danger-alerts", "partition": 0, "replicas": [0, 1]}, {"topic": "high-danger-alerts", "partition": 1, "replicas": [0, 1]}, {"topic": "high-danger-alerts", "partition": 2, "replicas": [1, 0]} ] }Execute the reassignment command:
bin/kafka-reassign-partitions.sh --bootstrap-server localhost:9092 --reassignment-json-file topics.json --execute
-
Generate data:
sbt clean sbt compile sbt run
-
Start MinIO server with access credentials:
export MINIO_ROOT_USER=StrongPass!2024 export MINIO_ROOT_PASSWORD=hadoopUser123 minio server ~/minio/data
-
Create a bucket for data storage:
mc mb myminio/drone-data-lake
-
Create a second bucket for storing analysis results:
mc mb myminio/storageanalyse
-
Generate JAR files for the consumers:
cd consumerSparkAlert sbt package cd ../consumerSparkDatalake sbt package cd ../analysisDatalake sbt package
-
Run the consumers using Spark:
-
Consumer for the datalake:
spark-submit --class ConsumerDatalake --master "local[*]" --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.kafka:kafka-clients:3.7.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk:1.12.452 consumerdatalake_2.12-0.1.jar -
Consumer for the alert:
spark-submit --class ConsumerAlert --master local[*] --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.kafka:kafka-clients:3.7.0 consumeralert_2.12-0.1.jar -
Consumer for the alert process:
spark-submit --class ConsumerAlertProcess \ --master local[*] \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.kafka:kafka-clients:3. 7.0,org.scalaj:scalaj-http_2.12:2.4.2,com.typesafe.play:play-json_2.12:2.9.2 \ consumeralert_2.12-0.1.jar
-
-
Run the data analysis consumer:
spark-submit --class DataAnalysis --master "local[*]" --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.kafka:kafka-clients:3.7.0,org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk:1.12.452 dronedataanalysis_2.12-0.1.jar
To launch the project, ensure all environment variables for Spark are set, Kafka brokers are running, and MinIO is configured correctly. The steps above guide you through setting up Kafka topics, generating data, running MinIO, and executing the Spark consumers and data analysis. Adjust paths and configurations based on your system setup.