Skip to content

BnnaFish/big-data-in-docker

Repository files navigation

Description

Minimal example to run Big Data tools in local docker.

Architecture

Hive Metastore to store...metadata.

MariaDB as Hive backend.

Spark worker to run scripts.

Trino as a query engine to make requests via Pandas or IDE.

Goal

The goal is to move some plane file from S3 to Data Lake with partitions.

Wellknown iris dataset is good enough for this purpose.

HowTo

First fetch sample data

wget -P ./data https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Then up minio. Files will be added via compose entrypoint command.

make minio_up

Then up Hive Metastore.

make hive_up

Then run Spark script to create new table in Iceberg and move data from file to this table.

make fill_tables

Then we need to up Trino to be used in pandas to make a query.

make trino_up

Install python requirements

poetry install
poetry shell

Then run pandas script

python pandas_app/fetch.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published