GitHub - BnnaFish/big-data-in-docker

Description

Minimal example to run Big Data tools in local docker.

Architecture

Hive Metastore to store...metadata.

MariaDB as Hive backend.

Spark worker to run scripts.

Trino as a query engine to make requests via Pandas or IDE.

Goal

The goal is to move some plane file from S3 to Data Lake with partitions.

Wellknown iris dataset is good enough for this purpose.

HowTo

First fetch sample data

wget -P ./data https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Then up minio. Files will be added via compose entrypoint command.

make minio_up

Then up Hive Metastore.

make hive_up

Then run Spark script to create new table in Iceberg and move data from file to this table.

make fill_tables

Then we need to up Trino to be used in pandas to make a query.

make trino_up

Install python requirements

poetry install
poetry shell

Then run pandas script

python pandas_app/fetch.py

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
conf		conf
migrations/raw/iris		migrations/raw/iris
migrator/src/main		migrator/src/main
pandas_app		pandas_app
project		project
.dockerignore		.dockerignore
.gitignore		.gitignore
.scalafmt.conf		.scalafmt.conf
Makefile		Makefile
README.md		README.md
build.sbt		build.sbt
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Description

Architecture

Goal

HowTo

About

Uh oh!

Releases

Packages

Languages

BnnaFish/big-data-in-docker

Folders and files

Latest commit

History

Repository files navigation

Description

Architecture

Goal

HowTo

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages