Minimal example to run Big Data tools in local docker.
Hive Metastore to store...metadata.
MariaDB as Hive backend.
Spark worker to run scripts.
Trino as a query engine to make requests via Pandas or IDE.
The goal is to move some plane file from S3 to Data Lake with partitions.
Wellknown iris dataset is good enough for this purpose.
First fetch sample data
wget -P ./data https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.dataThen up minio. Files will be added via compose entrypoint command.
make minio_upThen up Hive Metastore.
make hive_upThen run Spark script to create new table in Iceberg and move data from file to this table.
make fill_tablesThen we need to up Trino to be used in pandas to make a query.
make trino_upInstall python requirements
poetry install
poetry shellThen run pandas script
python pandas_app/fetch.py