This repo contains the work done for the Data Engineering Zoomcamp 2026 from DataTalks.Club. Currently, only the first module is implemented. In this module we used Pyhton, Docker and PostgreSQL to create a data migration pipeline, from CSV a file with 2021 NY Taxi Data to a PostgreSQL Database. The CSV files are extracted and stored in the Database in chunks, using pandas framework, through a containerized CLI application. The PostreSQL Database runs in a multi-container application with PgAdmin and the Python script is executed in a separate container. The multi-container application must be up before using the script container.
This work is based on Docker for containerization. You must have Docker installed and running, in order to build and run the containers.
- Clone this repo
git clone https://github.com/PLCodingStuff/data-engineering-zoomcamp.git cd data-engineering-zoomcamp.git - Get into
pipelinefoldercd pipeline - Start the multi-container application
docker compose up
- Build the containerized script
docker build -t taxi:v001 . - Check the network
docker network ls
- Run the container
# The network name will be based on the directory or found with previous command docker run -it --rm \ --network=pipeline_default \ taxi:v001 \ --pg-user=root \ --pg-password=root \ --pg-host=pgdatabase \ --pg-port=5432 \ --pg-db=ny_taxi \ --target-table=yellow_taxi_trips - You can connect to PgAdmin through
localhost:8085, with emailadmin@admin.comand passwordroot.
You can modify the environment variables of PostgreSQL and PgAdmin in docker-compose.yaml. The default user and password in PostgreSQL are root, under the fields POSTGRES_USER and POSTGRES_PASSWORD. The default email in PgAdmin is admin@admin.com and the password is root, under the fields PGADMIN_DEFAULT_EMAIL and PGADMIN_DEFAULT_PASSWORD. If you are more familiar with docker configuration files, feel free to make any changes. Every change in docker-compose.yaml also entails a change to the container execution.