In this study, I will try to explain to set airflow jobs on Google Cloud Platform (GCP) step-by-step. .
-
New VM instance created for influxdb (we choose influxdb since it provides system metrics in default.)
- name → influxdb-study
- region → us-central1(Iowa) | Zone → us-central1-a
- e2-standart-2 would be sufficient
- Under BootDisk ->
- operating system : Ubuntu
- Version : Ubuntu 18.04 LTS (bionic version)
- Boot disk type: SSD persistent disk (recommended to using Influx)
- Disk size: 20GB (for init)
- Firewall:
- Allow HTTP & HTTPS traffic ( for development purpose only. In production, take in consideration of security constraints.)
- Access scopes : Allow full access to all Cloud APIs
- Then hit "Create"
- In VPC Networks, allow the belowmentioned port numbers to being allowed to use these services.
-
Influx default port # : 8086
- Crate firewall rule
- name: influx-db
- logs: off -> prevent additional cost
- Source IPv4 ranges : 0.0.0.0/0 → All IP address
- Protocols and ports: specified protocols and ports -> tcp: 8086 (influx)
- hit "Create"
- Crate firewall rule
-
-
New VM instance created for Airflow
-
name → airflow-study
-
region → us-central1(Oregon) | Zone → us-west1-a
-
machine type: e2-standart-2
-
Under BootDisk ->
- operating system : Ubuntu
- Version : Ubuntu 18.04 LTS (bionic version)
- Boot disk type: SSD persistent disk (recommended to using Influx)
- Disk size: 20GB (for init)
-
Firewall:
- Allow HTTP & HTTPS traffic ( for development purpose only. In production, take in consideration of security constraints.)
-
Access scopes : Allow full access to all Cloud APIs
-
Then hit "Crate"
-
In VPC Networks, allow the belowmentioned port numbers to being allowed to use these services.
- airflow default port # : 8080
- Crate firewall rule
- name: airflow
- logs: off -> prevent additional cost
- Source IPv4 ranges : 0.0.0.0/0 → All IP address
- Protocols and ports: specified protocols and ports -> tcp: 8080 (airflow)
- hit "Create"
- Crate firewall rule
- airflow default port # : 8080
-
-
In order to set fixed IP adresses of these service: VPC Network -> IP adresses -> Reserve External Static Address
- Region must be the same where VM created.
- Type of these address change to Ephemeral to Static
-
Connect via SSH to influx-study .
apt-upgradeto update the VM- Then
sudo sucommand to install these services.
-
Install Influxdb
- using documentation to install influx on our Linux Ubuntu installed VM.
wget https://dl.influxdata.com/influxdb/releases/influxdb2-2.2.0-amd64.deb(to download)sudo dpkg -i influxdb2-2.2.0-amd64.deb(to open the file)systemctl status influxdb(shows the status of influx service)systemctl start influxdb(starts the service)
- using documentation to install influx on our Linux Ubuntu installed VM.
-
Install Telegraf (to get system metrics as sample data)
-
using documentation to install Telegraf on our Linux Ubuntu installed VM.
wget -qO- https://repos.influxdata.com/influxdb.key | sudo tee /etc/apt/trusted.gpg.d/influxdb.asc >/dev/nullecho "deb https://repos.influxdata.com/debian stable main" | sudo tee /etc/apt/sources.list.d/influxdb.listsudo apt-get update && sudo apt-get install telegrafsystemctl status influxdb(shows the status of influx service)systemctl start influxdb(starts the service)
-
to access telegraf use external IP of influxdb VM and append port number as
:8086- get started
- username: admin
- password: password
- Initial Org: acme
- Initial Bucket Name: telegraf
- Quick Start
- Data
- API Tokens
- admin's token (to get info for telegraf.conf file)
- API Tokens
- Data
- get started
-
Configure Telegraf to connect with our influxdb
cd /etc/telegraf/nano telegraf.conf- we commentout
*[[outputs.influxdb]]* - we edit parameters under outputs.influxdb_v2
- remove hashtags on:
- urls
- VM's external IP instead of localhost IP (127.0.0.1)
- token
- admin's token
- organization
organization = "acme"
- bucket
bucket = "telegraf"
- timeout
timeout = "10s"
- user_agent
user_agent= "telegraf"
- urls
- remove hashtags on:
- we commentout
-
-
Connect via SSH to airflow-study .
apt-upgradeto update the VM- Then
sudo sucommand to install these services.
-
Check if there are any docker instance on VM :
docker ps -
Go root folder:
cd / -
make directory for Airflow:
mkdir airflow -
enter airflow directory:
cd airflow -
Install Docker Engine:
https://docs.docker.com/engine/install/ubuntu/- to root folder:
cd / - install docker engine: (
https://docs.docker.com/engine/install/ubuntu/)
apt-get install \ ca-certificates \ curl \ gnupg \ lsb-release-
GPG key :
sudo mkdir -p /etc/apt/keyringscurl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg -
to set up repository :
$ echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null- Install Docker Engine (
https://docs.docker.com/compose/install/compose-plugin/):sudo apt-get updatesudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin - Check for successfull installation :
docker ps - Check installed docker version :
docker version - List the available repos :
apt-cache madison docker-ce(result will be entered in version_string) - Install indicated version : sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io docker-compose-plugin
sudo apt-get install docker-ce=5:20.10.16~3-0~ubuntu-jammy docker-ce-cli=5:20.10.16~3-0~ubuntu-jammy containerd.io docker-compose-plugin
- Check docker compose version :
docker compose version - Go to airflow folder under root folder :
cd /cd airflow/ - (info) To download any file from VM to local PC, you can use download option at upper right corner. you can write /airflow/docker-compose.yaml in the dialog box to download docker-compose file.
- go to airflow folder :
cd airflow/ - create a new yaml file via
nano docker-compose.yaml - copy from docker-compose.yaml.txt
- we need PostgreSQL to keep DAG's metadata
- since we define
AIRFLOW_EXECUTER=LocalExecutor, any airflow-worker nodes fail since it just run on master thanks to LocalExecutor notion - save and exit
- make directory in airflow for DAGs :
mkdir dags - if any plugins is used :
mkdir dags plugins - if any scripts are used, new folder is created under dags folder :
cd dags/mkdir scripts
- to root folder:
-
Install Airflow :
-
apt install docker-compose -
use yaml file to install Airflow :
docker-compose up -d(-d is for detached run) -
check downloaded images and status of containers:
docker ps(docker ps -afor all containers, which are included failed containers) -
We need to define username and password for Airflow. we can do on airflow_airflow-scheduler1
- execute with
docker exec -it -u 0 542 bashto access the container. (542 is the first 3 digit of Container ID of airflow scheduler) 
- to create credentials :
airflow users create --username admin --firstname melih --lastname melih --role Admin --password admin --email admin@airflow - to exit from container :
exit
- execute with
-
access to UI :
<airflow VM IP>:8080- username : admin
- password : admin
-
to activate the changes that we made on telegraf.conf file we restart the service :
systemctl restart influxdb -
to see the activities of Influxdb :
journalctl -fu influxdb -
restart telegraf as well:
systemctl restart telegraf -
check status of telegraf :
systemctl status telegraf -
in each container, related airflow files take place in
cd /opt/bitnami/airflow/ -
Get into dags file :
cd airlow/dags/ -
create an influxdb dag file :
nano influxdb_dag.py(we define functions in this file and indicate dependencies at the bottom line of the file.) -
paste all from Airflow-configuration.txt
- we import InfluxDBOperator to connect the airflow with influxdb
- we import BashOperator to execute commands in a Bash shell.
- we need to use influxdb VM IP address on tokens, buckets, and URLs in the functions.
- we need to adapt query which is written in flux syntax properly. (watch out for
r["host"] = <VMname>) - to write the output on BigQuery, need to edit
project_id = <bigquery project id> - indicate the table_id where the data is written on
table_id = <tableIndicatedInFunctions>
-
We face with an error on Airflow UI. To solve this problem, we need to import the libraries which are taken place on
influx_dag.pyfile. we do this operation in airflow-scheduler1 container. (you can find commands in Airflow Study&HelperFunctions.txt)$ sudo pip3 install virtualenv virtualenv -p python3 (target folder) $ . /opt/bitnami/airflow/venv/bin/activate #(to turn into venv) (venv) # pip3 install influxdb ... (venv) # pip3 install influxdb-client (venv) # pip3 install apache-airflow-providers-influxdb (venv) # pip3 install pandas_gbq python3 -m venv --upgrade -
then
exitfrom the virtual environment. -
We do the configurations on scheduler container.
-
execute with
docker exec -it -u 0 542 bashto access the airflow-scheduler container. (542 is the starting 3 digits of scheduler container's ID) (we can find out Container ID viadocker pscommand. -
Reach UI of Airflow
-
Since we installed the required libraries, the warning on upper banner must be dissappear.
-
cd airflow/dags/scriptsandnano command.shto copy commands in command.sh file -
to make command.sh executable we need to
chmod +x command.sh -
At UI, DAG: influxdb_query_operator must be activated via button
-
Then press "play" button to trigger DAG without waiting Scheduled time-interval
-
Then check "log" to see how the process gone. if log is not seen, then check the given error to correct. (usual suspect is equalize Secret Keys with airflow and scheduler containers.)
-
This correction needed to be made in related container. (scheduler_1 and airflow_1)
-
Then restart the containers
docker restart 6a6(6a6 is the first three characters of airflow_1 container)
-
Any suggestions and improvment advices are welcome!!
Summary Info
-
To define username and password:
airflow users create --username admin --firstname melih --lastname gor --role Admin --password admin --email admin@airflowcommand. -
exit from the container
exit. -
Refresh services to allow changes on configurations :
systemctl restart influxdb -
Check the status of restarted service :
systemctl status influxdb -
to see the activities of Influxdb :
journalctl -fu influxdb -
restart telegraf as well:
systemctl restart telegraf -
check status of telegraf :
systemctl status telegraf -
in "airflow/dags" folder, create a file as
nano influxdb_dag.pyand paste the codes inside of "Airflow-configuration.txt" file.