tysker · tysker · Jan 12, 2026 · Jan 12, 2026 · Jan 12, 2026 · Jan 12, 2026
diff --git a/README.md b/README.md
@@ -3,7 +3,9 @@
 ## Project Description
 
 This repository is a complete end-to-end DevOps learning project built around a small Python Flask
-application. The goal is to gradually build a realistic production-like environment that includes:
+application. All access follows a bastion-based, non-root security model. 
+
+The goal is to gradually build a realistic production-like environment that includes:
 
 - containerization with Docker
 - CI/CD pipelines (GitHub Actions)
@@ -16,6 +18,8 @@ application. The goal is to gradually build a realistic production-like environm
 The project grows in clear stages. Each stage is documented with **what was done**, **why it matters**,
 and **how it was implemented**, so it becomes both a learning journal and a portfolio project.
 
+**Current status:** Stages 1–10 completed. Application is deployed, monitored with Prometheus and Grafana, and accessible via HTTP. Next step: TLS and reverse proxy.
+
 ## Structure
 
 Current project layout:
@@ -24,19 +28,32 @@ Current project layout:
 cloud_devops_lab/
 ├── ansible
 │   ├── ansible.cfg
+│   ├── ansible.log
 │   ├── group_vars
-│   │   └── all.yml
+│   │   ├── all.yml
+│   │   ├── app.yml
+│   │   └── monitoring.yml
 │   ├── hosts.ini
 │   ├── playbooks
-│   │   └── bootstrap.yml
+│   │   ├── bootstrap_1.yml
+│   │   ├── bootstrap_2.yml
+│   │   ├── deploy_app.yml
+│   │   ├── monitoring_grafana.yml
+│   │   ├── monitoring_node_exporter.yml
+│   │   └── monitoring_prometheus.yml
 │   ├── README.md
 │   └── roles
-│       ├── bootstrap_users
+│       ├── bootstrap_user
 │       ├── common
+│       ├── deploy_app
 │       ├── docker
+│       ├── grafana
+│       ├── node_exporter
+│       ├── prometheus
 │       └── ssh_hardening
 ├── app
 │   ├── Dockerfile
+│   ├── gunicorn.conf.py
 │   ├── requirements.txt
 │   ├── src
 │   │   ├── app.py
@@ -50,6 +67,8 @@ cloud_devops_lab/
 │       ├── lib
 │       ├── lib64 -> lib
 │       └── pyvenv.cfg
+├── docs
+│   └── project-checklist.md
 ├── infrastructure
 │   └── terraform
 │       ├── main.tf
@@ -74,6 +93,8 @@ cloud_devops_lab/
 - Linode for server hosting
 - Cloudflare (DNS)
 - Domain registrar
+- Grafana
+- Prometheus
 
 ## Running the Application Locally
 
@@ -104,8 +125,8 @@ The project is built in incremental stages. Each stage adds a new DevOps capabil
 - <s>Stage 7: SSH hardening</s>
 - <s>Stage 8: Docker installation (via Ansible)</s>
 - <s>Stage 9: Application deployment</s>
-- Stage 10: Monitoring stack (Prometheus & Grafana)
-- Stage 11: TLS certificates & reverse proxy
+- <s>Stage 10: Monitoring stack (Prometheus & Grafana)</s>
+- Stage 11: TLS certificates & reverse proxy (Caddy)
 
 ### Stage 1 — Flask Application
 
@@ -143,7 +164,6 @@ foundation for CI/CD pipelines, registries, deployment automation, and infrastru
 1. Build image: `docker build -t cloud-devops-app:0.1 .`
 2. Run container: `docker run -p 5000:5000 cloud-devops-app:0.1`
 3. Test health endpoint: `curl http://localhost:5000/health`
-4. Test metrics: `curl http://localhost:5000/metrics/custom`
 
 ### Stage 3 — CI/CD Pipeline (GHCR Integration)
 
@@ -289,6 +309,74 @@ A repeatable deployment reduces manual steps and ensures consistent environments
 - Exposed HTTP on port 80 mapped to container port 5000.
 - Added an Ansible health check against `/health`.
 
+### Stage 10 — Monitoring stack (Prometheus & Grafana)
+
+This stage introduces full observability for both the infrastructure and the application.
+
+#### Part 1 — Node Exporter
+
+**What:**  
+Deployed Node Exporter on the application and monitoring servers.
+
+**Why:**  
+Host-level metrics (CPU, memory, disk, network) are essential for understanding system health and capacity.
+
+**How:**  
+- Installed Node Exporter via Docker using Ansible.
+- Metrics exposed on port `9100`.
+- Targets scraped via private IPs.
+
+---
+
+#### Part 2 — Prometheus
+
+**What:**  
+Deployed Prometheus on the monitoring server.
+
+**Why:**  
+Prometheus acts as the central metrics collection and storage system.
+
+**How:**  
+- Prometheus deployed via Docker using Ansible.
+- Configuration rendered from a template (`prometheus.yml`).
+- Scrapes:
+  - Node Exporter on app + monitoring servers
+  - Flask application metrics
+- Persistent data directory mounted on the host.
+
+---
+
+#### Part 3 — Grafana
+
+**What:**  
+Deployed Grafana for metrics visualization.
+
+**Why:**  
+Metrics are only useful if they can be explored and visualized effectively.
+
+**How:**  
+- Grafana deployed via Docker using Ansible.
+- Prometheus configured as a data source.
+- Access restricted to SSH port forwarding (no public exposure).
+- Imported **Node Exporter Full** dashboard (ID 1860).
+
+---
+
+#### Part 4 — Flask application metrics
+
+**What:**  
+Exposed application metrics in Prometheus format.
+
+**Why:**  
+Application-level observability enables insight into runtime behavior, performance, and stability.
+
+**How:**  
+- Added `/metrics` endpoint using `prometheus_client`.
+- Removed the earlier JSON-based metrics endpoint.
+- Prometheus scrapes the app at:
+  - `http://<app_private_ip>:80/metrics`
+- Metrics verified in Prometheus and visualized in Grafana.
+
 ### Access Model
 
 - Direct SSH access is allowed only to the jump server.
@@ -329,7 +417,12 @@ A chronological log describing the work done in each stage.
 - <s>Procced to Stage 7: SSH hardening</s>
 - <s>Procced to Stage 8: Docker installation (via Ansible)</s>
 - <s>Procced to Stage 9: Application deployment using Docker and GHCR</s>
-- Procced to Stage 10: Stage 10: Monitoring stack (Prometheus & Grafana)
+- <s>Procced to Stage 10: Stage 10: Monitoring stack (Prometheus & Grafana)</s>
+- Procced to Stage 11: TLS certificates & reverse proxy (Caddy)
+
+Stage 11 will introduce HTTPS, automatic TLS certificates, and a reverse proxy
+in front of the application. This enables secure traffic, prepares the setup
+for Cloudflare proxying, and allows stricter firewall rules on the application server.
 
 ## Git Workflow & Conventions
 

diff --git a/ansible/README.md b/ansible/README.md
@@ -12,10 +12,12 @@ all servers in the Cloud DevOps Lab.
 ## Roles
 
 ### common
+
 - Verifies basic connectivity (`ping`)
 - Used as a baseline dependency for all other roles
 
 ### bootstrap_users
+
 - Creates a non-root `devops` user
 - Adds the user to the `sudo` group
 - Configures passwordless sudo (temporary)
@@ -26,5 +28,5 @@ all servers in the Cloud DevOps Lab.
 Run from the `ansible/` directory:
 
 ```bash
-ansible-playbook playbooks/bootstrap.yml
+ansible-playbook playbooks/....
 ```
diff --git a/ansible/group_vars/app.yml b/ansible/group_vars/app.yml
@@ -1,4 +1,4 @@
-app_image: "ghcr.io/tysker/cloud_devops_app:77ecd38"
+app_image: "ghcr.io/tysker/cloud_devops_app:0950da9"
 app_container_name: "cloud-devops-app"
 app_container_port: 5000
 app_public_port: 80
diff --git a/ansible/roles/prometheus/templates/prometheus.yml.j2 b/ansible/roles/prometheus/templates/prometheus.yml.j2
@@ -3,7 +3,7 @@ global:
   evaluation_interval: 15s
 
 scrape_configs:
-  - job_name: "node-exporter"
+  - job_name: "node"
     static_configs:
       - targets:
           - "{{ hostvars['monitoring-1'].ansible_host }}:9100"

diff --git a/docs/project-checklist.md b/docs/project-checklist.md
@@ -29,7 +29,7 @@ completed, what is in progress, and what belongs to future expansion.
 - [x] Cloudflare in DNS-only mode (proxy disabled)
 - [ ] Terraform-managed DNS records (Cloudflare provider)
 - [ ] Stable DNS target via reserved IP
-- [ ] Decide exposure model for monitoring (private vs public)
+- [x] Decide exposure model for monitoring (private vs public)
 
 ---
 
@@ -124,19 +124,24 @@ completed, what is in progress, and what belongs to future expansion.
 
 ### Prometheus
 
-- [ ] Prometheus deployed on monitoring server
-- [ ] Scrape node exporters
-- [ ] Scrape application metrics
-- [ ] Retention and storage configured
-- [ ] Alert rules defined
+- [x] Prometheus deployed on monitoring server
+- [x] Scrape node exporters (app + monitoring via private IPs)
+- [x] Scrape application metrics (`/metrics` scraped as job `flask_app`)
+- [ ] Retention and storage tuned/configured explicitly (beyond defaults)
+- [ ] Alert rules definedned
 
 ### Grafana
 
-- [ ] Grafana deployed on monitoring server
-- [ ] Prometheus datasource configured
-- [ ] Node exporter dashboards imported
-- [ ] App dashboards created
-- [ ] Access control (auth / private access)
+### Grafana
+
+- [x] Grafana deployed on monitoring server
+- [x] Prometheus datasource configured
+- [x] Node exporter dashboards imported
+- [x] App dashboards created (custom panels for `flask_app`):
+  - `process_resident_memory_bytes`
+  - `rate(process_cpu_seconds_total[5m])`
+  - `rate(python_gc_objects_collected_total[5m])`
+- [ ] Access control hardening (no default creds, users/roles; still private via SSH tunnel)
 
 ---