Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 101 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@
## Project Description

This repository is a complete end-to-end DevOps learning project built around a small Python Flask
application. The goal is to gradually build a realistic production-like environment that includes:
application. All access follows a bastion-based, non-root security model.

The goal is to gradually build a realistic production-like environment that includes:

- containerization with Docker
- CI/CD pipelines (GitHub Actions)
Expand All @@ -16,6 +18,8 @@ application. The goal is to gradually build a realistic production-like environm
The project grows in clear stages. Each stage is documented with **what was done**, **why it matters**,
and **how it was implemented**, so it becomes both a learning journal and a portfolio project.

**Current status:** Stages 1–10 completed. Application is deployed, monitored with Prometheus and Grafana, and accessible via HTTP. Next step: TLS and reverse proxy.

## Structure

Current project layout:
Expand All @@ -24,19 +28,32 @@ Current project layout:
cloud_devops_lab/
├── ansible
│   ├── ansible.cfg
│   ├── ansible.log
│   ├── group_vars
│   │   └── all.yml
│   │   ├── all.yml
│   │   ├── app.yml
│   │   └── monitoring.yml
│   ├── hosts.ini
│   ├── playbooks
│   │   └── bootstrap.yml
│   │   ├── bootstrap_1.yml
│   │   ├── bootstrap_2.yml
│   │   ├── deploy_app.yml
│   │   ├── monitoring_grafana.yml
│   │   ├── monitoring_node_exporter.yml
│   │   └── monitoring_prometheus.yml
│   ├── README.md
│   └── roles
│   ├── bootstrap_users
│   ├── bootstrap_user
│   ├── common
│   ├── deploy_app
│   ├── docker
│   ├── grafana
│   ├── node_exporter
│   ├── prometheus
│   └── ssh_hardening
├── app
│   ├── Dockerfile
│   ├── gunicorn.conf.py
│   ├── requirements.txt
│   ├── src
│   │   ├── app.py
Expand All @@ -50,6 +67,8 @@ cloud_devops_lab/
│   ├── lib
│   ├── lib64 -> lib
│   └── pyvenv.cfg
├── docs
│   └── project-checklist.md
├── infrastructure
│   └── terraform
│   ├── main.tf
Expand All @@ -74,6 +93,8 @@ cloud_devops_lab/
- Linode for server hosting
- Cloudflare (DNS)
- Domain registrar
- Grafana
- Prometheus

## Running the Application Locally

Expand Down Expand Up @@ -104,8 +125,8 @@ The project is built in incremental stages. Each stage adds a new DevOps capabil
- <s>Stage 7: SSH hardening</s>
- <s>Stage 8: Docker installation (via Ansible)</s>
- <s>Stage 9: Application deployment</s>
- Stage 10: Monitoring stack (Prometheus & Grafana)
- Stage 11: TLS certificates & reverse proxy
- <s>Stage 10: Monitoring stack (Prometheus & Grafana)</s>
- Stage 11: TLS certificates & reverse proxy (Caddy)

### Stage 1 — Flask Application

Expand Down Expand Up @@ -143,7 +164,6 @@ foundation for CI/CD pipelines, registries, deployment automation, and infrastru
1. Build image: `docker build -t cloud-devops-app:0.1 .`
2. Run container: `docker run -p 5000:5000 cloud-devops-app:0.1`
3. Test health endpoint: `curl http://localhost:5000/health`
4. Test metrics: `curl http://localhost:5000/metrics/custom`

### Stage 3 — CI/CD Pipeline (GHCR Integration)

Expand Down Expand Up @@ -289,6 +309,74 @@ A repeatable deployment reduces manual steps and ensures consistent environments
- Exposed HTTP on port 80 mapped to container port 5000.
- Added an Ansible health check against `/health`.

### Stage 10 — Monitoring stack (Prometheus & Grafana)

This stage introduces full observability for both the infrastructure and the application.

#### Part 1 — Node Exporter

**What:**
Deployed Node Exporter on the application and monitoring servers.

**Why:**
Host-level metrics (CPU, memory, disk, network) are essential for understanding system health and capacity.

**How:**
- Installed Node Exporter via Docker using Ansible.
- Metrics exposed on port `9100`.
- Targets scraped via private IPs.

---

#### Part 2 — Prometheus

**What:**
Deployed Prometheus on the monitoring server.

**Why:**
Prometheus acts as the central metrics collection and storage system.

**How:**
- Prometheus deployed via Docker using Ansible.
- Configuration rendered from a template (`prometheus.yml`).
- Scrapes:
- Node Exporter on app + monitoring servers
- Flask application metrics
- Persistent data directory mounted on the host.

---

#### Part 3 — Grafana

**What:**
Deployed Grafana for metrics visualization.

**Why:**
Metrics are only useful if they can be explored and visualized effectively.

**How:**
- Grafana deployed via Docker using Ansible.
- Prometheus configured as a data source.
- Access restricted to SSH port forwarding (no public exposure).
- Imported **Node Exporter Full** dashboard (ID 1860).

---

#### Part 4 — Flask application metrics

**What:**
Exposed application metrics in Prometheus format.

**Why:**
Application-level observability enables insight into runtime behavior, performance, and stability.

**How:**
- Added `/metrics` endpoint using `prometheus_client`.
- Removed the earlier JSON-based metrics endpoint.
- Prometheus scrapes the app at:
- `http://<app_private_ip>:80/metrics`
- Metrics verified in Prometheus and visualized in Grafana.

### Access Model

- Direct SSH access is allowed only to the jump server.
Expand Down Expand Up @@ -329,7 +417,12 @@ A chronological log describing the work done in each stage.
- <s>Procced to Stage 7: SSH hardening</s>
- <s>Procced to Stage 8: Docker installation (via Ansible)</s>
- <s>Procced to Stage 9: Application deployment using Docker and GHCR</s>
- Procced to Stage 10: Stage 10: Monitoring stack (Prometheus & Grafana)
- <s>Procced to Stage 10: Stage 10: Monitoring stack (Prometheus & Grafana)</s>
- Procced to Stage 11: TLS certificates & reverse proxy (Caddy)

Stage 11 will introduce HTTPS, automatic TLS certificates, and a reverse proxy
in front of the application. This enables secure traffic, prepares the setup
for Cloudflare proxying, and allows stricter firewall rules on the application server.

## Git Workflow & Conventions

Expand Down
4 changes: 3 additions & 1 deletion ansible/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@ all servers in the Cloud DevOps Lab.
## Roles

### common

- Verifies basic connectivity (`ping`)
- Used as a baseline dependency for all other roles

### bootstrap_users

- Creates a non-root `devops` user
- Adds the user to the `sudo` group
- Configures passwordless sudo (temporary)
Expand All @@ -26,5 +28,5 @@ all servers in the Cloud DevOps Lab.
Run from the `ansible/` directory:

```bash
ansible-playbook playbooks/bootstrap.yml
ansible-playbook playbooks/....
```
2 changes: 1 addition & 1 deletion ansible/group_vars/app.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
app_image: "ghcr.io/tysker/cloud_devops_app:77ecd38"
app_image: "ghcr.io/tysker/cloud_devops_app:0950da9"
app_container_name: "cloud-devops-app"
app_container_port: 5000
app_public_port: 80
2 changes: 1 addition & 1 deletion ansible/roles/prometheus/templates/prometheus.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ global:
evaluation_interval: 15s

scrape_configs:
- job_name: "node-exporter"
- job_name: "node"
static_configs:
- targets:
- "{{ hostvars['monitoring-1'].ansible_host }}:9100"
Expand Down
27 changes: 16 additions & 11 deletions docs/project-checklist.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ completed, what is in progress, and what belongs to future expansion.
- [x] Cloudflare in DNS-only mode (proxy disabled)
- [ ] Terraform-managed DNS records (Cloudflare provider)
- [ ] Stable DNS target via reserved IP
- [ ] Decide exposure model for monitoring (private vs public)
- [x] Decide exposure model for monitoring (private vs public)

---

Expand Down Expand Up @@ -124,19 +124,24 @@ completed, what is in progress, and what belongs to future expansion.

### Prometheus

- [ ] Prometheus deployed on monitoring server
- [ ] Scrape node exporters
- [ ] Scrape application metrics
- [ ] Retention and storage configured
- [ ] Alert rules defined
- [x] Prometheus deployed on monitoring server
- [x] Scrape node exporters (app + monitoring via private IPs)
- [x] Scrape application metrics (`/metrics` scraped as job `flask_app`)
- [ ] Retention and storage tuned/configured explicitly (beyond defaults)
- [ ] Alert rules definedned

### Grafana

- [ ] Grafana deployed on monitoring server
- [ ] Prometheus datasource configured
- [ ] Node exporter dashboards imported
- [ ] App dashboards created
- [ ] Access control (auth / private access)
### Grafana

- [x] Grafana deployed on monitoring server
- [x] Prometheus datasource configured
- [x] Node exporter dashboards imported
- [x] App dashboards created (custom panels for `flask_app`):
- `process_resident_memory_bytes`
- `rate(process_cpu_seconds_total[5m])`
- `rate(python_gc_objects_collected_total[5m])`
- [ ] Access control hardening (no default creds, users/roles; still private via SSH tunnel)

---

Expand Down