diff --git a/.claude/commands/review-challenge.md b/.claude/commands/review-challenge.md index f70cdb4..4e51c82 100644 --- a/.claude/commands/review-challenge.md +++ b/.claude/commands/review-challenge.md @@ -1,5 +1,5 @@ --- -allowed-tools: Bash(kubeasy-cli*),Bash(kubectl*),Bash(cat*),Bash(grep*),Bash(ls*),Bash(sleep*),Bash(head*),Bash(tail*),Read,Write,Edit +allowed-tools: Bash(kubeasy*),Bash(kubectl*),Bash(cat*),Bash(grep*),Bash(ls*),Bash(sleep*),Bash(head*),Bash(tail*),Read,Write,Edit description: Review a Kubeasy challenge for quality, pedagogy, and bypass resistance --- @@ -32,7 +32,7 @@ You must experience the challenge as a learner first. Run structural validation before deploying anything: ```bash -kubeasy-cli dev lint +kubeasy dev lint ``` If lint fails → **stop the review immediately**, score 0/20, verdict ❌ Fail. @@ -41,15 +41,15 @@ Write the PR comment with lint errors and exit. ### Phase 3: Deploy and Verify Broken State ```bash -kubeasy-cli dev apply --clean +kubeasy dev apply --clean sleep 10 -kubeasy-cli dev status +kubeasy dev status ``` **Then immediately run validations:** ```bash -kubeasy-cli dev validate +kubeasy dev validate ``` All validations MUST FAIL at this point. This confirms the broken state is real. @@ -74,7 +74,7 @@ kubectl get events -n --sort-by='.lastTimestamp' 1. Form a hypothesis about what's wrong 2. Apply a fix using `kubectl` -3. Verify with `kubeasy-cli dev validate ` +3. Verify with `kubeasy dev validate ` **Maximum 5 attempts.** If you can't solve it after 5 tries, flag the challenge and continue. @@ -83,7 +83,7 @@ kubectl get events -n --sort-by='.lastTimestamp' Reset to broken state: ```bash -kubeasy-cli dev apply --clean +kubeasy dev apply --clean sleep 10 ``` @@ -167,7 +167,7 @@ Write a spoiler-free PR comment to `review--pr-comment.md` in the current ### Phase 10: Clean up ```bash -kubeasy-cli dev clean +kubeasy dev clean ``` ## Spoiler-Free Writing Guide diff --git a/cascading-blackout/challenge.yaml b/cascading-blackout/challenge.yaml new file mode 100644 index 0000000..2f669f4 --- /dev/null +++ b/cascading-blackout/challenge.yaml @@ -0,0 +1,116 @@ +title: "Cascading Blackout" +type: "fix" +theme: "networking" +difficulty: "hard" +estimatedTime: 30 + +description: | + The order-processing platform was running perfectly until a recent security hardening push. + The edge proxy returns HTTP 200 on its health endpoint, but actual order requests + fail silently — customers see empty responses or timeouts. + The team reports that "nothing changed in the application code" — + but the infrastructure change touched multiple components at once. + +initialSituation: | + A three-tier order processing system is deployed in the namespace: + - An edge proxy (nginx) that routes requests to a backend service + - A backend application that processes orders and caches results + - A Redis cache used by the backend for session and order data + Each tier has its own Deployment, Service, and pods are running. + After a recent infrastructure change, the edge proxy health check still works, + but end-to-end order requests fail. + The security hardening introduced several changes simultaneously — + investigate each tier carefully before concluding the root cause. + +objective: | + Restore full end-to-end communication across the platform. + Orders submitted through the edge proxy must reach the backend, + and the backend must be able to read and write to the cache. + All services should remain healthy and reachable through their Services. + +objectives: + - key: gateway-running + title: "Gateway Online" + description: "The edge proxy pods must be running and ready" + order: 1 + type: condition + spec: + target: + kind: Pod + labelSelector: + app: edge-proxy + checks: + - type: Ready + status: "True" + + - key: backend-running + title: "Backend Online" + description: "The backend pods must be running and ready" + order: 2 + type: condition + spec: + target: + kind: Pod + labelSelector: + app: order-backend + checks: + - type: Ready + status: "True" + + - key: cache-running + title: "Cache Online" + description: "The cache pods must be running and ready" + order: 3 + type: condition + spec: + target: + kind: Pod + labelSelector: + app: order-cache + checks: + - type: Ready + status: "True" + + - key: gateway-to-backend + title: "Gateway Reaches Backend" + description: "The edge proxy must be able to forward requests to the backend service" + order: 4 + type: connectivity + spec: + sourcePod: + labelSelector: + app: edge-proxy + targets: + - url: "http://order-backend:8080/health" + expectedStatusCode: 200 + timeoutSeconds: 5 + + - key: backend-service-identity + title: "Backend Service Classification" + description: "The backend pods are correctly classified within the platform" + order: 5 + type: condition + spec: + target: + kind: Pod + labelSelector: + app: order-backend + tier: backend + checks: + - type: Initialized + status: "True" + + - key: backend-healthy + title: "Backend Fully Operational" + description: "The backend reports healthy status including cache connectivity" + order: 6 + type: log + spec: + target: + kind: Pod + labelSelector: + app: order-backend + container: order-backend + expectedStrings: + - "ready to accept connections" + sinceSeconds: 120 diff --git a/cascading-blackout/manifests/backend.yaml b/cascading-blackout/manifests/backend.yaml new file mode 100644 index 0000000..fb3c387 --- /dev/null +++ b/cascading-blackout/manifests/backend.yaml @@ -0,0 +1,64 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: order-backend + namespace: cascading-blackout + labels: + app: order-backend +spec: + replicas: 1 + selector: + matchLabels: + app: order-backend + template: + metadata: + labels: + app: order-backend + spec: + containers: + - name: order-backend + image: busybox:1.36 + ports: + - containerPort: 8080 + command: + - /bin/sh + - -c + - | + # Background: check cache and update state flag + while true; do + if nc -z -w2 order-cache 6379 2>/dev/null; then + echo "[$(date)] ready to accept connections" + touch /tmp/cache_ok + else + echo "[$(date)] ERROR: cannot reach cache at order-cache:6379" + rm -f /tmp/cache_ok + fi + sleep 5 + done & + # Foreground: HTTP server reflects cache state + while true; do + if [ -f /tmp/cache_ok ]; then + echo -e "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\nok" | nc -l -p 8080 -w5 || true + else + echo -e "HTTP/1.1 503 Service Unavailable\r\nContent-Type: text/plain\r\n\r\nunavailable" | nc -l -p 8080 -w5 || true + fi + done + readinessProbe: + httpGet: + path: /health + port: 8080 + initialDelaySeconds: 5 + periodSeconds: 10 + failureThreshold: 3 +--- +apiVersion: v1 +kind: Service +metadata: + name: order-backend + namespace: cascading-blackout +spec: + selector: + app: order-backend + ports: + - port: 8080 + targetPort: 8080 diff --git a/cascading-blackout/manifests/cache.yaml b/cascading-blackout/manifests/cache.yaml new file mode 100644 index 0000000..48cf77f --- /dev/null +++ b/cascading-blackout/manifests/cache.yaml @@ -0,0 +1,41 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: order-cache + namespace: cascading-blackout + labels: + app: order-cache + tier: cache +spec: + replicas: 1 + selector: + matchLabels: + app: order-cache + template: + metadata: + labels: + app: order-cache + tier: cache + spec: + containers: + - name: redis + image: redis:7-alpine + ports: + - containerPort: 6379 + readinessProbe: + exec: + command: ["redis-cli", "ping"] + initialDelaySeconds: 5 + periodSeconds: 5 +--- +apiVersion: v1 +kind: Service +metadata: + name: order-cache + namespace: cascading-blackout +spec: + selector: + app: order-cache + ports: + - port: 6379 + targetPort: 6379 diff --git a/cascading-blackout/manifests/gateway.yaml b/cascading-blackout/manifests/gateway.yaml new file mode 100644 index 0000000..485e2a4 --- /dev/null +++ b/cascading-blackout/manifests/gateway.yaml @@ -0,0 +1,78 @@ +apiVersion: apps/v1 +kind: Deployment +metadata: + name: edge-proxy + namespace: cascading-blackout + labels: + app: edge-proxy + tier: frontend +spec: + replicas: 1 + selector: + matchLabels: + app: edge-proxy + template: + metadata: + labels: + app: edge-proxy + tier: frontend + spec: + containers: + - name: edge-proxy + image: nginx:1.25-alpine + ports: + - containerPort: 80 + volumeMounts: + - name: nginx-config + mountPath: /etc/nginx/conf.d/default.conf + subPath: default.conf + readinessProbe: + httpGet: + path: /healthz + port: 80 + initialDelaySeconds: 5 + periodSeconds: 5 + livenessProbe: + httpGet: + path: /healthz + port: 80 + initialDelaySeconds: 10 + periodSeconds: 10 + volumes: + - name: nginx-config + configMap: + name: gateway-config +--- +apiVersion: v1 +kind: Service +metadata: + name: edge-proxy + namespace: cascading-blackout +spec: + selector: + app: edge-proxy + ports: + - port: 80 + targetPort: 80 +--- +apiVersion: v1 +kind: ConfigMap +metadata: + name: gateway-config + namespace: cascading-blackout +data: + default.conf: | + server { + listen 80; + + location /healthz { + return 200 'ok'; + add_header Content-Type text/plain; + } + + location /api/ { + proxy_pass http://order-backend:8080/; + proxy_connect_timeout 5s; + proxy_read_timeout 10s; + } + } diff --git a/cascading-blackout/manifests/network-policies.yaml b/cascading-blackout/manifests/network-policies.yaml new file mode 100644 index 0000000..e950851 --- /dev/null +++ b/cascading-blackout/manifests/network-policies.yaml @@ -0,0 +1,76 @@ +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: gateway-policy + namespace: cascading-blackout +spec: + podSelector: + matchLabels: + app: edge-proxy + policyTypes: + - Ingress + - Egress + ingress: + - ports: + - port: 80 + protocol: TCP + egress: + - ports: + - port: 53 + protocol: UDP + - port: 53 + protocol: TCP + - to: + - podSelector: + matchLabels: + app: order-backend + ports: + - port: 8080 + protocol: TCP +--- +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: backend-policy + namespace: cascading-blackout +spec: + podSelector: + matchLabels: + app: order-backend + policyTypes: + - Ingress + - Egress + ingress: + - from: + - podSelector: + matchLabels: + app: edge-proxy + ports: + - port: 8080 + protocol: TCP + egress: + - ports: + - port: 53 + protocol: UDP + - port: 53 + protocol: TCP +--- +apiVersion: networking.k8s.io/v1 +kind: NetworkPolicy +metadata: + name: cache-policy + namespace: cascading-blackout +spec: + podSelector: + matchLabels: + app: order-cache + policyTypes: + - Ingress + ingress: + - from: + - podSelector: + matchLabels: + tier: backend + ports: + - port: 6379 + protocol: TCP diff --git a/cascading-blackout/policies/protect.yaml b/cascading-blackout/policies/protect.yaml new file mode 100644 index 0000000..fa608e7 --- /dev/null +++ b/cascading-blackout/policies/protect.yaml @@ -0,0 +1,146 @@ +apiVersion: kyverno.io/v1 +kind: ClusterPolicy +metadata: + name: protect-cascading-blackout + annotations: + argocd.argoproj.io/sync-wave: "2" +spec: + validationFailureAction: Enforce + background: true + rules: + - name: preserve-gateway-image + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["Deployment"] + names: ["edge-proxy"] + validate: + message: "Cannot change the edge proxy image" + pattern: + spec: + template: + spec: + containers: + - name: edge-proxy + image: "nginx:1.25-alpine" + + - name: preserve-backend-image + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["Deployment"] + names: ["order-backend"] + validate: + message: "Cannot change the backend application image" + pattern: + spec: + template: + spec: + containers: + - name: order-backend + image: "busybox:1.36" + + - name: prevent-backend-deletion + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["Deployment"] + names: ["order-backend"] + validate: + message: "The backend Deployment is protected and cannot be deleted." + deny: + conditions: + any: + - key: "{{ request.operation }}" + operator: Equals + value: "DELETE" + + - name: preserve-cache-image + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["Deployment"] + names: ["order-cache"] + validate: + message: "Cannot change the cache image" + pattern: + spec: + template: + spec: + containers: + - name: redis + image: "redis:7-alpine" + + - name: prevent-netpol-deletion + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["NetworkPolicy"] + validate: + message: "NetworkPolicies cannot be deleted — they are part of the security requirements. Fix them instead." + deny: + conditions: + any: + - key: "{{ request.operation }}" + operator: Equals + value: "DELETE" + + - name: preserve-gateway-policy + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["NetworkPolicy"] + names: ["gateway-policy"] + validate: + message: "The gateway NetworkPolicy is correctly configured and should not be modified" + deny: + conditions: + any: + - key: "{{ request.operation }}" + operator: Equals + value: "UPDATE" + + - name: preserve-backend-policy + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["NetworkPolicy"] + names: ["backend-policy"] + validate: + message: "The backend NetworkPolicy cannot be modified. You can create additional NetworkPolicy resources to extend its connectivity rules." + deny: + conditions: + any: + - key: "{{ request.operation }}" + operator: Equals + value: "UPDATE" + + - name: preserve-cache-policy + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["NetworkPolicy"] + names: ["cache-policy"] + validate: + message: "The cache NetworkPolicy is intentionally configured this way and cannot be modified." + deny: + conditions: + any: + - key: "{{ request.operation }}" + operator: Equals + value: "UPDATE" + + - name: preserve-gateway-config + match: + resources: + namespaces: ["cascading-blackout"] + kinds: ["ConfigMap"] + names: ["gateway-config"] + validate: + message: "Cannot modify the gateway configuration" + deny: + conditions: + any: + - key: "{{ request.operation }}" + operator: Equals + value: "UPDATE" diff --git a/review-cascading-blackout-pr-comment.md b/review-cascading-blackout-pr-comment.md new file mode 100644 index 0000000..932edb8 --- /dev/null +++ b/review-cascading-blackout-pr-comment.md @@ -0,0 +1,29 @@ +## 🔍 Challenge Review: Cascading Blackout + +**Score: 19/20** · Verdict: ✅ Pass + +| Criterion | Score | Comment | +|-----------|:-----:|---------| +| Clarity | 4/4 | Symptom-only description, realistic incident framing. The hint that "multiple components" were affected sets expectations without revealing cause. All validation titles remain generic throughout. | +| Pedagogy | 4/4 | Teaches three complementary networking concepts in one challenge (label-based pod selectors, egress rules, and additive policy composition). Investigation path flows naturally from observable symptoms to root causes. Two-bug cascading design is realistic and earns its Hard rating. | +| Validation | 3/4 | All checks are consistent and reliable — no timing-related flakiness. The intermediate validation provides useful partial-progress feedback when one fix is applied before the other. Minor: it checks a specific implementation detail rather than a pure outcome, which is slightly narrower than ideal but justified by the pedagogical value of the intermediate signal. | +| Bypass resistance | 4/4 | All three NetworkPolicies protected against both deletion and modification. Images locked. One error message proactively hints at the correct remediation approach for the connectivity gap, guiding learners without revealing the solution. No bypasses found. | +| UX | 4/4 | The gateway connectivity validation now passes consistently in broken state, correctly focusing learner attention on the backend-to-cache tier. Error messages from protection policies are informative. Intermediate validation feedback reflects partial progress cleanly. Difficulty and time estimate are accurate. | + +### What works well + +This is a well-crafted challenge that mirrors real production incidents. The "security hardening touched multiple components at once" framing is authentic — this is exactly how cascading failures occur in practice. The investigation path is well-structured: observable symptoms in logs point toward the network layer, where two independent but cooperative issues reveal themselves. Neither fix alone is sufficient, which teaches learners to reason about the full communication chain rather than stopping at the first finding. The intermediate validation turns what could be a frustrating two-bug puzzle into a guided discovery experience. All protection policies have clear, actionable error messages. + +### Minor note + +The intermediate validation checks a specific metadata property rather than an observable service behavior. This is an acceptable trade-off for the feedback value it provides, but a future improvement could express the same constraint through the final end-to-end outcome alone (i.e., rely solely on the log validation to confirm full connectivity once both fixes are in place). + +### Flags + +- Solvable: ✅ +- Bypass found: ❌ +- Coherent with learning goal: ✅ +- Solved in 1 attempt (two coordinated sub-fixes applied simultaneously) + +--- +*Reviewed by Kubeasy Challenge Reviewer*