Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
82df632
Add chaos testing setup and experiment documentation
XploY04 Oct 2, 2025
08348c5
Add documentation for primary pod deletion without TARGET_PODS
XploY04 Oct 6, 2025
1718988
Enhance documentation and code for primary pod chaos testing without …
XploY04 Oct 6, 2025
ee55b8f
Enhance chaos testing setup by implementing dynamic pod targeting and…
XploY04 Oct 16, 2025
b8ae7b1
feat: Add setup scripts for cnp-bench, Prometheus monitoring, and dat…
XploY04 Nov 2, 2025
da0a01f
feat: Add setup and workload testing scripts for CNPG monitoring with…
XploY04 Nov 3, 2025
d9246e0
fix: Update probe timeout and interval formats to include 's' suffix …
XploY04 Nov 3, 2025
6a193d9
Add Jepsen consistency test job and results PVC
XploY04 Nov 18, 2025
2b9e31f
fix: Update chaos experiment configurations for consistency and monit…
XploY04 Nov 18, 2025
304367d
Add Jepsen chaos test runner script for CNPG
XploY04 Nov 18, 2025
b9d9a8c
fix: Update namespace for litmus-admin ServiceAccount in RBAC configu…
XploY04 Nov 20, 2025
ea1b989
feat: Add CNPG Jepsen Chaos Engine without probes for consistency tes…
XploY04 Nov 20, 2025
5cab5ce
fix: Update chaos interval for primary pod deletion to 180 seconds an…
XploY04 Nov 20, 2025
10313b1
fix: Enhance pod status check command and update Prometheus query for…
XploY04 Nov 20, 2025
6e575a9
docs: Update CNPG plugin installation, add disk space recommendations…
XploY04 Nov 22, 2025
55047b7
refactor: Consistently use LITMUS_NAMESPACE for Litmus resources and …
XploY04 Nov 22, 2025
3274fe4
feat: add pg-eu CloudNativePG cluster manifest and update README with…
XploY04 Nov 22, 2025
f7245ed
docs: Streamline README setup instructions by adding a repo clone ste…
XploY04 Nov 23, 2025
be495db
docs: Remove kubectl cnpg plugin commands section from README.md
XploY04 Nov 23, 2025
cf9e711
feat: Implement GitHub Actions for automated chaos testing, enhance t…
XploY04 Nov 25, 2025
db71bf9
docs: update CloudNativePG operator installation instructions to use …
XploY04 Nov 25, 2025
b772d26
docs: Improve CNPG plugin installation instructions by adding krew up…
XploY04 Nov 25, 2025
28161e2
chore: configuration changes
gbartolini Nov 25, 2025
7749541
feat: Remove local experiment and simplify instructions to use Chao…
XploY04 Nov 25, 2025
5e3b81e
chore: add operator configuration map
gbartolini Nov 25, 2025
98ad079
docs: separate comment from command in CNPG rollout verification exam…
XploY04 Nov 26, 2025
5ac5c1b
feat: Add GitHub Actions for Kind cluster setup, tool installation, a…
XploY04 Nov 27, 2025
52cd2b3
test: Add Step 1 - disk cleanup action
XploY04 Nov 27, 2025
87cf067
test: Add Step 2 - tool installation
XploY04 Nov 27, 2025
52256f6
fix: Remove redundant tool check (already in free-disk-space)
XploY04 Nov 27, 2025
bb04540
feat: Update actions to use cnpg-playground and optimize tool install…
XploY04 Nov 27, 2025
4c03d8d
test: Add Step 3 - CNPG Playground cluster setup
XploY04 Nov 27, 2025
c6745af
test: Add Step 3 - CNPG Playground cluster setup
XploY04 Nov 27, 2025
bae27c5
test: Add Step 4 - CNPG operator and PostgreSQL cluster
XploY04 Nov 27, 2025
42f2c40
fix: Correct YAML indentation in pg-eu-cluster probes
XploY04 Nov 27, 2025
c577e4a
fix: Wait for CNPG webhook to be ready before cluster deployment
XploY04 Nov 27, 2025
b2cabc3
test: Add Step 5 - Litmus Chaos operator and experiments
XploY04 Nov 27, 2025
66bf1c2
test: Add Step 6 - Prometheus monitoring for chaos probes
XploY04 Nov 27, 2025
306e54a
fix: Use deployment rollout status for webhook wait
XploY04 Nov 27, 2025
3a81ff2
perf: Optimize Prometheus installation
XploY04 Nov 27, 2025
42b89e5
feat: Add complete Jepsen + Chaos test workflow
XploY04 Nov 27, 2025
c8661bf
fix: Add push trigger to register chaos test workflow
XploY04 Nov 30, 2025
4822b35
fix: Remove Litmus control plane check for Litmus 3.x compatibility
XploY04 Nov 30, 2025
6d78b27
fix: Include PNG graph files in test artifacts
XploY04 Nov 30, 2025
995519c
fix: Wait for Prometheus to scrape metrics before chaos test
XploY04 Nov 30, 2025
a4b2c00
debug: Add comprehensive Prometheus metrics verification
XploY04 Nov 30, 2025
e1a5c51
fix: Add comprehensive Prometheus verification and Litmus 3.x compati…
XploY04 Nov 30, 2025
9904ea0
debug: Add comprehensive probe failure debugging
XploY04 Nov 30, 2025
ae3ba20
fix: Reduce default chaos duration to 300s for faster tests
XploY04 Dec 1, 2025
5ce6dfc
debug: Add comprehensive experiment pod logging for probe diagnosis
XploY04 Dec 1, 2025
5559248
debug: Add experiment job pod logs to diagnose probe failures
XploY04 Dec 1, 2025
3fbf46e
refactor: Switch to full chaos test on push, disable infra test on PR
XploY04 Dec 1, 2025
69b3f80
fix: Increase EOT probe wait time to 180s for experiment completion
XploY04 Dec 1, 2025
9a22ab6
fix: Wait for ChaosResult completion instead of fixed time
XploY04 Dec 1, 2025
01bf046
fix: Remove continuous probe causing experiment Error on expected beh…
XploY04 Dec 1, 2025
64ae71b
refactor: Remove verbose debugging code
XploY04 Dec 1, 2025
d049753
security: Implement least privilege permissions in workflows
XploY04 Dec 1, 2025
851196f
chore: Change chaos test schedule from daily to weekly
XploY04 Dec 1, 2025
df0c083
chore: Update chaos test schedule and add PR trigger
XploY04 Dec 1, 2025
d8db968
ci: remove Jepsen chaos testing setup workflow and execution script.
XploY04 Dec 1, 2025
7abea15
docs: Remove GitHub Actions chaos testing README.
XploY04 Dec 2, 2025
2f0b7dc
feat: Add GitHub Actions docs and simplify chaos test workflow
XploY04 Dec 2, 2025
d7cd132
ci: reduce workflow artifact retention days to 7
XploY04 Dec 2, 2025
9c9d8d0
feat: Migrate to cnpg-playground monitoring setup, update Prometheus …
XploY04 Dec 5, 2025
1a48815
fix: use correct Prometheus service name in chaos-test-full workflow
XploY04 Dec 5, 2025
cae6455
fix: use correct Prometheus service name in chaos-test-full workflow
XploY04 Dec 5, 2025
004c1f0
fix: Correct Prometheus service endpoint in CNPG Jepsen chaos experim…
XploY04 Dec 5, 2025
facc20a
docs: Add detailed documentation on monitoring dependency on and tro…
XploY04 Dec 5, 2025
3067fd3
docs: removed operator configuration
gbartolini Dec 10, 2025
6b442b7
docs: fixed CNPG link
gbartolini Dec 10, 2025
b78a0de
fix: Update curl command for Prometheus metrics query to use --data-u…
XploY04 Dec 10, 2025
05bbc59
fix: Remove CNPG operator configuration step from setup action
XploY04 Dec 10, 2025
c9f97bc
fix: Update Prometheus port-forward service name in metrics verificat…
XploY04 Dec 10, 2025
b2dbfe4
Revise README for CloudNativePG Chaos Testing
XploY04 Dec 10, 2025
05bd1e4
fix: Update README and script to remove references to Elle analysis r…
XploY04 Dec 10, 2025
13eeb43
fix: Remove reference to Jepsen results file in README
XploY04 Dec 10, 2025
d9a1b2e
fix: Clean up whitespace and improve readability in chaos-test-full.yml
XploY04 Dec 10, 2025
e288262
change name of Actions readme.
XploY04 Dec 10, 2025
def23f6
change name to overview
XploY04 Dec 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 231 additions & 0 deletions .github/OVERVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
# GitHub Actions for CloudNativePG Chaos Testing

This directory contains GitHub Actions workflows and reusable composite actions for automated chaos testing of CloudNativePG clusters.

## Workflows

### `chaos-test-full.yml`

Comprehensive chaos testing workflow that validates PostgreSQL cluster resilience under failure conditions.

**What it does**:
- Provisions a Kind cluster using cnpg-playground
- Installs CloudNativePG operator and PostgreSQL cluster
- Deploys Litmus Chaos and Prometheus monitoring
- Runs Jepsen consistency tests with pod-delete chaos injection
- **Validates resilience** - fails the build if chaos tests don't pass
- Collects comprehensive artifacts including cluster state dumps on failure

**Triggers**:
- **Manual**: `workflow_dispatch` with configurable chaos duration (default: 300s)
- **Automatic**: Pull requests to `main` branch (skips documentation-only changes)
- **Scheduled**: Weekly on Sundays at 13:00 UTC

**Quality Gates**:
- Litmus chaos experiment must pass
- Jepsen consistency validation must pass (`:valid? true`)
- Workflow fails if either check fails

---

## Reusable Composite Actions

### `free-disk-space`

Removes unnecessary pre-installed software from GitHub runners to free up ~40GB of disk space.

**What it removes**:
- .NET SDK (~15-20 GB)
- Android SDK (~12 GB)
- Haskell tools (~5-8 GB)
- Large tool caches (CodeQL, Go, Python, Ruby, Node)
- Unused browsers

**What it preserves**:
- Docker
- kubectl
- Kind
- Helm
- jq

**Usage**:
```yaml
- name: Free disk space
uses: ./.github/actions/free-disk-space
```

---

### `setup-tools`

Installs and upgrades chaos testing tools to latest stable versions.

**Tools installed/upgraded**:
- kubectl (latest stable)
- Kind (latest release)
- Helm (latest via official installer)
- krew (kubectl plugin manager)
- kubectl-cnpg plugin (via krew)

**Usage**:
```yaml
- name: Setup chaos testing tools
uses: ./.github/actions/setup-tools
```

---

### `setup-kind`

Creates a Kind cluster using the proven cnpg-playground configuration.

**Features**:
- Multi-node cluster with PostgreSQL-labeled nodes
- Configured for HA testing
- Proven configuration from cnpg-playground

**Inputs**:
- `region` (optional): Region name for the cluster (default: `eu`)

**Outputs**:
- `kubeconfig`: Path to kubeconfig file
- `cluster-name`: Name of the created cluster

**Usage**:
```yaml
- name: Create Kind cluster
uses: ./.github/actions/setup-kind
with:
region: eu
```

---

### `setup-cnpg`

Installs CloudNativePG operator and deploys a PostgreSQL cluster.

**What it does**:
1. Installs CNPG operator using `kubectl cnpg install generate` (recommended method)
2. Waits for operator deployment to be ready
3. Applies CNPG operator configuration
4. Waits for webhook to be fully initialized
5. Deploys PostgreSQL cluster
6. Waits for cluster to be ready with health checks

**Requirements**:
- `clusters/cnpg-config.yaml` - CNPG operator configuration
- `clusters/pg-eu-cluster.yaml` - PostgreSQL cluster definition

**Usage**:
```yaml
- name: Setup CloudNativePG
uses: ./.github/actions/setup-cnpg
```

---

### `setup-litmus`

Installs Litmus Chaos operator, experiments, and RBAC configuration.

**What it installs**:
- litmus-core operator (via Helm)
- pod-delete chaos experiment
- Litmus RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding)

**Verification**:
- Checks all CRDs are installed
- Verifies operator is ready
- Validates RBAC permissions

**Requirements**:
- `litmus-rbac.yaml` - RBAC configuration file

**Usage**:
```yaml
- name: Setup Litmus Chaos
uses: ./.github/actions/setup-litmus
```

---

### `setup-prometheus`

Installs Prometheus and Grafana monitoring using cnpg-playground's built-in monitoring solution.

**What it installs**:
- Prometheus Operator (via cnpg-playground monitoring/setup.sh)
- Grafana Operator with official CNPG dashboard
- CNPG PodMonitor for PostgreSQL metrics

**Requirements**:
- `monitoring/podmonitor-pg-eu.yaml` - CNPG PodMonitor configuration
- cnpg-playground must be cloned to `/tmp/cnpg-playground` (done by setup-kind action)

**Usage**:
```yaml
- name: Setup Prometheus
uses: ./.github/actions/setup-prometheus
```

---

## Artifacts

Each workflow run produces the following artifacts (retained for 30 days):

**Jepsen Results**:
- `history.edn` - Operation history
- `STATISTICS.txt` - Test statistics
- `*.png` - Visualization graphs

**Litmus Results**:
- `chaosresult.yaml` - Chaos experiment results

**Logs**:
- `test.log` - Complete test execution log

**Cluster State** (on failure only):
- `cluster-state-dump.yaml` - Complete cluster state including pods, events, and operator logs

---

## Usage in Other Workflows

You can reuse these actions in your own workflows:

```yaml
name: My Chaos Test

on:
workflow_dispatch:

jobs:
test:
runs-on: ubuntu-latest
permissions:
contents: read
actions: write

steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

- name: Free disk space
uses: ./.github/actions/free-disk-space

- name: Setup tools
uses: ./.github/actions/setup-tools

- name: Create cluster
uses: ./.github/actions/setup-kind
with:
region: us

- name: Setup CNPG
uses: ./.github/actions/setup-cnpg

# Your custom chaos testing steps here
```

---
103 changes: 103 additions & 0 deletions .github/actions/free-disk-space/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
name: 'Free Disk Space'
description: 'Remove unnecessary pre-installed software to free up disk space (preserves Docker, kubectl, Kind, Helm)'
branding:
icon: 'hard-drive'
color: 'blue'

runs:
using: 'composite'
steps:
- name: Display disk usage before cleanup
shell: bash
run: |
echo "=== Disk Usage Before Cleanup ==="
df -h /
echo ""
echo "=== Pre-installed tools we'll keep ==="
echo "Docker: $(docker --version)"
echo "kubectl: $(kubectl version --client --short 2>/dev/null || echo 'will install')"
echo "Kind: $(kind version 2>/dev/null || echo 'will install')"
echo "Helm: $(helm version --short 2>/dev/null || echo 'will install')"
echo "jq: $(jq --version)"

- name: Remove .NET SDK and tools
shell: bash
run: |
echo "Removing .NET SDK (~15-20 GB)..."
sudo rm -rf /usr/share/dotnet
sudo rm -rf /opt/hostedtoolcache/dotnet

- name: Remove Android SDK
shell: bash
run: |
echo "Removing Android SDK (~12 GB)..."
sudo rm -rf /usr/local/lib/android
sudo rm -rf ${ANDROID_HOME:-/usr/local/lib/android/sdk}
sudo rm -rf ${ANDROID_NDK_HOME:-/usr/local/lib/android/sdk/ndk}

- name: Remove Haskell tools
shell: bash
run: |
echo "Removing Haskell/GHC (~5-8 GB)..."
sudo rm -rf /opt/ghc
sudo rm -rf /usr/local/.ghcup
sudo rm -rf ~/.ghcup

- name: Remove large cached tools
shell: bash
run: |
echo "Removing large tool caches..."
# Remove CodeQL (keep for security scanning if needed, but ~5 GB)
sudo rm -rf /opt/hostedtoolcache/CodeQL

# Remove cached Go versions (we'll use latest if needed)
sudo rm -rf /opt/hostedtoolcache/go

# Remove cached Python versions (keep system Python)
sudo rm -rf /opt/hostedtoolcache/Python

# Remove cached Ruby versions
sudo rm -rf /opt/hostedtoolcache/Ruby

# Remove cached Node versions (keep system Node)
sudo rm -rf /opt/hostedtoolcache/node

- name: Remove unused browsers and drivers
shell: bash
run: |
echo "Removing browser test tools (not needed for chaos testing)..."
# Keep Chrome for potential debugging, remove others
sudo rm -rf /usr/share/microsoft-edge
sudo rm -rf /opt/microsoft/msedge
sudo apt-get remove -y firefox chromium-browser 2>/dev/null || true

- name: Clean package manager caches
shell: bash
run: |
echo "Cleaning package manager caches..."
sudo apt-get clean
sudo rm -rf /var/lib/apt/lists/*

- name: Clean Docker build cache (preserve images)
shell: bash
run: |
echo "Cleaning Docker build cache..."
# Only remove build cache, not images (we need Docker functional)
docker builder prune --all --force || true

- name: Display disk usage after cleanup
shell: bash
run: |
echo ""
echo "=== Disk Usage After Cleanup ==="
df -h /
echo ""
echo "=== Verify essential tools still available ==="
docker --version
echo "Docker: ✅"

# These will be installed by setup-tools action
kubectl version --client --short 2>/dev/null && echo "kubectl: ✅ (pre-installed)" || echo "kubectl: will be installed"
kind version 2>/dev/null && echo "Kind: ✅ (pre-installed)" || echo "Kind: will be installed"
helm version --short 2>/dev/null && echo "Helm: ✅ (pre-installed)" || echo "Helm: will be installed"
jq --version && echo "jq: ✅"
Loading