Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 148 additions & 0 deletions etcd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
# etcd.openshift.io API Group

This API group contains CRDs related to etcd cluster management in Two Node OpenShift with Fencing deployments.

## API Versions

### v1alpha1

Contains the `PacemakerCluster` custom resource for monitoring Pacemaker cluster health in Two Node OpenShift with Fencing deployments.

#### PacemakerCluster

- **Feature Gate**: `DualReplica`
- **Component**: `two-node-fencing`
- **Scope**: Cluster-scoped singleton resource (must be named "cluster")
- **Resource Path**: `pacemakerclusters.etcd.openshift.io`

The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster.
It is periodically updated by the cluster-etcd-operator's status collector.

### Pacemaker Resources

A **pacemaker resource** is a unit of work managed by pacemaker. In pacemaker terminology, resources are services
or applications that pacemaker monitors, starts, stops, and moves between nodes to maintain high availability.

For Two Node OpenShift with Fencing, we manage three resources:
- **Kubelet**: The Kubernetes node agent and a prerequisite for etcd
- **Etcd**: The distributed key-value store
- **FencingAgent**: Used to isolate failed nodes during a quorum loss event

### Status Structure

> **Note on optional fields:** Per Kubernetes API conventions, all status fields are marked `+optional` in the Go
> type definitions to allow for partial updates and backward compatibility. However, CEL validation rules enforce
> that certain fields (like conditions and lastUpdated) cannot be removed once set. This means these fields are
> "optional on initial creation, but required once present."
> See: [kube-api-linter statusoptional](https://github.com/kubernetes-sigs/kube-api-linter/blob/main/pkg/analysis/statusoptional/doc.go)
Comment on lines +33 to +37
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are dropping this requirement, it's better to be able to actually make the fields required

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean I should bump the linter in the go.mod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make that bump yes, if you have time to do it, go for it, else I'll try get to it in the next few days


```yaml
status:
conditions: # Cluster-level conditions (required once set, min 3 items)
- type: Healthy
- type: InService
- type: NodeCountAsExpected
lastUpdated: <timestamp> # When status was last updated (required once set, cannot decrease)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to be confident that things are still working without a clock that will cause writes every 30 seconds?

Node's in Kube have similar properties that haven't aged well, mainly because of scale reasons which I guess is less of an issue here, but still, wonder how else we could be confident things were still good without a ticking clock

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked Claude to scan through the existing APIs and review our options for this. The best it could come up with was moving the staleness check into the health checker:

// In healthcheck controller
if cr.ResourceVersion == c.lastSeenResourceVersion {
    timeSinceLastChange := time.Since(c.lastResourceVersionTime)
    if timeSinceLastChange > threshold {
        // Status is stale
    }
}

But this still has the problem that if the status is healthy / stable for > 5 minutes, we don't record a heartbeat, so the status becomes stale.

The core objective of the API is to prevent users from entering a dangerous state without a warning provided in the cluster. I think the only way to achieve this is with some kind of heartbeat. The lastUpdated mechanism is simplest I could come up with.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the lastUpdated will tick every time you reconcile, not every time there is a particular change, right?

nodes: # Per-node status (0-32 nodes, expects 2)
- name: <hostname> # RFC 1123 subdomain name
ipAddresses: # List of global unicast IPv4 or IPv6 addresses in canonical form
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Nodes, these are just addresses, maybe we want to be consistent

Also, maybe we want to copy the structure where we can add details to addresses later. Would mean making this a list of struct

addresses:
- address: <ip>
  type: ??? # Future expansion of the API? Other metadata?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating to use this approach.

- <ip> # First address used for etcd peer URLs
conditions: # Node-level conditions (required once set, min 7 items)
- type: Healthy
- type: Online
- type: InService
- type: Active
- type: Ready
- type: Clean
- type: Member
resources: # Array of pacemaker resources on this node (required once set, min 3 items)
- name: Kubelet # All three resources (Kubelet, Etcd, FencingAgent) must be present
conditions: # Resource-level conditions (required once set, min 8 items)
- type: Healthy
- type: InService
- type: Managed
- type: Enabled
- type: Operational
- type: Active
- type: Started
- type: Schedulable
- name: Etcd
conditions: []
- name: FencingAgent
conditions: []
```

### Cluster-Level Conditions

**All three conditions are required once the status is populated.**

| Condition | True | False |
|-----------|------|-------|
| `Healthy` | Cluster is healthy (`ClusterHealthy`) | Cluster has issues (`ClusterUnhealthy`) |
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
| `NodeCountAsExpected` | Node count is as expected (`AsExpected`) | Wrong count (`InsufficientNodes`, `ExcessiveNodes`) |

### Node-Level Conditions

**All seven conditions are required for each node once the status is populated.**

| Condition | True | False |
|-----------|------|-------|
| `Healthy` | Node is healthy (`NodeHealthy`) | Node has issues (`NodeUnhealthy`) |
| `Online` | Node is online (`Online`) | Node is offline (`Offline`) |
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
| `Active` | Node is active (`Active`) | Node is in standby (`Standby`) |
| `Ready` | Node is ready (`Ready`) | Node is pending (`Pending`) |
| `Clean` | Node is clean (`Clean`) | Node is unclean (`Unclean`) |
| `Member` | Node is a member (`Member`) | Not a member (`NotMember`) |

### Resource-Level Conditions

Each resource in the `resources` array has its own conditions. **All eight conditions are required for each resource once the status is populated.**

| Condition | True | False |
|-----------|------|-------|
| `Healthy` | Resource is healthy (`ResourceHealthy`) | Resource has issues (`ResourceUnhealthy`) |
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
| `Managed` | Managed by pacemaker (`Managed`) | Not managed (`Unmanaged`) |
| `Enabled` | Resource is enabled (`Enabled`) | Resource is disabled (`Disabled`) |
| `Operational` | Resource is operational (`Operational`) | Resource has failed (`Failed`) |
| `Active` | Resource is active (`Active`) | Resource is not active (`Inactive`) |
| `Started` | Resource is started (`Started`) | Resource is stopped (`Stopped`) |
| `Schedulable` | Resource is schedulable (`Schedulable`) | Resource is not schedulable (`Unschedulable`) |

### Validation Rules

**Resource naming:**
- Resource name must be "cluster" (singleton)

**Node name validation:**
- Must be a lowercase RFC 1123 subdomain name
- Consists of lowercase alphanumeric characters, '-' or '.'
- Must start and end with an alphanumeric character
- Maximum 253 characters

**IP address validation:**
- Pacemaker allows multiple IP addresses for Corosync communication between nodes (1-8 addresses)
- The first address in the list is used for IP-based peer URLs for etcd membership
- Each address must be in canonical form (e.g., `192.168.1.1` not `192.168.001.001`, or `2001:db8::1` not `2001:0db8::1`)
- Each address must be a valid global unicast IPv4 or IPv6 address (including private/RFC1918 addresses)
- Excludes loopback, link-local, and multicast addresses

**Timestamp validation:**
- `lastUpdated` is optional on initial creation, but once set it cannot be removed and must always increase (prevents stale updates)

**Required conditions (once status is populated):**
- All cluster-level conditions must be present (MinItems=3)
- All node-level conditions must be present for each node (MinItems=7)
- All resource-level conditions must be present for each resource in the `resources` array (MinItems=8)

**Resource names:**
- Valid values are: `Kubelet`, `Etcd`, `FencingAgent`
- All three resources must be present in each node's `resources` array

### Usage

The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on
the cluster state. The aggregate `Healthy` conditions at each level (cluster, node, resource) provide a quick
way to determine overall health.
26 changes: 26 additions & 0 deletions etcd/install.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package etcd

import (
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"

v1alpha1 "github.com/openshift/api/etcd/v1alpha1"
)

const (
GroupName = "etcd.openshift.io"
)

var (
schemeBuilder = runtime.NewSchemeBuilder(v1alpha1.Install)
// Install is a function which adds every version of this group to a scheme
Install = schemeBuilder.AddToScheme
)

func Resource(resource string) schema.GroupResource {
return schema.GroupResource{Group: GroupName, Resource: resource}
}

func Kind(kind string) schema.GroupKind {
return schema.GroupKind{Group: GroupName, Kind: kind}
}
3 changes: 3 additions & 0 deletions etcd/v1alpha1/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.PHONY: test
test:
make -C ../../tests test GINKGO_EXTRA_ARGS=--focus="etcd.openshift.io/v1alpha1"
6 changes: 6 additions & 0 deletions etcd/v1alpha1/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
// +k8s:deepcopy-gen=package,register
// +k8s:defaulter-gen=TypeMeta
// +k8s:openapi-gen=true
// +openshift:featuregated-schema-gen=true
// +groupName=etcd.openshift.io
package v1alpha1
39 changes: 39 additions & 0 deletions etcd/v1alpha1/register.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package v1alpha1

import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
)

var (
GroupName = "etcd.openshift.io"
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
// Install is a function which adds this version to a scheme
Install = schemeBuilder.AddToScheme

// SchemeGroupVersion generated code relies on this name
// Deprecated
SchemeGroupVersion = GroupVersion
// AddToScheme exists solely to keep the old generators creating valid code
// DEPRECATED
AddToScheme = schemeBuilder.AddToScheme
)

// Resource generated code relies on this being here, but it logically belongs to the group
// DEPRECATED
func Resource(resource string) schema.GroupResource {
return schema.GroupResource{Group: GroupName, Resource: resource}
}

func addKnownTypes(scheme *runtime.Scheme) error {
metav1.AddToGroupVersion(scheme, GroupVersion)

scheme.AddKnownTypes(GroupVersion,
&PacemakerCluster{},
&PacemakerClusterList{},
)

return nil
}
Loading