Skip to content

Conversation

@DavidHurta
Copy link
Contributor

@DavidHurta DavidHurta commented Dec 10, 2025

This pull request proposes to migrate the CVO metrics endpoint from bearer token authentication to mutual TLS (mTLS) to comply with OpenShift metrics conventions.

Key changes:

  • Replaced custom certificate watching logic with the k8s.io/apiserver/pkg/server/dynamiccertificates package to simplify logic
  • Authenticate clients using mutual TLS (mTLS)
  • Authorize clients by verifying CN
  • Add configurable authentication/authorization options for HyperShift compatibility
  • Update ServiceMonitor to use mTLS instead of bearer tokens

Breaking Changes:

  • Prometheus must now authenticate using client certificates signed by a trusted cluster CA. Clients without valid certificates are rejected during the TLS handshake.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 10, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

Switched metrics authentication from bearer-token to TLS client certificates with CN-based authorization, introduced dynamic certificate controllers and a MetricsOptions API, removed fsnotify-based reloads and token-review code, updated ServiceMonitor and go.mod, and adjusted tests and startup wiring.

Changes

Cohort / File(s) Summary
Dependency updates
go.mod
Added direct k8s.io/apiserver v0.34.1; removed its indirect entry; replaced gopkg.in/fsnotify.v1 v1.4.7 with github.com/fsnotify/fsnotify v1.9.0 (moved between require blocks).
ServiceMonitor manifest
install/0000_90_cluster-version-operator_02_servicemonitor.yaml
Removed bearerTokenFile from endpoint; added tlsConfig.certFile and tlsConfig.keyFile to enable TLS client-cert for Prometheus scraping.
Metrics implementation
pkg/cvo/metrics.go
Replaced tokenReview/token-based auth with certificate-based auth using dynamic serving-certificate controller and client CA; added public MetricsOptions (DisableAuthentication, DisableAuthorization); changed RunMetrics and createHttpServer signatures; removed fsnotify-based rotation and token-review helpers; adjusted server lifecycle and authHandler to use CN checks.
Metrics tests
pkg/cvo/metrics_test.go
Rewrote tests to simulate TLS peer certificates and CN-based authorization; removed fake TokenReview client and bearer-token test scaffolding; added x509/tls-based test constructs.
Startup wiring
pkg/start/start.go
Updated call site to construct and pass MetricsOptions (DisableAuthentication/DisableAuthorization) to RunMetrics instead of the prior boolean flag.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 290ee8b and 9ad1d4e.

📒 Files selected for processing (2)
  • pkg/cvo/metrics.go
  • pkg/cvo/metrics_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/cvo/metrics.go
  • pkg/cvo/metrics_test.go
🪛 ast-grep (0.40.5)
pkg/cvo/metrics.go

[warning] 297-297: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{ClientAuth: clientAuth}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)


[warning] 325-328: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
ClientAuth: clientAuth,
GetConfigForClient: servingCertController.GetConfigForClient,
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🔇 Additional comments (5)
pkg/cvo/metrics.go (4)

157-181: LGTM! Well-documented authorization handler.

The documentation clearly explains the design rationale for HTTP-layer authorization vs TLS-layer rejection, and the trade-off is explicitly acknowledged in the PR description.


230-232: Good defensive validation.

Correctly prevents an invalid configuration where authorization would be attempted without authentication.


295-305: Static analysis false positive — crypto.SecureTLSConfig already sets secure TLS defaults.

The static analyzer flags missing MinVersion at lines 297 and 326, but crypto.SecureTLSConfig() from library-go wraps the config and sets secure TLS parameters including minimum version. These warnings can be safely ignored.


286-293: Client CA handling for HyperShift is properly configured.

The ConfigMap loading from kube-system/extension-apiserver-authentication is skipped entirely when DisableAuthentication is enabled. In HyperShift mode, DisableAuthentication is explicitly set to true in start.go (line 354), preventing any attempt to load the CA bundle and avoiding failures in hosted environments.

pkg/cvo/metrics_test.go (1)

1029-1099: LGTM! Good test coverage for CN-based authorization.

The tests cover the three primary scenarios: valid CN, invalid CN, and missing certificate. The mock TLS state construction correctly simulates the authorization logic.


Comment @coderabbitai help to get the list of available commands and usage tips.

@DavidHurta
Copy link
Contributor Author

/test ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

@DavidHurta: The following commands are available to trigger required jobs:

/test e2e-agnostic-operator
/test e2e-agnostic-operator-techpreview
/test e2e-agnostic-ovn
/test e2e-agnostic-ovn-upgrade-into-change
/test e2e-agnostic-ovn-upgrade-into-change-techpreview
/test e2e-agnostic-ovn-upgrade-out-of-change
/test e2e-agnostic-ovn-upgrade-out-of-change-techpreview
/test e2e-aws-ovn-techpreview
/test e2e-hypershift
/test e2e-hypershift-conformance
/test gofmt
/test images
/test lint
/test okd-scos-images
/test unit
/test verify-deps
/test verify-update
/test verify-yaml

The following commands are available to trigger optional jobs:

/test e2e-agnostic-operator-devpreview
/test e2e-extended-tests
/test okd-scos-e2e-aws-ovn

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-operator
pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-ovn
pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-ovn-upgrade-into-change
pull-ci-openshift-cluster-version-operator-main-e2e-agnostic-ovn-upgrade-out-of-change
pull-ci-openshift-cluster-version-operator-main-e2e-aws-ovn-techpreview
pull-ci-openshift-cluster-version-operator-main-e2e-hypershift
pull-ci-openshift-cluster-version-operator-main-e2e-hypershift-conformance
pull-ci-openshift-cluster-version-operator-main-gofmt
pull-ci-openshift-cluster-version-operator-main-images
pull-ci-openshift-cluster-version-operator-main-lint
pull-ci-openshift-cluster-version-operator-main-okd-scos-images
pull-ci-openshift-cluster-version-operator-main-unit
pull-ci-openshift-cluster-version-operator-main-verify-deps
pull-ci-openshift-cluster-version-operator-main-verify-update
pull-ci-openshift-cluster-version-operator-main-verify-yaml
Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 10, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DavidHurta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 10, 2025
@DavidHurta
Copy link
Contributor Author

/test e2e-agnostic-ovn
/test e2e-hypershift
/test e2e-hypershift-conformance

1 similar comment
@DavidHurta
Copy link
Contributor Author

/test e2e-agnostic-ovn
/test e2e-hypershift
/test e2e-hypershift-conformance

@DavidHurta
Copy link
Contributor Author

/test all

@DavidHurta
Copy link
Contributor Author

/test all

@DavidHurta
Copy link
Contributor Author

DavidHurta commented Dec 13, 2025

Note: PromeCIeus and the e2e-hypershift job can be used to verify that the CVO metrics in a hosted control plane can be scraped (when HyperShift is specified to use the OCP monitoring).

@DavidHurta
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 13, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/cvo/metrics.go (1)

201-299: Harden the listener TLS config: ensure MinVersion/ciphers apply even if GetConfigForClient returns nil.

The base TLS config is secured via crypto.SecureTLSConfig, but the listener uses a fresh tls.Config literal containing only GetConfigForClient. If the callback returns nil or fails, the listener operates with default TLS settings lacking explicit cipher suite and minimum version constraints. Wrap the listener config with crypto.SecureTLSConfig to guarantee hardened settings at the listener level.

-	tlsConfig := &tls.Config{GetConfigForClient: servingCertController.GetConfigForClient}
+	tlsConfig := crypto.SecureTLSConfig(&tls.Config{GetConfigForClient: servingCertController.GetConfigForClient})
 	go startListening(server, tlsConfig, listenAddress, resultChannel)

Also verify the process has RBAC to read kube-system/extension-apiserver-authentication and that client-ca-file is the intended trust root for Prometheus' client cert chain in all target environments.

🧹 Nitpick comments (3)
pkg/cvo/metrics_test.go (1)

4-6: CN-based TLS auth tests are aligned with the new implementation; consider using httptest.NewRequest for a fully-formed request.
Current http.NewRequest("GET", "url-not-important", nil) is probably fine for this handler, but httptest.NewRequest("GET", "https://example.invalid/metrics", nil) is more idiomatic and avoids edge cases if the handler later reads URL/Host.

Also applies to: 1030-1088

pkg/cvo/metrics.go (2)

126-143: Rename disableAuth to disableAuthorization to avoid confusion (authn vs authz).
Right now createHttpServer(disableAuth bool) is called with metricsOptions.DisableAuthorization, but the parameter name reads like “disable authentication”.

-func createHttpServer(disableAuth bool) *http.Server {
-	if disableAuth {
+func createHttpServer(disableAuthorization bool) *http.Server {
+	if disableAuthorization {
 		handler := http.NewServeMux()
 		handler.Handle("/metrics", promhttp.Handler())
 		server := &http.Server{
 			Handler: handler,
 		}
 		return server
 	}

149-166: CN authorization is very strict; confirm the allowed identity set is complete for all supported deployments.
Hardcoding only system:serviceaccount:openshift-monitoring:prometheus-k8s may be correct, but it will break scraping if the client cert CN differs in hosted control planes / alternate Prometheus setups. Consider allowing a small allowlist (still static) to reduce fragility.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between d4cb3b0 and 920ad71.

⛔ Files ignored due to path filters (69)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/fsnotify/fsnotify/.cirrus.yml is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/.gitignore is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/.mailmap is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/CHANGELOG.md is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/CONTRIBUTING.md is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/LICENSE is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/README.md is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/backend_fen.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/backend_inotify.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/backend_kqueue.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/backend_other.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/backend_windows.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/fsnotify.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/darwin.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_darwin.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_dragonfly.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_freebsd.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_kqueue.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_linux.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_netbsd.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_openbsd.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_solaris.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/debug_windows.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/freebsd.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/internal.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/unix.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/unix2.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/internal/windows.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/shared.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/staticcheck.conf is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/system_bsd.go is excluded by !vendor/**, !**/vendor/**
  • vendor/github.com/fsnotify/fsnotify/system_darwin.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/.editorconfig is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/.gitignore is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/.travis.yml is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/AUTHORS is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/CHANGELOG.md is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/CONTRIBUTING.md is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/LICENSE is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/README.md is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/fen.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/fsnotify.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/inotify.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/inotify_poller.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/kqueue.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/open_mode_bsd.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/open_mode_darwin.go is excluded by !vendor/**, !**/vendor/**
  • vendor/gopkg.in/fsnotify.v1/windows.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/cert_key.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/client_ca.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/configmap_cafile_content.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_cafile_content.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_serving_content.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/dynamic_sni_content.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/interfaces.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/named_certificates.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/static_content.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/tlsconfig.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/union_content.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/apiserver/pkg/server/dynamiccertificates/util.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/OWNERS is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/doc.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/event_broadcaster.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/event_recorder.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/fake.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/helper.go is excluded by !vendor/**, !**/vendor/**
  • vendor/k8s.io/client-go/tools/events/interfaces.go is excluded by !vendor/**, !**/vendor/**
  • vendor/modules.txt is excluded by !vendor/**, !**/vendor/**
📒 Files selected for processing (5)
  • go.mod (2 hunks)
  • install/0000_90_cluster-version-operator_02_servicemonitor.yaml (1 hunks)
  • pkg/cvo/metrics.go (5 hunks)
  • pkg/cvo/metrics_test.go (2 hunks)
  • pkg/start/start.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • install/0000_90_cluster-version-operator_02_servicemonitor.yaml
  • pkg/cvo/metrics_test.go
  • pkg/start/start.go
  • pkg/cvo/metrics.go
  • go.mod
🧬 Code graph analysis (1)
pkg/start/start.go (1)
pkg/cvo/metrics.go (2)
  • MetricsOptions (196-199)
  • RunMetrics (208-299)
🪛 ast-grep (0.40.0)
pkg/cvo/metrics.go

[warning] 245-245: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)


[warning] 247-247: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{ClientAuth: tls.RequireAndVerifyClientCert}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)


[warning] 273-273: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{GetConfigForClient: servingCertController.GetConfigForClient}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🔇 Additional comments (5)
install/0000_90_cluster-version-operator_02_servicemonitor.yaml (1)

14-21: Verify Prometheus mounts metrics-client-certs at the referenced file paths (otherwise scraping will fail).
ServiceMonitor now expects client cert/key at /etc/prometheus/secrets/metrics-client-certs/{tls.crt,tls.key}; ensure the cluster-monitoring-managed Prometheus pod has that secret mounted and populated in all target topologies (self-managed + hosted).

go.mod (1)

29-29: Dependency changes look consistent with dynamiccertificates adoption; please verify module graph is tidy and conflict-free.
Run a go mod tidy-equivalent in CI and ensure no remaining imports of gopkg.in/fsnotify.v1, and no k8s.io/* version skew introduced by the new direct k8s.io/apiserver requirement.

Also applies to: 42-42

pkg/start/start.go (1)

353-358: Double-check HyperShift semantics: this disables both client-cert auth and CN authorization.
Wiring to o.HyperShift is clean, but please confirm the hosted-control-plane scraping model really expects unauthenticated metrics over TLS (vs. mTLS with different client identity).

pkg/cvo/metrics.go (2)

18-20: Good move switching to dynamiccertificates; verify k8s.io/apiserver integration matches the vendored K8s minor version.
This import-level change is appropriate for hot-reloading certs/CA; just ensure no k8s.io/* skew and that dynamiccertificates APIs used here match v0.34.1 expectations.


196-199: MetricsOptions is a clean API shape for callers.

@DavidHurta
Copy link
Contributor Author

/test e2e-hypershift
/test e2e-agnostic-ovn

@DavidHurta
Copy link
Contributor Author

/test all

@DavidHurta
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Dec 19, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/cvo/metrics.go (1)

149-166: Certificate-based authorization correctly implemented.

The CN validation logic is sound: it checks for the presence of peer certificates and validates that the Common Name matches the expected Prometheus service account. The HTTP status codes (401 for missing certificate, 403 for unauthorized CN) are appropriate.

Note: The CN "system:serviceaccount:openshift-monitoring:prometheus-k8s" is hardcoded. Consider whether this should be configurable via MetricsOptions if other clients might need access in the future.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 920ad71 and ff6d96e.

📒 Files selected for processing (4)
  • install/0000_90_cluster-version-operator_02_servicemonitor.yaml (1 hunks)
  • pkg/cvo/metrics.go (5 hunks)
  • pkg/cvo/metrics_test.go (2 hunks)
  • pkg/start/start.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/cvo/metrics_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • install/0000_90_cluster-version-operator_02_servicemonitor.yaml
  • pkg/start/start.go
  • pkg/cvo/metrics.go
🧬 Code graph analysis (1)
pkg/start/start.go (1)
pkg/cvo/metrics.go (2)
  • MetricsOptions (196-199)
  • RunMetrics (208-312)
🪛 ast-grep (0.40.0)
pkg/cvo/metrics.go

[warning] 256-256: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{ClientAuth: clientAuth}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)


[warning] 283-286: MinVersionis missing from this TLS configuration. By default, TLS 1.2 is currently used as the minimum when acting as a client, and TLS 1.0 when acting as a server. General purpose web applications should default to TLS 1.3 with all other protocols disabled. Only where it is known that a web server must support legacy clients with unsupported an insecure browsers (such as Internet Explorer 10), it may be necessary to enable TLS 1.0 to provide support. AddMinVersion: tls.VersionTLS13' to the TLS configuration to bump the minimum version to TLS 1.3.
Context: tls.Config{
ClientAuth: clientAuth,
GetConfigForClient: servingCertController.GetConfigForClient,
}
Note: [CWE-327]: Use of a Broken or Risky Cryptographic Algorithm [OWASP A03:2017]: Sensitive Data Exposure [OWASP A02:2021]: Cryptographic Failures [REFERENCES]
https://owasp.org/Top10/A02_2021-Cryptographic_Failures

(missing-ssl-minversion-go)

🔇 Additional comments (10)
install/0000_90_cluster-version-operator_02_servicemonitor.yaml (1)

14-21: LGTM! TLS client certificate authentication correctly configured.

The ServiceMonitor now uses TLS client certificates (certFile and keyFile) for authentication instead of bearer tokens, aligning with the mTLS implementation in pkg/cvo/metrics.go that validates the client certificate's CN.

pkg/cvo/metrics.go (8)

18-19: LGTM! Imports correctly updated for certificate-based approach.

The new imports for dynamiccertificates and kubernetes client replace the previous token-based authentication dependencies and support the dynamic certificate loading implementation.


126-143: LGTM! Simplified server creation aligns with TLS-based authentication.

The function signature correctly removes the authentication client dependency since authentication is now handled at the TLS layer via client certificates, with authorization delegated to the authHandler.


196-199: LGTM! Clean options struct for metrics configuration.

The MetricsOptions struct provides a clear, type-safe way to configure authentication and authorization behavior for the metrics server.


208-223: LGTM! Serving certificate controller correctly initialized.

The serving certificate controller is properly set up to watch and automatically reload the server's TLS certificate and key files. The RunOnce initialization before starting the background watcher ensures the files are loaded before the server starts.


225-249: LGTM! Client CA controller correctly set up when authentication is enabled.

The conditional creation and initialization of the client CA controller is appropriate—it's only needed when client certificate authentication is enabled. The controller watches the extension-apiserver-authentication ConfigMap in kube-system for CA changes, which aligns with Kubernetes' standard approach for extension API server authentication.


276-288: LGTM! Server and TLS configuration properly wired.

The HTTP server creation respects the DisableAuthorization setting, and the TLS configuration correctly uses GetConfigForClient to enable dynamic certificate updates. The ClientAuth setting in the final config ensures proper fallback behavior if the callback fails.


290-311: LGTM! Server lifecycle properly managed with graceful shutdown.

The implementation correctly handles the run/shutdown contexts, waits for server termination, performs graceful shutdown on context cancellation, and properly collects results from the server goroutine. Error handling and logging are appropriate.


252-267: This concern is not valid. crypto.SecureTLSConfig from github.com/openshift/library-go/pkg/crypto automatically sets MinVersion to the cluster's default TLS version (via DefaultTLSVersion()) when MinVersion is not explicitly set in the passed tls.Config. The code in metrics.go correctly relies on this behavior and does not need explicit MinVersion configuration.

Likely an incorrect or invalid review comment.

pkg/start/start.go (1)

353-357: LGTM! MetricsOptions correctly configured for HyperShift mode.

The creation of the MetricsOptions struct with both DisableAuthentication and DisableAuthorization set to o.HyperShift is appropriate. This ensures that in HyperShift deployments, both authentication and authorization are disabled together, while in standard deployments, both are enabled.

@DavidHurta
Copy link
Contributor Author

/retest

1 similar comment
@DavidHurta
Copy link
Contributor Author

/retest

@DavidHurta
Copy link
Contributor Author

/test e2e-hypershift-conformance
/test e2e-hypershift

@DavidHurta
Copy link
Contributor Author

/test all

@DavidHurta
Copy link
Contributor Author

/test all

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 6, 2026
@openshift-ci-robot
Copy link
Contributor

@DavidHurta: This pull request references Jira Issue OCPBUGS-66898, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from dis016 January 6, 2026 16:34
@DavidHurta
Copy link
Contributor Author

Testing

Standalone OpenShift

Prometheus can scrape using mTLS

$ oc --as system:admin -n openshift-monitoring exec -c prometheus pod/prometheus-k8s-0 -- curl -si --cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt --cert /etc/prometheus/secrets/metrics-client-certs/tls.crt --key /etc/prometheus/secrets/metrics-client-certs/tls.key https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics
HTTP/1.1 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=underscores
Date: Tue, 06 Jan 2026 15:39:21 GMT
Transfer-Encoding: chunked

# HELP cluster_installer Reports info about the installation process and, if applicable, the install tool. The type is either 'openshift-install', indicating that openshift-install was used to install the cluster, or 'other', indicating that an unknown process installed the cluster. The invoker is 'user' by default, but it may be overridden by a consuming tool. The version reported is that of the openshift-install that was used to generate the manifests and, if applicable, provision the infrastructure.
# TYPE cluster_installer gauge

Client needs to provide its certificate

$ oc --as system:admin -n openshift-monitoring exec -c prometheus pod/prometheus-k8s-0 -- curl --cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (56) OpenSSL SSL_read: error:0A00045C:SSL routines::tlsv13 alert certificate required, errno 0
command terminated with exit code 56

The CN is verified correctly

The metrics-server Pod has its client certificate signed by the same issuer as does Prometheus. Let's try to scrape the CVO using the metric-server certs. I am using --insecure to bypass the need for a relevant local CA bundle file.

$ oc --as system:admin -n openshift-monitoring exec pod/metrics-server-676544b7fb-kzjbp -- curl --insecure --cert /etc/tls/metrics-server-client-certs/tls.crt --key /etc/tls/metrics-server-client-certs/tls.key  https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (56) OpenSSL SSL_read: error:0A000412:SSL routines::sslv3 alert bad certificate, errno 0
command terminated with exit code 56

Checking the CVO logs:

I0106 15:39:30.018925       1 metrics.go:183] Access denied for CN: system:serviceaccount:openshift-monitoring:metrics-server

HyperShift

PromeCleus can be used in combination with the e2e-hypershift job to verify that the CVO metrics in a hosted control plane can be scraped. However, to my knowledge, the job utilizes the OCP monitoring. The metrics endpoint needs to be accessible by a custom monitoring stack as well. For the time being, that means having a disabled authorization and authentication to allow any client to scrape the hosted CVO.

Given that the OCP monitoring is able to scrape hosted CVO, it is safe to assume that a custom monitoring stack can as well. The hosted CVO does not even have access to the required CA bundle.

Prometheus output using PromeCIeus from a e2e-hypershift job run showcases a number of clusters:
image

The E2E functionality requires a follow-up with the HyperShift team regarding the E2E testing to not potentially introduce a regression with this PR.

@DavidHurta
Copy link
Contributor Author

/hold

I am checking with the HyperShift team.

However, the PR is ready for a review.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@DavidHurta DavidHurta marked this pull request as ready for review January 6, 2026 16:59
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 6, 2026
@DavidHurta
Copy link
Contributor Author

/test e2e-agnostic-ovn-upgrade-into-change
/test e2e-aws-ovn-techpreview
/test e2e-hypershift

@openshift-ci-robot
Copy link
Contributor

@DavidHurta: This pull request references Jira Issue OCPBUGS-66898, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @dis016

Details

In response to this:

This pull request proposes to migrate the CVO metrics endpoint from bearer token authentication to mutual TLS (mTLS) to comply with OpenShift metrics conventions.

Key changes:

  • Replaced custom certificate watching logic with the k8s.io/apiserver/pkg/server/dynamiccertificates package to simplify logic
  • Authenticate clients using mutual TLS (mTLS)
  • Authorize clients by verifying CN
  • Perform authorization once during TLS handshake instead of per-request
  • Add configurable authentication/authorization options for HyperShift compatibility
  • Update ServiceMonitor to use mTLS instead of bearer tokens

Breaking Changes:

  • Prometheus must now authenticate using client certificates signed by a trusted cluster CA. Clients without valid certificates are rejected during the TLS handshake.

Trade-offs:

  • Authorization at the TLS layer improves performance but produces less descriptive errors on failures.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

cn := verifiedChains[0][0].Subject.CommonName
if cn != metricsAllowedClientCN {
klog.V(4).Infof("Access denied for CN: %s", cn)
return errors.New("unauthorized CN")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't mind as much for logs, but for error messages, which might eventually reach users, can we avoid accronyms with unauthorized common name or similar instead of using CN?

Also, not clear to me why we'd log the rejected CN in the previous Infof but not include it in the error message here. The client has the certificate, and they can go look at their CN there, but if we include the CN in the error message, maybe that saves them a bit of work?

Copy link
Contributor Author

@DavidHurta DavidHurta Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me double-check where exactly does the returned error bubble up to.

Copy link
Contributor Author

@DavidHurta DavidHurta Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error value bubbles up to the server's logger, that is, the value is logged in CVO logs (see a net/http line). This can be verified by verifying the CVO's logs. We currently use the default server's logger.

2026/01/08 17:08:57 http: TLS handshake error from 10.0.110.147:42700: unauthorized CN

The user observers only a TLS alert message. See a crypto/tls code section:

	if c.config.VerifyPeerCertificate != nil && !echRejected {
		if err := c.config.VerifyPeerCertificate(certificates, c.verifiedChains); err != nil {
			c.sendAlert(alertBadCertificate)
			return err
		}
	}

This is unfortunate to some degree, as there is a specific access_denied TLS alert message that I would probably want to return instead of bad_certificate for failed authorizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am starting to be open to reverting the 290ee8b commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could highlight the security aspects in more detail in the code comments.

The optimization was not my main goal with the commit. It was rather to move the authorization into the TLS layer and to make the mechanism's nature more explicit. The optimization was a nice plus.
However, the trade-off is more difficult debugging and technically wrong return codes to clients. This is an internal service, so I don't presume to cause it too many disruptions. However, I am inclined into reverting the commit to keep things correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to your original question.

I don't mind as much for logs, but for error messages, which might eventually reach users, can we avoid accronyms with unauthorized common name or similar instead of using CN?

Yes, definitely.

Also, not clear to me why we'd log the rejected CN in the previous Infof but not include it in the error message here. The client has the certificate, and they can go look at their CN there, but if we include the CN in the error message, maybe that saves them a bit of work?

Given we have found out that the client does not get to see these logs, we can delete the klog.V(4).Infof("Access denied for CN: %s", cn) line entirely, and we can include the CN in error itself.

Copy link
Contributor Author

@DavidHurta DavidHurta Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Slack thread and this discussion.

  • Reverting the TLS authorization in c46a6cf
  • Improving authorization docs in d4bdd70
  • Using full names in 534feae

// they change on disk using dynamiccertificates.
func RunMetrics(runContext context.Context, shutdownContext context.Context, listenAddress, certFile, keyFile string, restConfig *rest.Config, metricsOptions MetricsOptions) error {
if listenAddress == "" {
return errors.New("TLS configuration is required to serve metrics")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth updating this error message to point out that listenAddress is required. Looks like it's been stale since 0e6136c 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 021e46f.

origKeyChecksum, err := checksumFile(keyFile)
if err != nil {
return fmt.Errorf("Failed to initialize key file checksum: %w", err)
if err := servingContentController.RunOnce(metricsContext); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the RunOnce approach is intended to catch issues early and allow this function to hard fail, before launching the Run method. But there's more going on in Run than in RunOnce, and having no way to get information back out of the Run goroutine to tell you how it's going seems like it could be problematic. What if we have some kind of issue that would need a container restart to recover? But 🤷, I guess if I was particularly motivated, I could try to push for a clearer feedback mechanism in the upstream library. And ClusterVersionOperatorDown gives us a chance at calling in an admin to manually recover things if something in the metrics-serving pipe breaks. So not a blocker, just something that I'm grumbling about, in the hopes that it might motivate someone else to chase upstream about moving to an API with better feeback mechanisms ;)

Copy link
Contributor Author

@DavidHurta DavidHurta Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting for transparency that the dynamiccertificates.NewDynamicServingCertificateController does allow to consume an event recorder. It would not fix the CVO, but it could potentially help debug some issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not currently set up in the PR.

Copy link
Contributor Author

@DavidHurta DavidHurta Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was originally done to keep the scope down, and it eventually ended up in the PR. Also, the related docs, which reference rhobs/observailbity-operator changes that initially omitted the logging due to:

			// Disabling events for now because the controller generates
			// invalid events when used with DynamicServingContentFromFiles.

However, I see that the project's contributors added the logging a year ago!

I don't see a valid point in defending to not bring the logging with this PR other than "let's ship it, life is short :shipit: "

}

clientAuth := tls.NoClientCert
if !metricsOptions.DisableAuthentication {
Copy link
Member

@wking wking Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems like it would be slightly simpler if you shifted the clientAuth := tls.NoClientCert default up to the var clientCA ... level above, and merged the clientAuth = tls.RequireAndVerifyClientCert line into the previous !metricsOptions.DisableAuthentication block, instead of having two of those blocks in such close proximity.

Same for baseTlSConfig too, I think collecting all the insecure defaults in one place, and then overriding all of them in the secure section. Or maybe for the cheap ones we can initialize to the secure settings, like:

clientAuth := tls.RequireAndVerifyClientCert
baseTlSConfig := crypto.SecureTLSConfig(&tls.Config{ClientAuth: clientAuth})
baseTlSConfig.VerifyPeerCertificate = verifyPeerCertificate
var clientCA dynamiccertificates.CAContentProvider
var clientCAController *dynamiccertificates.ConfigMapCAController
if metricsOptions.DisableAuthentication { // weakening
	clientAuth = tls.NoClientCert
	baseTlSConfig.VerifyPeerCertificate = someDefault
} else { // expensive strengthening
	kubeClient, err := kubernetes.NewForConfig(restConfig)
	...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 86fcd9a.


resultChannelCount++
go func() {
servingCertController.Run(1, metricsContext.Done())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about that workers=1 argument, 😂:

doesn't matter what workers say, only start one.

continue
}
case <-metricsContext.Done():
klog.Infof("Clean shutdown requested: %v", metricsContext.Err())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the klog line will include metrics.go, but maybe mention metrics here, just to be extra clear? Clean metrics shutdown requested: ...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in f49bde3.

if loopError == nil {
loopError = shutdownError
} else if shutdownError != nil { // log the error we are discarding
klog.Errorf("Failed to gracefully shut down metrics server: %v", shutdownError)
Copy link
Member

@wking wking Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to keep the logging down here, instead of shifting it to log unconditionally, otherwise you'll double up on the loopError message (which is getting logged below at the Infof level, and then again (presumably at the Errorf level) in whatever is at the top of the bubbled-back return loopError pipeline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change deserved its own commit.

It seems to be possible for the previous logic:

			shutdown = true
			shutdownError := server.Shutdown(shutdownContext)

			if loopError == nil {
				loopError = shutdownError
			} else if shutdownError != nil { // log the error we are discarding
				klog.Errorf("Failed to gracefully shut down metrics server: %v", shutdownError)
			}

to not print an error from the http server not shutting down correctly. This was not previously possible as only one go routine existed.

Imagine a scenario:

  1. loopError is nil
  2. metricsContext is cancelled
  3. http server shuts down incorrectly; the shutdownError holds the error
  4. not printed immediately; the value gets saved to loopError
  5. loopError may get overridden while other go routines are shutting down
  6. The Graceful shutdown complete for metrics server: %s outputs the last value saved in loopError
  7. the shutdownError gets forgotten

This does not resolve your comment regarding the even triple same logs. Maybe it's worth revisiting the purpose of the loopError and my usage of it, as the logic seems to be a bit too tricky at first glance. My proposed changes should log every failure and return the last loop's error. Without the change, it is still possible to even triple log failures (handleServerResult, the Graceful shutdown complete for metrics server, and the parent function), but we can also forget the http server's failures all together.

DavidHurta and others added 7 commits January 13, 2026 23:30
This reverts commit 290ee8b.

The used library does not allow to modify the returned
TLS error message to return `access_denied` on authorization failures.
The logic would proceed to return `bad_certificate` on authorization
failures.

Revert to HTTP authorization to be able to return a user-friendly
return values, which are correct.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@DavidHurta DavidHurta requested a review from wking January 13, 2026 22:33
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 14, 2026

@DavidHurta: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn 9ad1d4e link true /test e2e-agnostic-ovn
ci/prow/e2e-agnostic-operator 9ad1d4e link true /test e2e-agnostic-operator

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@DavidHurta
Copy link
Contributor Author

Testing for 9ad1d4e

Standalone OpenShift

Prometheus can scrape using mTLS

$ oc --as system:admin -n openshift-monitoring exec -c prometheus pod/prometheus-k8s-0 -- curl -si --cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt --cert /etc/prometheus/secrets/metrics-client-certs/tls.crt --key /etc/prometheus/secrets/metrics-client-certs/tls.key https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics | head
HTTP/1.1 200 OK
Content-Type: text/plain; version=0.0.4; charset=utf-8; escaping=underscores
Date: Wed, 14 Jan 2026 14:10:23 GMT
Transfer-Encoding: chunked

# HELP cluster_installer Reports info about the installation process and, if applicable, the install tool. The type is either 'openshift-install', indicating that openshift-install was used to install the cluster, or 'other', indicating that an unknown process installed the cluster. The invoker is 'user' by default, but it may be overridden by a consuming tool. The version reported is that of the openshift-install that was used to generate the manifests and, if applicable, provision the infrastructure.
# TYPE cluster_installer gauge
cluster_installer{invoker="openshift-internal-ci/release-openshift-origin-installer-launch-aws-modern/2011417831714852864",type="openshift-install",version="v4.21.0"} 1
# HELP cluster_operator_condition_transitions Reports the number of times that a condition on a cluster operator changes status
# TYPE cluster_operator_condition_transitions gauge

Client needs to provide its certificate

$ oc --as system:admin -n openshift-monitoring exec -c prometheus pod/prometheus-k8s-0 -- curl --cacert /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (56) OpenSSL SSL_read: error:0A00045C:SSL routines::tlsv13 alert certificate required, errno 0
command terminated with exit code 56

The CN is verified correctly

The metrics-server Pod has its client certificate signed by the same issuer as does Prometheus. Let's try to scrape the CVO using the metric-server certs. I am using --insecure to bypass the need for a relevant local CA bundle file.

$ oc --as system:admin -n openshift-monitoring exec pod/metrics-server-78f96dd467-bddsp -- curl -si --insecure --cert /etc/tls/metrics-server-client-certs/tls.crt --key /etc/tls/metrics-server-client-certs/tls.key  https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics
HTTP/1.1 403 Forbidden
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Wed, 14 Jan 2026 14:12:46 GMT
Content-Length: 84

unauthorized common name: system:serviceaccount:openshift-monitoring:metrics-server

CA is verified during the TLS handshake

Using different certs from the metrics-server Pod:

$ oc --as system:admin -n openshift-monitoring exec pod/metrics-server-78f96dd467-bddsp -- curl --insecure --cert /etc/tls/private/tls.crt --key /etc/tls/private/tls.key  https://cluster-version-operator.openshift-cluster-version.svc.cluster.local:9099/metrics 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (56) OpenSSL SSL_read: error:0A000418:SSL routines::tlsv1 alert unknown ca, errno 0
command terminated with exit code 56

Checking the CVO logs:

The logs need a higher verbosity level. Switching to TechPreviewNoUpgrade feature set, switching to Trace log level, restarting the CVO, and running:

$ oc logs -n openshift-cluster-version deployment/cluster-version-operator | grep metrics.go | grep -v "skipping metrics for" | uniq
I0114 14:42:35.282860       1 metrics.go:340] Waiting on 4 outstanding goroutines.
I0114 14:42:35.283286       1 metrics.go:193] Metrics port listening for HTTPS on 0.0.0.0:9099
I0114 14:42:53.020998       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:42:55.148626       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:43:23.016661       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:43:25.143561       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:43:53.016801       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:43:55.143616       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:44:23.016536       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:44:25.143496       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:44:53.016644       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:44:55.143915       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:45:23.016094       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:45:25.143359       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:45:53.016683       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:45:55.143721       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:46:23.016811       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:46:25.143878       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:46:53.016674       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:46:55.143358       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:47:23.016298       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:47:25.143710       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:47:53.016235       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:47:55.143829       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:48:23.018661       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:48:25.143923       1 metrics.go:179] Access granted for common name: system:serviceaccount:openshift-monitoring:prometheus-k8s
I0114 14:48:25.422465       1 metrics.go:174] Access denied for common name: system:serviceaccount:openshift-monitoring:metrics-server

HyperShift

For information, see a previous comment.

Prometheus output using PromeCIeus from a e2e-hypershift job run showcases a number of clusters:

image

@DavidHurta
Copy link
Contributor Author

/test e2e-agnostic-ovn
/test e2e-agnostic-operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants