Draft
Conversation
added 18 commits
February 16, 2026 14:19
This commit adds complete telemetry support for PX-Backup to upload metrics and logs to Pure1. Key Changes: - Add telemetry configuration section in values.yaml with endpoint definitions - Define endpoints (osbEndpoint, registerEndpoint, restEndpoint) in values.yaml as single source of truth - Remove environment-based conditionals from templates for cleaner configuration New Helm Templates: - configmap.yaml: Main telemetry configuration - registration-deployment.yaml: Pure1 registration and certificate management - registration-configmap.yaml: Envoy proxy config for registration - metrics-collector-deployment.yaml: Metrics collection and upload - metrics-collector-configmap.yaml: Envoy proxy config for metrics - logs-collector-deployment.yaml: Log collection and upload - logs-collector-configmap.yaml: Envoy proxy config for logs - rbac.yaml: Service account and RBAC permissions Architecture: - 3-pod architecture: registration, metrics-collector, logs-collector - Uses PX-Enterprise proven patterns (cert_checker, security context, resource limits) - Certificate-based mTLS authentication with Pure1 - Envoy proxy for all telemetry traffic Configuration: - All endpoints defined in values.yaml (no scattered conditionals) - Easy to override for different environments via --set or custom values files - Removed unused 'environment' field for cleaner configuration Tested on BK32 cluster with successful: - OSB registration (HTTP 202) - Pure1 certificate issuance (HTTP 200) - Metrics upload every 10 seconds (HTTP 200) - Log upload on-demand (HTTP 200)
- Add init containers to wait for PX-Backup pod and appliance-id - Add wait-for-certificate init container to prevent premature uploads - Add phonehome-cache volume to prevent duplicate uploads - Add RBAC for logs-collector pod - Improve error messages and logging in init containers
- Standardize all telemetry pods to use portworx/telemetry-envoy:1.1.18 - Metrics collector: purestorage/telemetry-envoy:1.0.0 → portworx/telemetry-envoy:1.1.18 - Logs collector: purestorage/telemetry-envoy:1.0.0 → portworx/telemetry-envoy:1.1.18 - Registration pod: already using portworx/telemetry-envoy:1.1.18 (no change) - Fix double-encoded JSON in reporting data collection - Add 'jq -r .json_data' to extract inner JSON from reporting API response - Prevents double-encoded JSON in uploaded log bundles to Pure1
This commit enhances the logs collector script with comprehensive improvements: Robustness Improvements: - Add structured logging with timestamps and severity levels (INFO, WARN, ERROR, DEBUG) - Implement graceful shutdown handling (SIGTERM, SIGINT, SIGHUP) - Add comprehensive input validation for environment variables - Add timeout configuration for curl and grpcurl operations - Add disk space validation before collection - Improve error handling with per-function error checks Bug Fixes: - Fix multi-container pod handling with --all-containers flag - Rename POD_LABEL to PXBACKUP_POD_LABEL for clarity - Add startup delay (60s default) to prevent timing issues during helm upgrades New Features: - Track and log upload status from grpcurl responses - Implement log file cleanup (3-day retention) to prevent disk space issues - Add grpcurl installation check and automatic installation - Add detailed debug logging capability Configuration: - Add startupDelay configuration option (default: 60 seconds) - Maintain backward compatibility with existing configurations All improvements have been tested and verified on bk36 cluster.
- Add excludePodPatterns configuration option in values.yaml - Default exclusions: alertmanager,prometheus,frontend - Update logs-collector-deployment.yaml to use EXCLUDE_POD_PATTERNS env var - Clear EXCLUDE_PATTERNS (was used for log line filtering, now empty) This makes pod exclusion configurable via Helm values, allowing users to customize which pods are excluded from telemetry log collection.
…ention - Update logfile_patterns to match new Go binary file naming: - /var/cores/px-backup-logs-*.log.gz (pod logs) - /var/cores/px-backup-reporting-*.log.gz (reporting data) - This matches the original shell script naming convention - Ensures log-upload-service can scan and upload files to Pure1
- Remove DownstreamTlsContext from listener (log-upload-service connects via plain HTTP) - Add proper UpstreamTlsContext to cluster with client certificate via SDS - Fix SDS configuration to use correct private key filename (/appliance-cert/private_key) - Add server certificate validation for phonehome.portworx.com This fixes the 'broken pipe' and 'connection reset' errors that were preventing log uploads to Pure1. The listener now accepts plain HTTP from log-upload-service and Envoy handles mTLS to Pure1 on the upstream side.
Move all telemetry-related images from scattered locations in pxbackup.telemetry section to centralized images section, following the existing PX-Central pattern. Changes: - Move telemetry config from pxbackup.telemetry to top-level telemetry - Add 6 telemetry images to centralized images section: * telemetryInitContainerImage (kubectl) * telemetryEnvoyImage * telemetryRegistrationImage (ccm-go) * telemetryMetricsCollectorImage (realtime-metrics) * telemetryDataCollectorImage * telemetryLogUploadImage - Update all 3 telemetry deployment templates to use new image paths - Remove duplicate image configurations from telemetry section - Use global images.pullPolicy instead of per-image settings - Delete obsolete logs-collector-configmap.yaml (728 lines)
- Remove appliance-id Secret volume and volumeMount from registration-deployment.yaml - Remove applianceIdPath from CCM-Go config in registration-configmap.yaml - Add comment explaining CCM-Go reads from APPLIANCE_ID environment variable - Simplifies pod spec: reduces volume mounts from 3 to 2 - Single source of truth: px-backup-telemetry-config ConfigMap only
This commit removes all init containers from telemetry deployments and replaces them with Kubernetes-native dependency primitives. Changes: - Remove configmap.yaml (PX-Backup creates ConfigMap at runtime) - Remove all 3 init containers from logs-collector-deployment.yaml: * update-envoy-config (Envoy ConfigMap update) * wait-for-appliance-id (appliance-id wait) * wait-for-certificate (Pure1 certificate wait) - Remove all 3 init containers from metrics-collector-deployment.yaml: * cert-checker (certificate wait) * update-envoy-config (Envoy config update) * wait-for-certificate (Pure1 certificate wait) - Remove telemetryInitContainerImage from values.yaml - Remove unnecessary RBAC rules for ConfigMap access - Change optional: true to optional: false for dependencies: * px-backup-telemetry-config ConfigMap * px-backup-telemetry-logs-envoy-config ConfigMap * px-backup-telemetry-metrics-envoy-config ConfigMap * pure-telemetry-certs Secret Architecture: - PX-Backup creates px-backup-telemetry-config ConfigMap with appliance-id - PX-Backup updates all 3 Envoy ConfigMaps with actual appliance-id - Pods use optional: false to wait for dependencies (Kubernetes-native) - No init containers needed - cleaner, simpler architecture Benefits: - Eliminates 9 init containers (3 per deployment × 3 deployments) - Reduces complexity and startup time - Uses Kubernetes-native dependency management - Follows cloud-native best practices
…or appliance-id preservation 1. Helm Lookup Implementation for Appliance-ID Preservation - Implemented Helm lookup logic in all three Envoy ConfigMap templates - Preserves appliance-id during helm upgrades to prevent 401 errors - Uses regex pattern to extract UUID from existing ConfigMaps - Falls back to APPLIANCE_ID_PLACEHOLDER for fresh installations 2. REST API Support for Log Upload Service - Added REST API configuration alongside existing gRPC implementation - Exposed HTTP port 8080 for REST API endpoint - Enables flexible log upload triggering via both gRPC and REST Logs Collector Deployment (logs-collector-deployment.yLogs Collector DeploPI configuration environment variables - Exposed HTTP port 8080 for REST API endpoint - Added LOG_UPLOAD_REST_PORT environment variable
- Implement telemetry.environment (staging/production) to switch endpoints - Move planId into endpoints block for environment-specific configuration - Remove legacy endpoint fields (osbEndpoint, registerEndpoint, restEndpoint) - Update all telemetry templates to use index function for endpoint lookup - Update image tags to 2.11.0-staging
- Remove all endpoint URLs from Helm templates - Endpoints now defined in px-backup Go code - Remove OSB_ENDPOINT and OSB_PLAN_ID env vars from pxcentral-backup.yaml - Remove STARTUP_DELAY env var - Add appliance-id and endpoint preservation during upgrades
- Remove unused env vars: LOG_TAIL_LINES, LOG_DIR - Remove hardcoded timeout env vars: REPORTING_API_TIMEOUT, UPLOAD_SERVICE_TIMEOUT - Remove gRPC env vars: UPLOAD_SERVICE_ADDR, UPLOAD_USE_REST_API - Rename UPLOAD_SERVICE_REST_URL to UPLOAD_SERVICE_URL - Clean up comments
- Add liveness and readiness probes to registration container (port 12011) - Add httpServer.port: 12011 to registration configmap (matches px-e) - Add gRPC readiness probe to log-upload-service (port 9090) - Add HTTP readiness probe to logs-collector envoy (/ping-trusted:12002) - Update realtime-metrics image from purestorage:1.0.23 to portworx:1.0.32 Health probes enable auto-recovery from transient failures and improve pod readiness detection during startup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional)
Closes #
Special notes for your reviewer: