Skip to content

Telemetry data upload to pure1#863

Draft
bk-px wants to merge 18 commits into2.11.0-stagingfrom
telemetry-reporting-api-data-collector-non-init
Draft

Telemetry data upload to pure1#863
bk-px wants to merge 18 commits into2.11.0-stagingfrom
telemetry-reporting-api-data-collector-non-init

Conversation

@bk-px
Copy link
Collaborator

@bk-px bk-px commented Feb 16, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes (optional)
Closes #

Special notes for your reviewer:

Bhanuchandra K added 18 commits February 16, 2026 14:19
This commit adds complete telemetry support for PX-Backup to upload metrics and logs to Pure1.

Key Changes:
- Add telemetry configuration section in values.yaml with endpoint definitions
- Define endpoints (osbEndpoint, registerEndpoint, restEndpoint) in values.yaml as single source of truth
- Remove environment-based conditionals from templates for cleaner configuration

New Helm Templates:
- configmap.yaml: Main telemetry configuration
- registration-deployment.yaml: Pure1 registration and certificate management
- registration-configmap.yaml: Envoy proxy config for registration
- metrics-collector-deployment.yaml: Metrics collection and upload
- metrics-collector-configmap.yaml: Envoy proxy config for metrics
- logs-collector-deployment.yaml: Log collection and upload
- logs-collector-configmap.yaml: Envoy proxy config for logs
- rbac.yaml: Service account and RBAC permissions

Architecture:
- 3-pod architecture: registration, metrics-collector, logs-collector
- Uses PX-Enterprise proven patterns (cert_checker, security context, resource limits)
- Certificate-based mTLS authentication with Pure1
- Envoy proxy for all telemetry traffic

Configuration:
- All endpoints defined in values.yaml (no scattered conditionals)
- Easy to override for different environments via --set or custom values files
- Removed unused 'environment' field for cleaner configuration

Tested on BK32 cluster with successful:
- OSB registration (HTTP 202)
- Pure1 certificate issuance (HTTP 200)
- Metrics upload every 10 seconds (HTTP 200)
- Log upload on-demand (HTTP 200)
- Add init containers to wait for PX-Backup pod and appliance-id
- Add wait-for-certificate init container to prevent premature uploads
- Add phonehome-cache volume to prevent duplicate uploads
- Add RBAC for logs-collector pod
- Improve error messages and logging in init containers
- Standardize all telemetry pods to use portworx/telemetry-envoy:1.1.18
  - Metrics collector: purestorage/telemetry-envoy:1.0.0 → portworx/telemetry-envoy:1.1.18
  - Logs collector: purestorage/telemetry-envoy:1.0.0 → portworx/telemetry-envoy:1.1.18
  - Registration pod: already using portworx/telemetry-envoy:1.1.18 (no change)

- Fix double-encoded JSON in reporting data collection
  - Add 'jq -r .json_data' to extract inner JSON from reporting API response
  - Prevents double-encoded JSON in uploaded log bundles to Pure1
This commit enhances the logs collector script with comprehensive improvements:

Robustness Improvements:
- Add structured logging with timestamps and severity levels (INFO, WARN, ERROR, DEBUG)
- Implement graceful shutdown handling (SIGTERM, SIGINT, SIGHUP)
- Add comprehensive input validation for environment variables
- Add timeout configuration for curl and grpcurl operations
- Add disk space validation before collection
- Improve error handling with per-function error checks

Bug Fixes:
- Fix multi-container pod handling with --all-containers flag
- Rename POD_LABEL to PXBACKUP_POD_LABEL for clarity
- Add startup delay (60s default) to prevent timing issues during helm upgrades

New Features:
- Track and log upload status from grpcurl responses
- Implement log file cleanup (3-day retention) to prevent disk space issues
- Add grpcurl installation check and automatic installation
- Add detailed debug logging capability

Configuration:
- Add startupDelay configuration option (default: 60 seconds)
- Maintain backward compatibility with existing configurations

All improvements have been tested and verified on bk36 cluster.
- Add excludePodPatterns configuration option in values.yaml
- Default exclusions: alertmanager,prometheus,frontend
- Update logs-collector-deployment.yaml to use EXCLUDE_POD_PATTERNS env var
- Clear EXCLUDE_PATTERNS (was used for log line filtering, now empty)

This makes pod exclusion configurable via Helm values, allowing users to
customize which pods are excluded from telemetry log collection.
…ention

- Update logfile_patterns to match new Go binary file naming:
  - /var/cores/px-backup-logs-*.log.gz (pod logs)
  - /var/cores/px-backup-reporting-*.log.gz (reporting data)
- This matches the original shell script naming convention
- Ensures log-upload-service can scan and upload files to Pure1
- Remove DownstreamTlsContext from listener (log-upload-service connects via plain HTTP)
- Add proper UpstreamTlsContext to cluster with client certificate via SDS
- Fix SDS configuration to use correct private key filename (/appliance-cert/private_key)
- Add server certificate validation for phonehome.portworx.com

This fixes the 'broken pipe' and 'connection reset' errors that were preventing
log uploads to Pure1. The listener now accepts plain HTTP from log-upload-service
and Envoy handles mTLS to Pure1 on the upstream side.
Move all telemetry-related images from scattered locations in
pxbackup.telemetry section to centralized images section, following
the existing PX-Central pattern.

Changes:
- Move telemetry config from pxbackup.telemetry to top-level telemetry
- Add 6 telemetry images to centralized images section:
  * telemetryInitContainerImage (kubectl)
  * telemetryEnvoyImage
  * telemetryRegistrationImage (ccm-go)
  * telemetryMetricsCollectorImage (realtime-metrics)
  * telemetryDataCollectorImage
  * telemetryLogUploadImage
- Update all 3 telemetry deployment templates to use new image paths
- Remove duplicate image configurations from telemetry section
- Use global images.pullPolicy instead of per-image settings
- Delete obsolete logs-collector-configmap.yaml (728 lines)
- Remove appliance-id Secret volume and volumeMount from registration-deployment.yaml
- Remove applianceIdPath from CCM-Go config in registration-configmap.yaml
- Add comment explaining CCM-Go reads from APPLIANCE_ID environment variable
- Simplifies pod spec: reduces volume mounts from 3 to 2
- Single source of truth: px-backup-telemetry-config ConfigMap only
This commit removes all init containers from telemetry deployments and
replaces them with Kubernetes-native dependency primitives.

Changes:
- Remove configmap.yaml (PX-Backup creates ConfigMap at runtime)
- Remove all 3 init containers from logs-collector-deployment.yaml:
  * update-envoy-config (Envoy ConfigMap update)
  * wait-for-appliance-id (appliance-id wait)
  * wait-for-certificate (Pure1 certificate wait)
- Remove all 3 init containers from metrics-collector-deployment.yaml:
  * cert-checker (certificate wait)
  * update-envoy-config (Envoy config update)
  * wait-for-certificate (Pure1 certificate wait)
- Remove telemetryInitContainerImage from values.yaml
- Remove unnecessary RBAC rules for ConfigMap access
- Change optional: true to optional: false for dependencies:
  * px-backup-telemetry-config ConfigMap
  * px-backup-telemetry-logs-envoy-config ConfigMap
  * px-backup-telemetry-metrics-envoy-config ConfigMap
  * pure-telemetry-certs Secret

Architecture:
- PX-Backup creates px-backup-telemetry-config ConfigMap with appliance-id
- PX-Backup updates all 3 Envoy ConfigMaps with actual appliance-id
- Pods use optional: false to wait for dependencies (Kubernetes-native)
- No init containers needed - cleaner, simpler architecture

Benefits:
- Eliminates 9 init containers (3 per deployment × 3 deployments)
- Reduces complexity and startup time
- Uses Kubernetes-native dependency management
- Follows cloud-native best practices
…or appliance-id preservation

1. Helm Lookup Implementation for Appliance-ID Preservation
   - Implemented Helm lookup logic in all three Envoy ConfigMap templates
   - Preserves appliance-id during helm upgrades to prevent 401 errors
   - Uses regex pattern to extract UUID from existing ConfigMaps
   - Falls back to APPLIANCE_ID_PLACEHOLDER for fresh installations

2. REST API Support for Log Upload Service
   - Added REST API configuration alongside existing gRPC implementation
   - Exposed HTTP port 8080 for REST API endpoint
   - Enables flexible log upload triggering via both gRPC and REST

Logs Collector Deployment (logs-collector-deployment.yLogs Collector DeploPI configuration environment variables
- Exposed HTTP port 8080 for REST API endpoint
- Added LOG_UPLOAD_REST_PORT environment variable
- Implement telemetry.environment (staging/production) to switch endpoints
- Move planId into endpoints block for environment-specific configuration
- Remove legacy endpoint fields (osbEndpoint, registerEndpoint, restEndpoint)
- Update all telemetry templates to use index function for endpoint lookup
- Update image tags to 2.11.0-staging
- Remove all endpoint URLs from Helm templates
- Endpoints now defined in px-backup Go code
- Remove OSB_ENDPOINT and OSB_PLAN_ID env vars from pxcentral-backup.yaml
- Remove STARTUP_DELAY env var
- Add appliance-id and endpoint preservation during upgrades
- Remove unused env vars: LOG_TAIL_LINES, LOG_DIR
- Remove hardcoded timeout env vars: REPORTING_API_TIMEOUT, UPLOAD_SERVICE_TIMEOUT
- Remove gRPC env vars: UPLOAD_SERVICE_ADDR, UPLOAD_USE_REST_API
- Rename UPLOAD_SERVICE_REST_URL to UPLOAD_SERVICE_URL
- Clean up comments
- Add liveness and readiness probes to registration container (port 12011)
- Add httpServer.port: 12011 to registration configmap (matches px-e)
- Add gRPC readiness probe to log-upload-service (port 9090)
- Add HTTP readiness probe to logs-collector envoy (/ping-trusted:12002)
- Update realtime-metrics image from purestorage:1.0.23 to portworx:1.0.32

Health probes enable auto-recovery from transient failures and improve
pod readiness detection during startup.
@bk-px bk-px added the DO_NOT_MERGE Required changes label Feb 16, 2026
@bk-px bk-px marked this pull request as draft February 16, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO_NOT_MERGE Required changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments