Skip to content

Conversation

@nitink23
Copy link

@nitink23 nitink23 commented Jul 14, 2025

What's added in this PR?


Core Telemetry Implementation

  • Implemented telemetry setup with error handling, idempotency, and initialization

Infrastructure Implementation

  • Docker Configuration: Implemented Grafana Alloy collector setup for telemetry aggregation
  • Telemetry Configuration: Created Alloy config for OTLP collection and forwarding
  • Environment Setup: Configured proper telemetry endpoints

Instrumented Components

  • Zephyr Agent Core: Fully instrumented with telemetry implementation and tracing
  • Rspack Plugin: Fully instrumented with telemetry implementation and tracing
  • Rolldown Plugin: Fully instrumented with telemetry implementation and tracing

Screenshots

N/A – No UI changes, only code improvements and cleanup.


What's the issues or discussion related to this PR?

This PR addresses multiple aspects of the OpenTelemetry implementation:

Core Telemetry setup

  • Zephyr Agent: telemetry setup with better error handling, idempotenc, initialization
  • Better Comments: Added comprehensive documentation to all telemetry code

Infrastructure:

  • Docker Setup: Grafana Alloy collector for telemetry aggregation
  • Configuration: Proper OTLP endpoint and header configuration
  • Environment: Support for development and production telemetry

What are the steps to test this PR?

  1. Dependencies: Run pnpm install to ensure all OTel dependencies are correct
  2. Code Quality: Run pnpm lint to verify all ESLint issues are resolved
  3. Build: Run pnpm build to ensure all components build successfully
  4. Telemetry Infrastructure:
    • Start Docker collector: docker compose -f scripts/telemetry/compose.yml up
    • Verify OTLP endpoint at http://localhost:4318
    • Test telemetry collection from rspack and rolldown plugins should reflect in grafana cloud

Documentation update for this PR (if applicable)?

Yes - The telemetry infrastructure setup should be documented:

  • Docker Setup: Instructions for running the Grafana Alloy collector
  • Environment Variables: Configuration for OTLP endpoints and headers
  • Plugin Status: Document which plugins have telemetry and which don't
  • Infrastructure: Setup guide for telemetry collection and forwarding

(Optional) What's left to be done for this PR?


(Optional) What's the potential risk and how to mitigate it?


Who do you wish to review this PR other than required reviewers?

@claude @Nsttt @zackarychapple


(Required) Pre-PR/Merge checklist

  • I have added/updated/opened a PR to documentation to cover this new behavior
  • I have added an explanation of my changes
  • I have written new tests (if applicable) - No new tests needed for cleanup/improvements
  • I have tested this locally (standing from a first time user point of view, never touch this app before)
  • I have/will run tests, or ask for help to add test

Note: Documentation PR should be created to cover the telemetry infrastructure setup, especially the Docker setup requirements.

@cla-assistant
Copy link

cla-assistant bot commented Jul 14, 2025

CLA assistant check
All committers have signed the CLA.

@cla-assistant
Copy link

cla-assistant bot commented Jul 14, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@claude
Copy link

claude bot commented Jul 14, 2025

Claude finished @nitink23's task —— View job


OpenTelemetry Implementation Review ✅

Note: Writing @claude please review in any comment will trigger a full review.

Summary: This PR implements OpenTelemetry tracing and metrics infrastructure, despite the description claiming to only add dev dependencies.

📋 Review Issues
Severity File/Line(s) Issue Recommendation
Critical PR Description Misleading PR description - claims "only dev dependencies" but contains full OTel implementation Update PR description to accurately reflect the actual changes
Critical libs/zephyr-agent/package.json:53-62 OpenTelemetry packages in devDependencies but used in production code Move OTel packages to dependencies - they're required at runtime
Critical libs/zephyr-agent/src/index.ts:5 Top-level void initTelemetry() blocks app startup Make telemetry initialization async and handle failures gracefully
High libs/zephyr-agent/src/lib/telemetry/index.ts:84-88 Silent telemetry disabling when endpoint missing Add proper logging/metrics for telemetry health monitoring
High libs/zephyr-agent/src/lib/telemetry/index.ts:118-122 Telemetry init errors only logged, not propagated Consider failing fast on telemetry errors in critical environments
High Multiple files No tests for telemetry functionality Add unit tests for telemetry module, initialization, and error handling
High alloy-config.alloy:133-141 Grafana Cloud credentials in config file Move credentials to environment variables or secure config
Medium libs/zephyr-agent/src/lib/telemetry/index.ts:14 Hard-coded relative path for .env file Use more robust path resolution or config management
Medium libs/zephyr-agent/src/lib/telemetry/index.ts:92-96 OTLP exporter created without error handling Add try-catch around exporter creation
Medium libs/zephyr-agent/src/lib/telemetry/index.ts:146-156 parseKeyValueString function lacks input validation Add validation for malformed key-value pairs
Medium libs/zephyr-agent/src/zephyr-engine/index.ts:30 getTracer import but limited usage Either use telemetry more comprehensively or document the minimal approach
Low libs/zephyr-agent/src/lib/telemetry/index.ts:79 autoInstrument defaults to true Document the performance impact of auto-instrumentation
Low libs/zephyr-agent/src/lib/edge-requests/get-build-id.ts:17 Hardcoded span attributes Consider using constants or configuration for span attributes
🔒 Security & Performance Analysis

Security Issues:

  • Grafana Cloud credentials hardcoded in config file (should be env vars only)
  • No input validation on telemetry configuration
  • Debug logging may expose sensitive information

Performance Concerns:

  • Top-level telemetry initialization blocks app startup
  • Auto-instrumentation enabled by default (high overhead)
  • No telemetry sampling configuration
  • Memory leaks possible if telemetry shutdown not called

Missing Error Handling:

  • No fallback when telemetry fails to initialize
  • Silent failures in span creation
  • No circuit breaker for telemetry operations
🧪 Test Coverage Gaps

Critical Missing Tests:

  1. Telemetry initialization - test successful init, failure scenarios
  2. Environment variable parsing - test valid/invalid configurations
  3. Span creation and attributes - test tracer functionality
  4. Shutdown behavior - test graceful shutdown
  5. Error scenarios - test missing endpoints, invalid credentials
  6. Integration tests - test end-to-end tracing in ZephyrEngine

Suggested Test Structure:

describe('Telemetry', () => {
  describe('initTelemetry', () => {
    it('should initialize successfully with valid config')
    it('should handle missing OTLP endpoint gracefully')
    it('should parse resource attributes correctly')
    it('should be idempotent')
  })
})
💡 Code Quality Improvements

Recommended Changes:

  1. Fix dependency classification:
// In libs/zephyr-agent/package.json
- "devDependencies": {
+ "dependencies": {
    "@opentelemetry/api": "^1.9.0",
    "@opentelemetry/auto-instrumentations-node": "^0.62.0",
    "@opentelemetry/exporter-trace-otlp-http": "^0.203.0",
    // ... other OTel packages
  1. Improve initialization:
// In libs/zephyr-agent/src/index.ts
import { initTelemetry } from './lib/telemetry';

// Initialize telemetry with proper error handling
(async () => {
  try {
    await initTelemetry();
  } catch (error) {
    console.error('Telemetry initialization failed:', error);
    // Continue without telemetry or fail fast based on config
  }
})();
  1. Add environment variable documentation:
// Required environment variables:
// - OTEL_EXPORTER_OTLP_ENDPOINT: OTLP endpoint URL
// - OTEL_RESOURCE_ATTRIBUTES: service.name=zephyr-agent,deployment.environment=production
// - GRAFANA_CLOUD_OTLP_ENDPOINT: Grafana Cloud OTLP endpoint
// - GRAFANA_CLOUD_INSTANCE_ID: Grafana Cloud instance ID
// - GRAFANA_CLOUD_API_KEY: Grafana Cloud API key

Overall Quality Score: 4/10

Top 3 Next Steps:

  1. Fix Critical Issues: Move OTel packages to dependencies and update PR description
  2. Add Test Coverage: Write comprehensive tests for telemetry functionality
  3. Improve Error Handling: Add proper error handling and fallback mechanisms

Recommendation: This PR needs significant work before merging. The telemetry implementation itself is reasonable, but the packaging, testing, and error handling need substantial improvement.


@nx-cloud
Copy link

nx-cloud bot commented Jul 14, 2025

View your CI Pipeline Execution ↗ for commit 65dc919

Command Status Duration Result
nx affected -t build --skipRemoteCache ❌ Failed 1m 41s View ↗
nx affected -t test ✅ Succeeded 1m 28s View ↗
nx affected -t lint ✅ Succeeded 7s View ↗

☁️ Nx Cloud last updated this comment at 2025-07-17 23:05:42 UTC

@nitink23 nitink23 marked this pull request as draft July 14, 2025 21:49
@nitink23 nitink23 closed this Aug 5, 2025
@nitink23 nitink23 reopened this Aug 5, 2025
@nitink23 nitink23 changed the title build: added some dependencies to the dev for setting up Otel feat: Otel traces setup Aug 5, 2025
@nitink23 nitink23 marked this pull request as ready for review August 5, 2025 21:13
@nitink23 nitink23 requested a review from ryok90 August 12, 2025 18:53
@Nsttt Nsttt marked this pull request as draft August 14, 2025 18:27
Copy link
Member

@Nsttt Nsttt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some things that are missing that we need to do at some point;

  1. Move the boilerplate creation of spans / setup of otel to an internal lib we use on Zephyr, so it's easier to maintain.
  2. Add a ZEPHYR_DISABLE_TELEMETRY env variable for people that do not want to give telemetry to us.
  3. Check the ammount of info we currently track and decide the pieces that we want to send as traces based on events that happen on the Engine.
  4. Try to keep the tracing logic inside of zephyr-engine and zephyr-agent to not pollute the plugins.

"zephyr-agent": "workspace:*"
},
"devDependencies": {
"@types/is-ci": "catalog:typescript",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can revert this I think.

const OTLP_ENDPOINT =
process.env['OTEL_EXPORTER_OTLP_ENDPOINT'] || 'http://localhost:4318';
// OTEL_EXPORTER_OTLP_HEADERS: Authentication headers for the telemetry endpoint
const OTEL_EXPORTER_OTLP_HEADERS = process.env['OTEL_EXPORTER_OTLP_HEADERS'];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably hardcode the ones we actually want to track.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah if we decide to host the docker image we would need to replace the endpoint anyways

Comment on lines +60 to +63
const resource = resourceFromAttributes({
[ATTR_SERVICE_NAME]: 'zephyr-packages',
[ATTR_SERVICE_VERSION]: '0.0.1',
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide on which attributes we want to track from the environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants