Skip to content

317 Connect Orchestration Engine and Policy Decision Point to Observability Stack#380

Closed
ginaxu1 wants to merge 4 commits intomainfrom
317-part3-oe-pdp
Closed

317 Connect Orchestration Engine and Policy Decision Point to Observability Stack#380
ginaxu1 wants to merge 4 commits intomainfrom
317-part3-oe-pdp

Conversation

@ginaxu1
Copy link
Collaborator

@ginaxu1 ginaxu1 commented Dec 9, 2025

review #371 first + merge to main (updating the base branch here) OR, ideally, review + merge this into #371 first

Summary

OE and PDP now instrumented and connected to the observability stack from #371

  1. OE (orchestration-engine:4000)

    • Metrics endpoint: /metrics (Prometheus format)
    • HTTP metrics middleware applied to all routes
    • Records: http_requests_total, http_request_duration_seconds
    • Uses exchange/shared/monitoring package
  2. PDP (policy-decision-point:8082)

    • Metrics endpoint: /metrics (Prometheus format)
    • HTTP metrics middleware applied to all routes
    • Records: http_requests_total, http_request_duration_seconds
    • Uses exchange/shared/monitoring package

Both services now expose the following Prometheus metrics:

  • http_requests_total{http_method, http_route, http_status_code} - Total HTTP request count by method, route, and status code
  • http_request_duration_seconds{http_method, http_route} - HTTP request latency histogram by method and route
  • external_calls_total{opendif.external.target, opendif.external.operation} - External service call metrics (when used)
  • business_events_total{opendif.business.action, opendif.business.outcome} - Business event metrics (when used)

Note: Custom attributes use the opendif. namespace prefix to distinguish them from standard OpenTelemetry semantic conventions.

Why these changes are needed:

  • Services currently have no metrics instrumentation, making it impossible to monitor performance, errors, and request patterns
  • The observability stack (Prometheus + Grafana) is configured but cannot collect data without service instrumentation
  • OpenTelemetry provides vendor-agnostic instrumentation, allowing teams to choose their observability backend (Prometheus, Datadog, New Relic, etc.) without code changes
  • This is the foundation for comprehensive observability across all services

Type of Change

  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Other (please describe):

Changes Made

New Files Created

  1. observability/generate_sample_traffic.sh (156 lines)
    • Script to generate sample HTTP traffic for testing Grafana dashboards
    • Targets orchestration-engine (port 4000) and policy-decision-point (port 8082)
    • Configurable via environment variables (ORCHESTRATION_ENGINE_URL, POLICY_DECISION_POINT_URL, REQUEST_INTERVAL, REQUEST_COUNT)
    • Sends requests to health, metrics, and API endpoints to populate metrics

Modified Files

Service Integration

  1. exchange/orchestration-engine/server/server.go

    • Added import: github.com/gov-dx-sandbox/exchange/shared/monitoring
    • Added /metrics endpoint: mux.Handle("/metrics", monitoring.Handler())
    • Wrapped handlers with monitoring.HTTPMetricsMiddleware() for automatic instrumentation
    • Metrics middleware applied before CORS middleware (outermost layer)
    • Lines changed: ~9 lines added
  2. exchange/orchestration-engine/go.mod & go.sum

    • Added dependency on exchange/shared/monitoring package
    • Updated via go mod tidy
  3. exchange/policy-decision-point/main.go

    • Added import: github.com/gov-dx-sandbox/exchange/shared/monitoring
    • Added /metrics endpoint: mux.Handle("/metrics", monitoring.Handler())
    • Wrapped handlers with monitoring.HTTPMetricsMiddleware() for automatic instrumentation
    • Lines changed: ~9 lines added
  4. exchange/policy-decision-point/go.mod & go.sum

    • Added dependency on exchange/shared/monitoring package
    • Updated via go mod tidy
  5. exchange/docker-compose.yml

    • Added SERVICE_NAME and OTEL_METRICS_EXPORTER environment variables to orchestration-engine and policy-decision-point services
    • Ensures proper service identification in metrics
    • Defaults to Prometheus exporter for local development
    • Lines changed: 4 lines added (2 per service)

Observability Stack Configuration

  1. observability/prometheus/prometheus.yml
    • Added scrape configs for orchestration-engine:4000 and policy-decision-point:8082
    • Both services configured with /metrics path
    • Lines changed: ~18 lines added

Testing

  • I have tested this change locally
  • I have added unit tests for new functionality
  • I have tested edge cases
  • All existing tests pass

Test Results

Unit Tests (exchange/shared/monitoring):

=== RUN   TestHandler
--- PASS: TestHandler (0.00s)
=== RUN   TestHTTPMetricsMiddleware
--- PASS: TestHTTPMetricsMiddleware (0.00s)
=== RUN   TestNormalizeRoute
--- PASS: TestNormalizeRoute (0.00s)
PASS

Service Integration:

  • Both services successfully expose /metrics endpoints
  • Metrics middleware correctly wraps HTTP handlers
  • Prometheus can successfully scrape both services

Runtime Testing

To verify the observability stack is working:

  1. Start observability stack:

    cd observability
    ./start-grafana.sh  # or: docker compose up -d
  2. Start Go services (ensure they're on opendif-network):

    cd exchange
    docker compose up -d orchestration-engine policy-decision-point
  3. Verify metrics endpoints:

    curl http://localhost:4000/metrics | grep http_requests_total
    curl http://localhost:8082/metrics | grep http_requests_total
  4. Check Prometheus targets:

  5. Generate sample traffic:

    cd observability
    ./generate_sample_traffic.sh
    # Script targets orchestration-engine:4000 and policy-decision-point:8082
  6. View metrics in Grafana:

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have checked that there are no merge conflicts
  • I have verified all services are on opendif-network
  • I have verified Prometheus can scrape all service endpoints

Related Issues

Related to observability stack setup. Enables metrics collection from Go services for monitoring and debugging.

Deployment Notes

Pre-Deployment Checklist

  1. Service Restart Required: Services must be restarted to load the new monitoring code

    # Stop existing services
    # Rebuild: go build .
    # Restart services
  2. Network Setup: Ensure opendif-network exists before starting services

    docker network create opendif-network  # if it doesn't exist
    # Or use: cd observability && ./start-grafana.sh
  3. No Configuration Changes Required: Services use Prometheus exporter by default (no env vars needed for local dev)

  4. Prometheus Already Configured: Prometheus is already configured to scrape these services (see observability/prometheus/prometheus.yml)

  5. Grafana Dashboard Ready: Grafana dashboard is already configured to display these metrics

Environment Variables (Optional)

For local development, no environment variables are needed (Prometheus is default).

To switch to other backends (Datadog, New Relic, etc.), set:

export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=<your-endpoint>
export OTEL_EXPORTER_OTLP_HEADERS="<your-headers>"
export SERVICE_NAME=<service-name>

Post-Deployment Verification

  1. Check Metrics Endpoints:

    curl http://localhost:4000/metrics | grep http_requests_total
    curl http://localhost:8082/metrics | grep http_requests_total
  2. Verify Prometheus Scraping:

  3. View in Grafana:

  4. Generate Sample Traffic:

    cd observability
    ./generate_sample_traffic.sh

Related PRs

  • 317-part2-connect: refactors to use OpenTelemetry infra in exchange/shared/monitoring/
  • This PR (317-part3-oe-pdp) integrates that infra into OE and PDP

@ginaxu1 ginaxu1 changed the title 317 Part 3: Connect Orchestration Engine and Policy Decision Point to Observability Stack 317 Connect Orchestration Engine and Policy Decision Point to Observability Stack Dec 9, 2025
@ginaxu1 ginaxu1 force-pushed the 317-part2-connect branch 4 times, most recently from 448532b to af8ba3f Compare December 18, 2025 09:24
@sthanikan2000 sthanikan2000 marked this pull request as draft December 18, 2025 16:48
Base automatically changed from 317-part2-connect to main December 23, 2025 15:32
@ginaxu1 ginaxu1 marked this pull request as ready for review December 27, 2025 08:33
@ginaxu1
Copy link
Collaborator Author

ginaxu1 commented Jan 3, 2026

Created new branch #408

@ginaxu1 ginaxu1 closed this Jan 3, 2026
@ginaxu1 ginaxu1 deleted the 317-part3-oe-pdp branch January 5, 2026 04:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant