317 Connect Orchestration Engine and Policy Decision Point to Observability Stack by ginaxu1 · Pull Request #380 · OpenDIF/opendif-core

ginaxu1 · 2025-12-09T02:45:58Z

review #371 first + merge to main (updating the base branch here) OR, ideally, review + merge this into #371 first

Summary

OE and PDP now instrumented and connected to the observability stack from #371

OE (orchestration-engine:4000)
- Metrics endpoint: /metrics (Prometheus format)
- HTTP metrics middleware applied to all routes
- Records: http_requests_total, http_request_duration_seconds
- Uses exchange/shared/monitoring package
PDP (policy-decision-point:8082)
- Metrics endpoint: /metrics (Prometheus format)
- HTTP metrics middleware applied to all routes
- Records: http_requests_total, http_request_duration_seconds
- Uses exchange/shared/monitoring package

Both services now expose the following Prometheus metrics:

http_requests_total{http_method, http_route, http_status_code} - Total HTTP request count by method, route, and status code
http_request_duration_seconds{http_method, http_route} - HTTP request latency histogram by method and route
external_calls_total{opendif.external.target, opendif.external.operation} - External service call metrics (when used)
business_events_total{opendif.business.action, opendif.business.outcome} - Business event metrics (when used)

Note: Custom attributes use the opendif. namespace prefix to distinguish them from standard OpenTelemetry semantic conventions.

Why these changes are needed:

Services currently have no metrics instrumentation, making it impossible to monitor performance, errors, and request patterns
The observability stack (Prometheus + Grafana) is configured but cannot collect data without service instrumentation
OpenTelemetry provides vendor-agnostic instrumentation, allowing teams to choose their observability backend (Prometheus, Datadog, New Relic, etc.) without code changes
This is the foundation for comprehensive observability across all services

Type of Change

New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Other (please describe):

Changes Made

New Files Created

observability/generate_sample_traffic.sh (156 lines)
- Script to generate sample HTTP traffic for testing Grafana dashboards
- Targets orchestration-engine (port 4000) and policy-decision-point (port 8082)
- Configurable via environment variables (ORCHESTRATION_ENGINE_URL, POLICY_DECISION_POINT_URL, REQUEST_INTERVAL, REQUEST_COUNT)
- Sends requests to health, metrics, and API endpoints to populate metrics

Modified Files

Service Integration

exchange/orchestration-engine/server/server.go
- Added import: github.com/gov-dx-sandbox/exchange/shared/monitoring
- Added /metrics endpoint: mux.Handle("/metrics", monitoring.Handler())
- Wrapped handlers with monitoring.HTTPMetricsMiddleware() for automatic instrumentation
- Metrics middleware applied before CORS middleware (outermost layer)
- Lines changed: ~9 lines added
exchange/orchestration-engine/go.mod & go.sum
- Added dependency on exchange/shared/monitoring package
- Updated via go mod tidy
exchange/policy-decision-point/main.go
- Added import: github.com/gov-dx-sandbox/exchange/shared/monitoring
- Added /metrics endpoint: mux.Handle("/metrics", monitoring.Handler())
- Wrapped handlers with monitoring.HTTPMetricsMiddleware() for automatic instrumentation
- Lines changed: ~9 lines added
exchange/policy-decision-point/go.mod & go.sum
- Added dependency on exchange/shared/monitoring package
- Updated via go mod tidy
exchange/docker-compose.yml
- Added SERVICE_NAME and OTEL_METRICS_EXPORTER environment variables to orchestration-engine and policy-decision-point services
- Ensures proper service identification in metrics
- Defaults to Prometheus exporter for local development
- Lines changed: 4 lines added (2 per service)

Observability Stack Configuration

observability/prometheus/prometheus.yml
- Added scrape configs for orchestration-engine:4000 and policy-decision-point:8082
- Both services configured with /metrics path
- Lines changed: ~18 lines added

Testing

I have tested this change locally
I have added unit tests for new functionality
I have tested edge cases
All existing tests pass

Test Results

Unit Tests (exchange/shared/monitoring):

=== RUN   TestHandler
--- PASS: TestHandler (0.00s)
=== RUN   TestHTTPMetricsMiddleware
--- PASS: TestHTTPMetricsMiddleware (0.00s)
=== RUN   TestNormalizeRoute
--- PASS: TestNormalizeRoute (0.00s)
PASS

Service Integration:

Both services successfully expose /metrics endpoints
Metrics middleware correctly wraps HTTP handlers
Prometheus can successfully scrape both services

Runtime Testing

To verify the observability stack is working:

Start observability stack:

cd observability
./start-grafana.sh  # or: docker compose up -d

Start Go services (ensure they're on opendif-network):

cd exchange
docker compose up -d orchestration-engine policy-decision-point

Verify metrics endpoints:

curl http://localhost:4000/metrics | grep http_requests_total
curl http://localhost:8082/metrics | grep http_requests_total

Check Prometheus targets:
- Open http://localhost:9091/targets
- Services should show as "UP" (green)

Generate sample traffic:

cd observability
./generate_sample_traffic.sh
# Script targets orchestration-engine:4000 and policy-decision-point:8082

View metrics in Grafana:
- Open http://localhost:3002/d/go-services/go-services-metrics
- Metrics should appear after services receive traffic

Checklist

My code follows the project's style guidelines
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have checked that there are no merge conflicts
I have verified all services are on opendif-network
I have verified Prometheus can scrape all service endpoints

Related Issues

Related to observability stack setup. Enables metrics collection from Go services for monitoring and debugging.

Deployment Notes

Pre-Deployment Checklist

Service Restart Required: Services must be restarted to load the new monitoring code
```
# Stop existing services
# Rebuild: go build .
# Restart services
```

Network Setup: Ensure opendif-network exists before starting services

docker network create opendif-network  # if it doesn't exist
# Or use: cd observability && ./start-grafana.sh

No Configuration Changes Required: Services use Prometheus exporter by default (no env vars needed for local dev)
Prometheus Already Configured: Prometheus is already configured to scrape these services (see observability/prometheus/prometheus.yml)
Grafana Dashboard Ready: Grafana dashboard is already configured to display these metrics

Environment Variables (Optional)

For local development, no environment variables are needed (Prometheus is default).

To switch to other backends (Datadog, New Relic, etc.), set:

export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_ENDPOINT=<your-endpoint>
export OTEL_EXPORTER_OTLP_HEADERS="<your-headers>"
export SERVICE_NAME=<service-name>

Post-Deployment Verification

Check Metrics Endpoints:

curl http://localhost:4000/metrics | grep http_requests_total
curl http://localhost:8082/metrics | grep http_requests_total

Verify Prometheus Scraping:
- Open http://localhost:9091/targets
- All services should show as "UP" (green)
View in Grafana:
- Open http://localhost:3002/d/go-services/go-services-metrics
- Metrics should appear after services receive some traffic

Generate Sample Traffic:

cd observability
./generate_sample_traffic.sh

Related PRs

317-part2-connect: refactors to use OpenTelemetry infra in exchange/shared/monitoring/
This PR (317-part3-oe-pdp) integrates that infra into OE and PDP

ginaxu1 · 2026-01-03T14:26:38Z

Created new branch #408

ginaxu1 requested review from mushrafmim and sthanikan2000 December 9, 2025 02:45

ginaxu1 changed the title ~~317 Part 3: Connect Orchestration Engine and Policy Decision Point to Observability Stack~~ 317 Connect Orchestration Engine and Policy Decision Point to Observability Stack Dec 9, 2025

ginaxu1 force-pushed the 317-part2-connect branch 4 times, most recently from 448532b to af8ba3f Compare December 18, 2025 09:24

sthanikan2000 marked this pull request as draft December 18, 2025 16:48

ginaxu1 force-pushed the 317-part2-connect branch from 24d171f to eba3284 Compare December 19, 2025 05:48

Refactor observability stack with OpenTelemetry

3825981

ginaxu1 force-pushed the 317-part2-connect branch from eba3284 to 3825981 Compare December 19, 2025 06:45

ginaxu1 and others added 3 commits December 19, 2025 12:43

Address PR Comment

9b620f9

Rename opendif-mvp to opendif-core (#367)

bef2a41

Update observability README

d77a01f

ginaxu1 force-pushed the 317-part3-oe-pdp branch from 90844ba to d77a01f Compare December 19, 2025 07:22

ginaxu1 force-pushed the 317-part2-connect branch from 4ee78bb to e5d0cda Compare December 23, 2025 08:43

Base automatically changed from 317-part2-connect to main December 23, 2025 15:32

ginaxu1 marked this pull request as ready for review December 27, 2025 08:33

ginaxu1 closed this Jan 3, 2026

ginaxu1 deleted the 317-part3-oe-pdp branch January 5, 2026 04:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

317 Connect Orchestration Engine and Policy Decision Point to Observability Stack#380

317 Connect Orchestration Engine and Policy Decision Point to Observability Stack#380
ginaxu1 wants to merge 4 commits intomainfrom
317-part3-oe-pdp

ginaxu1 commented Dec 9, 2025 •

edited

Loading

Uh oh!

ginaxu1 commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ginaxu1 commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Changes Made

New Files Created

Modified Files

Service Integration

Observability Stack Configuration

Testing

Test Results

Runtime Testing

Checklist

Related Issues

Deployment Notes

Pre-Deployment Checklist

Environment Variables (Optional)

Post-Deployment Verification

Related PRs

Uh oh!

ginaxu1 commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ginaxu1 commented Dec 9, 2025 •

edited

Loading