Skip to content

Conversation

@cbb330
Copy link
Collaborator

@cbb330 cbb330 commented Jan 14, 2026

Summary

Critical latency metric is pinned to 30s when in reality it is as high as 5 minutes. Incorrect latency was interpreted in an investigation and led to wrong conclusions.

above: log based metric shows real latency

below: prometheus based metric shows p99 capped at 30s
image

Configure histogram buckets to extend to 600 seconds for the catalog_metadata_retrieval_latency metric, enabling accurate capture of long-running metadata operations.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Added maximum-expected-value.catalog_metadata_retrieval_latency=600s to application.properties and a new Spring Boot integration test to verify the configuration.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

Added MetricsHistogramConfigurationTest which verifies:

  • The 600s max expected value configuration is set
  • Percentiles histogram is enabled
  • MeterRegistry is PrometheusMeterRegistry with histogram buckets
  • Histogram buckets extend to 600s (verified via Timer.takeSnapshot().histogramCounts())
  • The configuration value is parseable as a Duration
./gradlew :services:tables:test --tests "MetricsHistogramConfigurationTest"

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

…metric

Configure histogram buckets to extend to 600 seconds for the metadata retrieval
latency metric, enabling capture of long-running metadata operations.

Changes:
- Add maximum-expected-value.catalog_metadata_retrieval_latency=600s to application.properties
- Add MetricsHistogramConfigurationTest to verify Spring Boot configuration
@cbb330 cbb330 force-pushed the cbb330-histogram-600s branch from 984464b to 235e7a7 Compare January 14, 2026 08:03
teamurko
teamurko previously approved these changes Jan 14, 2026
sumedhsakdeo
sumedhsakdeo previously approved these changes Jan 14, 2026
…eusMeterRegistry

- Use @AutoConfigureMetrics to enable production metrics config in tests
  (Spring Boot disables metrics exporters by default in tests)
- Validate MeterRegistry is PrometheusMeterRegistry, not SimpleMeterRegistry
- Scrape /actuator/prometheus endpoint to verify le="600.0" bucket exists
- Confirms maximum-expected-value.catalog_metadata_retrieval_latency=600s is applied
@cbb330 cbb330 dismissed stale reviews from sumedhsakdeo and teamurko via 5d59115 January 15, 2026 21:01
Copy link
Member

@abhisheknath2011 abhisheknath2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Have seen couple of instances of sudden latency spike earlier where some of the requests are actually stuck (at specific time range) and took significant amount of time while retrieving metadata. I have captured two instances of such high latency from HDFS in a doc.

Replace HTTP endpoint parsing with direct Timer.takeSnapshot() inspection:
- Remove TestRestTemplate and /actuator/prometheus calls
- Use HistogramSnapshot.histogramCounts() to verify bucket boundaries
- Change @SpringBootTest to use default webEnvironment (faster startup)
- Remove brittle Prometheus text format parsing
@cbb330 cbb330 merged commit 05c8a5d into linkedin:main Jan 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants