Add 600s max histogram bucket for catalog_metadata_retrieval_latency metric #432

cbb330 · 2026-01-14T07:08:03Z

Summary

Critical latency metric is pinned to 30s when in reality it is as high as 5 minutes. Incorrect latency was interpreted in an investigation and led to wrong conclusions.

above: log based metric shows real latency

below: prometheus based metric shows p99 capped at 30s

Configure histogram buckets to extend to 600 seconds for the catalog_metadata_retrieval_latency metric, enabling accurate capture of long-running metadata operations.

Changes

Added maximum-expected-value.catalog_metadata_retrieval_latency=600s to application.properties and a new Spring Boot integration test to verify the configuration.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Added MetricsHistogramConfigurationTest which verifies:

The 600s max expected value configuration is set
Percentiles histogram is enabled
MeterRegistry is PrometheusMeterRegistry with histogram buckets
Histogram buckets extend to 600s (verified via Timer.takeSnapshot().histogramCounts())
The configuration value is parseable as a Duration

./gradlew :services:tables:test --tests "MetricsHistogramConfigurationTest"

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

…metric Configure histogram buckets to extend to 600 seconds for the metadata retrieval latency metric, enabling capture of long-running metadata operations. Changes: - Add maximum-expected-value.catalog_metadata_retrieval_latency=600s to application.properties - Add MetricsHistogramConfigurationTest to verify Spring Boot configuration

…eusMeterRegistry - Use @AutoConfigureMetrics to enable production metrics config in tests (Spring Boot disables metrics exporters by default in tests) - Validate MeterRegistry is PrometheusMeterRegistry, not SimpleMeterRegistry - Scrape /actuator/prometheus endpoint to verify le="600.0" bucket exists - Confirms maximum-expected-value.catalog_metadata_retrieval_latency=600s is applied

abhisheknath2011

Thanks for the PR. Have seen couple of instances of sudden latency spike earlier where some of the requests are actually stuck (at specific time range) and took significant amount of time while retrieving metadata. I have captured two instances of such high latency from HDFS in a doc.

Replace HTTP endpoint parsing with direct Timer.takeSnapshot() inspection: - Remove TestRestTemplate and /actuator/prometheus calls - Use HistogramSnapshot.histogramCounts() to verify bucket boundaries - Change @SpringBootTest to use default webEnvironment (faster startup) - Remove brittle Prometheus text format parsing

cbb330 force-pushed the cbb330-histogram-600s branch from 984464b to 235e7a7 Compare January 14, 2026 08:03

teamurko previously approved these changes Jan 14, 2026

View reviewed changes

sumedhsakdeo previously approved these changes Jan 14, 2026

View reviewed changes

cbb330 dismissed stale reviews from sumedhsakdeo and teamurko via 5d59115 January 15, 2026 21:01

abhisheknath2011 previously approved these changes Jan 15, 2026

View reviewed changes

cbb330 dismissed abhisheknath2011’s stale review via 9d83ccf January 15, 2026 21:35

abhisheknath2011 approved these changes Jan 15, 2026

View reviewed changes

cbb330 merged commit 05c8a5d into linkedin:main Jan 15, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 600s max histogram bucket for catalog_metadata_retrieval_latency metric #432

Add 600s max histogram bucket for catalog_metadata_retrieval_latency metric #432

Uh oh!

cbb330 commented Jan 14, 2026 •

edited

Loading

Uh oh!

abhisheknath2011 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add 600s max histogram bucket for catalog_metadata_retrieval_latency metric #432

Add 600s max histogram bucket for catalog_metadata_retrieval_latency metric #432

Uh oh!

Conversation

cbb330 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

abhisheknath2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cbb330 commented Jan 14, 2026 •

edited

Loading