Klag

Kafka Consumer Lag Exporter — Know when your consumers fall behind, before it becomes a problem.

Inspired by kafka-lag-exporter (archived 2024). Built with Vert.x and Micrometer.

Scales to large clusters: Monitors thousands of consumer groups in ~50MB heap. Request batching with configurable delays prevents overwhelming brokers — fetch offsets for 500+ groups without spiking cluster CPU.

Why Klag?

Consumer lag is the gap between what Kafka has produced and what your consumers have processed. Left unmonitored, growing lag leads to:

Stale data in downstream systems
Memory pressure as consumers struggle to catch up
Silent failures when consumer groups die without alerts

Klag continuously monitors all consumer groups and exposes metrics to your observability stack.

Key Features

Feature	Why It Matters
Lag velocity	Know if lag is growing or shrinking — catch problems before they escalate
Time-based lag estimation	See lag in seconds/minutes, not just message counts
Hot partition detection	Find partitions with uneven load causing bottlenecks
Consumer group state tracking	Alert on Rebalancing, Dead, or Empty states
Request batching	Safely monitor large clusters without overwhelming brokers
Stale group cleanup	Automatically stops reporting deleted/inactive groups

Supported Sinks

Prometheus endpoint
Datadog
OTLP (OpenTelemetry) — works with Grafana Cloud, New Relic, etc.
(planned) Prometheus Push Gateway, StatsD, Google Stackdriver

Quick Start

docker run -e KAFKA_BOOTSTRAP_SERVERS=kafka:9092 \
           -e METRICS_REPORTER=prometheus \
           -p 8888:8888 \
           themoah/klag

Metrics available at http://localhost:8888/metrics

Metrics Exposed

Metric	Description
`klag.consumer.lag`	Current lag per partition (also `.sum`, `.max`, `.min` aggregations)
`klag.consumer.lag.velocity`	Rate of change — positive means falling behind
`klag.consumer.lag.time_ms`	Estimated time (ms) to process current lag at current rate
`klag.consumer.lag.time_to_close_seconds`	Estimated seconds until lag reaches zero (only when catching up)
`klag.consumer.group.state`	Group health: Stable, Rebalancing, Dead, Empty
`klag.hot_partition`	Partitions with statistically abnormal throughput
`klag.hot_partition.lag`	Lag on hot partitions specifically
`klag.topic.partitions`	Partition count per topic
`klag.partition.log_end_offset`	Latest offset per partition
`klag.consumer.committed_offset`	Last committed offset per consumer

All metrics tagged with consumer_group, topic, partition where applicable.

Blogpost: Introducing Klag

Installation

Helm Chart

helm install klag ./charts/klag \
  --set kafka.bootstrapServers="kafka-broker:9092"

With SASL authentication

helm install klag ./charts/klag \
  --set kafka.bootstrapServers="kafka:9092" \
  --set kafka.securityProtocol="SASL_SSL" \
  --set kafka.saslMechanism="PLAIN" \
  --set kafka.saslJaasConfig="org.apache.kafka.common.security.plain.PlainLoginModule required username='user' password='pass';"

See charts/klag/README.md for full configuration options.

Docker with Environment File

docker run --env-file .env themoah/klag

Sample .env file

# Kafka connection
KAFKA_BOOTSTRAP_SERVERS=instance.gcp.confluent.cloud:9092
KAFKA_SECURITY_PROTOCOL=SASL_SSL
KAFKA_SASL_MECHANISM=PLAIN
KAFKA_SASL_JAAS_CONFIG="org.apache.kafka.common.security.plain.PlainLoginModule required username=${SASL_USERNAME} password=${SASL_PASSWORD};"

# Metrics
METRICS_REPORTER=prometheus
METRICS_INTERVAL_MS=30000
METRICS_GROUP_FILTER=*

# Optional: JVM metrics
METRICS_JVM_ENABLED=true

Configuration

Configure via src/main/resources/application.properties or environment variables:

Variable	Default	Description
`KAFKA_BOOTSTRAP_SERVERS`	`localhost:9092`	Kafka broker addresses
`KAFKA_REQUEST_TIMEOUT_MS`	`30000`	Request timeout
`KAFKA_CHUNK_COUNT`	`1`	Split offset requests into N batches
`KAFKA_CHUNK_DELAY_MS`	`0`	Delay (ms) between batches
`METRICS_REPORTER`	`none`	`prometheus`, `datadog`, or `otlp`
`METRICS_INTERVAL_MS`	`60000`	How often to collect metrics
`METRICS_GROUP_FILTER`	`*`	Glob pattern to filter consumer groups

See CLAUDE.md for the complete configuration reference.

Kafka ACL Permissions

Klag requires read-only access to monitor consumer lag. It uses only the Kafka Admin Client API with DESCRIBE permissions—no write or alter access needed.

Required Permissions

Resource	Name	Permission	Operations
CLUSTER	kafka-cluster	DESCRIBE	Health check, list consumer groups
TOPIC	`*` or prefixed	DESCRIBE	Get partition info and offsets
GROUP	`*` or prefixed	DESCRIBE	Get group state and committed offsets

Self-Managed Kafka

Monitor all groups and topics

# Cluster permissions (required)
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --cluster

# All topics
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --topic '*'

# All consumer groups
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --group '*'

Monitor specific prefix only

# Cluster permissions (required for listConsumerGroups)
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --cluster

# Topics with prefix
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --topic 'myapp-' \
  --resource-pattern-type prefixed

# Consumer groups with prefix
kafka-acls --bootstrap-server <broker> \
  --add --allow-principal User:<klag-user> \
  --operation Describe --group 'myapp-' \
  --resource-pattern-type prefixed

Confluent Cloud

Create a service account with the following ACLs using the Confluent CLI:

Monitor all groups and topics

# Set your cluster ID
CLUSTER_ID=<your-cluster-id>
SERVICE_ACCOUNT=<service-account-id>

# Cluster permissions
confluent kafka acl create --allow --service-account $SERVICE_ACCOUNT \
  --operations describe --cluster-scope

# All topics
confluent kafka acl create --allow --service-account $SERVICE_ACCOUNT \
  --operations describe --topic '*'

# All consumer groups
confluent kafka acl create --allow --service-account $SERVICE_ACCOUNT \
  --operations describe --consumer-group '*'

Monitor specific prefix only

CLUSTER_ID=<your-cluster-id>
SERVICE_ACCOUNT=<service-account-id>

# Cluster permissions (required for listConsumerGroups)
confluent kafka acl create --allow --service-account $SERVICE_ACCOUNT \
  --operations describe --cluster-scope

# Topics with prefix
confluent kafka acl create --allow --service-account $SERVICE_ACCOUNT \
  --operations describe --topic 'myapp-' --prefix

# Consumer groups with prefix
confluent kafka acl create --allow --service-account $SERVICE_ACCOUNT \
  --operations describe --consumer-group 'myapp-' --prefix

Note: The METRICS_GROUP_FILTER environment variable provides application-level filtering, but cluster DESCRIBE permission is always required since listConsumerGroups() queries all groups before filtering.

Building from Source

Requires Java 21.

./gradlew clean test      # Run tests
./gradlew clean assemble  # Build fat JAR
./gradlew clean run       # Run with hot-reload

Development

Testing Helm Chart

./scripts/test-helm-chart.sh

Local Kubernetes Testing (macOS)

# Full test: create cluster, install chart, validate, cleanup
./scripts/local-k8s-test.sh

# Auto-install dependencies (kind, helm, kubectl) via Homebrew
./scripts/local-k8s-test.sh --auto-install

# Keep cluster running after test
./scripts/local-k8s-test.sh --skip-cleanup

# Cleanup only
./scripts/local-k8s-test.sh --cleanup

Prerequisites: Docker Desktop, kind, helm, kubectl (use --auto-install to install via Homebrew).

Contributing

Fork the repository
Create a feature branch

Run tests before submitting:

./gradlew test                # Java tests
./scripts/test-helm-chart.sh  # Helm chart tests

Submit a pull request

Some parts of the code were written with Claude

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
charts/klag		charts/klag
dashboard		dashboard
scripts		scripts
src		src
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
build.gradle.kts		build.gradle.kts
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Klag

Why Klag?

Key Features

Supported Sinks

Quick Start

Metrics Exposed

Installation

Helm Chart

Docker with Environment File

Configuration

Kafka ACL Permissions

Required Permissions

Self-Managed Kafka

Confluent Cloud

Building from Source

Development

Testing Helm Chart

Local Kubernetes Testing (macOS)

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

themoah/klag

Folders and files

Latest commit

History

Repository files navigation

Klag

Why Klag?

Key Features

Supported Sinks

Quick Start

Metrics Exposed

Installation

Helm Chart

Docker with Environment File

Configuration

Kafka ACL Permissions

Required Permissions

Self-Managed Kafka

Confluent Cloud

Building from Source

Development

Testing Helm Chart

Local Kubernetes Testing (macOS)

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages