Automatic scale-to-zero for HashiCorp Nomad workloads with Traefik and wake-on-request
Scale-to-zero allows Nomad services to be scaled down to 0 allocations when idle, then automatically woken up on the next request. This dramatically reduces infrastructure costs for services with intermittent or unpredictable traffic patterns while maintaining instant availability.
- π Automatic Wake-on-Request: Services scale from 0 to N when traffic arrives via Traefik middleware
- π€ Intelligent Idle Detection: Configurable idle timeouts automatically scale down unused services
- π Dead Job Revival: Automatically restore and start stopped/purged jobs on first request
- π ACL-Ready: First-class support for Nomad and Consul ACLs with token management
- π Flexible Storage: Choose between Consul KV (simple) or Redis (high-performance) backends
- β‘ Production-Ready: Minimal configuration, battle-tested in production environments
- π― Per-Service Configuration: Fine-grained control via job metadata and Traefik tags
This implementation combines several components:
- Traefik as the ingress proxy
- ScaleWaker - a custom Traefik middleware plugin to wake services
- idle-scaler - an agent to scale services back down after idle timeout
- activity-store - Consul KV or Redis backend to track activity and store job specs
The fastest way to try scale-to-zero locally is with the all-in-one demo script:
./local-test/scripts/start-local-with-acl.shThis script:
- Starts Consul + Nomad in dev mode with ACLs enabled
- Creates least-privilege tokens
- Starts Traefik with the ScaleWaker plugin
- Builds and runs the idle-scaler as a Nomad system job
Once the stack is running:
# 1. Deploy a scale-to-zero enabled job
nomad job run local-test/sample-jobs/echo-s2z.hcl
# 2. Test the service
curl -H 'Host: echo-s2z.localhost' http://localhost/
# 3. Scale it down to 0
nomad job scale echo-s2z main 0
# 4. Hit it again - it wakes back up automatically!
curl -H 'Host: echo-s2z.localhost' http://localhost/See LOCAL_TESTING.md for detailed development setup and testing guide.
Scale-to-zero is perfect for:
- Development/Staging Environments: Dramatically reduce costs for environments used only during business hours
- Preview Environments: PR previews and feature branches that sit idle most of the time
- Batch Processing Services: Jobs triggered by external events (webhooks, cron) with idle periods
- Internal Tools: Admin panels, dashboards, and utilities with sporadic usage
- Microservices: Low-traffic services in a microservices architecture
- Multi-Tenant Applications: Per-customer services that aren't always active
Traditional auto-scaling typically scales to a minimum of 1 instance, which still consumes resources 24/7. Scale-to-zero goes further:
- Reduce costs by 90%+ for idle services
- Maintain instant availability with automatic wake-on-request
- Optimize resource utilization across your cluster
- Simplify operations with automatic lifecycle management
- A request arrives at Traefik for
some-service.localhost. - The ScaleWaker middleware determines the target service/job from the request (usually the
Hostheader). - If the service is not healthy/registered (typically because itβs scaled to 0):
- it calls the Nomad API to scale the job group up (usually to 1)
- it waits for the service to become healthy in Consul (bounded by a timeout)
- It records activity (last request timestamp) in the configured activity store.
- The request is proxied to the now-running backend.
- The idle-scaler periodically scans for scale-to-zero-enabled jobs.
- For each job it reads the last activity timestamp.
- If
now - lastActivity > idleTimeout, it scales the job group down to 0.
- HashiCorp Nomad cluster (1.0+)
- HashiCorp Consul cluster
- Traefik (2.9+) as ingress proxy
- Redis (optional, recommended for large deployments)
-
Deploy the Traefik Plugin
Option A: Local Plugin Installation (Recommended for now)
Copy the plugin to Traefik's local plugins directory:
# Clone the repository git clone https://github.com/Metatable-ai/nomad_scale_to_zero.git # Copy plugin to Traefik plugins directory mkdir -p /path/to/traefik/plugins-local cp -r nomad_scale_to_zero/traefik-plugin /path/to/traefik/plugins-local/
Configure Traefik static configuration:
experimental: localPlugins: scalewaker: moduleName: "nomad_scale_to_zero/traefik-plugin"
Option B: Plugin Catalog (Coming soon)
Note: Plugin Catalog support requires restructuring this repository to have the plugin code at the root. This is planned for a future release. For now, use local plugin installation (Option A).
Configure environment variables:
S2Z_NOMAD_ADDR=http://nomad.service.consul:4646 S2Z_CONSUL_ADDR=http://consul.service.consul:8500 S2Z_ACTIVITY_STORE=redis S2Z_REDIS_ADDR=redis.service.consul:6379 S2Z_NOMAD_TOKEN=<your-nomad-token> S2Z_CONSUL_TOKEN=<your-consul-token>
-
Deploy the Idle-Scaler
Run the idle-scaler as a Nomad system job:
nomad job run local-test/system-jobs/idle-scaler.hcl
Ensure environment variables are set with appropriate tokens.
-
Configure Jobs for Scale-to-Zero
Add metadata to your job specifications:
job "my-service" { meta = { "scale-to-zero.enabled" = "true" "scale-to-zero.idle-timeout" = "300" # seconds "scale-to-zero.job-spec-kv" = "scale-to-zero/jobs/my-service" } group "main" { # ... your group config service { tags = [ "traefik.enable=true", "traefik.http.routers.myservice.rule=Host(`myservice.example.com`)", "traefik.http.middlewares.scalewaker-myservice.plugin.scalewaker.serviceName=my-service", "traefik.http.middlewares.scalewaker-myservice.plugin.scalewaker.timeout=30s", "traefik.http.routers.myservice.middlewares=scalewaker-myservice", ] } } }
- Use Redis for Storage: For deployments with 50+ jobs, use Redis instead of Consul KV to reduce Raft pressure
- Set Appropriate Timeouts: Balance cold-start latency against resource savings
- Monitor Wake Times: Track how long services take to become healthy after wake-up
- Use ACL Tokens: Always use least-privilege tokens in production
- Test Dead Job Revival: Verify job specs are stored correctly and can be restored
- Configure Health Checks: Ensure services have proper health checks for reliable wake detection
Create dedicated policies for scale-to-zero components:
Nomad Policy (see local-test/nomad/scale-to-zero-policy.hcl):
namespace "*" {
policy = "write"
capabilities = ["submit-job", "read-job", "scale-job"]
}Consul Policy (see local-test/nomad/scale-to-zero-consul-policy.hcl):
key_prefix "scale-to-zero/" {
policy = "write"
}
service_prefix "" {
policy = "write"
}The project evolved from a βverbose tags everywhereβ setup to a simpler V2 configuration:
- Infrastructure configuration comes from environment variables (set once on Traefik / idle-scaler), instead of repeating addresses in every job.
- Per-job configuration in Traefik tags is minimal (usually only
serviceNameandtimeout). - ACL support is first-class:
- Nomad API calls include
X-Nomad-Token - Consul API calls include
X-Consul-Token
- Nomad API calls include
Traefik plugin (ScaleWaker) reads:
S2Z_NOMAD_ADDR(default:http://nomad.service.consul:4646)S2Z_CONSUL_ADDR(default:http://consul.service.consul:8500)S2Z_NOMAD_TOKEN(optional)S2Z_CONSUL_TOKEN(optional)S2Z_ACTIVITY_STORE(consulorredis)S2Z_JOB_SPEC_STORE(consulorredis)S2Z_REDIS_ADDR,S2Z_REDIS_PASSWORD(optional, if using Redis)
Idle-scaler uses:
NOMAD_ADDR,CONSUL_ADDRNOMAD_TOKEN,CONSUL_TOKEN(optional)IDLE_CHECK_INTERVAL,DEFAULT_IDLE_TIMEOUTSTORE_TYPE+ Redis config when applicable
The quickest way to demo this to other developers is the single script:
local-test/scripts/start-local-with-acl.sh
It will:
- start Consul + Nomad in dev mode with ACLs enabled
- create least-privilege tokens and export them
- start Traefik configured with the local plugin and Consul Catalog provider
- build the idle-scaler and run it as a Nomad system job (with tokens)
It prints dashboards and tails Traefik logs. Stop with Ctrl-C (the script traps and cleans up the spawned processes).
After the local stack is up:
-
Submit a sample job (for example):
nomad job run local-test/sample-jobs/echo-s2z.hcl
-
Hit it through Traefik:
curl -H 'Host: echo-s2z.localhost' http://localhost/
-
Scale it down to 0:
nomad job scale echo-s2z main 0
-
Hit it again; it should wake back up:
curl -H 'Host: echo-s2z.localhost' http://localhost/
The policies used by the local ACL script live in local-test/nomad/:
scale-to-zero-policy.hclβ Nomad policy for submitting/reading/scaling jobsscale-to-zero-consul-policy.hclβ Consul policy for KV writes and service discovery/cleanupconsul-catalog-read-policy.hclβ Consul policy used by Traefikβs Consul Catalog providernomad-agent-consul-policy.hclβ Consul policy used by the Nomad agent for service registration/deregistration
traefik-plugin/β ScaleWaker Traefik middleware plugin (Go)idle-scaler/β Idle scaler agent (Go)activity-store/β Shared store abstraction (Consul KV / Redis)local-test/β Local development configs and sample jobslocal-test/scripts/start-local-with-acl.shβ One-shot local demo with ACLslocal-test/traefik/β Dynamic Traefik config (fallback router/middleware)local-test/sample-jobs/β Sample Nomad jobs with minimal V2 tagslocal-test/nomad/β ACL policy HCLs for local testing
We welcome contributions! Please see our Contributing Guide for details on:
- How to report bugs and request features
- Development setup and testing
- Code style and conventions
- Pull request process
- π Documentation: Start with LOCAL_TESTING.md for setup details
- π Bug Reports: Open an issue with reproduction steps
- π¬ Questions: Use GitHub Discussions for questions
- π§ Component Docs: See activity-store/README.md and idle-scaler/README.md
- π Creating Releases: See RELEASE.md for the complete release process
- π Changelog: See CHANGELOG.md for version history
Future enhancements we're considering:
- Metrics and monitoring integration (Prometheus/Grafana)
- Support for multiple activity stores simultaneously
- Configurable wake-up strategies (parallel scaling, gradual rollout)
- Integration with other ingress controllers (nginx, envoy)
- Webhook notifications for scale events
- Advanced idle detection (request rate, resource usage)
Have ideas? Open a feature request or start a discussion!
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Built with:
- HashiCorp Nomad - Workload orchestration
- HashiCorp Consul - Service discovery and KV storage
- Traefik - Cloud-native ingress proxy
Ready to get started? Check out our Quick Start guide or dive into LOCAL_TESTING.md for detailed setup instructions.
Want to contribute? Read our Contributing Guide to get involved!