Intelligent, Cost-Optimized Kubernetes Autoscaling for AWS
Reduce infrastructure costs by 40-50% through intelligent, event-driven autoscaling
Features β’ Architecture β’ Quick Start β’ Technical Docs β’ Testing
Client: TechFlow Solutions - E-commerce platform (Dhaka, Bangladesh).
Current Pain Points:
- πΈ 1.2 lakh BDT/month infrastructure waste (5 static nodes running 24/7, 60% idle).
- π₯ Flash sale crashes - manual scaling takes 15-20 minutes, causing significant revenue loss.
- π¨βπ» Manual intervention required for every traffic spike, leading to high operational overhead.
Our Solution: node-fleet is an intelligent, serverless autoscaler for K3s clusters. It shifts from a reactive, manual operations model to an automated, metric-driven architecture. By leveraging AWS Lambda and Prometheus, it ensures the cluster capacity perfectly matches real-time demand while optimizing for cost via Spot instances.
- β Cost Reduction: Target 40-50% savings (60K BDT/month).
- β Response Time: <3 min for new capacity (83% faster than manual).
- β Reliability: 0 service disruptions during scaling operations.
Design Rationale:
- K3s over Standard K8s: Chosen for its lightweight footprint (50% less resource usage on master), crucial for cost-saving on small instances.
- AWS Lambda: Used for the autoscaler "brain" to avoid paying for a 24/7 controller instance.
- VPC Isolation: Master and monitoring run in a controlled environment, while workers reside in private subnets for enhanced security.
- EventBridge triggers Lambda every 2 minutes.
- Lambda scrapes custom metrics from Prometheus.
- Lambda checks DynamoDB to manage state and distributed locking.
- Decision Engine evaluates thresholds and historical patterns.
- EC2 Manager provisions nodes from Launch Templates or terminates them gracefully.
Network Topology:
- VPC: 10.0.0.0/16 with 2 Public and 2 Private Subnets across 2 AZs.
- Security Groups: Granular whitelisting (Port 6443 for K3s API, 30090 for Prometheus).
- NAT Gateway: Enables workers in private subnets to download packages and patches.
| Category | Technology | Justification |
|---|---|---|
| Cloud Provider | AWS | Industry-standard reliability and robust Serverless (Lambda) ecosystem. |
| Orchestration | K3s | Lightweight, perfect for cost-optimized cloud and edge deployments. |
| Monitoring | Prometheus | Native Kubernetes support and powerful PromQL for complex scaling metrics. |
| IaC | Pulumi (TS) | Preferred over Terraform for its strong type-safety and support for actual coding logic. |
| State Storage | DynamoDB | No-SQL performance with built-in conditional writes for distributed locking. |
| Testing | k6 / Pytest | Modern tools for high-performance load testing and robust unit verification. |
The autoscaler core is modularly designed into five specialized components:
- Metric Collector: Aggregates CPU, Memory, and Pending Pod data from Prometheus.
- Decision Engine: Evaluates thresholds (70% CPU/Pending Pods) and applies cooldowns.
- State Manager: Implements distributed locking in DynamoDB to prevent simultaneous scaling.
- EC2 Manager: Orchestrates instance lifecycle via Launch Templates.
We utilize Pulumi's TypeScript SDK to manage our AWS fleet. This allows for native programming constructs like loops for Multi-AZ distribution and conditional logic for Spot vs On-Demand provisioning.
// Example: Cost-Optimized Spot Instance Definition
export const workerSpotTemplate = new aws.ec2.LaunchTemplate("worker-spot", {
imageId: ubuntuAmiId,
instanceType: "t3.medium",
instanceMarketOptions: { marketType: "spot" }
});- AWS CLI configured and Pulumi CLI installed.
- Node.js 18+ and Python 3.11+.
cd pulumi
npm install
pulumi up --yes- SSH into the Master node using the IP provided by Pulumi output.
- Run
sudo bash /tmp/master-setup.shto initialize the K3s control plane. - Verify cluster connectivity:
kubectl get nodes.
Access Grafana via port-forwarding:
kubectl port-forward -n monitoring svc/grafana 3000:80You may deploy this architecture in the following environments:
- AWS Free Tier (Recommended): 1x t3.micro master, 2x t3.micro workers (minimize costs).
- LocalStack + Minikube: Simulate AWS services (Lambda, DynamoDB) locally for development.
- On-premise K3s cluster: Raspberry Pi or VM lab with mocked AWS service endpoints.
The Lambda handler (node-fleet-autoscaler) follows a tiered execution logic:
- Pre-flight: Acquires a DynamoDB lock with a 5-minute TTL to prevent race conditions.
- Discovery: Queries Prometheus for
node_cpu_utilization,pending_pods, andapi_latency. - Logic: Applies the Decision Engine (see Algorithm section below).
- Execution: Interacts with the EC2 Service to manage the fleet.
Required Environment Variables:
PROMETHEUS_URL: Endpoint for cluster metrics.DYNAMODB_TABLE: Name of the state tracking table.CLUSTER_ID: Unique identifier for the node fleet.SNS_TOPIC_ARN: ARN for scaling alert notifications.MIN_NODES/MAX_NODES: Guardrails for fleet size.
The Lambda function operates under a Least Privilege policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:RunInstances",
"ec2:TerminateInstances",
"ec2:DescribeInstances",
"ec2:DescribeInstanceStatus",
"ec2:CreateTags",
"dynamodb:GetItem",
"dynamodb:PutItem",
"dynamodb:UpdateItem",
"secretsmanager:GetSecretValue",
"cloudwatch:PutMetricData"
],
"Resource": "*"
}
]
}The cluster scrapes both system and kube-state metrics every 15 seconds:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.*):10250'
replacement: '${1}:9100'
target_label: __address__
- job_name: 'kube-state-metrics'
static_configs:
- targets: ['kube-state-metrics.monitoring.svc.cluster.local:8080']Key PromQL Queries Used:
- CPU:
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100(Direct load indicator) - Memory:
(1 - avg(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 - Pending Pods:
sum(kube_pod_status_phase{phase="Pending"})(Critical scale-up trigger)
Table: node-fleet-cluster-state
- Partition Key:
cluster_id(String) - Attributes:
node_count,scaling_in_progress,lock_expiry.
Example State (Locked):
{
"cluster_id": "node-fleet-prod",
"scaling_in_progress": true,
"lock_expiry": 1706263200,
"node_count": 4
}Every new worker node automatically joins the fleet via this entrypoint:
#!/bin/bash
# 1. Join Cluster
K3S_TOKEN=$(aws secretsmanager get-secret-value --secret-id node-fleet/k3s-token --query SecretString --output text)
MASTER_IP=$(aws ec2 describe-instances --filters "Name=tag:Role,Values=k3s-master" --query 'Reservations[0].Instances[0].PrivateIpAddress' --output text)
curl -sfL https://get.k3s.io | K3S_URL=https://${MASTER_IP}:6443 K3S_TOKEN=${K3S_TOKEN} sh -
# 2. Validate Successful Join
if systemctl is-active --quiet k3s-agent; then
echo "β
Successfully joined cluster"
else
echo "β Failed to join cluster" && exit 1
fiThe autoscaler uses a tiered decision-making engine to balance cluster stability with cost efficiency.
def node_fleet_brain(metrics, state):
# 1. Cooldown Check
if now < state.last_action_time + cooldown:
return IDLE
# 2. Critical Scale-Up (Reactive)
if metrics.pending_pods > 0:
return SCALE_UP(nodes=2, level="CRITICAL")
# 3. Standard Scale-Up (Reactive)
if metrics.cpu_utilization > 70%:
return SCALE_UP(nodes=1, level="HIGH_LOAD")
# 4. Predictive Pre-Scaling (AI-Driven)
if patterns.is_flash_sale_incoming(now):
return SCALE_UP(nodes=1, level="PREDICTIVE")
# 5. Gradual Scale-Down (Safe)
if metrics.cpu_utilization < 30% and metrics.pending_pods == 0:
return SCALE_DOWN(nodes=1)- Why 70% CPU Threshold?: t3.medium instances have a finite CPU credit balance. Scaling at 70% prevents "exhaustion" where the node would be throttled, causing cascading failures.
- Why prioritize Pending Pods?: A pending pod indicates that a user request is currently unfulfilled. This is the highest priority signal for scaling.
- Why 10-Min Scale-Down Cooldown?: Draining a node causes pod evictions. We wait 10 minutes to ensure a traffic surge is truly over, minimizing "churn" and pod rescheduling overhead.
We track Lambda performance, EC2 lifecycle events, and cluster health in real-time.
CloudWatch Metrics Tracked:
LambdaDuration/LambdaErrors: Monitors autoscaler health and execution time.EC2LaunchEvents/EC2Terminations: Tracks fleet churn.ClusterCPUUtilization/PendingPodCount: Core scaling signals.ScalingDecisionHistory: Custom metric logging every Up/Down/Idle event for audit.
Alarms Configuration:
- π΄ Scaling Failure: Triggered after 3 consecutive Lambda errors β Sends SNS Notification.
- π΄ CPU Overload: Triggered if Cluster CPU > 90% for 5 minutes β Sends Urgent Pager Alert.
β οΈ Capacity Warning: Triggered if Node Count stays at Maximum for 10+ minutes.
Note
Dashboard Showcase:
We utilize a multi-layered testing strategy verified across 120 test cases:
- Load Testing: Use
k6 run load-test.jsto simulate traffic spikes by increasing Virtual Users (VUs) from 1 to 100, forcing CPU saturation. - Scale-Up Testing: Triggered by load tests; verified by polling
kubectl get nodesto ensure new instances transition toReadyin <3 min. - Scale-Down Testing: Safely verified by reducing k6 load and monitoring the
StateManagerlogs as nodes are cordoned, drained, and removed. - Failure Scenarios Tested:
- Lambda Timeout: Verified state recovery after execution interruption.
- EC2 Quota Full: Logic handles "Insufficient Capacity" gracefully.
- Prometheus Down: System enters "Safety Mode" (holds current state).
- Lock Stuck: lock_expiry mechanism auto-clears stale DynamoDB locks.
- Node Join Failure: Automated termination of "Ghost" instances (NotReady nodes).
Final Results: 100% Pass Rate (120/120 Tests). See the full Verification Report for details.
| Resource | Baseline (5 Nodes Static) | node-fleet Optimized |
|---|---|---|
| EC2 Instances | 1.2 Lakh BDT | 55,000 BDT |
| Lambda/DynamoDB | 0 BDT | 1,000 BDT |
| Total Monthly | 1.2 Lakh BDT | ~56,000 BDT |
| Savings | - | 53.3% Savings π |
- Core DevOps Engineer: Bayajid Alam
- Architecture & Logic: AI Assisted Professional Implementation
| Session | Focus Area | Key Milestones |
|---|---|---|
| Session 1-4 | Infrastructure | VPC, IAM, K3s Master deployment via Pulumi. |
| Session 5-8 | Monitoring | Prometheus, Grafana, and Custom Metric Exporters. |
| Session 9-12 | Autoscaler Core | Lambda Decision Engine, locking, and DynamoDB state. |
| Session 13-16 | Verification | Load Testing (k6), Failure Scenarios, and final naming audit. |
Scale Smart. Save More. Automate Everything.
π Built with β€οΈ for node-fleet architects

