node-fleet K3s Autoscaler

Intelligent, Cost-Optimized Kubernetes Autoscaling for AWS

Reduce infrastructure costs by 40-50% through intelligent, event-driven autoscaling

Features • Architecture • Quick Start • Technical Docs • Testing

📋 Project Overview and Problem Statement

Business Problem & Problem Statement

Client: TechFlow Solutions - E-commerce platform (Dhaka, Bangladesh).

Current Pain Points:

💸 1.2 lakh BDT/month infrastructure waste (5 static nodes running 24/7, 60% idle).
🔥 Flash sale crashes - manual scaling takes 15-20 minutes, causing significant revenue loss.
👨‍💻 Manual intervention required for every traffic spike, leading to high operational overhead.

Our Solution: node-fleet is an intelligent, serverless autoscaler for K3s clusters. It shifts from a reactive, manual operations model to an automated, metric-driven architecture. By leveraging AWS Lambda and Prometheus, it ensures the cluster capacity perfectly matches real-time demand while optimizing for cost via Spot instances.

Success Metrics

✅ Cost Reduction: Target 40-50% savings (60K BDT/month).
✅ Response Time: <3 min for new capacity (83% faster than manual).
✅ Reliability: 0 service disruptions during scaling operations.

🏗️ Architecture Explanation with Diagrams

High-Level System Design

Design Rationale:

K3s over Standard K8s: Chosen for its lightweight footprint (50% less resource usage on master), crucial for cost-saving on small instances.
AWS Lambda: Used for the autoscaler "brain" to avoid paying for a 24/7 controller instance.
VPC Isolation: Master and monitoring run in a controlled environment, while workers reside in private subnets for enhanced security.

Data Flow

EventBridge triggers Lambda every 2 minutes.
Lambda scrapes custom metrics from Prometheus.
Lambda checks DynamoDB to manage state and distributed locking.
Decision Engine evaluates thresholds and historical patterns.
EC2 Manager provisions nodes from Launch Templates or terminates them gracefully.

Network Topology:

VPC: 10.0.0.0/16 with 2 Public and 2 Private Subnets across 2 AZs.
Security Groups: Granular whitelisting (Port 6443 for K3s API, 30090 for Prometheus).
NAT Gateway: Enables workers in private subnets to download packages and patches.

🛠️ Tools and Technologies Used

Category	Technology	Justification
Cloud Provider	AWS	Industry-standard reliability and robust Serverless (Lambda) ecosystem.
Orchestration	K3s	Lightweight, perfect for cost-optimized cloud and edge deployments.
Monitoring	Prometheus	Native Kubernetes support and powerful PromQL for complex scaling metrics.
IaC	Pulumi (TS)	Preferred over Terraform for its strong type-safety and support for actual coding logic.
State Storage	DynamoDB	No-SQL performance with built-in conditional writes for distributed locking.
Testing	k6 / Pytest	Modern tools for high-performance load testing and robust unit verification.

🧠 Lambda Logic & Decision Engine

The autoscaler core is modularly designed into five specialized components:

Metric Collector: Aggregates CPU, Memory, and Pending Pod data from Prometheus.
Decision Engine: Evaluates thresholds (70% CPU/Pending Pods) and applies cooldowns.
State Manager: Implements distributed locking in DynamoDB to prevent simultaneous scaling.
EC2 Manager: Orchestrates instance lifecycle via Launch Templates.

📜 Terraform/IaC Definitions

We utilize Pulumi's TypeScript SDK to manage our AWS fleet. This allows for native programming constructs like loops for Multi-AZ distribution and conditional logic for Spot vs On-Demand provisioning.

// Example: Cost-Optimized Spot Instance Definition
export const workerSpotTemplate = new aws.ec2.LaunchTemplate("worker-spot", {
    imageId: ubuntuAmiId,
    instanceType: "t3.medium",
    instanceMarketOptions: { marketType: "spot" }
});

🚀 Setup and Deployment Instructions

Prerequisites

AWS CLI configured and Pulumi CLI installed.
Node.js 18+ and Python 3.11+.

1. Infrastructure Deployment (10 min)

cd pulumi
npm install
pulumi up --yes

2. Cluster Setup

SSH into the Master node using the IP provided by Pulumi output.
Run sudo bash /tmp/master-setup.sh to initialize the K3s control plane.
Verify cluster connectivity: kubectl get nodes.

3. Monitoring Verification

Access Grafana via port-forwarding:

kubectl port-forward -n monitoring svc/grafana 3000:80

☁️ Deployment Environment Options

You may deploy this architecture in the following environments:

AWS Free Tier (Recommended): 1x t3.micro master, 2x t3.micro workers (minimize costs).
LocalStack + Minikube: Simulate AWS services (Lambda, DynamoDB) locally for development.
On-premise K3s cluster: Raspberry Pi or VM lab with mocked AWS service endpoints.

📘 Technical Documentation

🐍 Lambda Function Code and Logic

The Lambda handler (node-fleet-autoscaler) follows a tiered execution logic:

Pre-flight: Acquires a DynamoDB lock with a 5-minute TTL to prevent race conditions.
Discovery: Queries Prometheus for node_cpu_utilization, pending_pods, and api_latency.
Logic: Applies the Decision Engine (see Algorithm section below).
Execution: Interacts with the EC2 Service to manage the fleet.

Required Environment Variables:

PROMETHEUS_URL: Endpoint for cluster metrics.
DYNAMODB_TABLE: Name of the state tracking table.
CLUSTER_ID: Unique identifier for the node fleet.
SNS_TOPIC_ARN: ARN for scaling alert notifications.
MIN_NODES / MAX_NODES: Guardrails for fleet size.

🔒 IAM Policy (JSON)

The Lambda function operates under a Least Privilege policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:RunInstances",
        "ec2:TerminateInstances",
        "ec2:DescribeInstances",
        "ec2:DescribeInstanceStatus",
        "ec2:CreateTags",
        "dynamodb:GetItem",
        "dynamodb:PutItem",
        "dynamodb:UpdateItem",
        "secretsmanager:GetSecretValue",
        "cloudwatch:PutMetricData"
      ],
      "Resource": "*"
    }
  ]
}

📊 Prometheus Configuration (prometheus.yml)

The cluster scrapes both system and kube-state metrics every 15 seconds:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-nodes'
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):10250'
        replacement: '${1}:9100'
        target_label: __address__
  
  - job_name: 'kube-state-metrics'
    static_configs:
      - targets: ['kube-state-metrics.monitoring.svc.cluster.local:8080']

Key PromQL Queries Used:

CPU: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) * 100 (Direct load indicator)
Memory: (1 - avg(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Pending Pods: sum(kube_pod_status_phase{phase="Pending"}) (Critical scale-up trigger)

🗄️ DynamoDB Schema and Example Data

Table: node-fleet-cluster-state

Partition Key: cluster_id (String)
Attributes: node_count, scaling_in_progress, lock_expiry.

Example State (Locked):

{
  "cluster_id": "node-fleet-prod",
  "scaling_in_progress": true,
  "lock_expiry": 1706263200,
  "node_count": 4
}

🐚 EC2 User Data Script (Worker Join)

Every new worker node automatically joins the fleet via this entrypoint:

#!/bin/bash
# 1. Join Cluster
K3S_TOKEN=$(aws secretsmanager get-secret-value --secret-id node-fleet/k3s-token --query SecretString --output text)
MASTER_IP=$(aws ec2 describe-instances --filters "Name=tag:Role,Values=k3s-master" --query 'Reservations[0].Instances[0].PrivateIpAddress' --output text)
curl -sfL https://get.k3s.io | K3S_URL=https://${MASTER_IP}:6443 K3S_TOKEN=${K3S_TOKEN} sh -

# 2. Validate Successful Join
if systemctl is-active --quiet k3s-agent; then
    echo "✅ Successfully joined cluster"
else
    echo "❌ Failed to join cluster" && exit 1
fi

📈 Scaling Algorithm & Logic

The autoscaler uses a tiered decision-making engine to balance cluster stability with cost efficiency.

🤖 Decision Logic (Pseudocode)

def node_fleet_brain(metrics, state):
    # 1. Cooldown Check
    if now < state.last_action_time + cooldown:
        return IDLE

    # 2. Critical Scale-Up (Reactive)
    if metrics.pending_pods > 0:
        return SCALE_UP(nodes=2, level="CRITICAL")

    # 3. Standard Scale-Up (Reactive)
    if metrics.cpu_utilization > 70%:
        return SCALE_UP(nodes=1, level="HIGH_LOAD")

    # 4. Predictive Pre-Scaling (AI-Driven)
    if patterns.is_flash_sale_incoming(now):
        return SCALE_UP(nodes=1, level="PREDICTIVE")

    # 5. Gradual Scale-Down (Safe)
    if metrics.cpu_utilization < 30% and metrics.pending_pods == 0:
        return SCALE_DOWN(nodes=1)

🧠 Logic Rationale

Why 70% CPU Threshold?: t3.medium instances have a finite CPU credit balance. Scaling at 70% prevents "exhaustion" where the node would be throttled, causing cascading failures.
Why prioritize Pending Pods?: A pending pod indicates that a user request is currently unfulfilled. This is the highest priority signal for scaling.
Why 10-Min Scale-Down Cooldown?: Draining a node causes pod evictions. We wait 10 minutes to ensure a traffic surge is truly over, minimizing "churn" and pod rescheduling overhead.

🎬 Monitoring & Alerting Strategy

We track Lambda performance, EC2 lifecycle events, and cluster health in real-time.

CloudWatch Metrics Tracked:

LambdaDuration / LambdaErrors: Monitors autoscaler health and execution time.
EC2LaunchEvents / EC2Terminations: Tracks fleet churn.
ClusterCPUUtilization / PendingPodCount: Core scaling signals.
ScalingDecisionHistory: Custom metric logging every Up/Down/Idle event for audit.

Alarms Configuration:

🔴 Scaling Failure: Triggered after 3 consecutive Lambda errors → Sends SNS Notification.
🔴 CPU Overload: Triggered if Cluster CPU > 90% for 5 minutes → Sends Urgent Pager Alert.
⚠️ Capacity Warning: Triggered if Node Count stays at Maximum for 10+ minutes.

Note

Dashboard Showcase:

System Health:
Cost Tracking: [INSERT_SS_HERE: cloudwatch_cost_analysis.png]

🧪 Testing Strategy and Results

We utilize a multi-layered testing strategy verified across 120 test cases:

Load Testing: Use k6 run load-test.js to simulate traffic spikes by increasing Virtual Users (VUs) from 1 to 100, forcing CPU saturation.
Scale-Up Testing: Triggered by load tests; verified by polling kubectl get nodes to ensure new instances transition to Ready in <3 min.
Scale-Down Testing: Safely verified by reducing k6 load and monitoring the StateManager logs as nodes are cordoned, drained, and removed.
Failure Scenarios Tested:
- Lambda Timeout: Verified state recovery after execution interruption.
- EC2 Quota Full: Logic handles "Insufficient Capacity" gracefully.
- Prometheus Down: System enters "Safety Mode" (holds current state).
- Lock Stuck: lock_expiry mechanism auto-clears stale DynamoDB locks.
- Node Join Failure: Automated termination of "Ghost" instances (NotReady nodes).

Final Results: 100% Pass Rate (120/120 Tests). See the full Verification Report for details.

💰 Cost Analysis (before/after autoscaler)

Resource	Baseline (5 Nodes Static)	node-fleet Optimized
EC2 Instances	1.2 Lakh BDT	55,000 BDT
Lambda/DynamoDB	0 BDT	1,000 BDT
Total Monthly	1.2 Lakh BDT	~56,000 BDT
Savings	-	53.3% Savings 🎉

👥 Team Members and Roles

Core DevOps Engineer: Bayajid Alam
Architecture & Logic: AI Assisted Professional Implementation

🗓️ Lab Session Progress

Session	Focus Area	Key Milestones
Session 1-4	Infrastructure	VPC, IAM, K3s Master deployment via Pulumi.
Session 5-8	Monitoring	Prometheus, Grafana, and Custom Metric Exporters.
Session 9-12	Autoscaler Core	Lambda Decision Engine, locking, and DynamoDB state.
Session 13-16	Verification	Load Testing (k6), Failure Scenarios, and final naming audit.

Scale Smart. Save More. Automate Everything.

🚀 Built with ❤️ for node-fleet architects

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
demo-app		demo-app
docs		docs
gitops		gitops
k3s		k3s
k8s		k8s
lambda		lambda
monitoring		monitoring
pulumi		pulumi
scripts		scripts
test-results		test-results
tests		tests
.gitignore		.gitignore
README.md		README.md
SDB1_Problem_Statement.md		SDB1_Problem_Statement.md
full_kubeconfig.yaml		full_kubeconfig.yaml
new_delete.py		new_delete.py
new_drain.py		new_drain.py
new_pod_info.py		new_pod_info.py
temp_kubeconfig.yaml		temp_kubeconfig.yaml
verify_all.sh		verify_all.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

node-fleet K3s Autoscaler

📋 Project Overview and Problem Statement

Business Problem & Problem Statement

Success Metrics

🏗️ Architecture Explanation with Diagrams

High-Level System Design

Data Flow

🛠️ Tools and Technologies Used

🧠 Lambda Logic & Decision Engine

📜 Terraform/IaC Definitions

🚀 Setup and Deployment Instructions

Prerequisites

1. Infrastructure Deployment (10 min)

2. Cluster Setup

3. Monitoring Verification

☁️ Deployment Environment Options

📘 Technical Documentation

🐍 Lambda Function Code and Logic

🔒 IAM Policy (JSON)

📊 Prometheus Configuration (prometheus.yml)

🗄️ DynamoDB Schema and Example Data

🐚 EC2 User Data Script (Worker Join)

📈 Scaling Algorithm & Logic

🤖 Decision Logic (Pseudocode)

🧠 Logic Rationale

🎬 Monitoring & Alerting Strategy

🧪 Testing Strategy and Results

💰 Cost Analysis (before/after autoscaler)

👥 Team Members and Roles

🗓️ Lab Session Progress

About

Uh oh!

Releases

Packages

Languages

BayajidAlam/node-fleet

Folders and files

Latest commit

History

Repository files navigation

node-fleet K3s Autoscaler

📋 Project Overview and Problem Statement

Business Problem & Problem Statement

Success Metrics

🏗️ Architecture Explanation with Diagrams

High-Level System Design

Data Flow

🛠️ Tools and Technologies Used

🧠 Lambda Logic & Decision Engine

📜 Terraform/IaC Definitions

🚀 Setup and Deployment Instructions

Prerequisites

1. Infrastructure Deployment (10 min)

2. Cluster Setup

3. Monitoring Verification

☁️ Deployment Environment Options

📘 Technical Documentation

🐍 Lambda Function Code and Logic

🔒 IAM Policy (JSON)

📊 Prometheus Configuration (prometheus.yml)

🗄️ DynamoDB Schema and Example Data

🐚 EC2 User Data Script (Worker Join)

📈 Scaling Algorithm & Logic

🤖 Decision Logic (Pseudocode)

🧠 Logic Rationale

🎬 Monitoring & Alerting Strategy

🧪 Testing Strategy and Results

💰 Cost Analysis (before/after autoscaler)

👥 Team Members and Roles

🗓️ Lab Session Progress

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages