Autoscaling (HPA, VPA, KEDA) - deep dive
The autoscaling stack: HPA metrics pipeline, VPA right-sizing, KEDA event-driven scaling, and node-level scaling with Karpenter vs Cluster Autoscaler.
Autoscaling is where bad defaults cost real money. At Binocs the right-sizing work that saved $1,800 to $2,000 per month was 70% autoscaling: correctly-tuned HPA, replacing Cluster Autoscaler with Karpenter, mixing spot, and getting resource requests sane. This section is the playbook.
The two-layer system
The pod-level autoscalers create or resize Pods. When Pods cannot schedule (no node has capacity) the node-level autoscaler provisions more nodes. When nodes are underused, the node autoscaler drains and terminates them. This two-layer model is the entire elasticity story.
HPA: how it actually works
The HPA controller runs in kube-controller-manager. Every 15 seconds it:
- Reads the current metric for each Pod in the target Deployment from
metrics.k8s.io(metrics-server) orcustom.metrics.k8s.io(Prometheus Adapter, etc.). - Computes
desiredReplicas = ceil(currentReplicas * currentMetric / targetMetric). - Applies
minReplicas,maxReplicas, and stabilization windows. - Patches the Deployment's replica count if it changed.
The formula is multiplicative. If you are at 5 replicas averaging 90% CPU with a 70% target, desired = ceil(5 * 90/70) = ceil(6.43) = 7. Add a few replicas, the average per-replica drops, you stabilize.
Stabilization windows
Default behavior: HPA scales up immediately, scales down after a 5-minute stabilization window (no further scale-downs during that window). You can tune both via behavior:
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 100 # double at most
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600 # 10 min
policies:
- type: Percent
value: 10 # shed at most 10% per minute
periodSeconds: 60For latency-sensitive services, longer scaleDown windows prevent flapping during traffic dips. For bursty workloads, shorter scaleUp windows respond faster.
Custom metrics
The metrics field supports:
- Resource: CPU, memory from metrics-server.
- Pods: per-pod custom metric (e.g., requests per second per pod).
- Object: a metric on another object (queue length on an SQS queue).
- External: cluster-external metric (CloudWatch alarm).
Prometheus Adapter is the standard way to expose Prometheus metrics to HPA. Define a metric like http_requests_per_second, scale on it. The trick: pick a metric that is causal to load, not a symptom of it. CPU works for compute-bound services. RPS works for request-bound services. p99 latency does not work as an HPA metric - by the time latency spikes you are already overloaded.
VPA: when and how
VPA is for workloads where you do not know the right CPU/memory request upfront. It observes actual usage over a window (default 8 days), computes a recommendation (P50 to P95 of observed usage), and either:
- Off: writes recommendations to the VPA object. You read them, update Deployments manually.
- Initial: sets requests on new Pod creation only.
- Auto: evicts running Pods to apply new requests.
The big trap: VPA Auto on a Deployment that also has HPA on CPU/memory creates a conflict. VPA scales request up, HPA sees lower utilization, scales replicas down, per-replica load increases, VPA scales request up. Use VPA in Off/Initial mode with HPA, or VPA Auto without HPA.
I have used VPA mostly in Off mode as a recommendation engine. You read the recommendation in your monitoring, update the Helm values, ship. Cleaner than letting VPA evict pods unexpectedly.
KEDA: event-driven and scale-to-zero
KEDA fills the gaps in HPA. It adds:
- 40+ event sources: SQS, Kafka, RabbitMQ, Redis lists, NATS, Azure Service Bus, GCP Pub/Sub, Prometheus queries, MySQL/Postgres query results, AWS DynamoDB Streams, cron schedules.
- Scale to zero: when the queue is empty, scale the Deployment to 0 replicas. HPA cannot do this;
minReplicas: 0was added in 1.16 but only KEDA pairs it with idle-aware scalers. - HPA wrapping: KEDA creates an HPA underneath, so you still get all HPA features.
Example: scale on SQS depth.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-scaler }
spec:
scaleTargetRef: { name: worker }
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123/work
queueLength: "10"
awsRegion: us-east-1
authenticationRef: { name: aws-irsa }KEDA polls SQS, computes desired replicas, hands it to the HPA. When the queue is empty for a while, KEDA scales the Deployment to 0. Cost: nothing while idle.
We used KEDA at Binocs for nightly batch workers that processed customer document uploads. Scale to zero overnight saved a few hundred dollars per month and removed the need for cron-based deployment toggling.
Cluster Autoscaler vs Karpenter
Cluster Autoscaler
The original. Works with pre-defined node groups (ASGs on AWS). On a pending Pod:
- Simulates scheduling the pending Pod against templates of each node group.
- Picks the cheapest node group that would fit.
- Increments that ASG's desired size by 1.
- EC2 launches an instance, kubelet registers, scheduler places the Pod.
Slow (1-2 minutes node-to-ready), constrained to ASG instance types, no bin-packing optimization across instance types. Scaling down is a separate, conservative loop that drains nodes whose pods can fit elsewhere.
Karpenter
AWS's purpose-built node provisioner. Skips ASGs. On a pending Pod:
- Looks at the Pod's resource requests, taints, affinities.
- Picks the cheapest EC2 instance type (across hundreds of options) that fits this Pod and any other pending Pods.
- Launches the EC2 instance directly via the EC2 API.
- Pod schedules in under 60s typically.
Wins over CA:
- Faster: 40-60s vs 1-2 min.
- Cheaper: picks across the entire EC2 catalog. No ASG instance-type constraint.
- Better bin-packing: consolidates Pods onto fewer larger nodes when possible. Has a
consolidationpolicy that proactively replaces underutilized nodes with smaller ones. - Spot-native: trivial to mix spot and on-demand, handle interruptions gracefully.
At Binocs, switching from Cluster Autoscaler to Karpenter cut node-launch time significantly and let us aggressively use spot instances. Karpenter's consolidation would automatically replace 4 m5.large nodes at 30% util with 2 m5.xlarge at 60% util, then a few minutes later swap one for spot if available. That alone was a meaningful chunk of the monthly savings.
Pod Disruption Budgets matter for autoscaling
When a node autoscaler drains a node to consolidate or terminate, it respects PodDisruptionBudgets. Without PDBs you can have your entire Deployment briefly unavailable during a scale-down.
apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
minAvailable: 2 # or "50%"
selector: { matchLabels: { app: api } }Karpenter respects PDBs. CA respects PDBs. Without one, the autoscaler will happily evict all your pods at once.
Spot instance handling
Spot saves 50-90% on EC2 cost. Risk: AWS reclaims with 2-minute warning. The patterns:
- Run stateless workloads on spot, stateful on on-demand.
- Use Karpenter's NodePool with
karpenter.sh/capacity-type: [spot, on-demand]and let it prefer spot. - Run aws-node-termination-handler to cordon and drain on spot interruption signals.
- Diversify across instance families so AWS does not reclaim all your spot at once.
At Binocs, about 60% of our compute was spot. The interruption handling was a DaemonSet (NTH) plus PDBs plus topology spread to make sure interruptions never took out a whole Deployment.
The right-sizing story
The $1,800 to $2,000 per month savings was a combination:
- Audited resource requests vs actual usage with VPA in Off mode. Found Deployments requesting 2 CPU but using 0.3. Cut requests, doubled effective node capacity.
- Switched from Cluster Autoscaler with fixed node groups (m5.xlarge) to Karpenter with the full EC2 catalog. Karpenter mixed m5, m6i, c6i, t3 based on actual workload.
- Moved stateless workloads to spot via Karpenter. ~60% of compute on spot.
- KEDA scale-to-zero for batch workers.
- Right-sized HPA min/max. Some had minReplicas=10 from copy-paste; reality only needed 3.
None of these are magic. They are all in this section. The work was disciplined measurement, then tuning each layer.
The interview narrative
Two-layer system: pod-level (HPA for replicas, VPA for requests, KEDA for events and scale-to-zero) and node-level (Karpenter or Cluster Autoscaler). HPA needs good metrics (causal, not symptom). VPA needs care - do not mix Auto with HPA. KEDA fills HPA gaps for event-driven workloads. Karpenter beats Cluster Autoscaler on EKS for speed, cost, and bin-packing. PDBs protect availability during autoscaler-driven disruptions. Close with the cost story: this is where you save real money.
Learn more
- DocsHorizontal Pod Autoscaler Walkthroughkubernetes.io
- DocsVertical Pod Autoscalergithub.com
- DocsKEDA Conceptskeda.sh
- DocsKarpenter Documentationkarpenter.sh
- DocsCluster Autoscaler FAQgithub.com