Deep dive15 min read← Back to crisp

Autoscaling (HPA, VPA, KEDA) - deep dive

The autoscaling stack: HPA metrics pipeline, VPA right-sizing, KEDA event-driven scaling, and node-level scaling with Karpenter vs Cluster Autoscaler.

Autoscaling is where bad defaults cost real money. At Binocs the right-sizing work that saved $1,800 to $2,000 per month was 70% autoscaling: correctly-tuned HPA, replacing Cluster Autoscaler with Karpenter, mixing spot, and getting resource requests sane. This section is the playbook.

The two-layer system

The pod-level autoscalers create or resize Pods. When Pods cannot schedule (no node has capacity) the node-level autoscaler provisions more nodes. When nodes are underused, the node autoscaler drains and terminates them. This two-layer model is the entire elasticity story.

HPA: how it actually works

The HPA controller runs in kube-controller-manager. Every 15 seconds it:

Reads the current metric for each Pod in the target Deployment from metrics.k8s.io (metrics-server) or custom.metrics.k8s.io (Prometheus Adapter, etc.).
Computes desiredReplicas = ceil(currentReplicas * currentMetric / targetMetric).
Applies minReplicas, maxReplicas, and stabilization windows.
Patches the Deployment's replica count if it changed.

The formula is multiplicative. If you are at 5 replicas averaging 90% CPU with a 70% target, desired = ceil(5 * 90/70) = ceil(6.43) = 7. Add a few replicas, the average per-replica drops, you stabilize.

Stabilization windows

Default behavior: HPA scales up immediately, scales down after a 5-minute stabilization window (no further scale-downs during that window). You can tune both via behavior:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
    policies:
      - type: Percent
        value: 100  # double at most
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 600  # 10 min
    policies:
      - type: Percent
        value: 10  # shed at most 10% per minute
        periodSeconds: 60

For latency-sensitive services, longer scaleDown windows prevent flapping during traffic dips. For bursty workloads, shorter scaleUp windows respond faster.

Custom metrics

The metrics field supports:

Resource: CPU, memory from metrics-server.
Pods: per-pod custom metric (e.g., requests per second per pod).
Object: a metric on another object (queue length on an SQS queue).
External: cluster-external metric (CloudWatch alarm).

Prometheus Adapter is the standard way to expose Prometheus metrics to HPA. Define a metric like http_requests_per_second, scale on it. The trick: pick a metric that is causal to load, not a symptom of it. CPU works for compute-bound services. RPS works for request-bound services. p99 latency does not work as an HPA metric - by the time latency spikes you are already overloaded.

VPA: when and how

VPA is for workloads where you do not know the right CPU/memory request upfront. It observes actual usage over a window (default 8 days), computes a recommendation (P50 to P95 of observed usage), and either:

Off: writes recommendations to the VPA object. You read them, update Deployments manually.
Initial: sets requests on new Pod creation only.
Auto: evicts running Pods to apply new requests.

The big trap: VPA Auto on a Deployment that also has HPA on CPU/memory creates a conflict. VPA scales request up, HPA sees lower utilization, scales replicas down, per-replica load increases, VPA scales request up. Use VPA in Off/Initial mode with HPA, or VPA Auto without HPA.

I have used VPA mostly in Off mode as a recommendation engine. You read the recommendation in your monitoring, update the Helm values, ship. Cleaner than letting VPA evict pods unexpectedly.

KEDA: event-driven and scale-to-zero

KEDA fills the gaps in HPA. It adds:

40+ event sources: SQS, Kafka, RabbitMQ, Redis lists, NATS, Azure Service Bus, GCP Pub/Sub, Prometheus queries, MySQL/Postgres query results, AWS DynamoDB Streams, cron schedules.
Scale to zero: when the queue is empty, scale the Deployment to 0 replicas. HPA cannot do this; minReplicas: 0 was added in 1.16 but only KEDA pairs it with idle-aware scalers.
HPA wrapping: KEDA creates an HPA underneath, so you still get all HPA features.

Example: scale on SQS depth.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-scaler }
spec:
  scaleTargetRef: { name: worker }
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123/work
        queueLength: "10"
        awsRegion: us-east-1
      authenticationRef: { name: aws-irsa }

KEDA polls SQS, computes desired replicas, hands it to the HPA. When the queue is empty for a while, KEDA scales the Deployment to 0. Cost: nothing while idle.

We used KEDA at Binocs for nightly batch workers that processed customer document uploads. Scale to zero overnight saved a few hundred dollars per month and removed the need for cron-based deployment toggling.

Cluster Autoscaler vs Karpenter

Cluster Autoscaler

The original. Works with pre-defined node groups (ASGs on AWS). On a pending Pod:

Simulates scheduling the pending Pod against templates of each node group.
Picks the cheapest node group that would fit.
Increments that ASG's desired size by 1.
EC2 launches an instance, kubelet registers, scheduler places the Pod.

Slow (1-2 minutes node-to-ready), constrained to ASG instance types, no bin-packing optimization across instance types. Scaling down is a separate, conservative loop that drains nodes whose pods can fit elsewhere.

Karpenter

AWS's purpose-built node provisioner. Skips ASGs. On a pending Pod:

Looks at the Pod's resource requests, taints, affinities.
Picks the cheapest EC2 instance type (across hundreds of options) that fits this Pod and any other pending Pods.
Launches the EC2 instance directly via the EC2 API.
Pod schedules in under 60s typically.

Wins over CA:

Faster: 40-60s vs 1-2 min.
Cheaper: picks across the entire EC2 catalog. No ASG instance-type constraint.
Better bin-packing: consolidates Pods onto fewer larger nodes when possible. Has a consolidation policy that proactively replaces underutilized nodes with smaller ones.
Spot-native: trivial to mix spot and on-demand, handle interruptions gracefully.

At Binocs, switching from Cluster Autoscaler to Karpenter cut node-launch time significantly and let us aggressively use spot instances. Karpenter's consolidation would automatically replace 4 m5.large nodes at 30% util with 2 m5.xlarge at 60% util, then a few minutes later swap one for spot if available. That alone was a meaningful chunk of the monthly savings.

Pod Disruption Budgets matter for autoscaling

When a node autoscaler drains a node to consolidate or terminate, it respects PodDisruptionBudgets. Without PDBs you can have your entire Deployment briefly unavailable during a scale-down.

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: 2  # or "50%"
  selector: { matchLabels: { app: api } }

Karpenter respects PDBs. CA respects PDBs. Without one, the autoscaler will happily evict all your pods at once.

Spot instance handling

Spot saves 50-90% on EC2 cost. Risk: AWS reclaims with 2-minute warning. The patterns:

Run stateless workloads on spot, stateful on on-demand.
Use Karpenter's NodePool with karpenter.sh/capacity-type: [spot, on-demand] and let it prefer spot.
Run aws-node-termination-handler to cordon and drain on spot interruption signals.
Diversify across instance families so AWS does not reclaim all your spot at once.

At Binocs, about 60% of our compute was spot. The interruption handling was a DaemonSet (NTH) plus PDBs plus topology spread to make sure interruptions never took out a whole Deployment.

The right-sizing story

The $1,800 to $2,000 per month savings was a combination:

Audited resource requests vs actual usage with VPA in Off mode. Found Deployments requesting 2 CPU but using 0.3. Cut requests, doubled effective node capacity.
Switched from Cluster Autoscaler with fixed node groups (m5.xlarge) to Karpenter with the full EC2 catalog. Karpenter mixed m5, m6i, c6i, t3 based on actual workload.
Moved stateless workloads to spot via Karpenter. ~60% of compute on spot.
KEDA scale-to-zero for batch workers.
Right-sized HPA min/max. Some had minReplicas=10 from copy-paste; reality only needed 3.

None of these are magic. They are all in this section. The work was disciplined measurement, then tuning each layer.

The interview narrative

Two-layer system: pod-level (HPA for replicas, VPA for requests, KEDA for events and scale-to-zero) and node-level (Karpenter or Cluster Autoscaler). HPA needs good metrics (causal, not symptom). VPA needs care - do not mix Auto with HPA. KEDA fills HPA gaps for event-driven workloads. Karpenter beats Cluster Autoscaler on EKS for speed, cost, and bin-packing. PDBs protect availability during autoscaler-driven disruptions. Close with the cost story: this is where you save real money.

Learn more

Docs
Horizontal Pod Autoscaler Walkthroughkubernetes.io
Docs
Vertical Pod Autoscalergithub.com
Docs
KEDA Conceptskeda.sh
Docs
Karpenter Documentationkarpenter.sh
Docs
Cluster Autoscaler FAQgithub.com

Deep dive15 min read← Back to crisp

Autoscaling (HPA, VPA, KEDA) - deep dive

The autoscaling stack: HPA metrics pipeline, VPA right-sizing, KEDA event-driven scaling, and node-level scaling with Karpenter vs Cluster Autoscaler.

The two-layer system

HPA: how it actually works

The HPA controller runs in kube-controller-manager. Every 15 seconds it:

Reads the current metric for each Pod in the target Deployment from metrics.k8s.io (metrics-server) or custom.metrics.k8s.io (Prometheus Adapter, etc.).
Computes desiredReplicas = ceil(currentReplicas * currentMetric / targetMetric).
Applies minReplicas, maxReplicas, and stabilization windows.
Patches the Deployment's replica count if it changed.

Stabilization windows

Default behavior: HPA scales up immediately, scales down after a 5-minute stabilization window (no further scale-downs during that window). You can tune both via behavior:

behavior:
  scaleUp:
    stabilizationWindowSeconds: 30
    policies:
      - type: Percent
        value: 100  # double at most
        periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 600  # 10 min
    policies:
      - type: Percent
        value: 10  # shed at most 10% per minute
        periodSeconds: 60

For latency-sensitive services, longer scaleDown windows prevent flapping during traffic dips. For bursty workloads, shorter scaleUp windows respond faster.

Custom metrics

The metrics field supports:

Resource: CPU, memory from metrics-server.
Pods: per-pod custom metric (e.g., requests per second per pod).
Object: a metric on another object (queue length on an SQS queue).
External: cluster-external metric (CloudWatch alarm).

VPA: when and how

Off: writes recommendations to the VPA object. You read them, update Deployments manually.
Initial: sets requests on new Pod creation only.
Auto: evicts running Pods to apply new requests.

I have used VPA mostly in Off mode as a recommendation engine. You read the recommendation in your monitoring, update the Helm values, ship. Cleaner than letting VPA evict pods unexpectedly.

KEDA: event-driven and scale-to-zero

KEDA fills the gaps in HPA. It adds:

40+ event sources: SQS, Kafka, RabbitMQ, Redis lists, NATS, Azure Service Bus, GCP Pub/Sub, Prometheus queries, MySQL/Postgres query results, AWS DynamoDB Streams, cron schedules.
Scale to zero: when the queue is empty, scale the Deployment to 0 replicas. HPA cannot do this; minReplicas: 0 was added in 1.16 but only KEDA pairs it with idle-aware scalers.
HPA wrapping: KEDA creates an HPA underneath, so you still get all HPA features.

Example: scale on SQS depth.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker-scaler }
spec:
  scaleTargetRef: { name: worker }
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
    - type: aws-sqs-queue
      metadata:
        queueURL: https://sqs.us-east-1.amazonaws.com/123/work
        queueLength: "10"
        awsRegion: us-east-1
      authenticationRef: { name: aws-irsa }

KEDA polls SQS, computes desired replicas, hands it to the HPA. When the queue is empty for a while, KEDA scales the Deployment to 0. Cost: nothing while idle.

Cluster Autoscaler vs Karpenter

Cluster Autoscaler

The original. Works with pre-defined node groups (ASGs on AWS). On a pending Pod:

Simulates scheduling the pending Pod against templates of each node group.
Picks the cheapest node group that would fit.
Increments that ASG's desired size by 1.
EC2 launches an instance, kubelet registers, scheduler places the Pod.

Karpenter

AWS's purpose-built node provisioner. Skips ASGs. On a pending Pod:

Looks at the Pod's resource requests, taints, affinities.
Picks the cheapest EC2 instance type (across hundreds of options) that fits this Pod and any other pending Pods.
Launches the EC2 instance directly via the EC2 API.
Pod schedules in under 60s typically.

Wins over CA:

Faster: 40-60s vs 1-2 min.
Cheaper: picks across the entire EC2 catalog. No ASG instance-type constraint.
Better bin-packing: consolidates Pods onto fewer larger nodes when possible. Has a consolidation policy that proactively replaces underutilized nodes with smaller ones.
Spot-native: trivial to mix spot and on-demand, handle interruptions gracefully.

Pod Disruption Budgets matter for autoscaling

When a node autoscaler drains a node to consolidate or terminate, it respects PodDisruptionBudgets. Without PDBs you can have your entire Deployment briefly unavailable during a scale-down.

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: 2  # or "50%"
  selector: { matchLabels: { app: api } }

Karpenter respects PDBs. CA respects PDBs. Without one, the autoscaler will happily evict all your pods at once.

Spot instance handling

Spot saves 50-90% on EC2 cost. Risk: AWS reclaims with 2-minute warning. The patterns:

Run stateless workloads on spot, stateful on on-demand.
Use Karpenter's NodePool with karpenter.sh/capacity-type: [spot, on-demand] and let it prefer spot.
Run aws-node-termination-handler to cordon and drain on spot interruption signals.
Diversify across instance families so AWS does not reclaim all your spot at once.

At Binocs, about 60% of our compute was spot. The interruption handling was a DaemonSet (NTH) plus PDBs plus topology spread to make sure interruptions never took out a whole Deployment.

The right-sizing story

The $1,800 to $2,000 per month savings was a combination:

Audited resource requests vs actual usage with VPA in Off mode. Found Deployments requesting 2 CPU but using 0.3. Cut requests, doubled effective node capacity.
Switched from Cluster Autoscaler with fixed node groups (m5.xlarge) to Karpenter with the full EC2 catalog. Karpenter mixed m5, m6i, c6i, t3 based on actual workload.
Moved stateless workloads to spot via Karpenter. ~60% of compute on spot.
KEDA scale-to-zero for batch workers.
Right-sized HPA min/max. Some had minReplicas=10 from copy-paste; reality only needed 3.

None of these are magic. They are all in this section. The work was disciplined measurement, then tuning each layer.

The interview narrative

Learn more

Docs
Horizontal Pod Autoscaler Walkthroughkubernetes.io
Docs
Vertical Pod Autoscalergithub.com
Docs
KEDA Conceptskeda.sh
Docs
Karpenter Documentationkarpenter.sh
Docs
Cluster Autoscaler FAQgithub.com