Deep dive15 min read← Back to crisp

EKS and Kubernetes - Deep Dive

Control plane internals, the scheduler, networking with VPC CNI, IRSA, autoscaling strategies, and the production checklist nobody writes down.

This is the operator's view of EKS. Less "what is a pod" and more "why is my pod pending for 4 minutes." Earned the hard way at Binocs running 12 services and an analytics pipeline.

Control plane: what AWS runs for you

EKS runs the control plane in AWS-owned VPCs: API server (3 replicas, load balanced), etcd cluster (3 nodes across AZs), scheduler, controller manager, kube-proxy distributed to your nodes. You pay $0.10 per hour per cluster for this. Worth it; running etcd yourself is a part-time job.

The API server is your only entry point. kubectl talks to it over HTTPS, authenticated via the aws-iam-authenticator (now built into the AWS CLI). RBAC controls what you can do.

The scheduler

The scheduler watches for pods with no node assignment and finds a node for each. The algorithm is filter then score:

Filters: nodes that cannot run the pod (insufficient resources, wrong selectors, taints not tolerated).
Scoring: rank remaining nodes by spread, locality, image locality, affinity preferences.

Common failure modes:

Pending pod because no node has enough CPU/memory request capacity. Solution: scale nodes or reduce requests.
Pending pod because of zone affinity rules that no node satisfies. Solution: relax affinity or add nodes in that AZ.
Pending pod because of taints (e.g., GPU nodes tainted, your pod has no toleration). Solution: add the toleration.

kubectl describe pod shows the scheduler's reasoning. Read it.

VPC CNI: pod networking on EKS

The Amazon VPC CNI gives each pod a real IP from your VPC subnet, attached via secondary ENIs on the node. This is great: pods can talk directly to RDS, security groups work for pods, no overlay network overhead.

It is also constrained: each EC2 instance type has a limit on ENIs and IPs per ENI. An m5.large supports 29 pods max (3 ENIs * 10 IPs - 1). Run out of IPs, pods stay pending.

Tactics:

Enable prefix delegation: each ENI gets /28 prefixes instead of individual IPs, raising the pod density limit dramatically.
Use larger instance types if you run many small pods.
Plan VPC subnets with enough address space (a /22 per AZ at minimum for a production cluster).

IRSA: IAM for pods

Old way: store AWS credentials in a Kubernetes Secret. New way: associate an IAM role with a Kubernetes ServiceAccount, pods using that SA get a projected OIDC token that the AWS SDK exchanges for temporary credentials via STS.

Setup:

Create an OIDC provider for the cluster (eksctl does this).
Create an IAM role with a trust policy allowing the OIDC subject to assume it.
Annotate the ServiceAccount with the role ARN.
AWS SDK in the pod automatically picks up AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars.

Result: no static credentials, scoped permissions per workload, audit trail in CloudTrail.

Autoscaling: three layers

HPA (Horizontal Pod Autoscaler): scales replica count based on CPU, memory, or custom metrics. Reconciles every 15 seconds. Use it.
VPA (Vertical Pod Autoscaler): adjusts requests/limits based on actual usage. Restarts pods to apply. Useful for right-sizing during dev, risky in production.
Cluster autoscaler or Karpenter: adds/removes nodes when pods are pending or nodes are underutilized.

Karpenter vs Cluster Autoscaler: Karpenter watches pending pods and provisions instances of any compatible type, including spot, in seconds. Cluster Autoscaler operates on pre-defined node groups and is slower. Karpenter is the future, we migrated to it and node provisioning dropped from 3 minutes to 40 seconds.

Probes done right

readinessProbe: gates traffic. If it fails, the pod is removed from Service endpoints but not killed. Use this for "am I ready to serve traffic right now?"
livenessProbe: kills the pod if it fails. Use this for "am I deadlocked beyond recovery?" Be conservative; cascading liveness failures take down services.
startupProbe: gives slow-starting apps time to come up before liveness applies.

A common production setup:

readinessProbe:
  httpGet: { path: /ready, port: 8000 }
  periodSeconds: 5
  failureThreshold: 2
livenessProbe:
  httpGet: { path: /healthz, port: 8000 }
  periodSeconds: 30
  failureThreshold: 5
startupProbe:
  httpGet: { path: /healthz, port: 8000 }
  periodSeconds: 5
  failureThreshold: 30

Rolling updates and PDBs

Deployment rolling update: maxSurge: 25%, maxUnavailable: 25% by default. The deployment controller creates new replicas, waits for readiness, then terminates old ones.

Pod Disruption Budget: minimum number of pods that must remain available during voluntary disruptions (node drain, deploy). Without PDB, a cluster upgrade can take all your replicas down at once.

apiVersion: policy/v1
kind: PodDisruptionBudget
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

Networking: Services and Ingress

Service types:

ClusterIP: internal-only virtual IP, load-balanced across pods. Default.
NodePort: exposes a port on every node. Rarely useful directly, mostly for cluster-level load balancers.
LoadBalancer: provisions an external load balancer. On EKS, this creates an NLB.
ExternalName: DNS CNAME, no proxying.

Ingress is the right way to expose HTTP. The AWS Load Balancer Controller creates an ALB per Ingress (or shared via IngressGroup annotation), routes by host and path, terminates TLS via ACM.

Service mesh: we did not run one. Istio and Linkerd add real value if you need mTLS between every service, fine-grained traffic policies, or detailed inter-service telemetry. They also add operational complexity. Most 10-service shops do not need it.

Storage

PersistentVolumeClaim binds to a PersistentVolume. On EKS, the EBS CSI driver provisions gp3 volumes on demand. Volumes are AZ-local, so the pod that uses a PVC is pinned to that AZ.

Stateful workloads: use StatefulSet, not Deployment. StatefulSets give stable pod names (api-0, api-1, api-2), stable network identity, ordered rollout, and per-pod PVCs that survive pod replacement.

The production checklist

Resource requests on every container (for scheduling).
Memory limits on every container (to prevent runaway pods).
No CPU limits (or only for known-bursty workloads).
Readiness probes on every workload.
PDBs on every Deployment with >1 replica.
HPA on every workload that has variable traffic.
Logs to stdout/stderr, shipped by Fluent Bit to CloudWatch / Loki.
Metrics scraped by Prometheus (kube-state-metrics + node-exporter + app /metrics).
NetworkPolicies in a default-deny-egress shop.
Secrets from AWS Secrets Manager via the CSI driver, not raw Kubernetes Secrets.
Image scanning in CI (Trivy or ECR scanning).
Pinned image tags by digest in production, not :latest.

Disaster recovery

Velero for cluster backups: cron job, dumps API resources and PVs to S3. Restore creates the resources and snapshots back.

For databases: use RDS or Aurora, not in-cluster Postgres. Easier backups, easier upgrades, AZ failover handled.

Learn more

Docs
Kubernetes DocumentationKubernetes
Docs
EKS Best Practices GuideAWS
Article
Kubernetes Patterns bookBilgin Ibryam
Docs
Karpenter DocumentationKarpenter
Article
AWS Blog: EKS networking deep diveAWS Blog