EKS and Kubernetes
What EKS gives you over raw Kubernetes, the objects you actually use day to day, and how I ran 12 microservices on it.
Kubernetes is a control loop. You declare desired state (3 replicas of this image, this much CPU, this service exposed) and a set of controllers reconcile actual state to match. EKS is AWS's managed control plane: they run etcd, the API server, the scheduler, and the controller manager. You run the worker nodes and the workloads.
What I ran at Binocs
12 microservices on a 3-node EKS cluster (m5.large, eventually m5.xlarge). Traffic from an ALB Ingress, internal traffic via ClusterIP services, secrets from AWS Secrets Manager via the CSI driver, persistent volumes on gp3 EBS.
The objects I touched daily
- Deployment: declarative replica management, rolling updates with surge and unavailable controls.
- Service: stable virtual IP for a set of pods, selected by labels.
- Ingress: HTTP routing to services, terminated TLS, hostname-based routing. On EKS this provisions an ALB.
- ConfigMap and Secret: config and credentials, mounted as files or env vars.
- HPA: scale replica count based on CPU or custom metrics.
- PodDisruptionBudget: minimum available pods during voluntary disruptions (node drains, deploys).
- NetworkPolicy: pod-to-pod firewall rules.
EKS specifics that matter
- IRSA (IAM Roles for Service Accounts): pods assume AWS IAM roles via OIDC. No long-lived credentials in the cluster.
- VPC CNI: each pod gets a real ENI-backed IP from your VPC subnet. Pods are first-class network citizens, security groups apply.
- Fargate profiles: serverless pods, no node to manage, costs more per CPU-hour but no idle nodes.
- Managed node groups: AWS provisions and upgrades EC2 nodes for you, you pick instance types and AZ spread.
The painful lessons
- CPU limits cause throttling. CFS quotas are accounted in 100ms windows. A burst over your limit gets throttled for the rest of the window. Latency spikes. Set requests for scheduling, skip CPU limits, set memory limits to prevent OOM-killing the node.
- Liveness probes that fail because the app is slow under load will cascade. The probe restarts the pod, traffic shifts to remaining pods, they get slower, more probes fail. Use readiness probes for traffic routing, be conservative with liveness.
- Pod-to-pod DNS goes through CoreDNS. Default ndots:5 in resolv.conf means
api.svcbecomes 5 DNS lookups. SetdnsConfig.options.ndots: 1.
Learn more
- DocsKubernetes DocumentationKubernetes
- Docs
- RepoKubernetes the Hard WayKelsey Hightower