Kubernetes architecture (control plane vs nodes) - deep dive

How the control loops, API server, etcd, and kubelet collaborate to keep your cluster converging toward desired state.

The single most useful sentence I can give you about Kubernetes: it is a distributed state machine where the API server is the only thing that touches etcd and every other component is a reconciler that watches the API server. Once that clicks, the rest is just naming.

The control loop that runs everything

Every controller in Kubernetes implements the same pattern. Watch the API server for changes to a resource type, diff desired vs actual, take action to close the gap, repeat forever. The Deployment controller watches Deployments and creates ReplicaSets. The ReplicaSet controller watches ReplicaSets and creates Pods. The scheduler watches unscheduled Pods and assigns nodeName. The kubelet watches Pods with its nodeName and starts containers.

The flow has no direct calls between components. Everything goes through the API server. This is why the API server is the bottleneck you scale first in large clusters.

etcd is the cluster

People treat etcd like a database the cluster uses. It is the other way around. etcd holds every Pod, Service, ConfigMap, Secret, Node, Lease, custom resource. If etcd loses quorum (more than half its members down) the API server flips to read-only. Existing pods keep running because the kubelet has cached state, but you cannot deploy, scale, or recover anything.

A few etcd realities you should know:

It uses Raft. Five-node etcd survives two failures, three-node survives one. Always odd numbers.
Default etcd revisions compact every 5 minutes. If you let it grow unbounded you will eventually hit the storage quota (default 2 GiB) and the cluster freezes. EKS handles this for you.
Watch fan-out is the silent scaling problem. Every kubelet, controller, and operator opens a watch. 5,000 nodes means 5,000+ watches on Pod resources alone. The API server uses the watch cache to deduplicate, but it is still real work.
Back up etcd. etcdctl snapshot save is one command. On self-hosted clusters this is the single most important operational task. On EKS, AWS does it.

API server is stateless and that matters

The API server has no state of its own. You can run three of them behind a load balancer and they all behave identically. This means:

Rolling upgrades are easy. Drain one, upgrade, put it back, repeat.
You scale horizontally for read load. Writes still hit etcd serialized through Raft.
Authentication and authorization run on every request. Webhooks for admission, OIDC for auth, RBAC for authz.

The admission controller chain is where things get interesting. Mutating webhooks can rewrite your pod (this is how sidecar injection works for Istio, Linkerd, Vault). Validating webhooks can reject it (OPA Gatekeeper, Kyverno). When debugging "why did my pod come up with these extra containers" - look at the mutating webhooks.

The scheduler picks, kubelet acts

The scheduler is one process. It runs a two-phase algorithm.

Filtering: drop nodes that cannot run this pod. No resource fit, wrong taints, wrong nodeSelector, anti-affinity violated.
Scoring: rank the survivors. Spread across zones, prefer less loaded nodes, honor topology spread constraints.

Then it writes nodeName onto the pod and walks away. That is it. The scheduler never touches a container. It is an assignment service.

The kubelet on the chosen node sees the pod (because it watches pods with spec.nodeName=$me), pulls images, calls the CRI to run them, mounts volumes, sets up the network via CNI, and reports status. The kubelet is also a tiny scheduler for itself - it can evict pods under memory pressure, restart crashed containers per the restartPolicy, and run liveness and readiness probes.

What runs on every worker node

kubelet: the node agent. Talks to the API server, talks to the container runtime over CRI, talks to the CNI binary for network setup.
container runtime: containerd is the default in EKS and most managed clusters. Docker shim was removed in 1.24. CRI-O is the Red Hat alternative.
kube-proxy: turns Service VIPs into real packet flow. Default mode is iptables, IPVS scales better past ~1,000 services. eBPF CNIs like Cilium can replace kube-proxy entirely.
CNI plugin: gives the pod an IP. AWS VPC CNI, Calico, Cilium, Flannel. More on this in 10.10.

Managed control plane reality (what I ran at Binocs)

On EKS the control plane is hidden. You get an API endpoint, AWS runs the etcd, scheduler, controller-manager. You pay $73 per cluster per month for that. Your job is the worker nodes - node groups, instance types, scaling.

The right-sizing exercise that saved us $1,800 to $2,000 per month was almost entirely on the data plane: pick smaller node instance types, mix on-demand with spot for stateless workloads, set sensible resource requests so the scheduler can actually bin-pack. The control plane was a fixed cost we never touched.

One trap with managed control planes: the API server has rate limits you do not see until a noisy controller hammers it. We hit this once with an operator that did unbatched watches. CloudWatch shows API request count per cluster - watch it.

High availability that actually matters

For production, three things matter:

Multi-AZ etcd. Three nodes across three AZs. EKS does this by default.
Multi-AZ workers. Spread your node groups so an AZ outage drops at most a third of capacity. Topology spread constraints and pod anti-affinity help here.
PodDisruptionBudgets. Tell the cluster "at most 1 pod of this Deployment can be unavailable during voluntary disruptions." Without PDBs a node drain during an upgrade can take your service to zero.

The interview answer

When someone asks "how does Kubernetes work" you answer with the control loop pattern, name the four control plane components plus etcd, name the three node components, and emphasize that the API server is the only writer to etcd. Then you mention that everything else (Deployments, ReplicaSets, Services, your custom operators) is the same pattern - watch, diff, reconcile. That is the whole architecture in 90 seconds.

Learn more

Docs
Kubernetes Componentskubernetes.io
Paper
Borg, Omega, and Kubernetes (Burns et al.)research.google
Docs
etcd documentationetcd.io
Article
Kelsey Hightower - Kubernetes the Hard Waygithub.com

Deep dive15 min read← Back to crisp

Kubernetes architecture (control plane vs nodes) - deep dive

How the control loops, API server, etcd, and kubelet collaborate to keep your cluster converging toward desired state.

The control loop that runs everything

The flow has no direct calls between components. Everything goes through the API server. This is why the API server is the bottleneck you scale first in large clusters.

etcd is the cluster

A few etcd realities you should know:

It uses Raft. Five-node etcd survives two failures, three-node survives one. Always odd numbers.
Default etcd revisions compact every 5 minutes. If you let it grow unbounded you will eventually hit the storage quota (default 2 GiB) and the cluster freezes. EKS handles this for you.
Watch fan-out is the silent scaling problem. Every kubelet, controller, and operator opens a watch. 5,000 nodes means 5,000+ watches on Pod resources alone. The API server uses the watch cache to deduplicate, but it is still real work.
Back up etcd. etcdctl snapshot save is one command. On self-hosted clusters this is the single most important operational task. On EKS, AWS does it.

API server is stateless and that matters

The API server has no state of its own. You can run three of them behind a load balancer and they all behave identically. This means:

Rolling upgrades are easy. Drain one, upgrade, put it back, repeat.
You scale horizontally for read load. Writes still hit etcd serialized through Raft.
Authentication and authorization run on every request. Webhooks for admission, OIDC for auth, RBAC for authz.

The scheduler picks, kubelet acts

The scheduler is one process. It runs a two-phase algorithm.

Filtering: drop nodes that cannot run this pod. No resource fit, wrong taints, wrong nodeSelector, anti-affinity violated.
Scoring: rank the survivors. Spread across zones, prefer less loaded nodes, honor topology spread constraints.

Then it writes nodeName onto the pod and walks away. That is it. The scheduler never touches a container. It is an assignment service.

What runs on every worker node

kubelet: the node agent. Talks to the API server, talks to the container runtime over CRI, talks to the CNI binary for network setup.
container runtime: containerd is the default in EKS and most managed clusters. Docker shim was removed in 1.24. CRI-O is the Red Hat alternative.
kube-proxy: turns Service VIPs into real packet flow. Default mode is iptables, IPVS scales better past ~1,000 services. eBPF CNIs like Cilium can replace kube-proxy entirely.
CNI plugin: gives the pod an IP. AWS VPC CNI, Calico, Cilium, Flannel. More on this in 10.10.

Managed control plane reality (what I ran at Binocs)

High availability that actually matters

For production, three things matter:

Multi-AZ etcd. Three nodes across three AZs. EKS does this by default.
Multi-AZ workers. Spread your node groups so an AZ outage drops at most a third of capacity. Topology spread constraints and pod anti-affinity help here.
PodDisruptionBudgets. Tell the cluster "at most 1 pod of this Deployment can be unavailable during voluntary disruptions." Without PDBs a node drain during an upgrade can take your service to zero.

The interview answer

Learn more

Docs
Kubernetes Componentskubernetes.io
Paper
Borg, Omega, and Kubernetes (Burns et al.)research.google
Docs
etcd documentationetcd.io
Article
Kelsey Hightower - Kubernetes the Hard Waygithub.com