Deep dive15 min read← Back to crisp

Pods, ReplicaSets, Deployments - deep dive

Pod lifecycle, the sidecar pattern, rolling update math, surge and unavailable, and how to actually do zero-downtime deploys.

The Pod is the most misunderstood object in Kubernetes. People think of it as "a container with extra steps." It is not. It is a shared execution context for a set of containers that need to live together, share the same network identity, and die together. Get the Pod model right and every other workload object makes sense.

What a Pod actually is

When the kubelet creates a Pod it does the following sequence:

Calls the CRI's RunPodSandbox. This starts the pause container which holds the network, IPC, and (optionally) PID namespaces.
Calls the CNI plugin which gives the sandbox an IP and wires it into the cluster network.
Runs each initContainer in order, to completion, before any app container starts.
Starts each app container in parallel inside those namespaces.
Starts lifecycle.postStart hooks if defined.

All containers in the Pod share localhost, the same IP, the same set of volume mounts (if mounted). They do not share filesystem by default - each container has its own root FS from its own image.

The sidecar pattern (and what changed in 1.28)

Sidecars are the canonical multi-container Pod use case. A proxy that handles TLS termination, a log shipper that tails files, an agent that refreshes secrets. The classic problem was the sidecar lifecycle: the app would exit first and the log shipper would hang forever, or the sidecar would crash and take the app with it.

Kubernetes 1.28 introduced native sidecar containers via initContainers with restartPolicy: Always. These start in order before app containers, keep running for the Pod's lifetime, and shut down after app containers exit. This is how Istio's ambient mesh, modern Vault Agent, and OpenTelemetry collectors should now run.

ReplicaSet: the simplest controller

A ReplicaSet has three fields that matter:

replicas: desired count.
selector: which Pods belong to me.
template: how to make new ones.

The controller loop: list Pods matching the selector, if count < replicas create more, if count > replicas delete the surplus. Surplus deletion uses a priority order - not-ready pods first, then youngest, then highest restart count. This matters because during a scale-down you do not want to kill your healthiest pods.

The selector is the gotcha. ReplicaSets adopt any Pod matching the selector, even ones they did not create. If you have a stray Pod with matching labels, the ReplicaSet will count it and possibly delete it. Always use the immutable pod-template-hash label that the Deployment controller adds.

Deployment: rolling updates and rollback

A Deployment is a thin layer over ReplicaSets that adds versioning and rollout strategy.

The two knobs are maxSurge and maxUnavailable. With 10 replicas, 25% surge, 25% unavailable:

At any moment you have at most 12 pods (10 + surge) and at least 7 ready (10 - unavailable).
Faster rollouts: bump surge. Tighter capacity: bump unavailable.
For tight latency SLOs set maxUnavailable: 0 so you never drop below 100%.

Set BOTH to 0 and the rollout deadlocks. Set surge too high and you double your compute cost during the roll.

Probes are not optional

Without proper probes your rolling update is a coin flip.

startupProbe: gives a slow-starting app time to come up before liveness kicks in. Use it for JVM apps that take 30 seconds to warm up.
readinessProbe: when this fails, the Pod is removed from Service endpoints. The pod keeps running, just no traffic. Use this for "I am alive but warming caches."
livenessProbe: when this fails repeatedly, the kubelet restarts the container. Use sparingly - a bad liveness probe is a self-DoS.

The order matters during a rolling deploy. The new Pod starts, the startup probe passes, then readiness flips to true, then the Service includes its IP. Only then does the Deployment controller scale down the old Pod. If readiness is missing, kube-proxy routes traffic to a not-ready pod immediately and you see 502s.

terminationGracePeriodSeconds and preStop

When a Pod is deleted, the kubelet sends SIGTERM, waits terminationGracePeriodSeconds (default 30), then SIGKILL. Meanwhile the endpoints controller removes the Pod IP from Services. There is a race: kube-proxy on other nodes may take a few hundred ms to update its iptables rules, during which traffic still flows to the dying pod.

Fix: add a preStop hook that sleeps 5 to 10 seconds. The Pod stays in the endpoints during that sleep, kube-proxy converges, then your app receives SIGTERM and shuts down cleanly. This single trick eliminates 90% of the 502s during rollouts.

When Deployment is wrong

StatefulSet: ordered, named pods with stable storage. web-0, web-1. Pod 0 starts before pod 1. PVCs are per-pod. Use for: Postgres, Kafka, Elasticsearch, anything that votes or has stable peer identity.
DaemonSet: one pod per node. Use for: log shippers (Fluent Bit), node-level agents (Datadog), CNI, kube-proxy itself.
Job: run to completion. Retries on failure. Parallelism controls.
CronJob: scheduled Jobs. Watch out for missed runs during control plane downtime - set concurrencyPolicy: Forbid or Replace.

What I ran at Binocs

Almost everything was a Deployment plus HPA. The exceptions: Postgres and Redis were managed services (RDS, ElastiCache), so no StatefulSets. We had a Fluent Bit DaemonSet for shipping logs to CloudWatch and a couple of CronJobs for nightly cleanup tasks.

The single biggest reliability win was tuning probes and the preStop sleep. Before that, every rolling deploy threw a handful of 502s in CloudFront. After, deploys were silent. Cost-wise, getting maxSurge right on heavy services mattered - one of our larger Deployments was 20 replicas and a 25% surge meant briefly running 5 extra m5.large worth of compute during every deploy. We dropped that to 10% for stable services.

The interview narrative

Lead with the layering: Pod is the schedulable unit and shares namespaces; ReplicaSet maintains replica count via a reconcile loop; Deployment is the user-facing object that owns ReplicaSets and gives you versioned rollouts. Then mention rolling update strategy, probes, and graceful shutdown as the three things you tune in production. If they push deeper, talk about native sidecars in 1.28, the preStop hook trick, and when you would reach for a StatefulSet instead.

Learn more

Docs
Pod lifecyclekubernetes.io
Docs
Deploymentskubernetes.io
Docs
Configure Liveness, Readiness and Startup Probeskubernetes.io
Docs
Brendan Burns - Designing Distributed Systems (sidecar chapter)azure.microsoft.com

Deep dive15 min read← Back to crisp

Pods, ReplicaSets, Deployments - deep dive

Pod lifecycle, the sidecar pattern, rolling update math, surge and unavailable, and how to actually do zero-downtime deploys.

What a Pod actually is

When the kubelet creates a Pod it does the following sequence:

Calls the CRI's RunPodSandbox. This starts the pause container which holds the network, IPC, and (optionally) PID namespaces.
Calls the CNI plugin which gives the sandbox an IP and wires it into the cluster network.
Runs each initContainer in order, to completion, before any app container starts.
Starts each app container in parallel inside those namespaces.
Starts lifecycle.postStart hooks if defined.

All containers in the Pod share localhost, the same IP, the same set of volume mounts (if mounted). They do not share filesystem by default - each container has its own root FS from its own image.

The sidecar pattern (and what changed in 1.28)

ReplicaSet: the simplest controller

A ReplicaSet has three fields that matter:

replicas: desired count.
selector: which Pods belong to me.
template: how to make new ones.

Deployment: rolling updates and rollback

A Deployment is a thin layer over ReplicaSets that adds versioning and rollout strategy.

The two knobs are maxSurge and maxUnavailable. With 10 replicas, 25% surge, 25% unavailable:

At any moment you have at most 12 pods (10 + surge) and at least 7 ready (10 - unavailable).
Faster rollouts: bump surge. Tighter capacity: bump unavailable.
For tight latency SLOs set maxUnavailable: 0 so you never drop below 100%.

Set BOTH to 0 and the rollout deadlocks. Set surge too high and you double your compute cost during the roll.

Probes are not optional

Without proper probes your rolling update is a coin flip.

startupProbe: gives a slow-starting app time to come up before liveness kicks in. Use it for JVM apps that take 30 seconds to warm up.
readinessProbe: when this fails, the Pod is removed from Service endpoints. The pod keeps running, just no traffic. Use this for "I am alive but warming caches."
livenessProbe: when this fails repeatedly, the kubelet restarts the container. Use sparingly - a bad liveness probe is a self-DoS.

terminationGracePeriodSeconds and preStop

When Deployment is wrong

StatefulSet: ordered, named pods with stable storage. web-0, web-1. Pod 0 starts before pod 1. PVCs are per-pod. Use for: Postgres, Kafka, Elasticsearch, anything that votes or has stable peer identity.
DaemonSet: one pod per node. Use for: log shippers (Fluent Bit), node-level agents (Datadog), CNI, kube-proxy itself.
Job: run to completion. Retries on failure. Parallelism controls.
CronJob: scheduled Jobs. Watch out for missed runs during control plane downtime - set concurrencyPolicy: Forbid or Replace.

What I ran at Binocs

The interview narrative

Learn more

Docs
Pod lifecyclekubernetes.io
Docs
Deploymentskubernetes.io
Docs
Configure Liveness, Readiness and Startup Probeskubernetes.io
Docs
Brendan Burns - Designing Distributed Systems (sidecar chapter)azure.microsoft.com