Deep dive15 min read← Back to crisp

Services (ClusterIP, NodePort, LoadBalancer) - deep dive

kube-proxy modes, EndpointSlices, externalTrafficPolicy, session affinity, the SNAT problem, and why eBPF CNIs are replacing kube-proxy.

Services are where most "weird Kubernetes networking problems" actually live. The model looks simple until you start debugging why your client IPs are all coming through as the node IP, or why your iptables rule count has hit 100k and the API server is sluggish. This section unpacks what kube-proxy is actually doing and how to operate Services in production.

The cast

The endpoint controller (well, since 1.21, the EndpointSlice controller) is the only component that turns a Service selector into a list of Pod IPs. kube-proxy never looks at Pods. It looks at EndpointSlices.

EndpointSlices replaced the monolithic Endpoints object specifically to fix a scaling cliff. With Endpoints, a Service with 1,000 Pods meant a single 1,000-entry object. Every change rewrote and re-broadcast the whole thing to every node. EndpointSlices chunk that into batches of ~100, so a single pod change only updates one slice. This matters above ~200 Pods per Service.

kube-proxy modes

There are three modes, in order of how mature and how scalable.

iptables (default)

kube-proxy writes a chain per Service and a chain per endpoint. To pick a Pod it uses iptables statistic module with random probability. With 5 endpoints each gets 1/5, 1/4, 1/3, 1/2, 1 probability across sequential rules. Simple, no userspace proxy in the data path.

Problem: rule count is O(Services * Endpoints). 1,000 Services * 10 endpoints = ~50k rules and packet processing time degrades linearly. Each new Service triggers a full iptables-restore which under load can take seconds.

IPVS

Same data flow but uses the Linux IPVS (IP Virtual Server) load balancer instead of iptables. Hash-based lookup so adding endpoints is O(1) cost. Supports actual algorithms: round-robin, least-connection, source hash. Use this for clusters with thousands of Services.

nftables (1.31 GA)

Modern replacement for iptables with better update performance. Still kernel netfilter. EKS supports it as opt-in.

eBPF (Cilium, no kube-proxy)

Cilium replaces kube-proxy entirely. Service VIPs are translated in the eBPF program attached to the socket layer, not in netfilter. No iptables chains for Services at all. Scales to millions of endpoints with no rule degradation. This is where new clusters are headed.

externalTrafficPolicy and the SNAT problem

When external traffic hits a LoadBalancer Service, the default externalTrafficPolicy: Cluster does this:

Cloud LB sends packet to node A on the NodePort.
Node A's iptables picks any Pod cluster-wide. Often a Pod on node B.
To make the reply route back through node A, node A SNATs the source IP to itself.
Your app sees the node IP, not the client IP.

With externalTrafficPolicy: Local:

Cloud LB sends packet to node A on the NodePort.
iptables picks only a Pod on node A. If none, drop.
No SNAT needed.
Your app sees the real client IP.

The catch with Local: your LB needs to skip nodes that have no Pods. AWS NLB with the right health check on the NodePort handles this. AWS ALB with target type ip (via the AWS Load Balancer Controller) skips kube-proxy entirely and sends to Pod IPs directly - this is what you want for production HTTP traffic on EKS.

Service discovery: DNS

CoreDNS (the default cluster DNS since 1.13) creates records:

my-svc.my-ns.svc.cluster.local -> ClusterIP
my-svc.my-ns.svc.cluster.local (headless) -> list of Pod IPs
pod-ip.my-ns.pod.cluster.local -> single Pod
my-svc.my-ns.svc.cluster.local SRV records for named ports

NDOTS gotcha: by default ndots: 5 in /etc/resolv.conf means any name with fewer than 5 dots gets the search path appended. curl google.com becomes google.com.my-ns.svc.cluster.local, google.com.svc.cluster.local, google.com.cluster.local, then finally google.com.. That is 4 extra failed DNS lookups before every external call. Either set ndots: 2 in your pod's dnsConfig or append a trailing dot to FQDNs.

Session affinity

sessionAffinity: ClientIP makes kube-proxy hash on source IP so a client sticks to the same Pod. Default timeout 3 hours. Useful for WebSocket or stateful protocols, but breaks if your client is behind a NAT (everyone gets the same Pod) or if you have many Pods (no real balancing).

For real session affinity, do it at the ingress/LB layer with cookies.

EndpointSlice topology hints

Topology-aware routing tells kube-proxy "prefer endpoints in the same zone as the consuming node." Massive win for cross-AZ data transfer cost on AWS - intra-AZ is free, cross-AZ is $0.01/GB each way. We saw this matter for chatty services calling each other 10k+ times per minute. Enable with the annotation service.kubernetes.io/topology-mode: Auto.

The catch: hints are honored only when each zone has enough endpoints to handle its traffic share. If zone B has 1 Pod and 70% of consumers, the hint is dropped and you fall back to cluster-wide. So this works at scale, not for 3-pod deployments.

NodePort: when and when not

NodePort is the integration point that LoadBalancer Services build on. Used standalone it is mostly for:

Bare metal where you front it with MetalLB or HAProxy.
Development clusters (kind, minikube).
Edge cases where you need a stable port on every node.

In cloud, never expose NodePort directly to the internet. You lose the cloud LB's TLS termination, WAF, DDoS protection, and connection draining.

What I ran at Binocs

EKS. All public traffic came through ALBs provisioned by the AWS Load Balancer Controller, with target type ip so packets went straight to Pods, no kube-proxy hop, no SNAT, real client IPs preserved. Internal east-west was ClusterIP with kube-proxy in iptables mode. We had about 80 Services - well below the scale where IPVS would matter.

The cost lesson: enabling topology-aware hints on three high-volume internal services dropped cross-AZ data transfer by about 40% on those flows. Not a headline number but it stacked with the right-sizing work that contributed to the $1.8k-$2k/month savings. Small networking choices compound.

The interview narrative

Frame it as three layers. The Service object holds the selector and VIP. The EndpointSlice controller turns the selector into a current list of Pod IPs. kube-proxy on every node watches EndpointSlices and programs the kernel (iptables, IPVS, nftables) to translate Service VIP to Pod IP. Then mention externalTrafficPolicy, the SNAT problem, target-type-ip on AWS ALB, and topology-aware routing. If they go deep, eBPF and Cilium are where this is all going.

Learn more

Docs
Servicekubernetes.io
Docs
Virtual IPs and Service Proxieskubernetes.io
Docs
EndpointSliceskubernetes.io
Docs
AWS Load Balancer Controllerkubernetes-sigs.github.io
Talk
Liz Rice - eBPF and Cilium talksyoutube.com

Deep dive15 min read← Back to crisp

Services (ClusterIP, NodePort, LoadBalancer) - deep dive

kube-proxy modes, EndpointSlices, externalTrafficPolicy, session affinity, the SNAT problem, and why eBPF CNIs are replacing kube-proxy.

The cast

kube-proxy modes

There are three modes, in order of how mature and how scalable.

iptables (default)

IPVS

nftables (1.31 GA)

Modern replacement for iptables with better update performance. Still kernel netfilter. EKS supports it as opt-in.

eBPF (Cilium, no kube-proxy)

externalTrafficPolicy and the SNAT problem

When external traffic hits a LoadBalancer Service, the default externalTrafficPolicy: Cluster does this:

Cloud LB sends packet to node A on the NodePort.
Node A's iptables picks any Pod cluster-wide. Often a Pod on node B.
To make the reply route back through node A, node A SNATs the source IP to itself.
Your app sees the node IP, not the client IP.

With externalTrafficPolicy: Local:

Cloud LB sends packet to node A on the NodePort.
iptables picks only a Pod on node A. If none, drop.
No SNAT needed.
Your app sees the real client IP.

Service discovery: DNS

CoreDNS (the default cluster DNS since 1.13) creates records:

my-svc.my-ns.svc.cluster.local -> ClusterIP
my-svc.my-ns.svc.cluster.local (headless) -> list of Pod IPs
pod-ip.my-ns.pod.cluster.local -> single Pod
my-svc.my-ns.svc.cluster.local SRV records for named ports

Bare metal where you front it with MetalLB or HAProxy.
Development clusters (kind, minikube).
Edge cases where you need a stable port on every node.

In cloud, never expose NodePort directly to the internet. You lose the cloud LB's TLS termination, WAF, DDoS protection, and connection draining.

What I ran at Binocs

The interview narrative

Learn more

Docs
Servicekubernetes.io
Docs
Virtual IPs and Service Proxieskubernetes.io
Docs
EndpointSliceskubernetes.io
Docs
AWS Load Balancer Controllerkubernetes-sigs.github.io
Talk
Liz Rice - eBPF and Cilium talksyoutube.com