Deep dive15 min read← Back to crisp

Networking (CNI, kube-proxy, eBPF) - deep dive

Pod networking from the ground up: CNI plugin model, overlay vs underlay, AWS VPC CNI's ENI math, kube-proxy modes, NetworkPolicy enforcement, and Cilium/eBPF as the modern replacement.

Kubernetes networking is the deepest rabbit hole in the platform. The good news: the model is small. Pods get flat routable IPs, Services are virtual IPs translated by kernel rules, NetworkPolicy is enforced by your CNI. The bad news: every layer has gotchas (ENI limits on AWS, iptables rule scaling, NetworkPolicy controller differences) and the modern eBPF rewrite is changing the rules under everyone's feet.

The Kubernetes network model (revisited)

The model has three rules. They sound simple but they are constraints on every networking implementation:

All Pods can reach all Pods without NAT.
All Nodes can reach all Pods without NAT.
A Pod sees itself with the same IP others see for it.

This rules out a lot of designs. You cannot just give each Pod a Docker bridge IP - those are not routable across nodes. You need either:

Underlay routing: Pod IPs are real, routable in your infrastructure (AWS VPC CNI, Calico in BGP mode).
Overlay tunneling: Pod IPs are virtual, encapsulated in VXLAN/Geneve/IPIP packets between nodes (Flannel, Calico in overlay mode, Cilium in tunnel mode).

Underlay is faster (no encap overhead) and simpler to debug (tcpdump shows real Pod IPs). Overlay decouples from the underlying network (works on any infrastructure).

CNI: the plugin contract

The CNI specification is intentionally minimal: a binary, called with a command (ADD, DEL, CHECK), passed a JSON config on stdin, expected to print a JSON result on stdout. The kubelet calls it via the container runtime when setting up a Pod's network namespace.

Multiple plugins can be chained (Multus): one for the primary network, others for secondary interfaces. Used for things like exposing Pods on multiple VLANs or attaching SR-IOV interfaces.

AWS VPC CNI: the EKS default

The AWS VPC CNI gives every Pod a real IP from your VPC's subnets. Pods are first-class VPC citizens - security groups can reference them, VPC flow logs see them, ELBs can target them directly. This is the underlay model done well.

The mechanism:

The CNI controller on each node attaches multiple Elastic Network Interfaces (ENIs) to the EC2 instance.
Each ENI has multiple secondary IPs.
The CNI maintains a warm pool of IPs (WARM_IP_TARGET, WARM_ENI_TARGET) so Pod creation is fast.
On ADD, it picks a free IP, creates a veth pair, moves one end into the Pod namespace, programs source-based routing.

The ENI/IP density problem

The number of ENIs per instance and IPs per ENI is fixed by EC2 instance type. A m5.large supports 3 ENIs * 10 IPs = 30 total IPs, minus 1 for the node = 29 Pods max. A m5.4xlarge supports 8 * 30 = 240.

If you do not know this, you hit a wall. We had a Deployment scale up, Karpenter launched m5.large nodes, the cluster ran out of IPs, new Pods went unschedulable even with plenty of CPU and memory free.

Fixes:

Prefix delegation: enable ENABLE_PREFIX_DELEGATION=true. Each ENI slot now holds a /28 prefix (16 IPs) instead of one. m5.large goes from 29 to 110 Pods.
Larger instance types: scale up to instances with higher native limits.
Custom networking: route Pod IPs from a separate subnet. Decouples Pod density from node subnet size.

Enable prefix delegation early. It is one parameter and prevents an entire class of incidents.

Security group per Pod

The VPC CNI supports security groups per Pod (via the SecurityGroupPolicy CRD). You can apply VPC security groups to specific Pods as if they were EC2 instances. Useful for compliance ("this Pod needs to talk to this RDS instance which only allows this SG"). Limited to specific instance types and requires custom networking mode.

kube-proxy modes in depth

We covered the basics in 10.3. Going deeper:

iptables mode internals

For each Service, kube-proxy creates:

KUBE-SERVICES chain: matches on (ClusterIP, port), jumps to per-service chain.
KUBE-SVC-xxxx chain: lists endpoints with probability-based selection.
KUBE-SEP-xxxx chain per endpoint: DNATs to the Pod IP.

The probability trick:

-m statistic --mode random --probability 0.333  -> endpoint 1
-m statistic --mode random --probability 0.500  -> endpoint 2 (of remaining 2)
                                                -> endpoint 3 (fallthrough)

For 3 endpoints, each gets 1/3 probability. For N endpoints, the rules are 1/N, 1/(N-1), ..., 1/2, 1. Sequential evaluation, O(N) per packet but very fast in kernel.

The scaling issue: every Service changes requires iptables-restore which takes a global lock and rewrites all rules. Under churn (many deploys per minute) you can have minute-long delays in Service updates.

IPVS mode

Uses Linux Virtual Server, a kernel-level L4 load balancer. Service VIPs map to IPVS virtual services. Endpoints are real servers. Hash-based lookup, O(1) per packet.

Real algorithms: rr (round-robin), wrr (weighted round-robin), lc (least connection), wlc (weighted least connection), sh (source hash), dh (destination hash).

Practical: enable IPVS when your cluster has >1000 Services or you are seeing measurable kube-proxy reload pressure.

nftables mode (1.31 GA)

Replacement for iptables, same conceptual model but with the modern nftables kernel subsystem. Faster rule updates, more efficient packet processing. Enable when supported.

NetworkPolicy: the firewall layer

NetworkPolicy is the Kubernetes-native way to control traffic between Pods. It is a CRD-shaped object (actually a built-in API) but it does nothing unless your CNI implements it.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: api-only-from-frontend, namespace: prod }
spec:
  podSelector: { matchLabels: { app: api } }
  policyTypes: [Ingress]
  ingress:
    - from:
        - podSelector: { matchLabels: { app: frontend } }
      ports:
        - { protocol: TCP, port: 8080 }

That policy says "Pods labeled app=api in prod accept ingress only from Pods labeled app=frontend on TCP 8080."

Implementations:

Calico: full support, fast.
Cilium: full support plus L7 (HTTP method, URL path, Kafka topic).
AWS VPC CNI: added NetworkPolicy support in 2023. Less mature than Calico/Cilium.
Flannel: no NetworkPolicy support (you would add Calico for policy alongside Flannel for networking).

Default behavior is allow-all. The discipline is to start with a default-deny NetworkPolicy in each namespace, then explicitly allow what is needed.

Cilium and eBPF: the replacement

eBPF is the biggest shift in Linux networking in a decade. Programs run at kernel hook points (socket, XDP, tc, kprobe), JIT-compiled to native machine code, verified for safety. Cilium uses eBPF to reimplement most of the Kubernetes networking stack.

Cilium replaces kube-proxy

Service VIP translation in Cilium happens at the socket layer via an eBPF program. When a Pod calls connect() to a Service VIP, the eBPF program rewrites the destination to a Pod IP before the packet is built. No iptables, no NAT in the data path. Faster, no rule-scaling problem.

To enable: install Cilium with kubeProxyReplacement: true, delete the kube-proxy DaemonSet.

NetworkPolicy with L7 awareness

Standard NetworkPolicy is L3/L4 (Pod selectors, ports). Cilium's CiliumNetworkPolicy adds L7:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
spec:
  endpointSelector: { matchLabels: { app: api } }
  ingress:
    - fromEndpoints: [{ matchLabels: { app: frontend } }]
      toPorts:
        - ports: [{ port: "8080", protocol: TCP }]
          rules:
            http:
              - method: GET
                path: "/users/.*"

The eBPF program parses HTTP at L7 inside the kernel, drops packets that violate the policy. No userspace proxy.

Hubble: observability

Hubble is Cilium's observability layer. Because eBPF sees every packet, Hubble can show real-time Pod-to-Pod flows: who called whom, what HTTP path, did it succeed. The kind of visibility you previously needed a service mesh for, at a fraction of the overhead.

When Cilium is overkill

For small clusters with no NetworkPolicy needs and no scaling pain, the default CNI plus kube-proxy is fine. Cilium pays off when:

You hit kube-proxy iptables scaling limits.
You want L7 NetworkPolicy.
You want deep observability without a service mesh.
You are running large clusters where eBPF performance wins matter.

CoreDNS and the DNS layer

CoreDNS is the cluster DNS server (default since 1.13). Runs as a Deployment, fronted by a Service at a fixed ClusterIP (often 10.96.0.10). Every Pod's /etc/resolv.conf points to it.

Performance tips:

Run multiple replicas, autoscale on QPS.
Set ndots: 2 in your Pod's dnsConfig if you do a lot of external lookups.
Use NodeLocal DNSCache: a DaemonSet that runs a CoreDNS instance on every node, intercepts DNS via iptables, caches locally. Cuts DNS latency dramatically for high-QPS apps.

NodeLocal DNSCache was one of the cheaper wins we got at Binocs. Some of our services hit DNS at ~500 QPS per pod. Going from cluster CoreDNS to NodeLocal cache cut p99 lookup time from ~5ms to <1ms.

What I ran at Binocs

EKS with AWS VPC CNI (default) plus kube-proxy in iptables mode. About 80 Services, well below where IPVS would matter. We enabled prefix delegation to handle Pod density. NodeLocal DNSCache as a DaemonSet on every node.

NetworkPolicy was minimal - we relied on VPC security groups at the node level and IAM for cross-service authz. In hindsight a default-deny NetworkPolicy per namespace would have been a good defense in depth. If I were doing it again on a larger cluster I would start with Cilium for kube-proxy replacement and L7 policy from day one.

The interview narrative

Open with the network model: flat Pod IPs, no NAT, every node can reach every Pod. CNI is the plugin that makes this real (AWS VPC CNI for EKS, Cilium increasingly the modern choice). kube-proxy programs the kernel (iptables, IPVS, nftables) to translate Service VIPs. NetworkPolicy is the firewall layer, only enforced if your CNI supports it. eBPF (via Cilium) is the modern replacement that does Service translation in the socket layer, supports L7 NetworkPolicy, and gives you observability without a service mesh. Close with operational gotchas: ENI density on AWS VPC CNI, kube-proxy reload pressure, NodeLocal DNSCache as a cheap win.

Learn more

Docs
Cluster Networkingkubernetes.io
Paper
CNI specificationgithub.com
Docs
AWS VPC CNIgithub.com
Docs
Cilium Documentationdocs.cilium.io
Talk
Liz Rice - eBPF Documentary and talksyoutube.com
Docs
Network Policykubernetes.io

Deep dive15 min read← Back to crisp

Networking (CNI, kube-proxy, eBPF) - deep dive

Pod networking from the ground up: CNI plugin model, overlay vs underlay, AWS VPC CNI's ENI math, kube-proxy modes, NetworkPolicy enforcement, and Cilium/eBPF as the modern replacement.

The Kubernetes network model (revisited)

The model has three rules. They sound simple but they are constraints on every networking implementation:

All Pods can reach all Pods without NAT.
All Nodes can reach all Pods without NAT.
A Pod sees itself with the same IP others see for it.

This rules out a lot of designs. You cannot just give each Pod a Docker bridge IP - those are not routable across nodes. You need either:

Underlay routing: Pod IPs are real, routable in your infrastructure (AWS VPC CNI, Calico in BGP mode).
Overlay tunneling: Pod IPs are virtual, encapsulated in VXLAN/Geneve/IPIP packets between nodes (Flannel, Calico in overlay mode, Cilium in tunnel mode).

Underlay is faster (no encap overhead) and simpler to debug (tcpdump shows real Pod IPs). Overlay decouples from the underlying network (works on any infrastructure).

CNI: the plugin contract

Multiple plugins can be chained (Multus): one for the primary network, others for secondary interfaces. Used for things like exposing Pods on multiple VLANs or attaching SR-IOV interfaces.

AWS VPC CNI: the EKS default

The mechanism:

The CNI controller on each node attaches multiple Elastic Network Interfaces (ENIs) to the EC2 instance.
Each ENI has multiple secondary IPs.
The CNI maintains a warm pool of IPs (WARM_IP_TARGET, WARM_ENI_TARGET) so Pod creation is fast.
On ADD, it picks a free IP, creates a veth pair, moves one end into the Pod namespace, programs source-based routing.

The ENI/IP density problem

Fixes:

Prefix delegation: enable ENABLE_PREFIX_DELEGATION=true. Each ENI slot now holds a /28 prefix (16 IPs) instead of one. m5.large goes from 29 to 110 Pods.
Larger instance types: scale up to instances with higher native limits.
Custom networking: route Pod IPs from a separate subnet. Decouples Pod density from node subnet size.

Enable prefix delegation early. It is one parameter and prevents an entire class of incidents.

Security group per Pod

kube-proxy modes in depth

We covered the basics in 10.3. Going deeper:

iptables mode internals

For each Service, kube-proxy creates:

KUBE-SERVICES chain: matches on (ClusterIP, port), jumps to per-service chain.
KUBE-SVC-xxxx chain: lists endpoints with probability-based selection.
KUBE-SEP-xxxx chain per endpoint: DNATs to the Pod IP.

The probability trick:

-m statistic --mode random --probability 0.333  -> endpoint 1
-m statistic --mode random --probability 0.500  -> endpoint 2 (of remaining 2)
                                                -> endpoint 3 (fallthrough)

For 3 endpoints, each gets 1/3 probability. For N endpoints, the rules are 1/N, 1/(N-1), ..., 1/2, 1. Sequential evaluation, O(N) per packet but very fast in kernel.

IPVS mode

Uses Linux Virtual Server, a kernel-level L4 load balancer. Service VIPs map to IPVS virtual services. Endpoints are real servers. Hash-based lookup, O(1) per packet.

Real algorithms: rr (round-robin), wrr (weighted round-robin), lc (least connection), wlc (weighted least connection), sh (source hash), dh (destination hash).

Practical: enable IPVS when your cluster has >1000 Services or you are seeing measurable kube-proxy reload pressure.

nftables mode (1.31 GA)

Replacement for iptables, same conceptual model but with the modern nftables kernel subsystem. Faster rule updates, more efficient packet processing. Enable when supported.

NetworkPolicy: the firewall layer

NetworkPolicy is the Kubernetes-native way to control traffic between Pods. It is a CRD-shaped object (actually a built-in API) but it does nothing unless your CNI implements it.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: api-only-from-frontend, namespace: prod }
spec:
  podSelector: { matchLabels: { app: api } }
  policyTypes: [Ingress]
  ingress:
    - from:
        - podSelector: { matchLabels: { app: frontend } }
      ports:
        - { protocol: TCP, port: 8080 }

That policy says "Pods labeled app=api in prod accept ingress only from Pods labeled app=frontend on TCP 8080."

Implementations:

Calico: full support, fast.
Cilium: full support plus L7 (HTTP method, URL path, Kafka topic).
AWS VPC CNI: added NetworkPolicy support in 2023. Less mature than Calico/Cilium.
Flannel: no NetworkPolicy support (you would add Calico for policy alongside Flannel for networking).

Default behavior is allow-all. The discipline is to start with a default-deny NetworkPolicy in each namespace, then explicitly allow what is needed.

Cilium and eBPF: the replacement

Cilium replaces kube-proxy

To enable: install Cilium with kubeProxyReplacement: true, delete the kube-proxy DaemonSet.

NetworkPolicy with L7 awareness

Standard NetworkPolicy is L3/L4 (Pod selectors, ports). Cilium's CiliumNetworkPolicy adds L7:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
spec:
  endpointSelector: { matchLabels: { app: api } }
  ingress:
    - fromEndpoints: [{ matchLabels: { app: frontend } }]
      toPorts:
        - ports: [{ port: "8080", protocol: TCP }]
          rules:
            http:
              - method: GET
                path: "/users/.*"

The eBPF program parses HTTP at L7 inside the kernel, drops packets that violate the policy. No userspace proxy.

Hubble: observability

When Cilium is overkill

For small clusters with no NetworkPolicy needs and no scaling pain, the default CNI plus kube-proxy is fine. Cilium pays off when:

You hit kube-proxy iptables scaling limits.
You want L7 NetworkPolicy.
You want deep observability without a service mesh.
You are running large clusters where eBPF performance wins matter.

CoreDNS and the DNS layer

CoreDNS is the cluster DNS server (default since 1.13). Runs as a Deployment, fronted by a Service at a fixed ClusterIP (often 10.96.0.10). Every Pod's /etc/resolv.conf points to it.

Performance tips:

Run multiple replicas, autoscale on QPS.
Set ndots: 2 in your Pod's dnsConfig if you do a lot of external lookups.
Use NodeLocal DNSCache: a DaemonSet that runs a CoreDNS instance on every node, intercepts DNS via iptables, caches locally. Cuts DNS latency dramatically for high-QPS apps.

NodeLocal DNSCache was one of the cheaper wins we got at Binocs. Some of our services hit DNS at ~500 QPS per pod. Going from cluster CoreDNS to NodeLocal cache cut p99 lookup time from ~5ms to <1ms.

What I ran at Binocs

The interview narrative

Learn more

Docs
Cluster Networkingkubernetes.io
Paper
CNI specificationgithub.com
Docs
AWS VPC CNIgithub.com
Docs
Cilium Documentationdocs.cilium.io
Talk
Liz Rice - eBPF Documentary and talksyoutube.com
Docs
Network Policykubernetes.io