Volumes, PVCs, and StorageClasses - deep dive
CSI architecture, dynamic provisioning, volume expansion, snapshots, topology-aware scheduling, and when to actually use stateful workloads in Kubernetes.
Storage is where Kubernetes goes from "magical" to "you better know what is under the hood." Get the abstraction wrong and you lose data on a Pod restart. Get it right and you have portable, snapshottable, dynamically-provisioned storage that survives node failure.
The four-layer model
The separation of PVC from PV is the same separation as a Deployment from a Pod. The user-facing object (PVC, Deployment) declares intent. The cluster-managed object (PV, Pod) is the realized resource. StorageClass is the template that lets the cluster create PVs on demand.
CSI: the plugin model
Before CSI, storage drivers were compiled into Kubernetes itself. Adding support for a new storage backend meant a Kubernetes release. CSI (Container Storage Interface) made storage drivers external. Now each driver is a Pod (or DaemonSet + Deployment) that implements a gRPC interface:
- Controller plugin: runs as a Deployment. Handles
CreateVolume,DeleteVolume,ControllerPublishVolume(attach to node). - Node plugin: runs as a DaemonSet. Handles
NodeStageVolume,NodePublishVolume(mount into Pod).
The AWS EBS CSI driver, for example, runs the controller pod that calls EC2 APIs to create volumes, and a DaemonSet that runs on every worker to format and mount them. On EKS this is an addon you enable; the driver assumes IAM permissions via IRSA.
Dynamic vs static provisioning
Static: cluster admin pre-creates PV objects pointing at existing disks. Users create PVCs that bind to those PVs. Old-school, used when storage is provisioned out-of-band (legacy NFS, FC SAN).
Dynamic: user creates a PVC referencing a StorageClass. The provisioner watches PVCs, calls the CSI driver to create the underlying disk, creates the PV, binds it. This is the default in cloud. Almost no one uses static anymore.
volumeBindingMode: the foot-gun fix
The default volumeBindingMode: Immediate means: when you create a PVC, provision the PV right now. Problem: cloud disks are zone-scoped. The scheduler decides Pod placement based on PV zone affinity. If you provisioned the PV in us-east-1a but the only node with capacity is in 1b, your Pod is unschedulable forever.
volumeBindingMode: WaitForFirstConsumer flips the order: do not provision the PV until a Pod is scheduled. The scheduler picks a node based on the Pod's other constraints (CPU, memory, affinity), then the CSI driver provisions the PV in that node's zone. This is what you want for all zone-scoped storage. EKS's default gp2 and gp3 StorageClasses should set this.
Reclaim policy: the data-loss prevention knob
When a PVC is deleted, the PV transitions to:
- Delete: CSI driver deletes the underlying disk. Default for dynamic provisioning. Useful for ephemeral workloads. Catastrophic for production state.
- Retain: PV stays, disk stays, but the PV is in Released state and cannot be bound to a new PVC without admin cleanup.
For any PVC backing production state, set reclaimPolicy: Retain on the StorageClass. The CI/CD pipeline that accidentally runs kubectl delete -f all.yaml will not eat your database.
Volume expansion
CSI supports online expansion. Bump the PVC's spec.resources.requests.storage, the controller resizes the EBS volume, the kubelet expands the filesystem. No restart. Shrinking is not supported - that needs migration to a new volume.
# StorageClass
allowVolumeExpansion: trueWithout that flag, expansion attempts fail. EKS gp3 StorageClass has it enabled.
Volume snapshots
VolumeSnapshot and VolumeSnapshotClass are the API for point-in-time backups. The CSI driver maps these to backend snapshot APIs (EBS snapshots, etc.). You can restore by creating a new PVC with dataSource: VolumeSnapshot.
This is your backup story for stateful workloads in Kubernetes. Velero builds on this for cluster-wide snapshot + restore including the Kubernetes objects themselves.
Access modes and what backends support
| Mode | EBS | EFS | NFS | RBD |
|---|---|---|---|---|
| ReadWriteOnce | yes | yes | yes | yes |
| ReadWriteMany | no | yes | yes | yes |
| ReadOnlyMany | yes | yes | yes | yes |
| ReadWriteOncePod | yes | yes | yes | yes |
If you need RWX, you are picking from EFS, FSx, NFS, or a software-defined storage layer (Rook/Ceph, Longhorn). EBS is RWO only because block storage cannot safely have two filesystem mounts writing simultaneously.
ReadWriteOncePod (1.27+) is stricter than ReadWriteOnce. RWO allows multiple Pods on the same node to share the mount. RWOP guarantees one Pod period, useful for database leaders where you must prevent split-brain even within a node.
Topology awareness
Multi-zone clusters need topology-aware storage. The CSI driver reports topology labels (topology.ebs.csi.aws.com/zone: us-east-1a) on each PV. The scheduler honors these as Pod affinity. Combined with WaitForFirstConsumer, this guarantees Pod and PV land in the same zone.
For multi-AZ HA of stateful workloads, this means you cannot just move the Pod across AZs - the PV is pinned. StatefulSets with one PVC per Pod across zones is the pattern: pod-0 in 1a, pod-1 in 1b, pod-2 in 1c, each with its own zone-local PV.
StatefulSet + PVC template
StatefulSet's volumeClaimTemplates is the magic that gives each Pod its own PVC:
volumeClaimTemplates:
- metadata: { name: data }
spec:
accessModes: [ReadWriteOnce]
storageClassName: gp3
resources: { requests: { storage: 20Gi } }Pods web-0, web-1, web-2 each get data-web-0, data-web-1, data-web-2. When web-1 is rescheduled, it gets the same data-web-1 PVC re-attached. Stable identity, stable storage.
The trap: deleting the StatefulSet does NOT delete the PVCs by default. This is on purpose so you do not lose data. To fully clean up: kubectl delete pvc -l app=mystatefulset.
When NOT to run stateful workloads in Kubernetes
Hot take: for most teams, do not. Managed services for databases (RDS, Aurora, Cloud SQL), message brokers (MSK, Confluent Cloud), caches (ElastiCache, Memorystore). The operational overhead of running Postgres on Kubernetes is real: backups, failover, version upgrades, certificate rotation, monitoring.
Run stateful in Kubernetes when:
- You need portability across clouds.
- The managed offering does not exist or is too restrictive.
- You have a real Kubernetes operations team that wants this.
At Binocs we used RDS for Postgres and ElastiCache for Redis. Zero PVCs in production. The few PVCs we had were for staging/dev databases and a couple of small caches that ran in cluster. This eliminated an entire class of operational complexity from EKS.
What I learned tuning storage at Binocs
The right-sizing exercise included a storage review that contributed to the $1.8k-$2k/month savings:
- Audited every PVC. Found three orphaned PVCs from old experiments still costing $50/month each.
- Migrated remaining in-cluster volumes from gp2 to gp3 - same performance baseline, cheaper, and you can independently tune IOPS and throughput.
- Right-sized EFS provisioned throughput where we used it. Default bursting was over-provisioned.
- Set
reclaimPolicy: Retainon any PVC tied to data we cared about. Cleaning up Retained PVs is more work than restoring deleted data.
The other lesson: snapshots are not backups by themselves. EBS snapshots in the same region in the same account are not disaster recovery. We replicated critical snapshots cross-region.
The interview narrative
Open with the PVC-PV-StorageClass-CSI four-layer model. PVC is the request, PV is the disk, StorageClass automates provisioning via CSI. Hit volumeBindingMode (WaitForFirstConsumer) and reclaimPolicy (Retain for stateful prod) as the two production-critical knobs. Mention volume expansion, snapshots, and topology awareness for multi-AZ. Close with the opinion that most teams should run stateful workloads in managed services, not in Kubernetes.
Learn more
- DocsPersistent Volumeskubernetes.io
- DocsStorage Classeskubernetes.io
- PaperCSI Specificationgithub.com
- DocsAWS EBS CSI Drivergithub.com
- DocsVolume Snapshotskubernetes.io