Operators and CRDs - deep dive
How CRDs hook into the API server, controller-runtime reconcile loops, status subresources, finalizers, conversion webhooks, and when an operator is the right tool.
The operator pattern is one of the cleanest ideas in modern infrastructure. You take operational expertise that previously lived in runbooks and on-call brains, and encode it as a controller that runs the same reconcile loop the built-in Kubernetes controllers use. Every component you have ever interacted with - Deployments, ReplicaSets, Services - is an operator. The pattern is the foundation.
The CRD: extending the API
A CustomResourceDefinition registers a new resource kind with the API server. Once defined, the API server treats the new type like a first-class resource: stored in etcd, validated against an OpenAPI schema, served via watch and list, subject to RBAC.
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: backups.platform.example.com
spec:
group: platform.example.com
scope: Namespaced
names:
plural: backups
singular: backup
kind: Backup
shortNames: [bk]
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: [source, destination]
properties:
source: { type: string }
destination: { type: string }
schedule: { type: string, pattern: "^[0-9*/, -]+$" }
status:
type: object
properties:
phase: { type: string, enum: [Pending, Running, Succeeded, Failed] }
lastRunTime: { type: string, format: date-time }
subresources:
status: {}
additionalPrinterColumns:
- name: Phase
type: string
jsonPath: .status.phase
- name: Age
type: date
jsonPath: .metadata.creationTimestampA few things worth calling out:
- OpenAPI schema validation: the API server rejects malformed objects before they hit storage. Catch typos at apply time.
- Status subresource: when enabled,
statuscan only be updated via a separate/statusendpoint. Prevents users from accidentally writing status, and controllers from accidentally writing spec. - Printer columns:
kubectl get backupsshows your columns. UX matters. - Conversion webhooks: if you have multiple versions (v1, v1beta1) the API server can call your webhook to convert between them. Critical for graceful schema evolution.
The controller: the loop that does the work
A CRD is inert without a controller. The controller is a regular program (typically Go using controller-runtime) that:
- Watches the CRD via the API server.
- For each event, reconciles: read current state, compute desired state, take actions to converge, update status, requeue if needed.
The work queue is the heart of it. Events deduplicate by object key - if a Backup is modified 10 times in a second, the reconciler runs once with the latest state. The reconciler is idempotent: running it twice on the same state should produce the same result.
controller-runtime: the standard library
Almost every modern operator uses controller-runtime (the library that powers Kubebuilder and Operator SDK). It gives you:
- Caching client (uses watch for reads, no API server hammering).
- Work queue with rate limiting.
- Leader election (only one replica reconciles at a time).
- Owner references and garbage collection.
- Webhook scaffolding for admission and conversion.
A typical reconcile function:
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var backup platformv1.Backup
if err := r.Get(ctx, req.NamespacedName, &backup); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// Compute desired Job
desired := buildJobFor(&backup)
if err := ctrl.SetControllerReference(&backup, desired, r.Scheme); err != nil {
return ctrl.Result{}, err
}
// Apply server-side
if err := r.Patch(ctx, desired, client.Apply, client.ForceOwnership, client.FieldOwner("backup-operator")); err != nil {
return ctrl.Result{}, err
}
// Update status
backup.Status.Phase = "Running"
if err := r.Status().Update(ctx, &backup); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}Pattern points:
client.IgnoreNotFound: the object may be deleted between event and reconcile. Treat NotFound as "nothing to do."SetControllerReference: ties the Job to the Backup so garbage collection cleans up children.- Server-side apply (
client.Apply): the controller declares ownership of fields. Multiple controllers can share an object without stomping each other. Status().Update: hits the/statussubresource, doesn't touch spec.RequeueAfter: poll on a schedule even without events.
Finalizers: handling deletion
By default, deleting a custom resource just removes it from etcd. If your operator needs to clean up external state (delete an S3 bucket, deregister a domain), you need a finalizer.
The protocol:
- On create/update, add
metadata.finalizers: [my-op/cleanup]to the object. - When the user deletes the object, the API server marks
deletionTimestampbut does NOT remove it until all finalizers are cleared. - Your reconciler sees the deletionTimestamp, runs cleanup, then removes its finalizer.
- Once all finalizers are gone, the API server actually deletes.
if !backup.DeletionTimestamp.IsZero() {
if controllerutil.ContainsFinalizer(&backup, finalizerName) {
if err := r.cleanupExternalState(ctx, &backup); err != nil {
return ctrl.Result{}, err
}
controllerutil.RemoveFinalizer(&backup, finalizerName)
return ctrl.Result{}, r.Update(ctx, &backup)
}
return ctrl.Result{}, nil
}Forgetting to remove a finalizer is the classic operator bug. The object becomes undeletable. Users have to manually kubectl patch ... -p '{"metadata":{"finalizers":null}}' --type=merge to escape.
Status and conditions
Status should reflect observed state, not desired state. The convention is the Condition pattern:
status:
conditions:
- type: Ready
status: "True"
lastTransitionTime: "2026-06-20T10:00:00Z"
reason: AllResourcesProvisioned
message: "Backup completed successfully"
- type: Synced
status: "True"
...This matches how Pods, Deployments, and Nodes report status. Tools like ArgoCD, Crossplane, and ad-hoc kubectl wait --for=condition=Ready all rely on this convention.
Versioning and conversion webhooks
You will need to evolve your CRD schema. The pattern:
- Bump version:
v1becomesv1andv2. Bothserved: true. One isstorage: true. - Conversion webhook: a webhook your controller serves that converts between versions on read/write.
- Deprecate the old version after a few releases.
This is what makes operators production-grade across long timelines. Without conversion webhooks you have to do flag-day migrations.
Admission webhooks
Two kinds, separate from controllers:
- Mutating webhook: rewrite incoming objects (inject defaults, sidecars). Runs before validation.
- Validating webhook: reject objects that violate policy. Runs after validation.
Both are HTTP services your operator registers. The API server calls them on every matching admission request. Be careful: a slow or broken webhook can break the API server (no Pod creation possible).
When operators are right
- Stateful systems with operational complexity: Postgres Operator (CrunchyData / Zalando), Strimzi (Kafka), Elastic Cloud on Kubernetes, Vault Operator, Redis Enterprise Operator. Failover, backups, restores, version upgrades all become declarative.
- Platform abstractions: your platform team exposes
kind: Microservicethat auto-creates Deployment, HPA, Service, Ingress, ServiceMonitor, NetworkPolicy with sane defaults. Developers say "I want a microservice" and get the whole stack. - External resource management: Crossplane exposes cloud resources (RDS, S3) as CRDs. AWS Controllers for Kubernetes (ACK) does the same.
- GitOps controllers: ArgoCD's
ApplicationCR, Flux'sKustomizationCR.
When operators are wrong
- YAML templating: Helm or Kustomize. Operators are a heavy lift for "render this template with these values."
- One-time setup: a Job, an init container, or a Terraform module.
- Stateless apps with simple lifecycle: Deployment. Do not invent your own.
- Custom RBAC on the same object types: ClusterRoles and admission webhooks suffice.
The rule of thumb: if the operator's reconcile loop is just "render YAML and apply," skip it. If the loop encodes domain knowledge (Postgres failover, Kafka partition rebalancing, certificate rotation), it is the right tool.
Operators in production: what to watch
- Single point of failure: most operators run as a Deployment with leader election. If it crashes, your reconciliation stops. Critical operators (cert-manager, ArgoCD) should have alerting on the operator's health.
- API server load: a poorly written operator with no rate limit can DoS the API server. Always rate-limit your work queue and use the cached client for reads.
- Upgrades: operator upgrades can change CRD schema. Test in staging. Have a rollback plan.
- RBAC: the operator's SA needs broad permissions on the resources it manages. Scope tightly.
What I have used in production
At Binocs we leaned heavily on operators:
- cert-manager: TLS automation. The operator that proves the pattern's value.
- External Secrets Operator: AWS Secrets Manager sync.
- AWS Load Balancer Controller: provisions ALBs from Ingress.
- Karpenter: node provisioning.
- ArgoCD: GitOps deployments.
- Prometheus Operator / kube-prometheus-stack: monitoring.
We did not write our own operator. The bar to justify it is: "no existing operator solves this, and the operational logic is complex enough that a Helm chart plus runbook is worse." At our scale, that bar was never met.
The interview narrative
Open with the layered pattern: CRDs extend the API, operators are controllers that reconcile CRDs, both leveraging the same machinery as built-in resources. Hit reconcile loop semantics (idempotent, work-queue-driven, status updates via subresource), finalizers for deletion cleanup, conversion webhooks for schema evolution, and the controller-runtime library as the standard. Close with the opinion: most teams should consume operators (cert-manager, ESO, ArgoCD), not write them. Justify writing one only when the operational logic is complex enough that a Helm chart plus runbook is clearly worse.
Learn more
- DocsCustom Resourceskubernetes.io
- DocsOperator patternkubernetes.io
- DocsKubebuilder Bookbook.kubebuilder.io
- Repocontroller-runtimepkg.go.dev
- Article