Deep dive15 min read← Back to crisp

Terraform and Helm - Deep Dive

State management, module boundaries, drift detection, Helm hooks, chart testing, GitOps with ArgoCD, and the multi-environment patterns that scale.

Infrastructure as code is mostly about three things: state, drift, and blast radius. Terraform handles AWS, Helm handles Kubernetes, ArgoCD handles continuous sync. This is the production playbook.

Terraform state, in depth

State is a JSON file mapping resources in your .tf code to their real-world IDs (AWS ARNs, GCP project IDs, etc). Without state, Terraform has no way to know what it manages.

Remote state backends:

S3 + DynamoDB: the standard for AWS shops. S3 stores the file, DynamoDB provides locking via conditional writes.
Terraform Cloud: managed by HashiCorp, includes plan visualization and policy as code.
Consul, GCS, Azure Storage: equivalents for other clouds.

terraform {
  backend "s3" {
    bucket         = "binocs-tfstate"
    key            = "infra-platform/prod.tfstate"
    region         = "us-east-1"
    dynamodb_table = "tfstate-lock"
    encrypt        = true
  }
}

State locking matters. Two terraform apply runs against the same state will corrupt it. The DynamoDB table holds a per-state-file lock for the duration of the operation.

State surgery (rare but real):

terraform state mv: rename a resource without destroy/recreate.
terraform state rm: remove from state without destroying the resource.
terraform import: bring an existing resource under management.

You will need these once per quarter. Document the runbook.

Module design

Modules are reusable groupings of resources. A good module:

Has a clear purpose ("a service in our platform").
Exposes a minimal set of variables.
Outputs only what callers need.
Is versioned (Git tag or registry version).

Example structure:

modules/
  service/
    main.tf      # ECR repo, IAM role, log group, route53 record
    variables.tf
    outputs.tf
    versions.tf
envs/
  prod/
    main.tf      # calls service module for each service
    backend.tf

Anti-patterns:

The "god module" that creates the entire VPC, EKS, RDS, etc. Hard to test, hard to reuse.
Modules with 50 variables. If you need that many knobs, you have multiple use cases jammed together.
Modules that wrap a single resource. Just use the resource directly.

Drift detection

Drift is when reality diverges from state. Causes:

Someone clicked in the AWS console.
An IAM policy was added by Service Control Policy automation.
A Kubernetes operator mutated a resource Terraform thinks it owns.

terraform plan shows drift. Run it nightly in CI, alert on diffs. Tools like driftctl and Terraform Cloud's drift detection automate this.

Helm chart structure

mychart/
  Chart.yaml          # chart metadata
  values.yaml         # default values
  values-prod.yaml    # env override
  templates/
    deployment.yaml
    service.yaml
    ingress.yaml
    hpa.yaml
    _helpers.tpl      # named templates
  charts/             # dependencies

_helpers.tpl holds reusable named templates like app.fullname and app.labels. Put labeling logic here so every resource gets consistent labels.

Standard labels we used:

labels:
  app.kubernetes.io/name: {{ .Chart.Name }}
  app.kubernetes.io/instance: {{ .Release.Name }}
  app.kubernetes.io/version: {{ .Chart.AppVersion }}
  app.kubernetes.io/managed-by: {{ .Release.Service }}
  team: payments

Helm hooks

Hooks run at specific points in the release lifecycle: pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete, test.

Use cases:

pre-upgrade Job that runs DB migrations.
post-install Job that seeds initial data.
test Job that hits the readiness endpoint.

apiVersion: batch/v1
kind: Job
metadata:
  name: {{ include "app.fullname" . }}-migrate
  annotations:
    "helm.sh/hook": pre-upgrade,pre-install
    "helm.sh/hook-weight": "-5"
    "helm.sh/hook-delete-policy": before-hook-creation
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        command: ["alembic", "upgrade", "head"]
      restartPolicy: Never

GitOps with ArgoCD

ArgoCD watches a Git repo and reconciles cluster state to match. The flow:

Developer pushes Helm value change to git.
ArgoCD detects the change within seconds.
ArgoCD diffs desired (rendered manifests) vs actual cluster state.
ArgoCD applies the diff.
UI shows sync status; auto-rollback on failure if configured.

Pull-based deploys (ArgoCD) beat push-based (CI runs kubectl) because:

No long-lived cluster credentials in CI.
Self-healing: if someone kubectl deletes a deployment, ArgoCD restores it.
Single source of truth: git.
Easier disaster recovery: bootstrap a new cluster by pointing ArgoCD at the repo.

Multi-environment patterns

Three approaches, ranked by pain:

Separate Terraform workspaces per env (state isolation), same code with var.environment switches. Works for small teams.
Separate state files per env in separate directories, shared modules. Better for medium teams; clearer blast radius.
Separate AWS accounts per env, cross-account roles for CI. Best for security and compliance; most operational overhead.

For Helm: one chart, multiple values-{env}.yaml files. ArgoCD ApplicationSet generates one Application per env from a template.

Secrets, the right way

Do not put plain secrets in values.yaml. Even encrypted with SOPS or sealed-secrets, the decrypted form lives in cluster state.

Use External Secrets Operator (ESO):

Secrets live in AWS Secrets Manager or HashiCorp Vault.
ESO has an IAM role to fetch them.
You create an ExternalSecret resource referencing the secret.
ESO syncs into a regular Kubernetes Secret in the namespace.
App mounts the Secret as usual.

Rotation: rotate in Secrets Manager, ESO syncs within minutes, restart pods to pick up.

Terraform: terraform validate, terraform plan against a test workspace, terraform-compliance or OPA for policy checks. For real integration tests, Terratest spins up resources, asserts behavior, tears down. Slow and expensive; run sparingly.

Helm: helm template to verify YAML compiles, helm lint, helm unittest plugin for Go-template assertions, chart-testing for changed-chart linting in CI. Skip integration tests in CI for charts; rely on staging.

The production patterns

Terraform plans run on every PR, output posted as a comment.
Apply gated behind approval for prod.
Helm chart changes go through ArgoCD, no manual helm upgrade in prod.
Drift detection runs nightly, alerts to Slack.
Backup state file daily to a separate S3 bucket with versioning.
Tag every AWS resource with Environment, Team, ManagedBy=terraform, CostCenter.

Learn more

Docs
Terraform DocumentationHashiCorp
Docs
Helm Best PracticesHelm
Docs
ArgoCD DocumentationArgo Project
Article
Terraform Up and RunningYevgeniy Brikman
Docs
External Secrets OperatorExternal Secrets