Terraform and Helm - Deep Dive
State management, module boundaries, drift detection, Helm hooks, chart testing, GitOps with ArgoCD, and the multi-environment patterns that scale.
Infrastructure as code is mostly about three things: state, drift, and blast radius. Terraform handles AWS, Helm handles Kubernetes, ArgoCD handles continuous sync. This is the production playbook.
Terraform state, in depth
State is a JSON file mapping resources in your .tf code to their real-world IDs (AWS ARNs, GCP project IDs, etc). Without state, Terraform has no way to know what it manages.
Remote state backends:
- S3 + DynamoDB: the standard for AWS shops. S3 stores the file, DynamoDB provides locking via conditional writes.
- Terraform Cloud: managed by HashiCorp, includes plan visualization and policy as code.
- Consul, GCS, Azure Storage: equivalents for other clouds.
terraform {
backend "s3" {
bucket = "binocs-tfstate"
key = "infra-platform/prod.tfstate"
region = "us-east-1"
dynamodb_table = "tfstate-lock"
encrypt = true
}
}State locking matters. Two terraform apply runs against the same state will corrupt it. The DynamoDB table holds a per-state-file lock for the duration of the operation.
State surgery (rare but real):
terraform state mv: rename a resource without destroy/recreate.terraform state rm: remove from state without destroying the resource.terraform import: bring an existing resource under management.
You will need these once per quarter. Document the runbook.
Module design
Modules are reusable groupings of resources. A good module:
- Has a clear purpose ("a service in our platform").
- Exposes a minimal set of variables.
- Outputs only what callers need.
- Is versioned (Git tag or registry version).
Example structure:
modules/
service/
main.tf # ECR repo, IAM role, log group, route53 record
variables.tf
outputs.tf
versions.tf
envs/
prod/
main.tf # calls service module for each service
backend.tf
Anti-patterns:
- The "god module" that creates the entire VPC, EKS, RDS, etc. Hard to test, hard to reuse.
- Modules with 50 variables. If you need that many knobs, you have multiple use cases jammed together.
- Modules that wrap a single resource. Just use the resource directly.
Drift detection
Drift is when reality diverges from state. Causes:
- Someone clicked in the AWS console.
- An IAM policy was added by Service Control Policy automation.
- A Kubernetes operator mutated a resource Terraform thinks it owns.
terraform plan shows drift. Run it nightly in CI, alert on diffs. Tools like driftctl and Terraform Cloud's drift detection automate this.
Helm chart structure
mychart/
Chart.yaml # chart metadata
values.yaml # default values
values-prod.yaml # env override
templates/
deployment.yaml
service.yaml
ingress.yaml
hpa.yaml
_helpers.tpl # named templates
charts/ # dependencies
_helpers.tpl holds reusable named templates like app.fullname and app.labels. Put labeling logic here so every resource gets consistent labels.
Standard labels we used:
labels:
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/version: {{ .Chart.AppVersion }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
team: paymentsHelm hooks
Hooks run at specific points in the release lifecycle: pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete, test.
Use cases:
- pre-upgrade Job that runs DB migrations.
- post-install Job that seeds initial data.
- test Job that hits the readiness endpoint.
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "app.fullname" . }}-migrate
annotations:
"helm.sh/hook": pre-upgrade,pre-install
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation
spec:
template:
spec:
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
command: ["alembic", "upgrade", "head"]
restartPolicy: NeverGitOps with ArgoCD
ArgoCD watches a Git repo and reconciles cluster state to match. The flow:
- Developer pushes Helm value change to git.
- ArgoCD detects the change within seconds.
- ArgoCD diffs desired (rendered manifests) vs actual cluster state.
- ArgoCD applies the diff.
- UI shows sync status; auto-rollback on failure if configured.
Pull-based deploys (ArgoCD) beat push-based (CI runs kubectl) because:
- No long-lived cluster credentials in CI.
- Self-healing: if someone kubectl deletes a deployment, ArgoCD restores it.
- Single source of truth: git.
- Easier disaster recovery: bootstrap a new cluster by pointing ArgoCD at the repo.
Multi-environment patterns
Three approaches, ranked by pain:
- Separate Terraform workspaces per env (state isolation), same code with
var.environmentswitches. Works for small teams. - Separate state files per env in separate directories, shared modules. Better for medium teams; clearer blast radius.
- Separate AWS accounts per env, cross-account roles for CI. Best for security and compliance; most operational overhead.
For Helm: one chart, multiple values-{env}.yaml files. ArgoCD ApplicationSet generates one Application per env from a template.
Secrets, the right way
Do not put plain secrets in values.yaml. Even encrypted with SOPS or sealed-secrets, the decrypted form lives in cluster state.
Use External Secrets Operator (ESO):
- Secrets live in AWS Secrets Manager or HashiCorp Vault.
- ESO has an IAM role to fetch them.
- You create an
ExternalSecretresource referencing the secret. - ESO syncs into a regular Kubernetes Secret in the namespace.
- App mounts the Secret as usual.
Rotation: rotate in Secrets Manager, ESO syncs within minutes, restart pods to pick up.
Testing infrastructure code
Terraform: terraform validate, terraform plan against a test workspace, terraform-compliance or OPA for policy checks. For real integration tests, Terratest spins up resources, asserts behavior, tears down. Slow and expensive; run sparingly.
Helm: helm template to verify YAML compiles, helm lint, helm unittest plugin for Go-template assertions, chart-testing for changed-chart linting in CI. Skip integration tests in CI for charts; rely on staging.
The production patterns
- Terraform plans run on every PR, output posted as a comment.
- Apply gated behind approval for prod.
- Helm chart changes go through ArgoCD, no manual
helm upgradein prod. - Drift detection runs nightly, alerts to Slack.
- Backup state file daily to a separate S3 bucket with versioning.
- Tag every AWS resource with
Environment,Team,ManagedBy=terraform,CostCenter.
Learn more
- DocsTerraform DocumentationHashiCorp
- Docs
- DocsArgoCD DocumentationArgo Project
- ArticleTerraform Up and RunningYevgeniy Brikman
- DocsExternal Secrets OperatorExternal Secrets