Deep dive15 min read← Back to crisp

Binocs - EKS infra and 60% cloud savings

How I ran a one-week AWS cost audit at Binocs and shaved 1.8K to 2K USD off the monthly bill. Methodology, tools, the order of operations, the trap of premature commitment, and the patterns that keep cost down after you fix it.

The brief

A senior pinged me on Slack one morning. "Bill is creeping. Take a look this week, see what is up." That was the brief. By the end of the week I had a Notion doc with 1.8K to 2K USD per month in identified savings and the terraform changes to capture them.

This is the playbook I used, the mistakes I almost made, and the patterns I put in place so the bill stayed down.

The methodology

Cost optimization is performance work in disguise. You measure, you find the hotspot, you fix the hotspot, you measure again. The temptation is always to start with whatever is technically interesting. The right move is to start with whatever is expensive.

My order of operations -

Cost Explorer by service, top to bottom.
Find the orphans (running resources nobody owns).
Right-size what is actually running.
Move suitable workloads to spot or savings-plans-eligible.
Check storage and data transfer (they are quietly expensive).
Put guardrails in so the bill stays down.

I did step 1 to 4 in week one. Step 5 in week two as I found time. Step 6 was a separate small project later.

Step 1 - measure the right thing

I opened Cost Explorer, set the date range to the last 90 days, grouped by service, and sorted by cost. The breakdown -

EC2 (which is mostly EKS node groups) - 40 percent.
RDS - 25 percent.
Data transfer - 10 percent.
S3 - 7 percent.
CloudWatch (yes, really) - 5 percent.
Everything else - 13 percent.

The lesson - everything below the top 3 is noise for the first pass. I focused on EC2 and RDS. Data transfer I came back to in week two.

I also grouped by tag (cluster, environment, team). This is only useful if you have a tagging policy. We did, sort of. Anything in EKS was tagged by cluster. Anything not in EKS was inconsistent. That tagging gap cost me half a day of detective work later, and it is the reason I added a tagging policy at the end of the project.

Step 2 - find the orphans

The single highest ROI thing in any cost audit is finding resources that no one owns. The dev cluster I mentioned in the crisp version was the big one - 3 m5.xlarge nodes running for 90 days, plus an RDS instance, plus an ALB. Someone had spun it up for a POC in April and moved on.

How I found it - Cost Explorer grouped by cluster tag. There were 4 clusters in the list and only 3 in active use. The fourth was the orphan. Total bill on the orphan was 280 USD per month. Terraform destroy, save the state for archaeology, done.

Other orphans I found -

Two EBS volumes detached from terminated EC2 instances, 8 USD per month each.
One Elastic IP not attached to anything, 3.60 USD per month.
An old NAT Gateway in a deprecated VPC, 32 USD per month.
Three S3 buckets with old build artifacts, 12 USD per month.

None of these were big individually. Together they were another 70 USD per month. The discipline is to look.

Step 3 - right-size what is running

This is the biggest line item win and the one that requires the most care. The data sources -

AWS Compute Optimizer for EC2 recommendations.
CloudWatch Container Insights for pod-level CPU and memory.
Kubernetes Vertical Pod Autoscaler in recommendation mode for per-pod requests.

The prod node group was m5.2xlarge nodes. Container Insights showed average CPU at 18 percent and average memory at 35 percent over a 7-day window with peaks at 32 percent and 48 percent. I right-sized to m5.large for the general API workload, kept m5.xlarge for the LLM worker pool because those pods occasionally spike higher.

The risk is that average usage hides peaks. I looked at p95 and p99 not just average, and I sized to the p99 with 20 percent headroom. After deployment, I watched CPU throttling metrics for two weeks. Nothing throttled.

Pod requests were the other half. Most helm charts ship with absurdly high default resource requests because they want to "feel safe". VPA in recommendation mode runs alongside your pods and tells you what they actually use. I dropped requests across the board to match VPA recommendations plus 30 percent headroom. This let more pods schedule per node, which is the actual mechanism that turns right-sized nodes into a smaller bill.

The right-sizing flow. Measure, recommend, resize, verify.

Savings - about 700 USD per month on the prod node group.

Step 4 - spot for batch

The CIM processing pipeline runs as Kubernetes jobs. Each job processes one CIM, takes 1 to 5 minutes, can be retried if interrupted. This is the textbook spot instance use case.

The setup - a separate node group with a mixed instance policy. 80 percent of capacity is spot across three instance types (so AWS has flexibility to fulfill the spot request), 20 percent is on-demand fallback. Pods schedule on spot when available, fall back to on-demand when spot pricing spikes or capacity is unavailable.

I added a PodDisruptionBudget so we never lose more than 25 percent of in-flight jobs to a spot interruption, and the job retry logic was already idempotent (the pipeline is content-addressed by CIM hash, so re-running a job is a cheap no-op if the result is already in S3).

Spot pricing for the instance types I picked was about 35 to 40 percent of on-demand. Savings on this node group worked out to about 60 percent overall (some on-demand still needed), roughly 450 USD per month.

Step 5 - RDS

We were on db.m5.2xlarge with 1000 GB of io2 provisioned IOPS storage. The instance was a Postgres 15 cluster running the main app DB.

CloudWatch metrics over 30 days - CPU peaked at 22 percent, IOPS peaked at 4500 (we had 6000 provisioned), connection count peaked at 80 (max was 500). The instance was massively over-provisioned.

I dropped to db.m5.xlarge with gp3 storage at 12000 baseline IOPS. gp3 is cheaper than io2 below a certain IOPS threshold and our workload was well below it. Before pushing to prod I ran the same workload in staging for 48 hours with synthetic load, watched the latency p99 stay below SLO, then did the prod resize during a maintenance window.

Savings - about 380 USD per month.

Step 6 - the discipline that keeps it down

After the audit, the bill stayed down. Six months later it was still down. Three things made that hold.

Tagging policy

Every resource gets a team, env, and service tag. CI fails terraform plans that introduce untagged resources. With consistent tags, Cost Explorer becomes a self-service tool for any engineer.

Monthly cost review

First Monday of every month I sent a Slack summary - top 5 line items, change vs last month, anything that grew more than 10 percent. Took 30 minutes to produce. People started self-correcting before I had to ask.

Cost as a CI check

For PRs that touch terraform, I added a terraform plan summary that flagged any new compute resources and forced an explicit acknowledgment. Friction in the right place.

What I almost did wrong

I almost signed a 1-year Reserved Instance plan before I right-sized. The math would have looked good on the old footprint and disastrous on the new footprint. The rule - right-size first, commit to long-term pricing second.

I almost migrated to ARM (Graviton) in the same change. ARM migration is a real win but it is a separate project with its own testing burden. Doing it in the same week as right-sizing would have made any regression hard to attribute. I parked it.

I almost did a Karpenter migration to "do it properly". Karpenter is great but it is a tool change, not a cost fix. I did the cost fix first and left Karpenter as a follow-up that a teammate picked up later.

What this taught me

Cost work is operations work, not architecture work. The wins come from measurement and discipline, not from clever rewrites. The same is true of performance work, of reliability work, of security work. Measure first, attack the biggest thing, change one variable, measure again. Repeat until done. Then put the discipline in place so it does not regrow.

Also - any senior engineer who tells you "we should look at the cloud bill" is offering you a high-leverage week of work. Take it. The bill always has room. The team always notices.

Learn more

Docs
AWS EKS Best Practices GuideAWS
Docs
AWS Cost Explorer docsAWS
Repo
Kubernetes Vertical Pod AutoscalerGitHub
Docs
Karpenter docsKarpenter
Docs
AWS Compute OptimizerAWS
Article
Brendan Gregg - systems performancebrendangregg.com