Senior 10 min · March 06, 2026

Cloud Cost Optimisation — The $120K Forgotten Database

One untagged RDS instance drifted AWS bills from $45K to $110K monthly.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Cloud cost optimisation is the practice of aligning cloud spend with actual usage
  • Rightsizing matches instance types to workload requirements — downsizing 40% of instances cuts bills by 20-50%
  • Reserved instances give 30-72% discounts over on-demand for steady workloads
  • Spot instances reduce costs by 60-90% for fault-tolerant, interruptible jobs
  • Tagging without enforcement is just decoration — build automated policies that stop untagged resources
  • Biggest mistake: mistaking reserved instance discounts for cost control while leaving idle resources running
Plain-English First

Imagine you rent a huge warehouse with 50 rooms, but you only ever use 3 of them — and you pay full rent every single month. Cloud cost optimisation is the art of figuring out which rooms you actually need, downsizing to the right-sized space, pre-paying for rooms you're certain you'll use long-term, and turning off the lights in rooms nobody is in after 6pm. Your cloud bill works exactly the same way.

Cloud bills have a nasty habit of doing one thing: going up. A startup spins up a few EC2 instances to test an idea, forgets to turn them off, adds an RDS database 'just for dev', and six months later the founders are staring at a $40,000 invoice wondering how it happened. This isn't a cautionary tale — it's Tuesday at most engineering companies. Cloud providers design their consoles to make provisioning effortless and deprovisioning forgettable. That asymmetry is expensive.

The problem cloud cost optimisation solves isn't just waste — it's invisible waste. Unused Elastic IPs, idle load balancers, forgotten S3 buckets full of old logs, over-provisioned RDS instances running at 8% CPU — none of these announce themselves. They silently accumulate. Cost optimisation is the discipline of making cloud spend intentional: every resource should be the right size, running only when needed, and tagged so you know exactly which team or product is paying for it.

By the end of this article you'll know how to identify and eliminate the four biggest categories of cloud waste, how to write Infrastructure as Code that enforces cost-aware defaults, how to use reserved capacity and spot instances strategically, and how to build a tagging policy that gives you real visibility. These are the same techniques engineering teams at scale use to routinely cut cloud bills by 30–60% without touching a single line of application code.

What is Cloud Cost Optimisation?

Cloud cost optimisation is the practice of continuously aligning cloud resource provisioning with actual workload requirements. It's not about cutting corners — it's about eliminating the gap between what you pay and what you need. That gap is almost always wider than you think.

Most engineers treat cloud spend as a fixed cost, like rent. It's not. Every resource is individually billable, and providers design their consoles to make adding resources frictionless. The result? A typical AWS account has 30-40% waste: idle instances, oversized databases, unused storage, and orphaned resources that nobody remembers creating.

Cost optimisation operates at three levels
  • Tactical: Rightsizing instances, stopping idle resources, removing orphaned volumes.
  • Strategic: Reserved instances, savings plans, spot usage for batch workloads.
  • Cultural: Tagging policies, cost allocation reports, and developer education on cost awareness.

Each level compounds. Without tactical cleanup, strategic discounts are wasted on empty resources. Without cultural enforcement, tactical wins revert within a quarter.

main.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// TheCodeForgeTerraform example enforcing cost-aware defaults
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = var.env == "production" ? "t3.medium" : "t3.micro"

  // Enforce tags for cost visibility
  tags = {
    Name        = "web-${var.env}"
    Environment = var.env
    Owner       = var.team
    CostCenter  = var.cost_center
    ExpiresOn   = var.env == "staging" ? timeadd(timestamp(), "168h") : "never"
  }

  // Auto-stop for non-production
  user_data = var.env != "production" ? <<-EOF
    #!/bin/bash
    echo "0 20 * * * shutdown -h now" | crontab -
  EOF : null
}
Cost Optimisation Mental Model
  • First: stop paying for resources you don't use (idle instances, orphaned volumes).
  • Second: shrink resources you use but don't need full power for (rightsize).
  • Third: commit to longer-term usage for discounts (RIs, savings plans) only after you've right-sized.
  • Fourth: use spot for variable, fault-tolerant workloads only when steady-state is optimised.
Production Insight
The biggest cost optimisation trap is buying reserved instances before rightsizing. A 3-year all-upfront RI on an m5.xlarge that runs at 5% CPU still costs you 95% of that money.
Always rightsize before you reserve.
Rule: 50% waste elimination is cheaper than 30% discount on waste.
Key Takeaway
Cloud cost optimisation starts with eliminating waste, not getting discounts.
Discounts on waste are still waste.
Rightsize first, reserve second, spot third.
Should You Optimise Cost or Spend More?
IfWorkload runs <6 hours/day or <40% peak usage
UseRightsize down and consider spot/preemptible instances.
IfWorkload runs 24/7 with steady CPU >50%
UseReserved instance or savings plan — commit 1 or 3 years.
IfWorkload is batch, fault-tolerant, and time-flexible
UseUse spot fleets with diversity across instance families.
IfNo resource tags, no cost allocation visibility
UseStop everything except production; implement tagging policy first.

10-Point Cloud Cost Optimization Checklist

Use this checklist as a quick reference to audit your cloud spend. Each point addresses a common source of waste or a best practice for cost control.

  1. Stop idle compute resources. Identify instances with less than 5% CPU over the last 14 days. Stop or terminate them. Use AWS Instance Scheduler to automate stop/start.
  2. Rightsize over-provisioned instances. Run Compute Optimizer or cross-check CloudWatch metrics. Downsize to the smallest instance type that meets your workload’s peak requirements.
  3. Remove orphaned EBS volumes and Elastic IPs. Unattached volumes and unused IPs accrue costs. List and delete them monthly.
  4. Set S3 lifecycle policies. Transition logs and backups to cheaper storage classes (Standard-IA → Glacier → Deep Archive) and expire old objects.
  5. Abort incomplete multipart uploads. Add a lifecycle rule to remove incomplete uploads older than 7 days. They accumulate hidden storage cost.
  6. Enforce tagging on all resources. Use IaC policy (Terraform checkov, CloudFormation stack policy) to require Owner, Environment, and CostCenter tags. Automatically terminate untagged resources.
  7. Buy Reserved Instances or Savings Plans only after rightsizing. Commit for steady-state, rightsized workloads. Start with 1-year partial upfront to preserve flexibility.
  8. Use Spot instances for fault-tolerant workloads. Batch processing, CI/CD workers, and stateless microservices are ideal. Implement termination handling.
  9. Set up cost anomaly alerts. Use AWS Budgets or third-party tools to alert on >30% increases in top services. Assign an owner to review alerts weekly.
  10. Conduct monthly cost reviews. Review Cost Explorer, look for new services or unexplained spikes. Maintain a shared spreadsheet of cost optimization actions.
Automate Where Possible
Manually checking all 10 points every month is tedious. Automate steps 1–5 with AWS Config rules, Lambda functions, and S3 lifecycle policies. Reserve manual review for steps 6–10.
Production Insight
A team I worked with applied this checklist in a two-day cost audit blitz. They found and fixed 14 idle databases, 23 oversized instances, and 6 orphaned volumes. Savings: $18,000/month. The checklist turned a reactive firefight into a repeatable process.
Rule: Turn this checklist into a quarterly automated report. If you can't automate it, write a runbook and schedule a 2-hour cost review every month.
Key Takeaway
A 10-point checklist gives you a repeatable, quick way to identify and fix the most common cloud waste sources. Automate what you can; review the rest monthly.

Rightsizing: Match Instance Size to Actual Workload

Rightsizing means picking the instance type and size that matches your workload's actual resource consumption. The default — t3.medium for everything — is almost always wrong. A web server that serves 100 req/s might need 4 vCPUs; one that serves 5 req/s might run fine on a burstable nano.

AWS Compute Optimizer and Azure Advisor give recommendations based on historical CPU, memory, and network utilisation. But they're conservative — they only suggest downsizing if utilisation is below a threshold (e.g., 40% max CPU over 2 weeks). In practice, most engineering teams can downsize 50% of their instances with no performance impact.

The mechanics
  • Over-provisioned CPU: Burstable instances (T3/T4g) accumulate CPU credits when idle; check credit balance before downsizing.
  • Over-provisioned memory: Use CloudWatch mem_used_percent custom metrics; often RDS instances run at <20% memory.
  • Over-provisioned I/O: EBS gp3 volumes can be downsized by reducing IOPS and throughput to actual usage patterns.

Rightsizing is a continuous process. Quarterly reviews catch new waste from autoscaling groups that launched larger instances during a peak and never scaled down.

rightsize_check.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash
# TheCodeForgeFetch EC2 instances with avg CPU <10% over last 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -d '14 days ago' +%Y-%m-%dT00:00:00Z) \
  --end-time $(date +%Y-%m-%dT23:59:59Z) \
  --period 86400 \
  --statistics Average \
  --query 'Datapoints[*].Average' \
  --output json | jq 'add/length | . < 10'
Rightsizing Trap
Don't downsize instances that have CPU credits (T3/T4g) without checking the credit balance. A burstable instance with zero credits will run at baseline performance, causing immediate throttling. Always switch to unlimited mode or a non-burstable type first.
Production Insight
Rightsizing a production RDS instance from db.r5.xlarge to db.r5.large saves ~$200/month. If you have 20 such instances, that's $48,000/year. No code change. No outage. Just a few minutes of metric review.
But also consider: if your database uses Multi-AZ, each instance is doubled — rightsizing one side doesn't halve the cost unless you also switch to single-AZ for dev environments.
Rule: Rightsize dev and staging before production — they're often 2-3x over-provisioned.
Key Takeaway
Rightsizing is the lowest-effort, highest-return cost optimisation action.
Target 50% of instances for downsizing; average 30% savings.
Use Compute Optimizer, but validate with actual metrics before changing.
Right-size Decision
IfCPU <10% and memory <30% for 14 consecutive days
UseDownsize to next lower instance size. If at smallest size, consider switching to burstable (T3/T4g).
IfCPU <10% but memory >70%
UseDownsize CPU but keep or increase memory. Use instance types with same vCPU but more memory (e.g., r5 vs m5).
IfCPU spikes but average is low
UseUse compute-optimised (C5/C6i) or use burstable with 'unlimited' mode and monitor SurplusCredits.
IfInstance running less than 30 days
UseDo not rightsize yet — wait for at least 2 weeks of utilisation data.

Reserved Instances and Savings Plans: Commit to Save

Reserved Instances (RIs) and Savings Plans are discount programs in exchange for committed usage. If you know you'll run a database or web server for the next 1-3 years, you can pay upfront or partially upfront and get 30-72% off the on-demand rate.

The key distinction
  • Standard RIs: Locked to a specific instance family (e.g., m5.large). Highest discount, least flexibility.
  • Convertible RIs: Allow changing instance family but lower discount (40-60%).
  • Compute Savings Plans: Apply to any EC2, ECS, or Fargate usage within a region. More flexible, slightly lower discount than RIs.
  • EC2 Instance Savings Plans: Apply to a specific instance family within a region.

Senior engineers treat RIs as a second-order optimisation — after rightsizing. You don't want to commit to a m5.large for 3 years only to discover you could have moved to t3.medium and saved more without the commitment.

One real strategy: use a mix of 1-year partial upfront for predictable workloads and 3-year all upfront for baseline production. This balances discount depth with financial flexibility.

ri_analysis.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# TheCodeForge — Calculate optimal RI coverage using AWS Pricing API
import boto3
import pandas as pd

pricing = boto3.client('pricing', region_name='us-east-1')
ce = boto3.client('ce')

# Get EC2 usage over last 30 days
usage = ce.get_cost_and_usage(
    TimePeriod={'Start': '2026-03-22', 'End': '2026-04-22'},
    Granularity='DAILY',
    Metrics=['UsageQuantity', 'UnblendedCost'],
    GroupBy=[{'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE'}]
)

# Identify top 5 instance families by cost
costs = {}
for item in usage['ResultsByTime']:
    for group in item['Groups']:
        key = group['Keys'][0]
        costs[key] = costs.get(key, 0) + float(group['Metrics']['UnblendedCost']['Amount'])

top_5 = sorted(costs.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 instance families by cost:")
for family, cost in top_5:
    print(f"{family}: ${cost:.2f}")

# Suggested RI coverage: 50% of top family with 1-year partial upfront
Reserved Instances Myth
Many teams buy RIs because they think it 'saves money automatically.' But if you buy an RI for an instance that you later stop, you still pay for it. You can't cancel RIs. Only buy RIs for workloads that are rightsized and will definitely stay running.
Production Insight
I've seen teams buy $500k in 3-year RIs because they believed the 60% discount was a no-brainer. Then they migrated to containers and only used 30% of the RI commitment. The discount is huge — until you need to change architecture. Compute Savings Plans avoid this lock-in for containerised workloads.
Rule: Buy RIs only after at least 6 months of consistent usage data. Start with 1-year partial upfront for flexibility.
Key Takeaway
Reserved instances and savings plans are powerful, but only after rightsizing.
Start with 1-year commitments; load more as usage patterns stabilise.
Compute Savings Plans offer the best balance for modern, containerised architectures.
Reserved Instance or Savings Plan?
IfWorkload is containerised (ECS/EKS/Fargate)
UseCompute Savings Plan — covers both EC2 and container compute.
IfWorkload runs on specific instance type with no changes expected
UseStandard 3-year all upfront RI for maximum discount.
IfNot sure if workload will change architecture in 2 years
Use1-year partial upfront RI or Compute Savings Plan — maximise flexibility.

Savings Plans vs Reserved Instances: Which Should You Choose?

Both Reserved Instances (RIs) and Savings Plans offer deep discounts in exchange for committing to a certain dollar amount per hour (Savings Plans) or a specific instance configuration (RIs). The right choice depends on how predictable and stable your workloads are.

Standard RIs give the highest discount (up to 72%) but lock you to a specific instance family in a specific region. They are best for steady-state, long-lived workloads that won't change architecture — e.g., a production MySQL database on db.r5.large.

Convertible RIs offer slightly lower discounts (40-60%) but allow you to change instance family, size, or region. They’re useful if you anticipate moderate changes but still want a commitment discount.

Compute Savings Plans apply to any EC2, ECS, EKS, or Fargate compute usage within a region. Discounts are 30-50%, but you gain flexibility to change instance types, sizes, or even move between compute services. Ideal for containerized workloads.

EC2 Instance Savings Plans are similar to Standard RIs — they apply to a specific instance family in a region — but are slightly more flexible because they cover any instance size within that family.

FeatureStandard RIConvertible RICompute Savings PlanEC2 Instance Savings Plan
Discount depth (1yr)~40%~30%~30%~35%
Discount depth (3yr)~60-72%~50-60%~45-55%~55-65%
Instance family locked?YesCan changeNoYes (family only)
Region locked?YesCan changeNo (floating)Yes (region)
ScopeSingle AZ or entire regionSingle AZ or regionPer regionPer region
Best forPredictable, immutable workloadsWorkloads with future migration plansContainerized or serverless architecturesSteady but variable instance sizes

If your architecture is cloud-native and you use containers or Lambda, Compute Savings Plans are the safest bet. If you have legacy monoliths on known instance types, Standard RIs maximize savings. Never buy a 3-year RI until you have at least 6 months of stable usage data — financial flexibility is worth the lower discount.

The 80% Rule
A common rule-of-thumb: cover 80% of your smooth baseline compute spend with commitment discounts (RIs or Savings Plans). Leave the remaining 20% on-demand to absorb spikes. This balances savings with flexibility. Review coverage quarterly and adjust.
Production Insight
I’ve seen multi-million-dollar environments where the team bought 3-year Standard RIs for their entire fleet, then migrated to EKS six months later. The RIs became worthless for the new architecture. A Compute Savings Plan would have covered the same spend with less lock-in. The lesson: flexibility compounds in value as your architecture evolves.
Rule: Use Compute Savings Plans as the default; reserve Standard RIs only for absolutely static workloads like licensed databases on specific instance types.
Key Takeaway
Savings Plans offer more flexibility than Reserved Instances, often at a manageable discount reduction. For modern, containerized architectures, choose Compute Savings Plans. Only buy Standard RIs for workloads that won't change for the full term.

Spot and Preemptible Instances: Cheap Compute for the Brave

Spot instances (AWS) and preemptible VMs (GCP) let you use spare cloud capacity at a steep discount — typically 60-90% off on-demand price. The trade-off: the cloud provider can reclaim the instance with just a few minutes' notice (2 minutes in AWS, 30 seconds in GCP).

This isn't a free-for-all. It works best for
  • Batch processing (data pipelines, CI/CD workers)
  • Stateless microservices that can tolerate interruption
  • Fault-tolerant distributed jobs (e.g., training ML models with checkpointing)
  • Rendering or simulation workloads

The challenge is handling termination gracefully. Your application must save state or be able to restart. AWS sends a Spot Instance Termination Notice event 2 minutes before reclaiming the instance. You can catch this via the instance metadata endpoint.

Avoid spot for stateful databases, single-instance workloads, or anything that can't tolerate an abrupt stop. The cost savings are real, but the operational cost of re-architecting for spot can be significant.

spot_termination_listener.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# TheCodeForge — Listen for spot termination and checkpoint work
import requests
import json
import time

METADATA_URL = "http://169.254.169.254/latest/meta-data/spot/termination-time"

def check_termination():
    try:
        response = requests.get(METADATA_URL, timeout=5)
        if response.status_code == 200:
            print("⚠️  Spot termination imminent. Checkpointing...")
            # Save state to S3 or distribute work
            return True
    except requests.exceptions.RequestException:
        pass
    return False

while True:
    if check_termination():
        # Graceful shutdown logic
        break
    # Normal processing
    time.sleep(5)
Spot Fleet Diversity
Use multiple instance types (c5.large, m5.large, r5.large) in your spot fleet request. AWS can then pick from whichever have spare capacity, reducing interruptions. Set allocation_strategy to 'capacity-optimized' for lowest interruption risk.
Production Insight
Company I consulted saved 72% on their Spark cluster by switching to spot instances. But they forgot to set up termination handling — the cluster restarted from scratch 4 times a day because nodes got reclaimed mid-job. After adding checkpointing to HDFS and a spot fleet with fallback to on-demand, savings were real and stable.
Rule: Always set a target capacity with a mix of spot and on-demand. Use lowestPrice only for extremely fault-tolerant workloads; use capacityOptimized for production.
Key Takeaway
Spot instances can cut compute costs by 60-90%, but require architectural changes.
Always handle termination notices and use with fault-tolerant workloads.
Combine spot with on-demand capacity pools for reliability SLAs.
Should You Use Spot or On-Demand?
IfWorkload is stateless, batch, or time-flexible
UseSpot all the way — target 70-90% of instances from spot.
IfWorkload is stateful, transactional, or single-instance
UseUse on-demand or reserved. Spot is too risky.
IfWorkload can tolerate interruptions but needs SLA
UseUse a spot fleet with a 'zero capacity' fallback to on-demand at a higher price.

Tagging and Cost Allocation: Visibility Is the Prerequisite

You can't optimise what you can't see. Tagging is the foundation of cost visibility — attaching metadata (key-value pairs) to every cloud resource so you can attribute costs to teams, products, environments, or cost centers.

But tags are only useful if they're: 1. Mandatory: IaC policies should reject deployments without required tags. 2. Standardised: A fixed set of tags (e.g., Owner, Environment, CostCenter, Project) used everywhere. 3. Enforced: Resources created without required tags are automatically terminated or reported.

Without enforcement, tags become optional and quickly rot. After three months, half your resources are untagged and you're back to guessing who's spending what.

AWS provides Cost Allocation Tags that appear in the Cost Explorer and detailed billing reports. You can also create user-defined tags. Once tags are in place, you can run reports per team, set budget alerts for specific CostCenter tags, and even block untagged resources from launching via SCP (Service Control Policies) or Custom Lambda functions.

lambda_enforce_tags.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# TheCodeForge — AWS Lambda function that terminates untagged resources
import boto3
from botocore.exceptions import ClientError

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    required_tags = ['Owner', 'Environment', 'CostCenter']
    
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running', 'stopped']}]
    )
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
            missing = [tag for tag in required_tags if tag not in tags]
            if missing:
                print(f"Terminating instance {instance['InstanceId']} — missing tags: {missing}")
                # Option 1: Send alert via SNS
                # Option 2: Terminate directly (use with caution!)
                # ec2.terminate_instances(InstanceIds=[instance['InstanceId']])
Tags Are Metadata Contracts
  • Mandatory tags: Owner, Environment, CostCenter, Project.
  • Enforcement at deployment: refuse to create resources without tags.
  • Automated cleanup: Lambda or SCP that either tags unknown resources or terminates them.
  • Report weekly: Cost Explorer per tag → Share with team leads.
Production Insight
A common failure: teams create a tagging policy but don't enforce it. Three months later, they run a cost report and see 60% of resources are untagged. The report is useless. You have to enforce tags at the infrastructure level — either with Terraform validation, SCP, or CloudFormation StackPolicy. Otherwise, the tagging policy is a suggestion, not a rule.
Rule: Enforce tags in IaC before you create any resource. Don't let untagged resources reach production.
Key Takeaway
Tags turn cloud spend from a black box into a per-team, per-project breakdown.
Enforce tags at the IaC level, not through documentation.
Without enforcement, tagging is theater — worthless for cost optimisation.
Enforce Tags or Not?
IfYou have a tag policy but untagged resources are common
UseImplement automatic tagging for existing resources, then enforce for new ones.
IfYou need per-team cost allocation
UseUse CostCenter tag and activate as a Cost Allocation Tag in AWS Billing console.
IfYou're about to start a migration
UseInclude tag enforcement in the migration checklist — it's easier than retrofitting.

The Cost of Untagged Resources: A Visual Impact

Tags are the single most important tool for cost attribution, but they only work when enforced. Untagged resources are invisible in cost reports — they fall into a generic "Untagged" bucket that tells you nothing about ownership, project, or environment. This invisibility is a direct cause of cloud waste.

The diagram below shows the chain reaction from an untagged resource to a growing bill. When a developer provisions an EC2 instance without tags:

  1. The instance shows up in Cost Explorer under "No Tag: Environment".
  2. The operations team doesn’t know who owns it or why it exists.
  3. No one is responsible for stopping it, so it runs indefinitely.
  4. The cost is buried in a generic line item — nobody notices.
  5. After months, the bill has crept up by thousands of dollars.

With tagging, the opposite happens: resources are automatically attributed to a team, cost allocation reports surface anomalies, and the team lead gets a budget alert when spend exceeds threshold.

The impact is not just financial. Untagged resources slow down incident response (who to page?), hinder compliance audits, and make capacity planning guesswork. A single untagged production database can delay a root cause analysis by hours because no one knows who stacked it.

The visual below summarises the flow from untagged resource to cost leak. Use it to justify enforcing tag policies across your organization.

The Untagged Resource Tax
At $12,000/month per 100 untagged instances, the "tax" adds up fast. A single untagged RDS database can cost $8,500/month (like in our production incident). Enforcing tags from day one is the cheapest way to avoid this tax.
Production Insight
In the $120K forgotten database incident, the root cause wasn’t just the untagged RDS — it was the lack of visibility. The instance was launched by a contractor who left the company. Without an Owner tag, no one knew who to ask. Without an Environment tag, it wasn't flagged as dev/staging. The bill grew silently for 14 months. A simple mandatory tag policy with automated enforcement would have stopped it in its tracks.
Rule: Always enforce tags via IaC or service control policies. If a resource cannot be created without tags, it cannot be forgotten.
Key Takeaway
Untagged resources are invisible cost leaks. Enforcing tagging at the infrastructure level turns a black box of cloud spend into a transparent, per-team breakdown. Visibility alone can reduce waste by 20-30%.

Storage Lifecycle Management: Stop Paying for Old Logs

Storage costs are the sneakiest creepers in your cloud bill. Data accumulates, and once written, it almost never gets deleted. S3 charges per GB per month, and that cost grows linearly with data volume. A project that stores 500GB of logs and deletes nothing will be paying for 5TB in a year.

The solution is lifecycle policies — automatically transition objects to cheaper storage classes and expire them after a set period. Typical data lifecycle: - 0-30 days: S3 Standard (hot, frequent access) - 30-90 days: S3 Infrequent Access (lower storage cost, higher retrieval cost) - 90-365 days: S3 Glacier (long-term archival, retrieval takes minutes-hours) - 365+ days: S3 Glacier Deep Archive (cheapest, retrieval takes 12-24 hours) - Expiration: Delete objects after, say, 3 years.

Set these policies on every bucket from day one. Retroactively adding them to existing buckets with millions of objects can be done via S3 Batch Operations.

Also watch for incomplete multipart uploads. S3 charges you for the chunks even if the upload never finished. They accumulate silently. Add a rule to expire incomplete uploads after 7 days.

lifecycle-policy.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
{
  "Rules": [
    {
      "Id": "Standard-to-IA-after-30d",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 1095
      }
    },
    {
      "Id": "AbortIncompleteMultipartUpload",
      "Status": "Enabled",
      "Filter": {"Prefix": ""},
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}
Glacier Deep Archive Pricing
S3 Glacier Deep Archive costs $1/TB per month — roughly 1/10th the cost of S3 Standard. But retrieval time is 12 hours (standard) or 48 hours (bulk). Perfect for compliance archives, old logs, and backup data you hope you'll never need.
Production Insight
I audited an AWS account that had been running for 5 years. Their S3 storage cost was $28k/month. 60% of it was logs older than 90 days in S3 Standard. After applying lifecycle policies to transition old logs to Glacier Deep Archive and expire them after 3 years, the monthly storage bill dropped to $6k. No data lost. Just a policy change.
Rule: Never leave logs in Standard beyond 30 days. Set lifecycle policies on the day you create the bucket — retrofitting is harder.
Key Takeaway
Storage costs grow linearly with data volume — lifecycle policies are your only defence.
Transition old data to cheaper tiers automatically.
Expire incomplete multipart uploads — they're invisible cost leaks.
Storage Class For Your Data
IfData accessed more than once per month
UseS3 Standard.
IfData accessed less than once per month, but retrieval needs to be fast (<5 min)
UseS3 Standard-IA or One Zone-IA (cheaper, single AZ).
IfData must be retained for compliance but accessed rarely (<1% annually)
UseS3 Glacier Deep Archive — $1/TB/month, retrieval in 12 hours.
IfData is temporary or processing intermediate files
UseUse S3 lifecycle expiration to delete after 7-30 days. Set Abort incomplete multipart uploads to 7 days.
● Production incidentPOST-MORTEMseverity: high

The Forgotten Dev Database That Cost $120,000

Symptom
Monthly AWS bill gradually increased from $45k to $110k over a year, with no clear spike — just a steady upward drift. The RDS line item alone was $8,500/month.
Assumption
Teams assumed the finance department tracked cost allocation. They didn't. Everyone thought someone else was monitoring the staging environment.
Root cause
A single db.r5.4xlarge RDS instance in us-east-1 was left running after the project was shelved. No auto-stop policy, no tag indicating ownership, and no CloudWatch alarm on cost anomalies.
Fix
Stopped the instance, took a final snapshot, then deleted it. Set up a monthly cost anomaly budget with AWS Budgets. Added a mandatory 'expiration_date' tag enforced via a Lambda function that terminates untagged resources older than 90 days.
Key lesson
  • Tag every resource with owner, environment, and expiration date — enforce it in IaC.
  • Set up cost anomaly alerts from day one, not after the bill shocks you.
  • Treat staging environments like production: use auto-stop schedules and instance scheduler.
Production debug guideWhen your cloud bill jumps, follow this symptom→action grid to find the leak fast.5 entries
Symptom · 01
Monthly bill increased by 30%+ with no deployment change
Fix
Open AWS Cost Explorer → Group by service → look for the top-3 cost drivers that changed. Filter by 'yesterday' vs 'same day last week'.
Symptom · 02
A specific service (e.g., RDS) shows flat cost despite no new instances
Fix
Check for unused Multi-AZ deployments. Often failover tests leave standby replicas running at full cost. Also verify that old snapshots are not auto-retained beyond the policy.
Symptom · 03
Data transfer costs suddenly spike
Fix
Look for cross-region traffic or NAT Gateway processing. Use VPC Flow Logs to identify heavy traffic sources. Filter by bytes sent to public addresses.
Symptom · 04
Storage costs increase but file count is stable
Fix
Check S3 storage class transition policies. Old objects may have moved to S3 Standard when they should be in Glacier. Also review incomplete multipart uploads — they accumulate hidden chunks.
Symptom · 05
Team A suspects Team B is driving up costs
Fix
Enforce cost allocation tags and run the AWS Cost & Usage Report with tag splits. If no tags exist, use CF/SAM templates to retroactively tag resources by creation time pattern.
★ Cost Waste Forensics PlaybookFive-minute checks when you suspect cloud waste is bleeding money.
Idle compute instances
Immediate action
Run AWS Compute Optimizer for EC2 + RDS recommendations
Commands
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query "Reservations[*].Instances[*].[InstanceId,LaunchTime,State.Name]" --output table
aws ce get-rightsizing-recommendation --service EC2
Fix now
Stop instances with <5% CPU over the last week and no current load balancer target membership.
Orphaned EBS volumes / Elastic IPs+
Immediate action
List unattached EBS volumes and unused EIPs — they accrue hourly
Commands
aws ec2 describe-volumes --filters "Name=status,Values=available" --query "Volumes[*].[VolumeId,Size,State]" --output table
aws ec2 describe-addresses --query "Addresses[?AssociationId==null].[PublicIp,AllocationId]" --output table
Fix now
Delete or attach orphaned volumes. Release unassociated Elastic IPs.
S3 storage cost out of control+
Immediate action
Run S3 Inventory and lifecycle policy review
Commands
aws s3api list-buckets --query "Buckets[*].Name" | xargs -I {} aws s3api get-bucket-lifecycle-configuration --bucket {}
aws s3 ls s3://[bucket] --recursive --summarize | grep "Total Objects"
Fix now
Set lifecycle rules to transition objects older than 30 days to Glacier Deep Archive and expire after 365 days.
NAT Gateway costs spiking+
Immediate action
Check data processed through NAT gateways — they cost $0.045/GB
Commands
aws cloudwatch get-metric-statistics --namespace AWS/NATGateway --metric-name BytesOutToDestination --statistics Sum --start-time 2026-04-21T00:00:00Z --end-time 2026-04-22T00:00:00Z --period 3600 --dimensions Name=NatGatewayId,Value=[gateway-id]
aws ec2 describe-nat-gateways --query "NatGateways[*].[NatGatewayId,VpcId,State]" --output table
Fix now
Redirect traffic through VPC endpoints where possible. Use private NAT gateway or replace with VPC endpoints for S3 and DynamoDB.
Cost Optimisation Tactics Comparison
StrategySavings PotentialEffort to ImplementRisk of Over-Optimisation
Rightsizing20-50%Low (use CloudWatch + Compute Optimizer)Low — can always scale back up
Reserved Instances / Savings Plans30-72%Medium (requires usage commitment)Medium — locked into instance family (RIs). Savings plans are safer.
Spot / Preemptible Instances60-90%High (requires architectural changes for fault tolerance)High — if not handled, interruptions can cause data loss or downtime
Storage Lifecycle Policies50-80% on storage costsMedium (set once per bucket, but monitoring needed)Low — retrieval cost trade-off. Test retrieval time.
Tagging + Cost AllocationIndirect (enables all other optimisations)Medium (enforcement needed)None — more visibility never hurts

Key takeaways

1
Cloud waste is invisible
you must actively look for idle, oversized, and orphaned resources.
2
Rightsize first, reserve second, spot third. This order maximises savings with minimum risk.
3
Tags are the prerequisite for any cost visibility. Enforce them in IaC, not in documentation.
4
Storage costs grow linearly
lifecycle policies are non-negotiable.
5
Cost optimisation is a continuous practice, not a one-time cleanup. Run monthly reviews.
6
Discounts on waste are still waste. Never buy reserved capacity for under-utilised resources.

Common mistakes to avoid

5 patterns
×

Buying Reserved Instances before Rightsizing

Symptom
You commit to a 3-year m5.large RI, but your instance is running at 10% CPU. You could have used a t3.medium and saved more without the commitment.
Fix
Always rightsize first. Run Compute Optimizer or check CPU utilisation metrics for at least 2 weeks. Only then buy RIs or Savings Plans for the optimised instance.
×

Using spot instances without termination handling

Symptom
Batch jobs fail randomly, data is lost, and developers lose trust in spot. You end up switching back to on-demand, paying 3x more.
Fix
Implement spot termination listeners that checkpoint state to S3 or a durable queue. Use spot fleets with fallback to on-demand. Educate the team on the trade-offs.
×

Creating a tagging policy but not enforcing it

Symptom
After 3 months, 50% of resources are untagged. Cost reports are useless. Finance can't allocate costs, and no one knows who owns what.
Fix
Enforce tags in IaC (Terraform validation, SCP, CloudFormation). Use a Lambda to tag or terminate untagged resources. Make tags mandatory before creation.
×

Storing logs in S3 Standard indefinitely

Symptom
Storage costs grow linearly year after year. You're paying $0.023/GB for logs from 2018 that nobody has ever queried.
Fix
Implement lifecycle policies: transition to Standard-IA after 30 days, Glacier after 90, and expire after 365 days. Use S3 Storage Class Analysis to validate.
×

Setting up cost anomaly alerts but not acting on them

Symptom
Alerts fire every week, but no one investigates because there are too many false positives. Eventually alerts are ignored.
Fix
Start with high-signal alerts: >30% increase in top-5 services only. Tune thresholds over a month. Assign an on-call rotation for cost alerts just like for pagerduty.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
A team tells you their AWS bill doubled in the last month. How do you de...
Q02SENIOR
Explain when you would choose a Compute Savings Plan over a Standard Res...
Q03SENIOR
Your team wants to cut cloud costs by 40%. What's your step-by-step appr...
Q01 of 03SENIOR

A team tells you their AWS bill doubled in the last month. How do you debug it?

ANSWER
First, don't panic. Open AWS Cost Explorer and filter by service. Look for the top cost drivers that changed — usually one service is the culprit. Check for new resource types, cross-region data transfer spikes, or increased usage in an existing service. Run the Cost Anomaly Detection dashboard. Also verify that no new accounts were added under an organization that suddenly incurred costs. If the culprit is a single service, drill into the specific resources. Common causes: a new CI/CD agent using large instances, an EBS snapshot policy that retained all backups, or an S3 bucket with public access that was scraped.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is cloud cost optimisation in simple terms?
02
How much money can cloud cost optimisation typically save?
03
Is cloud cost optimisation a one-time project?
04
What's the biggest mistake teams make with reserved instances?
05
Can I use spot instances for production databases?
🔥

That's Cloud. Mark it forged?

10 min read · try the examples if you haven't

Previous
Ansible Basics
14 / 23 · Cloud
Next
Serverless Architecture Explained