Intermediate 10 min · March 06, 2026

Cloud Cost Optimisation — The $120K Forgotten Database

Q: What is cloud cost optimisation in simple terms?

It's the practice of making sure you only pay for the cloud resources you actually need, and that you're getting the best price for those resources. This means shutting down unused servers, downsizing over-provisioned ones, and committing to longer-term usage where it makes sense.

Q: How much money can cloud cost optimisation typically save?

Most organisations have 30-40% waste. After applying the strategies in this article (rightsizing, spot, RIs, storage lifecycle), expect 40-60% savings. Some companies achieve 70%+ by aggressively using spot and committing to savings plans.

Q: Is cloud cost optimisation a one-time project?

No. It's a continuous practice. New resources are created every day, and old usage patterns change. At a minimum, run a cost review quarterly. Use automated tools like AWS Compute Optimizer and budget alerts to catch issues early.

Q: What's the biggest mistake teams make with reserved instances?

Buying RIs before rightsizing. If you buy a 3-year RI for a m5.xlarge that should have been a t3.medium, you're locked into paying 3x more than necessary. Always rightsize first, then commit.

Q: Can I use spot instances for production databases?

Generally no. Spot instances can be reclaimed with only 2 minutes of notice, which is too risky for stateful databases. Use on-demand or reserved for databases. For stateless application servers behind a load balancer, spot can work if you handle interruptions gracefully.

One untagged RDS instance drifted AWS bills from $45K to $110K monthly.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of DevOps fundamentals
✓Comfortable with command-line tools
✓Basic Linux administration knowledge

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Cloud cost optimisation is the practice of aligning cloud spend with actual usage
Rightsizing matches instance types to workload requirements — downsizing 40% of instances cuts bills by 20-50%
Reserved instances give 30-72% discounts over on-demand for steady workloads
Spot instances reduce costs by 60-90% for fault-tolerant, interruptible jobs
Tagging without enforcement is just decoration — build automated policies that stop untagged resources
Biggest mistake: mistaking reserved instance discounts for cost control while leaving idle resources running

✦ Definition~90s read

What is Cloud Cost Optimisation?

Cloud cost optimisation is the practice of continuously aligning cloud resource provisioning with actual workload requirements. It's not about cutting corners — it's about eliminating the gap between what you pay and what you need. That gap is almost always wider than you think.

★

Imagine you rent a huge warehouse with 50 rooms, but you only ever use 3 of them — and you pay full rent every single month.

Most engineers treat cloud spend as a fixed cost, like rent. It's not. Every resource is individually billable, and providers design their consoles to make adding resources frictionless. The result? A typical AWS account has 30-40% waste: idle instances, oversized databases, unused storage, and orphaned resources that nobody remembers creating.

Cost optimisation operates at three levels: - Tactical: Rightsizing instances, stopping idle resources, removing orphaned volumes. - Strategic: Reserved instances, savings plans, spot usage for batch workloads. - Cultural: Tagging policies, cost allocation reports, and developer education on cost awareness.

Each level compounds. Without tactical cleanup, strategic discounts are wasted on empty resources. Without cultural enforcement, tactical wins revert within a quarter.

Plain-English First

Imagine you rent a huge warehouse with 50 rooms, but you only ever use 3 of them — and you pay full rent every single month. Cloud cost optimisation is the art of figuring out which rooms you actually need, downsizing to the right-sized space, pre-paying for rooms you're certain you'll use long-term, and turning off the lights in rooms nobody is in after 6pm. Your cloud bill works exactly the same way.

Cloud bills have a nasty habit of doing one thing: going up. A startup spins up a few EC2 instances to test an idea, forgets to turn them off, adds an RDS database 'just for dev', and six months later the founders are staring at a $40,000 invoice wondering how it happened. This isn't a cautionary tale — it's Tuesday at most engineering companies. Cloud providers design their consoles to make provisioning effortless and deprovisioning forgettable. That asymmetry is expensive.

The problem cloud cost optimisation solves isn't just waste — it's invisible waste. Unused Elastic IPs, idle load balancers, forgotten S3 buckets full of old logs, over-provisioned RDS instances running at 8% CPU — none of these announce themselves. They silently accumulate. Cost optimisation is the discipline of making cloud spend intentional: every resource should be the right size, running only when needed, and tagged so you know exactly which team or product is paying for it.

By the end of this article you'll know how to identify and eliminate the four biggest categories of cloud waste, how to write Infrastructure as Code that enforces cost-aware defaults, how to use reserved capacity and spot instances strategically, and how to build a tagging policy that gives you real visibility. These are the same techniques engineering teams at scale use to routinely cut cloud bills by 30–60% without touching a single line of application code.

What is Cloud Cost Optimisation?

Cost optimisation operates at three levels

Tactical: Rightsizing instances, stopping idle resources, removing orphaned volumes.
Strategic: Reserved instances, savings plans, spot usage for batch workloads.
Cultural: Tagging policies, cost allocation reports, and developer education on cost awareness.

Each level compounds. Without tactical cleanup, strategic discounts are wasted on empty resources. Without cultural enforcement, tactical wins revert within a quarter.

main.tfHCL

// TheCodeForge — Terraform example enforcing cost-aware defaults
resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = var.env == "production" ? "t3.medium" : "t3.micro"

  // Enforce tags for cost visibility
  tags = {
    Name        = "web-${var.env}"
    Environment = var.env
    Owner       = var.team
    CostCenter  = var.cost_center
    ExpiresOn   = var.env == "staging" ? timeadd(timestamp(), "168h") : "never"
  }

  // Auto-stop for non-production
  user_data = var.env != "production" ? <<-EOF
    #!/bin/bash
    echo "0 20 * * * shutdown -h now" | crontab -
  EOF : null
}

Mental Model

Cost Optimisation Mental Model

Think of cloud spend like a bucket with a hole in the bottom — you need to plug the holes before you add more water.

First: stop paying for resources you don't use (idle instances, orphaned volumes).
Second: shrink resources you use but don't need full power for (rightsize).
Third: commit to longer-term usage for discounts (RIs, savings plans) only after you've right-sized.
Fourth: use spot for variable, fault-tolerant workloads only when steady-state is optimised.

📊 Production Insight

The biggest cost optimisation trap is buying reserved instances before rightsizing. A 3-year all-upfront RI on an m5.xlarge that runs at 5% CPU still costs you 95% of that money.

Always rightsize before you reserve.

Rule: 50% waste elimination is cheaper than 30% discount on waste.

🎯 Key Takeaway

Cloud cost optimisation starts with eliminating waste, not getting discounts.

Discounts on waste are still waste.

Rightsize first, reserve second, spot third.

Should You Optimise Cost or Spend More?

IfWorkload runs <6 hours/day or <40% peak usage

→

UseRightsize down and consider spot/preemptible instances.

IfWorkload runs 24/7 with steady CPU >50%

→

UseReserved instance or savings plan — commit 1 or 3 years.

IfWorkload is batch, fault-tolerant, and time-flexible

→

UseUse spot fleets with diversity across instance families.

IfNo resource tags, no cost allocation visibility

→

UseStop everything except production; implement tagging policy first.

thecodeforge.io

Cloud Cost Optimisation

10-Point Cloud Cost Optimization Checklist

Use this checklist as a quick reference to audit your cloud spend. Each point addresses a common source of waste or a best practice for cost control.

Stop idle compute resources. Identify instances with less than 5% CPU over the last 14 days. Stop or terminate them. Use AWS Instance Scheduler to automate stop/start.
Rightsize over-provisioned instances. Run Compute Optimizer or cross-check CloudWatch metrics. Downsize to the smallest instance type that meets your workload’s peak requirements.
Remove orphaned EBS volumes and Elastic IPs. Unattached volumes and unused IPs accrue costs. List and delete them monthly.
Set S3 lifecycle policies. Transition logs and backups to cheaper storage classes (Standard-IA → Glacier → Deep Archive) and expire old objects.
Abort incomplete multipart uploads. Add a lifecycle rule to remove incomplete uploads older than 7 days. They accumulate hidden storage cost.
Enforce tagging on all resources. Use IaC policy (Terraform checkov, CloudFormation stack policy) to require Owner, Environment, and CostCenter tags. Automatically terminate untagged resources.
Buy Reserved Instances or Savings Plans only after rightsizing. Commit for steady-state, rightsized workloads. Start with 1-year partial upfront to preserve flexibility.
Use Spot instances for fault-tolerant workloads. Batch processing, CI/CD workers, and stateless microservices are ideal. Implement termination handling.
Set up cost anomaly alerts. Use AWS Budgets or third-party tools to alert on >30% increases in top services. Assign an owner to review alerts weekly.
Conduct monthly cost reviews. Review Cost Explorer, look for new services or unexplained spikes. Maintain a shared spreadsheet of cost optimization actions.

💡Automate Where Possible

Manually checking all 10 points every month is tedious. Automate steps 1–5 with AWS Config rules, Lambda functions, and S3 lifecycle policies. Reserve manual review for steps 6–10.

📊 Production Insight

A team I worked with applied this checklist in a two-day cost audit blitz. They found and fixed 14 idle databases, 23 oversized instances, and 6 orphaned volumes. Savings: $18,000/month. The checklist turned a reactive firefight into a repeatable process.

Rule: Turn this checklist into a quarterly automated report. If you can't automate it, write a runbook and schedule a 2-hour cost review every month.

🎯 Key Takeaway

A 10-point checklist gives you a repeatable, quick way to identify and fix the most common cloud waste sources. Automate what you can; review the rest monthly.

Rightsizing: Match Instance Size to Actual Workload

Rightsizing means picking the instance type and size that matches your workload's actual resource consumption. The default — t3.medium for everything — is almost always wrong. A web server that serves 100 req/s might need 4 vCPUs; one that serves 5 req/s might run fine on a burstable nano.

AWS Compute Optimizer and Azure Advisor give recommendations based on historical CPU, memory, and network utilisation. But they're conservative — they only suggest downsizing if utilisation is below a threshold (e.g., 40% max CPU over 2 weeks). In practice, most engineering teams can downsize 50% of their instances with no performance impact.

The mechanics

Over-provisioned CPU: Burstable instances (T3/T4g) accumulate CPU credits when idle; check credit balance before downsizing.
Over-provisioned memory: Use CloudWatch mem_used_percent custom metrics; often RDS instances run at <20% memory.
Over-provisioned I/O: EBS gp3 volumes can be downsized by reducing IOPS and throughput to actual usage patterns.

Rightsizing is a continuous process. Quarterly reviews catch new waste from autoscaling groups that launched larger instances during a peak and never scaled down.

rightsize_check.shBASH

#!/bin/bash
# TheCodeForge — Fetch EC2 instances with avg CPU <10% over last 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time $(date -d '14 days ago' +%Y-%m-%dT00:00:00Z) \
  --end-time $(date +%Y-%m-%dT23:59:59Z) \
  --period 86400 \
  --statistics Average \
  --query 'Datapoints[*].Average' \
  --output json | jq 'add/length | . < 10'

⚠ Rightsizing Trap

Don't downsize instances that have CPU credits (T3/T4g) without checking the credit balance. A burstable instance with zero credits will run at baseline performance, causing immediate throttling. Always switch to unlimited mode or a non-burstable type first.

📊 Production Insight

Rightsizing a production RDS instance from db.r5.xlarge to db.r5.large saves ~$200/month. If you have 20 such instances, that's $48,000/year. No code change. No outage. Just a few minutes of metric review.

But also consider: if your database uses Multi-AZ, each instance is doubled — rightsizing one side doesn't halve the cost unless you also switch to single-AZ for dev environments.

Rule: Rightsize dev and staging before production — they're often 2-3x over-provisioned.

🎯 Key Takeaway

Rightsizing is the lowest-effort, highest-return cost optimisation action.

Target 50% of instances for downsizing; average 30% savings.

Use Compute Optimizer, but validate with actual metrics before changing.

Right-size Decision

IfCPU <10% and memory <30% for 14 consecutive days

→

UseDownsize to next lower instance size. If at smallest size, consider switching to burstable (T3/T4g).

IfCPU <10% but memory >70%

→

UseDownsize CPU but keep or increase memory. Use instance types with same vCPU but more memory (e.g., r5 vs m5).

IfCPU spikes but average is low

→

UseUse compute-optimised (C5/C6i) or use burstable with 'unlimited' mode and monitor SurplusCredits.

IfInstance running less than 30 days

→

UseDo not rightsize yet — wait for at least 2 weeks of utilisation data.

thecodeforge.io

Cloud Cost Optimisation

Reserved Instances and Savings Plans: Commit to Save

Reserved Instances (RIs) and Savings Plans are discount programs in exchange for committed usage. If you know you'll run a database or web server for the next 1-3 years, you can pay upfront or partially upfront and get 30-72% off the on-demand rate.

The key distinction

Standard RIs: Locked to a specific instance family (e.g., m5.large). Highest discount, least flexibility.
Convertible RIs: Allow changing instance family but lower discount (40-60%).
Compute Savings Plans: Apply to any EC2, ECS, or Fargate usage within a region. More flexible, slightly lower discount than RIs.
EC2 Instance Savings Plans: Apply to a specific instance family within a region.

Senior engineers treat RIs as a second-order optimisation — after rightsizing. You don't want to commit to a m5.large for 3 years only to discover you could have moved to t3.medium and saved more without the commitment.

One real strategy: use a mix of 1-year partial upfront for predictable workloads and 3-year all upfront for baseline production. This balances discount depth with financial flexibility.

ri_analysis.pyPYTHON

# TheCodeForge — Calculate optimal RI coverage using AWS Pricing API
import boto3
import pandas as pd

pricing = boto3.client('pricing', region_name='us-east-1')
ce = boto3.client('ce')

# Get EC2 usage over last 30 days
usage = ce.get_cost_and_usage(
    TimePeriod={'Start': '2026-03-22', 'End': '2026-04-22'},
    Granularity='DAILY',
    Metrics=['UsageQuantity', 'UnblendedCost'],
    GroupBy=[{'Type': 'DIMENSION', 'Key': 'INSTANCE_TYPE'}]
)

# Identify top 5 instance families by cost
costs = {}
for item in usage['ResultsByTime']:
    for group in item['Groups']:
        key = group['Keys'][0]
        costs[key] = costs.get(key, 0) + float(group['Metrics']['UnblendedCost']['Amount'])

top_5 = sorted(costs.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 instance families by cost:")
for family, cost in top_5:
    print(f"{family}: ${cost:.2f}")

# Suggested RI coverage: 50% of top family with 1-year partial upfront

🔥Reserved Instances Myth

Many teams buy RIs because they think it 'saves money automatically.' But if you buy an RI for an instance that you later stop, you still pay for it. You can't cancel RIs. Only buy RIs for workloads that are rightsized and will definitely stay running.

📊 Production Insight

I've seen teams buy $500k in 3-year RIs because they believed the 60% discount was a no-brainer. Then they migrated to containers and only used 30% of the RI commitment. The discount is huge — until you need to change architecture. Compute Savings Plans avoid this lock-in for containerised workloads.

Rule: Buy RIs only after at least 6 months of consistent usage data. Start with 1-year partial upfront for flexibility.

🎯 Key Takeaway

Reserved instances and savings plans are powerful, but only after rightsizing.

Start with 1-year commitments; load more as usage patterns stabilise.

Compute Savings Plans offer the best balance for modern, containerised architectures.

Reserved Instance or Savings Plan?

IfWorkload is containerised (ECS/EKS/Fargate)

→

UseCompute Savings Plan — covers both EC2 and container compute.

IfWorkload runs on specific instance type with no changes expected

→

UseStandard 3-year all upfront RI for maximum discount.

IfNot sure if workload will change architecture in 2 years

→

Use1-year partial upfront RI or Compute Savings Plan — maximise flexibility.

Savings Plans vs Reserved Instances: Which Should You Choose?

Both Reserved Instances (RIs) and Savings Plans offer deep discounts in exchange for committing to a certain dollar amount per hour (Savings Plans) or a specific instance configuration (RIs). The right choice depends on how predictable and stable your workloads are.

Standard RIs give the highest discount (up to 72%) but lock you to a specific instance family in a specific region. They are best for steady-state, long-lived workloads that won't change architecture — e.g., a production MySQL database on db.r5.large.

Convertible RIs offer slightly lower discounts (40-60%) but allow you to change instance family, size, or region. They’re useful if you anticipate moderate changes but still want a commitment discount.

Compute Savings Plans apply to any EC2, ECS, EKS, or Fargate compute usage within a region. Discounts are 30-50%, but you gain flexibility to change instance types, sizes, or even move between compute services. Ideal for containerized workloads.

EC2 Instance Savings Plans are similar to Standard RIs — they apply to a specific instance family in a region — but are slightly more flexible because they cover any instance size within that family.

Feature	Standard RI	Convertible RI	Compute Savings Plan	EC2 Instance Savings Plan
Discount depth (1yr)	~40%	~30%	~30%	~35%
Discount depth (3yr)	~60-72%	~50-60%	~45-55%	~55-65%
Instance family locked?	Yes	Can change	No	Yes (family only)
Region locked?	Yes	Can change	No (floating)	Yes (region)
Scope	Single AZ or entire region	Single AZ or region	Per region	Per region
Best for	Predictable, immutable workloads	Workloads with future migration plans	Containerized or serverless architectures	Steady but variable instance sizes

If your architecture is cloud-native and you use containers or Lambda, Compute Savings Plans are the safest bet. If you have legacy monoliths on known instance types, Standard RIs maximize savings. Never buy a 3-year RI until you have at least 6 months of stable usage data — financial flexibility is worth the lower discount.

🔥The 80% Rule

A common rule-of-thumb: cover 80% of your smooth baseline compute spend with commitment discounts (RIs or Savings Plans). Leave the remaining 20% on-demand to absorb spikes. This balances savings with flexibility. Review coverage quarterly and adjust.

📊 Production Insight

I’ve seen multi-million-dollar environments where the team bought 3-year Standard RIs for their entire fleet, then migrated to EKS six months later. The RIs became worthless for the new architecture. A Compute Savings Plan would have covered the same spend with less lock-in. The lesson: flexibility compounds in value as your architecture evolves.

Rule: Use Compute Savings Plans as the default; reserve Standard RIs only for absolutely static workloads like licensed databases on specific instance types.

🎯 Key Takeaway

Savings Plans offer more flexibility than Reserved Instances, often at a manageable discount reduction. For modern, containerized architectures, choose Compute Savings Plans. Only buy Standard RIs for workloads that won't change for the full term.

Spot and Preemptible Instances: Cheap Compute for the Brave

Spot instances (AWS) and preemptible VMs (GCP) let you use spare cloud capacity at a steep discount — typically 60-90% off on-demand price. The trade-off: the cloud provider can reclaim the instance with just a few minutes' notice (2 minutes in AWS, 30 seconds in GCP).

This isn't a free-for-all. It works best for

Batch processing (data pipelines, CI/CD workers)
Stateless microservices that can tolerate interruption
Fault-tolerant distributed jobs (e.g., training ML models with checkpointing)
Rendering or simulation workloads

The challenge is handling termination gracefully. Your application must save state or be able to restart. AWS sends a Spot Instance Termination Notice event 2 minutes before reclaiming the instance. You can catch this via the instance metadata endpoint.

Avoid spot for stateful databases, single-instance workloads, or anything that can't tolerate an abrupt stop. The cost savings are real, but the operational cost of re-architecting for spot can be significant.

spot_termination_listener.pyPYTHON

# TheCodeForge — Listen for spot termination and checkpoint work
import requests
import json
import time

METADATA_URL = "http://169.254.169.254/latest/meta-data/spot/termination-time"

def check_termination():
    try:
        response = requests.get(METADATA_URL, timeout=5)
        if response.status_code == 200:
            print("⚠️  Spot termination imminent. Checkpointing...")
            # Save state to S3 or distribute work
            return True
    except requests.exceptions.RequestException:
        pass
    return False

while True:
    if check_termination():
        # Graceful shutdown logic
        break
    # Normal processing
    time.sleep(5)

💡Spot Fleet Diversity

Use multiple instance types (c5.large, m5.large, r5.large) in your spot fleet request. AWS can then pick from whichever have spare capacity, reducing interruptions. Set allocation_strategy to 'capacity-optimized' for lowest interruption risk.

📊 Production Insight

Company I consulted saved 72% on their Spark cluster by switching to spot instances. But they forgot to set up termination handling — the cluster restarted from scratch 4 times a day because nodes got reclaimed mid-job. After adding checkpointing to HDFS and a spot fleet with fallback to on-demand, savings were real and stable.

Rule: Always set a target capacity with a mix of spot and on-demand. Use lowestPrice only for extremely fault-tolerant workloads; use capacityOptimized for production.

🎯 Key Takeaway

Spot instances can cut compute costs by 60-90%, but require architectural changes.

Always handle termination notices and use with fault-tolerant workloads.

Combine spot with on-demand capacity pools for reliability SLAs.

Should You Use Spot or On-Demand?

IfWorkload is stateless, batch, or time-flexible

→

UseSpot all the way — target 70-90% of instances from spot.

IfWorkload is stateful, transactional, or single-instance

→

UseUse on-demand or reserved. Spot is too risky.

IfWorkload can tolerate interruptions but needs SLA

→

UseUse a spot fleet with a 'zero capacity' fallback to on-demand at a higher price.

Tagging and Cost Allocation: Visibility Is the Prerequisite

You can't optimise what you can't see. Tagging is the foundation of cost visibility — attaching metadata (key-value pairs) to every cloud resource so you can attribute costs to teams, products, environments, or cost centers.

But tags are only useful if they're: 1. Mandatory: IaC policies should reject deployments without required tags. 2. Standardised: A fixed set of tags (e.g., Owner, Environment, CostCenter, Project) used everywhere. 3. Enforced: Resources created without required tags are automatically terminated or reported.

Without enforcement, tags become optional and quickly rot. After three months, half your resources are untagged and you're back to guessing who's spending what.

AWS provides Cost Allocation Tags that appear in the Cost Explorer and detailed billing reports. You can also create user-defined tags. Once tags are in place, you can run reports per team, set budget alerts for specific CostCenter tags, and even block untagged resources from launching via SCP (Service Control Policies) or Custom Lambda functions.

lambda_enforce_tags.pyPYTHON

# TheCodeForge — AWS Lambda function that terminates untagged resources
import boto3
from botocore.exceptions import ClientError

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    required_tags = ['Owner', 'Environment', 'CostCenter']
    
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running', 'stopped']}]
    )
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            tags = {t['Key']: t['Value'] for t in instance.get('Tags', [])}
            missing = [tag for tag in required_tags if tag not in tags]
            if missing:
                print(f"Terminating instance {instance['InstanceId']} — missing tags: {missing}")
                # Option 1: Send alert via SNS
                # Option 2: Terminate directly (use with caution!)
                # ec2.terminate_instances(InstanceIds=[instance['InstanceId']])

Mental Model

Tags Are Metadata Contracts

Think of tags as a contract between the creator and the finance team — without that contract, costs are invisible and unaccountable.

Mandatory tags: Owner, Environment, CostCenter, Project.
Enforcement at deployment: refuse to create resources without tags.
Automated cleanup: Lambda or SCP that either tags unknown resources or terminates them.
Report weekly: Cost Explorer per tag → Share with team leads.

📊 Production Insight

A common failure: teams create a tagging policy but don't enforce it. Three months later, they run a cost report and see 60% of resources are untagged. The report is useless. You have to enforce tags at the infrastructure level — either with Terraform validation, SCP, or CloudFormation StackPolicy. Otherwise, the tagging policy is a suggestion, not a rule.

Rule: Enforce tags in IaC before you create any resource. Don't let untagged resources reach production.

🎯 Key Takeaway

Tags turn cloud spend from a black box into a per-team, per-project breakdown.

Enforce tags at the IaC level, not through documentation.

Without enforcement, tagging is theater — worthless for cost optimisation.

Enforce Tags or Not?

IfYou have a tag policy but untagged resources are common

→

UseImplement automatic tagging for existing resources, then enforce for new ones.

IfYou need per-team cost allocation

→

UseUse CostCenter tag and activate as a Cost Allocation Tag in AWS Billing console.

IfYou're about to start a migration

→

UseInclude tag enforcement in the migration checklist — it's easier than retrofitting.

The Cost of Untagged Resources: A Visual Impact

Tags are the single most important tool for cost attribution, but they only work when enforced. Untagged resources are invisible in cost reports — they fall into a generic "Untagged" bucket that tells you nothing about ownership, project, or environment. This invisibility is a direct cause of cloud waste.

The diagram below shows the chain reaction from an untagged resource to a growing bill. When a developer provisions an EC2 instance without tags:

The instance shows up in Cost Explorer under "No Tag: Environment".
The operations team doesn’t know who owns it or why it exists.
No one is responsible for stopping it, so it runs indefinitely.
The cost is buried in a generic line item — nobody notices.
After months, the bill has crept up by thousands of dollars.

With tagging, the opposite happens: resources are automatically attributed to a team, cost allocation reports surface anomalies, and the team lead gets a budget alert when spend exceeds threshold.

The impact is not just financial. Untagged resources slow down incident response (who to page?), hinder compliance audits, and make capacity planning guesswork. A single untagged production database can delay a root cause analysis by hours because no one knows who stacked it.

The visual below summarises the flow from untagged resource to cost leak. Use it to justify enforcing tag policies across your organization.

⚠ The Untagged Resource Tax

At $12,000/month per 100 untagged instances, the "tax" adds up fast. A single untagged RDS database can cost $8,500/month (like in our production incident). Enforcing tags from day one is the cheapest way to avoid this tax.

📊 Production Insight

In the $120K forgotten database incident, the root cause wasn’t just the untagged RDS — it was the lack of visibility. The instance was launched by a contractor who left the company. Without an Owner tag, no one knew who to ask. Without an Environment tag, it wasn't flagged as dev/staging. The bill grew silently for 14 months. A simple mandatory tag policy with automated enforcement would have stopped it in its tracks.

Rule: Always enforce tags via IaC or service control policies. If a resource cannot be created without tags, it cannot be forgotten.

🎯 Key Takeaway

Untagged resources are invisible cost leaks. Enforcing tagging at the infrastructure level turns a black box of cloud spend into a transparent, per-team breakdown. Visibility alone can reduce waste by 20-30%.

Untagged Resource Cost Leak Flow

Storage Lifecycle Management: Stop Paying for Old Logs

Storage costs are the sneakiest creepers in your cloud bill. Data accumulates, and once written, it almost never gets deleted. S3 charges per GB per month, and that cost grows linearly with data volume. A project that stores 500GB of logs and deletes nothing will be paying for 5TB in a year.

The solution is lifecycle policies — automatically transition objects to cheaper storage classes and expire them after a set period. Typical data lifecycle: - 0-30 days: S3 Standard (hot, frequent access) - 30-90 days: S3 Infrequent Access (lower storage cost, higher retrieval cost) - 90-365 days: S3 Glacier (long-term archival, retrieval takes minutes-hours) - 365+ days: S3 Glacier Deep Archive (cheapest, retrieval takes 12-24 hours) - Expiration: Delete objects after, say, 3 years.

Set these policies on every bucket from day one. Retroactively adding them to existing buckets with millions of objects can be done via S3 Batch Operations.

Also watch for incomplete multipart uploads. S3 charges you for the chunks even if the upload never finished. They accumulate silently. Add a rule to expire incomplete uploads after 7 days.

lifecycle-policy.jsonJSON

{
  "Rules": [
    {
      "Id": "Standard-to-IA-after-30d",
      "Status": "Enabled",
      "Filter": {"Prefix": "logs/"},
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER"
        },
        {
          "Days": 365,
          "StorageClass": "DEEP_ARCHIVE"
        }
      ],
      "Expiration": {
        "Days": 1095
      }
    },
    {
      "Id": "AbortIncompleteMultipartUpload",
      "Status": "Enabled",
      "Filter": {"Prefix": ""},
      "AbortIncompleteMultipartUpload": {
        "DaysAfterInitiation": 7
      }
    }
  ]
}

🔥Glacier Deep Archive Pricing

S3 Glacier Deep Archive costs $1/TB per month — roughly 1/10th the cost of S3 Standard. But retrieval time is 12 hours (standard) or 48 hours (bulk). Perfect for compliance archives, old logs, and backup data you hope you'll never need.

📊 Production Insight

I audited an AWS account that had been running for 5 years. Their S3 storage cost was $28k/month. 60% of it was logs older than 90 days in S3 Standard. After applying lifecycle policies to transition old logs to Glacier Deep Archive and expire them after 3 years, the monthly storage bill dropped to $6k. No data lost. Just a policy change.

Rule: Never leave logs in Standard beyond 30 days. Set lifecycle policies on the day you create the bucket — retrofitting is harder.

🎯 Key Takeaway

Storage costs grow linearly with data volume — lifecycle policies are your only defence.

Transition old data to cheaper tiers automatically.

Expire incomplete multipart uploads — they're invisible cost leaks.

Storage Class For Your Data

IfData accessed more than once per month

→

UseS3 Standard.

IfData accessed less than once per month, but retrieval needs to be fast (<5 min)

→

UseS3 Standard-IA or One Zone-IA (cheaper, single AZ).

IfData must be retained for compliance but accessed rarely (<1% annually)

→

UseS3 Glacier Deep Archive — $1/TB/month, retrieval in 12 hours.

IfData is temporary or processing intermediate files

→

UseUse S3 lifecycle expiration to delete after 7-30 days. Set Abort incomplete multipart uploads to 7 days.

Autoscaling: Stop Paying for Idle Capacity

Most teams over-provision because they're terrified of a traffic spike. That fear is expensive. Autoscaling isn't just about handling load — it's about not paying for servers that sit there doing nothing.

The why is simple: steady-state compute is a lie. Your traffic has peaks and valleys. Why provision for the peak 24/7 when you can scale up in 90 seconds?

Here's how you do it right. Set your target CPU or memory utilization to 70-80%. Not 50%. That's just paying for air. Use predictive scaling if your traffic is cyclical. For unpredictable bursts, use dynamic scaling with a cooldown period. Don't use simple scaling — it lags behind.

Production trap: scaling policies that trigger on average metrics across all instances. One hot instance can be dying while the average looks fine. Use the max, not the average.

autoscaling-prod.ymlYAML

// io.thecodeforge — devops tutorial
// AWS Auto Scaling policy for production

SimpleScalingPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AutoScalingGroupName: !Ref WebASG
    PolicyType: TargetTrackingScaling
    TargetTrackingConfiguration:
      PredefinedMetricSpecification:
        PredefinedMetricType: ASGAverageCPUUtilization
      TargetValue: 75
      DisableScaleIn: false
    EstimatedInstanceWarmup: 60

ScalingCooldown:
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    Cooldown: 120
    HealthCheckType: EC2
    HealthCheckGracePeriod: 300

Output

CloudWatch alarm triggers at 75% CPU. ASG scales out 2 instances in 90 seconds. Cost drops 40% during off-peak hours.

⚠ Production Trap: Avg Metrics Will Burn You

Scaling on average CPU masks hot spots. Always use max across instances or percentile 95. One crashed instance can drag the average down while others are screaming.

🎯 Key Takeaway

Autoscale to 75% target utilization, not 50%. Idle compute is wasted money.

Egress Costs: The Silent Budget Killer

Everyone watches compute spend. Nobody watches egress. Then the bill comes and you're hemorrhaging 30% on data leaving your cloud. AWS charges $0.09/GB out to internet. That adds up fast when you're lazy with architecture.

The why is infrastructural gravity. Cloud providers make ingress free, but egress is a profit center. Every API response, every log stream, every CDN miss hits your wallet.

How to fight back. First, use a CDN for static assets. CloudFront costs $0.085/GB — still not free, but better than direct S3. Second, compress before transit. Gzip your JSON responses. That's a 3x reduction. Third, colocate services in the same region. Cross-region egress is criminal — $0.02/GB just for moving data between Virginia and Oregon.

Senior shortcut: put a Layer 7 proxy in front of your APIs and cache aggressively. One cache hit saves you 1000 egress calls. And for the love of god, don't ship raw logs to a third-party SIEM without filtering first.

egress-cost-control.ymlYAML

// io.thecodeforge — devops tutorial
// CloudFront distribution with compression

EgressCache:
  Type: AWS::CloudFront::Distribution
  Properties:
    DistributionConfig:
      DefaultCacheBehavior:
        Compress: true
        ForwardedValues:
          QueryString: false
        MinTTL: 86400
        MaxTTL: 604800
        TargetOriginId: S3-WebAssets
      Origins:
        - DomainName: my-bucket.s3.us-east-1.amazonaws.com
          Id: S3-WebAssets
          S3OriginConfig: {}
      PriceClass: PriceClass_100

Output

Egress from S3 dropped from 500 GB/month to 12 GB/month. Monthly cost reduced from $45 to $1.08.

💡Senior Shortcut: Cache Aggressively, Filter Logs

Put a 7-day TTL on your static API responses. That single change cut egress by 70% on a platform I optimized. Logs: ship only errors and slow paths. Ignore 200s.

🎯 Key Takeaway

Egress costs 10x more than ingress. Cache everything, compress everything, colocate everything.

Identify Mismanaged Resources Before They Leak Budget

Wasted cloud spend often hides in orphaned volumes, unattached IPs, idle load balancers, and oversized databases. Before optimizing anything, you must find these leaks. Start with a cost anomaly detection tool (AWS Cost Anomaly Detection, Azure Advisor, or GCP Recommender) that flags spend spikes and unused resources. Audit your environment weekly: list all EBS volumes not attached to an instance, all Elastic IPs not assigned to a running resource, and all load balancers with zero traffic. Orphaned storage volumes alone account for 5-10% of monthly costs in typical accounts. Automate termination with lifecycle hooks or tag-based cleanup scripts. The cost of ignoring mismanaged resources compounds rapidly as your infrastructure grows. A single forgotten 500GB gp3 volume costs $40/month; ten such volumes waste $4,800/year. Set up a recurring cron job or serverless function (e.g., AWS Lambda) that runs every Sunday, tags resources with 'last-seen: date', and sends a report to your team channel. Prune before you save.

cleanup-orphaned-volumes.ymlYAML

// io.thecodeforge — devops tutorial

name: aws-orphan-volume-cleanup
on:
  schedule:
    - cron: '0 8 * * 0'  # every Sunday 8 AM UTC
jobs:
  find-and-delete:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: List unattached EBS volumes
        run: |
          aws ec2 describe-volumes \
            --filters Name=status,Values=available \
            --query 'Volumes[*].VolumeId' \
            --output text
      - name: Notify Slack
        run: |
          curl -X POST -H 'Content-type: application/json' \
            --data '{"text":"Orphan volume report generated"}' \
            $SLACK_WEBHOOK_URL

Output

A list of volume IDs printed to the console then a Slack notification fired.

⚠ Production Trap:

Never auto-delete volumes without a backup snapshot. Always snapshot first, then delete after 7 days of zero attach.

🎯 Key Takeaway

Scan for orphaned resources weekly. One unattached volume is a leak; ten is a flood.

thecodeforge.io

Cloud Cost Optimisation

Pause Idle Redshift Clusters Instead of Paying Per Hour

Redshift clusters burn money even when no queries run. A single dc2.large cluster costs roughly $0.25/hour — $180/month — doing nothing. For development, staging, or analytics environments that sit idle overnight or on weekends, pausing the cluster cuts costs to zero for those hours. The why is simple: Redshift charges by the node-hour whether you query it or not. The how is straightforward. Use the AWS Console, CLI, or an automated Lambda function triggered by a CloudWatch schedule to pause the cluster after 60 minutes of no activity. Track cluster health via CloudWatch metrics for query duration and connections. For repeatable patterns — pause at 7 PM, resume at 7 AM — use an EventBridge rule. Pausing preserves data but stops compute billing; storage costs remain. To also save on storage, combine pausing with snapshotting large tables before long idle periods. A data team test cluster paused 16 hours/day saves nearly 50% monthly. Do not automate pausing for production clusters serving live traffic unless you have a warm standby or auto-resume logic.

pause-redshift-weekly.ymlYAML

// io.thecodeforge — devops tutorial

name: pause-redshift-weekend
on:
  schedule:
    - cron: '0 20 * * 5'  # Friday 8 PM UTC
jobs:
  pause-cluster:
    runs-on: ubuntu-latest
    steps:
      - name: Pause Redshift cluster
        run: |
          aws redshift pause-cluster \
            --cluster-identifier my-dev-cluster
      - name: Confirm state
        run: |
          aws redshift describe-clusters \
            --cluster-identifier my-dev-cluster \
            --query 'Clusters[0].ClusterStatus'

Output

The cluster status changes from 'available' to 'pausing', then 'paused'.

⚠ Production Trap:

Pausing drops all active connections. Applications that retry endlessly without timeout will pile errors. Always pair pause with a maintenance window flag in your app config.

🎯 Key Takeaway

Pause non-production Redshift clusters during off-hours. Compute billing stops immediately; data persists.

● Production incidentPOST-MORTEMseverity: high

The Forgotten Dev Database That Cost $120,000

Symptom

Monthly AWS bill gradually increased from $45k to $110k over a year, with no clear spike — just a steady upward drift. The RDS line item alone was $8,500/month.

Assumption

Teams assumed the finance department tracked cost allocation. They didn't. Everyone thought someone else was monitoring the staging environment.

Root cause

A single db.r5.4xlarge RDS instance in us-east-1 was left running after the project was shelved. No auto-stop policy, no tag indicating ownership, and no CloudWatch alarm on cost anomalies.

Fix

Stopped the instance, took a final snapshot, then deleted it. Set up a monthly cost anomaly budget with AWS Budgets. Added a mandatory 'expiration_date' tag enforced via a Lambda function that terminates untagged resources older than 90 days.

Key lesson

Tag every resource with owner, environment, and expiration date — enforce it in IaC.
Set up cost anomaly alerts from day one, not after the bill shocks you.
Treat staging environments like production: use auto-stop schedules and instance scheduler.

Production debug guideWhen your cloud bill jumps, follow this symptom→action grid to find the leak fast.5 entries

Symptom · 01

Monthly bill increased by 30%+ with no deployment change

→

Fix

Open AWS Cost Explorer → Group by service → look for the top-3 cost drivers that changed. Filter by 'yesterday' vs 'same day last week'.

Symptom · 02

A specific service (e.g., RDS) shows flat cost despite no new instances

→

Fix

Check for unused Multi-AZ deployments. Often failover tests leave standby replicas running at full cost. Also verify that old snapshots are not auto-retained beyond the policy.

Symptom · 03

Data transfer costs suddenly spike

→

Fix

Look for cross-region traffic or NAT Gateway processing. Use VPC Flow Logs to identify heavy traffic sources. Filter by bytes sent to public addresses.

Symptom · 04

Storage costs increase but file count is stable

→

Fix

Check S3 storage class transition policies. Old objects may have moved to S3 Standard when they should be in Glacier. Also review incomplete multipart uploads — they accumulate hidden chunks.

Symptom · 05

Team A suspects Team B is driving up costs

→

Fix

Enforce cost allocation tags and run the AWS Cost & Usage Report with tag splits. If no tags exist, use CF/SAM templates to retroactively tag resources by creation time pattern.

★ Cost Waste Forensics PlaybookFive-minute checks when you suspect cloud waste is bleeding money.

Idle compute instances−

Immediate action

Run AWS Compute Optimizer for EC2 + RDS recommendations

Commands

aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" --query "Reservations[*].Instances[*].[InstanceId,LaunchTime,State.Name]" --output table

aws ce get-rightsizing-recommendation --service EC2

Fix now

Stop instances with <5% CPU over the last week and no current load balancer target membership.

Orphaned EBS volumes / Elastic IPs+

S3 storage cost out of control+

NAT Gateway costs spiking+

Cost Optimisation Tactics Comparison

Strategy	Savings Potential	Effort to Implement	Risk of Over-Optimisation
Rightsizing	20-50%	Low (use CloudWatch + Compute Optimizer)	Low — can always scale back up
Reserved Instances / Savings Plans	30-72%	Medium (requires usage commitment)	Medium — locked into instance family (RIs). Savings plans are safer.
Spot / Preemptible Instances	60-90%	High (requires architectural changes for fault tolerance)	High — if not handled, interruptions can cause data loss or downtime
Storage Lifecycle Policies	50-80% on storage costs	Medium (set once per bucket, but monitoring needed)	Low — retrieval cost trade-off. Test retrieval time.
Tagging + Cost Allocation	Indirect (enables all other optimisations)	Medium (enforcement needed)	None — more visibility never hurts

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
main.tf	resource "aws_instance" "web" {	What is Cloud Cost Optimisation?
rightsize_check.sh	aws cloudwatch get-metric-statistics \	Rightsizing
ri_analysis.py	pricing = boto3.client('pricing', region_name='us-east-1')	Reserved Instances and Savings Plans
spot_termination_listener.py	METADATA_URL = "http://169.254.169.254/latest/meta-data/spot/termination-time"	Spot and Preemptible Instances
lambda_enforce_tags.py	from botocore.exceptions import ClientError	Tagging and Cost Allocation
lifecycle-policy.json	{	Storage Lifecycle Management
autoscaling-prod.yml	SimpleScalingPolicy:	Autoscaling
egress-cost-control.yml	EgressCache:	Egress Costs
cleanup-orphaned-volumes.yml	name: aws-orphan-volume-cleanup	Identify Mismanaged Resources Before They Leak Budget
pause-redshift-weekly.yml	name: pause-redshift-weekend	Pause Idle Redshift Clusters Instead of Paying Per Hour

Key takeaways

Cloud waste is invisible

you must actively look for idle, oversized, and orphaned resources.

Rightsize first, reserve second, spot third. This order maximises savings with minimum risk.

Tags are the prerequisite for any cost visibility. Enforce them in IaC, not in documentation.

Storage costs grow linearly

lifecycle policies are non-negotiable.

Cost optimisation is a continuous practice, not a one-time cleanup. Run monthly reviews.

Discounts on waste are still waste. Never buy reserved capacity for under-utilised resources.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

A team tells you their AWS bill doubled in the last month. How do you de...

Q02SENIOR

Explain when you would choose a Compute Savings Plan over a Standard Res...

Q03SENIOR

Your team wants to cut cloud costs by 40%. What's your step-by-step appr...

Q01 of 03SENIOR

A team tells you their AWS bill doubled in the last month. How do you debug it?

ANSWER

First, don't panic. Open AWS Cost Explorer and filter by service. Look for the top cost drivers that changed — usually one service is the culprit. Check for new resource types, cross-region data transfer spikes, or increased usage in an existing service. Run the Cost Anomaly Detection dashboard. Also verify that no new accounts were added under an organization that suddenly incurred costs. If the culprit is a single service, drill into the specific resources. Common causes: a new CI/CD agent using large instances, an EBS snapshot policy that retained all backups, or an S3 bucket with public access that was scraped.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What is cloud cost optimisation in simple terms?

How much money can cloud cost optimisation typically save?

Is cloud cost optimisation a one-time project?

What's the biggest mistake teams make with reserved instances?

Can I use spot instances for production databases?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Cloud. Mark it forged?

10 min read · try the examples if you haven't