Cloud Cost Optimisation — The $120K Forgotten Database
One untagged RDS instance drifted AWS bills from $45K to $110K monthly.
- Cloud cost optimisation is the practice of aligning cloud spend with actual usage
- Rightsizing matches instance types to workload requirements — downsizing 40% of instances cuts bills by 20-50%
- Reserved instances give 30-72% discounts over on-demand for steady workloads
- Spot instances reduce costs by 60-90% for fault-tolerant, interruptible jobs
- Tagging without enforcement is just decoration — build automated policies that stop untagged resources
- Biggest mistake: mistaking reserved instance discounts for cost control while leaving idle resources running
Imagine you rent a huge warehouse with 50 rooms, but you only ever use 3 of them — and you pay full rent every single month. Cloud cost optimisation is the art of figuring out which rooms you actually need, downsizing to the right-sized space, pre-paying for rooms you're certain you'll use long-term, and turning off the lights in rooms nobody is in after 6pm. Your cloud bill works exactly the same way.
Cloud bills have a nasty habit of doing one thing: going up. A startup spins up a few EC2 instances to test an idea, forgets to turn them off, adds an RDS database 'just for dev', and six months later the founders are staring at a $40,000 invoice wondering how it happened. This isn't a cautionary tale — it's Tuesday at most engineering companies. Cloud providers design their consoles to make provisioning effortless and deprovisioning forgettable. That asymmetry is expensive.
The problem cloud cost optimisation solves isn't just waste — it's invisible waste. Unused Elastic IPs, idle load balancers, forgotten S3 buckets full of old logs, over-provisioned RDS instances running at 8% CPU — none of these announce themselves. They silently accumulate. Cost optimisation is the discipline of making cloud spend intentional: every resource should be the right size, running only when needed, and tagged so you know exactly which team or product is paying for it.
By the end of this article you'll know how to identify and eliminate the four biggest categories of cloud waste, how to write Infrastructure as Code that enforces cost-aware defaults, how to use reserved capacity and spot instances strategically, and how to build a tagging policy that gives you real visibility. These are the same techniques engineering teams at scale use to routinely cut cloud bills by 30–60% without touching a single line of application code.
What is Cloud Cost Optimisation?
Cloud cost optimisation is the practice of continuously aligning cloud resource provisioning with actual workload requirements. It's not about cutting corners — it's about eliminating the gap between what you pay and what you need. That gap is almost always wider than you think.
Most engineers treat cloud spend as a fixed cost, like rent. It's not. Every resource is individually billable, and providers design their consoles to make adding resources frictionless. The result? A typical AWS account has 30-40% waste: idle instances, oversized databases, unused storage, and orphaned resources that nobody remembers creating.
- Tactical: Rightsizing instances, stopping idle resources, removing orphaned volumes.
- Strategic: Reserved instances, savings plans, spot usage for batch workloads.
- Cultural: Tagging policies, cost allocation reports, and developer education on cost awareness.
Each level compounds. Without tactical cleanup, strategic discounts are wasted on empty resources. Without cultural enforcement, tactical wins revert within a quarter.
- First: stop paying for resources you don't use (idle instances, orphaned volumes).
- Second: shrink resources you use but don't need full power for (rightsize).
- Third: commit to longer-term usage for discounts (RIs, savings plans) only after you've right-sized.
- Fourth: use spot for variable, fault-tolerant workloads only when steady-state is optimised.
10-Point Cloud Cost Optimization Checklist
Use this checklist as a quick reference to audit your cloud spend. Each point addresses a common source of waste or a best practice for cost control.
- Stop idle compute resources. Identify instances with less than 5% CPU over the last 14 days. Stop or terminate them. Use AWS Instance Scheduler to automate stop/start.
- Rightsize over-provisioned instances. Run Compute Optimizer or cross-check CloudWatch metrics. Downsize to the smallest instance type that meets your workload’s peak requirements.
- Remove orphaned EBS volumes and Elastic IPs. Unattached volumes and unused IPs accrue costs. List and delete them monthly.
- Set S3 lifecycle policies. Transition logs and backups to cheaper storage classes (Standard-IA → Glacier → Deep Archive) and expire old objects.
- Abort incomplete multipart uploads. Add a lifecycle rule to remove incomplete uploads older than 7 days. They accumulate hidden storage cost.
- Enforce tagging on all resources. Use IaC policy (Terraform
checkov, CloudFormation stack policy) to require Owner, Environment, and CostCenter tags. Automatically terminate untagged resources. - Buy Reserved Instances or Savings Plans only after rightsizing. Commit for steady-state, rightsized workloads. Start with 1-year partial upfront to preserve flexibility.
- Use Spot instances for fault-tolerant workloads. Batch processing, CI/CD workers, and stateless microservices are ideal. Implement termination handling.
- Set up cost anomaly alerts. Use AWS Budgets or third-party tools to alert on >30% increases in top services. Assign an owner to review alerts weekly.
- Conduct monthly cost reviews. Review Cost Explorer, look for new services or unexplained spikes. Maintain a shared spreadsheet of cost optimization actions.
Rightsizing: Match Instance Size to Actual Workload
Rightsizing means picking the instance type and size that matches your workload's actual resource consumption. The default — t3.medium for everything — is almost always wrong. A web server that serves 100 req/s might need 4 vCPUs; one that serves 5 req/s might run fine on a burstable nano.
AWS Compute Optimizer and Azure Advisor give recommendations based on historical CPU, memory, and network utilisation. But they're conservative — they only suggest downsizing if utilisation is below a threshold (e.g., 40% max CPU over 2 weeks). In practice, most engineering teams can downsize 50% of their instances with no performance impact.
- Over-provisioned CPU: Burstable instances (T3/T4g) accumulate CPU credits when idle; check credit balance before downsizing.
- Over-provisioned memory: Use CloudWatch
mem_used_percentcustom metrics; often RDS instances run at <20% memory. - Over-provisioned I/O: EBS gp3 volumes can be downsized by reducing IOPS and throughput to actual usage patterns.
Rightsizing is a continuous process. Quarterly reviews catch new waste from autoscaling groups that launched larger instances during a peak and never scaled down.
Reserved Instances and Savings Plans: Commit to Save
Reserved Instances (RIs) and Savings Plans are discount programs in exchange for committed usage. If you know you'll run a database or web server for the next 1-3 years, you can pay upfront or partially upfront and get 30-72% off the on-demand rate.
- Standard RIs: Locked to a specific instance family (e.g., m5.large). Highest discount, least flexibility.
- Convertible RIs: Allow changing instance family but lower discount (40-60%).
- Compute Savings Plans: Apply to any EC2, ECS, or Fargate usage within a region. More flexible, slightly lower discount than RIs.
- EC2 Instance Savings Plans: Apply to a specific instance family within a region.
Senior engineers treat RIs as a second-order optimisation — after rightsizing. You don't want to commit to a m5.large for 3 years only to discover you could have moved to t3.medium and saved more without the commitment.
One real strategy: use a mix of 1-year partial upfront for predictable workloads and 3-year all upfront for baseline production. This balances discount depth with financial flexibility.
Savings Plans vs Reserved Instances: Which Should You Choose?
Both Reserved Instances (RIs) and Savings Plans offer deep discounts in exchange for committing to a certain dollar amount per hour (Savings Plans) or a specific instance configuration (RIs). The right choice depends on how predictable and stable your workloads are.
Standard RIs give the highest discount (up to 72%) but lock you to a specific instance family in a specific region. They are best for steady-state, long-lived workloads that won't change architecture — e.g., a production MySQL database on db.r5.large.
Convertible RIs offer slightly lower discounts (40-60%) but allow you to change instance family, size, or region. They’re useful if you anticipate moderate changes but still want a commitment discount.
Compute Savings Plans apply to any EC2, ECS, EKS, or Fargate compute usage within a region. Discounts are 30-50%, but you gain flexibility to change instance types, sizes, or even move between compute services. Ideal for containerized workloads.
EC2 Instance Savings Plans are similar to Standard RIs — they apply to a specific instance family in a region — but are slightly more flexible because they cover any instance size within that family.
| Feature | Standard RI | Convertible RI | Compute Savings Plan | EC2 Instance Savings Plan |
|---|---|---|---|---|
| Discount depth (1yr) | ~40% | ~30% | ~30% | ~35% |
| Discount depth (3yr) | ~60-72% | ~50-60% | ~45-55% | ~55-65% |
| Instance family locked? | Yes | Can change | No | Yes (family only) |
| Region locked? | Yes | Can change | No (floating) | Yes (region) |
| Scope | Single AZ or entire region | Single AZ or region | Per region | Per region |
| Best for | Predictable, immutable workloads | Workloads with future migration plans | Containerized or serverless architectures | Steady but variable instance sizes |
If your architecture is cloud-native and you use containers or Lambda, Compute Savings Plans are the safest bet. If you have legacy monoliths on known instance types, Standard RIs maximize savings. Never buy a 3-year RI until you have at least 6 months of stable usage data — financial flexibility is worth the lower discount.
Spot and Preemptible Instances: Cheap Compute for the Brave
Spot instances (AWS) and preemptible VMs (GCP) let you use spare cloud capacity at a steep discount — typically 60-90% off on-demand price. The trade-off: the cloud provider can reclaim the instance with just a few minutes' notice (2 minutes in AWS, 30 seconds in GCP).
- Batch processing (data pipelines, CI/CD workers)
- Stateless microservices that can tolerate interruption
- Fault-tolerant distributed jobs (e.g., training ML models with checkpointing)
- Rendering or simulation workloads
The challenge is handling termination gracefully. Your application must save state or be able to restart. AWS sends a Spot Instance Termination Notice event 2 minutes before reclaiming the instance. You can catch this via the instance metadata endpoint.
Avoid spot for stateful databases, single-instance workloads, or anything that can't tolerate an abrupt stop. The cost savings are real, but the operational cost of re-architecting for spot can be significant.
lowestPrice only for extremely fault-tolerant workloads; use capacityOptimized for production.Tagging and Cost Allocation: Visibility Is the Prerequisite
You can't optimise what you can't see. Tagging is the foundation of cost visibility — attaching metadata (key-value pairs) to every cloud resource so you can attribute costs to teams, products, environments, or cost centers.
But tags are only useful if they're: 1. Mandatory: IaC policies should reject deployments without required tags. 2. Standardised: A fixed set of tags (e.g., Owner, Environment, CostCenter, Project) used everywhere. 3. Enforced: Resources created without required tags are automatically terminated or reported.
Without enforcement, tags become optional and quickly rot. After three months, half your resources are untagged and you're back to guessing who's spending what.
AWS provides Cost Allocation Tags that appear in the Cost Explorer and detailed billing reports. You can also create user-defined tags. Once tags are in place, you can run reports per team, set budget alerts for specific CostCenter tags, and even block untagged resources from launching via SCP (Service Control Policies) or Custom Lambda functions.
- Mandatory tags: Owner, Environment, CostCenter, Project.
- Enforcement at deployment: refuse to create resources without tags.
- Automated cleanup: Lambda or SCP that either tags unknown resources or terminates them.
- Report weekly: Cost Explorer per tag → Share with team leads.
The Cost of Untagged Resources: A Visual Impact
Tags are the single most important tool for cost attribution, but they only work when enforced. Untagged resources are invisible in cost reports — they fall into a generic "Untagged" bucket that tells you nothing about ownership, project, or environment. This invisibility is a direct cause of cloud waste.
The diagram below shows the chain reaction from an untagged resource to a growing bill. When a developer provisions an EC2 instance without tags:
- The instance shows up in Cost Explorer under "No Tag: Environment".
- The operations team doesn’t know who owns it or why it exists.
- No one is responsible for stopping it, so it runs indefinitely.
- The cost is buried in a generic line item — nobody notices.
- After months, the bill has crept up by thousands of dollars.
With tagging, the opposite happens: resources are automatically attributed to a team, cost allocation reports surface anomalies, and the team lead gets a budget alert when spend exceeds threshold.
The impact is not just financial. Untagged resources slow down incident response (who to page?), hinder compliance audits, and make capacity planning guesswork. A single untagged production database can delay a root cause analysis by hours because no one knows who stacked it.
The visual below summarises the flow from untagged resource to cost leak. Use it to justify enforcing tag policies across your organization.
Storage Lifecycle Management: Stop Paying for Old Logs
Storage costs are the sneakiest creepers in your cloud bill. Data accumulates, and once written, it almost never gets deleted. S3 charges per GB per month, and that cost grows linearly with data volume. A project that stores 500GB of logs and deletes nothing will be paying for 5TB in a year.
The solution is lifecycle policies — automatically transition objects to cheaper storage classes and expire them after a set period. Typical data lifecycle: - 0-30 days: S3 Standard (hot, frequent access) - 30-90 days: S3 Infrequent Access (lower storage cost, higher retrieval cost) - 90-365 days: S3 Glacier (long-term archival, retrieval takes minutes-hours) - 365+ days: S3 Glacier Deep Archive (cheapest, retrieval takes 12-24 hours) - Expiration: Delete objects after, say, 3 years.
Set these policies on every bucket from day one. Retroactively adding them to existing buckets with millions of objects can be done via S3 Batch Operations.
Also watch for incomplete multipart uploads. S3 charges you for the chunks even if the upload never finished. They accumulate silently. Add a rule to expire incomplete uploads after 7 days.
The Forgotten Dev Database That Cost $120,000
- Tag every resource with owner, environment, and expiration date — enforce it in IaC.
- Set up cost anomaly alerts from day one, not after the bill shocks you.
- Treat staging environments like production: use auto-stop schedules and instance scheduler.
Key takeaways
Common mistakes to avoid
5 patternsBuying Reserved Instances before Rightsizing
Using spot instances without termination handling
Creating a tagging policy but not enforcing it
Storing logs in S3 Standard indefinitely
Setting up cost anomaly alerts but not acting on them
Interview Questions on This Topic
A team tells you their AWS bill doubled in the last month. How do you debug it?
Frequently Asked Questions
That's Cloud. Mark it forged?
10 min read · try the examples if you haven't