Capacity Planning — Why Auto-Scaling Won't Save You
Auto-scaling lags 3-5 minutes during traffic spikes.
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
- Capacity planning builds a model of future load: QPS, storage, memory, and bandwidth
- Peak traffic is 10-20x average — plan for the worst hour, not the daily average
- Storage grows with user count and data per user; use retention policies to cap growth
- Memory and CPU scale with request complexity and concurrency, not just QPS
- Bandwidth often becomes the bottleneck before CPU does—CDN caching saves you
- Performance insight: underestimating peak QPS by 2x causes total collapse; auto-scaling lags minutes behind
- Production insight: always model worst-case peak, not average — and validate with load tests before launch
Imagine you're opening a lemonade stand at a school fair. Before you show up, you need to guess: how many kids will come, how many cups you need, how fast you can pour, and whether you need one table or three. Capacity planning is exactly that — but for software. You're estimating how much traffic your system will handle, how much data it'll store, and whether your servers will buckle under pressure. Do it before you build, and you sleep at night. Skip it, and your site goes down the moment it gets popular.
Every system that has ever crashed under load had one thing in common: nobody did the math beforehand. Twitter's Fail Whale, early Reddit meltdowns, the Healthcare.gov launch disaster — these weren't random bad luck. They were the predictable result of shipping systems without ever asking 'what happens when a million people show up at once?' Capacity planning is the engineering discipline that answers that question before it becomes a crisis.
The core problem capacity planning solves is the gap between 'it works on my machine' and 'it works for ten million users.' A system that handles 10 requests per second behaves completely differently at 100,000 requests per second. Memory leaks that are invisible at small scale become catastrophic at large scale. Database queries that return in 2ms under no load suddenly take 4 seconds when 500 connections compete for the same rows. Capacity planning gives you a model — however rough — of where those breaking points are, so you can design around them intentionally rather than discover them in production.
By the end of this article you'll know how to estimate Queries Per Second (QPS) for a real system, calculate storage growth over time, size your bandwidth and memory requirements, and translate all of that into a concrete infrastructure starting point. These are the exact skills that separate engineers who can design systems from engineers who just implement tickets.
Here's the thing: every hour spent planning capacity saves ten hours of production firefighting. It's not a one-time exercise — it's a muscle you build.
Don't assume your cloud provider's default limits will save you. They won't. I've seen a team lose a $500k deal because they hit the default DynamoDB write capacity — and nobody had checked the limit.
What is Capacity Planning?
Capacity planning builds a model of future system load before you write production code. You estimate QPS, storage, memory, CPU, bandwidth — then design around those numbers. It's not about perfect prediction; it's about bounding risk so you don't wake up at 3 AM to a 503 tsunami.
The feedback loop: estimate → build → monitor → adjust. Each iteration tightens the model. Without it, you're guessing. I've seen teams spend weeks optimizing a query that didn't matter while their database would run out of storage in 3 months.
Averages lie. Peak tells the truth. A system handling 100 QPS average might burst to 2000 QPS for 2 minutes. If your pool is sized for the average, you'll saturate connections fast. Always model the worst hour.
Common trap: treating capacity planning as a one-time exercise. It's not. Launch day traffic is nothing compared to year 2 growth. Revisit your model quarterly or whenever you hit a 2x traffic milestone.
Write-heavy workloads are the silent killer. A payment processing team I advised sized for average write QPS — then a flash sale hit 50x burst. The primary fell over, replication lag hit hours. Model write QPS separately with a 3x safety factor.
Another angle: capacity planning is a communication tool. When you have numbers, you can explain to product why a feature launch needs a two-week infra lead time. That conversation never happens when the model lives in someone's head.
- Start with a rough estimate based on assumptions.
- Build infrastructure to that estimate with a safety margin.
- Monitor actual traffic and resource usage in production.
- Update your model with real data; refine for the next cycle.
- The goal is not perfect prediction — it's avoiding catastrophic failure.
Estimating Queries Per Second (QPS)
QPS is the heartbeat of your system. Every other resource — database connections, CPU, memory, bandwidth — depends on it. Start with your expected DAU, multiply by requests per user per day, then apply the peak hour fraction (typically 10% of daily traffic in 1 hour). The formula: peak QPS = (DAU requests/user/day peak fraction) / 3600.
But here's the nuance: the peak fraction varies. A global consumer app might see 15-20% of daily traffic in the evening commute. A B2B SaaS might only see 5-7% during business hours. If you lack historical data, start with 10% and add a 1.5x safety margin.
Also, QPS isn't uniform across endpoints. Your login endpoint may get 10x less traffic than your feed endpoint. Profile your traffic — treat each endpoint's resource cost separately. I've seen teams mis-size compute because they assumed all requests consumed equal CPU.
Another trap: webhooks and callbacks can burst unexpectedly. A payment webhook once caused a 50x spike for 2 seconds, saturating our connection pool. Plan for these async bursts by adding buffer in your peak estimate.
For event-driven systems, QPS estimation is trickier because it depends on producers. Estimate incoming event rate from queue metrics, not endpoint hits. Same formulas apply, but replace DAU with message producers.
Write QPS is often a fraction of read QPS, but each write can be an order of magnitude more expensive — row locks, index updates, replication. Model write QPS separately with a 2x overhead for index maintenance.
- Average QPS hides spikes. A 100 QPS average could mean 2000 QPS burst for 5 minutes.
- Database connection pools and thread pools must handle the burst, not the average.
- Use 95th percentile from monitoring if you have it; otherwise assume peak = 10x average.
Storage Sizing Over Time
Storage grows with two dimensions: number of users and data per user. Each user might store 500KB of profile data, 2MB of images, and 100KB of logs per day. Over a year, that compounds fast. Don't forget replication and backup factors — a 3x replication multiplier is common. A good rule: estimate storage after 1 year with a 2x buffer for growth.
A subtle trap: logs and temporary data explode unexpectedly. A developer adds a debug log that writes 20 bytes per request at 1000 QPS — that's 1.7GB/day. Unnoticed for a week, it fills your disk. Set retention policies and monitor growth rates, not absolute usage.
Also consider data lifecycle. Not all data needs hot storage. Archive old data to cheaper storage (S3 Glacier, GCP Archive) to cap costs. Compression ratios vary: text compresses 4-5x, images don't. Use estimates by data type.
Another common mistake: database storage != file storage. A MongoDB document may be 1KB in your model, but on disk it's 2-3KB with indexes and journaling. Factor 2x for database storage estimates. Also include transaction logs — they can grow significantly during heavy writes.
Cold storage costs and GDPR retention laws may force long-term data keeping. Plan a tiered strategy: hot data on fast SSDs, warm on HDD, cold on object storage. The cost difference can be 10x between hot and cold.
Real-world: A company stored logs indefinitely because they forgot to set retention. Their storage bill hit $200k/month — more than compute. They implemented 30-day retention and tiered old logs to Glacier, cutting costs by 90%.
Also watch for unused objects in S3 that accumulate. Implement lifecycle policies to expire unused data.
Compute and Memory Requirements
CPU and memory are driven by QPS and request complexity. A typical web server consumes 50-100ms CPU time and 10-50MB memory per request. To handle 1000 peak QPS, you need at least 100 concurrent threads (assuming 100ms per request). Each thread may need 2MB, so 200MB for threads alone. Add heap, caches, GC overhead — aim for 4-8GB RAM per instance. Formula: instances = peak QPS / (1 / avg_response_time) / max_concurrency.
But memory isn't just thread stacks. Caches, connection pools, and GC overhead dominate. A 4GB heap with G1GC at 1000 QPS can see GC pauses of 50-100ms — enough to push latency over SLO. Sweet spot for G1GC is 4-8GB; above 8GB pause times increase non-linearly. Below 2GB, GC frequency spikes.
Also monitor non-heap memory: Metaspace, thread stacks, direct buffers. Connection pools consume memory too — 100 connections at 1MB each = 100MB just waiting.
With Java 21 virtual threads, memory per thread drops to ~2KB vs 1MB for platform threads. That means you can handle thousands of concurrent requests with a 2GB heap instead of 8GB. But virtual threads still need carrier threads from a small pool (default = cores). If your code blocks on synchronized or native methods, it pins the carrier, reducing concurrency. Great for I/O-bound, not magic for CPU-bound.
Also consider vertical scaling vs horizontal. Sometimes one large instance is cheaper than many small ones, especially if workloads benefit from large caches. Compare total cost: 8 small vs 1 large with same total RAM — often 20-30% cheaper.
Cold start overhead: containerized services may not be ready immediately after restart. Plan a 30-second grace period in auto-scaling triggers. Use pod disruption budgets in K8s to avoid mass restarts.
Real-world: A trading platform with 16GB heap saw 200ms GC pauses during peak hours. Switching to ZGC dropped pauses to <1ms. ZGC uses more CPU but latency was the constraint. Measure your constraint.
Memory leaks are another common cause. A team ignored growing heap usage over weeks, assuming GC would handle it — then JVM hit OOM. Add weekly heap growth alerts.
Bandwidth and Network Considerations
Bandwidth is the silent killer. A single 500KB image served 1000 times per second consumes 500MB/s of egress bandwidth — about 4 Gbps. Most cloud instances cap network at 10 Gbps. Outbound costs can dominate your bill. Use a CDN for static assets, compress responses, and cache aggressively. For real-time apps, plan for sustained throughput, not just bursts.
Don't forget internal bandwidth either. Cross-AZ traffic costs money and adds latency. If your database writes are routed through a different availability zone, you'll pay egress fees and see 2-5ms additional latency per call. Keep your data and compute inside the same AZ when possible.
Also consider intra-service bandwidth. If your services communicate over HTTP and are chatty, that adds load. Use protobuf or gRPC to reduce payload size.
DNS and TLS handshake overhead: each new connection adds 100-200ms before data transfer. Keep alive connections reduce this. Estimate number of concurrent connections.
Also watch for bandwidth spikes from health checks and monitoring probes. If you have 200 microservices each monitoring each other (mesh), that's 200 * 200 = 40,000 probes per minute. Those small requests add up. Use a dedicated health check service.
Connection multiplexing: HTTP/2 multiplexes streams over a single connection, reducing overhead. For internal services, gRPC HTTP/2 can improve bandwidth utilisation.
Network topology matters. If you use a service mesh like Istio, each proxy adds 5-10% bandwidth overhead. Factor that in. Outbound bandwidth from cloud to internet is often more expensive than inbound. Monitor both directions.
Real-world: A photo-sharing app's egress bill hit $50k/month because images served directly from origin. After enabling CDN, it dropped to $8k/month — 84% reduction. The CDN also cut load on app servers by 90%.
Another: A video streaming startup's $100k monthly bandwidth bill was cut by 70% by moving to CDN and compressing with AV1.
Estimate: peak QPS average response size 1.5 (for headers, retransmits). For media-heavy apps, add another 20% overhead.
Database Capacity Planning
Databases are the hardest component to scale. Unlike app servers, you can't just add instances and expect linear performance. Database capacity planning must account for read throughput, write throughput, storage, connection pool, and replication lag.
First, estimate read QPS and write QPS separately. Reads can be offloaded to replicas — typical pattern: one primary for writes, multiple read replicas. Write capacity is often the bottleneck because every write hits the primary. Size primary's CPU and IOPS accordingly.
Connection pool sizing: pool size = peak QPS * avg query time (seconds). For 1000 QPS with 50ms queries, you need at least 50 connections. But also account for overhead — a common mistake is setting pool size equal to database's max_connections, which can exhaust the database with too many connections. Tune both sides.
Replication lag: if you have read replicas, ensure they can handle read traffic without falling behind. Monitor seconds_behind_master (MySQL) or replica lag (PostgreSQL). Keep lag under 5 seconds for responsive apps.
Storage for databases includes indexes and transaction logs. Indexes can double the storage of a table. Transaction logs (WAL) grow significantly during heavy writes — plan for at least 25% extra storage for logs.
Connection pool memory: each database connection uses about 2-5MB on the database side. For 200 connections, that's 1GB just for connections. Size the database instance accordingly.
Use a connection pooler like PgBouncer or ProxySQL to maintain persistent pool and reduce connection overhead. Some ORMs hold connections longer than expected due to transactional boundaries — test with realistic request patterns.
Real-world failure: A SaaS startup sized RDS instance based on average QPS, ignoring peak. When a customer imported a million records via API, database CPU hit 100%, connection pool saturated, all queries timed out. Recovery took 4 hours. Fix: add write replicas, use async processing, set connection pool limit that prevents thundering herd.
Separate read and write connection pools to avoid contention. Use two datasources or a proxy that routes by query type.
A billing system's database crashed during month-end due to unplanned write bursts — a single script triggered millions of updates. Adding a write queue saved them. Always plan for batch operations that can spike write load.
Capacity Planning for Event-Driven and Async Workloads
Event-driven systems shift the capacity model. Instead of QPS hitting an endpoint, you have message producers pushing events into a queue, and consumers processing at their own rate. The key metric is message arrival rate vs consumption rate. If arrival exceeds consumption, the queue grows indefinitely — hitting queue depth limits, memory pressure, or consumer timeouts.
Start by estimating peak message production rate. Often comes from upstream services or external webhooks. For example, a payment webhook might deliver 1000 events/sec during a flash sale. Treat this like peak QPS but with no concurrency ceiling — messages can pile up.
Next, measure average processing time per message (deserialization, business logic, I/O). Then required consumers = (peak message rate) * (processing time). Add safety factor 1.5-2x for burst handling.
Watch for poison pill messages — messages that fail repeatedly and consume all consumer capacity. Implement dead-letter queues (DLQ) and circuit breakers on consumer failures.
Backpressure: if consumers can't keep up, you need to signal the producer to slow down. Rarely built-in by default. Use bounded queue with drop policy or implement backpressure mechanism.
Batch processing can increase throughput — tune batch size for latency vs throughput.
Monitor queue depth growth rate. If it's positive for more than 5 minutes, you're losing ground. Set alerts on growth rate, not just absolute depth. Auto-scaling based on queue depth (KEDA for Kubernetes) works better than CPU-based scaling for async workloads.
Real-world failure: A fintech startup's event queue grew to 10M messages over a weekend due to a single failing consumer. A malformed message kept failing, retry loop consumed all capacity. Recovery took 12 hours. Add DLQ and alert on consumer error rate.
Another nuance: strict ordering requirements force partitioning — each partition is processed by one consumer. Plan enough partitions for peak load.
Capacity Planning for Cloud Costs
Capacity planning directly impacts cloud costs. Every estimate — QPS, storage, bandwidth — becomes a line item on your bill. Understanding that relationship lets you design cost-efficient systems from the start.
Start with unit economics: cost per request. If you run a Java service on 8 instances at $0.50/hour each, handling 1000 peak QPS, that's $0.004 per 1000 requests in compute alone. Add storage, bandwidth, database — you might get to $0.01 per 1000 requests. Know this number; share it with product.
Reserved vs on-demand: reserve for baseline, use spot for burst, on-demand as last resort. Reserved instances save 30-60%, but commit for predictable baselines only.
Right-sizing is where most money is wasted. Teams often over-provision because they don't trust their estimates. That's fine for the first month, but after 90 days of monitoring, rightsize all instances. Use AWS Compute Optimizer or similar.
Storage tiering is a huge lever. Hot data on SSDs, warm on HDDs, cold on object storage. A photo-sharing app with 10PB of data can reduce costs from $1M/month to $200k/month with proper tiering.
Data transfer costs: ingress is often free, egress expensive. Design to minimize cross-region or internet egress. Use CloudFront or Cloudflare for outgoing traffic.
Hidden costs: orphaned resources — load balancers, unused EBS volumes, idle NAT gateways. Set up cost anomaly detection and regular cleanup.
Real story: A team spent $50k/month on Redis clusters because they never reevaluated after cache hit ratio improved. Rightsizing saved $20k/month.
Another: A company had 10% of instances idle for months, costing $30k/month. They added scheduled shutdown for non-production environments.
Build a simple cost model in a spreadsheet — compute hours, storage tier, bandwidth * egress. Update quarterly and compare to actual bills. If actual exceeds model by 20%, investigate.
Why Your Capacity Estimates Are Wrong (And How to Fix Them)
Every capacity estimate is a lie until proven otherwise. The question isn't whether your numbers are wrong — it's whether they're wrong in a survivable direction.
Three factors burn junior engineers most often. First, hardware resources aren't fungible. Throwing more CPU at a memory-bound process just makes the OOM killer busier. You need to identify which resource is actually the bottleneck — and it's rarely the one you think.
Second, software efficiency isn't a constant. A query that returns in 2ms on an empty table can take 200ms with 10 million rows. You can't estimate capacity without understanding how your algorithms degrade under load. Third, workload characteristics change without warning. Black Friday spikes, viral tweets, bot traffic — your "average" load is a bedtime story.
The fix is simple: measure everything, assume every assumption is wrong, and build in 3x headroom you'll actually need. Your CTO will thank you at 2 AM during the post-launch scramble.
Two Metrics That Kill More Deployments Than All Bugs Combined
You're tracking CPU and memory. Everyone tracks CPU and memory. That's why production catches fire while your dashboards show green.
The real killers are request latency p99 and concurrency limit. p99 latency tells you what the slowest 1% of your users experience. If your p99 jumps from 200ms to 2 seconds, your users are rage-quitting even though your p50 looks fine. Everything cascades — timeouts cause retries, retries cause backpressure, backpressure kills the whole box.
Concurrency limit is the number of simultaneous requests your system can handle before response times go nonlinear. Once you hit that inflection point, adding one more request doubles your latency. That's the moment your SLA dies.
Here's the rule: measure p99 at every service boundary. Graph it against concurrency. The moment p99 rises 50% above baseline under peak load, you're at capacity. Stop adding traffic. Scale up. That's your hard limit — not the theoretical max from your load test.
The Real-World Capacity Plan That Saves Your Weekend
Theory is cheap. Let me show you what a real capacity plan looks like when your CEO just announced a product launch with 10x expected traffic and you have two weeks.
Start with the worst case: what breaks first? For an e-commerce site, it's always the checkout database. Every item added to cart, every payment processed, every inventory decrement — all hitting the same row. That's your serialization point.
Here's the playbook. Day 1: benchmark the checkout path under synthetic load. Find the max concurrent transactions before p99 goes to hell. Day 2-3: cache product catalog reads (90% of page views are reads). Day 4: shard the inventory database by product category. Day 5: add read replicas for order history. Day 6: load test the whole chain again. Day 7: pray and monitor.
Real example from a past life: 500k users, 5k writes/second at peak. Checkout DB hit 95% CPU at 3k writes/second. We added a write queue with batch commit — writes went to Redis first, then drained in batches. CPU dropped to 40%, throughput doubled, p99 stayed under 200ms. No new hardware, just understanding where the bottleneck actually lived.
The E-Commerce Capacity Trap: Checkout as a Concurrency Problem
Most teams size e-commerce infrastructure by average daily traffic. That’s how you die on Black Friday. The real question: what’s your peak concurrency during checkout? Every user who adds to cart, loads a payment gateway, or waits for inventory validation holds a connection. If your thread pool or database connection pool is sized for 500 concurrent users but your checkout burst hits 2,000, you’re looking at cascading failures.
Start with the checkout funnel. Measure session duration and request rate at the payment endpoint, not the homepage. Then apply Little’s Law: concurrency = throughput × latency. If you process 100 checkouts/sec and each takes 2 seconds, you need 200 concurrent connections just for payment. Multiply by 3 for headroom. Add circuit breakers for payment provider latency spikes. Capacity planning without concurrency modeling is just wishful thinking.
The Product Page Cache That Breaks Your Inventory
E-commerce teams love caching product pages for 60 seconds. It slashes DB load. But when a flash sale drops inventory from 500 to 0 in 5 seconds, your cache serves stale “in stock” data for the next 55 seconds. Users add to cart, hit checkout, and get “item unavailable.” That’s a 50% cart abandonment rate you just coded.
The fix: tiered caching with invalidation on inventory writes. Cache product description and images at the CDN for 10 minutes. Cache inventory counts with a 5-second TTL. Or better, push invalidations from the inventory service when stock changes. Redis pub-sub works. Kafka works. Just don’t let a 60-second cache do a 5-second job. Measure your inventory write rate and design cache TTL as a fraction of that interval. If 50% of your orders come in the first minute of a drop, your cache is your enemy.
E-Commerce Checkout: Real-World Capacity Failure
Why the worst capacity mistakes happen at checkout. An e-commerce site handles 50,000 daily visitors, 10,000 add-to-cart actions, and 2,000 checkout attempts. The naive plan: allocate 100 concurrent checkout threads based on average 5-second checkout time. But flash sales spike traffic 10x within 30 seconds. Checkout fails because database connection pools exhaust, not because compute runs out. The fix: reserve dedicated capacity for checkout writes, set admission control on cart-to-checkout transitions, and pre-compute invoice totals asynchronously. A production trace shows 40% of checkout latency comes from inventory lock contention, not SQL queries. Separating inventory holds into a Redis-backed cache with TTL reduces p99 checkout time from 12s to 1.2s without adding servers. The hard lesson: capacity plans that ignore the checkout funnel's hot path will fail at peak.
Cache That Breaks Inventory: E-Commerce Case Study
Why caching product pages accidentally zeroes out stock. A fashion retailer caches product pages for 5 minutes to reduce database load. Inventory queries are cached at page level. During a flash sale, 2,000 users see "in stock" for a sneaker with only 50 units. The checkout system, reading from a separate inventory service, rejects 1,950 orders. Users rage, trust drops. Root cause: product page cache TTL of 300 seconds vs. inventory read-through cache of 60 seconds created a stale window. The fix: add cache-invalidation on inventory decrement events, not TTL. Implement a write-through cache for inventory counts fed by the order service. In production, this reduced database reads by 80% and eliminated overselling. The capacity implication: compute needed for cache invalidation events (1 per purchase) is 10x cheaper than recomputing product pages on miss. Plan invalidation bandwidth, not just read bandwidth.
Black Friday Crash at a Retail Startup
- Always model worst-case peak traffic, not average.
- Auto-scaling is not instant — you need headroom or pre-provision.
- Databases are the hardest to scale; plan their capacity first.
- Monitor connection pool usage as a leading indicator of saturation.
- Run load tests at 5x expected peak to validate your model before launch.
- Don't let marketing launch a campaign without a capacity sign-off.
- Use feature flags to gradually ramp traffic to new capacity — don't flip the switch for all users at once.
- Plan for write-heavy bursts — they overwhelm primaries faster than reads.
- Consider using auto-scaling pre-warming scripts to reduce lag during known traffic events.
du -sh /var/log/* to find large files. Check for forgotten debug logs.ulimit -a), connection pool, or disk IO (iostat -x 1). Use vmstat 5 5 to see context switching and blocking. Often it's thread pool exhaustion, not CPU.curl -v /health from inside the container to verify. Increase thread pool or add instances.vmstat 5 5netstat -an | grep :80 | wc -lKey takeaways
Common mistakes to avoid
5 patternsUsing only average QPS instead of peak
Assuming auto-scaling will save you instantly
Neglecting database connection pool sizing
Ignoring storage growth rate
Treating all QPS as equal cost
Interview Questions on This Topic
How would you estimate the capacity requirements for a new social media app expected to have 1 million users in the first year?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
That's Estimation. Mark it forged?
18 min read · try the examples if you haven't