Capacity Planning — Why Auto-Scaling Won't Save You
Auto-scaling lags 3-5 minutes during traffic spikes.
- Capacity planning builds a model of future load: QPS, storage, memory, and bandwidth
- Peak traffic is 10-20x average — plan for the worst hour, not the daily average
- Storage grows with user count and data per user; use retention policies to cap growth
- Memory and CPU scale with request complexity and concurrency, not just QPS
- Bandwidth often becomes the bottleneck before CPU does—CDN caching saves you
- Performance insight: underestimating peak QPS by 2x causes total collapse; auto-scaling lags minutes behind
- Production insight: always model worst-case peak, not average — and validate with load tests before launch
Imagine you're opening a lemonade stand at a school fair. Before you show up, you need to guess: how many kids will come, how many cups you need, how fast you can pour, and whether you need one table or three. Capacity planning is exactly that — but for software. You're estimating how much traffic your system will handle, how much data it'll store, and whether your servers will buckle under pressure. Do it before you build, and you sleep at night. Skip it, and your site goes down the moment it gets popular.
Every system that has ever crashed under load had one thing in common: nobody did the math beforehand. Twitter's Fail Whale, early Reddit meltdowns, the Healthcare.gov launch disaster — these weren't random bad luck. They were the predictable result of shipping systems without ever asking 'what happens when a million people show up at once?' Capacity planning is the engineering discipline that answers that question before it becomes a crisis.
The core problem capacity planning solves is the gap between 'it works on my machine' and 'it works for ten million users.' A system that handles 10 requests per second behaves completely differently at 100,000 requests per second. Memory leaks that are invisible at small scale become catastrophic at large scale. Database queries that return in 2ms under no load suddenly take 4 seconds when 500 connections compete for the same rows. Capacity planning gives you a model — however rough — of where those breaking points are, so you can design around them intentionally rather than discover them in production.
By the end of this article you'll know how to estimate Queries Per Second (QPS) for a real system, calculate storage growth over time, size your bandwidth and memory requirements, and translate all of that into a concrete infrastructure starting point. These are the exact skills that separate engineers who can design systems from engineers who just implement tickets.
Here's the thing: every hour spent planning capacity saves ten hours of production firefighting. It's not a one-time exercise — it's a muscle you build.
Don't assume your cloud provider's default limits will save you. They won't. I've seen a team lose a $500k deal because they hit the default DynamoDB write capacity — and nobody had checked the limit.
What is Capacity Planning?
Capacity planning builds a model of future system load before you write production code. You estimate QPS, storage, memory, CPU, bandwidth — then design around those numbers. It's not about perfect prediction; it's about bounding risk so you don't wake up at 3 AM to a 503 tsunami.
The feedback loop: estimate → build → monitor → adjust. Each iteration tightens the model. Without it, you're guessing. I've seen teams spend weeks optimizing a query that didn't matter while their database would run out of storage in 3 months.
Averages lie. Peak tells the truth. A system handling 100 QPS average might burst to 2000 QPS for 2 minutes. If your pool is sized for the average, you'll saturate connections fast. Always model the worst hour.
Common trap: treating capacity planning as a one-time exercise. It's not. Launch day traffic is nothing compared to year 2 growth. Revisit your model quarterly or whenever you hit a 2x traffic milestone.
Write-heavy workloads are the silent killer. A payment processing team I advised sized for average write QPS — then a flash sale hit 50x burst. The primary fell over, replication lag hit hours. Model write QPS separately with a 3x safety factor.
Another angle: capacity planning is a communication tool. When you have numbers, you can explain to product why a feature launch needs a two-week infra lead time. That conversation never happens when the model lives in someone's head.
- Start with a rough estimate based on assumptions.
- Build infrastructure to that estimate with a safety margin.
- Monitor actual traffic and resource usage in production.
- Update your model with real data; refine for the next cycle.
- The goal is not perfect prediction — it's avoiding catastrophic failure.
Estimating Queries Per Second (QPS)
QPS is the heartbeat of your system. Every other resource — database connections, CPU, memory, bandwidth — depends on it. Start with your expected DAU, multiply by requests per user per day, then apply the peak hour fraction (typically 10% of daily traffic in 1 hour). The formula: peak QPS = (DAU requests/user/day peak fraction) / 3600.
But here's the nuance: the peak fraction varies. A global consumer app might see 15-20% of daily traffic in the evening commute. A B2B SaaS might only see 5-7% during business hours. If you lack historical data, start with 10% and add a 1.5x safety margin.
Also, QPS isn't uniform across endpoints. Your login endpoint may get 10x less traffic than your feed endpoint. Profile your traffic — treat each endpoint's resource cost separately. I've seen teams mis-size compute because they assumed all requests consumed equal CPU.
Another trap: webhooks and callbacks can burst unexpectedly. A payment webhook once caused a 50x spike for 2 seconds, saturating our connection pool. Plan for these async bursts by adding buffer in your peak estimate.
For event-driven systems, QPS estimation is trickier because it depends on producers. Estimate incoming event rate from queue metrics, not endpoint hits. Same formulas apply, but replace DAU with message producers.
Write QPS is often a fraction of read QPS, but each write can be an order of magnitude more expensive — row locks, index updates, replication. Model write QPS separately with a 2x overhead for index maintenance.
- Average QPS hides spikes. A 100 QPS average could mean 2000 QPS burst for 5 minutes.
- Database connection pools and thread pools must handle the burst, not the average.
- Use 95th percentile from monitoring if you have it; otherwise assume peak = 10x average.
Storage Sizing Over Time
Storage grows with two dimensions: number of users and data per user. Each user might store 500KB of profile data, 2MB of images, and 100KB of logs per day. Over a year, that compounds fast. Don't forget replication and backup factors — a 3x replication multiplier is common. A good rule: estimate storage after 1 year with a 2x buffer for growth.
A subtle trap: logs and temporary data explode unexpectedly. A developer adds a debug log that writes 20 bytes per request at 1000 QPS — that's 1.7GB/day. Unnoticed for a week, it fills your disk. Set retention policies and monitor growth rates, not absolute usage.
Also consider data lifecycle. Not all data needs hot storage. Archive old data to cheaper storage (S3 Glacier, GCP Archive) to cap costs. Compression ratios vary: text compresses 4-5x, images don't. Use estimates by data type.
Another common mistake: database storage != file storage. A MongoDB document may be 1KB in your model, but on disk it's 2-3KB with indexes and journaling. Factor 2x for database storage estimates. Also include transaction logs — they can grow significantly during heavy writes.
Cold storage costs and GDPR retention laws may force long-term data keeping. Plan a tiered strategy: hot data on fast SSDs, warm on HDD, cold on object storage. The cost difference can be 10x between hot and cold.
Real-world: A company stored logs indefinitely because they forgot to set retention. Their storage bill hit $200k/month — more than compute. They implemented 30-day retention and tiered old logs to Glacier, cutting costs by 90%.
Also watch for unused objects in S3 that accumulate. Implement lifecycle policies to expire unused data.
Compute and Memory Requirements
CPU and memory are driven by QPS and request complexity. A typical web server consumes 50-100ms CPU time and 10-50MB memory per request. To handle 1000 peak QPS, you need at least 100 concurrent threads (assuming 100ms per request). Each thread may need 2MB, so 200MB for threads alone. Add heap, caches, GC overhead — aim for 4-8GB RAM per instance. Formula: instances = peak QPS / (1 / avg_response_time) / max_concurrency.
But memory isn't just thread stacks. Caches, connection pools, and GC overhead dominate. A 4GB heap with G1GC at 1000 QPS can see GC pauses of 50-100ms — enough to push latency over SLO. Sweet spot for G1GC is 4-8GB; above 8GB pause times increase non-linearly. Below 2GB, GC frequency spikes.
Also monitor non-heap memory: Metaspace, thread stacks, direct buffers. Connection pools consume memory too — 100 connections at 1MB each = 100MB just waiting.
With Java 21 virtual threads, memory per thread drops to ~2KB vs 1MB for platform threads. That means you can handle thousands of concurrent requests with a 2GB heap instead of 8GB. But virtual threads still need carrier threads from a small pool (default = cores). If your code blocks on synchronized or native methods, it pins the carrier, reducing concurrency. Great for I/O-bound, not magic for CPU-bound.
Also consider vertical scaling vs horizontal. Sometimes one large instance is cheaper than many small ones, especially if workloads benefit from large caches. Compare total cost: 8 small vs 1 large with same total RAM — often 20-30% cheaper.
Cold start overhead: containerized services may not be ready immediately after restart. Plan a 30-second grace period in auto-scaling triggers. Use pod disruption budgets in K8s to avoid mass restarts.
Real-world: A trading platform with 16GB heap saw 200ms GC pauses during peak hours. Switching to ZGC dropped pauses to <1ms. ZGC uses more CPU but latency was the constraint. Measure your constraint.
Memory leaks are another common cause. A team ignored growing heap usage over weeks, assuming GC would handle it — then JVM hit OOM. Add weekly heap growth alerts.
Bandwidth and Network Considerations
Bandwidth is the silent killer. A single 500KB image served 1000 times per second consumes 500MB/s of egress bandwidth — about 4 Gbps. Most cloud instances cap network at 10 Gbps. Outbound costs can dominate your bill. Use a CDN for static assets, compress responses, and cache aggressively. For real-time apps, plan for sustained throughput, not just bursts.
Don't forget internal bandwidth either. Cross-AZ traffic costs money and adds latency. If your database writes are routed through a different availability zone, you'll pay egress fees and see 2-5ms additional latency per call. Keep your data and compute inside the same AZ when possible.
Also consider intra-service bandwidth. If your services communicate over HTTP and are chatty, that adds load. Use protobuf or gRPC to reduce payload size.
DNS and TLS handshake overhead: each new connection adds 100-200ms before data transfer. Keep alive connections reduce this. Estimate number of concurrent connections.
Also watch for bandwidth spikes from health checks and monitoring probes. If you have 200 microservices each monitoring each other (mesh), that's 200 * 200 = 40,000 probes per minute. Those small requests add up. Use a dedicated health check service.
Connection multiplexing: HTTP/2 multiplexes streams over a single connection, reducing overhead. For internal services, gRPC HTTP/2 can improve bandwidth utilisation.
Network topology matters. If you use a service mesh like Istio, each proxy adds 5-10% bandwidth overhead. Factor that in. Outbound bandwidth from cloud to internet is often more expensive than inbound. Monitor both directions.
Real-world: A photo-sharing app's egress bill hit $50k/month because images served directly from origin. After enabling CDN, it dropped to $8k/month — 84% reduction. The CDN also cut load on app servers by 90%.
Another: A video streaming startup's $100k monthly bandwidth bill was cut by 70% by moving to CDN and compressing with AV1.
Estimate: peak QPS average response size 1.5 (for headers, retransmits). For media-heavy apps, add another 20% overhead.
Database Capacity Planning
Databases are the hardest component to scale. Unlike app servers, you can't just add instances and expect linear performance. Database capacity planning must account for read throughput, write throughput, storage, connection pool, and replication lag.
First, estimate read QPS and write QPS separately. Reads can be offloaded to replicas — typical pattern: one primary for writes, multiple read replicas. Write capacity is often the bottleneck because every write hits the primary. Size primary's CPU and IOPS accordingly.
Connection pool sizing: pool size = peak QPS * avg query time (seconds). For 1000 QPS with 50ms queries, you need at least 50 connections. But also account for overhead — a common mistake is setting pool size equal to database's max_connections, which can exhaust the database with too many connections. Tune both sides.
Replication lag: if you have read replicas, ensure they can handle read traffic without falling behind. Monitor seconds_behind_master (MySQL) or replica lag (PostgreSQL). Keep lag under 5 seconds for responsive apps.
Storage for databases includes indexes and transaction logs. Indexes can double the storage of a table. Transaction logs (WAL) grow significantly during heavy writes — plan for at least 25% extra storage for logs.
Connection pool memory: each database connection uses about 2-5MB on the database side. For 200 connections, that's 1GB just for connections. Size the database instance accordingly.
Use a connection pooler like PgBouncer or ProxySQL to maintain persistent pool and reduce connection overhead. Some ORMs hold connections longer than expected due to transactional boundaries — test with realistic request patterns.
Real-world failure: A SaaS startup sized RDS instance based on average QPS, ignoring peak. When a customer imported a million records via API, database CPU hit 100%, connection pool saturated, all queries timed out. Recovery took 4 hours. Fix: add write replicas, use async processing, set connection pool limit that prevents thundering herd.
Separate read and write connection pools to avoid contention. Use two datasources or a proxy that routes by query type.
A billing system's database crashed during month-end due to unplanned write bursts — a single script triggered millions of updates. Adding a write queue saved them. Always plan for batch operations that can spike write load.
Capacity Planning for Event-Driven and Async Workloads
Event-driven systems shift the capacity model. Instead of QPS hitting an endpoint, you have message producers pushing events into a queue, and consumers processing at their own rate. The key metric is message arrival rate vs consumption rate. If arrival exceeds consumption, the queue grows indefinitely — hitting queue depth limits, memory pressure, or consumer timeouts.
Start by estimating peak message production rate. Often comes from upstream services or external webhooks. For example, a payment webhook might deliver 1000 events/sec during a flash sale. Treat this like peak QPS but with no concurrency ceiling — messages can pile up.
Next, measure average processing time per message (deserialization, business logic, I/O). Then required consumers = (peak message rate) * (processing time). Add safety factor 1.5-2x for burst handling.
Watch for poison pill messages — messages that fail repeatedly and consume all consumer capacity. Implement dead-letter queues (DLQ) and circuit breakers on consumer failures.
Backpressure: if consumers can't keep up, you need to signal the producer to slow down. Rarely built-in by default. Use bounded queue with drop policy or implement backpressure mechanism.
Batch processing can increase throughput — tune batch size for latency vs throughput.
Monitor queue depth growth rate. If it's positive for more than 5 minutes, you're losing ground. Set alerts on growth rate, not just absolute depth. Auto-scaling based on queue depth (KEDA for Kubernetes) works better than CPU-based scaling for async workloads.
Real-world failure: A fintech startup's event queue grew to 10M messages over a weekend due to a single failing consumer. A malformed message kept failing, retry loop consumed all capacity. Recovery took 12 hours. Add DLQ and alert on consumer error rate.
Another nuance: strict ordering requirements force partitioning — each partition is processed by one consumer. Plan enough partitions for peak load.
Capacity Planning for Cloud Costs
Capacity planning directly impacts cloud costs. Every estimate — QPS, storage, bandwidth — becomes a line item on your bill. Understanding that relationship lets you design cost-efficient systems from the start.
Start with unit economics: cost per request. If you run a Java service on 8 instances at $0.50/hour each, handling 1000 peak QPS, that's $0.004 per 1000 requests in compute alone. Add storage, bandwidth, database — you might get to $0.01 per 1000 requests. Know this number; share it with product.
Reserved vs on-demand: reserve for baseline, use spot for burst, on-demand as last resort. Reserved instances save 30-60%, but commit for predictable baselines only.
Right-sizing is where most money is wasted. Teams often over-provision because they don't trust their estimates. That's fine for the first month, but after 90 days of monitoring, rightsize all instances. Use AWS Compute Optimizer or similar.
Storage tiering is a huge lever. Hot data on SSDs, warm on HDDs, cold on object storage. A photo-sharing app with 10PB of data can reduce costs from $1M/month to $200k/month with proper tiering.
Data transfer costs: ingress is often free, egress expensive. Design to minimize cross-region or internet egress. Use CloudFront or Cloudflare for outgoing traffic.
Hidden costs: orphaned resources — load balancers, unused EBS volumes, idle NAT gateways. Set up cost anomaly detection and regular cleanup.
Real story: A team spent $50k/month on Redis clusters because they never reevaluated after cache hit ratio improved. Rightsizing saved $20k/month.
Another: A company had 10% of instances idle for months, costing $30k/month. They added scheduled shutdown for non-production environments.
Build a simple cost model in a spreadsheet — compute hours, storage tier, bandwidth * egress. Update quarterly and compare to actual bills. If actual exceeds model by 20%, investigate.
Black Friday Crash at a Retail Startup
- Always model worst-case peak traffic, not average.
- Auto-scaling is not instant — you need headroom or pre-provision.
- Databases are the hardest to scale; plan their capacity first.
- Monitor connection pool usage as a leading indicator of saturation.
- Run load tests at 5x expected peak to validate your model before launch.
- Don't let marketing launch a campaign without a capacity sign-off.
- Use feature flags to gradually ramp traffic to new capacity — don't flip the switch for all users at once.
- Plan for write-heavy bursts — they overwhelm primaries faster than reads.
- Consider using auto-scaling pre-warming scripts to reduce lag during known traffic events.
du -sh /var/log/* to find large files. Check for forgotten debug logs.ulimit -a), connection pool, or disk IO (iostat -x 1). Use vmstat 5 5 to see context switching and blocking. Often it's thread pool exhaustion, not CPU.curl -v /health from inside the container to verify. Increase thread pool or add instances.Key takeaways
Common mistakes to avoid
5 patternsUsing only average QPS instead of peak
Assuming auto-scaling will save you instantly
Neglecting database connection pool sizing
Ignoring storage growth rate
Treating all QPS as equal cost
Interview Questions on This Topic
How would you estimate the capacity requirements for a new social media app expected to have 1 million users in the first year?
Frequently Asked Questions
That's Estimation. Mark it forged?
13 min read · try the examples if you haven't