Senior 13 min · March 06, 2026

Capacity Planning — Why Auto-Scaling Won't Save You

Auto-scaling lags 3-5 minutes during traffic spikes.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Capacity planning builds a model of future load: QPS, storage, memory, and bandwidth
  • Peak traffic is 10-20x average — plan for the worst hour, not the daily average
  • Storage grows with user count and data per user; use retention policies to cap growth
  • Memory and CPU scale with request complexity and concurrency, not just QPS
  • Bandwidth often becomes the bottleneck before CPU does—CDN caching saves you
  • Performance insight: underestimating peak QPS by 2x causes total collapse; auto-scaling lags minutes behind
  • Production insight: always model worst-case peak, not average — and validate with load tests before launch
Plain-English First

Imagine you're opening a lemonade stand at a school fair. Before you show up, you need to guess: how many kids will come, how many cups you need, how fast you can pour, and whether you need one table or three. Capacity planning is exactly that — but for software. You're estimating how much traffic your system will handle, how much data it'll store, and whether your servers will buckle under pressure. Do it before you build, and you sleep at night. Skip it, and your site goes down the moment it gets popular.

Every system that has ever crashed under load had one thing in common: nobody did the math beforehand. Twitter's Fail Whale, early Reddit meltdowns, the Healthcare.gov launch disaster — these weren't random bad luck. They were the predictable result of shipping systems without ever asking 'what happens when a million people show up at once?' Capacity planning is the engineering discipline that answers that question before it becomes a crisis.

The core problem capacity planning solves is the gap between 'it works on my machine' and 'it works for ten million users.' A system that handles 10 requests per second behaves completely differently at 100,000 requests per second. Memory leaks that are invisible at small scale become catastrophic at large scale. Database queries that return in 2ms under no load suddenly take 4 seconds when 500 connections compete for the same rows. Capacity planning gives you a model — however rough — of where those breaking points are, so you can design around them intentionally rather than discover them in production.

By the end of this article you'll know how to estimate Queries Per Second (QPS) for a real system, calculate storage growth over time, size your bandwidth and memory requirements, and translate all of that into a concrete infrastructure starting point. These are the exact skills that separate engineers who can design systems from engineers who just implement tickets.

Here's the thing: every hour spent planning capacity saves ten hours of production firefighting. It's not a one-time exercise — it's a muscle you build.

Don't assume your cloud provider's default limits will save you. They won't. I've seen a team lose a $500k deal because they hit the default DynamoDB write capacity — and nobody had checked the limit.

What is Capacity Planning?

Capacity planning builds a model of future system load before you write production code. You estimate QPS, storage, memory, CPU, bandwidth — then design around those numbers. It's not about perfect prediction; it's about bounding risk so you don't wake up at 3 AM to a 503 tsunami.

The feedback loop: estimate → build → monitor → adjust. Each iteration tightens the model. Without it, you're guessing. I've seen teams spend weeks optimizing a query that didn't matter while their database would run out of storage in 3 months.

Averages lie. Peak tells the truth. A system handling 100 QPS average might burst to 2000 QPS for 2 minutes. If your pool is sized for the average, you'll saturate connections fast. Always model the worst hour.

Common trap: treating capacity planning as a one-time exercise. It's not. Launch day traffic is nothing compared to year 2 growth. Revisit your model quarterly or whenever you hit a 2x traffic milestone.

Write-heavy workloads are the silent killer. A payment processing team I advised sized for average write QPS — then a flash sale hit 50x burst. The primary fell over, replication lag hit hours. Model write QPS separately with a 3x safety factor.

Another angle: capacity planning is a communication tool. When you have numbers, you can explain to product why a feature launch needs a two-week infra lead time. That conversation never happens when the model lives in someone's head.

io/thecodeforge/estimation/CapacityPlanner.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package io.thecodeforge.estimation;

public class CapacityPlanner {
    public static void main(String[] args) {
        int expectedUsers = 500_000;
        double dauRatio = 0.2;
        int dau = (int)(expectedUsers * dauRatio);
        int readsPerUser = 50;
        double peakHourFraction = 0.1;
        double peakReadQPS = (dau * readsPerUser * peakHourFraction) / 3600.0;
        System.out.println("Expected peak read QPS: " + Math.round(peakReadQPS));
        System.out.println("Plan for at least " + Math.round(peakReadQPS * 1.5) + " QPS with safety margin");
    }
}
Output
Expected peak read QPS: 694
Plan for at least 1041 QPS with safety margin
The Feedback Loop
  • Start with a rough estimate based on assumptions.
  • Build infrastructure to that estimate with a safety margin.
  • Monitor actual traffic and resource usage in production.
  • Update your model with real data; refine for the next cycle.
  • The goal is not perfect prediction — it's avoiding catastrophic failure.
Production Insight
Skipping capacity planning is a decision to discover your scaling limits in production.
Even a rough estimate prevents late-night firefights.
Rule: Always run the numbers before you commit to an architecture.
Pro tip: Load test at 10x expected traffic to validate your model.
Real example: A fintech startup skipped load testing — their payment gateway timed out during the first marketing push. Recovery took 6 hours.
Another insight: capacity planning is a negotiation tool with your cloud provider. Know your peak QPS to negotiate reserved instance discounts.
A payment team sized for average write QPS and collapsed under a 50x burst. Rule: model write QPS separately with a 3x safety factor.
Key Takeaway
Capacity planning is the difference between a launch and a disaster.
Start with a rough estimate and refine as you learn.
A bad estimate is better than no estimate.
The cost of over-provisioning is almost always lower than the cost of a crash.
Capacity planning is a muscle, not a one-time calculation.
Write-heavy workloads are the silent killer — model them separately with a 3x safety factor.
When to Perform Capacity Planning
IfBuilding a new system from scratch
UseEstimate based on expected user base, market research, and comparable systems.
IfExisting system with monitoring data
UseUse 95th percentile of historical traffic, storage, and resource usage over the last 90 days.
IfPreparing for a known event (launch, sale)
UseModel peak at 5-10x normal traffic and provision accordingly.
IfAfter a production incident related to capacity
UseTrigger a full re-estimation within 48 hours and implement guardrails.

Estimating Queries Per Second (QPS)

QPS is the heartbeat of your system. Every other resource — database connections, CPU, memory, bandwidth — depends on it. Start with your expected DAU, multiply by requests per user per day, then apply the peak hour fraction (typically 10% of daily traffic in 1 hour). The formula: peak QPS = (DAU requests/user/day peak fraction) / 3600.

But here's the nuance: the peak fraction varies. A global consumer app might see 15-20% of daily traffic in the evening commute. A B2B SaaS might only see 5-7% during business hours. If you lack historical data, start with 10% and add a 1.5x safety margin.

Also, QPS isn't uniform across endpoints. Your login endpoint may get 10x less traffic than your feed endpoint. Profile your traffic — treat each endpoint's resource cost separately. I've seen teams mis-size compute because they assumed all requests consumed equal CPU.

Another trap: webhooks and callbacks can burst unexpectedly. A payment webhook once caused a 50x spike for 2 seconds, saturating our connection pool. Plan for these async bursts by adding buffer in your peak estimate.

For event-driven systems, QPS estimation is trickier because it depends on producers. Estimate incoming event rate from queue metrics, not endpoint hits. Same formulas apply, but replace DAU with message producers.

Write QPS is often a fraction of read QPS, but each write can be an order of magnitude more expensive — row locks, index updates, replication. Model write QPS separately with a 2x overhead for index maintenance.

io/thecodeforge/estimation/QPSEstimator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
package io.thecodeforge.estimation;

public class QPSEstimator {
    public static double peakReadQPS(int dau, int readsPerUserPerDay) {\n        double dailyReads = dau * readsPerUserPerDay;\n        double peakHourSeconds = 3600;\n        double peakFactor = 0.1; // peak hour = 10% of daily traffic\n        return (dailyReads * peakFactor) / peakHourSeconds;\n    }

    public static void main(String[] args) {
        int dau = 200_000;
        int readsPerUser = 50;
        System.out.println("Peak Read QPS: " + peakReadQPS(dau, readsPerUser));
    }
}
Output
Peak Read QPS: 277.777...
The 95th Percentile Rule
  • Average QPS hides spikes. A 100 QPS average could mean 2000 QPS burst for 5 minutes.
  • Database connection pools and thread pools must handle the burst, not the average.
  • Use 95th percentile from monitoring if you have it; otherwise assume peak = 10x average.
Production Insight
If you underestimate peak QPS by 2x, your database connection pool saturates in minutes.
Connection pool exhaustion manifests as slow queries → timeouts → 503s.
Rule: Always model peak QPS with a safety factor of at least 1.5.
Pro tip: Validate your QPS estimates with a load test using k6 before go-live.
Real example: A video platform hit 4x normal QPS during a live event — their connection pool was sized for 2x, but luckily they had a 2x safety factor.
Real story: A social media app saw 50x write QPS during a coordinated bot attack — they had no rate limiter. Add rate limiting as a capacity safety net.
A SaaS platform's QPS model was off by 4x because they assumed all requests had equal cost. Always weight QPS by endpoint resource cost.
Another failure: Marketing scheduled an email blast at 10 AM — 15x spike in 2 minutes, connection pool saturated, cascading failures. Plan for those.
Key Takeaway
QPS is the foundation of capacity planning.
Estimate peak, not average.
Everything else scales from this number.
Always add a 1.5x safety margin to your peak estimate.
Write QPS is more expensive than read QPS — model it separately.
Break QPS down by endpoint and weight by resource cost per request.
Choosing an estimation method
IfYou have historical traffic data
UseUse 95th percentile of peak QPS from last 90 days
IfNew product with no data
UseEstimate based on comparable products and launch scale
IfMarketing campaign expected
UseMultiply baseline peak by 2x to 5x depending on campaign size
IfReal-time event (e.g., product launch)
UseUse worst-case: 10x normal QPS with pre-scaling and circuit breakers

Storage Sizing Over Time

Storage grows with two dimensions: number of users and data per user. Each user might store 500KB of profile data, 2MB of images, and 100KB of logs per day. Over a year, that compounds fast. Don't forget replication and backup factors — a 3x replication multiplier is common. A good rule: estimate storage after 1 year with a 2x buffer for growth.

A subtle trap: logs and temporary data explode unexpectedly. A developer adds a debug log that writes 20 bytes per request at 1000 QPS — that's 1.7GB/day. Unnoticed for a week, it fills your disk. Set retention policies and monitor growth rates, not absolute usage.

Also consider data lifecycle. Not all data needs hot storage. Archive old data to cheaper storage (S3 Glacier, GCP Archive) to cap costs. Compression ratios vary: text compresses 4-5x, images don't. Use estimates by data type.

Another common mistake: database storage != file storage. A MongoDB document may be 1KB in your model, but on disk it's 2-3KB with indexes and journaling. Factor 2x for database storage estimates. Also include transaction logs — they can grow significantly during heavy writes.

Cold storage costs and GDPR retention laws may force long-term data keeping. Plan a tiered strategy: hot data on fast SSDs, warm on HDD, cold on object storage. The cost difference can be 10x between hot and cold.

Real-world: A company stored logs indefinitely because they forgot to set retention. Their storage bill hit $200k/month — more than compute. They implemented 30-day retention and tiered old logs to Glacier, cutting costs by 90%.

Also watch for unused objects in S3 that accumulate. Implement lifecycle policies to expire unused data.

io/thecodeforge/estimation/StorageEstimator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package io.thecodeforge.estimation;

public class StorageEstimator {
    public static long yearlyStorageGB(int users, long bytesPerUserDaily, int retentionDays) {\n        long totalBytes = (long) users * bytesPerUserDaily * retentionDays;\n        return totalBytes / (1024 * 1024 * 1024);\n    }

    public static void main(String[] args) {
        int users = 1_000_000;
        long bytesPerUserDaily = 2_000_000; // 2MB per user per day
        int retentionDays = 365;
        long gb = yearlyStorageGB(users, bytesPerUserDaily, retentionDays);
        System.out.println("Yearly storage: " + gb + " GB (raw)");
        System.out.println("With 3x replication: " + (gb * 3) + " GB");
    }
}
Output
Yearly storage: 1898 GB (raw)
With 3x replication: 5694 GB
The Log Trap
Logs are the silent storage killer. A single verbose log line per request at 1000 QPS can generate 10GB per week. Set log rotation and monitor daily log volume as a capacity metric.
Production Insight
Storage costs sneak up on you. Logs, temporary files, and test data are often forgotten.
A 10TB database might cost $30k/month in cloud storage alone.
Rule of thumb: double your initial estimate to account for replication and backups.
Log aggregation systems (ELK, Datadog) are common storage hogs. Set log retention to 30 days, not indefinite.
Real story: A startup thought they had 500GB of storage — then they realized their test environment had written 4TB of debug logs over six months.
Cost trap: storing logs indefinitely can cost more than your compute. Set retention policies on day one.
Another: A company's $200k monthly storage bill was cut by 90% by setting 30-day retention and tiering to Glacier.
Monitor daily storage growth rate, not just total capacity. A 1% daily growth may seem small but doubles storage in 70 days.
Key Takeaway
Storage grows with (users data_per_user retention).
Plan for at least 2x headroom.
Don't forget replication and backup multipliers.
Monitor growth rates, not just absolute usage.
Set alerts on daily storage growth rate, not just total capacity.
Tiered storage can cut your bill by 60% — plan for it.
Set log retention policies on day one — they're the silent storage killer.
When to Archive Old Data
IfData older than 90 days with infrequent access
UseMove to cold storage (e.g., S3 Glacier, GCP Archive) to reduce costs
IfRegulatory compliance requires long retention
UseKeep in cold storage but ensure ability to restore within required SLA
IfUser-generated content (images, videos)
UseNever delete, but migrate older content to cheaper storage tiers
IfTemporary data, logs, debug info
UsePurge after 30-90 days; set automated retention policies

Compute and Memory Requirements

CPU and memory are driven by QPS and request complexity. A typical web server consumes 50-100ms CPU time and 10-50MB memory per request. To handle 1000 peak QPS, you need at least 100 concurrent threads (assuming 100ms per request). Each thread may need 2MB, so 200MB for threads alone. Add heap, caches, GC overhead — aim for 4-8GB RAM per instance. Formula: instances = peak QPS / (1 / avg_response_time) / max_concurrency.

But memory isn't just thread stacks. Caches, connection pools, and GC overhead dominate. A 4GB heap with G1GC at 1000 QPS can see GC pauses of 50-100ms — enough to push latency over SLO. Sweet spot for G1GC is 4-8GB; above 8GB pause times increase non-linearly. Below 2GB, GC frequency spikes.

Also monitor non-heap memory: Metaspace, thread stacks, direct buffers. Connection pools consume memory too — 100 connections at 1MB each = 100MB just waiting.

With Java 21 virtual threads, memory per thread drops to ~2KB vs 1MB for platform threads. That means you can handle thousands of concurrent requests with a 2GB heap instead of 8GB. But virtual threads still need carrier threads from a small pool (default = cores). If your code blocks on synchronized or native methods, it pins the carrier, reducing concurrency. Great for I/O-bound, not magic for CPU-bound.

Also consider vertical scaling vs horizontal. Sometimes one large instance is cheaper than many small ones, especially if workloads benefit from large caches. Compare total cost: 8 small vs 1 large with same total RAM — often 20-30% cheaper.

Cold start overhead: containerized services may not be ready immediately after restart. Plan a 30-second grace period in auto-scaling triggers. Use pod disruption budgets in K8s to avoid mass restarts.

Real-world: A trading platform with 16GB heap saw 200ms GC pauses during peak hours. Switching to ZGC dropped pauses to <1ms. ZGC uses more CPU but latency was the constraint. Measure your constraint.

Memory leaks are another common cause. A team ignored growing heap usage over weeks, assuming GC would handle it — then JVM hit OOM. Add weekly heap growth alerts.

io/thecodeforge/estimation/ComputeEstimator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
package io.thecodeforge.estimation;

public class ComputeEstimator {
    public static int requiredInstances(double peakQPS, double avgResponseTimeSec, double maxCpuPerInstance) {\n        double requestsPerInstance = (1.0 / avgResponseTimeSec) * maxCpuPerInstance;\n        return (int) Math.ceil(peakQPS / requestsPerInstance);\n    }

    public static void main(String[] args) {
        double peakQPS = 1000;
        double avgResponseTimeSec = 0.1; // 100ms
        double maxCpuPerInstance = 0.8; // 80% target CPU utilisation
        int instances = requiredInstances(peakQPS, avgResponseTimeSec, maxCpuPerInstance);
        System.out.println("Required instances: " + instances);
    }
}
Output
Required instances: 8
GC Realities:
G1GC's sweet spot is 4-8GB heaps. Above 8GB, pause times increase non-linearly. Below 2GB, GC frequency spikes. Use jstat -gcutil to monitor GC overhead as a capacity signal — if GC overhead exceeds 5% of CPU, your heap is likely too small.
Production Insight
CPU is rarely the first bottleneck — memory often is. GC spikes under load cause latency spikes.
Right-size heap: too large causes long GC pauses, too small causes frequent GC.
Rule: Keep heap under 8GB to stay within G1GC's sweet spot.
Monitor GC pause time as a latency signal. If GC pauses exceed 1% of request timeout, reduce heap or switch to ZGC.
Real data: A trading platform with 16GB heap saw 200ms GC pauses — moving to ZGC dropped pauses to <1ms.
A common failure: too many container restarts due to liveness probes failing during heavy load. Tune readiness probes to account for startup time.
A team ignored growing heap usage over three weeks, then hit OOM. Add weekly heap growth alerts to catch memory leaks early.
Another: High thread concurrency (many idle threads) can bloat non-heap memory. Use virtual threads or reduce thread count.
Key Takeaway
Memory scales with concurrency, not just QPS.
Compute instances = (peak QPS * response_time_sec) / (target_cpu_utilisation).
Monitor GC overhead as a capacity signal.
Don't assume CPU is the bottleneck — check memory and GC first.
Cold starts and container restarts can spike latency — build buffer into your capacity model.
For low-latency services, consider ZGC over G1GC to avoid long pause times.
Choosing Heap Size for Java Services
IfService with high throughput and low latency requirements
UseStart with 4GB heap, monitor GC pauses. If pauses exceed 20ms, reduce heap to 2GB or switch to ZGC
IfBatch processing, no strict latency requirements
UseUse larger heap (8-16GB) but expect longer GC pauses. Acceptable if batch timeout is generous
IfMicroservices with very low QPS (< 10)
Use2GB heap is sufficient. Watch for memory leaks more than GC pauses
IfHigh thread concurrency (many idle threads)
UseMonitor non-heap memory (Metaspace, thread stacks). Use NIO or virtual threads to reduce thread count

Bandwidth and Network Considerations

Bandwidth is the silent killer. A single 500KB image served 1000 times per second consumes 500MB/s of egress bandwidth — about 4 Gbps. Most cloud instances cap network at 10 Gbps. Outbound costs can dominate your bill. Use a CDN for static assets, compress responses, and cache aggressively. For real-time apps, plan for sustained throughput, not just bursts.

Don't forget internal bandwidth either. Cross-AZ traffic costs money and adds latency. If your database writes are routed through a different availability zone, you'll pay egress fees and see 2-5ms additional latency per call. Keep your data and compute inside the same AZ when possible.

Also consider intra-service bandwidth. If your services communicate over HTTP and are chatty, that adds load. Use protobuf or gRPC to reduce payload size.

DNS and TLS handshake overhead: each new connection adds 100-200ms before data transfer. Keep alive connections reduce this. Estimate number of concurrent connections.

Also watch for bandwidth spikes from health checks and monitoring probes. If you have 200 microservices each monitoring each other (mesh), that's 200 * 200 = 40,000 probes per minute. Those small requests add up. Use a dedicated health check service.

Connection multiplexing: HTTP/2 multiplexes streams over a single connection, reducing overhead. For internal services, gRPC HTTP/2 can improve bandwidth utilisation.

Network topology matters. If you use a service mesh like Istio, each proxy adds 5-10% bandwidth overhead. Factor that in. Outbound bandwidth from cloud to internet is often more expensive than inbound. Monitor both directions.

Real-world: A photo-sharing app's egress bill hit $50k/month because images served directly from origin. After enabling CDN, it dropped to $8k/month — 84% reduction. The CDN also cut load on app servers by 90%.

Another: A video streaming startup's $100k monthly bandwidth bill was cut by 70% by moving to CDN and compressing with AV1.

Estimate: peak QPS average response size 1.5 (for headers, retransmits). For media-heavy apps, add another 20% overhead.

io/thecodeforge/estimation/BandwidthEstimator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package io.thecodeforge.estimation;

public class BandwidthEstimator {
    public static double peakBandwidthGbps(double peakQPS, double avgResponseSizeMB, double overheadFactor) {\n        double bytesPerSecond = peakQPS * avgResponseSizeMB * 1024 * 1024;\n        return (bytesPerSecond * overheadFactor) / 1_000_000_000.0 * 8;\n    }

    public static void main(String[] args) {
        double peakQPS = 1000;
        double avgResponseSizeMB = 0.5; // 500KB
        double overhead = 1.5;
        double gbps = peakBandwidthGbps(peakQPS, avgResponseSizeMB, overhead);
        System.out.println("Peak bandwidth: " + gbps + " Gbps");
        System.out.println("Monthly egress (TB): " + (gbps * 3600 * 24 * 30 / 8 / 1000));
    }
}
Output
Peak bandwidth: 6.0 Gbps
Monthly egress (TB): 77.76
CDN Savings:
A CDN can reduce origin bandwidth by 80-90% for static assets. This not only cuts cost but also reduces load on your app servers, effectively increasing capacity without adding instances.
Production Insight
Network bandwidth often becomes the bottleneck before CPU, especially for media-heavy apps.
Egress costs: AWS charges ~$0.09/GB for internet transfer.
Rule: Estimate bandwidth = peak QPS * average response size. Then add 50% for overhead.
Egress costs for a video streaming app can exceed compute costs. Plan for CDN and compression early.
Real example: A photo-sharing app's egress bill hit $50k/month before CDN — dropped to $8k after.
A video streaming startup's $100k monthly bandwidth bill was cut by 70% by moving to CDN and compressing with AV1.
Internal cross-AZ traffic costs money — colocate services in the same AZ to avoid egress fees.
Health check mesh can saturate internal bandwidth — aggregate checks instead.
Key Takeaway
Bandwidth = QPS * data_per_request.
Use CDNs for static and compress dynamic responses.
Egress cost can exceed compute cost — monitor it.
Internal cross-AZ traffic costs money — keep services colocated.
Bandwidth cost can exceed compute cost — CDN and compression are not optional for media-heavy apps.
Always include a 50% overhead factor for headers and retransmits in your model.
When to Use a CDN
IfStatic assets (images, CSS, JS) served to end users
UseAlways use CDN. Reduces bandwidth cost by 80%+ and improves latency.
IfDynamic API responses (<10KB, per-user)
UseCDN not effective for dynamic content. Optimize response size with compression and caching headers.
IfVideo or large file downloads
UseUse CDN with chunk-based caching. Consider using a dedicated media delivery service.
IfReal-time streaming (WebRTC, WebSockets)
UseCDN not suitable. Use edge compute or dedicated media servers.

Database Capacity Planning

Databases are the hardest component to scale. Unlike app servers, you can't just add instances and expect linear performance. Database capacity planning must account for read throughput, write throughput, storage, connection pool, and replication lag.

First, estimate read QPS and write QPS separately. Reads can be offloaded to replicas — typical pattern: one primary for writes, multiple read replicas. Write capacity is often the bottleneck because every write hits the primary. Size primary's CPU and IOPS accordingly.

Connection pool sizing: pool size = peak QPS * avg query time (seconds). For 1000 QPS with 50ms queries, you need at least 50 connections. But also account for overhead — a common mistake is setting pool size equal to database's max_connections, which can exhaust the database with too many connections. Tune both sides.

Replication lag: if you have read replicas, ensure they can handle read traffic without falling behind. Monitor seconds_behind_master (MySQL) or replica lag (PostgreSQL). Keep lag under 5 seconds for responsive apps.

Storage for databases includes indexes and transaction logs. Indexes can double the storage of a table. Transaction logs (WAL) grow significantly during heavy writes — plan for at least 25% extra storage for logs.

Connection pool memory: each database connection uses about 2-5MB on the database side. For 200 connections, that's 1GB just for connections. Size the database instance accordingly.

Use a connection pooler like PgBouncer or ProxySQL to maintain persistent pool and reduce connection overhead. Some ORMs hold connections longer than expected due to transactional boundaries — test with realistic request patterns.

Real-world failure: A SaaS startup sized RDS instance based on average QPS, ignoring peak. When a customer imported a million records via API, database CPU hit 100%, connection pool saturated, all queries timed out. Recovery took 4 hours. Fix: add write replicas, use async processing, set connection pool limit that prevents thundering herd.

Separate read and write connection pools to avoid contention. Use two datasources or a proxy that routes by query type.

A billing system's database crashed during month-end due to unplanned write bursts — a single script triggered millions of updates. Adding a write queue saved them. Always plan for batch operations that can spike write load.

io/thecodeforge/estimation/DatabaseCapacityPlanner.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package io.thecodeforge.estimation;

public class DatabaseCapacityPlanner {
    public static int requiredConnections(double peakQPS, double avgQueryTimeSec, double overheadFactor) {\n        double base = peakQPS * avgQueryTimeSec;\n        return (int) Math.ceil(base * overheadFactor);\n    }

    public static void main(String[] args) {
        double peakQPS = 1000;
        double avgQueryTime = 0.05; // 50ms
        double overhead = 1.2; // 20% headroom
        int poolSize = requiredConnections(peakQPS, avgQueryTime, overhead);
        System.out.println("Minimum pool size: " + poolSize);
        System.out.println("Database max_connections should be > " + (poolSize * 1.5));
    }
}
Output
Minimum pool size: 60
Database max_connections should be > 90
Connection Pool Trap
Setting pool size too high (> 200) can overwhelm the database with context switching. Always set a max pool limit on the application side, and monitor active connections on the database side. A sudden spike in active connections is a leading indicator of a capacity crisis.
Production Insight
Databases are the most constrained resource in any system.
Write throughput is typically the bottleneck — plan for it first.
Connection pool exhaustion is the quickest path to a database outage.
Rule: Set pool size = (peak QPS avg query time) 1.2, but never exceed 200 per instance.
Real example: A billing system's database crashed during month-end due to unplanned write bursts — adding a write queue saved them.
Another failure: a microservice using a single database instance for both read and write saw connection pool exhaustion during a backup operation when the database was locked. Separate read and write connection pools.
A SaaS startup sized RDS based on average QPS — a million-record import killed it. Always size for worst-case write bursts.
Monitor active connections as a leading indicator — when they approach pool max, you have minutes to react.
Key Takeaway
Database capacity is often the hardest to scale — plan for it first.
Connection pool = (peak QPS avg query time) safety factor.
Write throughput is the bottleneck — shard or use queuing.
Monitor replication lag and active connections as leading indicators.
Separate read and write connection pools to avoid contention.
Size for worst-case write bursts, not average load.
Database Scaling Strategy
IfRead-heavy workload, low write QPS
UseAdd read replicas. Tune cache layer (Redis, Memcached) to reduce database reads.
IfWrite-heavy workload, high concurrency
UseConsider sharding (horizontal partition) or use a distributed database like CockroachDB. Use async writes where possible.
IfMixed workload, moderate QPS
UseStart with a strong primary and 2-3 read replicas. Monitor lag and add replicas as needed.
IfBursty workload with unpredictable spikes
UseUse connection pooling with a queue (e.g., HikariCP) and database proxy (pgBouncer, ProxySQL) to absorb bursts.

Capacity Planning for Event-Driven and Async Workloads

Event-driven systems shift the capacity model. Instead of QPS hitting an endpoint, you have message producers pushing events into a queue, and consumers processing at their own rate. The key metric is message arrival rate vs consumption rate. If arrival exceeds consumption, the queue grows indefinitely — hitting queue depth limits, memory pressure, or consumer timeouts.

Start by estimating peak message production rate. Often comes from upstream services or external webhooks. For example, a payment webhook might deliver 1000 events/sec during a flash sale. Treat this like peak QPS but with no concurrency ceiling — messages can pile up.

Next, measure average processing time per message (deserialization, business logic, I/O). Then required consumers = (peak message rate) * (processing time). Add safety factor 1.5-2x for burst handling.

Watch for poison pill messages — messages that fail repeatedly and consume all consumer capacity. Implement dead-letter queues (DLQ) and circuit breakers on consumer failures.

Backpressure: if consumers can't keep up, you need to signal the producer to slow down. Rarely built-in by default. Use bounded queue with drop policy or implement backpressure mechanism.

Batch processing can increase throughput — tune batch size for latency vs throughput.

Monitor queue depth growth rate. If it's positive for more than 5 minutes, you're losing ground. Set alerts on growth rate, not just absolute depth. Auto-scaling based on queue depth (KEDA for Kubernetes) works better than CPU-based scaling for async workloads.

Real-world failure: A fintech startup's event queue grew to 10M messages over a weekend due to a single failing consumer. A malformed message kept failing, retry loop consumed all capacity. Recovery took 12 hours. Add DLQ and alert on consumer error rate.

Another nuance: strict ordering requirements force partitioning — each partition is processed by one consumer. Plan enough partitions for peak load.

io/thecodeforge/estimation/EventDrivenCapacityPlanner.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
package io.thecodeforge.estimation;

public class EventDrivenCapacityPlanner {
    public static int requiredConsumers(double peakMsgRatePerSec, double processingTimeMs) {\n        double processingTimeSec = processingTimeMs / 1000.0;\n        double capacityPerConsumer = 1.0 / processingTimeSec;\n        double safety = 1.5;\n        return (int) Math.ceil((peakMsgRatePerSec / capacityPerConsumer) * safety);\n    }

    public static void main(String[] args) {
        double peakMsgRate = 2000;
        double processingTime = 50; // 50ms per message
        int consumers = requiredConsumers(peakMsgRate, processingTime);
        System.out.println("Required consumers: " + consumers);
        System.out.println("Also monitor queue depth trend and configure DLQ.");
    }
}
Output
Required consumers: 150
Also monitor queue depth trend and configure DLQ.
Queue Depth Trap
If queue depth grows linearly over time, you're under-provisioned. A queue that grows at 1% per hour might take days to become critical, but during a burst it can explode in minutes. Monitor the derivative of queue depth, not just the absolute value.
Production Insight
Event-driven systems mask capacity problems — messages queue silently until memory or storage runs out.
A single slow consumer can cause a backlog that takes hours to drain.
Rule: Monitor queue depth growth rate. If it's positive for more than 5 minutes, you're losing ground.
Pro tip: Use auto-scaling based on queue depth (e.g., KEDA for Kubernetes) to dynamically adjust consumer count.
Real story: A fintech startup's event queue grew to 10M messages over a weekend due to a single failing consumer. Recovery took 12 hours. Add dead-letter queues and alert on consumer health.
A malformed message caused a retry loop that consumed all capacity. Implement dead-letter queues and circuit breakers on consumer failures.
Also watch for batch jobs that flood the queue — a nightly batch can overwhelm consumers if not throttled.
Key Takeaway
Capacity for event-driven systems = message arrival rate * processing time.
Monitor queue depth growth rate as a leading indicator.
Use dead-letter queues and backpressure to prevent hidden failures.
Auto-scaling based on queue depth (KEDA) works better than CPU-based scaling for async workloads.
Always include a safety factor of 1.5-2x on consumer count.
Don't ignore queue depth growth — set alerts on growth rate, not just threshold.
Capacity Strategy for Async Systems
IfStable message rate with predictable peaks
UseProvision consumers for peak + 1.5x safety. Use batch processing to improve throughput.
IfUnpredictable bursty producers (e.g., webhooks)
UseUse auto-scaling based on queue depth. Consider a queue with a bounded size and a drop policy.
IfHigh processing time per message (CPU-bound)
UseScale consumers horizontally. Consider partitioning the queue to increase parallelism.
IfMessages have strict ordering requirements
UsePartition by key. Each partition is processed by one consumer. Plan enough partitions for peak load.

Capacity Planning for Cloud Costs

Capacity planning directly impacts cloud costs. Every estimate — QPS, storage, bandwidth — becomes a line item on your bill. Understanding that relationship lets you design cost-efficient systems from the start.

Start with unit economics: cost per request. If you run a Java service on 8 instances at $0.50/hour each, handling 1000 peak QPS, that's $0.004 per 1000 requests in compute alone. Add storage, bandwidth, database — you might get to $0.01 per 1000 requests. Know this number; share it with product.

Reserved vs on-demand: reserve for baseline, use spot for burst, on-demand as last resort. Reserved instances save 30-60%, but commit for predictable baselines only.

Right-sizing is where most money is wasted. Teams often over-provision because they don't trust their estimates. That's fine for the first month, but after 90 days of monitoring, rightsize all instances. Use AWS Compute Optimizer or similar.

Storage tiering is a huge lever. Hot data on SSDs, warm on HDDs, cold on object storage. A photo-sharing app with 10PB of data can reduce costs from $1M/month to $200k/month with proper tiering.

Data transfer costs: ingress is often free, egress expensive. Design to minimize cross-region or internet egress. Use CloudFront or Cloudflare for outgoing traffic.

Hidden costs: orphaned resources — load balancers, unused EBS volumes, idle NAT gateways. Set up cost anomaly detection and regular cleanup.

Real story: A team spent $50k/month on Redis clusters because they never reevaluated after cache hit ratio improved. Rightsizing saved $20k/month.

Another: A company had 10% of instances idle for months, costing $30k/month. They added scheduled shutdown for non-production environments.

Build a simple cost model in a spreadsheet — compute hours, storage tier, bandwidth * egress. Update quarterly and compare to actual bills. If actual exceeds model by 20%, investigate.

Production Insight
Cloud costs grow with every resource you add. Capacity planning without cost estimation leads to bill shock.
Unit economics: cost per request = (total monthly spend) / (total requests). Track this monthly.
Rule: Reserve for baseline, use spot for burst, on-demand for overflow.
Real example: A startup's $200k monthly bill was cut by 60% after right-sizing instances and adding storage tiering.
Cost trap: Over-provisioned databases are the #1 waste in cloud spend. Monitor and resize regularly.
A team spent $50k/month on Redis clusters they didn't need. Rightsizing saved $20k/month.
Orphaned resources (idle NAT gateways, unused EBS volumes) silently add up. Automate cleanup.
Another: Scheduled shutdowns for non-prod saved $30k/month — do this on day one.
Key Takeaway
Capacity planning is also cost planning — every estimate has a price tag.
Know your cost per request and use it to inform architecture decisions.
Reserved instances for baseline, spot for burst.
Right-size after 90 days of monitoring.
Orphaned resources are silent budget killers.
Build a cost model in a spreadsheet and update it quarterly — if actual exceeds model by 20%, investigate.
● Production incidentPOST-MORTEMseverity: high

Black Friday Crash at a Retail Startup

Symptom
Homepage loading times from 200ms to 12s, then 503 errors. Database CPU at 100%. Lost orders and revenue.
Assumption
Auto-scaling in the cloud would handle any traffic spike.
Root cause
Auto-scaling lagged by several minutes. Database connection pool sized for normal load. No read replicas. No circuit breakers.
Fix
Implemented read replicas, connection pool tuning, auto-scaling pre-warming, circuit breakers, and capacity gates for campaigns.
Key lesson
  • Always model worst-case peak traffic, not average.
  • Auto-scaling is not instant — you need headroom or pre-provision.
  • Databases are the hardest to scale; plan their capacity first.
  • Monitor connection pool usage as a leading indicator of saturation.
  • Run load tests at 5x expected peak to validate your model before launch.
  • Don't let marketing launch a campaign without a capacity sign-off.
  • Use feature flags to gradually ramp traffic to new capacity — don't flip the switch for all users at once.
  • Plan for write-heavy bursts — they overwhelm primaries faster than reads.
  • Consider using auto-scaling pre-warming scripts to reduce lag during known traffic events.
Production debug guideStep-by-step symptom to action8 entries
Symptom · 01
Response times spike but CPU is low
Fix
Check for database lock contention (use SHOW ENGINE INNODB STATUS), network bandwidth saturation (nload, vnstat), or thread pool exhaustion (check thread pool metrics in app server logs). Also verify if connection pool is exhausted.
Symptom · 02
Requests queue up and timeout
Fix
Increase connection pool size or add application server instances. Check load balancer settings for max connections and timeouts. Monitor request queue depth as a leading indicator.
Symptom · 03
Disk fills up unexpectedly
Fix
Review log rotation and data retention policies. Estimate storage growth per user and set alerts at 70% capacity. Use du -sh /var/log/* to find large files. Check for forgotten debug logs.
Symptom · 04
Latency grows linearly with QPS
Fix
Check if you hit a resource limit: open file handles (ulimit -a), connection pool, or disk IO (iostat -x 1). Use vmstat 5 5 to see context switching and blocking. Often it's thread pool exhaustion, not CPU.
Symptom · 05
Load balancer health checks fail intermittently
Fix
Check if the application's request queue depth is near capacity. Use curl -v /health from inside the container to verify. Increase thread pool or add instances.
Symptom · 06
Database replicates lag then fails over
Fix
Check replication lag (SHOW SLAVE STATUS). If lag exceeds threshold, increase replica instances or reduce write load. Consider caching reads to offload replicas.
Symptom · 07
Scale-out events cause cascading failures
Fix
Downstream services may not handle the sudden increase in traffic. Inspect circuit breaker states. Add circuit breakers and backpressure. Test scale-out scenarios in staging first.
Symptom · 08
Database CPU spikes while app CPU is idle
Fix
Check for slow queries or missing indexes. Use slow query log and EXPLAIN. Add read replicas for read-heavy workloads. Consider caching with Redis or Memcached.
★ Capacity Crisis Cheat SheetWhen your system starts failing under load, use these commands to diagnose quickly and apply immediate fixes.
High latency across all endpoints
Immediate action
Check CPU and memory: top, htop
Commands
vmstat 5 5
netstat -an | grep :80 | wc -l
Fix now
Add temporary capacity by scaling horizontally or adding read replicas.
Database queries timing out+
Immediate action
Check active connections: SHOW PROCESSLIST;
Commands
SHOW STATUS LIKE 'Threads_connected';
SHOW ENGINE INNODB STATUS;
Fix now
Kill idle connections and increase max_connections immediately.
Memory usage grows over time until OOM+
Immediate action
Check heap usage with jmap -heap or memory profiler
Commands
jmap -heap <pid>
jstat -gcutil <pid> 1000
Fix now
Increase heap or set memory limits; investigate memory leak in code.
Storage usage exceeds 80% capacity+
Immediate action
Check disk usage: df -h
Commands
du -sh /var/log/*
ls -la /tmp | sort -k5 -rn | head
Fix now
Enable log rotation and increase retention limits; move cold data to cheaper storage.
Application starts slow and becomes fast after warmup+
Immediate action
Check if JIT compilation or lazy loading is causing cold starts
Commands
jstat -compiler <pid>
jstat -class <pid>
Fix now
Warm up the application by sending dummy requests before accepting real traffic.
Connections to database pool timeout+
Immediate action
Check pool metrics: active vs idle connections
Commands
SELECT * FROM pg_stat_activity;
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
Fix now
Increase pool size and add read replicas immediately.
Write operations slow down during peak hours+
Immediate action
Check replication lag and primary CPU
Commands
SHOW SLAVE STATUS;
iostat -x 1
Fix now
Offload analytics queries to replicas, consider write sharding or async writes.

Key takeaways

1
Capacity planning is the math that prevents production collapses
do it before you build.
2
Always model for peak traffic, not average
peaks are 10-20x higher.
3
Storage grows with (users × data per user × retention)
plan for 2x headroom.
4
Database capacity is the hardest to scale
connection pool sizing is critical.
5
Bandwidth often becomes the bottleneck before CPU
CDN and compression are must-haves.
6
Monitor growth rates, not just absolute values
leading indicators save you.
7
Event-driven systems need special attention
queue depth trend is your best early warning.
8
Capacity planning is also cost planning
know your cost per request.
9
Over-provision initially, then rightsize after 90 days of real data.
10
Automate scaling, but never assume auto-scaling is instant
plan buffers.

Common mistakes to avoid

5 patterns
×

Using only average QPS instead of peak

Symptom
System works fine at low traffic but crashes under burst. Connection pool saturates, latency spikes, 503 errors appear within minutes of a sudden traffic surge.
Fix
Always estimate peak QPS with a safety factor of 1.5x. Use 95th percentile of historical data. Run load tests at 5x expected peak before launch.
×

Assuming auto-scaling will save you instantly

Symptom
During a traffic spike, auto-scaling kicks in but takes 2-5 minutes to provision new instances. By then, connections are exhausted and cascading failures occur.
Fix
Pre-warm instances before known events. Use buffer capacity (e.g., 20% headroom). Set aggressive scaling thresholds (e.g., scale up at 60% CPU, not 80%).
×

Neglecting database connection pool sizing

Symptom
Queries start timing out, application logs show 'Connection pool exhausted' errors, database CPU is high due to too many connections context switching.
Fix
Calculate pool size = (peak QPS avg query time) 1.2. Set a max pool limit of 200 per instance. Use connection pooler (PgBouncer, ProxySQL) to manage backend connections.
×

Ignoring storage growth rate

Symptom
Disk fills up unexpectedly, causing application errors. Logs, temporary files, and test data consume space without anyone noticing until it's critical.
Fix
Monitor daily storage growth rate, not just total capacity. Set alerts at 70% threshold. Implement log rotation and retention policies (30-90 days). Tier old data to cold storage.
×

Treating all QPS as equal cost

Symptom
Compute instances are sized based on average request complexity, but a search endpoint that is 100x more expensive than a simple read dominates CPU usage. Under load, the search endpoint slows everything down.
Fix
Break QPS down by endpoint and weight by resource cost per request. Use separate scaling groups for expensive endpoints if needed. Consider caching or dedicated compute for expensive queries.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you estimate the capacity requirements for a new social media ...
Q02SENIOR
What's the most common failure you've seen caused by poor capacity plann...
Q03SENIOR
How do you estimate storage requirements for a system that stores user-g...
Q04SENIOR
Explain the trade-off between over-provisioning and under-provisioning i...
Q05SENIOR
How would you estimate bandwidth requirements for a video streaming plat...
Q01 of 05SENIOR

How would you estimate the capacity requirements for a new social media app expected to have 1 million users in the first year?

ANSWER
Start with DAU estimate: assume 20% of total users become daily active => 200K DAU. Then estimate requests per user per day: for a social feed, roughly 50 reads and 10 writes. Peak hour carries about 10% of daily traffic, so peak read QPS = (200K 50 0.1) / 3600 ≈ 278 QPS. Multiply by 1.5 safety factor => 417 peak read QPS. For writes: (200K 10 0.1) / 3600 ≈ 56 write QPS, model separately. Storage: each user may produce 2MB/day, so yearly storage = 1M 2MB 365 ≈ 730TB raw, with 3x replication ≈ 2190TB. Bandwidth: assume 500KB per response, peak bandwidth = 278 0.5MB 1.5 overhead ≈ 208 MB/s ≈ 1.67 Gbps. For compute, assume 100ms response time, 80% CPU target: instances = 278 / (1/0.1 * 0.8) ≈ 35 instances (rounded up). This gives a starting point; refine with monitoring.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What's the difference between capacity planning and performance testing?
02
How often should I revisit my capacity plan?
03
What's the single most important number to estimate first in capacity planning?
04
Should I over-provision or under-provision initially?
05
How do I handle capacity planning for serverless or auto-scaling architectures?
🔥

That's Estimation. Mark it forged?

13 min read · try the examples if you haven't

Previous
Back of Envelope Estimation
2 / 5 · Estimation
Next
QPS — Queries Per Second