Senior 11 min · March 05, 2026

Scalability Concepts - Local Session Memory Logout Storm

During a 50K-user Black Friday surge, 40% errors from local session state.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Scalability means your system handles more load without redesign.
  • Vertical scaling (scale up): bigger machine, no code changes, hits a ceiling.
  • Horizontal scaling (scale out): more machines, requires stateless architecture, unlimited in theory.
  • Stateless design is the prerequisite for horizontal scaling — each request is self-contained.
  • Caching and read replicas often solve 80% of database bottlenecks before adding nodes.
  • Performance insight: a properly cached endpoint serves in <1ms vs 200ms from DB; cache hit rate >90% is achievable.
Plain-English First

Imagine a lemonade stand. On a slow Tuesday, one kid handles everything fine. But on a hot Saturday with a queue around the block, you have two choices: hire a bigger, faster kid (make one worker stronger) or hire more kids and split the queue (add more workers). That's scalability in a nutshell — your system's ability to handle more work without falling over. Every design decision you make today either opens that door wider or quietly nails it shut.

Every system works fine with ten users. The brutal truth is that most production outages aren't caused by bad code — they're caused by systems that were never designed to grow. Twitter's 'Fail Whale', Slack's 2021 degradation event, and countless startup horror stories all share the same root cause: scalability was an afterthought. Understanding scalability isn't optional for an intermediate engineer — it's the line between a system that survives launch day and one that embarrasses you in front of your entire user base.

The core problem scalability solves is demand unpredictability. Your e-commerce site might handle 500 requests per second on a normal Wednesday. On Black Friday it might need to handle 50,000. If your architecture can only scale by crossing fingers and upgrading to a bigger server, you're one viral tweet away from a very bad day. Scalability concepts give you a vocabulary and a toolkit to reason about growth before it happens — and design systems that bend instead of breaking.

By the end of this article you'll be able to explain the difference between vertical and horizontal scaling and know which to reach for first, understand why stateless design is the foundation everything else is built on, describe how load balancers and caching multiply your throughput without multiplying your bill, and walk into a system design interview and speak confidently about trade-offs — not just definitions.

Vertical vs Horizontal Scaling — Choosing Your Growth Strategy

Vertical scaling (scaling up) means giving your existing machine more power — more CPU cores, more RAM, faster SSDs. It's the simplest option and often the right first move. There's no code change, no architecture rethink, and you can do it in minutes on most cloud providers. The catch? Every machine has a ceiling. At some point AWS doesn't have a bigger instance type, and even if it did, a single machine is a single point of failure.

Horizontal scaling (scaling out) means adding more machines and splitting the work between them. This is how Netflix, Google and every large-scale system you've ever used actually works. It has no theoretical ceiling — you can keep adding nodes. But it demands that your application be stateless, because requests will land on different servers unpredictably.

The practical rule of thumb: scale vertically first until it hurts, then design for horizontal. Premature horizontal scaling adds enormous operational complexity — distributed systems are hard. A startup serving 10,000 users probably doesn't need a Kubernetes cluster; they need a better database index and maybe one more server tier.

The real decision point is around state. If your app stores session data in memory on a single server, horizontal scaling will immediately break user logins. That's why stateless design — covered next — isn't a nice-to-have. It's the prerequisite.

ScalingDecisionFramework.pseudoPSEUDOCODE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// ─────────────────────────────────────────────────────────────
// SCALING DECISION FRAMEWORK
// Run this mental checklist BEFORE choosing a scaling strategy
// ─────────────────────────────────────────────────────────────

function chooseScalingStrategy(currentLoad, projectedLoad, appIsStateless):

    // Step 1: Calculate how much headroom you need
    growthFactor = projectedLoad / currentLoad
    // e.g. Black Friday estimate: 50,000 rps / 500 rps = 100x growth needed

    // Step 2: Check if vertical scaling can close the gap cheaply
    currentInstanceType  = "db.t3.medium"   // 2 vCPU, 4 GB RAM
    upgradedInstanceType = "db.r6g.4xlarge" // 16 vCPU, 128 GB RAM  (~8x capacity)

    if growthFactor <= 8 AND upgradedInstanceType IS available:
        // Vertical scaling is simpler and fast — do it first
        return SCALE_UP(
            targetInstance = upgradedInstanceType,
            estimatedCost  = "$0.90/hr → $4.80/hr",  // predictable pricing
            operationalRisk = LOW                      // no code changes needed
        )

    // Step 3: Beyond vertical ceiling — must go horizontal
    if growthFactor > 8 OR upgradedInstanceType NOT available:

        if NOT appIsStateless:
            // ⚠️  STOP — horizontal scaling WILL BREAK stateful apps
            // Sessions stored in-memory on Server A won't exist on Server B
            return REFACTOR_FIRST(
                action = "Move sessions to Redis / JWT tokens",
                reason = "Load balancer will route user requests to ANY server"
            )

        // App is stateless — safe to scale out
        return SCALE_OUT(
            addNodes        = ceil(growthFactor / capacityPerNode),
            loadBalancer    = "Round-robin or least-connections",
            operationalRisk = MEDIUM  // distributed systems add failure modes
        )

// ─── EXAMPLE OUTPUT ──────────────────────────────────────────
// Input:  currentLoad=500rps, projectedLoad=50000rps, stateless=false
// Output: REFACTOR_FIRST
//         → Move sessions to Redis THEN scale horizontally
//
// Input:  currentLoad=500rps, projectedLoad=2000rps, stateless=true
// Output: SCALE_UP
//         → Upgrade instance (4x growth, within vertical ceiling)
// ─────────────────────────────────────────────────────────────
Output
Decision: REFACTOR_FIRST
Action : Move session storage from in-memory to Redis
Reason : Load balancer distributes requests across all nodes.
User on Server A will hit Server B next request.
In-memory session on A does not exist on B → instant logout bug.
Decision: SCALE_UP
Target : db.r6g.4xlarge (16 vCPU / 128 GB RAM)
Cost : $0.90/hr → $4.80/hr
Risk : LOW — zero code changes, deploy in ~5 minutes
Pro Tip: The 'Stateless Test'
Before assuming you can scale horizontally, ask: 'If I killed the server handling this request right now and a different server picked it up, would the user notice?' If yes, you have stateful logic — find it and externalize it to Redis, a database, or JWTs before you add a single new node.
Production Insight
A startup scaled to 10 nodes in Kubernetes without externalizing sessions.
Users saw logouts every 3rd request due to round-robin load balancing.
Fix: move sessions to Redis before adding more than 1 node — always.
Don’t rely on 'sticky sessions' — they hide the problem, not solve it.
Key Takeaway
Vertical first until it hurts.
Horizontal requires statelessness.
Don't skip the stateless audit — it's the most common scaling failure.
Choose Your Scaling Path
IfGrowth factor <= 8 and bigger instance available
UseScale up vertically — quicker, simpler, cheaper for this range
IfGrowth factor > 8 or no bigger instance
UsePlan horizontal scaling — requires stateless app design
IfApp is stateful (sessions in memory)
UseRefactor to externalize state first — then scale horizontally
IfApp is stateless
UseDeploy behind load balancer, add nodes, test readiness

Scalability Math — How Many Nodes Do You Actually Need?

When you move from vertical to horizontal scaling, the obvious question is: how many servers do I need? The formula is straightforward in theory, but production complexity makes it a little more nuanced.

## The Base Formula

N = (Peak QPS × (1 + Safety Margin)) / (Max QPS per Node)

Where
  • Peak QPS: The maximum requests per second you expect during traffic spikes (e.g., 50,000).
  • Max QPS per Node: The maximum throughput a single node can sustain while keeping response times acceptable (e.g., 1,000 RPS).
  • Safety Margin: A buffer for unexpected spikes, deployment headroom, and failover capacity (typically 0.2 to 0.5).

## Example Calculation

For a Black Friday event
  • Peak QPS = 50,000
  • Max QPS per Node = 2,000 (based on load testing with your application)
  • Safety Margin = 0.3 (30% headroom)

N = (50,000 × 1.3) / 2,000 = 65,000 / 2,000 = 32.5 → round up to 33 nodes.

## Important Caveats

  1. Linear scaling assumption — The formula assumes each additional node adds exactly its capacity. In practice, there's often a small overhead from coordination (e.g., connection pooling to shared databases). Expect 80-90% linearity for stateless services.
  2. Cold start penalty — New nodes have empty caches, so their initial throughput is lower. Pre-warming helps.
  3. Crossover point — At very high node counts (e.g., >50), you may hit load balancer limits or database connection limits. Plan for multiple load balancers and database replicas.
  4. Not all requests are equal — Some requests are heavier (e.g., checkout vs product listing). Use average + P99 latency, not raw RPS.

## Production Formula

N = (Peak QPS × SafetyMultiplier) / (NodeCapacity × LinearEfficiency)

Where NodeCapacity is measured under realistic load (not synthetic microbenchmarks), LinearEfficiency is 0.8–0.9, and SafetyMultiplier is 1.2–1.5.

ScalabilityMath.pseudoPSEUDOCODE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// ── Scalability Math Calculator ────────────────────────────
function calculateScaledNodes(peakQPS, nodeCapacity, safetyMargin, linearEfficiency):
    // safetyMargin: 0.2 for modest spikes, 0.5 for unpredictable traffic
    // linearEfficiency: 0.85 for typical stateless apps, 0.95 for perfect stateless
    
    adjustedPeak = peakQPS * (1 + safetyMargin)
    effectiveCapacity = nodeCapacity * linearEfficiency
    
    nodeCount = ceil(adjustedPeak / effectiveCapacity)
    
    // Add one more for N+1 redundancy if required
    if highAvailability requested:
        nodeCount = nodeCount + 1
    
    return nodeCount

// ── Real example ────────────────────────────────────────
// Load test results: each node handles 1,800 RPS at P99 < 200ms
// Expected peak Black Friday: 100,000 RPS
// Safety margin: 0.5 (aggressive because unknown pattern)
// Linear efficiency: 0.85 (database connection pool shared)

nodesNeeded = calculateScaledNodes(
    peakQPS = 100000,
    nodeCapacity = 1800,
    safetyMargin = 0.5,
    linearEfficiency = 0.85
)
// = ceil((100000 * 1.5) / (1800 * 0.85))
// = ceil(150000 / 1530)
// = ceil(98.04)
// = 99 nodes (plus 1 for HA = 100 nodes)
Output
Input: peakQPS=100000, nodeCapacity=1800, safetyMargin=0.5, linearEfficiency=0.85
Calculation: (100000 * 1.5) / (1800 * 0.85) = 150000 / 1530 ≈ 98.04
Round up: 99 nodes
With HA: 100 nodes
Cost estimate: 100 nodes × $0.50/hr = $50/hr during spike
vs 1 node × $100/hr for vertical scaling (if even possible)
Don't Assume Linear Scaling
Every additional node adds a tiny amount of overhead (connection pool management, load balancer processing, cache coherency). Test with 1, 2, 4, 8 nodes and measure actual throughput to derive your linear efficiency factor. Don't rely on theoretical capacity.
Production Insight
Production experience: A team calculated they needed 20 nodes based on perfect linear scaling. At 15 nodes, they hit the database connection pool limit — actual capacity dropped. Always test scaling curves under realistic load.
Key Takeaway
Use the formula N = (peakQPS safetyMargin) / (nodeCapacity linearEfficiency). Always measure your linear efficiency and account for shared bottlenecks like database connections.

Stateless Design and Load Balancing — The Foundation of Horizontal Scale

Stateless design means each server treats every incoming request as if it's meeting that user for the first time. No local memory of what happened before. All state the request needs — auth tokens, user preferences, cart contents — travels with the request or lives in a shared external store like Redis or a database.

This sounds like a constraint, but it's actually a superpower. When no single server 'owns' a user's session, a load balancer can freely route any request to any available server. You can add servers during a traffic spike, remove them when it passes, and restart individual servers without losing anyone's session. Your system becomes elastic.

A load balancer sits in front of your server pool and distributes incoming requests. The two most common strategies are round-robin (requests cycle through servers sequentially — great for uniform workloads) and least-connections (each new request goes to the server with fewest active connections — better when requests vary in processing time, like an API mixing fast reads and slow report generation).

Health checks are the underrated hero here. A good load balancer pings each server every few seconds. If a server stops responding, traffic is automatically rerouted to healthy nodes — and users never know a server died. This is how large systems achieve high availability without magic.

StatelessRequestFlow.pseudoPSEUDOCODE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// ─────────────────────────────────────────────────────────────
// STATELESS REQUEST FLOW WITH LOAD BALANCER
// Demonstrates how a stateless API handles auth across two servers
// ─────────────────────────────────────────────────────────────

// ── SHARED INFRASTRUCTURE (lives outside any single server) ──
redisSessionStore = ExternalRedis(host="redis.internal", port=6379)
jwtSecretKey      = EnvironmentVariable("JWT_SECRET")  // same key on ALL servers

// ── SERVER A and SERVER B are identical clones ────────────────
function handleRequest(incomingHttpRequest):

    // Load balancer already decided this request lands here
    // We don't know or care which server handled the previous request

    authHeader = incomingHttpRequest.headers["Authorization"]
    // e.g. "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

    if authHeader is null:
        return HTTP_401("Missing auth token")

    jwtToken = authHeader.stripPrefix("Bearer ")

    // Verify the JWT using the shared secret — works on ANY server
    // because the secret is the same everywhere and state is IN the token
    decodedPayload = JWT.verify(jwtToken, jwtSecretKey)
    // decodedPayload = { userId: "usr_8821", role: "admin", exp: 1712000000 }

    if decodedPayload.isExpired():
        return HTTP_401("Token expired — please log in again")

    // Fetch user-specific data from the SHARED store (not local memory)
    userCart = redisSessionStore.get(key = "cart:" + decodedPayload.userId)
    // userCart = [{ productId: "prod_44", qty: 2 }, { productId: "prod_91", qty: 1 }]

    // Process the request normally
    orderTotal = calculateTotal(userCart)

    return HTTP_200({ cart: userCart, total: orderTotal })

// ── WHAT MAKES THIS STATELESS ─────────────────────────────────
// 1. No in-memory session map — server restarts lose NOTHING
// 2. JWT carries identity — valid on Server A, B, or C equally
// 3. Cart lives in RedisServer B reads the same cart as Server A
// 4. Load balancer can route usr_8821's next request ANYWHERE safely

// ── LOAD BALANCER LOGIC (simplified) ─────────────────────────
serverPool = [ServerA(weight=1), ServerB(weight=1), ServerC(weight=1)]

function routeIncomingRequest(request):
    // Least-connections strategy — helps when cart checkout is slow
    targetServer = serverPool.minBy(server => server.activeConnections)
    targetServer.activeConnections += 1

    response = targetServer.handleRequest(request)

    targetServer.activeConnections -= 1
    return response
Output
GET /api/cart → Load Balancer routes to ServerB (fewest connections)
ServerB receives request:
✓ Auth header found: Bearer eyJhbGci...
✓ JWT verified with shared secret
✓ Decoded: { userId: 'usr_8821', role: 'admin' }
✓ Cart fetched from Redis: 2 items
✓ Total calculated: $84.97
HTTP 200 OK
{
"cart": [
{ "productId": "prod_44", "qty": 2, "price": "$29.99" },
{ "productId": "prod_91", "qty": 1, "price": "$24.99" }
],
"total": "$84.97"
}
-- ServerA was handling a slow checkout, ServerB had 0 active connections
-- User never knew which server responded. That's the point.
Watch Out: Sticky Sessions Are a Trap
Some load balancers offer 'sticky sessions' (also called session affinity) which always route a user to the same server. This papers over a stateful app but destroys the benefits of horizontal scaling — if that one server dies, the user's session is gone anyway, and you can't freely redistribute load. Fix the stateful design instead of relying on sticky sessions.
Production Insight
A production team used sticky sessions to bypass a stateful app — until a node failed.
All users pinned to that node lost their session simultaneously.
Rule: sticky sessions mask the real problem. Externalize state, don't pin users.
Key Takeaway
Stateless design = requests don't depend on which server handled previous request.
That's the only way horizontal scaling works safely.
Audit your app: find every in-memory user store and move it outside.

Stateless Architecture Flow — Load Balancer to Any App Node

The following diagram illustrates how a stateless architecture routes requests from a load balancer to any available application node. Because no node stores user-specific data locally, the load balancer is free to distribute requests evenly across all healthy instances. State such as session tokens and user preferences live in a shared Redis cache or are encoded in JWT tokens sent with each request. This ensures that even if a node fails, the user's session survives and can be handled by another node.

The key takeaway: any request can go to any node, and the result is identical.

Production Insight
In our Black Friday incident, we lacked a diagram like this. The developers assumed each node was independent but didn't realize session state was local. Visualizing the flow with a shared Redis would have caught the issue before production.
Key Takeaway
A stateless architecture requires a shared external store for any user-specific data. The load balancer treats all nodes as interchangeable — that's the foundation of horizontal scaling.

Caching and Database Scaling — Where Most Performance Is Actually Won

Here's an uncomfortable truth: most scalability problems aren't compute problems — they're database problems. Your web servers are usually fine. It's the database that melts under load because every request hits it, even for data that hasn't changed in hours.

Caching solves this by storing the result of expensive operations in fast, in-memory storage and serving repeat requests from there. A Redis cache lookup takes under 1 millisecond. A PostgreSQL query joining three large tables might take 200ms. If 10,000 users all request the homepage product list within a minute, you want to hit your database once and serve everyone else from cache — not hammer your database 10,000 times.

The cache hierarchy matters. Browser caches handle static assets (images, CSS). A CDN cache handles geographically-distributed content. An application-level cache like Redis handles dynamic query results. Each layer handles a different class of data.

For databases specifically, you have two main scaling levers: read replicas and sharding. Read replicas copy your primary database to one or more secondary nodes. Reads are distributed across replicas; writes go only to the primary. This works brilliantly when your workload is read-heavy — which most web apps are (roughly 80% reads, 20% writes is common). Sharding partitions data itself across multiple databases — User IDs 1–1M on Shard A, 1M–2M on Shard B. It's powerful but operationally complex. Reach for read replicas first.

CacheAsidePattern.pseudoPSEUDOCODE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
// ─────────────────────────────────────────────────────────────
// CACHE-ASIDE PATTERN (also called Lazy Loading)
// The most common and safest caching strategy for web APIs
// Application controls what gets cached and when
// ─────────────────────────────────────────────────────────────

redisCache    = RedisClient(host="cache.internal")
primaryDB     = PostgresClient(host="db-primary.internal")
readReplicaDB = PostgresClient(host="db-replica.internal")  // read-only copy

// ── CACHE TTL STRATEGY ────────────────────────────────────────
// TTL (Time To Live) = how long cached data stays valid
// Too short → cache provides little benefit, DB still hammered
// Too long  → users see stale data after updates
PRODUCT_CATALOG_TTL = 300   // 5 minutes — changes rarely
USER_PROFILE_TTL    = 60    // 1 minute  — changes occasionally
LIVE_STOCK_COUNT_TTL = 5    // 5 seconds — must be near-realtime

function getProductCatalog(categoryId):
    cacheKey = "catalog:category:" + categoryId
    // e.g. "catalog:category:electronics"

    // ── STEP 1: Check the cache first (fast path) ─────────────
    cachedResult = redisCache.get(cacheKey)

    if cachedResult is NOT null:
        // Cache HIT — served in <1ms, database not touched
        logMetric("cache_hit", key=cacheKey)
        return JSON.parse(cachedResult)

    // ── STEP 2: Cache MISS — go to read replica ───────────────
    // Using read replica, not primary — keeps primary free for writes
    logMetric("cache_miss", key=cacheKey)

    freshProducts = readReplicaDB.query("""
        SELECT p.id, p.name, p.price, p.stock_count, c.name AS category
        FROM   products p
        JOIN   categories c ON c.id = p.category_id
        WHERE  p.category_id = :categoryId
          AND  p.is_active = true
        ORDER  BY p.created_at DESC
        LIMIT  50
    """, params={ categoryId: categoryId })
    // This query takes ~180ms on a warm database

    // ── STEP 3: Populate cache for next request ───────────────
    redisCache.setWithExpiry(
        key     = cacheKey,
        value   = JSON.stringify(freshProducts),
        ttlSecs = PRODUCT_CATALOG_TTL
    )

    return freshProducts

// ── CACHE INVALIDATION ON UPDATE ─────────────────────────────
// When a product changes, the cache for its category MUST be cleared
function updateProductPrice(productId, newPrice, categoryId):

    // Write always goes to PRIMARY database
    primaryDB.execute("""
        UPDATE products SET price = :newPrice, updated_at = NOW()
        WHERE  id = :productId
    """, params={ productId, newPrice })

    // Invalidate the cached category so next read gets fresh data
    staleCacheKey = "catalog:category:" + categoryId
    redisCache.delete(staleCacheKey)
    // Next call to getProductCatalog() will be a cache miss → DB fetch → repopulate

    logEvent("price_updated", productId=productId, cacheInvalidated=staleCacheKey)
Output
Request 1 — GET /products?category=electronics
cache_miss: catalog:category:electronics
→ Query read replica: 183ms
→ Cache populated with TTL=300s
Total response time: 187ms
Request 2–9,999 — GET /products?category=electronics (within 5 min)
cache_hit: catalog:category:electronics
→ Served from Redis: 0.8ms
Total response time: 4ms
Database not touched.
Admin updates product_44 price to $34.99:
→ Write to primary DB: 12ms
→ Cache key 'catalog:category:electronics' deleted
Request 10,000 — GET /products?category=electronics
cache_miss: catalog:category:electronics (cache was invalidated)
→ Query read replica: 181ms ← fresh data with new price
→ Cache repopulated
Total response time: 185ms
Interview Gold: The Two Hard Problems
Phil Karlton famously said 'There are only two hard things in computer science: cache invalidation and naming things.' When an interviewer asks about caching, always mention invalidation strategy alongside TTL. Saying 'I'd set a 5-minute TTL AND delete the cache key on write' shows you understand both sides of the coin — most candidates only talk about TTL.
Production Insight
A team added Redis caching but set TTL to 1 hour and never invalidated on writes.
Users saw outdated stock for 45 minutes during a flash sale, causing overselling.
Fix: always implement cache invalidation on writes. TTL is a safety net, not a strategy.
Key Takeaway
Cache before you add more servers — databases are the real bottleneck.
Always pair writes with cache invalidation.
Read replicas for read-heavy workloads; sharding is the last resort.

Database Scaling Decision Matrix — Replication vs Sharding vs Federation

When a single database can't handle your load, you have three major strategies. Choosing the wrong one early can cost months of refactoring. Here's a decision matrix based on workload type, complexity, and growth pattern.

## The Three Strategies

  1. Read Replication — Create one or more read-only copies of your database. All writes go to the primary; reads are distributed to replicas.
  2. - Best for: Read-heavy workloads (80/20 or worse).
  3. - Complexity: Low to medium (replication lag monitoring).
  4. - Limitation: Writes still bottleneck on one primary.
  5. Sharding — Partition data across multiple databases based on a shard key (e.g., user_id % N).
  6. - Best for: Balanced read-write workloads, or when data size exceeds single machine capacity.
  7. - Complexity: High (shard key must be carefully chosen, cross-shard queries are painful).
  8. - Limitation: Rebalancing shards is expensive.
  9. Federation (Functional Partitioning) — Split by domain. One database for users, another for orders, another for products.
  10. - Best for: Microservices architectures with clear domain boundaries.
  11. - Complexity: Medium (requires service-level joins).
  12. - Limitation: Queries that span domains are slow (API composition).

## Decision Matrix

FactorReplicationShardingFederation
Write throughputLimited by primaryScales with shardsScales per domain
Read throughputScales linearly with replicasScales with shardsScales per domain
Query complexitySimple (any query)Must target correct shardMust orchestrate across services
Data size limitPrimary size limitUnlimited (add shards)Per-domain limit
Operational complexityLowHighMedium
ConsistencyStrong on primary, eventual on replicasPer-shard strongPer-domain strong
Best forRead-heavy, moderate sizeMassive data, high write volumeMicroservices, domain isolation

## When to Use Each

  • Start with replication if 80%+ of your queries are reads and you're not hitting write limits.
  • Move to sharding if you have billions of rows and write volume exceeds what one primary can handle.
  • Use federation if your system naturally has distinct business domains (e.g., user service, order service) and you've already adopted microservices.

## Combined Approach

Most large systems use a combination: federation for domain isolation, read replicas within each domain for reads, and sharding within a domain when that domain's data grows beyond a single database.

DatabaseScalingDecision.pseudoPSEUDOCODE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
function chooseDatabaseScalingStrategy(workloadProfile):
    // workloadProfile: { readRatio, writeRatio, dataSize, writeVolume, domainCount }
    
    if workloadProfile.readRatio > 0.8:
        return "READ_REPLICAS"
        // Add replicas until primary write capacity becomes bottleneck
        // Monitor replication lag < 1s
    
    if workloadProfile.dataSize > 10 TB OR workloadProfile.writeVolume > 50k writes/s:
        // Single primary can't handle writes or data size exceeds max instance
        if workloadProfile.domainCount > 3 AND clear domain boundaries exist:
            return "FEDERATION"
            // Split into domain databases, each with its own replicas
        else:
            return "SHARDING (partition by customer_region or user_id_hash)"
            // Choose shard key that evenly distributes load and queries
    
    // Default: replication is simplest
    return "READ_REPLICAS (monitor for future growth)"
Output
Example input:
readRatio=0.9, dataSize=5TB, writeVolume=10k writes/s, domainCount=1
Output: READ_REPLICAS
→ Add 3 read replicas to handle 90% read traffic.
→ Primary handles 10k writes/s, within limits.
→ Monitor if write volume grows beyond 50k.
Example input:
readRatio=0.6, dataSize=50TB, writeVolume=100k writes/s, domainCount=5
Output: FEDERATION
→ Split into users, orders, products, payments DBs.
→ Each domain has its own primary + replicas.
→ Within orders domain, if data size > 10TB, shard orders by customer_id.
Production Insight
A team prematurely sharded their database when replication would have solved the problem. The sharding key was poorly chosen, causing hot spots and cross-shard join nightmares. They spent 6 months refactoring back to replication. Measure your current bottleneck before choosing.
Key Takeaway
Always start with replication. Shard only when writes become bottleneck or data exceeds single machine capacity. Federation is best when your architecture already has clear domain boundaries.

Performance vs Scalability — Know the Difference and Why It Matters

Performance and scalability are not the same thing. Performance is how fast your system responds to a single request — latency and throughput at a given load. Scalability is whether that performance holds up as you increase load. A system can be blindingly fast for 100 users (great performance) but collapse at 10,000 (poor scalability). Conversely, a system can be moderately slow but degrade gracefully under massive load (good scalability).

Understanding this distinction is crucial because the wrong diagnosis leads to wrong fixes. If your API returns in 2 seconds for a single user, caching won't help — you need a faster query, better indexes, or a more efficient algorithm. If your API returns in 20ms for one user but takes 500ms when 100 users hit it simultaneously, that's a scalability issue — likely a database connection pool saturating or a missing index causing a table scan under load.

The practical approach: measure your system's performance at a baseline load, then gradually increase load and watch for the inflection point where latency starts to climb. That's where your scalability bottleneck lives. Most tools like wrk, k6, or artillery can give you this curve. Without this measurement, you're guessing — and guessing breaks production.

Performance vs Scalability Mental Model
  • Performance = what happens with one request? Latency, throughput at low concurrency.
  • Scalability = what happens as requests multiply? Does latency stay flat or spike?
  • Bad performance but good scalability: each request is slow but consistent regardless of load (e.g., a geo-distributed system with high base latency).
  • Good performance but bad scalability: fast for one user, collapses under load — classic sign of shared locked resources (DB, thread pool).
  • Rule of thumb: optimize for performance first (cheap), then test for scalability before adding infrastructure.
Production Insight
A team optimized a query from 300ms to 20ms for a single user (great performance win).
But under 1000 concurrent users, it still crashed — the query was now fast, but connection pool was too small.
Rule: performance fixes address single-request speed; scalability fixes address behavior under concurrency.
Key Takeaway
Performance and scalability are independent.
Measure your latency curve under increasing load.
Don't buy more servers when your query itself is slow — fix the query first.

Auto-scaling and Elasticity — Scaling Without Human Intervention

Elasticity is the ability to automatically add or remove resources as demand changes. It's the cloud's killer feature: you don't need to predict traffic spikes. You set policies (e.g., CPU > 70% for 5 minutes → add 2 nodes), and the platform does the rest.

But auto-scaling is not magic. It only works if your application can be safely scaled — which brings us back to stateless design. If your app stores state in memory, an auto-scaling event that kills a node also kills every session on that node. Worse, a scale-in event (removing nodes) might terminate a node mid-request.

Key considerations: horizontal pod auto-scaling in Kubernetes looks at CPU or custom metrics. AWS Auto Scaling groups use launch configurations. The warm-up time of new instances matters — a new server with an empty cache can spike database load. Pre-warming caches via startup hooks or graceful degradation helps.

The most common mistake is setting aggressive scale-up policies and no scale-down. You spike, add 10 nodes, traffic drops, but those nodes stay running — burning money. Always set cooldown periods and scale-in thresholds. Test with load generators before production.

Auto-scaling Gotcha: Thundering Herd
When you scale up, new nodes hit the database for the first time (empty caches). If all new nodes start at once, the database gets a thundering herd of requests from all of them. Mitigate: use a pre-warming step (e.g., call a startup endpoint that loads common data into cache), or ramp up nodes one by one with health checks.
Production Insight
A team set auto-scaling to add nodes when CPU > 80% for 1 minute.
During flash sale, CPUs hit 80% every time new nodes started (because they were idle at first).
Result: nodes were added continuously until the quota was hit, costing $40k in one hour.
Fix: use a more conservative metric like request latency, and add a cooldown period.
Key Takeaway
Auto-scaling requires stateless apps and careful metric selection.
Test scaling policies with load testing before Black Friday.
Always set scale-down policies to avoid burning money.

Availability (The 9s) — What They Mean and How to Design for Them

Availability is measured in 'nines' — the percentage of time a system is operational. Each nine represents an order of magnitude reduction in downtime. Understanding these levels is critical for setting realistic SLAs and designing the right redundancy.

## The 9s Table

Availability LevelDowntime per YearDowntime per MonthDowntime per WeekTypical Architecture
99% (two nines)3.65 days7.2 hours1.68 hoursSingle server, no redundancy
99.9% (three nines)8.76 hours43.8 minutes10.1 minutesActive-passive failover
99.99% (four nines)52.56 minutes4.38 minutes1.01 minutesActive-active, load balancer, multi-AZ
99.999% (five nines)5.26 minutes26.3 seconds6.05 secondsMulti-region, automatic failover, redundant everything
99.9999% (six nines)31.5 seconds2.6 seconds0.6 secondsGeographically distributed, fault-tolerant, near-zero downtime

## What It Takes

  • Two nines (99%) — Good for internal tools or non-critical systems. A few hours of downtime per month is acceptable.
  • Three nines (99.9%) — Typical for consumer web apps. Unplanned downtime under 9 hours per year requires some form of redundancy (e.g., multi-AZ database, load balanced app servers).
  • Four nines (99.99%) — Expected for production workloads at scale. Requires active-active architecture, automatic failover, and careful change management. Downtime budget is under 1 hour per year.
  • Five nines (99.999%) — Banking, telecom, mission-critical. Requires multi-region deployment, chaos engineering, and significant operational investment. Most systems don't need this — it's expensive and complex.

## Design Implications

Each additional nine roughly doubles operational complexity and cost. Don't design for five nines unless you have a regulatory or business requirement. The cost of achieving 99.999% is often not justified for most B2C web apps.

The Cost of Nines
Moving from 99.9% to 99.99% doesn't just mean adding one more replica. It typically requires redundancy at every layer: load balancers, application servers, databases, caches, DNS, cross-region failover, and automated recovery procedures. The engineering effort to detect and recover from failures in seconds (not minutes) is substantial.
Production Insight
Our platform advertised 99.99% availability but during the Black Friday logout storm, the actual availability was 97% because session failures cascaded. We learned that achieving high availability requires every component to be redundant — the weakest link determines your real uptime.
Key Takeaway
Match your availability target to business requirements, not engineering ego. Understand what it costs to gain each nine — the last two are exponentially harder than the first.
● Production incidentPOST-MORTEMseverity: high

The Black Friday Logout Storm

Symptom
During Black Friday traffic spike (50,000 concurrent users), users were randomly logged out, lost shopping carts, and saw inconsistent data. Error rates jumped to 40%.
Assumption
The team assumed that horizontal scaling with a round-robin load balancer was safe because they had 'stateless' services. They forgot that session state was stored in-memory on each server via HttpContext.Session (ASP.NET).
Root cause
A user's first request landed on Server A, which stored session data in local memory. The next request, routed to Server B by the load balancer, found no session and created a new one — user appeared logged out. This happened on every request, causing infinite redirect loops and login prompts.
Fix
Moved session storage from in-process memory to a shared Redis cache. Changed the session provider configuration in web.config to use StackExchange.Redis. Deployed with zero downtime via rolling update. Cache hit rate for session data: 99.9%, response times dropped from 300ms to 5ms.
Key lesson
  • Never assume your app is stateless — audit every store of user-specific data in local memory.
  • Externalize all session, cart, and user state to a shared store before adding a second node.
  • Test horizontal scaling in a staging environment with real load patterns before Black Friday.
Production debug guideIdentify why your system breaks under load and how to fix it fast.4 entries
Symptom · 01
Users randomly logged out, session lost
Fix
Check session storage mechanism. If in-memory, move to Redis or database. Validate with curl -i -c cookies.txt -b cookies.txt across multiple endpoints.
Symptom · 02
Database CPU 100% but web servers idle
Fix
Add query-level monitoring. Identify the top 5 slowest queries. Add appropriate indexes, then implement caching (Redis) for read-heavy data.
Symptom · 03
Load balancer reports backend unhealthy despite service running
Fix
Check health check endpoint. Ensure it actually validates dependencies (DB, cache). A '200 OK' from a broken server is worse than a '503'.
Symptom · 04
Response times spike when new server is added
Fix
Measure if the new server is cold (empty cache). Pre-warm caches during deployment. Use connection pooling limits to avoid overwhelming databases.
★ Quick Debug Cheat Sheet: Production OverloadWhen your app starts timing out under load, run these commands to pinpoint the bottleneck.
API response time > 2s
Immediate action
Check CPU and memory on all nodes. Look for any single node at 100% CPU.
Commands
top -b -n1 | head -20
kubectl top pods (if Kubernetes) or docker stats
Fix now
Add one more node immediately, then profile the slow endpoints.
Database connections exhausted+
Immediate action
Check connection pool stats on the app side and database side.
Commands
SHOW max_connections; SELECT count(*) FROM pg_stat_activity;
SELECT * FROM pg_stat_activity WHERE state = 'active';
Fix now
Reduce connection pool size per node to 10-20. Increase max_connections on DB if safe. Add a PgBouncer or connection pooler.
Cache hit rate below 50%+
Immediate action
Log the cache key being queried. Check if keys match what is cached.
Commands
redis-cli MONITOR | head -100
echo 'keys *' | redis-cli | head -20
Fix now
Implement cache-aside with consistent key naming. Pre-warm on deploy. Set appropriate TTL.
Load balancer returns 502/503+
Immediate action
Check health check endpoint of backend nodes. Verify network connectivity.
Commands
curl -I http://internal-service:8080/health
telnet internal-service 8080 (or nc -zv)
Fix now
Ensure health check endpoint is simple and fast. Tune health check interval and retry count. Add more nodes.
AspectVertical Scaling (Scale Up)Horizontal Scaling (Scale Out)
MechanismBigger CPU/RAM on one machineMore machines, split the load
ComplexityLow — no code changes neededHigh — stateless design required
Cost curveExponential — bigger = disproportionately pricierLinear — each node costs roughly the same
CeilingHard limit — biggest instance type availableEffectively unlimited
Single point of failureYes — one machine going down = full outageNo — other nodes absorb dead node's traffic
Best forDatabases, early-stage apps, quick winsWeb/API tiers, microservices, large-scale systems
Time to implementMinutes (resize instance)Days to weeks (architecture refactor)
Failure modeDowntime during resize windowPartial degradation — system degrades gracefully

Key takeaways

1
Scale vertically first
it's faster, simpler, and often enough. Move to horizontal only when you hit the machine's ceiling or need fault tolerance, not by default.
2
Stateless design isn't a feature
it's a prerequisite. If a server restart loses user data, you can't safely scale horizontally. Externalize all state to Redis, a database, or tokens before adding nodes.
3
Most scalability bottlenecks live in the database, not the web tier. Read replicas for read-heavy workloads and aggressive caching will outperform adding more web servers if the DB is the real constraint.
4
Cache invalidation is the hard part of caching
always design your write path to delete or update cache keys, not just rely on TTL expiry. A short TTL without invalidation still serves stale data between writes.
5
Performance and scalability are different
a fast single-request experience doesn't guarantee good multi-user behavior. Measure latency under load before choosing a fix.
6
Auto-scaling is powerful but dangerous. use request latency metrics with cooldowns to avoid runaway costs and thundering herd issues.

Common mistakes to avoid

5 patterns
×

Scaling horizontally without making the app stateless first

Symptom
Users randomly get logged out, lose shopping carts, or see inconsistent data as requests land on different servers.
Fix
Audit your app for any in-process memory used to store user state (session maps, local caches keyed by user). Move them to Redis or encode them in JWT tokens. Then add nodes behind a load balancer.
×

Setting cache TTL too high on data that changes on writes

Symptom
Users see stale prices, outdated stock counts, or deleted items still appearing for minutes after an admin update.
Fix
Pair every write operation with an explicit cache.delete(key) call for affected cache entries. Use TTL as a safety net for missed invalidations, not as your only freshness strategy.
×

Sending ALL database traffic to the primary even after adding read replicas

Symptom
Read replicas sit idle while the primary is overwhelmed and becomes the bottleneck.
Fix
Explicitly route SELECT queries to a read-replica connection pool in your ORM or data access layer. In most ORMs this is a one-line config change; the hard part is making it a conscious habit for every query you write.
×

Auto-scaling with only CPU metric and no cooldown

Symptom
During traffic spikes, nodes keep being added until you hit the quota, costing huge amounts. When traffic drops, nodes never scale down.
Fix
Use a combination of CPU, memory, and request latency metrics. Set a cooldown period (e.g., 5 minutes) between scaling events. Always define a scale-in policy with a lower threshold.
×

Assuming adding more servers will fix a database bottleneck

Symptom
Adding web servers doesn't improve throughput; database CPU stays at 100%.
Fix
Profile database queries first. Add indexes, optimize queries, then consider caching or read replicas. Only after that consider sharding or application-level changes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Your API handles 1,000 requests per second comfortably. You're told to d...
Q02SENIOR
What's the difference between horizontal and vertical scaling, and under...
Q03SENIOR
You've added Redis caching and your cache hit rate is 95% — but users ar...
Q04SENIOR
Explain the concept of 'sticky sessions' and why they are considered an ...
Q05SENIOR
What is the difference between caching and read replicas for database sc...
Q01 of 05SENIOR

Your API handles 1,000 requests per second comfortably. You're told to design it to handle 100,000 RPS by next month. Walk me through your scaling strategy from first principle to final architecture.

ANSWER
First, I'd verify the current system's bottleneck by measuring where latency starts to degrade under load. I'd use tools like wrk or k6 to generate load and find the inflection point. Typically, it's the database. I'd start with vertical scaling on the database (bigger instance) if tomorrow it's 10x, but for 100x I'd go horizontal. Prerequisite: make all services stateless (move sessions to Redis, use JWT auth). Then add a load balancer with health checks. For the database, I'd add read replicas (for read-heavy workloads) and implement caching with Redis for hot data. If that's not enough, I'd consider sharding the database. I'd also set up auto-scaling with proper metrics (request latency, not just CPU). The final architecture: stateless API tier behind a load balancer, Redis cache layer, read replicas for reads, and a sharded primary for writes. All with monitoring and gradual rollout.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between scalability and performance in system design?
02
When should I start thinking about scalability in a new project?
03
Can a single database ever be fast enough, or do I always need read replicas?
04
Should I always use a load balancer even for a single server?
05
What is the 'thundering herd' problem in auto-scaling and how do you prevent it?
🔥

That's Fundamentals. Mark it forged?

11 min read · try the examples if you haven't

Previous
System Design Basics
2 / 10 · Fundamentals
Next
CAP Theorem