System Design Intermediate

Scalability Concepts Explained — Vertical, Horizontal and Beyond

📅 March 2026 ⏱ 8 min read 🎯 Intermediate

In Plain English 🔥

Imagine a lemonade stand. On a slow Tuesday, one kid handles everything fine. But on a hot Saturday with a queue around the block, you have two choices: hire a bigger, faster kid (make one worker stronger) or hire more kids and split the queue (add more workers). That's scalability in a nutshell — your system's ability to handle more work without falling over. Every design decision you make today either opens that door wider or quietly nails it shut.

⚡ Quick Answer

Every system works fine with ten users. The brutal truth is that most production outages aren't caused by bad code — they're caused by systems that were never designed to grow. Twitter's 'Fail Whale', Slack's 2021 degradation event, and countless startup horror stories all share the same root cause: scalability was an afterthought. Understanding scalability isn't optional for an intermediate engineer — it's the line between a system that survives launch day and one that embarrasses you in front of your entire user base.

The core problem scalability solves is demand unpredictability. Your e-commerce site might handle 500 requests per second on a normal Wednesday. On Black Friday it might need to handle 50,000. If your architecture can only scale by crossing fingers and upgrading to a bigger server, you're one viral tweet away from a very bad day. Scalability concepts give you a vocabulary and a toolkit to reason about growth before it happens — and design systems that bend instead of breaking.

By the end of this article you'll be able to explain the difference between vertical and horizontal scaling and know which to reach for first, understand why stateless design is the foundation everything else is built on, describe how load balancers and caching multiply your throughput without multiplying your bill, and walk into a system design interview and speak confidently about trade-offs — not just definitions.

Vertical vs Horizontal Scaling — Choosing Your Growth Strategy

Vertical scaling (scaling up) means giving your existing machine more power — more CPU cores, more RAM, faster SSDs. It's the simplest option and often the right first move. There's no code change, no architecture rethink, and you can do it in minutes on most cloud providers. The catch? Every machine has a ceiling. At some point AWS doesn't have a bigger instance type, and even if it did, a single machine is a single point of failure.

Horizontal scaling (scaling out) means adding more machines and splitting the work between them. This is how Netflix, Google and every large-scale system you've ever used actually works. It has no theoretical ceiling — you can keep adding nodes. But it demands that your application be stateless, because requests will land on different servers unpredictably.

The practical rule of thumb: scale vertically first until it hurts, then design for horizontal. Premature horizontal scaling adds enormous operational complexity — distributed systems are hard. A startup serving 10,000 users probably doesn't need a Kubernetes cluster; they need a better database index and maybe one more server tier.

The real decision point is around state. If your app stores session data in memory on a single server, horizontal scaling will immediately break user logins. That's why stateless design — covered next — isn't a nice-to-have. It's the prerequisite.

ScalingDecisionFramework.pseudo · PSEUDOCODE

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950

// ─────────────────────────────────────────────────────────────
// SCALING DECISION FRAMEWORK
// Run this mental checklist BEFORE choosing a scaling strategy
// ─────────────────────────────────────────────────────────────

function chooseScalingStrategy(currentLoad, projectedLoad, appIsStateless):

    // Step 1: Calculate how much headroom you need
    growthFactor = projectedLoad / currentLoad
    // e.g. Black Friday estimate: 50,000 rps / 500 rps = 100x growth needed

    // Step 2: Check if vertical scaling can close the gap cheaply
    currentInstanceType  = "db.t3.medium"   // 2 vCPU, 4 GB RAM
    upgradedInstanceType = "db.r6g.4xlarge" // 16 vCPU, 128 GB RAM  (~8x capacity)

    if growthFactor <= 8 AND upgradedInstanceType IS available:
        // Vertical scaling is simpler and fast — do it first
        return SCALE_UP(
            targetInstance = upgradedInstanceType,
            estimatedCost  = "$0.90/hr → $4.80/hr",  // predictable pricing
            operationalRisk = LOW                      // no code changes needed
        )

    // Step 3: Beyond vertical ceiling — must go horizontal
    if growthFactor > 8 OR upgradedInstanceType NOT available:

        if NOT appIsStateless:
            // ⚠️  STOP — horizontal scaling WILL BREAK stateful apps
            // Sessions stored in-memory on Server A won't exist on Server B
            return REFACTOR_FIRST(
                action = "Move sessions to Redis / JWT tokens",
                reason = "Load balancer will route user requests to ANY server"
            )

        // App is stateless — safe to scale out
        return SCALE_OUT(
            addNodes        = ceil(growthFactor / capacityPerNode),
            loadBalancer    = "Round-robin or least-connections",
            operationalRisk = MEDIUM  // distributed systems add failure modes
        )

// ─── EXAMPLE OUTPUT ──────────────────────────────────────────
// Input:  currentLoad=500rps, projectedLoad=50000rps, stateless=false
// Output: REFACTOR_FIRST
//         → Move sessions to Redis THEN scale horizontally
//
// Input:  currentLoad=500rps, projectedLoad=2000rps, stateless=true
// Output: SCALE_UP
//         → Upgrade instance (4x growth, within vertical ceiling)
// ─────────────────────────────────────────────────────────────

▶ Output

Decision: REFACTOR_FIRST
Action : Move session storage from in-memory to Redis
Reason : Load balancer distributes requests across all nodes.
User on Server A will hit Server B next request.
In-memory session on A does not exist on B → instant logout bug.

Decision: SCALE_UP
Target : db.r6g.4xlarge (16 vCPU / 128 GB RAM)
Cost : $0.90/hr → $4.80/hr
Risk : LOW — zero code changes, deploy in ~5 minutes

⚠️

Pro Tip: The 'Stateless Test'Before assuming you can scale horizontally, ask: 'If I killed the server handling this request right now and a different server picked it up, would the user notice?' If yes, you have stateful logic — find it and externalize it to Redis, a database, or JWTs before you add a single new node.

Stateless Design and Load Balancing — The Foundation of Horizontal Scale

Stateless design means each server treats every incoming request as if it's meeting that user for the first time. No local memory of what happened before. All state the request needs — auth tokens, user preferences, cart contents — travels with the request or lives in a shared external store like Redis or a database.

This sounds like a constraint, but it's actually a superpower. When no single server 'owns' a user's session, a load balancer can freely route any request to any available server. You can add servers during a traffic spike, remove them when it passes, and restart individual servers without losing anyone's session. Your system becomes elastic.

A load balancer sits in front of your server pool and distributes incoming requests. The two most common strategies are round-robin (requests cycle through servers sequentially — great for uniform workloads) and least-connections (each new request goes to the server with fewest active connections — better when requests vary in processing time, like an API mixing fast reads and slow report generation).

Health checks are the underrated hero here. A good load balancer pings each server every few seconds. If a server stops responding, traffic is automatically rerouted to healthy nodes — and users never know a server died. This is how large systems achieve high availability without magic.

StatelessRequestFlow.pseudo · PSEUDOCODE

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758

// ─────────────────────────────────────────────────────────────
// STATELESS REQUEST FLOW WITH LOAD BALANCER
// Demonstrates how a stateless API handles auth across two servers
// ─────────────────────────────────────────────────────────────

// ── SHARED INFRASTRUCTURE (lives outside any single server) ──
redisSessionStore = ExternalRedis(host="redis.internal", port=6379)
jwtSecretKey      = EnvironmentVariable("JWT_SECRET")  // same key on ALL servers

// ── SERVER A and SERVER B are identical clones ────────────────
function handleRequest(incomingHttpRequest):

    // Load balancer already decided this request lands here
    // We don't know or care which server handled the previous request

    authHeader = incomingHttpRequest.headers["Authorization"]
    // e.g. "Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9..."

    if authHeader is null:
        return HTTP_401("Missing auth token")

    jwtToken = authHeader.stripPrefix("Bearer ")

    // Verify the JWT using the shared secret — works on ANY server
    // because the secret is the same everywhere and state is IN the token
    decodedPayload = JWT.verify(jwtToken, jwtSecretKey)
    // decodedPayload = { userId: "usr_8821", role: "admin", exp: 1712000000 }

    if decodedPayload.isExpired():
        return HTTP_401("Token expired — please log in again")

    // Fetch user-specific data from the SHARED store (not local memory)
    userCart = redisSessionStore.get(key = "cart:" + decodedPayload.userId)
    // userCart = [{ productId: "prod_44", qty: 2 }, { productId: "prod_91", qty: 1 }]

    // Process the request normally
    orderTotal = calculateTotal(userCart)

    return HTTP_200({ cart: userCart, total: orderTotal })

// ── WHAT MAKES THIS STATELESS ─────────────────────────────────
// 1. No in-memory session map — server restarts lose NOTHING
// 2. JWT carries identity — valid on Server A, B, or C equally
// 3. Cart lives in Redis — Server B reads the same cart as Server A
// 4. Load balancer can route usr_8821's next request ANYWHERE safely

// ── LOAD BALANCER LOGIC (simplified) ─────────────────────────
serverPool = [ServerA(weight=1), ServerB(weight=1), ServerC(weight=1)]

function routeIncomingRequest(request):
    // Least-connections strategy — helps when cart checkout is slow
    targetServer = serverPool.minBy(server => server.activeConnections)
    targetServer.activeConnections += 1

    response = targetServer.handleRequest(request)

    targetServer.activeConnections -= 1
    return response

▶ Output

GET /api/cart → Load Balancer routes to ServerB (fewest connections)

ServerB receives request:
✓ Auth header found: Bearer eyJhbGci...
✓ JWT verified with shared secret
✓ Decoded: { userId: 'usr_8821', role: 'admin' }
✓ Cart fetched from Redis: 2 items
✓ Total calculated: $84.97

HTTP 200 OK
{
"cart": [
{ "productId": "prod_44", "qty": 2, "price": "$29.99" },
{ "productId": "prod_91", "qty": 1, "price": "$24.99" }
],
"total": "$84.97"
}

-- ServerA was handling a slow checkout, ServerB had 0 active connections
-- User never knew which server responded. That's the point.

⚠️

Watch Out: Sticky Sessions Are a TrapSome load balancers offer 'sticky sessions' (also called session affinity) which always route a user to the same server. This papers over a stateful app but destroys the benefits of horizontal scaling — if that one server dies, the user's session is gone anyway, and you can't freely redistribute load. Fix the stateful design instead of relying on sticky sessions.

Caching and Database Scaling — Where Most Performance Is Actually Won

Here's an uncomfortable truth: most scalability problems aren't compute problems — they're database problems. Your web servers are usually fine. It's the database that melts under load because every request hits it, even for data that hasn't changed in hours.

Caching solves this by storing the result of expensive operations in fast, in-memory storage and serving repeat requests from there. A Redis cache lookup takes under 1 millisecond. A PostgreSQL query joining three large tables might take 200ms. If 10,000 users all request the homepage product list within a minute, you want to hit your database once and serve everyone else from cache — not hammer your database 10,000 times.

The cache hierarchy matters. Browser caches handle static assets (images, CSS). A CDN cache handles geographically-distributed content. An application-level cache like Redis handles dynamic query results. Each layer handles a different class of data.

For databases specifically, you have two main scaling levers: read replicas and sharding. Read replicas copy your primary database to one or more secondary nodes. Reads are distributed across replicas; writes go only to the primary. This works brilliantly when your workload is read-heavy — which most web apps are (roughly 80% reads, 20% writes is common). Sharding partitions data itself across multiple databases — User IDs 1–1M on Shard A, 1M–2M on Shard B. It's powerful but operationally complex. Reach for read replicas first.

CacheAsidePattern.pseudo · PSEUDOCODE

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970

// ─────────────────────────────────────────────────────────────
// CACHE-ASIDE PATTERN (also called Lazy Loading)
// The most common and safest caching strategy for web APIs
// Application controls what gets cached and when
// ─────────────────────────────────────────────────────────────

redisCache    = RedisClient(host="cache.internal")
primaryDB     = PostgresClient(host="db-primary.internal")
readReplicaDB = PostgresClient(host="db-replica.internal")  // read-only copy

// ── CACHE TTL STRATEGY ────────────────────────────────────────
// TTL (Time To Live) = how long cached data stays valid
// Too short → cache provides little benefit, DB still hammered
// Too long  → users see stale data after updates
PRODUCT_CATALOG_TTL = 300   // 5 minutes — changes rarely
USER_PROFILE_TTL    = 60    // 1 minute  — changes occasionally
LIVE_STOCK_COUNT_TTL = 5    // 5 seconds — must be near-realtime

function getProductCatalog(categoryId):
    cacheKey = "catalog:category:" + categoryId
    // e.g. "catalog:category:electronics"

    // ── STEP 1: Check the cache first (fast path) ─────────────
    cachedResult = redisCache.get(cacheKey)

    if cachedResult is NOT null:
        // Cache HIT — served in <1ms, database not touched
        logMetric("cache_hit", key=cacheKey)
        return JSON.parse(cachedResult)

    // ── STEP 2: Cache MISS — go to read replica ───────────────
    // Using read replica, not primary — keeps primary free for writes
    logMetric("cache_miss", key=cacheKey)

    freshProducts = readReplicaDB.query("""
        SELECT p.id, p.name, p.price, p.stock_count, c.name AS category
        FROM   products p
        JOIN   categories c ON c.id = p.category_id
        WHERE  p.category_id = :categoryId
          AND  p.is_active = true
        ORDER  BY p.created_at DESC
        LIMIT  50
    """, params={ categoryId: categoryId })
    // This query takes ~180ms on a warm database

    // ── STEP 3: Populate cache for next request ───────────────
    redisCache.setWithExpiry(
        key     = cacheKey,
        value   = JSON.stringify(freshProducts),
        ttlSecs = PRODUCT_CATALOG_TTL
    )

    return freshProducts

// ── CACHE INVALIDATION ON UPDATE ─────────────────────────────
// When a product changes, the cache for its category MUST be cleared
function updateProductPrice(productId, newPrice, categoryId):

    // Write always goes to PRIMARY database
    primaryDB.execute("""
        UPDATE products SET price = :newPrice, updated_at = NOW()
        WHERE  id = :productId
    """, params={ productId, newPrice })

    // Invalidate the cached category so next read gets fresh data
    staleCacheKey = "catalog:category:" + categoryId
    redisCache.delete(staleCacheKey)
    // Next call to getProductCatalog() will be a cache miss → DB fetch → repopulate

    logEvent("price_updated", productId=productId, cacheInvalidated=staleCacheKey)

▶ Output

Request 1 — GET /products?category=electronics
cache_miss: catalog:category:electronics
→ Query read replica: 183ms
→ Cache populated with TTL=300s
Total response time: 187ms

Request 2–9,999 — GET /products?category=electronics (within 5 min)
cache_hit: catalog:category:electronics
→ Served from Redis: 0.8ms
Total response time: 4ms
Database not touched.

Admin updates product_44 price to $34.99:
→ Write to primary DB: 12ms
→ Cache key 'catalog:category:electronics' deleted

Request 10,000 — GET /products?category=electronics
cache_miss: catalog:category:electronics (cache was invalidated)
→ Query read replica: 181ms ← fresh data with new price
→ Cache repopulated
Total response time: 185ms

🔥

Interview Gold: The Two Hard ProblemsPhil Karlton famously said 'There are only two hard things in computer science: cache invalidation and naming things.' When an interviewer asks about caching, always mention invalidation strategy alongside TTL. Saying 'I'd set a 5-minute TTL AND delete the cache key on write' shows you understand both sides of the coin — most candidates only talk about TTL.

Aspect	Vertical Scaling (Scale Up)	Horizontal Scaling (Scale Out)
Mechanism	Bigger CPU/RAM on one machine	More machines, split the load
Complexity	Low — no code changes needed	High — stateless design required
Cost curve	Exponential — bigger = disproportionately pricier	Linear — each node costs roughly the same
Ceiling	Hard limit — biggest instance type available	Effectively unlimited
Single point of failure	Yes — one machine going down = full outage	No — other nodes absorb dead node's traffic
Best for	Databases, early-stage apps, quick wins	Web/API tiers, microservices, large-scale systems
Time to implement	Minutes (resize instance)	Days to weeks (architecture refactor)
Failure mode	Downtime during resize window	Partial degradation — system degrades gracefully

🎯 Key Takeaways

Scale vertically first — it's faster, simpler, and often enough. Move to horizontal only when you hit the machine's ceiling or need fault tolerance, not by default.
Stateless design isn't a feature — it's a prerequisite. If a server restart loses user data, you can't safely scale horizontally. Externalize all state to Redis, a database, or tokens before adding nodes.
Most scalability bottlenecks live in the database, not the web tier. Read replicas for read-heavy workloads and aggressive caching will outperform adding more web servers if the DB is the real constraint.
Cache invalidation is the hard part of caching — always design your write path to delete or update cache keys, not just rely on TTL expiry. A short TTL without invalidation still serves stale data between writes.

⚠ Common Mistakes to Avoid

✕Mistake 1: Scaling horizontally without making the app stateless first — Symptom: users randomly get logged out, lose shopping carts, or see inconsistent data as requests land on different servers — Fix: audit your app for any in-process memory used to store user state (session maps, local caches keyed by user), move them to Redis or encode them in JWT tokens, THEN add nodes behind a load balancer.
✕Mistake 2: Setting cache TTL too high on data that changes on writes — Symptom: users see stale prices, outdated stock counts, or deleted items still appearing for minutes after an admin update — Fix: pair every write operation with an explicit cache.delete(key) call for affected cache entries. Use TTL as a safety net for missed invalidations, not as your only freshness strategy.
✕Mistake 3: Sending ALL database traffic to the primary even after adding read replicas — Symptom: read replicas sit idle while the primary is overwhelmed and becomes the bottleneck — Fix: explicitly route SELECT queries to a read-replica connection pool in your ORM or data access layer. In most ORMs this is a one-line config change; the hard part is making it a conscious habit for every query you write.

Interview Questions on This Topic

QYour API handles 1,000 requests per second comfortably. You're told to design it to handle 100,000 RPS by next month. Walk me through your scaling strategy from first principle to final architecture.
QWhat's the difference between horizontal and vertical scaling, and under what specific circumstances would you choose one over the other? What makes an application 'horizontally scalable'?
QYou've added Redis caching and your cache hit rate is 95% — but users are still occasionally seeing stale product prices seconds after an admin updates them. What's the likely cause and how would you fix it without dropping your hit rate significantly?

Frequently Asked Questions

What is the difference between scalability and performance in system design?

Performance is about how fast your system responds to a single request — latency and throughput at a given load. Scalability is about whether that performance holds up as load increases. A system can be fast for 100 users and completely collapse at 10,000. Good scalability means your response times degrade gracefully (or not at all) as demand grows, which requires a fundamentally different design mindset than just optimizing individual queries.

When should I start thinking about scalability in a new project?

You should think about it at the architecture level from day one — but implement only what you need right now. Specifically: design your app to be stateless (it costs almost nothing and keeps options open), but don't build a distributed caching layer or sharded database until you have evidence you need it. The rule of thumb is: make stateless design a habit always, defer complex infrastructure until a real bottleneck forces your hand.

Can a single database ever be fast enough, or do I always need read replicas?

A well-indexed single database can handle tens of thousands of queries per second — it's genuinely surprising how far a properly tuned Postgres instance can go. Most teams reach for read replicas prematurely when the real issue is a missing index or an N+1 query pattern. Profile first: add EXPLAIN ANALYZE to your slowest queries, fix indexes, then consider replicas if the primary is still saturated. Read replicas add replication lag complexity; only add that complexity when the numbers justify it.

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged