Mid-level 10 min · March 06, 2026

System Design Interview - Cache Stampede in Production

Q: What is the single most important part of a System Design interview?

Requirement clarification. If you design a system for 10 users that needs to support 10 million, you've failed the round before it started. Always ask about the scale, user behavior, and expected latency first.

Q: How do I handle the 'I don't know' moment in a design round?

Be honest but analytical. Say: 'I haven't used Tool X specifically in production, but based on its documentation, I expect it handles Y by doing Z. Alternatively, we could use Tool A which I am more familiar with.' It shows you can reason through unknowns.

Q: Should I choose SQL or NoSQL in an interview?

There is no default answer. SQL is better for relational integrity and financial transactions (ACID). NoSQL is superior for high-availability, low-latency, and rapidly evolving schemas. State the trade-off and let the interviewer's constraints guide your choice.

Q: How deep should I go into implementation details?

Go deep on the components that are most critical to the system's performance and reliability. For example, if caching is crucial (read-heavy system), explain cache eviction policies, write-through vs. lazy loading, and how to handle cache stampede. Skip the CRUD endpoints unless they directly affect scaling. The interviewer wants to see you can prioritise.

Q: How do I estimate QPS / storage without a calculator?

Use round numbers: 1 day = 100,000 seconds (roughly). So if you have 10 million requests per day, average QPS ≈ 100. Peak QPS is around 2x-3x average. For storage: estimate object size (e.g., 200 bytes per tweet) times number of objects per day times retention period. Always add 20% for indexes and metadata. Memorise: 1TB = 1e12 bytes. These estimates are not exact but show you understand magnitude.

Q: How do you handle hot keys in a distributed system?

Hot keys (e.g., a celebrity's profile) can overload a single shard. Use consistent hashing with virtual nodes to spread load more evenly. For known hot keys, replicate them to multiple nodes and route reads to any replica. For writes, you may need to accept some additional complexity. Cache aggressively for read-heavy hot keys. Monitor per-shard metrics to detect skew.

Q: What's the difference between vertical and horizontal scaling for databases?

Vertical scaling adds CPU/RAM to an existing node. It's simple but has a hard cap and creates a single point of failure. Horizontal scaling adds more nodes and requires data partitioning (sharding). It offers unlimited scale and fault tolerance but adds complexity in queries and transactions. For databases, start with vertical scaling for the primary, add read replicas horizontally for read scaling. Shard only when the write load exceeds a single node's capacity.

During interviews, cache stampede during peak hours caused 100% database CPU and >30s P95 latency.

Naren Founder & Principal Engineer

20+ years shipping production code across the stack, with years spent interviewing engineers. Written from production experience, not tutorials.

✓ Production

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

System design interviews test your ability to handle ambiguity and make trade-offs under pressure
Core skill: requirement clarification defines the scope before drawing any boxes
Key components: functional vs non-functional requirements, high-level design, deep-dive bottlenecks, wrap-up with failure modes
Performance insight: a 10x scale miscalculation (e.g., QPS off by factor) can make your entire design irrelevant — always verify numbers
Production insight: the same mistake shows up in real systems — teams build for current load, then wonder why it crumbles at 10x
Biggest mistake: jumping straight to a solution (Kafka! NoSQL!) without asking 'What problem are we solving?'

✦ Definition~90s read

What is System Design Interview?

A system design interview guide is a structured approach to solving open-ended architectural problems under time pressure, typically within 45–60 minutes. It exists because FAANG and tier-2 tech companies use these interviews to evaluate your ability to decompose vague requirements into scalable, maintainable systems—not to test rote memorization of database internals.

★

Imagine you're asked to design a city from scratch.

The guide provides a repeatable 4-step framework (requirements → estimation → data model → high-level design) that prevents you from jumping into premature implementation, which is the #1 failure mode for senior engineers in these sessions. It also covers critical production realities like cache stampede—where thousands of concurrent requests hit a cold cache and collapse your database—and forces you to articulate trade-offs (e.g., SQL vs NoSQL, sync vs async) with concrete numbers from back-of-envelope calculations.

Without this guide, candidates often waste time on irrelevant details or propose architectures that work in theory but fail under real-world load patterns like thundering herds or partial failures.

Plain-English First

Imagine you're asked to design a city from scratch. You don't start by choosing the color of doorknobs — you start with roads, power grids, and water pipes. System design interviews work exactly the same way: interviewers want to see that you can think big, make smart trade-offs, and build something that won't collapse under pressure. It's less about memorizing answers and more about showing you can be the architect, not just the bricklayer.

Every senior engineering role at a top tech company has one brutal filter: the system design round. It's the interview that makes experienced developers freeze up, not because they lack knowledge, but because the question is deliberately open-ended. 'Design Twitter.' 'Design a URL shortener.' 'Design Netflix.' The candidate who answers these well isn't the one who memorized the most blog posts — it's the one who can think out loud, reason through trade-offs, and communicate at the level of a staff engineer.

The core problem this interview type solves — from the interviewer's perspective — is figuring out how you'll behave when given an ambiguous, high-stakes technical problem with no single right answer. At scale, every architectural decision has cascading consequences. Choosing the wrong database engine, ignoring read/write ratios, or failing to think about failure modes can mean millions in lost revenue or a 3am outage. The design interview is a compressed simulation of exactly that situation.

By the end of this guide, you'll have a repeatable framework you can apply to any system design prompt, understand the specific trade-offs interviewers are listening for (and the buzzwords that actually hurt you), know how to handle the moments where you genuinely don't know the answer, and walk away with a mental model that works in real production systems — not just whiteboards.

Why a System Design Interview Guide Exists

A system design interview guide is a structured framework for evaluating how you decompose a vague requirement into a working distributed system. The core mechanic is trade-off analysis: you must justify every choice—database, cache, consistency model—against latency, throughput, durability, and cost. It's not about memorizing solutions; it's about demonstrating a repeatable process for making decisions under uncertainty.

In practice, the interviewer cares about your ability to identify bottlenecks before they happen. You start with a single server, then scale horizontally, adding a load balancer, caching layer, and database sharding. Key properties like read-to-write ratio, data size (e.g., 10 TB), and latency SLAs (e.g., p99 < 200 ms) drive every decision. You must also reason about failure modes: what happens when the cache cluster loses a node, or the database replica falls behind.

You use this guide when preparing for roles at companies where system complexity is the norm—FAANG, high-growth startups, or any team operating at scale. It matters because a naive design (e.g., single Redis for all reads) can cause a cache stampede that takes down production. Mastering this guide means you can design systems that survive real traffic spikes without waking up the on-call engineer.

Not a Memorization Test

Interviewers can spot a rehearsed answer instantly. They want to see you reason through trade-offs, not recite a textbook solution.

Production Insight

A flash sale on an e-commerce site caused 10x traffic to a single Redis node, triggering a cache stampede that cascaded to the database, which fell over under 50k QPS.

Symptom: p99 latency jumped from 50 ms to 15 seconds, then the database connection pool exhausted, causing a full site outage.

Rule of thumb: always add a local cache (e.g., Caffeine) in front of Redis and use a rate limiter on cache rebuilds to cap concurrent DB queries.

Key Takeaway

A system design interview tests your trade-off reasoning, not your solution recall.

Always quantify: latency, throughput, data size, and failure modes drive every decision.

Production systems fail from cache stampedes and connection pool exhaustion—design for those first.

thecodeforge.io

Cache Stampede in Production: System Design

System Design Interview Guide

The 4-Step Framework for Architectural Clarity

A System Design Interview isn't a coding test; it's a conversation about trade-offs. To avoid the 'blank whiteboard' syndrome, you need a reliable framework. We recommend the following: 1. Understand Requirements (Functional & Non-Functional), 2. High-Level Design (The 'Boxes and Arrows'), 3. Deep Dive into Bottlenecks (Database Sharding, Caching), and 4. Wrap-up (Identifying SPOFs and Scaling).

Rather than starting with a dry definition, let's see a practical example of how you might handle a request-response flow in a distributed environment.

Here's the thing most candidates miss: the framework is just a container. What matters is how you move through it. Don't treat it as a rigid checklist — treat it as a conversation. If the interviewer asks a deep question about caching mid-way through step 2, follow that thread. The framework keeps you from getting lost, not from getting interesting.

io/thecodeforge/design/DistributedIdGenerator.javaJAVA

package io.thecodeforge.design;

import java.util.concurrent.atomic.AtomicLong;

/**
 * A production-grade concept for Unique ID Generation in a distributed system.
 * In a real interview, you'd discuss Snowflake IDs or UUIDs to avoid DB bottlenecks.
 */
public class DistributedIdGenerator {
    private final long datacenterId;
    private final long workerId;
    private final AtomicLong sequence = new AtomicLong(0L);

    public DistributedIdGenerator(long datacenterId, long workerId) {
        this.datacenterId = datacenterId;
        this.workerId = workerId;
    }

    public synchronized String generateId() {
        long timestamp = System.currentTimeMillis();
        // In a real interview, explain how bit-shifting ensures sortability and uniqueness
        return String.format("%d-%d-%d-%d", timestamp, datacenterId, workerId, sequence.getAndIncrement());
    }

    public static void main(String[] args) {
        DistributedIdGenerator generator = new DistributedIdGenerator(1, 42);
        System.out.println("Generated Unique ID: " + generator.generateId());
    }
}

Output

Generated Unique ID: 1742031000000-1-42-0

Forge Tip: Clarify Constraints Early

Before drawing a single box, ask: 'Is this system read-heavy or write-heavy?' and 'What is the Daily Active User (DAU) count?' A system for 100 users is a project; a system for 100 million is an architecture.

Production Insight

The most common failure in early-stage startups mirrors the interview: they skip requirement gathering and over-engineer for scale that never comes.

If you don't ask about read/write ratio, you'll design a caching layer for a write-heavy system — wasting both time and cache hit ratio.

Rule: always anchor your design to actual constraints, not hypothetical scale. I've debugged a production outage where a team blindly added Redis to a write-heavy job queue — cache hit ratio was 0.3%. The latency actually increased because of the extra hop.

Key Takeaway

Framework first, solution second.

The four steps prevent you from drawing before you know the canvas size.

Remember: requirements are the most loaded part — get them wrong and nothing else matters.

Infrastructure as Code: Deploying the Architecture

Interviewers love when you can bridge the gap between a whiteboard drawing and actual deployment. Understanding how to containerize and scale your components is key to demonstrating seniority.

But don't just name tools. Explain why you'd choose Docker over a VM, or Kubernetes over a single server. The 'how' is easy — the 'why' is what separates staff engineers.

The dirty secret: most senior engineers have been burned by overcomplicated infrastructure. If you're interviewing, showing you understand the pain of maintaining a 15-microservice nightclub is more impressive than listing every AWS service. Say 'I'd start with a monolith until I have a concrete bottleneck.' That's the answer that gets 'hire.'

io/thecodeforge/deploy/ArchitectureStack.DockerfileDOCKER

# TheCodeForge - Scaling the API Layer
FROM eclipse-temurin:17-jdk-alpine

# Best practice: Don't run as root in production interview scenarios
RUN addgroup -S forgegroup && adduser -S forgeuser -G forgegroup
USER forgeuser

WORKDIR /app
COPY target/system-design-app.jar app.jar

# Expose the service port
EXPOSE 8080

# Standard entrypoint for Spring Boot apps
ENTRYPOINT ["java", "-Xmx2g", "-jar", "app.jar"]

Output

Successfully built and tagged thecodeforge/api-layer:latest

Avoid the 'Buzzword Bingo' Trap

Don't just say 'We'll use Kafka.' Say: 'We'll use a message queue like Kafka to decouple the user-facing API from the heavy image-processing worker, ensuring high availability even if the worker service is down.'

Production Insight

I've seen teams adopt Kubernetes just because it's trendy — then spend months managing cluster upgrades while their monolithic PHP app runs fine on a single server.

The rule: choose the minimal infrastructure that meets your requirements. Add complexity only when you have a concrete bottleneck.

If you're interviewing, showing you understand this trade-off is more impressive than listing every AWS service. I remember a candidate who said 'I'd actually use a single EC2 instance with autoscaling and skip k8s until we hit 10 services.' That candidate got the offer.

Key Takeaway

Infrastructure choices are trade-offs, not checklists.

A Dockerfile is easy; knowing when to avoid containers is hard.

In an interview, say: 'I'd start with a simple architecture and scale only when the data demands it.'

Back-of-Envelope Estimations: The Numbers That Validate Your Design

In a system design interview, you can talk theoretical all day, but the moment you put numbers on the whiteboard, you demonstrate real-world experience. Interviewers want to see you can estimate QPS, storage, bandwidth, and cache size with reasonable accuracy.

Don't aim for precision — aim for orders of magnitude. A factor of 10 off is acceptable if you catch it and adjust. Here's the cheat sheet: 1 million requests/second = 1,000,000 QPS. 1 TB = 1000 GB. A single MySQL write can handle ~1k writes/second. A Redis single node can handle ~100k reads/second.

Here's a real trick: always mention 'I'd add 30% headroom for traffic spikes.' It shows you've dealt with production surprises. If you forget to include replication factor (3x) and retention (say 18 months), your storage estimate will be off by a factor of 5 — and your architecture will be wrong.

io/thecodeforge/design/EstimationHelper.pyPYTHON

from typing import Tuple

def estimate_qps(daily_active_users: int, requests_per_user: int, peak_multiplier: float = 2.0) -> Tuple[float, float]:
    """Estimate average and peak QPS for a system."""
    daily_requests = daily_active_users * requests_per_user
    avg_qps = daily_requests / 86400
    peak_qps = avg_qps * peak_multiplier
    return round(avg_qps, 2), round(peak_qps, 2)

# Example for Twitter-like system: 100M DAU, each user fetches timeline 50 times
avg, peak = estimate_qps(100_000_000, 50)
print(f"Average QPS: {avg}, Peak QPS: {peak}")
# Output: Average QPS: 57870.37, Peak QPS: 115740.74

Output

Average QPS: 57870.37, Peak QPS: 115740.74

The Capacity Triangle: QPS × Storage × Latency

QPS is the surface area of your system. High QPS demands caching, partitioning, and asynchronous processing.
Storage grows unbounded. Estimate 3x growth for data, indexes, and backups.
Latency is the sharpest constraint. Every network hop adds ~1ms locally, ~50ms cross-region.

Production Insight

At a payment company, we estimated storage for transaction logs as 200 GB/month. We forgot to account for replication factor (3x) and retention (18 months). Six months later we hit 1.6 TB and exceeded budget by $50k.

Always include replication, retention, and indexing overhead in your estimates.

If you interview for a storage-heavy system, say: 'I'd multiply my raw estimate by 4-5x for production overhead.' That phrase alone signals you've been burned before.

Key Takeaway

Estimations are not about precision — they're about keeping your design realistic.

A single order-of-magnitude error can render your entire architecture useless.

Memorise the 'power of 10' numbers: single DB ~1k writes/s, Redis ~100k reads/s, Kafka ~1 million events/s.

Trade-Off Decision Tree: SQL vs NoSQL, Cache vs DB, Sync vs Async

Every system design interview forces you to make choices. The ability to articulate the trade-offs is what separates a good answer from a great one. Let's create a mental decision tree:

If your system requires strong consistency (e.g., banking), go SQL with a single writer and read replicas. Accept write latency.
If you need high availability with eventual consistency for a global feed, choose NoSQL (Cassandra, DynamoDB).
If you have a read-heavy workload (90% reads), add a cache layer (Redis) and consider CDN for static assets.
If your workload is write-heavy (50%+ writes), you need a commit log (Kafka) and a write-optimised store (Cassandra).

The decision tree is not about picking the right answer — it's about showing you understand the constraints.

Don't forget to mention the hard part: consistency trade-offs. If you pick eventual consistency, say 'I accept that a user might see stale data for a few seconds — that's fine for this use case.' If you pick strong consistency, say 'I'm trading availability for correctness — I'll need to handle higher write latency and potential downtime during partitions.' That's the level of depth that gets 'strong hire.'

The Consistency vs Availability Mistake

When an interviewer asks 'Do you need strong consistency?' resist the urge to say 'Yes, always.' For a social media feed, eventual consistency is fine. For a payment system, it's not. Know your CAP theorem.

Production Insight

The biggest production incident I've seen was a team that chose MongoDB for a financial system because 'it scales better.' They hit a node failure and lost 15 minutes of transactions — no rollback possible.

The lesson: consistency guarantees are not optional — they are fundamental to your data model.

In interviews, if you pick eventual consistency, you'd better have a reason that the system can tolerate temporary inconsistency (e.g., view count vs payment). And always mention you'd use quorum reads for critical data.

Key Takeaway

Every database choice is a compromise.

NoSQL gives up consistency for scale. SQL gives up scale for consistency.

The right choice depends on your write-to-read ratio, consistency requirements, and growth trajectory.

Data Store Decision Tree

IfStrong consistency, ACID transactions required

→

UseUse a relational database (PostgreSQL/MySQL). Plan for vertical scaling or read replicas first. Sharding only when necessary.

IfHigh horizontal scalability, less strict consistency (eventual)

→

UseUse a distributed NoSQL store (Cassandra, DynamoDB). Understand quorum-based reads/writes for tuning consistency.

IfHigh throughput writes with eventual consistency

→

UseUse an append-only log (Kafka) + a message queue + a consumer that writes to NoSQL. This decouples producers from the database write bottleneck.

IfVariable read patterns, need low latency (P95 < 10ms)

→

UseIntroduce a distributed cache (Redis) with a write-through or lazy-loading strategy. Handle cache stampede with locking and jitter.

Failure Modes: Designing for When Things Go Wrong

The most overlooked part of system design interviews is discussing failure modes. Too many candidates describe a perfect system where everything works. The real world has partial failures — a network partition, a node crash, a buggy deployment. A senior engineer designs for these.

You should discuss

Single point of failure (SPOF): every load balancer, database master, and queue broker is a candidate. Make everything redundant.
Cascading failures: when one component fails and overloads the next. Example: a cache node goes down → all traffic hits DB → DB goes down → app fails.
Recoverability: how do you return to normal after a failure? Blue-green deployments, circuit breakers, and graceful degradation are key concepts.

Don't just list them — explain the specific scenario. Say 'If the cache cluster goes down, I'd have a circuit breaker that falls back to the database with a pool limiter to prevent overload. The users would see slightly slower responses, but no outage.' That's a senior answer.

io/thecodeforge/design/circuit_breaker.pyPYTHON

import time
from typing import Callable, Any

class CircuitBreaker:
    def __init__(self, max_failures: int = 5, reset_timeout: float = 60.0):
        self.max_failures = max_failures
        self.reset_timeout = reset_timeout
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, func: Callable, *args, **kwargs) -> Any:
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker is OPEN. Request blocked.")
        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.max_failures:
                self.state = "OPEN"
            raise

# Usage
cb = CircuitBreaker(max_failures=3, reset_timeout=30)
cb.call(self.api.get_timeline, "user123")

Output

If the API fails 3 times in a row, subsequent calls are blocked for 30 seconds, preventing cascading overload.

Don't Assume 100% Uptime — Describe Graceful Degradation

When a system is partially down, what should the user see? A search service should return recent results from cache if the index is unreachable. A video streaming service should lower quality. Always describe the degraded user experience.

Production Insight

A major social network went down for 12 hours because a load balancer's health check started failing due to a race condition in their deployment script. The fix: add a timeout and a secondary health check.

In your interview, if you mention 'I'd add health checks and circuit breakers', you've already outdone 90% of candidates.

The rule: always allocate 2 minutes in your wrap-up to discuss failure modes — it shows operational maturity.

Key Takeaway

Design for failure, not success.

A system that works perfectly under ideal conditions is a toy.

In production, partial failures are the norm — plan for them.

The Wrap-Up: Single Points of Failure and Scaling Plans

The final step of your interview answer should be a structured wrap-up. Summarise what you've designed, then explicitly call out: 1. Single points of failure: 'Our load balancer is a SPOF. I'd make it redundant with active-passive and a floating IP.' 2. Scaling plan: 'Currently this handles 1M DAU. To reach 100M DAU, I'd shard the database by user ID, add a Redis cache for timelines, and introduce Kafka for async processing of tweets.' 3. Future improvements: things you'd add if time allowed (like telemetry, rate limiting, etc.)

This structured wrap-up leaves a strong final impression — it tells the interviewer you think holistically.

Pro tip: end with an open-ended question. 'Is there any requirement I missed that would change this design?' That shows you're not attached to your answer — you're collaborating.

End with a Cliffhanger

After your wrap-up, pause and ask: 'Are there any constraints or requirements you'd like me to reconsider?' This shows humility and a willingness to iterate.

Production Insight

The best interview answers I've seen always close the loop. The candidate didn't just draw boxes — they said 'Here's what could kill this design at 10x scale, and here's how I'd fix it.'

That confidence comes from experience, not memorisation.

Practice this wrap-up until it becomes second nature. I've seen candidates lose an offer because they trailed off at the end. The wrap-up is your final impression — make it count.

Key Takeaway

A strong wrap-up solidifies your architecture.

Summarising SPOFs and scaling plans shows you know the design's weak points.

The final question 'Any constraints?' turns the interview into a collaborative conversation.

Data Partitioning and Sharding Strategies: Consistent Hashing, Range Partitioning, and Rebalancing

When you outgrow a single database instance, partitioning becomes inevitable. The two most common strategies are range partitioning and consistent hashing. Range partitioning splits data by key ranges (e.g., user ID 1-1000 on shard A, 1001-2000 on shard B). It's simple and supports range queries, but hot keys can overload a single shard. Consistent hashing distributes data across a ring using a hash function. It minimizes data movement when nodes join or leave, but range queries become expensive because data is scattered.

In interviews, you need to choose based on query patterns. If you frequently query by user ID range, range partitioning is natural. If you need uniform load and dynamic scaling, consistent hashing wins. Many production systems use a hybrid: consistent hashing with virtual nodes (replicas) to spread load evenly, and secondary indexes for range queries.

Always discuss rebalancing: adding new nodes in a range-partitioned system requires splitting ranges and migrating data. Consistent hashing only moves keys within the affected segment. Tools like Cassandra handle rebalancing automatically using virtual nodes and hinted handoff.

Here's the real-world truth: rebalancing is where most designs break. Say 'I'd use consistent hashing with 150 virtual nodes per physical node to distribute load evenly.' That level of detail signals you've done this before.

io/thecodeforge/design/ConsistentHashRing.javaJAVA

package io.thecodeforge.design;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class ConsistentHashRing<T> {
    private final TreeMap<Long, T> ring = new TreeMap<>();
    private final MessageDigest md;
    private final int replicas;

    public ConsistentHashRing(int replicas) throws NoSuchAlgorithmException {
        this.md = MessageDigest.getInstance("MD5");
        this.replicas = replicas;
    }

    public void addNode(T node) {
        for (int i = 0; i < replicas; i++) {
            long hash = hash(node.toString() + i);
            ring.put(hash, node);
        }
    }

    public void removeNode(T node) {
        for (int i = 0; i < replicas; i++) {
            long hash = hash(node.toString() + i);
            ring.remove(hash);
        }
    }

    public T getNode(String key) {
        if (ring.isEmpty()) return null;
        long hash = hash(key);
        Map.Entry<Long, T> entry = ring.ceilingEntry(hash);
        if (entry == null) entry = ring.firstEntry();
        return entry.getValue();
    }

    private long hash(String key) {
        md.reset();
        md.update(key.getBytes());
        byte[] digest = md.digest();
        return ((long) (digest[3] & 0xFF) << 24) | ((long) (digest[2] & 0xFF) << 16) |
               ((long) (digest[1] & 0xFF) << 8) | (digest[0] & 0xFF);
    }

    public static void main(String[] args) throws Exception {
        ConsistentHashRing<String> ring = new ConsistentHashRing<>(3);
        ring.addNode("shard1");
        ring.addNode("shard2");
        ring.addNode("shard3");
        System.out.println(ring.getNode("user1234"));
    }
}

Output

shard2

The Ring of Nodes

Each node maps to multiple points on the ring (virtual nodes) for even load distribution.
Data is assigned to the nearest clockwise node.
Adding or removing a node only affects its neighbours, not the entire ring.

Production Insight

I've seen a sharded system crash when a hot user took down one shard while the others stayed idle.

Virtual nodes (replicas) spread the load evenly across physical machines.

Rule: always monitor per-shard load distribution and set up alerts for skew. A single hot key can bring down a cluster if you don't have virtual nodes.

Key Takeaway

Partitioning strategy must match query patterns.

Range queries need range partitioning; uniform load needs consistent hashing.

Wrong choice causes hot spots that nullify scaling gains.

Observability and Monitoring: Logging, Metrics, and Distributed Tracing

Most system design descriptions skip observability, but in production you can't fix what you can't see. The three pillars are logging, metrics, and distributed tracing. Logs give you per-request detail but are expensive to store long-term. Metrics (latency, error rates, throughput) give you aggregated health. Distributed tracing connects a single request across multiple services.

In an interview, mentioning you'd integrate Prometheus for metrics, structured logging (JSON), and Jaeger for tracing shows operational maturity. Discuss how you'd monitor the key SLIs: latency (P50, P95, P99), error rate, throughput, and saturation (e.g., CPU, memory, connection pool). Define SLOs and set up alerts based on burn-rate budgets.

A common mistake is to design for zero latency but not instrument to measure it. Add a simple tracing middleware from day one. Without tracing, debugging a 500ms latency spike across 10 services becomes a guessing game.

Here's a concrete snippet: 'I'd set up structured logging with correlation IDs, then pipe logs into a centralised system (ELK). For metrics, I'd expose endpoints for Prometheus and create dashboards showing P99 latency, error rate, and throughput. For tracing, I'd use OpenTelemetry with Jaeger.' That's a senior-level answer.

io/thecodeforge/observability/TracingMiddleware.javaJAVA

package io.thecodeforge.observability;

import io.opentracing.Span;
import io.opentracing.Tracer;
import io.opentracing.util.GlobalTracer;
import javax.servlet.*;
import javax.servlet.http.HttpServletRequest;
import java.io.IOException;

public class TracingMiddleware implements Filter {
    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {
        Tracer tracer = GlobalTracer.get();
        Span span = tracer.buildSpan("api-call")
                .withTag("method", ((HttpServletRequest) request).getMethod())
                .withTag("path", ((HttpServletRequest) request).getRequestURI())
                .start();
        try (Scope ignored = tracer.activateSpan(span)) {
            chain.doFilter(request, response);
        } catch (Exception e) {
            span.setTag("error", true);
            span.log(e.getMessage());
            throw e;
        } finally {
            span.finish();
        }
    }
    // ... other filter methods
}

Output

Every API request now creates a span that can be forwarded to downstream services via headers.

Don't ignore observability in interviews.

Interviewers love when you mention monitoring requirements. It shows you've operated systems in production, not just designed them on whiteboards.

Production Insight

A team once deployed a service that silently dropped 5% of requests for two weeks.

They had no metrics to detect it.

Rule: instrument every service from day one, even before the feature is complete. That 5% drop could have been caught in 5 minutes with a simple error rate dashboard.

Key Takeaway

Observability isn't optional — it's how you know if your design works.

Logging, metrics, and tracing form the three pillars.

Interview tip: say 'I'd add distributed tracing to root-cause failures across services.'

Idempotency and Retry Strategies: Preventing Double Charges and Data Corruption

In distributed systems, network failures are inevitable. A client sends a request, the server processes it, but the acknowledgment is lost. The client retries — and suddenly you have two orders, two payments, two emails. That's why idempotency is not optional.

Idempotency means performing the same operation multiple times produces the same result. For write operations, use an idempotency key: a unique token generated by the client and sent with the request. The server stores the result keyed by that token; if it sees the same key again, it returns the stored response without executing the operation again.

In interviews, mention idempotency early. Say 'Every write operation will include an idempotency key. The client generates a UUID, sends it with the request. The server deduplicates by that key.' This signals you've built payment systems.

For retries, use exponential backoff with jitter — never retry instantly. A naive retry can bring down a struggling service. Say 'I'd use exponential backoff with a base of 1 second, doubling each time, capped at 30 seconds, plus random jitter (±20%) to spread retries.'

io/thecodeforge/design/IdempotencyMiddleware.pyPYTHON

import uuid
from flask import Flask, request, jsonify
from functools import wraps

app = Flask(__name__)
idempotency_store = {}  # In production, use Redis with TTL

def idempotent(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        key = request.headers.get('Idempotency-Key')
        if not key:
            return jsonify({'error': 'Missing Idempotency-Key header'}), 400
        if key in idempotency_store:
            return jsonify(idempotency_store[key]), 200
        response = f(*args, **kwargs)
        idempotency_store[key] = response.get_json()
        return response
    return decorated

@app.route('/order', methods=['POST'])
@idempotent
def create_order():
    # Process order (deduplicated by idempotency key)
    return jsonify({'order_id': str(uuid.uuid4()), 'status': 'created'}), 201

Output

POST /order with Idempotency-Key header returns same result on retry

Bring up idempotency without being asked

In any design that involves external calls or financial transactions, say 'I'd make this operation idempotent using a client-generated key.' That single sentence can turn a 'good' answer into a 'strong hire'.

Production Insight

I've debugged a payment gateway that processed double charges on 0.1% of transactions — costing the company $100k before we added idempotency keys.

The fix: a simple 'Idempotency-Key' header with a Redis-based dedup cache.

If you've ever touched payments, you know idempotency is table stakes. Mention it or lose the offer.

Key Takeaway

Idempotency prevents duplicate actions when retries happen.

Use client-generated keys and server-side dedup storage.

Retry with exponential backoff + jitter, not instant retries.

Clarify the Scope: Stop Building Facebook When They Asked for a Chat App

The single biggest mistake in system design interviews isn't choosing the wrong database — it's solving the wrong problem. You get 45 minutes. If you spend ten of them whiteboarding a distributed file system for a social media app that only needs to serve 10,000 users, you've already failed.

Start by asking questions that kill ambiguity. What are the read/write ratios? Are we serving global traffic or regional? Do we need real-time or eventual consistency? What's the expected latency SLA?

Companies don't care if you can recite the perfect sharding strategy for 100 million users. They care if you can ask "Is this actually going to hit 100 million users?" before you design for it.

Frame your constraints early. Write them on the board. Reference them when someone suggests cassandra for a key-value store serving 500 QPS. The interview isn't a test of how many systems you know — it's a test of how well you narrow the problem space before you start throwing tech at it.

ScopeConstraints.pyPYTHON

// io.thecodeforge — interview tutorial

class DesignScope:
    """Don't skip this. Ever."""
    def __init__(self):
        self.users = 1_000_000  # Monthly active
        self.read_ratio = 0.95   # 95% reads
        self.write_ratio = 0.05  # 5% writes
        self.latency_sla_ms = 200
        self.consistency = "eventual"  # Accept stale reads
        self.geo_distribution = "us-east-1 only"

    def validate(self):
        if self.users < 100_000:
            print("Simple monolith. Don't overengineer.")
            return
        if self.read_ratio > 0.9:
            print("Heavy read workload. Cache aggressively.")
        if self.consistency == "eventual":
            print("Stop trying to sell me Kafka.")
        print(f"Scope validated: {self.__dict__}")

scope = DesignScope()
scope.validate()

Output

Heavy read workload. Cache aggressively.

Stop trying to sell me Kafka.

Scope validated: {'users': 1000000, 'read_ratio': 0.95, 'write_ratio': 0.05, 'latency_sla_ms': 200, 'consistency': 'eventual', 'geo_distribution': 'us-east-1 only'}

Production Trap:

I've seen senior engineers spend 20 minutes debating DynamoDB vs Cassandra for a system that only needed PostgreSQL with a read replica. Ask about scale first. Real systems rarely need the distributed hype.

Key Takeaway

Spend the first 5 minutes defining constraints. Everything else is optimization.

API Design: Your Whiteboard Is a Contract — Violate It and You're Cooked

Once you've scoped the problem, you need to define the API surface before you draw a single box. This is where interviewers separate the architects from the diagram-drawers.

Start with the core operations. For a video platform: uploadVideo(userId, videoFile, metadata) and getFeed(userId, pageToken, pageSize). Define the request/response shapes. Include status codes and error handling. Production APIs return 409 on conflicts, not 500 on everything.

Don't forget pagination. If your design returns ten thousand records in one response, you've just designed a denial-of-service attack against yourself. Use cursor-based pagination with pageToken — it scales better than offset/limit when data moves.

Write the endpoints on the board. Reference them when you talk about load balancers, caching, and database indexing. A good API design makes the rest of the architecture fall into place. A bad one means you're building a house on a swamp.

APIBlueprint.pyPYTHON

// io.thecodeforge — interview tutorial

from typing import Optional
from dataclasses import dataclass

@dataclass
class UploadVideoRequest:
    user_id: str
    video_file: bytes  # Multipart in reality
    title: str
    description: Optional[str] = None

@dataclass
class UploadVideoResponse:
    video_id: str
    upload_url: str  # Presigned S3 URL
    status: str  # "pending", "processing", "ready"

@dataclass
class GetFeedRequest:
    user_id: str
    page_token: Optional[str] = None
    page_size: int = 20

@dataclass
class GetFeedResponse:
    videos: list[dict]
    next_page_token: Optional[str]
    total_estimate: int  # Approximate, don't count()

# Don't forget error responses:
# 400 - malformed request
# 409 - duplicate upload (idempotency key missing)
# 429 - rate limited (back off, client)
print("API contract defined. Now draw the boxes.")

Output

API contract defined. Now draw the boxes.

Senior Shortcut:

When an interviewer asks "how would you design the API?", they're actually checking if you know about rate limiting, idempotency keys, and pagination. Mention all three in one sentence. It's a power move.

Key Takeaway

Define API contracts before drawing architecture. It forces clarity on every downstream decision.

● Production incidentPOST-MORTEMseverity: high

The Cache Stampede That Cost a Weekend

Symptom

Users saw empty timelines or 503 errors during peak hours. Database CPU hit 100%. P95 latency > 30 seconds.

Assumption

The cache TTL was short (5 minutes) to keep timelines fresh. They assumed the database could handle the cache-miss rate of 20k QPS.

Root cause

Cache stampede: when many keys expire at once, all requests go to the database simultaneously. The DB connection pool saturated, causing cascading failures across services.

Fix

Implement locking around cache rebuild (only one request fetches from DB, others wait) and add a jitter to TTL expiration to avoid simultaneous expiry. Also scaled read replicas and added a CDN for static content.

Key lesson

Always model your cache miss rate — a 50% spike can kill your database.
Never trust default TTL values: they're designed for demo, not production.
In an interview, discussing cache stampede shows depth — say 'I'd use a mutex around recompute and stagger TTLs.'
Always include cache warm-up during deployment to avoid cold cache stampede.

Production debug guideSymptom → Action mapping for the most common real-world failures that mirror interview mistakes5 entries

Symptom · 01

Database CPU spikes to 100% during traffic bursts

→

Fix

Check if cache miss rate is spiking. Add lazy cache refresh with a background job (write-through) or implement a read-through cache with jittered TTL.

Symptom · 02

API latency increases linearly with user count

→

Fix

Identify the bottleneck: likely an unscaled database query. Add a read replica or shard by user ID. Alternatively, add a cache layer for hot data.

Symptom · 03

Writes are slow, reads are fast

→

Fix

You have a read-optimised system for a write-heavy workload. Switch to an append-only log (like Kafka) and process writes asynchronously. Consider NoSQL if writes are high volume.

Symptom · 04

Service A can't reach Service B even though both are healthy

→

Fix

Network partitioning — the most common cause of production incidents not discussed in interviews. Implement circuit breakers and fallback responses. The service mesh (e.g., Istio) can help with resilient routing.

Symptom · 05

P99 latency increases after adding a cache layer

→

Fix

Check cache hit ratio. If low, the cache is not warming properly. Pre-warm cache or adjust eviction policy. Also check for cache stampede.

★ Quick Reference: Three Number Checks That Save Your DesignWhen you're in an interview and need to validate your scaling decisions, run through these three back-of-envelope calculations. If you can't justify one, your design is likely flawed.

You proposed a monolithic database without any caching or sharding−

Immediate action

Estimate QPS and data size. For 1 billion reads/month ≈ 380 QPS. Any single MySQL instance can handle ~10k QPS — so a monolith might be fine for 380 QPS, but not for 10x growth.

Commands

QPS = (total requests per day / 86400) * peak traffic factor (usually 2-3x)

Storage = average object size * number of objects (users * tweets * etc.)

Fix now

If QPS > 10k, add caching (Redis) and/or read replicas. If writes > 1k QPS, consider horizontal sharding or NoSQL.

You chose SQL when the workload is write-heavy (90/10 write/read ratio)+

You forgot to discuss fault tolerance and recovery+

You designed a system with synchronous communication for every request+

Key Technology Trade-Offs in System Design

Concept	When to Use	Core Trade-off
SQL (PostgreSQL/MySQL)	Structured data, ACID compliance required	Harder to scale horizontally (sharding is complex)
NoSQL (Cassandra/DynamoDB)	Unstructured data, massive write throughput	Eventual consistency; lacks complex joins
Caching (Redis/Memcached)	Read-heavy workloads with frequent access	Complexity in cache invalidation (Stale data)
Load Balancing	Distributing traffic across multiple nodes	Introduces a single point of failure (SPOF) if not redundant
Message Queue (Kafka/RabbitMQ)	Decoupling producers from consumers, buffering writes	Adds latency and operational complexity

Key takeaways

System design is about the 'Why' (trade-offs) more than the 'What' (tools).

Always start with the API signatures and Data Schema before scaling components.

Master the CAP Theorem

understand why you cannot have Consistency, Availability, and Partition Tolerance simultaneously.

Practice whiteboarding your thoughts

the 'human touch' of explaining your logic matters more than a perfect diagram.

Back-of-envelope calculations are your best friend. Memorise the order-of-magnitude numbers for common components.

Failure mode analysis is the single most under-discussed topic in interviews. Address it explicitly.

A structured wrap-up (SPOFs + scaling plan) leaves the strongest final impression.

Data partitioning is a deep topic

bring it up in every design to show depth.

Observability is the difference between a toy system and a production system.

Idempotency is not optional

discuss it for every write operation to prevent data corruption.

Common mistakes to avoid

7 patterns

Jumping into drawing boxes before defining functional and non-functional requirements

Symptom

The design looks good on the whiteboard but fails under real constraints — e.g., designed for 100 users, not 100 million.

Fix

Spend the first 3-5 minutes clarifying: what are the core features? What is the expected scale (DAU, QPS, storage)? What are the latency and consistency requirements?

Failing to estimate scale (QPS, storage, bandwidth)

Symptom

Interviewer asks 'Why did you choose this database?' and you can't justify the choice with numbers — the design feels random.

Fix

Always run a back-of-envelope calculation: QPS = (DAU requests per user per day) / 86400 peak factor. Have benchmarks ready: single MySQL ~1k writes/s, Redis ~100k reads/s, Kafka ~1M events/s.

Ignoring the 'Failure Mode': Never assume a network call succeeds

Symptom

Your design has a single load balancer and a single database master. When one fails, the entire system goes down.

Fix

Identify single points of failure (SPOF) in your design and make them redundant. Discuss circuit breakers, retries with exponential backoff, and graceful degradation (e.g., show cached data when backend is down).

Choosing technologies without trade-off analysis

Symptom

Candidate says 'We'll use Kafka' without explaining why it's better than RabbitMQ or SQS for this specific use case.

Fix

Always explain the trade-off: 'Kafka gives us durability and replayability at the cost of higher operational overhead. For a simple notification system, SQS might be simpler.'

Over-engineering for scale that doesn't exist

Symptom

Candidate designs a distributed sharded NoSQL system for a startup with 1000 users, adding unnecessary complexity.

Fix

Start simple: a monolith with a single database. Then identify the scaling bottleneck and suggest the minimal addition (e.g., a cache layer). Interviewers value pragmatic simplicity.

Ignoring data replication and consistency trade-offs

Symptom

Data loss during node failure because replication factor is 1 or consistency level is set too low.

Fix

Configure replication factor >= 3 in production. Use quorum reads/writes for critical data. For eventual consistency, define conflict resolution strategy (e.g., last-write-wins).

Designing without considering network latency between components

Symptom

High end-to-end latency because services are deployed across different regions and each call adds 50–100ms.

Fix

Co-locate dependent services in the same region or availability zone. Use async communication for non-blocking calls. Consider edge caching for read-heavy global workloads.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Design a URL shortener. Walk through your requirement gathering, then hi...

Q02SENIOR

Design a real-time chat system for 1 billion users. How do you handle me...

Q03SENIOR

You need to design a system that processes 10,000 orders per second duri...

Q04SENIOR

What's the difference between horizontal scaling and vertical scaling, a...

Q05SENIOR

Design a news feed system like Facebook's News Feed. How do you rank and...

Q06SENIOR

How would you design a rate limiter for a public API?

Q01 of 06SENIOR

Design a URL shortener. Walk through your requirement gathering, then high-level design.

ANSWER

First, clarify: how many URLs per month? (Assume 100 million). Are we generating short codes or letting users customise them? What's the read/write ratio? (It's read-heavy, ~10:1). High-level design: a web server that accepts long URL and returns short code. The short code is generated using a base-62 random string (or Snowflake ID stored in DB). Storage: relational DB for mapping (short -> long) plus a cache (Redis) for hot URLs. For scaling, we use a CDN for the redirect (if static), and as traffic grows, add read replicas and eventually shard by short code prefix. Trade-off: we chose SQL because we need strong consistency for the mapping. We add a cache to reduce read latency. We'll use a consistent hash to distribute cache load.

FAQ · 7 QUESTIONS

Frequently Asked Questions

What is the single most important part of a System Design interview?

How do I handle the 'I don't know' moment in a design round?

Should I choose SQL or NoSQL in an interview?

How deep should I go into implementation details?

How do I estimate QPS / storage without a calculator?

How do you handle hot keys in a distributed system?

What's the difference between vertical and horizontal scaling for databases?

Naren Founder & Principal Engineer

20+ years shipping production code across the stack, with years spent interviewing engineers. Written from production experience, not tutorials.

✓ Verified

production tested

May 23, 2026

last updated

1,554

articles · all by Naren

🔥

That's System Design Interview. Mark it forged?

10 min read · try the examples if you haven't