Mid-level 8 min · March 06, 2026

System Design Interview - Cache Stampede in Production

During interviews, cache stampede during peak hours caused 100% database CPU and >30s P95 latency.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • System design interviews test your ability to handle ambiguity and make trade-offs under pressure
  • Core skill: requirement clarification defines the scope before drawing any boxes
  • Key components: functional vs non-functional requirements, high-level design, deep-dive bottlenecks, wrap-up with failure modes
  • Performance insight: a 10x scale miscalculation (e.g., QPS off by factor) can make your entire design irrelevant — always verify numbers
  • Production insight: the same mistake shows up in real systems — teams build for current load, then wonder why it crumbles at 10x
  • Biggest mistake: jumping straight to a solution (Kafka! NoSQL!) without asking 'What problem are we solving?'
Plain-English First

Imagine you're asked to design a city from scratch. You don't start by choosing the color of doorknobs — you start with roads, power grids, and water pipes. System design interviews work exactly the same way: interviewers want to see that you can think big, make smart trade-offs, and build something that won't collapse under pressure. It's less about memorizing answers and more about showing you can be the architect, not just the bricklayer.

Every senior engineering role at a top tech company has one brutal filter: the system design round. It's the interview that makes experienced developers freeze up, not because they lack knowledge, but because the question is deliberately open-ended. 'Design Twitter.' 'Design a URL shortener.' 'Design Netflix.' The candidate who answers these well isn't the one who memorized the most blog posts — it's the one who can think out loud, reason through trade-offs, and communicate at the level of a staff engineer.

The core problem this interview type solves — from the interviewer's perspective — is figuring out how you'll behave when given an ambiguous, high-stakes technical problem with no single right answer. At scale, every architectural decision has cascading consequences. Choosing the wrong database engine, ignoring read/write ratios, or failing to think about failure modes can mean millions in lost revenue or a 3am outage. The design interview is a compressed simulation of exactly that situation.

By the end of this guide, you'll have a repeatable framework you can apply to any system design prompt, understand the specific trade-offs interviewers are listening for (and the buzzwords that actually hurt you), know how to handle the moments where you genuinely don't know the answer, and walk away with a mental model that works in real production systems — not just whiteboards.

The 4-Step Framework for Architectural Clarity

A System Design Interview isn't a coding test; it's a conversation about trade-offs. To avoid the 'blank whiteboard' syndrome, you need a reliable framework. We recommend the following: 1. Understand Requirements (Functional & Non-Functional), 2. High-Level Design (The 'Boxes and Arrows'), 3. Deep Dive into Bottlenecks (Database Sharding, Caching), and 4. Wrap-up (Identifying SPOFs and Scaling).

Rather than starting with a dry definition, let's see a practical example of how you might handle a request-response flow in a distributed environment.

Here's the thing most candidates miss: the framework is just a container. What matters is how you move through it. Don't treat it as a rigid checklist — treat it as a conversation. If the interviewer asks a deep question about caching mid-way through step 2, follow that thread. The framework keeps you from getting lost, not from getting interesting.

io/thecodeforge/design/DistributedIdGenerator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.design;

import java.util.concurrent.atomic.AtomicLong;

/**
 * A production-grade concept for Unique ID Generation in a distributed system.
 * In a real interview, you'd discuss Snowflake IDs or UUIDs to avoid DB bottlenecks.
 */
public class DistributedIdGenerator {
    private final long datacenterId;
    private final long workerId;
    private final AtomicLong sequence = new AtomicLong(0L);

    public DistributedIdGenerator(long datacenterId, long workerId) {
        this.datacenterId = datacenterId;
        this.workerId = workerId;
    }

    public synchronized String generateId() {
        long timestamp = System.currentTimeMillis();
        // In a real interview, explain how bit-shifting ensures sortability and uniqueness
        return String.format("%d-%d-%d-%d", timestamp, datacenterId, workerId, sequence.getAndIncrement());
    }

    public static void main(String[] args) {
        DistributedIdGenerator generator = new DistributedIdGenerator(1, 42);
        System.out.println("Generated Unique ID: " + generator.generateId());
    }
}
Output
Generated Unique ID: 1742031000000-1-42-0
Forge Tip: Clarify Constraints Early
Before drawing a single box, ask: 'Is this system read-heavy or write-heavy?' and 'What is the Daily Active User (DAU) count?' A system for 100 users is a project; a system for 100 million is an architecture.
Production Insight
The most common failure in early-stage startups mirrors the interview: they skip requirement gathering and over-engineer for scale that never comes.
If you don't ask about read/write ratio, you'll design a caching layer for a write-heavy system — wasting both time and cache hit ratio.
Rule: always anchor your design to actual constraints, not hypothetical scale. I've debugged a production outage where a team blindly added Redis to a write-heavy job queue — cache hit ratio was 0.3%. The latency actually increased because of the extra hop.
Key Takeaway
Framework first, solution second.
The four steps prevent you from drawing before you know the canvas size.
Remember: requirements are the most loaded part — get them wrong and nothing else matters.

Infrastructure as Code: Deploying the Architecture

Interviewers love when you can bridge the gap between a whiteboard drawing and actual deployment. Understanding how to containerize and scale your components is key to demonstrating seniority.

But don't just name tools. Explain why you'd choose Docker over a VM, or Kubernetes over a single server. The 'how' is easy — the 'why' is what separates staff engineers.

The dirty secret: most senior engineers have been burned by overcomplicated infrastructure. If you're interviewing, showing you understand the pain of maintaining a 15-microservice nightclub is more impressive than listing every AWS service. Say 'I'd start with a monolith until I have a concrete bottleneck.' That's the answer that gets 'hire.'

io/thecodeforge/deploy/ArchitectureStack.DockerfileDOCKER
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# TheCodeForge - Scaling the API Layer
FROM eclipse-temurin:17-jdk-alpine

# Best practice: Don't run as root in production interview scenarios
RUN addgroup -S forgegroup && adduser -S forgeuser -G forgegroup
USER forgeuser

WORKDIR /app
COPY target/system-design-app.jar app.jar

# Expose the service port
EXPOSE 8080

# Standard entrypoint for Spring Boot apps
ENTRYPOINT ["java", "-Xmx2g", "-jar", "app.jar"]
Output
Successfully built and tagged thecodeforge/api-layer:latest
Avoid the 'Buzzword Bingo' Trap
Don't just say 'We'll use Kafka.' Say: 'We'll use a message queue like Kafka to decouple the user-facing API from the heavy image-processing worker, ensuring high availability even if the worker service is down.'
Production Insight
I've seen teams adopt Kubernetes just because it's trendy — then spend months managing cluster upgrades while their monolithic PHP app runs fine on a single server.
The rule: choose the minimal infrastructure that meets your requirements. Add complexity only when you have a concrete bottleneck.
If you're interviewing, showing you understand this trade-off is more impressive than listing every AWS service. I remember a candidate who said 'I'd actually use a single EC2 instance with autoscaling and skip k8s until we hit 10 services.' That candidate got the offer.
Key Takeaway
Infrastructure choices are trade-offs, not checklists.
A Dockerfile is easy; knowing when to avoid containers is hard.
In an interview, say: 'I'd start with a simple architecture and scale only when the data demands it.'

Back-of-Envelope Estimations: The Numbers That Validate Your Design

In a system design interview, you can talk theoretical all day, but the moment you put numbers on the whiteboard, you demonstrate real-world experience. Interviewers want to see you can estimate QPS, storage, bandwidth, and cache size with reasonable accuracy.

Don't aim for precision — aim for orders of magnitude. A factor of 10 off is acceptable if you catch it and adjust. Here's the cheat sheet: 1 million requests/second = 1,000,000 QPS. 1 TB = 1000 GB. A single MySQL write can handle ~1k writes/second. A Redis single node can handle ~100k reads/second.

Here's a real trick: always mention 'I'd add 30% headroom for traffic spikes.' It shows you've dealt with production surprises. If you forget to include replication factor (3x) and retention (say 18 months), your storage estimate will be off by a factor of 5 — and your architecture will be wrong.

io/thecodeforge/design/EstimationHelper.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from typing import Tuple

def estimate_qps(daily_active_users: int, requests_per_user: int, peak_multiplier: float = 2.0) -> Tuple[float, float]:
    """Estimate average and peak QPS for a system."""
    daily_requests = daily_active_users * requests_per_user
    avg_qps = daily_requests / 86400
    peak_qps = avg_qps * peak_multiplier
    return round(avg_qps, 2), round(peak_qps, 2)

# Example for Twitter-like system: 100M DAU, each user fetches timeline 50 times
avg, peak = estimate_qps(100_000_000, 50)
print(f"Average QPS: {avg}, Peak QPS: {peak}")
# Output: Average QPS: 57870.37, Peak QPS: 115740.74
Output
Average QPS: 57870.37, Peak QPS: 115740.74
The Capacity Triangle: QPS × Storage × Latency
  • QPS is the surface area of your system. High QPS demands caching, partitioning, and asynchronous processing.
  • Storage grows unbounded. Estimate 3x growth for data, indexes, and backups.
  • Latency is the sharpest constraint. Every network hop adds ~1ms locally, ~50ms cross-region.
Production Insight
At a payment company, we estimated storage for transaction logs as 200 GB/month. We forgot to account for replication factor (3x) and retention (18 months). Six months later we hit 1.6 TB and exceeded budget by $50k.
Always include replication, retention, and indexing overhead in your estimates.
If you interview for a storage-heavy system, say: 'I'd multiply my raw estimate by 4-5x for production overhead.' That phrase alone signals you've been burned before.
Key Takeaway
Estimations are not about precision — they're about keeping your design realistic.
A single order-of-magnitude error can render your entire architecture useless.
Memorise the 'power of 10' numbers: single DB ~1k writes/s, Redis ~100k reads/s, Kafka ~1 million events/s.

Trade-Off Decision Tree: SQL vs NoSQL, Cache vs DB, Sync vs Async

Every system design interview forces you to make choices. The ability to articulate the trade-offs is what separates a good answer from a great one. Let's create a mental decision tree:

  • If your system requires strong consistency (e.g., banking), go SQL with a single writer and read replicas. Accept write latency.
  • If you need high availability with eventual consistency for a global feed, choose NoSQL (Cassandra, DynamoDB).
  • If you have a read-heavy workload (90% reads), add a cache layer (Redis) and consider CDN for static assets.
  • If your workload is write-heavy (50%+ writes), you need a commit log (Kafka) and a write-optimised store (Cassandra).

The decision tree is not about picking the right answer — it's about showing you understand the constraints.

Don't forget to mention the hard part: consistency trade-offs. If you pick eventual consistency, say 'I accept that a user might see stale data for a few seconds — that's fine for this use case.' If you pick strong consistency, say 'I'm trading availability for correctness — I'll need to handle higher write latency and potential downtime during partitions.' That's the level of depth that gets 'strong hire.'

The Consistency vs Availability Mistake
When an interviewer asks 'Do you need strong consistency?' resist the urge to say 'Yes, always.' For a social media feed, eventual consistency is fine. For a payment system, it's not. Know your CAP theorem.
Production Insight
The biggest production incident I've seen was a team that chose MongoDB for a financial system because 'it scales better.' They hit a node failure and lost 15 minutes of transactions — no rollback possible.
The lesson: consistency guarantees are not optional — they are fundamental to your data model.
In interviews, if you pick eventual consistency, you'd better have a reason that the system can tolerate temporary inconsistency (e.g., view count vs payment). And always mention you'd use quorum reads for critical data.
Key Takeaway
Every database choice is a compromise.
NoSQL gives up consistency for scale. SQL gives up scale for consistency.
The right choice depends on your write-to-read ratio, consistency requirements, and growth trajectory.
Data Store Decision Tree
IfStrong consistency, ACID transactions required
UseUse a relational database (PostgreSQL/MySQL). Plan for vertical scaling or read replicas first. Sharding only when necessary.
IfHigh horizontal scalability, less strict consistency (eventual)
UseUse a distributed NoSQL store (Cassandra, DynamoDB). Understand quorum-based reads/writes for tuning consistency.
IfHigh throughput writes with eventual consistency
UseUse an append-only log (Kafka) + a message queue + a consumer that writes to NoSQL. This decouples producers from the database write bottleneck.
IfVariable read patterns, need low latency (P95 < 10ms)
UseIntroduce a distributed cache (Redis) with a write-through or lazy-loading strategy. Handle cache stampede with locking and jitter.

Failure Modes: Designing for When Things Go Wrong

The most overlooked part of system design interviews is discussing failure modes. Too many candidates describe a perfect system where everything works. The real world has partial failures — a network partition, a node crash, a buggy deployment. A senior engineer designs for these.

You should discuss
  • Single point of failure (SPOF): every load balancer, database master, and queue broker is a candidate. Make everything redundant.
  • Cascading failures: when one component fails and overloads the next. Example: a cache node goes down → all traffic hits DB → DB goes down → app fails.
  • Recoverability: how do you return to normal after a failure? Blue-green deployments, circuit breakers, and graceful degradation are key concepts.

Don't just list them — explain the specific scenario. Say 'If the cache cluster goes down, I'd have a circuit breaker that falls back to the database with a pool limiter to prevent overload. The users would see slightly slower responses, but no outage.' That's a senior answer.

io/thecodeforge/design/circuit_breaker.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import time
from typing import Callable, Any

class CircuitBreaker:
    def __init__(self, max_failures: int = 5, reset_timeout: float = 60.0):
        self.max_failures = max_failures
        self.reset_timeout = reset_timeout
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, func: Callable, *args, **kwargs) -> Any:
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker is OPEN. Request blocked.")
        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.max_failures:
                self.state = "OPEN"
            raise

# Usage
cb = CircuitBreaker(max_failures=3, reset_timeout=30)
cb.call(self.api.get_timeline, "user123")
Output
If the API fails 3 times in a row, subsequent calls are blocked for 30 seconds, preventing cascading overload.
Don't Assume 100% Uptime — Describe Graceful Degradation
When a system is partially down, what should the user see? A search service should return recent results from cache if the index is unreachable. A video streaming service should lower quality. Always describe the degraded user experience.
Production Insight
A major social network went down for 12 hours because a load balancer's health check started failing due to a race condition in their deployment script. The fix: add a timeout and a secondary health check.
In your interview, if you mention 'I'd add health checks and circuit breakers', you've already outdone 90% of candidates.
The rule: always allocate 2 minutes in your wrap-up to discuss failure modes — it shows operational maturity.
Key Takeaway
Design for failure, not success.
A system that works perfectly under ideal conditions is a toy.
In production, partial failures are the norm — plan for them.

The Wrap-Up: Single Points of Failure and Scaling Plans

The final step of your interview answer should be a structured wrap-up. Summarise what you've designed, then explicitly call out: 1. Single points of failure: 'Our load balancer is a SPOF. I'd make it redundant with active-passive and a floating IP.' 2. Scaling plan: 'Currently this handles 1M DAU. To reach 100M DAU, I'd shard the database by user ID, add a Redis cache for timelines, and introduce Kafka for async processing of tweets.' 3. Future improvements: things you'd add if time allowed (like telemetry, rate limiting, etc.)

This structured wrap-up leaves a strong final impression — it tells the interviewer you think holistically.

Pro tip: end with an open-ended question. 'Is there any requirement I missed that would change this design?' That shows you're not attached to your answer — you're collaborating.

End with a Cliffhanger
After your wrap-up, pause and ask: 'Are there any constraints or requirements you'd like me to reconsider?' This shows humility and a willingness to iterate.
Production Insight
The best interview answers I've seen always close the loop. The candidate didn't just draw boxes — they said 'Here's what could kill this design at 10x scale, and here's how I'd fix it.'
That confidence comes from experience, not memorisation.
Practice this wrap-up until it becomes second nature. I've seen candidates lose an offer because they trailed off at the end. The wrap-up is your final impression — make it count.
Key Takeaway
A strong wrap-up solidifies your architecture.
Summarising SPOFs and scaling plans shows you know the design's weak points.
The final question 'Any constraints?' turns the interview into a collaborative conversation.

Data Partitioning and Sharding Strategies: Consistent Hashing, Range Partitioning, and Rebalancing

When you outgrow a single database instance, partitioning becomes inevitable. The two most common strategies are range partitioning and consistent hashing. Range partitioning splits data by key ranges (e.g., user ID 1-1000 on shard A, 1001-2000 on shard B). It's simple and supports range queries, but hot keys can overload a single shard. Consistent hashing distributes data across a ring using a hash function. It minimizes data movement when nodes join or leave, but range queries become expensive because data is scattered.

In interviews, you need to choose based on query patterns. If you frequently query by user ID range, range partitioning is natural. If you need uniform load and dynamic scaling, consistent hashing wins. Many production systems use a hybrid: consistent hashing with virtual nodes (replicas) to spread load evenly, and secondary indexes for range queries.

Always discuss rebalancing: adding new nodes in a range-partitioned system requires splitting ranges and migrating data. Consistent hashing only moves keys within the affected segment. Tools like Cassandra handle rebalancing automatically using virtual nodes and hinted handoff.

Here's the real-world truth: rebalancing is where most designs break. Say 'I'd use consistent hashing with 150 virtual nodes per physical node to distribute load evenly.' That level of detail signals you've done this before.

io/thecodeforge/design/ConsistentHashRing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
package io.thecodeforge.design;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

public class ConsistentHashRing<T> {
    private final TreeMap<Long, T> ring = new TreeMap<>();
    private final MessageDigest md;
    private final int replicas;

    public ConsistentHashRing(int replicas) throws NoSuchAlgorithmException {
        this.md = MessageDigest.getInstance("MD5");
        this.replicas = replicas;
    }

    public void addNode(T node) {
        for (int i = 0; i < replicas; i++) {
            long hash = hash(node.toString() + i);
            ring.put(hash, node);
        }
    }

    public void removeNode(T node) {
        for (int i = 0; i < replicas; i++) {
            long hash = hash(node.toString() + i);
            ring.remove(hash);
        }
    }

    public T getNode(String key) {
        if (ring.isEmpty()) return null;
        long hash = hash(key);
        Map.Entry<Long, T> entry = ring.ceilingEntry(hash);
        if (entry == null) entry = ring.firstEntry();
        return entry.getValue();
    }

    private long hash(String key) {
        md.reset();
        md.update(key.getBytes());
        byte[] digest = md.digest();
        return ((long) (digest[3] & 0xFF) << 24) | ((long) (digest[2] & 0xFF) << 16) |
               ((long) (digest[1] & 0xFF) << 8) | (digest[0] & 0xFF);
    }

    public static void main(String[] args) throws Exception {
        ConsistentHashRing<String> ring = new ConsistentHashRing<>(3);
        ring.addNode("shard1");
        ring.addNode("shard2");
        ring.addNode("shard3");
        System.out.println(ring.getNode("user1234"));
    }
}
Output
shard2
The Ring of Nodes
  • Each node maps to multiple points on the ring (virtual nodes) for even load distribution.
  • Data is assigned to the nearest clockwise node.
  • Adding or removing a node only affects its neighbours, not the entire ring.
Production Insight
I've seen a sharded system crash when a hot user took down one shard while the others stayed idle.
Virtual nodes (replicas) spread the load evenly across physical machines.
Rule: always monitor per-shard load distribution and set up alerts for skew. A single hot key can bring down a cluster if you don't have virtual nodes.
Key Takeaway
Partitioning strategy must match query patterns.
Range queries need range partitioning; uniform load needs consistent hashing.
Wrong choice causes hot spots that nullify scaling gains.

Observability and Monitoring: Logging, Metrics, and Distributed Tracing

Most system design descriptions skip observability, but in production you can't fix what you can't see. The three pillars are logging, metrics, and distributed tracing. Logs give you per-request detail but are expensive to store long-term. Metrics (latency, error rates, throughput) give you aggregated health. Distributed tracing connects a single request across multiple services.

In an interview, mentioning you'd integrate Prometheus for metrics, structured logging (JSON), and Jaeger for tracing shows operational maturity. Discuss how you'd monitor the key SLIs: latency (P50, P95, P99), error rate, throughput, and saturation (e.g., CPU, memory, connection pool). Define SLOs and set up alerts based on burn-rate budgets.

A common mistake is to design for zero latency but not instrument to measure it. Add a simple tracing middleware from day one. Without tracing, debugging a 500ms latency spike across 10 services becomes a guessing game.

Here's a concrete snippet: 'I'd set up structured logging with correlation IDs, then pipe logs into a centralised system (ELK). For metrics, I'd expose endpoints for Prometheus and create dashboards showing P99 latency, error rate, and throughput. For tracing, I'd use OpenTelemetry with Jaeger.' That's a senior-level answer.

io/thecodeforge/observability/TracingMiddleware.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package io.thecodeforge.observability;

import io.opentracing.Span;
import io.opentracing.Tracer;
import io.opentracing.util.GlobalTracer;
import javax.servlet.*;
import javax.servlet.http.HttpServletRequest;
import java.io.IOException;

public class TracingMiddleware implements Filter {
    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {
        Tracer tracer = GlobalTracer.get();
        Span span = tracer.buildSpan("api-call")
                .withTag("method", ((HttpServletRequest) request).getMethod())
                .withTag("path", ((HttpServletRequest) request).getRequestURI())
                .start();
        try (Scope ignored = tracer.activateSpan(span)) {
            chain.doFilter(request, response);
        } catch (Exception e) {
            span.setTag("error", true);
            span.log(e.getMessage());
            throw e;
        } finally {
            span.finish();
        }
    }
    // ... other filter methods
}
Output
Every API request now creates a span that can be forwarded to downstream services via headers.
Don't ignore observability in interviews.
Interviewers love when you mention monitoring requirements. It shows you've operated systems in production, not just designed them on whiteboards.
Production Insight
A team once deployed a service that silently dropped 5% of requests for two weeks.
They had no metrics to detect it.
Rule: instrument every service from day one, even before the feature is complete. That 5% drop could have been caught in 5 minutes with a simple error rate dashboard.
Key Takeaway
Observability isn't optional — it's how you know if your design works.
Logging, metrics, and tracing form the three pillars.
Interview tip: say 'I'd add distributed tracing to root-cause failures across services.'

Idempotency and Retry Strategies: Preventing Double Charges and Data Corruption

In distributed systems, network failures are inevitable. A client sends a request, the server processes it, but the acknowledgment is lost. The client retries — and suddenly you have two orders, two payments, two emails. That's why idempotency is not optional.

Idempotency means performing the same operation multiple times produces the same result. For write operations, use an idempotency key: a unique token generated by the client and sent with the request. The server stores the result keyed by that token; if it sees the same key again, it returns the stored response without executing the operation again.

In interviews, mention idempotency early. Say 'Every write operation will include an idempotency key. The client generates a UUID, sends it with the request. The server deduplicates by that key.' This signals you've built payment systems.

For retries, use exponential backoff with jitter — never retry instantly. A naive retry can bring down a struggling service. Say 'I'd use exponential backoff with a base of 1 second, doubling each time, capped at 30 seconds, plus random jitter (±20%) to spread retries.'

io/thecodeforge/design/IdempotencyMiddleware.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import uuid
from flask import Flask, request, jsonify
from functools import wraps

app = Flask(__name__)
idempotency_store = {}  # In production, use Redis with TTL

def idempotent(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        key = request.headers.get('Idempotency-Key')
        if not key:
            return jsonify({'error': 'Missing Idempotency-Key header'}), 400
        if key in idempotency_store:
            return jsonify(idempotency_store[key]), 200
        response = f(*args, **kwargs)
        idempotency_store[key] = response.get_json()
        return response
    return decorated

@app.route('/order', methods=['POST'])
@idempotent
def create_order():
    # Process order (deduplicated by idempotency key)
    return jsonify({'order_id': str(uuid.uuid4()), 'status': 'created'}), 201
Output
POST /order with Idempotency-Key header returns same result on retry
Bring up idempotency without being asked
In any design that involves external calls or financial transactions, say 'I'd make this operation idempotent using a client-generated key.' That single sentence can turn a 'good' answer into a 'strong hire'.
Production Insight
I've debugged a payment gateway that processed double charges on 0.1% of transactions — costing the company $100k before we added idempotency keys.
The fix: a simple 'Idempotency-Key' header with a Redis-based dedup cache.
If you've ever touched payments, you know idempotency is table stakes. Mention it or lose the offer.
Key Takeaway
Idempotency prevents duplicate actions when retries happen.
Use client-generated keys and server-side dedup storage.
Retry with exponential backoff + jitter, not instant retries.
● Production incidentPOST-MORTEMseverity: high

The Cache Stampede That Cost a Weekend

Symptom
Users saw empty timelines or 503 errors during peak hours. Database CPU hit 100%. P95 latency > 30 seconds.
Assumption
The cache TTL was short (5 minutes) to keep timelines fresh. They assumed the database could handle the cache-miss rate of 20k QPS.
Root cause
Cache stampede: when many keys expire at once, all requests go to the database simultaneously. The DB connection pool saturated, causing cascading failures across services.
Fix
Implement locking around cache rebuild (only one request fetches from DB, others wait) and add a jitter to TTL expiration to avoid simultaneous expiry. Also scaled read replicas and added a CDN for static content.
Key lesson
  • Always model your cache miss rate — a 50% spike can kill your database.
  • Never trust default TTL values: they're designed for demo, not production.
  • In an interview, discussing cache stampede shows depth — say 'I'd use a mutex around recompute and stagger TTLs.'
  • Always include cache warm-up during deployment to avoid cold cache stampede.
Production debug guideSymptom → Action mapping for the most common real-world failures that mirror interview mistakes5 entries
Symptom · 01
Database CPU spikes to 100% during traffic bursts
Fix
Check if cache miss rate is spiking. Add lazy cache refresh with a background job (write-through) or implement a read-through cache with jittered TTL.
Symptom · 02
API latency increases linearly with user count
Fix
Identify the bottleneck: likely an unscaled database query. Add a read replica or shard by user ID. Alternatively, add a cache layer for hot data.
Symptom · 03
Writes are slow, reads are fast
Fix
You have a read-optimised system for a write-heavy workload. Switch to an append-only log (like Kafka) and process writes asynchronously. Consider NoSQL if writes are high volume.
Symptom · 04
Service A can't reach Service B even though both are healthy
Fix
Network partitioning — the most common cause of production incidents not discussed in interviews. Implement circuit breakers and fallback responses. The service mesh (e.g., Istio) can help with resilient routing.
Symptom · 05
P99 latency increases after adding a cache layer
Fix
Check cache hit ratio. If low, the cache is not warming properly. Pre-warm cache or adjust eviction policy. Also check for cache stampede.
★ Quick Reference: Three Number Checks That Save Your DesignWhen you're in an interview and need to validate your scaling decisions, run through these three back-of-envelope calculations. If you can't justify one, your design is likely flawed.
You proposed a monolithic database without any caching or sharding
Immediate action
Estimate QPS and data size. For 1 billion reads/month ≈ 380 QPS. Any single MySQL instance can handle ~10k QPS — so a monolith might be fine for 380 QPS, but not for 10x growth.
Commands
QPS = (total requests per day / 86400) * peak traffic factor (usually 2-3x)
Storage = average object size * number of objects (users * tweets * etc.)
Fix now
If QPS > 10k, add caching (Redis) and/or read replicas. If writes > 1k QPS, consider horizontal sharding or NoSQL.
You chose SQL when the workload is write-heavy (90/10 write/read ratio)+
Immediate action
Reconsider the trade-off: SQL writes are slow at scale due to ACID constraints. A write-heavy system (like a logging service) should use an append-only log or NoSQL.
Commands
Calculate write QPS: if you expect 100 million writes/day = ~1150 writes/second (peak ~2300).
Check if your schema has indexes that slow writes. Each index adds ~50% write overhead.
Fix now
If write QPS > 5000, use Cassandra or ScyllaDB for write-heavy sharded systems. Or batch writes into a log (Kafka) before sinking to DB.
You forgot to discuss fault tolerance and recovery+
Immediate action
Assume one component fails completely. What degrades? Do you lose data? How do you recover?
Commands
List single points of failure: load balancer, database master, message queue broker.
For each SPOF, ask: 'Is this stateless? Can it be replicated? What's the RTO?'
Fix now
Make load balancers redundant (active-passive with keepalived), use multi-AZ for databases, and implement leader election for critical services.
You designed a system with synchronous communication for every request+
Immediate action
Identify synchronous calls that could be async. Any service that doesn't need an immediate response should be decoupled.
Commands
List all inter-service calls. Mark which ones must be synchronous (e.g., user authentication) and which can be async (e.g., email notification).
For async candidates, introduce a message queue (Kafka, RabbitMQ) and design event-driven flows.
Fix now
Decouple at least one critical path: e.g., after order placement, the confirmation email can be sent asynchronously.
Key Technology Trade-Offs in System Design
ConceptWhen to UseCore Trade-off
SQL (PostgreSQL/MySQL)Structured data, ACID compliance requiredHarder to scale horizontally (sharding is complex)
NoSQL (Cassandra/DynamoDB)Unstructured data, massive write throughputEventual consistency; lacks complex joins
Caching (Redis/Memcached)Read-heavy workloads with frequent accessComplexity in cache invalidation (Stale data)
Load BalancingDistributing traffic across multiple nodesIntroduces a single point of failure (SPOF) if not redundant
Message Queue (Kafka/RabbitMQ)Decoupling producers from consumers, buffering writesAdds latency and operational complexity

Key takeaways

1
System design is about the 'Why' (trade-offs) more than the 'What' (tools).
2
Always start with the API signatures and Data Schema before scaling components.
3
Master the CAP Theorem
understand why you cannot have Consistency, Availability, and Partition Tolerance simultaneously.
4
Practice whiteboarding your thoughts
the 'human touch' of explaining your logic matters more than a perfect diagram.
5
Back-of-envelope calculations are your best friend. Memorise the order-of-magnitude numbers for common components.
6
Failure mode analysis is the single most under-discussed topic in interviews. Address it explicitly.
7
A structured wrap-up (SPOFs + scaling plan) leaves the strongest final impression.
8
Data partitioning is a deep topic
bring it up in every design to show depth.
9
Observability is the difference between a toy system and a production system.
10
Idempotency is not optional
discuss it for every write operation to prevent data corruption.

Common mistakes to avoid

7 patterns
×

Jumping into drawing boxes before defining functional and non-functional requirements

Symptom
The design looks good on the whiteboard but fails under real constraints — e.g., designed for 100 users, not 100 million.
Fix
Spend the first 3-5 minutes clarifying: what are the core features? What is the expected scale (DAU, QPS, storage)? What are the latency and consistency requirements?
×

Failing to estimate scale (QPS, storage, bandwidth)

Symptom
Interviewer asks 'Why did you choose this database?' and you can't justify the choice with numbers — the design feels random.
Fix
Always run a back-of-envelope calculation: QPS = (DAU requests per user per day) / 86400 peak factor. Have benchmarks ready: single MySQL ~1k writes/s, Redis ~100k reads/s, Kafka ~1M events/s.
×

Ignoring the 'Failure Mode': Never assume a network call succeeds

Symptom
Your design has a single load balancer and a single database master. When one fails, the entire system goes down.
Fix
Identify single points of failure (SPOF) in your design and make them redundant. Discuss circuit breakers, retries with exponential backoff, and graceful degradation (e.g., show cached data when backend is down).
×

Choosing technologies without trade-off analysis

Symptom
Candidate says 'We'll use Kafka' without explaining why it's better than RabbitMQ or SQS for this specific use case.
Fix
Always explain the trade-off: 'Kafka gives us durability and replayability at the cost of higher operational overhead. For a simple notification system, SQS might be simpler.'
×

Over-engineering for scale that doesn't exist

Symptom
Candidate designs a distributed sharded NoSQL system for a startup with 1000 users, adding unnecessary complexity.
Fix
Start simple: a monolith with a single database. Then identify the scaling bottleneck and suggest the minimal addition (e.g., a cache layer). Interviewers value pragmatic simplicity.
×

Ignoring data replication and consistency trade-offs

Symptom
Data loss during node failure because replication factor is 1 or consistency level is set too low.
Fix
Configure replication factor >= 3 in production. Use quorum reads/writes for critical data. For eventual consistency, define conflict resolution strategy (e.g., last-write-wins).
×

Designing without considering network latency between components

Symptom
High end-to-end latency because services are deployed across different regions and each call adds 50–100ms.
Fix
Co-locate dependent services in the same region or availability zone. Use async communication for non-blocking calls. Consider edge caching for read-heavy global workloads.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Design a URL shortener. Walk through your requirement gathering, then hi...
Q02SENIOR
Design a real-time chat system for 1 billion users. How do you handle me...
Q03SENIOR
You need to design a system that processes 10,000 orders per second duri...
Q04SENIOR
What's the difference between horizontal scaling and vertical scaling, a...
Q05SENIOR
Design a news feed system like Facebook's News Feed. How do you rank and...
Q06SENIOR
How would you design a rate limiter for a public API?
Q01 of 06SENIOR

Design a URL shortener. Walk through your requirement gathering, then high-level design.

ANSWER
First, clarify: how many URLs per month? (Assume 100 million). Are we generating short codes or letting users customise them? What's the read/write ratio? (It's read-heavy, ~10:1). High-level design: a web server that accepts long URL and returns short code. The short code is generated using a base-62 random string (or Snowflake ID stored in DB). Storage: relational DB for mapping (short -> long) plus a cache (Redis) for hot URLs. For scaling, we use a CDN for the redirect (if static), and as traffic grows, add read replicas and eventually shard by short code prefix. Trade-off: we chose SQL because we need strong consistency for the mapping. We add a cache to reduce read latency. We'll use a consistent hash to distribute cache load.
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
What is the single most important part of a System Design interview?
02
How do I handle the 'I don't know' moment in a design round?
03
Should I choose SQL or NoSQL in an interview?
04
How deep should I go into implementation details?
05
How do I estimate QPS / storage without a calculator?
06
How do you handle hot keys in a distributed system?
07
What's the difference between vertical and horizontal scaling for databases?
🔥

That's System Design Interview. Mark it forged?

8 min read · try the examples if you haven't

Previous
TypeScript Interview Questions
1 / 7 · System Design Interview
Next
How to Answer System Design Q