Mid-level 9 min · March 05, 2026

System Design — Replication Lag Breaks Read-Your-Writes

Q: What is the difference between Scalability and Reliability?

Scalability is the system's ability to handle increased load by adding resources. Reliability is the system's ability to remain functional even when components fail. A system can be scalable but unreliable if its parts break frequently under that load.

Q: When should I use a Message Queue like RabbitMQ or Kafka?

Use a Message Queue to decouple heavy tasks from the user request cycle. For example, when a user uploads a video, return 'Success' immediately and put the video processing task into a queue to be handled asynchronously by a worker service.

Q: What is 'Sticky Sessions' and why are they generally avoided?

Sticky sessions force a specific user to always talk to the same server. While this makes session management easy, it makes scaling hard and load balancing uneven. It's better to use a distributed session store like Redis so any server can handle any user's request.

After posting, users saw comments missing for up to 5 seconds — replication lag broke read-your-writes.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

System design decides how your software handles growth, not just user features
Core components: load balancers, caches, databases, and message queues
Horizontal scaling adds machines; vertical scaling adds power to one machine
Caching reduces database load by 10x, but cache invalidation is the hard part
Biggest mistake: microservices for a 10-user app — start monolithic, split later

✦ Definition~90s read

What is System Design Basics?

Replication lag is the delay between writing data to a primary database node and having that write propagate to replica nodes. In a typical read-your-writes consistency model, a user expects to immediately see data they just wrote — but if your application reads from a replica that hasn't caught up, the user sees stale data, breaking that guarantee.

★

Imagine you open a lemonade stand.

This is a fundamental tension in distributed databases: you trade consistency for availability and latency. Systems like PostgreSQL with streaming replication, MySQL with asynchronous replication, or managed services like Amazon Aurora all exhibit this behavior under load.

The fix often involves session-level read-after-write consistency (e.g., reading from the primary for a short window), using quorum reads, or accepting eventual consistency where the business logic can tolerate it. If you're building a social feed or analytics dashboard, replication lag might be acceptable; for a banking app or collaborative document editor, it's a hard no.

Plain-English First

Imagine you open a lemonade stand. On day one, you serve 10 customers alone — easy. But what if 10,000 people show up? You'd need more workers, bigger jugs, a way to take orders faster, and maybe a fridge so you're not squeezing lemons from scratch every time. System design is exactly that: planning HOW your software handles the crowd before the crowd arrives. It's the blueprint you draw before you build, not the patch you apply after everything breaks.

Every application you've ever loved — Instagram, Spotify, Google Maps — was built twice. Once in code, and once in architecture. The code handles what the app does. The architecture determines whether it survives Monday morning when a million users show up at once. System design is the discipline of making those architectural decisions deliberately, before production teaches you the hard way. It's the difference between a service that scales gracefully and one that crumbles under its own success.

The problem system design solves isn't a coding problem — it's a coordination problem. A single server can handle a few hundred users just fine. But what happens when you hit ten thousand? A hundred thousand? The naive answer is 'add a bigger server,' but that only buys you time. Real scalability means distributing work intelligently, caching aggressively, tolerating failure gracefully, and keeping data consistent without grinding everything to a halt. Each of those goals pulls in a different direction, and system design is the art of finding the right tension between them.

By the end of this article you'll understand the core building blocks every system is made of — load balancers, caches, databases, and message queues — why each one exists, when to reach for it, and what you give up when you do. You'll be able to look at a system description in an interview or a design doc and immediately start asking the right questions instead of staring at a blank whiteboard.

Why Replication Lag Breaks Read-Your-Writes

System design basics are the foundational patterns and trade-offs that determine how a distributed system behaves under load. At its core, replication lag is the delay between writing to a primary node and seeing that write on a replica. In a leader-based replication setup, the leader accepts writes and asynchronously propagates changes to followers. This means a client that writes to the leader and immediately reads from a follower may not see its own write — a violation of read-your-writes consistency.

Replication lag is measured in milliseconds to seconds, depending on network, load, and replica count. Under normal conditions, lag stays below 100ms, but during a burst or node failure, it can spike to multiple seconds. The key property is that asynchronous replication is fast but weak: it sacrifices consistency for availability and latency. Strong consistency requires synchronous replication, which adds latency proportional to the slowest replica.

Use read-your-writes guarantees when user experience depends on immediate feedback after a write — for example, after a user updates their profile or posts a comment. In real systems, this matters because users expect their own actions to be reflected instantly. Without it, they see stale data, get confused, and lose trust. The solution is either to read from the leader after a write, or to use session-level consistency with version stamps.

Common Misconception

Replication lag doesn't just cause stale reads — it can cause lost updates if two clients read-before-write on different replicas.

Production Insight

During a Black Friday sale, a leader replica handling writes for a shopping cart service saw 2-second replication lag. Users adding items to their cart saw an empty cart on refresh because the read went to a follower.

Symptom: Users report 'my cart keeps losing items' — support tickets spike, cart abandonment rate jumps 15%.

Rule of thumb: After any write that the user will immediately see, always read from the leader for at least 5 seconds, or use a session cookie that pins reads to the leader.

Key Takeaway

Read-your-writes consistency is not guaranteed with async replication — you must explicitly route reads to the leader after a write.

Replication lag is not a bug; it's a trade-off for lower write latency and higher availability.

Always measure p99 replication lag in production — it's the metric that tells you when your consistency guarantees break.

thecodeforge.io

Replication Lag Breaks Read-Your-Writes Consistency

System Design Basics

System Component Flow: From Client to Database

Before diving into individual components, it's crucial to understand how requests flow through a modern distributed system. The typical path starts with a user on a client device and ends with data being read or written to a database, passing through several intermediary layers. Each layer serves a specific purpose: security, routing, caching, and processing. Understanding this flow helps you identify where bottlenecks occur and which component to scale.

The flow usually follows this sequence: Client -> DNS -> Load Balancer -> Application Server -> Cache -> Database. The DNS resolves the domain to an IP address, typically the load balancer's IP. The load balancer distributes traffic across multiple application servers to avoid overloading any single instance. Application servers contain the business logic and may consult a cache (like Redis) to reduce database load. Only when the cache misses do we reach the database (primary or replica). Depending on the request type (read or write), the database interaction differs: writes go to primary, reads can go to replicas if eventual consistency is acceptable.

In high-traffic systems, additional layers like CDNs for static assets and message queues for async tasks are inserted. The exact flow varies by use case, but the core path remains the same. When designing, always map out this flow to ensure you don't miss critical components like health checks or circuit breakers between layers.

io/thecodeforge/flow/request-flow.yamlYAML

# Simplified request flow example (conceptual)
client:
  action: HTTP GET /api/users/123
  headers:
    Host: app.example.com

dns:
  resolution: app.example.com -> 203.0.113.10 (LB VIP)

load_balancer:
  type: L7 (HTTP)
  algorithm: least_connections
  health_check: /health
  target: 10.0.1.5:8080

app_server:
  cache_check:
    query: "GET user:123"
    result: MISS
  database_query:
    query: "SELECT * FROM users WHERE id=123"
    result: user data

response:
  status: 200
  body: {"id":123,"name":"Alice"}

Output

System returns user data within 50ms when cached, ~200ms on cache miss.

Forge Tip: Trace the Full Path

Always trace the complete request path in your architecture diagrams. Missing a single hop (like a missing health check) can lead to cascading failures.

Production Insight

In production, the load balancer is often the first point of failure if misconfigured. We once saw a 5-minute outage because the LB health check interval was too long, allowing a faulty server to stay in pool and serve errors. Rule: set health check intervals to at most 5 seconds with fast failure detection.

Key Takeaway

Every request flows through multiple layers; each is a potential bottleneck. Map the full path early to plan scaling and failure modes.

System Component Flow

Horizontal vs Vertical Scaling: Choosing Your Growth Path

The table below summarizes the key differences between vertical and horizontal scaling.

Feature	Vertical Scaling (Scale Up)	Horizontal Scaling (Scale Out)
Complexity	Low (no code changes)	High (needs LB, distributed logic)
Hardware	Increasing single machine resources	Adding more commodity servers
Availability	Single point of failure	High (other nodes stay up)
Cost	Exponentially expensive	Linear and predictable
Limit	Hard ceiling (max hardware)	Theoretical infinite
Latency	Lower (no network hops)	Higher (network overhead)
Data consistency	Simple (single node)	Complex (replication, sharding)

In practice, most mature systems use a hybrid: start with vertical scaling for simplicity, then move to horizontal as traffic grows. Always keep your application stateless — store session data in Redis so you can add or remove servers without session loss.

io/thecodeforge/scaling/ScalingDecision.javaJAVA

package io.thecodeforge.scaling;

/**
 * Simulates the decision logic for scaling strategy.
 */
public class ScalingDecision {
    public static String decideScalingStrategy(int activeUsers, double avgLatencyMs) {
        if (activeUsers < 10_000 && avgLatencyMs < 200) {
            return "VERTICAL: Upgrade instance to 16 vCPU / 64GB RAM";
        } else if (activeUsers < 100_000) {
            return "HORIZONTAL: Add 2 more instances behind LB (target: 5 nodes)";
        } else {
            return "HORIZONTAL + AUTO SCALING: Deploy to Kubernetes with HPA";
        }
    }

    public static void main(String[] args) {
        System.out.println(decideScalingStrategy(5000, 150));   // VERTICAL
        System.out.println(decideScalingStrategy(50000, 300));  // HORIZONTAL
    }
}

Output

VERTICAL: Upgrade instance to 16 vCPU / 64GB RAM

HORIZONTAL: Add 2 more instances behind LB (target: 5 nodes)

The Vertical Ceiling

Even if you can afford a 128-core machine, you cannot scale forever. At some point, the cost and physical limits (rack space, power, cooling) make vertical scaling impractical. Plan for horizontal from day one by designing stateless APIs.

Production Insight

We've seen teams invest in a super-expensive bare-metal server only to hit CPU limits 6 months later. The migration to horizontal scaling took weeks because the app had sticky session dependencies. Rule: design for horizontal scaling from the start, even if you start with one server.

Key Takeaway

Vertical scaling is simple but limited; horizontal scaling is complex but elastic. Start with vertical, but architect for horizontal from day one by keeping services stateless.

Back-of-the-Envelope Estimation: Latency Numbers Every Programmer Should Know

System design interviews and real-world capacity planning often require quick, order-of-magnitude estimates. Knowing typical latency numbers helps you reason about bottlenecks without running benchmarks. These numbers are based on the famous Jeff Dean talk and have been updated for modern hardware.

Here is the reference table for common operations (approximate, 2026-era hardware):

Operation	Latency	Scaled to 1s
L1 cache reference	0.5 ns	2,000,000,000
Branch mispredict	5 ns	200,000,000
L2 cache reference	7 ns	142,857,000
Mutex lock/unlock	100 ns	10,000,000
Main memory reference	100 ns	10,000,000
Compress 1KB with Zippy	10,000 ns (10 μs)	100,000
Send 2KB over 1 Gbps network	20,000 ns (20 μs)	50,000
Read 1MB sequentially from memory	250,000 ns (250 μs)	4,000
Round trip within same datacenter	500,000 ns (500 μs)	2,000
Disk seek	10,000,000 ns (10 ms)	100
Read 1MB sequentially from disk	20,000,000 ns (20 ms)	50
Packet round-trip California to Netherlands	150,000,000 ns (150 ms)	6.6

How to use these numbers: If you read a key from cache (main memory), it takes ~100 ns. If that cache misses and you have to read from disk, it's 10 ms — 100,000x slower. That's why caching is critical. Also, cross-datacenter round trips (~150 ms) are about 300,000 ns DNS lookup? Actually you get the idea — network jumps dominate latency. Keep your services in the same region and avoid synchronous cross-region calls in latency-sensitive paths.

Forge Tip: Multiplication Rules

Use these simplified rules: Memory is 100 ns; Network round-trip within same datacenter is 500 μs; Disk I/O is 10 ms; Cross-continent is 150 ms. For back-of-envelope, round to nearest order of magnitude.

Production Insight

In production, latency numbers are rarely exactly these textbook values. Factors like network congestion, CPU throttling, and NUMA effects can add 2-10x variability. We once saw a 1000x latency spike because a shared network switch was saturated. Always benchmark your actual hardware under realistic load.

Key Takeaway

Know the latency hierarchy: memory (ns) > same-DC network (μs) > disk (ms) > cross-region network (100s ms). Use these to reason about where to optimize.

The Core Pillar: Scaling from One to a Million Users

In system design, we differentiate between Vertical Scaling (buying a bigger machine) and Horizontal Scaling (buying more machines). While Vertical Scaling is easy, it has a hard ceiling. Horizontal Scaling is the industry standard for modern distributed systems, but it introduces the need for Load Balancing and Data Consistency management.

To handle this, we use a Load Balancer (LB) as the entry point. The LB sits in front of your application servers and conducts traffic based on algorithms like Round Robin or Least Connections. This ensures no single server is overwhelmed while others sit idle.

io/thecodeforge/scaling/LoadBalancerController.javaJAVA

package io.thecodeforge.scaling;

import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * A simplified simulation of a Round-Robin Load Balancing logic.
 * In production, this logic would live in Nginx, HAProxy, or an AWS ALB.
 */
@RestController
public class LoadBalancerController {
    private final String[] serverNodes = {"10.0.0.1", "10.0.0.2", "10.0.0.3"};
    private final AtomicInteger requestCounter = new AtomicInteger(0);

    @GetMapping("/route")
    public String distributeTraffic() {
        int index = requestCounter.getAndIncrement() % serverNodes.length;
        String targetNode = serverNodes[Math.abs(index)];
        return "[TheCodeForge-LB] Routing request to Node: " + targetNode;
    }
}

Output

[TheCodeForge-LB] Routing request to Node: 10.0.0.2

Forge Tip: The Single Point of Failure

Adding a load balancer solves server congestion, but the LB itself can become a Single Point of Failure (SPOF). In production-grade architecture, always deploy LBs in a 'High Availability' (HA) pair with a floating IP.

Production Insight

Round-robin works fine when all servers have equal capacity and health.

But if one server starts failing, round-robin still sends traffic to it — you need health checks with circuit breakers.

Rule: always pair your LB algorithm with active health probes and automatic node removal.

Key Takeaway

Horizontal scaling gives you infinite capacity but requires a load balancer.

The load balancer itself must be redundant — never a single point of failure.

Scale by adding nodes, not by buying a bigger server.

Database Scaling and The CAP Theorem

When scaling databases, you'll eventually face the CAP Theorem: you can only have two out of three: Consistency, Availability, and Partition Tolerance. For a global system, we often use 'Read Replicas' to handle heavy traffic. We write to a 'Primary' node and read from 'Secondary' nodes. This improves performance but introduces 'Eventual Consistency'—where a user might not see their own post for a few milliseconds while data synchronizes.

io/thecodeforge/db/ReplicaSetup.sqlSQL

-- TheCodeForge: Simulating a Primary-Replica split at the query level
-- Primary Node (Write Operations)
INSERT INTO users (username, bio) VALUES ('dev_forge', 'Building the future of tech');

-- Replica Node (Read Operations - Scaled Horizontally)
-- This allows us to handle 10,000+ simultaneous read requests without taxing the master
SELECT * FROM users WHERE username = 'dev_forge' /* read_from_replica_01 */;

Output

Query executed successfully on read-replica-node-01.

The Replication Lag Trap

Never assume data is instantly available across all nodes. If your application logic requires 'Read-After-Write' consistency (e.g., updating a password), ensure that specific read is routed to the Primary node.

Production Insight

Replication lag isn't constant. A spike in writes can delay replicas by seconds.

We once had a booking system where a user saw 'pending confirmation' after booking — because read replica hadn't picked up the write.

Rule: identify critical read-after-write paths and route them to primary.

Key Takeaway

CAP theorem forces a trade-off: you cannot have perfect consistency and availability during a partition.

In practice, many systems choose availability and eventual consistency for reads.

But read-your-writes consistency must be explicitly handled.

The CAP Theorem Triangle: Consistency, Availability, Partition Tolerance

The CAP theorem, formulated by Eric Brewer, states that a distributed data store can only provide two of the following three guarantees simultaneously: Consistency (every read receives the most recent write or an error), Availability (every request receives a non-error response, without guarantee that it contains the most recent write), and Partition Tolerance (the system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes).

In practice, network partitions are inevitable, so you must choose between CP (Consistency + Partition Tolerance) and AP (Availability + Partition Tolerance). A CP system, like HBase or a traditional RDBMS with synchronous replication, will reject writes or become unavailable during a partition to ensure consistency. An AP system, like Cassandra or DynamoDB, will accept writes and serve possibly stale reads during a partition, but remains available. You can also dynamically switch between modes depending on the operation.

The triangle visualization helps you understand the trade-offs. Most modern systems are configured AP for reads (eventual consistency) and CP for writes that require strong consistency, but you must explicitly design for that split.

io/thecodeforge/cap/CAPExample.javaJAVA

package io.thecodeforge.cap;

/**
 * Illustrating CAP trade-off in a key-value store scenario.
 * Assume a partition has occurred between two data centers.
 */
public class CAPExample {
    enum Mode { CP, AP }

    public static String handleWrite(String key, String value, Mode mode) {
        if (mode == Mode.CP) {
            // Try synchronous write to all replicas; if one fails, reject
            if (!writeToAllReplicas(key, value)) {
                return "ERROR: Write rejected to maintain consistency";
            }
            return "OK: Write consistent across all nodes";
        } else { // AP
            // Write to local replica and accept eventual consistency
            writeToLocalReplica(key, value);
            return "OK: Write accepted, may be stale on other replicas";
        }
    }

    private static boolean writeToAllReplicas(String key, String value) {
        // Simulate partition: return false for remote DC
        return false;
    }

    private static void writeToLocalReplica(String key, String value) {
        // Local write succeeds
    }

    public static void main(String[] args) {
        System.out.println(handleWrite("user:1", "name=Alice", Mode.CP));
        System.out.println(handleWrite("user:1", "name=Alice", Mode.AP));
    }
}

Output

ERROR: Write rejected to maintain consistency

OK: Write accepted, may be stale on other replicas

Forge Tip: CAP is a Trade-off, Not a Rule

You can mix CAP choices per operation. Use strong consistency for critical financial data and eventual consistency for user profiles. Design your data layer to support both modes.

Production Insight

In production, partitions are rare but devastating when they happen. We once had a network split between two AWS regions that caused a 45-minute inconsistency window on user balances. The fix was to use a quorum-based write (W=3, R=2) with Cassandra to balance consistency and availability. Rule: always test your system under artificial network partitions using Chaos Engineering.

Key Takeaway

CAP theorem forces a choice between consistency and availability during a partition. Most systems choose AP for reads and CP for writes that require strong consistency, using techniques like quorum and read-repair.

CAP Theorem Triangle

Caching: The Fast Lane to Performance

Caching stores frequently accessed data in a faster storage layer (memory) so that repeated requests avoid expensive database calls. Common caching patterns include Cache-Aside (application checks cache first), Read-Through (cache loads from DB on miss), and Write-Through (write to cache and DB simultaneously). The most widely used cache is Redis, an in-memory data store with sub-millisecond latency.

Caching isn't free. You trade memory for speed, and you introduce the problem of cache invalidation — keeping the cache in sync with the source of truth. A stale cache can serve outdated data, sometimes breaking business logic.

io/thecodeforge/cache/CacheService.javaJAVA

package io.thecodeforge.cache;

import org.springframework.cache.annotation.Cacheable;
import org.springframework.stereotype.Service;

@Service
public class CacheService {
    @Cacheable(value = "userCache", key = "#userId")
    public UserProfile getUserProfile(String userId) {
        // If not in cache, this method executes and its result is stored.
        return database.findUserById(userId);
    }
}

Output

First call: database hit. Subsequent calls: cache hit (sub-millisecond).

Cache Invalidation Is Hard

Stale data is the #1 caching failure. Use TTLs, write-through, or event-driven invalidation (e.g., Kafka + Redis pub/sub) to keep cache fresh.

Production Insight

Cache miss storms happen when a popular item TTL expires and all requests hit the database at once – called 'thundering herd'.

Fix: use a mutex or 'early recompute' for critical cache keys.

Rule: always set a reasonable TTL and consider refresh-ahead.

Key Takeaway

Caching reduces latency from hundreds of milliseconds to sub-millisecond.

But cache invalidation and thundering herds are the real production enemies.

Use TTLs, write-through, and careful key design to stay safe.

Load Balancers: Beyond Round-Robin

Load balancers distribute incoming traffic across multiple servers. But the algorithm matters. Round-Robin is simple but assumes equal server capacity and health. Least Connections sends requests to the server with fewest active connections — better for variable-length requests. IP Hash can enable session persistence without sticky cookies, by hashing the client IP to a specific server.

In production, LBs also perform health checks (active and passive), SSL termination, and rate limiting. They can be L4 (TCP/UDP) or L7 (HTTP/HTTPS). L7 allows content-based routing, e.g., forwarding /api/ to one pool and /static/ to another.

io/thecodeforge/lb/haproxy.cfgYAML

# HAProxy configuration example for L7 load balancing
global
  log /dev/log local0

defaults
  mode http
  timeout connect 5000ms
  timeout client 50000ms
  timeout server 50000ms

frontend http-in
  bind *:80
  acl api_request path_beg /api
  use_backend api-servers if api_request
  default_backend web-servers

backend api-servers
  balance leastconn
  option httpchk GET /health
  server node1 10.0.1.1:8080 check fall 3 rise 2
  server node2 10.0.1.2:8080 check fall 3 rise 2

backend web-servers
  balance roundrobin
  server web1 10.0.2.1:3000 check
  server web2 10.0.2.2:3000 check

Output

HAProxy shows backend health on statistics page, routes API to least-loaded node.

Load Balancer as a Traffic Cop

Round-Robin: sends each car to the next lane in order.
Least Connections: sends to the lane with fewest cars waiting.
Health checks: cop blocks lanes with accidents (failed servers).
SSL termination: cop handles the toll booth so servers don't have to.

Production Insight

If health checks are too lenient, backend servers that are partially failing (high latency, random errors) remain in the pool.

We saw a case where a memory-leaking server passed basic health but crashed every hour.

Rule: use deep health checks (e.g., a small endpoint that verifies DB connectivity) and set aggressive timeouts.

Key Takeaway

Choose load balancing algorithm based on workload: round-robin for uniform requests, leastconn for variable length.

Always configure health checks with appropriate failure thresholds.

Consider L7 routing to split traffic logically.

Message Queues: Decouple and Scale Asynchronously

Message queues (e.g., RabbitMQ, Kafka, SQS) decouple producers from consumers. When a user uploads a video, the web server returns immediately and enqueues a task saying 'process video'. A separate worker picks that task, does the heavy lifting, and updates the database. This pattern keeps the web server responsive even under load.

Key concepts: Producer sends messages to a queue; Consumer pulls them; Queue buffers during spikes. Kafka adds the idea of log-based persistence and replayability. The trade-off is complexity: you now have at-least-once delivery, idempotency, and ordering challenges.

io/thecodeforge/queue/VideoUploadController.javaJAVA

package io.thecodeforge.queue;

import org.springframework.web.bind.annotation.*;
import org.springframework.amqp.rabbit.core.RabbitTemplate;

@RestController
@RequestMapping("/videos")
public class VideoUploadController {
    private final RabbitTemplate rabbitTemplate;

    public VideoUploadController(RabbitTemplate rabbitTemplate) {
        this.rabbitTemplate = rabbitTemplate;
    }

    @PostMapping("/upload")
    public String upload(@RequestBody VideoUploadRequest req) {
        rabbitTemplate.convertAndSend("video.processing", req);
        return "Upload accepted, video is being processed.";
    }
}

Output

Immediate response: "Upload accepted" — user doesn't wait for encoding.

Forge Tip: At-Least-Once & Idempotency

Most message systems guarantee at-least-once delivery. If your worker processes a message twice, it must be idempotent — doing it again has no side effect. Use a dedup table or idempotency keys in the consumer.

Production Insight

If your consumer falls behind, the queue grows. That is fine — it's buffering — but if the backlog exceeds TTL, messages are lost.

We once saw a queue grow to 50 million messages because a downstream database was too slow.

Rule: monitor queue depth and set alerts. Always have dead-letter queues for poison pills.

Key Takeaway

Message queues protect your web server from heavy tasks.

At-least-once delivery requires idempotent consumers.

Monitor queue depth and always handle poison messages gracefully.

API Gateways: The Bouncer Your Microservices Need

Your microservices shouldn't expose their internal ports to the internet. That's a security audit waiting to happen. An API gateway sits as a single entry point. It handles authentication, rate limiting, request routing, and aggregation. Why? Because every service shouldn't need to reimplement OAuth or worry about DDoS attacks. You route all external traffic through the gateway. It parses the JWT, checks the rate limit in Redis, then forwards the request to the correct service. This is not 'nice to have'. It's how you enforce security boundaries. Without it, you'll end up with auth logic spread across 40 microservices. And when a vulnerability hits, you'll play whack-a-mole instead of fixing it in one place.

ApiGatewayFilter.javaJAVA

// io.thecodeforge.gateway
@Component
public class AuthRateLimitFilter extends ZuulFilter {
    private final RedisTemplate<String, Integer> rateLimiter;

    @Override
    public Object run() {
        RequestContext ctx = RequestContext.getCurrentContext();
        String clientId = ctx.getRequest().getHeader("X-Client-Id");
        String key = "ratelimit:" + clientId;
        Integer count = rateLimiter.opsForValue().increment(key, 1);
        if (count > 100) { // 100 req/min
            ctx.setResponseStatusCode(429);
            ctx.setResponseBody("{\"error\":\"rate limit exceeded\"}");
            ctx.setSendZuulResponse(false);
        }
        return null;
    }
}

Output

HTTP/1.1 429 Too Many Requests

{"error":"rate limit exceeded"}

Production Trap:

Don't put business logic in the gateway. It becomes a monolith in disguise. Keep it thin: auth, routing, rate limiting. Nothing else.

Key Takeaway

An API gateway centralizes cross-cutting concerns. It's not a proxy for lazy service decomposition.

Content Delivery Networks: Stop Serving Static Assets from Your App Server

Serving images, CSS, and JavaScript from your application server is wasteful. Your app server should compute. Static assets should be cached at the edge. A CDN (Content Delivery Network) caches your static files on servers worldwide. When a user in Tokyo requests your logo, they get it from a CDN server in Tokyo, not your origin server in Virginia. This reduces latency by orders of magnitude. It also offloads your application server. The server handles fewer requests, so it handles more critical ones. Why are you still sending 500KB of JavaScript through your Rails controller? Configure your CDN with a cache-control header of one year for immutable assets. Then use content hashes in filenames to bust the cache on deploy.

CdnConfig.javaJAVA

// io.thecodeforge.cdn
import org.springframework.context.annotation.Configuration;
import org.springframework.web.servlet.config.annotation.ResourceHandlerRegistry;
import org.springframework.web.servlet.config.annotation.WebMvcConfigurer;

@Configuration
public class CdnConfig implements WebMvcConfigurer {
    @Override
    public void addResourceHandlers(ResourceHandlerRegistry registry) {
        // public/static is fingerprinted by build tool
        registry.addResourceHandler("/static/**")
                .addResourceLocations("classpath:/public/static/")
                .setCachePeriod(31536000); // 1 year in seconds
    }
}

Output

Static resources served with max-age=31536000. Browsers and CDNs cache aggressively.

Production Trap:

Cache invalidation is not instant. If you overwrite a file without changing its name, expect stale assets for up to the TTL. Always use content-addressable filenames.

Key Takeaway

CDNs are for static assets. They reduce latency and offload your origin server. Never serve a static file from application code.

● Production incidentPOST-MORTEMseverity: high

The Replication Lag That Broke Read-Your-Writes

Symptom

Users reported that after posting a comment, the page showed the comment missing for up to 5 seconds. The app used read replicas.

Assumption

Team assumed eventual consistency was acceptable. They didn't test read-after-write scenarios.

Root cause

Writes went to primary, subsequent reads hit a replica that hadn't replicated the write yet. No read-your-writes consistency guarantee.

Fix

For any session that just performed a write, route all subsequent reads for that user to the primary database for 10 seconds, or use an intermediate cache.

Key lesson

Eventual consistency is dangerous for features that require immediate read-after-write correctness.
Profile your API calls to identify which ones need strong consistency, and route them accordingly.
Never assume replication lag is negligible — measure it under load.

Production debug guideSpot and fix common scaling failures before users notice5 entries

Symptom · 01

API response times spike during peak hours

→

Fix

Check database query times, then cache hit ratio. Likely cause: missing index or cache eviction storm.

Symptom · 02

Intermittent 503 Service Unavailable

→

Fix

Inspect connection pool exhaustion and thread pool saturation. Increase pool size or add more server instances.

Symptom · 03

Users see stale or inconsistent data across regions

→

Fix

Measure replication lag from primary to replicas. Consider read-after-write routing or use a distributed cache with TTL.

Symptom · 04

Single server goes down and app is unreachable

→

Fix

Verify load balancer health checks are configured and that you have at least two servers in the pool.

Symptom · 05

CPU on one server is 100% while others idle

→

Fix

Check load balancer algorithm: sticky sessions or uneven distribution. Use round-robin with least connections.

★ System Design Cheat Sheet for DebuggingQuick commands and fixes for the top three production scaling issues.

Database query times > 500ms−

Immediate action

Check slow query log and execution plan.

Commands

EXPLAIN ANALYZE SELECT ... (PostgreSQL/MySQL)

SHOW INDEX FROM table_name;

Fix now

Add missing index or rewrite query to avoid full table scan.

Cache miss rate > 40%+

Connection pool exhaustion+

Vertical vs Horizontal Scaling

Feature	Vertical Scaling (Scale Up)	Horizontal Scaling (Scale Out)
Complexity	Low (No code changes needed)	High (Requires Load Balancer & Distributed Logic)
Hardware	Increasing CPU/RAM on one box	Adding more standard commodity servers
Availability	SPOF (If server dies, app dies)	High (Other nodes stay up if one fails)
Cost	Exponentially expensive at high end	Linear and more predictable

Key takeaways

Scalability is not just about size; it's about the ability to handle growth gracefully without total architectural rewrites.

Caching is your best friend for performance—use Redis or Memcached to store 'hot' data and avoid expensive database hits.

State is the enemy of scaling. Keep your application servers 'stateless' so you can add or remove them at will without losing user sessions.

The CAP Theorem is a law of nature in distributed systems—accepting Eventual Consistency is often the price of high availability.

Common mistakes to avoid

3 patterns

Premature Optimization

Symptom

Building a distributed microservices architecture for a product with 10 users leads to high complexity and slow feature delivery.

Fix

Start monolithic but design with modularity so you can split services later when the user base grows.

Ignoring Latency

Symptom

Adding too many network hops (e.g., App -> LB -> Cache -> DB) without measuring RTT, causing slow responses.

Fix

Profile network hops early. Use a CDN for static assets, co-locate services in the same region, and batch requests where possible.

Hard-coding IPs

Symptom

Service discovery fails when servers are replaced or scaled. Manual IP updates are error-prone and cause outages.

Fix

Use service discovery (Consul, Eureka) or internal DNS names. Never reference IPs directly in config or code.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You are designing a service like 'URL Shortener' (TinyURL). How would yo...

Q02SENIOR

Explain the trade-offs between a NoSQL (e.g., MongoDB) and a Relational ...

Q03SENIOR

How does a Content Delivery Network (CDN) reduce latency for global user...

Q01 of 03SENIOR

You are designing a service like 'URL Shortener' (TinyURL). How would you handle 100,000 requests per second with low latency?

ANSWER

Key components: design a database to store mappings (wide row like Cassandra or distributed SQL), use consistent hashing for sharding, cache hot URLs in Redis with TTL, and use a CDN for redirects. For write-heavy, consider pre-generating keys or using a distributed ID generator (e.g., Snowflake). Scale horizontally with stateless web servers behind an LB that does rate limiting.

FAQ · 3 QUESTIONS

Frequently Asked Questions

What is the difference between Scalability and Reliability?

When should I use a Message Queue like RabbitMQ or Kafka?

What is 'Sticky Sessions' and why are they generally avoided?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Fundamentals. Mark it forged?

9 min read · try the examples if you haven't