System Design — Replication Lag Breaks Read-Your-Writes
After posting, users saw comments missing for up to 5 seconds — replication lag broke read-your-writes.
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
- System design decides how your software handles growth, not just user features
- Core components: load balancers, caches, databases, and message queues
- Horizontal scaling adds machines; vertical scaling adds power to one machine
- Caching reduces database load by 10x, but cache invalidation is the hard part
- Biggest mistake: microservices for a 10-user app — start monolithic, split later
Imagine you open a lemonade stand. On day one, you serve 10 customers alone — easy. But what if 10,000 people show up? You'd need more workers, bigger jugs, a way to take orders faster, and maybe a fridge so you're not squeezing lemons from scratch every time. System design is exactly that: planning HOW your software handles the crowd before the crowd arrives. It's the blueprint you draw before you build, not the patch you apply after everything breaks.
Every application you've ever loved — Instagram, Spotify, Google Maps — was built twice. Once in code, and once in architecture. The code handles what the app does. The architecture determines whether it survives Monday morning when a million users show up at once. System design is the discipline of making those architectural decisions deliberately, before production teaches you the hard way. It's the difference between a service that scales gracefully and one that crumbles under its own success.
The problem system design solves isn't a coding problem — it's a coordination problem. A single server can handle a few hundred users just fine. But what happens when you hit ten thousand? A hundred thousand? The naive answer is 'add a bigger server,' but that only buys you time. Real scalability means distributing work intelligently, caching aggressively, tolerating failure gracefully, and keeping data consistent without grinding everything to a halt. Each of those goals pulls in a different direction, and system design is the art of finding the right tension between them.
By the end of this article you'll understand the core building blocks every system is made of — load balancers, caches, databases, and message queues — why each one exists, when to reach for it, and what you give up when you do. You'll be able to look at a system description in an interview or a design doc and immediately start asking the right questions instead of staring at a blank whiteboard.
Why Replication Lag Breaks Read-Your-Writes
System design basics are the foundational patterns and trade-offs that determine how a distributed system behaves under load. At its core, replication lag is the delay between writing to a primary node and seeing that write on a replica. In a leader-based replication setup, the leader accepts writes and asynchronously propagates changes to followers. This means a client that writes to the leader and immediately reads from a follower may not see its own write — a violation of read-your-writes consistency.
Replication lag is measured in milliseconds to seconds, depending on network, load, and replica count. Under normal conditions, lag stays below 100ms, but during a burst or node failure, it can spike to multiple seconds. The key property is that asynchronous replication is fast but weak: it sacrifices consistency for availability and latency. Strong consistency requires synchronous replication, which adds latency proportional to the slowest replica.
Use read-your-writes guarantees when user experience depends on immediate feedback after a write — for example, after a user updates their profile or posts a comment. In real systems, this matters because users expect their own actions to be reflected instantly. Without it, they see stale data, get confused, and lose trust. The solution is either to read from the leader after a write, or to use session-level consistency with version stamps.
System Component Flow: From Client to Database
Before diving into individual components, it's crucial to understand how requests flow through a modern distributed system. The typical path starts with a user on a client device and ends with data being read or written to a database, passing through several intermediary layers. Each layer serves a specific purpose: security, routing, caching, and processing. Understanding this flow helps you identify where bottlenecks occur and which component to scale.
The flow usually follows this sequence: Client -> DNS -> Load Balancer -> Application Server -> Cache -> Database. The DNS resolves the domain to an IP address, typically the load balancer's IP. The load balancer distributes traffic across multiple application servers to avoid overloading any single instance. Application servers contain the business logic and may consult a cache (like Redis) to reduce database load. Only when the cache misses do we reach the database (primary or replica). Depending on the request type (read or write), the database interaction differs: writes go to primary, reads can go to replicas if eventual consistency is acceptable.
In high-traffic systems, additional layers like CDNs for static assets and message queues for async tasks are inserted. The exact flow varies by use case, but the core path remains the same. When designing, always map out this flow to ensure you don't miss critical components like health checks or circuit breakers between layers.
Horizontal vs Vertical Scaling: Choosing Your Growth Path
The table below summarizes the key differences between vertical and horizontal scaling.
| Feature | Vertical Scaling (Scale Up) | Horizontal Scaling (Scale Out) |
|---|---|---|
| Complexity | Low (no code changes) | High (needs LB, distributed logic) |
| Hardware | Increasing single machine resources | Adding more commodity servers |
| Availability | Single point of failure | High (other nodes stay up) |
| Cost | Exponentially expensive | Linear and predictable |
| Limit | Hard ceiling (max hardware) | Theoretical infinite |
| Latency | Lower (no network hops) | Higher (network overhead) |
| Data consistency | Simple (single node) | Complex (replication, sharding) |
In practice, most mature systems use a hybrid: start with vertical scaling for simplicity, then move to horizontal as traffic grows. Always keep your application stateless — store session data in Redis so you can add or remove servers without session loss.
Back-of-the-Envelope Estimation: Latency Numbers Every Programmer Should Know
System design interviews and real-world capacity planning often require quick, order-of-magnitude estimates. Knowing typical latency numbers helps you reason about bottlenecks without running benchmarks. These numbers are based on the famous Jeff Dean talk and have been updated for modern hardware.
Here is the reference table for common operations (approximate, 2026-era hardware):
| Operation | Latency | Scaled to 1s |
|---|---|---|
| L1 cache reference | 0.5 ns | 2,000,000,000 |
| Branch mispredict | 5 ns | 200,000,000 |
| L2 cache reference | 7 ns | 142,857,000 |
| Mutex lock/unlock | 100 ns | 10,000,000 |
| Main memory reference | 100 ns | 10,000,000 |
| Compress 1KB with Zippy | 10,000 ns (10 μs) | 100,000 |
| Send 2KB over 1 Gbps network | 20,000 ns (20 μs) | 50,000 |
| Read 1MB sequentially from memory | 250,000 ns (250 μs) | 4,000 |
| Round trip within same datacenter | 500,000 ns (500 μs) | 2,000 |
| Disk seek | 10,000,000 ns (10 ms) | 100 |
| Read 1MB sequentially from disk | 20,000,000 ns (20 ms) | 50 |
| Packet round-trip California to Netherlands | 150,000,000 ns (150 ms) | 6.6 |
How to use these numbers: If you read a key from cache (main memory), it takes ~100 ns. If that cache misses and you have to read from disk, it's 10 ms — 100,000x slower. That's why caching is critical. Also, cross-datacenter round trips (~150 ms) are about 300,000 ns DNS lookup? Actually you get the idea — network jumps dominate latency. Keep your services in the same region and avoid synchronous cross-region calls in latency-sensitive paths.
The Core Pillar: Scaling from One to a Million Users
In system design, we differentiate between Vertical Scaling (buying a bigger machine) and Horizontal Scaling (buying more machines). While Vertical Scaling is easy, it has a hard ceiling. Horizontal Scaling is the industry standard for modern distributed systems, but it introduces the need for Load Balancing and Data Consistency management.
To handle this, we use a Load Balancer (LB) as the entry point. The LB sits in front of your application servers and conducts traffic based on algorithms like Round Robin or Least Connections. This ensures no single server is overwhelmed while others sit idle.
Database Scaling and The CAP Theorem
When scaling databases, you'll eventually face the CAP Theorem: you can only have two out of three: Consistency, Availability, and Partition Tolerance. For a global system, we often use 'Read Replicas' to handle heavy traffic. We write to a 'Primary' node and read from 'Secondary' nodes. This improves performance but introduces 'Eventual Consistency'—where a user might not see their own post for a few milliseconds while data synchronizes.
The CAP Theorem Triangle: Consistency, Availability, Partition Tolerance
The CAP theorem, formulated by Eric Brewer, states that a distributed data store can only provide two of the following three guarantees simultaneously: Consistency (every read receives the most recent write or an error), Availability (every request receives a non-error response, without guarantee that it contains the most recent write), and Partition Tolerance (the system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes).
In practice, network partitions are inevitable, so you must choose between CP (Consistency + Partition Tolerance) and AP (Availability + Partition Tolerance). A CP system, like HBase or a traditional RDBMS with synchronous replication, will reject writes or become unavailable during a partition to ensure consistency. An AP system, like Cassandra or DynamoDB, will accept writes and serve possibly stale reads during a partition, but remains available. You can also dynamically switch between modes depending on the operation.
The triangle visualization helps you understand the trade-offs. Most modern systems are configured AP for reads (eventual consistency) and CP for writes that require strong consistency, but you must explicitly design for that split.
Caching: The Fast Lane to Performance
Caching stores frequently accessed data in a faster storage layer (memory) so that repeated requests avoid expensive database calls. Common caching patterns include Cache-Aside (application checks cache first), Read-Through (cache loads from DB on miss), and Write-Through (write to cache and DB simultaneously). The most widely used cache is Redis, an in-memory data store with sub-millisecond latency.
Caching isn't free. You trade memory for speed, and you introduce the problem of cache invalidation — keeping the cache in sync with the source of truth. A stale cache can serve outdated data, sometimes breaking business logic.
Load Balancers: Beyond Round-Robin
Load balancers distribute incoming traffic across multiple servers. But the algorithm matters. Round-Robin is simple but assumes equal server capacity and health. Least Connections sends requests to the server with fewest active connections — better for variable-length requests. IP Hash can enable session persistence without sticky cookies, by hashing the client IP to a specific server.
In production, LBs also perform health checks (active and passive), SSL termination, and rate limiting. They can be L4 (TCP/UDP) or L7 (HTTP/HTTPS). L7 allows content-based routing, e.g., forwarding /api/ to one pool and /static/ to another.
- Round-Robin: sends each car to the next lane in order.
- Least Connections: sends to the lane with fewest cars waiting.
- Health checks: cop blocks lanes with accidents (failed servers).
- SSL termination: cop handles the toll booth so servers don't have to.
Message Queues: Decouple and Scale Asynchronously
Message queues (e.g., RabbitMQ, Kafka, SQS) decouple producers from consumers. When a user uploads a video, the web server returns immediately and enqueues a task saying 'process video'. A separate worker picks that task, does the heavy lifting, and updates the database. This pattern keeps the web server responsive even under load.
Key concepts: Producer sends messages to a queue; Consumer pulls them; Queue buffers during spikes. Kafka adds the idea of log-based persistence and replayability. The trade-off is complexity: you now have at-least-once delivery, idempotency, and ordering challenges.
API Gateways: The Bouncer Your Microservices Need
Your microservices shouldn't expose their internal ports to the internet. That's a security audit waiting to happen. An API gateway sits as a single entry point. It handles authentication, rate limiting, request routing, and aggregation. Why? Because every service shouldn't need to reimplement OAuth or worry about DDoS attacks. You route all external traffic through the gateway. It parses the JWT, checks the rate limit in Redis, then forwards the request to the correct service. This is not 'nice to have'. It's how you enforce security boundaries. Without it, you'll end up with auth logic spread across 40 microservices. And when a vulnerability hits, you'll play whack-a-mole instead of fixing it in one place.
Content Delivery Networks: Stop Serving Static Assets from Your App Server
Serving images, CSS, and JavaScript from your application server is wasteful. Your app server should compute. Static assets should be cached at the edge. A CDN (Content Delivery Network) caches your static files on servers worldwide. When a user in Tokyo requests your logo, they get it from a CDN server in Tokyo, not your origin server in Virginia. This reduces latency by orders of magnitude. It also offloads your application server. The server handles fewer requests, so it handles more critical ones. Why are you still sending 500KB of JavaScript through your Rails controller? Configure your CDN with a cache-control header of one year for immutable assets. Then use content hashes in filenames to bust the cache on deploy.
The Replication Lag That Broke Read-Your-Writes
- Eventual consistency is dangerous for features that require immediate read-after-write correctness.
- Profile your API calls to identify which ones need strong consistency, and route them accordingly.
- Never assume replication lag is negligible — measure it under load.
EXPLAIN ANALYZE SELECT ... (PostgreSQL/MySQL)SHOW INDEX FROM table_name;Key takeaways
Common mistakes to avoid
3 patternsPremature Optimization
Ignoring Latency
Hard-coding IPs
Interview Questions on This Topic
You are designing a service like 'URL Shortener' (TinyURL). How would you handle 100,000 requests per second with low latency?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
That's Fundamentals. Mark it forged?
9 min read · try the examples if you haven't