PACELC Theorem: When CAP Isn't Enough — Real Trade-Offs in Distributed Systems
PACELC theorem explained with production trade-offs.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
PACELC says: during a network partition, choose between availability and consistency (like CAP). When the system is healthy (Else), choose between low latency and strong consistency. This helps you design databases that don't just survive partitions but also perform well day-to-day.
Imagine a library with two branches. If the phone line between them is cut (partition), you can either let each branch lend books independently (availability) or force them to wait until the line is fixed (consistency). When the line works (else), you can either update both branches instantly (strong consistency, slower) or let one branch update first and sync later (low latency, eventual consistency). PACELC captures both scenarios.
CAP theorem is taught in every system design interview, but it's incomplete. It only tells you what to do when the network is on fire. What about the other 99.9% of the time when everything is fine? That's where PACELC comes in — and ignoring it has cost teams millions in latency tax.
PACELC adds the 'Else' clause: when there is no Partition, you still have to choose between Latency and Consistency. Most production outages I've seen aren't partition-related — they're cascading failures from a database that chose strong consistency on every read, then collapsed under load.
By the end of this article, you'll be able to classify any distributed data store by its PACELC trade-off, predict its failure modes under load, and choose the right configuration for your service's latency budget.
Why CAP Alone Will Burn You in Production
CAP theorem says you can have at most two of Consistency, Availability, and Partition Tolerance. But it only applies during a partition. In practice, partitions are rare. The real cost is the latency-vs-consistency trade-off you make every millisecond when the network is healthy.
I've seen teams design systems that are 'CP' (consistent and partition-tolerant) using something like Zookeeper. During a partition, they lose availability — fine. But during normal operation, they still pay the latency cost of strong consistency on every read. Their p99 latency is 200ms because they insist on linearizable reads. Meanwhile, a competitor using an 'AP' system with eventual consistency serves reads in 5ms. Guess who wins?
PACELC forces you to answer: when there's no partition, do you optimize for low latency (L) or strong consistency (C)? Most production systems should optimize for latency and accept eventual consistency. But you need to know the cost.
The PACELC Spectrum: From PA/EL to PC/EC
Every distributed database sits somewhere on the PACELC spectrum. Let's map the common ones.
PA/EL (Partition: Availability, Else: Low latency): Cassandra, DynamoDB, Riak. These are your 'always available, eventually consistent' systems. They shine when you need 99.999% uptime and can tolerate stale reads. The cost: you must handle conflicts (e.g., last-write-wins or CRDTs).
PC/EC (Partition: Consistency, Else: Consistency): HBase, Zookeeper, Spanner. These are your 'strong consistency at all costs' systems. They give you linearizable reads and writes, but you pay in latency and availability. Spanner uses TrueTime to achieve external consistency with reasonable latency, but it's complex and expensive.
PC/EL (Partition: Consistency, Else: Low latency): Some systems like MongoDB with default settings. During a partition, they favor consistency (primary-only writes). During normal operation, they favor low latency (reads from primary, writes acknowledged without replica wait). But this is a dangerous mix — you can lose writes during a partition if the primary goes down.
PA/EC (Partition: Availability, Else: Consistency): Rare. DynamoDB with DAX (in-memory cache) can behave this way: during partition, DynamoDB is available (AP), but DAX provides strong consistency for cached data. But this is a hybrid.
Tuning Consistency vs Latency in Production
You don't have to pick one PACELC class and stick with it. Most databases let you tune consistency per operation. This is where the real power lies.
In Cassandra, you can set consistency level per query. For a checkout service, you might use LOCAL_QUORUM for the 'place order' write (strong consistency) and ONE for a 'get order history' read (low latency). This gives you the best of both worlds — but you must understand the trade-off.
Here's the rule: every time you increase consistency, you increase latency. In Cassandra, QUORUM waits for a majority of replicas. If your replication factor is 3, QUORUM waits for 2 replicas. That adds network round-trips. Under high load, this can cause timeouts.
I've seen a team use QUORUM for every read in a Cassandra cluster with 5 replicas. Their p99 latency was 500ms. They switched to LOCAL_QUORUM (wait for 2 of 3 in local DC) and dropped to 50ms. The trade-off: during a cross-DC failover, they might read stale data. But that was acceptable for their use case.
When PACELC Breaks: The Hidden Cost of 'Else'
PACELC assumes the 'else' case is the normal, healthy state. But what if the system is degraded but not partitioned? For example, high CPU, memory pressure, or slow disks. In these cases, the latency-vs-consistency trade-off still applies, but the cost of strong consistency multiplies.
I debugged a case where a MongoDB replica set had a secondary with a slow disk. The primary was healthy, but writes with w:'majority' waited for the slow secondary to ack. Write latency went from 10ms to 500ms. The fix: reduce write concern to w:1 (acknowledged by primary only) and monitor secondary lag. This is effectively moving from EC (else consistency) to EL (else latency) for writes.
Another scenario: network latency between datacenters. If your application is multi-region and you use strong consistency across regions, every write waits for a round-trip to the other region. That's 100ms+ added latency. In this case, you might want to use local consistency (e.g., LOCAL_QUORUM) and accept eventual consistency across regions. PACELC doesn't explicitly cover multi-region, but the same principle applies: when the network is healthy but slow, you still trade latency for consistency.
PACELC in the Real World: DynamoDB, Cassandra, and Spanner
Let's look at three production systems and their PACELC profiles.
DynamoDB: PA/EL. During a partition, DynamoDB chooses availability (you can still read and write, but may get stale data). During normal operation, it chooses low latency (eventual consistency by default). You can opt into strongly consistent reads, but they cost 2x the read capacity units and have higher latency. DynamoDB's DAX cache adds another layer: it can serve strongly consistent reads with low latency by caching the primary, but that's a separate component.
Cassandra: PA/EL by default. Tunable consistency lets you move along the spectrum. For example, using SERIAL consistency gives you linearizable transactions (PA/EC for those operations), but at a latency cost.
Spanner: PC/EC. Spanner uses TrueTime to provide external consistency (linearizability) even across datacenters. During a partition, Spanner chooses consistency over availability (it may reject writes if it can't get a quorum). During normal operation, it still enforces strong consistency, but TrueTime allows it to do so with lower latency than traditional two-phase commit. However, Spanner is complex and expensive — you only use it when you absolutely need global strong consistency.
The PACELC Fallacy: Why You Can't Have It All
Some vendors claim to offer 'strong consistency with low latency'. That's a lie — or at least a half-truth. They might use techniques like consensus protocols (Raft, Paxos) or clock synchronization (TrueTime) to reduce the latency cost of consistency, but there's always a trade-off.
For example, etcd uses Raft to provide linearizable reads. But a linearizable read still requires a round-trip to the leader, which adds latency. You can use 'ReadIndex' or 'lease read' to optimize, but you're still paying a cost. Similarly, Spanner's TrueTime reduces the cost of global consistency, but it's not free — you need specialized hardware and software.
The PACELC theorem reminds us that there's no free lunch. You must choose: either you pay in latency, or you pay in consistency. The art is in choosing the right trade-off for each operation.
Production Patterns: How to Apply PACELC in Your Architecture
Here are three patterns I've used in production to apply PACELC thinking.
Pattern 1: Operation-level consistency tuning. As shown earlier, use different consistency levels for different operations. Critical writes (e.g., payment) use strong consistency; non-critical reads (e.g., product catalog) use eventual consistency. This is the most common and effective pattern.
Pattern 2: Service-level PACELC isolation. Separate your services by their PACELC requirements. For example, a 'user session' service might need strong consistency (PC/EC) because stale sessions cause login issues. A 'recommendation' service can be PA/EL because stale recommendations are acceptable. Run them on different databases or clusters.
Pattern 3: Hybrid PACELC with caching. Use a strongly consistent database (e.g., PostgreSQL) as source of truth, and a cache (e.g., Redis) for low-latency reads. The cache is eventually consistent with the database. This gives you the best of both worlds, but adds complexity in cache invalidation.
I've used Pattern 3 in a high-traffic e-commerce site. Product details were served from Redis (low latency), but inventory counts were read from PostgreSQL (strong consistency). This prevented overselling while keeping page load times under 100ms.
The 2-Second Read That Took Down Checkout
- Strong consistency is a shared resource — if you use it everywhere, you'll bottleneck on the primary.
- Route read-heavy workloads to replicas and accept eventual consistency where possible.
nodetool tpstats for pending tasks. 2. Check nodetool gossipinfo for node status. 3. If a node is slow, remove it from the cluster temporarily. 4. Reduce consistency level to ONE for non-critical writes.rs.status() for replica lag. 2. If a secondary is lagging, reduce write concern to w:1 temporarily. 3. Investigate secondary disk I/O or network latency.nodetool cfstats orders | grep 'Read latency'nodetool proxyhistogramsKey takeaways
Interview Questions on This Topic
How does PACELC extend CAP, and why is it more useful for real-world system design?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Distributed Systems. Mark it forged?
6 min read · try the examples if you haven't