CAP Theorem — Why Cassandra Returned Stale Data at 3 AM
During an AZ outage, Cassandra's default ONE consistency returned stale balances on payment paths.
- CAP Theorem: you can guarantee at most 2 of Consistency, Availability, Partition Tolerance during a network partition
- Consistency (C): every read receives the most recent write or an error
- Availability (A): every request receives a (non-error) response — even if stale
- Partition Tolerance (P): the system continues to operate despite dropped or delayed messages
- Performance insight: in Cassandra workloads we've observed 30–50% reductions in p99 tail latency when dropping read consistency from LOCAL_QUORUM to ONE during partition recovery — but staleness windows widen proportionally. Your numbers will vary by topology and replication factor
- Production insight: the hardest part isn't choosing CP or AP on a whiteboard — it's verifying that choice actually holds under asymmetric packet loss, partial link degradation, and slow-node scenarios that Chaos Monkey alone won't cover
- Biggest mistake: assuming CA databases exist in distributed systems; in any real multi-node network, partitions are inevitable, so CA is a design fiction — when a partition occurs, you are always choosing between C and A whether you planned for it or not
Imagine you and your friend both have a copy of the same homework answers. If your friend crosses out a wrong answer and writes a new one, do you instantly have the right answer too? Probably not — you'd need someone to tell you. Now imagine your friend is in another country and the phone lines are cut. That's a network partition. CAP Theorem just says: when the phone lines are cut, you have to decide — do you give out your possibly-stale answer (Available but not Consistent), or do you refuse to answer until the lines are fixed (Consistent but not Available)? You can't do both. The moment you accept that the phone lines will sometimes be cut — and in any real distributed system they will be — you stop arguing about whether to prepare and start arguing about which failure mode hurts your users less.
Every distributed system you've ever built or used is quietly making a bet you probably didn't consciously place. When Cassandra writes to one node before replicating, when etcd blocks a read during a leader election, when DynamoDB serves a stale cart total — these aren't bugs. They're deliberate, principled trade-offs that trace directly back to a single conjecture presented by Eric Brewer at PODC 2000 and formally proved by Gilbert and Lynch in 2002. CAP Theorem is the gravitational constant of distributed systems design. You don't fight it. You work with it.
The problem CAP Theorem names — not solves, names — is this: the moment you run your system on more than one machine, a network failure between those machines is not a matter of if but when. When that failure happens, your system must make a choice about what to promise its callers. CAP gives you a precise vocabulary for that choice: Consistency (every read gets the most recent write or an error), Availability (every request gets a non-error response), and Partition Tolerance (the system keeps working despite dropped messages between nodes). The punchline? You can guarantee at most two of these simultaneously when a partition actually occurs. And since Partition Tolerance is never optional in a real multi-node network, the real choice is always between C and A.
After reading this article you'll be able to explain what each CAP property means at the protocol level — not just the dictionary definition. You'll understand why CA systems are a theoretical fiction, how real databases like Cassandra, Spanner, and etcd position themselves on the CAP spectrum, and how to make an informed, defensible architectural choice when your tech lead asks 'should we use a CP or AP database for this service?' You'll also understand why CAP alone is not sufficient for reasoning about normal-operation trade-offs — and where PACELC fills that gap.
What is CAP Theorem?
CAP Theorem is the foundational constraint in distributed systems design. It states that when a network partition occurs, a distributed system must choose between Consistency and Availability — it cannot guarantee both simultaneously. This isn't a weakness of a particular database vendor; it's a mathematical result, proved formally in 2002, that applies to every distributed system regardless of how well it's engineered.
Let's be precise about each property — because the informal definitions are where most misunderstandings start.
Consistency (C) in CAP means linearizability: after a write completes, every subsequent read from any node in the system sees that write. This is stronger than what most people mean when they say 'consistent' in a database context. It requires coordination between nodes — typically via a leader-based consensus protocol like Raft or Paxos — which introduces latency and the possibility of blocking.
Availability (A) in CAP means every request to a non-failed node receives a response — not an error, not a timeout. The response may contain stale data. The key is that no client is left hanging indefinitely. Systems like Cassandra favour availability: they return a best-effort value even when some replicas are unreachable.
Partition Tolerance (P) means the system continues to operate despite network failures that cause messages to be lost or delayed between nodes. In any distributed system spanning more than one machine, network partitions are not a choice — they are a given. A partition can be as brief as a dropped packet or as long as a datacenter losing connectivity for minutes. You don't opt into Partition Tolerance; you acknowledge that your network will fail and design accordingly.
The constraint CAP imposes is this: during a partition, a node that can't communicate with another node must decide — do I respond with potentially stale data (A), or do I refuse to respond until I can confirm consistency (C)? There is no third path.
Why does the theorem hold? Because during an asynchronous network partition, a node cannot distinguish between 'the other node is slow' and 'a partition has occurred.' If you require both C and A, your protocol would need to respond immediately with the latest data — but it can't know the latest data without communicating with the other side, which it can't reach. Gilbert and Lynch formalised this into a proof that no asynchronous protocol can guarantee both properties under partition.
The 'CA' Myth — Why You Can't Have All Three
A common misconception is that you can build a system that is Consistent, Available, and Partition-Tolerant — just not all the time. That framing is wrong. During a partition, you must choose between C and A. A system claiming to be CA either doesn't handle partitions (it stops working entirely) or it quietly sacrifices one guarantee without admitting it.
Consider two nodes, N1 and N2, with a network cut between them. A write arrives at N1. N2 cannot know about this write because messages are dropped. A client now reads from N2. If N2 returns a response — even old data — the system is Available but not Consistent. If N2 refuses to respond until it can contact N1, the system is Consistent but not Available. There is no third path. Gilbert and Lynch proved this in 2002: any protocol that guarantees both C and A during an asynchronous partition would require synchronous communication, which is impossible when the network is down.
Now, a nuance worth addressing for senior engineers: what about single-datacenter systems with synchronous replication? A PostgreSQL primary with a synchronous standby in the same datacenter is effectively both Consistent and Available — in practice, because the probability of a true network partition within a single well-managed DC is extremely low. But it is not Partition Tolerant by design. A datacenter-level failure takes both nodes down simultaneously. You haven't escaped CAP; you've placed your bet that partitions within a single DC won't happen. That bet is usually correct — until the day it isn't. Single-DC synchronous replication is a practical approximation of CA, not a proof that CA is achievable. The moment you span datacenters or availability zones, the network partition becomes a real and frequent event, and the CA approximation collapses.
This is why 'CA' vendors in a distributed context are almost always misusing terminology. What they usually mean is: 'our system prioritises consistency and availability in normal operation, and we stop serving during partitions.' That's a CP system that admits to becoming unavailable under failure. Calling it CA is marketing, not engineering.
Consistency Models: From Strong to Eventual
CAP's Consistency property is binary in the theorem — either you have linearizability during a partition or you don't. But in practice, consistency is a spectrum, and the choice of where to sit on that spectrum is one of the most consequential architectural decisions you'll make. Here are the models you'll encounter in production, ordered from strongest to weakest.
- Linearizability (Strong Consistency): After a write completes, all subsequent reads from any node see that write. Operations appear instantaneous and globally ordered. This is what CAP's C means. It requires coordination — quorum or consensus — and introduces latency proportional to round-trip time between replicas. Examples: Google Spanner (via TrueTime), etcd (Raft), Zookeeper (Zab). During partitions, minority sides become unavailable.
- Sequential Consistency: Operations appear in some global order that is consistent with each process's local order. Weaker than linearizability — you don't get real-time ordering guarantees — but easier to implement. Less common in modern distributed databases; you'll encounter it more in academic literature and some memory model discussions.
- Causal Consistency: Writes that are causally related (one write depends on another) are seen in the correct order across all nodes. Writes that are causally independent may appear in different orders on different nodes. Used in CRDT-based systems and some multi-region databases. Provides a useful middle ground: stronger than eventual, cheaper than linearizability.
- Read-Your-Writes (Session Consistency): A client always sees its own writes, but may not see writes from other clients immediately. This can be provided without full linearizability by routing a client's reads to the replica that received their write, or by using a session token. Common in MongoDB (read from primary) and DynamoDB (read-your-writes within a session).
- Eventual Consistency: Given enough time without new writes, all replicas converge to the same value. During that convergence window, different replicas may return different values. This is the default for many AP systems: Cassandra, DynamoDB (eventually consistent reads), Riak. After a partition heals, anti-entropy processes (gossip, Merkle tree comparison, read-repair) reconcile divergent replicas.
The question isn't 'which model is best.' It's 'which model matches what your users will actually experience and what your business can recover from.' A recommendation feed returning yesterday's items is annoying but harmless. A payment balance returning a value from before a deposit was processed is a bug with financial consequences.
One more thing worth naming: the relationship between consistency models and the R + W > N quorum rule. In a system with N replicas, if you require W replicas to acknowledge a write and R replicas to respond to a read, and W + R > N, then at least one replica must have participated in both the write and the read — guaranteeing you'll see the latest write. This is the mathematical foundation behind Cassandra's consistency levels. QUORUM with replication factor 3 means W=2, R=2, W+R=4 > N=3. ONE means R=1, and with W=2, you have W+R=3 = N, which means you might miss the latest write if the one replica you read from wasn't one of the two that acknowledged the write.
- Linearizability costs coordination — typically +10–50ms per write for quorum round-trips in a multi-AZ deployment. That cost exists even when there is no partition, which is why PACELC matters for normal-operation design.
- Eventual consistency is cheap to write — no waiting for quorum — but you pay in complexity: conflict resolution logic, staleness detection, and reconciliation jobs that someone has to build and maintain.
- In production, the majority of services can use eventual consistency if they design for it: idempotent reads, user-visible staleness indicators ('prices updated every 60 seconds'), and reconciliation after partition recovery.
- The hidden cost of consistency mismatches is debugging anomalies months later when a developer assumed strong consistency in a code review but the database was configured for eventual. That assumption lives in no config file.
CAP in Normal Operation: Where PACELC Fills the Gap
CAP tells you what to sacrifice during a partition. But most of your system's lifetime is spent in normal operation — no partition, no crisis, just steady-state reads and writes. In normal operation, CAP says nothing. Both C and A are achievable. So what governs the trade-offs you make every day?
That's where PACELC comes in. Proposed by Daniel Abadi in 2012, PACELC extends CAP with a second branch:
- If there is a Partition (P), choose between Availability (A) and Consistency (C) — the CAP branch.
- Else (E), in normal operation, choose between Latency (L) and Consistency (C).
The insight is that even without a partition, consistency costs something. Every strongly consistent write requires coordination between replicas — quorum acknowledgement, leader confirmation, or a two-phase commit. That coordination adds latency. Every millisecond of coordination latency is a trade-off you're making against throughput and user experience.
Let's map real databases onto both dimensions:
Cassandra is PA/EL: during partition it chooses Availability; in normal operation it chooses low Latency (writes go to the coordinator immediately, replication is asynchronous by default). This makes Cassandra extremely fast under normal load and extremely tolerant of node failures — at the cost of consistency both during and between partitions.
DynamoDB (eventually consistent) is PA/EL: same pattern. Default reads are fast and stale-tolerant. Strongly consistent reads opt into PC/EC behaviour for that request.
Spanner is PC/EC: during partition it chooses Consistency (minority sides block); in normal operation it also chooses Consistency (TrueTime-coordinated commits, external consistency). This makes Spanner correct and expensive. The latency overhead is real — cross-region commits involve TrueTime uncertainty windows of 1–7ms on top of network round-trip.
etcd is PC/EC: Raft consensus means every write goes through a leader commit before acknowledgement. In normal operation, p99 write latency in a 3-node cluster across availability zones is typically 5–15ms. That's the cost of EC.
MongoDB with default settings is PA/EL in practice: secondaries lag behind the primary, reads from secondaries are eventually consistent. If you configure readConcern:linearizable and writeConcern:majority, you get PC/EC for those operations — at latency cost.
Why does this matter? Because most architecture discussions about CAP are really about PACELC. The question 'should we use Cassandra or Spanner?' is almost never triggered by a partition event you're anticipating. It's triggered by normal-operation requirements: how fast do writes need to be? Can reads be 500ms stale? Can we afford 15ms write latency for strong consistency? PACELC gives you the vocabulary to answer those questions with the same rigour that CAP applies to failure scenarios.
Real Database Positions on the CAP and PACELC Spectrum
Let's map major databases to their actual behaviour under partition and under normal operation. These are not marketing positions — they're what happens when you inject a network fault.
Cassandra: AP (CAP), PA/EL (PACELC). By default, writes succeed once the coordinator acknowledges — replication is asynchronous. Reads with ONE return data from any single replica, potentially stale. Raise to LOCAL_QUORUM and you get quorum-consistent reads in normal operation (EL becomes EC for that query), but during a partition, any node that can't reach quorum will time out. Cassandra's power is that you tune this per query — it's the most flexible point on the spectrum.
DynamoDB: AP (CAP), PA/EL (PACELC) by default. The default read is eventually consistent — DynamoDB routes to the nearest replica without quorum. ConsistentRead=true uses quorum and shifts that read to PC/EC. Importantly, DynamoDB is not the original Dynamo system — it's a managed service with different internals. The original Dynamo (used by Amazon's shopping cart) used vector clocks and sloppy quorums; DynamoDB uses a different replication mechanism. Don't conflate the two when reading the 2007 Dynamo paper.
etcd: CP (CAP), PC/EC (PACELC). Raft consensus means every write goes through a leader commit. During a partition, the minority side of the Raft cluster stops accepting writes and by default stops serving reads (to prevent serving stale data). This is correct CP behaviour. In normal operation, the Raft overhead means p99 write latency across 3 AZs is typically 5–15ms. For distributed locks and configuration management, this is the right trade-off.
Zookeeper: CP (CAP), PC/EC (PACELC). Uses Zab (Zookeeper Atomic Broadcast) instead of Raft, but similar guarantees. During a partition, the minority side stops serving. Writes always go to the leader. Used extensively for leader election, service discovery, and distributed coordination where correctness outweighs availability.
Google Spanner: CP (CAP), PC/EC (PACELC). TrueTime enables externally consistent distributed transactions across global regions. During a global partition — rare but possible — some writes block until the TrueTime uncertainty window resolves. In normal operation, TrueTime adds 1–7ms to commit latency on top of network round-trip. Spanner is the closest production system to having strong consistency at global scale, but the latency cost is real and the availability cost during partition is real.
MongoDB (replica set, default config): CP-leaning (CAP), PA/EL in practice (PACELC). Writes go to the primary with writeConcern:1 by default — acknowledged by one node before returning. Secondaries replicate asynchronously and may lag. Reads from secondaries are eventually consistent. During partition: if the primary is isolated from a majority, the majority elects a new primary after ~10 seconds — during which writes are unavailable. If you lower writeConcern to w:0 or w:1 without journaling, you can lose acknowledged writes during failover, which is an AP-like behaviour. MongoDB's position shifts significantly based on write concern and read preference — treat the defaults as a starting point, not a fixed CAP label.
Redis Cluster: this one requires care. Redis Cluster uses hash slots partitioned across master nodes, each with replica nodes. During a partition, the behaviour is slot-dependent. With the default cluster-require-full-coverage=yes, if any master and all its replicas are unreachable, the entire cluster stops accepting writes — CP-like for availability but not linearizable. With cluster-require-full-coverage=no, the cluster continues serving the slots it can reach — AP-like for available slots. Redis Cluster does not guarantee linearizability in any configuration: replication is asynchronous, so a promoted replica may not have all writes from the failed master. Redis Cluster is best described as AP with last-write-wins semantics and potential write loss during failover — which is fine for caching and session storage, and wrong for anything requiring durability.
Split-Brain: The Partition Failure Mode Nobody Plans For
Split-brain is what happens when a partition causes two parts of your cluster to each believe they are the authoritative majority — and both continue accepting writes independently. It is the most dangerous partition failure mode, and it's the one most teams discover in production rather than in testing.
Here's the scenario. You have a 3-node cluster: N1, N2, N3. A network partition separates N1 from N2 and N3. N2 and N3 form a majority and elect N2 as the new leader. N1, isolated, doesn't detect the partition immediately — it still thinks it's the leader. During the detection window (which can be seconds to minutes depending on heartbeat timeout configuration), N1 accepts writes. N2 also accepts writes as the new leader. Both sides commit conflicting state.
When the partition heals, you have two versions of truth. The system must reconcile them, and depending on the conflict resolution strategy, some writes will be lost or overwritten.
For CP systems (etcd, Zookeeper), Raft and Zab are specifically designed to prevent split-brain: a node can only be leader if it has acknowledgement from a majority. N1 in the scenario above, isolated from N2 and N3, cannot form a majority and steps down. It stops accepting writes. This is correct — availability is sacrificed, but split-brain is prevented.
For AP systems (Cassandra, DynamoDB, Redis Cluster with async replication), split-brain is a real possibility. Cassandra with low write consistency (ONE or ANY) can accept writes on both sides of a partition. After recovery, last-write-wins (LWW) semantics based on timestamp ordering determine which write survives. If the client clocks are not well-synchronised, the wrong write wins. Redis Cluster's async replication means a promoted replica may not have all writes from the failed master — those writes are lost.
Asymmetric partitions make this harder. In a real network, partitions are rarely a clean cut. You might have a situation where N1 can send messages to N2 but N2's responses are dropped. N1 sees N2 as alive; N2 sees N1 as unreachable. Both might attempt to become leader, depending on the election protocol. This is why fault domain placement matters: place replicas in separate availability zones, use rack-aware placement policies in Cassandra, and configure etcd peer URLs to use AZ-specific endpoints. The goal is to ensure that a single physical failure — a switch, a power unit, an AZ network event — takes down at most a minority of your replicas.
Conflict resolution strategies for AP systems:
- Last-Write-Wins (LWW): the write with the highest timestamp wins. Simple, but requires synchronised clocks. NTP drift of a few milliseconds can cause the wrong write to win. Cassandra uses LWW by default.
- Vector Clocks: track causality between writes using per-node counters. A write that happened after another write (causally) always wins. Writes that are concurrent (neither caused the other) are flagged as conflicts for application-level resolution. The original Dynamo system used vector clocks; DynamoDB removed them in favour of LWW.
- CRDTs (Conflict-free Replicated Data Types): data structures designed so that concurrent writes always merge deterministically without conflicts. Examples: G-Counter (increment-only counter), OR-Set (add/remove set with tombstones). Used in Riak and some Cassandra patterns. The constraint is that not all data types can be expressed as CRDTs.
- Application-Level Reconciliation: a background job reads from all replicas after partition recovery, detects divergence, and applies business logic to determine the correct value. This is the most flexible approach and the most work. Required when LWW semantics would produce wrong results and CRDTs don't fit the data model.
How to Make the CAP Decision in Your Architecture
Here's a practical framework for deciding CP or AP for a given service — and for documenting that decision so the next engineer doesn't have to reverse-engineer it during an incident.
Step 1: Identify data criticality and failure mode cost. Ask: what's worse for this service — wrong data or no data? For a financial transaction, wrong data (serving a stale balance) can cause real monetary loss or regulatory exposure. An error or brief unavailability is recoverable. Choose CP. For a product recommendation feed, stale recommendations are invisible to most users. An error that breaks page load is highly visible. Choose AP.
Step 2: Define recovery time objectives honestly. How long can this service be unavailable during a partition before the business impact exceeds the impact of serving stale data? If the answer is 'zero seconds,' you need AP. If the answer is 'a few seconds during leader election is acceptable,' CP is viable. Be honest — most teams underestimate their tolerance for brief unavailability and overestimate the cost of stale data.
Step 3: Evaluate read/write ratio and access patterns. Write-heavy services benefit from AP because they avoid quorum coordination overhead on every write. Read-heavy services can often tolerate stale reads if the underlying data changes slowly (product catalog, user preferences). Read-heavy services where data changes fast and staleness matters (account balance, inventory) need CP for reads even if the write rate is low.
Step 4: Apply the quorum math. For the databases you're considering, calculate R + W > N for your replication factor and desired consistency. Know what consistency level each critical read and write uses. Write it down. This is the most often skipped step — teams pick a database and never calculate what their consistency configuration actually guarantees.
Step 5: Design for partition recovery explicitly. If you choose AP, partition recovery is not automatic from an application perspective — replicas reconcile at the data layer, but your application may have made decisions based on stale data during the partition window. Design a reconciliation path: a background job, a read-repair trigger, or an idempotent retry mechanism. If you choose CP, partition recovery means the minority side comes back online and catches up with the majority — but clients that were hitting the minority side were getting errors during the partition. Design a retry/backoff policy and a user-facing message for that state.
Step 6: Document and test. For each critical data path, write one sentence: 'During a partition, this path [serves stale data / returns an error] because we prioritise [availability / consistency] for [business reason].' Then run a fault injection test that verifies it. If the observed behaviour doesn't match the sentence, the sentence or the configuration is wrong. Fix whichever one is cheaper.
The 3 AM Cassandra Read Timeout That Wasn't a Bug
- Check your database driver logs at DEBUG level during incidents — UnavailableExceptions and ReadTimeoutExceptions often appear there before they surface anywhere else.
- Understand your database's default consistency levels — they almost always favour availability over consistency because that is the safer default for the majority of workloads. Payment paths are not the majority of workloads.
- When partitions happen, 'available' doesn't mean 'correct'. Test your system under network faults — not just under load — before you find out in production at 3 AM.
- Document the CAP trade-off for each critical read/write path. If your on-call engineer can't explain which guarantee you're sacrificing during a partition, you're already broken.
Key takeaways
Common mistakes to avoid
5 patternsAssuming CA databases exist in distributed systems
Ignoring consistency level configuration defaults — and checking them only in cqlsh, not in the application driver
Failing to plan for partition recovery — especially conflict resolution
Choosing CP for a read-heavy service without understanding the availability cost — or choosing AP for a write-heavy financial service without understanding the staleness cost
Treating Redis Cluster as a straightforward AP store for all use cases
Interview Questions on This Topic
Explain CAP Theorem. Why can't you have all three guarantees simultaneously?
Frequently Asked Questions
That's Fundamentals. Mark it forged?
18 min read · try the examples if you haven't