Senior 18 min · March 05, 2026

CAP Theorem — Why Cassandra Returned Stale Data at 3 AM

During an AZ outage, Cassandra's default ONE consistency returned stale balances on payment paths.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • CAP Theorem: you can guarantee at most 2 of Consistency, Availability, Partition Tolerance during a network partition
  • Consistency (C): every read receives the most recent write or an error
  • Availability (A): every request receives a (non-error) response — even if stale
  • Partition Tolerance (P): the system continues to operate despite dropped or delayed messages
  • Performance insight: in Cassandra workloads we've observed 30–50% reductions in p99 tail latency when dropping read consistency from LOCAL_QUORUM to ONE during partition recovery — but staleness windows widen proportionally. Your numbers will vary by topology and replication factor
  • Production insight: the hardest part isn't choosing CP or AP on a whiteboard — it's verifying that choice actually holds under asymmetric packet loss, partial link degradation, and slow-node scenarios that Chaos Monkey alone won't cover
  • Biggest mistake: assuming CA databases exist in distributed systems; in any real multi-node network, partitions are inevitable, so CA is a design fiction — when a partition occurs, you are always choosing between C and A whether you planned for it or not
Plain-English First

Imagine you and your friend both have a copy of the same homework answers. If your friend crosses out a wrong answer and writes a new one, do you instantly have the right answer too? Probably not — you'd need someone to tell you. Now imagine your friend is in another country and the phone lines are cut. That's a network partition. CAP Theorem just says: when the phone lines are cut, you have to decide — do you give out your possibly-stale answer (Available but not Consistent), or do you refuse to answer until the lines are fixed (Consistent but not Available)? You can't do both. The moment you accept that the phone lines will sometimes be cut — and in any real distributed system they will be — you stop arguing about whether to prepare and start arguing about which failure mode hurts your users less.

Every distributed system you've ever built or used is quietly making a bet you probably didn't consciously place. When Cassandra writes to one node before replicating, when etcd blocks a read during a leader election, when DynamoDB serves a stale cart total — these aren't bugs. They're deliberate, principled trade-offs that trace directly back to a single conjecture presented by Eric Brewer at PODC 2000 and formally proved by Gilbert and Lynch in 2002. CAP Theorem is the gravitational constant of distributed systems design. You don't fight it. You work with it.

The problem CAP Theorem names — not solves, names — is this: the moment you run your system on more than one machine, a network failure between those machines is not a matter of if but when. When that failure happens, your system must make a choice about what to promise its callers. CAP gives you a precise vocabulary for that choice: Consistency (every read gets the most recent write or an error), Availability (every request gets a non-error response), and Partition Tolerance (the system keeps working despite dropped messages between nodes). The punchline? You can guarantee at most two of these simultaneously when a partition actually occurs. And since Partition Tolerance is never optional in a real multi-node network, the real choice is always between C and A.

After reading this article you'll be able to explain what each CAP property means at the protocol level — not just the dictionary definition. You'll understand why CA systems are a theoretical fiction, how real databases like Cassandra, Spanner, and etcd position themselves on the CAP spectrum, and how to make an informed, defensible architectural choice when your tech lead asks 'should we use a CP or AP database for this service?' You'll also understand why CAP alone is not sufficient for reasoning about normal-operation trade-offs — and where PACELC fills that gap.

What is CAP Theorem?

CAP Theorem is the foundational constraint in distributed systems design. It states that when a network partition occurs, a distributed system must choose between Consistency and Availability — it cannot guarantee both simultaneously. This isn't a weakness of a particular database vendor; it's a mathematical result, proved formally in 2002, that applies to every distributed system regardless of how well it's engineered.

Let's be precise about each property — because the informal definitions are where most misunderstandings start.

Consistency (C) in CAP means linearizability: after a write completes, every subsequent read from any node in the system sees that write. This is stronger than what most people mean when they say 'consistent' in a database context. It requires coordination between nodes — typically via a leader-based consensus protocol like Raft or Paxos — which introduces latency and the possibility of blocking.

Availability (A) in CAP means every request to a non-failed node receives a response — not an error, not a timeout. The response may contain stale data. The key is that no client is left hanging indefinitely. Systems like Cassandra favour availability: they return a best-effort value even when some replicas are unreachable.

Partition Tolerance (P) means the system continues to operate despite network failures that cause messages to be lost or delayed between nodes. In any distributed system spanning more than one machine, network partitions are not a choice — they are a given. A partition can be as brief as a dropped packet or as long as a datacenter losing connectivity for minutes. You don't opt into Partition Tolerance; you acknowledge that your network will fail and design accordingly.

The constraint CAP imposes is this: during a partition, a node that can't communicate with another node must decide — do I respond with potentially stale data (A), or do I refuse to respond until I can confirm consistency (C)? There is no third path.

Why does the theorem hold? Because during an asynchronous network partition, a node cannot distinguish between 'the other node is slow' and 'a partition has occurred.' If you require both C and A, your protocol would need to respond immediately with the latest data — but it can't know the latest data without communicating with the other side, which it can't reach. Gilbert and Lynch formalised this into a proof that no asynchronous protocol can guarantee both properties under partition.

io/thecodeforge/cap/CAPDecision.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
package io.thecodeforge.cap;

/**
 * Illustrates the CAP trade-off decision during a network partition.
 * This is intentionally simplified for clarity — production systems
 * make this choice per-query, not per-service.
 */
public class CAPDecision {

    public enum Guarantee {
        CP, // Consistent + Partition Tolerant — blocks or errors during partition (e.g., etcd, Zookeeper)
        AP  // Available + Partition Tolerant — may serve stale data during partition (e.g., Cassandra, DynamoDB)
    }

    /**
     * Choose the CAP guarantee based on business requirement.
     *
     * Note: in real systems, you'd also implement a fallback path —
     * for example, a CP read that falls back to a cached AP response
     * after a timeout, rather than surfacing an error to the user.
     *
     * @param requiresConsistency true if stale reads are unacceptable (e.g., financial ledger)
     * @param mustBeAvailable     true if the system must always respond (e.g., product catalog)
     * @return the CAP guarantee to enforce
     */
    public static Guarantee choose(boolean requiresConsistency, boolean mustBeAvailable) {
        if (requiresConsistency && mustBeAvailable) {
            // This is the impossible ask. During a partition you cannot have both.
            // If your requirements genuinely demand both, the real answer is:
            // redesign the requirement — either relax consistency to eventual,
            // or accept a short unavailability window during rare partitions.
            throw new IllegalArgumentException(
                "During a partition, C and A are mutually exclusive. " +
                "Redefine your requirement or accept a trade-off."
            );
        }
        if (requiresConsistency) {
            return Guarantee.CP;
        }
        return Guarantee.AP;
    }

    /**
     * A more realistic pattern: attempt a CP read, fall back to AP on unavailability.
     * This is a common production pattern for services that prefer consistency
     * but can tolerate a stale cached response over a user-visible error.
     */
    public static String readWithFallback(
            java.util.function.Supplier<String> cpRead,
            java.util.function.Supplier<String> apFallback) {
        try {
            return cpRead.get(); // attempt LOCAL_QUORUM or equivalent
        } catch (RuntimeException unavailable) {
            // Partition occurred — CP store is unreachable for this node.
            // Fall back to AP (stale cache, read from ONE replica, etc.)
            // Log this — stale data served is a known event, not a silent failure.
            System.err.println("[WARN] CP read unavailable, falling back to AP: " + unavailable.getMessage());
            return apFallback.get();
        }
    }

    public static void main(String[] args) {
        // Payment ledger: stale balance is worse than a brief error
        Guarantee paymentService = choose(true, false);
        System.out.println("Payment service picks: " + paymentService);

        // Shopping cart: stale items are better than a broken page
        Guarantee cartService = choose(false, true);
        System.out.println("Cart service picks: " + cartService);

        // Product catalog read with fallback
        String result = readWithFallback(
            () -> "[CP] latest product price: $29.99",
            () -> "[AP fallback] cached product price: $29.99 (may be stale)"
        );
        System.out.println(result);
    }
}
Output
Payment service picks: CP
Cart service picks: AP
[CP] latest product price: $29.99
Two Traps That Catch Senior Engineers
First: don't confuse 'eventual consistency' with 'availability without consequences.' Eventual consistency sacrifices Consistency during the window before replicas reconcile — that window can be milliseconds or minutes depending on anti-entropy configuration and network conditions. Second: 'strong consistency' in most databases means the system will block or error during partition — that's the availability cost. Both are deliberate trade-offs, not bugs. The mistake is not knowing which one your system is making on a given code path.
Production Insight
Your system will behave differently during a partition than during normal operation — and the difference is often invisible until it matters. Running Chaos Monkey or Toxiproxy against a staging environment is necessary but not sufficient. The failure modes that hurt most in production are asymmetric: one side of a partition can send but not receive, or packet loss is 20% rather than 100%, or a single slow node triggers timeouts without a clean partition being detected at all.
The discipline that catches these is fault injection at the network level combined with workload replay — send real production read/write patterns against a partitioned cluster and measure what the system actually returns, not what the documentation says it should return.
Rule: run partition drills quarterly, not just after incidents. Document the actual observed behaviour, not the theoretical CAP label.
Key Takeaway
CAP is not a menu you pick from freely. During a partition, you must choose either C or A — not both. Partition Tolerance is never optional in a multi-node system. The engineering discipline is knowing which choice each of your data paths is making, testing that it makes the right one under failure, and documenting it so the engineer paged at 3 AM doesn't have to reverse-engineer it.
Choosing CP vs AP for a Service
IfService requires strong consistency — financial transactions, distributed locks, inventory reservation, user permissions
UseChoose CP: Zookeeper, etcd, Spanner, CockroachDB, or PostgreSQL with synchronous replication. Accept that during a partition, this service will be unavailable on the minority side. Design a user-facing error message accordingly — a clear 'service temporarily unavailable' is better than a wrong answer.
IfService can tolerate stale data for seconds to minutes — user profiles, product catalog, recommendation feeds, session metadata
UseChoose AP: Cassandra, DynamoDB (eventually consistent reads), Riak. Design for staleness: show users a 'last updated' timestamp, make reads idempotent, implement reconciliation logic for post-partition recovery.
IfRead-heavy service with a mix of critical and non-critical paths
UseUse a hybrid: CP store (etcd or Spanner) for authoritative writes and critical reads; AP cache (Redis, Cassandra with ONE) for high-volume non-critical reads. The cache goes stale during partitions — that is expected and should be logged, not silently swallowed.
IfYou need both C and A with zero tolerance for either failure mode
UseThat requirement is not achievable in any distributed system. The real conversation to have is: which failure mode is more recoverable for this business? Wrong data in a financial ledger, or 12 seconds of 503 errors? One of those is recoverable. The other may not be.

The 'CA' Myth — Why You Can't Have All Three

A common misconception is that you can build a system that is Consistent, Available, and Partition-Tolerant — just not all the time. That framing is wrong. During a partition, you must choose between C and A. A system claiming to be CA either doesn't handle partitions (it stops working entirely) or it quietly sacrifices one guarantee without admitting it.

Consider two nodes, N1 and N2, with a network cut between them. A write arrives at N1. N2 cannot know about this write because messages are dropped. A client now reads from N2. If N2 returns a response — even old data — the system is Available but not Consistent. If N2 refuses to respond until it can contact N1, the system is Consistent but not Available. There is no third path. Gilbert and Lynch proved this in 2002: any protocol that guarantees both C and A during an asynchronous partition would require synchronous communication, which is impossible when the network is down.

Now, a nuance worth addressing for senior engineers: what about single-datacenter systems with synchronous replication? A PostgreSQL primary with a synchronous standby in the same datacenter is effectively both Consistent and Available — in practice, because the probability of a true network partition within a single well-managed DC is extremely low. But it is not Partition Tolerant by design. A datacenter-level failure takes both nodes down simultaneously. You haven't escaped CAP; you've placed your bet that partitions within a single DC won't happen. That bet is usually correct — until the day it isn't. Single-DC synchronous replication is a practical approximation of CA, not a proof that CA is achievable. The moment you span datacenters or availability zones, the network partition becomes a real and frequent event, and the CA approximation collapses.

This is why 'CA' vendors in a distributed context are almost always misusing terminology. What they usually mean is: 'our system prioritises consistency and availability in normal operation, and we stop serving during partitions.' That's a CP system that admits to becoming unavailable under failure. Calling it CA is marketing, not engineering.

Single-DC Systems and the CA Approximation
A single PostgreSQL instance or a synchronous primary/standby pair within one datacenter appears to be CA — Consistent (all reads see the latest write) and Available (every request gets a response). Technically it's not Partition Tolerant: a network failure between the two nodes takes down the standby's ability to replicate, and depending on configuration, the primary may block writes until the standby reconnects. But in practice, intra-DC partitions are rare enough that the system approximates CA most of the time. The lesson: CA is not achievable across datacenters or regions. The moment you distribute across network boundaries with real failure probabilities, you are in CAP territory and must choose C or A under partition.
Production Insight
We've debugged a production incident where a distributed SQL database — marketed as CA — silently dropped writes during a network hiccup between availability zones. The system couldn't maintain both C and A, so it made an undocumented choice: it dropped the write rather than serving an error. The application had no visibility into this because the database returned a success acknowledgement before replication was confirmed.
The fix wasn't switching databases. It was reading the vendor's consistency documentation at the protocol level — not the marketing page — and setting synchronous replication with write confirmation only after durable replication. That added latency. But silent data loss is not a latency trade-off; it's a correctness failure.
Rule: always verify how a database behaves under a partition by testing it, not by reading its marketing claims. The test is five lines of Toxiproxy config and a write workload. Run it.
Key Takeaway
CA is a useful concept for single-node or single-DC systems where partitions are rare enough to ignore in practice. It is a fiction in multi-datacenter distributed systems. Every vendor claiming CA in a distributed context is either describing a CP system that becomes unavailable during partition, or they haven't tested their own failure modes. Accept the constraint, design for one failure mode, and test it.

Consistency Models: From Strong to Eventual

CAP's Consistency property is binary in the theorem — either you have linearizability during a partition or you don't. But in practice, consistency is a spectrum, and the choice of where to sit on that spectrum is one of the most consequential architectural decisions you'll make. Here are the models you'll encounter in production, ordered from strongest to weakest.

  1. Linearizability (Strong Consistency): After a write completes, all subsequent reads from any node see that write. Operations appear instantaneous and globally ordered. This is what CAP's C means. It requires coordination — quorum or consensus — and introduces latency proportional to round-trip time between replicas. Examples: Google Spanner (via TrueTime), etcd (Raft), Zookeeper (Zab). During partitions, minority sides become unavailable.
  2. Sequential Consistency: Operations appear in some global order that is consistent with each process's local order. Weaker than linearizability — you don't get real-time ordering guarantees — but easier to implement. Less common in modern distributed databases; you'll encounter it more in academic literature and some memory model discussions.
  3. Causal Consistency: Writes that are causally related (one write depends on another) are seen in the correct order across all nodes. Writes that are causally independent may appear in different orders on different nodes. Used in CRDT-based systems and some multi-region databases. Provides a useful middle ground: stronger than eventual, cheaper than linearizability.
  4. Read-Your-Writes (Session Consistency): A client always sees its own writes, but may not see writes from other clients immediately. This can be provided without full linearizability by routing a client's reads to the replica that received their write, or by using a session token. Common in MongoDB (read from primary) and DynamoDB (read-your-writes within a session).
  5. Eventual Consistency: Given enough time without new writes, all replicas converge to the same value. During that convergence window, different replicas may return different values. This is the default for many AP systems: Cassandra, DynamoDB (eventually consistent reads), Riak. After a partition heals, anti-entropy processes (gossip, Merkle tree comparison, read-repair) reconcile divergent replicas.

The question isn't 'which model is best.' It's 'which model matches what your users will actually experience and what your business can recover from.' A recommendation feed returning yesterday's items is annoying but harmless. A payment balance returning a value from before a deposit was processed is a bug with financial consequences.

One more thing worth naming: the relationship between consistency models and the R + W > N quorum rule. In a system with N replicas, if you require W replicas to acknowledge a write and R replicas to respond to a read, and W + R > N, then at least one replica must have participated in both the write and the read — guaranteeing you'll see the latest write. This is the mathematical foundation behind Cassandra's consistency levels. QUORUM with replication factor 3 means W=2, R=2, W+R=4 > N=3. ONE means R=1, and with W=2, you have W+R=3 = N, which means you might miss the latest write if the one replica you read from wasn't one of the two that acknowledged the write.

Consistency as a Currency
  • Linearizability costs coordination — typically +10–50ms per write for quorum round-trips in a multi-AZ deployment. That cost exists even when there is no partition, which is why PACELC matters for normal-operation design.
  • Eventual consistency is cheap to write — no waiting for quorum — but you pay in complexity: conflict resolution logic, staleness detection, and reconciliation jobs that someone has to build and maintain.
  • In production, the majority of services can use eventual consistency if they design for it: idempotent reads, user-visible staleness indicators ('prices updated every 60 seconds'), and reconciliation after partition recovery.
  • The hidden cost of consistency mismatches is debugging anomalies months later when a developer assumed strong consistency in a code review but the database was configured for eventual. That assumption lives in no config file.
Production Insight
We've seen teams choose strong consistency for a read-heavy product catalog API — not because the business required it, but because 'consistent' sounded safer. During a datacenter partition, the API became completely unavailable for 4 minutes while the CP store waited for quorum. Customer-facing search was down. The business impact of 4 minutes of outage was worse than the impact of serving yesterday's product prices would have been.
The right answer would have been eventual consistency for the catalog read path, with a fallback cache, and strong consistency reserved for inventory reservation writes where overselling is the actual risk.
Rule: every strong consistency requirement should have a documented business justification. If you can't write one sentence explaining why stale data is worse than downtime for this specific path, you probably don't need strong consistency on that path.
Key Takeaway
Strong consistency means CP — the system becomes unavailable on the minority side during partition. Eventual consistency means AP — replicas diverge during partition and converge after. The quorum math (R + W > N) is the bridge between theory and configuration. Know your model, know your numbers, and document which data paths use which — especially the ones that handle money.

CAP in Normal Operation: Where PACELC Fills the Gap

CAP tells you what to sacrifice during a partition. But most of your system's lifetime is spent in normal operation — no partition, no crisis, just steady-state reads and writes. In normal operation, CAP says nothing. Both C and A are achievable. So what governs the trade-offs you make every day?

That's where PACELC comes in. Proposed by Daniel Abadi in 2012, PACELC extends CAP with a second branch:

  • If there is a Partition (P), choose between Availability (A) and Consistency (C) — the CAP branch.
  • Else (E), in normal operation, choose between Latency (L) and Consistency (C).

The insight is that even without a partition, consistency costs something. Every strongly consistent write requires coordination between replicas — quorum acknowledgement, leader confirmation, or a two-phase commit. That coordination adds latency. Every millisecond of coordination latency is a trade-off you're making against throughput and user experience.

Cassandra is PA/EL: during partition it chooses Availability; in normal operation it chooses low Latency (writes go to the coordinator immediately, replication is asynchronous by default). This makes Cassandra extremely fast under normal load and extremely tolerant of node failures — at the cost of consistency both during and between partitions.

DynamoDB (eventually consistent) is PA/EL: same pattern. Default reads are fast and stale-tolerant. Strongly consistent reads opt into PC/EC behaviour for that request.

Spanner is PC/EC: during partition it chooses Consistency (minority sides block); in normal operation it also chooses Consistency (TrueTime-coordinated commits, external consistency). This makes Spanner correct and expensive. The latency overhead is real — cross-region commits involve TrueTime uncertainty windows of 1–7ms on top of network round-trip.

etcd is PC/EC: Raft consensus means every write goes through a leader commit before acknowledgement. In normal operation, p99 write latency in a 3-node cluster across availability zones is typically 5–15ms. That's the cost of EC.

MongoDB with default settings is PA/EL in practice: secondaries lag behind the primary, reads from secondaries are eventually consistent. If you configure readConcern:linearizable and writeConcern:majority, you get PC/EC for those operations — at latency cost.

Why does this matter? Because most architecture discussions about CAP are really about PACELC. The question 'should we use Cassandra or Spanner?' is almost never triggered by a partition event you're anticipating. It's triggered by normal-operation requirements: how fast do writes need to be? Can reads be 500ms stale? Can we afford 15ms write latency for strong consistency? PACELC gives you the vocabulary to answer those questions with the same rigour that CAP applies to failure scenarios.

PACELC Quick Reference
PA/EL systems (Cassandra, DynamoDB default): fast in normal operation, available during partition, eventually consistent always. Choose these when throughput and availability matter more than read-after-write correctness. PC/EC systems (Spanner, etcd, CockroachDB): consistent in normal operation, consistent during partition (at the cost of availability on minority side). Choose these when correctness is non-negotiable and you can absorb the latency overhead. PA/EC or PC/EL are less common but possible with per-operation tuning — for example, Cassandra with LOCAL_QUORUM reads (EC in normal operation) but default write behaviour (AP during partition if you lower write consistency).
Production Insight
The teams I've seen struggle most with database selection are the ones who evaluated CAP position without evaluating PACELC. They picked Cassandra for a service that needed both high write throughput (correct — PA/EL is great here) and read-after-write consistency on a critical confirmation flow (wrong — PA/EL does not give you this in normal operation, let alone during partition).
The fix was a hybrid: Cassandra for the high-volume event stream (PA/EL), and a lightweight Postgres instance with synchronous replication for the confirmation read (PC/EC on a low-volume, high-criticality path). Both systems doing what they're designed to do, on the paths that need them.
Rule: evaluate PACELC position alongside CAP position when choosing a database. The normal-operation latency/consistency trade-off affects every request. The partition trade-off affects a fraction of them.
Key Takeaway
CAP governs failure scenarios. PACELC governs the other 99.9% of your system's life. Know both axes for every database you use. The question 'how much latency does consistency cost in normal operation?' is just as important as 'what happens during a partition?' — and PACELC is the framework that lets you answer it.

Real Database Positions on the CAP and PACELC Spectrum

Let's map major databases to their actual behaviour under partition and under normal operation. These are not marketing positions — they're what happens when you inject a network fault.

Cassandra: AP (CAP), PA/EL (PACELC). By default, writes succeed once the coordinator acknowledges — replication is asynchronous. Reads with ONE return data from any single replica, potentially stale. Raise to LOCAL_QUORUM and you get quorum-consistent reads in normal operation (EL becomes EC for that query), but during a partition, any node that can't reach quorum will time out. Cassandra's power is that you tune this per query — it's the most flexible point on the spectrum.

DynamoDB: AP (CAP), PA/EL (PACELC) by default. The default read is eventually consistent — DynamoDB routes to the nearest replica without quorum. ConsistentRead=true uses quorum and shifts that read to PC/EC. Importantly, DynamoDB is not the original Dynamo system — it's a managed service with different internals. The original Dynamo (used by Amazon's shopping cart) used vector clocks and sloppy quorums; DynamoDB uses a different replication mechanism. Don't conflate the two when reading the 2007 Dynamo paper.

etcd: CP (CAP), PC/EC (PACELC). Raft consensus means every write goes through a leader commit. During a partition, the minority side of the Raft cluster stops accepting writes and by default stops serving reads (to prevent serving stale data). This is correct CP behaviour. In normal operation, the Raft overhead means p99 write latency across 3 AZs is typically 5–15ms. For distributed locks and configuration management, this is the right trade-off.

Zookeeper: CP (CAP), PC/EC (PACELC). Uses Zab (Zookeeper Atomic Broadcast) instead of Raft, but similar guarantees. During a partition, the minority side stops serving. Writes always go to the leader. Used extensively for leader election, service discovery, and distributed coordination where correctness outweighs availability.

Google Spanner: CP (CAP), PC/EC (PACELC). TrueTime enables externally consistent distributed transactions across global regions. During a global partition — rare but possible — some writes block until the TrueTime uncertainty window resolves. In normal operation, TrueTime adds 1–7ms to commit latency on top of network round-trip. Spanner is the closest production system to having strong consistency at global scale, but the latency cost is real and the availability cost during partition is real.

MongoDB (replica set, default config): CP-leaning (CAP), PA/EL in practice (PACELC). Writes go to the primary with writeConcern:1 by default — acknowledged by one node before returning. Secondaries replicate asynchronously and may lag. Reads from secondaries are eventually consistent. During partition: if the primary is isolated from a majority, the majority elects a new primary after ~10 seconds — during which writes are unavailable. If you lower writeConcern to w:0 or w:1 without journaling, you can lose acknowledged writes during failover, which is an AP-like behaviour. MongoDB's position shifts significantly based on write concern and read preference — treat the defaults as a starting point, not a fixed CAP label.

Redis Cluster: this one requires care. Redis Cluster uses hash slots partitioned across master nodes, each with replica nodes. During a partition, the behaviour is slot-dependent. With the default cluster-require-full-coverage=yes, if any master and all its replicas are unreachable, the entire cluster stops accepting writes — CP-like for availability but not linearizable. With cluster-require-full-coverage=no, the cluster continues serving the slots it can reach — AP-like for available slots. Redis Cluster does not guarantee linearizability in any configuration: replication is asynchronous, so a promoted replica may not have all writes from the failed master. Redis Cluster is best described as AP with last-write-wins semantics and potential write loss during failover — which is fine for caching and session storage, and wrong for anything requiring durability.

io/thecodeforge/cap/CAPPositionProvider.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
package io.thecodeforge.cap;

import java.util.Map;

/**
 * CAP and PACELC positions for commonly used databases.
 *
 * These reflect default configuration behaviour under partition.
 * Many of these can be tuned per-operation — treat the defaults
 * as your starting point, not your fixed position.
 *
 * Key reminder: DynamoDB != original Dynamo (2007 paper).
 * Redis Cluster behaviour is slot-dependent and config-dependent.
 */
public class CAPPositionProvider {

    public record DatabaseProfile(
        String capChoice,
        String pacelcChoice,
        String caveat
    ) {}

    private static final Map<String, DatabaseProfile> PROFILES = Map.of(
        "Cassandra", new DatabaseProfile(
            "AP",
            "PA/EL",
            "Fully tunable per query. LOCAL_QUORUM shifts reads to EC in normal op."
        ),
        "DynamoDB", new DatabaseProfile(
            "AP",
            "PA/EL",
            "ConsistentRead=true shifts individual reads to PC/EC. Not the 2007 Dynamo system."
        ),
        "etcd", new DatabaseProfile(
            "CP",
            "PC/EC",
            "Minority side stops serving during partition. Raft adds 5-15ms write latency across AZs."
        ),
        "Zookeeper", new DatabaseProfile(
            "CP",
            "PC/EC",
            "Zab consensus. Minority side stops. Preferred for locks and leader election."
        ),
        "Spanner", new DatabaseProfile(
            "CP",
            "PC/EC",
            "TrueTime adds 1-7ms uncertainty window to commits. Global partition causes write blocking."
        ),
        "MongoDB (replica set)", new DatabaseProfile(
            "CP-leaning",
            "PA/EL (default) -> PC/EC (linearizable)",
            "writeConcern and readConcern heavily influence actual position. Test your config."
        ),
        "Redis Cluster", new DatabaseProfile(
            "AP (slot-dependent)",
            "PA/EL",
            "Async replication = potential write loss during failover. cluster-require-full-coverage changes behaviour."
        )
    );

    public static DatabaseProfile getProfile(String dbName) {
        return PROFILES.getOrDefault(dbName, null);
    }

    public static void main(String[] args) {
        String[] dbs = {"Cassandra", "etcd", "Spanner", "Redis Cluster", "MongoDB (replica set)"};
        for (String db : dbs) {
            DatabaseProfile p = getProfile(db);
            if (p != null) {
                System.out.printf("%-30s CAP: %-12s PACELC: %-30s Note: %s%n",
                    db, p.capChoice(), p.pacelcChoice(), p.caveat());
            }
        }
    }
}
Output
Cassandra CAP: AP PACELC: PA/EL Note: Fully tunable per query. LOCAL_QUORUM shifts reads to EC in normal op.
etcd CAP: CP PACELC: PC/EC Note: Minority side stops serving during partition. Raft adds 5-15ms write latency across AZs.
Spanner CAP: CP PACELC: PC/EC Note: TrueTime adds 1-7ms uncertainty window to commits. Global partition causes write blocking.
Redis Cluster CAP: AP (slot-dependent) PACELC: PA/EL Note: Async replication = potential write loss during failover. cluster-require-full-coverage changes behaviour.
MongoDB (replica set) CAP: CP-leaning PACELC: PA/EL (default) -> PC/EC Note: writeConcern and readConcern heavily influence actual position. Test your config.
Production Insight
The single most useful thing you can do after reading a database's CAP and PACELC position is run a partition drill against it with your actual workload. Take your most critical write path. Inject a partition with Toxiproxy or tc netem. Observe what the database returns — does it error, block, or serve stale data? Does it match the documentation? You will be surprised how often it doesn't.
We ran this exercise against a MongoDB replica set configured with writeConcern:1 (the default at the time). Under partition with the primary isolated, a write was acknowledged by the primary, the primary stepped down, the new primary didn't have that write, and a subsequent read returned the pre-write value. The write was permanently lost. The documentation said writeConcern:1 was the default. It didn't say 'acknowledged writes can be lost during failover.' That's a CA-flavoured promise on a CP-positioned database.
Rule: test the actual behaviour, not the label. Every database's CAP position is a configuration-dependent approximation, and the defaults are often optimised for availability and performance — not for the correctness guarantees your most critical paths need.
Key Takeaway
Database CAP labels are defaults, not fixed identities. Redis Cluster is not simply AP — it's slot-dependent and write-loss-tolerant. MongoDB is not simply CP — it's highly configurable and defaults can produce AP-like write loss during failover. Know your database's actual behaviour at your actual configuration. The code above gives you a starting map; fault injection gives you the truth.

Split-Brain: The Partition Failure Mode Nobody Plans For

Split-brain is what happens when a partition causes two parts of your cluster to each believe they are the authoritative majority — and both continue accepting writes independently. It is the most dangerous partition failure mode, and it's the one most teams discover in production rather than in testing.

Here's the scenario. You have a 3-node cluster: N1, N2, N3. A network partition separates N1 from N2 and N3. N2 and N3 form a majority and elect N2 as the new leader. N1, isolated, doesn't detect the partition immediately — it still thinks it's the leader. During the detection window (which can be seconds to minutes depending on heartbeat timeout configuration), N1 accepts writes. N2 also accepts writes as the new leader. Both sides commit conflicting state.

When the partition heals, you have two versions of truth. The system must reconcile them, and depending on the conflict resolution strategy, some writes will be lost or overwritten.

For CP systems (etcd, Zookeeper), Raft and Zab are specifically designed to prevent split-brain: a node can only be leader if it has acknowledgement from a majority. N1 in the scenario above, isolated from N2 and N3, cannot form a majority and steps down. It stops accepting writes. This is correct — availability is sacrificed, but split-brain is prevented.

For AP systems (Cassandra, DynamoDB, Redis Cluster with async replication), split-brain is a real possibility. Cassandra with low write consistency (ONE or ANY) can accept writes on both sides of a partition. After recovery, last-write-wins (LWW) semantics based on timestamp ordering determine which write survives. If the client clocks are not well-synchronised, the wrong write wins. Redis Cluster's async replication means a promoted replica may not have all writes from the failed master — those writes are lost.

Asymmetric partitions make this harder. In a real network, partitions are rarely a clean cut. You might have a situation where N1 can send messages to N2 but N2's responses are dropped. N1 sees N2 as alive; N2 sees N1 as unreachable. Both might attempt to become leader, depending on the election protocol. This is why fault domain placement matters: place replicas in separate availability zones, use rack-aware placement policies in Cassandra, and configure etcd peer URLs to use AZ-specific endpoints. The goal is to ensure that a single physical failure — a switch, a power unit, an AZ network event — takes down at most a minority of your replicas.

  1. Last-Write-Wins (LWW): the write with the highest timestamp wins. Simple, but requires synchronised clocks. NTP drift of a few milliseconds can cause the wrong write to win. Cassandra uses LWW by default.
  2. Vector Clocks: track causality between writes using per-node counters. A write that happened after another write (causally) always wins. Writes that are concurrent (neither caused the other) are flagged as conflicts for application-level resolution. The original Dynamo system used vector clocks; DynamoDB removed them in favour of LWW.
  3. CRDTs (Conflict-free Replicated Data Types): data structures designed so that concurrent writes always merge deterministically without conflicts. Examples: G-Counter (increment-only counter), OR-Set (add/remove set with tombstones). Used in Riak and some Cassandra patterns. The constraint is that not all data types can be expressed as CRDTs.
  4. Application-Level Reconciliation: a background job reads from all replicas after partition recovery, detects divergence, and applies business logic to determine the correct value. This is the most flexible approach and the most work. Required when LWW semantics would produce wrong results and CRDTs don't fit the data model.
Watch the Asymmetric Partition
In real networks, partitions are rarely a clean cut. You might have a split where one side can send messages but not receive them (asymmetric packet loss). This makes CAP trade-offs harder to reason about: your CP system may think it has a majority when it actually doesn't — because it can send heartbeats but not receive acknowledgements. Always use fault domains (availability zones, racks) and place replicas to ensure a single physical failure takes down at most a minority. And test asymmetric partition scenarios specifically — not just full network cuts — because they're the ones your monitoring is least likely to detect cleanly.
Production Insight
One of the most instructive split-brain incidents I've seen involved a Cassandra cluster where the network interface card on one node developed intermittent packet loss — not a clean partition. The node appeared up in nodetool status (heartbeats got through occasionally) but was dropping ~30% of replication messages. Writes to that node were inconsistently replicated. Reads hitting that node returned a mix of current and stale data with no error signals.
The symptom was user-facing: some users saw their profile updates revert after a few minutes. The root cause took two days to find because nodetool status showed all nodes as UN (up/normal). The detection required correlating write latency histograms per-node with application-level data divergence reports.
Rule: partial partitions (packet loss, slow nodes, one-directional drops) are harder to detect than clean partitions. Build monitoring that tracks per-replica write success rates and per-replica read latency distributions — not just cluster-level health. A node that's 'up' but dropping 30% of messages is a split-brain waiting to happen.
Key Takeaway
Split-brain is the partition failure mode that turns an availability problem into a data correctness problem. CP systems prevent it via consensus — at the cost of availability on the minority side. AP systems tolerate it and require explicit conflict resolution strategies. Know which conflict resolution strategy your AP system uses (LWW, vector clocks, CRDTs, application reconciliation) and whether it's correct for your data model. Then test asymmetric partition scenarios specifically — they're the hardest to detect and the most common in real networks.

How to Make the CAP Decision in Your Architecture

Here's a practical framework for deciding CP or AP for a given service — and for documenting that decision so the next engineer doesn't have to reverse-engineer it during an incident.

Step 1: Identify data criticality and failure mode cost. Ask: what's worse for this service — wrong data or no data? For a financial transaction, wrong data (serving a stale balance) can cause real monetary loss or regulatory exposure. An error or brief unavailability is recoverable. Choose CP. For a product recommendation feed, stale recommendations are invisible to most users. An error that breaks page load is highly visible. Choose AP.

Step 2: Define recovery time objectives honestly. How long can this service be unavailable during a partition before the business impact exceeds the impact of serving stale data? If the answer is 'zero seconds,' you need AP. If the answer is 'a few seconds during leader election is acceptable,' CP is viable. Be honest — most teams underestimate their tolerance for brief unavailability and overestimate the cost of stale data.

Step 3: Evaluate read/write ratio and access patterns. Write-heavy services benefit from AP because they avoid quorum coordination overhead on every write. Read-heavy services can often tolerate stale reads if the underlying data changes slowly (product catalog, user preferences). Read-heavy services where data changes fast and staleness matters (account balance, inventory) need CP for reads even if the write rate is low.

Step 4: Apply the quorum math. For the databases you're considering, calculate R + W > N for your replication factor and desired consistency. Know what consistency level each critical read and write uses. Write it down. This is the most often skipped step — teams pick a database and never calculate what their consistency configuration actually guarantees.

Step 5: Design for partition recovery explicitly. If you choose AP, partition recovery is not automatic from an application perspective — replicas reconcile at the data layer, but your application may have made decisions based on stale data during the partition window. Design a reconciliation path: a background job, a read-repair trigger, or an idempotent retry mechanism. If you choose CP, partition recovery means the minority side comes back online and catches up with the majority — but clients that were hitting the minority side were getting errors during the partition. Design a retry/backoff policy and a user-facing message for that state.

Step 6: Document and test. For each critical data path, write one sentence: 'During a partition, this path [serves stale data / returns an error] because we prioritise [availability / consistency] for [business reason].' Then run a fault injection test that verifies it. If the observed behaviour doesn't match the sentence, the sentence or the configuration is wrong. Fix whichever one is cheaper.

The Hybrid Architecture Trap
Hybrid approaches — CP store for authoritative state, AP cache for reads — are correct in principle but dangerous in practice if the cache invalidation strategy isn't designed for partition scenarios. During a partition, the CP store may be unavailable (minority side) while the AP cache continues serving stale data. If the application treats a cache hit as authoritative, users get stale data with no error signal and no way to know it's stale. Always make cache staleness explicit: log it, surface it in monitoring, and design the UI to handle it gracefully (e.g., 'data last updated X minutes ago').
Production Insight
One team I worked with used DynamoDB (AP) for an order tracking service. During a regional partition, orders that were placed appeared to 'disappear' for some users — they were reading from a replica that hadn't received the write yet, and the service had no reconciliation logic. The business impact was significant: customer support calls spiked, and some users placed duplicate orders thinking the first one was lost.
The fix had two parts: first, a reconciliation job that ran on partition recovery, comparing order IDs across regions and flagging divergence. Second, a user-facing message on the order confirmation page: 'Your order has been placed. It may take a few minutes to appear in your order history.' That single sentence reduced customer support calls by 60% during subsequent partition events — not because the system was more correct, but because users knew what to expect.
Rule: if you go AP, design for staleness visibility at every layer — monitoring, logging, and user interface. Stale data served silently is the worst outcome. Stale data served with context is manageable.
Key Takeaway
Your CAP choice is not a one-time architectural decision. Revisit it when your data volume grows, when you expand to new regions, when your user base crosses regulatory boundaries requiring data residency, and when your business model changes (a recommendation feed becoming a financial service, for example). The correct choice today may be genuinely wrong next year — and the cost of discovering that in a production incident is much higher than the cost of a quarterly architecture review.
● Production incidentPOST-MORTEMseverity: high

The 3 AM Cassandra Read Timeout That Wasn't a Bug

Symptom
During a single-AZ outage in AWS us-east-1, a read-heavy microservice started returning balance values that were 2–3 seconds old. Some reads succeeded; others threw timeouts. No errors in the application logs — only latency spikes in the APM dashboard. The Cassandra driver was logging UnavailableExceptions at DEBUG level, which nobody had routed to alerts.
Assumption
Engineers assumed the database was fine because writes were succeeding. They blamed the application's read-path logic — a caching layer that had recently been refactored. Multiple rollbacks later, nothing changed. The real signal was sitting in driver-level debug logs the whole time.
Root cause
The service was using Cassandra's default read consistency level (ONE), which during a partition returned data from any available replica — often a stale one. Writes used QUORUM with replication factor 3, so a write was considered successful once 2 of 3 replicas acknowledged it. During the AZ outage, one replica was unreachable. Reads hitting that replica returned pre-partition data. The system was available but not consistent — exactly the AP trade-off Cassandra makes by default, applied to a payment path that needed the opposite.
Fix
Changed read consistency to LOCAL_QUORUM, which requires acknowledgements from a majority of replicas in the local datacenter. During partitions, reads that can't reach a quorum now block and return an error rather than serving stale data. The trade-off — accepting unavailability over inconsistency — was correct for a payment reconciliation path. Post-fix measurement: p99 read latency increased by approximately 18ms under normal operation (quorum round-trip overhead). During the next simulated AZ failure in staging, the service correctly returned errors for ~12 seconds — the duration of the partition — rather than serving stale balances for 7 minutes. The on-call team considered 12 seconds of errors vastly preferable to 7 minutes of wrong data flowing into a financial ledger.
Key lesson
  • Check your database driver logs at DEBUG level during incidents — UnavailableExceptions and ReadTimeoutExceptions often appear there before they surface anywhere else.
  • Understand your database's default consistency levels — they almost always favour availability over consistency because that is the safer default for the majority of workloads. Payment paths are not the majority of workloads.
  • When partitions happen, 'available' doesn't mean 'correct'. Test your system under network faults — not just under load — before you find out in production at 3 AM.
  • Document the CAP trade-off for each critical read/write path. If your on-call engineer can't explain which guarantee you're sacrificing during a partition, you're already broken.
Production debug guideWhen your system behaves unexpectedly during partitions, these symptom→action pairs help you pinpoint whether you're looking at a consistency failure or an availability failure — before you start changing code.4 entries
Symptom · 01
Reads return old data during a known network partition
Fix
Check read consistency level first — not application code. In Cassandra: inspect the driver's default consistency (often set in application config, not cqlsh). In DynamoDB: check whether ConsistentRead is true. If you're reading with ONE or ANY in Cassandra, or eventually consistent reads in DynamoDB, you've chosen AP for that query. Raising to LOCAL_QUORUM or ConsistentRead=true trades some availability for correctness. Do not raise to ALL unless you accept total unavailability when any node is down.
Symptom · 02
Writes succeed but reads intermittently time out
Fix
This is almost always a write/read consistency mismatch. If writes use QUORUM and reads use ONE, replicas can have divergent data during a partition — some replicas have the new write, others don't. A ONE read that hits a stale replica returns old data; a ONE read that hits an unavailable replica times out. Align consistency levels or enable read-repair. In Cassandra, you can trigger read-repair by issuing a read at QUORUM, which will force reconciliation across replicas.
Symptom · 03
Service returns 503 or 500 during partition when it should be available
Fix
Verify whether the database is CP (etcd, Zookeeper). During a partition, the minority side of a Raft or Zab cluster will refuse writes and may stop serving reads. This is correct behaviour for CP — the system is protecting consistency. If this service path can tolerate stale data, consider routing read-only queries to an AP cache or a read replica with explicitly lowered consistency, and reserve the CP store for writes and critical reads.
Symptom · 04
Data loss after a network partition recovers
Fix
Look for writes that were acknowledged at a consistency level lower than replication factor. For example, writes with w=1 to a 3-node cluster: if the node holding that write goes down before replication completes, the write is lost. In Cassandra, check whether hinted handoff is enabled and whether hints were dropped due to the partition lasting longer than the hints window (default: 3 hours). Run nodetool repair after partition recovery. In systems without automatic repair, implement a reconciliation job that detects divergent values using checksums or version vectors.
★ CAP Quick Debug Cheat SheetWhen you suspect a CAP-related issue, run these commands to gather evidence fast. Note: cqlsh CONSISTENCY shows the session-level default for that cqlsh client — your application's consistency level is set in the driver configuration (e.g., application.yml or driver builder). Always check both.
Stale reads under partition
Immediate action
Identify read consistency at both cqlsh session level and application driver config — they can differ
Commands
cqlsh -e 'CONSISTENCY;' # shows cqlsh session default, NOT your app driver setting
nodetool proxyhistograms | head # shows read/write latency distribution across replicas
Fix now
In application driver config, set default consistency to LOCAL_QUORUM for critical reads. Validate by tailing driver DEBUG logs during a simulated partition — UnavailableException should appear instead of stale data
Unavailability during partition+
Immediate action
Determine whether the database is CP or AP and which side of the partition your service is on
Commands
etcdctl endpoint status --write-out=table # shows leader, raft index, and which endpoints are healthy
nodetool status # UN=Up/Normal, DN=Down/Normal — shows which Cassandra nodes are unreachable
Fix now
For CP stores: confirm your service is on the majority side. If it's on the minority side, no amount of retrying will help — route traffic to the majority. For AP stores: if you're seeing unavailability, check internal timeout configs; AP stores should not block during partition
Data divergence after partition recovery+
Immediate action
Trigger repair before users notice inconsistency — the longer you wait, the wider the divergence window
Commands
nodetool repair -pr # repairs primary range on this node — less disruptive than full repair
nodetool repair --full # full repair — run during off-peak hours, can be I/O intensive
Fix now
Schedule a full cluster repair after partition recovery. For DynamoDB or other managed AP stores with no direct repair command, implement application-level reconciliation: read from multiple regions, compare version timestamps, write the winner back
CAP and PACELC Properties Quick Comparison
PropertyWhat It GuaranteesDuring a PartitionNormal Operation (PACELC)Example Database
Consistency (C)Every read sees the most recent write (linearizability)Must block or return an error if partition prevents quorum coordination — minority side becomes unavailableEC systems pay coordination latency on every write (e.g., +5–15ms across AZs for Raft consensus)etcd, Spanner, Zookeeper (CP / PC/EC)
Availability (A)Every request to a non-failed node gets a non-error responseMay return stale data — the system responds rather than blockingEL systems write fast (no quorum wait) but reads may lag behind writes in normal operationCassandra, DynamoDB default (AP / PA/EL)
Partition Tolerance (P)System continues operating despite message loss between nodesSystem must choose C or A — P itself is not optional in any multi-node networkP is always assumed. PACELC addresses the trade-off when P is not occurring.All distributed systems — P is the given, not the choice

Key takeaways

1
CAP Theorem was conjectured by Brewer in 2000 and proved by Gilbert and Lynch in 2002. It is a constraint during network partitions
not a menu. During partition, choose C or A. Partition Tolerance is not optional.
2
CA is achievable in single-DC synchronous systems as a practical approximation
because intra-DC partitions are rare. It is a fiction in multi-datacenter distributed systems. Every vendor claiming CA in a distributed context deserves a fault injection test before you trust the claim.
3
PACELC extends CAP to normal operation
even without partition, consistent systems pay coordination latency (EC), while eventually consistent systems optimise for low latency at the cost of potential read staleness (EL). Evaluate both axes when choosing a database.
4
Understand your database's actual behaviour at your actual configuration
not its marketing label. Redis Cluster is AP with potential write loss, not simply a fast AP store. MongoDB's CAP position is highly configuration-dependent. Cassandra's position is tunable per query.
5
The quorum math (R + W > N) is the bridge between CAP theory and database configuration. Know your replication factor, your read consistency, and your write consistency for every critical data path. Write it down.
6
Split-brain is the partition failure mode that turns an availability event into a data correctness event. Plan for conflict resolution
LWW, vector clocks, CRDTs, or application-level reconciliation — whichever is correct for your data model. Then test asymmetric partition scenarios, not just clean cuts.
7
Test under real network faults
asymmetric packet loss, partial link degradation, slow nodes — not just full partitions. The failure modes that hurt most in production are the subtle ones that don't trigger clean detection.

Common mistakes to avoid

5 patterns
×

Assuming CA databases exist in distributed systems

Symptom
Vendors claim CA but during a network partition the database either hangs, silently drops writes, or returns errors with no documentation of which guarantee was sacrificed. Production incidents manifest as unexplained timeouts or post-incident data loss that the vendor attributes to 'network instability.'
Fix
Accept that during a partition you must choose CP or AP. Single-DC synchronous replication approximates CA in practice because intra-DC partitions are rare — but this approximation collapses the moment you span datacenters. Test with network fault injection (Toxiproxy, tc netem) to see actual behaviour under partition, not theoretical behaviour from documentation.
×

Ignoring consistency level configuration defaults — and checking them only in cqlsh, not in the application driver

Symptom
Reads return stale data under partition because the application driver's default consistency level (e.g., ONE in Cassandra) favours availability. No errors surface — just wrong data served silently. This is compounded when engineers check consistency via cqlsh (which shows the session-level default) without realising the application driver may be configured differently.
Fix
Check consistency levels in three places: the cqlsh session default, the application driver configuration (e.g., application.yml, driver builder code), and any per-query overrides in the data access layer. For critical reads, set LOCAL_QUORUM or SERIAL at the driver level. Document the setting and the reason. Route driver DEBUG logs to a log aggregator so UnavailableExceptions are visible during incidents.
×

Failing to plan for partition recovery — especially conflict resolution

Symptom
After a partition heals, replicas have divergent data. AP systems (Cassandra, DynamoDB) reconcile at the storage layer via anti-entropy and read-repair, but the application has no visibility into which version 'won.' With last-write-wins semantics and unsynchronised clocks, the wrong version may win. Data appears inconsistent or reverts to older values days after the partition.
Fix
Implement explicit reconciliation logic for any data path where LWW semantics produce incorrect results. Options: vector clocks for causal ordering, CRDTs for data types that support it (counters, sets), or application-level reconciliation jobs that run after partition recovery. For Cassandra, run nodetool repair after partition events. Set up monitoring that detects divergence by comparing checksums or version vectors across replicas — don't wait for users to report it.
×

Choosing CP for a read-heavy service without understanding the availability cost — or choosing AP for a write-heavy financial service without understanding the staleness cost

Symptom
CP + read-heavy: during partition, the service returns 503 for the entire duration of the minority side's unavailability — which can be minutes if leader election is slow or the partition is persistent. Users see complete outage. AP + financial: stale reads cause users to act on incorrect balance or inventory data. The service is 'available' but the data is wrong, which is worse than an error for financial paths.
Fix
Match the CAP choice to the actual failure mode cost for that specific service. For read-heavy services tolerant of staleness: AP with eventual consistency and a staleness SLO (e.g., 'reads may be up to 5 seconds stale'). For financial paths: CP with an explicit error page designed for the partition scenario ('service temporarily unavailable — please retry in 30 seconds'). Never default to CP because it 'feels safer' — it has a real availability cost that affects real users.
×

Treating Redis Cluster as a straightforward AP store for all use cases

Symptom
Data loss after a Redis Cluster master failover — writes acknowledged by the master before it failed are not present on the promoted replica. The application treats the Redis acknowledgement as durable, but Redis's async replication means it was not. This manifests as lost session data, missing cache entries that should have been populated, or — in misuse cases — lost application state that was incorrectly stored in Redis as a primary store.
Fix
Understand Redis Cluster's actual guarantees: async replication, potential write loss during failover, and slot-dependent availability based on cluster-require-full-coverage configuration. Use Redis for caching and session data where loss is tolerable. Never use Redis Cluster as a primary store for data you cannot reconstruct. For data that must survive failover, use a CP store with synchronous replication for that path.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain CAP Theorem. Why can't you have all three guarantees simultaneou...
Q02SENIOR
Is MongoDB CP or AP? Does it change with configuration?
Q03SENIOR
Your system uses Cassandra. A network partition causes read requests to ...
Q04SENIOR
How does PACELC differ from CAP, and when would you use it in an archite...
Q01 of 04SENIOR

Explain CAP Theorem. Why can't you have all three guarantees simultaneously?

ANSWER
CAP Theorem was conjectured by Eric Brewer at PODC 2000 and formally proved by Gilbert and Lynch in 2002. It states that in a distributed system, during a network partition, you must choose between Consistency (every read sees the most recent write — linearizability) and Availability (every request to a non-failed node gets a non-error response). Partition Tolerance is not a choice — it's the given. Any multi-node system must tolerate the fact that the network between nodes will fail. The impossibility argument: during an asynchronous network partition, a node cannot distinguish between 'the other node is slow' and 'a partition has occurred.' If you require both C and A, a node on the minority side of a partition must both respond immediately (A) and respond with the latest write (C). But it can't know the latest write without communicating with the other side — which it can't reach. Any protocol that guarantees both would require synchronous communication, which is impossible in an asynchronous network with partitions. So during a partition, you sacrifice one: either refuse to serve stale data (CP, sacrifice A) or serve a response even if stale (AP, sacrifice C). In practice, the real architectural question is PACELC: even without a partition, strongly consistent systems pay coordination latency on every write (the EC trade-off), while eventually consistent systems optimise for low latency but may serve stale reads in normal operation (the EL trade-off).
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is CAP Theorem in simple terms?
02
Does CAP Theorem apply when there is no partition?
03
Is it possible to switch between CP and AP dynamically?
04
What's the difference between CAP and PACELC?
05
What is split-brain and how do you prevent it?
🔥

That's Fundamentals. Mark it forged?

18 min read · try the examples if you haven't

Previous
Scalability Concepts
3 / 10 · Fundamentals
Next
Horizontal vs Vertical Scaling