Cassandra LOCAL_ONE Write Loss — Multi-DC Failure Patterns
LOCAL_ONE acknowledged a write, then the DC failed before hinted handoff.
20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.
- Cassandra is a distributed NoSQL database designed for high write throughput and horizontal scaling.
- Data is partitioned across nodes using a consistent hash ring — each node owns a token range.
- Replication factor determines how many copies of each partition exist; replication strategy sets placement.
- Writes go to Memtable and CommitLog first, then flushed to immutable SSTables on disk.
- Tunable consistency lets you choose between strong reads (QUORUM) and low-latency reads (ONE) per query.
- Production trap: using LOCAL_QUORUM in a multi-datacenter cluster can cause silent data loss during a DC failure.
Imagine a city library that's so popular, one building can't hold all the books or serve all the visitors. So they build 6 identical branch libraries across the city, and each branch is responsible for books whose titles start with certain letters. When you want a book, you go to the branch that owns it — and if that branch is closed, the next nearest one has a copy. Cassandra works exactly like that: it splits your data into chunks (partitions), spreads those chunks across many servers (nodes), and keeps copies on neighboring servers so nothing is ever lost when one goes down.
Every time you hit 'like' on a social media post, check your Uber driver's live location, or see your Netflix watch history load in under a second — there's a non-trivial chance Cassandra is doing the heavy lifting underneath. It was built by Facebook engineers to handle the inbox search problem: hundreds of millions of writes per day, globally distributed, with zero acceptable downtime. Relational databases buckled under that load. Cassandra didn't.
The core problem Cassandra solves is the tension between scale, availability, and write throughput that kills traditional RDBMS systems. When you need to ingest millions of events per second across datacenters in Tokyo, Frankfurt, and Ohio simultaneously, you can't afford a single master node, a two-phase commit, or a schema that requires costly JOINs. Cassandra abandons all three. It trades the strong consistency guarantee of SQL for eventual consistency and gives you fine-grained control over exactly how eventual that consistency is — per query.
After working through this article, you'll be able to design a Cassandra data model from scratch using partition keys and clustering columns, explain exactly what happens under the hood during a write (Memtable → CommitLog → SSTable), configure replication for a multi-datacenter cluster, tune consistency levels intelligently based on your SLA, and avoid the five production mistakes that turn a fast Cassandra cluster into a slow one.
How Cassandra LOCAL_ONE Consistency Actually Works
Cassandra's LOCAL_ONE consistency level writes to one replica in the local datacenter and returns success as soon as that replica acknowledges the write. It does not wait for other replicas, even if the replication factor (RF) is higher. This is the default write consistency for many production workloads because it minimizes latency while still providing durability within the local DC.
In practice, LOCAL_ONE sends the write to the coordinator, which picks the closest replica (by snitch) in the local DC. If that replica fails or is unreachable, the coordinator retries against another replica in the same DC — but only one acknowledgment is required. This means a write can succeed even if only one of, say, three replicas in the local DC is alive. However, if that single replica later fails before propagating to others (via hinted handoff or repair), the write is lost.
The critical trade-off: LOCAL_ONE gives the lowest latency of any durable consistency level, but it sacrifices read-your-writes guarantees and cross-DC consistency. Use it when throughput and low p99 latency matter more than immediate consistency — for example, in time-series ingestion or event logging where a small fraction of lost writes is acceptable. Never use it for transactional or billing data where every write must survive a single-node failure.
The Cassandra Data Model: Partition Keys, Clustering Columns, and Physical Storage
Cassandra's data model is deceptively simple: a table is a collection of rows, and each row is identified by a primary key. But that primary key has two parts: the partition key determines which node stores the row, and the clustering columns determine the order within that partition.
Here's the thing most docs gloss over: the partition key isn't just a lookup key. It's the unit of physical locality. All rows sharing the same partition key live on the same node and are stored contiguously on disk. That's why bad partition key design (e.g., using a timestamp as the sole partition key) creates hot spots — all writes for that second hit one node.
Internally, each partition is stored as a single row inside a Memtable, then flushed to an SSTable on disk. When you query by partition key, Cassandra uses an in-memory index (the partition summary) to find which SSTable contains that partition. It'll read only that SSTable, not the whole file. Bloom filters help skip SSTables that definitely don't contain your partition.
The clustering columns sort rows within a partition. This ordering is physical — it's how data is laid out on disk. So if you query with a range on the first clustering column, Cassandra can skip rows efficiently. This is a big deal for time-series workloads where you often query recent data.
- A good partition key distributes data evenly across shelves — no shelf gets all the books.
- If you use a sequential ID (like an auto-increment), all new books will pile onto one shelf — a hot spot.
- Clustering order matters because Cassandra writes data in that order. If you need range queries, make the range part of the clustering key, not the partition key.
- Large partitions (over 100 MB) hurt compaction and read performance. Keep each partition under 10 MB ideally.
Write Path Internals: Memtable, CommitLog, and SSTable
When you issue a write to Cassandra, the coordinator node sends it to all replicas that should hold the partition (based on replication strategy). Each replica does this:
- Writes the mutation to the CommitLog (append-only, sequential writes on disk) — this ensures durability even if the node crashes before the Memtable flushes.
- Applies the mutation to an in-memory structure called the Memtable (a sorted map of partition key → row data).
- Returns success to the coordinator — as soon as it's in both CommitLog and Memtable.
- Later, when the Memtable is full (default 128 MB) or a flush is triggered, it's sorted and written to disk as an immutable SSTable. The CommitLog entries for that Memtable are then truncated.
The key insight: writes never touch a disk read path until flush. That's why Cassandra write throughput is so high — it's essentially sequential CommitLog writes plus memory operations. But that speed has a cost: if you lose power between node 3 (return success) and node 4 (flush), the Mutation is replayed from the CommitLog on startup. The CommitLog is your safety net.
SSTables are immutable — once written, never modified. That means deletions and updates don't modify existing data; they write a new tombstone (for deletes) or a newer timestamped version. Old data becomes garbage that must be compacted away. That's why tombstones are dangerous: if you create a delete-heavy workload without frequent compactions, SSTables pile up with dead rows, inflating read latency.
nodetool compact after large batch deletes, or use a shorter TTL and let compaction handle it naturally. The rule of thumb: a single partition should never have more than 100 tombstones.commitlog_total_space_in_mb combined with a surge in write volume. The node couldn't flush fast enough, so it blocked writes.commitlog_sync: periodic (default batch mode) with a short sync window. Use memtable_heap_space_in_mb = 512 MB or less to avoid long GC pauses.commitlog_sync: batch — this slower but guarantees each write is fsynced before ack. Expect 2-3x write latency trade-off.Visual: Cassandra Write Path — CommitLog → Memtable → SSTable
The write path is the most critical flow in Cassandra because it determines both throughput and durability. This diagram visualizes the three main stages: the CommitLog (durable, append-only log), the Memtable (in-memory sorted buffer), and the SSTable (immutable disk file). The sequence is: (1) write arrives, (2) immediately written to CommitLog for crash recovery, (3) applied to Memtable for fast reads, (4) acknowledged to client once both are done, (5) later, Memtable is flushed to an SSTable on disk, and the CommitLog segment is truncated.
The architectural insight: writes are fast because step 2 and 3 are sequential I/O and memory operations. The flush (step 5) happens asynchronously in the background. This design prioritizes write throughput over read latency — reads may have to consult multiple SSTables, but writes are nearly as fast as the network and memory allow.
The diagram below uses a mermaid flowchart to capture the flow.
commitlog_sync: periodic with a shorter sync window (e.g., 10ms) instead. The CommitLog is your only safety net for unflushed writes.Replication: Strategies, Snitch, and Multi-Datacenter Placement
Cassandra distributes data across nodes using a consistent hash ring. Each node owns a range of token values. A partition key is hashed to a token, and the node that owns that token range is the primary replica. Additional replicas are placed on the next N nodes in the ring (where N = replication factor).
But that's only half the story. The actual placement depends on the replication strategy and the snitch. The snitch tells Cassandra about network topology — which nodes are in which rack and datacenter. Without a proper snitch, Cassandra will place replicas on nodes in the same rack, defeating fault tolerance.
- SimpleStrategy: for single-datacenter testing. Places replicas on the next N nodes in the ring. Avoid in production.
- NetworkTopologyStrategy: for multi-datacenter. You specify a replication factor per datacenter. For example, RF=3 in DC-West and RF=2 in DC-East. The snitch determines which nodes belong to which DC, and Cassandra places the right number of replicas in each DC, distributed across racks to survive rack failures.
Here's the gotcha: if your snitch is misconfigured (e.g., using SimpleSnitch when you have multiple datacenters), Cassandra will think all nodes are in one datacenter and will place all replicas in that single DC — the other DC gets zero data. You'll have a false sense of redundancy.
Tunable Consistency: What Each Level Actually Means for Latency and Correctness
Cassandra's consistency levels let you choose how many replicas must respond before a query returns. This is per-query — you can use strong consistency for critical operations and eventual consistency for everything else.
Here's the trade-off: higher consistency = higher latency + lower availability. If you require ALL replicas to agree, a single node failure makes that query impossible. If you require ONE replica, you get fast responses but risk stale reads or lost writes.
- ONE: one replica responds. Fastest. Weakest guarantee. Use for bulk writes or non-critical reads.
- LOCAL_ONE: one replica in the coordinator's DC responds. Avoids cross-DC latency. Good for local-read scenarios.
- QUORUM: majority of replicas. For an RF=3, that's 2 replicas. Ensures read-your-write consistency within a single session if both reads and writes use QUORUM.
- LOCAL_QUORUM: majority within the local datacenter. Preferred for multi-DC writes to avoid cross-DC latency.
- EACH_QUORUM: majority in each datacenter. Strong, but slow — requires cross-DC coordination. Use only for critical financial operations.
- ALL: all replicas. Guarantees strong consistency, but a single node failure makes the query fail. Avoid in production unless you can tolerate failure on every read/write.
The production rule: use LOCAL_QUORUM for writes and reads within the same datacenter. This provides strong consistency within that DC and survives one node failure (RF=3). For global reads that don't need the latest data, use LOCAL_ONE.
A common mistake is using QUORUM in a multi-DC world: QUORUM means majority of all replicas, not majority in each DC. In a 3+3 DC setup, QUORUM requires 4 replicas – which could be 3 in DC1 and 1 in DC2. That cross-DC read adds latency and if DC2 is unreachable, QUORUM may fail. Use LOCAL_QUORUM instead.
- Use LOCAL_QUORUM for writes that must be durable across nodes (your core business data).
- Use LOCAL_ONE for background reads (recommendations, logs, analytics). Staleness is acceptable.
- If your application needs read-after-write consistency, both read and write must use the same strong consistency level (QUORUM or LOCAL_QUORUM).
- NEVER use ALL in production – a single node failure kills all reads/writes.
Quick-Reference: Consistency Levels and Their Effects
Choosing the right consistency level per query is the key to balancing performance and correctness. This table consolidates the available levels, their guarantees, and recommended use cases. Unlike the high-level comparison earlier, this table includes the exact number of replicas required and how failure tolerance changes with replication factor.
The rule of thumb: for business-critical operations within a datacenter, use LOCAL_QUORUM. For global operations requiring strong consistency across datacenters, use EACH_QUORUM — but expect 20-100ms latency. For high-throughput background tasks, use LOCAL_ONE or ONE.
Important nuance: LOCAL_SERIAL and SERIAL are used for lightweight transactions (LWTs) like conditional updates. They use Paxos protocol internally and are significantly slower — avoid in hot paths.
Compaction Strategies: SizeTiered, Leveled, and TimeWindow – When to Use Each
SSTables are immutable, so over time you accumulate many small files. Compaction merges them into larger ones, discards tombstones and overwritten data, and reclaims disk space. Cassandra offers three main strategies:
SizeTieredCompactionStrategy (STCS): merges SSTables of similar size. Default. Simple, but can lead to write amplification (multiple merges of the same data) and space amplification (temporary disk double during compaction). Best for write-heavy, read-rare workloads where uniformity is acceptable.
LeveledCompactionStrategy (LCS): organizes SSTables into levels (L0, L1, L2...). Each level is 10x larger than the previous. Compaction occurs within a level, guaranteeing that 90% of reads hit at most one SSTable per level. Results in lower read latency but higher write amplification (more compaction work for each write). Best for read-heavy workloads, high consistency, or systems where read tail latency matters.
TimeWindowCompactionStrategy (TWCS): designed for time-series data. SSTables are grouped by time window (e.g., one hour). Compaction only merges SSTables within the same window. Old windows are left untouched. Ideal for metrics data where older data is rarely queried and tombstones from TTLs can be efficiently removed per window.
The trick: choose based on your write/read ratio and access pattern. STCS works for most cases, but don't use it with high TTL workloads – it'll keep merging tombstones into ever-larger SSTables, wasting disk and read time. TWCS is far more efficient for TTL-heavy tables.
nodetool compactionstats. If the pending count stays above 100 for more than 10 minutes, your cluster is falling behind. Consider reducing SSTable max size or switching to LCS for faster compaction. Also ensure disk has 50% free space for STCS to avoid compaction failure.concurrent_compactors from 2 to 8, and decreased sstable_size_in_mb to 50 to keep per-window SSTable count low. Reads returned to sub-10ms.Compaction Strategy Decision Matrix — Choosing the Right Strategy for Your Workload
Cassandra offers three compaction strategies, each optimized for different access patterns. The decision comes down to your read/write ratio, data lifetime (TTL), and whether you prioritize write throughput or read latency. Below is a decision matrix comparing STCS, LCS, and TWCS across key dimensions.
How to use this matrix: Identify your workload category (column) and find the recommended strategy. Then check the trade-offs in terms of write amplification, read amplification, disk space overhead, and compaction frequency. The 'Best For' column gives a quick verdict.
nodetool compactionstats to see pending compactions. A backlog above 200 indicates the cluster can't keep up. If using STCS, ensure at least 50% free disk space to allow compaction to operate. For LCS, monitor write latency — if it increases, compaction may be throttled. For TWCS, check that compaction windows match your TTL — if windows are too large, tombstones accumulate.Tombstones and Bloom Filters — How Deletes Are Handled and How Reads Are Accelerated
Tombstones are Cassandra's mechanism for handling deletes. When you delete a row, Cassandra writes a tombstone — a marker that says 'this data is considered deleted as of timestamp X.' The original data is not removed immediately; it remains in the SSTable until compaction. During a read, Cassandra scans through SSTables and skips any cell with a tombstone whose timestamp is newer than the data. This makes deletes and TTLs (which also create tombstones) deceptively expensive: each tombstone must be read and ignored, adding latency.
Bloom filters are the counterbalance. Before reading an SSTable, Cassandra checks the Bloom filter — a probabilistic data structure that tells if an SSTable definitely does NOT contain the requested partition key. If the filter says 'no,' the SSTable is skipped entirely. This drastically reduces read latency, especially when many SSTables exist. However, Bloom filters have a false positive rate (default 1 in 10,000) — sometimes they say 'maybe' when the data isn't there, causing a wasted read. The filter is stored in memory and is keyed by partition key.
Tombstone gotchas: - A partition with >100K tombstones can cause a read timeout even if the live data is just a few rows. - You can monitor tombstones with nodetool cfstats — look for 'SSTable count' and 'Tombstone count'. - The gc_grace_seconds setting (default 10 days) determines how long a tombstone must exist before compaction can remove it. This is to prevent resurrecting data from other nodes that may not have seen the delete yet. If you're sure all replicas have the delete, you can lower this value to reduce tombstone duration.
Bloom filter tuning: - bloom_filter_fp_chance (default 0.01 = 1% false positive) trades memory for accuracy. Lower values use more memory but reduce wasted reads. - Bloom filters are stored in off-heap memory. If memory is tight, you can increase the false positive chance to 0.1, but expect higher read latency. - Use nodetool cfstats to see 'Bloom filter false positives' — if this is high, consider reducing bloom_filter_fp_chance.
tombstone_warn_threshold (default 1000) and tombstone_failure_threshold (default 100,000). If a query scans more than 1000 tombstones, a warning is logged. If it exceeds 100,000, the query is aborted with an error. Monitor these logs to detect tombstone buildup early.gc_grace_seconds to 3600 (1 hour) since all replicas were in the same datacenter and consistency was LOCAL_QUORUM, guaranteeing tombstone propagation.Who This Is For (And Who Should Walk Away)
If you're reaching for Cassandra because 'it's like MongoDB but better' or because you heard it scales linearly, you need to stop and read this first. This isn't a tutorial for database beginners. You should already know what ACID and BASE mean, understand why joins hurt at scale, and have shipped at least one production system that didn't fit into a single Postgres instance. Cassandra will punish you if you don't understand its trade-offs. If you can't stomach eventual consistency or don't have a team that can handle operational complexity, go back to your RDS instance. This page assumes you have Java 11+ on your PATH, understand Linux filesystem basics, and have read the Cassandra docs on partition keys at least once. No hand-holding. We're building production systems, not school projects.
Every Node Is a Coordinator – Why You Can't Have a Master
Cassandra has no single point of failure because there's no master node. Every node in the ring is identical and can serve any request. When your application sends a query, the driver picks a coordinator node based on the token range of the partition key. That coordinator talks to the replicas that own the data, gathers responses according to your consistency level, and returns the reconciled result to you. This means your application never needs to know where data lives. It also means that if a coordinator node dies mid-request, the driver automatically retries on another node. Clients don't need connection pooling to a single master. They need a driver that can handle a ring of peers. The trade-off: no transactions, no cross-partition joins, no global secondary indexes without performance penalties. Every node is a coordinator, and every node is a potential bottleneck if you don't distribute your partition keys evenly. Hot spots kill throughput faster than any single server failure.
Data Partitioning – Where Your Rows Actually Live
Cassandra distributes data across nodes using consistent hashing. Each row's partition key is hashed to a token value between -2^63 and 2^63 - 1. The ring maps token ranges to nodes. When you insert a row, Cassandra hashes the partition key, finds which node owns that token, and sends the data there. This is why partition key design is the single most important schema decision you'll make. Choose a key with high cardinality (like user_id) to spread writes evenly. Pick a key with low cardinality (like status) and you'll create a hot node that handles all writes for 'active' users while other nodes sit idle. Clustering columns sort data within a partition, but they don't affect distribution. One partition can store gigabytes of data if you're not careful – that's a bad time for reads. Keep partitions under 100MB in production. Monitor with nodetool cfstats and look at 'Compacted partition maximum bytes'. If you see partitions over 1GB, you're in trouble.
status with 3 values) means 33% of your cluster handles all traffic. You'll see uneven load in nodetool ring and wonder why 2 nodes are at 95% CPU while others idle. Fix it before you deploy.Materialized Views – Pre-Joined Read Performance at a Cost
Why: In Cassandra, queries must follow the primary key structure of a table. Materialized Views let you pre-define alternative access patterns without duplicating writes manually. How: Each Materialized View is a separate table that Cassandra maintains automatically. When you write to the base table, Cassandra asynchronously updates the view. This gives you query flexibility without client-side joins. But the cost is real: view updates add write latency and stress the coordinator. Views also have strict constraints — the base table’s primary key must be fully included. If a view update fails (e.g., a node goes down), the base write still succeeds but the view becomes stale. Production trap: Views do not guarantee eventual consistency timing. Use them only when reads must follow a non-primary-key path and you accept write amplification. Never use them for high-throughput or latency-sensitive writes.
Decentralization – Every Node Holds the Keys, None Holds the Crown
Why: Cassandra’s decentralization means no single point of failure. Every node is identical — there is no master, no slave, no failover. How: All nodes use the gossip protocol to share cluster state. Each node runs the same software, processes reads and writes, and participates in token ring partitioning. A client can connect to any node (the coordinator) and get consistent data. If a node dies, the replica nodes serve its data. Adding or removing nodes requires no manual rebalancing — Cassandra uses consistent hashing to redistribute tokens automatically. This design gives you linear scalability: double the nodes, double the throughput, no bottlenecks. The trade-off? No transactions, no complex joins — consistency must be tunable. Production trap: In a large cluster, gossip overhead can consume bandwidth. Monitor repair intervals to prevent data divergence. Decentralization works best when you embrace eventual consistency.
Conclusion
Cassandra is not a drop-in replacement for a relational database. Its architecture trades joins and ACID transactions for linear scalability and fault tolerance across commodity hardware. The key takeaway is that every design decision—from consistency levels to compaction strategies—exists to manage trade-offs between latency, correctness, and operational cost. Tunable consistency lets you choose whether to prioritize read speed or write availability, but misconfiguring it can silently serve stale data or block writes under load. Compaction strategies dictate how SSTables merge; picking the wrong one for your workload leads to read amplification or disk bloat. Tombstones and Bloom filters handle deletes and reads efficiently, but excessive tombstones degrade performance. Decentralization means every node can coordinate requests, eliminating single points of failure but requiring careful partition key design to avoid hot spots. Master these fundamentals, and you can build resilient systems that scale horizontally. Ignore them, and Cassandra will punish you with unpredictable latency and hard-to-diagnose data anomalies.
Final Thoughts on Cassandra Basics
Cassandra thrives in workloads demanding high write throughput and multi-datacenter replication—think real-time analytics, IoT sensor ingestion, or messaging systems. But it fails spectacularly when forced into relational patterns like joins or multi-row transactions. The decentralization model (every node is a coordinator) forces you to design partition keys that distribute writes evenly; a bad key creates a hot spot that throttles the whole cluster. Materialized views offer pre-joined read performance, but they impose write amplification and eventual consistency that can confuse application logic. Tombstones, while necessary for deletes, accumulate if you don't align compaction strategy with your data lifecycle—leading to massive read amplification and repair storms. Bloom filters accelerate reads by skipping SSTables unlikely to contain data, but they require enough memory to stay in RAM. In the end, Cassandra is a tool for specific use cases: master it by understanding why each feature exists, not just how to configure it. Walk away if you need strong consistency across multiple rows, or if your data model requires frequent schema changes.
The Silent Write Loss: When a Single Replica Fails in a Multi-Datacenter Write
- Never use consistency level ONE for business-critical writes in a multi-DC setup.
- Always verify that your consistency level survives a single node and a single datacenter failure.
- Run incremental repairs weekly — gossip-based anti-entropy alone isn't enough for full consistency recovery.
nodetool cfstats — if > 100, run nodetool compact to merge SSTables. Also verify Bloom filters are enabled via cassandra.yaml.nodetool tpstats — if write stage depth > 100, reduce write concurrency or increase concurrent_writes in cassandra.yaml. Also verify hinted handoff is enabled.consistency parameter in CQL or driver. Run nodetool repair on affected keyspace.nodetool info — if off-heap memory (Memtable) is high, reduce memtable_heap_space_in_mb. Also verify GC tuning: CMS or G1? For large heaps, switch to G1GC with MaxGCPauseMillis=200.nodetool tpstats | grep -i writetail -100 /var/log/cassandra/system.log | grep -i 'WriteTimeout'write_request_timeout_in_ms in cassandra.yaml (default 2000ms) — but only temporarily. Root cause is usually compaction backlog or overloaded coordinator.Key takeaways
Common mistakes to avoid
4 patternsUsing SimpleStrategy with RF=1 in production
ALTER KEYSPACE and run nodetool repair to redistribute data.Designing partition keys that create hot spots
nodetool cfhistograms. Use a composite partition key (e.g., hash of user_id + time bucket) to spread load evenly.Ignoring tombstone accumulation from TTL deletes
nodetool cfstats shows tombstone count high per partition.nodetool compact or enable unchecked_tombstone_compaction carefully.Setting consistency to ALL for 'safety'
Interview Questions on This Topic
Explain the write path in Cassandra step-by-step. What happens from when a write request arrives until it's durable?
Frequently Asked Questions
20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.
That's NoSQL. Mark it forged?
16 min read · try the examples if you haven't