Cassandra LOCAL_ONE Write Loss — Multi-DC Failure Patterns
LOCAL_ONE acknowledged a write, then the DC failed before hinted handoff.
- Cassandra is a distributed NoSQL database designed for high write throughput and horizontal scaling.
- Data is partitioned across nodes using a consistent hash ring — each node owns a token range.
- Replication factor determines how many copies of each partition exist; replication strategy sets placement.
- Writes go to Memtable and CommitLog first, then flushed to immutable SSTables on disk.
- Tunable consistency lets you choose between strong reads (QUORUM) and low-latency reads (ONE) per query.
- Production trap: using LOCAL_QUORUM in a multi-datacenter cluster can cause silent data loss during a DC failure.
Imagine a city library that's so popular, one building can't hold all the books or serve all the visitors. So they build 6 identical branch libraries across the city, and each branch is responsible for books whose titles start with certain letters. When you want a book, you go to the branch that owns it — and if that branch is closed, the next nearest one has a copy. Cassandra works exactly like that: it splits your data into chunks (partitions), spreads those chunks across many servers (nodes), and keeps copies on neighboring servers so nothing is ever lost when one goes down.
Every time you hit 'like' on a social media post, check your Uber driver's live location, or see your Netflix watch history load in under a second — there's a non-trivial chance Cassandra is doing the heavy lifting underneath. It was built by Facebook engineers to handle the inbox search problem: hundreds of millions of writes per day, globally distributed, with zero acceptable downtime. Relational databases buckled under that load. Cassandra didn't.
The core problem Cassandra solves is the tension between scale, availability, and write throughput that kills traditional RDBMS systems. When you need to ingest millions of events per second across datacenters in Tokyo, Frankfurt, and Ohio simultaneously, you can't afford a single master node, a two-phase commit, or a schema that requires costly JOINs. Cassandra abandons all three. It trades the strong consistency guarantee of SQL for eventual consistency and gives you fine-grained control over exactly how eventual that consistency is — per query.
After working through this article, you'll be able to design a Cassandra data model from scratch using partition keys and clustering columns, explain exactly what happens under the hood during a write (Memtable → CommitLog → SSTable), configure replication for a multi-datacenter cluster, tune consistency levels intelligently based on your SLA, and avoid the five production mistakes that turn a fast Cassandra cluster into a slow one.
The Cassandra Data Model: Partition Keys, Clustering Columns, and Physical Storage
Cassandra's data model is deceptively simple: a table is a collection of rows, and each row is identified by a primary key. But that primary key has two parts: the partition key determines which node stores the row, and the clustering columns determine the order within that partition.
Here's the thing most docs gloss over: the partition key isn't just a lookup key. It's the unit of physical locality. All rows sharing the same partition key live on the same node and are stored contiguously on disk. That's why bad partition key design (e.g., using a timestamp as the sole partition key) creates hot spots — all writes for that second hit one node.
Internally, each partition is stored as a single row inside a Memtable, then flushed to an SSTable on disk. When you query by partition key, Cassandra uses an in-memory index (the partition summary) to find which SSTable contains that partition. It'll read only that SSTable, not the whole file. Bloom filters help skip SSTables that definitely don't contain your partition.
The clustering columns sort rows within a partition. This ordering is physical — it's how data is laid out on disk. So if you query with a range on the first clustering column, Cassandra can skip rows efficiently. This is a big deal for time-series workloads where you often query recent data.
- A good partition key distributes data evenly across shelves — no shelf gets all the books.
- If you use a sequential ID (like an auto-increment), all new books will pile onto one shelf — a hot spot.
- Clustering order matters because Cassandra writes data in that order. If you need range queries, make the range part of the clustering key, not the partition key.
- Large partitions (over 100 MB) hurt compaction and read performance. Keep each partition under 10 MB ideally.
Write Path Internals: Memtable, CommitLog, and SSTable
When you issue a write to Cassandra, the coordinator node sends it to all replicas that should hold the partition (based on replication strategy). Each replica does this:
- Writes the mutation to the CommitLog (append-only, sequential writes on disk) — this ensures durability even if the node crashes before the Memtable flushes.
- Applies the mutation to an in-memory structure called the Memtable (a sorted map of partition key → row data).
- Returns success to the coordinator — as soon as it's in both CommitLog and Memtable.
- Later, when the Memtable is full (default 128 MB) or a flush is triggered, it's sorted and written to disk as an immutable SSTable. The CommitLog entries for that Memtable are then truncated.
The key insight: writes never touch a disk read path until flush. That's why Cassandra write throughput is so high — it's essentially sequential CommitLog writes plus memory operations. But that speed has a cost: if you lose power between node 3 (return success) and node 4 (flush), the Mutation is replayed from the CommitLog on startup. The CommitLog is your safety net.
SSTables are immutable — once written, never modified. That means deletions and updates don't modify existing data; they write a new tombstone (for deletes) or a newer timestamped version. Old data becomes garbage that must be compacted away. That's why tombstones are dangerous: if you create a delete-heavy workload without frequent compactions, SSTables pile up with dead rows, inflating read latency.
nodetool compact after large batch deletes, or use a shorter TTL and let compaction handle it naturally. The rule of thumb: a single partition should never have more than 100 tombstones.commitlog_total_space_in_mb combined with a surge in write volume. The node couldn't flush fast enough, so it blocked writes.commitlog_sync: periodic (default batch mode) with a short sync window. Use memtable_heap_space_in_mb = 512 MB or less to avoid long GC pauses.commitlog_sync: batch — this slower but guarantees each write is fsynced before ack. Expect 2-3x write latency trade-off.Visual: Cassandra Write Path — CommitLog → Memtable → SSTable
The write path is the most critical flow in Cassandra because it determines both throughput and durability. This diagram visualizes the three main stages: the CommitLog (durable, append-only log), the Memtable (in-memory sorted buffer), and the SSTable (immutable disk file). The sequence is: (1) write arrives, (2) immediately written to CommitLog for crash recovery, (3) applied to Memtable for fast reads, (4) acknowledged to client once both are done, (5) later, Memtable is flushed to an SSTable on disk, and the CommitLog segment is truncated.
The architectural insight: writes are fast because step 2 and 3 are sequential I/O and memory operations. The flush (step 5) happens asynchronously in the background. This design prioritizes write throughput over read latency — reads may have to consult multiple SSTables, but writes are nearly as fast as the network and memory allow.
The diagram below uses a mermaid flowchart to capture the flow.
commitlog_sync: periodic with a shorter sync window (e.g., 10ms) instead. The CommitLog is your only safety net for unflushed writes.Replication: Strategies, Snitch, and Multi-Datacenter Placement
Cassandra distributes data across nodes using a consistent hash ring. Each node owns a range of token values. A partition key is hashed to a token, and the node that owns that token range is the primary replica. Additional replicas are placed on the next N nodes in the ring (where N = replication factor).
But that's only half the story. The actual placement depends on the replication strategy and the snitch. The snitch tells Cassandra about network topology — which nodes are in which rack and datacenter. Without a proper snitch, Cassandra will place replicas on nodes in the same rack, defeating fault tolerance.
- SimpleStrategy: for single-datacenter testing. Places replicas on the next N nodes in the ring. Avoid in production.
- NetworkTopologyStrategy: for multi-datacenter. You specify a replication factor per datacenter. For example, RF=3 in DC-West and RF=2 in DC-East. The snitch determines which nodes belong to which DC, and Cassandra places the right number of replicas in each DC, distributed across racks to survive rack failures.
Here's the gotcha: if your snitch is misconfigured (e.g., using SimpleSnitch when you have multiple datacenters), Cassandra will think all nodes are in one datacenter and will place all replicas in that single DC — the other DC gets zero data. You'll have a false sense of redundancy.
Tunable Consistency: What Each Level Actually Means for Latency and Correctness
Cassandra's consistency levels let you choose how many replicas must respond before a query returns. This is per-query — you can use strong consistency for critical operations and eventual consistency for everything else.
Here's the trade-off: higher consistency = higher latency + lower availability. If you require ALL replicas to agree, a single node failure makes that query impossible. If you require ONE replica, you get fast responses but risk stale reads or lost writes.
- ONE: one replica responds. Fastest. Weakest guarantee. Use for bulk writes or non-critical reads.
- LOCAL_ONE: one replica in the coordinator's DC responds. Avoids cross-DC latency. Good for local-read scenarios.
- QUORUM: majority of replicas. For an RF=3, that's 2 replicas. Ensures read-your-write consistency within a single session if both reads and writes use QUORUM.
- LOCAL_QUORUM: majority within the local datacenter. Preferred for multi-DC writes to avoid cross-DC latency.
- EACH_QUORUM: majority in each datacenter. Strong, but slow — requires cross-DC coordination. Use only for critical financial operations.
- ALL: all replicas. Guarantees strong consistency, but a single node failure makes the query fail. Avoid in production unless you can tolerate failure on every read/write.
The production rule: use LOCAL_QUORUM for writes and reads within the same datacenter. This provides strong consistency within that DC and survives one node failure (RF=3). For global reads that don't need the latest data, use LOCAL_ONE.
A common mistake is using QUORUM in a multi-DC world: QUORUM means majority of all replicas, not majority in each DC. In a 3+3 DC setup, QUORUM requires 4 replicas – which could be 3 in DC1 and 1 in DC2. That cross-DC read adds latency and if DC2 is unreachable, QUORUM may fail. Use LOCAL_QUORUM instead.
- Use LOCAL_QUORUM for writes that must be durable across nodes (your core business data).
- Use LOCAL_ONE for background reads (recommendations, logs, analytics). Staleness is acceptable.
- If your application needs read-after-write consistency, both read and write must use the same strong consistency level (QUORUM or LOCAL_QUORUM).
- NEVER use ALL in production – a single node failure kills all reads/writes.
Quick-Reference: Consistency Levels and Their Effects
Choosing the right consistency level per query is the key to balancing performance and correctness. This table consolidates the available levels, their guarantees, and recommended use cases. Unlike the high-level comparison earlier, this table includes the exact number of replicas required and how failure tolerance changes with replication factor.
The rule of thumb: for business-critical operations within a datacenter, use LOCAL_QUORUM. For global operations requiring strong consistency across datacenters, use EACH_QUORUM — but expect 20-100ms latency. For high-throughput background tasks, use LOCAL_ONE or ONE.
Important nuance: LOCAL_SERIAL and SERIAL are used for lightweight transactions (LWTs) like conditional updates. They use Paxos protocol internally and are significantly slower — avoid in hot paths.
Compaction Strategies: SizeTiered, Leveled, and TimeWindow – When to Use Each
SSTables are immutable, so over time you accumulate many small files. Compaction merges them into larger ones, discards tombstones and overwritten data, and reclaims disk space. Cassandra offers three main strategies:
SizeTieredCompactionStrategy (STCS): merges SSTables of similar size. Default. Simple, but can lead to write amplification (multiple merges of the same data) and space amplification (temporary disk double during compaction). Best for write-heavy, read-rare workloads where uniformity is acceptable.
LeveledCompactionStrategy (LCS): organizes SSTables into levels (L0, L1, L2...). Each level is 10x larger than the previous. Compaction occurs within a level, guaranteeing that 90% of reads hit at most one SSTable per level. Results in lower read latency but higher write amplification (more compaction work for each write). Best for read-heavy workloads, high consistency, or systems where read tail latency matters.
TimeWindowCompactionStrategy (TWCS): designed for time-series data. SSTables are grouped by time window (e.g., one hour). Compaction only merges SSTables within the same window. Old windows are left untouched. Ideal for metrics data where older data is rarely queried and tombstones from TTLs can be efficiently removed per window.
The trick: choose based on your write/read ratio and access pattern. STCS works for most cases, but don't use it with high TTL workloads – it'll keep merging tombstones into ever-larger SSTables, wasting disk and read time. TWCS is far more efficient for TTL-heavy tables.
nodetool compactionstats. If the pending count stays above 100 for more than 10 minutes, your cluster is falling behind. Consider reducing SSTable max size or switching to LCS for faster compaction. Also ensure disk has 50% free space for STCS to avoid compaction failure.concurrent_compactors from 2 to 8, and decreased sstable_size_in_mb to 50 to keep per-window SSTable count low. Reads returned to sub-10ms.Compaction Strategy Decision Matrix — Choosing the Right Strategy for Your Workload
Cassandra offers three compaction strategies, each optimized for different access patterns. The decision comes down to your read/write ratio, data lifetime (TTL), and whether you prioritize write throughput or read latency. Below is a decision matrix comparing STCS, LCS, and TWCS across key dimensions.
How to use this matrix: Identify your workload category (column) and find the recommended strategy. Then check the trade-offs in terms of write amplification, read amplification, disk space overhead, and compaction frequency. The 'Best For' column gives a quick verdict.
nodetool compactionstats to see pending compactions. A backlog above 200 indicates the cluster can't keep up. If using STCS, ensure at least 50% free disk space to allow compaction to operate. For LCS, monitor write latency — if it increases, compaction may be throttled. For TWCS, check that compaction windows match your TTL — if windows are too large, tombstones accumulate.Tombstones and Bloom Filters — How Deletes Are Handled and How Reads Are Accelerated
Tombstones are Cassandra's mechanism for handling deletes. When you delete a row, Cassandra writes a tombstone — a marker that says 'this data is considered deleted as of timestamp X.' The original data is not removed immediately; it remains in the SSTable until compaction. During a read, Cassandra scans through SSTables and skips any cell with a tombstone whose timestamp is newer than the data. This makes deletes and TTLs (which also create tombstones) deceptively expensive: each tombstone must be read and ignored, adding latency.
Bloom filters are the counterbalance. Before reading an SSTable, Cassandra checks the Bloom filter — a probabilistic data structure that tells if an SSTable definitely does NOT contain the requested partition key. If the filter says 'no,' the SSTable is skipped entirely. This drastically reduces read latency, especially when many SSTables exist. However, Bloom filters have a false positive rate (default 1 in 10,000) — sometimes they say 'maybe' when the data isn't there, causing a wasted read. The filter is stored in memory and is keyed by partition key.
Tombstone gotchas: - A partition with >100K tombstones can cause a read timeout even if the live data is just a few rows. - You can monitor tombstones with nodetool cfstats — look for 'SSTable count' and 'Tombstone count'. - The gc_grace_seconds setting (default 10 days) determines how long a tombstone must exist before compaction can remove it. This is to prevent resurrecting data from other nodes that may not have seen the delete yet. If you're sure all replicas have the delete, you can lower this value to reduce tombstone duration.
Bloom filter tuning: - bloom_filter_fp_chance (default 0.01 = 1% false positive) trades memory for accuracy. Lower values use more memory but reduce wasted reads. - Bloom filters are stored in off-heap memory. If memory is tight, you can increase the false positive chance to 0.1, but expect higher read latency. - Use nodetool cfstats to see 'Bloom filter false positives' — if this is high, consider reducing bloom_filter_fp_chance.
tombstone_warn_threshold (default 1000) and tombstone_failure_threshold (default 100,000). If a query scans more than 1000 tombstones, a warning is logged. If it exceeds 100,000, the query is aborted with an error. Monitor these logs to detect tombstone buildup early.gc_grace_seconds to 3600 (1 hour) since all replicas were in the same datacenter and consistency was LOCAL_QUORUM, guaranteeing tombstone propagation.The Silent Write Loss: When a Single Replica Fails in a Multi-Datacenter Write
- Never use consistency level ONE for business-critical writes in a multi-DC setup.
- Always verify that your consistency level survives a single node and a single datacenter failure.
- Run incremental repairs weekly — gossip-based anti-entropy alone isn't enough for full consistency recovery.
nodetool cfstats — if > 100, run nodetool compact to merge SSTables. Also verify Bloom filters are enabled via cassandra.yaml.nodetool tpstats — if write stage depth > 100, reduce write concurrency or increase concurrent_writes in cassandra.yaml. Also verify hinted handoff is enabled.consistency parameter in CQL or driver. Run nodetool repair on affected keyspace.nodetool info — if off-heap memory (Memtable) is high, reduce memtable_heap_space_in_mb. Also verify GC tuning: CMS or G1? For large heaps, switch to G1GC with MaxGCPauseMillis=200.write_request_timeout_in_ms in cassandra.yaml (default 2000ms) — but only temporarily. Root cause is usually compaction backlog or overloaded coordinator.Key takeaways
Common mistakes to avoid
4 patternsUsing SimpleStrategy with RF=1 in production
ALTER KEYSPACE and run nodetool repair to redistribute data.Designing partition keys that create hot spots
nodetool cfhistograms. Use a composite partition key (e.g., hash of user_id + time bucket) to spread load evenly.Ignoring tombstone accumulation from TTL deletes
nodetool cfstats shows tombstone count high per partition.nodetool compact or enable unchecked_tombstone_compaction carefully.Setting consistency to ALL for 'safety'
Interview Questions on This Topic
Explain the write path in Cassandra step-by-step. What happens from when a write request arrives until it's durable?
Frequently Asked Questions
That's NoSQL. Mark it forged?
10 min read · try the examples if you haven't