Mid-level 10 min · March 05, 2026

Cassandra LOCAL_ONE Write Loss — Multi-DC Failure Patterns

LOCAL_ONE acknowledged a write, then the DC failed before hinted handoff.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Cassandra is a distributed NoSQL database designed for high write throughput and horizontal scaling.
  • Data is partitioned across nodes using a consistent hash ring — each node owns a token range.
  • Replication factor determines how many copies of each partition exist; replication strategy sets placement.
  • Writes go to Memtable and CommitLog first, then flushed to immutable SSTables on disk.
  • Tunable consistency lets you choose between strong reads (QUORUM) and low-latency reads (ONE) per query.
  • Production trap: using LOCAL_QUORUM in a multi-datacenter cluster can cause silent data loss during a DC failure.
Plain-English First

Imagine a city library that's so popular, one building can't hold all the books or serve all the visitors. So they build 6 identical branch libraries across the city, and each branch is responsible for books whose titles start with certain letters. When you want a book, you go to the branch that owns it — and if that branch is closed, the next nearest one has a copy. Cassandra works exactly like that: it splits your data into chunks (partitions), spreads those chunks across many servers (nodes), and keeps copies on neighboring servers so nothing is ever lost when one goes down.

Every time you hit 'like' on a social media post, check your Uber driver's live location, or see your Netflix watch history load in under a second — there's a non-trivial chance Cassandra is doing the heavy lifting underneath. It was built by Facebook engineers to handle the inbox search problem: hundreds of millions of writes per day, globally distributed, with zero acceptable downtime. Relational databases buckled under that load. Cassandra didn't.

The core problem Cassandra solves is the tension between scale, availability, and write throughput that kills traditional RDBMS systems. When you need to ingest millions of events per second across datacenters in Tokyo, Frankfurt, and Ohio simultaneously, you can't afford a single master node, a two-phase commit, or a schema that requires costly JOINs. Cassandra abandons all three. It trades the strong consistency guarantee of SQL for eventual consistency and gives you fine-grained control over exactly how eventual that consistency is — per query.

After working through this article, you'll be able to design a Cassandra data model from scratch using partition keys and clustering columns, explain exactly what happens under the hood during a write (Memtable → CommitLog → SSTable), configure replication for a multi-datacenter cluster, tune consistency levels intelligently based on your SLA, and avoid the five production mistakes that turn a fast Cassandra cluster into a slow one.

The Cassandra Data Model: Partition Keys, Clustering Columns, and Physical Storage

Cassandra's data model is deceptively simple: a table is a collection of rows, and each row is identified by a primary key. But that primary key has two parts: the partition key determines which node stores the row, and the clustering columns determine the order within that partition.

Here's the thing most docs gloss over: the partition key isn't just a lookup key. It's the unit of physical locality. All rows sharing the same partition key live on the same node and are stored contiguously on disk. That's why bad partition key design (e.g., using a timestamp as the sole partition key) creates hot spots — all writes for that second hit one node.

Internally, each partition is stored as a single row inside a Memtable, then flushed to an SSTable on disk. When you query by partition key, Cassandra uses an in-memory index (the partition summary) to find which SSTable contains that partition. It'll read only that SSTable, not the whole file. Bloom filters help skip SSTables that definitely don't contain your partition.

The clustering columns sort rows within a partition. This ordering is physical — it's how data is laid out on disk. So if you query with a range on the first clustering column, Cassandra can skip rows efficiently. This is a big deal for time-series workloads where you often query recent data.

io_thecodeforge_cassandra_data_model.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- TheCodeForge — Cassandra data model for a user timeline
CREATE TABLE io.thecodeforge.user_timeline (
    user_id UUID,                         -- partition key: all tweets for one user on one node
    tweet_id TIMEUUID,                    -- clustering column: sorted by time
    tweet_text TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (user_id, tweet_id)
) WITH CLUSTERING ORDER BY (tweet_id DESC);

-- Query: get the 20 most recent tweets for a user (scans only one partition, uses sorted order)
SELECT * FROM io.thecodeforge.user_timeline
WHERE user_id = 123e4567-e89b-12d3-a456-426614174000
ORDER BY tweet_id DESC
LIMIT 20;
The One-Bookcase Analogy for Partition Design
  • A good partition key distributes data evenly across shelves — no shelf gets all the books.
  • If you use a sequential ID (like an auto-increment), all new books will pile onto one shelf — a hot spot.
  • Clustering order matters because Cassandra writes data in that order. If you need range queries, make the range part of the clustering key, not the partition key.
  • Large partitions (over 100 MB) hurt compaction and read performance. Keep each partition under 10 MB ideally.
Production Insight
A team stored IoT sensor readings with sensor_id as partition key and timestamp as clustering column. Each sensor produced 100k readings/day. Partitions hit 2 GB. Reads slowed to crawl.
Rule: add a time bucket (e.g., month) to the partition key to break partitions into ~1 GB chunks.
Key Takeaway
Good partition design = low latency + even load.
Bad partition design = hot spots + 500+ ms reads.
Your partition key is the single most impactful design decision in Cassandra.
Choosing a Partition Key Strategy
IfHigh write throughput (millions/sec) with random IDs
UseUse UUID or hash of natural key as partition key. Distributes evenly.
IfTime-series data with range queries per time window
UseAdd a time bucket (e.g., day, month) as a composite partition key prefix.
IfRead-heavy with single-row lookups by ID
UseUse the ID directly as partition key. Ensure IDs are uniformly distributed.
IfNeed secondary indexes for non-key columns
UseAvoid secondary indexes in production — they'll scan all nodes. Use materialized views or application-side denormalisation instead.

Write Path Internals: Memtable, CommitLog, and SSTable

When you issue a write to Cassandra, the coordinator node sends it to all replicas that should hold the partition (based on replication strategy). Each replica does this:

  1. Writes the mutation to the CommitLog (append-only, sequential writes on disk) — this ensures durability even if the node crashes before the Memtable flushes.
  2. Applies the mutation to an in-memory structure called the Memtable (a sorted map of partition key → row data).
  3. Returns success to the coordinator — as soon as it's in both CommitLog and Memtable.
  4. Later, when the Memtable is full (default 128 MB) or a flush is triggered, it's sorted and written to disk as an immutable SSTable. The CommitLog entries for that Memtable are then truncated.

The key insight: writes never touch a disk read path until flush. That's why Cassandra write throughput is so high — it's essentially sequential CommitLog writes plus memory operations. But that speed has a cost: if you lose power between node 3 (return success) and node 4 (flush), the Mutation is replayed from the CommitLog on startup. The CommitLog is your safety net.

SSTables are immutable — once written, never modified. That means deletions and updates don't modify existing data; they write a new tombstone (for deletes) or a newer timestamped version. Old data becomes garbage that must be compacted away. That's why tombstones are dangerous: if you create a delete-heavy workload without frequent compactions, SSTables pile up with dead rows, inflating read latency.

io/thecodeforge/cassandra/WritePathDebug.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package io.thecodeforge.cassandra;

import com.datastax.oss.driver.api.core.CqlSession;

/**
 * TheCodeForgeDemonstrates write path behavior: consistency level affects durability.
 * Never run this against a production cluster without understanding the impact.
 */
public class WritePathDebug {
    public static void main(String[] args) {
        try (CqlSession session = CqlSession.builder().build()) {
            // This write returns immediately after one replica acknowledges
            // But if that replica crashes before CommitLog flush, the write is lost
            session.execute(
                "INSERT INTO io.thecodeforge.user_timeline (user_id, tweet_id, tweet_text) " +
                "VALUES (123e4567-e89b-12d3-a456-426614174000, now(), 'Hello Cassandra!') " +
                "USING CONSISTENCY ONE;"
            );
            System.out.println("Write acknowledged. But is it durable? Check CommitLog on replica.");
        }
    }
}
Output
Write acknowledged. But is it durable? Check CommitLog on replica.
Tombstone Trap
Each delete creates a tombstone marker. If you delete rows with a high frequency (e.g., TTL on time-series data), those tombstones accumulate in SSTables until compaction. During reads, Cassandra must scan through tombstones to skip them. A partition with thousands of tombstones can cause read timeouts even if the actual live data is small. Fix: run nodetool compact after large batch deletes, or use a shorter TTL and let compaction handle it naturally. The rule of thumb: a single partition should never have more than 100 tombstones.
Production Insight
A production cluster serving a real-time analytics pipeline had write latencies under 1 ms for months. Then one day, write latencies spiked to 10 seconds. Root cause: the CommitLog filled up disk because of a misconfigured commitlog_total_space_in_mb combined with a surge in write volume. The node couldn't flush fast enough, so it blocked writes.
Rule: always monitor disk usage for the CommitLog directory. Set alerts when it crosses 70%.
Key Takeaway
Write speed comes from immutability and sequential I/O.
CommitLog is your durability guarantee — never treat it as optional.
Tombstones are the silent killer of read performance: avoid them or compact aggressively.
When to Tune Write Path Settings
IfLow write latency required (<5ms p99)
UseEnsure commitlog_sync: periodic (default batch mode) with a short sync window. Use memtable_heap_space_in_mb = 512 MB or less to avoid long GC pauses.
IfHigh durability needed (no data loss)
UseUse commitlog_sync: batch — this slower but guarantees each write is fsynced before ack. Expect 2-3x write latency trade-off.

Visual: Cassandra Write Path — CommitLog → Memtable → SSTable

The write path is the most critical flow in Cassandra because it determines both throughput and durability. This diagram visualizes the three main stages: the CommitLog (durable, append-only log), the Memtable (in-memory sorted buffer), and the SSTable (immutable disk file). The sequence is: (1) write arrives, (2) immediately written to CommitLog for crash recovery, (3) applied to Memtable for fast reads, (4) acknowledged to client once both are done, (5) later, Memtable is flushed to an SSTable on disk, and the CommitLog segment is truncated.

The architectural insight: writes are fast because step 2 and 3 are sequential I/O and memory operations. The flush (step 5) happens asynchronously in the background. This design prioritizes write throughput over read latency — reads may have to consult multiple SSTables, but writes are nearly as fast as the network and memory allow.

The diagram below uses a mermaid flowchart to capture the flow.

Durability Guarantee
The write is considered durable once it's in the CommitLog and Memtable. If the node crashes before flush, the CommitLog is replayed on restart, ensuring no data loss. This is why you must never disable the CommitLog or set commitlog_sync to periodic without understanding the trade-off.
Production Insight
In a real-world incident, a team disabled the CommitLog to improve write latency during a benchmark — and lost all unflushed data when the node crashed. Always keep CommitLog enabled. If you need lower latency, tune commitlog_sync: periodic with a shorter sync window (e.g., 10ms) instead. The CommitLog is your only safety net for unflushed writes.
Key Takeaway
The write path is a three-stage pipeline: CommitLog for durability, Memtable for speed, SSTable for persistence. Never disable the CommitLog — it's the only guarantee against data loss on crash.

Replication: Strategies, Snitch, and Multi-Datacenter Placement

Cassandra distributes data across nodes using a consistent hash ring. Each node owns a range of token values. A partition key is hashed to a token, and the node that owns that token range is the primary replica. Additional replicas are placed on the next N nodes in the ring (where N = replication factor).

But that's only half the story. The actual placement depends on the replication strategy and the snitch. The snitch tells Cassandra about network topology — which nodes are in which rack and datacenter. Without a proper snitch, Cassandra will place replicas on nodes in the same rack, defeating fault tolerance.

The two built-in strategies
  • SimpleStrategy: for single-datacenter testing. Places replicas on the next N nodes in the ring. Avoid in production.
  • NetworkTopologyStrategy: for multi-datacenter. You specify a replication factor per datacenter. For example, RF=3 in DC-West and RF=2 in DC-East. The snitch determines which nodes belong to which DC, and Cassandra places the right number of replicas in each DC, distributed across racks to survive rack failures.

Here's the gotcha: if your snitch is misconfigured (e.g., using SimpleSnitch when you have multiple datacenters), Cassandra will think all nodes are in one datacenter and will place all replicas in that single DC — the other DC gets zero data. You'll have a false sense of redundancy.

cassandra.yaml (snitch and replication)YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# TheCodeForge — cassandra.yaml snitch configuration
endpoint_snitch: GossipingPropertyFileSnitch

# Then in cassandra-rackdc.properties:
# dc=dc_west
# rack=rack1

# Keyspace creation using NetworkTopologyStrategy
CREATE KEYSPACE io.thecodeforge.myapp WITH REPLICATION = {
  'class': 'NetworkTopologyStrategy',
  'dc_west': 3,
  'dc_east': 2
};

# Verify replication with:
# SELECT * FROM system_schema.keyspaces WHERE keyspace_name = 'io.thecodeforge.myapp';
Snitch Selection Matters
The snitch is one of the first settings you configure, and one of the hardest to change later. You cannot change the snitch type without a full cluster restart or data rebalancing. Choose wisely: start with GossipingPropertyFileSnitch for multi-DC clusters. Turn on cross-datacenter replication with NetworkTopologyStrategy from day one.
Production Insight
A startup used SimpleStrategy with RF=3 across two AWS regions. They thought data was replicated in both regions. In reality, SimpleStrategy placed all three replicas in one region (because the snitch didn't distinguish DCs). When that region went down, they lost all data.
Rule: always use NetworkTopologyStrategy with a proper snitch in any multi-DC setup.
Key Takeaway
SimpleStrategy is for testing only.
Use NetworkTopologyStrategy with GossipingPropertyFileSnitch in any multi-DC setup.
Your snitch defines your failure domains – get it wrong and your replication strategy is meaningless.
Choosing Replication Strategy and RF
IfSingle DC, development or small-scale
UseSimpleStrategy with RF=3 (tolerates one node failure). Acceptable for non-critical data.
IfProduction single DC
UseNetworkTopologyStrategy with RF=3 (tolerates one node failure). Prepare for future multi-DC.
IfMulti-DC production
UseNetworkTopologyStrategy with RF=3 in each DC. This tolerates both node and DC failures.

Tunable Consistency: What Each Level Actually Means for Latency and Correctness

Cassandra's consistency levels let you choose how many replicas must respond before a query returns. This is per-query — you can use strong consistency for critical operations and eventual consistency for everything else.

Here's the trade-off: higher consistency = higher latency + lower availability. If you require ALL replicas to agree, a single node failure makes that query impossible. If you require ONE replica, you get fast responses but risk stale reads or lost writes.

The common levels
  • ONE: one replica responds. Fastest. Weakest guarantee. Use for bulk writes or non-critical reads.
  • LOCAL_ONE: one replica in the coordinator's DC responds. Avoids cross-DC latency. Good for local-read scenarios.
  • QUORUM: majority of replicas. For an RF=3, that's 2 replicas. Ensures read-your-write consistency within a single session if both reads and writes use QUORUM.
  • LOCAL_QUORUM: majority within the local datacenter. Preferred for multi-DC writes to avoid cross-DC latency.
  • EACH_QUORUM: majority in each datacenter. Strong, but slow — requires cross-DC coordination. Use only for critical financial operations.
  • ALL: all replicas. Guarantees strong consistency, but a single node failure makes the query fail. Avoid in production unless you can tolerate failure on every read/write.

The production rule: use LOCAL_QUORUM for writes and reads within the same datacenter. This provides strong consistency within that DC and survives one node failure (RF=3). For global reads that don't need the latest data, use LOCAL_ONE.

A common mistake is using QUORUM in a multi-DC world: QUORUM means majority of all replicas, not majority in each DC. In a 3+3 DC setup, QUORUM requires 4 replicas – which could be 3 in DC1 and 1 in DC2. That cross-DC read adds latency and if DC2 is unreachable, QUORUM may fail. Use LOCAL_QUORUM instead.

io_thecodeforge_cassandra_consistency_usage.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- TheCodeForge — Using consistency levels strategically

-- Critical write: must survive a single node failure
INSERT INTO io.thecodeforge.orders (order_id, amount, status)
VALUES (uuid(), 99.99, 'confirmed')
USING CONSISTENCY LOCAL_QUORUM;

-- Non-critical read: latest may be a few seconds old
SELECT * FROM io.thecodeforge.recommendations WHERE user_id = ?
CONSISTENCY LOCAL_ONE;

-- Batch read: need consistent snapshot across multiple partitions
-- Use SERIAL (paxos) only for lightweight transactions (LWT)
BEGIN BATCH
  UPDATE io.thecodeforge.user_timeline SET is_visible = false
  WHERE user_id = ? AND tweet_id = ?
  IF EXISTS
APPLY BATCH;  -- LWT requires SERIAL consistency

-- Checking effective consistency with drivers:
-- In Java driver: SimpleStatement.withDefaultReadConsistency()
Consistency as a Knob, Not a Switch
  • Use LOCAL_QUORUM for writes that must be durable across nodes (your core business data).
  • Use LOCAL_ONE for background reads (recommendations, logs, analytics). Staleness is acceptable.
  • If your application needs read-after-write consistency, both read and write must use the same strong consistency level (QUORUM or LOCAL_QUORUM).
  • NEVER use ALL in production – a single node failure kills all reads/writes.
Production Insight
A social media app used LOCAL_ONE for all writes to maximize throughput. Users complained that their posts sometimes disappeared after refresh. Root cause: write to one replica, then read from another – the new post didn't propagate yet.
Fix: set write consistency to LOCAL_QUORUM and read consistency to LOCAL_QUORUM – immediate visibility with minimal latency increase (from 2ms to 8ms p99).
Key Takeaway
Consistency is a per-query trade-off, not a cluster-wide setting.
LOCAL_QUORUM is the safest default for production – you get strong consistency without cross-DC overhead.
Perfection (ALL) is the enemy of uptime: availability always wins.
Choosing Consistency Level for a Given Operation
IfUser-facing write that must be immediately visible
UseUse LOCAL_QUORUM for both write and subsequent read. Ensures read-your-write.
IfBatch analytics read – volume, no staleness concerns
UseUse LOCAL_ONE or ANY (weakest). High throughput, no consistency guarantees.
IfDistributed counter update
UseUse QUORUM (actually LOCAL_QUORUM) – counters require atomicity across replicas; lightweight transactions not needed.
IfConditional update (LWT) like 'insert if not exists'
UseMust use SERIAL or LOCAL_SERIAL. These are slower (Paxos protocol) – avoid in hot paths.

Quick-Reference: Consistency Levels and Their Effects

Choosing the right consistency level per query is the key to balancing performance and correctness. This table consolidates the available levels, their guarantees, and recommended use cases. Unlike the high-level comparison earlier, this table includes the exact number of replicas required and how failure tolerance changes with replication factor.

The rule of thumb: for business-critical operations within a datacenter, use LOCAL_QUORUM. For global operations requiring strong consistency across datacenters, use EACH_QUORUM — but expect 20-100ms latency. For high-throughput background tasks, use LOCAL_ONE or ONE.

Important nuance: LOCAL_SERIAL and SERIAL are used for lightweight transactions (LWTs) like conditional updates. They use Paxos protocol internally and are significantly slower — avoid in hot paths.

consistency_levels_demo.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
-- Check current consistency level in CQLSH
CONSISTENCY;
-- Output: Current consistency level is LOCAL_QUORUM.

-- Set per query
SELECT * FROM io.thecodeforge.orders WHERE order_id = ?
CONSISTENCY LOCAL_ONE;

-- In Java driver:
// SimpleStatement stmt = SimpleStatement.newInstance("SELECT ...");
// stmt = stmt.setExecutionProfile(session.getContext().getExecutionProfile()
//     .withConsistencyLevel(DefaultConsistencyLevel.LOCAL_ONE));
Danger: Using ALL in Production
Setting consistency to ALL means all replicas must respond. If any replica is down (planned maintenance, network partition), the query will fail with an UnavailableException. This makes your system unavailable during node failures. Never use ALL in production; use LOCAL_QUORUM instead for strong consistency within a DC.
Production Insight
A financial services company used QUORUM (not LOCAL_QUORUM) across three datacenters with RF=3 each. During a cross-region network issue, QUORUM tried to gather majority of all replicas (5 out of 9) but could only reach replicas in two DCs, still meeting the requirement. However, latency spiked to 200ms due to cross-DC coordination. Switching to LOCAL_QUORUM reduced latency to 5ms within each DC and still survived a single node failure per DC.
Key Takeaway
Use LOCAL_QUORUM as the default for production writes and reads. Reserve EACH_QUORUM only for operations that must be consistent across all datacenters. Never use ALL. Always consider how many replicas your consistency level requires and what happens when one fails.

Compaction Strategies: SizeTiered, Leveled, and TimeWindow – When to Use Each

SSTables are immutable, so over time you accumulate many small files. Compaction merges them into larger ones, discards tombstones and overwritten data, and reclaims disk space. Cassandra offers three main strategies:

SizeTieredCompactionStrategy (STCS): merges SSTables of similar size. Default. Simple, but can lead to write amplification (multiple merges of the same data) and space amplification (temporary disk double during compaction). Best for write-heavy, read-rare workloads where uniformity is acceptable.

LeveledCompactionStrategy (LCS): organizes SSTables into levels (L0, L1, L2...). Each level is 10x larger than the previous. Compaction occurs within a level, guaranteeing that 90% of reads hit at most one SSTable per level. Results in lower read latency but higher write amplification (more compaction work for each write). Best for read-heavy workloads, high consistency, or systems where read tail latency matters.

TimeWindowCompactionStrategy (TWCS): designed for time-series data. SSTables are grouped by time window (e.g., one hour). Compaction only merges SSTables within the same window. Old windows are left untouched. Ideal for metrics data where older data is rarely queried and tombstones from TTLs can be efficiently removed per window.

The trick: choose based on your write/read ratio and access pattern. STCS works for most cases, but don't use it with high TTL workloads – it'll keep merging tombstones into ever-larger SSTables, wasting disk and read time. TWCS is far more efficient for TTL-heavy tables.

io_thecodeforge_cassandra_compaction.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- TheCodeForge — Compaction strategy configuration

-- For a read-heavy user profile table:
ALTER TABLE io.thecodeforge.users WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160
};

-- For a time-series metrics table with 30-day TTL:
CREATE TABLE io.thecodeforge.metrics (
    sensor_id UUID,
    hour_bucket INT,
    timestamp TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((sensor_id, hour_bucket), timestamp)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'HOURS',
  'compaction_window_size': 1
} AND default_time_to_live = 2592000; -- 30 days

-- Check current compaction strategy:
SELECT table_name, compaction FROM system_schema.tables WHERE keyspace_name = 'io.thecodeforge';
Compaction Monitoring
Keep an eye on pending compactions with nodetool compactionstats. If the pending count stays above 100 for more than 10 minutes, your cluster is falling behind. Consider reducing SSTable max size or switching to LCS for faster compaction. Also ensure disk has 50% free space for STCS to avoid compaction failure.
Production Insight
A team ran TWCS with a 24-hour window for their metrics ingestion. They deleted 90% of old data using TTL=30 days. Over time, 30 days of SSTables accumulated, each window containing many small SSTables from daily flushes. Pending compactions hit 500 and reads slowed to 5 seconds.
Fix: increased compaction thread throughput: concurrent_compactors from 2 to 8, and decreased sstable_size_in_mb to 50 to keep per-window SSTable count low. Reads returned to sub-10ms.
Key Takeaway
Read-heavy → LCS.
Time-series with TTL → TWCS.
Write-heavy, simple → STCS.
Never use STCS with high TTL workloads – you'll pay in read latency and disk waste.
Selecting a Compaction Strategy
IfWrite-heavy, reads occasional, no TTLs
UseSizeTieredCompactionStrategy (default). Simple works.
IfRead-heavy, low latency critical, moderate writes
UseLeveledCompactionStrategy – predictable read performance at cost of write amplification.
IfTime-series data with TTL (metrics, logs)
UseTimeWindowCompactionStrategy – efficient tombstone removal and bounded space.
IfUncertain workload
UseStart with STCS, monitor compaction metrics, switch if needed. Changing strategies is safe but requires full compaction of existing data.

Compaction Strategy Decision Matrix — Choosing the Right Strategy for Your Workload

Cassandra offers three compaction strategies, each optimized for different access patterns. The decision comes down to your read/write ratio, data lifetime (TTL), and whether you prioritize write throughput or read latency. Below is a decision matrix comparing STCS, LCS, and TWCS across key dimensions.

How to use this matrix: Identify your workload category (column) and find the recommended strategy. Then check the trade-offs in terms of write amplification, read amplification, disk space overhead, and compaction frequency. The 'Best For' column gives a quick verdict.

compaction_decision_example.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Example: If you have a user_profile table that is read-heavy (high reads per write):
ALTER TABLE io.thecodeforge.user_profile WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160
};

-- Example: Time-series sensor data with TTL=7 days:
CREATE TABLE io.thecodeforge.sensor_data (
    sensor_id UUID,
    day_bucket INT,
    ts TIMESTAMP,
    value DOUBLE,
    PRIMARY KEY ((sensor_id, day_bucket), ts)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': 1
} AND default_time_to_live = 604800;

-- Verify current strategy:
SELECT table_name, compaction FROM system_schema.tables WHERE keyspace_name = 'io.thecodeforge';
Monitoring Compaction Health
Run nodetool compactionstats to see pending compactions. A backlog above 200 indicates the cluster can't keep up. If using STCS, ensure at least 50% free disk space to allow compaction to operate. For LCS, monitor write latency — if it increases, compaction may be throttled. For TWCS, check that compaction windows match your TTL — if windows are too large, tombstones accumulate.
Production Insight
A team ran STCS on a time-series table with TTL=30 days. After 6 months, SSTable count per partition exceeded 500. Reads timed out. They switched to TWCS with daily windows, and within a week reads returned to normal. The key was matching compaction windows to the time bucket of the partition key.
Key Takeaway
Match compaction strategy to workload: STCS for write-heavy, LCS for read-heavy, TWCS for time-series with TTL. Monitor pending compactions and disk space to avoid performance degradation.

Tombstones and Bloom Filters — How Deletes Are Handled and How Reads Are Accelerated

Tombstones are Cassandra's mechanism for handling deletes. When you delete a row, Cassandra writes a tombstone — a marker that says 'this data is considered deleted as of timestamp X.' The original data is not removed immediately; it remains in the SSTable until compaction. During a read, Cassandra scans through SSTables and skips any cell with a tombstone whose timestamp is newer than the data. This makes deletes and TTLs (which also create tombstones) deceptively expensive: each tombstone must be read and ignored, adding latency.

Bloom filters are the counterbalance. Before reading an SSTable, Cassandra checks the Bloom filter — a probabilistic data structure that tells if an SSTable definitely does NOT contain the requested partition key. If the filter says 'no,' the SSTable is skipped entirely. This drastically reduces read latency, especially when many SSTables exist. However, Bloom filters have a false positive rate (default 1 in 10,000) — sometimes they say 'maybe' when the data isn't there, causing a wasted read. The filter is stored in memory and is keyed by partition key.

Tombstone gotchas: - A partition with >100K tombstones can cause a read timeout even if the live data is just a few rows. - You can monitor tombstones with nodetool cfstats — look for 'SSTable count' and 'Tombstone count'. - The gc_grace_seconds setting (default 10 days) determines how long a tombstone must exist before compaction can remove it. This is to prevent resurrecting data from other nodes that may not have seen the delete yet. If you're sure all replicas have the delete, you can lower this value to reduce tombstone duration.

Bloom filter tuning: - bloom_filter_fp_chance (default 0.01 = 1% false positive) trades memory for accuracy. Lower values use more memory but reduce wasted reads. - Bloom filters are stored in off-heap memory. If memory is tight, you can increase the false positive chance to 0.1, but expect higher read latency. - Use nodetool cfstats to see 'Bloom filter false positives' — if this is high, consider reducing bloom_filter_fp_chance.

tombstone_bloom_commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Check tombstone and bloom filter stats for a table
nodetool cfstats io.thecodeforge.my_table | grep -E 'Bloom|Tombstone|SSTable'

# Sample output:
# Bloom filter false positives: 0
# Bloom filter false ratio: 0.00000
# Bloom filter space used: 1048576
# SSTable count: 12
# Tombstone count: 5432
# Tombstone ratio: 0.10

# To lower bloom filter false positive chance (uses more memory):
# In cassandra.yaml: bloom_filter_fp_chance: 0.001
# Or per table via CQL:
ALTER TABLE io.thecodeforge.my_table WITH bloom_filter_fp_chance = 0.001;

# To force compaction to remove tombstones early (after reducing gc_grace_seconds):
nodetool compact io.thecodeforge my_table
Tombstone Threshold Alert
Cassandra has a built-in tombstone threshold: tombstone_warn_threshold (default 1000) and tombstone_failure_threshold (default 100,000). If a query scans more than 1000 tombstones, a warning is logged. If it exceeds 100,000, the query is aborted with an error. Monitor these logs to detect tombstone buildup early.
Production Insight
A high-traffic analytics system used TTL=1 hour on log entries. After a month, reads started timing out. Investigation showed some partitions had >200,000 tombstones. The fix was to switch from SizeTieredCompactionStrategy to TimeWindowCompactionStrategy with compaction_window_unit='HOURS', which compacted each hour's data independently and removed tombstones efficiently. Additionally, they reduced gc_grace_seconds to 3600 (1 hour) since all replicas were in the same datacenter and consistency was LOCAL_QUORUM, guaranteeing tombstone propagation.
Key Takeaway
Tombstones are a necessary evil for deletes and TTLs — monitor them with nodetool cfstats and use compaction strategies that handle them efficiently. Bloom filters drastically reduce read latency by skipping irrelevant SSTables; tune bloom_filter_fp_chance based on memory availability and read latency requirements.
● Production incidentPOST-MORTEMseverity: high

The Silent Write Loss: When a Single Replica Fails in a Multi-Datacenter Write

Symptom
Orders placed from the affected datacenter were missing in global sales reports. The application reported write success (consistency level ONE was used for latency).
Assumption
The team assumed that since replication factor was 3 across two DCs, a write hitting any one replica would eventually propagate. They didn't account for the fact that consistency level ONE only acknowledges a single replica before returning success — if that replica then fails before propagating, the write is lost.
Root cause
Consistency level LOCAL_ONE with a single DC failure: the write was acknowledged by a node in the affected DC, but that DC became unreachable before the hinted handoff could deliver the write to the other DC. The write was lost permanently.
Fix
Changed write consistency to LOCAL_QUORUM (2 replicas in a 3-node RF per DC) so that at least two replicas must acknowledge before success. Combined with a background repair job that ran every 24 hours to reconcile inconsistencies.
Key lesson
  • Never use consistency level ONE for business-critical writes in a multi-DC setup.
  • Always verify that your consistency level survives a single node and a single datacenter failure.
  • Run incremental repairs weekly — gossip-based anti-entropy alone isn't enough for full consistency recovery.
Production debug guideSymptom → Action: Quick checks to diagnose common failures in production4 entries
Symptom · 01
Read latency spikes above 1 second
Fix
Check SSTable count per partition with nodetool cfstats — if > 100, run nodetool compact to merge SSTables. Also verify Bloom filters are enabled via cassandra.yaml.
Symptom · 02
Write timeouts (WriteTimeoutException)
Fix
Check coordinator node load with nodetool tpstats — if write stage depth > 100, reduce write concurrency or increase concurrent_writes in cassandra.yaml. Also verify hinted handoff is enabled.
Symptom · 03
Inconsistent reads — stale data returned
Fix
Check consistency levels: if using eventual consistency (ONE), switch to QUORUM for critical queries. Use consistency parameter in CQL or driver. Run nodetool repair on affected keyspace.
Symptom · 04
High memory usage / GC pauses
Fix
Check Heap via nodetool info — if off-heap memory (Memtable) is high, reduce memtable_heap_space_in_mb. Also verify GC tuning: CMS or G1? For large heaps, switch to G1GC with MaxGCPauseMillis=200.
★ Quick Debug Cheat Sheet: Cassandra Production IssuesWhen the pager goes off, run these commands first. They'll point you at the root cause in under 2 minutes.
Write timeout
Immediate action
Run `nodetool tpstats` and look at 'WriteStage' active count.
Commands
nodetool tpstats | grep -i write
tail -100 /var/log/cassandra/system.log | grep -i 'WriteTimeout'
Fix now
Increase write_request_timeout_in_ms in cassandra.yaml (default 2000ms) — but only temporarily. Root cause is usually compaction backlog or overloaded coordinator.
Read timeout+
Immediate action
Check SSTable count per table with `nodetool cfstats`.
Commands
nodetool cfstats <keyspace>.<table> | grep -E 'SSTable count|Read latency'
nodetool compactionstats
Fix now
Run nodetool compact <keyspace> <table> to merge small SSTables. If compaction backlog > 200, enable incremental_backups: false and schedule regular compactions.
Node marked as down by gossip+
Immediate action
Check network and process status.
Commands
ping <node-ip> && grep -i 'gossip' /var/log/cassandra/system.log | tail -20
nodetool status
Fix now
If node is actually up but not in ring, restart Cassandra process. If it's down, start with service cassandra start. If it won't come up, check system.log for out-of-memory or disk-full errors.
Consistency Levels at a Glance
Consistency LevelGuaranteeLatency ImpactFault ToleranceBest For
ONE1 replicaFastest (~1ms p50)Lose 1 replica = data lossBulk writes, non-critical reads
LOCAL_ONE1 local replicaFast (~1ms local)Lose 1 local replica = staleLocal reads, low staleness acceptable
QUORUMMajority of allModerate (3-10ms)Survive 1 node failureGlobal consistency (single DC)
LOCAL_QUORUMMajority in DCModerate (3-8ms local)Survive 1 node per DCMulti-DC production default
EACH_QUORUMMajority per DCSlow (20-100ms cross-DC)Survive 1 node per DCFinance, critical cross-DC ops
ALL100% of replicasSlowest (unavailable on failure)No failure toleranceTesting only

Key takeaways

1
Partition keys are the most important design decision
they determine data locality and load distribution.
2
Writes are fast because they're sequential CommitLog + memory; reads are cheap because of bloom filters and SSTable indexes.
3
Replication strategies and snitch define your failure domains
always use NetworkTopologyStrategy in production.
4
Consistency is tunable per query
LOCAL_QUORUM is the safe default for multi-DC survival.
5
Never use SimpleStrategy or ALL consistency in production
they sacrifice either durability or availability.
6
Monitor compaction backlog and tombstone counts regularly
they're the biggest silent killers of performance.

Common mistakes to avoid

4 patterns
×

Using SimpleStrategy with RF=1 in production

Symptom
Any single node failure causes data loss. No redundancy. Queries to that node hang until timeout.
Fix
Change to NetworkTopologyStrategy with RF=3 at minimum. Migrate via ALTER KEYSPACE and run nodetool repair to redistribute data.
×

Designing partition keys that create hot spots

Symptom
One node has 10x more SSTables than others, high GC pauses, read/write timeouts on that node only.
Fix
Analyze data distribution with nodetool cfhistograms. Use a composite partition key (e.g., hash of user_id + time bucket) to spread load evenly.
×

Ignoring tombstone accumulation from TTL deletes

Symptom
Read latency spikes unexpectedly. nodetool cfstats shows tombstone count high per partition.
Fix
Switch to TimeWindowCompactionStrategy for TTL-heavy tables. Run periodic nodetool compact or enable unchecked_tombstone_compaction carefully.
×

Setting consistency to ALL for 'safety'

Symptom
Random write failures when any node is down for maintenance or restarts. Availability drops to 0% during node failures.
Fix
Use LOCAL_QUORUM for production. Understand that strong consistency trades availability. If you need read-your-write, use QUORUM, not ALL.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the write path in Cassandra step-by-step. What happens from when...
Q02SENIOR
What happens when a node fails in a Cassandra cluster? How does the clus...
Q03SENIOR
How does Cassandra achieve linear scalability? Explain the relationship ...
Q04SENIOR
Why shouldn't you use secondary indexes in production? How would you des...
Q05SENIOR
What is the difference between Leveled and SizeTiered compaction? When w...
Q01 of 05SENIOR

Explain the write path in Cassandra step-by-step. What happens from when a write request arrives until it's durable?

ANSWER
1. Coordinator receives write, hashes partition key, determines replicas using replication strategy and snitch. 2. Coordinator sends mutation to all replicas (based on consistency level). 3. Each replica writes to CommitLog (append-only, for crash recovery). 4. Mutation is applied to Memtable (in-memory sorted map). 5. Replica acknowledges success (if consistency requirement met). 6. Periodically, Memtable is flushed to an SSTable on disk (sorted, immutable). 7. CommitLog entries for flushed data are truncated. Durability: data is in CommitLog (disk) before ack. If node crashes before flush, CommitLog replays on restart.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is the difference between partition key and clustering key in Cassandra?
02
Can I change the replication factor or strategy after creating a keyspace?
03
What happens when a write consistency level cannot be met (e.g., ONE but all replicas are down)?
04
Why does Cassandra use a gossip protocol instead of a coordinator for node membership?
05
How do I handle a hot partition in Cassandra?
🔥

That's NoSQL. Mark it forged?

10 min read · try the examples if you haven't

Previous
Redis Data Structures
8 / 15 · NoSQL
Next
SQL vs NoSQL — When to Use Which