Senior 9 min · March 09, 2026

Apache Cassandra — Hot Partitions and Ring Architecture

Q: Does Cassandra support ACID transactions?

Cassandra supports lightweight transactions (LWT) using the Paxos protocol for conditional updates within a single partition, but they come with a significant performance cost — approximately 4x the latency of a regular write due to the multi-round-trip Paxos consensus. For most use cases, Cassandra's eventual consistency with LOCAL_QUORUM provides the right balance. Full multi-row ACID transactions across partitions are not supported — design your data model to avoid needing them. If you genuinely need cross-partition transactions, evaluate whether Cassandra is the right database for that specific workload.

Q: How do I choose the right compaction strategy?

STCS (Size-Tiered Compaction Strategy) is the default and works well for write-heavy workloads where data is rarely updated. LCS (Leveled Compaction Strategy) provides better read performance and bounded space amplification — use it for read-heavy workloads where the extra compaction I/O is acceptable. TWCS (Time-Window Compaction Strategy) is ideal for time-series data with TTL, as it groups SSTables by time window and drops entire windows when data expires — avoiding the need to merge expired tombstones with live data. Cassandra 5.0 introduced the Unified Compaction Strategy (UCS) as an experimental option that adapts behavior based on workload patterns.

Q: What is a tombstone in Cassandra and why should I care?

A tombstone is a marker that indicates a row or cell has been deleted. Cassandra does not immediately remove data — it writes a tombstone that is only physically cleaned up after gc_grace_seconds (default 10 days) AND after compaction runs AND after repair has propagated the tombstone to all replicas. Excessive tombstones degrade read performance because Cassandra must scan through them during reads. Hitting the tombstone_failure_threshold (default 100,000 per query) causes a TombstoneOverwhelmingException that aborts the read entirely. Prevention: use TTL on INSERT, run repair regularly within gc_grace_seconds, and monitor tombstone ratios with nodetool tablehistograms.

Q: Can I use secondary indexes in Cassandra?

In Cassandra 5.0, the recommended secondary indexing approach is Storage Attached Indexes (SAI). SAI supports equality and range queries on non-primary-key columns without ALLOW FILTERING, and adds vector ANN search for embedding-based workloads. SASI (Storage Attached Secondary Index) is deprecated in Cassandra 5.0 — do not create new SASI indexes. The original legacy secondary indexes are still present but have severe performance limitations. The general rule: SAI for selective queries (< 5% of partition rows), dedicated table for known stable query patterns with high query volume.

Q: When should I use Cassandra vs PostgreSQL?

Use Cassandra when: you need write throughput that a single PostgreSQL instance cannot absorb, your dataset exceeds 1TB and growing, you need multi-datacenter active-active replication, or your data naturally fits a time-series or event-log model. Use PostgreSQL when: you need ACID transactions across multiple tables, your queries are ad-hoc and unpredictable, your data is relational and normalized, or your scale is small enough that a single well-tuned PostgreSQL instance handles the load. The worst outcome is starting with Cassandra for a workload that needs joins and transactions — the data model migration cost is high.

A single 4GB partition OOM-killed two nodes at peak traffic.

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Apache Cassandra is a masterless, peer-to-peer distributed NoSQL database favoring availability over strong consistency
Data is distributed via consistent hashing across a ring of equal nodes — no single point of failure
Writes go to any node and propagate to replicas based on Replication Factor
LSM-Tree storage engine delivers extreme write throughput — sequential I/O, no read-before-write
Query-Driven Design: model tables around queries, not entities — joins do not exist
Cassandra 5.0 introduced Storage Attached Indexes (SAI) — the correct answer to secondary indexing from 2024 onward
Biggest mistake: over-normalizing data like in SQL — denormalization is the correct Cassandra pattern

✦ Definition~90s read

What is Introduction to Apache Cassandra?

Apache Cassandra was designed to solve the availability vs. consistency trade-off in distributed systems, specifically favoring high availability and partition tolerance — the AP side of the CAP theorem. Traditional relational databases struggle to scale horizontally because they rely on a central master node that becomes both the bottleneck and the single point of failure.

★

Imagine you are running a massive global library.

Cassandra eliminates this bottleneck entirely by treating every node in the cluster as an equal peer.

This architecture allows linear scaling: adding a new node to the cluster provides a predictable, proportional increase in performance and storage capacity. There is no leader election, no failover window, and no promotion ceremony when a node dies. Its neighbors simply serve its data until it comes back.

Before writing any application code, you need a keyspace and a table. A keyspace is Cassandra's equivalent of a database schema — it defines the replication strategy and factor. The example below uses NetworkTopologyStrategy, which is the only strategy you should use in production. SimpleStrategy is shown for contrast because you will encounter it in tutorials, and understanding why it is wrong is as important as knowing the right answer.

Note the PRIMARY KEY (service_name, log_time) design. service_name is the partition key — it controls which node stores the data. log_time is the clustering column — it controls the sort order within the partition. This is the fundamental unit of Cassandra data modeling, and every decision downstream flows from getting it right.

Plain-English First

Imagine you are running a massive global library. Instead of keeping all books in one giant building (a single server), you distribute them across a ring of interconnected buildings (a cluster). If one building catches fire, the library stays open because every book has copies in other buildings. Apache Cassandra is that global library system; it ensures that no matter how much data you add or how many servers fail, your information is always available and reachable in milliseconds.

Apache Cassandra is a peer-to-peer distributed NoSQL database designed to handle massive amounts of data across many commodity servers. Originally developed at Facebook to power Inbox Search, it provides high availability with no single point of failure and linear scalability.

In this guide, we'll break down exactly what makes Cassandra unique in the crowded database landscape, why its masterless design beats traditional leader-follower models for write-heavy workloads, how to design partition keys that don't become time bombs, and how to connect a Java application using the DataStax driver. By the end, you'll understand not just what Cassandra is, but how to avoid the mistakes that cause production outages.

What Is Apache Cassandra and Why Does It Exist?

keyspace_setup.cqlSQL

-- ── SimpleStrategy: development only, single datacenter ─────────────────────
-- Never use in production — it ignores rack and datacenter topology.
-- Replicas are placed on the next N nodes clockwise in the ring,
-- which means two replicas can land on the same physical rack.
CREATE KEYSPACE IF NOT EXISTS forge_dev
  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

-- ── NetworkTopologyStrategy: production standard ─────────────────────────────
-- Rack-aware and datacenter-aware placement.
-- With RF=3 across 3 racks, you can lose an entire rack and still serve queries.
CREATE KEYSPACE IF NOT EXISTS forge_system
  WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
  };

USE forge_system;

-- ── Table design: audit_logs ─────────────────────────────────────────────────
-- Partition key: service_name  → determines which node(s) store this row
-- Clustering column: log_time  → determines sort order within the partition
-- CLUSTERING ORDER BY DESC: most recent logs are read first with no sort overhead
CREATE TABLE IF NOT EXISTS audit_logs (
    service_name text,
    log_time     timestamp,
    request_id   uuid,
    payload      text,
    PRIMARY KEY (service_name, log_time)
) WITH CLUSTERING ORDER BY (log_time DESC)
  AND default_time_to_live = 2592000;  -- 30-day TTL: logs auto-expire

-- ── Writing data: any node accepts the write ────────────────────────────────
INSERT INTO audit_logs (service_name, log_time, request_id, payload)
VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"ok"}');

INSERT INTO audit_logs (service_name, log_time, request_id, payload)
VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"retry"}');

-- ── Reading data: the query matches the partition key exactly ────────────────
-- This is a single-partition read — one node handles it, response in <5ms
SELECT service_name, log_time, payload
FROM audit_logs
WHERE service_name = 'payment-service'
  AND log_time > '2026-01-01 00:00:00+0000'
LIMIT 100;

Output

Warnings: None

Keyspace 'forge_dev' created.

Keyspace 'forge_system' created.

Table 'audit_logs' created.

-- INSERT: applied

service_name | log_time | payload

-----------------+---------------------------------+---------------------

payment-service | 2026-04-18 14:23:01.000000+0000 | {"status":"retry"}

payment-service | 2026-04-18 14:22:58.000000+0000 | {"status":"ok"}

(2 rows)

The Ring Architecture Mental Model

Every node owns a range of the token ring — data is distributed by hashing the partition key to a token
Any node can accept reads and writes — there is no master, no leader election, no failover delay
Replication factor determines how many nodes hold a copy of each partition
If a node goes down, its neighbors serve its data immediately — no recovery window
Adding a node splits an existing token range and streams data to the new node automatically — linear scaling

Production Insight

Cassandra's AP design means reads may return slightly stale data during replication lag.

Tunable consistency lets you choose per-query: ONE for speed (fastest, weakest), LOCAL_QUORUM for balance (default choice), ALL for safety (strongest, slowest).

QUORUM math: for RF=3, QUORUM = floor(3/2)+1 = 2 nodes must respond. You can lose one node and still satisfy QUORUM.

Rule: use LOCAL_QUORUM for production writes and reads — it balances consistency with cross-datacenter latency without requiring all replicas to be available.

Key Takeaway

Cassandra exists because master-slave databases cannot scale writes horizontally without a bottleneck.

Every node is equal — any node accepts writes, and replication handles redundancy automatically.

QUORUM = floor(RF/2)+1: with RF=3, you need 2 nodes — you can lose one and still serve queries.

Never use SimpleStrategy in production — it has no rack or datacenter awareness.

Choosing Replication Strategy

IfSingle datacenter development or testing

→

UseUse SimpleStrategy with RF=1 — do NOT use in production

IfSingle production datacenter

→

UseUse NetworkTopologyStrategy with RF=3 spread across different racks

IfMulti-datacenter deployment

→

UseUse NetworkTopologyStrategy with RF=3 per datacenter for cross-DC resilience

IfCompliance requires geographic data isolation

→

UseUse NetworkTopologyStrategy with LOCAL_QUORUM consistency to ensure reads never cross datacenter boundaries

thecodeforge.io

Cassandra Ring Architecture & Hot Partitions

Cassandra Introduction

Partition Keys, Clustering Columns, and Primary Key Design

If I had to point to one thing that separates engineers who run Cassandra well from engineers who generate production incidents, it's partition key design. Every other Cassandra concept — consistency levels, compaction, gossip — is secondary to getting this right.

The primary key in Cassandra has two distinct parts that most tutorials blur together:

The partition key determines which node stores the data. Cassandra hashes the partition key value to a token, and that token falls in a node's token range. All rows with the same partition key are stored together on disk — this is what makes single-partition reads so fast. But it's also why a bad partition key creates a hot spot: if thousands of writes share the same partition key value, all of them land on the same two or three nodes.

Clustering columns determine the sort order of rows within a partition. They are stored sorted on disk in the SSTable. This means range queries on clustering columns — WHERE log_time > x AND log_time < y — are essentially free: Cassandra reads a contiguous block of sorted data from a single SSTable. There is no sort step, no heap sort, no temporary buffer.

The combination of partition key and clustering columns forms the primary key. Every row within a partition must have a unique combination of clustering column values. A missing clustering column in a write will upsert the existing row — not create a new one.

Partition size is the operational concern that flows from this design. Because all rows with the same partition key live on the same nodes, an unbounded partition grows indefinitely and eventually causes the problems described in the production incident at the top of this guide. The rule I operate by: keep partitions under 100MB and under 100,000 rows. If either limit is at risk, add a time bucket to the partition key.

partition_key_design.cqlSQL

-- ── Example 1: Simple partition key — one column ────────────────────────────
-- All events for a user are in one partition.
-- Problem: a user with 10 million events creates a 4GB partition.
CREATE TABLE user_events_v1 (
    user_id   uuid,
    event_time timestamp,
    event_type text,
    metadata  text,
    PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- ── Example 2: Composite partition key — time bucketing ──────────────────────
-- Now the partition key is (user_id, bucket).
-- Each day gets its own partition — bounded growth, predictable size.
-- 'bucket' is a date string like '2026-04-18', computed in the application.
-- Trade-off: queries spanning multiple days require multiple partition reads.
CREATE TABLE user_events_v2 (
    user_id    uuid,
    bucket     text,      -- application sets this: '2026-04-18'
    event_time timestamp,
    event_type text,
    metadata   text,
    PRIMARY KEY ((user_id, bucket), event_time)
--                ^-----------^  composite partition key: two columns together
) WITH CLUSTERING ORDER BY (event_time DESC);

-- ── Writing to the bucketed table ─────────────────────────────────────────────
-- Application code computes the bucket from the event timestamp
-- then includes it in every write and read
INSERT INTO user_events_v2 (user_id, bucket, event_time, event_type, metadata)
VALUES (
    a3b4c5d6-e7f8-9012-abcd-ef1234567890,
    '2026-04-18',
    toTimestamp(now()),
    'page_view',
    '{"page":"/dashboard"}'
);

-- ── Reading from a specific bucket ───────────────────────────────────────────
-- This is a single-partition read: fast, predictable, no cluster scan
SELECT event_time, event_type, metadata
FROM user_events_v2
WHERE user_id = a3b4c5d6-e7f8-9012-abcd-ef1234567890
  AND bucket   = '2026-04-18'
  AND event_time > '2026-04-18 08:00:00+0000'
LIMIT 50;

-- ── What ALLOW FILTERING looks like — and why it is dangerous ─────────────────
-- This forces Cassandra to scan EVERY partition on EVERY node
-- to find rows where event_type = 'purchase'.
-- With 50 nodes and 500GB of data, this query touches all of it.
-- Never run this in production without SAI or a dedicated table.
SELECT * FROM user_events_v2
WHERE event_type = 'purchase'
ALLOW FILTERING;  -- full cluster scan — do not do this

Output

Table 'user_events_v1' created.

Table 'user_events_v2' created.

-- INSERT: applied

event_time | event_type | metadata

-----------------------------------+------------+---------------------

2026-04-18 14:23:01.000000+0000 | page_view | {"page":"/dashboard"}

(1 rows)

-- ALLOW FILTERING query:

Warnings: Aggregation query used without partition key (potential performance issue)

-- This scanned 50 nodes. Do not use in production.

The 100MB / 100,000 Row Partition Rule

Keep individual partitions under 100MB and under 100,000 rows. These are not hard limits but operational thresholds I've used across multiple production clusters. Beyond these numbers, compaction takes longer, reads require more SSTable lookups, and a single write spike can overwhelm the nodes responsible for that token range. Check partition sizes regularly with nodetool tablehistograms. If you're approaching the limit, add a time bucket to the partition key before you hit it — not after.

Production Insight

The most dangerous partition key mistake is the one that looks fine on day one.

A partition key based on user_id is fine when you have 1,000 users and 100 events each.

At 50,000 events per power user, you have a 500MB partition waiting to OOM a node.

Design for the maximum, not the average — you won't have time to redesign at 3am during a traffic spike.

Key Takeaway

The partition key determines data placement — choose high-cardinality values and add time buckets for time-series data.

Clustering columns determine sort order within a partition — sorted on disk, so range queries are essentially free.

ALLOW FILTERING is a full cluster scan — it is not a query strategy, it is an emergency diagnostic tool.

Keep partitions under 100MB and 100,000 rows — if you cannot, add a time bucket.

Partition Key Design Decisions

IfData volume per partition key value is bounded and small (< 10,000 rows)

→

UseSimple partition key is fine — one column, high cardinality (UUID, user_id)

IfData volume per partition key grows unboundedly over time (time-series, events)

→

UseAdd a time bucket to the partition key: (entity_id, bucket) where bucket is day/hour/week

IfNeed to query by a column that is not the partition key

→

UseCreate a separate table with that column as the partition key — duplicate the data

IfQuery needs to filter on a low-selectivity column across many partitions

→

UseIn Cassandra 5.0+, use a Storage Attached Index (SAI) — do not use ALLOW FILTERING

The LSM-Tree Write Path — Why Cassandra Writes So Fast

Understanding Cassandra's write performance requires understanding one design decision: all writes are sequential. There is no read-before-write, no in-place update, no locking. The LSM-Tree (Log-Structured Merge-Tree) storage engine is the reason Cassandra can absorb write rates that would saturate a PostgreSQL cluster.

Here's what actually happens when your application executes an INSERT or UPDATE:

Commit log write — The mutation is appended to a sequential commit log on disk. Sequential I/O. If the node crashes after this step, the data can be recovered from the commit log on restart.
Memtable write — The mutation is written to an in-memory structure called the memtable. All writes to the same partition are merged here — updates, deletes, and inserts coexist. The memtable is the 'current view' of recent data.
Memtable flush to SSTable — When the memtable reaches a configured size, it is flushed to disk as an immutable SSTable (Sorted String Table). Immutable means once written, it is never modified. Updates and deletes don't rewrite existing SSTables — they write new entries with a higher timestamp.
Compaction — Background compaction merges multiple SSTables into fewer, larger ones. During compaction, entries for the same partition key are merged by timestamp — newer wins. Deleted entries (tombstones) are purged after gc_grace_seconds.

The trade-off this creates: reads are more complex than writes. A read for a partition may need to check the memtable and multiple SSTables to find all the data. Bloom filters (one per SSTable) give Cassandra a probabilistic fast path — 'definitely not in this SSTable' — before opening the file. Partition indexes give the exact byte offset. By 2026, Cassandra 5.0's trie-based memtable and storage improvements have significantly reduced the read amplification that older LSM implementations suffered.

write_path_demonstration.cqlSQL

-- ── Demonstrating Cassandra's write semantics ────────────────────────────────
-- INSERT and UPDATE are identical in Cassandra — both are 'upserts'.
-- There is no read-before-write. The write goes to memtable immediately.
INSERT INTO forge_system.audit_logs
    (service_name, log_time, request_id, payload)
VALUES
    ('order-service', '2026-04-18 10:00:00+0000', uuid(), '{"amount":99.99}');

-- UPDATE writes a new cell with a higher timestamp — the old cell still exists
-- in the SSTable until compaction. Cassandra resolves conflicts by timestamp.
UPDATE forge_system.audit_logs
SET payload = '{"amount":99.99,"status":"confirmed"}'
WHERE service_name = 'order-service'
  AND log_time = '2026-04-18 10:00:00+0000';

-- ── Deletes write tombstones, not erasures ───────────────────────────────────
-- This DELETE writes a tombstone marker — it does NOT remove the data immediately.
-- The data will be physically removed after gc_grace_seconds (default: 10 days)
-- AND after a compaction runs AND after repair has propagated the tombstone.
DELETE FROM forge_system.audit_logs
WHERE service_name = 'order-service'
  AND log_time = '2026-04-18 10:00:00+0000';

-- ── Checking SSTable and memtable state via nodetool ─────────────────────────
-- Run from the shell, not from cqlsh:
-- nodetool flush forge_system         → forces memtable to flush to SSTable
-- nodetool compact forge_system       → forces compaction on the keyspace
-- nodetool tablehistograms forge_system audit_logs → read/write latency breakdown

-- ── TTL as an alternative to application-level deletes ───────────────────────
-- When data has a natural expiry, use TTL instead of explicit DELETE.
-- Cassandra handles the tombstone lifecycle automatically.
-- Here, this log entry expires in 86400 seconds (24 hours).
INSERT INTO forge_system.audit_logs
    (service_name, log_time, request_id, payload)
VALUES
    ('cache-service', toTimestamp(now()), uuid(), '{"cache":"hit"}')
USING TTL 86400;

Output

-- INSERT: applied

-- UPDATE: applied

-- DELETE: applied (tombstone written)

-- nodetool tablehistograms forge_system audit_logs

Forge_system/audit_logs histograms

Offset SSTables Write Latency Read Latency

1 0 74us 0us

2 1 112us 145us

3 1 118us 148us

-- SSTable count: 1 (after flush)

-- Tombstone live cells ratio: 0.12 (acceptable — below 0.20 threshold)

-- INSERT with TTL: applied

Why Deletes Are the Most Dangerous Write in Cassandra

A tombstone is a write — it goes through the same memtable → SSTable → compaction path as regular data
Tombstones accumulate until gc_grace_seconds passes (default 10 days) AND compaction runs
During reads, Cassandra must scan through tombstones to find live data — high tombstone counts degrade reads
If a node was down when a DELETE ran, it must receive the tombstone via repair before gc_grace_seconds expires
Common tombstone traps: nullable columns, frequent row-level deletes, and TTL-on-collections

Production Insight

The most operationally expensive thing you can do in Cassandra is delete frequently from a hot partition.

Each delete is a write that must be read around during future reads until compaction clears it.

If you need frequent deletes, use TTL on insertion instead — it amortizes the cost and keeps tombstone density predictable.

Monitor tombstone ratios with nodetool tablehistograms. When the live cells to tombstone ratio exceeds 0.20, investigate before latency spikes.

Key Takeaway

LSM-Tree = sequential writes to commit log and memtable, periodic flush to immutable SSTables, background compaction.

Cassandra writes are faster than B-Tree RDBMS writes because there is no read-before-write and no in-place update.

Deletes are writes — they create tombstones that degrade read performance until compaction clears them.

Use TTL for data with natural expiry — it keeps tombstone accumulation predictable and bounded.

Choosing a Compaction Strategy

IfWrite-heavy workload, data is rarely updated or deleted

→

UseUse STCS (Size-Tiered Compaction Strategy) — the default, optimizes write throughput

IfRead-heavy workload, data is frequently queried by partition key

→

UseUse LCS (Leveled Compaction Strategy) — fewer SSTables per partition, better read performance, more I/O during compaction

IfTime-series data with TTL-based expiry

→

UseUse TWCS (Time-Window Compaction Strategy) — groups SSTables by time window, drops entire windows when TTL expires, minimal write amplification

IfUnclear workload or mixed read/write

→

UseStart with STCS, monitor with nodetool tablehistograms, switch to LCS if read latency is the dominant complaint

Common Mistakes and How to Avoid Them

When moving from SQL to Cassandra, the mistakes fall into two categories: design mistakes you make before writing a line of code, and operational mistakes you make after the cluster is running. The design mistakes are more expensive — they require data migration to fix. The operational ones are more urgent — they cause incidents.

The single most consequential design mistake is treating Cassandra like a relational database. Normalizing data, expecting joins, using secondary indexes as a query strategy, filtering on non-primary-key columns with ALLOW FILTERING — all of these patterns assume a query engine that Cassandra deliberately does not have. Cassandra's query engine is intentionally limited because that limitation is what makes it fast at scale.

For the operational side, the development cluster below is the starting point for understanding how the cluster behaves before you have 12 production nodes to experiment on. The compose file uses Cassandra 5.0 — pinned, not latest. Using latest in any tutorial or development setup is a trap: you will develop against a different version than production, and Cassandra has meaningful behavioral differences between major versions.

docker-compose.ymlYAML

# ── Three-node Cassandra development cluster ─────────────────────────────────
# Pinned to Cassandra 5.0 — do not use 'latest' in development or production.
# Cassandra 5.0 ships with significant storage engine improvements:
#   - Trie-based memtable (better memory efficiency)
#   - Storage Attached Indexes (SAI) replacing legacy SASI
#   - Unified compaction strategy option

services:
  cassandra-node1:
    image: cassandra:5.0
    container_name: forge-cassandra-1
    hostname: cassandra-node1
    ports:
      - "9042:9042"
    environment:
      - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack1
      - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=1G
      - HEAP_NEWSIZE=256M
    volumes:
      - cassandra-data1:/var/lib/cassandra
    healthcheck:
      test: ["CMD-SHELL", "nodetool status | grep -E '^UN'"]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 60s

  cassandra-node2:
    image: cassandra:5.0
    container_name: forge-cassandra-2
    hostname: cassandra-node2
    environment:
      - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack2
      - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=1G
      - HEAP_NEWSIZE=256M
    volumes:
      - cassandra-data2:/var/lib/cassandra
    depends_on:
      cassandra-node1:
        condition: service_healthy

  cassandra-node3:
    image: cassandra:5.0
    container_name: forge-cassandra-3
    hostname: cassandra-node3
    environment:
      - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack3
      - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=1G
      - HEAP_NEWSIZE=256M
    volumes:
      - cassandra-data3:/var/lib/cassandra
    depends_on:
      cassandra-node2:
        condition: service_healthy

volumes:
  cassandra-data1:
  cassandra-data2:
  cassandra-data3:

Output

# After all three nodes start (allow ~3 minutes for gossip convergence)

# Run: docker exec forge-cassandra-1 nodetool status

# Datacenter: datacenter1

# ========================

# Status=Up/Down

# |/ State=Normal/Leaving/Joining/Moving

# -- Address Load Tokens Owns Host ID Rack

# UN 172.20.0.2 195.21 KiB 16 33.3% a1b2c3d4-e5f6-7890-abcd-ef1234567890 rack1

# UN 172.20.0.3 207.43 KiB 16 33.3% b2c3d4e5-f6a7-8901-bcde-f12345678901 rack2

# UN 172.20.0.4 189.67 KiB 16 33.4% c3d4e5f6-a7b8-9012-cdef-123456789012 rack3

# UN = Up, Normal — all three nodes are healthy and distributing load evenly

# Each node owns ~33% of the token ring

Watch Out:

The most common setup mistake is using the SimpleStrategy replication class in production. While fine for single-node development, production environments require NetworkTopologyStrategy to ensure rack-aware and datacenter-aware data placement. The second most common mistake is using cassandra:latest as the image tag — it breaks reproducibility and means developers and production may be running different major versions with different behaviors.

Production Insight

Denormalization in Cassandra means duplicating data across multiple tables.

Each table is designed for exactly one query pattern — this is by design, not a flaw.

Rule: if you need the same data for two queries, create two tables — storage is cheap, latency is not.

The storage cost of denormalization at scale is almost always lower than the operational cost of slow queries or a poorly tuned secondary index.

Key Takeaway

Query-First Design: write your CQL queries before designing your tables — the query defines the table structure.

Joins do not exist — if you need data from two entities, denormalize into one table or create two query-specific tables.

Never use SimpleStrategy in production and never pin to cassandra:latest — both create incidents you cannot explain.

Data Modeling Decision Matrix

IfNeed to query by user_id and by email

→

UseCreate two tables: one partitioned by user_id, one by email — do NOT rely on secondary indexes for this

IfTime-series data with high write volume

→

UseUse a time bucket in the partition key (e.g., day or hour) to prevent unbounded partition growth

IfNeed to aggregate or group data across partitions

→

UsePre-compute aggregations at write time and store in a dedicated summary table

IfQuery requires filtering on a non-primary-key column in Cassandra 5.0+

→

UseUse a Storage Attached Index (SAI) — it replaces SASI and avoids the full-cluster scan of ALLOW FILTERING

Storage Attached Indexes (SAI) — Secondary Indexing in Cassandra 5.0

Before Cassandra 5.0, secondary indexing was the most common source of bad advice in Cassandra guides. The options were:

SASI (Storage Attached Secondary Index) — better than the original secondary indexes, still had significant operational overhead and limited query capabilities.
Original secondary indexes — effectively hidden tables, painful performance characteristics, deprecated by the Cassandra community for most production use.
ALLOW FILTERING — not an index at all, a full cluster scan dressed up as a query option.

Cassandra 5.0 introduced Storage Attached Indexes (SAI), which supersede SASI. SAI is built into the SSTable storage format, requires no separate index tables, and supports range queries, equality, and — for vector embeddings — approximate nearest neighbor (ANN) search. Vector search support makes Cassandra 5.0 relevant for AI-adjacent workloads that weren't possible before.

SAI doesn't make secondary indexing free. It's still more expensive than a partition key query because it must look across all SSTables on the node. But the cost is manageable for selective queries — queries that filter down to a small result set. The rule I apply: if your SAI query returns more than 5-10% of the rows in the partition, the query is not selective enough for an index to help, and you should create a dedicated table instead.

SASI is deprecated in Cassandra 5.0. Any guide that recommends SASI for new deployments is out of date.

sai_indexes.cqlSQL

-- ── Storage Attached Indexes: Cassandra 5.0+ ─────────────────────────────────
-- SAI allows querying on non-primary-key columns without ALLOW FILTERING.
-- SASI is deprecated in Cassandra 5.0 — use SAI for all new indexes.

USE forge_system;

-- Table: product catalog — partition key is category
CREATE TABLE IF NOT EXISTS products (
    category    text,
    product_id  uuid,
    name        text,
    price       decimal,
    in_stock    boolean,
    rating      float,
    PRIMARY KEY (category, product_id)
);

-- ── Create SAI indexes on non-primary-key columns ────────────────────────────
-- Syntax: CREATE INDEX index_name ON table(column) USING 'sai';
CREATE INDEX products_price_idx  ON products(price)   USING 'sai';
CREATE INDEX products_stock_idx  ON products(in_stock) USING 'sai';
CREATE INDEX products_rating_idx ON products(rating)  USING 'sai';

-- ── SAI-enabled queries: no ALLOW FILTERING required ─────────────────────────
-- Range query on price — only possible with SAI or as a partition key
SELECT product_id, name, price
FROM products
WHERE category = 'electronics'
  AND price > 100.00
  AND price < 500.00;

-- Combined filter: in stock AND rating above threshold
-- Both conditions use SAI — Cassandra intersects the index results
SELECT product_id, name, price, rating
FROM products
WHERE category = 'electronics'
  AND in_stock = true
  AND rating > 4.5;

-- ── Vector search with SAI (AI workloads) ────────────────────────────────────
-- Cassandra 5.0 supports ANN (Approximate Nearest Neighbor) search via SAI
-- Useful for semantic search, recommendation systems, embedding lookup
CREATE TABLE IF NOT EXISTS product_embeddings (
    category   text,
    product_id uuid,
    embedding  vector<float, 1536>,  -- 1536-dimensional OpenAI embedding
    PRIMARY KEY (category, product_id)
);

-- SAI index on the vector column for ANN search
CREATE INDEX embeddings_ann_idx
    ON product_embeddings(embedding)
    USING 'sai'
    WITH OPTIONS = {'similarity_function': 'cosine'};

-- ANN query: find the 10 most similar products to a given embedding vector
-- This runs on-node without a cluster scan — orders of magnitude faster than
-- brute-force similarity across all rows
SELECT product_id
FROM product_embeddings
WHERE category = 'electronics'
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, ...] -- 1536 floats from your model
LIMIT 10;

Output

Index 'products_price_idx' created.

Index 'products_stock_idx' created.

Index 'products_rating_idx' created.

-- Price range query:

product_id | name | price

--------------------------------------+-------------------+-------

d1e2f3a4-b5c6-7890-defg-h12345678901 | Forge Headphones | 149.99

e2f3a4b5-c6d7-8901-efgh-i23456789012 | Cloud Core Pro | 299.00

(2 rows)

-- Combined filter query:

product_id | name | price | rating

--------------------------------------+-------------------+--------+-------

d1e2f3a4-b5c6-7890-defg-h12345678901 | Forge Headphones | 149.99 | 4.7

(1 rows)

-- Vector index created.

-- ANN query: (10 results returned based on cosine similarity)

SAI vs Dedicated Table: When to Use Which

SAI is right when: the indexed column has high cardinality, queries are selective (< 5% of partition rows), and you cannot predict all query patterns at schema design time
Dedicated table is right when: you need the absolute fastest reads, the query pattern is known and stable, or selectivity is too low for SAI to help
Vector search via SAI is the recommended pattern for embedding lookup in Cassandra 5.0+ — no separate vector database required for moderate scale
SASI is deprecated in Cassandra 5.0 — do not create new SASI indexes on any new deployment

Production Insight

SAI in Cassandra 5.0 is a genuine improvement over everything that came before it.

But 'genuine improvement' is not the same as 'free.' A low-selectivity SAI query — one that matches 30% of rows — still performs worse than a well-designed partition key query.

Rule: always include the partition key in SAI queries when possible. SAI narrows results within a partition far more efficiently than across the whole cluster.

Key Takeaway

SAI replaces SASI in Cassandra 5.0 — SASI is deprecated and should not be used in new deployments.

SAI supports equality, range, and vector (ANN) queries without ALLOW FILTERING.

Include the partition key alongside SAI conditions wherever possible — it scopes the index lookup to one partition rather than the full cluster.

For AI workloads requiring embedding similarity search, Cassandra 5.0 + SAI vector indexes removes the need for a separate vector database at moderate scale.

Secondary Index Strategy in Cassandra 5.0

IfNeed to query on a non-primary-key column, column has high cardinality

→

UseUse SAI (USING 'sai') — efficient for selective queries, no separate table required

IfNeed to query on a non-primary-key column, query is for a known stable pattern

→

UseCreate a dedicated table with the column as partition key — fastest reads, predictable performance

IfNeed approximate nearest neighbor search for ML embeddings

→

UseUse SAI vector index on a vector<float, N> column — built into Cassandra 5.0, no external vector DB needed for moderate scale

IfExisting cluster uses SASI indexes

→

UsePlan migration to SAI — SASI is deprecated in 5.0 and will not receive new features

Connecting a Java Application — DataStax Java Driver

Every other concept in this guide — rings, partition keys, consistency levels, SAI — becomes concrete once you see how an application actually talks to Cassandra. The standard driver by 2026 is the DataStax Java Driver for Apache Cassandra, version 4.x. It replaced the older CQL driver and the Astyanax client.

Three things the driver handles that you do not want to implement yourself:

Load balancing policy — the driver tracks which nodes are in which datacenter, which are up, and which are token-aware. With the DefaultLoadBalancingPolicy, queries are automatically routed to the replica that owns the data — no unnecessary hops through coordinator nodes.
Connection pooling — each driver instance maintains a pool of connections per host. You create one CqlSession per application and reuse it for the lifetime of the process. Creating a new session per request is one of the most common Java + Cassandra performance mistakes.
Prepared statements — CQL statements should be prepared once and executed many times. Prepared statements are parsed and validated once by the coordinator; subsequent executions send only the bound values. Unprepared statements parse on every execution — at scale, the coordinator overhead becomes measurable.

The Spring Data Cassandra abstraction sits on top of this driver and is suitable for straightforward CRUD workloads. For anything involving custom query patterns, consistency level overrides, or performance-critical paths, I prefer going directly to the driver — the abstraction layer hides the consistency level control that Cassandra's tunable consistency depends on.

CassandraConfig.java + AuditLogRepository.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

// ── pom.xml dependencies ──────────────────────────────────────────────────────
// <dependency>
//   <groupId>com.datastax.oss</groupId>
//   <artifactId>java-driver-core</artifactId>
//   <version>4.17.0</version>
// </dependency>
// <dependency>
//   <groupId>com.datastax.oss</groupId>
//   <artifactId>java-driver-query-builder</artifactId>
//   <version>4.17.0</version>
// </dependency>

// ── CassandraConfig.java — single CqlSession for the application lifetime ────
package io.thecodeforge.config;

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.config.DefaultDriverOption;
import com.datastax.oss.driver.api.core.config.DriverConfigLoader;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.net.InetSocketAddress;
import java.time.Duration;

@Configuration
public class CassandraConfig {

    // ONE CqlSession per application — not one per request.
    // CqlSession is thread-safe and manages its own connection pool internally.
    // Creating a session per request is one of the most common performance mistakes.
    @Bean(destroyMethod = "close")
    public CqlSession cqlSession() {
        return CqlSession.builder()
                .addContactPoint(new InetSocketAddress("127.0.0.1", 9042))
                .withLocalDatacenter("datacenter1")
                .withKeyspace("forge_system")
                .withConfigLoader(
                    DriverConfigLoader.programmaticBuilder()
                        // Request timeout: fail fast rather than hold threads
                        .withDuration(DefaultDriverOption.REQUEST_TIMEOUT,
                                      Duration.ofMillis(500))
                        // Consistency level: LOCAL_QUORUM for production
                        // Tolerates one replica being down in the local DC
                        .withString(DefaultDriverOption.REQUEST_CONSISTENCY,
                                    "LOCAL_QUORUM")
                        // Token-aware routing: send queries directly to the
                        // replica that owns the partition — avoids coordinator hop
                        .withString(DefaultDriverOption.LOAD_BALANCING_POLICY_CLASS,
                                    "DefaultLoadBalancingPolicy")
                        .build()
                )
                .build();
    }
}

// ── AuditLog.java — simple value object ──────────────────────────────────────
package io.thecodeforge.model;

import java.time.Instant;
import java.util.UUID;

public record AuditLog(
    String  serviceName,
    Instant logTime,
    UUID    requestId,
    String  payload
) {}

// ── AuditLogRepository.java — DataStax driver with prepared statements ────────
package io.thecodeforge.repository;

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;
import io.thecodeforge.model.AuditLog;
import org.springframework.stereotype.Repository;
import java.time.Instant;
import java.util.List;
import java.util.UUID;
import java.util.stream.StreamSupport;

@Repository
public class AuditLogRepository {

    private final CqlSession session;

    // Prepared statements: parse once, execute many times.
    // Prepared at construction time — fails fast on startup if CQL is wrong.
    private final PreparedStatement insertStmt;
    private final PreparedStatement selectRecentStmt;

    public AuditLogRepository(CqlSession session) {
        this.session = session;

        this.insertStmt = session.prepare(
            "INSERT INTO audit_logs (service_name, log_time, request_id, payload) "
          + "VALUES (?, ?, ?, ?) USING TTL 2592000"
        );

        this.selectRecentStmt = session.prepare(
            "SELECT service_name, log_time, request_id, payload "
          + "FROM audit_logs "
          + "WHERE service_name = ? AND log_time > ? "
          + "LIMIT 100"
        );
    }

    public void save(AuditLog log) {
        BoundStatement bound = insertStmt.bind(
            log.serviceName(),
            log.logTime(),
            log.requestId(),
            log.payload()
        );
        // execute() is synchronous — use executeAsync() for high-throughput writes
        session.execute(bound);
    }

    public List<AuditLog> findRecent(String serviceName, Instant since) {
        BoundStatement bound = selectRecentStmt.bind(serviceName, since);
        ResultSet rs = session.execute(bound);

        return StreamSupport.stream(rs.spliterator(), false)
                .map(row -> new AuditLog(
                    row.getString("service_name"),
                    row.getInstant("log_time"),
                    row.getUuid("request_id"),
                    row.getString("payload")
                ))
                .toList(); // Java 16+ — returns unmodifiable list
    }

    // ── Async write for high-throughput ingestion paths ───────────────────────
    // Returns CompletableFuture<AsyncResultSet> — caller decides whether to await
    public java.util.concurrent.CompletableFuture<AsyncResultSet> saveAsync(AuditLog log) {
        BoundStatement bound = insertStmt.bind(
            log.serviceName(),
            log.logTime(),
            log.requestId(),
            log.payload()
        );
        return session.executeAsync(bound).toCompletableFuture();
    }
}

Output

-- Application startup log (condensed):

[main] INFO c.d.o.d.i.core.session.DefaultSession - [forge-session] Initializing

[main] INFO c.d.o.d.i.core.loadbalancing.DefaultLoadBalancingPolicy

- [forge-session] Using datacenter 'datacenter1' for local policy

[main] INFO c.d.o.d.i.core.session.DefaultSession - [forge-session] Initialized

-- Repository.save() called:

-- Prepared statement bound and executed in 2.3ms (LOCAL_QUORUM satisfied)

-- Repository.findRecent('payment-service', 2026-04-18T00:00:00Z):

-- Single-partition read, 2 rows returned in 1.8ms

One Session, Many Queries — The Connection Pool Model

CqlSession is thread-safe — inject it as a singleton and share it across all repositories
The driver manages a connection pool per host automatically — no manual pool management needed
PreparedStatement objects are also reusable and thread-safe — prepare at startup, bind per request
executeAsync() returns a CompletableFuture — use it for fire-and-forget writes or high-throughput ingestion
Token-aware routing in DefaultLoadBalancingPolicy means queries go directly to the replica — no extra coordinator hop

Production Insight

The two driver mistakes that show up in every Cassandra performance post-mortem:

1. Creating a new CqlSession per request — session creation takes 100-300ms. At 100 req/s, this saturates the application before Cassandra is ever stressed.

2. Using unprepared statements in a hot path — the coordinator parses the CQL string on every execution. At scale, this adds measurable latency and CPU load to your coordinator nodes.

Both mistakes look fine in load testing with 10 users. They collapse at production scale.

Key Takeaway

Create one CqlSession per application — it is thread-safe and manages its own connection pool.

Prepare statements at startup, bind per request — never execute unprepared statements in a hot path.

Use LOCAL_QUORUM as the default consistency level — it tolerates one replica being down while maintaining strong consistency within the local datacenter.

Use executeAsync() for high-throughput write paths — it allows multiple in-flight writes without blocking application threads.

Sync vs Async Driver Execution

IfStandard request-response REST API endpoint

→

UseUse session.execute() — synchronous, straightforward, compatible with Spring MVC thread model

IfHigh-throughput write ingestion (events, metrics, logs)

→

UseUse session.executeAsync() with CompletableFuture — allows batching multiple in-flight writes without blocking threads

IfApplication uses Spring WebFlux or reactive streams

→

UseUse the reactive driver extension (java-driver-reactive) which returns Mono/Flux directly

IfBulk data migration or backfill job

→

UseUse executeAsync() with a semaphore to limit concurrency — avoids overwhelming Cassandra with thousands of simultaneous requests

Who Should Touch This Database (And Who Should Run Away)

Cassandra is not your father's relational database. You do not reach for it when you have three tables and a weekend project. This is the hammer you pull out when you need to ingest millions of writes per second across a dozen data centers and you cannot afford a single second of downtime. If you are a backend engineer working on time-series metrics, IoT sensor pipelines, or a real-time recommendation engine that must survive a rack failure without blinking: this database was built for you. If you are building a simple blog with user profiles and comments, stop reading and go back to PostgreSQL. Cassandra trades away joins, secondary indexes that work out of the box, and strong consistency across partitions for raw write throughput and linear scalability. You need to know how partition keys work. You need to understand that your query pattern must be designed before you create a single table. If you cannot model your access paths up front, Cassandra will punish you in production and your on-call rotation will hate you.

ModelCheck.javaJAVA

// io.thecodeforge
// Verify your access pattern before writing a single row
public class ModelCheck {
    // If your query is "get all orders by user_id",
    // your partition key MUST be user_id. No exceptions.
    public static final String CREATE_ORDERS_TABLE = 
        "CREATE TABLE IF NOT EXISTS orders_by_user (" +
        "user_id UUID," +
        "order_time TIMESTAMP," +
        "amount DECIMAL," +
        "PRIMARY KEY (user_id, order_time)" +
        ") WITH CLUSTERING ORDER BY (order_time DESC);";

    public static void main(String[] args) {
        System.err.println("Design the query first. Then build the table.");
        System.err.println("Reverse that order and you will re-architect at 2 AM.");
    }
}

Output

Design the query first. Then build the table.

Reverse that order and you will re-architect at 2 AM.

Production Trap:

Do NOT let a junior architect 'normalize' Cassandra tables. Every table is a denormalized query result set. One table per query pattern. Violate this and your cluster will scatter reads across nodes, killing latency.

Key Takeaway

Model your query first, then build the table — Cassandra is not relational.

What You Need Before You Start (Or You Will Flame Out)

You cannot just install Cassandra and start typing CQL. You need a solid grasp of Java — not just syntax, but the JVM's garbage collection behavior, heap sizing, and concurrent programming. Cassandra is written in Java, and every node runs inside a JVM. If you do not know what a GC pause looks like in a production graph, you will mistake a dying node for a network issue. You also need to be comfortable on Linux. Not Windows. Not Mac for production. Linux. You will tail logs in /var/log/cassandra, adjust kernel parameters for swappiness and net.core.somaxconn, and debug ulimit complaints before breakfast. Basic database concepts help: you should know what a B-tree is so you can appreciate why an LSM-tree behaves differently under compaction. If you have never used any NoSQL database before, spend a weekend with DynamoDB or Scylla first. The distributed systems concepts — consistent hashing, gossip protocol, hinted handoff — are not optional. They are the difference between understanding why your read repair failed and blindly restarting nodes.

GCSettings.javaJAVA

// io.thecodeforge
// Tune JVM before you tune Cassandra queries
public class GCSettings {
    public static void main(String[] args) {
        // Production Cassandra nodes MUST use G1GC
        // -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5
        // Never use CMS or ParallelGC unless you enjoy 10-second pauses
        System.out.println("Check jvm.options for:");
        System.out.println("-XX:+UseG1GC");
        System.out.println("-Xms16G -Xmx16G  // always equal heap");
        System.out.println("-XX:MaxGCPauseMillis=200");
    }
}

Output

Check jvm.options for:

-XX:+UseG1GC

-Xms16G -Xmx16G // always equal heap

-XX:MaxGCPauseMillis=200

Production Trap:

Never set -Xms and -Xmx to different values. Unequal heap sizes trigger full GCs when the JVM tries to shrink or expand the heap. Equal them out and your node stays steady.

Key Takeaway

You need Java, Linux, and GC tuning knowledge before you run a single production node. Skip this and you will burn a weekend debugging pauses.

● Production incidentPOST-MORTEMseverity: high

Hot Partition Crash During Peak Traffic

Symptom

Two nodes hit 100% CPU and OOM-killed. Read latency spiked from 5ms to 12 seconds. Application received WriteTimeoutException on 40% of requests.

Assumption

Partition key design distributes writes evenly across the cluster. No single partition should exceed 100MB.

Root cause

The order_id was used as the partition key, but a batch import job inserted 2 million rows with the same partition key (bulk order event). This created a single 4GB partition that overwhelmed two nodes responsible for that token range.

Fix

Redesigned partition key to include a time bucket: (order_date, order_id). Added partition size monitoring with alerts at 100MB threshold. Implemented client-side rate limiting for bulk imports.

Key lesson

Never use a low-cardinality value as a partition key — it creates hot spots
Monitor partition sizes in production — anything over 100MB is a red flag
Bulk import jobs must respect partition key distribution — add time bucketing
Set up alerts on compaction pending tasks as an early warning for hot partitions

Production debug guideFrom slow queries to node failures5 entries

Symptom · 01

Read latency spikes above 100ms

→

Fix

Run nodetool tablehistograms to check read latency distribution; look for large SSTable counts or tombstone-heavy partitions

Symptom · 02

Node shows DN (Down) in nodetool status

→

Fix

Check system.log for OOM or disk full errors; verify network connectivity with ping and gossip state with nodetool gossipinfo

Symptom · 03

WriteTimeoutException from application

→

Fix

Check if consistency level exceeds available replicas; run nodetool tpstats to see if write thread pools are saturated

Symptom · 04

Compaction never completes

→

Fix

Check disk I/O with iostat; verify compaction strategy matches workload (STCS for write-heavy, LCS for read-heavy, TWCS for time-series)

Symptom · 05

ALLOW FILTERING query causes cluster-wide scan

→

Fix

Identify queries using ALLOW FILTERING in application logs or with slow query log enabled. Redesign the table so the filter column is part of the primary key, or create a SAI index in Cassandra 5.0+

★ Cassandra Emergency DebuggingWhen your Cassandra cluster is in trouble, start here

Node unresponsive or high latency−

Immediate action

Check node status and pending compactions

Commands

nodetool status

nodetool tpstats | grep -E 'pending|blocked'

Fix now

If pending compactions > 50, reduce write throughput or add nodes. If a node is DN, check /var/log/cassandra/system.log for OOM or disk-full errors before attempting restart.

Data inconsistency after node recovery+

Disk space critical (>85% used)+

Database Architecture Comparison

Aspect	Traditional RDBMS (MySQL/PostgreSQL)	Apache Cassandra
Architecture	Centralized (Master-Slave)	Distributed (Peer-to-Peer)
Scalability	Vertical (bigger hardware)	Horizontal (more nodes, linear throughput increase)
Data Model	Normalized (fixed schema, joins)	Denormalized (flexible schema, one table per query)
Availability	High with failover lag (30s-2min)	Always-on (no failover delay, no leader election)
Write Speed	Moderate (B-Tree, random I/O, locking)	Extreme (LSM-Tree, sequential I/O, no read-before-write)
Read Pattern	Flexible (joins, aggregations, ad-hoc)	Query-driven (design tables around queries)
Secondary Indexes	B-Tree indexes, generally efficient	SAI in Cassandra 5.0+ (selective queries only)
Transactions	Full ACID across tables and rows	Lightweight transactions (LWT) for single-partition conditional updates only
Operational Cost	Lower for small scale (< 1TB)	Lower for massive scale (10TB+, multi-datacenter)

Key takeaways

Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale

every node is equal, any node accepts reads and writes.

Always model your data around your queries (Query-Driven Design) rather than entities

joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.

Linear scalability means adding nodes provides a predictable, proportional increase in throughput

no rebalancing ceremony, no leader election.

The Gossip protocol and Snitch configuration are the heartbeat of the cluster's awareness

they power failure detection and topology-aware routing.

Tunable consistency is Cassandra's operational superpower

LOCAL_QUORUM is the right default for production: QUORUM = floor(RF/2)+1 nodes must respond, tolerating one failure with RF=3.

Denormalization is not a flaw

it is the correct pattern when joins do not exist. Duplicate data across tables; storage is cheap, cross-partition reads are not.

Tombstones are writes

every DELETE creates a tombstone that degrades reads until compaction clears it. Use TTL on INSERT instead of explicit DELETE for time-bounded data.

SAI (Storage Attached Indexes) in Cassandra 5.0 replaces SASI and supports range queries and vector ANN search

SASI is deprecated and should not be used in new deployments.

Create one CqlSession per application, prepare statements at startup, and use LOCAL_QUORUM as the default consistency level

these three driver decisions determine whether your application runs well under load.

Common mistakes to avoid

6 patterns

Over-normalizing data like in SQL databases

Symptom

Application performs multiple sequential reads to assemble a single view, resulting in 50-200ms latency per page load instead of the expected 2-5ms.

Fix

Redesign tables using Query-First approach: define the exact CQL query first, then create a denormalized table with all required columns in the primary key or as stored columns. If you need the same data for two query patterns, create two tables.

Using SimpleStrategy in production

Symptom

During a rack failure, all replicas of a partition reside in the same rack. Data becomes temporarily or permanently unavailable.

Fix

Switch to NetworkTopologyStrategy with RF=3 per datacenter. Verify snitch configuration matches your infrastructure (GossipingPropertyFileSnitch for most deployments).

Ignoring partition key cardinality — creating hot spots

Symptom

One node shows 90% CPU while others idle at 5%. Compaction backlog grows unbounded on the hot node. Eventually the node OOM-kills and the cluster degrades.

Fix

Choose a high-cardinality partition key (UUIDs, composite keys with time buckets). Monitor partition sizes with nodetool tablehistograms — keep individual partitions under 100MB and 100,000 rows.

Using consistency level ALL for every query

Symptom

Any single node failure or network partition causes all reads and writes to fail, completely negating Cassandra's availability guarantees.

Fix

Use LOCAL_QUORUM as the default consistency level. It tolerates one node failure while still guaranteeing strong consistency within the local datacenter. QUORUM = floor(RF/2) + 1.

Using ALLOW FILTERING in production queries

Symptom

A query that takes 2ms during development takes 30 seconds in production when the table has 50 million rows. The coordinator scans every partition on every node.

Fix

Redesign the table so the filter column is part of the primary key, or create a Storage Attached Index (SAI) in Cassandra 5.0+. ALLOW FILTERING is a diagnostic tool, not a query strategy.

Creating a new CqlSession per request in the Java driver

Symptom

Application throughput collapses under load. Thread pool exhaustion in the application, not in Cassandra. P99 latency spikes to 300ms+ because session initialization dominates request time.

Fix

Create one CqlSession as a singleton bean and inject it into all repositories. Session creation is expensive (100-300ms); request execution should be 1-5ms.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is the role of the Gossip protocol in an Apache Cassandra cluster?

Q02SENIOR

How does Cassandra achieve high write throughput compared to traditional...

Q03SENIOR

Explain the LSM-Tree (Log-Structured Merge-Tree) and its importance in C...

Q04JUNIOR

What is the difference between SimpleStrategy and NetworkTopologyStrateg...

Q05SENIOR

What is a tombstone in Cassandra, why does it degrade read performance, ...

Q06SENIOR

What is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does i...

Q01 of 06SENIOR

What is the role of the Gossip protocol in an Apache Cassandra cluster?

ANSWER

Gossip is a peer-to-peer communication protocol that every Cassandra node uses to discover and maintain awareness of other nodes in the cluster. Each node gossips with up to 3 other nodes every second, exchanging state information (up/down, load, schema version). This decentralized protocol has no single point of failure and converges on cluster state within a few seconds. It powers failure detection — if a node stops responding to gossip, it is marked as down after a configurable phi-convict-threshold. The Snitch works alongside Gossip to tell each node the datacenter and rack topology, which the load balancing policy uses to make routing decisions.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Does Cassandra support ACID transactions?

How do I choose the right compaction strategy?

What is a tombstone in Cassandra and why should I care?

Can I use secondary indexes in Cassandra?

When should I use Cassandra vs PostgreSQL?

Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Cassandra. Mark it forged?

9 min read · try the examples if you haven't