Skip to content
Home Database Introduction to Apache Cassandra

Introduction to Apache Cassandra

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Cassandra → Topic 1 of 4
A comprehensive guide to Apache Cassandra — master the fundamentals of this distributed NoSQL database, its ring architecture, and linear scalability patterns.
🧑‍💻 Beginner-friendly — no prior Database experience needed
In this tutorial, you'll learn
A comprehensive guide to Apache Cassandra — master the fundamentals of this distributed NoSQL database, its ring architecture, and linear scalability patterns.
  • Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale — every node is equal, any node accepts reads and writes.
  • Always model your data around your queries (Query-Driven Design) rather than entities — joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.
  • Linear scalability means adding nodes provides a predictable, proportional increase in throughput — no rebalancing ceremony, no leader election.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Apache Cassandra is a masterless, peer-to-peer distributed NoSQL database favoring availability over strong consistency
  • Data is distributed via consistent hashing across a ring of equal nodes — no single point of failure
  • Writes go to any node and propagate to replicas based on Replication Factor
  • LSM-Tree storage engine delivers extreme write throughput — sequential I/O, no read-before-write
  • Query-Driven Design: model tables around queries, not entities — joins do not exist
  • Cassandra 5.0 introduced Storage Attached Indexes (SAI) — the correct answer to secondary indexing from 2024 onward
  • Biggest mistake: over-normalizing data like in SQL — denormalization is the correct Cassandra pattern
🚨 START HERE
Cassandra Emergency Debugging
When your Cassandra cluster is in trouble, start here
🟠Node unresponsive or high latency
Immediate ActionCheck node status and pending compactions
Commands
nodetool status
nodetool tpstats | grep -E 'pending|blocked'
Fix NowIf pending compactions > 50, reduce write throughput or add nodes. If a node is DN, check /var/log/cassandra/system.log for OOM or disk-full errors before attempting restart.
🟡Data inconsistency after node recovery
Immediate ActionRun repair on the affected keyspace
Commands
nodetool repair --full forge_system
nodetool getendpoints forge_system audit_logs <partition_key>
Fix NowSchedule weekly nodetool repair via cron for all keyspaces. Repairs must complete within gc_grace_seconds (default 10 days) or deleted data can resurrect.
🟡Disk space critical (>85% used)
Immediate ActionCheck SSTable count and tombstone accumulation
Commands
nodetool cfstats forge_system.audit_logs | grep 'Space used'
find /var/lib/cassandra/data -name '*.db' | wc -l
Fix NowForce compaction with nodetool compact. If tombstones are the cause, verify gc_grace_seconds is configured correctly and that repair runs regularly before tombstones are purged.
Production IncidentHot Partition Crash During Peak TrafficProduction Cassandra cluster lost two nodes simultaneously during a flash sale event, causing data unavailability for 23 minutes.
SymptomTwo nodes hit 100% CPU and OOM-killed. Read latency spiked from 5ms to 12 seconds. Application received WriteTimeoutException on 40% of requests.
AssumptionPartition key design distributes writes evenly across the cluster. No single partition should exceed 100MB.
Root causeThe order_id was used as the partition key, but a batch import job inserted 2 million rows with the same partition key (bulk order event). This created a single 4GB partition that overwhelmed two nodes responsible for that token range.
FixRedesigned partition key to include a time bucket: (order_date, order_id). Added partition size monitoring with alerts at 100MB threshold. Implemented client-side rate limiting for bulk imports.
Key Lesson
Never use a low-cardinality value as a partition key — it creates hot spotsMonitor partition sizes in production — anything over 100MB is a red flagBulk import jobs must respect partition key distribution — add time bucketingSet up alerts on compaction pending tasks as an early warning for hot partitions
Production Debug GuideFrom slow queries to node failures
Read latency spikes above 100msRun nodetool tablehistograms to check read latency distribution; look for large SSTable counts or tombstone-heavy partitions
Node shows DN (Down) in nodetool statusCheck system.log for OOM or disk full errors; verify network connectivity with ping and gossip state with nodetool gossipinfo
WriteTimeoutException from applicationCheck if consistency level exceeds available replicas; run nodetool tpstats to see if write thread pools are saturated
Compaction never completesCheck disk I/O with iostat; verify compaction strategy matches workload (STCS for write-heavy, LCS for read-heavy, TWCS for time-series)
ALLOW FILTERING query causes cluster-wide scanIdentify queries using ALLOW FILTERING in application logs or with slow query log enabled. Redesign the table so the filter column is part of the primary key, or create a SAI index in Cassandra 5.0+

Apache Cassandra is a peer-to-peer distributed NoSQL database designed to handle massive amounts of data across many commodity servers. Originally developed at Facebook to power Inbox Search, it provides high availability with no single point of failure and linear scalability.

In this guide, we'll break down exactly what makes Cassandra unique in the crowded database landscape, why its masterless design beats traditional leader-follower models for write-heavy workloads, how to design partition keys that don't become time bombs, and how to connect a Java application using the DataStax driver. By the end, you'll understand not just what Cassandra is, but how to avoid the mistakes that cause production outages.

What Is Apache Cassandra and Why Does It Exist?

Apache Cassandra was designed to solve the availability vs. consistency trade-off in distributed systems, specifically favoring high availability and partition tolerance — the AP side of the CAP theorem. Traditional relational databases struggle to scale horizontally because they rely on a central master node that becomes both the bottleneck and the single point of failure. Cassandra eliminates this bottleneck entirely by treating every node in the cluster as an equal peer.

This architecture allows linear scaling: adding a new node to the cluster provides a predictable, proportional increase in performance and storage capacity. There is no leader election, no failover window, and no promotion ceremony when a node dies. Its neighbors simply serve its data until it comes back.

Before writing any application code, you need a keyspace and a table. A keyspace is Cassandra's equivalent of a database schema — it defines the replication strategy and factor. The example below uses NetworkTopologyStrategy, which is the only strategy you should use in production. SimpleStrategy is shown for contrast because you will encounter it in tutorials, and understanding why it is wrong is as important as knowing the right answer.

Note the PRIMARY KEY (service_name, log_time) design. service_name is the partition key — it controls which node stores the data. log_time is the clustering column — it controls the sort order within the partition. This is the fundamental unit of Cassandra data modeling, and every decision downstream flows from getting it right.

keyspace_setup.cql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- ── SimpleStrategy: development only, single datacenter ─────────────────────
-- Never use in production — it ignores rack and datacenter topology.
-- Replicas are placed on the next N nodes clockwise in the ring,
-- which means two replicas can land on the same physical rack.
CREATE KEYSPACE IF NOT EXISTS forge_dev
  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

-- ── NetworkTopologyStrategy: production standard ─────────────────────────────
-- Rack-aware and datacenter-aware placement.
-- With RF=3 across 3 racks, you can lose an entire rack and still serve queries.
CREATE KEYSPACE IF NOT EXISTS forge_system
  WITH replication = {
    'class': 'NetworkTopologyStrategy',
    'datacenter1': 3
  };

USE forge_system;

-- ── Table design: audit_logs ─────────────────────────────────────────────────
-- Partition key: service_name  → determines which node(s) store this row
-- Clustering column: log_time  → determines sort order within the partition
-- CLUSTERING ORDER BY DESC: most recent logs are read first with no sort overhead
CREATE TABLE IF NOT EXISTS audit_logs (
    service_name text,
    log_time     timestamp,
    request_id   uuid,
    payload      text,
    PRIMARY KEY (service_name, log_time)
) WITH CLUSTERING ORDER BY (log_time DESC)
  AND default_time_to_live = 2592000;  -- 30-day TTL: logs auto-expire

-- ── Writing data: any node accepts the write ────────────────────────────────
INSERT INTO audit_logs (service_name, log_time, request_id, payload)
VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"ok"}');

INSERT INTO audit_logs (service_name, log_time, request_id, payload)
VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"retry"}');

-- ── Reading data: the query matches the partition key exactly ────────────────
-- This is a single-partition read — one node handles it, response in <5ms
SELECT service_name, log_time, payload
FROM audit_logs
WHERE service_name = 'payment-service'
  AND log_time > '2026-01-01 00:00:00+0000'
LIMIT 100;
▶ Output
Warnings: None
Keyspace 'forge_dev' created.
Keyspace 'forge_system' created.
Table 'audit_logs' created.

-- INSERT: applied
-- INSERT: applied

service_name | log_time | payload
-----------------+---------------------------------+---------------------
payment-service | 2026-04-18 14:23:01.000000+0000 | {"status":"retry"}
payment-service | 2026-04-18 14:22:58.000000+0000 | {"status":"ok"}

(2 rows)
Mental Model
The Ring Architecture Mental Model
Think of Cassandra as a circular conveyor belt where every station is equal — any station can accept packages, and packages are copied to nearby stations for safety.
  • Every node owns a range of the token ring — data is distributed by hashing the partition key to a token
  • Any node can accept reads and writes — there is no master, no leader election, no failover delay
  • Replication factor determines how many nodes hold a copy of each partition
  • If a node goes down, its neighbors serve its data immediately — no recovery window
  • Adding a node splits an existing token range and streams data to the new node automatically — linear scaling
📊 Production Insight
Cassandra's AP design means reads may return slightly stale data during replication lag.
Tunable consistency lets you choose per-query: ONE for speed (fastest, weakest), LOCAL_QUORUM for balance (default choice), ALL for safety (strongest, slowest).
QUORUM math: for RF=3, QUORUM = floor(3/2)+1 = 2 nodes must respond. You can lose one node and still satisfy QUORUM.
Rule: use LOCAL_QUORUM for production writes and reads — it balances consistency with cross-datacenter latency without requiring all replicas to be available.
🎯 Key Takeaway
Cassandra exists because master-slave databases cannot scale writes horizontally without a bottleneck.
Every node is equal — any node accepts writes, and replication handles redundancy automatically.
QUORUM = floor(RF/2)+1: with RF=3, you need 2 nodes — you can lose one and still serve queries.
Never use SimpleStrategy in production — it has no rack or datacenter awareness.
Choosing Replication Strategy
IfSingle datacenter development or testing
UseUse SimpleStrategy with RF=1 — do NOT use in production
IfSingle production datacenter
UseUse NetworkTopologyStrategy with RF=3 spread across different racks
IfMulti-datacenter deployment
UseUse NetworkTopologyStrategy with RF=3 per datacenter for cross-DC resilience
IfCompliance requires geographic data isolation
UseUse NetworkTopologyStrategy with LOCAL_QUORUM consistency to ensure reads never cross datacenter boundaries

Partition Keys, Clustering Columns, and Primary Key Design

If I had to point to one thing that separates engineers who run Cassandra well from engineers who generate production incidents, it's partition key design. Every other Cassandra concept — consistency levels, compaction, gossip — is secondary to getting this right.

The primary key in Cassandra has two distinct parts that most tutorials blur together:

The partition key determines which node stores the data. Cassandra hashes the partition key value to a token, and that token falls in a node's token range. All rows with the same partition key are stored together on disk — this is what makes single-partition reads so fast. But it's also why a bad partition key creates a hot spot: if thousands of writes share the same partition key value, all of them land on the same two or three nodes.

Clustering columns determine the sort order of rows within a partition. They are stored sorted on disk in the SSTable. This means range queries on clustering columns — WHERE log_time > x AND log_time < y — are essentially free: Cassandra reads a contiguous block of sorted data from a single SSTable. There is no sort step, no heap sort, no temporary buffer.

The combination of partition key and clustering columns forms the primary key. Every row within a partition must have a unique combination of clustering column values. A missing clustering column in a write will upsert the existing row — not create a new one.

Partition size is the operational concern that flows from this design. Because all rows with the same partition key live on the same nodes, an unbounded partition grows indefinitely and eventually causes the problems described in the production incident at the top of this guide. The rule I operate by: keep partitions under 100MB and under 100,000 rows. If either limit is at risk, add a time bucket to the partition key.

partition_key_design.cql · SQL
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
-- ── Example 1: Simple partition key — one column ────────────────────────────
-- All events for a user are in one partition.
-- Problem: a user with 10 million events creates a 4GB partition.
CREATE TABLE user_events_v1 (
    user_id   uuid,
    event_time timestamp,
    event_type text,
    metadata  text,
    PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

-- ── Example 2: Composite partition key — time bucketing ──────────────────────
-- Now the partition key is (user_id, bucket).
-- Each day gets its own partition — bounded growth, predictable size.
-- 'bucket' is a date string like '2026-04-18', computed in the application.
-- Trade-off: queries spanning multiple days require multiple partition reads.
CREATE TABLE user_events_v2 (
    user_id    uuid,
    bucket     text,      -- application sets this: '2026-04-18'
    event_time timestamp,
    event_type text,
    metadata   text,
    PRIMARY KEY ((user_id, bucket), event_time)
--                ^-----------^  composite partition key: two columns together
) WITH CLUSTERING ORDER BY (event_time DESC);

-- ── Writing to the bucketed table ─────────────────────────────────────────────
-- Application code computes the bucket from the event timestamp
-- then includes it in every write and read
INSERT INTO user_events_v2 (user_id, bucket, event_time, event_type, metadata)
VALUES (
    a3b4c5d6-e7f8-9012-abcd-ef1234567890,
    '2026-04-18',
    toTimestamp(now()),
    'page_view',
    '{"page":"/dashboard"}'
);

-- ── Reading from a specific bucket ───────────────────────────────────────────
-- This is a single-partition read: fast, predictable, no cluster scan
SELECT event_time, event_type, metadata
FROM user_events_v2
WHERE user_id = a3b4c5d6-e7f8-9012-abcd-ef1234567890
  AND bucket   = '2026-04-18'
  AND event_time > '2026-04-18 08:00:00+0000'
LIMIT 50;

-- ── What ALLOW FILTERING looks like — and why it is dangerous ─────────────────
-- This forces Cassandra to scan EVERY partition on EVERY node
-- to find rows where event_type = 'purchase'.
-- With 50 nodes and 500GB of data, this query touches all of it.
-- Never run this in production without SAI or a dedicated table.
SELECT * FROM user_events_v2
WHERE event_type = 'purchase'
ALLOW FILTERING;  -- full cluster scan — do not do this
▶ Output
Table 'user_events_v1' created.
Table 'user_events_v2' created.

-- INSERT: applied

event_time | event_type | metadata
-----------------------------------+------------+---------------------
2026-04-18 14:23:01.000000+0000 | page_view | {"page":"/dashboard"}

(1 rows)

-- ALLOW FILTERING query:
Warnings: Aggregation query used without partition key (potential performance issue)
-- This scanned 50 nodes. Do not use in production.
⚠ The 100MB / 100,000 Row Partition Rule
Keep individual partitions under 100MB and under 100,000 rows. These are not hard limits but operational thresholds I've used across multiple production clusters. Beyond these numbers, compaction takes longer, reads require more SSTable lookups, and a single write spike can overwhelm the nodes responsible for that token range. Check partition sizes regularly with nodetool tablehistograms. If you're approaching the limit, add a time bucket to the partition key before you hit it — not after.
📊 Production Insight
The most dangerous partition key mistake is the one that looks fine on day one.
A partition key based on user_id is fine when you have 1,000 users and 100 events each.
At 50,000 events per power user, you have a 500MB partition waiting to OOM a node.
Design for the maximum, not the average — you won't have time to redesign at 3am during a traffic spike.
🎯 Key Takeaway
The partition key determines data placement — choose high-cardinality values and add time buckets for time-series data.
Clustering columns determine sort order within a partition — sorted on disk, so range queries are essentially free.
ALLOW FILTERING is a full cluster scan — it is not a query strategy, it is an emergency diagnostic tool.
Keep partitions under 100MB and 100,000 rows — if you cannot, add a time bucket.
Partition Key Design Decisions
IfData volume per partition key value is bounded and small (< 10,000 rows)
UseSimple partition key is fine — one column, high cardinality (UUID, user_id)
IfData volume per partition key grows unboundedly over time (time-series, events)
UseAdd a time bucket to the partition key: (entity_id, bucket) where bucket is day/hour/week
IfNeed to query by a column that is not the partition key
UseCreate a separate table with that column as the partition key — duplicate the data
IfQuery needs to filter on a low-selectivity column across many partitions
UseIn Cassandra 5.0+, use a Storage Attached Index (SAI) — do not use ALLOW FILTERING

The LSM-Tree Write Path — Why Cassandra Writes So Fast

Understanding Cassandra's write performance requires understanding one design decision: all writes are sequential. There is no read-before-write, no in-place update, no locking. The LSM-Tree (Log-Structured Merge-Tree) storage engine is the reason Cassandra can absorb write rates that would saturate a PostgreSQL cluster.

Here's what actually happens when your application executes an INSERT or UPDATE:

  1. Commit log write — The mutation is appended to a sequential commit log on disk. Sequential I/O. If the node crashes after this step, the data can be recovered from the commit log on restart.
  2. Memtable write — The mutation is written to an in-memory structure called the memtable. All writes to the same partition are merged here — updates, deletes, and inserts coexist. The memtable is the 'current view' of recent data.
  3. Memtable flush to SSTable — When the memtable reaches a configured size, it is flushed to disk as an immutable SSTable (Sorted String Table). Immutable means once written, it is never modified. Updates and deletes don't rewrite existing SSTables — they write new entries with a higher timestamp.
  4. Compaction — Background compaction merges multiple SSTables into fewer, larger ones. During compaction, entries for the same partition key are merged by timestamp — newer wins. Deleted entries (tombstones) are purged after gc_grace_seconds.

The trade-off this creates: reads are more complex than writes. A read for a partition may need to check the memtable and multiple SSTables to find all the data. Bloom filters (one per SSTable) give Cassandra a probabilistic fast path — 'definitely not in this SSTable' — before opening the file. Partition indexes give the exact byte offset. By 2026, Cassandra 5.0's trie-based memtable and storage improvements have significantly reduced the read amplification that older LSM implementations suffered.

write_path_demonstration.cql · SQL
1234567891011121314151617181920212223242526272829303132333435363738
-- ── Demonstrating Cassandra's write semantics ────────────────────────────────
-- INSERT and UPDATE are identical in Cassandra — both are 'upserts'.
-- There is no read-before-write. The write goes to memtable immediately.
INSERT INTO forge_system.audit_logs
    (service_name, log_time, request_id, payload)
VALUES
    ('order-service', '2026-04-18 10:00:00+0000', uuid(), '{"amount":99.99}');

-- UPDATE writes a new cell with a higher timestamp — the old cell still exists
-- in the SSTable until compaction. Cassandra resolves conflicts by timestamp.
UPDATE forge_system.audit_logs
SET payload = '{"amount":99.99,"status":"confirmed"}'
WHERE service_name = 'order-service'
  AND log_time = '2026-04-18 10:00:00+0000';

-- ── Deletes write tombstones, not erasures ───────────────────────────────────
-- This DELETE writes a tombstone marker — it does NOT remove the data immediately.
-- The data will be physically removed after gc_grace_seconds (default: 10 days)
-- AND after a compaction runs AND after repair has propagated the tombstone.
DELETE FROM forge_system.audit_logs
WHERE service_name = 'order-service'
  AND log_time = '2026-04-18 10:00:00+0000';

-- ── Checking SSTable and memtable state via nodetool ─────────────────────────
-- Run from the shell, not from cqlsh:
-- nodetool flush forge_system         → forces memtable to flush to SSTable
-- nodetool compact forge_system       → forces compaction on the keyspace
-- nodetool tablehistograms forge_system audit_logs → read/write latency breakdown

-- ── TTL as an alternative to application-level deletes ───────────────────────
-- When data has a natural expiry, use TTL instead of explicit DELETE.
-- Cassandra handles the tombstone lifecycle automatically.
-- Here, this log entry expires in 86400 seconds (24 hours).
INSERT INTO forge_system.audit_logs
    (service_name, log_time, request_id, payload)
VALUES
    ('cache-service', toTimestamp(now()), uuid(), '{"cache":"hit"}')
USING TTL 86400;
▶ Output
-- INSERT: applied
-- UPDATE: applied
-- DELETE: applied (tombstone written)

-- nodetool tablehistograms forge_system audit_logs
Forge_system/audit_logs histograms
Offset SSTables Write Latency Read Latency
1 0 74us 0us
2 1 112us 145us
3 1 118us 148us

-- SSTable count: 1 (after flush)
-- Tombstone live cells ratio: 0.12 (acceptable — below 0.20 threshold)

-- INSERT with TTL: applied
Mental Model
Why Deletes Are the Most Dangerous Write in Cassandra
In a traditional database, DELETE removes the row. In Cassandra, DELETE adds a row — a tombstone marker. This distinction has caused more production incidents than almost any other Cassandra misunderstanding.
  • A tombstone is a write — it goes through the same memtable → SSTable → compaction path as regular data
  • Tombstones accumulate until gc_grace_seconds passes (default 10 days) AND compaction runs
  • During reads, Cassandra must scan through tombstones to find live data — high tombstone counts degrade reads
  • If a node was down when a DELETE ran, it must receive the tombstone via repair before gc_grace_seconds expires
  • Common tombstone traps: nullable columns, frequent row-level deletes, and TTL-on-collections
📊 Production Insight
The most operationally expensive thing you can do in Cassandra is delete frequently from a hot partition.
Each delete is a write that must be read around during future reads until compaction clears it.
If you need frequent deletes, use TTL on insertion instead — it amortizes the cost and keeps tombstone density predictable.
Monitor tombstone ratios with nodetool tablehistograms. When the live cells to tombstone ratio exceeds 0.20, investigate before latency spikes.
🎯 Key Takeaway
LSM-Tree = sequential writes to commit log and memtable, periodic flush to immutable SSTables, background compaction.
Cassandra writes are faster than B-Tree RDBMS writes because there is no read-before-write and no in-place update.
Deletes are writes — they create tombstones that degrade read performance until compaction clears them.
Use TTL for data with natural expiry — it keeps tombstone accumulation predictable and bounded.
Choosing a Compaction Strategy
IfWrite-heavy workload, data is rarely updated or deleted
UseUse STCS (Size-Tiered Compaction Strategy) — the default, optimizes write throughput
IfRead-heavy workload, data is frequently queried by partition key
UseUse LCS (Leveled Compaction Strategy) — fewer SSTables per partition, better read performance, more I/O during compaction
IfTime-series data with TTL-based expiry
UseUse TWCS (Time-Window Compaction Strategy) — groups SSTables by time window, drops entire windows when TTL expires, minimal write amplification
IfUnclear workload or mixed read/write
UseStart with STCS, monitor with nodetool tablehistograms, switch to LCS if read latency is the dominant complaint

Common Mistakes and How to Avoid Them

When moving from SQL to Cassandra, the mistakes fall into two categories: design mistakes you make before writing a line of code, and operational mistakes you make after the cluster is running. The design mistakes are more expensive — they require data migration to fix. The operational ones are more urgent — they cause incidents.

The single most consequential design mistake is treating Cassandra like a relational database. Normalizing data, expecting joins, using secondary indexes as a query strategy, filtering on non-primary-key columns with ALLOW FILTERING — all of these patterns assume a query engine that Cassandra deliberately does not have. Cassandra's query engine is intentionally limited because that limitation is what makes it fast at scale.

For the operational side, the development cluster below is the starting point for understanding how the cluster behaves before you have 12 production nodes to experiment on. The compose file uses Cassandra 5.0 — pinned, not latest. Using latest in any tutorial or development setup is a trap: you will develop against a different version than production, and Cassandra has meaningful behavioral differences between major versions.

docker-compose.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
# ── Three-node Cassandra development cluster ─────────────────────────────────
# Pinned to Cassandra 5.0do not use 'latest' in development or production.
# Cassandra 5.0 ships with significant storage engine improvements:
#   - Trie-based memtable (better memory efficiency)
#   - Storage Attached Indexes (SAI) replacing legacy SASI
#   - Unified compaction strategy option

services:
  cassandra-node1:
    image: cassandra:5.0
    container_name: forge-cassandra-1
    hostname: cassandra-node1
    ports:
      - "9042:9042"
    environment:
      - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack1
      - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=1G
      - HEAP_NEWSIZE=256M
    volumes:
      - cassandra-data1:/var/lib/cassandra
    healthcheck:
      test: ["CMD-SHELL", "nodetool status | grep -E '^UN'"]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 60s

  cassandra-node2:
    image: cassandra:5.0
    container_name: forge-cassandra-2
    hostname: cassandra-node2
    environment:
      - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack2
      - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=1G
      - HEAP_NEWSIZE=256M
    volumes:
      - cassandra-data2:/var/lib/cassandra
    depends_on:
      cassandra-node1:
        condition: service_healthy

  cassandra-node3:
    image: cassandra:5.0
    container_name: forge-cassandra-3
    hostname: cassandra-node3
    environment:
      - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster
      - CASSANDRA_DC=datacenter1
      - CASSANDRA_RACK=rack3
      - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2
      - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch
      - MAX_HEAP_SIZE=1G
      - HEAP_NEWSIZE=256M
    volumes:
      - cassandra-data3:/var/lib/cassandra
    depends_on:
      cassandra-node2:
        condition: service_healthy

volumes:
  cassandra-data1:
  cassandra-data2:
  cassandra-data3:
▶ Output
# After all three nodes start (allow ~3 minutes for gossip convergence)
# Run: docker exec forge-cassandra-1 nodetool status

# Datacenter: datacenter1
# ========================
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# -- Address Load Tokens Owns Host ID Rack
# UN 172.20.0.2 195.21 KiB 16 33.3% a1b2c3d4-e5f6-7890-abcd-ef1234567890 rack1
# UN 172.20.0.3 207.43 KiB 16 33.3% b2c3d4e5-f6a7-8901-bcde-f12345678901 rack2
# UN 172.20.0.4 189.67 KiB 16 33.4% c3d4e5f6-a7b8-9012-cdef-123456789012 rack3
#
# UN = Up, Normal — all three nodes are healthy and distributing load evenly
# Each node owns ~33% of the token ring
⚠ Watch Out:
The most common setup mistake is using the SimpleStrategy replication class in production. While fine for single-node development, production environments require NetworkTopologyStrategy to ensure rack-aware and datacenter-aware data placement. The second most common mistake is using cassandra:latest as the image tag — it breaks reproducibility and means developers and production may be running different major versions with different behaviors.
📊 Production Insight
Denormalization in Cassandra means duplicating data across multiple tables.
Each table is designed for exactly one query pattern — this is by design, not a flaw.
Rule: if you need the same data for two queries, create two tables — storage is cheap, latency is not.
The storage cost of denormalization at scale is almost always lower than the operational cost of slow queries or a poorly tuned secondary index.
🎯 Key Takeaway
Query-First Design: write your CQL queries before designing your tables — the query defines the table structure.
Joins do not exist — if you need data from two entities, denormalize into one table or create two query-specific tables.
Never use SimpleStrategy in production and never pin to cassandra:latest — both create incidents you cannot explain.
Data Modeling Decision Matrix
IfNeed to query by user_id and by email
UseCreate two tables: one partitioned by user_id, one by email — do NOT rely on secondary indexes for this
IfTime-series data with high write volume
UseUse a time bucket in the partition key (e.g., day or hour) to prevent unbounded partition growth
IfNeed to aggregate or group data across partitions
UsePre-compute aggregations at write time and store in a dedicated summary table
IfQuery requires filtering on a non-primary-key column in Cassandra 5.0+
UseUse a Storage Attached Index (SAI) — it replaces SASI and avoids the full-cluster scan of ALLOW FILTERING

Storage Attached Indexes (SAI) — Secondary Indexing in Cassandra 5.0

Before Cassandra 5.0, secondary indexing was the most common source of bad advice in Cassandra guides. The options were:

  • SASI (Storage Attached Secondary Index) — better than the original secondary indexes, still had significant operational overhead and limited query capabilities.
  • Original secondary indexes — effectively hidden tables, painful performance characteristics, deprecated by the Cassandra community for most production use.
  • ALLOW FILTERING — not an index at all, a full cluster scan dressed up as a query option.

Cassandra 5.0 introduced Storage Attached Indexes (SAI), which supersede SASI. SAI is built into the SSTable storage format, requires no separate index tables, and supports range queries, equality, and — for vector embeddings — approximate nearest neighbor (ANN) search. Vector search support makes Cassandra 5.0 relevant for AI-adjacent workloads that weren't possible before.

SAI doesn't make secondary indexing free. It's still more expensive than a partition key query because it must look across all SSTables on the node. But the cost is manageable for selective queries — queries that filter down to a small result set. The rule I apply: if your SAI query returns more than 5-10% of the rows in the partition, the query is not selective enough for an index to help, and you should create a dedicated table instead.

SASI is deprecated in Cassandra 5.0. Any guide that recommends SASI for new deployments is out of date.

sai_indexes.cql · SQL
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
-- ── Storage Attached Indexes: Cassandra 5.0+ ─────────────────────────────────
-- SAI allows querying on non-primary-key columns without ALLOW FILTERING.
-- SASI is deprecated in Cassandra 5.0 — use SAI for all new indexes.

USE forge_system;

-- Table: product catalog — partition key is category
CREATE TABLE IF NOT EXISTS products (
    category    text,
    product_id  uuid,
    name        text,
    price       decimal,
    in_stock    boolean,
    rating      float,
    PRIMARY KEY (category, product_id)
);

-- ── Create SAI indexes on non-primary-key columns ────────────────────────────
-- Syntax: CREATE INDEX index_name ON table(column) USING 'sai';
CREATE INDEX products_price_idx  ON products(price)   USING 'sai';
CREATE INDEX products_stock_idx  ON products(in_stock) USING 'sai';
CREATE INDEX products_rating_idx ON products(rating)  USING 'sai';

-- ── SAI-enabled queries: no ALLOW FILTERING required ─────────────────────────
-- Range query on price — only possible with SAI or as a partition key
SELECT product_id, name, price
FROM products
WHERE category = 'electronics'
  AND price > 100.00
  AND price < 500.00;

-- Combined filter: in stock AND rating above threshold
-- Both conditions use SAI — Cassandra intersects the index results
SELECT product_id, name, price, rating
FROM products
WHERE category = 'electronics'
  AND in_stock = true
  AND rating > 4.5;

-- ── Vector search with SAI (AI workloads) ────────────────────────────────────
-- Cassandra 5.0 supports ANN (Approximate Nearest Neighbor) search via SAI
-- Useful for semantic search, recommendation systems, embedding lookup
CREATE TABLE IF NOT EXISTS product_embeddings (
    category   text,
    product_id uuid,
    embedding  vector<float, 1536>,  -- 1536-dimensional OpenAI embedding
    PRIMARY KEY (category, product_id)
);

-- SAI index on the vector column for ANN search
CREATE INDEX embeddings_ann_idx
    ON product_embeddings(embedding)
    USING 'sai'
    WITH OPTIONS = {'similarity_function': 'cosine'};

-- ANN query: find the 10 most similar products to a given embedding vector
-- This runs on-node without a cluster scan — orders of magnitude faster than
-- brute-force similarity across all rows
SELECT product_id
FROM product_embeddings
WHERE category = 'electronics'
ORDER BY embedding ANN OF [0.1, 0.2, 0.3, ...] -- 1536 floats from your model
LIMIT 10;
▶ Output
Index 'products_price_idx' created.
Index 'products_stock_idx' created.
Index 'products_rating_idx' created.

-- Price range query:
product_id | name | price
--------------------------------------+-------------------+-------
d1e2f3a4-b5c6-7890-defg-h12345678901 | Forge Headphones | 149.99
e2f3a4b5-c6d7-8901-efgh-i23456789012 | Cloud Core Pro | 299.00

(2 rows)

-- Combined filter query:
product_id | name | price | rating
--------------------------------------+-------------------+--------+-------
d1e2f3a4-b5c6-7890-defg-h12345678901 | Forge Headphones | 149.99 | 4.7

(1 rows)

-- Vector index created.
-- ANN query: (10 results returned based on cosine similarity)
Mental Model
SAI vs Dedicated Table: When to Use Which
SAI is a selective filter on a partition. A dedicated table is the right choice when SAI's overhead exceeds the cost of the extra write.
  • SAI is right when: the indexed column has high cardinality, queries are selective (< 5% of partition rows), and you cannot predict all query patterns at schema design time
  • Dedicated table is right when: you need the absolute fastest reads, the query pattern is known and stable, or selectivity is too low for SAI to help
  • Vector search via SAI is the recommended pattern for embedding lookup in Cassandra 5.0+ — no separate vector database required for moderate scale
  • SASI is deprecated in Cassandra 5.0 — do not create new SASI indexes on any new deployment
📊 Production Insight
SAI in Cassandra 5.0 is a genuine improvement over everything that came before it.
But 'genuine improvement' is not the same as 'free.' A low-selectivity SAI query — one that matches 30% of rows — still performs worse than a well-designed partition key query.
Rule: always include the partition key in SAI queries when possible. SAI narrows results within a partition far more efficiently than across the whole cluster.
🎯 Key Takeaway
SAI replaces SASI in Cassandra 5.0 — SASI is deprecated and should not be used in new deployments.
SAI supports equality, range, and vector (ANN) queries without ALLOW FILTERING.
Include the partition key alongside SAI conditions wherever possible — it scopes the index lookup to one partition rather than the full cluster.
For AI workloads requiring embedding similarity search, Cassandra 5.0 + SAI vector indexes removes the need for a separate vector database at moderate scale.
Secondary Index Strategy in Cassandra 5.0
IfNeed to query on a non-primary-key column, column has high cardinality
UseUse SAI (USING 'sai') — efficient for selective queries, no separate table required
IfNeed to query on a non-primary-key column, query is for a known stable pattern
UseCreate a dedicated table with the column as partition key — fastest reads, predictable performance
IfNeed approximate nearest neighbor search for ML embeddings
UseUse SAI vector index on a vector<float, N> column — built into Cassandra 5.0, no external vector DB needed for moderate scale
IfExisting cluster uses SASI indexes
UsePlan migration to SAI — SASI is deprecated in 5.0 and will not receive new features

Connecting a Java Application — DataStax Java Driver

Every other concept in this guide — rings, partition keys, consistency levels, SAI — becomes concrete once you see how an application actually talks to Cassandra. The standard driver by 2026 is the DataStax Java Driver for Apache Cassandra, version 4.x. It replaced the older CQL driver and the Astyanax client.

Three things the driver handles that you do not want to implement yourself:

  1. Load balancing policy — the driver tracks which nodes are in which datacenter, which are up, and which are token-aware. With the DefaultLoadBalancingPolicy, queries are automatically routed to the replica that owns the data — no unnecessary hops through coordinator nodes.
  2. Connection pooling — each driver instance maintains a pool of connections per host. You create one CqlSession per application and reuse it for the lifetime of the process. Creating a new session per request is one of the most common Java + Cassandra performance mistakes.
  3. Prepared statements — CQL statements should be prepared once and executed many times. Prepared statements are parsed and validated once by the coordinator; subsequent executions send only the bound values. Unprepared statements parse on every execution — at scale, the coordinator overhead becomes measurable.

The Spring Data Cassandra abstraction sits on top of this driver and is suitable for straightforward CRUD workloads. For anything involving custom query patterns, consistency level overrides, or performance-critical paths, I prefer going directly to the driver — the abstraction layer hides the consistency level control that Cassandra's tunable consistency depends on.

CassandraConfig.java + AuditLogRepository.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
// ── pom.xml dependencies ──────────────────────────────────────────────────────
// <dependency>
//   <groupId>com.datastax.oss</groupId>
//   <artifactId>java-driver-core</artifactId>
//   <version>4.17.0</version>
// </dependency>
// <dependency>
//   <groupId>com.datastax.oss</groupId>
//   <artifactId>java-driver-query-builder</artifactId>
//   <version>4.17.0</version>
// </dependency>

// ── CassandraConfig.java — single CqlSession for the application lifetime ────
package io.thecodeforge.config;

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.config.DefaultDriverOption;
import com.datastax.oss.driver.api.core.config.DriverConfigLoader;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.net.InetSocketAddress;
import java.time.Duration;

@Configuration
public class CassandraConfig {

    // ONE CqlSession per application — not one per request.
    // CqlSession is thread-safe and manages its own connection pool internally.
    // Creating a session per request is one of the most common performance mistakes.
    @Bean(destroyMethod = "close")
    public CqlSession cqlSession() {
        return CqlSession.builder()
                .addContactPoint(new InetSocketAddress("127.0.0.1", 9042))
                .withLocalDatacenter("datacenter1")
                .withKeyspace("forge_system")
                .withConfigLoader(
                    DriverConfigLoader.programmaticBuilder()
                        // Request timeout: fail fast rather than hold threads
                        .withDuration(DefaultDriverOption.REQUEST_TIMEOUT,
                                      Duration.ofMillis(500))
                        // Consistency level: LOCAL_QUORUM for production
                        // Tolerates one replica being down in the local DC
                        .withString(DefaultDriverOption.REQUEST_CONSISTENCY,
                                    "LOCAL_QUORUM")
                        // Token-aware routing: send queries directly to the
                        // replica that owns the partition — avoids coordinator hop
                        .withString(DefaultDriverOption.LOAD_BALANCING_POLICY_CLASS,
                                    "DefaultLoadBalancingPolicy")
                        .build()
                )
                .build();
    }
}

// ── AuditLog.java — simple value object ──────────────────────────────────────
package io.thecodeforge.model;

import java.time.Instant;
import java.util.UUID;

public record AuditLog(
    String  serviceName,
    Instant logTime,
    UUID    requestId,
    String  payload
) {}

// ── AuditLogRepository.java — DataStax driver with prepared statements ────────
package io.thecodeforge.repository;

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;
import io.thecodeforge.model.AuditLog;
import org.springframework.stereotype.Repository;
import java.time.Instant;
import java.util.List;
import java.util.UUID;
import java.util.stream.StreamSupport;

@Repository
public class AuditLogRepository {

    private final CqlSession session;

    // Prepared statements: parse once, execute many times.
    // Prepared at construction time — fails fast on startup if CQL is wrong.
    private final PreparedStatement insertStmt;
    private final PreparedStatement selectRecentStmt;

    public AuditLogRepository(CqlSession session) {
        this.session = session;

        this.insertStmt = session.prepare(
            "INSERT INTO audit_logs (service_name, log_time, request_id, payload) "
          + "VALUES (?, ?, ?, ?) USING TTL 2592000"
        );

        this.selectRecentStmt = session.prepare(
            "SELECT service_name, log_time, request_id, payload "
          + "FROM audit_logs "
          + "WHERE service_name = ? AND log_time > ? "
          + "LIMIT 100"
        );
    }

    public void save(AuditLog log) {
        BoundStatement bound = insertStmt.bind(
            log.serviceName(),
            log.logTime(),
            log.requestId(),
            log.payload()
        );
        // execute() is synchronous — use executeAsync() for high-throughput writes
        session.execute(bound);
    }

    public List<AuditLog> findRecent(String serviceName, Instant since) {
        BoundStatement bound = selectRecentStmt.bind(serviceName, since);
        ResultSet rs = session.execute(bound);

        return StreamSupport.stream(rs.spliterator(), false)
                .map(row -> new AuditLog(
                    row.getString("service_name"),
                    row.getInstant("log_time"),
                    row.getUuid("request_id"),
                    row.getString("payload")
                ))
                .toList(); // Java 16+ — returns unmodifiable list
    }

    // ── Async write for high-throughput ingestion paths ───────────────────────
    // Returns CompletableFuture<AsyncResultSet> — caller decides whether to await
    public java.util.concurrent.CompletableFuture<AsyncResultSet> saveAsync(AuditLog log) {
        BoundStatement bound = insertStmt.bind(
            log.serviceName(),
            log.logTime(),
            log.requestId(),
            log.payload()
        );
        return session.executeAsync(bound).toCompletableFuture();
    }
}
▶ Output
-- Application startup log (condensed):
[main] INFO c.d.o.d.i.core.session.DefaultSession - [forge-session] Initializing
[main] INFO c.d.o.d.i.core.loadbalancing.DefaultLoadBalancingPolicy
- [forge-session] Using datacenter 'datacenter1' for local policy
[main] INFO c.d.o.d.i.core.session.DefaultSession - [forge-session] Initialized

-- Repository.save() called:
-- Prepared statement bound and executed in 2.3ms (LOCAL_QUORUM satisfied)

-- Repository.findRecent('payment-service', 2026-04-18T00:00:00Z):
-- Single-partition read, 2 rows returned in 1.8ms
Mental Model
One Session, Many Queries — The Connection Pool Model
Creating a CqlSession is expensive — it opens connections, discovers topology, and warms up the load balancing policy. Do it once. Reuse it everywhere.
  • CqlSession is thread-safe — inject it as a singleton and share it across all repositories
  • The driver manages a connection pool per host automatically — no manual pool management needed
  • PreparedStatement objects are also reusable and thread-safe — prepare at startup, bind per request
  • executeAsync() returns a CompletableFuture — use it for fire-and-forget writes or high-throughput ingestion
  • Token-aware routing in DefaultLoadBalancingPolicy means queries go directly to the replica — no extra coordinator hop
📊 Production Insight
The two driver mistakes that show up in every Cassandra performance post-mortem:
1. Creating a new CqlSession per request — session creation takes 100-300ms. At 100 req/s, this saturates the application before Cassandra is ever stressed.
2. Using unprepared statements in a hot path — the coordinator parses the CQL string on every execution. At scale, this adds measurable latency and CPU load to your coordinator nodes.
Both mistakes look fine in load testing with 10 users. They collapse at production scale.
🎯 Key Takeaway
Create one CqlSession per application — it is thread-safe and manages its own connection pool.
Prepare statements at startup, bind per request — never execute unprepared statements in a hot path.
Use LOCAL_QUORUM as the default consistency level — it tolerates one replica being down while maintaining strong consistency within the local datacenter.
Use executeAsync() for high-throughput write paths — it allows multiple in-flight writes without blocking application threads.
Sync vs Async Driver Execution
IfStandard request-response REST API endpoint
UseUse session.execute() — synchronous, straightforward, compatible with Spring MVC thread model
IfHigh-throughput write ingestion (events, metrics, logs)
UseUse session.executeAsync() with CompletableFuture — allows batching multiple in-flight writes without blocking threads
IfApplication uses Spring WebFlux or reactive streams
UseUse the reactive driver extension (java-driver-reactive) which returns Mono/Flux directly
IfBulk data migration or backfill job
UseUse executeAsync() with a semaphore to limit concurrency — avoids overwhelming Cassandra with thousands of simultaneous requests
🗂 Database Architecture Comparison
Traditional RDBMS vs. Cassandra for distributed workloads
AspectTraditional RDBMS (MySQL/PostgreSQL)Apache Cassandra
ArchitectureCentralized (Master-Slave)Distributed (Peer-to-Peer)
ScalabilityVertical (bigger hardware)Horizontal (more nodes, linear throughput increase)
Data ModelNormalized (fixed schema, joins)Denormalized (flexible schema, one table per query)
AvailabilityHigh with failover lag (30s-2min)Always-on (no failover delay, no leader election)
Write SpeedModerate (B-Tree, random I/O, locking)Extreme (LSM-Tree, sequential I/O, no read-before-write)
Read PatternFlexible (joins, aggregations, ad-hoc)Query-driven (design tables around queries)
Secondary IndexesB-Tree indexes, generally efficientSAI in Cassandra 5.0+ (selective queries only)
TransactionsFull ACID across tables and rowsLightweight transactions (LWT) for single-partition conditional updates only
Operational CostLower for small scale (< 1TB)Lower for massive scale (10TB+, multi-datacenter)

🎯 Key Takeaways

  • Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale — every node is equal, any node accepts reads and writes.
  • Always model your data around your queries (Query-Driven Design) rather than entities — joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.
  • Linear scalability means adding nodes provides a predictable, proportional increase in throughput — no rebalancing ceremony, no leader election.
  • The Gossip protocol and Snitch configuration are the heartbeat of the cluster's awareness — they power failure detection and topology-aware routing.
  • Tunable consistency is Cassandra's operational superpower — LOCAL_QUORUM is the right default for production: QUORUM = floor(RF/2)+1 nodes must respond, tolerating one failure with RF=3.
  • Denormalization is not a flaw — it is the correct pattern when joins do not exist. Duplicate data across tables; storage is cheap, cross-partition reads are not.
  • Tombstones are writes — every DELETE creates a tombstone that degrades reads until compaction clears it. Use TTL on INSERT instead of explicit DELETE for time-bounded data.
  • SAI (Storage Attached Indexes) in Cassandra 5.0 replaces SASI and supports range queries and vector ANN search — SASI is deprecated and should not be used in new deployments.
  • Create one CqlSession per application, prepare statements at startup, and use LOCAL_QUORUM as the default consistency level — these three driver decisions determine whether your application runs well under load.

⚠ Common Mistakes to Avoid

    Over-normalizing data like in SQL databases
    Symptom

    Application performs multiple sequential reads to assemble a single view, resulting in 50-200ms latency per page load instead of the expected 2-5ms.

    Fix

    Redesign tables using Query-First approach: define the exact CQL query first, then create a denormalized table with all required columns in the primary key or as stored columns. If you need the same data for two query patterns, create two tables.

    Using SimpleStrategy in production
    Symptom

    During a rack failure, all replicas of a partition reside in the same rack. Data becomes temporarily or permanently unavailable.

    Fix

    Switch to NetworkTopologyStrategy with RF=3 per datacenter. Verify snitch configuration matches your infrastructure (GossipingPropertyFileSnitch for most deployments).

    Ignoring partition key cardinality — creating hot spots
    Symptom

    One node shows 90% CPU while others idle at 5%. Compaction backlog grows unbounded on the hot node. Eventually the node OOM-kills and the cluster degrades.

    Fix

    Choose a high-cardinality partition key (UUIDs, composite keys with time buckets). Monitor partition sizes with nodetool tablehistograms — keep individual partitions under 100MB and 100,000 rows.

    Using consistency level ALL for every query
    Symptom

    Any single node failure or network partition causes all reads and writes to fail, completely negating Cassandra's availability guarantees.

    Fix

    Use LOCAL_QUORUM as the default consistency level. It tolerates one node failure while still guaranteeing strong consistency within the local datacenter. QUORUM = floor(RF/2) + 1.

    Using ALLOW FILTERING in production queries
    Symptom

    A query that takes 2ms during development takes 30 seconds in production when the table has 50 million rows. The coordinator scans every partition on every node.

    Fix

    Redesign the table so the filter column is part of the primary key, or create a Storage Attached Index (SAI) in Cassandra 5.0+. ALLOW FILTERING is a diagnostic tool, not a query strategy.

    Creating a new CqlSession per request in the Java driver
    Symptom

    Application throughput collapses under load. Thread pool exhaustion in the application, not in Cassandra. P99 latency spikes to 300ms+ because session initialization dominates request time.

    Fix

    Create one CqlSession as a singleton bean and inject it into all repositories. Session creation is expensive (100-300ms); request execution should be 1-5ms.

Interview Questions on This Topic

  • QWhat is the role of the Gossip protocol in an Apache Cassandra cluster?Mid-levelReveal
    Gossip is a peer-to-peer communication protocol that every Cassandra node uses to discover and maintain awareness of other nodes in the cluster. Each node gossips with up to 3 other nodes every second, exchanging state information (up/down, load, schema version). This decentralized protocol has no single point of failure and converges on cluster state within a few seconds. It powers failure detection — if a node stops responding to gossip, it is marked as down after a configurable phi-convict-threshold. The Snitch works alongside Gossip to tell each node the datacenter and rack topology, which the load balancing policy uses to make routing decisions.
  • QHow does Cassandra achieve high write throughput compared to traditional B-Tree based databases?Mid-levelReveal
    Cassandra uses an LSM-Tree (Log-Structured Merge-Tree) storage engine. Writes go to an in-memory memtable and are immediately appended to a sequential commit log on disk. Sequential I/O is orders of magnitude faster than the random I/O required by B-Tree indexes. When the memtable fills, it is flushed to an immutable SSTable on disk. Background compaction merges SSTables. This architecture means writes never read-modify-update existing data structures — they only append. There is no locking, no read-before-write, and no page split. In Cassandra 5.0, the trie-based memtable further reduces memory overhead and improves flush performance.
  • QExplain the LSM-Tree (Log-Structured Merge-Tree) and its importance in Cassandra's storage engine.SeniorReveal
    An LSM-Tree is a write-optimized data structure that converts random writes into sequential writes. Data flows through three stages: (1) write to memtable (in-memory sorted structure) and commit log simultaneously, (2) flush memtable to SSTable (sorted, immutable file on disk) when it reaches configured size, (3) background compaction merges multiple SSTables into fewer, larger SSTables, resolving conflicts by timestamp and purging tombstones after gc_grace_seconds. This design trades read amplification (may need to check multiple SSTables) for extreme write throughput. Bloom filters give a fast 'definitely not in this SSTable' path, and partition indexes provide the exact byte offset for each partition. Cassandra 5.0 introduced the trie-based memtable to reduce memory amplification.
  • QWhat is the difference between SimpleStrategy and NetworkTopologyStrategy for replication?JuniorReveal
    SimpleStrategy places replicas on the next N nodes clockwise in the ring, ignoring rack and datacenter topology. It is only suitable for single-datacenter development. NetworkTopologyStrategy is rack and datacenter aware — it places replicas in different racks within each datacenter, protecting against rack-level failures. For production, NetworkTopologyStrategy with RF=3 per datacenter is the standard. You specify the replication factor per datacenter: {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3}. SimpleStrategy in production is a correctness issue, not just a best practice violation — a rack failure can take down all replicas of a partition.
  • QWhat is a tombstone in Cassandra, why does it degrade read performance, and how do you prevent tombstone accumulation?SeniorReveal
    A tombstone is a deletion marker written when data is deleted in Cassandra. Because SSTables are immutable, Cassandra cannot remove data in place — it writes a tombstone that is propagated to replicas and physically removed only after gc_grace_seconds (default 10 days) AND after a compaction runs. Tombstones degrade reads because Cassandra must scan through all tombstones in a partition to find live data, and each scanned tombstone counts against the tombstone_failure_threshold (default 100,000). Prevention strategies: use TTL on INSERT instead of explicit DELETE for time-bounded data; avoid nullable columns that generate cell-level tombstones; ensure nodetool repair runs within gc_grace_seconds so tombstones are safe to purge; monitor tombstone ratio with nodetool tablehistograms.
  • QWhat is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does it differ from SASI?SeniorReveal
    SAI (Storage Attached Indexes) is the secondary indexing framework introduced in Cassandra 5.0 that replaces SASI (Storage Attached Secondary Index). SASI required separate index files and had limited query support. SAI is built into the SSTable storage format, supports equality, range queries, and vector approximate nearest neighbor (ANN) search, and has significantly lower write overhead than SASI. SASI is deprecated in Cassandra 5.0. The key operational difference: SAI indexes integrate with the native SSTable format, so there are no separate index tables to maintain, and compaction handles index data along with row data automatically. For vector search workloads, SAI with the cosine or dot-product similarity function enables embedding lookup without a separate vector database.

Frequently Asked Questions

Does Cassandra support ACID transactions?

Cassandra supports lightweight transactions (LWT) using the Paxos protocol for conditional updates within a single partition, but they come with a significant performance cost — approximately 4x the latency of a regular write due to the multi-round-trip Paxos consensus. For most use cases, Cassandra's eventual consistency with LOCAL_QUORUM provides the right balance. Full multi-row ACID transactions across partitions are not supported — design your data model to avoid needing them. If you genuinely need cross-partition transactions, evaluate whether Cassandra is the right database for that specific workload.

How do I choose the right compaction strategy?

STCS (Size-Tiered Compaction Strategy) is the default and works well for write-heavy workloads where data is rarely updated. LCS (Leveled Compaction Strategy) provides better read performance and bounded space amplification — use it for read-heavy workloads where the extra compaction I/O is acceptable. TWCS (Time-Window Compaction Strategy) is ideal for time-series data with TTL, as it groups SSTables by time window and drops entire windows when data expires — avoiding the need to merge expired tombstones with live data. Cassandra 5.0 introduced the Unified Compaction Strategy (UCS) as an experimental option that adapts behavior based on workload patterns.

What is a tombstone in Cassandra and why should I care?

A tombstone is a marker that indicates a row or cell has been deleted. Cassandra does not immediately remove data — it writes a tombstone that is only physically cleaned up after gc_grace_seconds (default 10 days) AND after compaction runs AND after repair has propagated the tombstone to all replicas. Excessive tombstones degrade read performance because Cassandra must scan through them during reads. Hitting the tombstone_failure_threshold (default 100,000 per query) causes a TombstoneOverwhelmingException that aborts the read entirely. Prevention: use TTL on INSERT, run repair regularly within gc_grace_seconds, and monitor tombstone ratios with nodetool tablehistograms.

Can I use secondary indexes in Cassandra?

In Cassandra 5.0, the recommended secondary indexing approach is Storage Attached Indexes (SAI). SAI supports equality and range queries on non-primary-key columns without ALLOW FILTERING, and adds vector ANN search for embedding-based workloads. SASI (Storage Attached Secondary Index) is deprecated in Cassandra 5.0 — do not create new SASI indexes. The original legacy secondary indexes are still present but have severe performance limitations. The general rule: SAI for selective queries (< 5% of partition rows), dedicated table for known stable query patterns with high query volume.

When should I use Cassandra vs PostgreSQL?

Use Cassandra when: you need write throughput that a single PostgreSQL instance cannot absorb, your dataset exceeds 1TB and growing, you need multi-datacenter active-active replication, or your data naturally fits a time-series or event-log model. Use PostgreSQL when: you need ACID transactions across multiple tables, your queries are ad-hoc and unpredictable, your data is relational and normalized, or your scale is small enough that a single well-tuned PostgreSQL instance handles the load. The worst outcome is starting with Cassandra for a workload that needs joins and transactions — the data model migration cost is high.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Cassandra Data Model and Keyspaces
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged