Query-Driven Design: model tables around queries, not entities — joins do not exist
Cassandra 5.0 introduced Storage Attached Indexes (SAI) — the correct answer to secondary indexing from 2024 onward
Biggest mistake: over-normalizing data like in SQL — denormalization is the correct Cassandra pattern
✦ Definition~90s read
What is Introduction to Apache Cassandra?
Apache Cassandra was designed to solve the availability vs. consistency trade-off in distributed systems, specifically favoring high availability and partition tolerance — the AP side of the CAP theorem. Traditional relational databases struggle to scale horizontally because they rely on a central master node that becomes both the bottleneck and the single point of failure.
★
Imagine you are running a massive global library.
Cassandra eliminates this bottleneck entirely by treating every node in the cluster as an equal peer.
This architecture allows linear scaling: adding a new node to the cluster provides a predictable, proportional increase in performance and storage capacity. There is no leader election, no failover window, and no promotion ceremony when a node dies. Its neighbors simply serve its data until it comes back.
Before writing any application code, you need a keyspace and a table. A keyspace is Cassandra's equivalent of a database schema — it defines the replication strategy and factor. The example below uses NetworkTopologyStrategy, which is the only strategy you should use in production. SimpleStrategy is shown for contrast because you will encounter it in tutorials, and understanding why it is wrong is as important as knowing the right answer.
Note the PRIMARY KEY (service_name, log_time) design. service_name is the partition key — it controls which node stores the data. log_time is the clustering column — it controls the sort order within the partition. This is the fundamental unit of Cassandra data modeling, and every decision downstream flows from getting it right.
Plain-English First
Imagine you are running a massive global library. Instead of keeping all books in one giant building (a single server), you distribute them across a ring of interconnected buildings (a cluster). If one building catches fire, the library stays open because every book has copies in other buildings. Apache Cassandra is that global library system; it ensures that no matter how much data you add or how many servers fail, your information is always available and reachable in milliseconds.
Apache Cassandra is a peer-to-peer distributed NoSQL database designed to handle massive amounts of data across many commodity servers. Originally developed at Facebook to power Inbox Search, it provides high availability with no single point of failure and linear scalability.
In this guide, we'll break down exactly what makes Cassandra unique in the crowded database landscape, why its masterless design beats traditional leader-follower models for write-heavy workloads, how to design partition keys that don't become time bombs, and how to connect a Java application using the DataStax driver. By the end, you'll understand not just what Cassandra is, but how to avoid the mistakes that cause production outages.
What Is Apache Cassandra and Why Does It Exist?
Apache Cassandra was designed to solve the availability vs. consistency trade-off in distributed systems, specifically favoring high availability and partition tolerance — the AP side of the CAP theorem. Traditional relational databases struggle to scale horizontally because they rely on a central master node that becomes both the bottleneck and the single point of failure. Cassandra eliminates this bottleneck entirely by treating every node in the cluster as an equal peer.
This architecture allows linear scaling: adding a new node to the cluster provides a predictable, proportional increase in performance and storage capacity. There is no leader election, no failover window, and no promotion ceremony when a node dies. Its neighbors simply serve its data until it comes back.
Before writing any application code, you need a keyspace and a table. A keyspace is Cassandra's equivalent of a database schema — it defines the replication strategy and factor. The example below uses NetworkTopologyStrategy, which is the only strategy you should use in production. SimpleStrategy is shown for contrast because you will encounter it in tutorials, and understanding why it is wrong is as important as knowing the right answer.
Note the PRIMARY KEY (service_name, log_time) design. service_name is the partition key — it controls which node stores the data. log_time is the clustering column — it controls the sort order within the partition. This is the fundamental unit of Cassandra data modeling, and every decision downstream flows from getting it right.
keyspace_setup.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- ── SimpleStrategy: development only, single datacenter ─────────────────────-- Never use in production — it ignores rack and datacenter topology.-- Replicas are placed on the next N nodes clockwise in the ring,-- which means two replicas can land on the same physical rack.CREATEKEYSPACEIFNOTEXISTS forge_dev
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
-- ── NetworkTopologyStrategy: production standard ─────────────────────────────-- Rack-aware and datacenter-aware placement.-- With RF=3 across 3 racks, you can lose an entire rack and still serve queries.CREATEKEYSPACEIFNOTEXISTS forge_system
WITH replication = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3
};
USE forge_system;
-- ── Table design: audit_logs ─────────────────────────────────────────────────-- Partition key: service_name → determines which node(s) store this row-- Clustering column: log_time → determines sort order within the partition-- CLUSTERING ORDER BY DESC: most recent logs are read first with no sort overheadCREATETABLEIFNOTEXISTSaudit_logs (
service_name text,
log_time timestamp,
request_id uuid,
payload text,
PRIMARYKEY (service_name, log_time)
) WITHCLUSTERINGORDERBY (log_time DESC)
AND default_time_to_live = 2592000; -- 30-day TTL: logs auto-expire-- ── Writing data: any node accepts the write ────────────────────────────────INSERTINTOaudit_logs (service_name, log_time, request_id, payload)
VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"ok"}');
INSERTINTOaudit_logs (service_name, log_time, request_id, payload)
VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"retry"}');
-- ── Reading data: the query matches the partition key exactly ────────────────-- This is a single-partition read — one node handles it, response in <5msSELECT service_name, log_time, payload
FROM audit_logs
WHERE service_name = 'payment-service'AND log_time > '2026-01-01 00:00:00+0000'LIMIT100;
Every node owns a range of the token ring — data is distributed by hashing the partition key to a token
Any node can accept reads and writes — there is no master, no leader election, no failover delay
Replication factor determines how many nodes hold a copy of each partition
If a node goes down, its neighbors serve its data immediately — no recovery window
Adding a node splits an existing token range and streams data to the new node automatically — linear scaling
Production Insight
Cassandra's AP design means reads may return slightly stale data during replication lag.
Tunable consistency lets you choose per-query: ONE for speed (fastest, weakest), LOCAL_QUORUM for balance (default choice), ALL for safety (strongest, slowest).
QUORUM math: for RF=3, QUORUM = floor(3/2)+1 = 2 nodes must respond. You can lose one node and still satisfy QUORUM.
Rule: use LOCAL_QUORUM for production writes and reads — it balances consistency with cross-datacenter latency without requiring all replicas to be available.
Key Takeaway
Cassandra exists because master-slave databases cannot scale writes horizontally without a bottleneck.
Every node is equal — any node accepts writes, and replication handles redundancy automatically.
QUORUM = floor(RF/2)+1: with RF=3, you need 2 nodes — you can lose one and still serve queries.
Never use SimpleStrategy in production — it has no rack or datacenter awareness.
Choosing Replication Strategy
IfSingle datacenter development or testing
→
UseUse SimpleStrategy with RF=1 — do NOT use in production
IfSingle production datacenter
→
UseUse NetworkTopologyStrategy with RF=3 spread across different racks
IfMulti-datacenter deployment
→
UseUse NetworkTopologyStrategy with RF=3 per datacenter for cross-DC resilience
IfCompliance requires geographic data isolation
→
UseUse NetworkTopologyStrategy with LOCAL_QUORUM consistency to ensure reads never cross datacenter boundaries
thecodeforge.io
Cassandra Ring Architecture & Hot Partitions
Cassandra Introduction
Partition Keys, Clustering Columns, and Primary Key Design
If I had to point to one thing that separates engineers who run Cassandra well from engineers who generate production incidents, it's partition key design. Every other Cassandra concept — consistency levels, compaction, gossip — is secondary to getting this right.
The primary key in Cassandra has two distinct parts that most tutorials blur together:
The partition key determines which node stores the data. Cassandra hashes the partition key value to a token, and that token falls in a node's token range. All rows with the same partition key are stored together on disk — this is what makes single-partition reads so fast. But it's also why a bad partition key creates a hot spot: if thousands of writes share the same partition key value, all of them land on the same two or three nodes.
Clustering columns determine the sort order of rows within a partition. They are stored sorted on disk in the SSTable. This means range queries on clustering columns — WHERE log_time > x AND log_time < y — are essentially free: Cassandra reads a contiguous block of sorted data from a single SSTable. There is no sort step, no heap sort, no temporary buffer.
The combination of partition key and clustering columns forms the primary key. Every row within a partition must have a unique combination of clustering column values. A missing clustering column in a write will upsert the existing row — not create a new one.
Partition size is the operational concern that flows from this design. Because all rows with the same partition key live on the same nodes, an unbounded partition grows indefinitely and eventually causes the problems described in the production incident at the top of this guide. The rule I operate by: keep partitions under 100MB and under 100,000 rows. If either limit is at risk, add a time bucket to the partition key.
partition_key_design.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
-- ── Example 1: Simple partition key — one column ────────────────────────────-- All events for a user are in one partition.-- Problem: a user with 10 million events creates a 4GB partition.CREATETABLEuser_events_v1 (
user_id uuid,
event_time timestamp,
event_type text,
metadata text,
PRIMARYKEY (user_id, event_time)
) WITHCLUSTERINGORDERBY (event_time DESC);
-- ── Example 2: Composite partition key — time bucketing ──────────────────────-- Now the partition key is (user_id, bucket).-- Each day gets its own partition — bounded growth, predictable size.-- 'bucket' is a date string like '2026-04-18', computed in the application.-- Trade-off: queries spanning multiple days require multiple partition reads.CREATETABLEuser_events_v2 (
user_id uuid,
bucket text, -- application sets this: '2026-04-18'
event_time timestamp,
event_type text,
metadata text,
PRIMARYKEY ((user_id, bucket), event_time)
-- ^-----------^ composite partition key: two columns together
) WITHCLUSTERINGORDERBY (event_time DESC);
-- ── Writing to the bucketed table ─────────────────────────────────────────────-- Application code computes the bucket from the event timestamp-- then includes it in every write and readINSERTINTOuser_events_v2 (user_id, bucket, event_time, event_type, metadata)
VALUES (
a3b4c5d6-e7f8-9012-abcd-ef1234567890,
'2026-04-18',
toTimestamp(now()),
'page_view',
'{"page":"/dashboard"}'
);
-- ── Reading from a specific bucket ───────────────────────────────────────────-- This is a single-partition read: fast, predictable, no cluster scanSELECT event_time, event_type, metadata
FROM user_events_v2
WHERE user_id = a3b4c5d6-e7f8-9012-abcd-ef1234567890
AND bucket = '2026-04-18'AND event_time > '2026-04-18 08:00:00+0000'LIMIT50;
-- ── What ALLOW FILTERING looks like — and why it is dangerous ─────────────────-- This forces Cassandra to scan EVERY partition on EVERY node-- to find rows where event_type = 'purchase'.-- With 50 nodes and 500GB of data, this query touches all of it.-- Never run this in production without SAI or a dedicated table.SELECT * FROM user_events_v2
WHERE event_type = 'purchase'ALLOWFILTERING; -- full cluster scan — do not do this
Warnings: Aggregation query used without partition key (potential performance issue)
-- This scanned 50 nodes. Do not use in production.
The 100MB / 100,000 Row Partition Rule
Keep individual partitions under 100MB and under 100,000 rows. These are not hard limits but operational thresholds I've used across multiple production clusters. Beyond these numbers, compaction takes longer, reads require more SSTable lookups, and a single write spike can overwhelm the nodes responsible for that token range. Check partition sizes regularly with nodetool tablehistograms. If you're approaching the limit, add a time bucket to the partition key before you hit it — not after.
Production Insight
The most dangerous partition key mistake is the one that looks fine on day one.
A partition key based on user_id is fine when you have 1,000 users and 100 events each.
At 50,000 events per power user, you have a 500MB partition waiting to OOM a node.
Design for the maximum, not the average — you won't have time to redesign at 3am during a traffic spike.
Key Takeaway
The partition key determines data placement — choose high-cardinality values and add time buckets for time-series data.
Clustering columns determine sort order within a partition — sorted on disk, so range queries are essentially free.
ALLOW FILTERING is a full cluster scan — it is not a query strategy, it is an emergency diagnostic tool.
Keep partitions under 100MB and 100,000 rows — if you cannot, add a time bucket.
Partition Key Design Decisions
IfData volume per partition key value is bounded and small (< 10,000 rows)
→
UseSimple partition key is fine — one column, high cardinality (UUID, user_id)
IfData volume per partition key grows unboundedly over time (time-series, events)
→
UseAdd a time bucket to the partition key: (entity_id, bucket) where bucket is day/hour/week
IfNeed to query by a column that is not the partition key
→
UseCreate a separate table with that column as the partition key — duplicate the data
IfQuery needs to filter on a low-selectivity column across many partitions
→
UseIn Cassandra 5.0+, use a Storage Attached Index (SAI) — do not use ALLOW FILTERING
The LSM-Tree Write Path — Why Cassandra Writes So Fast
Understanding Cassandra's write performance requires understanding one design decision: all writes are sequential. There is no read-before-write, no in-place update, no locking. The LSM-Tree (Log-Structured Merge-Tree) storage engine is the reason Cassandra can absorb write rates that would saturate a PostgreSQL cluster.
Here's what actually happens when your application executes an INSERT or UPDATE:
Commit log write — The mutation is appended to a sequential commit log on disk. Sequential I/O. If the node crashes after this step, the data can be recovered from the commit log on restart.
Memtable write — The mutation is written to an in-memory structure called the memtable. All writes to the same partition are merged here — updates, deletes, and inserts coexist. The memtable is the 'current view' of recent data.
Memtable flush to SSTable — When the memtable reaches a configured size, it is flushed to disk as an immutable SSTable (Sorted String Table). Immutable means once written, it is never modified. Updates and deletes don't rewrite existing SSTables — they write new entries with a higher timestamp.
Compaction — Background compaction merges multiple SSTables into fewer, larger ones. During compaction, entries for the same partition key are merged by timestamp — newer wins. Deleted entries (tombstones) are purged after gc_grace_seconds.
The trade-off this creates: reads are more complex than writes. A read for a partition may need to check the memtable and multiple SSTables to find all the data. Bloom filters (one per SSTable) give Cassandra a probabilistic fast path — 'definitely not in this SSTable' — before opening the file. Partition indexes give the exact byte offset. By 2026, Cassandra 5.0's trie-based memtable and storage improvements have significantly reduced the read amplification that older LSM implementations suffered.
write_path_demonstration.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- ── Demonstrating Cassandra's write semantics ────────────────────────────────-- INSERT and UPDATE are identical in Cassandra — both are 'upserts'.-- There is no read-before-write. The write goes to memtable immediately.INSERTINTO forge_system.audit_logs
(service_name, log_time, request_id, payload)
VALUES
('order-service', '2026-04-18 10:00:00+0000', uuid(), '{"amount":99.99}');
-- UPDATE writes a new cell with a higher timestamp — the old cell still exists-- in the SSTable until compaction. Cassandra resolves conflicts by timestamp.UPDATE forge_system.audit_logs
SET payload = '{"amount":99.99,"status":"confirmed"}'WHERE service_name = 'order-service'AND log_time = '2026-04-18 10:00:00+0000';
-- ── Deletes write tombstones, not erasures ───────────────────────────────────-- This DELETE writes a tombstone marker — it does NOT remove the data immediately.-- The data will be physically removed after gc_grace_seconds (default: 10 days)-- AND after a compaction runs AND after repair has propagated the tombstone.DELETEFROM forge_system.audit_logs
WHERE service_name = 'order-service'AND log_time = '2026-04-18 10:00:00+0000';
-- ── Checking SSTable and memtable state via nodetool ─────────────────────────-- Run from the shell, not from cqlsh:-- nodetool flush forge_system → forces memtable to flush to SSTable-- nodetool compact forge_system → forces compaction on the keyspace-- nodetool tablehistograms forge_system audit_logs → read/write latency breakdown-- ── TTL as an alternative to application-level deletes ───────────────────────-- When data has a natural expiry, use TTL instead of explicit DELETE.-- Cassandra handles the tombstone lifecycle automatically.-- Here, this log entry expires in 86400 seconds (24 hours).INSERTINTO forge_system.audit_logs
(service_name, log_time, request_id, payload)
VALUES
('cache-service', toTimestamp(now()), uuid(), '{"cache":"hit"}')
USINGTTL86400;
IfRead-heavy workload, data is frequently queried by partition key
→
UseUse LCS (Leveled Compaction Strategy) — fewer SSTables per partition, better read performance, more I/O during compaction
IfTime-series data with TTL-based expiry
→
UseUse TWCS (Time-Window Compaction Strategy) — groups SSTables by time window, drops entire windows when TTL expires, minimal write amplification
IfUnclear workload or mixed read/write
→
UseStart with STCS, monitor with nodetool tablehistograms, switch to LCS if read latency is the dominant complaint
Common Mistakes and How to Avoid Them
When moving from SQL to Cassandra, the mistakes fall into two categories: design mistakes you make before writing a line of code, and operational mistakes you make after the cluster is running. The design mistakes are more expensive — they require data migration to fix. The operational ones are more urgent — they cause incidents.
The single most consequential design mistake is treating Cassandra like a relational database. Normalizing data, expecting joins, using secondary indexes as a query strategy, filtering on non-primary-key columns with ALLOW FILTERING — all of these patterns assume a query engine that Cassandra deliberately does not have. Cassandra's query engine is intentionally limited because that limitation is what makes it fast at scale.
For the operational side, the development cluster below is the starting point for understanding how the cluster behaves before you have 12 production nodes to experiment on. The compose file uses Cassandra 5.0 — pinned, not latest. Using latest in any tutorial or development setup is a trap: you will develop against a different version than production, and Cassandra has meaningful behavioral differences between major versions.
# After all three nodes start (allow ~3 minutes for gossip convergence)
# Run: docker exec forge-cassandra-1 nodetool status
# Datacenter: datacenter1
# ========================
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# -- Address Load Tokens Owns Host ID Rack
# UN 172.20.0.2 195.21 KiB 16 33.3% a1b2c3d4-e5f6-7890-abcd-ef1234567890 rack1
# UN 172.20.0.3 207.43 KiB 16 33.3% b2c3d4e5-f6a7-8901-bcde-f12345678901 rack2
# UN 172.20.0.4 189.67 KiB 16 33.4% c3d4e5f6-a7b8-9012-cdef-123456789012 rack3
#
# UN = Up, Normal — all three nodes are healthy and distributing load evenly
# Each node owns ~33% of the token ring
Watch Out:
The most common setup mistake is using the SimpleStrategy replication class in production. While fine for single-node development, production environments require NetworkTopologyStrategy to ensure rack-aware and datacenter-aware data placement. The second most common mistake is using cassandra:latest as the image tag — it breaks reproducibility and means developers and production may be running different major versions with different behaviors.
Production Insight
Denormalization in Cassandra means duplicating data across multiple tables.
Each table is designed for exactly one query pattern — this is by design, not a flaw.
Rule: if you need the same data for two queries, create two tables — storage is cheap, latency is not.
The storage cost of denormalization at scale is almost always lower than the operational cost of slow queries or a poorly tuned secondary index.
Key Takeaway
Query-First Design: write your CQL queries before designing your tables — the query defines the table structure.
Joins do not exist — if you need data from two entities, denormalize into one table or create two query-specific tables.
Never use SimpleStrategy in production and never pin to cassandra:latest — both create incidents you cannot explain.
Data Modeling Decision Matrix
IfNeed to query by user_id and by email
→
UseCreate two tables: one partitioned by user_id, one by email — do NOT rely on secondary indexes for this
IfTime-series data with high write volume
→
UseUse a time bucket in the partition key (e.g., day or hour) to prevent unbounded partition growth
IfNeed to aggregate or group data across partitions
→
UsePre-compute aggregations at write time and store in a dedicated summary table
IfQuery requires filtering on a non-primary-key column in Cassandra 5.0+
→
UseUse a Storage Attached Index (SAI) — it replaces SASI and avoids the full-cluster scan of ALLOW FILTERING
Storage Attached Indexes (SAI) — Secondary Indexing in Cassandra 5.0
Before Cassandra 5.0, secondary indexing was the most common source of bad advice in Cassandra guides. The options were:
SASI (Storage Attached Secondary Index) — better than the original secondary indexes, still had significant operational overhead and limited query capabilities.
Original secondary indexes — effectively hidden tables, painful performance characteristics, deprecated by the Cassandra community for most production use.
ALLOW FILTERING — not an index at all, a full cluster scan dressed up as a query option.
Cassandra 5.0 introduced Storage Attached Indexes (SAI), which supersede SASI. SAI is built into the SSTable storage format, requires no separate index tables, and supports range queries, equality, and — for vector embeddings — approximate nearest neighbor (ANN) search. Vector search support makes Cassandra 5.0 relevant for AI-adjacent workloads that weren't possible before.
SAI doesn't make secondary indexing free. It's still more expensive than a partition key query because it must look across all SSTables on the node. But the cost is manageable for selective queries — queries that filter down to a small result set. The rule I apply: if your SAI query returns more than 5-10% of the rows in the partition, the query is not selective enough for an index to help, and you should create a dedicated table instead.
SASI is deprecated in Cassandra 5.0. Any guide that recommends SASI for new deployments is out of date.
sai_indexes.cqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- ── Storage Attached Indexes: Cassandra 5.0+ ─────────────────────────────────-- SAI allows querying on non-primary-key columns without ALLOW FILTERING.-- SASI is deprecated in Cassandra 5.0 — use SAI for all new indexes.USE forge_system;
-- Table: product catalog — partition key is categoryCREATETABLEIFNOTEXISTSproducts (
category text,
product_id uuid,
name text,
price decimal,
in_stock boolean,
rating float,
PRIMARYKEY (category, product_id)
);
-- ── Create SAI indexes on non-primary-key columns ────────────────────────────-- Syntax: CREATE INDEX index_name ON table(column) USING 'sai';CREATEINDEX products_price_idx ONproducts(price) USING'sai';
CREATEINDEX products_stock_idx ONproducts(in_stock) USING'sai';
CREATEINDEX products_rating_idx ONproducts(rating) USING'sai';
-- ── SAI-enabled queries: no ALLOW FILTERING required ─────────────────────────-- Range query on price — only possible with SAI or as a partition keySELECT product_id, name, price
FROM products
WHERE category = 'electronics'AND price > 100.00AND price < 500.00;
-- Combined filter: in stock AND rating above threshold-- Both conditions use SAI — Cassandra intersects the index resultsSELECT product_id, name, price, rating
FROM products
WHERE category = 'electronics'AND in_stock = true
AND rating > 4.5;
-- ── Vector search with SAI (AI workloads) ────────────────────────────────────-- Cassandra 5.0 supports ANN (Approximate Nearest Neighbor) search via SAI-- Useful for semantic search, recommendation systems, embedding lookupCREATETABLEIFNOTEXISTSproduct_embeddings (
category text,
product_id uuid,
embedding vector<float, 1536>, -- 1536-dimensional OpenAI embeddingPRIMARYKEY (category, product_id)
);
-- SAI index on the vector column for ANN searchCREATEINDEX embeddings_ann_idx
ONproduct_embeddings(embedding)
USING'sai'WITHOPTIONS = {'similarity_function': 'cosine'};
-- ANN query: find the 10 most similar products to a given embedding vector-- This runs on-node without a cluster scan — orders of magnitude faster than-- brute-force similarity across all rowsSELECT product_id
FROM product_embeddings
WHERE category = 'electronics'ORDERBY embedding ANNOF [0.1, 0.2, 0.3, ...] -- 1536 floats from your modelLIMIT10;
-- ANN query: (10 results returned based on cosine similarity)
SAI vs Dedicated Table: When to Use Which
SAI is right when: the indexed column has high cardinality, queries are selective (< 5% of partition rows), and you cannot predict all query patterns at schema design time
Dedicated table is right when: you need the absolute fastest reads, the query pattern is known and stable, or selectivity is too low for SAI to help
Vector search via SAI is the recommended pattern for embedding lookup in Cassandra 5.0+ — no separate vector database required for moderate scale
SASI is deprecated in Cassandra 5.0 — do not create new SASI indexes on any new deployment
Production Insight
SAI in Cassandra 5.0 is a genuine improvement over everything that came before it.
But 'genuine improvement' is not the same as 'free.' A low-selectivity SAI query — one that matches 30% of rows — still performs worse than a well-designed partition key query.
Rule: always include the partition key in SAI queries when possible. SAI narrows results within a partition far more efficiently than across the whole cluster.
Key Takeaway
SAI replaces SASI in Cassandra 5.0 — SASI is deprecated and should not be used in new deployments.
SAI supports equality, range, and vector (ANN) queries without ALLOW FILTERING.
Include the partition key alongside SAI conditions wherever possible — it scopes the index lookup to one partition rather than the full cluster.
For AI workloads requiring embedding similarity search, Cassandra 5.0 + SAI vector indexes removes the need for a separate vector database at moderate scale.
Secondary Index Strategy in Cassandra 5.0
IfNeed to query on a non-primary-key column, column has high cardinality
→
UseUse SAI (USING 'sai') — efficient for selective queries, no separate table required
IfNeed to query on a non-primary-key column, query is for a known stable pattern
→
UseCreate a dedicated table with the column as partition key — fastest reads, predictable performance
IfNeed approximate nearest neighbor search for ML embeddings
→
UseUse SAI vector index on a vector<float, N> column — built into Cassandra 5.0, no external vector DB needed for moderate scale
IfExisting cluster uses SASI indexes
→
UsePlan migration to SAI — SASI is deprecated in 5.0 and will not receive new features
Connecting a Java Application — DataStax Java Driver
Every other concept in this guide — rings, partition keys, consistency levels, SAI — becomes concrete once you see how an application actually talks to Cassandra. The standard driver by 2026 is the DataStax Java Driver for Apache Cassandra, version 4.x. It replaced the older CQL driver and the Astyanax client.
Three things the driver handles that you do not want to implement yourself:
Load balancing policy — the driver tracks which nodes are in which datacenter, which are up, and which are token-aware. With the DefaultLoadBalancingPolicy, queries are automatically routed to the replica that owns the data — no unnecessary hops through coordinator nodes.
Connection pooling — each driver instance maintains a pool of connections per host. You create one CqlSession per application and reuse it for the lifetime of the process. Creating a new session per request is one of the most common Java + Cassandra performance mistakes.
Prepared statements — CQL statements should be prepared once and executed many times. Prepared statements are parsed and validated once by the coordinator; subsequent executions send only the bound values. Unprepared statements parse on every execution — at scale, the coordinator overhead becomes measurable.
The Spring Data Cassandra abstraction sits on top of this driver and is suitable for straightforward CRUD workloads. For anything involving custom query patterns, consistency level overrides, or performance-critical paths, I prefer going directly to the driver — the abstraction layer hides the consistency level control that Cassandra's tunable consistency depends on.
-- Single-partition read, 2 rows returned in 1.8ms
One Session, Many Queries — The Connection Pool Model
CqlSession is thread-safe — inject it as a singleton and share it across all repositories
The driver manages a connection pool per host automatically — no manual pool management needed
PreparedStatement objects are also reusable and thread-safe — prepare at startup, bind per request
executeAsync() returns a CompletableFuture — use it for fire-and-forget writes or high-throughput ingestion
Token-aware routing in DefaultLoadBalancingPolicy means queries go directly to the replica — no extra coordinator hop
Production Insight
The two driver mistakes that show up in every Cassandra performance post-mortem:
1. Creating a new CqlSession per request — session creation takes 100-300ms. At 100 req/s, this saturates the application before Cassandra is ever stressed.
2. Using unprepared statements in a hot path — the coordinator parses the CQL string on every execution. At scale, this adds measurable latency and CPU load to your coordinator nodes.
Both mistakes look fine in load testing with 10 users. They collapse at production scale.
Key Takeaway
Create one CqlSession per application — it is thread-safe and manages its own connection pool.
Prepare statements at startup, bind per request — never execute unprepared statements in a hot path.
Use LOCAL_QUORUM as the default consistency level — it tolerates one replica being down while maintaining strong consistency within the local datacenter.
Use executeAsync() for high-throughput write paths — it allows multiple in-flight writes without blocking application threads.
Sync vs Async Driver Execution
IfStandard request-response REST API endpoint
→
UseUse session.execute() — synchronous, straightforward, compatible with Spring MVC thread model
UseUse session.executeAsync() with CompletableFuture — allows batching multiple in-flight writes without blocking threads
IfApplication uses Spring WebFlux or reactive streams
→
UseUse the reactive driver extension (java-driver-reactive) which returns Mono/Flux directly
IfBulk data migration or backfill job
→
UseUse executeAsync() with a semaphore to limit concurrency — avoids overwhelming Cassandra with thousands of simultaneous requests
Who Should Touch This Database (And Who Should Run Away)
Cassandra is not your father's relational database. You do not reach for it when you have three tables and a weekend project. This is the hammer you pull out when you need to ingest millions of writes per second across a dozen data centers and you cannot afford a single second of downtime. If you are a backend engineer working on time-series metrics, IoT sensor pipelines, or a real-time recommendation engine that must survive a rack failure without blinking: this database was built for you. If you are building a simple blog with user profiles and comments, stop reading and go back to PostgreSQL. Cassandra trades away joins, secondary indexes that work out of the box, and strong consistency across partitions for raw write throughput and linear scalability. You need to know how partition keys work. You need to understand that your query pattern must be designed before you create a single table. If you cannot model your access paths up front, Cassandra will punish you in production and your on-call rotation will hate you.
ModelCheck.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge// Verify your access pattern before writing a single rowpublicclassModelCheck {
// If your query is "get all orders by user_id",// your partition key MUST be user_id. No exceptions.publicstaticfinalString CREATE_ORDERS_TABLE =
"CREATE TABLE IF NOT EXISTS orders_by_user (" +
"user_id UUID," +
"order_time TIMESTAMP," +
"amount DECIMAL," +
"PRIMARY KEY (user_id, order_time)" +
") WITH CLUSTERING ORDER BY (order_time DESC);";
publicstaticvoidmain(String[] args) {
System.err.println("Design the query first. Then build the table.");
System.err.println("Reverse that order and you will re-architect at 2 AM.");
}
}
Output
Design the query first. Then build the table.
Reverse that order and you will re-architect at 2 AM.
Production Trap:
Do NOT let a junior architect 'normalize' Cassandra tables. Every table is a denormalized query result set. One table per query pattern. Violate this and your cluster will scatter reads across nodes, killing latency.
Key Takeaway
Model your query first, then build the table — Cassandra is not relational.
What You Need Before You Start (Or You Will Flame Out)
You cannot just install Cassandra and start typing CQL. You need a solid grasp of Java — not just syntax, but the JVM's garbage collection behavior, heap sizing, and concurrent programming. Cassandra is written in Java, and every node runs inside a JVM. If you do not know what a GC pause looks like in a production graph, you will mistake a dying node for a network issue. You also need to be comfortable on Linux. Not Windows. Not Mac for production. Linux. You will tail logs in /var/log/cassandra, adjust kernel parameters for swappiness and net.core.somaxconn, and debug ulimit complaints before breakfast. Basic database concepts help: you should know what a B-tree is so you can appreciate why an LSM-tree behaves differently under compaction. If you have never used any NoSQL database before, spend a weekend with DynamoDB or Scylla first. The distributed systems concepts — consistent hashing, gossip protocol, hinted handoff — are not optional. They are the difference between understanding why your read repair failed and blindly restarting nodes.
GCSettings.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge// Tune JVM before you tune Cassandra queriespublicclassGCSettings {
publicstaticvoidmain(String[] args) {
// Production Cassandra nodes MUST use G1GC// -XX:+UseG1GC -XX:G1RSetUpdatingPauseTimePercent=5// Never use CMS or ParallelGC unless you enjoy 10-second pausesSystem.out.println("Check jvm.options for:");
System.out.println("-XX:+UseG1GC");
System.out.println("-Xms16G -Xmx16G// always equal heap");System.out.println("-XX:MaxGCPauseMillis=200");
}
}
Output
Check jvm.options for:
-XX:+UseG1GC
-Xms16G -Xmx16G // always equal heap
-XX:MaxGCPauseMillis=200
Production Trap:
Never set -Xms and -Xmx to different values. Unequal heap sizes trigger full GCs when the JVM tries to shrink or expand the heap. Equal them out and your node stays steady.
Key Takeaway
You need Java, Linux, and GC tuning knowledge before you run a single production node. Skip this and you will burn a weekend debugging pauses.
● Production incidentPOST-MORTEMseverity: high
Hot Partition Crash During Peak Traffic
Symptom
Two nodes hit 100% CPU and OOM-killed. Read latency spiked from 5ms to 12 seconds. Application received WriteTimeoutException on 40% of requests.
Assumption
Partition key design distributes writes evenly across the cluster. No single partition should exceed 100MB.
Root cause
The order_id was used as the partition key, but a batch import job inserted 2 million rows with the same partition key (bulk order event). This created a single 4GB partition that overwhelmed two nodes responsible for that token range.
Fix
Redesigned partition key to include a time bucket: (order_date, order_id). Added partition size monitoring with alerts at 100MB threshold. Implemented client-side rate limiting for bulk imports.
Key lesson
Never use a low-cardinality value as a partition key — it creates hot spots
Monitor partition sizes in production — anything over 100MB is a red flag
Bulk import jobs must respect partition key distribution — add time bucketing
Set up alerts on compaction pending tasks as an early warning for hot partitions
Production debug guideFrom slow queries to node failures5 entries
Symptom · 01
Read latency spikes above 100ms
→
Fix
Run nodetool tablehistograms to check read latency distribution; look for large SSTable counts or tombstone-heavy partitions
Symptom · 02
Node shows DN (Down) in nodetool status
→
Fix
Check system.log for OOM or disk full errors; verify network connectivity with ping and gossip state with nodetool gossipinfo
Symptom · 03
WriteTimeoutException from application
→
Fix
Check if consistency level exceeds available replicas; run nodetool tpstats to see if write thread pools are saturated
Symptom · 04
Compaction never completes
→
Fix
Check disk I/O with iostat; verify compaction strategy matches workload (STCS for write-heavy, LCS for read-heavy, TWCS for time-series)
Symptom · 05
ALLOW FILTERING query causes cluster-wide scan
→
Fix
Identify queries using ALLOW FILTERING in application logs or with slow query log enabled. Redesign the table so the filter column is part of the primary key, or create a SAI index in Cassandra 5.0+
★ Cassandra Emergency DebuggingWhen your Cassandra cluster is in trouble, start here
Node unresponsive or high latency−
Immediate action
Check node status and pending compactions
Commands
nodetool status
nodetool tpstats | grep -E 'pending|blocked'
Fix now
If pending compactions > 50, reduce write throughput or add nodes. If a node is DN, check /var/log/cassandra/system.log for OOM or disk-full errors before attempting restart.
Schedule weekly nodetool repair via cron for all keyspaces. Repairs must complete within gc_grace_seconds (default 10 days) or deleted data can resurrect.
Force compaction with nodetool compact. If tombstones are the cause, verify gc_grace_seconds is configured correctly and that repair runs regularly before tombstones are purged.
Database Architecture Comparison
Aspect
Traditional RDBMS (MySQL/PostgreSQL)
Apache Cassandra
Architecture
Centralized (Master-Slave)
Distributed (Peer-to-Peer)
Scalability
Vertical (bigger hardware)
Horizontal (more nodes, linear throughput increase)
Data Model
Normalized (fixed schema, joins)
Denormalized (flexible schema, one table per query)
Availability
High with failover lag (30s-2min)
Always-on (no failover delay, no leader election)
Write Speed
Moderate (B-Tree, random I/O, locking)
Extreme (LSM-Tree, sequential I/O, no read-before-write)
Read Pattern
Flexible (joins, aggregations, ad-hoc)
Query-driven (design tables around queries)
Secondary Indexes
B-Tree indexes, generally efficient
SAI in Cassandra 5.0+ (selective queries only)
Transactions
Full ACID across tables and rows
Lightweight transactions (LWT) for single-partition conditional updates only
Operational Cost
Lower for small scale (< 1TB)
Lower for massive scale (10TB+, multi-datacenter)
Key takeaways
1
Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale
every node is equal, any node accepts reads and writes.
2
Always model your data around your queries (Query-Driven Design) rather than entities
joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.
3
Linear scalability means adding nodes provides a predictable, proportional increase in throughput
no rebalancing ceremony, no leader election.
4
The Gossip protocol and Snitch configuration are the heartbeat of the cluster's awareness
they power failure detection and topology-aware routing.
5
Tunable consistency is Cassandra's operational superpower
LOCAL_QUORUM is the right default for production: QUORUM = floor(RF/2)+1 nodes must respond, tolerating one failure with RF=3.
6
Denormalization is not a flaw
it is the correct pattern when joins do not exist. Duplicate data across tables; storage is cheap, cross-partition reads are not.
7
Tombstones are writes
every DELETE creates a tombstone that degrades reads until compaction clears it. Use TTL on INSERT instead of explicit DELETE for time-bounded data.
8
SAI (Storage Attached Indexes) in Cassandra 5.0 replaces SASI and supports range queries and vector ANN search
SASI is deprecated and should not be used in new deployments.
9
Create one CqlSession per application, prepare statements at startup, and use LOCAL_QUORUM as the default consistency level
these three driver decisions determine whether your application runs well under load.
Common mistakes to avoid
6 patterns
×
Over-normalizing data like in SQL databases
Symptom
Application performs multiple sequential reads to assemble a single view, resulting in 50-200ms latency per page load instead of the expected 2-5ms.
Fix
Redesign tables using Query-First approach: define the exact CQL query first, then create a denormalized table with all required columns in the primary key or as stored columns. If you need the same data for two query patterns, create two tables.
×
Using SimpleStrategy in production
Symptom
During a rack failure, all replicas of a partition reside in the same rack. Data becomes temporarily or permanently unavailable.
Fix
Switch to NetworkTopologyStrategy with RF=3 per datacenter. Verify snitch configuration matches your infrastructure (GossipingPropertyFileSnitch for most deployments).
×
Ignoring partition key cardinality — creating hot spots
Symptom
One node shows 90% CPU while others idle at 5%. Compaction backlog grows unbounded on the hot node. Eventually the node OOM-kills and the cluster degrades.
Fix
Choose a high-cardinality partition key (UUIDs, composite keys with time buckets). Monitor partition sizes with nodetool tablehistograms — keep individual partitions under 100MB and 100,000 rows.
×
Using consistency level ALL for every query
Symptom
Any single node failure or network partition causes all reads and writes to fail, completely negating Cassandra's availability guarantees.
Fix
Use LOCAL_QUORUM as the default consistency level. It tolerates one node failure while still guaranteeing strong consistency within the local datacenter. QUORUM = floor(RF/2) + 1.
×
Using ALLOW FILTERING in production queries
Symptom
A query that takes 2ms during development takes 30 seconds in production when the table has 50 million rows. The coordinator scans every partition on every node.
Fix
Redesign the table so the filter column is part of the primary key, or create a Storage Attached Index (SAI) in Cassandra 5.0+. ALLOW FILTERING is a diagnostic tool, not a query strategy.
×
Creating a new CqlSession per request in the Java driver
Symptom
Application throughput collapses under load. Thread pool exhaustion in the application, not in Cassandra. P99 latency spikes to 300ms+ because session initialization dominates request time.
Fix
Create one CqlSession as a singleton bean and inject it into all repositories. Session creation is expensive (100-300ms); request execution should be 1-5ms.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What is the role of the Gossip protocol in an Apache Cassandra cluster?
Q02SENIOR
How does Cassandra achieve high write throughput compared to traditional...
Q03SENIOR
Explain the LSM-Tree (Log-Structured Merge-Tree) and its importance in C...
Q04JUNIOR
What is the difference between SimpleStrategy and NetworkTopologyStrateg...
Q05SENIOR
What is a tombstone in Cassandra, why does it degrade read performance, ...
Q06SENIOR
What is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does i...
Q01 of 06SENIOR
What is the role of the Gossip protocol in an Apache Cassandra cluster?
ANSWER
Gossip is a peer-to-peer communication protocol that every Cassandra node uses to discover and maintain awareness of other nodes in the cluster. Each node gossips with up to 3 other nodes every second, exchanging state information (up/down, load, schema version). This decentralized protocol has no single point of failure and converges on cluster state within a few seconds. It powers failure detection — if a node stops responding to gossip, it is marked as down after a configurable phi-convict-threshold. The Snitch works alongside Gossip to tell each node the datacenter and rack topology, which the load balancing policy uses to make routing decisions.
Q02 of 06SENIOR
How does Cassandra achieve high write throughput compared to traditional B-Tree based databases?
ANSWER
Cassandra uses an LSM-Tree (Log-Structured Merge-Tree) storage engine. Writes go to an in-memory memtable and are immediately appended to a sequential commit log on disk. Sequential I/O is orders of magnitude faster than the random I/O required by B-Tree indexes. When the memtable fills, it is flushed to an immutable SSTable on disk. Background compaction merges SSTables. This architecture means writes never read-modify-update existing data structures — they only append. There is no locking, no read-before-write, and no page split. In Cassandra 5.0, the trie-based memtable further reduces memory overhead and improves flush performance.
Q03 of 06SENIOR
Explain the LSM-Tree (Log-Structured Merge-Tree) and its importance in Cassandra's storage engine.
ANSWER
An LSM-Tree is a write-optimized data structure that converts random writes into sequential writes. Data flows through three stages: (1) write to memtable (in-memory sorted structure) and commit log simultaneously, (2) flush memtable to SSTable (sorted, immutable file on disk) when it reaches configured size, (3) background compaction merges multiple SSTables into fewer, larger SSTables, resolving conflicts by timestamp and purging tombstones after gc_grace_seconds. This design trades read amplification (may need to check multiple SSTables) for extreme write throughput. Bloom filters give a fast 'definitely not in this SSTable' path, and partition indexes provide the exact byte offset for each partition. Cassandra 5.0 introduced the trie-based memtable to reduce memory amplification.
Q04 of 06JUNIOR
What is the difference between SimpleStrategy and NetworkTopologyStrategy for replication?
ANSWER
SimpleStrategy places replicas on the next N nodes clockwise in the ring, ignoring rack and datacenter topology. It is only suitable for single-datacenter development. NetworkTopologyStrategy is rack and datacenter aware — it places replicas in different racks within each datacenter, protecting against rack-level failures. For production, NetworkTopologyStrategy with RF=3 per datacenter is the standard. You specify the replication factor per datacenter: {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3}. SimpleStrategy in production is a correctness issue, not just a best practice violation — a rack failure can take down all replicas of a partition.
Q05 of 06SENIOR
What is a tombstone in Cassandra, why does it degrade read performance, and how do you prevent tombstone accumulation?
ANSWER
A tombstone is a deletion marker written when data is deleted in Cassandra. Because SSTables are immutable, Cassandra cannot remove data in place — it writes a tombstone that is propagated to replicas and physically removed only after gc_grace_seconds (default 10 days) AND after a compaction runs. Tombstones degrade reads because Cassandra must scan through all tombstones in a partition to find live data, and each scanned tombstone counts against the tombstone_failure_threshold (default 100,000). Prevention strategies: use TTL on INSERT instead of explicit DELETE for time-bounded data; avoid nullable columns that generate cell-level tombstones; ensure nodetool repair runs within gc_grace_seconds so tombstones are safe to purge; monitor tombstone ratio with nodetool tablehistograms.
Q06 of 06SENIOR
What is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does it differ from SASI?
ANSWER
SAI (Storage Attached Indexes) is the secondary indexing framework introduced in Cassandra 5.0 that replaces SASI (Storage Attached Secondary Index). SASI required separate index files and had limited query support. SAI is built into the SSTable storage format, supports equality, range queries, and vector approximate nearest neighbor (ANN) search, and has significantly lower write overhead than SASI. SASI is deprecated in Cassandra 5.0. The key operational difference: SAI indexes integrate with the native SSTable format, so there are no separate index tables to maintain, and compaction handles index data along with row data automatically. For vector search workloads, SAI with the cosine or dot-product similarity function enables embedding lookup without a separate vector database.
01
What is the role of the Gossip protocol in an Apache Cassandra cluster?
SENIOR
02
How does Cassandra achieve high write throughput compared to traditional B-Tree based databases?
SENIOR
03
Explain the LSM-Tree (Log-Structured Merge-Tree) and its importance in Cassandra's storage engine.
SENIOR
04
What is the difference between SimpleStrategy and NetworkTopologyStrategy for replication?
JUNIOR
05
What is a tombstone in Cassandra, why does it degrade read performance, and how do you prevent tombstone accumulation?
SENIOR
06
What is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does it differ from SASI?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Does Cassandra support ACID transactions?
Cassandra supports lightweight transactions (LWT) using the Paxos protocol for conditional updates within a single partition, but they come with a significant performance cost — approximately 4x the latency of a regular write due to the multi-round-trip Paxos consensus. For most use cases, Cassandra's eventual consistency with LOCAL_QUORUM provides the right balance. Full multi-row ACID transactions across partitions are not supported — design your data model to avoid needing them. If you genuinely need cross-partition transactions, evaluate whether Cassandra is the right database for that specific workload.
Was this helpful?
02
How do I choose the right compaction strategy?
STCS (Size-Tiered Compaction Strategy) is the default and works well for write-heavy workloads where data is rarely updated. LCS (Leveled Compaction Strategy) provides better read performance and bounded space amplification — use it for read-heavy workloads where the extra compaction I/O is acceptable. TWCS (Time-Window Compaction Strategy) is ideal for time-series data with TTL, as it groups SSTables by time window and drops entire windows when data expires — avoiding the need to merge expired tombstones with live data. Cassandra 5.0 introduced the Unified Compaction Strategy (UCS) as an experimental option that adapts behavior based on workload patterns.
Was this helpful?
03
What is a tombstone in Cassandra and why should I care?
A tombstone is a marker that indicates a row or cell has been deleted. Cassandra does not immediately remove data — it writes a tombstone that is only physically cleaned up after gc_grace_seconds (default 10 days) AND after compaction runs AND after repair has propagated the tombstone to all replicas. Excessive tombstones degrade read performance because Cassandra must scan through them during reads. Hitting the tombstone_failure_threshold (default 100,000 per query) causes a TombstoneOverwhelmingException that aborts the read entirely. Prevention: use TTL on INSERT, run repair regularly within gc_grace_seconds, and monitor tombstone ratios with nodetool tablehistograms.
Was this helpful?
04
Can I use secondary indexes in Cassandra?
In Cassandra 5.0, the recommended secondary indexing approach is Storage Attached Indexes (SAI). SAI supports equality and range queries on non-primary-key columns without ALLOW FILTERING, and adds vector ANN search for embedding-based workloads. SASI (Storage Attached Secondary Index) is deprecated in Cassandra 5.0 — do not create new SASI indexes. The original legacy secondary indexes are still present but have severe performance limitations. The general rule: SAI for selective queries (< 5% of partition rows), dedicated table for known stable query patterns with high query volume.
Was this helpful?
05
When should I use Cassandra vs PostgreSQL?
Use Cassandra when: you need write throughput that a single PostgreSQL instance cannot absorb, your dataset exceeds 1TB and growing, you need multi-datacenter active-active replication, or your data naturally fits a time-series or event-log model. Use PostgreSQL when: you need ACID transactions across multiple tables, your queries are ad-hoc and unpredictable, your data is relational and normalized, or your scale is small enough that a single well-tuned PostgreSQL instance handles the load. The worst outcome is starting with Cassandra for a workload that needs joins and transactions — the data model migration cost is high.