Beginner 8 min · March 09, 2026

Apache Cassandra — Hot Partitions and Ring Architecture

A single 4GB partition OOM-killed two nodes at peak traffic.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
Quick Answer
  • Apache Cassandra is a masterless, peer-to-peer distributed NoSQL database favoring availability over strong consistency
  • Data is distributed via consistent hashing across a ring of equal nodes — no single point of failure
  • Writes go to any node and propagate to replicas based on Replication Factor
  • LSM-Tree storage engine delivers extreme write throughput — sequential I/O, no read-before-write
  • Query-Driven Design: model tables around queries, not entities — joins do not exist
  • Cassandra 5.0 introduced Storage Attached Indexes (SAI) — the correct answer to secondary indexing from 2024 onward
  • Biggest mistake: over-normalizing data like in SQL — denormalization is the correct Cassandra pattern

Apache Cassandra is a peer-to-peer distributed NoSQL database designed to handle massive amounts of data across many commodity servers. Originally developed at Facebook to power Inbox Search, it provides high availability with no single point of failure and linear scalability.

In this guide, we'll break down exactly what makes Cassandra unique in the crowded database landscape, why its masterless design beats traditional leader-follower models for write-heavy workloads, how to design partition keys that don't become time bombs, and how to connect a Java application using the DataStax driver. By the end, you'll understand not just what Cassandra is, but how to avoid the mistakes that cause production outages.

What Is Apache Cassandra and Why Does It Exist?

Apache Cassandra was designed to solve the availability vs. consistency trade-off in distributed systems, specifically favoring high availability and partition tolerance — the AP side of the CAP theorem. Traditional relational databases struggle to scale horizontally because they rely on a central master node that becomes both the bottleneck and the single point of failure. Cassandra eliminates this bottleneck entirely by treating every node in the cluster as an equal peer.

This architecture allows linear scaling: adding a new node to the cluster provides a predictable, proportional increase in performance and storage capacity. There is no leader election, no failover window, and no promotion ceremony when a node dies. Its neighbors simply serve its data until it comes back.

Before writing any application code, you need a keyspace and a table. A keyspace is Cassandra's equivalent of a database schema — it defines the replication strategy and factor. The example below uses NetworkTopologyStrategy, which is the only strategy you should use in production. SimpleStrategy is shown for contrast because you will encounter it in tutorials, and understanding why it is wrong is as important as knowing the right answer.

Note the PRIMARY KEY (service_name, log_time) design. service_name is the partition key — it controls which node stores the data. log_time is the clustering column — it controls the sort order within the partition. This is the fundamental unit of Cassandra data modeling, and every decision downstream flows from getting it right.

Partition Keys, Clustering Columns, and Primary Key Design

If I had to point to one thing that separates engineers who run Cassandra well from engineers who generate production incidents, it's partition key design. Every other Cassandra concept — consistency levels, compaction, gossip — is secondary to getting this right.

The primary key in Cassandra has two distinct parts that most tutorials blur together:

The partition key determines which node stores the data. Cassandra hashes the partition key value to a token, and that token falls in a node's token range. All rows with the same partition key are stored together on disk — this is what makes single-partition reads so fast. But it's also why a bad partition key creates a hot spot: if thousands of writes share the same partition key value, all of them land on the same two or three nodes.

Clustering columns determine the sort order of rows within a partition. They are stored sorted on disk in the SSTable. This means range queries on clustering columns — WHERE log_time > x AND log_time < y — are essentially free: Cassandra reads a contiguous block of sorted data from a single SSTable. There is no sort step, no heap sort, no temporary buffer.

The combination of partition key and clustering columns forms the primary key. Every row within a partition must have a unique combination of clustering column values. A missing clustering column in a write will upsert the existing row — not create a new one.

Partition size is the operational concern that flows from this design. Because all rows with the same partition key live on the same nodes, an unbounded partition grows indefinitely and eventually causes the problems described in the production incident at the top of this guide. The rule I operate by: keep partitions under 100MB and under 100,000 rows. If either limit is at risk, add a time bucket to the partition key.

The LSM-Tree Write Path — Why Cassandra Writes So Fast

Understanding Cassandra's write performance requires understanding one design decision: all writes are sequential. There is no read-before-write, no in-place update, no locking. The LSM-Tree (Log-Structured Merge-Tree) storage engine is the reason Cassandra can absorb write rates that would saturate a PostgreSQL cluster.

Here's what actually happens when your application executes an INSERT or UPDATE:

  1. Commit log write — The mutation is appended to a sequential commit log on disk. Sequential I/O. If the node crashes after this step, the data can be recovered from the commit log on restart.
  2. Memtable write — The mutation is written to an in-memory structure called the memtable. All writes to the same partition are merged here — updates, deletes, and inserts coexist. The memtable is the 'current view' of recent data.
  3. Memtable flush to SSTable — When the memtable reaches a configured size, it is flushed to disk as an immutable SSTable (Sorted String Table). Immutable means once written, it is never modified. Updates and deletes don't rewrite existing SSTables — they write new entries with a higher timestamp.
  4. Compaction — Background compaction merges multiple SSTables into fewer, larger ones. During compaction, entries for the same partition key are merged by timestamp — newer wins. Deleted entries (tombstones) are purged after gc_grace_seconds.

The trade-off this creates: reads are more complex than writes. A read for a partition may need to check the memtable and multiple SSTables to find all the data. Bloom filters (one per SSTable) give Cassandra a probabilistic fast path — 'definitely not in this SSTable' — before opening the file. Partition indexes give the exact byte offset. By 2026, Cassandra 5.0's trie-based memtable and storage improvements have significantly reduced the read amplification that older LSM implementations suffered.

Common Mistakes and How to Avoid Them

When moving from SQL to Cassandra, the mistakes fall into two categories: design mistakes you make before writing a line of code, and operational mistakes you make after the cluster is running. The design mistakes are more expensive — they require data migration to fix. The operational ones are more urgent — they cause incidents.

The single most consequential design mistake is treating Cassandra like a relational database. Normalizing data, expecting joins, using secondary indexes as a query strategy, filtering on non-primary-key columns with ALLOW FILTERING — all of these patterns assume a query engine that Cassandra deliberately does not have. Cassandra's query engine is intentionally limited because that limitation is what makes it fast at scale.

For the operational side, the development cluster below is the starting point for understanding how the cluster behaves before you have 12 production nodes to experiment on. The compose file uses Cassandra 5.0 — pinned, not latest. Using latest in any tutorial or development setup is a trap: you will develop against a different version than production, and Cassandra has meaningful behavioral differences between major versions.

Storage Attached Indexes (SAI) — Secondary Indexing in Cassandra 5.0

Before Cassandra 5.0, secondary indexing was the most common source of bad advice in Cassandra guides. The options were:

  • SASI (Storage Attached Secondary Index) — better than the original secondary indexes, still had significant operational overhead and limited query capabilities.
  • Original secondary indexes — effectively hidden tables, painful performance characteristics, deprecated by the Cassandra community for most production use.
  • ALLOW FILTERING — not an index at all, a full cluster scan dressed up as a query option.

Cassandra 5.0 introduced Storage Attached Indexes (SAI), which supersede SASI. SAI is built into the SSTable storage format, requires no separate index tables, and supports range queries, equality, and — for vector embeddings — approximate nearest neighbor (ANN) search. Vector search support makes Cassandra 5.0 relevant for AI-adjacent workloads that weren't possible before.

SAI doesn't make secondary indexing free. It's still more expensive than a partition key query because it must look across all SSTables on the node. But the cost is manageable for selective queries — queries that filter down to a small result set. The rule I apply: if your SAI query returns more than 5-10% of the rows in the partition, the query is not selective enough for an index to help, and you should create a dedicated table instead.

SASI is deprecated in Cassandra 5.0. Any guide that recommends SASI for new deployments is out of date.

Connecting a Java Application — DataStax Java Driver

Every other concept in this guide — rings, partition keys, consistency levels, SAI — becomes concrete once you see how an application actually talks to Cassandra. The standard driver by 2026 is the DataStax Java Driver for Apache Cassandra, version 4.x. It replaced the older CQL driver and the Astyanax client.

Three things the driver handles that you do not want to implement yourself:

  1. Load balancing policy — the driver tracks which nodes are in which datacenter, which are up, and which are token-aware. With the DefaultLoadBalancingPolicy, queries are automatically routed to the replica that owns the data — no unnecessary hops through coordinator nodes.
  2. Connection pooling — each driver instance maintains a pool of connections per host. You create one CqlSession per application and reuse it for the lifetime of the process. Creating a new session per request is one of the most common Java + Cassandra performance mistakes.
  3. Prepared statements — CQL statements should be prepared once and executed many times. Prepared statements are parsed and validated once by the coordinator; subsequent executions send only the bound values. Unprepared statements parse on every execution — at scale, the coordinator overhead becomes measurable.

The Spring Data Cassandra abstraction sits on top of this driver and is suitable for straightforward CRUD workloads. For anything involving custom query patterns, consistency level overrides, or performance-critical paths, I prefer going directly to the driver — the abstraction layer hides the consistency level control that Cassandra's tunable consistency depends on.

Database Architecture Comparison
AspectTraditional RDBMS (MySQL/PostgreSQL)Apache Cassandra
ArchitectureCentralized (Master-Slave)Distributed (Peer-to-Peer)
ScalabilityVertical (bigger hardware)Horizontal (more nodes, linear throughput increase)
Data ModelNormalized (fixed schema, joins)Denormalized (flexible schema, one table per query)
AvailabilityHigh with failover lag (30s-2min)Always-on (no failover delay, no leader election)
Write SpeedModerate (B-Tree, random I/O, locking)Extreme (LSM-Tree, sequential I/O, no read-before-write)
Read PatternFlexible (joins, aggregations, ad-hoc)Query-driven (design tables around queries)
Secondary IndexesB-Tree indexes, generally efficientSAI in Cassandra 5.0+ (selective queries only)
TransactionsFull ACID across tables and rowsLightweight transactions (LWT) for single-partition conditional updates only
Operational CostLower for small scale (< 1TB)Lower for massive scale (10TB+, multi-datacenter)

Key Takeaways

  • Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale — every node is equal, any node accepts reads and writes.
  • Always model your data around your queries (Query-Driven Design) rather than entities — joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.
  • Linear scalability means adding nodes provides a predictable, proportional increase in throughput — no rebalancing ceremony, no leader election.
  • The Gossip protocol and Snitch configuration are the heartbeat of the cluster's awareness — they power failure detection and topology-aware routing.
  • Tunable consistency is Cassandra's operational superpower — LOCAL_QUORUM is the right default for production: QUORUM = floor(RF/2)+1 nodes must respond, tolerating one failure with RF=3.
  • Denormalization is not a flaw — it is the correct pattern when joins do not exist. Duplicate data across tables; storage is cheap, cross-partition reads are not.
  • Tombstones are writes — every DELETE creates a tombstone that degrades reads until compaction clears it. Use TTL on INSERT instead of explicit DELETE for time-bounded data.
  • SAI (Storage Attached Indexes) in Cassandra 5.0 replaces SASI and supports range queries and vector ANN search — SASI is deprecated and should not be used in new deployments.
  • Create one CqlSession per application, prepare statements at startup, and use LOCAL_QUORUM as the default consistency level — these three driver decisions determine whether your application runs well under load.

Common Mistakes to Avoid

  • Over-normalizing data like in SQL databases
    Symptom: Application performs multiple sequential reads to assemble a single view, resulting in 50-200ms latency per page load instead of the expected 2-5ms.
    Fix: Redesign tables using Query-First approach: define the exact CQL query first, then create a denormalized table with all required columns in the primary key or as stored columns. If you need the same data for two query patterns, create two tables.
  • Using SimpleStrategy in production
    Symptom: During a rack failure, all replicas of a partition reside in the same rack. Data becomes temporarily or permanently unavailable.
    Fix: Switch to NetworkTopologyStrategy with RF=3 per datacenter. Verify snitch configuration matches your infrastructure (GossipingPropertyFileSnitch for most deployments).
  • Ignoring partition key cardinality — creating hot spots
    Symptom: One node shows 90% CPU while others idle at 5%. Compaction backlog grows unbounded on the hot node. Eventually the node OOM-kills and the cluster degrades.
    Fix: Choose a high-cardinality partition key (UUIDs, composite keys with time buckets). Monitor partition sizes with nodetool tablehistograms — keep individual partitions under 100MB and 100,000 rows.
  • Using consistency level ALL for every query
    Symptom: Any single node failure or network partition causes all reads and writes to fail, completely negating Cassandra's availability guarantees.
    Fix: Use LOCAL_QUORUM as the default consistency level. It tolerates one node failure while still guaranteeing strong consistency within the local datacenter. QUORUM = floor(RF/2) + 1.
  • Using ALLOW FILTERING in production queries
    Symptom: A query that takes 2ms during development takes 30 seconds in production when the table has 50 million rows. The coordinator scans every partition on every node.
    Fix: Redesign the table so the filter column is part of the primary key, or create a Storage Attached Index (SAI) in Cassandra 5.0+. ALLOW FILTERING is a diagnostic tool, not a query strategy.
  • Creating a new CqlSession per request in the Java driver
    Symptom: Application throughput collapses under load. Thread pool exhaustion in the application, not in Cassandra. P99 latency spikes to 300ms+ because session initialization dominates request time.
    Fix: Create one CqlSession as a singleton bean and inject it into all repositories. Session creation is expensive (100-300ms); request execution should be 1-5ms.

Interview Questions on This Topic

  • QWhat is the role of the Gossip protocol in an Apache Cassandra cluster?Mid-levelReveal
    Gossip is a peer-to-peer communication protocol that every Cassandra node uses to discover and maintain awareness of other nodes in the cluster. Each node gossips with up to 3 other nodes every second, exchanging state information (up/down, load, schema version). This decentralized protocol has no single point of failure and converges on cluster state within a few seconds. It powers failure detection — if a node stops responding to gossip, it is marked as down after a configurable phi-convict-threshold. The Snitch works alongside Gossip to tell each node the datacenter and rack topology, which the load balancing policy uses to make routing decisions.
  • QHow does Cassandra achieve high write throughput compared to traditional B-Tree based databases?Mid-levelReveal
    Cassandra uses an LSM-Tree (Log-Structured Merge-Tree) storage engine. Writes go to an in-memory memtable and are immediately appended to a sequential commit log on disk. Sequential I/O is orders of magnitude faster than the random I/O required by B-Tree indexes. When the memtable fills, it is flushed to an immutable SSTable on disk. Background compaction merges SSTables. This architecture means writes never read-modify-update existing data structures — they only append. There is no locking, no read-before-write, and no page split. In Cassandra 5.0, the trie-based memtable further reduces memory overhead and improves flush performance.
  • QExplain the LSM-Tree (Log-Structured Merge-Tree) and its importance in Cassandra's storage engine.SeniorReveal
    An LSM-Tree is a write-optimized data structure that converts random writes into sequential writes. Data flows through three stages: (1) write to memtable (in-memory sorted structure) and commit log simultaneously, (2) flush memtable to SSTable (sorted, immutable file on disk) when it reaches configured size, (3) background compaction merges multiple SSTables into fewer, larger SSTables, resolving conflicts by timestamp and purging tombstones after gc_grace_seconds. This design trades read amplification (may need to check multiple SSTables) for extreme write throughput. Bloom filters give a fast 'definitely not in this SSTable' path, and partition indexes provide the exact byte offset for each partition. Cassandra 5.0 introduced the trie-based memtable to reduce memory amplification.
  • QWhat is the difference between SimpleStrategy and NetworkTopologyStrategy for replication?JuniorReveal
    SimpleStrategy places replicas on the next N nodes clockwise in the ring, ignoring rack and datacenter topology. It is only suitable for single-datacenter development. NetworkTopologyStrategy is rack and datacenter aware — it places replicas in different racks within each datacenter, protecting against rack-level failures. For production, NetworkTopologyStrategy with RF=3 per datacenter is the standard. You specify the replication factor per datacenter: {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3}. SimpleStrategy in production is a correctness issue, not just a best practice violation — a rack failure can take down all replicas of a partition.
  • QWhat is a tombstone in Cassandra, why does it degrade read performance, and how do you prevent tombstone accumulation?SeniorReveal
    A tombstone is a deletion marker written when data is deleted in Cassandra. Because SSTables are immutable, Cassandra cannot remove data in place — it writes a tombstone that is propagated to replicas and physically removed only after gc_grace_seconds (default 10 days) AND after a compaction runs. Tombstones degrade reads because Cassandra must scan through all tombstones in a partition to find live data, and each scanned tombstone counts against the tombstone_failure_threshold (default 100,000). Prevention strategies: use TTL on INSERT instead of explicit DELETE for time-bounded data; avoid nullable columns that generate cell-level tombstones; ensure nodetool repair runs within gc_grace_seconds so tombstones are safe to purge; monitor tombstone ratio with nodetool tablehistograms.
  • QWhat is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does it differ from SASI?SeniorReveal
    SAI (Storage Attached Indexes) is the secondary indexing framework introduced in Cassandra 5.0 that replaces SASI (Storage Attached Secondary Index). SASI required separate index files and had limited query support. SAI is built into the SSTable storage format, supports equality, range queries, and vector approximate nearest neighbor (ANN) search, and has significantly lower write overhead than SASI. SASI is deprecated in Cassandra 5.0. The key operational difference: SAI indexes integrate with the native SSTable format, so there are no separate index tables to maintain, and compaction handles index data along with row data automatically. For vector search workloads, SAI with the cosine or dot-product similarity function enables embedding lookup without a separate vector database.

Frequently Asked Questions

Does Cassandra support ACID transactions?

Cassandra supports lightweight transactions (LWT) using the Paxos protocol for conditional updates within a single partition, but they come with a significant performance cost — approximately 4x the latency of a regular write due to the multi-round-trip Paxos consensus. For most use cases, Cassandra's eventual consistency with LOCAL_QUORUM provides the right balance. Full multi-row ACID transactions across partitions are not supported — design your data model to avoid needing them. If you genuinely need cross-partition transactions, evaluate whether Cassandra is the right database for that specific workload.

How do I choose the right compaction strategy?

STCS (Size-Tiered Compaction Strategy) is the default and works well for write-heavy workloads where data is rarely updated. LCS (Leveled Compaction Strategy) provides better read performance and bounded space amplification — use it for read-heavy workloads where the extra compaction I/O is acceptable. TWCS (Time-Window Compaction Strategy) is ideal for time-series data with TTL, as it groups SSTables by time window and drops entire windows when data expires — avoiding the need to merge expired tombstones with live data. Cassandra 5.0 introduced the Unified Compaction Strategy (UCS) as an experimental option that adapts behavior based on workload patterns.

What is a tombstone in Cassandra and why should I care?

A tombstone is a marker that indicates a row or cell has been deleted. Cassandra does not immediately remove data — it writes a tombstone that is only physically cleaned up after gc_grace_seconds (default 10 days) AND after compaction runs AND after repair has propagated the tombstone to all replicas. Excessive tombstones degrade read performance because Cassandra must scan through them during reads. Hitting the tombstone_failure_threshold (default 100,000 per query) causes a TombstoneOverwhelmingException that aborts the read entirely. Prevention: use TTL on INSERT, run repair regularly within gc_grace_seconds, and monitor tombstone ratios with nodetool tablehistograms.

Can I use secondary indexes in Cassandra?

In Cassandra 5.0, the recommended secondary indexing approach is Storage Attached Indexes (SAI). SAI supports equality and range queries on non-primary-key columns without ALLOW FILTERING, and adds vector ANN search for embedding-based workloads. SASI (Storage Attached Secondary Index) is deprecated in Cassandra 5.0 — do not create new SASI indexes. The original legacy secondary indexes are still present but have severe performance limitations. The general rule: SAI for selective queries (< 5% of partition rows), dedicated table for known stable query patterns with high query volume.

When should I use Cassandra vs PostgreSQL?

Use Cassandra when: you need write throughput that a single PostgreSQL instance cannot absorb, your dataset exceeds 1TB and growing, you need multi-datacenter active-active replication, or your data naturally fits a time-series or event-log model. Use PostgreSQL when: you need ACID transactions across multiple tables, your queries are ad-hoc and unpredictable, your data is relational and normalized, or your scale is small enough that a single well-tuned PostgreSQL instance handles the load. The worst outcome is starting with Cassandra for a workload that needs joins and transactions — the data model migration cost is high.

🔥

That's Cassandra. Mark it forged?

8 min read · try the examples if you haven't

Previous
Migrating Oracle PL/SQL to PostgreSQL – Common Errors
1 / 4 · Cassandra
Next
Cassandra Data Model and Keyspaces