Introduction to Apache Cassandra
- Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale — every node is equal, any node accepts reads and writes.
- Always model your data around your queries (Query-Driven Design) rather than entities — joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.
- Linear scalability means adding nodes provides a predictable, proportional increase in throughput — no rebalancing ceremony, no leader election.
- Apache Cassandra is a masterless, peer-to-peer distributed NoSQL database favoring availability over strong consistency
- Data is distributed via consistent hashing across a ring of equal nodes — no single point of failure
- Writes go to any node and propagate to replicas based on Replication Factor
- LSM-Tree storage engine delivers extreme write throughput — sequential I/O, no read-before-write
- Query-Driven Design: model tables around queries, not entities — joins do not exist
- Cassandra 5.0 introduced Storage Attached Indexes (SAI) — the correct answer to secondary indexing from 2024 onward
- Biggest mistake: over-normalizing data like in SQL — denormalization is the correct Cassandra pattern
Node unresponsive or high latency
nodetool statusnodetool tpstats | grep -E 'pending|blocked'Data inconsistency after node recovery
nodetool repair --full forge_systemnodetool getendpoints forge_system audit_logs <partition_key>Disk space critical (>85% used)
nodetool cfstats forge_system.audit_logs | grep 'Space used'find /var/lib/cassandra/data -name '*.db' | wc -lProduction Incident
Production Debug GuideFrom slow queries to node failures
Apache Cassandra is a peer-to-peer distributed NoSQL database designed to handle massive amounts of data across many commodity servers. Originally developed at Facebook to power Inbox Search, it provides high availability with no single point of failure and linear scalability.
In this guide, we'll break down exactly what makes Cassandra unique in the crowded database landscape, why its masterless design beats traditional leader-follower models for write-heavy workloads, how to design partition keys that don't become time bombs, and how to connect a Java application using the DataStax driver. By the end, you'll understand not just what Cassandra is, but how to avoid the mistakes that cause production outages.
What Is Apache Cassandra and Why Does It Exist?
Apache Cassandra was designed to solve the availability vs. consistency trade-off in distributed systems, specifically favoring high availability and partition tolerance — the AP side of the CAP theorem. Traditional relational databases struggle to scale horizontally because they rely on a central master node that becomes both the bottleneck and the single point of failure. Cassandra eliminates this bottleneck entirely by treating every node in the cluster as an equal peer.
This architecture allows linear scaling: adding a new node to the cluster provides a predictable, proportional increase in performance and storage capacity. There is no leader election, no failover window, and no promotion ceremony when a node dies. Its neighbors simply serve its data until it comes back.
Before writing any application code, you need a keyspace and a table. A keyspace is Cassandra's equivalent of a database schema — it defines the replication strategy and factor. The example below uses NetworkTopologyStrategy, which is the only strategy you should use in production. SimpleStrategy is shown for contrast because you will encounter it in tutorials, and understanding why it is wrong is as important as knowing the right answer.
Note the PRIMARY KEY (service_name, log_time) design. service_name is the partition key — it controls which node stores the data. log_time is the clustering column — it controls the sort order within the partition. This is the fundamental unit of Cassandra data modeling, and every decision downstream flows from getting it right.
-- ── SimpleStrategy: development only, single datacenter ───────────────────── -- Never use in production — it ignores rack and datacenter topology. -- Replicas are placed on the next N nodes clockwise in the ring, -- which means two replicas can land on the same physical rack. CREATE KEYSPACE IF NOT EXISTS forge_dev WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1}; -- ── NetworkTopologyStrategy: production standard ───────────────────────────── -- Rack-aware and datacenter-aware placement. -- With RF=3 across 3 racks, you can lose an entire rack and still serve queries. CREATE KEYSPACE IF NOT EXISTS forge_system WITH replication = { 'class': 'NetworkTopologyStrategy', 'datacenter1': 3 }; USE forge_system; -- ── Table design: audit_logs ───────────────────────────────────────────────── -- Partition key: service_name → determines which node(s) store this row -- Clustering column: log_time → determines sort order within the partition -- CLUSTERING ORDER BY DESC: most recent logs are read first with no sort overhead CREATE TABLE IF NOT EXISTS audit_logs ( service_name text, log_time timestamp, request_id uuid, payload text, PRIMARY KEY (service_name, log_time) ) WITH CLUSTERING ORDER BY (log_time DESC) AND default_time_to_live = 2592000; -- 30-day TTL: logs auto-expire -- ── Writing data: any node accepts the write ──────────────────────────────── INSERT INTO audit_logs (service_name, log_time, request_id, payload) VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"ok"}'); INSERT INTO audit_logs (service_name, log_time, request_id, payload) VALUES ('payment-service', toTimestamp(now()), uuid(), '{"status":"retry"}'); -- ── Reading data: the query matches the partition key exactly ──────────────── -- This is a single-partition read — one node handles it, response in <5ms SELECT service_name, log_time, payload FROM audit_logs WHERE service_name = 'payment-service' AND log_time > '2026-01-01 00:00:00+0000' LIMIT 100;
Keyspace 'forge_dev' created.
Keyspace 'forge_system' created.
Table 'audit_logs' created.
-- INSERT: applied
-- INSERT: applied
service_name | log_time | payload
-----------------+---------------------------------+---------------------
payment-service | 2026-04-18 14:23:01.000000+0000 | {"status":"retry"}
payment-service | 2026-04-18 14:22:58.000000+0000 | {"status":"ok"}
(2 rows)
- Every node owns a range of the token ring — data is distributed by hashing the partition key to a token
- Any node can accept reads and writes — there is no master, no leader election, no failover delay
- Replication factor determines how many nodes hold a copy of each partition
- If a node goes down, its neighbors serve its data immediately — no recovery window
- Adding a node splits an existing token range and streams data to the new node automatically — linear scaling
Partition Keys, Clustering Columns, and Primary Key Design
If I had to point to one thing that separates engineers who run Cassandra well from engineers who generate production incidents, it's partition key design. Every other Cassandra concept — consistency levels, compaction, gossip — is secondary to getting this right.
The primary key in Cassandra has two distinct parts that most tutorials blur together:
The partition key determines which node stores the data. Cassandra hashes the partition key value to a token, and that token falls in a node's token range. All rows with the same partition key are stored together on disk — this is what makes single-partition reads so fast. But it's also why a bad partition key creates a hot spot: if thousands of writes share the same partition key value, all of them land on the same two or three nodes.
Clustering columns determine the sort order of rows within a partition. They are stored sorted on disk in the SSTable. This means range queries on clustering columns — WHERE log_time > x AND log_time < y — are essentially free: Cassandra reads a contiguous block of sorted data from a single SSTable. There is no sort step, no heap sort, no temporary buffer.
The combination of partition key and clustering columns forms the primary key. Every row within a partition must have a unique combination of clustering column values. A missing clustering column in a write will upsert the existing row — not create a new one.
Partition size is the operational concern that flows from this design. Because all rows with the same partition key live on the same nodes, an unbounded partition grows indefinitely and eventually causes the problems described in the production incident at the top of this guide. The rule I operate by: keep partitions under 100MB and under 100,000 rows. If either limit is at risk, add a time bucket to the partition key.
-- ── Example 1: Simple partition key — one column ──────────────────────────── -- All events for a user are in one partition. -- Problem: a user with 10 million events creates a 4GB partition. CREATE TABLE user_events_v1 ( user_id uuid, event_time timestamp, event_type text, metadata text, PRIMARY KEY (user_id, event_time) ) WITH CLUSTERING ORDER BY (event_time DESC); -- ── Example 2: Composite partition key — time bucketing ────────────────────── -- Now the partition key is (user_id, bucket). -- Each day gets its own partition — bounded growth, predictable size. -- 'bucket' is a date string like '2026-04-18', computed in the application. -- Trade-off: queries spanning multiple days require multiple partition reads. CREATE TABLE user_events_v2 ( user_id uuid, bucket text, -- application sets this: '2026-04-18' event_time timestamp, event_type text, metadata text, PRIMARY KEY ((user_id, bucket), event_time) -- ^-----------^ composite partition key: two columns together ) WITH CLUSTERING ORDER BY (event_time DESC); -- ── Writing to the bucketed table ───────────────────────────────────────────── -- Application code computes the bucket from the event timestamp -- then includes it in every write and read INSERT INTO user_events_v2 (user_id, bucket, event_time, event_type, metadata) VALUES ( a3b4c5d6-e7f8-9012-abcd-ef1234567890, '2026-04-18', toTimestamp(now()), 'page_view', '{"page":"/dashboard"}' ); -- ── Reading from a specific bucket ─────────────────────────────────────────── -- This is a single-partition read: fast, predictable, no cluster scan SELECT event_time, event_type, metadata FROM user_events_v2 WHERE user_id = a3b4c5d6-e7f8-9012-abcd-ef1234567890 AND bucket = '2026-04-18' AND event_time > '2026-04-18 08:00:00+0000' LIMIT 50; -- ── What ALLOW FILTERING looks like — and why it is dangerous ───────────────── -- This forces Cassandra to scan EVERY partition on EVERY node -- to find rows where event_type = 'purchase'. -- With 50 nodes and 500GB of data, this query touches all of it. -- Never run this in production without SAI or a dedicated table. SELECT * FROM user_events_v2 WHERE event_type = 'purchase' ALLOW FILTERING; -- full cluster scan — do not do this
Table 'user_events_v2' created.
-- INSERT: applied
event_time | event_type | metadata
-----------------------------------+------------+---------------------
2026-04-18 14:23:01.000000+0000 | page_view | {"page":"/dashboard"}
(1 rows)
-- ALLOW FILTERING query:
Warnings: Aggregation query used without partition key (potential performance issue)
-- This scanned 50 nodes. Do not use in production.
The LSM-Tree Write Path — Why Cassandra Writes So Fast
Understanding Cassandra's write performance requires understanding one design decision: all writes are sequential. There is no read-before-write, no in-place update, no locking. The LSM-Tree (Log-Structured Merge-Tree) storage engine is the reason Cassandra can absorb write rates that would saturate a PostgreSQL cluster.
Here's what actually happens when your application executes an INSERT or UPDATE:
- Commit log write — The mutation is appended to a sequential commit log on disk. Sequential I/O. If the node crashes after this step, the data can be recovered from the commit log on restart.
- Memtable write — The mutation is written to an in-memory structure called the memtable. All writes to the same partition are merged here — updates, deletes, and inserts coexist. The memtable is the 'current view' of recent data.
- Memtable flush to SSTable — When the memtable reaches a configured size, it is flushed to disk as an immutable SSTable (Sorted String Table). Immutable means once written, it is never modified. Updates and deletes don't rewrite existing SSTables — they write new entries with a higher timestamp.
- Compaction — Background compaction merges multiple SSTables into fewer, larger ones. During compaction, entries for the same partition key are merged by timestamp — newer wins. Deleted entries (tombstones) are purged after
gc_grace_seconds.
The trade-off this creates: reads are more complex than writes. A read for a partition may need to check the memtable and multiple SSTables to find all the data. Bloom filters (one per SSTable) give Cassandra a probabilistic fast path — 'definitely not in this SSTable' — before opening the file. Partition indexes give the exact byte offset. By 2026, Cassandra 5.0's trie-based memtable and storage improvements have significantly reduced the read amplification that older LSM implementations suffered.
-- ── Demonstrating Cassandra's write semantics ──────────────────────────────── -- INSERT and UPDATE are identical in Cassandra — both are 'upserts'. -- There is no read-before-write. The write goes to memtable immediately. INSERT INTO forge_system.audit_logs (service_name, log_time, request_id, payload) VALUES ('order-service', '2026-04-18 10:00:00+0000', uuid(), '{"amount":99.99}'); -- UPDATE writes a new cell with a higher timestamp — the old cell still exists -- in the SSTable until compaction. Cassandra resolves conflicts by timestamp. UPDATE forge_system.audit_logs SET payload = '{"amount":99.99,"status":"confirmed"}' WHERE service_name = 'order-service' AND log_time = '2026-04-18 10:00:00+0000'; -- ── Deletes write tombstones, not erasures ─────────────────────────────────── -- This DELETE writes a tombstone marker — it does NOT remove the data immediately. -- The data will be physically removed after gc_grace_seconds (default: 10 days) -- AND after a compaction runs AND after repair has propagated the tombstone. DELETE FROM forge_system.audit_logs WHERE service_name = 'order-service' AND log_time = '2026-04-18 10:00:00+0000'; -- ── Checking SSTable and memtable state via nodetool ───────────────────────── -- Run from the shell, not from cqlsh: -- nodetool flush forge_system → forces memtable to flush to SSTable -- nodetool compact forge_system → forces compaction on the keyspace -- nodetool tablehistograms forge_system audit_logs → read/write latency breakdown -- ── TTL as an alternative to application-level deletes ─────────────────────── -- When data has a natural expiry, use TTL instead of explicit DELETE. -- Cassandra handles the tombstone lifecycle automatically. -- Here, this log entry expires in 86400 seconds (24 hours). INSERT INTO forge_system.audit_logs (service_name, log_time, request_id, payload) VALUES ('cache-service', toTimestamp(now()), uuid(), '{"cache":"hit"}') USING TTL 86400;
-- UPDATE: applied
-- DELETE: applied (tombstone written)
-- nodetool tablehistograms forge_system audit_logs
Forge_system/audit_logs histograms
Offset SSTables Write Latency Read Latency
1 0 74us 0us
2 1 112us 145us
3 1 118us 148us
-- SSTable count: 1 (after flush)
-- Tombstone live cells ratio: 0.12 (acceptable — below 0.20 threshold)
-- INSERT with TTL: applied
- A tombstone is a write — it goes through the same memtable → SSTable → compaction path as regular data
- Tombstones accumulate until gc_grace_seconds passes (default 10 days) AND compaction runs
- During reads, Cassandra must scan through tombstones to find live data — high tombstone counts degrade reads
- If a node was down when a DELETE ran, it must receive the tombstone via repair before gc_grace_seconds expires
- Common tombstone traps: nullable columns, frequent row-level deletes, and TTL-on-collections
Common Mistakes and How to Avoid Them
When moving from SQL to Cassandra, the mistakes fall into two categories: design mistakes you make before writing a line of code, and operational mistakes you make after the cluster is running. The design mistakes are more expensive — they require data migration to fix. The operational ones are more urgent — they cause incidents.
The single most consequential design mistake is treating Cassandra like a relational database. Normalizing data, expecting joins, using secondary indexes as a query strategy, filtering on non-primary-key columns with ALLOW FILTERING — all of these patterns assume a query engine that Cassandra deliberately does not have. Cassandra's query engine is intentionally limited because that limitation is what makes it fast at scale.
For the operational side, the development cluster below is the starting point for understanding how the cluster behaves before you have 12 production nodes to experiment on. The compose file uses Cassandra 5.0 — pinned, not latest. Using latest in any tutorial or development setup is a trap: you will develop against a different version than production, and Cassandra has meaningful behavioral differences between major versions.
# ── Three-node Cassandra development cluster ───────────────────────────────── # Pinned to Cassandra 5.0 — do not use 'latest' in development or production. # Cassandra 5.0 ships with significant storage engine improvements: # - Trie-based memtable (better memory efficiency) # - Storage Attached Indexes (SAI) replacing legacy SASI # - Unified compaction strategy option services: cassandra-node1: image: cassandra:5.0 container_name: forge-cassandra-1 hostname: cassandra-node1 ports: - "9042:9042" environment: - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster - CASSANDRA_DC=datacenter1 - CASSANDRA_RACK=rack1 - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2 - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch - MAX_HEAP_SIZE=1G - HEAP_NEWSIZE=256M volumes: - cassandra-data1:/var/lib/cassandra healthcheck: test: ["CMD-SHELL", "nodetool status | grep -E '^UN'"] interval: 30s timeout: 10s retries: 10 start_period: 60s cassandra-node2: image: cassandra:5.0 container_name: forge-cassandra-2 hostname: cassandra-node2 environment: - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster - CASSANDRA_DC=datacenter1 - CASSANDRA_RACK=rack2 - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2 - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch - MAX_HEAP_SIZE=1G - HEAP_NEWSIZE=256M volumes: - cassandra-data2:/var/lib/cassandra depends_on: cassandra-node1: condition: service_healthy cassandra-node3: image: cassandra:5.0 container_name: forge-cassandra-3 hostname: cassandra-node3 environment: - CASSANDRA_CLUSTER_NAME=TheCodeForgeCluster - CASSANDRA_DC=datacenter1 - CASSANDRA_RACK=rack3 - CASSANDRA_SEEDS=cassandra-node1,cassandra-node2 - CASSANDRA_ENDPOINT_SNITCH=GossipingPropertyFileSnitch - MAX_HEAP_SIZE=1G - HEAP_NEWSIZE=256M volumes: - cassandra-data3:/var/lib/cassandra depends_on: cassandra-node2: condition: service_healthy volumes: cassandra-data1: cassandra-data2: cassandra-data3:
# Run: docker exec forge-cassandra-1 nodetool status
# Datacenter: datacenter1
# ========================
# Status=Up/Down
# |/ State=Normal/Leaving/Joining/Moving
# -- Address Load Tokens Owns Host ID Rack
# UN 172.20.0.2 195.21 KiB 16 33.3% a1b2c3d4-e5f6-7890-abcd-ef1234567890 rack1
# UN 172.20.0.3 207.43 KiB 16 33.3% b2c3d4e5-f6a7-8901-bcde-f12345678901 rack2
# UN 172.20.0.4 189.67 KiB 16 33.4% c3d4e5f6-a7b8-9012-cdef-123456789012 rack3
#
# UN = Up, Normal — all three nodes are healthy and distributing load evenly
# Each node owns ~33% of the token ring
Storage Attached Indexes (SAI) — Secondary Indexing in Cassandra 5.0
Before Cassandra 5.0, secondary indexing was the most common source of bad advice in Cassandra guides. The options were:
- SASI (Storage Attached Secondary Index) — better than the original secondary indexes, still had significant operational overhead and limited query capabilities.
- Original secondary indexes — effectively hidden tables, painful performance characteristics, deprecated by the Cassandra community for most production use.
- ALLOW FILTERING — not an index at all, a full cluster scan dressed up as a query option.
Cassandra 5.0 introduced Storage Attached Indexes (SAI), which supersede SASI. SAI is built into the SSTable storage format, requires no separate index tables, and supports range queries, equality, and — for vector embeddings — approximate nearest neighbor (ANN) search. Vector search support makes Cassandra 5.0 relevant for AI-adjacent workloads that weren't possible before.
SAI doesn't make secondary indexing free. It's still more expensive than a partition key query because it must look across all SSTables on the node. But the cost is manageable for selective queries — queries that filter down to a small result set. The rule I apply: if your SAI query returns more than 5-10% of the rows in the partition, the query is not selective enough for an index to help, and you should create a dedicated table instead.
SASI is deprecated in Cassandra 5.0. Any guide that recommends SASI for new deployments is out of date.
-- ── Storage Attached Indexes: Cassandra 5.0+ ───────────────────────────────── -- SAI allows querying on non-primary-key columns without ALLOW FILTERING. -- SASI is deprecated in Cassandra 5.0 — use SAI for all new indexes. USE forge_system; -- Table: product catalog — partition key is category CREATE TABLE IF NOT EXISTS products ( category text, product_id uuid, name text, price decimal, in_stock boolean, rating float, PRIMARY KEY (category, product_id) ); -- ── Create SAI indexes on non-primary-key columns ──────────────────────────── -- Syntax: CREATE INDEX index_name ON table(column) USING 'sai'; CREATE INDEX products_price_idx ON products(price) USING 'sai'; CREATE INDEX products_stock_idx ON products(in_stock) USING 'sai'; CREATE INDEX products_rating_idx ON products(rating) USING 'sai'; -- ── SAI-enabled queries: no ALLOW FILTERING required ───────────────────────── -- Range query on price — only possible with SAI or as a partition key SELECT product_id, name, price FROM products WHERE category = 'electronics' AND price > 100.00 AND price < 500.00; -- Combined filter: in stock AND rating above threshold -- Both conditions use SAI — Cassandra intersects the index results SELECT product_id, name, price, rating FROM products WHERE category = 'electronics' AND in_stock = true AND rating > 4.5; -- ── Vector search with SAI (AI workloads) ──────────────────────────────────── -- Cassandra 5.0 supports ANN (Approximate Nearest Neighbor) search via SAI -- Useful for semantic search, recommendation systems, embedding lookup CREATE TABLE IF NOT EXISTS product_embeddings ( category text, product_id uuid, embedding vector<float, 1536>, -- 1536-dimensional OpenAI embedding PRIMARY KEY (category, product_id) ); -- SAI index on the vector column for ANN search CREATE INDEX embeddings_ann_idx ON product_embeddings(embedding) USING 'sai' WITH OPTIONS = {'similarity_function': 'cosine'}; -- ANN query: find the 10 most similar products to a given embedding vector -- This runs on-node without a cluster scan — orders of magnitude faster than -- brute-force similarity across all rows SELECT product_id FROM product_embeddings WHERE category = 'electronics' ORDER BY embedding ANN OF [0.1, 0.2, 0.3, ...] -- 1536 floats from your model LIMIT 10;
Index 'products_stock_idx' created.
Index 'products_rating_idx' created.
-- Price range query:
product_id | name | price
--------------------------------------+-------------------+-------
d1e2f3a4-b5c6-7890-defg-h12345678901 | Forge Headphones | 149.99
e2f3a4b5-c6d7-8901-efgh-i23456789012 | Cloud Core Pro | 299.00
(2 rows)
-- Combined filter query:
product_id | name | price | rating
--------------------------------------+-------------------+--------+-------
d1e2f3a4-b5c6-7890-defg-h12345678901 | Forge Headphones | 149.99 | 4.7
(1 rows)
-- Vector index created.
-- ANN query: (10 results returned based on cosine similarity)
- SAI is right when: the indexed column has high cardinality, queries are selective (< 5% of partition rows), and you cannot predict all query patterns at schema design time
- Dedicated table is right when: you need the absolute fastest reads, the query pattern is known and stable, or selectivity is too low for SAI to help
- Vector search via SAI is the recommended pattern for embedding lookup in Cassandra 5.0+ — no separate vector database required for moderate scale
- SASI is deprecated in Cassandra 5.0 — do not create new SASI indexes on any new deployment
Connecting a Java Application — DataStax Java Driver
Every other concept in this guide — rings, partition keys, consistency levels, SAI — becomes concrete once you see how an application actually talks to Cassandra. The standard driver by 2026 is the DataStax Java Driver for Apache Cassandra, version 4.x. It replaced the older CQL driver and the Astyanax client.
Three things the driver handles that you do not want to implement yourself:
- Load balancing policy — the driver tracks which nodes are in which datacenter, which are up, and which are token-aware. With the
DefaultLoadBalancingPolicy, queries are automatically routed to the replica that owns the data — no unnecessary hops through coordinator nodes. - Connection pooling — each driver instance maintains a pool of connections per host. You create one
CqlSessionper application and reuse it for the lifetime of the process. Creating a new session per request is one of the most common Java + Cassandra performance mistakes. - Prepared statements — CQL statements should be prepared once and executed many times. Prepared statements are parsed and validated once by the coordinator; subsequent executions send only the bound values. Unprepared statements parse on every execution — at scale, the coordinator overhead becomes measurable.
The Spring Data Cassandra abstraction sits on top of this driver and is suitable for straightforward CRUD workloads. For anything involving custom query patterns, consistency level overrides, or performance-critical paths, I prefer going directly to the driver — the abstraction layer hides the consistency level control that Cassandra's tunable consistency depends on.
// ── pom.xml dependencies ────────────────────────────────────────────────────── // <dependency> // <groupId>com.datastax.oss</groupId> // <artifactId>java-driver-core</artifactId> // <version>4.17.0</version> // </dependency> // <dependency> // <groupId>com.datastax.oss</groupId> // <artifactId>java-driver-query-builder</artifactId> // <version>4.17.0</version> // </dependency> // ── CassandraConfig.java — single CqlSession for the application lifetime ──── package io.thecodeforge.config; import com.datastax.oss.driver.api.core.CqlSession; import com.datastax.oss.driver.api.core.config.DefaultDriverOption; import com.datastax.oss.driver.api.core.config.DriverConfigLoader; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import java.net.InetSocketAddress; import java.time.Duration; @Configuration public class CassandraConfig { // ONE CqlSession per application — not one per request. // CqlSession is thread-safe and manages its own connection pool internally. // Creating a session per request is one of the most common performance mistakes. @Bean(destroyMethod = "close") public CqlSession cqlSession() { return CqlSession.builder() .addContactPoint(new InetSocketAddress("127.0.0.1", 9042)) .withLocalDatacenter("datacenter1") .withKeyspace("forge_system") .withConfigLoader( DriverConfigLoader.programmaticBuilder() // Request timeout: fail fast rather than hold threads .withDuration(DefaultDriverOption.REQUEST_TIMEOUT, Duration.ofMillis(500)) // Consistency level: LOCAL_QUORUM for production // Tolerates one replica being down in the local DC .withString(DefaultDriverOption.REQUEST_CONSISTENCY, "LOCAL_QUORUM") // Token-aware routing: send queries directly to the // replica that owns the partition — avoids coordinator hop .withString(DefaultDriverOption.LOAD_BALANCING_POLICY_CLASS, "DefaultLoadBalancingPolicy") .build() ) .build(); } } // ── AuditLog.java — simple value object ────────────────────────────────────── package io.thecodeforge.model; import java.time.Instant; import java.util.UUID; public record AuditLog( String serviceName, Instant logTime, UUID requestId, String payload ) {} // ── AuditLogRepository.java — DataStax driver with prepared statements ──────── package io.thecodeforge.repository; import com.datastax.oss.driver.api.core.CqlSession; import com.datastax.oss.driver.api.core.cql.*; import io.thecodeforge.model.AuditLog; import org.springframework.stereotype.Repository; import java.time.Instant; import java.util.List; import java.util.UUID; import java.util.stream.StreamSupport; @Repository public class AuditLogRepository { private final CqlSession session; // Prepared statements: parse once, execute many times. // Prepared at construction time — fails fast on startup if CQL is wrong. private final PreparedStatement insertStmt; private final PreparedStatement selectRecentStmt; public AuditLogRepository(CqlSession session) { this.session = session; this.insertStmt = session.prepare( "INSERT INTO audit_logs (service_name, log_time, request_id, payload) " + "VALUES (?, ?, ?, ?) USING TTL 2592000" ); this.selectRecentStmt = session.prepare( "SELECT service_name, log_time, request_id, payload " + "FROM audit_logs " + "WHERE service_name = ? AND log_time > ? " + "LIMIT 100" ); } public void save(AuditLog log) { BoundStatement bound = insertStmt.bind( log.serviceName(), log.logTime(), log.requestId(), log.payload() ); // execute() is synchronous — use executeAsync() for high-throughput writes session.execute(bound); } public List<AuditLog> findRecent(String serviceName, Instant since) { BoundStatement bound = selectRecentStmt.bind(serviceName, since); ResultSet rs = session.execute(bound); return StreamSupport.stream(rs.spliterator(), false) .map(row -> new AuditLog( row.getString("service_name"), row.getInstant("log_time"), row.getUuid("request_id"), row.getString("payload") )) .toList(); // Java 16+ — returns unmodifiable list } // ── Async write for high-throughput ingestion paths ─────────────────────── // Returns CompletableFuture<AsyncResultSet> — caller decides whether to await public java.util.concurrent.CompletableFuture<AsyncResultSet> saveAsync(AuditLog log) { BoundStatement bound = insertStmt.bind( log.serviceName(), log.logTime(), log.requestId(), log.payload() ); return session.executeAsync(bound).toCompletableFuture(); } }
[main] INFO c.d.o.d.i.core.session.DefaultSession - [forge-session] Initializing
[main] INFO c.d.o.d.i.core.loadbalancing.DefaultLoadBalancingPolicy
- [forge-session] Using datacenter 'datacenter1' for local policy
[main] INFO c.d.o.d.i.core.session.DefaultSession - [forge-session] Initialized
-- Repository.save() called:
-- Prepared statement bound and executed in 2.3ms (LOCAL_QUORUM satisfied)
-- Repository.findRecent('payment-service', 2026-04-18T00:00:00Z):
-- Single-partition read, 2 rows returned in 1.8ms
- CqlSession is thread-safe — inject it as a singleton and share it across all repositories
- The driver manages a connection pool per host automatically — no manual pool management needed
- PreparedStatement objects are also reusable and thread-safe — prepare at startup, bind per request
- executeAsync() returns a CompletableFuture — use it for fire-and-forget writes or high-throughput ingestion
- Token-aware routing in DefaultLoadBalancingPolicy means queries go directly to the replica — no extra coordinator hop
session.execute() — synchronous, straightforward, compatible with Spring MVC thread model| Aspect | Traditional RDBMS (MySQL/PostgreSQL) | Apache Cassandra |
|---|---|---|
| Architecture | Centralized (Master-Slave) | Distributed (Peer-to-Peer) |
| Scalability | Vertical (bigger hardware) | Horizontal (more nodes, linear throughput increase) |
| Data Model | Normalized (fixed schema, joins) | Denormalized (flexible schema, one table per query) |
| Availability | High with failover lag (30s-2min) | Always-on (no failover delay, no leader election) |
| Write Speed | Moderate (B-Tree, random I/O, locking) | Extreme (LSM-Tree, sequential I/O, no read-before-write) |
| Read Pattern | Flexible (joins, aggregations, ad-hoc) | Query-driven (design tables around queries) |
| Secondary Indexes | B-Tree indexes, generally efficient | SAI in Cassandra 5.0+ (selective queries only) |
| Transactions | Full ACID across tables and rows | Lightweight transactions (LWT) for single-partition conditional updates only |
| Operational Cost | Lower for small scale (< 1TB) | Lower for massive scale (10TB+, multi-datacenter) |
🎯 Key Takeaways
- Apache Cassandra is a masterless, distributed database designed for zero-downtime and massive scale — every node is equal, any node accepts reads and writes.
- Always model your data around your queries (Query-Driven Design) rather than entities — joins do not exist, and ALLOW FILTERING is a diagnostic tool, not a query strategy.
- Linear scalability means adding nodes provides a predictable, proportional increase in throughput — no rebalancing ceremony, no leader election.
- The Gossip protocol and Snitch configuration are the heartbeat of the cluster's awareness — they power failure detection and topology-aware routing.
- Tunable consistency is Cassandra's operational superpower — LOCAL_QUORUM is the right default for production: QUORUM = floor(RF/2)+1 nodes must respond, tolerating one failure with RF=3.
- Denormalization is not a flaw — it is the correct pattern when joins do not exist. Duplicate data across tables; storage is cheap, cross-partition reads are not.
- Tombstones are writes — every DELETE creates a tombstone that degrades reads until compaction clears it. Use TTL on INSERT instead of explicit DELETE for time-bounded data.
- SAI (Storage Attached Indexes) in Cassandra 5.0 replaces SASI and supports range queries and vector ANN search — SASI is deprecated and should not be used in new deployments.
- Create one CqlSession per application, prepare statements at startup, and use LOCAL_QUORUM as the default consistency level — these three driver decisions determine whether your application runs well under load.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the role of the Gossip protocol in an Apache Cassandra cluster?Mid-levelReveal
- QHow does Cassandra achieve high write throughput compared to traditional B-Tree based databases?Mid-levelReveal
- QExplain the LSM-Tree (Log-Structured Merge-Tree) and its importance in Cassandra's storage engine.SeniorReveal
- QWhat is the difference between SimpleStrategy and NetworkTopologyStrategy for replication?JuniorReveal
- QWhat is a tombstone in Cassandra, why does it degrade read performance, and how do you prevent tombstone accumulation?SeniorReveal
- QWhat is Storage Attached Indexing (SAI) in Cassandra 5.0, and how does it differ from SASI?SeniorReveal
Frequently Asked Questions
Does Cassandra support ACID transactions?
Cassandra supports lightweight transactions (LWT) using the Paxos protocol for conditional updates within a single partition, but they come with a significant performance cost — approximately 4x the latency of a regular write due to the multi-round-trip Paxos consensus. For most use cases, Cassandra's eventual consistency with LOCAL_QUORUM provides the right balance. Full multi-row ACID transactions across partitions are not supported — design your data model to avoid needing them. If you genuinely need cross-partition transactions, evaluate whether Cassandra is the right database for that specific workload.
How do I choose the right compaction strategy?
STCS (Size-Tiered Compaction Strategy) is the default and works well for write-heavy workloads where data is rarely updated. LCS (Leveled Compaction Strategy) provides better read performance and bounded space amplification — use it for read-heavy workloads where the extra compaction I/O is acceptable. TWCS (Time-Window Compaction Strategy) is ideal for time-series data with TTL, as it groups SSTables by time window and drops entire windows when data expires — avoiding the need to merge expired tombstones with live data. Cassandra 5.0 introduced the Unified Compaction Strategy (UCS) as an experimental option that adapts behavior based on workload patterns.
What is a tombstone in Cassandra and why should I care?
A tombstone is a marker that indicates a row or cell has been deleted. Cassandra does not immediately remove data — it writes a tombstone that is only physically cleaned up after gc_grace_seconds (default 10 days) AND after compaction runs AND after repair has propagated the tombstone to all replicas. Excessive tombstones degrade read performance because Cassandra must scan through them during reads. Hitting the tombstone_failure_threshold (default 100,000 per query) causes a TombstoneOverwhelmingException that aborts the read entirely. Prevention: use TTL on INSERT, run repair regularly within gc_grace_seconds, and monitor tombstone ratios with nodetool tablehistograms.
Can I use secondary indexes in Cassandra?
In Cassandra 5.0, the recommended secondary indexing approach is Storage Attached Indexes (SAI). SAI supports equality and range queries on non-primary-key columns without ALLOW FILTERING, and adds vector ANN search for embedding-based workloads. SASI (Storage Attached Secondary Index) is deprecated in Cassandra 5.0 — do not create new SASI indexes. The original legacy secondary indexes are still present but have severe performance limitations. The general rule: SAI for selective queries (< 5% of partition rows), dedicated table for known stable query patterns with high query volume.
When should I use Cassandra vs PostgreSQL?
Use Cassandra when: you need write throughput that a single PostgreSQL instance cannot absorb, your dataset exceeds 1TB and growing, you need multi-datacenter active-active replication, or your data naturally fits a time-series or event-log model. Use PostgreSQL when: you need ACID transactions across multiple tables, your queries are ad-hoc and unpredictable, your data is relational and normalized, or your scale is small enough that a single well-tuned PostgreSQL instance handles the load. The worst outcome is starting with Cassandra for a workload that needs joins and transactions — the data model migration cost is high.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.