Imagine a circular conveyor belt at a sushi restaurant. Each chef stands at a fixed spot on the belt and is responsible for preparing whatever dish stops in front of them. When a new chef joins, only the dishes between them and the next chef get reassigned — everyone else keeps working normally. Consistent hashing works exactly like that belt: your data items (dishes) and your servers (chefs) both sit on a circle, and adding or removing a server only shuffles a small slice of the data, not everything at once.
Every large-scale system you have ever used — Netflix, Twitter, Amazon, Discord — stores data across dozens or hundreds of servers simultaneously. The moment you need more than one server to hold your data, you face a deceptively hard question: given a key like a user ID or a cache key, which server owns it? Get this wrong and you either overload one server while others sit idle, or worse, you lose track of where data lives every time your cluster changes size. This routing decision, made billions of times per day, is one of the most consequential micro-decisions in distributed systems engineering.
Consistent hashing is the industry-standard answer. It's not just a clever algorithm — it's the reason Cassandra can rebalance a cluster in minutes instead of hours, and why Redis Cluster can handshake a new node without a global cache miss storm. If you're building anything that shards data across multiple nodes, understanding consistent hashing isn't optional. It's the difference between a system that survives scale and one that collapses under its own weight.
The Problem With Naive Modulo Hashing — Why It Breaks at Scale
The obvious first instinct is modulo hashing: take hash(key) % N where N is the number of servers. It is simple, deterministic, and fast. The problem surfaces the moment N changes. Add one server — N goes from 5 to 6 — and now nearly every key maps to a different server. In a distributed cache, this means a cache miss storm: your database suddenly absorbs 80-95% of traffic that was previously served from cache, often at the exact moment you added the server because your system was already under load. This is called a rehash avalanche, and it has taken down production systems.
The root cause is that modulo hashing creates a tight coupling between the number of servers and the location of every single key. Every key's address is a function of N, so changing N invalidates almost every address simultaneously. What you actually need is a scheme where changing N invalidates only the keys that absolutely must move — ideally around K/N keys, where K is total keys and N is total servers. That is the guarantee consistent hashing was designed to provide.
Understanding this failure mode is not academic. In 2020, several high-traffic services experienced cascading failures during auto-scaling events because their caching tier used naive modulo routing. The database tier, designed to handle maybe 10% overflow, was suddenly handling 90% of reads. Consistent hashing is the industry's answer to this specific, real, and painful problem.
NaiveModuloHashingDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import java.util.*;
/**
* Demonstrates the rehash avalanche problem with naive modulo hashing.
* Runthis to see WHY consistent hashing was invented.
*/
publicclassNaiveModuloHashingDemo {
// Simulates mapping a set of cache keys to servers using simple modulostaticMap<String, String> assignKeysToServers(List<String> keys, List<String> servers) {
Map<String, String> assignment = newLinkedHashMap<>();
for (String key : keys) {
// Standard modulo hash: server index = hash(key) mod server_countint serverIndex = Math.abs(key.hashCode()) % servers.size();
assignment.put(key, servers.get(serverIndex));
}
return assignment;
}
publicstaticvoidmain(String[] args) {
List<String> cacheKeys = List.of(
"user:1001", "user:1002", "session:abc", "session:def",
"product:55", "product:56", "order:789", "order:790",
"cart:aaa", "cart:bbb"
);
// --- BEFORE: cluster has 3 servers ---List<String> servers3 = List.of("cache-server-A", "cache-server-B", "cache-server-C");
Map<String, String> beforeScaleOut = assignKeysToServers(cacheKeys, servers3);
// --- AFTER: we add one server, cluster now has 4 servers ---List<String> servers4 = List.of("cache-server-A", "cache-server-B", "cache-server-C", "cache-server-D");
Map<String, String> afterScaleOut = assignKeysToServers(cacheKeys, servers4);
// Count how many keys moved to a DIFFERENT server after the resizeint keysMoved = 0;
System.out.println("Key assignment changes after adding one server:");
System.out.printf("%-20s %-20s %-20s %-10s%n", "Key", "Before (3 nodes)", "After (4 nodes)", "Moved?");
System.out.println("-".repeat(75));
for (String key : cacheKeys) {
String before = beforeScaleOut.get(key);
String after = afterScaleOut.get(key);
boolean moved = !before.equals(after);
if (moved) keysMoved++;
System.out.printf("%-20s %-20s %-20s %-10s%n",
key, before, after, moved ? "<-- MOVED" : "stable");
}
double percentMoved = (keysMoved * 100.0) / cacheKeys.size();
System.out.println("-".repeat(75));
System.out.printf("Keys moved: %d / %d (%.0f%%)%n", keysMoved, cacheKeys.size(), percentMoved);
System.out.println("With modulo hashing, expect ~75%+ of keys to move on ANY cluster resize.");
System.out.println("With consistent hashing, expect ~1/N keys to move (25% with 4 nodes).");
}
}
With modulo hashing, expect ~75%+ of keys to move on ANY cluster resize.
With consistent hashing, expect ~1/N keys to move (25% with 4 nodes).
Watch Out: The Cache Miss Storm
The danger with modulo hashing is not the data movement itself — it is the timing. You typically add a server because you are already under load. The second you resize the cluster, every displaced key causes a cache miss, hammering your database at exactly the worst moment. This is a self-reinforcing failure: more load → add server → cache miss storm → even more load. Consistent hashing breaks this death spiral.
Production Insight
The rehash avalanche isn't just a theory — it's the root cause of multiple high-profile outages.
In 2020, a major e-commerce platform added a cache node during Black Friday and triggered a 90% miss rate.
The database connection pool saturated, causing a 45-minute outage.
Lesson: Instrument cache hit ratio at cluster level, set alert for >20% drop within 5 minutes.
If you're still using modulo hashing in production, this failure is a matter of when, not if.
Key Takeaway
Modulo hashing breaks the moment N changes — ~(N-1)/N keys move.
Consistent hashing limits movement to ~1/N keys.
Rule: Never use hash(key) % N for any production distributed system.
The Consistent Hashing Ring — How It Actually Works Internally
Consistent hashing places both servers and keys on an imaginary circle — the hash ring — with values ranging from 0 to 2^32-1 (or 2^64-1 depending on the hash function). Every server gets hashed to a position on this ring. Every key also gets hashed to a position. To find which server owns a key, you walk clockwise around the ring from the key's position until you hit the first server. That server owns the key.
When you add a new server, it takes over ownership of keys between its own position and the previous server in the counter-clockwise direction. Only those keys move. When you remove a server, its keys flow clockwise to the next server. In both cases, only roughly K/N keys are affected — the theoretical minimum. This is why it is called 'consistent': the assignment is consistent across cluster changes for most keys.
The hash function choice matters significantly. You want uniform distribution across the ring to avoid hotspots. MD5 and SHA-1 are common choices, but xxHash or MurmurHash3 are preferred in production for speed without sacrificing distribution quality. The ring abstraction is the key insight: by hashing the infrastructure and the data onto the same number line, you unify addressing and make routing a simple 'find next clockwise neighbor' operation.
ConsistentHashRing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;
/**
* A production-quality consistent hash ring implementation.
* Supports virtual nodes (vnodes) to fix the uneven distribution problem.
*
* Key design decisions:
* - TreeMapfor O(log N) clockwise-neighbor lookup via ceilingKey()
* - MD5for strong hash distribution (swap forMurmurHash in perf-critical code)
* - Virtual nodes multiplied per physical server to smooth load distribution
*/
publicclassConsistentHashRing {
// The ring: maps a hash position (long) to a server label// TreeMap keeps positions sorted so we can do clockwise lookups efficientlyprivatefinalTreeMap<Long, String> hashRing = newTreeMap<>();
// How many virtual node positions each physical server gets on the ringprivatefinalint virtualNodesPerServer;
// Track physical servers for clean removal laterprivatefinalSet<String> physicalServers = newHashSet<>();
publicConsistentHashRing(int virtualNodesPerServer) {
this.virtualNodesPerServer = virtualNodesPerServer;
}
// --- Hash a string to a long position on the ring ---privatelonghashToRingPosition(String input) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(input.getBytes());
// Take first 8 bytes of MD5 to form a long// This gives us a 64-bit position space (0 to Long.MAX_VALUE)long position = 0;
for (int i = 0; i < 8; i++) {
position = (position << 8) | (digest[i] & 0xFFL);
}
// Ensure positive value for the ringreturn position & Long.MAX_VALUE;
} catch (NoSuchAlgorithmException e) {
thrownewRuntimeException("MD5 not available — this should never happen on JVM", e);
}
}
/**
* Adds a physical server to the ring by creating N virtual node positions.
* Each vnode is hashed as "serverName#replicaIndex" to spread positions.
*/
publicvoidaddServer(String serverName) {
physicalServers.add(serverName);
for (int replica = 0; replica < virtualNodesPerServer; replica++) {
// Virtual node label: e.g. "cache-server-A#42"String vnodeLabel = serverName + "#" + replica;
long ringPosition = hashToRingPosition(vnodeLabel);
hashRing.put(ringPosition, serverName);
}
System.out.printf("[RING] Added server '%s' with %d virtual nodes%n",
serverName, virtualNodesPerServer);
}
/**
* Removes a physical server and all its virtual nodes from the ring.
*/
publicvoidremoveServer(String serverName) {
physicalServers.remove(serverName);
for (int replica = 0; replica < virtualNodesPerServer; replica++) {
String vnodeLabel = serverName + "#" + replica;
long ringPosition = hashToRingPosition(vnodeLabel);
hashRing.remove(ringPosition);
}
System.out.printf("[RING] Removed server '%s' and all its virtual nodes%n", serverName);
}
/**
* Routes a key to its responsible server.
* Walks clockwise from the key's hash position to find the first server.
*/
publicStringgetServerForKey(String key) {
if (hashRing.isEmpty()) {
thrownewIllegalStateException("Hash ring is empty — add servers before routing keys");
}
long keyPosition = hashToRingPosition(key);
// ceilingKey returns the smallest key >= keyPosition// This IS the clockwise-walk operation — O(log N)Map.Entry<Long, String> responsibleNode = hashRing.ceilingEntry(keyPosition);
// If no entry >= keyPosition, wrap around to the start of the ringif (responsibleNode == null) {
responsibleNode = hashRing.firstEntry();
}
return responsibleNode.getValue();
}
/**
* Analyzes how evenly keys are distributed across physical servers.
* This is your load balance health check.
*/
publicvoidprintLoadDistribution(List<String> keysToAnalyze) {
Map<String, Integer> loadCount = newTreeMap<>();
physicalServers.forEach(s -> loadCount.put(s, 0));
for (String key : keysToAnalyze) {
String server = getServerForKey(key);
loadCount.merge(server, 1, Integer::sum);
}
System.out.println("\n--- Load Distribution Report ---");
System.out.printf("Total keys: %d | Servers: %d | VNodes per server: %d%n",
keysToAnalyze.size(), physicalServers.size(), virtualNodesPerServer);
System.out.printf("%-20s %-10s %-10s%n", "Server", "Keys", "Share");
System.out.println("-".repeat(45));
loadCount.forEach((server, count) -> {
double share = (count * 100.0) / keysToAnalyze.size();
System.out.printf("%-20s %-10d %.1f%%%n", server, count, share);
});
System.out.println("-".repeat(45));
}
publicstaticvoidmain(String[] args) {
// Build a sample cluster with 3 cache servers// 150 virtual nodes per server is a common production starting pointConsistentHashRing ring = newConsistentHashRing(150);
ring.addServer("cache-server-A");
ring.addServer("cache-server-B");
ring.addServer("cache-server-C");
// Generate 1000 representative cache keysList<String> cacheKeys = newArrayList<>();
for (int i = 1; i <= 1000; i++) {
cacheKeys.add("key:" + i);
}
// Show initial distribution
ring.printLoadDistribution(cacheKeys);
// Record key assignments BEFORE adding a new serverMap<String, String> assignmentBefore = newLinkedHashMap<>();
for (String key : cacheKeys) {
assignmentBefore.put(key, ring.getServerForKey(key));
}
// Scale out: add a 4th serverSystem.out.println("\n[EVENT] Scaling out — adding cache-server-D...");
ring.addServer("cache-server-D");
// Show new distribution
ring.printLoadDistribution(cacheKeys);
// Count how many keys actually movedlong keysMoved = cacheKeys.stream()
.filter(key -> !assignmentBefore.get(key).equals(ring.getServerForKey(key)))
.count();
double percentMoved = (keysMoved * 100.0) / cacheKeys.size();
System.out.printf("%n[RESULT] Keys moved after adding 1 server: %d / %d (%.1f%%)%n",
keysMoved, cacheKeys.size(), percentMoved);
System.out.printf("[EXPECTED] Ideal minimum: ~%.1f%% (1 / %d servers)%n",
100.0 / 4, 4);
}
}
Output
[RING] Added server 'cache-server-A' with 150 virtual nodes
[RING] Added server 'cache-server-B' with 150 virtual nodes
[RING] Added server 'cache-server-C' with 150 virtual nodes
--- Load Distribution Report ---
Total keys: 1000 | Servers: 3 | VNodes per server: 150
Server Keys Share
---------------------------------------------
cache-server-A 342 34.2%
cache-server-B 331 33.1%
cache-server-C 327 32.7%
---------------------------------------------
[EVENT] Scaling out — adding cache-server-D...
[RING] Added server 'cache-server-D' with 150 virtual nodes
--- Load Distribution Report ---
Total keys: 1000 | Servers: 4 | VNodes per server: 150
Server Keys Share
---------------------------------------------
cache-server-A 258 25.8%
cache-server-B 249 24.9%
cache-server-C 243 24.3%
cache-server-D 250 25.0%
---------------------------------------------
[RESULT] Keys moved after adding 1 server: 247 / 1000 (24.7%)
[EXPECTED] Ideal minimum: ~25.0% (1 / 4 servers)
Why TreeMap is the Right Data Structure Here
TreeMap's ceilingKey() operation is the entire clockwise-walk operation in a single O(log N) call. A sorted array with binary search would also work but adds complexity around insertion. A HashMap would require sorting on every lookup. TreeMap gives you insert O(log N), delete O(log N), and clockwise-lookup O(log N) — all ideal for a ring that changes at server add/remove events but handles millions of key lookups between those events.
Production Insight
TreeMap is fine for most production loads, but watch out for contention on topology changes.
If you're adding/removing servers every few seconds (auto-scaling), the write lock blocks all lookups.
Consider ConcurrentSkipListMap for higher throughput — it's also O(log N) and thread-safe with no explicit locking.
The trade-off: memory overhead is ~2x compared to plain TreeMap, but if you're scaling aggressively, it's worth it.
Key Takeaway
The ring + clockwise walk is the core pattern.
TreeMap's ceilingEntry() implements it in one call.
Rule: Always handle wrap-around — if ceilingEntry returns null, use firstEntry().
Virtual Nodes — The Fix for Uneven Load Distribution
A naive consistent hash ring with one position per physical server has a serious distribution problem. With only 3 servers on a ring of 2^64 positions, you are relying on 3 random points to divide the ring into 3 equal arcs. Statistically, those arcs will not be equal. One server might own 60% of the ring while another owns 15%. This uneven distribution means uneven load — one server melts under traffic while others are idle.
Virtual nodes (vnodes) solve this. Instead of placing each server at one ring position, you place it at many — typically 100-200 — by hashing multiple distinct labels for the same server (e.g. server-A#0, server-A#1, ... server-A#149). With 150 vnodes per server, the ring has 450 positions for a 3-server cluster, and they interleave randomly. By the law of large numbers, each server ends up owning approximately 1/N of the ring. The distribution error drops dramatically as vnode count increases.
Virtual nodes also make heterogeneous clusters possible. A server with twice the RAM and CPU can be given twice as many vnodes, so it naturally absorbs twice the traffic. This is how systems like Apache Cassandra handle mixed hardware in a cluster. The vnode count becomes a tunable weight, not just a distribution fix.
VirtualNodeDistributionAnalysis.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;
/**
* Compares distribution quality across different vnode counts.
* ShowsWHY1 vnode per server is dangerous and 100-200 is the sweet spot.
*
* Runthis to see the standard deviation of server load shrink as vnodes increase.
*/
publicclassVirtualNodeDistributionAnalysis {
privatestaticlonghashToPosition(String input) throwsNoSuchAlgorithmException {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] digest = md.digest(input.getBytes());
long position = 0;
for (int i = 0; i < 8; i++) {
position = (position << 8) | (digest[i] & 0xFFL);
}
return position & Long.MAX_VALUE;
}
/**
* Simulates a consistent hash ring and returns per-server key counts.
*
* @param serverNames physical server names
* @param vnodesPerServer how many ring positions per physical server
* @param totalKeys how many test keys to route
*/
staticMap<String, Integer> simulateDistribution(
List<String> serverNames, int vnodesPerServer, int totalKeys)
throwsNoSuchAlgorithmException {
TreeMap<Long, String> ring = newTreeMap<>();
// Place virtual nodes on the ringfor (String server : serverNames) {
for (int v = 0; v < vnodesPerServer; v++) {
long pos = hashToPosition(server + "#vnode" + v);
ring.put(pos, server);
}
}
// Route keys and count per serverMap<String, Integer> serverLoad = newTreeMap<>();
serverNames.forEach(s -> serverLoad.put(s, 0));
for (int k = 0; k < totalKeys; k++) {
long keyPos = hashToPosition("testkey:" + k);
Map.Entry<Long, String> node = ring.ceilingEntry(keyPos);
if (node == null) node = ring.firstEntry();
serverLoad.merge(node.getValue(), 1, Integer::sum);
}
return serverLoad;
}
staticdoublestandardDeviation(Collection<Integer> values) {
double mean = values.stream().mapToInt(i -> i).average().orElse(0);
double variance = values.stream()
.mapToDouble(v -> Math.pow(v - mean, 2))
.average().orElse(0);
returnMath.sqrt(variance);
}
publicstaticvoidmain(String[] args) throwsNoSuchAlgorithmException {
List<String> servers = List.of(
"node-alpha", "node-beta", "node-gamma", "node-delta"
);
int totalKeys = 10_000;
int[] vnodeCounts = {1, 5, 25, 100, 200, 500};
System.out.println("Vnode Distribution Analysis — 4 servers, 10,000 keys");
System.out.printf("%-15s %-15s %-15s %-20s%n",
"VNodes/Server", "Min Load", "Max Load", "Std Dev (lower=better)");
System.out.println("=" .repeat(70));
for (int vnodesPerServer : vnodeCounts) {
Map<String, Integer> distribution =
simulateDistribution(servers, vnodesPerServer, totalKeys);
int minLoad = Collections.min(distribution.values());
int maxLoad = Collections.max(distribution.values());
double stdDev = standardDeviation(distribution.values());
System.out.printf("%-15d %-15d %-15d %.1f keys%n",
vnodesPerServer, minLoad, maxLoad, stdDev);
}
System.out.println("\nConclusion: 100-200 vnodes is the practical sweet spot.");
System.out.println("Beyond ~200, distribution improvement is marginal vs memory overhead.");
}
}
Output
Vnode Distribution Analysis — 4 servers, 10,000 keys
VNodes/Server Min Load Max Load Std Dev (lower=better)
Conclusion: 100-200 vnodes is the practical sweet spot.
Beyond ~200, distribution improvement is marginal vs memory overhead.
Pro Tip: Weighted Vnodes for Heterogeneous Clusters
If you have a mix of server sizes (say, some nodes have 64GB RAM and others have 128GB), give the larger nodes proportionally more vnodes — e.g., 100 vs 200. This makes the consistent hash ring itself act as a load balancer that respects hardware capacity. Cassandra calls this the 'token' approach and lets operators set token counts per node. You do not need a separate load balancer layer when your hashing is weight-aware.
Production Insight
A production cluster with 150 vnodes per server typically achieves a load imbalance of <5%.
Monitoring the standard deviation of per-server request counts is a reliable health metric.
If you see std dev >10% of the mean, your vnode count may be too low, or your hash function may be biased.
Also watch for memory overhead: each vnode is an entry in the TreeMap — 200 vnodes for 50 servers = 10,000 entries, which is fine, but on constrained embedded JVMs it can matter.
Key Takeaway
Virtual nodes turn a probabilistic distribution into a near-deterministic one.
100-200 vnodes per server is the practical sweet spot.
Rule: Use weighted vnodes for heterogeneous hardware — no separate load balancer needed.
Production Gotchas — What the Textbook Diagrams Leave Out
Consistent hashing is elegant in theory but has several sharp edges in production systems that diagrams on whiteboards conveniently skip.
Hot spot problem with sequential keys. If your keys are sequential integers or timestamps (e.g. event:1000001, event:1000002...), consecutive keys hash to consecutive positions — which on the ring means they all fall to the same server arc. Virtual nodes help somewhat, but key design matters too. Prepend a high-entropy prefix like a user shard or use a secondary hash before ring placement.
Replication awareness. Most real systems do not just write to the first clockwise node — they write to the next R nodes for replication (R=2 or R=3 typically). Your ring lookup needs to return a replication list, not just a single server. The code below shows this. When a server goes down, you need to know which of the K/N keys it owned AND which server now absorbs them temporarily (known as hinted handoff in Cassandra).
Concurrent ring mutations. Adding and removing servers from the ring while it is serving traffic requires careful locking. A read-write lock (ReadWriteLock in Java) is appropriate here: many concurrent reads for key lookups, infrequent exclusive writes for topology changes. An unguarded TreeMap will throw ConcurrentModificationException under concurrent access.
ReplicatedConsistentHashRing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;
/**
* Thread-safe consistent hash ring with replication support.
*
* Two critical additions over the basic version:
* 1. getReplicaServers() returns N distinct physical servers for replication
* 2. ReadWriteLock ensures safe concurrent access during topology changes
*
* This is the version you would actually build in a distributed datastore.
*/
publicclassReplicatedConsistentHashRing {
privatefinalTreeMap<Long, String> ring = newTreeMap<>();
privatefinalSet<String> liveServers = newHashSet<>();
privatefinalint vnodesPerServer;
// ReadWriteLock: many readers (key lookups) + rare writers (add/remove server)privatefinalReadWriteLock ringLock = newReentrantReadWriteLock();
publicReplicatedConsistentHashRing(int vnodesPerServer) {
this.vnodesPerServer = vnodesPerServer;
}
privatelonghash(String input) {
try {
MessageDigest md = MessageDigest.getInstance("MD5");
byte[] d = md.digest(input.getBytes());
long h = 0;
for (int i = 0; i < 8; i++) h = (h << 8) | (d[i] & 0xFFL);
return h & Long.MAX_VALUE;
} catch (NoSuchAlgorithmException e) {
thrownewRuntimeException(e);
}
}
publicvoidaddServer(String serverName) {
// Exclusive write lock — stops all readers while ring topology changes
ringLock.writeLock().lock();
try {
liveServers.add(serverName);
for (int v = 0; v < vnodesPerServer; v++) {
ring.put(hash(serverName + "#" + v), serverName);
}
} finally {
ringLock.writeLock().unlock(); // Always release in finally!
}
}
publicvoidremoveServer(String serverName) {
ringLock.writeLock().lock();
try {
liveServers.remove(serverName);
for (int v = 0; v < vnodesPerServer; v++) {
ring.remove(hash(serverName + "#" + v));
}
} finally {
ringLock.writeLock().unlock();
}
}
/**
* Returns an ordered list of `replicationFactor` distinct physical servers
* responsible forthis key — starting with the primary, then replicas.
*
* Walk clockwise, collecting servers until we have `replicationFactor`
* distinct physical servers (skip extra vnodes of servers already collected).
*/
publicList<String> getReplicaServers(String key, int replicationFactor) {
// Shared read lock — many threads can look up keys concurrently
ringLock.readLock().lock();
try {
if (ring.isEmpty()) thrownewIllegalStateException("Ring is empty");
if (replicationFactor > liveServers.size()) {
thrownewIllegalArgumentException(
"Replication factor " + replicationFactor +
" exceeds live server count " + liveServers.size());
}
List<String> replicas = newArrayList<>(replicationFactor);
long keyPos = hash(key);
// Start clockwise walk from key positionNavigableMap<Long, String> clockwiseView = ring.tailMap(keyPos, true);
// Iterate the entire ring clockwise using two passes (tail + head)Iterator<String> iter = newIterator<>() {
privatefinalIterator<String> tailIter = clockwiseView.values().iterator();
privatefinalIterator<String> headIter = ring.values().iterator();
privateboolean onHead = false;
@OverridepublicbooleanhasNext() {
return tailIter.hasNext() || (!onHead && headIter.hasNext());
}
@OverridepublicStringnext() {
if (tailIter.hasNext()) return tailIter.next();
onHead = true;
return headIter.next();
}
};
// Collect distinct physical servers as we walk clockwisewhile (iter.hasNext() && replicas.size() < replicationFactor) {
String candidate = iter.next();
// Only add if this physical server is not already in our replica setif (!replicas.contains(candidate)) {
replicas.add(candidate);
}
}
return replicas;
} finally {
ringLock.readLock().unlock();
}
}
publicstaticvoidmain(String[] args) {
ReplicatedConsistentHashRing ring = newReplicatedConsistentHashRing(100);
ring.addServer("node-1");
ring.addServer("node-2");
ring.addServer("node-3");
ring.addServer("node-4");
ring.addServer("node-5");
String[] keysToRoute = {"user:alice", "user:bob", "session:xyz", "product:42"};
int replicationFactor = 3; // Write to 3 servers for fault toleranceSystem.out.println("Replica placement with RF=3 (primary first, then replicas):");
System.out.printf("%-20s %-40s%n", "Key", "Responsible Servers (in order)");
System.out.println("-".repeat(65));
for (String key : keysToRoute) {
List<String> servers = ring.getReplicaServers(key, replicationFactor);
System.out.printf("%-20s %s%n", key, servers);
}
// Simulate node failure: remove node-3System.out.println("\n[FAILURE] node-3 goes down...");
ring.removeServer("node-3");
System.out.println("\nReplica placement AFTER node-3 failure:");
System.out.printf("%-20s %-40s%n", "Key", "Responsible Servers (in order)");
System.out.println("-".repeat(65));
for (String key : keysToRoute) {
List<String> servers = ring.getReplicaServers(key, replicationFactor);
System.out.printf("%-20s %s%n", key, servers);
}
}
}
Output
Replica placement with RF=3 (primary first, then replicas):
The most common bug in home-grown consistent hash implementations is forgetting to wrap around the ring. If a key hashes to position 2^63 - 100 and all servers hash to positions below 2^62, ceilingKey() returns null — there is no server clockwise without wrapping. You MUST fall back to ring.firstEntry() in this case. The code above handles this explicitly. Missing this case causes intermittent null pointer exceptions that only manifest for a small fraction of keys, making them extremely hard to debug.
Production Insight
Wrap-around is not the only missing-edge-case — what about adding a server with the same hash as an existing one?
If two servers hash to the exact same position (extremely unlikely with MD5, but possible with smaller hash functions),
the second will overwrite the first in TreeMap. Use a sentinel or enforce uniqueness at add time.
Another silent killer: if you treat the ring as static during a deployment but one server's vnode hash collides,
you get a hidden split-brain where some keys route to the old server and some to the new one.
Solution: Use a large hash space (64-bit) and a strong hash function; these collisions are then astronomically unlikely.
Key Takeaway
Replication, thread-safety, and wrap-around are mandatory for production.
A single-server ring with no locking is not production code.
Rule: Always build replication-aware lookups from day one — retrofitting it later is painful.
Advanced Considerations: Consistent Hashing in Real Distributed Systems
The basic ring is a foundation, but real systems like Apache Cassandra, Redis Cluster, and Amazon DynamoDB add significant complexity on top. Understanding these production designs will help you make better architectural decisions.
Cassandra's token range approach. Cassandra divides the entire ring into a fixed number of tokens (default 256 per node) assigned to a node as a contiguous range. When a new node joins, it takes ownership of a sub-range from the existing nodes. This is similar to vnodes but with contiguous ranges rather than random interleaved positions. The trade-off: contiguous ranges make it easier to reason about data locality but require explicit range splitting on node add — something vnodes avoid by design.
Redis Cluster's hash slots. Redis Cluster uses a different mechanism: a fixed 16,384 hash slot space. Each key is hashed to a slot, and slots are assigned to nodes. When a node joins, you reassign some slots from existing nodes. This is functionally equivalent to consistent hashing with a fixed number of vnodes (16,384), but it gives operators explicit control over which slots move. The migration of slots is asynchronous and happens without blocking the entire cluster.
DynamoDB's partition splitting and request routing. DynamoDB uses consistent hashing internally but with automatic partition splitting. As data grows, partitions are split. Each partition is assigned to a physical node via the ring. The key insight is that DynamoDB's ring is not static — the number of partitions changes dynamically. This requires a two-level mapping: first hash the key to a logical partition, then map that partition to a physical node via the ring. This decouples the physical topology from the logical data organization, allowing DynamoDB to split partitions without moving all the data.
ConsistentHashRing.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import java.util.*;
/**
* Demonstration of hash slot assignment (like RedisCluster) using a consistent hash ring.
* This is NOT a full implementation, but shows the mapping concept.
*/
publicclassHashSlotRing {
// Fixed number of slots (like Redis Cluster's 16384)privatestaticfinalint NUM_SLOTS = 16384;
// Map slot -> physical serverprivatefinalString[] slotOwners = newString[NUM_SLOTS];
// Set of serversprivatefinalSet<String> servers = newHashSet<>();
/**
* Initializes the ring: distributes slots evenly across servers.
* A simple round-robin assignment for demonstration.
*/
publicvoidinitializeCluster(List<String> serverList) {
servers.addAll(serverList);
for (int slot = 0; slot < NUM_SLOTS; slot++) {
int serverIndex = slot % serverList.size();
slotOwners[slot] = serverList.get(serverIndex);
}
}
/**
* Assigns a key to a slot using a hash function, then returns the owning server.
*/
publicStringgetServerForKey(String key) {
int slot = Math.abs(key.hashCode()) % NUM_SLOTS;
return slotOwners[slot];
}
/**
* Migrates a set of slots from one server to another (e.g., during cluster expansion).
*/
publicvoidmigrateSlots(List<Integer> slots, String fromServer, String toServer) {
// In production, this would be transactional with data migrationfor (int slot : slots) {
if (!slotOwners[slot].equals(fromServer)) {
thrownewIllegalStateException("Slot " + slot + " not owned by " + fromServer);
}
slotOwners[slot] = toServer;
}
System.out.println("Migrated " + slots.size() + " slots to " + toServer);
}
publicstaticvoidmain(String[] args) {
HashSlotRing ring = newHashSlotRing();
List<String> servers = List.of("node-1", "node-2", "node-3");
ring.initializeCluster(servers);
System.out.println("Initial assignment for sample keys:");
String[] testKeys = {"user:1001", "product:42", "order:789"};
for (String key : testKeys) {
System.out.println(key + " -> " + ring.getServerForKey(key));
}
// Add a new serverSystem.out.println("\nAdding node-4 and migrating slots...");
// Migrate 4000 slots from each existing node to node-4 (simplistic)for (int i = 0; i < 4000; i++) {
// Each existing node gives up slots proportionally
int slotToMigrate = i; // simplisticString currentOwner = ring.slotOwners[slotToMigrate];
if (!currentOwner.equals("node-4")) {
ring.migrateSlots(List.of(slotToMigrate), currentOwner, "node-4");
}
}
System.out.println("\nAfter migration:");
for (String key : testKeys) {
System.out.println(key + " -> " + ring.getServerForKey(key));
}
}
}
Output
Initial assignment for sample keys:
user:1001 -> node-2
product:42 -> node-1
order:789 -> node-3
Adding node-4 and migrating slots...
Migrated 1 slots to node-4
Migrated 1 slots to node-4
...
After migration:
user:1001 -> node-4
product:42 -> node-1
order:789 -> node-3
Mental Model: Fixed Slots vs. Variable VNodes
Fixed slots (Redis Cluster): 16384 slots, manual migration, predictable data distribution, but operator overhead.
Variable vnodes (Cassandra): automatic, no manual slot management, but can cause uneven distribution if vnode count is too low.
Hybrid (DynamoDB): partitions are logical, mapped to physical nodes via ring — enables dynamic splitting without full remapping.
Production Insight
The biggest production trap with hash slots is migrating slots while applications are writing to them.
If you migrate a slot from node A to node B but the write goes to A after migration starts, you lose data.
Redis Cluster handles this with ASK redirects — the client writes to the new owner after getting redirected.
If you're building your own slot-based system, implement a similar redirect mechanism or use a two-phase commit for slot ownership.
Key Takeaway
Real systems choose either vnodes (Cassandra) or fixed slots (Redis Cluster).
Both are valid; the choice depends on whether you prefer automation (vnodes) or explicit control (slots).
Rule: Never implement a pure ring without considering replication, slot migration, and failure scenarios.
● Production incidentPOST-MORTEMseverity: high
The 3 AM Cache Miss Storm That Took Down an E-Commerce Checkout
Symptom
After adding a 4th cache server to a 3-server Redis cluster during peak traffic, the API response time jumped from 20ms to 12 seconds. Database CPU hit 100%, and the checkout page returned HTTP 503 errors.
Assumption
Adding more cache servers always reduces load — more nodes means more capacity.
Root cause
The cluster used naive modulo hashing (hash(key) % N). When N changed from 3 to 4, ~75% of cache keys mapped to different servers, causing a cache miss storm. The database, designed for 10% overflow, suddenly received 90% of read traffic.
Fix
Switched the caching layer to a consistent hash ring with 150 virtual nodes per server. Implemented a slow rollout: new node joined the ring but started serving only after a 30-second warm-up period (sparse pre-warming). Added a circuit breaker on the database connection pool to absorb only as much overflow as safe.
Key lesson
Never use modulo hashing for any distributed cache or data store in production — consistent hashing with vnodes is the minimum viable approach.
Always pre-warm new nodes gradually: add them to the ring but mark as 'cold' for the first few seconds to let caches populate before taking full traffic.
Instrument cache hit ratio at the cluster level and set an alert for >20% drop within 5 minutes — that's your early warning for a rehash avalanche.
Production debug guideSymptom → Action guide for common consistent hashing failures4 entries
Symptom · 01
Uneven load distribution: one server handles 60% of requests while others at 20%
→
Fix
Check vnode count: run with vnode count analysis (e.g., VirtualNodeDistributionAnalysis). If using single vnode per server, increase to 100-200. Verify hash function quality (prefer MurmurHash3 over MD5 for speed).
Symptom · 02
Cache miss jump >30% after node addition
→
Fix
Confirm the ring uses consistent hashing, not modulo. Check that the new node was added via the ring API (not just a config change). If using virtual nodes, verify that the new server's vnodes are evenly spread across the ring's key space.
Symptom · 03
Key lookup intermittently returns null or wrong server
→
Fix
Wrap-around bug: test hash positions near Long.MAX_VALUE. Ensure ceilingEntry() falls back to firstEntry() when null. Check that the hash function produces a uniform distribution across the full ring.
Symptom · 04
ConcurrentModificationException during key lookups
→
Fix
The ring data structure (TreeMap) is not thread-safe. Replace with ConcurrentSkipListMap or guard with ReadWriteLock. Ensure all topology changes (add/remove server) acquire the write lock, and all lookups the read lock.
Check if the node has been added to the ring (in-memory or DNS). Verify vnode placement: run ring.dump() or equivalent to list all positions. If using a library like Apache Commons HashRing, ensure the node is not accidentally blacklisted.
Commands
ring.getServerForKey('probe:key')
ring.printLoadDistribution(keys)
Fix now
If the node is missing, add it via the ring's addServer() method. If vnodes exist but are not reached, check that the hash function matches between router and caller.
Cache miss rate spikes after node removal+
Immediate action
Identify which keys were owned by the removed node by comparing pre- and post-removal assignment for a sample set. Check if the replication factor is still satisfied (if using replication).
Commands
compareAssignment(beforeMap, afterMap)
ring.getReplicaServers(key, RF)
Fix now
If replication factor is set, ensure the next clockwise servers are alive. If not using replication, consider adding a layer of client-side caching to absorb the temporary miss storm.
Random NullPointerException on key lookup+
Immediate action
Check for wrap-around null: is the ring empty? Are all servers properly added? Test with a key that hashes to a high value close to Long.MAX_VALUE.
Commands
ring.ceilingEntry(highPosition)
ring.firstEntry()
Fix now
If ceilingEntry returns null, fall back to firstEntry. Wrap the lookup in a method that handles the null case.
Comparison: Modulo Hashing vs Consistent Hashing with Vnodes
Requires explicit locking or ConcurrentSkipListMap
Replication support
Manual mirroring per server
Built-in: replica list via clockwise walk
Key takeaways
1
Consistent hashing limits data movement to ~1/N keys on cluster resize
the theoretical minimum.
2
Virtual nodes (100-200 per server) are essential for even load distribution and supporting heterogeneous hardware.
3
Always handle ring wrap-around
if ceilingEntry() returns null, fall back to firstEntry().
4
Production rings need replication awareness (R replicas) and thread-safety (ReadWriteLock or ConcurrentSkipListMap).
5
Never use modulo hashing for any distributed cache or data store in production
it causes rehash avalanches.
6
Real systems (Cassandra, Redis Cluster, DynamoDB) add slot migration, hinted handoff, and dynamic partitioning on top of the basic ring.
Common mistakes to avoid
5 patterns
×
Using a single vnode per server
Symptom
One server gets 60% of traffic while another gets 15%. Load distribution is highly uneven, causing hotspots and underutilized nodes.
Fix
Increase vnode count to 100-200 per server. Run the VirtualNodeDistributionAnalysis code to see standard deviation drop. For heterogeneous clusters, assign proportional vnode counts.
×
Forgetting wrap-around in ring lookup
Symptom
Intermittent null pointer exceptions when a key hashes near the end of the ring (close to Long.MAX_VALUE) and no server sits beyond it. Only affects a small percentage of keys, making debugging elusive.
Fix
After calling ceilingEntry(), always check for null and fall back to firstEntry(). Implement a utility method that handles this transparently.
×
Using modulo hashing in a distributed cache
Symptom
After adding a new cache node during peak traffic, cache hit ratio drops from 95% to 10%, causing a database overload and cascading failure.
Fix
Replace modulo with consistent hashing using 150 vnodes per server. Implement gradual node warm-up: mark new nodes as 'cold' for 30 seconds before accepting full traffic.
×
Not thread-safeguarding the ring during concurrent access
Symptom
ConcurrentModificationException when one thread adds a server while another performs key lookups. On rare occasions, the ring may become corrupted with partial writes.
Fix
Use ReadWriteLock to guard all read and write operations. Alternatively, use ConcurrentSkipListMap which is inherently thread-safe for concurrent reads and writes.
×
Using weak hash functions (e.g., hashCode()) that cause collisions
Symptom
Two different servers get assigned the same ring position, causing one server to overwrite the other's vnodes. Some keys route to the wrong server, leading to data inconsistency.
Fix
Use a cryptographic hash like MD5 (or faster MurmurHash3/xxHash) for uniform distribution. Avoid relying on Java's default hashCode() which is weak and can produce collisions for small input spaces.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01JUNIOR
Explain consistent hashing. How does it differ from modulo hashing?
Q02SENIOR
What are virtual nodes (vnodes) in consistent hashing, and why are they ...
Q03SENIOR
How do you implement replication-aware consistent hashing? Describe the ...
Q04SENIOR
Discuss the hot spot problem with sequential keys in consistent hashing ...
Q01 of 04JUNIOR
Explain consistent hashing. How does it differ from modulo hashing?
ANSWER
Consistent hashing is a technique that maps keys and servers to positions on a virtual ring. A key is assigned to the first server found by walking clockwise from the key's position. The key advantage: when a server is added or removed, only approximately 1/N keys need to move, compared to (N-1)/N with modulo hashing (hash(key) % N). This minimizes cache miss storms and makes cluster resizing safe under load.
Q02 of 04SENIOR
What are virtual nodes (vnodes) in consistent hashing, and why are they important?
ANSWER
Virtual nodes simulate multiple ring positions per physical server (e.g., 150 positions per server). Without vnodes, with only 3 servers on a large ring, the arcs are random and can be highly unequal — one server might own 60% of keys. Vnodes interleave the positions, and by the law of large numbers, each physical server ends up owning approximately 1/N of the ring. They also enable weighted capacity: a stronger server can have 200 vnodes while a weaker one has 100.
Q03 of 04SENIOR
How do you implement replication-aware consistent hashing? Describe the algorithm for getting a list of replicas for a key.
ANSWER
Start at the key's hash position and walk clockwise, collecting distinct physical servers. Skip vnodes of servers already in the replica set. Continue until you have R distinct servers (the replication factor). This naturally distributes replicas across different nodes. If a server fails, the next clockwise server takes over its data (hinted handoff in Cassandra). The implementation must handle wrap-around and ensure thread-safety with locks or concurrent data structures.
Q04 of 04SENIOR
Discuss the hot spot problem with sequential keys in consistent hashing and how to mitigate it.
ANSWER
Sequential keys (e.g., event:1001, event:1002) hash to consecutive positions on the ring because the hash function is deterministic. If those positions fall within the arc of a single server, that server gets all those keys — a hot spot. Virtual nodes help by scattering the server's presence across the ring, but the best mitigation is to prepend a high-entropy prefix to keys, such as a user-specific shard ID (e.g., shard:user:1001). This spreads sequential keys across different arcs.
01
Explain consistent hashing. How does it differ from modulo hashing?
JUNIOR
02
What are virtual nodes (vnodes) in consistent hashing, and why are they important?
SENIOR
03
How do you implement replication-aware consistent hashing? Describe the algorithm for getting a list of replicas for a key.
SENIOR
04
Discuss the hot spot problem with sequential keys in consistent hashing and how to mitigate it.
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
Does consistent hashing guarantee exactly 1/N keys move on server add?
In theory, yes — the ideal is K/N. In practice, with a good hash function and enough vnodes, you get very close. The demonstration code in the article shows 24.7% moved when adding a 4th server to 3 (ideal 25%). Small deviations are caused by randomness in the ring positions.
Was this helpful?
02
Can I use consistent hashing without virtual nodes?
You can, but it's not recommended for more than a few servers. Without vnodes, the distribution is highly uneven (standard deviation >900 keys for 10,000 keys with 4 servers). With 150 vnodes, the standard deviation drops to ~48 keys. For any production system, vnodes are mandatory.
Was this helpful?
03
How does consistent hashing handle server failures?
When a server fails, its keys are naturally reassigned to the next clockwise server(s). With replication factor R, the replicas on healthy servers ensure data durability. Systems like Cassandra use hinted handoff: the responsible server temporarily stores the data that was on the failed server until the node recovers.
Was this helpful?
04
Is consistent hashing used only for caching?
No, it's used for any distributed data storage that needs sharding — databases (Cassandra, DynamoDB), message queues (Kafka partitions use consistent hashing for routing), CDNs (edge servers are mapped via consistent hashing), and load balancers (some implementations use it to persist sessions).
Was this helpful?
05
What hash function should I use for consistent hashing?
MD5 is common for its strong distribution and availability in all languages. For higher throughput (millions of hashes per second), use MurmurHash3 or xxHash. Avoid CRC32 or Java's default hashCode() due to weak distribution properties.