Home System Design Consistent Hashing Explained — How Distributed Systems Scale Without Breaking

Consistent Hashing Explained — How Distributed Systems Scale Without Breaking

In Plain English 🔥
Imagine a circular conveyor belt at a sushi restaurant. Each chef stands at a fixed spot on the belt and is responsible for preparing whatever dish stops in front of them. When a new chef joins, only the dishes between them and the next chef get reassigned — everyone else keeps working normally. Consistent hashing works exactly like that belt: your data items (dishes) and your servers (chefs) both sit on a circle, and adding or removing a server only shuffles a small slice of the data, not everything at once.
⚡ Quick Answer
Imagine a circular conveyor belt at a sushi restaurant. Each chef stands at a fixed spot on the belt and is responsible for preparing whatever dish stops in front of them. When a new chef joins, only the dishes between them and the next chef get reassigned — everyone else keeps working normally. Consistent hashing works exactly like that belt: your data items (dishes) and your servers (chefs) both sit on a circle, and adding or removing a server only shuffles a small slice of the data, not everything at once.

Every large-scale system you have ever used — Netflix, Twitter, Amazon, Discord — stores data across dozens or hundreds of servers simultaneously. The moment you need more than one server to hold your data, you face a deceptively hard question: given a key like a user ID or a cache key, which server owns it? Get this wrong and you either overload one server while others sit idle, or worse, you lose track of where data lives every time your cluster changes size. This routing decision, made billions of times per day, is one of the most consequential micro-decisions in distributed systems engineering.

The Problem With Naive Modulo Hashing — Why It Breaks at Scale

The obvious first instinct is modulo hashing: take hash(key) % N where N is the number of servers. It is simple, deterministic, and fast. The problem surfaces the moment N changes. Add one server — N goes from 5 to 6 — and now nearly every key maps to a different server. In a distributed cache, this means a cache miss storm: your database suddenly absorbs 80-95% of traffic that was previously served from cache, often at the exact moment you added the server because your system was already under load. This is called a rehash avalanche, and it has taken down production systems.

The root cause is that modulo hashing creates a tight coupling between the number of servers and the location of every single key. Every key's address is a function of N, so changing N invalidates almost every address simultaneously. What you actually need is a scheme where changing N invalidates only the keys that absolutely must move — ideally around K/N keys, where K is total keys and N is total servers. That is the guarantee consistent hashing was designed to provide.

Understanding this failure mode is not academic. In 2020, several high-traffic services experienced cascading failures during auto-scaling events because their caching tier used naive modulo routing. The database tier, designed to handle maybe 10% overflow, was suddenly handling 90% of reads. Consistent hashing is the industry's answer to this specific, real, and painful problem.

NaiveModuloHashingDemo.java · JAVA
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import java.util.*;

/**
 * Demonstrates the rehash avalanche problem with naive modulo hashing.
 * Run this to see WHY consistent hashing was invented.
 */
public class NaiveModuloHashingDemo {

    // Simulates mapping a set of cache keys to servers using simple modulo
    static Map<String, String> assignKeysToServers(List<String> keys, List<String> servers) {
        Map<String, String> assignment = new LinkedHashMap<>();
        for (String key : keys) {
            // Standard modulo hash: server index = hash(key) mod server_count
            int serverIndex = Math.abs(key.hashCode()) % servers.size();
            assignment.put(key, servers.get(serverIndex));
        }
        return assignment;
    }

    public static void main(String[] args) {
        List<String> cacheKeys = List.of(
            "user:1001", "user:1002", "session:abc", "session:def",
            "product:55", "product:56", "order:789", "order:790",
            "cart:aaa", "cart:bbb"
        );

        // --- BEFORE: cluster has 3 servers ---
        List<String> servers3 = List.of("cache-server-A", "cache-server-B", "cache-server-C");
        Map<String, String> beforeScaleOut = assignKeysToServers(cacheKeys, servers3);

        // --- AFTER: we add one server, cluster now has 4 servers ---
        List<String> servers4 = List.of("cache-server-A", "cache-server-B", "cache-server-C", "cache-server-D");
        Map<String, String> afterScaleOut = assignKeysToServers(cacheKeys, servers4);

        // Count how many keys moved to a DIFFERENT server after the resize
        int keysMoved = 0;
        System.out.println("Key assignment changes after adding one server:");
        System.out.printf("%-20s %-20s %-20s %-10s%n", "Key", "Before (3 nodes)", "After (4 nodes)", "Moved?");
        System.out.println("-".repeat(75));

        for (String key : cacheKeys) {
            String before = beforeScaleOut.get(key);
            String after = afterScaleOut.get(key);
            boolean moved = !before.equals(after);
            if (moved) keysMoved++;
            System.out.printf("%-20s %-20s %-20s %-10s%n",
                key, before, after, moved ? "<-- MOVED" : "stable");
        }

        double percentMoved = (keysMoved * 100.0) / cacheKeys.size();
        System.out.println("-".repeat(75));
        System.out.printf("Keys moved: %d / %d (%.0f%%)%n", keysMoved, cacheKeys.size(), percentMoved);
        System.out.println("With modulo hashing, expect ~75%+ of keys to move on ANY cluster resize.");
        System.out.println("With consistent hashing, expect ~1/N keys to move (25% with 4 nodes).");
    }
}
▶ Output
Key assignment changes after adding one server:
Key Before (3 nodes) After (4 nodes) Moved?
---------------------------------------------------------------------------
user:1001 cache-server-C cache-server-C stable
user:1002 cache-server-A cache-server-D <-- MOVED
session:abc cache-server-B cache-server-A <-- MOVED
session:def cache-server-C cache-server-B <-- MOVED
product:55 cache-server-A cache-server-D <-- MOVED
product:56 cache-server-B cache-server-C <-- MOVED
order:789 cache-server-A cache-server-A stable
order:790 cache-server-B cache-server-B stable
cart:aaa cache-server-C cache-server-D <-- MOVED
cart:bbb cache-server-A cache-server-A stable
---------------------------------------------------------------------------
Keys moved: 6 / 10 (60%)
With modulo hashing, expect ~75%+ of keys to move on ANY cluster resize.
With consistent hashing, expect ~1/N keys to move (25% with 4 nodes).
⚠️
Watch Out: The Cache Miss StormThe danger with modulo hashing is not the data movement itself — it is the timing. You typically add a server because you are already under load. The second you resize the cluster, every displaced key causes a cache miss, hammering your database at exactly the worst moment. This is a self-reinforcing failure: more load → add server → cache miss storm → even more load. Consistent hashing breaks this death spiral.

The Consistent Hashing Ring — How It Actually Works Internally

Consistent hashing places both servers and keys on an imaginary circle — the hash ring — with values ranging from 0 to 2^32-1 (or 2^64-1 depending on the hash function). Every server gets hashed to a position on this ring. Every key also gets hashed to a position. To find which server owns a key, you walk clockwise around the ring from the key's position until you hit the first server. That server owns the key.

When you add a new server, it takes over ownership of keys between its own position and the previous server in the counter-clockwise direction. Only those keys move. When you remove a server, its keys flow clockwise to the next server. In both cases, only roughly K/N keys are affected — the theoretical minimum. This is why it is called 'consistent': the assignment is consistent across cluster changes for most keys.

The hash function choice matters significantly. You want uniform distribution across the ring to avoid hotspots. MD5 and SHA-1 are common choices, but xxHash or MurmurHash3 are preferred in production for speed without sacrificing distribution quality. The ring abstraction is the key insight: by hashing the infrastructure and the data onto the same number line, you unify addressing and make routing a simple 'find next clockwise neighbor' operation.

ConsistentHashRing.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

/**
 * A production-quality consistent hash ring implementation.
 * Supports virtual nodes (vnodes) to fix the uneven distribution problem.
 *
 * Key design decisions:
 *  - TreeMap for O(log N) clockwise-neighbor lookup via ceilingKey()
 *  - MD5 for strong hash distribution (swap for MurmurHash in perf-critical code)
 *  - Virtual nodes multiplied per physical server to smooth load distribution
 */
public class ConsistentHashRing {

    // The ring: maps a hash position (long) to a server label
    // TreeMap keeps positions sorted so we can do clockwise lookups efficiently
    private final TreeMap<Long, String> hashRing = new TreeMap<>();

    // How many virtual node positions each physical server gets on the ring
    private final int virtualNodesPerServer;

    // Track physical servers for clean removal later
    private final Set<String> physicalServers = new HashSet<>();

    public ConsistentHashRing(int virtualNodesPerServer) {
        this.virtualNodesPerServer = virtualNodesPerServer;
    }

    // --- Hash a string to a long position on the ring ---
    private long hashToRingPosition(String input) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] digest = md.digest(input.getBytes());
            // Take first 8 bytes of MD5 to form a long
            // This gives us a 64-bit position space (0 to Long.MAX_VALUE)
            long position = 0;
            for (int i = 0; i < 8; i++) {
                position = (position << 8) | (digest[i] & 0xFFL);
            }
            // Ensure positive value for the ring
            return position & Long.MAX_VALUE;
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException("MD5 not available — this should never happen on JVM", e);
        }
    }

    /**
     * Adds a physical server to the ring by creating N virtual node positions.
     * Each vnode is hashed as "serverName#replicaIndex" to spread positions.
     */
    public void addServer(String serverName) {
        physicalServers.add(serverName);
        for (int replica = 0; replica < virtualNodesPerServer; replica++) {
            // Virtual node label: e.g. "cache-server-A#42"
            String vnodeLabel = serverName + "#" + replica;
            long ringPosition = hashToRingPosition(vnodeLabel);
            hashRing.put(ringPosition, serverName);
        }
        System.out.printf("[RING] Added server '%s' with %d virtual nodes%n",
            serverName, virtualNodesPerServer);
    }

    /**
     * Removes a physical server and all its virtual nodes from the ring.
     */
    public void removeServer(String serverName) {
        physicalServers.remove(serverName);
        for (int replica = 0; replica < virtualNodesPerServer; replica++) {
            String vnodeLabel = serverName + "#" + replica;
            long ringPosition = hashToRingPosition(vnodeLabel);
            hashRing.remove(ringPosition);
        }
        System.out.printf("[RING] Removed server '%s' and all its virtual nodes%n", serverName);
    }

    /**
     * Routes a key to its responsible server.
     * Walks clockwise from the key's hash position to find the first server.
     */
    public String getServerForKey(String key) {
        if (hashRing.isEmpty()) {
            throw new IllegalStateException("Hash ring is empty — add servers before routing keys");
        }
        long keyPosition = hashToRingPosition(key);

        // ceilingKey returns the smallest key >= keyPosition
        // This IS the clockwise-walk operation — O(log N)
        Map.Entry<Long, String> responsibleNode = hashRing.ceilingEntry(keyPosition);

        // If no entry >= keyPosition, wrap around to the start of the ring
        if (responsibleNode == null) {
            responsibleNode = hashRing.firstEntry();
        }
        return responsibleNode.getValue();
    }

    /**
     * Analyzes how evenly keys are distributed across physical servers.
     * This is your load balance health check.
     */
    public void printLoadDistribution(List<String> keysToAnalyze) {
        Map<String, Integer> loadCount = new TreeMap<>();
        physicalServers.forEach(s -> loadCount.put(s, 0));

        for (String key : keysToAnalyze) {
            String server = getServerForKey(key);
            loadCount.merge(server, 1, Integer::sum);
        }

        System.out.println("\n--- Load Distribution Report ---");
        System.out.printf("Total keys: %d | Servers: %d | VNodes per server: %d%n",
            keysToAnalyze.size(), physicalServers.size(), virtualNodesPerServer);
        System.out.printf("%-20s %-10s %-10s%n", "Server", "Keys", "Share");
        System.out.println("-".repeat(45));
        loadCount.forEach((server, count) -> {
            double share = (count * 100.0) / keysToAnalyze.size();
            System.out.printf("%-20s %-10d %.1f%%%n", server, count, share);
        });
        System.out.println("-".repeat(45));
    }

    public static void main(String[] args) {
        // Build a sample cluster with 3 cache servers
        // 150 virtual nodes per server is a common production starting point
        ConsistentHashRing ring = new ConsistentHashRing(150);
        ring.addServer("cache-server-A");
        ring.addServer("cache-server-B");
        ring.addServer("cache-server-C");

        // Generate 1000 representative cache keys
        List<String> cacheKeys = new ArrayList<>();
        for (int i = 1; i <= 1000; i++) {
            cacheKeys.add("key:" + i);
        }

        // Show initial distribution
        ring.printLoadDistribution(cacheKeys);

        // Record key assignments BEFORE adding a new server
        Map<String, String> assignmentBefore = new LinkedHashMap<>();
        for (String key : cacheKeys) {
            assignmentBefore.put(key, ring.getServerForKey(key));
        }

        // Scale out: add a 4th server
        System.out.println("\n[EVENT] Scaling out — adding cache-server-D...");
        ring.addServer("cache-server-D");

        // Show new distribution
        ring.printLoadDistribution(cacheKeys);

        // Count how many keys actually moved
        long keysMoved = cacheKeys.stream()
            .filter(key -> !assignmentBefore.get(key).equals(ring.getServerForKey(key)))
            .count();
        double percentMoved = (keysMoved * 100.0) / cacheKeys.size();
        System.out.printf("%n[RESULT] Keys moved after adding 1 server: %d / %d (%.1f%%)%n",
            keysMoved, cacheKeys.size(), percentMoved);
        System.out.printf("[EXPECTED] Ideal minimum: ~%.1f%% (1 / %d servers)%n",
            100.0 / 4, 4);
    }
}
▶ Output
[RING] Added server 'cache-server-A' with 150 virtual nodes
[RING] Added server 'cache-server-B' with 150 virtual nodes
[RING] Added server 'cache-server-C' with 150 virtual nodes

--- Load Distribution Report ---
Total keys: 1000 | Servers: 3 | VNodes per server: 150
Server Keys Share
---------------------------------------------
cache-server-A 342 34.2%
cache-server-B 331 33.1%
cache-server-C 327 32.7%
---------------------------------------------

[EVENT] Scaling out — adding cache-server-D...
[RING] Added server 'cache-server-D' with 150 virtual nodes

--- Load Distribution Report ---
Total keys: 1000 | Servers: 4 | VNodes per server: 150
Server Keys Share
---------------------------------------------
cache-server-A 258 25.8%
cache-server-B 249 24.9%
cache-server-C 243 24.3%
cache-server-D 250 25.0%
---------------------------------------------

[RESULT] Keys moved after adding 1 server: 247 / 1000 (24.7%)
[EXPECTED] Ideal minimum: ~25.0% (1 / 4 servers)
🔥
Why TreeMap is the Right Data Structure HereTreeMap's ceilingKey() operation is the entire clockwise-walk operation in a single O(log N) call. A sorted array with binary search would also work but adds complexity around insertion. A HashMap would require sorting on every lookup. TreeMap gives you insert O(log N), delete O(log N), and clockwise-lookup O(log N) — all ideal for a ring that changes at server add/remove events but handles millions of key lookups between those events.

Virtual Nodes — The Fix for Uneven Load Distribution

A naive consistent hash ring with one position per physical server has a serious distribution problem. With only 3 servers on a ring of 2^64 positions, you are relying on 3 random points to divide the ring into 3 equal arcs. Statistically, those arcs will not be equal. One server might own 60% of the ring while another owns 15%. This uneven distribution means uneven load — one server melts under traffic while others are idle.

Virtual nodes (vnodes) solve this. Instead of placing each server at one ring position, you place it at many — typically 100-200 — by hashing multiple distinct labels for the same server (e.g. server-A#0, server-A#1, ... server-A#149). With 150 vnodes per server, the ring has 450 positions for a 3-server cluster, and they interleave randomly. By the law of large numbers, each server ends up owning approximately 1/N of the ring. The distribution error drops dramatically as vnode count increases.

Virtual nodes also make heterogeneous clusters possible. A server with twice the RAM and CPU can be given twice as many vnodes, so it naturally absorbs twice the traffic. This is how systems like Apache Cassandra handle mixed hardware in a cluster. The vnode count becomes a tunable weight, not just a distribution fix.

VirtualNodeDistributionAnalysis.java · JAVA
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;

/**
 * Compares distribution quality across different vnode counts.
 * Shows WHY 1 vnode per server is dangerous and 100-200 is the sweet spot.
 *
 * Run this to see the standard deviation of server load shrink as vnodes increase.
 */
public class VirtualNodeDistributionAnalysis {

    private static long hashToPosition(String input) throws NoSuchAlgorithmException {
        MessageDigest md = MessageDigest.getInstance("MD5");
        byte[] digest = md.digest(input.getBytes());
        long position = 0;
        for (int i = 0; i < 8; i++) {
            position = (position << 8) | (digest[i] & 0xFFL);
        }
        return position & Long.MAX_VALUE;
    }

    /**
     * Simulates a consistent hash ring and returns per-server key counts.
     *
     * @param serverNames     physical server names
     * @param vnodesPerServer how many ring positions per physical server
     * @param totalKeys       how many test keys to route
     */
    static Map<String, Integer> simulateDistribution(
            List<String> serverNames, int vnodesPerServer, int totalKeys)
            throws NoSuchAlgorithmException {

        TreeMap<Long, String> ring = new TreeMap<>();
        // Place virtual nodes on the ring
        for (String server : serverNames) {
            for (int v = 0; v < vnodesPerServer; v++) {
                long pos = hashToPosition(server + "#vnode" + v);
                ring.put(pos, server);
            }
        }

        // Route keys and count per server
        Map<String, Integer> serverLoad = new TreeMap<>();
        serverNames.forEach(s -> serverLoad.put(s, 0));

        for (int k = 0; k < totalKeys; k++) {
            long keyPos = hashToPosition("testkey:" + k);
            Map.Entry<Long, String> node = ring.ceilingEntry(keyPos);
            if (node == null) node = ring.firstEntry();
            serverLoad.merge(node.getValue(), 1, Integer::sum);
        }
        return serverLoad;
    }

    static double standardDeviation(Collection<Integer> values) {
        double mean = values.stream().mapToInt(i -> i).average().orElse(0);
        double variance = values.stream()
            .mapToDouble(v -> Math.pow(v - mean, 2))
            .average().orElse(0);
        return Math.sqrt(variance);
    }

    public static void main(String[] args) throws NoSuchAlgorithmException {
        List<String> servers = List.of(
            "node-alpha", "node-beta", "node-gamma", "node-delta"
        );
        int totalKeys = 10_000;
        int[] vnodeCounts = {1, 5, 25, 100, 200, 500};

        System.out.println("Vnode Distribution Analysis — 4 servers, 10,000 keys");
        System.out.printf("%-15s %-15s %-15s %-20s%n",
            "VNodes/Server", "Min Load", "Max Load", "Std Dev (lower=better)");
        System.out.println("=" .repeat(70));

        for (int vnodesPerServer : vnodeCounts) {
            Map<String, Integer> distribution =
                simulateDistribution(servers, vnodesPerServer, totalKeys);

            int minLoad = Collections.min(distribution.values());
            int maxLoad = Collections.max(distribution.values());
            double stdDev = standardDeviation(distribution.values());

            System.out.printf("%-15d %-15d %-15d %.1f keys%n",
                vnodesPerServer, minLoad, maxLoad, stdDev);
        }

        System.out.println("\nConclusion: 100-200 vnodes is the practical sweet spot.");
        System.out.println("Beyond ~200, distribution improvement is marginal vs memory overhead.");
    }
}
▶ Output
Vnode Distribution Analysis — 4 servers, 10,000 keys
VNodes/Server Min Load Max Load Std Dev (lower=better)
======================================================================
1 892 3241 956.3 keys
5 1801 2744 387.2 keys
25 2201 2689 194.1 keys
100 2389 2601 87.4 keys
200 2441 2559 48.3 keys
500 2466 2531 27.1 keys

Conclusion: 100-200 vnodes is the practical sweet spot.
Beyond ~200, distribution improvement is marginal vs memory overhead.
⚠️
Pro Tip: Weighted Vnodes for Heterogeneous ClustersIf you have a mix of server sizes (say, some nodes have 64GB RAM and others have 128GB), give the larger nodes proportionally more vnodes — e.g., 100 vs 200. This makes the consistent hash ring itself act as a load balancer that respects hardware capacity. Cassandra calls this the 'token' approach and lets operators set token counts per node. You do not need a separate load balancer layer when your hashing is weight-aware.

Production Gotchas — What the Textbook Diagrams Leave Out

Consistent hashing is elegant in theory but has several sharp edges in production systems that diagrams on whiteboards conveniently skip.

Hot spot problem with sequential keys. If your keys are sequential integers or timestamps (e.g. event:1000001, event:1000002...), consecutive keys hash to consecutive positions — which on the ring means they all fall to the same server arc. Virtual nodes help somewhat, but key design matters too. Prepend a high-entropy prefix like a user shard or use a secondary hash before ring placement.

Replication awareness. Most real systems do not just write to the first clockwise node — they write to the next R nodes for replication (R=2 or R=3 typically). Your ring lookup needs to return a replication list, not just a single server. The code below shows this. When a server goes down, you need to know which of the K/N keys it owned AND which server now absorbs them temporarily (known as hinted handoff in Cassandra).

Concurrent ring mutations. Adding and removing servers from the ring while it is serving traffic requires careful locking. A read-write lock (ReadWriteLock in Java) is appropriate here: many concurrent reads for key lookups, infrequent exclusive writes for topology changes. An unguarded TreeMap will throw ConcurrentModificationException under concurrent access.

ReplicatedConsistentHashRing.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.*;
import java.util.concurrent.locks.ReadWriteLock;
import java.util.concurrent.locks.ReentrantReadWriteLock;

/**
 * Thread-safe consistent hash ring with replication support.
 *
 * Two critical additions over the basic version:
 *  1. getReplicaServers() returns N distinct physical servers for replication
 *  2. ReadWriteLock ensures safe concurrent access during topology changes
 *
 * This is the version you would actually build in a distributed datastore.
 */
public class ReplicatedConsistentHashRing {

    private final TreeMap<Long, String> ring = new TreeMap<>();
    private final Set<String> liveServers = new HashSet<>();
    private final int vnodesPerServer;

    // ReadWriteLock: many readers (key lookups) + rare writers (add/remove server)
    private final ReadWriteLock ringLock = new ReentrantReadWriteLock();

    public ReplicatedConsistentHashRing(int vnodesPerServer) {
        this.vnodesPerServer = vnodesPerServer;
    }

    private long hash(String input) {
        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            byte[] d = md.digest(input.getBytes());
            long h = 0;
            for (int i = 0; i < 8; i++) h = (h << 8) | (d[i] & 0xFFL);
            return h & Long.MAX_VALUE;
        } catch (NoSuchAlgorithmException e) {
            throw new RuntimeException(e);
        }
    }

    public void addServer(String serverName) {
        // Exclusive write lock — stops all readers while ring topology changes
        ringLock.writeLock().lock();
        try {
            liveServers.add(serverName);
            for (int v = 0; v < vnodesPerServer; v++) {
                ring.put(hash(serverName + "#" + v), serverName);
            }
        } finally {
            ringLock.writeLock().unlock(); // Always release in finally!
        }
    }

    public void removeServer(String serverName) {
        ringLock.writeLock().lock();
        try {
            liveServers.remove(serverName);
            for (int v = 0; v < vnodesPerServer; v++) {
                ring.remove(hash(serverName + "#" + v));
            }
        } finally {
            ringLock.writeLock().unlock();
        }
    }

    /**
     * Returns an ordered list of `replicationFactor` distinct physical servers
     * responsible for this key — starting with the primary, then replicas.
     *
     * Walk clockwise, collecting servers until we have `replicationFactor`
     * distinct physical servers (skip extra vnodes of servers already collected).
     */
    public List<String> getReplicaServers(String key, int replicationFactor) {
        // Shared read lock — many threads can look up keys concurrently
        ringLock.readLock().lock();
        try {
            if (ring.isEmpty()) throw new IllegalStateException("Ring is empty");
            if (replicationFactor > liveServers.size()) {
                throw new IllegalArgumentException(
                    "Replication factor " + replicationFactor +
                    " exceeds live server count " + liveServers.size());
            }

            List<String> replicas = new ArrayList<>(replicationFactor);
            long keyPos = hash(key);

            // Start clockwise walk from key position
            NavigableMap<Long, String> clockwiseView = ring.tailMap(keyPos, true);

            // Iterate the entire ring clockwise using two passes (tail + head)
            Iterator<String> iter = new Iterator<>() {
                private final Iterator<String> tailIter = clockwiseView.values().iterator();
                private final Iterator<String> headIter = ring.values().iterator();
                private boolean onHead = false;

                @Override public boolean hasNext() {
                    return tailIter.hasNext() || (!onHead && headIter.hasNext());
                }

                @Override public String next() {
                    if (tailIter.hasNext()) return tailIter.next();
                    onHead = true;
                    return headIter.next();
                }
            };

            // Collect distinct physical servers as we walk clockwise
            while (iter.hasNext() && replicas.size() < replicationFactor) {
                String candidate = iter.next();
                // Only add if this physical server is not already in our replica set
                if (!replicas.contains(candidate)) {
                    replicas.add(candidate);
                }
            }
            return replicas;
        } finally {
            ringLock.readLock().unlock();
        }
    }

    public static void main(String[] args) {
        ReplicatedConsistentHashRing ring = new ReplicatedConsistentHashRing(100);
        ring.addServer("node-1");
        ring.addServer("node-2");
        ring.addServer("node-3");
        ring.addServer("node-4");
        ring.addServer("node-5");

        String[] keysToRoute = {"user:alice", "user:bob", "session:xyz", "product:42"};
        int replicationFactor = 3; // Write to 3 servers for fault tolerance

        System.out.println("Replica placement with RF=3 (primary first, then replicas):");
        System.out.printf("%-20s %-40s%n", "Key", "Responsible Servers (in order)");
        System.out.println("-".repeat(65));

        for (String key : keysToRoute) {
            List<String> servers = ring.getReplicaServers(key, replicationFactor);
            System.out.printf("%-20s %s%n", key, servers);
        }

        // Simulate node failure: remove node-3
        System.out.println("\n[FAILURE] node-3 goes down...");
        ring.removeServer("node-3");

        System.out.println("\nReplica placement AFTER node-3 failure:");
        System.out.printf("%-20s %-40s%n", "Key", "Responsible Servers (in order)");
        System.out.println("-".repeat(65));
        for (String key : keysToRoute) {
            List<String> servers = ring.getReplicaServers(key, replicationFactor);
            System.out.printf("%-20s %s%n", key, servers);
        }
    }
}
▶ Output
Replica placement with RF=3 (primary first, then replicas):
Key Responsible Servers (in order)
-----------------------------------------------------------------
user:alice [node-3, node-1, node-5]
user:bob [node-2, node-4, node-1]
session:xyz [node-5, node-2, node-3]
product:42 [node-1, node-4, node-2]

[FAILURE] node-3 goes down...

Replica placement AFTER node-3 failure:
Key Responsible Servers (in order)
-----------------------------------------------------------------
user:alice [node-1, node-5, node-4]
user:bob [node-2, node-4, node-1]
session:xyz [node-5, node-2, node-1]
product:42 [node-1, node-4, node-2]
⚠️
Watch Out: Forgetting the Wrap-AroundThe most common bug in home-grown consistent hash implementations is forgetting to wrap around the ring. If a key hashes to position 2^63 - 100 and all servers hash to positions below 2^62, `ceilingKey()` returns null — there is no server clockwise without wrapping. You MUST fall back to `ring.firstEntry()` in this case. The code above handles this explicitly. Missing this case causes intermittent null pointer exceptions that only manifest for a small fraction of keys, making them extremely hard to debug.
AspectModulo Hashing (hash % N)Consistent Hashing with Vnodes
Keys moved on server add~(N-1)/N ≈ 75-90% of all keys~1/N ≈ 25% of all keys (theoretical minimum)
Keys moved on server remove~(N-1)/N ≈ 75-90% of all keys~1/N ≈ 25% of all keys
Load distributionPerfectly even (by design)~Even with 100+ vnodes; tunable by weight
Heterogeneous hardware supportNot possible without complex shardingNative: assign more vnodes to stronger nodes
Implementation complexityTrivial (one line of code)Moderate (~100-200 lines for production quality)
Memory overheadNegligibleO(N × vnodesPerServer) entries in TreeMap
Lookup time complexityO(1)O(log N×V) where V = vnodes per server
Fault tolerance during resizeFull cache miss stormMinimal disruption — only affected keys miss
Used bySimple single-server cachesCassandra, DynamoDB, Redis Cluster, Riak, Memcached
Hot spot risk with sequential keysDistributes evenlyHigher risk — mitigate with key prefix hashing

🎯 Key Takeaways

    🔥
    TheCodeForge Editorial Team Verified Author

    Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

    ← PreviousReverse Proxy vs Forward ProxyNext →Bloom Filter
    Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged