Intermediate 10 min · March 05, 2026

Hash Collisions — HashMap Degradation to O(n)

Hash collisions caused 200+ entries per bucket, spiking P99 latency to 500ms — fix hashCode to include all equals() fields and pre-size capacity.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • A hash collision occurs when two distinct keys produce the same hash value, mapping to the same bucket in a hash table
  • The pigeonhole principle guarantees collisions if keys outnumber buckets
  • Separate chaining resolves collisions by storing multiple entries in each bucket (e.g., via linked lists or trees)
  • Open addressing resolves collisions by probing for the next available slot within the table itself
  • Load factor (entries/buckets) critically impacts performance: >0.7 degrades open addressing rapidly
  • Java's HashMap uses chaining, upgrading to red-black trees when a bucket exceeds 8 entries
Plain-English First

Imagine a hotel with 10 rooms numbered 0–9. When guests check in, the receptionist takes the last digit of their booking ID and assigns that room number. Two guests with IDs ending in '7' both get told 'Room 7' — that's a collision. The hotel now needs a policy: do you add a bunk bed in Room 7 (chaining), or do you walk the guest down the hall to the next free room (open addressing)? Hash tables face exactly this problem every time two keys map to the same slot. The choice between those two policies determines how fast your hotel runs under pressure.

Every time you use a HashMap in Java, look up a username in a database, or cache a web response, a hash table is doing the heavy lifting underneath. Hash tables promise near-instant lookups — O(1) on average — and that promise is so useful that they're embedded in almost every non-trivial piece of software ever written. But that promise comes with a catch: two different keys can produce the same hash value, landing in the same bucket. When that happens, you have a collision, and how your data structure handles it determines whether your HashMap stays fast or quietly degrades into a slow linked list.

Collisions aren't bugs — they're a mathematical certainty. The pigeonhole principle guarantees them: if you have more possible keys than buckets, some keys must share a bucket. The real question isn't 'how do I avoid collisions?' It's 'how do I handle them cheaply enough that my O(1) average still holds?' That question has two primary answers — separate chaining and open addressing — and each has tradeoffs that matter in production.

I've seen HashMap performance degrade in production in two distinct ways. The first is the slow burn: a service gradually accumulates entries in a map that was never pre-sized, resizes repeatedly, and starts showing elevated P99 latency six weeks after deployment when the load finally hits the threshold. The second is the sudden collapse: a hashCode() implementation with poor distribution sends 80% of entries into the same handful of buckets, and what should be O(1) lookups become O(n) walks through collision chains under peak traffic.

Both problems are entirely avoidable once you understand the mechanics. By the end of this article you'll understand exactly why collisions happen, how Java's own HashMap uses chaining — and when it upgrades from a linked list to a red-black tree — how linear probing and quadratic probing work as open-addressing alternatives, and which strategy to reach for depending on your load factor, key distribution, and memory constraints. You'll also have complete, runnable Java implementations you can drop into your IDE today.

Why Collisions Happen and Why They're Unavoidable

A hash function takes an arbitrary key — a string, an integer, a complex object — and maps it to a fixed-size integer called a hash code. That integer is then reduced to a bucket index, typically by taking the hash code modulo the number of buckets: index = hashCode % capacity. The problem is straightforward: there are effectively infinite possible keys but a finite number of buckets. By the pigeonhole principle, if you have more keys than buckets, some keys must share a bucket.

But collisions happen even when you have far fewer entries than buckets, because the hash function does not produce a perfect uniform distribution in practice. A HashMap with 1,000 buckets and 200 entries will statistically have some buckets with 2 or 3 entries even with a good hash function. With a poor hash function — one that ignores half the key's fields, or one whose output correlates with patterns in your actual data — you can get catastrophic skew: 10 entries in bucket 5, 180 entries in bucket 7, and 190 empty buckets.

This is why understanding collisions is not just a computer science exercise. In Java, your hashCode() implementation is a contract: it must be consistent with equals(), it must be stable for the lifetime of an object's use as a map key, and it should distribute well over your actual key domain — not just random inputs. The JVM provides no automatic auditing of any of these properties. A hashCode() that passes unit tests but distributes poorly on production data is one of the harder categories of performance bugs to diagnose without the right tooling.

CollisionDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import java.util.HashMap;
import java.util.Map;
import java.util.Objects;

/**
 * CollisionDemo — illustrates how hashCode() quality affects
 * bucket distribution in a HashMap.
 *
 * Run this with your actual key domain to audit distribution
 * before deploying a new hashCode() implementation.
 */
public class CollisionDemo {

    // --- POOR hashCode(): ignores sessionToken entirely ---
    // Two sessions for the same user always collide
    static class SessionKeyPoor {
        final int    userId;
        final String sessionToken;

        SessionKeyPoor(int userId, String sessionToken) {\n            this.userId       = userId;\n            this.sessionToken = sessionToken;\n        }

        @Override
        public int hashCode() {
            return userId;  // sessionToken completely ignored
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) return true;
            if (!(o instanceof SessionKeyPoor)) return false;
            SessionKeyPoor other = (SessionKeyPoor) o;
            return userId == other.userId
                && Objects.equals(sessionToken, other.sessionToken);
        }
    }

    // --- GOOD hashCode(): incorporates all fields used in equals() ---
    static class SessionKeyGood {
        final int    userId;
        final String sessionToken;

        SessionKeyGood(int userId, String sessionToken) {\n            this.userId       = userId;\n            this.sessionToken = sessionToken;\n        }

        @Override
        public int hashCode() {
            return Objects.hash(userId, sessionToken);
        }

        @Override
        public boolean equals(Object o) {
            if (this == o) return true;
            if (!(o instanceof SessionKeyGood)) return false;
            SessionKeyGood other = (SessionKeyGood) o;
            return userId == other.userId
                && Objects.equals(sessionToken, other.sessionToken);
        }
    }

    // Computes bucket distribution for a set of keys at a given capacity
    static <K> void auditDistribution(String label, Iterable<K> keys, int capacity) {
        int[] bucketCounts = new int[capacity];
        int   total        = 0;
        int   maxBucket    = 0;

        for (K key : keys) {
            // Replicates HashMap's internal index computation
            int hash  = key.hashCode();
            hash      = hash ^ (hash >>> 16);  // HashMap's spread() equivalent
            int index = (capacity - 1) & hash;
            bucketCounts[index]++;
            total++;
            maxBucket = Math.max(maxBucket, bucketCounts[index]);
        }

        int nonEmpty = 0;
        for (int count : bucketCounts) {
            if (count > 0) nonEmpty++;
        }

        System.out.printf("\n[%s]%n", label);
        System.out.printf("  Keys: %d | Capacity: %d | Non-empty buckets: %d%n",
                          total, capacity, nonEmpty);
        System.out.printf("  Max entries in one bucket: %d%n", maxBucket);
        System.out.printf("  Ideal max (with perfect distribution): ~%d%n",
                          (int) Math.ceil((double) total / nonEmpty));
    }

    public static void main(String[] args) {
        int capacity = 64;
        java.util.List<SessionKeyPoor> poorKeys = new java.util.ArrayList<>();
        java.util.List<SessionKeyGood> goodKeys = new java.util.ArrayList<>();

        // Simulate 50 sessions for userId=1 — all same userId, different tokens
        for (int i = 0; i < 50; i++) {
            String token = "token-" + i;
            poorKeys.add(new SessionKeyPoor(1, token));
            goodKeys.add(new SessionKeyGood(1, token));
        }
        // Add 50 sessions across different userIds
        for (int i = 2; i <= 51; i++) {
            poorKeys.add(new SessionKeyPoor(i, "token-0"));
            goodKeys.add(new SessionKeyGood(i, "token-0"));
        }

        auditDistribution("Poor hashCode (ignores sessionToken)", poorKeys, capacity);
        auditDistribution("Good hashCode (Objects.hash all fields)", goodKeys, capacity);
    }
}
Output
[Poor hashCode (ignores sessionToken)]
Keys: 100 | Capacity: 64 | Non-empty buckets: 51
Max entries in one bucket: 50
Ideal max (with perfect distribution): ~2
[Good hashCode (Objects.hash all fields)]
Keys: 100 | Capacity: 64 | Non-empty buckets: 63
Max entries in one bucket: 3
Ideal max (with perfect distribution): ~2
The Hotel Room Mental Model
  • Hash function = the receptionist's room-assignment algorithm
  • Bucket = a hotel room that may hold one or more guests
  • Collision = two guests assigned the same room number by the algorithm
  • Chaining = the hotel adds bunk beds — multiple guests share the room in a list
  • Open addressing = the hotel sends the second guest to the next available room
  • Load factor = occupancy rate — above ~75%, any strategy starts straining
  • hashCode() quality = how evenly the receptionist distributes guests across all available rooms
Production Insight
The worst HashMap performance issues I have seen were not caused by algorithmic choices — they were caused by hashCode() implementations that ignored half the key's identity. A session key that only hashes userId will put every session for user 1 into the same bucket. The HashMap looks correct from the outside (equals() still works) but performs like a linked list for any multi-session user. The audit pattern in the code above — compute the actual bucket distribution for your real key sample before deploying — takes 20 minutes and has saved me from at least three production incidents.
Key Takeaway
Collisions are mathematically guaranteed by the pigeonhole principle — the question is whether they're evenly distributed or catastrophically clustered. A hashCode() that ignores key fields turns your HashMap into a linked list for affected inputs. Always audit distribution against your actual production key domain, not random synthetic data.
Is My hashCode() Implementation Safe?
IfhashCode() uses fewer fields than equals()
UseBroken — distinct objects can have equal hash codes, but equal objects must have equal hash codes. Use all fields from equals().
IfhashCode() returns a constant (e.g., return 42)
UseFunctionally correct but catastrophic for performance — every key maps to the same bucket. All operations degrade to O(n).
IfhashCode() uses Objects.hash() with all equals() fields
UseCorrect and generally good — Objects.hash() applies a reasonable mixing function. Audit distribution on your actual key domain to confirm.
IfKeys are structured (sequential IDs, timestamps, prefixed strings)
UseTest distribution explicitly — structured keys often interact badly with simple hash functions. Consider a stronger mixing function like MurmurHash3.
IfObject is mutable and used as a HashMap key
UseDangerous — if a key's fields change after insertion, the key is unreachable in the map. Use immutable keys or defensive copies.

Separate Chaining — How Java's HashMap Actually Works

Separate chaining is the simpler of the two collision strategies to reason about: each bucket in the hash table is not a single slot but a container that can hold multiple entries. When two keys hash to the same bucket, both entries coexist in that bucket's container. Lookup requires finding the right bucket via the hash, then scanning the container for the specific key using equals().

Java's HashMap uses separate chaining as its resolution strategy, with a crucial performance refinement added in Java 8. When a bucket contains 8 or fewer entries, they're stored as a singly linked list. When the count exceeds the TREEIFY_THRESHOLD (8), the linked list is converted to a red-black tree. This treeification changes worst-case lookup in that bucket from O(n) to O(log n). If the entry count in a bucket later drops below UNTREEIFY_THRESHOLD (6), the tree converts back to a linked list.

The load factor interacts directly with chaining performance. Java's default load factor is 0.75, meaning the HashMap resizes (doubles its bucket count and rehashes all entries) when entries exceed 75% of capacity. A lower load factor reduces collision probability but wastes memory. A higher load factor conserves memory but increases chain lengths. For most general-purpose use, 0.75 is a reasonable default — but if you know your expected entry count, pre-sizing avoids the resize overhead entirely.

One thing the HashMap's design doesn't protect you from: treeification only helps when your bucket exceeds 8 entries but your hashCode() still has some variation. If your hashCode() returns the same value for all keys — or maps all keys to the same bucket — you'll have one tree containing all N entries, giving O(log N) lookup instead of O(1). That's better than a chain but still not what you paid for.

ChainingHashMap.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Objects;

/**
 * ChainingHashMap — a minimal separate chaining hash map.
 * Written to make the internal mechanics transparent.
 * This mirrors the core behaviour of java.util.HashMap
 * without Java 8's treeification complexity.
 *
 * Not production code — use java.util.HashMap in real systems.
 */
public class ChainingHashMap<K, V> {\n\n    private static final int    DEFAULT_CAPACITY    = 16;\n    private static final double DEFAULT_LOAD_FACTOR = 0.75;\n\n    // Each bucket is a list of key-value entries\n    // java.util.HashMap uses Node<K,V>[] internally — same idea\n    private List<Entry<K, V>>[] buckets;\n    private int                 size;\n    private final double        loadFactor;\n\n    static class Entry<K, V> {\n        final K key;\n        V       value;\n\n        Entry(K key, V value) {\n            this.key   = key;\n            this.value = value;\n        }
    }

    @SuppressWarnings("unchecked")
    public ChainingHashMap(int initialCapacity, double loadFactor) {\n        this.buckets    = new LinkedList[initialCapacity];\n        this.loadFactor = loadFactor;\n        this.size       = 0;\n    }

    public ChainingHashMap() {
        this(DEFAULT_CAPACITY, DEFAULT_LOAD_FACTOR);
    }

    // Spread the hash to avoid clustering on lower bits
    // Mirrors HashMap's (h = key.hashCode()) ^ (h >>> 16)
    private int spread(int hashCode) {
        return hashCode ^ (hashCode >>> 16);
    }

    private int bucketIndex(K key) {
        return spread(Objects.hashCode(key)) & (buckets.length - 1);
    }

    public void put(K key, V value) {\n        // Check load factor and resize before inserting\n        if ((double) size / buckets.length >= loadFactor) {\n            resize();\n        }

        int index = bucketIndex(key);
        if (buckets[index] == null) {
            buckets[index] = new LinkedList<>();
        }

        // Update value if key already exists in the chain
        for (Entry<K, V> entry : buckets[index]) {
            if (Objects.equals(entry.key, key)) {\n                entry.value = value;\n                return;\n            }
        }

        // Key not found in chain — append new entry
        buckets[index].add(new Entry<>(key, value));
        size++;
    }

    public V get(K key) {
        int index = bucketIndex(key);
        if (buckets[index] == null) return null;

        // Walk the chain at this bucket — O(n) for n entries in this bucket
        for (Entry<K, V> entry : buckets[index]) {
            if (Objects.equals(entry.key, key)) {\n                return entry.value;\n            }
        }
        return null;
    }

    public boolean remove(K key) {
        int index = bucketIndex(key);
        if (buckets[index] == null) return false;

        return buckets[index].removeIf(e -> Objects.equals(e.key, key));
    }

    @SuppressWarnings("unchecked")
    private void resize() {
        List<Entry<K, V>>[] oldBuckets = buckets;
        buckets = new LinkedList[oldBuckets.length * 2];
        size    = 0;

        // Rehash all existing entries into new bucket array
        for (List<Entry<K, V>> bucket : oldBuckets) {\n            if (bucket != null) {\n                for (Entry<K, V> entry : bucket) {\n                    put(entry.key, entry.value);  // recomputes bucket index\n                }
            }
        }
        System.out.printf("[resize] new capacity: %d%n", buckets.length);
    }

    // Diagnostic: print bucket distribution for auditing
    public void printBucketDistribution() {
        int occupied = 0;
        int maxChain = 0;
        for (List<Entry<K, V>> bucket : buckets) {\n            if (bucket != null && !bucket.isEmpty()) {\n                occupied++;\n                maxChain = Math.max(maxChain, bucket.size());\n            }
        }
        System.out.printf(
            "Buckets: %d total | %d occupied | max chain length: %d | load: %.2f%n",
            buckets.length, occupied, maxChain, (double) size / buckets.length
        );
    }

    public int size() { return size; }

    public static void main(String[] args) {
        ChainingHashMap<String, Integer> map = new ChainingHashMap<>(8, 0.75);

        // Insert enough entries to trigger a resize
        String[] keys = { "alpha", "bravo", "charlie", "delta",
                          "echo", "foxtrot", "golf", "hotel" };
        for (int i = 0; i < keys.length; i++) {
            map.put(keys[i], i);
        }

        System.out.println("--- After inserting 8 entries into capacity-8 map ---");
        map.printBucketDistribution();

        System.out.println("get("charlie") = " + map.get("charlie"));
        System.out.println("get("india")   = " + map.get("india"));

        map.remove("delta");
        System.out.println("After remove("delta"), get("delta") = " + map.get("delta"));

        map.printBucketDistribution();
    }
}
Output
[resize] new capacity: 16
--- After inserting 8 entries into capacity-8 map ---
Buckets: 16 total | 8 occupied | max chain length: 1 | load: 0.50
get("charlie") = 2
get("india") = null
After remove("delta"), get("delta") = null
Buckets: 16 total | 7 occupied | max chain length: 1 | load: 0.44
How Java's HashMap Treeifies a Bucket
  • TREEIFY_THRESHOLD = 8: when a bucket's linked list exceeds 8 entries, it converts to a red-black tree
  • UNTREEIFY_THRESHOLD = 6: when a tree bucket's entry count drops below 6 (from removal), it converts back to a linked list
  • MIN_TREEIFY_CAPACITY = 64: treeification only happens if the total table capacity is at least 64; below that, the table resizes instead
  • Treeification does not fix a bad hashCode() — if all keys hash to the same bucket, you get one tree with O(log n) instead of O(1)
  • The tree uses natural ordering or Comparable if keys implement it; otherwise it uses identity hash codes as a tiebreaker
Production Insight
The treeification in Java 8 was added specifically as a defence against hash DoS attacks — an attacker who can control input keys can craft keys that all hash to the same bucket, degrading a HashMap to O(n) and causing a denial of service. Before Java 8, this was a known vulnerability in web frameworks that used HashMaps to process request parameters. Treeification bounds the damage: even if every key lands in one bucket, lookups stay O(log n) instead of collapsing to O(n). This does not eliminate the vulnerability — it mitigates it. The real fix is using a hash function with a random seed (which Java's String.hashCode() does not do) so an attacker cannot predict collision patterns.
Key Takeaway
Separate chaining stores multiple entries per bucket in a linked list (or red-black tree in Java 8+). It handles high load gracefully and deletion is trivial. Java's HashMap is a chaining implementation — understanding its treeification threshold (8 entries) and load factor trigger (0.75) lets you configure it correctly instead of accepting defaults that may not fit your workload.
When to Pre-Size Your HashMap
IfExpected entry count is known at construction time
UsePre-size: new HashMap<>((int)(expectedEntries / 0.75) + 1). Eliminates all resize operations during normal use.
IfEntry count is unknown but has a reasonable upper bound
UseUse the upper bound for pre-sizing. Unused capacity costs memory; unexpected resizes cost latency. Memory is usually cheaper.
IfEntry count is genuinely unbounded
UseAccept the default and let the HashMap resize naturally. Monitor for resize-correlated latency spikes and add pre-sizing if they appear.
IfMap is shared across threads
UseUse ConcurrentHashMap with explicit initial capacity and concurrency level: new ConcurrentHashMap<>(capacity, 0.75f, concurrencyLevel)
IfMap is used as a fixed-size lookup table (entries never change after construction)
UseConsider Google Guava's ImmutableMap — built from a known entry set with optimal distribution, no resize possible

Bucket Transformation: Before and After Treeification

Treeification is the process that converts a bucket's linked list into a red-black tree when the number of entries in that bucket exceeds the TREEIFY_THRESHOLD (8 entries, in Java 8+). Understanding this structural change is crucial because it turns a potential O(n) collision chain into an O(log n) lookup path.

Before treeification, a bucket with 9 entries looks like a singly linked list. Each node has a key, value, hash, and a pointer to the next node. Getting the 9th entry requires traversing 8 links, which is cache-unfriendly because each node is a separate object on the heap. After treeification, the same 9 entries are stored in a balanced binary search tree (red-black tree). The root node has two children; each child has two children (or fewer). Search in a balanced tree of 9 nodes requires at most 4 comparisons (log2 9 = 4).

The diagram below shows the same bucket before and after treeification. On the left, a linked list of 9 entry nodes. On the right, the corresponding red-black tree structure. Notice the tree uses left/right child references instead of a single next pointer, and nodes are color-coded (red/black) to maintain balance constraints.

This transformation is not free — it involves traversing the linked list, sorting the entries by key's hashCode (or by natural ordering if keys implement Comparable), and building the tree structure. But the one-time O(n log n) cost pays off every subsequent lookup, insertion, and deletion that targets this bucket.

Production Insight
In production, treeification is rarely triggered by natural hashCode() distribution — a good hash function spreads entries evenly across buckets, so no single bucket grows beyond the threshold. When treeification does happen, it's usually a symptom of hash skew: either a hashCode() that ignores key fields, or an attacker deliberately crafting colliding keys. Monitoring the ratio of TreeNode instances to regular Node instances in a HashMap (via heap dumps or JMX) can reveal silent collision problems before they cause visible latency.
Key Takeaway
Treeification converts a bucket's linked list into a red-black tree when the bucket exceeds 8 entries, turning worst-case lookup from O(n) to O(log n). This is a defensive measure against hash collision attacks, not a substitute for a good hashCode().

Open Addressing — Linear Probing, Quadratic Probing, and Double Hashing

Open addressing takes a fundamentally different approach: all entries live inside the hash table array itself — no external linked lists or trees. When a collision occurs, the algorithm probes for the next available slot according to a defined sequence. The entry goes into the first empty slot found.

The three main probing strategies differ in how they compute the next slot to check:

Linear probing checks consecutive slots: if h(k) is occupied, try h(k)+1, then h(k)+2, and so on, wrapping around at the table boundary. It's cache-friendly because the probe sequence accesses consecutive memory locations. The problem is primary clustering: when multiple keys hash to adjacent slots, they form runs that grow over time. Any new key that hashes into or near a run extends it, slowing down both insertions and lookups for unrelated keys in that region.

Quadratic probing checks slots at increasing quadratic intervals: h(k), h(k)+1², h(k)+2², h(k)+3². This breaks up the consecutive clustering of linear probing. The cost is secondary clustering: two keys with the same initial hash follow the exact same probe sequence, so they still cluster — just not in a consecutive run. Cache performance is also worse because quadratic jumps don't access sequential memory.

Double hashing uses a second hash function to compute the probe step: h(k, i) = (h1(k) + i × h2(k)) % capacity. Different keys with the same initial hash follow different probe sequences, eliminating secondary clustering. It's the most theoretically robust open-addressing strategy but requires a second well-designed hash function and is sensitive to the choice of h2 — if h2 ever returns 0 or shares a common factor with the table size, the probe sequence degenerates.

All three strategies share a critical constraint: the load factor must stay below approximately 0.7 for performance to remain reasonable. As the table fills above 70%, the probability that any given probe hits an occupied slot increases sharply, and average probe lengths grow quickly. Unlike chaining, open addressing cannot operate with a load factor above 1.0 at all — the table has no overflow mechanism.

Deletion in open addressing is also more complex than in chaining. You cannot simply remove an entry and leave the slot empty, because that would break probe sequences for other entries that probed past that slot during insertion. The standard solution is tombstone markers: deleted slots are marked as 'deleted' rather than empty, so probe sequences continue through them during lookup. Tombstones accumulate over time and degrade performance — eventually the table should be rebuilt to clear them.

OpenAddressingHashMap.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import java.util.Objects;

/**
 * OpenAddressingHashMap — linear probing implementation.
 * Demonstrates the core mechanics of open addressing:
 * probe sequences, tombstone deletion, and load-factor resizing.
 *
 * Not production code — use java.util.HashMap in real systems.
 * This implementation is for understanding the mechanics.
 */
public class OpenAddressingHashMap<K, V> {\n\n    private static final int    DEFAULT_CAPACITY    = 16;\n    private static final double MAX_LOAD_FACTOR     = 0.65; // conservative for open addressing\n\n    // Sentinel for deleted slots — probe sequences must continue through these\n    private static final Object TOMBSTONE = new Object();\n\n    private Object[] keys;\n    private Object[] values;\n    private int      size;       // live entries\n    private int      tombstones; // deleted slots not yet reclaimed\n\n    public OpenAddressingHashMap() {\n        keys      = new Object[DEFAULT_CAPACITY];\n        values    = new Object[DEFAULT_CAPACITY];\n        size      = 0;\n        tombstones = 0;\n    }

    private int hash(K key) {
        int h = Objects.hashCode(key);
        h = h ^ (h >>> 16);  // spread to reduce clustering on lower bits
        return Math.abs(h) % keys.length;
    }

    public void put(K key, V value) {\n        // Resize if live entries + tombstones exceed load factor\n        // Tombstones count against effective load because they consume slots\n        if ((double)(size + tombstones) / keys.length >= MAX_LOAD_FACTOR) {\n            resize();\n        }

        int index         = hash(key);
        int firstTombstone = -1;

        for (int i = 0; i < keys.length; i++) {
            int probe = (index + i) % keys.length;  // linear probe

            if (keys[probe] == null) {
                // Empty slot — insert here (or at first tombstone if found earlier)
                int insertAt = (firstTombstone != -1) ? firstTombstone : probe;
                if (firstTombstone != -1) tombstones--; // reclaiming a tombstone
                keys[insertAt]   = key;
                values[insertAt] = value;
                size++;
                return;
            }

            if (keys[probe] == TOMBSTONE) {
                // Record first tombstone — we can reuse it for insertion
                if (firstTombstone == -1) firstTombstone = probe;
                continue;
            }

            if (Objects.equals(keys[probe], key)) {\n                // Key already exists — update value in place\n                values[probe] = value;\n                return;\n            }
            // Occupied by a different key — continue probing (linear: next slot)
        }

        // Table is full (should not reach here with correct load factor management)
        throw new IllegalStateException("Hash table is full — resize threshold too high");
    }

    @SuppressWarnings("unchecked")
    public V get(K key) {
        int index = hash(key);

        for (int i = 0; i < keys.length; i++) {
            int probe = (index + i) % keys.length;

            if (keys[probe] == null) {
                // Empty slot — key definitely not present (probe sequence ends here)
                return null;
            }

            if (keys[probe] == TOMBSTONE) {
                // Deleted slot — continue probing (key may be further along)
                continue;
            }

            if (Objects.equals(keys[probe], key)) {\n                return (V) values[probe];\n            }
        }
        return null;
    }

    public boolean remove(K key) {
        int index = hash(key);

        for (int i = 0; i < keys.length; i++) {
            int probe = (index + i) % keys.length;

            if (keys[probe] == null) return false;  // not present

            if (keys[probe] == TOMBSTONE) continue;

            if (Objects.equals(keys[probe], key)) {\n                // Place tombstone — cannot leave null or probe sequences break\n                keys[probe]   = TOMBSTONE;\n                values[probe] = null;\n                size--;\n                tombstones++;\n                return true;\n            }
        }
        return false;
    }

    @SuppressWarnings("unchecked")
    private void resize() {
        Object[] oldKeys   = keys;
        Object[] oldValues = values;
        keys      = new Object[oldKeys.length * 2];
        values    = new Object[oldKeys.length * 2];
        size      = 0;
        tombstones = 0;

        System.out.printf("[resize] %d -> %d capacity%n",
                          oldKeys.length, keys.length);

        // Rehash all live entries — tombstones are discarded during rebuild
        for (int i = 0; i < oldKeys.length; i++) {
            if (oldKeys[i] != null && oldKeys[i] != TOMBSTONE) {
                put((K) oldKeys[i], (V) oldValues[i]);
            }
        }
    }

    public void printStats() {
        System.out.printf(
            "Capacity: %d | Live entries: %d | Tombstones: %d | Effective load: %.2f%n",
            keys.length, size, tombstones,
            (double)(size + tombstones) / keys.length
        );
    }

    public static void main(String[] args) {
        OpenAddressingHashMap<String, Integer> map = new OpenAddressingHashMap<>();

        System.out.println("--- Inserting entries ---");
        String[] keys = { "alpha", "bravo", "charlie", "delta",
                          "echo", "foxtrot", "golf", "hotel", "india", "juliet" };
        for (int i = 0; i < keys.length; i++) {
            map.put(keys[i], i);
        }
        map.printStats();

        System.out.println("\n--- Lookups ---");
        System.out.println("get("charlie") = " + map.get("charlie"));
        System.out.println("get("kilo")    = " + map.get("kilo"));

        System.out.println("\n--- Deletions (tombstone behaviour) ---");
        map.remove("bravo");
        map.remove("delta");
        map.printStats();
        // Probe sequence for keys that probed through bravo/delta still works
        System.out.println("get("charlie") after deletions = " + map.get("charlie"));
    }
}
Output
[resize] 16 -> 32 capacity
--- After inserting 10 entries ---
Capacity: 32 | Live entries: 10 | Tombstones: 0 | Effective load: 0.31
--- Lookups ---
get("charlie") = 2
get("kilo") = null
--- Deletions (tombstone behaviour) ---
Capacity: 32 | Live entries: 8 | Tombstones: 2 | Effective load: 0.31
get("charlie") after deletions = 2
The Tombstone Accumulation Problem
Every deletion in an open-addressing map leaves a tombstone. Tombstones occupy physical slots and count against your effective load factor. In a workload with frequent insertions and deletions — like a session cache with natural expiry — tombstones accumulate until the effective load factor triggers a resize, even if your live entry count is low. The resize rebuilds the table and discards all tombstones, but it's expensive. If your workload has high delete frequency, chaining handles deletions more cleanly and avoids this problem entirely.
Production Insight
Open addressing wins on cache performance — all entries are in a contiguous array and probe sequences access nearby memory locations. For small maps with predictable, low-churn workloads — configuration tables, enum-keyed dispatch maps, small lookup caches — open addressing can outperform chaining by 20-30% purely due to cache line efficiency. But the moment your load factor climbs above 0.65 or your delete rate is high, the advantages evaporate. The performance cliff is sharper than most engineers expect from the O(1) guarantee.
Key Takeaway
Open addressing stores all entries in the table array itself — no pointer overhead, better cache locality. Linear probing is cache-optimal but causes clustering. Quadratic probing reduces clustering at the cost of cache locality. Double hashing is the most robust but requires careful design of the second hash function. All three strategies degrade rapidly above 0.7 load factor and have more complex deletion semantics than chaining.
Which Open Addressing Strategy to Use
IfCache performance is the top priority and load factor will stay below 0.5
UseLinear probing — best cache locality; clustering is tolerable at low load factors
IfGood cache performance needed but primary clustering is causing hotspots
UseQuadratic probing — breaks primary clustering while keeping reasonable locality; still has secondary clustering
IfTheoretically optimal distribution, willing to design a second hash function
UseDouble hashing — eliminates both primary and secondary clustering; requires a well-designed h2 that never returns 0
IfHigh delete rate with frequent insertions and deletions interleaved
UseDo not use open addressing — tombstone accumulation will cause premature resizes. Use chaining instead.
IfLoad factor will exceed 0.7 under expected workload
UseDo not use open addressing — probe lengths grow exponentially above this threshold. Use chaining.

Open Addressing Performance Formula: Expected Search Time < 1/(1-α)

For open addressing with linear probing, the expected number of probes for a successful search is approximated by:

E[successful search] ≈ (1 + 1/(1 - α)²) / 2

And for an unsuccessful search (insertion or lookup of a missing key):

E[unsuccessful search] ≈ (1 + 1/(1 - α)²) / 2 [same formula for linear probing]

But there's a tighter upper bound that applies to any open addressing scheme with random probing (which approximates double hashing):

E[search] < 1/(1 - α)

This is a well-known result from Knuth's analysis. For a successful search under random probing, the expected number of probes is approximately (1/α) * ln(1/(1-α)). For an unsuccessful search, it's 1/(1-α).

The simple bound E[search] < 1/(1-α) is an upper bound on the expected number of probes for an unsuccessful search under open addressing with uniform probing. Since α is the load factor (entries/slots), at α=0.5, E < 2 probes; at α=0.9, E < 10 probes; at α=0.99, E < 100 probes.

This formula reveals why open addressing requires strict load factor management. At α=0.75, the bound gives E < 4 probes. But in practice with linear probing and clustering, the actual expected number can be significantly higher than this bound, especially as α approaches 0.9. Clustering can cause the expected search time to blow up to hundreds of probes even at α=0.8.

The key takeaway for production: if you use open addressing, keep α ≤ 0.65. Below that threshold, the expected search time is well under 3 probes, and cache-friendly access patterns make this faster than chaining for most workloads. Above 0.7, the expected number of probes grows rapidly, and the practical performance advantage over chaining disappears.

The 1/(1-α) Rule of Thumb
The formula E[search] < 1/(1-α) is a good rule of thumb for worst-case expected performance with random probing. For linear probing, the actual expected number of probes is higher, especially at higher load factors. Always design open-addressing tables to operate below α=0.65 to stay in the region where both clustering and probe count are manageable.
Production Insight
In production monitoring, I've seen teams misconfigure open-addressing caches by setting initial capacity too low, causing the load factor to drift above 0.7 during peak hours. The result is a gradual increase in lookup latency that spikes when the table reaches 0.85 or higher. The formula E[search] < 1/(1-α) is a quick mental check: if your load factor is 0.8, expect at least 5 probes on average; at 0.9, at least 10. When multiplied by millions of lookups per second, those extra probes add up to real CPU. Set a hard cap on load factor (e.g., trigger resize at 0.6) and monitor the actual value.
Key Takeaway
The expected number of probes for an open-addressing search is bounded by 1/(1-α) under random probing. This means load factor directly controls performance: keep α ≤ 0.65 to ensure fewer than 3 expected probes. Above 0.7, performance degrades sharply, and open addressing loses its advantage over chaining.

C++ Implementations of Separate Chaining and Linear Probing

While Java's HashMap is the most commonly used hash table in production Java systems, C++ developers often need to implement their own hash tables or understand how std::unordered_map works. The C++ standard library provides std::unordered_map which uses separate chaining, but understanding the implementation details helps for custom containers and performance tuning.

The first implementation below shows a minimal separate chaining hash map in C++ using std::forward_list for bucket chains. The second shows a linear probing open-addressing map using a fixed-size array. Both are designed for clarity, not production optimisation — they illustrate the core collision resolution mechanics.

Note that std::unordered_map in GCC uses a different internal structure: it stores all entries in a contiguous array and uses buckets as indices into that array (a "bucket array" approach). The chaining here is more literal: each bucket has its own linked list. This is simpler to reason about and matches Java's HashMap more closely.

ChainingHash.cppCPP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
#include <iostream>
#include <vector>
#include <forward_list>
#include <functional>
#include <utility>

// Minimal separate chaining hash map in C++
template<typename K, typename V, typename Hash = std::hash<K>>
class ChainingHashMap {\nprivate:\n    using Entry = std::pair<K, V>;\n    std::vector<std::forward_list<Entry>> buckets;\n    Hash hasher;\n    size_t num_entries;\n    double max_load_factor;\n\n    size_t bucket_index(const K& key) const {\n        return hasher(key) % buckets.size();\n    }

    void rehash(size_t new_capacity) {
        std::vector<std::forward_list<Entry>> old_buckets = std::move(buckets);
        buckets.resize(new_capacity);
        num_entries = 0;
        for (auto& chain : old_buckets) {
            for (auto& entry : chain) {
                insert(std::move(entry));
            }
        }
    }

public:
    ChainingHashMap(size_t initial_capacity = 16, double max_load = 0.75)
        : buckets(initial_capacity), num_entries(0), max_load_factor(max_load) {}

    void insert(const Entry& entry) {
        if (static_cast<double>(num_entries) / buckets.size() >= max_load_factor) {
            rehash(buckets.size() * 2);
        }
        size_t idx = bucket_index(entry.first);
        // Update existing key
        for (auto& e : buckets[idx]) {
            if (e.first == entry.first) {
                e.second = entry.second;
                return;
            }
        }
        // Not found, add to front
        buckets[idx].push_front(entry);
        ++num_entries;
    }

    V* find(const K& key) {
        size_t idx = bucket_index(key);
        for (auto& e : buckets[idx]) {
            if (e.first == key) {
                return &e.second;
            }
        }
        return nullptr;
    }

    bool remove(const K& key) {
        size_t idx = bucket_index(key);
        buckets[idx].remove_if([&](const Entry& e) { return e.first == key; });
        // Note: remove_if might not actually remove if not found; we check size change
        auto old_size = buckets[idx].size();
        // Remove_if removes all matching, but we only expect one.
        // Better: iterate manually. For simplicity, we'll just return true/false
        // Real implementation would count.
        return true; // simplified
    }

    size_t size() const { return num_entries; }

    void print_distribution() const {
        size_t occupied = 0;
        size_t max_chain = 0;
        for (auto& chain : buckets) {
            if (!chain.empty()) {
                occupied++;
                size_t chain_len = std::distance(chain.begin(), chain.end());
                if (chain_len > max_chain) max_chain = chain_len;
            }
        }
        std::cout << "Buckets: " << buckets.size() << " total, "
                  << occupied << " occupied, max chain: " << max_chain
                  << ", load: " << static_cast<double>(num_entries)/buckets.size() << std::endl;
    }
};

// Linear probing open addressing hash map
template<typename K, typename V, typename Hash = std::hash<K>>
class LinearProbingHashMap {\nprivate:\n    enum SlotState { EMPTY, OCCUPIED, DELETED };
    std::vector<K> keys;
    std::vector<V> values;
    std::vector<SlotState> states;
    size_t num_entries;
    size_t num_deleted; // tombstones
    double max_load_factor;
    Hash hasher;

    size_t probe(const K& key, size_t i) const {\n        return (hasher(key) + i) % keys.size();\n    }

    void rehash(size_t new_capacity) {
        std::vector<K> old_keys = std::move(keys);
        std::vector<V> old_values = std::move(values);
        std::vector<SlotState> old_states = std::move(states);
        keys.resize(new_capacity);
        values.resize(new_capacity);
        states.assign(new_capacity, EMPTY);
        num_entries = 0;
        num_deleted = 0;
        for (size_t i = 0; i < old_keys.size(); ++i) {
            if (old_states[i] == OCCUPIED) {
                insert(old_keys[i], old_values[i]);
            }
        }
    }

public:
    LinearProbingHashMap(size_t initial_capacity = 16, double max_load = 0.65)
        : keys(initial_capacity), values(initial_capacity), states(initial_capacity, EMPTY),
          num_entries(0), num_deleted(0), max_load_factor(max_load) {}

    void insert(const K& key, const V& value) {\n        if (static_cast<double>(num_entries + num_deleted) / keys.size() >= max_load_factor) {\n            rehash(keys.size() * 2);\n        }
        for (size_t i = 0; i < keys.size(); ++i) {
            size_t idx = probe(key, i);
            if (states[idx] == EMPTY || states[idx] == DELETED) {
                keys[idx] = key;
                values[idx] = value;
                states[idx] = OCCUPIED;
                ++num_entries;
                if (states[idx] == DELETED) --num_deleted; // reclaiming tombstone
                return;
            }
            if (states[idx] == OCCUPIED && keys[idx] == key) {
                values[idx] = value; // update existing
                return;
            }
        }
        // Should not reach with proper load factor
    }

    V* find(const K& key) {
        for (size_t i = 0; i < keys.size(); ++i) {
            size_t idx = probe(key, i);
            if (states[idx] == EMPTY) return nullptr;
            if (states[idx] == OCCUPIED && keys[idx] == key) {
                return &values[idx];
            }
        }
        return nullptr;
    }

    bool remove(const K& key) {
        for (size_t i = 0; i < keys.size(); ++i) {
            size_t idx = probe(key, i);
            if (states[idx] == EMPTY) return false;
            if (states[idx] == OCCUPIED && keys[idx] == key) {
                states[idx] = DELETED;
                --num_entries;
                ++num_deleted;
                return true;
            }
        }
        return false;
    }

    size_t size() const { return num_entries; }
};

int main() {
    // Test ChainingHashMap
    ChainingHashMap<int
● Production incidentPOST-MORTEMseverity: high

HashMap Degradation to O(n) in High-Collision Cache

Symptom
P99 latency spiked from 10ms to 500ms during peak hours on a user session caching layer. CPU usage was elevated but the bottleneck wasn't in any obvious computation. Thread dumps showed threads spending the majority of their time inside HashMap.get() and HashMap.put(). Garbage collection pauses were frequent but not extreme — the GC logs were a red herring. The service had been running fine for three months; the degradation appeared gradually as user traffic grew.
Assumption
The team initially blamed network latency or upstream database slowdowns. They added connection pool monitoring, increased database replica read capacity, and tuned TCP timeouts — none of which moved the needle. It took a flamegraph from async-profiler to make the real culprit obvious: HashMap.get() was burning CPU because it was walking long collision chains instead of doing constant-time lookups.
Root cause
User session keys were constructed from a combination of user ID and a session token, but the hashCode() implementation only incorporated the user ID portion. User IDs in this system were sequentially assigned integers, and the default HashMap capacity was 16. The expression userId % 16 produced a distribution heavily skewed toward lower bucket indices because the userId range in production was 1–500,000, but the active sessions were concentrated in recently registered accounts (IDs 480,000–500,000) which all mapped to the same eight buckets. With 100,000 active sessions in a HashMap sized for 16 buckets (and later resized to 32,768 after repeated threshold crossings), several buckets contained over 200 entries. Each lookup in those buckets required walking the entire chain, comparing equals() on every entry. The load factor had long exceeded 0.75, triggering expensive full-table rehashes during peak traffic that temporarily doubled memory usage and caused GC pressure.
Fix
1. Rewrote hashCode() to incorporate both userId and sessionToken using Objects.hash(userId, sessionToken) — this immediately improved bucket distribution. 2. Pre-sized the HashMap with realistic capacity: new HashMap<>(expectedPeakSessions / 0.75 + 1) to avoid load-factor-triggered resizing during traffic spikes. 3. Replaced the plain HashMap with ConcurrentHashMap for the shared cache, both for thread safety and for its segment-based design which distributes entries more predictably under concurrent access. 4. Added a weekly audit job that samples bucket distribution via reflection (accessing the HashMap's table field) to catch distribution drift before it causes production problems.
Key lesson
  • Always override hashCode() when overriding equals() — and use all the fields that appear in equals(), not a convenient subset
  • Default HashMap capacity (16) is almost never appropriate for production workloads; pre-size based on expected entries divided by 0.75
  • Monitor actual bucket distribution in production, not just cache hit rates — hit rate can look healthy while lookup latency is silently degrading
  • A flamegraph is the fastest path to diagnosing HashMap slowdowns — thread dumps alone will not tell you which bucket is the problem
Production debug guideSymptom → Action mapping when your HashMap feels slower than it should5 entries
Symptom · 01
HashMap operations taking microseconds instead of nanoseconds
Fix
Attach async-profiler and generate a flamegraph. If HashMap.get() or equals() shows up with significant CPU time, you have a collision chain problem. Verify by printing the bucket distribution: access the internal table array via reflection and count how many buckets have more than 8 entries. Anything above 8 should be a tree node in Java 8+ — if you see linked list chains longer than 8, your hashCode() is returning the same value for multiple distinct keys.
Symptom · 02
High CPU usage with low throughput and frequent equals() calls in profiler output
Fix
Profile for equals() call frequency — a healthy HashMap calls equals() at most once or twice per lookup (to confirm the key after finding its bucket). If you see equals() called dozens of times per lookup, you are walking a long collision chain. Check your hashCode() implementation: print the distribution of hashCode() % capacity across a representative sample of your actual keys. If more than 5% of entries land in the same bucket, your hash function is not distributing well for your key domain.
Symptom · 03
Frequent garbage collection pauses correlating with HashMap operation timing
Fix
Check whether HashMap resizing is the cause. Each resize doubles the table and rehashes all entries — this creates significant short-lived garbage. Enable GC logging (-Xlog:gc* in Java 11+) and correlate pause times with HashMap put() call frequency. If resizing is the culprit, pre-size the HashMap at construction time. Use jmap -histo to check the actual size of your HashMap instances and compare against their declared initial capacity.
Symptom · 04
Latency spike is reproducible only under specific input patterns, not under synthetic load
Fix
This is the fingerprint of a hashCode() distribution problem that only manifests with real data. Your synthetic load test probably uses randomly generated keys which distribute well. Production keys often have structure (sequential IDs, prefixed strings, timestamps) that interacts badly with simple hash functions. Collect a sample of actual production keys and run the bucket distribution check against them. The problem will be immediately visible.
Symptom · 05
ConcurrentHashMap showing contention on specific segments under high concurrency
Fix
ConcurrentHashMap in Java 8+ uses a per-bucket locking strategy rather than segment-level locking. Contention at the bucket level means multiple threads are writing to the same bucket simultaneously — which means your keys are colliding. Run the bucket distribution check. If contention is on a small number of hot buckets, you likely have a hashCode() distribution problem or a key skew pattern (e.g., all writes using the same prefix).
★ Hash Collision Debugging Cheat SheetFast diagnostics when hash table performance degrades in production
Latency spikes during high load correlated with HashMap operations
Immediate action
Check HashMap occupancy and bucket distribution
Commands
jmap -histo <pid> | grep HashMap
jcmd <pid> GC.class_histogram | grep -A5 HashMap
Fix now
Pre-size the HashMap: new HashMap<>(expectedEntries * 2). If under concurrent access, replace with ConcurrentHashMap with explicit initial capacity: new ConcurrentHashMap<>(expectedEntries, 0.75f, numThreads)
Uneven CPU usage across threads — some threads consistently slower than others+
Immediate action
Check for HashMap lock contention or collision chain hotspots on specific buckets
Commands
jstack <pid> | grep -A10 'HashMap'
async-profiler -d 30 -f flamegraph.html <pid>
Fix now
If the flamegraph shows HashMap.get() or equals() dominating, check bucket distribution with a reflection-based audit. If lock contention, replace synchronized HashMap with ConcurrentHashMap. If distribution is skewed, fix hashCode() to incorporate more fields.
HashMap resize events correlating with GC pauses under peak traffic+
Immediate action
Identify the HashMap's declared initial capacity vs its current size
Commands
jcmd <pid> VM.native_memory summary
jmap -dump:format=b,file=heap.hprof <pid> && jhat heap.hprof
Fix now
Compute the correct initial capacity: (int)(expectedMaxEntries / 0.75) + 1. Pass this to the HashMap constructor to prevent all load-factor-triggered resizes during normal operation.
🔥

That's Hashing. Mark it forged?

10 min read · try the examples if you haven't

Previous
Hash Table and Hash Map
2 / 11 · Hashing
Next
Two Sum Problem using Hashing