Java's HashMap uses chaining, upgrading to red-black trees when a bucket exceeds 8 entries
Plain-English First
Imagine a hotel with 10 rooms numbered 0–9. When guests check in, the receptionist takes the last digit of their booking ID and assigns that room number. Two guests with IDs ending in '7' both get told 'Room 7' — that's a collision. The hotel now needs a policy: do you add a bunk bed in Room 7 (chaining), or do you walk the guest down the hall to the next free room (open addressing)? Hash tables face exactly this problem every time two keys map to the same slot. The choice between those two policies determines how fast your hotel runs under pressure.
Every time you use a HashMap in Java, look up a username in a database, or cache a web response, a hash table is doing the heavy lifting underneath. Hash tables promise near-instant lookups — O(1) on average — and that promise is so useful that they're embedded in almost every non-trivial piece of software ever written. But that promise comes with a catch: two different keys can produce the same hash value, landing in the same bucket. When that happens, you have a collision, and how your data structure handles it determines whether your HashMap stays fast or quietly degrades into a slow linked list.
Collisions aren't bugs — they're a mathematical certainty. The pigeonhole principle guarantees them: if you have more possible keys than buckets, some keys must share a bucket. The real question isn't 'how do I avoid collisions?' It's 'how do I handle them cheaply enough that my O(1) average still holds?' That question has two primary answers — separate chaining and open addressing — and each has tradeoffs that matter in production.
I've seen HashMap performance degrade in production in two distinct ways. The first is the slow burn: a service gradually accumulates entries in a map that was never pre-sized, resizes repeatedly, and starts showing elevated P99 latency six weeks after deployment when the load finally hits the threshold. The second is the sudden collapse: a hashCode() implementation with poor distribution sends 80% of entries into the same handful of buckets, and what should be O(1) lookups become O(n) walks through collision chains under peak traffic.
Both problems are entirely avoidable once you understand the mechanics. By the end of this article you'll understand exactly why collisions happen, how Java's own HashMap uses chaining — and when it upgrades from a linked list to a red-black tree — how linear probing and quadratic probing work as open-addressing alternatives, and which strategy to reach for depending on your load factor, key distribution, and memory constraints. You'll also have complete, runnable Java implementations you can drop into your IDE today.
Why Collisions Happen and Why They're Unavoidable
A hash function takes an arbitrary key — a string, an integer, a complex object — and maps it to a fixed-size integer called a hash code. That integer is then reduced to a bucket index, typically by taking the hash code modulo the number of buckets: index = hashCode % capacity. The problem is straightforward: there are effectively infinite possible keys but a finite number of buckets. By the pigeonhole principle, if you have more keys than buckets, some keys must share a bucket.
But collisions happen even when you have far fewer entries than buckets, because the hash function does not produce a perfect uniform distribution in practice. A HashMap with 1,000 buckets and 200 entries will statistically have some buckets with 2 or 3 entries even with a good hash function. With a poor hash function — one that ignores half the key's fields, or one whose output correlates with patterns in your actual data — you can get catastrophic skew: 10 entries in bucket 5, 180 entries in bucket 7, and 190 empty buckets.
This is why understanding collisions is not just a computer science exercise. In Java, your hashCode() implementation is a contract: it must be consistent with equals(), it must be stable for the lifetime of an object's use as a map key, and it should distribute well over your actual key domain — not just random inputs. The JVM provides no automatic auditing of any of these properties. A hashCode() that passes unit tests but distributes poorly on production data is one of the harder categories of performance bugs to diagnose without the right tooling.
CollisionDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
import java.util.HashMap;
import java.util.Map;
import java.util.Objects;
/**
* CollisionDemo — illustrates how hashCode() quality affects
* bucket distribution in a HashMap.
*
* Runthis with your actual key domain to audit distribution
* before deploying a newhashCode() implementation.
*/
publicclassCollisionDemo {
// --- POOR hashCode(): ignores sessionToken entirely ---// Two sessions for the same user always collidestaticclassSessionKeyPoor {
finalint userId;
finalString sessionToken;
SessionKeyPoor(int userId, String sessionToken) {\n this.userId = userId;\n this.sessionToken = sessionToken;\n }
@OverridepublicinthashCode() {
return userId; // sessionToken completely ignored
}
@Overridepublicbooleanequals(Object o) {
if (this == o) returntrue;
if (!(o instanceofSessionKeyPoor)) returnfalse;
SessionKeyPoor other = (SessionKeyPoor) o;
return userId == other.userId
&& Objects.equals(sessionToken, other.sessionToken);
}
}
// --- GOOD hashCode(): incorporates all fields used in equals() ---staticclassSessionKeyGood {
finalint userId;
finalString sessionToken;
SessionKeyGood(int userId, String sessionToken) {\n this.userId = userId;\n this.sessionToken = sessionToken;\n }
@OverridepublicinthashCode() {
returnObjects.hash(userId, sessionToken);
}
@Overridepublicbooleanequals(Object o) {
if (this == o) returntrue;
if (!(o instanceofSessionKeyGood)) returnfalse;
SessionKeyGood other = (SessionKeyGood) o;
return userId == other.userId
&& Objects.equals(sessionToken, other.sessionToken);
}
}
// Computes bucket distribution for a set of keys at a given capacitystatic <K> voidauditDistribution(String label, Iterable<K> keys, int capacity) {
int[] bucketCounts = newint[capacity];
int total = 0;
int maxBucket = 0;
for (K key : keys) {
// Replicates HashMap's internal index computationint hash = key.hashCode();
hash = hash ^ (hash >>> 16); // HashMap's spread() equivalentint index = (capacity - 1) & hash;
bucketCounts[index]++;
total++;
maxBucket = Math.max(maxBucket, bucketCounts[index]);
}
int nonEmpty = 0;
for (int count : bucketCounts) {
if (count > 0) nonEmpty++;
}
System.out.printf("\n[%s]%n", label);
System.out.printf(" Keys: %d | Capacity: %d | Non-empty buckets: %d%n",
total, capacity, nonEmpty);
System.out.printf(" Max entries in one bucket: %d%n", maxBucket);
System.out.printf(" Ideal max (with perfect distribution): ~%d%n",
(int) Math.ceil((double) total / nonEmpty));
}
publicstaticvoidmain(String[] args) {
int capacity = 64;
java.util.List<SessionKeyPoor> poorKeys = new java.util.ArrayList<>();
java.util.List<SessionKeyGood> goodKeys = new java.util.ArrayList<>();
// Simulate 50 sessions for userId=1 — all same userId, different tokensfor (int i = 0; i < 50; i++) {
String token = "token-" + i;
poorKeys.add(newSessionKeyPoor(1, token));
goodKeys.add(newSessionKeyGood(1, token));
}
// Add 50 sessions across different userIdsfor (int i = 2; i <= 51; i++) {
poorKeys.add(newSessionKeyPoor(i, "token-0"));
goodKeys.add(newSessionKeyGood(i, "token-0"));
}
auditDistribution("Poor hashCode (ignores sessionToken)", poorKeys, capacity);
auditDistribution("Good hashCode (Objects.hash all fields)", goodKeys, capacity);
}
}
Output
[Poor hashCode (ignores sessionToken)]
Keys: 100 | Capacity: 64 | Non-empty buckets: 51
Max entries in one bucket: 50
Ideal max (with perfect distribution): ~2
[Good hashCode (Objects.hash all fields)]
Keys: 100 | Capacity: 64 | Non-empty buckets: 63
Max entries in one bucket: 3
Ideal max (with perfect distribution): ~2
The Hotel Room Mental Model
Hash function = the receptionist's room-assignment algorithm
Bucket = a hotel room that may hold one or more guests
Collision = two guests assigned the same room number by the algorithm
Chaining = the hotel adds bunk beds — multiple guests share the room in a list
Open addressing = the hotel sends the second guest to the next available room
hashCode() quality = how evenly the receptionist distributes guests across all available rooms
Production Insight
The worst HashMap performance issues I have seen were not caused by algorithmic choices — they were caused by hashCode() implementations that ignored half the key's identity. A session key that only hashes userId will put every session for user 1 into the same bucket. The HashMap looks correct from the outside (equals() still works) but performs like a linked list for any multi-session user. The audit pattern in the code above — compute the actual bucket distribution for your real key sample before deploying — takes 20 minutes and has saved me from at least three production incidents.
Key Takeaway
Collisions are mathematically guaranteed by the pigeonhole principle — the question is whether they're evenly distributed or catastrophically clustered. A hashCode() that ignores key fields turns your HashMap into a linked list for affected inputs. Always audit distribution against your actual production key domain, not random synthetic data.
Is My hashCode() Implementation Safe?
IfhashCode() uses fewer fields than equals()
→
UseBroken — distinct objects can have equal hash codes, but equal objects must have equal hash codes. Use all fields from equals().
IfhashCode() returns a constant (e.g., return 42)
→
UseFunctionally correct but catastrophic for performance — every key maps to the same bucket. All operations degrade to O(n).
IfhashCode() uses Objects.hash() with all equals() fields
→
UseCorrect and generally good — Objects.hash() applies a reasonable mixing function. Audit distribution on your actual key domain to confirm.
IfKeys are structured (sequential IDs, timestamps, prefixed strings)
→
UseTest distribution explicitly — structured keys often interact badly with simple hash functions. Consider a stronger mixing function like MurmurHash3.
IfObject is mutable and used as a HashMap key
→
UseDangerous — if a key's fields change after insertion, the key is unreachable in the map. Use immutable keys or defensive copies.
Separate Chaining — How Java's HashMap Actually Works
Separate chaining is the simpler of the two collision strategies to reason about: each bucket in the hash table is not a single slot but a container that can hold multiple entries. When two keys hash to the same bucket, both entries coexist in that bucket's container. Lookup requires finding the right bucket via the hash, then scanning the container for the specific key using equals().
Java's HashMap uses separate chaining as its resolution strategy, with a crucial performance refinement added in Java 8. When a bucket contains 8 or fewer entries, they're stored as a singly linked list. When the count exceeds the TREEIFY_THRESHOLD (8), the linked list is converted to a red-black tree. This treeification changes worst-case lookup in that bucket from O(n) to O(log n). If the entry count in a bucket later drops below UNTREEIFY_THRESHOLD (6), the tree converts back to a linked list.
The load factor interacts directly with chaining performance. Java's default load factor is 0.75, meaning the HashMap resizes (doubles its bucket count and rehashes all entries) when entries exceed 75% of capacity. A lower load factor reduces collision probability but wastes memory. A higher load factor conserves memory but increases chain lengths. For most general-purpose use, 0.75 is a reasonable default — but if you know your expected entry count, pre-sizing avoids the resize overhead entirely.
One thing the HashMap's design doesn't protect you from: treeification only helps when your bucket exceeds 8 entries but your hashCode() still has some variation. If your hashCode() returns the same value for all keys — or maps all keys to the same bucket — you'll have one tree containing all N entries, giving O(log N) lookup instead of O(1). That's better than a chain but still not what you paid for.
ChainingHashMap.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;
import java.util.Objects;
/**
* ChainingHashMap — a minimal separate chaining hash map.
* Written to make the internal mechanics transparent.
* This mirrors the core behaviour of java.util.HashMap
* without Java8's treeification complexity.
*
* Not production code — use java.util.HashMap in real systems.
*/
public class ChainingHashMap<K, V> {\n\n private static final int DEFAULT_CAPACITY = 16;\n private static final double DEFAULT_LOAD_FACTOR = 0.75;\n\n // Each bucket is a list of key-value entries\n // java.util.HashMap uses Node<K,V>[] internally — same idea\n private List<Entry<K, V>>[] buckets;\n private int size;\n private final double loadFactor;\n\n static class Entry<K, V> {\n final K key;\n V value;\n\n Entry(K key, V value) {\n this.key = key;\n this.value = value;\n }
}
@SuppressWarnings("unchecked")
publicChainingHashMap(int initialCapacity, double loadFactor) {\n this.buckets = newLinkedList[initialCapacity];\n this.loadFactor = loadFactor;\n this.size = 0;\n }
publicChainingHashMap() {
this(DEFAULT_CAPACITY, DEFAULT_LOAD_FACTOR);
}
// Spread the hash to avoid clustering on lower bits// Mirrors HashMap's (h = key.hashCode()) ^ (h >>> 16)privateintspread(int hashCode) {
return hashCode ^ (hashCode >>> 16);
}
privateintbucketIndex(K key) {
returnspread(Objects.hashCode(key)) & (buckets.length - 1);
}
public void put(K key, V value) {\n // Check load factor and resize before inserting\n if ((double) size / buckets.length >= loadFactor) {\n resize();\n }int index = bucketIndex(key);
if (buckets[index] == null) {
buckets[index] = newLinkedList<>();
}
// Update value if key already exists in the chainfor (Entry<K, V> entry : buckets[index]) {
if (Objects.equals(entry.key, key)) {\n entry.value = value;\n return;\n }
}
// Key not found in chain — append new entry
buckets[index].add(newEntry<>(key, value));
size++;
}
public V get(K key) {
int index = bucketIndex(key);
if (buckets[index] == null) returnnull;
// Walk the chain at this bucket — O(n) for n entries in this bucketfor (Entry<K, V> entry : buckets[index]) {
if (Objects.equals(entry.key, key)) {\n return entry.value;\n }
}
returnnull;
}
publicbooleanremove(K key) {
int index = bucketIndex(key);
if (buckets[index] == null) returnfalse;
return buckets[index].removeIf(e -> Objects.equals(e.key, key));
}
@SuppressWarnings("unchecked")
privatevoidresize() {
List<Entry<K, V>>[] oldBuckets = buckets;
buckets = newLinkedList[oldBuckets.length * 2];
size = 0;
// Rehash all existing entries into new bucket arrayfor (List<Entry<K, V>> bucket : oldBuckets) {\n if (bucket != null) {\n for (Entry<K, V> entry : bucket) {\n put(entry.key, entry.value); // recomputes bucket index\n }
}
}
System.out.printf("[resize] new capacity: %d%n", buckets.length);
}
// Diagnostic: print bucket distribution for auditingpublicvoidprintBucketDistribution() {
int occupied = 0;
int maxChain = 0;
for (List<Entry<K, V>> bucket : buckets) {\n if (bucket != null && !bucket.isEmpty()) {\n occupied++;\n maxChain = Math.max(maxChain, bucket.size());\n }
}
System.out.printf(
"Buckets: %d total | %d occupied | max chain length: %d | load: %.2f%n",
buckets.length, occupied, maxChain, (double) size / buckets.length
);
}
publicintsize() { return size; }
publicstaticvoidmain(String[] args) {
ChainingHashMap<String, Integer> map = newChainingHashMap<>(8, 0.75);
// Insert enough entries to trigger a resizeString[] keys = { "alpha", "bravo", "charlie", "delta",
"echo", "foxtrot", "golf", "hotel" };
for (int i = 0; i < keys.length; i++) {
map.put(keys[i], i);
}
System.out.println("--- After inserting 8 entries into capacity-8 map ---");
map.printBucketDistribution();
System.out.println("get("charlie") = " + map.get("charlie"));
System.out.println("get("india") = " + map.get("india"));
map.remove("delta");
System.out.println("After remove("delta"), get("delta") = " + map.get("delta"));
map.printBucketDistribution();
}
}
Output
[resize] new capacity: 16
--- After inserting 8 entries into capacity-8 map ---
Buckets: 16 total | 8 occupied | max chain length: 1 | load: 0.50
get("charlie") = 2
get("india") = null
After remove("delta"), get("delta") = null
Buckets: 16 total | 7 occupied | max chain length: 1 | load: 0.44
How Java's HashMap Treeifies a Bucket
TREEIFY_THRESHOLD = 8: when a bucket's linked list exceeds 8 entries, it converts to a red-black tree
UNTREEIFY_THRESHOLD = 6: when a tree bucket's entry count drops below 6 (from removal), it converts back to a linked list
MIN_TREEIFY_CAPACITY = 64: treeification only happens if the total table capacity is at least 64; below that, the table resizes instead
Treeification does not fix a bad hashCode() — if all keys hash to the same bucket, you get one tree with O(log n) instead of O(1)
The tree uses natural ordering or Comparable if keys implement it; otherwise it uses identity hash codes as a tiebreaker
Production Insight
The treeification in Java 8 was added specifically as a defence against hash DoS attacks — an attacker who can control input keys can craft keys that all hash to the same bucket, degrading a HashMap to O(n) and causing a denial of service. Before Java 8, this was a known vulnerability in web frameworks that used HashMaps to process request parameters. Treeification bounds the damage: even if every key lands in one bucket, lookups stay O(log n) instead of collapsing to O(n). This does not eliminate the vulnerability — it mitigates it. The real fix is using a hash function with a random seed (which Java's String.hashCode() does not do) so an attacker cannot predict collision patterns.
Key Takeaway
Separate chaining stores multiple entries per bucket in a linked list (or red-black tree in Java 8+). It handles high load gracefully and deletion is trivial. Java's HashMap is a chaining implementation — understanding its treeification threshold (8 entries) and load factor trigger (0.75) lets you configure it correctly instead of accepting defaults that may not fit your workload.
When to Pre-Size Your HashMap
IfExpected entry count is known at construction time
→
UsePre-size: new HashMap<>((int)(expectedEntries / 0.75) + 1). Eliminates all resize operations during normal use.
IfEntry count is unknown but has a reasonable upper bound
→
UseUse the upper bound for pre-sizing. Unused capacity costs memory; unexpected resizes cost latency. Memory is usually cheaper.
IfEntry count is genuinely unbounded
→
UseAccept the default and let the HashMap resize naturally. Monitor for resize-correlated latency spikes and add pre-sizing if they appear.
IfMap is shared across threads
→
UseUse ConcurrentHashMap with explicit initial capacity and concurrency level: new ConcurrentHashMap<>(capacity, 0.75f, concurrencyLevel)
IfMap is used as a fixed-size lookup table (entries never change after construction)
→
UseConsider Google Guava's ImmutableMap — built from a known entry set with optimal distribution, no resize possible
Bucket Transformation: Before and After Treeification
Treeification is the process that converts a bucket's linked list into a red-black tree when the number of entries in that bucket exceeds the TREEIFY_THRESHOLD (8 entries, in Java 8+). Understanding this structural change is crucial because it turns a potential O(n) collision chain into an O(log n) lookup path.
Before treeification, a bucket with 9 entries looks like a singly linked list. Each node has a key, value, hash, and a pointer to the next node. Getting the 9th entry requires traversing 8 links, which is cache-unfriendly because each node is a separate object on the heap. After treeification, the same 9 entries are stored in a balanced binary search tree (red-black tree). The root node has two children; each child has two children (or fewer). Search in a balanced tree of 9 nodes requires at most 4 comparisons (log2 9 = 4).
The diagram below shows the same bucket before and after treeification. On the left, a linked list of 9 entry nodes. On the right, the corresponding red-black tree structure. Notice the tree uses left/right child references instead of a single next pointer, and nodes are color-coded (red/black) to maintain balance constraints.
This transformation is not free — it involves traversing the linked list, sorting the entries by key's hashCode (or by natural ordering if keys implement Comparable), and building the tree structure. But the one-time O(n log n) cost pays off every subsequent lookup, insertion, and deletion that targets this bucket.
Production Insight
In production, treeification is rarely triggered by natural hashCode() distribution — a good hash function spreads entries evenly across buckets, so no single bucket grows beyond the threshold. When treeification does happen, it's usually a symptom of hash skew: either a hashCode() that ignores key fields, or an attacker deliberately crafting colliding keys. Monitoring the ratio of TreeNode instances to regular Node instances in a HashMap (via heap dumps or JMX) can reveal silent collision problems before they cause visible latency.
Key Takeaway
Treeification converts a bucket's linked list into a red-black tree when the bucket exceeds 8 entries, turning worst-case lookup from O(n) to O(log n). This is a defensive measure against hash collision attacks, not a substitute for a good hashCode().
Open Addressing — Linear Probing, Quadratic Probing, and Double Hashing
Open addressing takes a fundamentally different approach: all entries live inside the hash table array itself — no external linked lists or trees. When a collision occurs, the algorithm probes for the next available slot according to a defined sequence. The entry goes into the first empty slot found.
The three main probing strategies differ in how they compute the next slot to check:
Linear probing checks consecutive slots: if h(k) is occupied, try h(k)+1, then h(k)+2, and so on, wrapping around at the table boundary. It's cache-friendly because the probe sequence accesses consecutive memory locations. The problem is primary clustering: when multiple keys hash to adjacent slots, they form runs that grow over time. Any new key that hashes into or near a run extends it, slowing down both insertions and lookups for unrelated keys in that region.
Quadratic probing checks slots at increasing quadratic intervals: h(k), h(k)+1², h(k)+2², h(k)+3². This breaks up the consecutive clustering of linear probing. The cost is secondary clustering: two keys with the same initial hash follow the exact same probe sequence, so they still cluster — just not in a consecutive run. Cache performance is also worse because quadratic jumps don't access sequential memory.
Double hashing uses a second hash function to compute the probe step: h(k, i) = (h1(k) + i × h2(k)) % capacity. Different keys with the same initial hash follow different probe sequences, eliminating secondary clustering. It's the most theoretically robust open-addressing strategy but requires a second well-designed hash function and is sensitive to the choice of h2 — if h2 ever returns 0 or shares a common factor with the table size, the probe sequence degenerates.
All three strategies share a critical constraint: the load factor must stay below approximately 0.7 for performance to remain reasonable. As the table fills above 70%, the probability that any given probe hits an occupied slot increases sharply, and average probe lengths grow quickly. Unlike chaining, open addressing cannot operate with a load factor above 1.0 at all — the table has no overflow mechanism.
Deletion in open addressing is also more complex than in chaining. You cannot simply remove an entry and leave the slot empty, because that would break probe sequences for other entries that probed past that slot during insertion. The standard solution is tombstone markers: deleted slots are marked as 'deleted' rather than empty, so probe sequences continue through them during lookup. Tombstones accumulate over time and degrade performance — eventually the table should be rebuilt to clear them.
OpenAddressingHashMap.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
import java.util.Objects;
/**
* OpenAddressingHashMap — linear probing implementation.
* Demonstrates the core mechanics of open addressing:
* probe sequences, tombstone deletion, and load-factor resizing.
*
* Not production code — use java.util.HashMap in real systems.
* This implementation is for understanding the mechanics.
*/
public class OpenAddressingHashMap<K, V> {\n\n private static final int DEFAULT_CAPACITY = 16;\n private static final double MAX_LOAD_FACTOR = 0.65; // conservative for open addressing\n\n // Sentinel for deleted slots — probe sequences must continue through these\n private static final Object TOMBSTONE = new Object();\n\n private Object[] keys;\n private Object[] values;\n private int size; // live entries\n private int tombstones; // deleted slots not yet reclaimed\n\n public OpenAddressingHashMap() {\n keys = new Object[DEFAULT_CAPACITY];\n values = new Object[DEFAULT_CAPACITY];\n size = 0;\n tombstones = 0;\n }privateinthash(K key) {
int h = Objects.hashCode(key);
h = h ^ (h >>> 16); // spread to reduce clustering on lower bitsreturnMath.abs(h) % keys.length;
}
public void put(K key, V value) {\n // Resize if live entries + tombstones exceed load factor\n // Tombstones count against effective load because they consume slots\n if ((double)(size + tombstones) / keys.length >= MAX_LOAD_FACTOR) {\n resize();\n }int index = hash(key);
int firstTombstone = -1;
for (int i = 0; i < keys.length; i++) {
int probe = (index + i) % keys.length; // linear probeif (keys[probe] == null) {
// Empty slot — insert here (or at first tombstone if found earlier)int insertAt = (firstTombstone != -1) ? firstTombstone : probe;
if (firstTombstone != -1) tombstones--; // reclaiming a tombstone
keys[insertAt] = key;
values[insertAt] = value;
size++;
return;
}
if (keys[probe] == TOMBSTONE) {
// Record first tombstone — we can reuse it for insertionif (firstTombstone == -1) firstTombstone = probe;
continue;
}
if (Objects.equals(keys[probe], key)) {\n // Key already exists — update value in place\n values[probe] = value;\n return;\n }// Occupied by a different key — continue probing (linear: next slot)
}
// Table is full (should not reach here with correct load factor management)thrownewIllegalStateException("Hash table is full — resize threshold too high");
}
@SuppressWarnings("unchecked")
public V get(K key) {
int index = hash(key);
for (int i = 0; i < keys.length; i++) {
int probe = (index + i) % keys.length;
if (keys[probe] == null) {
// Empty slot — key definitely not present (probe sequence ends here)returnnull;
}
if (keys[probe] == TOMBSTONE) {
// Deleted slot — continue probing (key may be further along)continue;
}
if (Objects.equals(keys[probe], key)) {\n return (V) values[probe];\n }
}
returnnull;
}
publicbooleanremove(K key) {
int index = hash(key);
for (int i = 0; i < keys.length; i++) {
int probe = (index + i) % keys.length;
if (keys[probe] == null) return false; // not presentif (keys[probe] == TOMBSTONE) continue;
if (Objects.equals(keys[probe], key)) {\n // Place tombstone — cannot leave null or probe sequences break\n keys[probe] = TOMBSTONE;\n values[probe] = null;\n size--;\n tombstones++;\n return true;\n }
}
returnfalse;
}
@SuppressWarnings("unchecked")
privatevoidresize() {
Object[] oldKeys = keys;
Object[] oldValues = values;
keys = newObject[oldKeys.length * 2];
values = newObject[oldKeys.length * 2];
size = 0;
tombstones = 0;
System.out.printf("[resize] %d -> %d capacity%n",
oldKeys.length, keys.length);
// Rehash all live entries — tombstones are discarded during rebuildfor (int i = 0; i < oldKeys.length; i++) {
if (oldKeys[i] != null && oldKeys[i] != TOMBSTONE) {
put((K) oldKeys[i], (V) oldValues[i]);
}
}
}
publicvoidprintStats() {
System.out.printf(
"Capacity: %d | Live entries: %d | Tombstones: %d | Effective load: %.2f%n",
keys.length, size, tombstones,
(double)(size + tombstones) / keys.length
);
}
publicstaticvoidmain(String[] args) {
OpenAddressingHashMap<String, Integer> map = newOpenAddressingHashMap<>();
System.out.println("--- Inserting entries ---");
String[] keys = { "alpha", "bravo", "charlie", "delta",
"echo", "foxtrot", "golf", "hotel", "india", "juliet" };
for (int i = 0; i < keys.length; i++) {
map.put(keys[i], i);
}
map.printStats();
System.out.println("\n--- Lookups ---");
System.out.println("get("charlie") = " + map.get("charlie"));
System.out.println("get("kilo") = " + map.get("kilo"));
System.out.println("\n--- Deletions (tombstone behaviour) ---");
map.remove("bravo");
map.remove("delta");
map.printStats();
// Probe sequence for keys that probed through bravo/delta still worksSystem.out.println("get("charlie") after deletions = " + map.get("charlie"));
}
}
Every deletion in an open-addressing map leaves a tombstone. Tombstones occupy physical slots and count against your effective load factor. In a workload with frequent insertions and deletions — like a session cache with natural expiry — tombstones accumulate until the effective load factor triggers a resize, even if your live entry count is low. The resize rebuilds the table and discards all tombstones, but it's expensive. If your workload has high delete frequency, chaining handles deletions more cleanly and avoids this problem entirely.
Production Insight
Open addressing wins on cache performance — all entries are in a contiguous array and probe sequences access nearby memory locations. For small maps with predictable, low-churn workloads — configuration tables, enum-keyed dispatch maps, small lookup caches — open addressing can outperform chaining by 20-30% purely due to cache line efficiency. But the moment your load factor climbs above 0.65 or your delete rate is high, the advantages evaporate. The performance cliff is sharper than most engineers expect from the O(1) guarantee.
Key Takeaway
Open addressing stores all entries in the table array itself — no pointer overhead, better cache locality. Linear probing is cache-optimal but causes clustering. Quadratic probing reduces clustering at the cost of cache locality. Double hashing is the most robust but requires careful design of the second hash function. All three strategies degrade rapidly above 0.7 load factor and have more complex deletion semantics than chaining.
Which Open Addressing Strategy to Use
IfCache performance is the top priority and load factor will stay below 0.5
→
UseLinear probing — best cache locality; clustering is tolerable at low load factors
IfGood cache performance needed but primary clustering is causing hotspots
→
UseQuadratic probing — breaks primary clustering while keeping reasonable locality; still has secondary clustering
IfTheoretically optimal distribution, willing to design a second hash function
→
UseDouble hashing — eliminates both primary and secondary clustering; requires a well-designed h2 that never returns 0
IfHigh delete rate with frequent insertions and deletions interleaved
→
UseDo not use open addressing — tombstone accumulation will cause premature resizes. Use chaining instead.
IfLoad factor will exceed 0.7 under expected workload
→
UseDo not use open addressing — probe lengths grow exponentially above this threshold. Use chaining.
Open Addressing Performance Formula: Expected Search Time < 1/(1-α)
For open addressing with linear probing, the expected number of probes for a successful search is approximated by:
E[successful search] ≈ (1 + 1/(1 - α)²) / 2
And for an unsuccessful search (insertion or lookup of a missing key):
E[unsuccessful search] ≈ (1 + 1/(1 - α)²) / 2 [same formula for linear probing]
But there's a tighter upper bound that applies to any open addressing scheme with random probing (which approximates double hashing):
E[search] < 1/(1 - α)
This is a well-known result from Knuth's analysis. For a successful search under random probing, the expected number of probes is approximately (1/α) * ln(1/(1-α)). For an unsuccessful search, it's 1/(1-α).
The simple bound E[search] < 1/(1-α) is an upper bound on the expected number of probes for an unsuccessful search under open addressing with uniform probing. Since α is the load factor (entries/slots), at α=0.5, E < 2 probes; at α=0.9, E < 10 probes; at α=0.99, E < 100 probes.
This formula reveals why open addressing requires strict load factor management. At α=0.75, the bound gives E < 4 probes. But in practice with linear probing and clustering, the actual expected number can be significantly higher than this bound, especially as α approaches 0.9. Clustering can cause the expected search time to blow up to hundreds of probes even at α=0.8.
The key takeaway for production: if you use open addressing, keep α ≤ 0.65. Below that threshold, the expected search time is well under 3 probes, and cache-friendly access patterns make this faster than chaining for most workloads. Above 0.7, the expected number of probes grows rapidly, and the practical performance advantage over chaining disappears.
The 1/(1-α) Rule of Thumb
The formula E[search] < 1/(1-α) is a good rule of thumb for worst-case expected performance with random probing. For linear probing, the actual expected number of probes is higher, especially at higher load factors. Always design open-addressing tables to operate below α=0.65 to stay in the region where both clustering and probe count are manageable.
Production Insight
In production monitoring, I've seen teams misconfigure open-addressing caches by setting initial capacity too low, causing the load factor to drift above 0.7 during peak hours. The result is a gradual increase in lookup latency that spikes when the table reaches 0.85 or higher. The formula E[search] < 1/(1-α) is a quick mental check: if your load factor is 0.8, expect at least 5 probes on average; at 0.9, at least 10. When multiplied by millions of lookups per second, those extra probes add up to real CPU. Set a hard cap on load factor (e.g., trigger resize at 0.6) and monitor the actual value.
Key Takeaway
The expected number of probes for an open-addressing search is bounded by 1/(1-α) under random probing. This means load factor directly controls performance: keep α ≤ 0.65 to ensure fewer than 3 expected probes. Above 0.7, performance degrades sharply, and open addressing loses its advantage over chaining.
C++ Implementations of Separate Chaining and Linear Probing
While Java's HashMap is the most commonly used hash table in production Java systems, C++ developers often need to implement their own hash tables or understand how std::unordered_map works. The C++ standard library provides std::unordered_map which uses separate chaining, but understanding the implementation details helps for custom containers and performance tuning.
The first implementation below shows a minimal separate chaining hash map in C++ using std::forward_list for bucket chains. The second shows a linear probing open-addressing map using a fixed-size array. Both are designed for clarity, not production optimisation — they illustrate the core collision resolution mechanics.
Note that std::unordered_map in GCC uses a different internal structure: it stores all entries in a contiguous array and uses buckets as indices into that array (a "bucket array" approach). The chaining here is more literal: each bucket has its own linked list. This is simpler to reason about and matches Java's HashMap more closely.
HashMap Degradation to O(n) in High-Collision Cache
Symptom
P99 latency spiked from 10ms to 500ms during peak hours on a user session caching layer. CPU usage was elevated but the bottleneck wasn't in any obvious computation. Thread dumps showed threads spending the majority of their time inside HashMap.get() and HashMap.put(). Garbage collection pauses were frequent but not extreme — the GC logs were a red herring. The service had been running fine for three months; the degradation appeared gradually as user traffic grew.
Assumption
The team initially blamed network latency or upstream database slowdowns. They added connection pool monitoring, increased database replica read capacity, and tuned TCP timeouts — none of which moved the needle. It took a flamegraph from async-profiler to make the real culprit obvious: HashMap.get() was burning CPU because it was walking long collision chains instead of doing constant-time lookups.
Root cause
User session keys were constructed from a combination of user ID and a session token, but the hashCode() implementation only incorporated the user ID portion. User IDs in this system were sequentially assigned integers, and the default HashMap capacity was 16. The expression userId % 16 produced a distribution heavily skewed toward lower bucket indices because the userId range in production was 1–500,000, but the active sessions were concentrated in recently registered accounts (IDs 480,000–500,000) which all mapped to the same eight buckets.
With 100,000 active sessions in a HashMap sized for 16 buckets (and later resized to 32,768 after repeated threshold crossings), several buckets contained over 200 entries. Each lookup in those buckets required walking the entire chain, comparing equals() on every entry. The load factor had long exceeded 0.75, triggering expensive full-table rehashes during peak traffic that temporarily doubled memory usage and caused GC pressure.
Fix
1. Rewrote hashCode() to incorporate both userId and sessionToken using Objects.hash(userId, sessionToken) — this immediately improved bucket distribution.
2. Pre-sized the HashMap with realistic capacity: new HashMap<>(expectedPeakSessions / 0.75 + 1) to avoid load-factor-triggered resizing during traffic spikes.
3. Replaced the plain HashMap with ConcurrentHashMap for the shared cache, both for thread safety and for its segment-based design which distributes entries more predictably under concurrent access.
4. Added a weekly audit job that samples bucket distribution via reflection (accessing the HashMap's table field) to catch distribution drift before it causes production problems.
Key lesson
Always override hashCode() when overriding equals() — and use all the fields that appear in equals(), not a convenient subset
Default HashMap capacity (16) is almost never appropriate for production workloads; pre-size based on expected entries divided by 0.75
Monitor actual bucket distribution in production, not just cache hit rates — hit rate can look healthy while lookup latency is silently degrading
A flamegraph is the fastest path to diagnosing HashMap slowdowns — thread dumps alone will not tell you which bucket is the problem
Production debug guideSymptom → Action mapping when your HashMap feels slower than it should5 entries
Symptom · 01
HashMap operations taking microseconds instead of nanoseconds
→
Fix
Attach async-profiler and generate a flamegraph. If HashMap.get() or equals() shows up with significant CPU time, you have a collision chain problem. Verify by printing the bucket distribution: access the internal table array via reflection and count how many buckets have more than 8 entries. Anything above 8 should be a tree node in Java 8+ — if you see linked list chains longer than 8, your hashCode() is returning the same value for multiple distinct keys.
Symptom · 02
High CPU usage with low throughput and frequent equals() calls in profiler output
→
Fix
Profile for equals() call frequency — a healthy HashMap calls equals() at most once or twice per lookup (to confirm the key after finding its bucket). If you see equals() called dozens of times per lookup, you are walking a long collision chain. Check your hashCode() implementation: print the distribution of hashCode() % capacity across a representative sample of your actual keys. If more than 5% of entries land in the same bucket, your hash function is not distributing well for your key domain.
Symptom · 03
Frequent garbage collection pauses correlating with HashMap operation timing
→
Fix
Check whether HashMap resizing is the cause. Each resize doubles the table and rehashes all entries — this creates significant short-lived garbage. Enable GC logging (-Xlog:gc* in Java 11+) and correlate pause times with HashMap put() call frequency. If resizing is the culprit, pre-size the HashMap at construction time. Use jmap -histo to check the actual size of your HashMap instances and compare against their declared initial capacity.
Symptom · 04
Latency spike is reproducible only under specific input patterns, not under synthetic load
→
Fix
This is the fingerprint of a hashCode() distribution problem that only manifests with real data. Your synthetic load test probably uses randomly generated keys which distribute well. Production keys often have structure (sequential IDs, prefixed strings, timestamps) that interacts badly with simple hash functions. Collect a sample of actual production keys and run the bucket distribution check against them. The problem will be immediately visible.
Symptom · 05
ConcurrentHashMap showing contention on specific segments under high concurrency
→
Fix
ConcurrentHashMap in Java 8+ uses a per-bucket locking strategy rather than segment-level locking. Contention at the bucket level means multiple threads are writing to the same bucket simultaneously — which means your keys are colliding. Run the bucket distribution check. If contention is on a small number of hot buckets, you likely have a hashCode() distribution problem or a key skew pattern (e.g., all writes using the same prefix).
★ Hash Collision Debugging Cheat SheetFast diagnostics when hash table performance degrades in production
Latency spikes during high load correlated with HashMap operations−
Immediate action
Check HashMap occupancy and bucket distribution
Commands
jmap -histo <pid> | grep HashMap
jcmd <pid> GC.class_histogram | grep -A5 HashMap
Fix now
Pre-size the HashMap: new HashMap<>(expectedEntries * 2). If under concurrent access, replace with ConcurrentHashMap with explicit initial capacity: new ConcurrentHashMap<>(expectedEntries, 0.75f, numThreads)
Uneven CPU usage across threads — some threads consistently slower than others+
Immediate action
Check for HashMap lock contention or collision chain hotspots on specific buckets
Commands
jstack <pid> | grep -A10 'HashMap'
async-profiler -d 30 -f flamegraph.html <pid>
Fix now
If the flamegraph shows HashMap.get() or equals() dominating, check bucket distribution with a reflection-based audit. If lock contention, replace synchronized HashMap with ConcurrentHashMap. If distribution is skewed, fix hashCode() to incorporate more fields.
HashMap resize events correlating with GC pauses under peak traffic+
Immediate action
Identify the HashMap's declared initial capacity vs its current size
Compute the correct initial capacity: (int)(expectedMaxEntries / 0.75) + 1. Pass this to the HashMap constructor to prevent all load-factor-triggered resizes during normal operation.