Intermediate 6 min · March 05, 2026

Java HashSet — Mutable hashCode Causes 30% Cache Loss

Q: Can a Java HashSet contain null values?

Yes, but only one null. The first time you add null, it's stored. Any subsequent attempt to add null is silently ignored just like any other duplicate. Be careful: calling hashCode() on null would throw a NullPointerException, so HashSet has special-case logic that places null directly into bucket 0, bypassing the hashCode() call entirely.

Q: Does HashSet maintain insertion order in Java?

No. HashSet makes no guarantees about iteration order, and the order can change between JVM runs or even after the internal array resizes due to a high load factor. If you need insertion order preserved, use LinkedHashSet instead — it's a drop-in replacement with the same interface but maintains a doubly-linked list across its entries.

Q: What is the difference between HashSet and HashMap in Java?

A HashMap stores key-value pairs where each key is unique. A HashSet stores only values (no pairs) and guarantees uniqueness among them. Internally, HashSet is literally implemented as a HashMap where your set element is the key and a shared dummy Object constant is the value — so every unique element lookup is just a key lookup in the backing map.

Q: How does HashSet handle hash collisions?

When two different elements have the same hash code (collision), they are stored in the same bucket. Java 7 and earlier used a linked list per bucket. Starting with Java 8, when a bucket exceeds 8 elements, the list is converted to a balanced tree (O(log n) lookup instead of O(n)). This prevents denial-of-service attacks against deliberately colliding hash codes.

Q: Is HashSet thread-safe?

No. HashSet is not thread-safe. Concurrent modifications from multiple threads without external synchronization can corrupt the internal data structure, leading to infinite loops, missing elements, or ArrayIndexOutOfBoundsException. Use Collections.synchronizedSet(new HashSet<>()) for simple cases, or ConcurrentHashMap.newKeySet() for better concurrency performance.

A cache hit rate dropped from 85% to 55% because mutable 'lastAccessed' orphaned HashSet entries.

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Notes here come from systems that actually shipped.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

A HashSet stores unique elements using a hash table with O(1) average-time add, remove, and contains
Backed by a HashMap — elements become keys, a dummy object is the value
Relies on hashCode() to pick a bucket and equals() to confirm uniqueness
Add() returns boolean: true if element was new, false if duplicate
Resizing at 75% load factor triggers O(n) rehashing — the only expensive operation
Biggest mistake: mutating hashCode-related fields after insertion leaves elements orphaned in wrong buckets

✦ Definition~90s read

What is HashSet in Java?

A HashSet is a Java collection that implements the Set interface using a hash table, specifically a HashMap under the hood. It exists to provide constant-time performance for the fundamental set operations—add, remove, contains—by leveraging the hashCode() and equals() methods of its elements.

★

Imagine you're organising a guest list for a party.

The core idea is that each object's hash code determines which "bucket" it lands in, allowing O(1) average-case lookups instead of the O(n) scans required by a List. This makes HashSet the go-to choice when you need to enforce uniqueness and test membership quickly, like deduplicating user IDs or checking for duplicate transactions in a stream of events.

However, HashSet is not a magic bullet. Its performance degrades catastrophically if the elements' hashCode() values change after insertion—a classic bug that can cause 30% or more cache loss as objects end up in wrong buckets, making contains() and remove() silently fail.

This is why the equals() and hashCode() contract demands that hash codes remain consistent while the object is in the set. Alternatives like LinkedHashSet (preserves insertion order) or TreeSet (sorted, O(log n) operations) exist for different trade-offs, but HashSet dominates when order doesn't matter and raw speed is the priority.

In practice, you'll see it in caching layers, dedup pipelines, and anywhere you need to answer "have I seen this before?" in microseconds—just don't mutate the keys after insertion.

Plain-English First

Imagine you're organising a guest list for a party. You don't care what order people arrive — you just need to know instantly whether someone is already on the list, and you never want the same name written down twice. A HashSet is exactly that guest list. It's a bag that holds things in no particular order, refuses duplicates without complaining, and can tell you 'yes they're in' or 'no they're not' almost instantly — no matter how many thousands of names are on the list.

Every non-trivial Java application deals with collections of data. But most developers reach for ArrayList out of habit, even when uniqueness or fast lookup is the actual requirement. That habit has real costs — O(n) contains() checks, duplicates sneaking into your data, and business logic cluttered with manual deduplication. HashSet exists to make those problems disappear at the data-structure level.

The problem HashSet solves is simple but important: how do you store a group of items where order doesn't matter, duplicates are impossible, and checking membership needs to be near-instant? ArrayList fails on all three counts at scale. HashSet uses a hash table under the hood — each element gets a numeric hash code, which maps it to a bucket in an array. Lookup becomes a matter of jumping to the right bucket, not scanning the whole collection. That's how it achieves O(1) average-time contains(), add(), and remove().

By the end of this article you'll understand not just how to use HashSet, but why it's built the way it is, what happens internally when you call add(), how it plays with equals() and hashCode(), where it beats and loses to its siblings TreeSet and LinkedHashSet, and the exact mistakes that cause silent bugs in production code. You'll also walk away with answers to the three HashSet questions interviewers love to ask.

Why HashSet Is Not a Magic O(1) Collection

A HashSet is a hash-table-backed Set implementation that stores unique elements using their hashCode() to determine the bucket index. It offers O(1) average-time add, remove, and contains operations — but only if hash codes are stable and well-distributed. The core mechanic: each element's hash code determines its bucket; equals() resolves collisions within a bucket.

Key properties: null is allowed, iteration order is unpredictable and can change across JVM runs, and the internal array resizes (rehashing) when load factor exceeds 0.75. The real performance cliff is not the O(n) worst-case collision chain — it's mutable objects whose hashCode changes after insertion. That silently corrupts the bucket mapping, making the element permanently unreachable via contains() or remove(), and bloating the set with orphaned entries.

Use HashSet when you need fast membership checks and don't care about order. It's ideal for deduplication pipelines, caching unique IDs, or tracking processed items in event streams. But never store mutable objects whose hashCode can change — that's the single most common production failure with this collection.

⚠ Mutable hashCode = silent data loss

If an object's hashCode changes after insertion, the set can't find it anymore — but it still occupies a bucket, wasting memory and breaking correctness.

📊 Production Insight

A team stored mutable POJOs in a HashSet for a dedup cache. After updating a field used in hashCode, the same logical object could be inserted again, causing duplicates and 30% cache loss.

Symptom: contains() returns false for objects that are definitely in the set; memory grows unboundedly.

Rule: Never mutate fields used in hashCode or equals while the object is in a HashSet. Use immutable keys or remove-then-reinsert.

🎯 Key Takeaway

HashSet offers O(1) average performance only if hashCode is stable and well-distributed.

Mutable objects in a HashSet cause silent data loss and memory leaks — never do it.

Use immutable keys or ensure the object is removed before mutation and reinserted after.

thecodeforge.io

Hashset Java

How HashSet Actually Stores Elements — The Bucket Mechanism

A HashSet is backed by a HashMap. When you call add(element), Java calls element.hashCode(), takes that integer, does some internal bit-mixing, and uses the result to pick a bucket — essentially a slot in an internal array. If that bucket is empty, the element goes in. If the bucket already has an occupant, Java calls equals() to check whether it's the same object. If equals() returns true, the element is rejected silently — no exception, no error, it just doesn't get added. If equals() returns false (a hash collision), both elements coexist in the same bucket as a linked list.

This two-step process — hashCode() to find the bucket, equals() to confirm identity — is the most important thing to understand about HashSet. It also reveals a critical contract: if two objects are equal according to equals(), they MUST return the same hashCode(). Break that contract and your HashSet will silently store duplicates as if they're different objects.

The default initial capacity is 16 buckets, with a load factor of 0.75. When 75% of the buckets are occupied, the internal array doubles in size and all elements are rehashed. This is called resizing, and it's the one expensive operation in HashSet's life.

HashSetInternals.javaJAVA

import java.util.HashSet;
import java.util.Set;

public class HashSetInternals {

    public static void main(String[] args) {

        // --- Basic uniqueness guarantee ---
        Set<String> visitedUrls = new HashSet<>();

        visitedUrls.add("https://thecodeforge.io/java");
        visitedUrls.add("https://thecodeforge.io/python");
        visitedUrls.add("https://thecodeforge.io/java"); // duplicate — silently ignored

        // Size is 2, not 3. The duplicate was rejected without any error.
        System.out.println("Unique URLs tracked: " + visitedUrls.size()); // 2

        // --- O(1) membership check ---
        String urlToCheck = "https://thecodeforge.io/java";
        if (visitedUrls.contains(urlToCheck)) {
            // This lookup is near-instant whether the set has 3 or 3 million elements
            System.out.println("Already crawled: " + urlToCheck);
        }

        // --- add() returns a boolean — use it! ---
        boolean wasNew = visitedUrls.add("https://thecodeforge.io/java"); // already exists
        System.out.println("URL was newly added: " + wasNew); // false — tells you it was a duplicate

        boolean isNew = visitedUrls.add("https://thecodeforge.io/go"); // brand new
        System.out.println("URL was newly added: " + isNew); // true

        // --- Order is NOT guaranteed ---
        // Iterating a HashSet gives you elements in an unpredictable order
        System.out.println("\nAll tracked URLs (order not guaranteed):");
        for (String url : visitedUrls) {
            System.out.println("  " + url);
        }
    }
}

Output

Unique URLs tracked: 2

Already crawled: https://thecodeforge.io/java

URL was newly added: false

URL was newly added: true

All tracked URLs (order not guaranteed):

https://thecodeforge.io/go

https://thecodeforge.io/java

https://thecodeforge.io/python

💡Pro Tip: add() returns a boolean — don't ignore it

Most developers write 'set.add(item)' and throw away the return value. That boolean tells you whether the element was actually new. In deduplication pipelines, counting those 'false' returns tells you exactly how many duplicates you caught — that's business-critical telemetry.

📊 Production Insight

Resizing at 75% load factor is the only expensive O(n) operation.

If you know the final size in advance, create HashSet with initial capacity: new HashSet<>(expectedSize / 0.75f + 1).

Rule: pre-size to avoid repeated resizing under high insert throughput.

🎯 Key Takeaway

HashSet = HashMap under the hood.

Two-step add: hashCode() finds bucket, equals() confirms identity.

Break the contract and you get silent duplicates.

The equals() and hashCode() Contract — Where Most Bugs Are Born

When you store Strings or Integer wrappers in a HashSet, everything works perfectly because those classes implement hashCode() and equals() correctly. The danger arrives when you create your own class and put instances of it into a HashSet.

By default, every Java object inherits hashCode() from Object, which returns a value based on the object's memory address. Two distinct objects with identical field values will produce different hash codes. To HashSet, they look like completely different elements — so it happily stores both, silently creating duplicates.

The fix is always the same: override both equals() AND hashCode() together. Never one without the other. Modern IDEs and libraries like Lombok can generate them, but you need to understand why they're needed before you trust the automation.

The contract is: if a.equals(b) is true, then a.hashCode() must equal b.hashCode(). The reverse doesn't need to hold — two objects can share a hash code (collision) but still be not-equal. Keep that asymmetry in mind.

ProductDeduplication.javaJAVA

import java.util.HashSet;
import java.util.Objects;
import java.util.Set;

public class ProductDeduplication {

    // A product SKU from an e-commerce catalogue
    static class Product {
        private final String sku;
        private final String name;

        Product(String sku, String name) {
            this.sku = sku;
            this.name = name;
        }

        // Without overriding these, two Product objects with the same SKU
        // would be treated as completely different by HashSet.
        @Override
        public boolean equals(Object other) {
            if (this == other) return true; // same reference — definitely equal
            if (!(other instanceof Product)) return false; // wrong type — definitely not equal
            Product otherProduct = (Product) other;
            // Business rule: two Products are the same if their SKUs match
            return Objects.equals(this.sku, otherProduct.sku);
        }

        @Override
        public int hashCode() {
            // Must use the same field(s) as equals() — SKU drives both
            return Objects.hash(sku);
        }

        @Override
        public String toString() {
            return "Product[sku=" + sku + ", name=" + name + "]";
        }
    }

    public static void main(String[] args) {

        Set<Product> catalogue = new HashSet<>();

        // Simulating products loaded from two different data sources
        Product fromDatabaseA = new Product("BOOT-42", "Trail Boot Size 42");
        Product fromDatabaseB = new Product("BOOT-42", "Trail Boot Size 42"); // same SKU, different object
        Product differentProduct = new Product("HAT-01", "Hiking Hat");

        catalogue.add(fromDatabaseA);
        catalogue.add(fromDatabaseB); // rejected — equals() + hashCode() say it's the same product
        catalogue.add(differentProduct);

        // If we had NOT overridden equals/hashCode, size would be 3 (a silent bug!)
        System.out.println("Unique products in catalogue: " + catalogue.size()); // 2

        // contains() also works correctly now
        Product lookupProduct = new Product("HAT-01", "Hiking Hat");
        System.out.println("Catalogue contains hat: " + catalogue.contains(lookupProduct)); // true

        System.out.println("\nCatalogue contents:");
        catalogue.forEach(System.out::println);
    }
}

Output

Unique products in catalogue: 2

Catalogue contains hat: true

Catalogue contents:

Product[sku=HAT-01, name=Hiking Hat]

Product[sku=BOOT-42, name=Trail Boot Size 42]

⚠ Watch Out: Mutating a field used in hashCode() after insertion

If you add an object to a HashSet and then change a field that's part of its hashCode() calculation, the object moves to a new bucket — but HashSet doesn't know. You'll never find it again with contains(), and you'll never be able to remove() it. Treat objects in a HashSet as effectively immutable, or use only fields that never change as the basis for equals/hashCode.

📊 Production Insight

Silent duplicates from missing hashCode/equals are hard to spot because the set just grows.

First symptom is often a memory warning or unexpected size in metrics.

Rule: always write a unit test that adds two distinct instances with equal fields and asserts size is 1.

🎯 Key Takeaway

Override equals AND hashCode together, using the same fields.

a.equals(b) -> a.hashCode() == b.hashCode().

Mutate hash fields after insertion and you lose the element forever.

thecodeforge.io

Hashset Java

HashSet vs LinkedHashSet vs TreeSet — Choosing the Right Tool

All three implement the Set interface and guarantee uniqueness, but they make different trade-offs around ordering and performance. Choosing the wrong one is a common source of subtle bugs — especially when code that 'works' in testing breaks in production because iteration order changed.

HashSet is the default choice. No ordering, pure speed. Use it whenever you don't care about the sequence of elements.

LinkedHashSet maintains insertion order — elements come out in the order you put them in. It does this by linking each bucket entry to the previous and next one with a doubly-linked list. You pay a small memory overhead but gain predictable iteration. Use it when order matters for display or reproducibility.

TreeSet maintains sorted order — elements are always iterated in their natural order (or a custom Comparator you provide). It uses a Red-Black tree internally, so all core operations are O(log n) rather than O(1). Use it when you need a sorted unique collection or range queries like headSet() and tailSet().

The rule of thumb: start with HashSet. Upgrade to LinkedHashSet when reproducibility matters. Upgrade to TreeSet when sorted order matters.

SetVariantsComparison.javaJAVA

import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.Set;
import java.util.TreeSet;

public class SetVariantsComparison {

    public static void main(String[] args) {

        // Imagine these are user-selected tags on a blog post, added in this order
        String[] selectedTags = {"java", "backend", "spring", "java", "performance", "backend"};

        // --- HashSet: unique, but order is unpredictable ---
        Set<String> hashTags = new HashSet<>();
        for (String tag : selectedTags) hashTags.add(tag);
        System.out.println("HashSet (no order guarantee): " + hashTags);
        // Output order will vary between JVM runs

        // --- LinkedHashSet: unique AND preserves insertion order ---
        Set<String> linkedTags = new LinkedHashSet<>();
        for (String tag : selectedTags) linkedTags.add(tag);
        System.out.println("LinkedHashSet (insertion order): " + linkedTags);
        // Always: [java, backend, spring, performance] — first occurrence wins

        // --- TreeSet: unique AND always alphabetically sorted ---
        Set<String> sortedTags = new TreeSet<>();
        for (String tag : selectedTags) sortedTags.add(tag);
        System.out.println("TreeSet (sorted order):        " + sortedTags);
        // Always: [backend, java, performance, spring]

        // --- Range query — a TreeSet superpower HashSet can't do ---
        TreeSet<Integer> pageNumbers = new TreeSet<>();
        for (int page = 1; page <= 20; page++) pageNumbers.add(page);

        // Give me all pages from 5 up to (but not including) 10
        System.out.println("\nPages 5 to 9: " + pageNumbers.subSet(5, 10));

        // Give me all pages from 15 onwards
        System.out.println("Pages 15+: " + pageNumbers.tailSet(15));
    }
}

Output

HashSet (no order guarantee): [java, spring, performance, backend]

LinkedHashSet (insertion order): [java, backend, spring, performance]

TreeSet (sorted order): [backend, java, performance, spring]

Pages 5 to 9: [5, 6, 7, 8, 9]

Pages 15+: [15, 16, 17, 18, 19, 20]

🔥Interview Gold: Why can't you use subSet() on a HashSet?

subSet(), headSet() and tailSet() are defined on the NavigableSet interface, which TreeSet implements but HashSet does not. HashSet has no notion of order, so range queries are meaningless to it. If an interviewer asks about this, the answer connects interface design to internal data structure — that's the depth they're looking for.

📊 Production Insight

Order-dependent code that works during testing can break in production if you assume iteration order of HashSet.

A staging test with low data volume may show consistent order that changes under load after resizing.

Rule: if you rely on iteration order, use LinkedHashSet or TreeSet explicitly — don't rely on HashSet's accidental stability.

🎯 Key Takeaway

HashSet: fastest, no order guaranteed.

LinkedHashSet: insertion order preserved, small memory premium.

TreeSet: sorted order, O(log n) operations, supports range queries.

Real-World Patterns — Where HashSet Actually Earns Its Keep

Theory is great, but HashSet shines brightest in specific real-world patterns. Understanding these patterns means you'll reach for the right tool instinctively.

The most common pattern is the visited/seen set — a fast-rejection gate. You're processing a stream of events, URLs, user IDs, or transactions and you want to skip anything you've already handled. A HashSet lets you check-and-mark in one add() call and the O(1) lookup means it stays fast even as the seen set grows to millions of entries.

The second pattern is set operations — intersection, union, and difference. The Collections API makes these one-liners with retainAll(), addAll(), and removeAll(). These are exactly how you'd compute shared followers between two users, unique items across two orders, or features present in one plan but not another.

A third pattern is deduplication before storage. When your application receives data from multiple sources that might overlap, running everything through a HashSet before persisting it is far cheaper than writing duplicates to a database and cleaning them up later.

SetOperationsDemo.javaJAVA

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class SetOperationsDemo {

    public static void main(String[] args) {

        // --- Pattern 1: Seen-set for deduplication in a stream ---
        System.out.println("=== Pattern 1: Deduplication Gate ===");
        Set<String> processedOrderIds = new HashSet<>();
        String[] incomingOrders = {"ORD-001", "ORD-002", "ORD-001", "ORD-003", "ORD-002"};

        for (String orderId : incomingOrders) {
            // add() returns false if already present — use that as a skip signal
            if (!processedOrderIds.add(orderId)) {
                System.out.println("Skipping duplicate order: " + orderId);
            } else {
                System.out.println("Processing new order: " + orderId);
            }
        }

        // --- Pattern 2: Set intersection — find common elements ---
        System.out.println("\n=== Pattern 2: Set Operations ===");

        // Permissions assigned to two different user roles
        Set<String> editorPermissions = new HashSet<>(
            Arrays.asList("read", "write", "comment", "upload")
        );
        Set<String> viewerPermissions = new HashSet<>(
            Arrays.asList("read", "comment")
        );

        // INTERSECTION — permissions that both roles share
        Set<String> sharedPermissions = new HashSet<>(editorPermissions);
        sharedPermissions.retainAll(viewerPermissions); // modifies in place — keep only what's in both
        System.out.println("Shared permissions: " + sharedPermissions); // [read, comment]

        // DIFFERENCE — permissions only editors have (not viewers)
        Set<String> editorOnlyPermissions = new HashSet<>(editorPermissions);
        editorOnlyPermissions.removeAll(viewerPermissions); // remove anything viewers also have
        System.out.println("Editor-only permissions: " + editorOnlyPermissions); // [write, upload]

        // UNION — all permissions across both roles combined (no duplicates)
        Set<String> allPermissions = new HashSet<>(editorPermissions);
        allPermissions.addAll(viewerPermissions); // adds only what's not already present
        System.out.println("All permissions (union): " + allPermissions); // [read, write, comment, upload]
    }
}

Output

=== Pattern 1: Deduplication Gate ===

Processing new order: ORD-001

Processing new order: ORD-002

Skipping duplicate order: ORD-001

Processing new order: ORD-003

Skipping duplicate order: ORD-002

=== Pattern 2: Set Operations ===

Shared permissions: [read, comment]

Editor-only permissions: [write, upload]

All permissions (union): [read, write, comment, upload]

⚠ Watch Out: retainAll() and removeAll() mutate the original set

Both retainAll() and removeAll() modify the Set you call them on. Always work on a defensive copy — 'new HashSet<>(originalSet)' — if you need to preserve the original. This is one of the most common sources of accidental data loss when doing set operations in production code.

📊 Production Insight

Using HashSet as a dedup gate before DB insert is far cheaper than DB unique constraint violations.

A million adds in a HashSet takes ~200ms while a million DB inserts with duplicate checks can take minutes.

Rule: pre-size the HashSet to expected cardinality to avoid resizing overhead.

🎯 Key Takeaway

HashSet excels for dedup gates, set arithmetic, and pre-storage dedup.

Use add() return value as a skip signal.

Always copy before retainAll/removeAll to avoid mutating originals.

Performance & Resizing: When HashSet Gets Expensive

HashSet's O(1) operations are average-case. Under the hood, the array of buckets starts at size 16 with a load factor of 0.75. When the number of elements exceeds capacity * load factor, the array doubles and all elements are rehashed — that means every element's hashCode() is called again, and each lands in a new bucket. This is an O(n) operation.

If you insert a million elements into a default HashSet, it will resize about 17 times (16 -> 32 -> 64 -> ... -> 1,048,576). Each resize triggers a full rehash. The total cost of resizing can dominate your insertion time if you don't pre-size.

Pre-sizing: if you know you'll store N unique items, create the HashSet with initial capacity = (int)(N / 0.75) + 1. This avoids all resizes. The +1 accounts for rounding errors from integer division.

Another performance trap: poor hashCode() implementations that produce many collisions. When many elements land in the same bucket, the linked list grows and operations degrade to O(m) where m is the bucket length. In extreme cases, a badly distributed hashCode() can turn HashSet into a slow ArrayList. Java 8 improved this by converting long linked lists to trees when a bucket holds more than 8 elements, but that's a fallback — not an excuse for a bad hash function.

HashSetPreSizing.javaJAVA

import java.util.HashSet;
import java.util.Set;

public class HashSetPreSizing {

    public static void main(String[] args) {
        int expectedSize = 1_000_000;
        // Calculate optimal initial capacity to avoid resizing
        int initialCapacity = (int)(expectedSize / 0.75f) + 1;
        
        long start = System.nanoTime();
        Set<Integer> sizedSet = new HashSet<>(initialCapacity);
        for (int i = 0; i < expectedSize; i++) {
            sizedSet.add(i);
        }
        long durationSized = System.nanoTime() - start;
        
        start = System.nanoTime();
        Set<Integer> defaultSet = new HashSet<>();
        for (int i = 0; i < expectedSize; i++) {
            defaultSet.add(i);
        }
        long durationDefault = System.nanoTime() - start;
        
        System.out.println("Pre-sized HashSet: " + durationSized / 1_000_000 + " ms");
        System.out.println("Default HashSet: " + durationDefault / 1_000_000 + " ms");
        // Pre-sizing typically cuts insertion time by 30-50% for large sets
    }
}

Output

Pre-sized HashSet: 480 ms

Default HashSet: 760 ms

💡Memory vs Speed Trade-off

Pre-sizing allocates a larger internal array upfront, which uses more memory but avoids the CPU cost of repeated rehashing. For most applications, the memory overhead is trivial compared to the speed gain. Use pre-sizing when you know the final size within 20%.

📊 Production Insight

In high-throughput services, unplanned resizing can cause latency spikes as the GC kicks in to clean old arrays.

If your HashSet holds millions of entries and you see periodic GC pauses, check resizing costs.

Rule: always pre-size when the approximate data volume is known at design time.

🎯 Key Takeaway

Default capacity=16, load factor=0.75.

Resizing doubles array and rehashes all elements — O(n) cost.

Pre-size using expectedSize / 0.75 + 1 to avoid resizing entirely.

The Load Factor Trap: Why Default 0.75 Is Not Always Right

Most tutorials parrot the default 0.75 load factor without explaining why it matters. Here's the truth: that 0.75 is a time-space tradeoff. Lower it to 0.5 and you waste memory—more buckets, fewer collisions, faster reads. Raise it to 0.9 and you save memory but risk degrading to O(log n) as collisions force treeification. The default works for most cases, but not for real-time systems or memory-constrained environments. For latency-sensitive apps, use 0.5-0.6. For large datasets where memory is tight, 0.8 is safer. Never use over 0.9 unless you love unpredictable pauses during rehashing. The worst mistake? Using the default for a set you know will hold millions of elements. Pre-size your HashSet using initialCapacity = (expectedSize / loadFactor) + 1 to avoid multiple resizes. That single line can cut insertion time by 40%.

LoadFactorDemo.javaJAVA

import java.util.*;

public class LoadFactorDemo {
    public static void main(String[] args) {
        // Bad: default capacity, causes multiple resizes
        long start = System.nanoTime();
        Set<Integer> badSet = new HashSet<>();
        for (int i = 0; i < 1_000_000; i++) badSet.add(i);
        System.out.println("Default took: " + (System.nanoTime() - start) / 1_000_000 + " ms");

        // Good: pre-sized to avoid resizing
        start = System.nanoTime();
        Set<Integer> goodSet = new HashSet<>((int)(1_000_000 / 0.75f) + 1);
        for (int i = 0; i < 1_000_000; i++) goodSet.add(i);
        System.out.println("Pre-sized took: " + (System.nanoTime() - start) / 1_000_000 + " ms");
    }
}

Output

Default took: 187 ms

Pre-sized took: 132 ms

⚠ Production Trap:

Don't rely on auto-resizing for high-throughput systems. Each resize doubles the array and rehashes every element—a full stop for the calling thread. Profile your expected cardinality and pre-size accordingly.

🎯 Key Takeaway

HashSet resizing is a hidden O(n) event. Pre-size or pay the latency tax.

HashCode Collisions: When Your HashSet Degrades to a LinkedList (or Tree)

Here's what happens when your hashCode() is broken: two unequal objects land in the same bucket. In Java 7 and earlier, that bucket becomes a linked list—O(n) lookups. Java 8 fixed this partially: when a bucket exceeds 8 entries, it treeifies into a Red-Black Tree, capping worst-case at O(log n). But that treeification costs memory (extra node objects) and CPU (tree rebalancing). You don't want it. The fix? Write hash functions that spread objects evenly. For a custom key class, multiply by a prime and shift bits: return (int)(id ^ (id >>> 32)) * 31. Never return a constant hash code—that puts every element in one bucket, making your HashSet a slow ArrayList. The real-world symptom: your production app has intermittent spikes when a bad actor sends crafted objects that collide. Always test with a million random objects and measure bucket distribution.

CollisionDemo.javaJAVA

import java.util.*;

class BadKey {
    int id;
    BadKey(int id) { this.id = id; }
    public int hashCode() { return 42; } // All same bucket!
    public boolean equals(Object o) { return o instanceof BadKey && ((BadKey)o).id == this.id; }
}

public class CollisionDemo {
    public static void main(String[] args) {
        Set<BadKey> set = new HashSet<>();
        long start = System.nanoTime();
        for (int i = 0; i < 10_000; i++) set.add(new BadKey(i));
        System.out.println("Time for 10K with bad hash: " + (System.nanoTime() - start) / 1_000_000 + " ms");

        Set<Integer> goodSet = new HashSet<>();
        start = System.nanoTime();
        for (int i = 0; i < 10_000; i++) goodSet.add(i);
        System.out.println("Time for 10K with good hash: " + (System.nanoTime() - start) / 1_000_000 + " ms");
    }
}

Output

Time for 10K with bad hash: 34 ms

Time for 10K with good hash: 2 ms

🔥Production Insight:

Treeification kicks in at threshold 8, but only if the key class implements Comparable. If it doesn't, the bucket stays as a linked list indefinitely. Always implement Comparable on your custom keys for a free performance guarantee.

🎯 Key Takeaway

A bad hash function can tank HashSet performance by 10x. Spread your bits or spread your pain.

● Production incidentPOST-MORTEMseverity: high

The Cache That Lost 30% of Its Entries Overnight

Symptom

The cache hit rate dropped from 85% to 55% within hours of deployment. contains() returned false for objects that had been added minutes earlier, causing expensive recomputation.

Assumption

The team assumed that once an object was added to the HashSet, it would remain discoverable forever. They had overridden equals() and hashCode() based on a mutable 'lastAccessed' field to speed up eviction logic.

Root cause

When 'lastAccessed' changed, the object's hashCode changed. The bucket location changed, but the set still held the old bucket. contains() looked in the new bucket, didn't find it, and returned false. The object was effectively orphaned but still consuming memory — a slow memory leak.

Fix

Changed the hashCode() to use only immutable fields (ID + creation timestamp). The 'lastAccessed' field was moved out of hashCode/equals. Added a unit test that mutates the object and verifies the set still contains it after mutation.

Key lesson

Never include mutable fields in hashCode() or equals() for objects stored in a HashSet
Treat objects in a HashSet as immutable — or design them so that identity fields never change
Write tests that explicitly mutate objects after insertion and verify set membership

Production debug guideCommon symptoms and the exact actions to diagnose and fix them4 entries

Symptom · 01

contains() returns false for an element you know was added

→

Fix

Check if the element's hashCode() changed after insertion. Print the hash code at add time and at check time. Look for mutable fields in hashCode().

Symptom · 02

Set size is larger than expected, duplicates present

→

Fix

Verify that equals() and hashCode() are both overridden consistently. Two objects with identical field values must return same hashCode(). Print hash codes for suspected duplicates.

Symptom · 03

Iterating the set shows elements in different order across runs

→

Fix

That's expected for HashSet — the order is not guaranteed. If reproducible order is needed, switch to LinkedHashSet.

Symptom · 04

NullPointerException when adding null?

→

Fix

HashSet allows one null. NPE occurs only if you call hashCode() on null. But HashSet handles null internally. If you see NPE, check if you're using TreeSet (throws NPE on null).

★ Quick Hash Set Debug CommandsCommands and code snippets to quickly diagnose common HashSet problems in production

Element not found after mutation−

Immediate action

Print hash codes before and after mutation to confirm change

Commands

System.out.println(element.hashCode()); // before mutation

System.out.println(element.hashCode()); // after mutation

Fix now

Redesign hashCode() to use only immutable fields. Remove mutable fields from equals() as well.

Set contains duplicates of custom objects+

Set size grows too large — possible memory leak+

Feature / Aspect	HashSet	LinkedHashSet	TreeSet
Internal structure	HashMap (array of buckets)	LinkedHashMap (buckets + linked list)	Red-Black Tree
Allows duplicates	No	No	No
Allows null	Yes (one null)	Yes (one null)	No — throws NullPointerException
Iteration order	Unpredictable	Insertion order	Sorted (natural or Comparator)
add() / `remove()` / `contains()`	O(1) average	O(1) average	O(log n)
Memory overhead	Low	Medium (extra linked list)	High (tree node objects)
Use when you need	Fast uniqueness, no order care	Unique + reproducible order	Unique + sorted + range queries
Implements NavigableSet	No	No	Yes
Thread-safe	No	No	No

⚙ Quick Reference

7 commands from this guide

File	Command / Code	Purpose
HashSetInternals.java	public class HashSetInternals {	How HashSet Actually Stores Elements
ProductDeduplication.java	public class ProductDeduplication {	The equals() and hashCode() Contract
SetVariantsComparison.java	public class SetVariantsComparison {	HashSet vs LinkedHashSet vs TreeSet
SetOperationsDemo.java	public class SetOperationsDemo {	Real-World Patterns
HashSetPreSizing.java	public class HashSetPreSizing {	Performance & Resizing
LoadFactorDemo.java	public class LoadFactorDemo {	The Load Factor Trap
CollisionDemo.java	class BadKey {	HashCode Collisions

Key takeaways

HashSet gives you O(1) average-time add(), remove(), and contains() by using a hash table

that's why it beats ArrayList for membership checks at scale, where ArrayList's contains() is O(n).

HashSet is backed by a HashMap

your elements are stored as keys with a dummy constant as the value. Everything you know about how HashMap hashes keys applies directly to HashSet elements.

If you store custom objects in a HashSet, you must override both equals() AND hashCode() using the same fields

skip either one and you get silent duplicate storage, which is one of the hardest bugs to diagnose.

The three Set implementations form a clear hierarchy of trade-offs

HashSet for raw speed with no ordering, LinkedHashSet for insertion-order preservation, and TreeSet for sorted order and range queries — pick based on what your code actually needs, not habit.

Pre-size your HashSet when you know the approximate number of elements to avoid expensive resize/rehash operations. Formula

initialCapacity = (int)(expectedSize / 0.75f) + 1.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

HashSet is backed by a HashMap internally — can you explain what that me...

Q02SENIOR

If two objects have the same hashCode() but equals() returns false, what...

Q03SENIOR

You add an object to a HashSet, then modify a field that's included in i...

Q01 of 03JUNIOR

HashSet is backed by a HashMap internally — can you explain what that means in practice, and what gets stored as the key and value in that underlying map?

ANSWER

When you add an element to a HashSet, the element becomes the key in the internal HashMap. The value is a shared dummy Object constant (PRESENT). All keys in a HashMap must be unique, so the set enforces uniqueness automatically. HashMap operations like put() are reused for add(), containsKey() for contains(), etc. The hash table structure with buckets and collision handling is identical. This also means that the requirements for a good key in HashMap (immutable, proper equals/hashCode) apply exactly to elements stored in HashSet.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Can a Java HashSet contain null values?

Does HashSet maintain insertion order in Java?

What is the difference between HashSet and HashMap in Java?

How does HashSet handle hash collisions?

Is HashSet thread-safe?

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's Collections. Mark it forged?

6 min read · try the examples if you haven't