Java Intermediate

Java HashSet Explained — How It Works, When to Use It, and What to Watch Out For

📅 March 2026 ⏱ 8 min read 🎯 Intermediate

In Plain English 🔥

Imagine you're organising a guest list for a party. You don't care what order people arrive — you just need to know instantly whether someone is already on the list, and you never want the same name written down twice. A HashSet is exactly that guest list. It's a bag that holds things in no particular order, refuses duplicates without complaining, and can tell you 'yes they're in' or 'no they're not' almost instantly — no matter how many thousands of names are on the list.

⚡ Quick Answer

Every non-trivial Java application deals with collections of data. But most developers reach for ArrayList out of habit, even when uniqueness or fast lookup is the actual requirement. That habit has real costs — O(n) contains() checks, duplicates sneaking into your data, and business logic cluttered with manual deduplication. HashSet exists to make those problems disappear at the data-structure level.

The problem HashSet solves is simple but important: how do you store a group of items where order doesn't matter, duplicates are impossible, and checking membership needs to be near-instant? ArrayList fails on all three counts at scale. HashSet uses a hash table under the hood — each element gets a numeric hash code, which maps it to a bucket in an array. Lookup becomes a matter of jumping to the right bucket, not scanning the whole collection. That's how it achieves O(1) average-time contains(), add(), and remove().

By the end of this article you'll understand not just how to use HashSet, but why it's built the way it is, what happens internally when you call add(), how it plays with equals() and hashCode(), where it beats and loses to its siblings TreeSet and LinkedHashSet, and the exact mistakes that cause silent bugs in production code. You'll also walk away with answers to the three HashSet questions interviewers love to ask.

How HashSet Actually Stores Elements — The Bucket Mechanism

A HashSet is backed by a HashMap. When you call add(element), Java calls element.hashCode(), takes that integer, does some internal bit-mixing, and uses the result to pick a bucket — essentially a slot in an internal array. If that bucket is empty, the element goes in. If the bucket already has an occupant, Java calls equals() to check whether it's the same object. If equals() returns true, the element is rejected silently — no exception, no error, it just doesn't get added. If equals() returns false (a hash collision), both elements coexist in the same bucket as a linked list.

This two-step process — hashCode() to find the bucket, equals() to confirm identity — is the most important thing to understand about HashSet. It also reveals a critical contract: if two objects are equal according to equals(), they MUST return the same hashCode(). Break that contract and your HashSet will silently store duplicates as if they're different objects.

The default initial capacity is 16 buckets, with a load factor of 0.75. When 75% of the buckets are occupied, the internal array doubles in size and all elements are rehashed. This is called resizing, and it's the one expensive operation in HashSet's life.

HashSetInternals.java · JAVA

123456789101112131415161718192021222324252627282930313233343536373839

import java.util.HashSet;
import java.util.Set;

public class HashSetInternals {

    public static void main(String[] args) {

        // --- Basic uniqueness guarantee ---
        Set<String> visitedUrls = new HashSet<>();

        visitedUrls.add("https://thecodeforge.io/java");
        visitedUrls.add("https://thecodeforge.io/python");
        visitedUrls.add("https://thecodeforge.io/java"); // duplicate — silently ignored

        // Size is 2, not 3. The duplicate was rejected without any error.
        System.out.println("Unique URLs tracked: " + visitedUrls.size()); // 2

        // --- O(1) membership check ---
        String urlToCheck = "https://thecodeforge.io/java";
        if (visitedUrls.contains(urlToCheck)) {
            // This lookup is near-instant whether the set has 3 or 3 million elements
            System.out.println("Already crawled: " + urlToCheck);
        }

        // --- add() returns a boolean — use it! ---
        boolean wasNew = visitedUrls.add("https://thecodeforge.io/java"); // already exists
        System.out.println("URL was newly added: " + wasNew); // false — tells you it was a duplicate

        boolean isNew = visitedUrls.add("https://thecodeforge.io/go"); // brand new
        System.out.println("URL was newly added: " + isNew); // true

        // --- Order is NOT guaranteed ---
        // Iterating a HashSet gives you elements in an unpredictable order
        System.out.println("\nAll tracked URLs (order not guaranteed):");
        for (String url : visitedUrls) {
            System.out.println("  " + url);
        }
    }
}

▶ Output

Unique URLs tracked: 2
Already crawled: https://thecodeforge.io/java
URL was newly added: false
URL was newly added: true

All tracked URLs (order not guaranteed):
https://thecodeforge.io/go
https://thecodeforge.io/java
https://thecodeforge.io/python

⚠️

Pro Tip: add() returns a boolean — don't ignore itMost developers write 'set.add(item)' and throw away the return value. That boolean tells you whether the element was actually new. In deduplication pipelines, counting those 'false' returns tells you exactly how many duplicates you caught — that's business-critical telemetry.

The equals() and hashCode() Contract — Where Most Bugs Are Born

When you store Strings or Integer wrappers in a HashSet, everything works perfectly because those classes implement hashCode() and equals() correctly. The danger arrives when you create your own class and put instances of it into a HashSet.

By default, every Java object inherits hashCode() from Object, which returns a value based on the object's memory address. Two distinct objects with identical field values will produce different hash codes. To HashSet, they look like completely different elements — so it happily stores both, silently creating duplicates.

The fix is always the same: override both equals() AND hashCode() together. Never one without the other. Modern IDEs and libraries like Lombok can generate them, but you need to understand why they're needed before you trust the automation.

The contract is: if a.equals(b) is true, then a.hashCode() must equal b.hashCode(). The reverse doesn't need to hold — two objects can share a hash code (collision) but still be not-equal. Keep that asymmetry in mind.

ProductDeduplication.java · JAVA

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263

import java.util.HashSet;
import java.util.Objects;
import java.util.Set;

public class ProductDeduplication {

    // A product SKU from an e-commerce catalogue
    static class Product {
        private final String sku;
        private final String name;

        Product(String sku, String name) {
            this.sku = sku;
            this.name = name;
        }

        // Without overriding these, two Product objects with the same SKU
        // would be treated as completely different by HashSet.
        @Override
        public boolean equals(Object other) {
            if (this == other) return true; // same reference — definitely equal
            if (!(other instanceof Product)) return false; // wrong type — definitely not equal
            Product otherProduct = (Product) other;
            // Business rule: two Products are the same if their SKUs match
            return Objects.equals(this.sku, otherProduct.sku);
        }

        @Override
        public int hashCode() {
            // Must use the same field(s) as equals() — SKU drives both
            return Objects.hash(sku);
        }

        @Override
        public String toString() {
            return "Product[sku=" + sku + ", name=" + name + "]";
        }
    }

    public static void main(String[] args) {

        Set<Product> catalogue = new HashSet<>();

        // Simulating products loaded from two different data sources
        Product fromDatabaseA = new Product("BOOT-42", "Trail Boot Size 42");
        Product fromDatabaseB = new Product("BOOT-42", "Trail Boot Size 42"); // same SKU, different object
        Product differentProduct = new Product("HAT-01", "Hiking Hat");

        catalogue.add(fromDatabaseA);
        catalogue.add(fromDatabaseB); // rejected — equals() + hashCode() say it's the same product
        catalogue.add(differentProduct);

        // If we had NOT overridden equals/hashCode, size would be 3 (a silent bug!)
        System.out.println("Unique products in catalogue: " + catalogue.size()); // 2

        // contains() also works correctly now
        Product lookupProduct = new Product("HAT-01", "Hiking Hat");
        System.out.println("Catalogue contains hat: " + catalogue.contains(lookupProduct)); // true

        System.out.println("\nCatalogue contents:");
        catalogue.forEach(System.out::println);
    }
}

▶ Output

Unique products in catalogue: 2
Catalogue contains hat: true

Catalogue contents:
Product[sku=HAT-01, name=Hiking Hat]
Product[sku=BOOT-42, name=Trail Boot Size 42]

⚠️

Watch Out: Mutating a field used in hashCode() after insertionIf you add an object to a HashSet and then change a field that's part of its hashCode() calculation, the object moves to a new bucket — but HashSet doesn't know. You'll never find it again with contains(), and you'll never be able to remove() it. Treat objects in a HashSet as effectively immutable, or use only fields that never change as the basis for equals/hashCode.

HashSet vs LinkedHashSet vs TreeSet — Choosing the Right Tool

All three implement the Set interface and guarantee uniqueness, but they make different trade-offs around ordering and performance. Choosing the wrong one is a common source of subtle bugs — especially when code that 'works' in testing breaks in production because iteration order changed.

HashSet is the default choice. No ordering, pure speed. Use it whenever you don't care about the sequence of elements.

LinkedHashSet maintains insertion order — elements come out in the order you put them in. It does this by linking each bucket entry to the previous and next one with a doubly-linked list. You pay a small memory overhead but gain predictable iteration. Use it when order matters for display or reproducibility.

TreeSet maintains sorted order — elements are always iterated in their natural order (or a custom Comparator you provide). It uses a Red-Black tree internally, so all core operations are O(log n) rather than O(1). Use it when you need a sorted unique collection or range queries like headSet() and tailSet().

The rule of thumb: start with HashSet. Upgrade to LinkedHashSet when reproducibility matters. Upgrade to TreeSet when sorted order matters.

SetVariantsComparison.java · JAVA

1234567891011121314151617181920212223242526272829303132333435363738394041

import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.Set;
import java.util.TreeSet;

public class SetVariantsComparison {

    public static void main(String[] args) {

        // Imagine these are user-selected tags on a blog post, added in this order
        String[] selectedTags = {"java", "backend", "spring", "java", "performance", "backend"};

        // --- HashSet: unique, but order is unpredictable ---
        Set<String> hashTags = new HashSet<>();
        for (String tag : selectedTags) hashTags.add(tag);
        System.out.println("HashSet (no order guarantee): " + hashTags);
        // Output order will vary between JVM runs

        // --- LinkedHashSet: unique AND preserves insertion order ---
        Set<String> linkedTags = new LinkedHashSet<>();
        for (String tag : selectedTags) linkedTags.add(tag);
        System.out.println("LinkedHashSet (insertion order): " + linkedTags);
        // Always: [java, backend, spring, performance] — first occurrence wins

        // --- TreeSet: unique AND always alphabetically sorted ---
        Set<String> sortedTags = new TreeSet<>();
        for (String tag : selectedTags) sortedTags.add(tag);
        System.out.println("TreeSet (sorted order):        " + sortedTags);
        // Always: [backend, java, performance, spring]

        // --- Range query — a TreeSet superpower HashSet can't do ---
        TreeSet<Integer> pageNumbers = new TreeSet<>();
        for (int page = 1; page <= 20; page++) pageNumbers.add(page);

        // Give me all pages from 5 up to (but not including) 10
        System.out.println("\nPages 5 to 9: " + pageNumbers.subSet(5, 10));

        // Give me all pages from 15 onwards
        System.out.println("Pages 15+: " + pageNumbers.tailSet(15));
    }
}

▶ Output

HashSet (no order guarantee): [java, spring, performance, backend]
LinkedHashSet (insertion order): [java, backend, spring, performance]
TreeSet (sorted order): [backend, java, performance, spring]

Pages 5 to 9: [5, 6, 7, 8, 9]
Pages 15+: [15, 16, 17, 18, 19, 20]

🔥

Interview Gold: Why can't you use subSet() on a HashSet?subSet(), headSet() and tailSet() are defined on the NavigableSet interface, which TreeSet implements but HashSet does not. HashSet has no notion of order, so range queries are meaningless to it. If an interviewer asks about this, the answer connects interface design to internal data structure — that's the depth they're looking for.

Real-World Patterns — Where HashSet Actually Earns Its Keep

Theory is great, but HashSet shines brightest in specific real-world patterns. Understanding these patterns means you'll reach for the right tool instinctively.

The most common pattern is the visited/seen set — a fast-rejection gate. You're processing a stream of events, URLs, user IDs, or transactions and you want to skip anything you've already handled. A HashSet lets you check-and-mark in one add() call and the O(1) lookup means it stays fast even as the seen set grows to millions of entries.

The second pattern is set operations — intersection, union, and difference. The Collections API makes these one-liners with retainAll(), addAll(), and removeAll(). These are exactly how you'd compute shared followers between two users, unique items across two orders, or features present in one plan but not another.

A third pattern is deduplication before storage. When your application receives data from multiple sources that might overlap, running everything through a HashSet before persisting it is far cheaper than writing duplicates to a database and cleaning them up later.

SetOperationsDemo.java · JAVA

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849

import java.util.Arrays;
import java.util.HashSet;
import java.util.Set;

public class SetOperationsDemo {

    public static void main(String[] args) {

        // --- Pattern 1: Seen-set for deduplication in a stream ---
        System.out.println("=== Pattern 1: Deduplication Gate ===");
        Set<String> processedOrderIds = new HashSet<>();
        String[] incomingOrders = {"ORD-001", "ORD-002", "ORD-001", "ORD-003", "ORD-002"};

        for (String orderId : incomingOrders) {
            // add() returns false if already present — use that as a skip signal
            if (!processedOrderIds.add(orderId)) {
                System.out.println("Skipping duplicate order: " + orderId);
            } else {
                System.out.println("Processing new order: " + orderId);
            }
        }

        // --- Pattern 2: Set intersection — find common elements ---
        System.out.println("\n=== Pattern 2: Set Operations ===");

        // Permissions assigned to two different user roles
        Set<String> editorPermissions = new HashSet<>(
            Arrays.asList("read", "write", "comment", "upload")
        );
        Set<String> viewerPermissions = new HashSet<>(
            Arrays.asList("read", "comment")
        );

        // INTERSECTION — permissions that both roles share
        Set<String> sharedPermissions = new HashSet<>(editorPermissions);
        sharedPermissions.retainAll(viewerPermissions); // modifies in place — keep only what's in both
        System.out.println("Shared permissions: " + sharedPermissions); // [read, comment]

        // DIFFERENCE — permissions only editors have (not viewers)
        Set<String> editorOnlyPermissions = new HashSet<>(editorPermissions);
        editorOnlyPermissions.removeAll(viewerPermissions); // remove anything viewers also have
        System.out.println("Editor-only permissions: " + editorOnlyPermissions); // [write, upload]

        // UNION — all permissions across both roles combined (no duplicates)
        Set<String> allPermissions = new HashSet<>(editorPermissions);
        allPermissions.addAll(viewerPermissions); // adds only what's not already present
        System.out.println("All permissions (union): " + allPermissions); // [read, write, comment, upload]
    }
}

▶ Output

=== Pattern 1: Deduplication Gate ===
Processing new order: ORD-001
Processing new order: ORD-002
Skipping duplicate order: ORD-001
Processing new order: ORD-003
Skipping duplicate order: ORD-002

=== Pattern 2: Set Operations ===
Shared permissions: [read, comment]
Editor-only permissions: [write, upload]
All permissions (union): [read, write, comment, upload]

⚠️

Watch Out: retainAll() and removeAll() mutate the original setBoth retainAll() and removeAll() modify the Set you call them on. Always work on a defensive copy — 'new HashSet<>(originalSet)' — if you need to preserve the original. This is one of the most common sources of accidental data loss when doing set operations in production code.

Feature / Aspect	HashSet	LinkedHashSet	TreeSet
Internal structure	HashMap (array of buckets)	LinkedHashMap (buckets + linked list)	Red-Black Tree
Allows duplicates	No	No	No
Allows null	Yes (one null)	Yes (one null)	No — throws NullPointerException
Iteration order	Unpredictable	Insertion order	Sorted (natural or Comparator)
add() / remove() / contains()	O(1) average	O(1) average	O(log n)
Memory overhead	Low	Medium (extra linked list)	High (tree node objects)
Use when you need	Fast uniqueness, no order care	Unique + reproducible order	Unique + sorted + range queries
Implements NavigableSet	No	No	Yes
Thread-safe	No	No	No

🎯 Key Takeaways

HashSet gives you O(1) average-time add(), remove(), and contains() by using a hash table — that's why it beats ArrayList for membership checks at scale, where ArrayList's contains() is O(n).
HashSet is backed by a HashMap — your elements are stored as keys with a dummy constant as the value. Everything you know about how HashMap hashes keys applies directly to HashSet elements.
If you store custom objects in a HashSet, you must override both equals() AND hashCode() using the same fields — skip either one and you get silent duplicate storage, which is one of the hardest bugs to diagnose.
The three Set implementations form a clear hierarchy of trade-offs: HashSet for raw speed with no ordering, LinkedHashSet for insertion-order preservation, and TreeSet for sorted order and range queries — pick based on what your code actually needs, not habit.

⚠ Common Mistakes to Avoid

✕Mistake 1: Forgetting to override hashCode() when you override equals() on a custom class — Symptom: your HashSet silently stores duplicates of your objects because two instances with identical field values land in different buckets and never get compared with equals(). The set grows larger than it should and contains() returns false for things you know you added — Fix: always override both equals() AND hashCode() together, using the same fields in both. Use Objects.hash(field1, field2) for a safe, null-tolerant hashCode() implementation.
✕Mistake 2: Mutating an object's hashCode-contributing field after adding it to a HashSet — Symptom: contains() returns false for an element you're certain is in the set, and remove() fails silently, causing a memory leak as the object is stranded in the wrong bucket forever — Fix: design objects that go into Sets to be effectively immutable, or at minimum ensure the fields used in hashCode() are declared final and set only in the constructor.
✕Mistake 3: Using a HashSet when you need a thread-safe set in a concurrent environment — Symptom: intermittent data corruption, elements mysteriously disappearing, or infinite loops during concurrent iteration — Fix: use Collections.synchronizedSet(new HashSet<>()) for coarse-grained locking, or ConcurrentHashMap.newKeySet() for better concurrent throughput (it's backed by ConcurrentHashMap and avoids the full lock on every operation).

Interview Questions on This Topic

QHashSet is backed by a HashMap internally — can you explain what that means in practice, and what gets stored as the key and value in that underlying map?
QIf two objects have the same hashCode() but equals() returns false, what happens when you try to add both of them to a HashSet? What's this situation called and how does HashSet handle it?
QYou add an object to a HashSet, then modify a field that's included in its hashCode() calculation. You then call contains() with that same object reference — what does it return and why? How would you prevent this bug?

Frequently Asked Questions

Can a Java HashSet contain null values?

Yes, but only one null. The first time you add null, it's stored. Any subsequent attempt to add null is silently ignored just like any other duplicate. Be careful: calling hashCode() on null would throw a NullPointerException, so HashSet has special-case logic that places null directly into bucket 0, bypassing the hashCode() call entirely.

Does HashSet maintain insertion order in Java?

No. HashSet makes no guarantees about iteration order, and the order can change between JVM runs or even after the internal array resizes due to a high load factor. If you need insertion order preserved, use LinkedHashSet instead — it's a drop-in replacement with the same interface but maintains a doubly-linked list across its entries.

What is the difference between HashSet and HashMap in Java?

A HashMap stores key-value pairs where each key is unique. A HashSet stores only values (no pairs) and guarantees uniqueness among them. Internally, HashSet is literally implemented as a HashMap where your set element is the key and a shared dummy Object constant is the value — so every unique element lookup is just a key lookup in the backing map.

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged