C++ unordered_map — Bad Hash Silently Kills Server
CPU spikes to 95% and latency to 3s from a single bad hash bucket.
- unordered containers use a hash function to map keys to buckets instantly, giving average O(1) lookup
- The hash output is modded by bucket count to pick a slot; collisions are resolved via linked lists in each bucket
- Performance depends on hash quality: a bad hash clusters keys into few buckets, degrading to O(n)
- reserve(N) preallocates buckets to avoid expensive rehashing during bulk inserts
- Biggest mistake: assuming O(1) always — worst-case O(n) happens with hash flooding or overloaded buckets
If you've ever written a loop to check whether a value exists in a vector, you already know the pain. A vector with a million elements means a million comparisons in the worst case. Real-world software — think spell checkers, URL routers, caches, and word frequency counters — can't afford that. They need to look things up in microseconds, not milliseconds. That's the problem C++'s unordered containers were born to solve.
The standard library gives us std::map and std::set (sorted, tree-based, O(log n)) and their unordered siblings: std::unordered_map and std::unordered_set (hash-based, average O(1)). Most developers know both exist but reach for map by default out of habit. That habit is costing them real performance. Understanding the mechanics behind hash tables — how buckets work, what causes collisions, and when the O(1) promise breaks — is what separates a developer who uses the STL from one who truly commands it.
By the end of this article you'll know exactly how the unordered containers store data under the hood, how to write and plug in a custom hash for your own types, when to choose ordered over unordered and vice versa, and the three mistakes that silently wreck performance or cause subtle bugs in production code. You'll also have concrete answers for the interview questions that trip up even experienced candidates.
How unordered_map and unordered_set Actually Store Data
Both containers are built on a hash table. When you insert a key, the container passes it through a hash function — a deterministic algorithm that converts any value into a large integer. That integer is then mapped to a 'bucket' (think of a bucket as a numbered slot in an array). Lookup works the same way: hash the key, go to that bucket, done.
This is why average lookup is O(1). You don't scan anything — you calculate a position and jump there directly.
The word 'average' matters. Two different keys can hash to the same bucket — that's a collision. The STL resolves collisions with separate chaining: each bucket holds a linked list of all entries that hashed to it. If many keys collide into one bucket (a degenerate worst case or a deliberate hash-flooding attack), lookup degrades to O(n). This is rare in practice with good keys but critical to understand.
unordered_set stores only keys. unordered_map stores key-value pairs, where only the key is hashed and used for lookup. Think of unordered_map as a smarter phone book where you hash someone's name to find their number instantly, rather than flipping through pages.
The load factor (elements / buckets) controls rehashing. When it exceeds a threshold (default 1.0), the table doubles its bucket count and re-inserts everything — an O(n) operation that happens transparently. You can control this with reserve() to avoid it.
Writing a Custom Hash for Your Own Types
The STL ships with hash specializations for built-in types (int, size_t, pointers), strings, and a handful of others. The moment you want to store a struct — say, a Point, a UserId, or a composite cache key — you need to teach the container how to hash it.
There are two approaches: specialize std::hash<YourType> in the std namespace, or write a standalone functor and pass it as a template argument. Prefer the functor approach for types you don't own or control (third-party structs), and std::hash specialization for your own types.
The most important rule for writing a hash function: two equal objects MUST produce the same hash. The converse (two objects with the same hash must be equal) does NOT have to hold — that's just a collision. If you violate the first rule, you'll store an element, then be completely unable to find it again. The bug is silent and infuriating.
For combining multiple fields into one hash, never just XOR them naively — XOR is commutative, meaning hash(a, b) == hash(b, a), which causes unnecessary collisions for symmetric keys. Use the boost::hash_combine pattern (or replicate it yourself) which mixes bits more aggressively.
Ordered vs Unordered — Choosing the Right Tool for the Job
The single biggest decision: do you need your data in order, or do you just need fast lookup? That one question determines which container family to reach for.
std::map and std::set use a self-balancing Red-Black tree. Every operation is O(log n) — guaranteed, not average. They keep keys sorted at all times, which means you can iterate in order, do range queries with lower_bound() and upper_bound(), and find the minimum or maximum key in O(log n).
std::unordered_map and std::unordered_set use a hash table. Average O(1) for insert, lookup, and erase. Worst case O(n) during a rehash or with a pathological key set. No ordering — iteration visits elements in an implementation-defined, unpredictable sequence.
The performance gap is real but context-dependent. For small containers (under ~100 elements), the cache-friendly memory layout of a sorted vector often beats both. For large containers with frequent random lookups and no need for ordering, unordered wins decisively. The comparison table below captures the key dimensions.
One subtle advantage of ordered containers: they work with any type that supports operator< — no hash function required. This makes std::map the path of least resistance for custom types, which is partly why developers default to it. Once you get comfortable writing hash functors, that friction disappears.
Hash Collisions and Performance Degradation — When O(1) Breaks
The O(1) promise depends entirely on the hash function distributing keys uniformly across buckets. In practice, three things break this: poor hash functions, high load factors, and deliberate hash-flooding attacks.
A poor hash for integers could be simply returning key % size. If many keys share the same remainder, they all land in one bucket. That's an O(n) linked-list traversal on every lookup. The STL's default hash for ints is better, but you can still hit trouble with user-provided hashes that don't mix bits well.
High load factor (near 1.0) means few empty buckets—collisions are likely even with a good hash. The max_load_factor defaults to 1.0, meaning the table only rehashes when the number of elements equals the number of buckets. Lowering it to 0.75 or 0.5 reduces collisions at the cost of memory.
Deliberate hash-flooding attacks exploit a known hash function to generate many distinct keys that all hash to the same bucket. This is a real DoS vector for public-facing services that accept user-controlled keys. Modern STL implementations (libstdc++ since GCC 5, libc++) use a per-process randomized seed for string hashing to mitigate this, but custom hash functors may still be vulnerable.
You can inspect collisions directly: for (size_t i = 0; i < buckets; ++i) std::cout << bucket_size(i) << ' '; This prints how many elements are in each bucket. If any bucket has far more than the average, your hash has a problem.
Iterator Invalidation and Thread Safety — The Hidden Traps
Unordered containers have strict iterator invalidation rules that differ from ordered containers. Understanding these is critical in production code.
Insertion invalidates all iterators in a rehash (when the load factor threshold is exceeded). But if no rehash occurs, iterators remain valid. The container rehashes only when max_load_factor is exceeded — unless you call reserve() or rehash() explicitly.
Erase invalidates only the iterator to the erased element. All other iterators remain valid. However, the order of the remaining elements may change (elements after the erased one may shift).
Thread safety: None. Concurrent reads from multiple threads are safe because no data is modified. But any concurrent write (insert, erase, clear) with concurrent reads is a data race. The container is not thread-safe — you must add external synchronization (mutex, reader-writer lock) or use a concurrent container (e.g., Intel TBB's concurrent_hash_map).
A common mistake: iterating over the container while another thread inserts. Even if the other thread doesn't cause rehash, the iteration may read partially-updated bucket pointers. Always protect with a lock.
| Feature / Aspect | std::map / std::set | std::unordered_map / unordered_set |
|---|---|---|
| Underlying structure | Red-Black Tree | Hash Table with chaining |
| Lookup complexity | O(log n) — guaranteed | O(1) average, O(n) worst case |
| Insert complexity | O(log n) — guaranteed | O(1) amortized, O(n) during rehash |
| Iteration order | Ascending by key (always) | Undefined / implementation-specific |
| Range queries (lower_bound) | Yes — O(log n) | No — not supported |
| Custom key requirements | operator< (or custom comparator) | std::hash + operator== (or custom functor) |
| Memory overhead | ~48 bytes per node (pointer-heavy) | Lower per-element, but bucket array overhead |
| Cache friendliness | Poor — pointer chasing across heap | Better — bucket array is contiguous |
| Key modification safety | Safe via const keys in nodes | Unsafe — rehash loses element if key mutated |
| Best use case | Ordered data, range iteration, sorted output | High-frequency lookups, sets, frequency counts |
Key Takeaways
- unordered_map and unordered_set are O(1) average because they jump directly to a bucket via a hash — they never scan. That O(1) becomes O(n) only when hash collisions pile many keys into the same bucket.
- Call reserve(N) before bulk-inserting N elements to avoid mid-run rehashing — this single line can cut insertion time nearly in half for large datasets.
- Your hash function and operator== must be in sync: every field compared in operator== must be mixed into the hash. One mismatched field makes the container silently eat your inserts and return wrong results.
- Reach for std::map when you need sorted iteration, range queries, or guaranteed O(log n) (no worst-case spikes). Reach for std::unordered_map when raw lookup speed is the priority and order doesn't matter.
- Iteration order of unordered containers is undefined and may change between runs or after rehash. Never rely on it for correctness.
- Thread safety is not provided by the standard. Concurrent reads are safe, but any concurrent write with reads is a data race — use external synchronization.
- Inspect
bucket_size()distribution when performance degrades. A cluster in one bucket indicates a poor hash or high load factor.
Common Mistakes to Avoid
- Using operator[] for existence checks in unordered_map
Symptom: Every call to `map[key]` to check if a key exists silently inserts a zero-initialized value, bloating the map with phantom keys. This can cause memory leaks, unintended side effects, and logical errors.
Fix: Useorfind()to check existence. Only usecount()operator[]when you intend insert-or-update behavior. - Writing a hash that violates the equal-implies-same-hash contract
Symptom: You insert a key, but `find()` returns `end()` even though the key exists. This happens when two objects that compare equal produce different hashes (e.g., hashing only one field while comparing all fields). The bug is silent and infuriating.
Fix: Ensure every field used inoperator==also contributes to the hash function. Cross-check both implementations side by side. - Storing pointers as keys without hashing the pointed-to value
Symptom: Two different pointer addresses pointing to identical logical objects are treated as different keys, causing duplicates and logic errors in sets or maps.
Fix: Hash and compare by value, not by pointer. Either dereference inside your hash functor or store the values directly (e.g., string instead of char*). - Modifying a key after insertion without re-inserting
Symptom: Elements become unfindable. The key's hash changes, but its bucket placement doesn't update. This happens when you obtain a mutable reference to a key (e.g., through a loop) and change it.
Fix: Always erase the old key and insert the modified one. Never mutate keys stored in unordered containers. - Relying on iteration order being consistent
Symptom: A code path that depends on a specific iteration order breaks when a rehash occurs or when the same code runs on a different platform. Production data corruption may occur.
Fix: If order matters, collect into a vector and sort, or switch tostd::map. Never assume unordered container iteration order is stable.
Interview Questions on This Topic
- QWhat is the worst-case time complexity of std::unordered_map::find(), and under what real-world conditions could that worst case actually occur in production?SeniorReveal
- QHow would you store a std::pair<int, int> as a key in an unordered_map? Walk me through exactly what you'd write — the hash functor, the equality check, and how you'd pass them to the template.Mid-levelReveal
- QIf you have a std::unordered_map that's performing surprisingly slowly in a benchmark, what are the three things you'd investigate first, and what tools or member functions would you use to diagnose each one?SeniorReveal
- QWhat happens to iterators after calling
erase(iterator)on an unordered_map? Does it invalidate any other iterators?Mid-levelReveal - QExplain the difference between
andreserve()in std::unordered_map. When would you use each?SeniorRevealrehash()
Frequently Asked Questions
Can I use a struct or custom class as a key in std::unordered_map?
Yes, but you must provide two things: a hash function (either by specializing std::hash<YourType> or by writing a functor and passing it as the third template argument) and an equality operator (operator== on your type). Both are required — the hash locates the bucket, and equality resolves collisions within that bucket.
When should I use unordered_set instead of a sorted vector with binary search?
Use unordered_set when you have frequent insertions and deletions mixed with lookups, since maintaining a sorted vector through insertions is O(n) per insert. Use a sorted vector + binary_search when the data is built once and then read many times — it's more cache-friendly and can outperform unordered_set for read-heavy workloads under ~10k elements due to better memory locality.
Why does iterating over unordered_map give elements in a different order every run?
Iteration visits elements bucket by bucket, and the mapping of keys to buckets depends on the hash values and the current bucket count. Both can vary across runs (especially if std::hash for strings is randomized by the implementation as a security measure against hash-flooding). Never rely on unordered container iteration order — if you need order, collect into a vector and sort it, or switch to std::map.
How do I reduce memory usage of an unordered_map after removing many elements?
Unordered containers do not automatically shrink their bucket count after erasures. To reclaim memory, you can either swap with an empty container: std::unordered_map<...> empty; myMap.swap(empty);, or call myMap.rehash(0); which may reduce bucket count to accommodate the current number of elements (this is implementation-defined but works in libstdc++).
What is a hash-flooding attack and how do I protect against it?
A hash-flooding attack exploits a known hash function to generate many distinct keys that all hash to the same bucket, degrading lookups to O(n). To protect: use a randomized hash (e.g., SipHash via std::hash with seed) or implement a custom hash that incorporates a per-process random seed. Avoid custom hash functors that use only public, deterministic transformations for user-supplied keys.
That's STL. Mark it forged?
6 min read · try the examples if you haven't