Group Anagrams — Case Normalization Bug in Production
- The core insight is canonical form: two anagrams must produce the same key — any key strategy that guarantees this turns a pairwise O(n²) problem into a single-pass O(n·k) problem.
- Sorted-key is cleaner and safer for interviews; frequency-array key is faster for long strings — know both and be able to justify which you'd pick given constraints.
- Always use a delimiter when serialising a numeric array into a string key — omitting it causes silent hash collisions that produce wrong answers with no error message.
- Group anagrams by computing a canonical key for each word (sorted chars or frequency array) and bucketing by key in a HashMap.
- Sorted-key: O(k log k) per word; frequency key: O(k) per word using int[26] and a delimiter to avoid count collisions.
- Performance: frequency key wins for long strings; sorted-key is cleaner for interviews and short words.
- Biggest mistake: forgetting case normalization or omitting delimiter in frequency key — both cause silent wrong groupings.
- Production insight: the canonical-hash pattern extends to schema matching, duplicate detection, and bioinformatics clustering.
Quick Debug: Anagram Grouping Gotchas
Two known anagrams are in different groups
System.out.println(buildSortedKey("listen") + " vs " + buildSortedKey("silent"));If keys differ, check toLowerCase() — add word = word.toLowerCase(); at top of method.One group contains obviously unrelated words
Print the key for each word: System.out.println(key(word) + " -> " + word);Look for collisions: e.g., key for 'aa' and 'b' both evaluate to '20'?Large input runs slowly
Run a quick benchmark: time both implementations on a sample of 10,000 words.If using sorted-key, replace with buildFrequencyKey().Production Incident
Production Debug GuideSymptom → Action guide for when your anagram grouping produces wrong results
Anagram grouping shows up more often in the real world than you'd think. Search engines use it to match synonyms and spelling variants. DNA analysis tools cluster gene sequences that are permutations of one another. Even autocorrect relies on letter-frequency fingerprints to suggest words. If you've ever searched 'best pasta recipe' and also gotten results for 'best tapas recipe', you've brushed up against anagram-aware text processing in the wild.
The naive approach — comparing every word against every other word — works, but it scales like a nightmare. With just 10,000 words you're looking at 100 million comparisons. The real problem is that we keep re-doing work we've already done. Hashing solves this by collapsing the expensive comparison into a single, cheap key lookup. Once you reduce 'LISTEN' and 'SILENT' to the same key, grouping them is O(1) — it's just putting a book on the right shelf.
By the end of this article you'll understand two distinct hashing strategies for this problem (sorted key vs. frequency array key), know exactly when to reach for each one, be able to implement both from scratch, and walk into an interview ready to discuss trade-offs, edge cases, and optimisations with confidence.
Building the Mental Model — What Makes a Good Anagram Key?
Before writing a single line of code, let's nail down the insight that makes this problem tractable.
Two words are anagrams if and only if they contain exactly the same characters with exactly the same frequencies. 'RACE' and 'CARE' both have one R, one A, one C, one E. That's it — that's the entire definition.
So a perfect key is any representation that captures character frequencies and nothing else. Order shouldn't matter. Two properties must hold: (1) anagrams must produce the identical key, and (2) non-anagrams must produce different keys. If your key satisfies both, you can use a HashMap to bucket words by key in a single linear pass.
There are two popular ways to build this key:
- Sort the letters alphabetically. 'RACE' → 'ACER'. 'CARE' → 'ACER'. Same key, same bucket.
- Build a 26-element frequency array, then serialise it into a string like '1#0#1#1#1#0#...' (one count per letter a–z).
Both are valid. Sorting costs O(k log k) per word where k is the word length. The frequency array costs O(k) per word. For short words the difference is negligible; for very long strings the frequency array wins. Understanding this trade-off is exactly what separates a good interview answer from a great one.
import java.util.Arrays; public class AnagramKeyDemo { public static String sortedKey(String word) { char[] characters = word.toLowerCase().toCharArray(); Arrays.sort(characters); return new String(characters); } public static String frequencyKey(String word) { int[] letterCounts = new int[26]; for (char ch : word.toLowerCase().toCharArray()) { letterCounts[ch - 'a']++; } StringBuilder keyBuilder = new StringBuilder(); for (int count : letterCounts) { keyBuilder.append(count).append('#'); } return keyBuilder.toString(); } public static void main(String[] args) { String[] testWords = {"listen", "silent", "enlist", "race", "care", "hello"}; System.out.println("Word | Sorted Key | Frequency Key (truncated)"); System.out.println("-----------|-------------|---------------------------"); for (String word : testWords) { String sorted = sortedKey(word); String freq = frequencyKey(word).substring(0, 20) + "..."; System.out.printf("%-10s | %-11s | %s%n", word, sorted, freq); } } }
-----------|-------------|---------------------------
listen | eilnst | 0#0#0#0#1#0#0#0#1#0#0#1#0...
silent | eilnst | 0#0#0#0#1#0#0#0#1#0#0#1#0...
enlist | eilnst | 0#0#0#0#1#0#0#0#1#0#0#1#0...
race | acer | 0#0#1#0#1#0#0#0#0#0#0#0#0...
care | acer | 0#0#1#0#1#0#0#0#0#0#0#0#0...
hello | ehllo | 0#0#0#0#1#0#0#1#0#0#0#2#0...
The Full Solution — Grouping Words With a HashMap in One Pass
Now that we have a reliable key function, the grouping algorithm itself is almost embarrassingly simple. One HashMap, one loop, done.
The map's keys are our anagram fingerprints. The map's values are lists of words that share that fingerprint. We iterate through every word once, compute its key, and append the word to the correct list. At the end, we return all the lists.
Total time complexity: O(n · k log k) using sorted keys, or O(n · k) using frequency keys, where n is the number of words and k is the average word length. Space complexity: O(n · k) to store all words in the map.
Notice we never compare two words against each other. We never nest loops. That's the power of hashing: it transforms a pairwise comparison problem into a single-pass bucketing problem.
One practical thing worth noting — Java's getOrDefault on a HashMap is your friend here. It lets you avoid an explicit null-check before adding to the list, which keeps the core logic clean and readable. Production code is read far more often than it's written, so clarity matters.
import java.util.*; public class GroupAnagrams { public static List<List<String>> groupAnagrams(String[] words) { Map<String, List<String>> anagramBuckets = new HashMap<>(); for (String word : words) { String canonicalKey = buildSortedKey(word); anagramBuckets .computeIfAbsent(canonicalKey, k -> new ArrayList<>()) .add(word); } return new ArrayList<>(anagramBuckets.values()); } private static String buildSortedKey(String word) { char[] chars = word.toCharArray(); Arrays.sort(chars); return new String(chars); } public static List<List<String>> groupAnagramsFast(String[] words) { Map<String, List<String>> anagramBuckets = new HashMap<>(); for (String word : words) { String canonicalKey = buildFrequencyKey(word); anagramBuckets .computeIfAbsent(canonicalKey, k -> new ArrayList<>()) .add(word); } return new ArrayList<>(anagramBuckets.values()); } private static String buildFrequencyKey(String word) { int[] charFrequencies = new int[26]; for (char ch : word.toCharArray()) { charFrequencies[ch - 'a']++; } StringBuilder key = new StringBuilder(); for (int frequency : charFrequencies) { key.append(frequency).append('#'); } return key.toString(); } public static void main(String[] args) { String[] words = {"eat", "tea", "tan", "ate", "nat", "bat"}; System.out.println("Input words: " + Arrays.toString(words)); System.out.println(); List<List<String>> groupedBySortedKey = groupAnagrams(words); System.out.println("Grouped (sorted-key approach):"); for (List<String> group : groupedBySortedKey) { System.out.println(" " + group); } System.out.println(); List<List<String>> groupedByFrequency = groupAnagramsFast(words); System.out.println("Grouped (frequency-key approach):"); for (List<String> group : groupedByFrequency) { System.out.println(" " + group); } String[] singleWord = {"hello"}; List<List<String>> singleResult = groupAnagrams(singleWord); System.out.println(); System.out.println("Single word input: " + singleResult); String[] allAnagrams = {"abc", "bca", "cab", "bac"}; List<List<String>> allSameGroup = groupAnagrams(allAnagrams); System.out.println("All anagrams input: " + allSameGroup); } }
Grouped (sorted-key approach):
[eat, tea, ate]
[tan, nat]
[bat]
Grouped (frequency-key approach):
[eat, tea, ate]
[tan, nat]
[bat]
Single word input: [[hello]]
All anagrams input: [[abc, bca, cab, bac]]
map.computeIfAbsent(key, k -> new ArrayList<>()).add(word) is a single atomic operation in Java's HashMap — it's cleaner, less error-prone, and communicates intent immediately. The older if (map.containsKey(key)) { map.get(key).add(word) } else { ... } pattern is four lines of noise hiding one idea.Real-World Patterns — When Grouping Anagrams Actually Comes Up
This problem isn't just a coding exercise — the underlying pattern (hash by canonical form, group by hash) appears constantly in production software.
Search query normalisation: search engines often cluster queries that are anagram variants or phonetic equivalents so that a single indexed result can serve multiple query forms. The sorted-key trick is a lightweight first-pass filter before heavier NLP kicks in.
Duplicate detection in data pipelines: imagine you're ingesting CSV files from multiple sources. Column headers might arrive in different orders — 'name, age, city' vs 'age, city, name'. Sorting the headers and hashing them is a fast way to detect that two files share the same schema before attempting a merge.
Bioinformatics: DNA subsequences that are permutations of each other sometimes bind to the same protein receptor. Grouping them by character frequency is a preprocessing step in similarity analysis.
The deeper lesson: any time you need to group items that are 'equivalent under reordering', the canonical-key-plus-HashMap pattern is your go-to. You're not solving 'group anagrams' — you're solving 'equivalence class partitioning', which is one of the most reusable ideas in computer science.
import java.util.*; public class SchemaGrouper { public static Map<String, List<String>> groupEquivalentSchemas(String[] schemas) { Map<String, List<String>> equivalenceGroups = new HashMap<>(); for (String schema : schemas) { String[] columns = schema.split(","); Arrays.sort(columns); String canonicalSchema = String.join(",", columns); equivalenceGroups .computeIfAbsent(canonicalSchema, k -> new ArrayList<>()) .add(schema); } return equivalenceGroups; } public static void main(String[] args) { String[] incomingSchemas = { "name,age,city", "age,city,name", "city,name,age", "id,email,phone", "phone,id,email", "name,salary" }; Map<String, List<String>> groups = groupEquivalentSchemas(incomingSchemas); System.out.println("Equivalent schema groups:"); System.out.println("========================="); int groupNumber = 1; for (Map.Entry<String, List<String>> entry : groups.entrySet()) { System.out.println("Group " + groupNumber + " (canonical: " + entry.getKey() + ")"); for (String schema : entry.getValue()) { System.out.println(" -> " + schema); } groupNumber++; } } }
=========================
Group 1 (canonical: age,city,name)
-> name,age,city
-> age,city,name
-> city,name,age
Group 2 (canonical: email,id,phone)
-> id,email,phone
-> phone,id,email
Group 3 (canonical: name,salary)
-> name,salary
Edge Cases and Performance Gotchas
Even with a correct key function, there are edge cases that can trip you up in production or in an interview.
Empty input: an empty string produces an empty key. For sorted-key, an empty char array sorts to empty string. For frequency key, all counts are zero. Both work, but the resulting key is a valid empty string. Your code should handle this without special-casing.
Single-character words: Both keys work fine. For sorted-key, the character remains itself. For frequency key, only one count is non-zero.
Unicode and non-ASCII characters: The frequency-array approach with int[26] only works for lowercase English letters. For Unicode, you'd need a larger array (e.g., int[65536] for full Unicode plane) or a different strategy like counting character code points. The sorted-key approach handles any characters that implement Comparable.
Very large input (millions of words): Memory becomes a concern. The HashMap stores every word reference — space is O(n·k). You can reduce memory by not storing the words themselves but storing indices, or by using a streaming approach if grouping isn't needed all at once.
Punctuation and spaces: The problem usually assumes only letters. If punctuation is present, you must decide whether to include or strip it. Typically you'd filter to letters only first.
import java.util.*; public class EdgeCasesDemo { // Key function that filters non-letters and lowercases public static String robustSortedKey(String word) { StringBuilder lettersOnly = new StringBuilder(); for (char ch : word.toLowerCase().toCharArray()) { if (ch >= 'a' && ch <= 'z') { lettersOnly.append(ch); } } char[] chars = lettersOnly.toString().toCharArray(); Arrays.sort(chars); return new String(chars); } public static void main(String[] args) { String[] tricky = {"", "a", "listen!", "SILENT?"}; for (String w : tricky) { System.out.println("Word: '" + w + "' -> key: '" + robustSortedKey(w) + "'"); } } }
Word: 'a' -> key: 'a'
Word: 'listen!' -> key: 'eilnst'
Word: 'SILENT?' -> key: 'eilnst'
Interview Strategy: Picking the Right Approach
When you see 'group anagrams' in an interview, your instinct should be: canonical key + HashMap. But the interviewer will probe your understanding of trade-offs.
Sorted key is the safest first answer. It's simple to implement, easy to explain, and works for 99% of cases. Start here.
Frequency key is the optimisation follow-up. If the interviewer asks 'can you do better?', say: 'I can avoid sorting by using a fixed-size array for counting, which reduces per-word cost from O(k log k) to O(k).' Then implement it.
Space optimisation might come up: 'Can you reduce memory?' You can use a Trie to store words by their sorted path, or use integer encoding (e.g., prime product) but that's risky due to overflow.
Edge cases to discuss: empty string, single character, all same word, all anagrams, Unicode.
Generalisation: Be prepared to show you've understood the underlying equivalence-class pattern. The interviewer may ask how you'd adapt this to group columns, genes, or phone numbers.
Performance numbers: For n=10^5 and k=10, sorted key takes ~10^5 10 log 10 ≈ 3.3 million operations. Frequency key takes 10^5 10 = 1 million. The difference is measurable but often irrelevant in practice. Mention that context matters.
// Key points to mention in an interview: // 1. Complexity: O(n·k log k) with sorted, O(n·k) with frequency. // 2. Space: O(n·k) for the map. // 3. Edge cases: empty, case, punctuation, Unicode, large input. // 4. Generalisation: canonical-form + HashMap is a pattern. // 5. If memory is constrained, consider using indices instead of strings. // Example: group anagrams returning indices public Map<String, List<Integer>> groupAnagramIndices(String[] words) { Map<String, List<Integer>> groups = new HashMap<>(); for (int i = 0; i < words.length; i++) { String key = buildSortedKey(words[i]); groups.computeIfAbsent(key, k -> new ArrayList<>()).add(i); } return groups; }
- Level 1: Sorted-key HashMap — the clear, correct baseline.
- Level 2: Frequency-array key — the optimisation for long strings.
- Level 3: Equivalence-class pattern — the generalisation to other domains.
| Aspect | Sorted-Key Approach | Frequency-Array Key Approach |
|---|---|---|
| Time per word | O(k log k) — sort dominates | O(k) — single linear scan |
| Key format | Sorted string e.g. 'aert' | Serialised counts e.g. '1#0#1#0#1#...' |
| Key length | Exactly k characters | Always 52 characters (26 counts + 26 '#') |
| Collision risk | None — sort is deterministic | None — if delimiter used correctly |
| Readability | High — key is human-readable | Low — key is opaque to humans |
| Best for | Short words, interview clarity | Long strings, performance-critical paths |
| Unicode support | Works with any sortable chars | Must extend array size beyond 26 for Unicode |
| Implementation complexity | Very low — Arrays.sort + new String | Low — count loop + StringBuilder |
🎯 Key Takeaways
- The core insight is canonical form: two anagrams must produce the same key — any key strategy that guarantees this turns a pairwise O(n²) problem into a single-pass O(n·k) problem.
- Sorted-key is cleaner and safer for interviews; frequency-array key is faster for long strings — know both and be able to justify which you'd pick given constraints.
- Always use a delimiter when serialising a numeric array into a string key — omitting it causes silent hash collisions that produce wrong answers with no error message.
- The 'hash by canonical form, group by hash' pattern generalises far beyond anagrams — it solves any equivalence-class partitioning problem, including schema matching, duplicate detection, and bioinformatics clustering.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWalk me through the time and space complexity of your solution — and how would it change if the input words could be up to 10,000 characters long instead of 20?SeniorReveal
- QYour current solution uses sorted characters as the key. Can you think of an alternative key that avoids sorting entirely and achieves O(n·k) overall? What are the trade-offs?Mid-levelReveal
- QWhat happens to your algorithm if the input contains Unicode characters like accented letters or emojis? How would you adapt the frequency-array approach to handle them?SeniorReveal
Frequently Asked Questions
What is the most efficient way to group anagrams in Java?
The most efficient approach is to use a HashMap where each key is a character-frequency fingerprint of the word (built in O(k) time) and each value is a list of words sharing that fingerprint. This gives O(n·k) total time complexity and O(n·k) space, which beats the sorted-key approach's O(n·k log k) on long strings.
Why do we sort the characters to create the anagram key?
Sorting removes the order of characters, leaving only their identity — which is exactly what two anagrams share. When you sort 'listen' and 'silent' you get 'eilnst' both times. That shared sorted form becomes the HashMap key, so both words land in the same bucket automatically.
Does the group anagrams algorithm work for words with repeated characters?
Yes, and both key strategies handle repeats correctly. The sorted-key approach keeps all duplicates in place — 'aab' sorts to 'aab', never 'ab'. The frequency-array approach explicitly counts each character, so two 'a's give a count of 2 in slot 0. There's no special-casing needed for repeated characters.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.