An inverted index stores for each word a sorted list of document IDs where that word appears. To search, you intersect the postings lists of query terms. This gives O(1) lookup per term instead of O(n) scan.
✦ Definition~90s read
What is Inverted Index and Text Search?
An inverted index is a data structure mapping each unique word (token) to the list of documents containing it. It enables fast full-text search by looking up the token's postings list instead of scanning every document.
★
Think of a cookbook's index at the back.
Plain-English First
Think of a cookbook's index at the back. Instead of flipping through every page to find 'chicken', you look up 'chicken' in the index and get page numbers 23, 45, 102. The inverted index is that index: a dictionary where each word points to the list of documents (pages) containing it.
Your search is slow. Not 'a bit sluggish' — slow enough that users refresh and your database connection pool drowns. You're doing WHERE text LIKE '%keyword%' on a table with millions of rows. That's a full table scan. Every. Single. Query. I've seen this bring down a content management system at 2am when a bot hit the search endpoint with 50 concurrent requests. The fix? An inverted index. It's the core of Elasticsearch, Lucene, and every decent search engine. By the end of this, you'll build one from scratch, understand when it's overkill, and never write LIKE '%...%' for search again.
Why Your LIKE Query Is a Crime Against Performance
Every time you write WHERE text LIKE '%search%', the database scans every row. That's O(n) per query. With an inverted index, you precompute the mapping from words to documents. Search becomes O(1) per term plus intersection cost. The trade-off: index build time and storage. But for read-heavy search workloads, it's the only sane choice. Here's how a checkout service handles product search: build the index offline, serve queries from memory-mapped files.
InvertedIndexBuilder.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — System Design tutorialimport java.io.*;
import java.util.*;
publicclassInvertedIndexBuilder {
// Builds inverted index from a list of documentspublicstaticMap<String, List<Integer>> buildIndex(List<String> documents) {
Map<String, List<Integer>> index = newHashMap<>();
for (int docId = 0; docId < documents.size(); docId++) {
String text = documents.get(docId).toLowerCase();
String[] tokens = text.split("\\W+"); // split on non-word charsfor (String token : tokens) {
if (token.isEmpty()) continue;
index.computeIfAbsent(token, k -> newArrayList<>()).add(docId);
}
}
// Sort postings lists for binary search intersectionfor (List<Integer> postings : index.values()) {
Collections.sort(postings);
}
return index;
}
publicstaticvoidmain(String[] args) {
List<String> docs = Arrays.asList(
"The quick brown fox",
"Jumped over the lazy dog",
"Quick brown dog jumps"
);
Map<String, List<Integer>> index = buildIndex(docs);
System.out.println(index);
}
}
If you don't sort postings lists, intersection becomes O(n*m). Sorted lists allow merge-join intersection in O(n+m). Always sort after building.
thecodeforge.io
Inverted Index & Text Search Pipeline
Inverted Index Text Search
thecodeforge.io
Inverted Index Build Pipeline
Inverted Index Text Search
Tokenization: The Silent Killer of Search Relevance
Tokenization is how you split text into tokens. The classic rookie mistake: splitting on whitespace only. That leaves punctuation attached: 'hello,' vs 'hello' are different tokens. Always lowercase and split on non-word characters. For production, consider stemming (e.g., 'running' -> 'run') and stop word removal ('the', 'a', 'and'). But be careful: stop word removal can break phrase queries. I've seen a legal document search fail because 'not' was removed as a stop word, flipping the meaning of clauses.
Stemming (Porter stemmer) is faster but cruder ('running' -> 'run', 'ran' -> 'ran'). Lemmatization uses vocabulary to return base form ('ran' -> 'run'). For search, stemming is usually sufficient. Use Lucene's EnglishAnalyzer if you need production-grade.
Intersecting Postings Lists: The Heart of Search
Given two query terms, you need documents containing both. With sorted postings lists, use a two-pointer merge. For multiple terms, intersect pairwise. This is O(n+m) per pair. For large indexes, use skip pointers (store offsets to skip ahead) to reduce comparisons. Here's the merge intersection for two sorted lists.
Intersection.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// io.thecodeforge — System Design tutorialimport java.util.*;
publicclassIntersection {
// Returns intersection of two sorted listspublicstaticList<Integer> intersect(List<Integer> list1, List<Integer> list2) {
List<Integer> result = newArrayList<>();
int i = 0, j = 0;
while (i < list1.size() && j < list2.size()) {
int cmp = list1.get(i).compareTo(list2.get(j));
if (cmp == 0) {
result.add(list1.get(i));
i++; j++;
} elseif (cmp < 0) {
i++;
} else {
j++;
}
}
return result;
}
publicstaticvoidmain(String[] args) {
List<Integer> a = Arrays.asList(0, 2, 5, 7);
List<Integer> b = Arrays.asList(1, 2, 5, 8);
System.out.println(intersect(a, b)); // [2, 5]
}
}
Output
[2, 5]
Interview Gold: Skip Pointers
For very long postings lists, add skip pointers at intervals (e.g., every sqrt(n) entries). During intersection, you can skip over blocks that can't match. This reduces comparisons from O(n+m) to O(sqrt(n)+sqrt(m)) in practice.
Phrase Queries: When Word Order Matters
Phrase queries (e.g., "quick brown fox") require documents where the terms appear consecutively in order. Store positions (term offsets) in the postings list. Then check if positions of consecutive terms differ by 1. This adds storage but enables phrase search. Here's a position-aware index.
PositionalIndex.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// io.thecodeforge — System Design tutorialimport java.util.*;
publicclassPositionalIndex {
// Map token -> (docId, positions)privateMap<String, Map<Integer, List<Integer>>> index = newHashMap<>();
publicvoidaddDocument(int docId, String text) {
String[] tokens = text.toLowerCase().split("\\W+");
for (int pos = 0; pos < tokens.length; pos++) {
String token = tokens[pos];
if (token.isEmpty()) continue;
index.computeIfAbsent(token, k -> newHashMap<>())
.computeIfAbsent(docId, k -> newArrayList<>())
.add(pos);
}
}
publicList<Integer> phraseQuery(String phrase) {
String[] terms = phrase.toLowerCase().split("\\s+");
if (terms.length == 0) returnCollections.emptyList();
// Start with docs containing first termMap<Integer, List<Integer>> firstPostings = index.getOrDefault(terms[0], Collections.emptyMap());
Set<Integer> candidateDocs = newHashSet<>(firstPostings.keySet());
for (int t = 1; t < terms.length; t++) {
Map<Integer, List<Integer>> termPostings = index.getOrDefault(terms[t], Collections.emptyMap());
Set<Integer> nextCandidates = newHashSet<>();
for (int docId : candidateDocs) {
List<Integer> prevPositions = firstPostings.get(docId);
List<Integer> currPositions = termPostings.get(docId);
if (prevPositions == null || currPositions == null) continue;
// Check if any position in prevPositions is exactly one less than a position in currPositionsfor (int p : prevPositions) {
if (currPositions.contains(p + 1)) {
nextCandidates.add(docId);
break;
}
}
}
candidateDocs = nextCandidates;
firstPostings = termPostings; // for next iteration
}
returnnewArrayList<>(candidateDocs);
}
publicstaticvoidmain(String[] args) {
PositionalIndex idx = newPositionalIndex();
idx.addDocument(0, "quick brown fox");
idx.addDocument(1, "brown fox quick");
System.out.println(idx.phraseQuery("quick brown")); // [0]System.out.println(idx.phraseQuery("brown fox")); // [0, 1]
}
}
Output
[0]
[0, 1]
Never Do This: Storing Positions as Objects
Positions per document can be large. Store them as primitive int arrays or use compressed bitmaps (e.g., Roaring Bitmaps). Avoid List<Integer> for each doc — memory overhead kills you at scale.
Scaling the Index: When Memory Isn't Enough
For millions of documents, an in-memory HashMap won't fit. Use disk-based indexes with block compression. Common approach: store index in multiple segments (like Lucene). Each segment is a sorted list of (term, postings) pairs. Merge segments periodically. For production, use memory-mapped files for fast access. Here's a simple disk-based index using RandomAccessFile.
Always call channel.force(true) after writing to ensure data is on disk. Otherwise, a crash can corrupt the index. I've seen this lose hours of indexing work.
When Not to Use an Inverted Index
Inverted indexes are great for keyword search, but terrible for fuzzy search, regex, or numerical range queries. If you need substring matching (e.g., '%abc%'), use a trie or suffix array. For geospatial, use R-tree. For small datasets (<10k documents), a simple scan with LIKE might be faster due to index overhead. Also, inverted indexes don't handle synonyms well — you need a thesaurus or embedding-based search for that.
Senior Shortcut: Hybrid Approach
Use inverted index for keyword search, and fall back to vector search (e.g., FAISS) for semantic search. Many production systems combine both: retrieve candidates via inverted index, then re-rank with embeddings.
thecodeforge.io
Inverted Index vs. LIKE Scan
Inverted Index Text Search
● Production incidentPOST-MORTEMseverity: high
The 4GB Container That Kept Dying
Symptom
A search service on a 4GB container kept OOM-killing itself every 30 minutes under moderate load.
Assumption
Team assumed it was a memory leak in the search library.
Root cause
The inverted index was built in-memory as a HashMap<String, List<Integer>> for 10 million documents. The postings lists were unsorted, so intersection required O(n*m) scanning. Memory usage exploded because each postings list was stored as an ArrayList with overhead.
Fix
Switched to sorted postings lists stored as int[] arrays on disk with memory-mapped files. Intersection uses binary search. Memory dropped to 500MB.
Key lesson
Never store postings lists as objects in a hash map for large indexes.
Use sorted arrays and memory-mapped I/O.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Search returns no results for a term that should exist
→
Fix
1. Check tokenization: run the document text through your tokenizer and verify the term appears. 2. Check case: ensure both index and query use same case normalization. 3. Check stop words: is the term being removed? 4. Verify index build: did the document get indexed at all?
Symptom · 02
Search returns too many irrelevant results
→
Fix
1. Check if you're using phrase query when you need exact match. 2. Review tokenization: are you splitting on punctuation correctly? 3. Consider adding TF-IDF scoring to rank results. 4. Check for synonym expansion causing noise.
Symptom · 03
Index build takes too long or runs out of memory
→
Fix
1. Check if you're storing positions unnecessarily. 2. Use disk-based index for large datasets. 3. Batch documents and merge segments. 4. Increase JVM heap or use memory-mapped files.
★ Inverted Index and Text Search Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Search returns empty for known term−
Immediate action
Check if term is in index
Commands
grep -c '\bterm\b' index.dat
java -jar IndexInspector.jar --term term
Fix now
Rebuild index with correct tokenizer
OutOfMemoryError during index build+
Immediate action
Check heap usage
Commands
jstat -gc <pid>
jmap -heap <pid>
Fix now
Increase -Xmx or switch to disk-based index
Phrase query returns wrong documents+
Immediate action
Check positions
Commands
java -jar IndexInspector.jar --positions term
Verify token positions in source
Fix now
Rebuild index with correct position assignment
Slow search queries (>1s)+
Immediate action
Check if postings lists are sorted
Commands
java -jar IndexInspector.jar --check-sorted
Profile intersection code
Fix now
Sort postings lists and add skip pointers
Feature / Aspect
Inverted Index
LIKE Query
Search speed
O(1) per term + intersection
O(n) full table scan
Phrase search
Supported with positions
Supported with % but slow
Index build time
O(total tokens)
None
Storage overhead
High (postings lists)
None
Fuzzy search
Not native
Not native
Scalability
Millions of docs
Thousands of docs
Key takeaways
1
Inverted indexes turn O(n) text search into O(1) term lookup
essential for any system with more than 10k documents.
2
Always sort postings lists after building
unsorted lists make intersection O(n*m) instead of O(n+m).
3
Tokenization is the most common source of search bugs
lowercase, split on non-word chars, and be careful with stop words.
4
For large indexes, use disk-based storage with memory-mapped files and segment merging
never store everything in a single HashMap.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
How does an inverted index handle concurrent updates while serving queri...
Q02SENIOR
When would you choose an inverted index over a B-tree for text search?
Q03SENIOR
What happens when you search for a term that appears in 90% of documents...
Q04JUNIOR
What is an inverted index?
Q05SENIOR
You notice search returns documents that don't contain the query term. W...
Q06SENIOR
Design a search system for a news website with 10 million articles. How ...
Q01 of 06SENIOR
How does an inverted index handle concurrent updates while serving queries?
ANSWER
Use a copy-on-write approach: build a new segment in memory, then atomically swap the segment list. Queries see the old segments until swap completes. This avoids locks and ensures consistency.
Q02 of 06SENIOR
When would you choose an inverted index over a B-tree for text search?
ANSWER
Inverted index is better for full-text search with multiple terms and ranking. B-tree is better for exact match or prefix search on a single column. For example, search 'quick brown fox' across millions of documents: inverted index wins. For 'SELECT * WHERE title = 'quick'': B-tree wins.
Q03 of 06SENIOR
What happens when you search for a term that appears in 90% of documents? How do you mitigate?
ANSWER
The postings list is huge, making intersection expensive. Mitigation: store document frequency and skip high-frequency terms in query optimization. Or use a 'required' flag to force inclusion only if beneficial.
Q04 of 06JUNIOR
What is an inverted index?
ANSWER
A data structure mapping each word to the list of documents containing it, enabling fast full-text search.
Q05 of 06SENIOR
You notice search returns documents that don't contain the query term. What's the likely cause?
ANSWER
Synonym expansion or stemming over-applied. For example, stemming 'running' to 'run' is fine, but if 'ran' also stems to 'run', you might get false positives. Check your analyzer configuration.
Q06 of 06SENIOR
Design a search system for a news website with 10 million articles. How would you build and serve the inverted index?
ANSWER
Use a distributed index (shard by document ID). Build index offline using MapReduce or Spark. Serve from memory-mapped files on multiple nodes. Use a coordinator to merge results. For ranking, use TF-IDF or BM25. Cache popular queries.
01
How does an inverted index handle concurrent updates while serving queries?
SENIOR
02
When would you choose an inverted index over a B-tree for text search?
SENIOR
03
What happens when you search for a term that appears in 90% of documents? How do you mitigate?
SENIOR
04
What is an inverted index?
JUNIOR
05
You notice search returns documents that don't contain the query term. What's the likely cause?
SENIOR
06
Design a search system for a news website with 10 million articles. How would you build and serve the inverted index?
SENIOR
FAQ · 4 QUESTIONS
Frequently Asked Questions
01
What is an inverted index in simple terms?
It's like a book index: a list of words, each pointing to the pages (documents) where that word appears. Instead of scanning every page, you look up the word and get the pages instantly.
Was this helpful?
02
What's the difference between inverted index and forward index?
A forward index maps documents to their words. An inverted index maps words to documents. For search, inverted is faster because you look up the word, not the document.
Was this helpful?
03
How do I implement an inverted index in Java?
Use a HashMap<String, List<Integer>> where key is the token and value is sorted list of document IDs. Tokenize by lowercasing and splitting on non-word characters. Sort postings lists after building.
Was this helpful?
04
How does Elasticsearch use inverted indexes?
Elasticsearch (Lucene) builds an inverted index per shard. It stores term frequency, positions, and offsets for phrase queries. It uses segment merging and memory-mapped files for performance.