Senior 3 min · June 25, 2026

Inverted Index and Text Search: Build a Production Search Engine Without the Pain

Inverted index and text search explained with production patterns.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

An inverted index stores for each word a sorted list of document IDs where that word appears. To search, you intersect the postings lists of query terms. This gives O(1) lookup per term instead of O(n) scan.

✦ Definition~90s read
What is Inverted Index and Text Search?

An inverted index is a data structure mapping each unique word (token) to the list of documents containing it. It enables fast full-text search by looking up the token's postings list instead of scanning every document.

Think of a cookbook's index at the back.
Plain-English First

Think of a cookbook's index at the back. Instead of flipping through every page to find 'chicken', you look up 'chicken' in the index and get page numbers 23, 45, 102. The inverted index is that index: a dictionary where each word points to the list of documents (pages) containing it.

Your search is slow. Not 'a bit sluggish' — slow enough that users refresh and your database connection pool drowns. You're doing WHERE text LIKE '%keyword%' on a table with millions of rows. That's a full table scan. Every. Single. Query. I've seen this bring down a content management system at 2am when a bot hit the search endpoint with 50 concurrent requests. The fix? An inverted index. It's the core of Elasticsearch, Lucene, and every decent search engine. By the end of this, you'll build one from scratch, understand when it's overkill, and never write LIKE '%...%' for search again.

Why Your LIKE Query Is a Crime Against Performance

Every time you write WHERE text LIKE '%search%', the database scans every row. That's O(n) per query. With an inverted index, you precompute the mapping from words to documents. Search becomes O(1) per term plus intersection cost. The trade-off: index build time and storage. But for read-heavy search workloads, it's the only sane choice. Here's how a checkout service handles product search: build the index offline, serve queries from memory-mapped files.

InvertedIndexBuilder.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — System Design tutorial

import java.io.*;
import java.util.*;

public class InvertedIndexBuilder {
    // Builds inverted index from a list of documents
    public static Map<String, List<Integer>> buildIndex(List<String> documents) {
        Map<String, List<Integer>> index = new HashMap<>();
        for (int docId = 0; docId < documents.size(); docId++) {
            String text = documents.get(docId).toLowerCase();
            String[] tokens = text.split("\\W+"); // split on non-word chars
            for (String token : tokens) {
                if (token.isEmpty()) continue;
                index.computeIfAbsent(token, k -> new ArrayList<>()).add(docId);
            }
        }
        // Sort postings lists for binary search intersection
        for (List<Integer> postings : index.values()) {
            Collections.sort(postings);
        }
        return index;
    }

    public static void main(String[] args) {
        List<String> docs = Arrays.asList(
            "The quick brown fox",
            "Jumped over the lazy dog",
            "Quick brown dog jumps"
        );
        Map<String, List<Integer>> index = buildIndex(docs);
        System.out.println(index);
    }
}
Output
{brown=[0, 2], dog=[1, 2], fox=[0], jumped=[1], jumps=[2], lazy=[1], over=[1], quick=[0, 2], the=[0, 1]}
Production Trap: Unsorted Postings Lists
If you don't sort postings lists, intersection becomes O(n*m). Sorted lists allow merge-join intersection in O(n+m). Always sort after building.
Inverted Index & Text Search Pipeline THECODEFORGE.IO Inverted Index & Text Search Pipeline From raw text to fast production search without LIKE queries Tokenization & Normalization Split text into tokens; lowercase, stem, remove stop words Build Inverted Index Map each token to sorted list of document IDs Intersect Postings Lists Merge sorted lists for multi-term queries Phrase Query Handling Check word order using positional offsets Scalable Index Storage Use disk-based or distributed index when memory limited ⚠ LIKE queries scan full table — O(n) per search Always use inverted index for text search at scale THECODEFORGE.IO
thecodeforge.io
Inverted Index & Text Search Pipeline
Inverted Index Text Search
Inverted Index Build PipelineTHECODEFORGE.IOInverted Index Build PipelineFrom raw documents to searchable indexRaw DocsUnstructured text corpusTokenizeSplit on non-word chars, lowercaseBuild MapTerm → sorted doc IDsStore IndexIn memory or disk segmentsQueryO(1) per term + merge⚠ Index build is O(N) once; each LIKE query is O(N) every timeTHECODEFORGE.IO
thecodeforge.io
Inverted Index Build Pipeline
Inverted Index Text Search

Tokenization: The Silent Killer of Search Relevance

Tokenization is how you split text into tokens. The classic rookie mistake: splitting on whitespace only. That leaves punctuation attached: 'hello,' vs 'hello' are different tokens. Always lowercase and split on non-word characters. For production, consider stemming (e.g., 'running' -> 'run') and stop word removal ('the', 'a', 'and'). But be careful: stop word removal can break phrase queries. I've seen a legal document search fail because 'not' was removed as a stop word, flipping the meaning of clauses.

Tokenizer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — System Design tutorial

import java.util.*;
import java.util.regex.Pattern;

public class Tokenizer {
    private static final Pattern WORD_PATTERN = Pattern.compile("\\w+");
    private static final Set<String> STOP_WORDS = Set.of("the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "by", "with");

    public static List<String> tokenize(String text) {
        List<String> tokens = new ArrayList<>();
        java.util.regex.Matcher matcher = WORD_PATTERN.matcher(text.toLowerCase());
        while (matcher.find()) {
            String token = matcher.group();
            if (!STOP_WORDS.contains(token)) {
                tokens.add(token);
            }
        }
        return tokens;
    }

    public static void main(String[] args) {
        System.out.println(tokenize("The quick brown fox jumps over the lazy dog!"));
    }
}
Output
[quick, brown, fox, jumps, over, lazy, dog]
Senior Shortcut: Stemming vs Lemmatization
Stemming (Porter stemmer) is faster but cruder ('running' -> 'run', 'ran' -> 'ran'). Lemmatization uses vocabulary to return base form ('ran' -> 'run'). For search, stemming is usually sufficient. Use Lucene's EnglishAnalyzer if you need production-grade.

Given two query terms, you need documents containing both. With sorted postings lists, use a two-pointer merge. For multiple terms, intersect pairwise. This is O(n+m) per pair. For large indexes, use skip pointers (store offsets to skip ahead) to reduce comparisons. Here's the merge intersection for two sorted lists.

Intersection.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// io.thecodeforge — System Design tutorial

import java.util.*;

public class Intersection {
    // Returns intersection of two sorted lists
    public static List<Integer> intersect(List<Integer> list1, List<Integer> list2) {
        List<Integer> result = new ArrayList<>();
        int i = 0, j = 0;
        while (i < list1.size() && j < list2.size()) {
            int cmp = list1.get(i).compareTo(list2.get(j));
            if (cmp == 0) {
                result.add(list1.get(i));
                i++; j++;
            } else if (cmp < 0) {
                i++;
            } else {
                j++;
            }
        }
        return result;
    }

    public static void main(String[] args) {
        List<Integer> a = Arrays.asList(0, 2, 5, 7);
        List<Integer> b = Arrays.asList(1, 2, 5, 8);
        System.out.println(intersect(a, b)); // [2, 5]
    }
}
Output
[2, 5]
Interview Gold: Skip Pointers
For very long postings lists, add skip pointers at intervals (e.g., every sqrt(n) entries). During intersection, you can skip over blocks that can't match. This reduces comparisons from O(n+m) to O(sqrt(n)+sqrt(m)) in practice.

Phrase Queries: When Word Order Matters

Phrase queries (e.g., "quick brown fox") require documents where the terms appear consecutively in order. Store positions (term offsets) in the postings list. Then check if positions of consecutive terms differ by 1. This adds storage but enables phrase search. Here's a position-aware index.

PositionalIndex.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// io.thecodeforge — System Design tutorial

import java.util.*;

public class PositionalIndex {
    // Map token -> (docId, positions)
    private Map<String, Map<Integer, List<Integer>>> index = new HashMap<>();

    public void addDocument(int docId, String text) {
        String[] tokens = text.toLowerCase().split("\\W+");
        for (int pos = 0; pos < tokens.length; pos++) {
            String token = tokens[pos];
            if (token.isEmpty()) continue;
            index.computeIfAbsent(token, k -> new HashMap<>())
                 .computeIfAbsent(docId, k -> new ArrayList<>())
                 .add(pos);
        }
    }

    public List<Integer> phraseQuery(String phrase) {
        String[] terms = phrase.toLowerCase().split("\\s+");
        if (terms.length == 0) return Collections.emptyList();
        // Start with docs containing first term
        Map<Integer, List<Integer>> firstPostings = index.getOrDefault(terms[0], Collections.emptyMap());
        Set<Integer> candidateDocs = new HashSet<>(firstPostings.keySet());
        for (int t = 1; t < terms.length; t++) {
            Map<Integer, List<Integer>> termPostings = index.getOrDefault(terms[t], Collections.emptyMap());
            Set<Integer> nextCandidates = new HashSet<>();
            for (int docId : candidateDocs) {
                List<Integer> prevPositions = firstPostings.get(docId);
                List<Integer> currPositions = termPostings.get(docId);
                if (prevPositions == null || currPositions == null) continue;
                // Check if any position in prevPositions is exactly one less than a position in currPositions
                for (int p : prevPositions) {
                    if (currPositions.contains(p + 1)) {
                        nextCandidates.add(docId);
                        break;
                    }
                }
            }
            candidateDocs = nextCandidates;
            firstPostings = termPostings; // for next iteration
        }
        return new ArrayList<>(candidateDocs);
    }

    public static void main(String[] args) {
        PositionalIndex idx = new PositionalIndex();
        idx.addDocument(0, "quick brown fox");
        idx.addDocument(1, "brown fox quick");
        System.out.println(idx.phraseQuery("quick brown")); // [0]
        System.out.println(idx.phraseQuery("brown fox"));   // [0, 1]
    }
}
Output
[0]
[0, 1]
Never Do This: Storing Positions as Objects
Positions per document can be large. Store them as primitive int arrays or use compressed bitmaps (e.g., Roaring Bitmaps). Avoid List<Integer> for each doc — memory overhead kills you at scale.

Scaling the Index: When Memory Isn't Enough

For millions of documents, an in-memory HashMap won't fit. Use disk-based indexes with block compression. Common approach: store index in multiple segments (like Lucene). Each segment is a sorted list of (term, postings) pairs. Merge segments periodically. For production, use memory-mapped files for fast access. Here's a simple disk-based index using RandomAccessFile.

DiskInvertedIndex.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// io.thecodeforge — System Design tutorial

import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.util.*;

public class DiskInvertedIndex {
    private RandomAccessFile file;
    private FileChannel channel;
    private Map<String, Long> termOffset = new HashMap<>(); // term -> file offset

    public DiskInvertedIndex(String path) throws IOException {
        file = new RandomAccessFile(path, "rw");
        channel = file.getChannel();
    }

    public void writeIndex(Map<String, List<Integer>> index) throws IOException {
        for (Map.Entry<String, List<Integer>> entry : index.entrySet()) {
            termOffset.put(entry.getKey(), file.getFilePointer());
            // Write term length, term, postings count, postings
            byte[] termBytes = entry.getKey().getBytes();
            file.writeInt(termBytes.length);
            file.write(termBytes);
            List<Integer> postings = entry.getValue();
            file.writeInt(postings.size());
            for (int docId : postings) {
                file.writeInt(docId);
            }
        }
    }

    public List<Integer> getPostings(String term) throws IOException {
        Long offset = termOffset.get(term);
        if (offset == null) return Collections.emptyList();
        file.seek(offset);
        int termLen = file.readInt();
        byte[] termBytes = new byte[termLen];
        file.read(termBytes);
        int count = file.readInt();
        List<Integer> postings = new ArrayList<>(count);
        for (int i = 0; i < count; i++) {
            postings.add(file.readInt());
        }
        return postings;
    }

    public void close() throws IOException {
        channel.close();
        file.close();
    }

    public static void main(String[] args) throws IOException {
        Map<String, List<Integer>> index = Map.of(
            "hello", List.of(0, 2, 5),
            "world", List.of(1, 3)
        );
        DiskInvertedIndex diskIdx = new DiskInvertedIndex("index.dat");
        diskIdx.writeIndex(index);
        System.out.println(diskIdx.getPostings("hello")); // [0, 2, 5]
        diskIdx.close();
    }
}
Output
[0, 2, 5]
Production Trap: File Channel Not Flushed
Always call channel.force(true) after writing to ensure data is on disk. Otherwise, a crash can corrupt the index. I've seen this lose hours of indexing work.

When Not to Use an Inverted Index

Inverted indexes are great for keyword search, but terrible for fuzzy search, regex, or numerical range queries. If you need substring matching (e.g., '%abc%'), use a trie or suffix array. For geospatial, use R-tree. For small datasets (<10k documents), a simple scan with LIKE might be faster due to index overhead. Also, inverted indexes don't handle synonyms well — you need a thesaurus or embedding-based search for that.

Senior Shortcut: Hybrid Approach
Use inverted index for keyword search, and fall back to vector search (e.g., FAISS) for semantic search. Many production systems combine both: retrieve candidates via inverted index, then re-rank with embeddings.
Inverted Index vs. LIKE ScanTHECODEFORGE.IOInverted Index vs. LIKE ScanWhy precomputation wins for keyword searchInverted IndexO(1) per term lookupSupports AND/OR/phrase queriesHigh build cost, large storageBest for keyword searchLIKE '%term%'O(N) full table scanNo multi-term optimizationZero build cost, no storageBest for small datasetsUse LIKE only when dataset <10k docs or substring matching neededTHECODEFORGE.IO
thecodeforge.io
Inverted Index vs. LIKE Scan
Inverted Index Text Search
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
A search service on a 4GB container kept OOM-killing itself every 30 minutes under moderate load.
Assumption
Team assumed it was a memory leak in the search library.
Root cause
The inverted index was built in-memory as a HashMap<String, List<Integer>> for 10 million documents. The postings lists were unsorted, so intersection required O(n*m) scanning. Memory usage exploded because each postings list was stored as an ArrayList with overhead.
Fix
Switched to sorted postings lists stored as int[] arrays on disk with memory-mapped files. Intersection uses binary search. Memory dropped to 500MB.
Key lesson
  • Never store postings lists as objects in a hash map for large indexes.
  • Use sorted arrays and memory-mapped I/O.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
Search returns no results for a term that should exist
Fix
1. Check tokenization: run the document text through your tokenizer and verify the term appears. 2. Check case: ensure both index and query use same case normalization. 3. Check stop words: is the term being removed? 4. Verify index build: did the document get indexed at all?
Symptom · 02
Search returns too many irrelevant results
Fix
1. Check if you're using phrase query when you need exact match. 2. Review tokenization: are you splitting on punctuation correctly? 3. Consider adding TF-IDF scoring to rank results. 4. Check for synonym expansion causing noise.
Symptom · 03
Index build takes too long or runs out of memory
Fix
1. Check if you're storing positions unnecessarily. 2. Use disk-based index for large datasets. 3. Batch documents and merge segments. 4. Increase JVM heap or use memory-mapped files.
★ Inverted Index and Text Search Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Search returns empty for known term
Immediate action
Check if term is in index
Commands
grep -c '\bterm\b' index.dat
java -jar IndexInspector.jar --term term
Fix now
Rebuild index with correct tokenizer
OutOfMemoryError during index build+
Immediate action
Check heap usage
Commands
jstat -gc <pid>
jmap -heap <pid>
Fix now
Increase -Xmx or switch to disk-based index
Phrase query returns wrong documents+
Immediate action
Check positions
Commands
java -jar IndexInspector.jar --positions term
Verify token positions in source
Fix now
Rebuild index with correct position assignment
Slow search queries (>1s)+
Immediate action
Check if postings lists are sorted
Commands
java -jar IndexInspector.jar --check-sorted
Profile intersection code
Fix now
Sort postings lists and add skip pointers
Feature / AspectInverted IndexLIKE Query
Search speedO(1) per term + intersectionO(n) full table scan
Phrase searchSupported with positionsSupported with % but slow
Index build timeO(total tokens)None
Storage overheadHigh (postings lists)None
Fuzzy searchNot nativeNot native
ScalabilityMillions of docsThousands of docs

Key takeaways

1
Inverted indexes turn O(n) text search into O(1) term lookup
essential for any system with more than 10k documents.
2
Always sort postings lists after building
unsorted lists make intersection O(n*m) instead of O(n+m).
3
Tokenization is the most common source of search bugs
lowercase, split on non-word chars, and be careful with stop words.
4
For large indexes, use disk-based storage with memory-mapped files and segment merging
never store everything in a single HashMap.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does an inverted index handle concurrent updates while serving queri...
Q02SENIOR
When would you choose an inverted index over a B-tree for text search?
Q03SENIOR
What happens when you search for a term that appears in 90% of documents...
Q04JUNIOR
What is an inverted index?
Q05SENIOR
You notice search returns documents that don't contain the query term. W...
Q06SENIOR
Design a search system for a news website with 10 million articles. How ...
Q01 of 06SENIOR

How does an inverted index handle concurrent updates while serving queries?

ANSWER
Use a copy-on-write approach: build a new segment in memory, then atomically swap the segment list. Queries see the old segments until swap completes. This avoids locks and ensures consistency.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is an inverted index in simple terms?
02
What's the difference between inverted index and forward index?
03
How do I implement an inverted index in Java?
04
How does Elasticsearch use inverted indexes?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Async & Data Processing. Mark it forged?

3 min read · try the examples if you haven't

Previous
Stream Processing
5 / 7 · Async & Data Processing
Next
Backpressure