Senior 3 min · June 25, 2026

Inverted Index and Text Search: Build a Production Search Engine Without the Pain

Q: What is an inverted index in simple terms?

It's like a book index: a list of words, each pointing to the pages (documents) where that word appears. Instead of scanning every page, you look up the word and get the pages instantly.

Q: What's the difference between inverted index and forward index?

A forward index maps documents to their words. An inverted index maps words to documents. For search, inverted is faster because you look up the word, not the document.

Q: How do I implement an inverted index in Java?

Use a `HashMap >` where key is the token and value is sorted list of document IDs. Tokenize by lowercasing and splitting on non-word characters. Sort postings lists after building.

Q: How does Elasticsearch use inverted indexes?

Elasticsearch (Lucene) builds an inverted index per shard. It stores term frequency, positions, and offsets for phrase queries. It uses segment merging and memory-mapped files for performance.

Inverted index and text search explained with production patterns.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

An inverted index stores for each word a sorted list of document IDs where that word appears. To search, you intersect the postings lists of query terms. This gives O(1) lookup per term instead of O(n) scan.

✦ Definition~90s read

What is Inverted Index and Text Search?

An inverted index is a data structure mapping each unique word (token) to the list of documents containing it. It enables fast full-text search by looking up the token's postings list instead of scanning every document.

★

Think of a cookbook's index at the back.

Plain-English First

Think of a cookbook's index at the back. Instead of flipping through every page to find 'chicken', you look up 'chicken' in the index and get page numbers 23, 45, 102. The inverted index is that index: a dictionary where each word points to the list of documents (pages) containing it.

Your search is slow. Not 'a bit sluggish' — slow enough that users refresh and your database connection pool drowns. You're doing WHERE text LIKE '%keyword%' on a table with millions of rows. That's a full table scan. Every. Single. Query. I've seen this bring down a content management system at 2am when a bot hit the search endpoint with 50 concurrent requests. The fix? An inverted index. It's the core of Elasticsearch, Lucene, and every decent search engine. By the end of this, you'll build one from scratch, understand when it's overkill, and never write LIKE '%...%' for search again.

Why Your LIKE Query Is a Crime Against Performance

Every time you write WHERE text LIKE '%search%', the database scans every row. That's O(n) per query. With an inverted index, you precompute the mapping from words to documents. Search becomes O(1) per term plus intersection cost. The trade-off: index build time and storage. But for read-heavy search workloads, it's the only sane choice. Here's how a checkout service handles product search: build the index offline, serve queries from memory-mapped files.

InvertedIndexBuilder.javaJAVA

// io.thecodeforge — System Design tutorial

import java.io.*;
import java.util.*;

public class InvertedIndexBuilder {
    // Builds inverted index from a list of documents
    public static Map<String, List<Integer>> buildIndex(List<String> documents) {
        Map<String, List<Integer>> index = new HashMap<>();
        for (int docId = 0; docId < documents.size(); docId++) {
            String text = documents.get(docId).toLowerCase();
            String[] tokens = text.split("\\W+"); // split on non-word chars
            for (String token : tokens) {
                if (token.isEmpty()) continue;
                index.computeIfAbsent(token, k -> new ArrayList<>()).add(docId);
            }
        }
        // Sort postings lists for binary search intersection
        for (List<Integer> postings : index.values()) {
            Collections.sort(postings);
        }
        return index;
    }

    public static void main(String[] args) {
        List<String> docs = Arrays.asList(
            "The quick brown fox",
            "Jumped over the lazy dog",
            "Quick brown dog jumps"
        );
        Map<String, List<Integer>> index = buildIndex(docs);
        System.out.println(index);
    }
}

Output

{brown=[0, 2], dog=[1, 2], fox=[0], jumped=[1], jumps=[2], lazy=[1], over=[1], quick=[0, 2], the=[0, 1]}

Production Trap: Unsorted Postings Lists

If you don't sort postings lists, intersection becomes O(n*m). Sorted lists allow merge-join intersection in O(n+m). Always sort after building.

thecodeforge.io

Inverted Index & Text Search Pipeline

Inverted Index Text Search

thecodeforge.io

Inverted Index Build Pipeline

Inverted Index Text Search

Tokenization: The Silent Killer of Search Relevance

Tokenization is how you split text into tokens. The classic rookie mistake: splitting on whitespace only. That leaves punctuation attached: 'hello,' vs 'hello' are different tokens. Always lowercase and split on non-word characters. For production, consider stemming (e.g., 'running' -> 'run') and stop word removal ('the', 'a', 'and'). But be careful: stop word removal can break phrase queries. I've seen a legal document search fail because 'not' was removed as a stop word, flipping the meaning of clauses.

Tokenizer.javaJAVA

// io.thecodeforge — System Design tutorial

import java.util.*;
import java.util.regex.Pattern;

public class Tokenizer {
    private static final Pattern WORD_PATTERN = Pattern.compile("\\w+");
    private static final Set<String> STOP_WORDS = Set.of("the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "by", "with");

    public static List<String> tokenize(String text) {
        List<String> tokens = new ArrayList<>();
        java.util.regex.Matcher matcher = WORD_PATTERN.matcher(text.toLowerCase());
        while (matcher.find()) {
            String token = matcher.group();
            if (!STOP_WORDS.contains(token)) {
                tokens.add(token);
            }
        }
        return tokens;
    }

    public static void main(String[] args) {
        System.out.println(tokenize("The quick brown fox jumps over the lazy dog!"));
    }
}

Output

[quick, brown, fox, jumps, over, lazy, dog]

Senior Shortcut: Stemming vs Lemmatization

Stemming (Porter stemmer) is faster but cruder ('running' -> 'run', 'ran' -> 'ran'). Lemmatization uses vocabulary to return base form ('ran' -> 'run'). For search, stemming is usually sufficient. Use Lucene's EnglishAnalyzer if you need production-grade.

Intersecting Postings Lists: The Heart of Search

Given two query terms, you need documents containing both. With sorted postings lists, use a two-pointer merge. For multiple terms, intersect pairwise. This is O(n+m) per pair. For large indexes, use skip pointers (store offsets to skip ahead) to reduce comparisons. Here's the merge intersection for two sorted lists.

Intersection.javaJAVA

// io.thecodeforge — System Design tutorial

import java.util.*;

public class Intersection {
    // Returns intersection of two sorted lists
    public static List<Integer> intersect(List<Integer> list1, List<Integer> list2) {
        List<Integer> result = new ArrayList<>();
        int i = 0, j = 0;
        while (i < list1.size() && j < list2.size()) {
            int cmp = list1.get(i).compareTo(list2.get(j));
            if (cmp == 0) {
                result.add(list1.get(i));
                i++; j++;
            } else if (cmp < 0) {
                i++;
            } else {
                j++;
            }
        }
        return result;
    }

    public static void main(String[] args) {
        List<Integer> a = Arrays.asList(0, 2, 5, 7);
        List<Integer> b = Arrays.asList(1, 2, 5, 8);
        System.out.println(intersect(a, b)); // [2, 5]
    }
}

Output

[2, 5]

Interview Gold: Skip Pointers

For very long postings lists, add skip pointers at intervals (e.g., every sqrt(n) entries). During intersection, you can skip over blocks that can't match. This reduces comparisons from O(n+m) to O(sqrt(n)+sqrt(m)) in practice.

Phrase Queries: When Word Order Matters

Phrase queries (e.g., "quick brown fox") require documents where the terms appear consecutively in order. Store positions (term offsets) in the postings list. Then check if positions of consecutive terms differ by 1. This adds storage but enables phrase search. Here's a position-aware index.

PositionalIndex.javaJAVA

// io.thecodeforge — System Design tutorial

import java.util.*;

public class PositionalIndex {
    // Map token -> (docId, positions)
    private Map<String, Map<Integer, List<Integer>>> index = new HashMap<>();

    public void addDocument(int docId, String text) {
        String[] tokens = text.toLowerCase().split("\\W+");
        for (int pos = 0; pos < tokens.length; pos++) {
            String token = tokens[pos];
            if (token.isEmpty()) continue;
            index.computeIfAbsent(token, k -> new HashMap<>())
                 .computeIfAbsent(docId, k -> new ArrayList<>())
                 .add(pos);
        }
    }

    public List<Integer> phraseQuery(String phrase) {
        String[] terms = phrase.toLowerCase().split("\\s+");
        if (terms.length == 0) return Collections.emptyList();
        // Start with docs containing first term
        Map<Integer, List<Integer>> firstPostings = index.getOrDefault(terms[0], Collections.emptyMap());
        Set<Integer> candidateDocs = new HashSet<>(firstPostings.keySet());
        for (int t = 1; t < terms.length; t++) {
            Map<Integer, List<Integer>> termPostings = index.getOrDefault(terms[t], Collections.emptyMap());
            Set<Integer> nextCandidates = new HashSet<>();
            for (int docId : candidateDocs) {
                List<Integer> prevPositions = firstPostings.get(docId);
                List<Integer> currPositions = termPostings.get(docId);
                if (prevPositions == null || currPositions == null) continue;
                // Check if any position in prevPositions is exactly one less than a position in currPositions
                for (int p : prevPositions) {
                    if (currPositions.contains(p + 1)) {
                        nextCandidates.add(docId);
                        break;
                    }
                }
            }
            candidateDocs = nextCandidates;
            firstPostings = termPostings; // for next iteration
        }
        return new ArrayList<>(candidateDocs);
    }

    public static void main(String[] args) {
        PositionalIndex idx = new PositionalIndex();
        idx.addDocument(0, "quick brown fox");
        idx.addDocument(1, "brown fox quick");
        System.out.println(idx.phraseQuery("quick brown")); // [0]
        System.out.println(idx.phraseQuery("brown fox"));   // [0, 1]
    }
}

Output

[0]

[0, 1]

Never Do This: Storing Positions as Objects

Positions per document can be large. Store them as primitive int arrays or use compressed bitmaps (e.g., Roaring Bitmaps). Avoid List<Integer> for each doc — memory overhead kills you at scale.

Scaling the Index: When Memory Isn't Enough

For millions of documents, an in-memory HashMap won't fit. Use disk-based indexes with block compression. Common approach: store index in multiple segments (like Lucene). Each segment is a sorted list of (term, postings) pairs. Merge segments periodically. For production, use memory-mapped files for fast access. Here's a simple disk-based index using RandomAccessFile.

DiskInvertedIndex.javaJAVA

// io.thecodeforge — System Design tutorial

import java.io.*;
import java.nio.*;
import java.nio.channels.*;
import java.util.*;

public class DiskInvertedIndex {
    private RandomAccessFile file;
    private FileChannel channel;
    private Map<String, Long> termOffset = new HashMap<>(); // term -> file offset

    public DiskInvertedIndex(String path) throws IOException {
        file = new RandomAccessFile(path, "rw");
        channel = file.getChannel();
    }

    public void writeIndex(Map<String, List<Integer>> index) throws IOException {
        for (Map.Entry<String, List<Integer>> entry : index.entrySet()) {
            termOffset.put(entry.getKey(), file.getFilePointer());
            // Write term length, term, postings count, postings
            byte[] termBytes = entry.getKey().getBytes();
            file.writeInt(termBytes.length);
            file.write(termBytes);
            List<Integer> postings = entry.getValue();
            file.writeInt(postings.size());
            for (int docId : postings) {
                file.writeInt(docId);
            }
        }
    }

    public List<Integer> getPostings(String term) throws IOException {
        Long offset = termOffset.get(term);
        if (offset == null) return Collections.emptyList();
        file.seek(offset);
        int termLen = file.readInt();
        byte[] termBytes = new byte[termLen];
        file.read(termBytes);
        int count = file.readInt();
        List<Integer> postings = new ArrayList<>(count);
        for (int i = 0; i < count; i++) {
            postings.add(file.readInt());
        }
        return postings;
    }

    public void close() throws IOException {
        channel.close();
        file.close();
    }

    public static void main(String[] args) throws IOException {
        Map<String, List<Integer>> index = Map.of(
            "hello", List.of(0, 2, 5),
            "world", List.of(1, 3)
        );
        DiskInvertedIndex diskIdx = new DiskInvertedIndex("index.dat");
        diskIdx.writeIndex(index);
        System.out.println(diskIdx.getPostings("hello")); // [0, 2, 5]
        diskIdx.close();
    }
}

Output

[0, 2, 5]

Production Trap: File Channel Not Flushed

Always call channel.force(true) after writing to ensure data is on disk. Otherwise, a crash can corrupt the index. I've seen this lose hours of indexing work.

When Not to Use an Inverted Index

Inverted indexes are great for keyword search, but terrible for fuzzy search, regex, or numerical range queries. If you need substring matching (e.g., '%abc%'), use a trie or suffix array. For geospatial, use R-tree. For small datasets (<10k documents), a simple scan with LIKE might be faster due to index overhead. Also, inverted indexes don't handle synonyms well — you need a thesaurus or embedding-based search for that.

Senior Shortcut: Hybrid Approach

Use inverted index for keyword search, and fall back to vector search (e.g., FAISS) for semantic search. Many production systems combine both: retrieve candidates via inverted index, then re-rank with embeddings.

thecodeforge.io

Inverted Index vs. LIKE Scan

Inverted Index Text Search

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A search service on a 4GB container kept OOM-killing itself every 30 minutes under moderate load.

Assumption

Team assumed it was a memory leak in the search library.

Root cause

The inverted index was built in-memory as a HashMap<String, List<Integer>> for 10 million documents. The postings lists were unsorted, so intersection required O(n*m) scanning. Memory usage exploded because each postings list was stored as an ArrayList with overhead.

Fix

Switched to sorted postings lists stored as int[] arrays on disk with memory-mapped files. Intersection uses binary search. Memory dropped to 500MB.

Key lesson

Never store postings lists as objects in a hash map for large indexes.
Use sorted arrays and memory-mapped I/O.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Search returns no results for a term that should exist

→

Fix

1. Check tokenization: run the document text through your tokenizer and verify the term appears. 2. Check case: ensure both index and query use same case normalization. 3. Check stop words: is the term being removed? 4. Verify index build: did the document get indexed at all?

Symptom · 02

Search returns too many irrelevant results

→

Fix

1. Check if you're using phrase query when you need exact match. 2. Review tokenization: are you splitting on punctuation correctly? 3. Consider adding TF-IDF scoring to rank results. 4. Check for synonym expansion causing noise.

Symptom · 03

Index build takes too long or runs out of memory

→

Fix

1. Check if you're storing positions unnecessarily. 2. Use disk-based index for large datasets. 3. Batch documents and merge segments. 4. Increase JVM heap or use memory-mapped files.

★ Inverted Index and Text Search Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Search returns empty for known term−

Immediate action

Check if term is in index

Commands

grep -c '\bterm\b' index.dat

java -jar IndexInspector.jar --term term

Fix now

Rebuild index with correct tokenizer

OutOfMemoryError during index build+

Phrase query returns wrong documents+

Slow search queries (>1s)+

Feature / Aspect	Inverted Index	LIKE Query
Search speed	O(1) per term + intersection	O(n) full table scan
Phrase search	Supported with positions	Supported with % but slow
Index build time	O(total tokens)	None
Storage overhead	High (postings lists)	None
Fuzzy search	Not native	Not native
Scalability	Millions of docs	Thousands of docs

Key takeaways

Inverted indexes turn O(n) text search into O(1) term lookup

essential for any system with more than 10k documents.

Always sort postings lists after building

unsorted lists make intersection O(n*m) instead of O(n+m).

Tokenization is the most common source of search bugs

lowercase, split on non-word chars, and be careful with stop words.

For large indexes, use disk-based storage with memory-mapped files and segment merging

never store everything in a single HashMap.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does an inverted index handle concurrent updates while serving queri...

Q02SENIOR

When would you choose an inverted index over a B-tree for text search?

Q03SENIOR

What happens when you search for a term that appears in 90% of documents...

Q04JUNIOR

What is an inverted index?

Q05SENIOR

You notice search returns documents that don't contain the query term. W...

Q06SENIOR

Design a search system for a news website with 10 million articles. How ...

Q01 of 06SENIOR

How does an inverted index handle concurrent updates while serving queries?

ANSWER

Use a copy-on-write approach: build a new segment in memory, then atomically swap the segment list. Queries see the old segments until swap completes. This avoids locks and ensures consistency.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is an inverted index in simple terms?

What's the difference between inverted index and forward index?

How do I implement an inverted index in Java?

How does Elasticsearch use inverted indexes?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Async & Data Processing. Mark it forged?

3 min read · try the examples if you haven't