Advanced 9 min · March 06, 2026

Design Google Search — Index Corruption Pitfalls

Malformed HTML corrupted index segments and spread to replicas, spiking errors from 0.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Google Search is a distributed system that crawls, indexes, and ranks billions of web pages in under 200ms
  • Core components: crawler, indexer, query processor, ranker, and serving infrastructure
  • Crawl budget and politeness policies prevent overloading servers while maximizing coverage
  • Inverted index enables sub-millisecond lookup of documents containing query terms
  • Production insight: a failing indexing node can cause stale results or missing pages — health checks and shard replication are mandatory
  • Biggest mistake: assuming relevance ranking is solved by TF-IDF — modern search uses multi-stage ranking with deep learning
  • Critical gotcha: corruption in one index segment can go undetected for hours — verify segment checksums after every merge
Plain-English First

Imagine the world's largest library with billions of books, and every time someone asks a question, a team of librarians instantly finds the most relevant page across all of them — in under half a second. Google Search is that library, but instead of librarians, it's machines crawling the internet 24/7, filing every word they find in a massive index, and ranking results by how trustworthy and relevant each page is. The magic isn't just storing everything — it's making the lookup feel instant no matter how obscure your question is. Think of the index as a card catalog that gets reorganized every night — the new cards don't appear until the next morning, but the old cards still work. Google's challenge is to reorganize that catalog millions of times per second while people are still using it.

Google processes over 8.5 billion searches per day. Behind every one of those queries is a pipeline that spans web crawling, link graph analysis, distributed indexing, real-time query parsing, and sub-100ms result serving — all at a scale that dwarfs most technology stacks in existence. When interviewers ask you to design a search engine, they're not expecting you to rebuild PageRank from scratch. They're testing whether you can reason about hard trade-offs: freshness vs. consistency, recall vs. precision, crawl budget vs. coverage. This is where senior engineers separate themselves.

The core problem search solves is deceptively simple: given an unstructured corpus of hundreds of billions of web pages, return the most relevant ten results for an arbitrary natural-language query in under 200 milliseconds. The difficulty is in every word of that sentence — 'hundreds of billions' demands distributed storage, 'arbitrary query' demands linguistic understanding, 'most relevant' demands a ranking model trained on human behavior, and '200 milliseconds' demands aggressive caching, pre-computation, and hardware co-design.

By the end of this article you'll be able to walk into any senior system design interview and articulate the full pipeline — from DNS to rendered SERP — naming the right data structures, explaining the right trade-offs, and spotting the gotchas that trip up candidates who only memorized diagrams. We'll cover the crawler, the indexing pipeline, the serving stack, and the ranking layer, with concrete pseudo-code and real numbers wherever they matter.

Here's the kicker: most engineers describe a linear pipeline and forget the feedback loops. The crawler discovers URLs from the indexer's freshness requirements, the indexer publishes stats that reshape the crawl priority, and the ranking model's performance dictates which segments get reindexed first. If you design each component in isolation, you get a system that's correct on paper but breaks in production. For example, a failing crawler on a high-traffic domain can cause the indexer to serve stale results for weeks before anyone notices — because the monitoring dashboard shows 'healthy' but the freshness metric never gets updated. That's the kind of failure that only surfaces when you instrument every boundary. Master the feedback loops, and you'll walk into any interview knowing exactly what separates a diagram-drawer from a system thinker.

One more thing: when you're designing the feedback loops, instrument every cross-component call with a latency histogram and a success/failure counter. Without that data, you're guessing. And trust me — you don't want to debug a silent crawl starvation at 3 AM on a Friday.

Design Google Search is a core concept in System Design. It's not about building the actual Google — it's about understanding the architectural principles that let you serve relevant results from a web-scale corpus at sub-200ms latency. The system decomposes into four main layers: a distributed web crawler that discovers and downloads pages, an indexing pipeline that transforms raw HTML into searchable inverted indexes, a query serving stack that processes user queries against the index, and a ranking system that orders results by relevance. Each layer must handle failure gracefully, scale horizontally, and operate under strict latency budgets. Senior engineers focus on the boundaries between these layers: how the crawler feeds the indexer without overwhelming it, how the index is distributed to the serving tier, and how ranking models are trained offline and served online without latency spikes.

Here's the thing most tutorials skip: the real bottleneck isn't the crawler or the indexer — it's the coordination between them. If the indexer falls behind, results go stale. If the crawler outpaces the indexer, you waste storage on pages nobody can query. The production reality is you need backpressure signals between every stage, and those signals are often the hardest thing to get right. Don't assume your pipeline is linear — it's a loop with feedback.

And then there's the failure mode that gets ignored until it bites you: a single slow component in the feedback loop can cascade. For example, if the ranking model's offline training pipeline stalls, new features never get deployed, but you don't notice until CTR drops 10% a week later. You need monitoring on every stage's throughput and latency, not just the serving tier.

Another overlooked feedback loop: the crawl priority influences indexing freshness, and indexing freshness influences ranking relevance. A page that's frequently updated but has low crawl priority will never accumulate enough ranking signals to surface in top results. You need a priority feedback signal from the ranking system back to the crawler — if a page's CTR drops below a threshold, increase its crawl frequency. This closes the loop and prevents stale content from polluting results.

Here's the production truth that's hard to swallow: most search engines fail not because the index is slow, but because the coordination signals between stages are missing. Fix the signals, and the system runs itself.

Don't fall into the trap of thinking each layer is independent. The real system is a set of feedback loops: the crawler feeds the indexer, the indexer's freshness determines ranking, and ranking performance dictates crawl priority. If you design in isolation, you'll miss the failures that happen at the boundaries.

One more thing: when you're designing the feedback loops, instrument every cross-component call with a latency histogram and a success/failure counter. Without that data, you're guessing. And trust me — you don't want to debug a silent crawl starvation at 3 AM on a Friday.

InvertedIndexLookup.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package io.thecodeforge.search;

/**
 * Simplified inverted index lookup — production uses distributed shards.
 */
public class InvertedIndexLookup {
    private final Map<String, List<Posting>> index;

    public InvertedIndexLookup(Map<String, List<Posting>> index) {
        this.index = index;
    }

    public List<Posting> lookup(String term) {
        List<Posting> postings = index.get(term);
        if (postings == null) return List.of();
        // In real system, merge with phrase/positional index
        return postings;
    }

    static class Posting {\\n        final int docId;\\n        final float score; // e.g., TF-IDF contribution\\n        Posting(int docId, float score) {\\n            this.docId = docId;\\n            this.score = score;\\n        }
    }
}
Mental Model: The Library Card Catalog
  • Instead of scanning all books (linear scan), you look up the term in the catalog (O(1) hash lookup).
  • Each card (posting list) stores the document ID and a weight representing relevance or position.
  • The catalog itself is partitioned across many cabinets (shards) because no single cabinet holds all cards.
Production Insight
In production, the inverted index is not a single hash map — it's a set of immutable segments merged asynchronously.
Each segment includes a skip list for fast intersection of posting lists.
The biggest gotcha: memory-mapped segment files can cause major GC pressure if not sized correctly (JVM heap vs mmap).
Another trap: if you use off-heap memory for segments, you avoid GC but risk native memory leaks — always set a limit and monitor /proc/<pid>/smaps.
Also, segment alignment to cache lines matters: misaligned posting lists can leave 30% performance on the table — profile your memory access patterns.
A segment with a corrupted skip list can cause a query to skip valid documents silently. Always verify skip pointers during segment validation.
Key Takeaway
An inverted index is the heart of search.
It's not just a hash map — it's a distributed, segmented, and compressed data structure.
When an interview says 'inverted index', think sorted posting lists with skip pointers, not just a Python dict.
And always check your segment cache line alignment.
Index Strategy Decision
IfContent changes hourly (news, sports)
UseIncremental index with near-real-time updates; batch only for consolidation.
IfContent changes weekly (blog posts, documentation)
UseBatch index nightly; incremental updates not strictly needed.
IfCorpus size grows fast, disk space is constrained
UseUse tiered merging and aggressive compression; monitor write amplification.
IfIndex serving P99 latency exceeds 300ms
UseReduce number of segments via merge; check if segment cache hit ratio is below 90%.

Web Crawling: Discovery, Fetching, and Politeness

The crawler discovers URLs from a seed set and downloads each page's content. At Google scale, you can't crawl everything every day — you need a crawl budget. Popular pages get fetched daily, while less important ones might be re-crawled weekly. The crawler must respect robots.txt, detect duplicate content, and avoid overwhelming servers. A critical component is the URL frontier, a priority queue that decides which URLs to fetch next based on freshness, page importance, and politeness. The crawler uses DNS caching, HTTP persistent connections, and distributed storage (e.g., Bigtable) to store raw HTML. Deduplication via simhash prevents storing near-identical pages.

But here's the part that catches everyone off guard: the frontier itself becomes a distributed system at scale. You can't keep a single priority queue in memory when you have billions of URLs. You need a partitioned frontier where each partition handles a range of domains, and you need a coordinator to balance load. And politeness isn't just about per-domain rate — it's about detecting when a site pushes malicious content via robots.txt that tells you to slow down to a crawl. Always validate robots.txt against a known-good policy.

A real incident: a misconfigured CMS on a popular domain caused a redirect loop that consumed 40% of a region's crawl budget in under an hour. The fix was to bound redirect depth to 5 hops and add a per-domain redirect limit. Also, monitor per-domain crawl rate and frontier queue depth — a drop in depth with high retry counts suggests a domain is blocking or politeness is too aggressive.

The other lesson: don't let your crawler become the bottleneck for index freshness. Set a maximum crawl age for each domain and stick to it. When the crawler falls behind, the indexer gets hungry and starts serving stale results. That's a user-facing issue that's hard to debug because it's gradual.

DNS caching is another gotcha. A stale DNS entry can cause the crawler to hit a dead IP for hours. Use a TTL cap of 5 minutes and warm multiple DNS resolvers in parallel. If a domain's IP changes faster than your TTL, you'll miss it — compensate by monitoring DNS resolution failures and forcing re-resolution on error.

There's also a subtle politeness trap: some CDNs return a cached 200 even when the origin is down. The crawler thinks the page is fine, but its content is stale. Use a freshness signal from the origin's Last-Modified header. If the header doesn't change, you might be recrawling identical content. Detect content evolution by comparing hashes — if the content hash matches the last fetch, skip reindexing.

One more gotcha: DNS resolution can silently throttle your crawl rate. If your DNS cache TTL is too high, you'll hit dead IPs for hours. Set a cap of 5 minutes and monitor resolution failures. Also, redirect chains are budget killers — always follow a 301 and count hops.

And something that haunted my team: we once had a crawl partition lose its state after a node failure. The frontier coordinator didn't replicate the queue, so we lost 15% of pending URLs. Always persist the frontier state to a durable store or use a replicated log. Recovery time matters less than data loss.

CrawlerWorker.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
package io.thecodeforge.search.crawler;

import java.util.concurrent.*;
import java.net.URL;

public class CrawlerWorker implements Runnable {
    private final URL frontier;
    private final RateLimiter perDomainLimiter;

    public CrawlerWorker(URL frontier, RateLimiter perDomainLimiter) {\\n        this.frontier = frontier;\\n        this.perDomainLimiter = perDomainLimiter;\\n    }

    @Override
    public void run() {
        while (true) {
            URL url = frontier.next(); // blocking
            perDomainLimiter.acquire(url.getHost());
            // fetch and store...
        }
    }
}
Robots.txt Can Be Malicious
Some sites serve a robots.txt that disallows everything or imposes unrealistic delays. Always validate against a default policy and override if the site tries to starve your crawler.
Production Insight
Crawl politeness is often the hardest thing to get right in production.
If your crawler sends too many requests per second to a single domain, you get blocked or rate-limited.
Use a per-domain queue in the frontier with a configurable delay (e.g., 1 request per 2 seconds).
Another trap: DNS resolution latency can be a bottleneck — pre-resolve and cache aggressively.
Watch out for redirect chains: a 301 followed by 302 can inflate your crawl time and eat your budget.
To detect crawler starvation, monitor per-domain crawl rate and frontier queue depth — a drop signals a blocking issue.
Also, monitor DNS resolution failures — a flaky DNS can cause the crawler to skip entire domains. Cache TTLs and use multiple resolvers.
One more: if a site is slow, don't just slow down — escalate to the indexer so it knows the results are not fresh.
Persist the frontier state to a distributed log — losing it after a node crash means wasted crawl budget and delayed index freshness.
Key Takeaway
Distributed crawling is a scheduling problem, not just a fetch loop.
Crawl budget, politeness, and deduplication determine coverage efficiency.
Politeness breaks your system first — test your throttle logic before scalability.
And remember: crawler health is often invisible until index freshness drops — monitor per-domain crawl rates.
Always set a maximum redirect depth and a per-domain redirect limit.
Crawl Policy Decision Tree
IfPage is frequently updated (e.g., news site)
UseAssign high recrawl priority; use a short crawl interval (hours).
IfPage has never changed in the last 30 days
UseIncrease crawl interval to weeks or lower priority.
IfServer returns 429 Too Many Requests
UseExponential backoff; add domain to cooldown queue.
IfDNS resolution fails for a domain
UseRetry with exponential backoff; if persistent, deprioritize domain and alert ops.

Indexing Pipeline: From Raw Pages to Inverted Index

The indexing pipeline processes crawled HTML through several stages: parsing (extract text, links, metadata), tokenization (split into terms with language-aware segmentation), stop word removal and stemming (e.g., 'running' → 'run'), and finally building the inverted index. For scalability, indexing is done in batches — map-reduce jobs that generate sorted posting lists for each term. Each index segment is an immutable file containing the term dictionary, a frequency table, and the actual posting list. Segments are periodically merged into larger segments to keep the number of files small and lookups fast. The index is distributed across a cluster of index servers, typically sharded by term ID hash.

A detail that bites teams: merging segments is not just a performance concern — it's a consistency one. If a merge fails mid-way, you can lose data from both input segments. Always write merges atomically: write a new segment to a temp path, fsync, then atomically rename it into the index directory. And never skip the fsync — an OS crash during rename can leave you with a half-written segment that looks valid to the reader.

Pain point in production: merge I/O can saturate disk bandwidth. On HDDs, sequential merge I/O can push disk util to 100%, causing query latency to spike by 500ms. Always separate merge and query disks, or use I/O throttling. Use iostat -x 1 to monitor disk await — if >20ms, throttle merge rate. Also, segment merging is write amplification: merging two 10GB segments to a new 20GB segment writes 20GB even if only a small fraction changed. Tiered merging helps: small segments merge frequently, large ones rarely.

Another subtle failure: if a segment's term dictionary becomes corrupted, the entire posting list for a term may be unreadable. Always validate the dictionary checksum after each merge and before exposing the segment to queries.

Compression is not an afterthought. Variable-byte encoding can reduce posting list size by 70%, but decompression adds CPU cost. In production, you balance compression ratio against decode speed — Frame of Reference (FOR) and Delta encoding are common choices. Measure your cache line utilization: posting lists that fit in L1 cache (32KB) are orders of magnitude faster than those that spill into L2.

And here's a real-world production problem: when you add a new segment to the serving tier, you need to warm the OS page cache. If you don't, the first few queries hit a cold cache and latency spikes. Pre-fault the pages by reading the segment headers sequentially before serving.

Also, monitor segment-level health with a background checker that runs a sample query per segment. If any segment returns empty for a known term, flag it immediately. Don't wait for users to report missing results.

Segment corruption is the nightmare that wakes search engineers at 3 AM. Always checksum every segment after a merge and before serving. And don't forget to pre-warm the page cache — a cold segment adds 500ms to the first few thousand queries.

One more nuance: the tokenization stage is often the source of silent recall problems. If you don't handle Unicode normalization, 'Café' and 'cafe' end up as different terms, and you miss matches for queries with accented characters. Use locale-aware tokenization and apply Unicode normalization (NFC/NFD) before indexing.

IndexSegmentWriter.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
package io.thecodeforge.search.index;

import java.io.*;
import java.util.*;

/**
 * Simplified segment writer for educational purposes.
 */
public class IndexSegmentWriter {
    private final Map<String, List<Integer>> inverted = new TreeMap<>();

    public void addDocument(int docId, String title, String body) {\\n        String text = title + \\\" \\\" + body;\\n        // In production: tokenize, stem, filter\\n        String[] tokens = text.toLowerCase().split(\\\"\\\\\\\\W+\\\");\\n        for (String term : tokens) {\\n            if (term.length() < 2) continue; // skip very short tokens\\n            inverted.computeIfAbsent(term, k -> new ArrayList<>()).add(docId);\\n        }\\n    }\\n\\n    public void write(String path) throws IOException {\\n        try (DataOutputStream out = new DataOutputStream(new BufferedOutputStream(\\n                new FileOutputStream(path)))) {\\n            out.writeInt(inverted.size());\\n            for (var entry : inverted.entrySet()) {\\n                out.writeUTF(entry.getKey());\\n                out.writeInt(entry.getValue().size());\\n                for (int docId : entry.getValue()) {\\n                    out.writeInt(docId);\\n                }\\n            }\\n        }\\n    }\\n}\"\n      }",
      "callout": {
        "type": "warning",
        "title": "Warning: Tokenization Without Locale Awareness Breaks Recall",
        "text": "A default split on whitespace treats 'Café' as 'Café' but 'cafe' as different. Always apply locale-specific tokenization and Unicode normalization to avoid missing matches for queries with accented characters."
      },
      "production_insight": "Index merges are the biggest source of write amplification in search systems.\nMerging two 10GB segments into a new 20GB segment requires writing 20GB even if only a small fraction of data changed.\nUse a tiered merging strategy: smaller segments merge more frequently, large ones rarely.\nMonitor merge I/O to prevent it from starving query read operations.\nOne more thing: if you use SSDs, merge I/O contention is less painful, but on HDDs you can see query latency spikes during merges. Consider throttling merge speed during peak hours.\nCheck merge I/O with iostat -x 1 — if await > 20ms, throttle merge rate immediately.\nAnd always pre-warm the page cache on new segments to avoid cold-start latency.\nNever skip checksum validation on merged segments — corruption can spread silently to replicas.",
      "decision_tree": {
        "title": "Merge Strategy Decision Tree",
        "items": [
          {
            "condition": "Index has many small segments (high read amplification)",
            "result": "Merge aggressively, but throttle during peak hours to avoid I/O contention."
          },
          {
            "condition": "Disk space is tight and merge is expensive",
            "result": "Use log-structured merge (LSM) tree approach: tiered merging with gradual compaction."
          },
          {
            "condition": "Segment corruption detected after merge",
            "result": "Roll back to pre-merge checkpoint, verify input segment integrity, then retry merge."
          },
          {
            "condition": "Merge I/O spikes correlate with P99 latency spikes",
            "result": "Throttle merge speed; consider separating merge and query disks."
          }
        ]
      },
      "key_takeaway": "The index is not updated in place — it's a collection of immutable segments.\nMerging segments balances read performance (fewer files) against write cost.\nTiered merging is the default for Lucene, the core of Elasticsearch — understand it before tuning.\nIf you're not thinking about cache hierarchy, you're not ready for production search.\nAlways validate segment checksums after merge and pre-warm the page cache."
    },
    {
      "heading": "Query Processing and Serving",
      "content": "When a user types a query, the serving layer is responsible for parsing the query, looking up each term in the inverted index, intersecting or merging the posting lists, and ranking the candidate documents — all in under 200ms. The query parser applies spelling correction (e.g., 'gooogle''google'), synonym expansion, and query rewriting based on user intent. Then the term lookup happens across all shards in parallel. Each shard returns its local top-K results (e.g., top 1000), and the serving node merges them into a global top-K. Early termination is critical: you don't need to score all matching documents — you can stop after finding enough high-quality candidates. This is achieved via dynamic pruning algorithms like MaxScore and WAND.\n\nThe real pain point in production is the merge step at the root node. If you have 10,000 shards each sending 1000 results, the root has 10 million candidates to rank. That's a lot of work in a few milliseconds. Use a two-phase merge: first, each shard returns only doc IDs and a cheap estimated score; the root filters to, say, 10,000 candidates, then requests full features for those. That cuts merge time by orders of magnitude.\n\nAnother hidden issue: the root node's merge is CPU-bound. Profile with a CPU profiler — often sorting a large list of candidate doc IDs is the bottleneck. Replace full sort with a priority queue of size K. Also, if the root node runs out of CPU, query latency degrades linearly across all queries. Pre-compute common query rewrites offline and cache frequent query results aggressively — Google's frontend cache sits in front of the serving layer and handles a large percentage of repeat queries.\n\nTail latency is the silent killer. Even if 99% of queries complete in 100ms, a single slow shard can push P99 beyond 500ms. Implement hedging: send each query to two replicas and take the first response. Set a timeout per shard step and use early termination to cap worst-case execution time. Monitor root node CPUif >80% during peak, reduce top-K per shard or add hedging.\n\nDynamic pruning algorithms deserve a closer look. WAND (Weak AND) maintains an upper bound score for each posting list and skips documents that cannot beat the current K-th best score. MaxScore partitions terms into 'must have' and 'nice to have' groups, only scoring the 'must have' ones early. Both reduce ranking time by 30-50% in practice.\n\nOne more nuance: query rewriting can backfire. If you expand a 2-word query into a 10-term OR clause, you blow up the candidate set. Always cap the expansion factor. Also, cache common rewrites: the top 100K queries rarely change their best rewriting. Pre-compute them offline and serve from a lookup table.\n\nThe root node merge is deceptively CPU-intensive. If you have 10,000 shards each returning 1000 candidates, that's 10 million documents to rank in milliseconds. Use a two-phase merge: cheap score first, then full features only for the top 10,000. And always set per-shard timeouts.\n\nAnd here's a failure we hit: a bug in the query rewriter turned every 'the' query into a 50-term OR expansion. The root node got 500 million candidates and melted. The fix: enforce a maximum expansion factor of 10 and a hard cap on candidate count. Also, run a canary query before pushing rewriting changes.",
      "code": {
        "language": "java",
        "filename": "QueryParser.java",
        "code": "package io.thecodeforge.search.serving;\n\nimport java.util.*;\n\npublic class QueryParser {\n    private final SpellCorrector corrector;\n\n    public QueryParser(SpellCorrector corrector) {\n        this.corrector = corrector;\n    }\n\n    public ParsedQuery parse(String raw) {\n        String corrected = corrector.correct(raw);\n        String[] terms = corrected.toLowerCase().split(\"\\s+\");\n        return new ParsedQuery(Arrays.asList(terms));\n    }\n\n    record ParsedQuery(List<String> terms) {}\n}"
      },
      "callout": {
        "type": "info",
        "title": "Spell Correction Overhead",
        "text": "Spell correction can add 10-50ms per query if not cached. Pre-compute corrections for the top 100K frequent queries offline and serve from a lookup table."
      },
      "production_insight": "Parallel query execution across thousands of shards introduces a tail latency problem.\nA single slow shard can delay the entire query by its response time.\nUse hedging: send the query to two shard replicas and take the first response.\nCache popular query results aggressively — Google's frontend cache sits in front of the serving layer and handles a large percentage of repeat queries.\nDon't forget query rewriting: a badly phrased query can trigger expensive misspell corrections that add 50ms. Pre-compute common query rewrites offline.\nMonitor root node CPUif >80% during peak, reduce top-K per shard or add hedging to avoid latency spikes.\nAnother trap: greedy query rewriting that expands a 2-word query into a 10-term OR clause can blow up the number of candidate documents. Set a maximum expansion factor and fall back to original if exceeded.\nAlways set timeouts per shard step to avoid tail latency dominating the SLO.\nWhen deploying a new query rewriter, run a canary to ensure expansion doesn't exceed a candidate cap.",
      "decision_tree": {
        "title": "Query Serving Optimization Decision Tree",
        "items": [
          {
            "condition": "Query is a common phrase (e.g., 'weather')",
            "result": "Serve from cached result set; bypass index lookup if cache is hot."
          },
          {
            "condition": "Query contains rare terms (low document frequency)",
            "result": "Use early termination: expand candidate set with related terms but limit ranking depth."
          },
          {
            "condition": "Query latency exceeds SLO (200ms)",
            "result": "Reduce top-K per shard (e.g., from 1000 to 500), enable query rewriting to simpler form."
          },
          {
            "condition": "Root node CPU >80% during peak",
            "result": "Enable hedging, reduce top-K per shard, or scale root node horizontally."
          },
          {
            "condition": "Expanded query produces >10M candidates",
            "result": "Cap expansion factor; fall back to original query if cap exceeded."
          }
        ]
      },
      "key_takeaway": "Query serving is a race against the clock — every millisecond matters.\nDistribute the work, use early termination, and hedge against slow shards.\nThe biggest win is caching — never recompute what you already served.\nAlways measure the tail — a single slow shard can kill your SLO.\nSet expansion limits on query rewriting to avoid combinatorial blowup."
    },
    {
      "heading": "Ranking and Relevance",
      "content": "Ranking determines which documents appear at the top of the search results page. Early search engines used TF-IDF (term frequency–inverse document frequency), while Google's PageRank used the link graph. Modern search uses multi-stage ranking: first stage (retrieval) uses simple signals like BM25 or vector similarity to retrieve a few hundred candidates; second stage (reranking) applies a more expensive model — often a deep neural network with hundreds of features (e.g., click-through rate, page freshness, user location, query-document match scores). The ranking model is trained offline on user interaction data (clicks, dwell time, skips) and deployed to the online serving stack via model inference servers. Feature computation must be fast — features are precomputed and stored alongside the document in the index.\n\nA subtle issue: features that are computed at query time (like user location) can't be precomputed. That's fine, but be aware that any floating-point operation in the scoring path adds latency. Profile your feature extraction pipeline end-to-end. We've seen teams spend months building a complex DNN only to find that a single feature look-up (e.g., a Redis call for user history) dominates the latency budget. Measure, then model.\n\nCommon failure: label leakage during training. If you include future click data as a feature, offline AUC looks amazing (0.99), but online impact is zero. Always validate that all features are available at inference time. Also, model drift is silent — monitor feature distribution drift and set alerting thresholds on model AUC per query cluster. Another real issue: the second-stage reranker can become a bottleneck if it's too slow. Set a timeout per document, and fall back to first-stage scores if the reranker misses its deadline.\n\nOne more production headache: feature freshness. If you use a per-document CTR feature that's computed offline, it may be hours old. A page that's trending on social media might have old CTR data, causing the model to underrank it. Set TTLs on features and recompute at least every hour for high-traffic queries.\n\nAnother nuance: click modeling is non-trivial. Raw clicks are noisy — a user may click a result by accident. Use dwell time as a quality signal: a click with <2 seconds dwell is likely a bad click. Also, position bias must be corrected: results in position 1 get clicked more regardless of relevance. Use a position bias model to debias training data.\n\nModel monitoring is critical. Track the distribution of each feature in production — if a feature's values drift beyond a threshold, the model is likely stale. Use PSI (Population Stability Index) to quantify drift. Also, monitor the fallback rate: how often does the reranker time out and fall back to first-stage scores? A rising fallback rate indicates the reranker is becoming too slow, perhaps due to increased model complexity or resource contention.\n\nModel drift is silent. You'll see CTR drop over weeks, not hours. Monitor feature distribution drift with Population Stability Index (PSI). And never train on future data — label leakage will give you perfect offline AUC but zero real lift.\n\nOne more incident: a team deployed a new ranking model that had a feature 'search_volume_7d' computed from future data. Offline validation looked fine because the feature was computed after the query timestamp. Online CTR dropped 15%. The lesson: always validate feature timestamps in your training pipeline.",
      "code": {
        "language": "java",
        "filename": "BM25Scorer.java",
        "code": "package io.thecodeforge.search.ranking;\n\npublic class BM25Scorer {\n    private final double k1 = 1.2;\n    private final double b = 0.75;\n\n    public double score(int termFreq, int docLen, double avgDocLen, int numDocs, int docFreq) {\\n        double idf = Math.log(1 + (numDocs - docFreq + 0.5) / (docFreq + 0.5));\\n        double tf = (termFreq * (k1 + 1)) / (termFreq + k1 * (1 - b + b * docLen / avgDocLen));\\n        return idf * tf;\\n    }\n}"
      },
      "callout": {
        "type": "mental_model",
        "title": "BM25 Intuition",
        "hook": "BM25 balances how often a term appears in a document (TF) against how rare the term is across the corpus (IDF).",
        "bullets": [
          "Higher TF means more relevant, but with diminishing returns (k1).",
          "Longer documents get a penalty (b controls length normalisation).",
          "Rare terms get a higher IDF boost."
        ]
      },
      "production_insight": "Ranking models drift over time as user behavior changes or content shifts.\nIf click-through rates drop for a group of queries, the model may be stale — retrain with recent data.\nA common failure: a new feature that spikes in importance can cause the model to overweight it, leading to bizarre results.\nMonitor feature distribution drift and set alerting thresholds on model AUC metrics per query cluster.\nAlso, watch out for label leakage during training — a feature that uses future user clicks will give you amazing offline metrics but zero real-world lift.\nAlways A/B test ranking changes for at least 2 weeks to measure real CTR impact.\nFeature freshness matters: a stale CTR feature (hours old) can rank an irrelevant page high. Set TTLs on features and recompute at least every hour for high-traffic queries.\nSet a timeout per document in the reranker and fall back to first-stage scores if it misses.\nNever trust a feature pipeline that doesn't validate temporal ordering.",
      "decision_tree": {
        "title": "Ranking Strategy Decision",
        "items": [
          {
            "condition": "Query is a navigational query (e.g., 'Facebook')",
            "result": "Use simple BM25 + page quality score; deep reranking adds latency without value."
          },
          {
            "condition": "Query is informational or ambiguous (e.g., 'Python vs Java')",
            "result": "Multi-stage ranking with DNN reranker; use user context signals."
          },
          {
            "condition": "Latency budget is tight (<50ms)",
            "result": "Skip reranking; rely on first-stage BM25 with precomputed quality scores."
          },
          {
            "condition": "Reranker fallback rate > 5%",
            "result": "Investigate model latency; consider reducing model size or increasing timeout."
          }
        ]
      },
      "key_takeaway": "Ranking is a machine learning problem, not a static formula.\nMulti-stage ranking trades off latency for accuracy.\nModel drift is silent — monitor offline metrics and have a fallback to a simpler model if the DNN becomes unstable.\nRanking is never done — your model is always stale. Automate retraining and A/B test every change.\nFeature freshness is critical: stale features lead to stale rankings. Set TTLs and recompute frequently."
    },
    {
      "heading": "Distributed Search Serving Architecture",
      "content": "Once the index is built, it must be distributed across a fleet of serving nodes that handle real-time queries. Google uses a two-tier serving architecture: the leaf tier holds index shards, and the root tier aggregates results. Each leaf node hosts a subset of the inverted index sharded by document ID or term. Sharding can be hash-based (e.g., MD5(term) mod N) or range-based. Hash-based gives even distribution but makes prefix queries expensive; range-based helps prefix queries but can cause hotspots. A common compromise is to use hash-based for even load and replicate hot terms across more nodes. Replication is essential for fault tolerance and read throughput — each shard has at least 3 replicas across different racks. Consistency between replicas is eventually consistent: updates are pushed with a version vector, and reads query the most up-to-date replica.\n\nThe serving layer also includes a result cache (e.g., for top 10,000 queries) and a frontend that handles query parsing, spelling correction, and personalization before hitting the index servers. Google's frontend cache likes to sit in front of the leafy tier — it can serve up to 80% of repeat queries without touching the index at all.\n\nOne gotcha: with shard replication, you must handle replica staleness during index pushes. A common pattern is to use a two-phase commit: prepare new index version on all replicas, then commit. If one replica fails to prepare, abort the entire push to avoid serving inconsistent results.\n\nWhen adding nodes, consistent hashing avoids full reshuffle, but each node must reload its segment metadata — this can cause a temporary CPU spike. Pre-warm new nodes with routing table before directing traffic. Also, if the routing table is updated asynchronously, some queries may hit a node that no longer holds the segment. Always validate shard ownership on each request via a version check.\n\nAnother production issue: version vector conflicts during concurrent pushes. Use a majority read protocol with timestamp ordering to ensure at least one replica returns the latest index version. If versions diverge, the root node should retry on other replicas or fall back to a cached result.\n\nCross-region replication adds another layer of complexity. Latency between regions can cause significant staleness. Use a primary region for writes and replicate asynchronously. For read affinity, route users to the nearest region and accept eventual consistency. If a region fails, redirect traffic to the nearest healthy region — but be prepared for a spike in latency as caches warm.\n\nAlso, think about routing table propagation. In a multi-region setup, the routing table must be synchronized across regions to avoid split-brain scenarios. Use a global distributed consensus like etcd or ZooKeeper to manage routing metadata. Periodically validate that all regions agree on shard ownership.\n\nConsistency during index pushes is the hardest problem. Use a two-phase commit: prepare on all replicas, then commit. If one replica fails, abort the entire push. Never serve from a version vector conflict.\n\nAnd here's a real outage: a bug in the routing table update caused half the queries to hit a node that had been decommissioned. The fix was to include a generation number in every request and reject stale responses. Also, always run a canary query per shard before and after topology changes.",
      "code": {
        "language": "java",
        "filename": "ShardRouter.java",
        "code": "package io.thecodeforge.search.serving;\n\nimport java.util.*;\n\n/**\n * Routes queries to appropriate shards based on term hash.\n */\npublic class ShardRouter {\n    private final List<String> shardNodes;\n    private final int replicationFactor;\n\n    public ShardRouter(List<String> shardNodes, int replicationFactor) {\\n        this.shardNodes = shardNodes;\\n        this.replicationFactor = replicationFactor;\\n    }\n\n    public List<String> getNodesForTerm(String term) {\n        int hash = Math.abs(term.hashCode());\n        int primaryIndex = hash % shardNodes.size();\n        List<String> result = new ArrayList<>();\n        for (int i = 0; i < replicationFactor; i++) {\n            result.add(shardNodes.get((primaryIndex + i) % shardNodes.size()));\n        }\n        return result;\n    }\n}"
      },
      "callout": {
        "type": "info",
        "title": "Replication Consistency",
        "text": "Even with eventual consistency, you need a way to detect stale replicas during a query. Use version vectors and prefer the replica with the highest version. If no replica is up-to-date, the query may return stale results — that's the trade-off for availability."
      },
      "production_insight": "One of the most common production bugs in search systems is misrouting queries after a cluster topology change.\nWhen nodes are added or removed, consistent hashing minimises reshuffling, but you must update the routing table atomically.\nIf the router and the index store disagree on shard assignments, queries hit the wrong node and return empty results.\nAlways have a fallback: if a shard returns an error, retry on a different replica.\nAnother trap: shard hot spots where a few terms get much more traffic — monitor QPS per shard and rebalance when variance exceeds 20%.\nWhen adding nodes, pre-warm the new routing table to avoid a CPU spike on the serving tier.\nAlso, version vector conflicts can cause query inconsistency. Use a majority read protocol with a timestamp ordering to ensure at least one replica returns the latest index version.\nIf you detect a split in version vectors, the root node should retry on other replicas or fall back to cached results to avoid serving stale data.\nInclude a generation number in every request to detect stale routing.",
      "decision_tree": {
        "title": "Shard Distribution Strategy",
        "items": [
          {
            "condition": "High-traffic terms, even load important",
            "result": "Hash-based sharding with consistent hashing for reshuffling."
          },
          {
            "condition": "Prefix queries common (e.g., 'a*', 'b*')",
            "result": "Range-based sharding, but monitor for hot spots and use dynamic splitting."
          },
          {
            "condition": "Need to support geo-distributed serving",
            "result": "Use geo-aware consistent hashing; replicate popular shards across regions."
          },
          {
            "condition": "QPS variance across shards >20%",
            "result": "Rebalance shards or replicate hot shards to additional nodes."
          }
        ]
      },
      "key_takeaway": "Distributed serving is about routing, replication, and consistency.\nHash-based sharding gives even load; range-based gives query locality.\nReplication is for fault tolerance, but consistency during pushes is the hardest part.\nMonitor shard QPS variance and rebalance before hotspots cause cascading failures.\nIn distributed serving, consistency is the hardest axis — prioritize availability but never ignore version conflicts.\nAlways pre-warm new nodes routing tables and validate shard ownership per request."
    },
    {
      "heading": "Monitoring and Operational Excellence",
      "content": "A search engine is only as reliable as its weakest stage. You need monitoring at every boundary: crawler health (per-domain crawl rate, frontier depth), indexing pipeline (segment commit latency, merge I/O, corruption checks), serving tier (P50/P95/P99 latency, error rates, cache hit ratio), and ranking (feature drift, model AUC). Without this, you're flying blind.\n\nConcrete metrics to track:\n- Crawler: DNS resolution success rate, 429 responses per domain, redirect chain length distribution.\n- Indexer: last commit timestamp, number of segments, merge write amplification factor.\n- Serving: P99 query latency per shard, root node CPU, cache hit ratio (frontend and backend).\n- Ranking: feature distribution drift (PSI), model AUC per query cluster, fallback rate.\n\nAlerting thresholds: if crawler per-domain rate drops by 50% in 10 minutes, page. If indexer last commit is older than 2 minutes in a near-real-time index, page. If P99 latency exceeds 200ms for 1 minute, page. If model AUC drops by 0.05 relative to baseline, investigate.\n\nA common failure: silent index corruption without alerting. Always run periodic segment integrity checks (checksum verification) and alert if any segment fails. Also, after each merge, verify the new segment before exposing it.\n\nCapacity planning: index size grows roughly linearly with crawled pages. Monitor storage growth and plan for 1.5x headroom. Shard replicas consume more storage but are necessary for read throughput. Use tiered storage: hot segments on SSD, cold segments on HDD.\n\nRun regular load tests with pre-recorded query logs to validate SLOs under peak. Use chaos engineering to test replica failover and index version rollback procedures. Synthetic queries should be part of your smoke test suite — push a fake query that expects a known result and alert if it fails.\n\nAlso, don't forget to monitor the monitoring pipeline itself. If your metrics endpoint goes down, you won't know. Set up a heartbeat check on your monitoring infrastructure and have a separate alerting channel (e.g., PagerDuty) that doesn't rely on the same stack.\n\nSynthetic queries are your best friend. Run a fixed set of queries every minute and alert if results are stale. That catches indexer failures before users do.\n\nOne more practice: maintain a runbook for each critical component. When the synthetic query alert fires, you want to know exactly which commands to run. We wrote a runbook that cut mean-time-to-recovery from 45 minutes to 8.",
      "code": {
        "language": "bash",
        "filename": "monitoring_commands.sh",
        "code": "# Crawler health\ncurl http://crawler-metrics:9090/metrics | grep frontier_depth\ncurl http://crawler-metrics:9090/metrics | grep per_domain_rate\n\n# Indexer health\ncurl http://indexer-internal:8080/metrics | grep last_commit_seconds\ncurl http://indexer-internal:8080/metrics | grep 'segment_count|merge_write_amplification'\n\n# Serving latency (via Prometheus)\nquery=\"histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))\"\ncurl \"http://prometheus:9090/api/v1/query?query=$query\"\n\n# Ranking model drift (custom endpoint)\ncurl http://model-monitor:8080/feature_drift?feature=ctr"
      },
      "callout": {
        "type": "tip",
        "title": "Alert on Silence, Not Just Noise",
        "text": "For crawler and indexer, the most dangerous signal is no data at all. If a metric stops reporting, the component may have silently died. Set up 'absence of data' alerts on all critical metrics."
      },
      "production_insight": "Don't wait for a user-facing error to detect a stale index.\nSet up a synthetic query monitor: run a set of fixed queries every minute and measure result freshness.\nIf the results are older than a threshold (e.g., >5 minutes from expected), trigger a warning.\nThat saved a team I know: they caught a failure in the incremental index writer within 30 seconds, before any user noticed.\nAnother thing: monitor the difference between P50 and P99 latency — if it grows, you have a tail latency problem that needs hedging or timeout tuning.\nAlso, always have a dashboard with the top 5 errors per component.\nDon't forget to monitor your monitoring stack's own health — a silent metrics outage can blind you.",
      "decision_tree": {
        "title": "Monitoring Response Decision Tree",
        "items": [
          {
            "condition": "Crawler frontier depth drops 50% in 10 min",
            "result": "Check DNS, per-domain rate limit, and if the frontier coordinator is healthy."
          },
          {
            "condition": "Indexer last commit more than 5 min stale",
            "result": "Check indexer process health, disk space, and Kafka consumer lag."
          },
          {
            "condition": "P99 query latency >200ms for 1 min",
            "result": "Check root node CPU, slow shards, cache hit ratio. Reduce top-K or enable hedging."
          },
          {
            "condition": "Ranking model AUC drops 0.05",
            "result": "Check for feature drift, model deployment mismatch, or sudden change in user behavior. Rollback to previous model if needed."
          },
          {
            "condition": "Synthetic query returns stale results",
            "result": "Check indexer freshness, segment checksums, and Kafka topic lag."
          }
        ]
      },
      "key_takeaway": "Monitoring is not optional — it's the difference between knowing and guessing.\nInstrument every stage independently.\nSynthetic queries catch stale indices before users do.\nAlert on data absence, not just high values.\nCapacity planning must be proactive: track storage and throughput trends weekly."
    },
    {
      "heading": "Caching and Latency Optimization",
      "content": "Caching is the single biggest lever for reducing search latency. Google employs a multi-tier cache: frontend cache (global CDN for popular queries), segment cache (in-memory posting lists for hot terms), and result cache (entire SERP for frequent queries). The frontend cache alone can serve 80% of repeat queries, reducing latency from 100ms to under 10ms.\n\nBut caching brings its own challenges: stale results, cache invalidation, and memory pressure. For the frontend cache, use a TTL of a few minutes for news queries and longer for evergreen content. Use a write-through cache for index segments: when a new segment is published, the old segment's cache entries must be invalidated. A common pattern is to use a generation counter per segment — increment it on every merge, and include it in cache keys.\n\nAnother trap: caching too aggressively can mask problems. If you cache a bad result for 5 minutes, that's 5 minutes of user-facing errors. Always cache with a fallback: if a cached result is older than a threshold, refresh it asynchronously and serve the stale version only if the refresh fails.\n\nMemory for cache is finite. Use LRU eviction and monitor cache hit ratios per tier. If the segment cache hit ratio drops below 90%, you're either caching the wrong segments or your working set is too large. Consider adding more nodes or using tiered storage where hot segments stay in memory and cold segments are on SSD.\n\nCache stampede is a real threat. When a new segment is published, many queries may miss the cache simultaneously and all try to populate it at once. This can spike CPU and overwhelm the index servers. Use a mutex per cache key (distributed lock) to allow only one process to compute the missing value. Others wait and read the newly cached result.\n\nCache stampede is a real threat. When a new segment is published, many queries miss cache simultaneously and try to populate it. Use a distributed mutex per cache key to prevent duplicate compute.\n\nOne more nuance: the result cache should have different TTLs for different query categories. A weather query might be fresh for 15 minutes; a stock price query needs seconds. Use a TTL function based on query type or category. Also, never cache results for queries that have a high rate of change (e.g., 'breaking news') — instead, use a short TTL and a background refresh.",
      "code": {
        "language": "java",
        "filename": "SegmentCache.java",
        "code": "package io.thecodeforge.search.cache;\n\nimport java.util.*;\nimport java.util.concurrent.ConcurrentHashMap;\n\n/**\n * In-memory cache for hot index segments.\n */\npublic class SegmentCache {\n    private final int maxSegments;\n    private final Map<String, byte[]> cache = new ConcurrentHashMap<>();\n    private final Queue<String> lru = new LinkedList<>();\n\n    public SegmentCache(int maxSegments) {\n        this.maxSegments = maxSegments;\n    }\n\n    public byte[] get(String segmentId) {\n        return cache.get(segmentId);\n    }\n\n    public void put(String segmentId, byte[] data) {\n        if (cache.size() >= maxSegments) {\n            String evicted = lru.poll();\n            if (evicted != null) cache.remove(evicted);\n        }\n        cache.put(segmentId, data);\n        lru.add(segmentId);\n    }\n}"
      },
      "callout": {
        "type": "info",
        "title": "Cache Invalidation Is Harder Than Caching",
        "text": "Invalidating the segment cache after a merge requires careful coordination. Use a versioned segment ID: when a new segment replaces an old one, change the ID. The cache will naturally evict the old entry on next access."
      },
      "production_insight": "Caching reduces latency but introduces staleness risks.\nIf you cache a result for too long, users see outdated information.\nSet TTLs based on query type: news queries need minutes, stock quotes need seconds.\nCache stampede can overwhelm servers when many queries miss simultaneously — use distributed locks.\nMonitor cache hit ratios per tier: if hit ratio drops below 90%, investigate why.\nAlways have a fallback for stale cache entries — serve stale only if async refresh fails.\nAnd never cache results for highly dynamic queries without a short TTL and a background refresh mechanism.",
      "decision_tree": {
        "title": "Cache Tier Decision",
        "items": [
          {
            "condition": "Query is popular and static (e.g., 'weather London')",
            "result": "Cache at CDN level with TTL of 5-15 minutes."
          },
          {
            "condition": "Query has high rate of change (e.g., 'breaking news')",
            "result": "Short TTL (seconds) with background refresh; serve stale only on refresh failure."
          },
          {
            "condition": "Segment is frequently accessed (hot term)",
            "result": "Cache segment in memory using LRU; monitor hit ratio."
          },
          {
            "condition": "Cache hit ratio drops below 90%",
            "result": "Increase cache size, add more nodes, or migrate hot segments to faster storage."
          }
        ]
      },
      "key_takeaway": "Caching is the single biggest lever for latency reduction.\nBut caching strategies must be per-query-type to balance freshness and speed.\nCache stampede can kill your availability — always use a mutex for cache population.\nMonitor hit ratios and adjust TTLs continuously."
    }
● Production incidentPOST-MORTEMseverity: high

Index Corruption from Malformed HTML

Symptom
Users reported missing search results for queries that should have matched popular pages. The error rate on search result pages jumped from 0.02% to 3.5% within minutes.
Assumption
The indexing pipeline assumed all parsed HTML produced clean, valid tokens. No validation layer existed between the HTML parser and the index writer.
Root cause
A page containing an unclosed <a> tag with a very long attribute caused the parser to allocate an unbounded buffer, resulting in a memory stomp that corrupted the index segment being written. The corrupted segment propagated to replicas during repair.
Fix
Add a per-segment hash verification before committing to the index store. Introduce a timeout and maximum token length in the parser. Deploy a quick rollback to the previous good index version using the shard's last verified checkpoint.
Key lesson
  • Never trust external input — validate and sanitize at every boundary.
  • Always store a checksum per index segment and verify before use.
  • Corruption in one shard can silently spread if replica repair copies bad data — use incremental repair with integrity checks.
  • Set parser timeouts and hardware limits to prevent resource exhaustion from malicious or malformed input.
  • Monitor segment integrity proactively: run a background checker that samples queries every minute and alerts on missing results.
Production debug guideCommon symptoms, immediate actions, and root cause analysis6 entries
Symptom · 01
High query latency (>500ms P99)
Fix
Check index server CPU and GC logs. Use iostat to check disk IO. Verify cache hit ratio for the segment postings list. If merge I/O spikes coincide with latency spikes, throttle merge bandwidth during peak hours. Also check if a cold cache on a new segment is causing the spike — pre-warm the segment by reading its header.
Symptom · 02
Stale or missing results for recently crawled pages
Fix
Check the indexing pipeline health: is the crawler producer keeping up? Is the index writer committing segments? Look for backpressure in the Kafka topic between crawler and indexer. Also verify the freshness signal from the crawler to the indexer — if the crawler marks a page as 'recrawl needed' but the indexer ignores the signal, results go stale.
Symptom · 03
Search results inconsistent across replicas
Fix
Compare index versions across shard replicas. Run a diff on the segment manifest. Check that the cluster has converged after a recent index push. Use version vectors in query responses to prefer the most up-to-date replica. If split-brain occurs, the root node should retry on other replicas or fall back to a cached result.
Symptom · 04
Spike in 'No results' responses for common queries
Fix
Check if inverted index lookups return empty. Verify that the vocabulary dictionary is intact. Inspect the last successful index refresh timestamp. Also check if a recent merge corrupted the term dictionary — run a segment integrity check. If a segment's checksum fails, roll back to the previous good version.
Symptom · 05
High memory usage on index node causing OOM
Fix
Check mmap region sizes and JVM heap settings. Use pmap -x <pid> to see mapped files. Reduce segment cache size or increase off-heap memory limits. Watch out for native memory leaks from off-heap segment storage — monitor /proc/<pid>/smaps for RSS growth.
Symptom · 06
Query latency varies significantly by region
Fix
Check regional replica health and routing table consistency. Ensure all regions have the latest index version. If cross-region replication lag is high, route users to the nearest region and accept eventual consistency. Monitor version vector propagation.
★ Search Engine Production Debug Cheat SheetQuick commands and steps to diagnose the most common search infrastructure failures.
Index not refreshing (stale results)
Immediate action
Check indexer process health and last commit timestamp
Commands
curl http://indexer-internal:8080/metrics | grep last_commit_seconds
kubectl logs -l app=indexer --tail=50 | grep 'commit failed'
Fix now
Restart the indexer and force a full re-index from the last good checkpoint
Query latency spikes at 5-minute intervals+
Immediate action
Check if GC pauses correspond to the indexing commit cycle
Commands
jstat -gcutil <pid> 1000 5 | awk '{print $17}'
grep 'Young GC' /var/log/indexer.log | tail
Fix now
Tune the GC by adjusting -XX:NewRatio and -XX:SurvivorRatio to reduce pause times
Serving tier returns 503 for high-traffic segments+
Immediate action
Check replica health and segment distribution across nodes
Commands
curl http://health-check:8080/replicas?segment=1234
kubectl describe pod search-node-5 | grep -A5 'Conditions:'
Fix now
Replicate the hot segment to additional nodes using zone-aware shard balancing
Spike in 404 responses for valid queries+
Immediate action
Verify the inverted index dictionary and term lookup
Commands
curl http://index-server:8080/vocab_check?term=the
head -c 100 /data/index/1234.termdict | xxd
Fix now
Restore term dictionary from backup and replay the index delta from WAL
Spike in 500 errors for geographically close users+
Immediate action
Check regional load balancer health and cross-region replication lag
Commands
curl http://health-check:8080/regional?latency
kubectl describe nodes --region=us-east | grep 'Memory'
Fix now
Temporarily route traffic to healthy region; investigate replication consistency.
Query latency varies significantly by region+
Immediate action
Check regional replica health and routing table consistency
Commands
curl http://health-check:8080/regional?latency
kubectl get pods -l app=search-node --all-regions | grep -v Running
Fix now
Repair routing table propagation; ensure all regions have complete shard replicas.
🔥

That's Real World. Mark it forged?

9 min read · try the examples if you haven't

Previous
Design Uber
6 / 17 · Real World
Next
Design Amazon