Senior 4 min · March 06, 2026

Search Autocomplete — Cache Poisoning & Trie Rollback

10 IPs poisoned our autocomplete cache with offensive suggestions in one hour.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • A search autocomplete system predicts completions as users type, serving top-K suggestions within 50ms
  • Core data structure: trie for prefix lookups, often combined with prefix hashing for memory/speed trade-off
  • Ranking blends global popularity, personal history, and recency — not just frequency
  • Caching layers: CDN for static top queries, Redis for hot prefixes, in-memory L1 cache on app servers
  • Sharding must preserve prefix locality — consistent hashing on prefix hash minimizes cross-node lookups
  • Biggest mistake: assuming depends_on ordering — containers start but DB isn't ready, causing connection pool exhaustion
Plain-English First

Imagine you're writing a letter and your friend looks over your shoulder, guessing the next word before you finish. They're not psychic — they've just read thousands of letters and know what words commonly follow others. Search autocomplete is exactly that friend, except instead of one friend it's a system serving millions of people at once, finishing your search query before your fingers can type it.

Every time you type a single character into Google, YouTube, or Amazon's search bar and watch suggestions instantly appear, you're experiencing one of the most latency-sensitive distributed systems in production software. That instant feel isn't magic — it's a carefully engineered pipeline involving in-memory data structures, aggressive multi-layer caching, real-time ranking signals, and sub-50ms response time SLAs enforced globally. Getting it wrong means users feel lag, and lag means they leave. Getting it right means you're invisibly influencing what billions of people search for every day.

The core problem autocomplete solves is bridging the gap between intent and query. Users often don't know exactly what they want to type. A partial string like 'how to fix my' could complete in a hundred directions. The system must predict the most relevant and popular completions in real time, rank them by a blend of global popularity, personal history, and contextual signals — all before the user lifts their finger from the keyboard. It also needs to handle billions of daily queries, top-K ranking, typo tolerance, personalization, and graceful degradation when parts of the system fail.

By the end of this article you'll be able to walk into any senior system design interview and confidently design a production-grade autocomplete system from scratch. You'll understand why a trie beats a relational database for prefix lookups, how to scale it with sharding and caching layers, how ranking signals layer on top of raw prefix matching, and what real failure modes look like in production — including the ones that bite engineers at 3am.

What is a Search Autocomplete System?

A search autocomplete system (also called typeahead) provides real-time search suggestions as the user types. At its core, it's a prefix-matching problem: given a partial input, return the most likely completions. But production reality adds layers: handle billions of unique prefixes daily, rank by multiple signals, support multiple languages, tolerate typos, and do it all under 50ms at p99.

You'll find autocomplete in Google Search, YouTube, Amazon, Twitter, and any large-scale product where reducing friction improves user engagement. The design principles here apply equally to code completion in IDEs and command-line autocomplete.

io/thecodeforge/autocomplete/TrieNode.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
package io.thecodeforge.autocomplete;

import java.util.HashMap;
import java.util.Map;
import java.util.PriorityQueue;

public class TrieNode {
    Map<Character, TrieNode> children = new HashMap<>();
    boolean isEnd;
    int frequency;
    PriorityQueue<String> topK = new PriorityQueue<>(
        (a, b) -> Integer.compare(getFrequency(b), getFrequency(a))
    );

    // For real production: maintain a bounded top-K heap per node
    // Updates require merge of child frequencies on insertion
}
Trie as a Decision Tree
  • Each node represents a character — not a string.
  • Shared prefixes share nodes — memory efficient for common prefixes.
  • Top-K suggestions per node avoid traversing entire subtree on every keystroke.
  • Frequency at leaf nodes aggregates across user sessions.
Production Insight
Pure in-memory trie for 100M unique queries consumes ~10GB+ with naive approach.
Use prefix hashing (hash of first 2-3 chars) to shard across servers — each server holds one hash range.
Rule: always bound memory per node; unbounded top-K heaps cause OOM under load.
Key Takeaway
A trie gives O(k) prefix lookup where k is input length.
Ranking requires storing precomputed top-K per node.
Production: trie alone is insufficient — layer with cache and sharding.

Data Structures: Trie vs Prefix Hashing

Two dominant approaches for prefix matching: the classic trie and prefix hashing. A trie offers deterministic lookups and easy prefix traversal, but memory can balloon. Prefix hashing uses a hash map keyed by prefix strings (e.g., first 3 characters) and stores lists of completions per key. It's simpler and more cache-friendly but loses prefix-sharing benefits.

Real systems often hybridize: use a compressed trie (radix tree) for hot prefixes and fall back to prefix hash tables for rare prefixes. Google's approach, for example, uses a multi-tiered index with a memory-mapped trie for the top million queries and disk-backed hash tables for long-tail completions. Twitter's autocomplete uses a similar hybrid — in-memory trie for trending topics, disk-based for historical queries.

io/thecodeforge/autocomplete/PrefixHashStore.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
package io.thecodeforge.autocomplete;

import java.util.concurrent.ConcurrentHashMap;
import java.util.List;
import java.util.ArrayList;

public class PrefixHashStore {
    // key: first 3 chars of query, value: sorted list of completions
    private ConcurrentHashMap<String, List<Completion>> store = new ConcurrentHashMap<>();

    public List<Completion> get(String prefix) {
        String key = prefix.length() > 3 ? prefix.substring(0, 3) : prefix;
        return store.getOrDefault(key, List.of());
    }

    public void put(String prefix, List<Completion> completions) {
        String key = prefix.length() > 3 ? prefix.substring(0, 3) : prefix;
        store.put(key, completions);
    }
}

// Note: Updates must be atomic; use replace() with CAS to avoid lost writes
Beware: Prefix Hashing Skew
Prefix hashing works well only when prefixes are uniformly distributed. 'a' prefixes dominate in English. If you use 3-char hashing, the 'a*' shard will handle 40% of traffic. Fix: use consistent hashing on the prefix hash, and replicate hot shards.
Production Insight
Trie nodes are small objects — GC pressure on JVM under high write load.
Prefix hashing uses contiguous arrays — better cache locality but no prefix sharing.
Compromise: use hash of first 2 chars as shard key, trie within shard.
Rule: profile memory and GC before deciding; never assume one-size-fits-all.
Key Takeaway
Trie: memory overhead, O(k) lookup, good for dynamic updates.
Prefix hash: less memory, O(1) lookup, bad for prefix expansion.
Hybrid: radix tree for hot paths + hash for tail queries wins in production.

Ranking: What Order to Show Suggestions?

Users don't just want any completion — they want the most relevant one. Relevance is a blend of signals: global popularity (most searched queries with that prefix), personal history (your past searches), recency (trending queries), and context (location, device). Ranking in real time at 50ms is the hard part.

Most systems precompute a ranking score per query-prefix pair offline using MapReduce or Spark jobs. The score is a weighted sum: w1 log(frequency) + w2 (1 / days_since_last_search) + w3 * personal_boost. Real-time personalization adds an extra fetch from user profile cache at query time, merging personal top queries into the global top-K list before returning.

io/thecodeforge/autocomplete/Ranker.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
package io.thecodeforge.autocomplete;

import java.util.*;
import java.util.stream.Collectors;

public class Ranker {
    private static final double FREQ_WEIGHT = 0.6;
    private static final double RECENCY_WEIGHT = 0.3;
    private static final double PERSONAL_WEIGHT = 0.1;

    public List<Completion> rank(List<Completion> global, List<Completion> personal, int k) {
        // Merge global and personal completions with personal boost
        Map<String, Double> scores = new HashMap<>();
        for (Completion c : global) {
            scores.put(c.query, FREQ_WEIGHT * c.score);
        }
        for (Completion c : personal) {
            scores.merge(c.query, PERSONAL_WEIGHT * c.score, Double::sum);
        }
        return scores.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(k)
            .map(e -> new Completion(e.getKey(), e.getValue()))
            .collect(Collectors.toList());
    }
}

// In production: scores are precomputed; real-time merge only for personalization
Common Pitfall
Don't compute ranking scores online. Precompute and store the top-K per prefix in the trie node itself. Online merging should only add personal context. Otherwise your p99 will exceed 200ms.
Production Insight
Precomputed ranking takes hours — stale if trending query appears.
Real-time trend detection polls recent search logs every 1-5 minutes.
Cold start for new queries: rely on NLP model (e.g., n-gram) to bootstrap scores.
Rule: always separate batch scoring pipeline from real-time serving.
Key Takeaway
Ranking is a blend of global frequency, recency, and personalization.
Precompute scores offline; merge personal history online.
Always return top-K quickly — 50ms SLA means no complex computation per request.

Caching for Sub-50ms Latency

Autocomplete is the poster child for multi-layer caching. Every microsecond counts. You'll typically see three layers: (1) Browser cache — service worker caches recent autocomplete responses, (2) CDN — edge caches for top 1% of prefixes (e.g., 'a', 'b'), (3) Application cache — Redis or Memcached for hot prefixes not in CDN, (4) In-memory L1 cache in application server — LRU cache with TTL, (5) The trie itself as the source of truth.

Cache invalidation is tricky: when a new query becomes popular (e.g., after a news event), must invalidate all prefixes that contain it. Use a bloom filter to quickly test if a prefix might need invalidation — avoids scanning entire cache.

io/thecodeforge/autocomplete/CacheManager.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package io.thecodeforge.autocomplete;

import com.google.common.cache.Cache;
import com.google.common.cache.CacheBuilder;
import java.util.concurrent.TimeUnit;

public class CacheManager {
    // L1 cache: in-memory, 10K entries, 30 second TTL
    private Cache<String, List<Completion>> l1 = CacheBuilder.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(30, TimeUnit.SECONDS)
        .build();

    public List<Completion> get(String prefix) {
        List<Completion> result = l1.getIfPresent(prefix);
        if (result != null) return result;
        // fallback to Redis (L2)
        result = redisClient.get(prefix);
        if (result != null) {
            l1.put(prefix, result); // populate L1 on miss
        }
        return result;
    }
}

// L3 (trie) is only hit when both L1 and L2 miss
Cache stampede on prefix miss
When a rare prefix is requested simultaneously by many users, multiple servers will miss cache and hit the trie simultaneously. Use mutex per prefix key — first one locks, others wait. Go's singleflight pattern or Java's LoadingCache handles this.
Production Insight
CDN caching for top prefixes saves 30% backend requests.
But CDN caches entire response — must version autocomplete payloads.
Stale suggestions degrade UX — TTL must align with ranking update frequency.
Rule: never cache by query alone — cache by (prefix, locale, user_segment) tuple.
Key Takeaway
Multi-layer cache: browser → CDN → Redis → in-memory → trie.
Use bloom filter for fast cache invalidation.
Mutex on cache miss prevents thundering herd.

Sharding and Scaling Across Servers

A single server can't hold the entire autocomplete index for billions of queries. You need to shard. The challenge: prefix lookups cross shards unless you shard intelligently. Common strategy: consistent hashing on the first 2-3 characters of the query. All queries with prefix 'ab' go to shard 7, 'ac' to shard 3, etc. This keeps prefix locality — most lookups hit one shard.

But prefix skew is real: 'a' prefixes are 40% of traffic. Solution: replicate hot shards (e.g., 10 copies of 'a*' shard), and use a load balancer that routes by prefix hash to one of the replicas. For cold prefixes, a single shard per hash range suffices.

To handle high write throughput (ingesting new queries), use a write-ahead log (WAL) before updating the trie, and batch updates to the shard every few seconds. This also handles replica consistency.

io/thecodeforge/autocomplete/ShardRouter.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package io.thecodeforge.autocomplete;

import com.google.common.hash.Hashing;
import java.nio.charset.StandardCharsets;
import java.util.List;

public class ShardRouter {
    private final List<ShardGroup> shardGroups;
    private static final int REPLICATION_FACTOR = 10;

    public String selectShard(String prefix) {
        int hash = Hashing.murmur3_32()
            .hashString(prefix.substring(0, Math.min(3, prefix.length())), StandardCharsets.UTF_8)
            .asInt();
        // hot shards get multiple nodes via replication factor
        int index = Math.abs(hash) % (shardGroups.size() * REPLICATION_FACTOR);
        int groupIndex = index / REPLICATION_FACTOR;
        ShardGroup group = shardGroups.get(groupIndex);
        // pick replica round-robin
        return group.getReplica();
    }
}

// Consistent hashing preferred for minimizing reshuffling on scale-up
Real Production: Google's Approach
Google shards by the first letter plus a uniform hash of the rest. The first letter determines the serving cluster (for locality), within cluster they use consistent hashing on the full prefix.
Production Insight
Skewed prefix distribution causes hot shard — monitor p99 latency per shard.
Autoscale replicas for hot shards, but watch for cross-shard queries (rare for autocomplete since prefix locality works).
Writes must be idempotent — retry on WAL to leader shard, then async replication.
Rule: test shard distribution with real query logs before choosing hash function.
Key Takeaway
Shard by prefix (first 2-3 chars) for locality.
Replicate hot shards for capacity.
Consistent hashing minimizes movement when adding shards.

Production Incident: Autocomplete Cache Poisoning

A real incident from a large e-commerce platform: autocomplete started showing offensive suggestions for normal queries. Root cause was a cache poisoning attack — attacker sent malicious queries that got cached, then served to all users. The fix required purging all caches and adding input sanitization before caching. This incident led to a new practice: never cache a query's suggestions without validating the output against toxicity filters.

Trust No Input
  • Always sanitize output before caching: strip profanity, PII, and SQL injection patterns.
  • Use a separate freshness score for user-generated queries vs aggregated popular queries.
  • Rate-limit query ingestion per user to prevent poisoning at input.
Production Insight
Cache poisoning spreads quickly — after first write, every read serves poisoned data.
Rollback strategies: version caches so you can revert to previous version.
Rule: treat autocomplete as a user-facing output — apply the same safety checks as search results.
Key Takeaway
Sanitize output before caching.
Version caches for instant rollback.
Rate-limit query ingestion per user.
● Production incidentPOST-MORTEMseverity: high

Autocomplete Cache Poisoning at Scale

Symptom
Users reported inappropriate suggestions for benign prefixes like 's', 'p', 'a'. Internal monitoring showed a sudden spike in query ingestion from 10 IPs.
Assumption
Engineers assumed the autocomplete trie was read-only and could not be poisoned because suggestions came from aggregated data. They missed that the system accepted new queries directly into the cache layer without validation.
Root cause
No output sanitization before caching. The autocomplete service stored any query that appeared >100 times in the last hour into the trie and cache, without checking its content. Attackers sent offensive queries with high frequency to trigger caching.
Fix
1. Purged all cached suggestions (CDN, Redis, in-memory). 2. Added a toxicity filter (TensorFlow-based model) to all new query ingestions. 3. Implemented a cache versioning system to allow rollback to a known-good state. 4. Rate-limited query ingestion per user IP to 1000/hour.
Key lesson
  • Never trust user input — sanitize before storing in any data structure.
  • Cache poisoning can happen even without direct write access if the cache is fed by user activity.
  • Always have a cache rollback mechanism — versioning is cheap, data loss is expensive.
Production debug guideStep-by-step actions for common autocomplete performance issues.4 entries
Symptom · 01
P99 response time > 200ms
Fix
Check L1 cache hit ratio. If < 80%, increase L1 cache size or warm up cache with top prefixes. Also check shard imbalance — a single shard may be overwhelmed.
Symptom · 02
Suggestions not updating for trending queries
Fix
Verify the offline ranking job completed successfully. Check logs for MapReduce/Spark failures. Manual trigger: replay last hour of query log through the batch pipeline.
Symptom · 03
Inappropriate suggestions appearing
Fix
Check recent cache writes — identify new queries with high frequency. Temporarily disable cache writes and enable toxicity filtering. Run a cache flush for affected prefixes.
Symptom · 04
Memory usage on app servers climbing
Fix
Check if L1 cache unlimited growth — enforce maximum size and eviction policy. Also check trie node count; if nodes are not pruned for stale queries, memory leaks. Add TTL for trie nodes via a background cleanup thread.
★ Autocomplete Quick Debug Cheat SheetCommands and immediate actions for autocomplete production issues.
High latency on specific prefix
Immediate action
Check if shard for that prefix is hot.
Commands
curl -s http://shard-monitor:8080/health | jq '.shard_load'
kubectl logs -l app=autocomplete -c sidecar --tail=50 | grep 'prefix=a'
Fix now
Temporarily route prefix to replica shard via config update.
Cache miss rate > 50%+
Immediate action
Check L1 cache size and TTL.
Commands
redis-cli INFO stats | grep keyspace_hit_rate
jstack <pid> | grep -A 20 'CacheManager.get'
Fix now
Increase L1 max size and reduce TTL to 15 seconds for hot prefixes.
Offensive suggestions served+
Immediate action
Block cache writes and flush CDN.
Commands
curl -X POST http://cache-admin:8080/flush?scope=autocomplete
kubectl delete pod -l app=autocomplete --force (if cache is in-memory per pod)
Fix now
Enable toxicity filter and rate-limit query ingestion.
Data Structure Comparison for Autocomplete
PropertyTriePrefix HashRadix Tree (Compressed Trie)
Lookup timeO(k) where k = input lengthO(1) for fixed-length prefixO(k) but fewer nodes
Memory usageHigh (each character is a node)Low (store strings directly)Medium (compresses common prefixes)
Prefix expansion (all completions)Natural — traverse subtreeRequires storing all completions per prefix keyNatural — similar to trie
Dynamic updatesEasy — insert nodesRequires rebuilding per prefix keyModerate — may split/merge nodes
Cache friendlinessPoor (pointer chasing)Good (array of completions)Better than trie

Key takeaways

1
Autocomplete is a prefix-matching problem solved with tries, sharded by prefix hash for scale.
2
Multi-layer caching (browser, CDN, Redis, in-memory, trie) is essential for <50ms SLA.
3
Ranking = global frequency + recency + personalization; precompute scores offline.
4
Shard by first 2-3 characters for prefix locality; replicate hot shards.
5
Always sanitize output before caching to prevent cache poisoning.
6
Rate-limit query ingestion and validate input to keep suggestions relevant and safe.

Common mistakes to avoid

4 patterns
×

Using a relational database for prefix lookups

Symptom
Queries like 'SELECT * FROM queries WHERE text LIKE 'prefix%' ORDER BY frequency DESC LIMIT 5' cause full table scans or index scans that take 200ms even with indexes, because the database must scan all matching prefixes and sort them.
Fix
Use a trie or prefix hash structure in-memory. If you must use a DB, use a specialized full-text search index (e.g., Elasticsearch with edge-gram tokenizer) that supports prefix queries efficiently.
×

Not handling personalization at query time

Symptom
All users see same suggestions — no personalization. But users expect their past searches to appear first. Conversion rates drop because users need to retype queries.
Fix
Store per-user recent searches in a Redis sorted set. At query time, merge the user's personal top 3 into the global top 5 list before returning, boosting personal results.
×

Ignoring typo tolerance

Symptom
Users who misspell 'reciepe' get no suggestions. They abandon the search or switch to a different app. Autocomplete should correct typos using a Levenshtein automaton or a bloom filter variant.
Fix
Implement a BK-tree or use an edit-distance-based approach. Store alternative spellings in the trie at a higher level. For production, use a spelling correction API as fallback when trie returns no results.
×

No rate limiting on query ingestion

Symptom
An attacker floods the autocomplete endpoint with fake queries, polluting the trie and cache with garbage. All users start seeing irrelevant suggestions.
Fix
Rate-limit query ingestion per user (or IP) to 1000 per hour. Validate query content with a classifier before storing. Use a bloom filter to deduplicate identical queries from multiple users.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design a search autocomplete system for a global search en...
Q02SENIOR
What are the trade-offs between a trie and a prefix hash for autocomplet...
Q03SENIOR
Explain how you would handle cache invalidation for autocomplete when a ...
Q01 of 03SENIOR

How would you design a search autocomplete system for a global search engine like Google? Focus on data structures, scaling, and ranking.

ANSWER
Start with a trie for prefix lookups. Each node stores a top-K priority queue of completions based on global frequency (precomputed offline). For scaling, use consistent hashing on the first 2-3 characters to shard the trie across servers. Replicate hot shards (e.g., 'a' prefix). Add multi-layer caching: browser cache, CDN for top 1% prefixes, Redis for hot, in-memory LRU. Ranking blends global frequency, recency, and personalization. Personalization merges user's recent searches from Redis at query time. Updates: batch process query logs every 5 minutes to update trie nodes. Support typo tolerance via BK-tree or n-gram similarity. Ensure <50ms p99 latency.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is a search autocomplete system in simple terms?
02
Why is a trie better than a database for autocomplete?
03
How does autocomplete handle multiple languages?
04
What is the typical latency SLA for autocomplete?
05
How does autocomplete prevent offensive suggestions?
🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't

Previous
Design a Notification System
13 / 17 · Real World
Next
Design a Payment System