Search Autocomplete — Cache Poisoning & Trie Rollback
10 IPs poisoned our autocomplete cache with offensive suggestions in one hour.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
- A search autocomplete system predicts completions as users type, serving top-K suggestions within 50ms
- Core data structure: trie for prefix lookups, often combined with prefix hashing for memory/speed trade-off
- Ranking blends global popularity, personal history, and recency — not just frequency
- Caching layers: CDN for static top queries, Redis for hot prefixes, in-memory L1 cache on app servers
- Sharding must preserve prefix locality — consistent hashing on prefix hash minimizes cross-node lookups
- Biggest mistake: assuming depends_on ordering — containers start but DB isn't ready, causing connection pool exhaustion
Imagine you're writing a letter and your friend looks over your shoulder, guessing the next word before you finish. They're not psychic — they've just read thousands of letters and know what words commonly follow others. Search autocomplete is exactly that friend, except instead of one friend it's a system serving millions of people at once, finishing your search query before your fingers can type it.
Every time you type a single character into Google, YouTube, or Amazon's search bar and watch suggestions instantly appear, you're experiencing one of the most latency-sensitive distributed systems in production software. That instant feel isn't magic — it's a carefully engineered pipeline involving in-memory data structures, aggressive multi-layer caching, real-time ranking signals, and sub-50ms response time SLAs enforced globally. Getting it wrong means users feel lag, and lag means they leave. Getting it right means you're invisibly influencing what billions of people search for every day.
The core problem autocomplete solves is bridging the gap between intent and query. Users often don't know exactly what they want to type. A partial string like 'how to fix my' could complete in a hundred directions. The system must predict the most relevant and popular completions in real time, rank them by a blend of global popularity, personal history, and contextual signals — all before the user lifts their finger from the keyboard. It also needs to handle billions of daily queries, top-K ranking, typo tolerance, personalization, and graceful degradation when parts of the system fail.
By the end of this article you'll be able to walk into any senior system design interview and confidently design a production-grade autocomplete system from scratch. You'll understand why a trie beats a relational database for prefix lookups, how to scale it with sharding and caching layers, how ranking signals layer on top of raw prefix matching, and what real failure modes look like in production — including the ones that bite engineers at 3am.
What is a Search Autocomplete System?
A search autocomplete system (also called typeahead) provides real-time search suggestions as the user types. At its core, it's a prefix-matching problem: given a partial input, return the most likely completions. But production reality adds layers: handle billions of unique prefixes daily, rank by multiple signals, support multiple languages, tolerate typos, and do it all under 50ms at p99.
You'll find autocomplete in Google Search, YouTube, Amazon, Twitter, and any large-scale product where reducing friction improves user engagement. The design principles here apply equally to code completion in IDEs and command-line autocomplete.
- Each node represents a character — not a string.
- Shared prefixes share nodes — memory efficient for common prefixes.
- Top-K suggestions per node avoid traversing entire subtree on every keystroke.
- Frequency at leaf nodes aggregates across user sessions.
Data Structures: Trie vs Prefix Hashing
Two dominant approaches for prefix matching: the classic trie and prefix hashing. A trie offers deterministic lookups and easy prefix traversal, but memory can balloon. Prefix hashing uses a hash map keyed by prefix strings (e.g., first 3 characters) and stores lists of completions per key. It's simpler and more cache-friendly but loses prefix-sharing benefits.
Real systems often hybridize: use a compressed trie (radix tree) for hot prefixes and fall back to prefix hash tables for rare prefixes. Google's approach, for example, uses a multi-tiered index with a memory-mapped trie for the top million queries and disk-backed hash tables for long-tail completions. Twitter's autocomplete uses a similar hybrid — in-memory trie for trending topics, disk-based for historical queries.
Ranking: What Order to Show Suggestions?
Users don't just want any completion — they want the most relevant one. Relevance is a blend of signals: global popularity (most searched queries with that prefix), personal history (your past searches), recency (trending queries), and context (location, device). Ranking in real time at 50ms is the hard part.
Most systems precompute a ranking score per query-prefix pair offline using MapReduce or Spark jobs. The score is a weighted sum: w1 log(frequency) + w2 (1 / days_since_last_search) + w3 * personal_boost. Real-time personalization adds an extra fetch from user profile cache at query time, merging personal top queries into the global top-K list before returning.
Caching for Sub-50ms Latency
Autocomplete is the poster child for multi-layer caching. Every microsecond counts. You'll typically see three layers: (1) Browser cache — service worker caches recent autocomplete responses, (2) CDN — edge caches for top 1% of prefixes (e.g., 'a', 'b'), (3) Application cache — Redis or Memcached for hot prefixes not in CDN, (4) In-memory L1 cache in application server — LRU cache with TTL, (5) The trie itself as the source of truth.
Cache invalidation is tricky: when a new query becomes popular (e.g., after a news event), must invalidate all prefixes that contain it. Use a bloom filter to quickly test if a prefix might need invalidation — avoids scanning entire cache.
Sharding and Scaling Across Servers
A single server can't hold the entire autocomplete index for billions of queries. You need to shard. The challenge: prefix lookups cross shards unless you shard intelligently. Common strategy: consistent hashing on the first 2-3 characters of the query. All queries with prefix 'ab' go to shard 7, 'ac' to shard 3, etc. This keeps prefix locality — most lookups hit one shard.
But prefix skew is real: 'a' prefixes are 40% of traffic. Solution: replicate hot shards (e.g., 10 copies of 'a*' shard), and use a load balancer that routes by prefix hash to one of the replicas. For cold prefixes, a single shard per hash range suffices.
To handle high write throughput (ingesting new queries), use a write-ahead log (WAL) before updating the trie, and batch updates to the shard every few seconds. This also handles replica consistency.
Production Incident: Autocomplete Cache Poisoning
A real incident from a large e-commerce platform: autocomplete started showing offensive suggestions for normal queries. Root cause was a cache poisoning attack — attacker sent malicious queries that got cached, then served to all users. The fix required purging all caches and adding input sanitization before caching. This incident led to a new practice: never cache a query's suggestions without validating the output against toxicity filters.
- Always sanitize output before caching: strip profanity, PII, and SQL injection patterns.
- Use a separate freshness score for user-generated queries vs aggregated popular queries.
- Rate-limit query ingestion per user to prevent poisoning at input.
System Requirements: The Non-Negotiable Contract
Before you write a single line of trie code, you must know what the system owes the user. This isn't academic. It's a contract. Functional requirements are the explicit promises: instant match ideas as you type, accurate suggestions ranked by popularity, personalization based on location and history, and graceful data ingestion from billions of queries.
Non-functional requirements are the ones that will wake you up at 3 AM. Sub-50ms latency at P99. 99.99% availability. Global distribution so a user in Lagos doesn't wait longer than one in San Francisco. You don't get to hand-wave these. Every architectural decision—sharding key, caching strategy, replication factor—traces back to these numbers.
A common failure is treating non-functional requirements as aspirational. They're constraints. If your trie shard takes 200ms to respond, you've already lost. Nail these down first, or your autocomplete will be a source of shame, not delight.
Capacity Estimation: Don't Guess, Calculate
You can't scale what you haven't measured. Capacity estimation forces you to answer the hard questions before they become emergency PagerDuty alerts. Start with traffic: how many queries per second? Google processes 3.5 billion searches per day. That's 40,000 QPS. Peak bursts double that.
Storage follows. Each autocomplete suggestion is a prefix-keyed entry in a trie. Assume 200 bytes per node. For 100 million unique prefixes, that's 20 GB. Now multiply by replication factor (3x). You need 60 GB of trie data in memory—just for one shard. This is why you're not using a relational database for this.
Bandwidth is the silent killer. If each suggestion response is 500 bytes, and you serve 40,000 QPS, you need 160 Mbps outbound. At peak, you're pushing 320 Mbps. If your network interface caps at 1 Gbps, you have headroom. Barely. Do this math now, not when you're explaining to your VP why the site is loading 2 seconds of suggestions.
Clients: The Edge of the Autocomplete Pipeline
Clients are the first and last mile of autocomplete. Why focus on them? Because a server-side optimization is wasted if the client introduces latency through naive rendering or excessive network calls. The browser or mobile app must debounce user input—typically 150-300ms—to avoid flooding the API with every keystroke. On the mobile client, a local Bloom filter or small trie can cache top-1000 suggestions per prefix, cutting network round trips by 80% for repeated queries. The client must also handle stale suggestions gracefully: show cached results immediately, then patch with fresh server data. Never block typing. Use request deduplication: if a user types 'ap' then 'app' quickly, abort the 'ap' request. This prevents wasted server work and reduces perceived latency. Production trap: an unbounded local cache causes memory pressure on low-end devices. Enforce a size limit, e.g., 2MB, and LRU eviction. Key takeaway: autocomplete is a distributed system where the client is a caching node, not a dumb display.
API Gateway: Traffic Cop for Autocomplete
The API gateway is the single entry point for all autocomplete requests. Why not let clients talk directly to services? Because a gateway centralizes rate limiting, authentication, and request coalescing. For autocomplete, the gateway must enforce a rate limit per user—usually 10 requests per second—to prevent abuse and protect the trie servers. It also normalizes input: strip whitespace, convert to lowercase, and reject prefixes longer than 50 characters to avoid pathological memory lookups. The gateway should coalesce identical concurrent requests: if two clients type 'ap' at the same millisecond, the gateway issues a single lookup to the backend and broadcasts the result. This reduces backend load by up to 40% in peak traffic. Production trap: rate limiting should differentiate between a power user and a bot. Use sliding window counters per session, not per IP. IP-based rate limiting fails under carrier-grade NAT where thousands of users share one IP. The gateway must also log latency percentiles: p99 below 50ms is the contract. Key takeaway: the gateway is not a proxy; it is a rate limiter, coalescer, and sanitizer.
Autocomplete Cache Poisoning at Scale
- Never trust user input — sanitize before storing in any data structure.
- Cache poisoning can happen even without direct write access if the cache is fed by user activity.
- Always have a cache rollback mechanism — versioning is cheap, data loss is expensive.
curl -s http://shard-monitor:8080/health | jq '.shard_load'kubectl logs -l app=autocomplete -c sidecar --tail=50 | grep 'prefix=a'Key takeaways
Common mistakes to avoid
4 patternsUsing a relational database for prefix lookups
Not handling personalization at query time
Ignoring typo tolerance
No rate limiting on query ingestion
Interview Questions on This Topic
How would you design a search autocomplete system for a global search engine like Google? Focus on data structures, scaling, and ranking.
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Real World. Mark it forged?
7 min read · try the examples if you haven't