Full-Text Search in SQL: Stopwords Hide 'ink' Products
Stopwords hid 'ink' from FTS results; LIKE found all products.
20+ years shipping high-throughput database systems. Drawn from code that ran under real load.
- Full-text search (FTS) uses an inverted index mapping tokens to row IDs. LIKE '%term%' forces full table scan — 1M rows = 10 seconds. FTS = 100ms.
- MySQL FTS uses InnoDB auxiliary tables; rebuild requires OPTIMIZE TABLE (table lock). PostgreSQL uses GIN indexes supports CONCURRENTLY (zero downtime).
- Ranking: MySQL uses TF-IDF (no length normalisation); PostgreSQL uses BM25 (normalises by document length — fairer for mixed-length content).
- Stopwords (the, a, ink, set) are filtered by default. If 'ink' is a product name, your customers see zero results. Always review stopword list for your domain.
- Stemming reduces 'running' → 'run'. Aggressive stemmers conflate unrelated words (marketing → market). Risk: false positives.
- Production trap: index staleness after bulk insert. FTS index doesn't auto-update immediately for all engines. Run CHECK TABLE or REINDEX after batch jobs.
- Biggest mistake: using FTS for exact identifiers (ID-123). Tokeniser splits on hyphens, ignores short tokens. Use B-Tree for IDs.
Imagine you walk into a giant library with a million books and ask the librarian for every book that mentions 'dragons'. A LIKE search is like that librarian reading every single page of every book from cover to cover. Full-text search is like the librarian pulling out a pre-built card catalogue that maps every word directly to which books contain it — the answer comes back in seconds, not hours. That card catalogue is the full-text index, and building it smartly is exactly what this article is about.
Most production applications reach a point where LIKE '%search_term%' stops cutting it. You've got a product catalogue with 500,000 rows, a blog with a decade of articles, or a support ticket system with millions of entries — and users expect Google-quality search that returns relevant results in under 100 milliseconds. LIKE with a leading wildcard can't use a B-Tree index, so the database performs a full sequential scan every single time. At scale, that's a query that grinds your entire application to a halt.
Full-text search (FTS) solves this with an entirely different data structure — an inverted index — that maps individual tokens (words) back to the rows that contain them. But it does far more than just fast lookups. It understands language: it knows that 'running', 'ran', and 'runs' are forms of the same root word. It knows that 'the', 'a', and 'is' are so common they're noise. It can rank results by relevance so that the most useful rows float to the top. These capabilities live inside your existing SQL database, no Elasticsearch cluster required.
By the end of this article you'll understand exactly how an inverted index is built and queried, how to set up and tune full-text search in both MySQL and PostgreSQL, how relevance ranking actually works under the hood, when FTS is the right tool versus when you should reach for a dedicated search engine, and every production gotcha that bites teams who ship FTS without reading the fine print.
Why Full-Text Search in SQL Is Not Just LIKE with a Fancy Index
Full-text search in SQL is a specialized indexing and querying mechanism that tokenizes text into words (terms), stores them in an inverted index, and supports linguistic operations like stemming, ranking, and stopword filtering. Unlike a B-tree index used with LIKE '%pattern%', which scans every row and cannot use an index efficiently, a full-text index maps each term to its document locations, enabling sub-second searches over millions of rows.
At its core, full-text search builds an inverted index: for each unique word, it stores a list of document IDs and positions. Queries like CONTAINS or FREETEXT (T-SQL) or MATCH ... AGAINST (MySQL) leverage this index to find documents containing specific words or phrases, rank results by relevance using TF-IDF or BM25, and apply language-specific rules such as stemming (e.g., 'running' matches 'run') and stopword removal (common words like 'the' are ignored). This is fundamentally different from pattern matching — it's a set-based retrieval, not a scan.
Use full-text search when your application needs to search natural language text — product descriptions, articles, emails — at scale. It's the right tool when you need relevance ranking, linguistic variants, or fast search over large text columns. But it's not a replacement for a dedicated search engine like Elasticsearch when you need faceted search, fuzzy matching, or real-time indexing at massive scale. In SQL databases, full-text search fills the gap between simple LIKE queries and external search infrastructure.
How the Inverted Index Works Internally
An inverted index is the data structure behind all full-text search engines — SQL's built-in FTS included. It maps every distinct word (token) to the list of rows where that word appears, along with its position and frequency inside each document.
When you insert or update a row, the database tokenizes the text, discards stopwords, applies stemming, and builds (or appends to) this mapping. The result is a structure you can query in logarithmic time: look up the token, get the row list, apply ranking, return top N.
MySQL stores its FTS index in separate internal InnoDB tables (auxiliary tables). The main table holds the data; the FTS index is in hidden tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE). You cannot query them directly for debugging, but they exist.
PostgreSQL uses GiST or GIN indexes, which store lexemes (stemmed tokens) directly in the index structure. GIN is faster for lookups (read-heavy), GiST is faster for updates (write-heavy). You can query the tsvector column directly for debugging.
How a query is executed: When you run MATCH(col) AGAINST('search'), the database: (1) tokenizes and stems the query text, (2) looks up each token in the inverted index, (3) retrieves posting lists (list of row IDs), (4) optionally intersects them for multi-word queries, (5) computes a relevance score (TF-IDF or BM25) for each candidate row, (6) sorts by score and returns top N.
- Token = word in the book's index
- Posting list = list of page (row) numbers where that word occurs
- Position = where on the page the word appears, used for phrase search
- Frequency = how many times the word appears on that page, used for relevance ranking
- The index is pre-built — queries don't scan rows, they consult the index.
MATCH/AGAINST Syntax and Relevance Ranking
The SQL syntax for full-text search is MATCH(columns) AGAINST('query' IN NATURAL LANGUAGE MODE). The engine returns rows ordered by relevance — a floating-point score. Higher scores mean more relevant.
Relevance ranking isn't magic. Behind the scenes the database computes a variant of TF-IDF (MySQL) or BM25 (PostgreSQL).
- TF-IDF (Term Frequency-Inverse Document Frequency) rewards terms that appear frequently in a document but rarely across the whole corpus. It doesn't normalise for document length, so long documents (with more words) tend to score higher simply because they contain more text.
- BM25 (Best Matching 25) adds document length normalisation. A short document that matches a term is given a higher boost relative to its length. It also saturates term frequency so that 100 occurrences of a word doesn't give 100x the score of 1 occurrence.
IN BOOLEAN MODE gives you operators: +mandatory, -exclude, *wildcard, "phrase". It doesn't rank — it just filters. Use it when you need exact control over which terms must match, not when you need relevance ordering.
WITH QUERY EXPANSION (MySQL) adds terms from top-ranked results to the original query. It can help find related documents but can also drift off-topic.
Practical example: For a product catalogue with descriptions of varying lengths (e.g., 50-character titles vs 5000-character detailed descriptions), BM25 gives fairer ranking. MySQL's TF-IDF would rank the long descriptions higher even if the title contains the exact match.
MATCH() * weight.Stopwords, Stemming and Language Support
Stopwords are common words (the, a, in, of) that don't carry search meaning. FTS engines discard them by default. But the default stopword list is English-centric and often wrong for your domain. If you're searching a product catalogue and 'ink' is a valid product name, but 'ink' happens to be in your custom stopword list — you'll lose results. On PostgreSQL, the default list includes 'the', 'a', 'and', 'of', 'to', 'in', 'for', etc. Not 'ink', unless customised. But technical terms like 'set', 'key', 'host', 'port' are NOT in default lists, but could be added accidentally.
Stemming reduces words to root forms: 'running' becomes 'run', 'better' becomes 'good'. MySQL uses a built-in stemmer based on the language setting. PostgreSQL uses dictionaries (english, simple, snowball). You can create custom dictionaries for domain-specific terms.
Key trade-off: aggressive stemming conflates unrelated words (e.g., 'marketing' and 'market' become 'market') causing false positives. Too little stemming and you miss variants.
Language-specific considerations: MySQL supports per-column character sets and collations that affect tokenisation. For non-English languages (Chinese, Japanese, Korean), MySQL's ngram parser splits text into n-character substrings; PostgreSQL's parser uses language-specific dictionaries.
Custom dictionaries in PostgreSQL: You can chain dictionaries (e.g., a custom thesaurus that maps 'JS' to 'JavaScript', then a snowball stemmer). This is far more flexible than MySQL's global setting.
Performance Tuning and Index Maintenance
Full-text indexes grow with your data and can become bloated, fragmented, or stale.
innodb_ft_min_token_size(default 3): ignore words shorter than this. Lower to index acronyms ('AI', 'UX') but increases index size by 20-30%.innodb_ft_max_token_size(default 84): ignore words longer than this. Increase for long product names.innodb_ft_cache_size(default 8MB): size of the in-memory cache for tokenised data. Larger values improve insert performance for bulk operations.- Index maintenance:
OPTIMIZE TABLErebuilds the FTS index but LOCKS THE TABLE for writes. Schedule during low traffic.
work_mem: controls memory used for sorting during index creation. Higher values speed up large index builds.maintenance_work_mem: memory used for REINDEX and VACUUM operations.gin_pending_list_limit: for GIN indexes with fast update enabled, this controls the size of the pending list.- Zero-downtime maintenance:
CREATE INDEX CONCURRENTLYandREINDEX INDEX CONCURRENTLYrebuild indexes without blocking writes.
- MySQL:
SHOW INDEX FROM tableshows cardinality. CompareINNODB_FTS_INDEX_TABLErow count with base table row count to detect staleness. - PostgreSQL:
for GiST indexes,pgstatindex()pg_size_pretty(pg_relation_size('index_name'))for index size growth.
Practical benchmarks: On a 10 million row table with 1KB average text per row, MySQL InnoDB FTS index size is about 20-30% of data size (2-3GB). PostgreSQL GIN index is larger, about 40-50% (4-5GB), but provides faster read queries.
The staleness trap: After a bulk delete or update, FTS indexes may still point to deleted rows. Run CHECK TABLE (MySQL) or REINDEX (PostgreSQL) after any bulk operation affecting >20% of rows.
innodb_ft_cache_size high enough to avoid excessive disk flushes.MySQL vs PostgreSQL: Implementation Differences That Matter
While both MySQL and PostgreSQL offer full-text search, their internals differ significantly. Choosing the wrong one can lead to performance pain or missing features.
Index storage: MySQL stores the FTS index in hidden auxiliary tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE) managed by InnoDB. PostgreSQL stores it as a GiST or GIN index on a tsvector column.
Tokenisation and stemming: MySQL uses a built-in tokeniser and stemmer that are less configurable. You can change the minimum token length via innodb_ft_min_token_size and set a custom stopword table, but the stemmer is fixed per language (English, French, Spanish, etc.). PostgreSQL uses text search configurations that separate tokeniser (parser) and dictionary (stemmer, stopwords, thesaurus), giving you the ability to chain dictionaries and create custom ones.
Index maintenance: MySQL requires OPTIMIZE TABLE to rebuild the FTS index (table is locked). PostgreSQL supports CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY for zero-downtime rebuilds.
Query flexibility: PostgreSQL supports ranking weights per column using setweight and can store tsvector as a generated column. MySQL's ranking is less flexible; you can use multiple MATCH clauses with custom multipliers but it's less efficient.
- MySQL InnoDB FTS: Optimised for transactional workloads with moderate write rates. Reads are fast but not as fast as PostgreSQL GIN.
- PostgreSQL GIN: Faster reads (10-20%) but slower writes (30-50% overhead due to lexeme compression). GiST is slower for reads but faster for writes.
Debugging: MySQL's FTS debugging requires querying INFORMATION_SCHEMA tables. PostgreSQL allows direct querying of the tsvector column, making debugging trivial.
Production recommendation: Use PostgreSQL if you need custom stemming, zero-downtime rebuilds, or fair ranking (BM25). Use MySQL if you're already in a MySQL environment and have simple text search requirements with moderate write rates.
Full-Text Search Queries: How They Actually Work at the Protocol Level
You've seen MATCH/AGAINST. Fine. But what happens when you type a query like '+smartphone -iphone warranty' in BOOLEAN MODE? The query parser breaks that string into tokens at the same word boundaries the indexer used. Then it traverses the inverted index looking for documents containing 'smartphone' while excluding those that also contain 'iphone'. The Boolean operators are evaluated as set operations on document ID lists.
Here's where juniors get burned: query expansion and proximity searches. MySQL's WITH QUERY EXPANSION performs a two-phase search — the first returns results, the second re-runs the query using the most relevant terms from the first result set. Sounds clever until you realise it can return garbage if your initial query matched irrelevant documents. Never use it on production user-facing search without manual relevance testing.
SQL Server's CONTAINSTABLE and FREETEXTTABLE return ranked results with internal scoring you can inspect. The difference? FREETEXT does linguistic stemming and thesaurus expansion automatically. CONTAINS gives you exact control over prefixes, proximity, and inflectional forms. Pick the right weapon.
The Full-Text Index Architecture Nobody Draws on the Whiteboard
The inverted index is the heart, but the rest of the body matters too. Every full-text index has three structural layers: the token list, the postings list, and the position list. The token list maps each unique word to a compressed integer ID. The postings list maps that ID to a list of document IDs where the word appears. The position list tracks word offsets within each document for proximity searches.
SQL Server stores these as internal system tables inside a full-text catalog — a virtual container that doesn't map to a single file. Each catalog can span multiple filegroups, and each index has its own population history. When you rebuild a full-text index, SQL Server drops all internal structures and re-parses every row through the filter daemon and word breaker. That's why a rebuild on a 10-million-row table takes hours, not seconds.
PostgreSQL stores tsvector directly in the table column. No separate catalog. This means updates are row-level — when you change a row, PostgreSQL re-parses only that row through a trigger or computed column. The trade-off: no incremental population like SQL Server's change tracking. You pay for every UPDATE with CPU to re-lex the text. Know your write pattern before choosing.
PATINDEX and CHARINDEX — Pattern Matching Beyond LIKE in SQL Server
LIKE is fine for simple wildcards, but it chokes when you need position-aware substring searches without full-text indexing. PATINDEX and CHARINDEX fill that gap. CHARINDEX is your scalpel: it returns the starting position of one string inside another, zero if missing. No wildcards, no guesswork. PATINDEX steps it up with regex-like patterns — find any numeric digit, any non-alphanumeric character, or custom patterns using % and _ with square brackets. Why this matters: you can extract specific substrings from messy data without resorting to slow CLR functions or client-side parsing. Use them when full-text search is overkill but LIKE is too weak. Both ignore collation unless you force it, so watch case sensitivity. They're also non-SARGable — indexes won't help. Keep pattern matching lean.
The Invisible Products: When Stopwords Hid an Entire Category
- Stopwords are not universal. Default lists suit general English, not e-commerce, medical, or technical domains.
- Always export and review your full-text configuration (stopwords, stemmer, min token length) before deploying to production.
- Test FTS with real user queries from day one. If a term should return results, verify it does.
- For domain-critical terms (product names, SKU prefixes, technical acronyms), ensure they are NOT in the stopword list.
- Set up monitoring: a weekly job that queries a set of known terms and alerts if any return zero results.
SELECT MAX(last_modified) FROM products; -- compare to index metadataOPTIMIZE TABLE products; -- MySQL or REINDEX INDEX CONCURRENTLY idx_products_fts; -- PostgreSQLKey takeaways
Common mistakes to avoid
7 patternsUsing default stopword list without domain review
Forgetting to rebuild the index after bulk operations
Using FTS for exact match or short codes (e.g., 'ID-123')
Using NATURAL LANGUAGE MODE for phrase searches expecting quotes to work
Ignoring stemming false positives
Not testing with production-like data volumes before deployment
Using MySQL FTS in a high-write environment without tuning `innodb_ft_cache_size`
innodb_ft_cache_size to 32MB or 64MB. Monitor INNODB_FTS_INDEX_CACHE size vs INNODB_FTS_INDEX_TABLE. For PostgreSQL, adjust gin_pending_list_limit.Interview Questions on This Topic
Explain how an inverted index is built and queried in a SQL database's full-text search engine.
Frequently Asked Questions
20+ years shipping high-throughput database systems. Drawn from code that ran under real load.
That's SQL Advanced. Mark it forged?
10 min read · try the examples if you haven't