Full-Text Search in SQL: Stopwords Hide 'ink' Products
- FTS relies on an inverted index — no B-Tree magic. It maps tokens to row IDs, enabling O(log n) lookups instead of O(n) scans.
- Ranking is algorithmic: MySQL uses TF-IDF (favours long documents), PostgreSQL uses BM25 (normalises length). Choose based on document length variance.
- Stopwords and stemming must be tailored to your domain. Default English stopwords hide technical terms. Review and customise.
- Full-text search (FTS) uses an inverted index mapping tokens to row IDs. LIKE '%term%' forces full table scan — 1M rows = 10 seconds. FTS = 100ms.
- MySQL FTS uses InnoDB auxiliary tables; rebuild requires OPTIMIZE TABLE (table lock). PostgreSQL uses GIN indexes supports CONCURRENTLY (zero downtime).
- Ranking: MySQL uses TF-IDF (no length normalisation); PostgreSQL uses BM25 (normalises by document length — fairer for mixed-length content).
- Stopwords (the, a, ink, set) are filtered by default. If 'ink' is a product name, your customers see zero results. Always review stopword list for your domain.
- Stemming reduces 'running' → 'run'. Aggressive stemmers conflate unrelated words (marketing → market). Risk: false positives.
- Production trap: index staleness after bulk insert. FTS index doesn't auto-update immediately for all engines. Run CHECK TABLE or REINDEX after batch jobs.
- Biggest mistake: using FTS for exact identifiers (ID-123). Tokeniser splits on hyphens, ignores short tokens. Use B-Tree for IDs.
Full-Text Search Debug Cheat Sheet
FTS index not picking up new data
SELECT MAX(last_modified) FROM products; -- compare to index metadataOPTIMIZE TABLE products; -- MySQL or REINDEX INDEX CONCURRENTLY idx_products_fts; -- PostgreSQLWrong ranking order, irrelevant results at top
SELECT id, MATCH(title) AGAINST('query' IN BOOLEAN MODE) AS score FROM products ORDER BY score DESC LIMIT 20;Check if stopwords are filtering: SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_DEFAULT_STOPWORD; (MySQL) or SELECT * FROM pg_catalog.pg_ts_config_map; (PostgreSQL)Full-text query times out (>5 seconds)
SHOW VARIABLES LIKE 'slow_query_log%'; (MySQL) or SHOW log_min_duration_statement; (PostgreSQL)EXPLAIN FORMAT=JSON SELECT ... MATCH AGAINST ... ; -- check if FTS index is usedShort words like 'AI', 'UX' not returning results
SHOW VARIABLES LIKE 'innodb_ft_min_token_size'; (MySQL, default 3) or SELECT cfgname, cfgoption FROM pg_catalog.pg_ts_config; (PostgreSQL)SET GLOBAL innodb_ft_min_token_size = 2; (requires restart and index rebuild)Phrase search using quotes doesn't work
SHOW VARIABLES LIKE 'ft_boolean_syntax'; (MySQL) or SELECT 'The <-> quick <-> brown'::tsquery; (PostgreSQL)SELECT * FROM products WHERE MATCH(col) AGAINST('"exact phrase"' IN BOOLEAN MODE);Production Incident
Production Debug GuideSymptom → Action for common FTS failures in production
Most production applications reach a point where LIKE '%search_term%' stops cutting it. You've got a product catalogue with 500,000 rows, a blog with a decade of articles, or a support ticket system with millions of entries — and users expect Google-quality search that returns relevant results in under 100 milliseconds. LIKE with a leading wildcard can't use a B-Tree index, so the database performs a full sequential scan every single time. At scale, that's a query that grinds your entire application to a halt.
Full-text search (FTS) solves this with an entirely different data structure — an inverted index — that maps individual tokens (words) back to the rows that contain them. But it does far more than just fast lookups. It understands language: it knows that 'running', 'ran', and 'runs' are forms of the same root word. It knows that 'the', 'a', and 'is' are so common they're noise. It can rank results by relevance so that the most useful rows float to the top. These capabilities live inside your existing SQL database, no Elasticsearch cluster required.
By the end of this article you'll understand exactly how an inverted index is built and queried, how to set up and tune full-text search in both MySQL and PostgreSQL, how relevance ranking actually works under the hood, when FTS is the right tool versus when you should reach for a dedicated search engine, and every production gotcha that bites teams who ship FTS without reading the fine print.
How the Inverted Index Works Internally
An inverted index is the data structure behind all full-text search engines — SQL's built-in FTS included. It maps every distinct word (token) to the list of rows where that word appears, along with its position and frequency inside each document.
When you insert or update a row, the database tokenizes the text, discards stopwords, applies stemming, and builds (or appends to) this mapping. The result is a structure you can query in logarithmic time: look up the token, get the row list, apply ranking, return top N.
MySQL stores its FTS index in separate internal InnoDB tables (auxiliary tables). The main table holds the data; the FTS index is in hidden tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE). You cannot query them directly for debugging, but they exist.
PostgreSQL uses GiST or GIN indexes, which store lexemes (stemmed tokens) directly in the index structure. GIN is faster for lookups (read-heavy), GiST is faster for updates (write-heavy). You can query the tsvector column directly for debugging.
How a query is executed: When you run MATCH(col) AGAINST('search'), the database: (1) tokenizes and stems the query text, (2) looks up each token in the inverted index, (3) retrieves posting lists (list of row IDs), (4) optionally intersects them for multi-word queries, (5) computes a relevance score (TF-IDF or BM25) for each candidate row, (6) sorts by score and returns top N.
-- MySQL: InnoDB FTS uses hidden auxiliary tables -- You can inspect them (read-only) SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE; SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_CACHE; -- PostgreSQL: tsvector column stores lexemes CREATE TABLE io_thecodeforge.documents ( id SERIAL PRIMARY KEY, body TEXT, search_vector TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', body)) STORED ); -- Insert and the tsvector is auto-generated INSERT INTO io_thecodeforge.documents (body) VALUES ('The quick brown fox jumps over the lazy dog.'); -- Query the tsvector directly to see normalized terms SELECT id, body, search_vector FROM io_thecodeforge.documents; -- Output shows: -- 'brown':2 'dog':8 'fox':3 'jump':4 'lazi':7 'quick':1 -- Stopwords ('the', 'over') are omitted. Stemming: 'jumps' → 'jump', 'lazy' → 'lazi'
----+------+-------------------------------------------------------
1 | The quick brown fox jumps over the lazy dog. | 'brown':2 'dog':8 'fox':3 'jump':4 'lazi':7 'quick':1
- Token = word in the book's index
- Posting list = list of page (row) numbers where that word occurs
- Position = where on the page the word appears, used for phrase search
- Frequency = how many times the word appears on that page, used for relevance ranking
- The index is pre-built — queries don't scan rows, they consult the index.
MATCH/AGAINST Syntax and Relevance Ranking
The SQL syntax for full-text search is MATCH(columns) AGAINST('query' IN NATURAL LANGUAGE MODE). The engine returns rows ordered by relevance — a floating-point score. Higher scores mean more relevant.
Relevance ranking isn't magic. Behind the scenes the database computes a variant of TF-IDF (MySQL) or BM25 (PostgreSQL).
- TF-IDF (Term Frequency-Inverse Document Frequency) rewards terms that appear frequently in a document but rarely across the whole corpus. It doesn't normalise for document length, so long documents (with more words) tend to score higher simply because they contain more text.
- BM25 (Best Matching 25) adds document length normalisation. A short document that matches a term is given a higher boost relative to its length. It also saturates term frequency so that 100 occurrences of a word doesn't give 100x the score of 1 occurrence.
IN BOOLEAN MODE gives you operators: +mandatory, -exclude, *wildcard, "phrase". It doesn't rank — it just filters. Use it when you need exact control over which terms must match, not when you need relevance ordering.
WITH QUERY EXPANSION (MySQL) adds terms from top-ranked results to the original query. It can help find related documents but can also drift off-topic.
Practical example: For a product catalogue with descriptions of varying lengths (e.g., 50-character titles vs 5000-character detailed descriptions), BM25 gives fairer ranking. MySQL's TF-IDF would rank the long descriptions higher even if the title contains the exact match.
-- MySQL: Natural language mode (ranked) SELECT id, title, MATCH(title, body) AGAINST('Docker Compose' IN NATURAL LANGUAGE MODE) AS relevance FROM io_thecodeforge.articles ORDER BY relevance DESC; -- PostgreSQL: ts_rank with normalization SELECT id, title, ts_rank(to_tsvector('english', title || ' ' || body), plainto_tsquery('english', 'Docker Compose'), 1) AS relevance FROM io_thecodeforge.articles ORDER BY relevance DESC; -- Boolean mode with operators (MySQL/PostgreSQL) SELECT id, title FROM io_thecodeforge.articles WHERE MATCH(title, body) AGAINST('+Docker +healthcheck -deprecated' IN BOOLEAN MODE); -- Phrase search in PostgreSQL using <-> operator SELECT id, title FROM io_thecodeforge.articles WHERE to_tsvector('english', body) @@ to_tsquery('docker <-> compose');
MATCH() * weight.Stopwords, Stemming and Language Support
Stopwords are common words (the, a, in, of) that don't carry search meaning. FTS engines discard them by default. But the default stopword list is English-centric and often wrong for your domain. If you're searching a product catalogue and 'ink' is a valid product name, but 'ink' happens to be in your custom stopword list — you'll lose results. On PostgreSQL, the default list includes 'the', 'a', 'and', 'of', 'to', 'in', 'for', etc. Not 'ink', unless customised. But technical terms like 'set', 'key', 'host', 'port' are NOT in default lists, but could be added accidentally.
Stemming reduces words to root forms: 'running' becomes 'run', 'better' becomes 'good'. MySQL uses a built-in stemmer based on the language setting. PostgreSQL uses dictionaries (english, simple, snowball). You can create custom dictionaries for domain-specific terms.
Key trade-off: aggressive stemming conflates unrelated words (e.g., 'marketing' and 'market' become 'market') causing false positives. Too little stemming and you miss variants.
Language-specific considerations: MySQL supports per-column character sets and collations that affect tokenisation. For non-English languages (Chinese, Japanese, Korean), MySQL's ngram parser splits text into n-character substrings; PostgreSQL's parser uses language-specific dictionaries.
Custom dictionaries in PostgreSQL: You can chain dictionaries (e.g., a custom thesaurus that maps 'JS' to 'JavaScript', then a snowball stemmer). This is far more flexible than MySQL's global setting.
-- MySQL: set a custom stopword table SET GLOBAL innodb_ft_server_stopword_table = 'mydb/my_stopwords'; CREATE TABLE mydb.my_stopwords (value VARCHAR(30)); INSERT INTO mydb.my_stopwords VALUES ('the'), ('a'), ('of'), ('our'), ('your'); -- Rebuild index for changes to take effect OPTIMIZE TABLE mydb.articles; -- PostgreSQL: create custom text search configuration CREATE TEXT SEARCH CONFIGURATION io_thecodeforge.product_search (COPY = english); -- Replace stemmer with 'simple' (no stemming) for precise matching ALTER TEXT SEARCH CONFIGURATION io_thecodeforge.product_search ALTER MAPPING FOR word WITH simple; -- Add a custom thesaurus dictionary CREATE TEXT SEARCH DICTIONARY my_thesaurus ( TEMPLATE = thesaurus, DictFile = 'product_synonyms' ); -- Then use it in queries SELECT to_tsvector('io_thecodeforge.product_search', 'JS framework');
Performance Tuning and Index Maintenance
Full-text indexes grow with your data and can become bloated, fragmented, or stale.
innodb_ft_min_token_size(default 3): ignore words shorter than this. Lower to index acronyms ('AI', 'UX') but increases index size by 20-30%.innodb_ft_max_token_size(default 84): ignore words longer than this. Increase for long product names.innodb_ft_cache_size(default 8MB): size of the in-memory cache for tokenised data. Larger values improve insert performance for bulk operations.- Index maintenance:
OPTIMIZE TABLErebuilds the FTS index but LOCKS THE TABLE for writes. Schedule during low traffic.
work_mem: controls memory used for sorting during index creation. Higher values speed up large index builds.maintenance_work_mem: memory used for REINDEX and VACUUM operations.gin_pending_list_limit: for GIN indexes with fast update enabled, this controls the size of the pending list.- Zero-downtime maintenance:
CREATE INDEX CONCURRENTLYandREINDEX INDEX CONCURRENTLYrebuild indexes without blocking writes.
- MySQL:
SHOW INDEX FROM tableshows cardinality. CompareINNODB_FTS_INDEX_TABLErow count with base table row count to detect staleness. - PostgreSQL:
for GiST indexes,pgstatindex()pg_size_pretty(pg_relation_size('index_name'))for index size growth.
Practical benchmarks: On a 10 million row table with 1KB average text per row, MySQL InnoDB FTS index size is about 20-30% of data size (2-3GB). PostgreSQL GIN index is larger, about 40-50% (4-5GB), but provides faster read queries.
The staleness trap: After a bulk delete or update, FTS indexes may still point to deleted rows. Run CHECK TABLE (MySQL) or REINDEX (PostgreSQL) after any bulk operation affecting >20% of rows.
-- MySQL: Check FTS index fragmentation SHOW INDEX FROM io_thecodeforge.articles; SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE WHERE TABLE_ID = (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_TABLES WHERE NAME = 'io_thecodeforge/articles'); -- Rebuild fulltext index (blocks writes) — schedule during maintenance OPTIMIZE TABLE io_thecodeforge.articles; -- Set min token length (requires restart) SET GLOBAL innodb_ft_min_token_size = 2; -- PostgreSQL: create index CONCURRENTLY (no downtime) CREATE INDEX CONCURRENTLY idx_fts_articles ON io_thecodeforge.articles USING GIN(to_tsvector('english', body)); -- Monitor index size SELECT pg_size_pretty(pg_relation_size('idx_fts_articles')); -- Rebuild online without locking REINDEX INDEX CONCURRENTLY idx_fts_articles; -- Increase memory for index operations SET maintenance_work_mem = '1024MB'; -- for REINDEX SET work_mem = '64MB'; -- for query-time sorting
innodb_ft_cache_size high enough to avoid excessive disk flushes.MySQL vs PostgreSQL: Implementation Differences That Matter
While both MySQL and PostgreSQL offer full-text search, their internals differ significantly. Choosing the wrong one can lead to performance pain or missing features.
Index storage: MySQL stores the FTS index in hidden auxiliary tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE) managed by InnoDB. PostgreSQL stores it as a GiST or GIN index on a tsvector column.
Tokenisation and stemming: MySQL uses a built-in tokeniser and stemmer that are less configurable. You can change the minimum token length via innodb_ft_min_token_size and set a custom stopword table, but the stemmer is fixed per language (English, French, Spanish, etc.). PostgreSQL uses text search configurations that separate tokeniser (parser) and dictionary (stemmer, stopwords, thesaurus), giving you the ability to chain dictionaries and create custom ones.
Index maintenance: MySQL requires OPTIMIZE TABLE to rebuild the FTS index (table is locked). PostgreSQL supports CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY for zero-downtime rebuilds.
Query flexibility: PostgreSQL supports ranking weights per column using setweight and can store tsvector as a generated column. MySQL's ranking is less flexible; you can use multiple MATCH clauses with custom multipliers but it's less efficient.
- MySQL InnoDB FTS: Optimised for transactional workloads with moderate write rates. Reads are fast but not as fast as PostgreSQL GIN.
- PostgreSQL GIN: Faster reads (10-20%) but slower writes (30-50% overhead due to lexeme compression). GiST is slower for reads but faster for writes.
Debugging: MySQL's FTS debugging requires querying INFORMATION_SCHEMA tables. PostgreSQL allows direct querying of the tsvector column, making debugging trivial.
Production recommendation: Use PostgreSQL if you need custom stemming, zero-downtime rebuilds, or fair ranking (BM25). Use MySQL if you're already in a MySQL environment and have simple text search requirements with moderate write rates.
-- MySQL: Creating an FTS index ALTER TABLE io_thecodeforge.articles ADD FULLTEXT(title, body); -- MySQL: Checking FTS stats (auxiliary tables) SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE; -- MySQL: Rebuild FTS index (blocks writes) OPTIMIZE TABLE io_thecodeforge.articles; -- PostgreSQL: Creating GIN index on generated tsvector column ALTER TABLE io_thecodeforge.articles ADD COLUMN search_vector tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED; CREATE INDEX idx_articles_fts ON io_thecodeforge.articles USING GIN(search_vector); -- PostgreSQL: Query the tsvector directly for debugging SELECT id, search_vector FROM io_thecodeforge.articles; -- PostgreSQL: Rebuild without blocking REINDEX INDEX CONCURRENTLY idx_articles_fts;
| Aspect | LIKE '%term%' | Full-Text Search |
|---|---|---|
| Data structure | B-Tree (unusable with leading wildcard) | Inverted index |
| Speed (1M rows, 1KB text) | 2-10 seconds (full sequential scan) | 10-100 ms (index lookup) |
| Ranking | None (order of insertion or random) | TF-IDF or BM25 relevance score |
| Stemming | None ('running' ≠ 'run') | Built-in language-specific stemmer |
| Stopword handling | None ('the' is searched like any word) | Automatic filtering of noise words |
| Phrase search | LIKE '%exact phrase%' (scan, no ordering) | Boolean mode operators or <-> adjacency |
| Index maintenance | None (B-Tree updates automatically) | Requires periodic rebuild (OPTIMIZE or REINDEX) |
| Prefix search | LIKE 'prefix%' (uses index, but leading wildcard does not) | wildcard in boolean mode ('run') |
| Infrastructure | Built into all SQL databases, no setup | Requires index creation and configuration |
🎯 Key Takeaways
- FTS relies on an inverted index — no B-Tree magic. It maps tokens to row IDs, enabling O(log n) lookups instead of O(n) scans.
- Ranking is algorithmic: MySQL uses TF-IDF (favours long documents), PostgreSQL uses BM25 (normalises length). Choose based on document length variance.
- Stopwords and stemming must be tailored to your domain. Default English stopwords hide technical terms. Review and customise.
- Always rebuild indexes after bulk operations or you'll get stale results. PostgreSQL CONCURRENTLY allows zero downtime; MySQL OPTIMIZE TABLE locks.
- Test FTS with real user queries before production deployment. A query that works on 1k rows can fail on 1M rows due to tokenisation or ranking.
- For typos and faceted search, FTS falls short. Know when to hand off to Elasticsearch — it's not a failure, it's the right architectural decision.
- MySQL FTS is simpler but less flexible; PostgreSQL FTS offers more control and zero-downtime maintenance. Choose based on operational needs.
- Phrase search requires BOOLEAN MODE in MySQL or <-> operator in PostgreSQL. NATURAL LANGUAGE MODE ignores quotes.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain how an inverted index is built and queried in a SQL database's full-text search engine.SeniorReveal
- QWhat is the difference between TF-IDF and BM25 ranking, and when would you choose one over the other?SeniorReveal
- QHow do you debug a full-text search that returns no results for a query that should match?SeniorReveal
- QWhen should you use a dedicated search engine like Elasticsearch over SQL FTS?SeniorReveal
- QHow would you design a search system for a multilingual product catalogue with English, Spanish, and Chinese content?SeniorReveal
Frequently Asked Questions
What is Full-Text Search in SQL in simple terms?
Full-Text Search is a database feature that lets you search large text columns for words or phrases quickly. Instead of scanning every row (like LIKE '%term%'), it builds an inverted index that maps each word to the rows containing it. This makes searches in millions of rows complete in milliseconds.
Can I use full-text search with a partial match like 'prefix*'?
Yes, in Boolean mode you can use the asterisk wildcard at the end of a word (e.g., 'run' matches 'running', 'runner', 'runs'). Leading wildcards are not supported in most databases. For prefix search, use the operator in MySQL's BOOLEAN MODE or the :* operator in PostgreSQL's tsquery.
Does full-text search support fuzzy matching (typo tolerance)?
Not natively in MySQL or PostgreSQL. You need trigram indexes (pg_trgm in PostgreSQL) or a dedicated search engine like Elasticsearch for fuzzy matching. Some workarounds exist using Levenshtein distance, but they don't scale to large tables. For typo tolerance, consider adding Elasticsearch as a specialised search layer.
How does full-text search handle multi-language content?
MySQL supports per-column language specification (CHARACTER SET and COLLATION), though stemming varies. PostgreSQL allows creating custom text search configurations per language and even per query. For mixed-language tables, consider separate language-specific columns or use a single configuration that handles multiple languages poorly — better to use an external search engine with per-language analysers.
Is full-text search better than Elasticsearch?
It depends. For simple text search on a single database table with moderate traffic (under 100GB, 10M rows), SQL FTS is simpler and requires no extra infrastructure. For advanced features like fuzzy matching, faceted search, multi-language support, and distributed scaling, Elasticsearch is better. FTS is a feature; Elasticsearch is a search platform. Use both: SQL FTS for exact keyword search, Elasticsearch for full-text catalogue search.
How do I monitor full-text index health in production?
- Compares row count with distinct doc_ids in FTS index metadata. Alert if difference >5%.
- Checks index size growth. Alert if growth exceeds 20% per week without data growth.
- Runs a query for a known term that should always return results. Alert if count=0.
- Monitors slow queries: log any MATCH/AGAINST query taking >2 seconds.
- For MySQL, monitor
INNODB_FTS_INDEX_CACHEsize vsINNODB_FTS_INDEX_TABLEto detect cache flushes.
Can I use full-text search on a database view?
Yes, but only if the view's underlying table columns have a full-text index and the view is not materialised (regular view). Materialised views can have their own full-text indexes if the DBMS supports it (PostgreSQL allows indexing on materialised views). In MySQL, you cannot create an FTS index on a view directly; create it on the base table and query via the view.
How does FTS handle NULL values?
NULL values are not indexed. If a column contains NULL, it is ignored. To include default text for NULLs, use COALESCE in the index definition or in queries. Example: ALTER TABLE t ADD FULLTEXT(COALESCE(description, '')) is not valid — you need a generated column. PostgreSQL: to_tsvector('english', coalesce(description, '')).
Does FTS support scoring for my own custom fields?
Yes, you can combine the MATCH/AGAINST score with other numeric columns (like popularity or recency) in the ORDER BY clause. In PostgreSQL, you can use the ts_rank function with setweight weights. In MySQL, you can add a weight multiplier in the ORDER BY: ORDER BY MATCH(title) AGAINST('query') 2 + MATCH(body) AGAINST('query') 1 DESC.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.