Skip to content
Home Database Full-Text Search in SQL: Stopwords Hide 'ink' Products

Full-Text Search in SQL: Stopwords Hide 'ink' Products

Where developers are forged. · Structured learning · Free forever.
📍 Part of: SQL Advanced → Topic 10 of 16
Stopwords hid 'ink' from FTS results; LIKE found all products.
🔥 Advanced — solid Database foundation required
In this tutorial, you'll learn
Stopwords hid 'ink' from FTS results; LIKE found all products.
  • FTS relies on an inverted index — no B-Tree magic. It maps tokens to row IDs, enabling O(log n) lookups instead of O(n) scans.
  • Ranking is algorithmic: MySQL uses TF-IDF (favours long documents), PostgreSQL uses BM25 (normalises length). Choose based on document length variance.
  • Stopwords and stemming must be tailored to your domain. Default English stopwords hide technical terms. Review and customise.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Full-text search (FTS) uses an inverted index mapping tokens to row IDs. LIKE '%term%' forces full table scan — 1M rows = 10 seconds. FTS = 100ms.
  • MySQL FTS uses InnoDB auxiliary tables; rebuild requires OPTIMIZE TABLE (table lock). PostgreSQL uses GIN indexes supports CONCURRENTLY (zero downtime).
  • Ranking: MySQL uses TF-IDF (no length normalisation); PostgreSQL uses BM25 (normalises by document length — fairer for mixed-length content).
  • Stopwords (the, a, ink, set) are filtered by default. If 'ink' is a product name, your customers see zero results. Always review stopword list for your domain.
  • Stemming reduces 'running' → 'run'. Aggressive stemmers conflate unrelated words (marketing → market). Risk: false positives.
  • Production trap: index staleness after bulk insert. FTS index doesn't auto-update immediately for all engines. Run CHECK TABLE or REINDEX after batch jobs.
  • Biggest mistake: using FTS for exact identifiers (ID-123). Tokeniser splits on hyphens, ignores short tokens. Use B-Tree for IDs.
🚨 START HERE

Full-Text Search Debug Cheat Sheet

Commands to run when FTS breaks in production. Run these first.
🟡

FTS index not picking up new data

Immediate ActionCheck last rebuild timestamp and run a manual rebuild
Commands
SELECT MAX(last_modified) FROM products; -- compare to index metadata
OPTIMIZE TABLE products; -- MySQL or REINDEX INDEX CONCURRENTLY idx_products_fts; -- PostgreSQL
Fix NowForce a full rebuild: OPTIMIZE TABLE products; (MySQL, locks table) or REINDEX INDEX CONCURRENTLY idx_products_fts; (PostgreSQL, no lock)
🟡

Wrong ranking order, irrelevant results at top

Immediate ActionLog the raw relevance score and inspect ranking factors
Commands
SELECT id, MATCH(title) AGAINST('query' IN BOOLEAN MODE) AS score FROM products ORDER BY score DESC LIMIT 20;
Check if stopwords are filtering: SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_DEFAULT_STOPWORD; (MySQL) or SELECT * FROM pg_catalog.pg_ts_config_map; (PostgreSQL)
Fix NowTemporarily disable stopwords for debugging: SET GLOBAL innodb_ft_enable_stopword=0; (MySQL). For PostgreSQL, use 'simple' configuration: to_tsvector('simple', text).
🟡

Full-text query times out (>5 seconds)

Immediate ActionCheck slow query log and examine the execution plan
Commands
SHOW VARIABLES LIKE 'slow_query_log%'; (MySQL) or SHOW log_min_duration_statement; (PostgreSQL)
EXPLAIN FORMAT=JSON SELECT ... MATCH AGAINST ... ; -- check if FTS index is used
Fix NowAdd a LIMIT clause and use IN BOOLEAN MODE for faster partial matching. Increase `innodb_ft_cache_size` (MySQL) or `work_mem` (PostgreSQL).
🟡

Short words like 'AI', 'UX' not returning results

Immediate ActionCheck min token length configuration
Commands
SHOW VARIABLES LIKE 'innodb_ft_min_token_size'; (MySQL, default 3) or SELECT cfgname, cfgoption FROM pg_catalog.pg_ts_config; (PostgreSQL)
SET GLOBAL innodb_ft_min_token_size = 2; (requires restart and index rebuild)
Fix NowFor MySQL, set min token length to 2. For PostgreSQL, use a custom parser or trigram index for short words.
🟡

Phrase search using quotes doesn't work

Immediate ActionVerify you're using BOOLEAN MODE (not NATURAL LANGUAGE MODE)
Commands
SHOW VARIABLES LIKE 'ft_boolean_syntax'; (MySQL) or SELECT 'The <-> quick <-> brown'::tsquery; (PostgreSQL)
SELECT * FROM products WHERE MATCH(col) AGAINST('"exact phrase"' IN BOOLEAN MODE);
Fix NowUse BOOLEAN MODE with quotes for phrase search. PostgreSQL: `to_tsvector(body) @@ to_tsquery('quick <-> brown');`
Production Incident

The Invisible Products: When Stopwords Hid an Entire Category

A printing supply company had 500 ink products. Search for 'ink' returned zero results. The default MySQL stopword list treated 'ink' as noise. The team spent 2 days debugging query syntax before discovering the word was being filtered out.
SymptomSearch for 'ink' returns no results. Search for 'ink cartridge' returns only 'cartridge' matches, missing all ink products. LIKE '%ink%' works fine. MySQL logs show the query executes, but FTS reports no matches.
AssumptionThe team assumed the FTS index was stale or the query syntax was wrong. They ran OPTIMIZE TABLE, rebuilt the index, even switched from NATURAL LANGUAGE to BOOLEAN mode. Nothing worked. They never considered that 'ink' might be a stopword.
Root causeMySQL's default stopword list includes common English words — 'the', 'a', 'of', 'it', 'and' — but also 'ink'? No, the default list actually doesn't include 'ink'. The team's specific version had a custom stopword list from a previous DBA who added domain-specific noise words. 'Ink' was included because the DBA thought it was too common in print logs. On PostgreSQL, the default stopword list also doesn't include 'ink'. But both databases allow custom stopword tables. The team inherited an undocumented custom stopword configuration that filtered critical product terms. The real issue: no one had reviewed the stopword list for domain relevance. Technical terms like 'set', 'key', 'host', 'port' also risk being filtered depending on the list version. The team learned that default stopwords are trained on general English, not on e-commerce catalogues.
Fix1. Checked the stopword list: MySQL SHOW VARIABLES LIKE 'innodb_ft_server_stopword_table'; PostgreSQL SELECT * FROM pg_catalog.pg_ts_config_map. 2. Found 'ink' in the custom stopword list. Removed it: SET GLOBAL innodb_ft_server_stopword_table = ''; (MySQL) or ALTER TEXT SEARCH CONFIGURATION ... DROP MAPPING FOR ... (PostgreSQL). 3. Rebuilt the full-text index: OPTIMIZE TABLE products; (MySQL) or REINDEX INDEX CONCURRENTLY idx_products_fts; (PostgreSQL). 4. Added a scheduled job to export the stopword list weekly to version control and compare against domain-critical terms. 5. Created a monitoring query that periodically searches for terms that should exist (like 'ink') and alerts if count = 0. 6. Documented the stopword configuration in the team's runbook.
Key Lesson
Stopwords are not universal. Default lists suit general English, not e-commerce, medical, or technical domains.Always export and review your full-text configuration (stopwords, stemmer, min token length) before deploying to production.Test FTS with real user queries from day one. If a term should return results, verify it does.For domain-critical terms (product names, SKU prefixes, technical acronyms), ensure they are NOT in the stopword list.Set up monitoring: a weekly job that queries a set of known terms and alerts if any return zero results.
Production Debug Guide

Symptom → Action for common FTS failures in production

FTS query returns no results but LIKE finds matchesCheck if the text column is included in the full-text index using SHOW INDEX or \d+ table. Verify stopwords are not filtering your search terms by querying the stopword table directly. Check min token length — 'AI' with length 2 filtered if min=3.
Relevance ranking ignores your most important keywordsReview the ranking algorithm (BM25 vs TF-IDF). In PostgreSQL, adjust the weight column configuration using setweight. In MySQL, tweak the WITH QUERY EXPANSION option or add manual multipliers in ORDER BY.
FTS query is slow (over 1 second) on a large tableCheck index size and fragmentation. Run ANALYZE TABLE (MySQL) or VACUUM ANALYZE (PostgreSQL). Increase innodb_ft_cache_size (MySQL) or work_mem (PostgreSQL). Consider partitioning the table if it exceeds 100GB.
Search misses plurals or verb forms — 'running' doesn't match 'ran'Examine the stemming dictionary. For MySQL, set the proper parser plugin (ngram for CJK, mecab for Japanese). For PostgreSQL, choose the correct text search configuration (english vs simple) and verify the dictionary (snowball vs simple).
New products don't appear in search results after bulk insertCheck if the FTS index is stale. For MySQL InnoDB, the index updates transactionally but you may need OPTIMIZE TABLE after bulk operations. For PostgreSQL, ensure the tsvector column is generated or the index includes the column. Run REINDEX CONCURRENTLY.

Most production applications reach a point where LIKE '%search_term%' stops cutting it. You've got a product catalogue with 500,000 rows, a blog with a decade of articles, or a support ticket system with millions of entries — and users expect Google-quality search that returns relevant results in under 100 milliseconds. LIKE with a leading wildcard can't use a B-Tree index, so the database performs a full sequential scan every single time. At scale, that's a query that grinds your entire application to a halt.

Full-text search (FTS) solves this with an entirely different data structure — an inverted index — that maps individual tokens (words) back to the rows that contain them. But it does far more than just fast lookups. It understands language: it knows that 'running', 'ran', and 'runs' are forms of the same root word. It knows that 'the', 'a', and 'is' are so common they're noise. It can rank results by relevance so that the most useful rows float to the top. These capabilities live inside your existing SQL database, no Elasticsearch cluster required.

By the end of this article you'll understand exactly how an inverted index is built and queried, how to set up and tune full-text search in both MySQL and PostgreSQL, how relevance ranking actually works under the hood, when FTS is the right tool versus when you should reach for a dedicated search engine, and every production gotcha that bites teams who ship FTS without reading the fine print.

How the Inverted Index Works Internally

An inverted index is the data structure behind all full-text search engines — SQL's built-in FTS included. It maps every distinct word (token) to the list of rows where that word appears, along with its position and frequency inside each document.

When you insert or update a row, the database tokenizes the text, discards stopwords, applies stemming, and builds (or appends to) this mapping. The result is a structure you can query in logarithmic time: look up the token, get the row list, apply ranking, return top N.

MySQL stores its FTS index in separate internal InnoDB tables (auxiliary tables). The main table holds the data; the FTS index is in hidden tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE). You cannot query them directly for debugging, but they exist.

PostgreSQL uses GiST or GIN indexes, which store lexemes (stemmed tokens) directly in the index structure. GIN is faster for lookups (read-heavy), GiST is faster for updates (write-heavy). You can query the tsvector column directly for debugging.

How a query is executed: When you run MATCH(col) AGAINST('search'), the database: (1) tokenizes and stems the query text, (2) looks up each token in the inverted index, (3) retrieves posting lists (list of row IDs), (4) optionally intersects them for multi-word queries, (5) computes a relevance score (TF-IDF or BM25) for each candidate row, (6) sorts by score and returns top N.

fts_internals_demo.sql · SQL
123456789101112131415161718192021
-- MySQL: InnoDB FTS uses hidden auxiliary tables
-- You can inspect them (read-only)
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE;
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_CACHE;

-- PostgreSQL: tsvector column stores lexemes
CREATE TABLE io_thecodeforge.documents (
    id SERIAL PRIMARY KEY,
    body TEXT,
    search_vector TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', body)) STORED
);

-- Insert and the tsvector is auto-generated
INSERT INTO io_thecodeforge.documents (body) VALUES ('The quick brown fox jumps over the lazy dog.');

-- Query the tsvector directly to see normalized terms
SELECT id, body, search_vector FROM io_thecodeforge.documents;

-- Output shows:
-- 'brown':2 'dog':8 'fox':3 'jump':4 'lazi':7 'quick':1
-- Stopwords ('the', 'over') are omitted. Stemming: 'jumps' → 'jump', 'lazy' → 'lazi'
▶ Output
id | body | search_vector
----+------+-------------------------------------------------------
1 | The quick brown fox jumps over the lazy dog. | 'brown':2 'dog':8 'fox':3 'jump':4 'lazi':7 'quick':1
Mental Model
Mental Model: Dictionary with Page Numbers
Think of an inverted index as the index at the back of a textbook — every key term points to every page where it appears.
  • Token = word in the book's index
  • Posting list = list of page (row) numbers where that word occurs
  • Position = where on the page the word appears, used for phrase search
  • Frequency = how many times the word appears on that page, used for relevance ranking
  • The index is pre-built — queries don't scan rows, they consult the index.
📊 Production Insight
InnoDB FTS stores the index in auxiliary tables managed internally — you cannot query them directly for most operations.
PostgreSQL's GIN index is transactional, so concurrent reads see a consistent snapshot during updates.
Failure: a server crash during a bulk insert can leave the FTS index in an inconsistent state — always test with CHECK TABLE.
Rule: For write-heavy workloads, PostgreSQL GiST has lower overhead than GIN but slower reads.
🎯 Key Takeaway
An inverted index is a reverse mapping: from words to rows, not rows to words.
It's the reason FTS queries return in milliseconds instead of seconds.
The choice of storage engine (InnoDB vs GIN) directly impacts write vs read performance.
Choose your FTS implementation based on workload
IfRead-heavy, large text columns, infrequent writes
UseUse PostgreSQL GIN index — best query speed, slower inserts, acceptable for read-mostly tables
IfWrite-heavy, many updates, small documents
UseUse MySQL InnoDB FTS — better insert performance, acceptable query speed, transactional consistency
IfNeed phrase search and fuzzy matching
UseBoth support, but PostgreSQL's phrase search is more robust with tsquery operators (<-> for adjacency)
IfZero-downtime index rebuilds required
UsePostgreSQL with CONCURRENTLY rebuilds. MySQL requires table lock during OPTIMIZE TABLE.

MATCH/AGAINST Syntax and Relevance Ranking

The SQL syntax for full-text search is MATCH(columns) AGAINST('query' IN NATURAL LANGUAGE MODE). The engine returns rows ordered by relevance — a floating-point score. Higher scores mean more relevant.

Relevance ranking isn't magic. Behind the scenes the database computes a variant of TF-IDF (MySQL) or BM25 (PostgreSQL).

  • TF-IDF (Term Frequency-Inverse Document Frequency) rewards terms that appear frequently in a document but rarely across the whole corpus. It doesn't normalise for document length, so long documents (with more words) tend to score higher simply because they contain more text.
  • BM25 (Best Matching 25) adds document length normalisation. A short document that matches a term is given a higher boost relative to its length. It also saturates term frequency so that 100 occurrences of a word doesn't give 100x the score of 1 occurrence.

IN BOOLEAN MODE gives you operators: +mandatory, -exclude, *wildcard, "phrase". It doesn't rank — it just filters. Use it when you need exact control over which terms must match, not when you need relevance ordering.

WITH QUERY EXPANSION (MySQL) adds terms from top-ranked results to the original query. It can help find related documents but can also drift off-topic.

Practical example: For a product catalogue with descriptions of varying lengths (e.g., 50-character titles vs 5000-character detailed descriptions), BM25 gives fairer ranking. MySQL's TF-IDF would rank the long descriptions higher even if the title contains the exact match.

match_against_demo.sql · SQL
12345678910111213141516171819202122
-- MySQL: Natural language mode (ranked)
SELECT id, title,
       MATCH(title, body) AGAINST('Docker Compose' IN NATURAL LANGUAGE MODE) AS relevance
FROM io_thecodeforge.articles
ORDER BY relevance DESC;

-- PostgreSQL: ts_rank with normalization
SELECT id, title,
       ts_rank(to_tsvector('english', title || ' ' || body),
               plainto_tsquery('english', 'Docker Compose'), 1) AS relevance
FROM io_thecodeforge.articles
ORDER BY relevance DESC;

-- Boolean mode with operators (MySQL/PostgreSQL)
SELECT id, title
FROM io_thecodeforge.articles
WHERE MATCH(title, body) AGAINST('+Docker +healthcheck -deprecated' IN BOOLEAN MODE);

-- Phrase search in PostgreSQL using <-> operator
SELECT id, title
FROM io_thecodeforge.articles
WHERE to_tsvector('english', body) @@ to_tsquery('docker <-> compose');
▶ Output
With ranking, documents containing the phrase in the title score higher. Boolean mode returns only documents that match all mandatory terms.
⚠ Ranking trap: short documents lose in MySQL
TF-IDF over-weights long documents. If your catalogue has product names (short) and descriptions (long), the description will often rank higher even when the name is exact match. Use PostgreSQL or implement manual boosting of title column with MATCH() * weight.
📊 Production Insight
MySQL's relevance score is normalised by total term frequency across all rows — adding a row can change all scores, causing order to shift non-deterministically.
PostgreSQL's ts_rank is deterministic per query, but the configuration (weights via setweight) controls which column matters more.
Failure: forgetting to use ORDER BY in natural language mode — rows come back in arbitrary (insertion) order, not by relevance. Your users see random results.
🎯 Key Takeaway
MATCH/AGAINST with NATURAL LANGUAGE MODE returns ranked results automatically.
Boolean mode gives you operators but no ranking — use it for filtering.
Ranking algorithms differ: MySQL uses TF-IDF (favours long docs), PostgreSQL uses BM25 (normalises length).

Stopwords, Stemming and Language Support

Stopwords are common words (the, a, in, of) that don't carry search meaning. FTS engines discard them by default. But the default stopword list is English-centric and often wrong for your domain. If you're searching a product catalogue and 'ink' is a valid product name, but 'ink' happens to be in your custom stopword list — you'll lose results. On PostgreSQL, the default list includes 'the', 'a', 'and', 'of', 'to', 'in', 'for', etc. Not 'ink', unless customised. But technical terms like 'set', 'key', 'host', 'port' are NOT in default lists, but could be added accidentally.

Stemming reduces words to root forms: 'running' becomes 'run', 'better' becomes 'good'. MySQL uses a built-in stemmer based on the language setting. PostgreSQL uses dictionaries (english, simple, snowball). You can create custom dictionaries for domain-specific terms.

Key trade-off: aggressive stemming conflates unrelated words (e.g., 'marketing' and 'market' become 'market') causing false positives. Too little stemming and you miss variants.

Language-specific considerations: MySQL supports per-column character sets and collations that affect tokenisation. For non-English languages (Chinese, Japanese, Korean), MySQL's ngram parser splits text into n-character substrings; PostgreSQL's parser uses language-specific dictionaries.

Custom dictionaries in PostgreSQL: You can chain dictionaries (e.g., a custom thesaurus that maps 'JS' to 'JavaScript', then a snowball stemmer). This is far more flexible than MySQL's global setting.

custom_stopword.sql · SQL
123456789101112131415161718192021222324
-- MySQL: set a custom stopword table
SET GLOBAL innodb_ft_server_stopword_table = 'mydb/my_stopwords';

CREATE TABLE mydb.my_stopwords (value VARCHAR(30));
INSERT INTO mydb.my_stopwords VALUES ('the'), ('a'), ('of'), ('our'), ('your');

-- Rebuild index for changes to take effect
OPTIMIZE TABLE mydb.articles;

-- PostgreSQL: create custom text search configuration
CREATE TEXT SEARCH CONFIGURATION io_thecodeforge.product_search (COPY = english);

-- Replace stemmer with 'simple' (no stemming) for precise matching
ALTER TEXT SEARCH CONFIGURATION io_thecodeforge.product_search
    ALTER MAPPING FOR word WITH simple;

-- Add a custom thesaurus dictionary
CREATE TEXT SEARCH DICTIONARY my_thesaurus (
    TEMPLATE = thesaurus,
    DictFile = 'product_synonyms'
);

-- Then use it in queries
SELECT to_tsvector('io_thecodeforge.product_search', 'JS framework');
🔥Stopword surprise: technical terms are often stopwords by accident
Words like 'set', 'key', 'host', 'port', 'null' are common in technical documentation but may be stopwords in some languages or custom lists. Always export and review your stopword list against actual product names and user queries.
📊 Production Insight
Default stopword lists are trained on general English — they hurt domain-specific search.
PostgreSQL allows per-query tokenizer overrides; MySQL requires global config changes (requires restart for fulltext index rebuild).
Failure: a legal search system that dropped 'contract' because it matched a stopword in the custom list — double-check every word you blacklist.
Rule: For domain-specific catalogues, start with no stopwords (empty list) and only add those proven to cause false positives.
🎯 Key Takeaway
Stopwords and stemming are language-specific — never trust the default for your domain.
Test your FTS with a sample of real queries to catch false positives.
PostgreSQL's per-query configuration gives more control; MySQL's global setting is simpler but less flexible.

Performance Tuning and Index Maintenance

Full-text indexes grow with your data and can become bloated, fragmented, or stale.

MySQL InnoDB FTS tuning
  • innodb_ft_min_token_size (default 3): ignore words shorter than this. Lower to index acronyms ('AI', 'UX') but increases index size by 20-30%.
  • innodb_ft_max_token_size (default 84): ignore words longer than this. Increase for long product names.
  • innodb_ft_cache_size (default 8MB): size of the in-memory cache for tokenised data. Larger values improve insert performance for bulk operations.
  • Index maintenance: OPTIMIZE TABLE rebuilds the FTS index but LOCKS THE TABLE for writes. Schedule during low traffic.
PostgreSQL GIN/GiST tuning
  • work_mem: controls memory used for sorting during index creation. Higher values speed up large index builds.
  • maintenance_work_mem: memory used for REINDEX and VACUUM operations.
  • gin_pending_list_limit: for GIN indexes with fast update enabled, this controls the size of the pending list.
  • Zero-downtime maintenance: CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY rebuild indexes without blocking writes.
Monitoring index health
  • MySQL: SHOW INDEX FROM table shows cardinality. Compare INNODB_FTS_INDEX_TABLE row count with base table row count to detect staleness.
  • PostgreSQL: pgstatindex() for GiST indexes, pg_size_pretty(pg_relation_size('index_name')) for index size growth.

Practical benchmarks: On a 10 million row table with 1KB average text per row, MySQL InnoDB FTS index size is about 20-30% of data size (2-3GB). PostgreSQL GIN index is larger, about 40-50% (4-5GB), but provides faster read queries.

The staleness trap: After a bulk delete or update, FTS indexes may still point to deleted rows. Run CHECK TABLE (MySQL) or REINDEX (PostgreSQL) after any bulk operation affecting >20% of rows.

fts_tuning.sql · SQL
1234567891011121314151617181920212223
-- MySQL: Check FTS index fragmentation
SHOW INDEX FROM io_thecodeforge.articles;
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE
WHERE TABLE_ID = (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_TABLES WHERE NAME = 'io_thecodeforge/articles');

-- Rebuild fulltext index (blocks writes) — schedule during maintenance
OPTIMIZE TABLE io_thecodeforge.articles;

-- Set min token length (requires restart)
SET GLOBAL innodb_ft_min_token_size = 2;

-- PostgreSQL: create index CONCURRENTLY (no downtime)
CREATE INDEX CONCURRENTLY idx_fts_articles ON io_thecodeforge.articles USING GIN(to_tsvector('english', body));

-- Monitor index size
SELECT pg_size_pretty(pg_relation_size('idx_fts_articles'));

-- Rebuild online without locking
REINDEX INDEX CONCURRENTLY idx_fts_articles;

-- Increase memory for index operations
SET maintenance_work_mem = '1024MB';  -- for REINDEX
SET work_mem = '64MB';  -- for query-time sorting
💡Use concurrent rebuilds to avoid downtime
PostgreSQL's CONCURRENTLY option lets you rebuild indexes without blocking writes. In MySQL, schedule OPTIMIZE TABLE during maintenance windows and set innodb_ft_cache_size high enough to avoid excessive disk flushes.
📊 Production Insight
In MySQL, OPTIMIZE TABLE on a table with a fulltext index rebuilds that index completely — on a 100GB table this can take hours.
PostgreSQL's CONCURRENTLY rebuild is slower overall but doesn't block production traffic — trade throughput for availability.
Failure: a team that ran OPTIMIZE TABLE on a 200GB table during peak hours, causing a 45-minute write outage. The marketing team couldn't add new products.
Rule: For tables >10GB, always test rebuild times on a staging replica before production. Schedule rebuilds during the lowest traffic window (e.g., Sunday 3 AM).
🎯 Key Takeaway
Index rebuilds are expensive — always do them during low traffic or use CONCURRENTLY.
Monitor index size weekly; a sudden growth spike may indicate tokenisation issues.
Test your rebuild strategy with a production-data clone before the first real run.

MySQL vs PostgreSQL: Implementation Differences That Matter

While both MySQL and PostgreSQL offer full-text search, their internals differ significantly. Choosing the wrong one can lead to performance pain or missing features.

Index storage: MySQL stores the FTS index in hidden auxiliary tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE) managed by InnoDB. PostgreSQL stores it as a GiST or GIN index on a tsvector column.

Tokenisation and stemming: MySQL uses a built-in tokeniser and stemmer that are less configurable. You can change the minimum token length via innodb_ft_min_token_size and set a custom stopword table, but the stemmer is fixed per language (English, French, Spanish, etc.). PostgreSQL uses text search configurations that separate tokeniser (parser) and dictionary (stemmer, stopwords, thesaurus), giving you the ability to chain dictionaries and create custom ones.

Index maintenance: MySQL requires OPTIMIZE TABLE to rebuild the FTS index (table is locked). PostgreSQL supports CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY for zero-downtime rebuilds.

Query flexibility: PostgreSQL supports ranking weights per column using setweight and can store tsvector as a generated column. MySQL's ranking is less flexible; you can use multiple MATCH clauses with custom multipliers but it's less efficient.

Performance characteristics
  • MySQL InnoDB FTS: Optimised for transactional workloads with moderate write rates. Reads are fast but not as fast as PostgreSQL GIN.
  • PostgreSQL GIN: Faster reads (10-20%) but slower writes (30-50% overhead due to lexeme compression). GiST is slower for reads but faster for writes.

Debugging: MySQL's FTS debugging requires querying INFORMATION_SCHEMA tables. PostgreSQL allows direct querying of the tsvector column, making debugging trivial.

Production recommendation: Use PostgreSQL if you need custom stemming, zero-downtime rebuilds, or fair ranking (BM25). Use MySQL if you're already in a MySQL environment and have simple text search requirements with moderate write rates.

mysql_vs_postgres_fts.sql · SQL
12345678910111213141516171819
-- MySQL: Creating an FTS index
ALTER TABLE io_thecodeforge.articles ADD FULLTEXT(title, body);

-- MySQL: Checking FTS stats (auxiliary tables)
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE;

-- MySQL: Rebuild FTS index (blocks writes)
OPTIMIZE TABLE io_thecodeforge.articles;

-- PostgreSQL: Creating GIN index on generated tsvector column
ALTER TABLE io_thecodeforge.articles ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
CREATE INDEX idx_articles_fts ON io_thecodeforge.articles USING GIN(search_vector);

-- PostgreSQL: Query the tsvector directly for debugging
SELECT id, search_vector FROM io_thecodeforge.articles;

-- PostgreSQL: Rebuild without blocking
REINDEX INDEX CONCURRENTLY idx_articles_fts;
⚠ Parser differences matter
MySQL's default parser treats URLs as separate tokens (e.g., 'https' as one token). PostgreSQL's parser handles URLs differently. Always test with your actual data to catch tokenisation surprises before production.
📊 Production Insight
In MySQL, the FTS index is not directly queryable — debugging requires INFORMATION_SCHEMA tables.
PostgreSQL allows direct querying of the tsvector column, making debugging trivial.
Failure: a team that relied on MySQL's FTS for a real-time search feature discovered the index could not be rebuilt online — they had to redesign for PostgreSQL mid-project.
🎯 Key Takeaway
MySQL FTS is simpler but less flexible; PostgreSQL FTS offers more control and zero-downtime maintenance.
Choose based on operational needs, not just syntax familiarity.
Always test index rebuild procedures before going to production.
Choose your FTS implementation based on operational needs
IfNeed zero-downtime index rebuilds
UseChoose PostgreSQL with CONCURRENTLY rebuilds. MySQL OPTIMIZE TABLE locks the table.
IfSimple text search, infrequent writes, minimal config
UseMySQL InnoDB FTS is simpler to set up and has lower operational overhead.
IfCustom stemming per language or domain
UsePostgreSQL with custom text search configurations (thesaurus, synonym dictionaries).
IfRead-heavy workload with large text columns (product catalogue, blog search)
UsePostgreSQL GIN index offers 10-20% faster reads than MySQL FTS for large datasets.
🗂 LIKE vs Full-Text Search: Head-to-Head
Choose the right tool for your text search needs
AspectLIKE '%term%'Full-Text Search
Data structureB-Tree (unusable with leading wildcard)Inverted index
Speed (1M rows, 1KB text)2-10 seconds (full sequential scan)10-100 ms (index lookup)
RankingNone (order of insertion or random)TF-IDF or BM25 relevance score
StemmingNone ('running' ≠ 'run')Built-in language-specific stemmer
Stopword handlingNone ('the' is searched like any word)Automatic filtering of noise words
Phrase searchLIKE '%exact phrase%' (scan, no ordering)Boolean mode operators or <-> adjacency
Index maintenanceNone (B-Tree updates automatically)Requires periodic rebuild (OPTIMIZE or REINDEX)
Prefix searchLIKE 'prefix%' (uses index, but leading wildcard does not)wildcard in boolean mode ('run')
InfrastructureBuilt into all SQL databases, no setupRequires index creation and configuration

🎯 Key Takeaways

  • FTS relies on an inverted index — no B-Tree magic. It maps tokens to row IDs, enabling O(log n) lookups instead of O(n) scans.
  • Ranking is algorithmic: MySQL uses TF-IDF (favours long documents), PostgreSQL uses BM25 (normalises length). Choose based on document length variance.
  • Stopwords and stemming must be tailored to your domain. Default English stopwords hide technical terms. Review and customise.
  • Always rebuild indexes after bulk operations or you'll get stale results. PostgreSQL CONCURRENTLY allows zero downtime; MySQL OPTIMIZE TABLE locks.
  • Test FTS with real user queries before production deployment. A query that works on 1k rows can fail on 1M rows due to tokenisation or ranking.
  • For typos and faceted search, FTS falls short. Know when to hand off to Elasticsearch — it's not a failure, it's the right architectural decision.
  • MySQL FTS is simpler but less flexible; PostgreSQL FTS offers more control and zero-downtime maintenance. Choose based on operational needs.
  • Phrase search requires BOOLEAN MODE in MySQL or <-> operator in PostgreSQL. NATURAL LANGUAGE MODE ignores quotes.

⚠ Common Mistakes to Avoid

    Using default stopword list without domain review
    Symptom

    Important domain terms like 'ink', 'set', or 'key' become invisible. Users complain searches return nothing for valid product names.

    Fix

    Export the default stopword list, add it to version control, and manually remove terms relevant to your domain. For product catalogues, consider using no stopwords and handling noise via ranking.

    Forgetting to rebuild the index after bulk operations
    Symptom

    New products don't appear in search results. Old deleted products still show up. The index is stale.

    Fix

    Always run OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) after bulk inserts/deletes affecting >10% of rows. Automate via scheduled jobs or triggers.

    Using FTS for exact match or short codes (e.g., 'ID-123')
    Symptom

    Searching for 'ID-123' returns lots of irrelevant rows or nothing. The tokeniser splits on hyphens and ignores short tokens (min length = 3).

    Fix

    For exact identifiers, use a regular B-Tree index. Reserve FTS for natural language text. For hybrid, use UNION with exact match condition.

    Using NATURAL LANGUAGE MODE for phrase searches expecting quotes to work
    Symptom

    MATCH(col) AGAINST('"exact phrase"') in natural language mode treats the quotes as part of the search string, not as a phrase operator.

    Fix

    Use BOOLEAN MODE with quotes: MATCH(col) AGAINST('"exact phrase"' IN BOOLEAN MODE). In PostgreSQL, use <-> operator.

    Ignoring stemming false positives
    Symptom

    Searches for 'marketing' return products about 'market' (price comparison). Searches for 'run' return 'runner' (shoes) and 'running' (performance).

    Fix

    Use a configuration with no stemming (PostgreSQL 'simple' dictionary, MySQL's language setting with no stemming if available). Or accept that stemming increases recall at the cost of precision.

    Not testing with production-like data volumes before deployment
    Symptom

    FTS queries that run in 50ms on development (10k rows) become 5 seconds in production (10M rows) due to index size and fragmentation.

    Fix

    Always load-test FTS with a copy of production data to estimate rebuild times, query latency, and index size. Use tools like sysbench for MySQL or pgbench for PostgreSQL.

    Using MySQL FTS in a high-write environment without tuning `innodb_ft_cache_size`
    Symptom

    After bulk inserts, FTS queries become slow or return stale data because the index cache is flushing to disk too frequently.

    Fix

    Increase innodb_ft_cache_size to 32MB or 64MB. Monitor INNODB_FTS_INDEX_CACHE size vs INNODB_FTS_INDEX_TABLE. For PostgreSQL, adjust gin_pending_list_limit.

Interview Questions on This Topic

  • QExplain how an inverted index is built and queried in a SQL database's full-text search engine.SeniorReveal
    An inverted index is built by tokenising each document's text, normalising tokens (lowercasing, stemming), discarding stopwords, and then constructing a mapping from each unique token to the list of document IDs where it appears, along with positional information and frequency. In MySQL InnoDB, this is stored in hidden auxiliary tables. In PostgreSQL, it's a GiST or GIN index on a tsvector column. Query execution: (1) tokenise and stem the query text, (2) look up each token in the inverted index (O(log n) per token), (3) retrieve posting lists (list of row IDs), (4) intersect them for multi-word queries, (5) compute a relevance score (TF-IDF or BM25) for each candidate row using term frequency and inverse document frequency, (6) sort by score and return top N. Complexity: O(k log n + m) where k = number of unique tokens in query, n = number of unique tokens in corpus, m = number of candidate rows. This is far faster than a full table scan (O(total documents × document length)).
  • QWhat is the difference between TF-IDF and BM25 ranking, and when would you choose one over the other?SeniorReveal
    TF-IDF (Term Frequency-Inverse Document Frequency) rewards terms that appear frequently in a document but rarely across the collection. Formula: TF-IDF = (term frequency in document) × log(total documents / documents containing term). It doesn't normalise for document length, so long documents tend to score higher simply because they have more words. BM25 (Best Matching 25) adds document length normalisation and saturates term frequency. Formula: BM25 = Σ IDF(qi) × (TF × (k1+1)) / (TF + k1×(1 - b + b×(doc length / avg doc length))). Parameters k1 (term frequency saturation) and b (length normalisation) are tunable. BM25 prevents a single term appearing 100 times from dominating the score and normalises by document length. Choose BM25 when documents vary widely in length (e.g., product titles vs detailed descriptions). Choose TF-IDF for homogeneous document lengths (e.g., tweets, short comments) where length normalisation is less critical. MySQL uses TF-IDF; PostgreSQL uses BM25. For mixed-length content, PostgreSQL gives fairer results.
  • QHow do you debug a full-text search that returns no results for a query that should match?SeniorReveal
    Step-by-step debugging: 1. Verify the FTS index exists: SHOW INDEX FROM table (MySQL) or \d+ table (PostgreSQL). Ensure the columns you're searching are included. 2. Check stopword filtering: Query the stopword table directly. MySQL: SELECT FROM INFORMATION_SCHEMA.INNODB_FTS_DEFAULT_STOPWORD. PostgreSQL: SELECT FROM pg_catalog.pg_ts_dict. 3. Check min token length: MySQL innodb_ft_min_token_size (default 3). Short term 'AI' (2 chars) would be ignored. PostgreSQL uses parser configuration. 4. Query the inverted index directly (PostgreSQL): SELECT search_vector FROM table WHERE id = 1 to see which tokens were actually indexed. 5. Compare with LIKE: SELECT id FROM table WHERE text LIKE '%search_term%' to confirm data exists. 6. Check index staleness: Compare row count in base table with number of distinct doc_ids in FTS index metadata. MySQL: SELECT COUNT(DISTINCT doc_id) FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE. 7. Verify stemming: Search for all forms of the word (e.g., 'run', 'running', 'ran') to see if stemming is too aggressive. 8. Check language configuration: If your data is Spanish but FTS config is English, results will be wrong.
  • QWhen should you use a dedicated search engine like Elasticsearch over SQL FTS?SeniorReveal
    Choose Elasticsearch when you need: - Fuzzy matching (typo tolerance): SQL FTS has no built-in typo handling. Elasticsearch's Levenshtein distance-based matching handles 'teh' → 'the'. - Faceted search: Combine text search with dynamic filters (price range, category, date) efficiently. SQL FTS can do this with UNION, but Elasticsearch's aggregations are far more powerful. - Multi-language support with per-field analysers: Single field can be indexed with multiple analysers (English, Spanish, Chinese) and queried with language detection. - Scale beyond billions of rows: SQL FTS indexes are per-database, single-node. Elasticsearch scales horizontally with sharding and replication. - High write throughput with near real-time search: Elasticsearch's near real-time (1 second) indexing is faster than rebuilding SQL FTS indexes. Use SQL FTS when: - Your dataset is under 100GB and fits on a single server. - You need simple keyword search without typos or faceting. - You want to avoid adding another infrastructure component (Elasticsearch cluster). Hybrid approach: Use SQL FTS for exact keyword matching on product names, Elasticsearch for full catalogue search with fuzzy and facets.
  • QHow would you design a search system for a multilingual product catalogue with English, Spanish, and Chinese content?SeniorReveal
    The key challenge is language-specific tokenisation and stemming. Option 1 — PostgreSQL with conditional tsvectors: Create separate tsvector columns per language, each with the appropriate configuration: ``sql ALTER TABLE products ADD COLUMN search_vector_en tsvector GENERATED ALWAYS AS ( CASE WHEN language_code = 'en' THEN to_tsvector('english', name || ' ' || description) WHEN language_code = 'es' THEN to_tsvector('spanish', name || ' ' || description) ELSE NULL END ) STORED; CREATE INDEX idx_products_fts_en ON products USING GIN(search_vector_en) WHERE language_code = 'en'; CREATE INDEX idx_products_fts_es ON products USING GIN(search_vector_es) WHERE language_code = 'es'; ` Then query using the appropriate language column based on user preference or language detection. Option 2 — MySQL with ngram parser for Chinese: For Chinese (no spaces), MySQL's ngram parser splits text into n-character substrings (e.g., ngarm token size 2). `sql ALTER TABLE products ADD FULLTEXT(name, description) WITH PARSER ngram; `` Better approach for mixed content: Detect user language preference (Accept-Language header or browser setting). Route query to the appropriate language-specific index. For fallback, use a single index with 'simple' tokenisation (no stemming) and accept lower precision. Scalability note: If you have many languages or large volumes, offload to Elasticsearch with per-language analysers. Elasticsearch allows multiple analysers per field and automatic language detection. Production tip: Always test with real queries in each language. English stemming works differently from Spanish — test edge cases.

Frequently Asked Questions

What is Full-Text Search in SQL in simple terms?

Full-Text Search is a database feature that lets you search large text columns for words or phrases quickly. Instead of scanning every row (like LIKE '%term%'), it builds an inverted index that maps each word to the rows containing it. This makes searches in millions of rows complete in milliseconds.

Can I use full-text search with a partial match like 'prefix*'?

Yes, in Boolean mode you can use the asterisk wildcard at the end of a word (e.g., 'run' matches 'running', 'runner', 'runs'). Leading wildcards are not supported in most databases. For prefix search, use the operator in MySQL's BOOLEAN MODE or the :* operator in PostgreSQL's tsquery.

Does full-text search support fuzzy matching (typo tolerance)?

Not natively in MySQL or PostgreSQL. You need trigram indexes (pg_trgm in PostgreSQL) or a dedicated search engine like Elasticsearch for fuzzy matching. Some workarounds exist using Levenshtein distance, but they don't scale to large tables. For typo tolerance, consider adding Elasticsearch as a specialised search layer.

How does full-text search handle multi-language content?

MySQL supports per-column language specification (CHARACTER SET and COLLATION), though stemming varies. PostgreSQL allows creating custom text search configurations per language and even per query. For mixed-language tables, consider separate language-specific columns or use a single configuration that handles multiple languages poorly — better to use an external search engine with per-language analysers.

Is full-text search better than Elasticsearch?

It depends. For simple text search on a single database table with moderate traffic (under 100GB, 10M rows), SQL FTS is simpler and requires no extra infrastructure. For advanced features like fuzzy matching, faceted search, multi-language support, and distributed scaling, Elasticsearch is better. FTS is a feature; Elasticsearch is a search platform. Use both: SQL FTS for exact keyword search, Elasticsearch for full-text catalogue search.

How do I monitor full-text index health in production?
Set up a scheduled job (hourly/daily) that
  • Compares row count with distinct doc_ids in FTS index metadata. Alert if difference >5%.
  • Checks index size growth. Alert if growth exceeds 20% per week without data growth.
  • Runs a query for a known term that should always return results. Alert if count=0.
  • Monitors slow queries: log any MATCH/AGAINST query taking >2 seconds.
  • For MySQL, monitor INNODB_FTS_INDEX_CACHE size vs INNODB_FTS_INDEX_TABLE to detect cache flushes.
Can I use full-text search on a database view?

Yes, but only if the view's underlying table columns have a full-text index and the view is not materialised (regular view). Materialised views can have their own full-text indexes if the DBMS supports it (PostgreSQL allows indexing on materialised views). In MySQL, you cannot create an FTS index on a view directly; create it on the base table and query via the view.

How does FTS handle NULL values?

NULL values are not indexed. If a column contains NULL, it is ignored. To include default text for NULLs, use COALESCE in the index definition or in queries. Example: ALTER TABLE t ADD FULLTEXT(COALESCE(description, '')) is not valid — you need a generated column. PostgreSQL: to_tsvector('english', coalesce(description, '')).

Does FTS support scoring for my own custom fields?

Yes, you can combine the MATCH/AGAINST score with other numeric columns (like popularity or recency) in the ORDER BY clause. In PostgreSQL, you can use the ts_rank function with setweight weights. In MySQL, you can add a weight multiplier in the ORDER BY: ORDER BY MATCH(title) AGAINST('query') 2 + MATCH(body) AGAINST('query') 1 DESC.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousSQL EXPLAIN and Execution PlansNext →Database Cursors
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged