Senior 10 min · March 05, 2026

Full-Text Search in SQL: Stopwords Hide 'ink' Products

Stopwords hid 'ink' from FTS results; LIKE found all products.

N
Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Drawn from code that ran under real load.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Full-text search (FTS) uses an inverted index mapping tokens to row IDs. LIKE '%term%' forces full table scan — 1M rows = 10 seconds. FTS = 100ms.
  • MySQL FTS uses InnoDB auxiliary tables; rebuild requires OPTIMIZE TABLE (table lock). PostgreSQL uses GIN indexes supports CONCURRENTLY (zero downtime).
  • Ranking: MySQL uses TF-IDF (no length normalisation); PostgreSQL uses BM25 (normalises by document length — fairer for mixed-length content).
  • Stopwords (the, a, ink, set) are filtered by default. If 'ink' is a product name, your customers see zero results. Always review stopword list for your domain.
  • Stemming reduces 'running' → 'run'. Aggressive stemmers conflate unrelated words (marketing → market). Risk: false positives.
  • Production trap: index staleness after bulk insert. FTS index doesn't auto-update immediately for all engines. Run CHECK TABLE or REINDEX after batch jobs.
  • Biggest mistake: using FTS for exact identifiers (ID-123). Tokeniser splits on hyphens, ignores short tokens. Use B-Tree for IDs.
✦ Definition~90s read
What is Full-Text Search in SQL?

Full-text search in SQL is a specialized indexing and querying mechanism designed for efficient searching of natural language text within database columns. Unlike LIKE '%term%' patterns, which force full table scans and can't rank results, full-text search uses an inverted index—a data structure that maps every distinct word (token) to the list of rows containing it.

Imagine you walk into a giant library with a million books and ask the librarian for every book that mentions 'dragons'.

This enables sub-millisecond lookups on millions of rows, relevance scoring via TF-IDF or BM25 algorithms, and linguistic features like stemming (finding 'run' from 'running') and stopword filtering (ignoring 'the', 'and'). It exists because naive pattern matching breaks at scale and can't answer 'which results are most relevant?'—a requirement for search bars, product catalogs, and document retrieval in applications like e-commerce or content management systems.

In the SQL ecosystem, full-text search sits between basic LIKE queries and external search engines like Elasticsearch or Meilisearch. It's the right choice when you need search capabilities but want to avoid operational complexity—no separate service to deploy, no data synchronization, and transactional consistency with your existing database.

However, it's not a replacement for dedicated search engines when you need faceted navigation, typo tolerance, or real-time indexing at web scale (think Google or Amazon). MySQL's implementation (InnoDB full-text) and PostgreSQL's tsvector/tsquery differ significantly: PostgreSQL offers more granular control over dictionaries, ranking functions, and custom text configurations, while MySQL's is simpler but less flexible.

Both support stopword lists, but their default lists and stemming algorithms vary—a critical detail when 'ink' disappears from search results because it's mistakenly treated as a stopword in a language configuration.

Plain-English First

Imagine you walk into a giant library with a million books and ask the librarian for every book that mentions 'dragons'. A LIKE search is like that librarian reading every single page of every book from cover to cover. Full-text search is like the librarian pulling out a pre-built card catalogue that maps every word directly to which books contain it — the answer comes back in seconds, not hours. That card catalogue is the full-text index, and building it smartly is exactly what this article is about.

Most production applications reach a point where LIKE '%search_term%' stops cutting it. You've got a product catalogue with 500,000 rows, a blog with a decade of articles, or a support ticket system with millions of entries — and users expect Google-quality search that returns relevant results in under 100 milliseconds. LIKE with a leading wildcard can't use a B-Tree index, so the database performs a full sequential scan every single time. At scale, that's a query that grinds your entire application to a halt.

Full-text search (FTS) solves this with an entirely different data structure — an inverted index — that maps individual tokens (words) back to the rows that contain them. But it does far more than just fast lookups. It understands language: it knows that 'running', 'ran', and 'runs' are forms of the same root word. It knows that 'the', 'a', and 'is' are so common they're noise. It can rank results by relevance so that the most useful rows float to the top. These capabilities live inside your existing SQL database, no Elasticsearch cluster required.

By the end of this article you'll understand exactly how an inverted index is built and queried, how to set up and tune full-text search in both MySQL and PostgreSQL, how relevance ranking actually works under the hood, when FTS is the right tool versus when you should reach for a dedicated search engine, and every production gotcha that bites teams who ship FTS without reading the fine print.

Why Full-Text Search in SQL Is Not Just LIKE with a Fancy Index

Full-text search in SQL is a specialized indexing and querying mechanism that tokenizes text into words (terms), stores them in an inverted index, and supports linguistic operations like stemming, ranking, and stopword filtering. Unlike a B-tree index used with LIKE '%pattern%', which scans every row and cannot use an index efficiently, a full-text index maps each term to its document locations, enabling sub-second searches over millions of rows.

At its core, full-text search builds an inverted index: for each unique word, it stores a list of document IDs and positions. Queries like CONTAINS or FREETEXT (T-SQL) or MATCH ... AGAINST (MySQL) leverage this index to find documents containing specific words or phrases, rank results by relevance using TF-IDF or BM25, and apply language-specific rules such as stemming (e.g., 'running' matches 'run') and stopword removal (common words like 'the' are ignored). This is fundamentally different from pattern matching — it's a set-based retrieval, not a scan.

Use full-text search when your application needs to search natural language text — product descriptions, articles, emails — at scale. It's the right tool when you need relevance ranking, linguistic variants, or fast search over large text columns. But it's not a replacement for a dedicated search engine like Elasticsearch when you need faceted search, fuzzy matching, or real-time indexing at massive scale. In SQL databases, full-text search fills the gap between simple LIKE queries and external search infrastructure.

Stopwords Can Silently Exclude Results
Default stopword lists often include short, common words like 'ink' — meaning a search for 'ink products' may return zero results if 'ink' is treated as noise.
Production Insight
A product catalog search for 'ink cartridges' returned zero results because the default English stopword list included 'ink' as a noise word.
The symptom: users see 'no results' for queries containing short, common product terms, while other queries work fine.
Rule of thumb: always review and customize stopword lists per domain — never trust defaults for business-critical search.
Key Takeaway
Full-text search uses an inverted index, not a B-tree — it's designed for word-level retrieval, not pattern matching.
Stopwords and stemming are language-specific and can silently exclude valid results if not configured for your domain.
Full-text search is not a general-purpose search engine — use it for ranked text search within a single database, not for distributed, real-time search at scale.
Full-Text Search in SQL: Inverted Index Flow THECODEFORGE.IO Full-Text Search in SQL: Inverted Index Flow From tokenization to ranked results with stopwords and stemming Tokenization & Stopword Removal Split text into tokens, filter common words Stemming & Normalization Reduce words to root form (e.g., 'running' → 'run') Inverted Index Build Map tokens to document IDs and positions MATCH/AGAINST Query Search index with relevance ranking (TF-IDF) Ranked Result Set Return documents sorted by relevance score ⚠ Stopwords can hide relevant products (e.g., 'ink') Customize stopword list or use boolean mode to bypass THECODEFORGE.IO
thecodeforge.io
Full-Text Search in SQL: Inverted Index Flow
Full Text Search Sql

How the Inverted Index Works Internally

An inverted index is the data structure behind all full-text search engines — SQL's built-in FTS included. It maps every distinct word (token) to the list of rows where that word appears, along with its position and frequency inside each document.

When you insert or update a row, the database tokenizes the text, discards stopwords, applies stemming, and builds (or appends to) this mapping. The result is a structure you can query in logarithmic time: look up the token, get the row list, apply ranking, return top N.

MySQL stores its FTS index in separate internal InnoDB tables (auxiliary tables). The main table holds the data; the FTS index is in hidden tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE). You cannot query them directly for debugging, but they exist.

PostgreSQL uses GiST or GIN indexes, which store lexemes (stemmed tokens) directly in the index structure. GIN is faster for lookups (read-heavy), GiST is faster for updates (write-heavy). You can query the tsvector column directly for debugging.

How a query is executed: When you run MATCH(col) AGAINST('search'), the database: (1) tokenizes and stems the query text, (2) looks up each token in the inverted index, (3) retrieves posting lists (list of row IDs), (4) optionally intersects them for multi-word queries, (5) computes a relevance score (TF-IDF or BM25) for each candidate row, (6) sorts by score and returns top N.

fts_internals_demo.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- MySQL: InnoDB FTS uses hidden auxiliary tables
-- You can inspect them (read-only)
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE;
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_CACHE;

-- PostgreSQL: tsvector column stores lexemes
CREATE TABLE io_thecodeforge.documents (
    id SERIAL PRIMARY KEY,
    body TEXT,
    search_vector TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', body)) STORED
);

-- Insert and the tsvector is auto-generated
INSERT INTO io_thecodeforge.documents (body) VALUES ('The quick brown fox jumps over the lazy dog.');

-- Query the tsvector directly to see normalized terms
SELECT id, body, search_vector FROM io_thecodeforge.documents;

-- Output shows:
-- 'brown':2 'dog':8 'fox':3 'jump':4 'lazi':7 'quick':1
-- Stopwords ('the', 'over') are omitted. Stemming: 'jumps' → 'jump', 'lazy' → 'lazi'
Output
id | body | search_vector
----+------+-------------------------------------------------------
1 | The quick brown fox jumps over the lazy dog. | 'brown':2 'dog':8 'fox':3 'jump':4 'lazi':7 'quick':1
Mental Model: Dictionary with Page Numbers
  • Token = word in the book's index
  • Posting list = list of page (row) numbers where that word occurs
  • Position = where on the page the word appears, used for phrase search
  • Frequency = how many times the word appears on that page, used for relevance ranking
  • The index is pre-built — queries don't scan rows, they consult the index.
Production Insight
InnoDB FTS stores the index in auxiliary tables managed internally — you cannot query them directly for most operations.
PostgreSQL's GIN index is transactional, so concurrent reads see a consistent snapshot during updates.
Failure: a server crash during a bulk insert can leave the FTS index in an inconsistent state — always test with CHECK TABLE.
Rule: For write-heavy workloads, PostgreSQL GiST has lower overhead than GIN but slower reads.
Key Takeaway
An inverted index is a reverse mapping: from words to rows, not rows to words.
It's the reason FTS queries return in milliseconds instead of seconds.
The choice of storage engine (InnoDB vs GIN) directly impacts write vs read performance.
Choose your FTS implementation based on workload
IfRead-heavy, large text columns, infrequent writes
UseUse PostgreSQL GIN index — best query speed, slower inserts, acceptable for read-mostly tables
IfWrite-heavy, many updates, small documents
UseUse MySQL InnoDB FTS — better insert performance, acceptable query speed, transactional consistency
IfNeed phrase search and fuzzy matching
UseBoth support, but PostgreSQL's phrase search is more robust with tsquery operators (<-> for adjacency)
IfZero-downtime index rebuilds required
UsePostgreSQL with CONCURRENTLY rebuilds. MySQL requires table lock during OPTIMIZE TABLE.

MATCH/AGAINST Syntax and Relevance Ranking

The SQL syntax for full-text search is MATCH(columns) AGAINST('query' IN NATURAL LANGUAGE MODE). The engine returns rows ordered by relevance — a floating-point score. Higher scores mean more relevant.

Relevance ranking isn't magic. Behind the scenes the database computes a variant of TF-IDF (MySQL) or BM25 (PostgreSQL).

  • TF-IDF (Term Frequency-Inverse Document Frequency) rewards terms that appear frequently in a document but rarely across the whole corpus. It doesn't normalise for document length, so long documents (with more words) tend to score higher simply because they contain more text.
  • BM25 (Best Matching 25) adds document length normalisation. A short document that matches a term is given a higher boost relative to its length. It also saturates term frequency so that 100 occurrences of a word doesn't give 100x the score of 1 occurrence.

IN BOOLEAN MODE gives you operators: +mandatory, -exclude, *wildcard, "phrase". It doesn't rank — it just filters. Use it when you need exact control over which terms must match, not when you need relevance ordering.

WITH QUERY EXPANSION (MySQL) adds terms from top-ranked results to the original query. It can help find related documents but can also drift off-topic.

Practical example: For a product catalogue with descriptions of varying lengths (e.g., 50-character titles vs 5000-character detailed descriptions), BM25 gives fairer ranking. MySQL's TF-IDF would rank the long descriptions higher even if the title contains the exact match.

match_against_demo.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- MySQL: Natural language mode (ranked)
SELECT id, title,
       MATCH(title, body) AGAINST('Docker Compose' IN NATURAL LANGUAGE MODE) AS relevance
FROM io_thecodeforge.articles
ORDER BY relevance DESC;

-- PostgreSQL: ts_rank with normalization
SELECT id, title,
       ts_rank(to_tsvector('english', title || ' ' || body),
               plainto_tsquery('english', 'Docker Compose'), 1) AS relevance
FROM io_thecodeforge.articles
ORDER BY relevance DESC;

-- Boolean mode with operators (MySQL/PostgreSQL)
SELECT id, title
FROM io_thecodeforge.articles
WHERE MATCH(title, body) AGAINST('+Docker +healthcheck -deprecated' IN BOOLEAN MODE);

-- Phrase search in PostgreSQL using <-> operator
SELECT id, title
FROM io_thecodeforge.articles
WHERE to_tsvector('english', body) @@ to_tsquery('docker <-> compose');
Output
With ranking, documents containing the phrase in the title score higher. Boolean mode returns only documents that match all mandatory terms.
Ranking trap: short documents lose in MySQL
TF-IDF over-weights long documents. If your catalogue has product names (short) and descriptions (long), the description will often rank higher even when the name is exact match. Use PostgreSQL or implement manual boosting of title column with MATCH() * weight.
Production Insight
MySQL's relevance score is normalised by total term frequency across all rows — adding a row can change all scores, causing order to shift non-deterministically.
PostgreSQL's ts_rank is deterministic per query, but the configuration (weights via setweight) controls which column matters more.
Failure: forgetting to use ORDER BY in natural language mode — rows come back in arbitrary (insertion) order, not by relevance. Your users see random results.
Key Takeaway
MATCH/AGAINST with NATURAL LANGUAGE MODE returns ranked results automatically.
Boolean mode gives you operators but no ranking — use it for filtering.
Ranking algorithms differ: MySQL uses TF-IDF (favours long docs), PostgreSQL uses BM25 (normalises length).

Stopwords, Stemming and Language Support

Stopwords are common words (the, a, in, of) that don't carry search meaning. FTS engines discard them by default. But the default stopword list is English-centric and often wrong for your domain. If you're searching a product catalogue and 'ink' is a valid product name, but 'ink' happens to be in your custom stopword list — you'll lose results. On PostgreSQL, the default list includes 'the', 'a', 'and', 'of', 'to', 'in', 'for', etc. Not 'ink', unless customised. But technical terms like 'set', 'key', 'host', 'port' are NOT in default lists, but could be added accidentally.

Stemming reduces words to root forms: 'running' becomes 'run', 'better' becomes 'good'. MySQL uses a built-in stemmer based on the language setting. PostgreSQL uses dictionaries (english, simple, snowball). You can create custom dictionaries for domain-specific terms.

Key trade-off: aggressive stemming conflates unrelated words (e.g., 'marketing' and 'market' become 'market') causing false positives. Too little stemming and you miss variants.

Language-specific considerations: MySQL supports per-column character sets and collations that affect tokenisation. For non-English languages (Chinese, Japanese, Korean), MySQL's ngram parser splits text into n-character substrings; PostgreSQL's parser uses language-specific dictionaries.

Custom dictionaries in PostgreSQL: You can chain dictionaries (e.g., a custom thesaurus that maps 'JS' to 'JavaScript', then a snowball stemmer). This is far more flexible than MySQL's global setting.

custom_stopword.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- MySQL: set a custom stopword table
SET GLOBAL innodb_ft_server_stopword_table = 'mydb/my_stopwords';

CREATE TABLE mydb.my_stopwords (value VARCHAR(30));
INSERT INTO mydb.my_stopwords VALUES ('the'), ('a'), ('of'), ('our'), ('your');

-- Rebuild index for changes to take effect
OPTIMIZE TABLE mydb.articles;

-- PostgreSQL: create custom text search configuration
CREATE TEXT SEARCH CONFIGURATION io_thecodeforge.product_search (COPY = english);

-- Replace stemmer with 'simple' (no stemming) for precise matching
ALTER TEXT SEARCH CONFIGURATION io_thecodeforge.product_search
    ALTER MAPPING FOR word WITH simple;

-- Add a custom thesaurus dictionary
CREATE TEXT SEARCH DICTIONARY my_thesaurus (
    TEMPLATE = thesaurus,
    DictFile = 'product_synonyms'
);

-- Then use it in queries
SELECT to_tsvector('io_thecodeforge.product_search', 'JS framework');
Stopword surprise: technical terms are often stopwords by accident
Words like 'set', 'key', 'host', 'port', 'null' are common in technical documentation but may be stopwords in some languages or custom lists. Always export and review your stopword list against actual product names and user queries.
Production Insight
Default stopword lists are trained on general English — they hurt domain-specific search.
PostgreSQL allows per-query tokenizer overrides; MySQL requires global config changes (requires restart for fulltext index rebuild).
Failure: a legal search system that dropped 'contract' because it matched a stopword in the custom list — double-check every word you blacklist.
Rule: For domain-specific catalogues, start with no stopwords (empty list) and only add those proven to cause false positives.
Key Takeaway
Stopwords and stemming are language-specific — never trust the default for your domain.
Test your FTS with a sample of real queries to catch false positives.
PostgreSQL's per-query configuration gives more control; MySQL's global setting is simpler but less flexible.

Performance Tuning and Index Maintenance

Full-text indexes grow with your data and can become bloated, fragmented, or stale.

MySQL InnoDB FTS tuning
  • innodb_ft_min_token_size (default 3): ignore words shorter than this. Lower to index acronyms ('AI', 'UX') but increases index size by 20-30%.
  • innodb_ft_max_token_size (default 84): ignore words longer than this. Increase for long product names.
  • innodb_ft_cache_size (default 8MB): size of the in-memory cache for tokenised data. Larger values improve insert performance for bulk operations.
  • Index maintenance: OPTIMIZE TABLE rebuilds the FTS index but LOCKS THE TABLE for writes. Schedule during low traffic.
PostgreSQL GIN/GiST tuning
  • work_mem: controls memory used for sorting during index creation. Higher values speed up large index builds.
  • maintenance_work_mem: memory used for REINDEX and VACUUM operations.
  • gin_pending_list_limit: for GIN indexes with fast update enabled, this controls the size of the pending list.
  • Zero-downtime maintenance: CREATE INDEX CONCURRENTLY and REINDEX INDEX CONCURRENTLY rebuild indexes without blocking writes.
Monitoring index health
  • MySQL: SHOW INDEX FROM table shows cardinality. Compare INNODB_FTS_INDEX_TABLE row count with base table row count to detect staleness.
  • PostgreSQL: pgstatindex() for GiST indexes, pg_size_pretty(pg_relation_size('index_name')) for index size growth.

Practical benchmarks: On a 10 million row table with 1KB average text per row, MySQL InnoDB FTS index size is about 20-30% of data size (2-3GB). PostgreSQL GIN index is larger, about 40-50% (4-5GB), but provides faster read queries.

The staleness trap: After a bulk delete or update, FTS indexes may still point to deleted rows. Run CHECK TABLE (MySQL) or REINDEX (PostgreSQL) after any bulk operation affecting >20% of rows.

fts_tuning.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- MySQL: Check FTS index fragmentation
SHOW INDEX FROM io_thecodeforge.articles;
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE
WHERE TABLE_ID = (SELECT TABLE_ID FROM INFORMATION_SCHEMA.INNODB_TABLES WHERE NAME = 'io_thecodeforge/articles');

-- Rebuild fulltext index (blocks writes) — schedule during maintenance
OPTIMIZE TABLE io_thecodeforge.articles;

-- Set min token length (requires restart)
SET GLOBAL innodb_ft_min_token_size = 2;

-- PostgreSQL: create index CONCURRENTLY (no downtime)
CREATE INDEX CONCURRENTLY idx_fts_articles ON io_thecodeforge.articles USING GIN(to_tsvector('english', body));

-- Monitor index size
SELECT pg_size_pretty(pg_relation_size('idx_fts_articles'));

-- Rebuild online without locking
REINDEX INDEX CONCURRENTLY idx_fts_articles;

-- Increase memory for index operations
SET maintenance_work_mem = '1024MB';  -- for REINDEX
SET work_mem = '64MB';  -- for query-time sorting
Use concurrent rebuilds to avoid downtime
PostgreSQL's CONCURRENTLY option lets you rebuild indexes without blocking writes. In MySQL, schedule OPTIMIZE TABLE during maintenance windows and set innodb_ft_cache_size high enough to avoid excessive disk flushes.
Production Insight
In MySQL, OPTIMIZE TABLE on a table with a fulltext index rebuilds that index completely — on a 100GB table this can take hours.
PostgreSQL's CONCURRENTLY rebuild is slower overall but doesn't block production traffic — trade throughput for availability.
Failure: a team that ran OPTIMIZE TABLE on a 200GB table during peak hours, causing a 45-minute write outage. The marketing team couldn't add new products.
Rule: For tables >10GB, always test rebuild times on a staging replica before production. Schedule rebuilds during the lowest traffic window (e.g., Sunday 3 AM).
Key Takeaway
Index rebuilds are expensive — always do them during low traffic or use CONCURRENTLY.
Monitor index size weekly; a sudden growth spike may indicate tokenisation issues.
Test your rebuild strategy with a production-data clone before the first real run.

MySQL vs PostgreSQL: Implementation Differences That Matter

While both MySQL and PostgreSQL offer full-text search, their internals differ significantly. Choosing the wrong one can lead to performance pain or missing features.

Index storage: MySQL stores the FTS index in hidden auxiliary tables (INNODB_FTS_INDEX_TABLE, INNODB_FTS_INDEX_CACHE) managed by InnoDB. PostgreSQL stores it as a GiST or GIN index on a tsvector column.

Tokenisation and stemming: MySQL uses a built-in tokeniser and stemmer that are less configurable. You can change the minimum token length via innodb_ft_min_token_size and set a custom stopword table, but the stemmer is fixed per language (English, French, Spanish, etc.). PostgreSQL uses text search configurations that separate tokeniser (parser) and dictionary (stemmer, stopwords, thesaurus), giving you the ability to chain dictionaries and create custom ones.

Index maintenance: MySQL requires OPTIMIZE TABLE to rebuild the FTS index (table is locked). PostgreSQL supports CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY for zero-downtime rebuilds.

Query flexibility: PostgreSQL supports ranking weights per column using setweight and can store tsvector as a generated column. MySQL's ranking is less flexible; you can use multiple MATCH clauses with custom multipliers but it's less efficient.

Performance characteristics
  • MySQL InnoDB FTS: Optimised for transactional workloads with moderate write rates. Reads are fast but not as fast as PostgreSQL GIN.
  • PostgreSQL GIN: Faster reads (10-20%) but slower writes (30-50% overhead due to lexeme compression). GiST is slower for reads but faster for writes.

Debugging: MySQL's FTS debugging requires querying INFORMATION_SCHEMA tables. PostgreSQL allows direct querying of the tsvector column, making debugging trivial.

Production recommendation: Use PostgreSQL if you need custom stemming, zero-downtime rebuilds, or fair ranking (BM25). Use MySQL if you're already in a MySQL environment and have simple text search requirements with moderate write rates.

mysql_vs_postgres_fts.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- MySQL: Creating an FTS index
ALTER TABLE io_thecodeforge.articles ADD FULLTEXT(title, body);

-- MySQL: Checking FTS stats (auxiliary tables)
SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_INDEX_TABLE;

-- MySQL: Rebuild FTS index (blocks writes)
OPTIMIZE TABLE io_thecodeforge.articles;

-- PostgreSQL: Creating GIN index on generated tsvector column
ALTER TABLE io_thecodeforge.articles ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (to_tsvector('english', coalesce(title, '') || ' ' || coalesce(body, ''))) STORED;
CREATE INDEX idx_articles_fts ON io_thecodeforge.articles USING GIN(search_vector);

-- PostgreSQL: Query the tsvector directly for debugging
SELECT id, search_vector FROM io_thecodeforge.articles;

-- PostgreSQL: Rebuild without blocking
REINDEX INDEX CONCURRENTLY idx_articles_fts;
Parser differences matter
MySQL's default parser treats URLs as separate tokens (e.g., 'https' as one token). PostgreSQL's parser handles URLs differently. Always test with your actual data to catch tokenisation surprises before production.
Production Insight
In MySQL, the FTS index is not directly queryable — debugging requires INFORMATION_SCHEMA tables.
PostgreSQL allows direct querying of the tsvector column, making debugging trivial.
Failure: a team that relied on MySQL's FTS for a real-time search feature discovered the index could not be rebuilt online — they had to redesign for PostgreSQL mid-project.
Key Takeaway
MySQL FTS is simpler but less flexible; PostgreSQL FTS offers more control and zero-downtime maintenance.
Choose based on operational needs, not just syntax familiarity.
Always test index rebuild procedures before going to production.
Choose your FTS implementation based on operational needs
IfNeed zero-downtime index rebuilds
UseChoose PostgreSQL with CONCURRENTLY rebuilds. MySQL OPTIMIZE TABLE locks the table.
IfSimple text search, infrequent writes, minimal config
UseMySQL InnoDB FTS is simpler to set up and has lower operational overhead.
IfCustom stemming per language or domain
UsePostgreSQL with custom text search configurations (thesaurus, synonym dictionaries).
IfRead-heavy workload with large text columns (product catalogue, blog search)
UsePostgreSQL GIN index offers 10-20% faster reads than MySQL FTS for large datasets.

Full-Text Search Queries: How They Actually Work at the Protocol Level

You've seen MATCH/AGAINST. Fine. But what happens when you type a query like '+smartphone -iphone warranty' in BOOLEAN MODE? The query parser breaks that string into tokens at the same word boundaries the indexer used. Then it traverses the inverted index looking for documents containing 'smartphone' while excluding those that also contain 'iphone'. The Boolean operators are evaluated as set operations on document ID lists.

Here's where juniors get burned: query expansion and proximity searches. MySQL's WITH QUERY EXPANSION performs a two-phase search — the first returns results, the second re-runs the query using the most relevant terms from the first result set. Sounds clever until you realise it can return garbage if your initial query matched irrelevant documents. Never use it on production user-facing search without manual relevance testing.

SQL Server's CONTAINSTABLE and FREETEXTTABLE return ranked results with internal scoring you can inspect. The difference? FREETEXT does linguistic stemming and thesaurus expansion automatically. CONTAINS gives you exact control over prefixes, proximity, and inflectional forms. Pick the right weapon.

ProximitySearchKills.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — database tutorial

-- MySQL: exact phrase with proximity
SELECT product_id, product_name
FROM products
WHERE MATCH(product_name, description)
AGAINST('"next day delivery"' IN BOOLEAN MODE);

-- SQL Server: proximity within 5 words
SELECT p.ProductID, p.Name
FROM Production.Product p
WHERE CONTAINS(p.Name, 'smartphone NEAR iphone', LANGUAGE 'English');

-- PostgreSQL: using tsquery operators
SELECT id, title
FROM documents
WHERE to_tsvector('english', content) @@
      to_tsquery('english', 'smartphone <-> iphone');
-- <-> means adjacent, <2> means within 2 words
Output
product_id | product_name
----------+-------------------
1001 | Smartphone XYZ next day delivery
2045 | Guaranteed next day delivery phones
(2 rows affected)
Production Trap: Query Expansion Bleed
WITH QUERY EXPANSION can pull in documents that share stopwords or common terms with your results, not semantically related ones. Your 'laptop battery life' query might return 'life insurance' documents. Test on a sample before enabling.
Key Takeaway
BOOLEAN MODE for AND/OR/NOT precision; NATURAL LANGUAGE MODE for ranked relevance; WITH QUERY EXPANSION only after manual testing — it's a shotgun, not a scalpel.

The Full-Text Index Architecture Nobody Draws on the Whiteboard

The inverted index is the heart, but the rest of the body matters too. Every full-text index has three structural layers: the token list, the postings list, and the position list. The token list maps each unique word to a compressed integer ID. The postings list maps that ID to a list of document IDs where the word appears. The position list tracks word offsets within each document for proximity searches.

SQL Server stores these as internal system tables inside a full-text catalog — a virtual container that doesn't map to a single file. Each catalog can span multiple filegroups, and each index has its own population history. When you rebuild a full-text index, SQL Server drops all internal structures and re-parses every row through the filter daemon and word breaker. That's why a rebuild on a 10-million-row table takes hours, not seconds.

PostgreSQL stores tsvector directly in the table column. No separate catalog. This means updates are row-level — when you change a row, PostgreSQL re-parses only that row through a trigger or computed column. The trade-off: no incremental population like SQL Server's change tracking. You pay for every UPDATE with CPU to re-lex the text. Know your write pattern before choosing.

IndexStructureDiagnostic.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — database tutorial

-- SQL Server: inspect full-text catalog details
SELECT c.name AS catalog_name,
       i.name AS index_name,
       i.is_active,
       i.crawl_type_description,
       i.item_count,
       i.unique_key_count
FROM sys.fulltext_catalogs c
JOIN sys.fulltext_indexes i ON c.fulltext_catalog_id = i.fulltext_catalog_id;

-- PostgreSQL: check tsvector stats per row
SELECT id,
       length(to_tsvector('english', content)) AS vector_size,
       nspname || '.' || relname AS table_name
FROM documents d
JOIN pg_class ON oid = d.tableoid
LIMIT 5;
Output
catalog_name | index_name | is_active | crawl_type | item_count
-------------+-------------------+-----------+------------+-----------
ProductFtCat | ProductDescFT | 1 | FULL | 2842154
SalesFtCat | SalesNotesFT | 1 | INCREMENTAL| 892104
(2 rows affected)
Senior Shortcut: Monitor Crawl Type
SQL Server's sys.fulltext_indexes.crawl_type_description tells you if your index did a FULL rescan (expensive) or INCREMENTAL (cheap, uses change tracking). If you see FULL on every scheduled job, you've misconfigured change tracking. Fix it before the DBAs hunt you.
Key Takeaway
Inverted index size = unique tokens * average document frequency. Each token ID is 4 bytes, each document ID is 4 bytes, each position is 2 bytes. For 1M documents with 100K unique tokens, expect about 800MB for the postings list alone. Plan your disk.

PATINDEX and CHARINDEX — Pattern Matching Beyond LIKE in SQL Server

LIKE is fine for simple wildcards, but it chokes when you need position-aware substring searches without full-text indexing. PATINDEX and CHARINDEX fill that gap. CHARINDEX is your scalpel: it returns the starting position of one string inside another, zero if missing. No wildcards, no guesswork. PATINDEX steps it up with regex-like patterns — find any numeric digit, any non-alphanumeric character, or custom patterns using % and _ with square brackets. Why this matters: you can extract specific substrings from messy data without resorting to slow CLR functions or client-side parsing. Use them when full-text search is overkill but LIKE is too weak. Both ignore collation unless you force it, so watch case sensitivity. They're also non-SARGable — indexes won't help. Keep pattern matching lean.

PatternMatching.sqlSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — database tutorial

-- Find position of 'urgent' (case-sensitive by collation)
SELECT CHARINDEX('urgent', body, 1) AS pos
FROM tickets
WHERE body LIKE '%urgent%';

-- PATINDEX: find first numeric digit position
SELECT PATINDEX('%[0-9]%', description) AS first_digit_pos
FROM products
WHERE PATINDEX('%[0-9]%', description) > 0;

-- PATINDEX with exclusion pattern: non-alpha start
SELECT PATINDEX('%[^a-zA-Z]%', title) AS first_non_alpha
FROM articles;
Output
pos | first_digit_pos | first_non_alpha
---- | --------------- | ---------------
12 | 4 | 1
0 | 0 | 3
Production Trap:
PATINDEX only supports a subset of regex-like patterns — don't expect full regex. Use square brackets for character groups, not quantifiers like + or *. For complex patterns, consider CLR integration or full-text search.
Key Takeaway
CHARINDEX finds positions, PATINDEX finds patterns — both outrun LIKE when you need speed and precision.
● Production incidentPOST-MORTEMseverity: high

The Invisible Products: When Stopwords Hid an Entire Category

Symptom
Search for 'ink' returns no results. Search for 'ink cartridge' returns only 'cartridge' matches, missing all ink products. LIKE '%ink%' works fine. MySQL logs show the query executes, but FTS reports no matches.
Assumption
The team assumed the FTS index was stale or the query syntax was wrong. They ran OPTIMIZE TABLE, rebuilt the index, even switched from NATURAL LANGUAGE to BOOLEAN mode. Nothing worked. They never considered that 'ink' might be a stopword.
Root cause
MySQL's default stopword list includes common English words — 'the', 'a', 'of', 'it', 'and' — but also 'ink'? No, the default list actually doesn't include 'ink'. The team's specific version had a custom stopword list from a previous DBA who added domain-specific noise words. 'Ink' was included because the DBA thought it was too common in print logs. On PostgreSQL, the default stopword list also doesn't include 'ink'. But both databases allow custom stopword tables. The team inherited an undocumented custom stopword configuration that filtered critical product terms. The real issue: no one had reviewed the stopword list for domain relevance. Technical terms like 'set', 'key', 'host', 'port' also risk being filtered depending on the list version. The team learned that default stopwords are trained on general English, not on e-commerce catalogues.
Fix
1. Checked the stopword list: MySQL SHOW VARIABLES LIKE 'innodb_ft_server_stopword_table'; PostgreSQL SELECT * FROM pg_catalog.pg_ts_config_map. 2. Found 'ink' in the custom stopword list. Removed it: SET GLOBAL innodb_ft_server_stopword_table = ''; (MySQL) or ALTER TEXT SEARCH CONFIGURATION ... DROP MAPPING FOR ... (PostgreSQL). 3. Rebuilt the full-text index: OPTIMIZE TABLE products; (MySQL) or REINDEX INDEX CONCURRENTLY idx_products_fts; (PostgreSQL). 4. Added a scheduled job to export the stopword list weekly to version control and compare against domain-critical terms. 5. Created a monitoring query that periodically searches for terms that should exist (like 'ink') and alerts if count = 0. 6. Documented the stopword configuration in the team's runbook.
Key lesson
  • Stopwords are not universal. Default lists suit general English, not e-commerce, medical, or technical domains.
  • Always export and review your full-text configuration (stopwords, stemmer, min token length) before deploying to production.
  • Test FTS with real user queries from day one. If a term should return results, verify it does.
  • For domain-critical terms (product names, SKU prefixes, technical acronyms), ensure they are NOT in the stopword list.
  • Set up monitoring: a weekly job that queries a set of known terms and alerts if any return zero results.
Production debug guideSymptom → Action for common FTS failures in production5 entries
Symptom · 01
FTS query returns no results but LIKE finds matches
Fix
Check if the text column is included in the full-text index using SHOW INDEX or \d+ table. Verify stopwords are not filtering your search terms by querying the stopword table directly. Check min token length — 'AI' with length 2 filtered if min=3.
Symptom · 02
Relevance ranking ignores your most important keywords
Fix
Review the ranking algorithm (BM25 vs TF-IDF). In PostgreSQL, adjust the weight column configuration using setweight. In MySQL, tweak the WITH QUERY EXPANSION option or add manual multipliers in ORDER BY.
Symptom · 03
FTS query is slow (over 1 second) on a large table
Fix
Check index size and fragmentation. Run ANALYZE TABLE (MySQL) or VACUUM ANALYZE (PostgreSQL). Increase innodb_ft_cache_size (MySQL) or work_mem (PostgreSQL). Consider partitioning the table if it exceeds 100GB.
Symptom · 04
Search misses plurals or verb forms — 'running' doesn't match 'ran'
Fix
Examine the stemming dictionary. For MySQL, set the proper parser plugin (ngram for CJK, mecab for Japanese). For PostgreSQL, choose the correct text search configuration (english vs simple) and verify the dictionary (snowball vs simple).
Symptom · 05
New products don't appear in search results after bulk insert
Fix
Check if the FTS index is stale. For MySQL InnoDB, the index updates transactionally but you may need OPTIMIZE TABLE after bulk operations. For PostgreSQL, ensure the tsvector column is generated or the index includes the column. Run REINDEX CONCURRENTLY.
★ Full-Text Search Debug Cheat SheetCommands to run when FTS breaks in production. Run these first.
FTS index not picking up new data
Immediate action
Check last rebuild timestamp and run a manual rebuild
Commands
SELECT MAX(last_modified) FROM products; -- compare to index metadata
OPTIMIZE TABLE products; -- MySQL or REINDEX INDEX CONCURRENTLY idx_products_fts; -- PostgreSQL
Fix now
Force a full rebuild: OPTIMIZE TABLE products; (MySQL, locks table) or REINDEX INDEX CONCURRENTLY idx_products_fts; (PostgreSQL, no lock)
Wrong ranking order, irrelevant results at top+
Immediate action
Log the raw relevance score and inspect ranking factors
Commands
SELECT id, MATCH(title) AGAINST('query' IN BOOLEAN MODE) AS score FROM products ORDER BY score DESC LIMIT 20;
Check if stopwords are filtering: SELECT * FROM INFORMATION_SCHEMA.INNODB_FTS_DEFAULT_STOPWORD; (MySQL) or SELECT * FROM pg_catalog.pg_ts_config_map; (PostgreSQL)
Fix now
Temporarily disable stopwords for debugging: SET GLOBAL innodb_ft_enable_stopword=0; (MySQL). For PostgreSQL, use 'simple' configuration: to_tsvector('simple', text).
Full-text query times out (>5 seconds)+
Immediate action
Check slow query log and examine the execution plan
Commands
SHOW VARIABLES LIKE 'slow_query_log%'; (MySQL) or SHOW log_min_duration_statement; (PostgreSQL)
EXPLAIN FORMAT=JSON SELECT ... MATCH AGAINST ... ; -- check if FTS index is used
Fix now
Add a LIMIT clause and use IN BOOLEAN MODE for faster partial matching. Increase innodb_ft_cache_size (MySQL) or work_mem (PostgreSQL).
Short words like 'AI', 'UX' not returning results+
Immediate action
Check min token length configuration
Commands
SHOW VARIABLES LIKE 'innodb_ft_min_token_size'; (MySQL, default 3) or SELECT cfgname, cfgoption FROM pg_catalog.pg_ts_config; (PostgreSQL)
SET GLOBAL innodb_ft_min_token_size = 2; (requires restart and index rebuild)
Fix now
For MySQL, set min token length to 2. For PostgreSQL, use a custom parser or trigram index for short words.
Phrase search using quotes doesn't work+
Immediate action
Verify you're using BOOLEAN MODE (not NATURAL LANGUAGE MODE)
Commands
SHOW VARIABLES LIKE 'ft_boolean_syntax'; (MySQL) or SELECT 'The <-> quick <-> brown'::tsquery; (PostgreSQL)
SELECT * FROM products WHERE MATCH(col) AGAINST('"exact phrase"' IN BOOLEAN MODE);
Fix now
Use BOOLEAN MODE with quotes for phrase search. PostgreSQL: to_tsvector(body) @@ to_tsquery('quick <-> brown');
LIKE vs Full-Text Search: Head-to-Head
AspectLIKE '%term%'Full-Text Search
Data structureB-Tree (unusable with leading wildcard)Inverted index
Speed (1M rows, 1KB text)2-10 seconds (full sequential scan)10-100 ms (index lookup)
RankingNone (order of insertion or random)TF-IDF or BM25 relevance score
StemmingNone ('running' ≠ 'run')Built-in language-specific stemmer
Stopword handlingNone ('the' is searched like any word)Automatic filtering of noise words
Phrase searchLIKE '%exact phrase%' (scan, no ordering)Boolean mode operators or <-> adjacency
Index maintenanceNone (B-Tree updates automatically)Requires periodic rebuild (OPTIMIZE or REINDEX)
Prefix searchLIKE 'prefix%' (uses index, but leading wildcard does not)wildcard in boolean mode ('run')
InfrastructureBuilt into all SQL databases, no setupRequires index creation and configuration

Key takeaways

1
FTS relies on an inverted index
no B-Tree magic. It maps tokens to row IDs, enabling O(log n) lookups instead of O(n) scans.
2
Ranking is algorithmic
MySQL uses TF-IDF (favours long documents), PostgreSQL uses BM25 (normalises length). Choose based on document length variance.
3
Stopwords and stemming must be tailored to your domain. Default English stopwords hide technical terms. Review and customise.
4
Always rebuild indexes after bulk operations or you'll get stale results. PostgreSQL CONCURRENTLY allows zero downtime; MySQL OPTIMIZE TABLE locks.
5
Test FTS with real user queries before production deployment. A query that works on 1k rows can fail on 1M rows due to tokenisation or ranking.
6
For typos and faceted search, FTS falls short. Know when to hand off to Elasticsearch
it's not a failure, it's the right architectural decision.
7
MySQL FTS is simpler but less flexible; PostgreSQL FTS offers more control and zero-downtime maintenance. Choose based on operational needs.
8
Phrase search requires BOOLEAN MODE in MySQL or <-> operator in PostgreSQL. NATURAL LANGUAGE MODE ignores quotes.

Common mistakes to avoid

7 patterns
×

Using default stopword list without domain review

Symptom
Important domain terms like 'ink', 'set', or 'key' become invisible. Users complain searches return nothing for valid product names.
Fix
Export the default stopword list, add it to version control, and manually remove terms relevant to your domain. For product catalogues, consider using no stopwords and handling noise via ranking.
×

Forgetting to rebuild the index after bulk operations

Symptom
New products don't appear in search results. Old deleted products still show up. The index is stale.
Fix
Always run OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) after bulk inserts/deletes affecting >10% of rows. Automate via scheduled jobs or triggers.
×

Using FTS for exact match or short codes (e.g., 'ID-123')

Symptom
Searching for 'ID-123' returns lots of irrelevant rows or nothing. The tokeniser splits on hyphens and ignores short tokens (min length = 3).
Fix
For exact identifiers, use a regular B-Tree index. Reserve FTS for natural language text. For hybrid, use UNION with exact match condition.
×

Using NATURAL LANGUAGE MODE for phrase searches expecting quotes to work

Symptom
MATCH(col) AGAINST('"exact phrase"') in natural language mode treats the quotes as part of the search string, not as a phrase operator.
Fix
Use BOOLEAN MODE with quotes: MATCH(col) AGAINST('"exact phrase"' IN BOOLEAN MODE). In PostgreSQL, use <-> operator.
×

Ignoring stemming false positives

Symptom
Searches for 'marketing' return products about 'market' (price comparison). Searches for 'run' return 'runner' (shoes) and 'running' (performance).
Fix
Use a configuration with no stemming (PostgreSQL 'simple' dictionary, MySQL's language setting with no stemming if available). Or accept that stemming increases recall at the cost of precision.
×

Not testing with production-like data volumes before deployment

Symptom
FTS queries that run in 50ms on development (10k rows) become 5 seconds in production (10M rows) due to index size and fragmentation.
Fix
Always load-test FTS with a copy of production data to estimate rebuild times, query latency, and index size. Use tools like sysbench for MySQL or pgbench for PostgreSQL.
×

Using MySQL FTS in a high-write environment without tuning `innodb_ft_cache_size`

Symptom
After bulk inserts, FTS queries become slow or return stale data because the index cache is flushing to disk too frequently.
Fix
Increase innodb_ft_cache_size to 32MB or 64MB. Monitor INNODB_FTS_INDEX_CACHE size vs INNODB_FTS_INDEX_TABLE. For PostgreSQL, adjust gin_pending_list_limit.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain how an inverted index is built and queried in a SQL database's f...
Q02SENIOR
What is the difference between TF-IDF and BM25 ranking, and when would y...
Q03SENIOR
How do you debug a full-text search that returns no results for a query ...
Q04SENIOR
When should you use a dedicated search engine like Elasticsearch over SQ...
Q05SENIOR
How would you design a search system for a multilingual product catalogu...
Q01 of 05SENIOR

Explain how an inverted index is built and queried in a SQL database's full-text search engine.

ANSWER
An inverted index is built by tokenising each document's text, normalising tokens (lowercasing, stemming), discarding stopwords, and then constructing a mapping from each unique token to the list of document IDs where it appears, along with positional information and frequency. In MySQL InnoDB, this is stored in hidden auxiliary tables. In PostgreSQL, it's a GiST or GIN index on a tsvector column. Query execution: (1) tokenise and stem the query text, (2) look up each token in the inverted index (O(log n) per token), (3) retrieve posting lists (list of row IDs), (4) intersect them for multi-word queries, (5) compute a relevance score (TF-IDF or BM25) for each candidate row using term frequency and inverse document frequency, (6) sort by score and return top N. Complexity: O(k log n + m) where k = number of unique tokens in query, n = number of unique tokens in corpus, m = number of candidate rows. This is far faster than a full table scan (O(total documents × document length)).
FAQ · 9 QUESTIONS

Frequently Asked Questions

01
What is Full-Text Search in SQL in simple terms?
02
Can I use full-text search with a partial match like 'prefix*'?
03
Does full-text search support fuzzy matching (typo tolerance)?
04
How does full-text search handle multi-language content?
05
Is full-text search better than Elasticsearch?
06
How do I monitor full-text index health in production?
07
Can I use full-text search on a database view?
08
How does FTS handle NULL values?
09
Does FTS support scoring for my own custom fields?
N
Naren Founder & Principal Engineer

20+ years shipping high-throughput database systems. Drawn from code that ran under real load.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's SQL Advanced. Mark it forged?

10 min read · try the examples if you haven't

Previous
SQL EXPLAIN and Execution Plans
10 / 16 · SQL Advanced
Next
Database Cursors