Senior 4 min · March 17, 2026

Elasticsearch Basics

Elasticsearch basics — what it is, when to use it, indices and documents, full-text search with match queries, aggregations, and how it differs from relational databases.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Elasticsearch is a distributed search and analytics engine built on Apache Lucene.
  • Data is stored as JSON documents in indices — no fixed schema required.
  • Excels at full-text search with tokenisation, stemming, and relevance scoring.
  • match queries do full-text search; term queries do exact, unanalysed matching.
  • Aggregations let you run analytics on millions of documents in near-real-time.
  • Not a primary database — no transactions, eventual consistency, and no ACID guarantees.

Documents and Indices

An index is like a database table but stores JSON documents. Documents are automatically given a unique _id (or you can assign one). The mapping defines the schema (field types, analysers). Unlike SQL, the schema can be dynamic — new fields are automatically added if dynamic mapping is enabled. But dynamic mapping is dangerous in production (see incident above). Each index is composed of one or more shards that are distributed across the cluster. That's how Elasticsearch scales horizontally.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Index a document (HTTP PUT/POST to the REST API)
PUT /articles/_doc/1
{
  "title": "Understanding Big O Notation",
  "content": "Big O describes the rate of growth of an algorithm...",
  "author": "Alice Chen",
  "published": "2025-03-17",
  "tags": ["algorithms", "computer science", "performance"]
}

# Response:
# { "_index": "articles", "_id": "1", "result": "created" }

# Get document by ID
GET /articles/_doc/1

# Delete document
DELETE /articles/_doc/1

# Bulk indexing (much faster than individual requests)
POST /_bulk
{ "index": { "_index": "articles", "_id": "2" } }
{ "title": "AVL Trees", "content": "Self-balancing BST..." }
{ "index": { "_index": "articles", "_id": "3" } }
{ "title": "Huffman Coding", "content": "Variable-length encoding..." }
Output
{ "_index": "articles", "_id": "1", "result": "created" }
Index vs Database Table
Don't map too literally. An index in Elasticsearch is more like a schema namespace than a table. You can store unrelated documents in the same index, but it's usually a bad idea because different documents have different mappings. Stick to one document type per index in 7.x+ — the type concept is deprecated.
Production Insight
Dynamic mapping is the silent killer. It looks convenient until your mapping consumes all heap.
Always set explicit mappings for indices that receive high-cardinality fields.
Rule of thumb: every thousand fields in mapping adds ~1% heap overhead on each shard.
Fix: PUT index/_mapping { "dynamic": false } before ingesting untrusted data.
Key Takeaway
An index is a logical namespace for documents with a shared mapping.
Dynamic mapping is a trap for high-cardinality data.
Always explicitly define mappings for production indices.

Full-Text Search Queries

Elasticsearch's killer feature is full-text search. When you send a match query, Elasticsearch analyses the query string (tokenises, lowercases, applies stemmers) and compares it against the analysed text in the inverted index. Relevance scoring (TF-IDF or BM25) ranks results. You can boost fields, require clauses, and exclude terms using the bool query. multi_match lets you search across multiple fields with individual boosts.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# match query: full-text search with relevance scoring
GET /articles/_search
{
  "query": {
    "match": {
      "content": "algorithm performance"
    }
  }
}

# multi_match: search across multiple fields
GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "binary tree",
      "fields": ["title^2", "content", "tags"]  // title weighted 2x
    }
  }
}

# bool query: combine must, should, must_not
GET /articles/_search
{
  "query": {
    "bool": {
      "must":     [ { "match": { "content": "sorting" } } ],
      "should":   [ { "match": { "tags": "algorithms" } } ],
      "must_not": [ { "match": { "title": "deprecated" } } ],
      "filter":   [ { "range": { "published": { "gte": "2024-01-01" } } } ]
    }
  }
}
Output
{ "hits": { "total": { "value": 5 }, "hits": [...] } }
How Full-Text Search Actually Works
  • Each field is analysed: tokenised, lowercased, stemmed.
  • The inverted index stores (token → list of document IDs + position).
  • A match query converts the search phrase into tokens, then looks up each token in the inverted index.
  • Documents matching more tokens (and rarer tokens) get higher scores.
  • The term query skips analysis — it looks for the exact token as stored in the inverted index.
Production Insight
Using term on a text field returns nothing because the field was analysed.
match analyses the query, term does not.
If you need exact match on a string, map it as keyword and use term.
Failure story: team searched user emails with term on a text field — zero results for hours.
Key Takeaway
match = analysed full-text search.
term = exact, unanalysed lookup on keyword fields.
Understand analysis before writing queries.

The inverted index is what makes full-text search possible without a full scan. When you index a text field, the analyser produces a list of tokens. For every unique token, the inverted index stores a sorted list of document IDs that contain that token, plus the position(s) within the document. When a match query arrives, Elasticsearch looks up each token in the inverted index — that's a hash table lookup, not a linear scan. The search returns the document IDs, scores them, and retrieves the top hits. The process is the same whether you have 1,000 or 100,000,000 documents — search latency scales with the number of distinct tokens, not the document count. That's why Elasticsearch can respond in milliseconds on huge datasets.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
# Test how a text field gets analysed (produces tokens)
POST /_analyze
{
  "analyzer": "standard",
  "text": "Elasticsearch is a search engine"
}

# Response tokens:
# ["elasticsearch", "is", "a", "search", "engine"]

# For each token, the inverted index stores doc IDs
# Token "search" -> [doc1, doc5, doc9]
# Token "engine" -> [doc1, doc3]
Output
{ "tokens": [ { "token": "elasticsearch" }, { "token": "is" }, { "token": "a" }, { "token": "search" }, { "token": "engine" } ] }
Inverted Index ≠ Full-Text Search on SQL
SQL databases with GIN indexes also use inverted indexes, but the tokenisation and query optimiser are far less capable. Elasticsearch's inverted index stores term frequencies, positions, and optionally payloads — all needed for BM25 scoring and phrase queries.
Production Insight
The inverted index is why Elasticsearch handles partial matches, typos, and stemming.
If you ever see full scans (check _profile on a slow query), it means your mapping doesn't match the query — e.g., a wildcard on a keyword field.
Always profile slow queries: GET index/_search { "profile": true }.
Key Takeaway
The inverted index is a token→docID hash table built during indexing.
It makes search O(number of distinct tokens) not O(number of documents).
Without it, you're scanning — that's not Elasticsearch.

Full-Text vs Keyword: When to Use Each

Elasticsearch offers two fundamental ways to handle string fields: text (analysed, full-text) and keyword (exact, unanalysed). Choosing the wrong type leads to missing results or broken filters. A text field uses an analyser to break the string into tokens; a keyword field stores the entire string as a single token. Use text for human-readable content you want to search by relevance (blog posts, product descriptions). Use keyword for identifiers, enums, tags, emails, URLs — anything you filter or aggregate on. In many production schemas, the same string field is mapped as both: a text sub-field for match queries and a keyword sub-field for term and aggregations.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Multi-field mapping: title is both text and keyword
PUT /products/_mapping
{
  "properties": {
    "product_name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    }
  }
}

# Full-text search on product_name
GET /products/_search
{ "query": { "match": { "product_name": "wireless mouse" } } }

# Exact filter on product_name.keyword
GET /products/_search
{ "query": { "term": { "product_name.keyword": "Wireless Mouse MX Master 3" } } }

# Aggregation on the keyword field
GET /products/_search
{
  "size": 0,
  "aggs": {
    "brands": {
      "terms": { "field": "product_name.keyword" }
    }
  }
}
When Ignore Above Saves You
Set ignore_above (default 256 for keyword) to avoid indexing unreasonably long strings. A 10KB string stored as keyword will blow up the in-memory field data cache. This is a common cause of slow aggregations.
Production Insight
A common production pattern: map field as text with a .keyword sub-field.
Search with match, filter/aggregate with term on .keyword.
Never use term on a pure text field — you'll get zero results because the actual stored token is analysed.
Failure: A team aggregated product names on text field (which tokenised) and got per-token counts, not per-product.

Aggregations — Analytics

Aggregations let you compute analytics over search results without writing custom code. They're like SQL GROUP BY on steroids: nested aggregations, date histograms, percentile ranks, geo distance, and more. Aggregations run on the shards in parallel and results are merged on the coordinating node. The size: 0 in the request avoids returning documents when you only want the aggregated data.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Count articles per author and average title length
GET /articles/_search
{
  "size": 0,  // don't return documents, only aggregation results
  "aggs": {
    "by_author": {
      "terms": { "field": "author.keyword", "size": 10 },
      "aggs": {
        "avg_content_length": {
          "avg": { "field": "content_length" }
        }
      }
    },
    "articles_over_time": {
      "date_histogram": {
        "field": "published",
        "calendar_interval": "month"
      }
    }
  }
}
Output
{ "aggregations": { "by_author": { "buckets": [...] } } }
Aggregation Performance Trap
Aggregations on high-cardinality fields (e.g., user IDs, IPs) can consume huge memory and slow down the cluster. The terms aggregation with a large size requests many buckets from each shard, then the coordinating node sorts and merges. If size is 10000, each shard sends 10000 buckets — multiply by number of shards. Use search.max_buckets to cap memory. Pre-filter with a query to reduce the dataset before aggregating.
Production Insight
Aggregations are expensive. They are not free analytics.
Each bucket requires memory on the coordinating node.
Hard limit: search.max_buckets defaults to 10000 — hitting it kills the query.
Fix for large cardinality: use a composite aggregation with pagination, not a single terms with huge size.
Key Takeaway
Aggregations are powerful but memory-hungry.
Limit bucket size with search.max_buckets and pre-filter with queries.
For high-cardinality fields, use composite aggregation.

Mapping, Analysis and Tokenisation

Mapping defines how documents and their fields are stored and indexed. Analysis is the process of converting a text field into tokens (terms) that go into the inverted index. It consists of a character filter, a tokeniser, and token filters. By default, Elasticsearch uses the standard analyser (lowercases, splits on punctuation, removes common words). You can create custom analysers with different tokenisers (whitespace, keyword, n-gram) and filters (synonym, stemmer, stop). Understanding analysis is critical to getting search results that match user intent.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Create an index with a custom analyser
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "unique"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "my_custom_analyzer"
      }
    }
  }
}

# Test the analyser
POST /my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Elasticsearch Search Engine"
}
# Result: tokens => ["elasticsearch", "search", "engine"]
Analysis Is Field-Level
Each text field can have its own analyser. The same string can be analysed differently in different fields. For example, a title field might use a standard analyser with synonyms, while a content field uses a simpler analyser. You can also set search_analyzer separately from index_analyzer.
Production Insight
If users complain about missing results, the first thing to check is analyser mismatch.
If the index analyser is different from the search analyser, query terms may not match indexed terms.
Failure: The team used an n-gram analyser for autocomplete on a field, but forgot to set the same analyser for search — queries returned nothing.
Key Takeaway
Analysis determines how text is tokenised and searched.
Mismatched analysers between index and search cause missing results.
Always test analysis with _analyze endpoint before deploying mappings.

Mapping & Analyzers: Technical Reference Guide

This reference covers the essential mapping types, parameter configuration, and custom analyser components you'll use in production. The mapping is the schema definition for an index — it tells Elasticsearch how to store and index each field. Key field types: text (full-text), keyword (exact match), integer, long, float, double, boolean, date (ISO 8601), nested (array of objects), flattened (arbitrary key-value), geo_point, ip. Each type has parameters: index (true/false), store (separate stored field), doc_values (for aggregations/sorting), norms (scoring), copy_to (combine fields).

Analysers consist of: character filters (strip HTML, replace patterns), tokeniser (standard, whitespace, edge_ngram), token filters (lowercase, stemmer, stop, synonym, asciifolding). Custom analysers are defined in index settings and referenced in mappings.

Production-critical parameters: ignore_above (keyword length limit), coerce (auto-convert numbers), eager_global_ordinals (pre-load for high cardinality keyword fields used in aggregations), fields (multi-field for same string).

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Full-featured mapping with custom analyser
PUT /logs/_mapping
{
  "dynamic": false,
  "properties": {
    "message": {
      "type": "text",
      "analyzer": "my_standard_analyzer",
      "fields": {
        "keyword": { "type": "keyword", "ignore_above": 512 },
        "ngram": { "type": "text", "analyzer": "ngram_analyzer" }
      }
    },
    "severity": {
      "type": "keyword",
      "doc_values": false,  # not aggregated
      "norms": false         # no scoring needed
    },
    "timestamp": { "type": "date" }
  },
  "settings": {
    "index.mapping.total_fields.limit": 200,
    "analysis": {
      "char_filter": {
        "html_strip": { "type": "html_strip" }
      },
      "tokenizer": {
        "edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10
        }
      },
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "synonyms": ["fast, quick", "small, little"]
        }
      },
      "analyzer": {
        "my_standard_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_synonyms", "stop"]
        },
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram",
          "filter": ["lowercase"]
        }
      }
    }
  }
}
Beware of Synonymous Token Bloat
Synonym filters can expand the inverted index dramatically. A field with 20 synonyms per token can multiply index size by 20x. Keep synonym lists small and use synonym_graph filter for multi-word synonyms to avoid false positives.
Production Insight
Mapping is schema-on-write, not schema-on-read.
Once a mapping is published, changing a field type requires reindexing (create new index, reindex documents, swap alias).
Use _template for index patterns to enforce mapping consistency across time-based indices.
Real incident: A team changed a text field to keyword via PUT mapping — Elasticsearch silently ignored the change, and queries kept failing until they reindexed.
Key Takeaway
Mappings are written when documents are indexed — you can't change them in-place.
Use index templates and explicit mappings for production.
Master analyser parameters: tokeniser, character filter, token filters.

Cluster Architecture, Sharding and Scaling

An Elasticsearch cluster consists of nodes: master-eligible nodes (handle cluster state), data nodes (store data and execute queries), ingest nodes (pre-process documents), and coordinating nodes (route requests). Data is split into shards — each shard is a separate Lucene index. When you index a document, it is routed to a primary shard based on _routing (default: document ID hash). The primary shard synchronously replicates to replica shards on other nodes. Scaling means adding data nodes — shards are automatically rebalanced. But you cannot change the number of primary shards after index creation. So right-sizing shards at index creation time is critical.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Check cluster health
GET /_cluster/health?pretty
# Sample output: { "status": "green", "number_of_nodes": 5, "active_shards": 20 }

# View shard allocation
GET /_cat/shards?v

# Create index with custom shard count and replication
PUT /my_logs
{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 2
  }
}
Output
{ "cluster_health": { "status": "green", "active_shards": 20 } }
Production Insight
Too many shards is worse than too few.
Each shard uses memory for metadata, file handles, and cluster state.
A cluster with 1000+ shards per node will degrade performance regardless of data size.
The formula: (number_of_nodes * 20) shards is a safe upper limit per cluster.
Fix: Use ILM (Index Lifecycle Management) to roll over indices and shrink shards over time.
Key Takeaway
Shards are the unit of horizontal scalability.
Primary shard count is fixed at index creation — get it right.
Too many shards hurts performance; ILM helps manage shard lifecycle.
Choosing Number of Shards
IfTime-series data, docs/day < 500K
UseStart with 1 shard per day. Use rollover index + ILM.
IfFull-text search index, < 10GB
UseStart with 1 shard. You can reindex later.
IfLog aggregation, > 50GB/day
UseUse data streams with ILM. 5-20 shards per day depending on node count.

Shard & Replica Distribution Layout

Visualising how shards and replicas are distributed across nodes helps you understand resilience and read scalability. Each index has a set of primary shards (P0, P1, ...). Each primary shard has zero or more replica shards (R0, R1, ...) on different nodes. Elasticsearch ensures primary and replica shards for the same shard number are never allocated on the same node — this guarantees high availability. When you have replicas, read requests can be served by any copy, distributing load. The diagram below shows a 2-node cluster with an index of 3 primary shards and 1 replica each.

ExampleBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create index with 3 primary shards and 1 replica
PUT /my_index
{
  "settings": {
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

# View allocation breakdown
GET _cat/shards/my_index?v&h=index,shard,prirep,state,node
# index   shard prirep state   node
# my_index 0     p      STARTED node-1
# my_index 0     r      STARTED node-2
# my_index 1     p      STARTED node-2
# my_index 1     r      STARTED node-1
# my_index 2     p      STARTED node-1
# my_index 2     r      STARTED node-2
Shard Routing Logic
When you index a document, Elasticsearch computes shard = hash(_routing) % number_of_primary_shards. The default routing is the document _id. If you need to group related documents on the same shard (e.g., all user documents for user 42), specify a custom _routing value. But beware: using a custom routing skips the hash and can cause uneven shard sizes — never use a high-cardinality routing key.
Production Insight
Replica shards balance read load. A query against 3 primaries + 3 replicas can be answered by any of the 6 shards.
But write throughput doesn't increase with replicas — replicas consume indexing overhead.
Failure: A team set 5 replicas on a write-heavy index, hoping to scale writes — but writes became slower because each doc was duplicated 6 times.
Fix: Only increase replicas for read-heavy workloads. Use ILM to manage replica count for time-series data (hot: high replicas, warm: low replicas).
Key Takeaway
Primaries accept writes, replicas serve reads and provide failover.
Primary and replica for the same shard never coexist on one node.
Use custom routing sparingly — it clusters documents but risks uneven shard sizes.
● Production incidentPOST-MORTEMseverity: high

Mapping Explosion Caused Cluster OOM

Symptom
Elasticsearch cluster health went red. Nodes threw CircuitBreakingException errors. The cluster stopped accepting write requests. The logs showed thousands of new field mappings per minute.
Assumption
The team assumed that dynamic mapping would handle any incoming JSON, and that Elasticsearch would manage memory automatically.
Root cause
Each log event contained unique field names (like user_1234_activity). Dynamic mapping created a new field in the mapping for every unique key. Within hours, the mapping object grew to millions of fields, consuming all heap and triggering the field data circuit breaker.
Fix
Disabled dynamic mapping for high-cardinality fields by setting dynamic: false or dynamic: strict on the index template. Mapped only known fields explicitly. Used a flattened data type or nested key-value pair for unknown fields. Applied a limit on field count via index.mapping.total_fields.limit.
Key lesson
  • Dynamic mapping is safe for low-cardinality, known schemas. Never use it for user-generated keys or log fields with unbounded cardinality.
  • Always set field count limits and monitor mapping size in production.
  • A mapping explosion is silent — you'll see OOM before you see the mapping warning.
Production debug guideSymptom to root cause: the signals that matter4 entries
Symptom · 01
Query latency jumps from 50ms to 5s for the same search
Fix
Check _cat/thread_pool/search?v for queue depth. If queue > 100, nodes are overloaded. Also check _nodes/hot_threads for CPU contention.
Symptom · 02
Cluster health is yellow or red
Fix
Run GET _cluster/health?pretty. If unassigned_shards > 0, check GET _cat/shards?h=index,shard,prirep,state,node,unassigned.reason to see why shards are unassigned (e.g., node left, disk full).
Symptom · 03
Partial search results (missing documents that should exist)
Fix
Check for replica lag: GET _cat/shards?v and see if primaries and replicas are in sync. Also verify that the refresh_interval isn't set too high (default 1s, increase only if indexing throughput is critical).
Symptom · 04
CircuitBreakingException in logs
Fix
Check GET _nodes/stats/breaker?pretty. Identify which breaker tripped (field data, request, in-flight). Increase the breaker limit temporarily, but the real fix is reducing memory pressure: reduce field data cache size or limit query complexity.
★ Quick Debug Commands for ElasticsearchRun these commands immediately when something breaks. No theory, just action.
Cluster health is red or yellow
Immediate action
Identify unassigned shards
Commands
GET _cat/shards?h=index,shard,prirep,state,node,unassigned.reason&v
GET _cluster/allocation/explain?pretty
Fix now
If disk space is low, increase cluster.routing.allocation.disk.watermark.low. If a node dropped, reroute shards manually: POST _cluster/reroute?retry_failed=true.
Query too slow+
Immediate action
Enable query profiling
Commands
PUT _settings { "index.search.slowlog.threshold.query.warn": "5s" }
GET _cat/thread_pool/search?v&h=name,active,queue,rejected,completed
Fix now
Add more replicas to distribute read load, or cache results with _search?request_cache=true for identical queries.
High JVM heap usage (over 85%)+
Immediate action
Check which indices consume most memory
Commands
GET _cat/nodes?v&h=name,heap.percent,ram.percent,cpu
GET _nodes/stats/indices/fielddata?pretty
Fix now
Reduce indices.fielddata.cache.size in elasticsearch.yml (default 40% of heap). Set search.max_buckets to avoid aggregation overload.
Ingestion slowdown (indexing rejects)+
Immediate action
Check node queue drop
Commands
GET _cat/thread_pool/write?v&h=name,active,queue,rejected,completed
GET _nodes/hot_threads
Fix now
Throttle indexing rate via Bulk API with refresh=wait_for and increase number_of_replicas (more replicas = more indexing overhead).
Elasticsearch vs Relational Databases
FeatureElasticsearchRelational Database (e.g., PostgreSQL)
Data modelJSON documentsRows in tables (normalised)
Schema enforcementDynamic (can be disabled)Strict (DDL required)
Query languageRESTful JSON queries (Query DSL)SQL
Full-text searchNative — tokenisation, scoring, fuzzyFull-text search via GIN indexes (limited)
ACID transactionsNo (document-level only)Yes (multi-row, multi-table)
JoinsNo joins — use denormalisation or nestedNative JOINs
ScalingHorizontal (sharding)Vertical (or read replicas, sharding adds complexity)
AggregationsBuilt-in, real-time, complexSQL GROUP BY (batch-friendly)
Use caseSearch, analytics, logsOLTP, reporting, transactions

Key takeaways

1
Elasticsearch stores JSON documents in indices with configurable mappings
dynamic mapping is dangerous for production.
2
Full-text search is Elasticsearch's core strength
analysis, inverted index, BM25 scoring.
3
match queries do full-text search; term queries do exact matching
use them on the right field types.
4
Aggregations run on shards in parallel but can be memory-intensive
pre-filter data and cap bucket sizes.
5
Shard count is fixed at index creation
right-size (20-50GB per shard) and use ILM for lifecycle management.
6
Elasticsearch is not a primary database
no ACID transactions, eventual consistency across shards.

Common mistakes to avoid

5 patterns
×

Using `term` query on `text` field

Symptom
Search returns zero results for exact string matches. User types 'Alice' but gets nothing.
Fix
Map the field as keyword (or use .keyword sub-field) and query with term on that sub-field. For full-text, use match.
×

Not setting a replica count on critical indices

Symptom
When one node goes down, some shards become unavailable. Queries return partial results or time out.
Fix
Set number_of_replicas >= 1 (default is 1). For mission-critical data, use 2 or more replicas. Monitor with GET _cat/indices?v.
×

Wildcard queries on leading edge of `text` fields

Symptom
Query like *pattern runs extremely slow because it can't use the inverted index. CPU spikes and timeouts.
Fix
Use n-gram tokeniser for partial matching. Or use match_phrase_prefix which is optimised. Avoid leading wildcards unless the field is mapped as keyword with a small prefix length.
×

Over-sharding — too many primary shards

Symptom
Cluster state grows large (1000+ shards). Heap usage on master node increases. Cluster becomes slow.
Fix
Use ILM to roll over indices and shrink old indices. Start with fewer shards (1-5 per index). Remember: you cannot reduce shards after creation without reindex.
×

Not monitoring disk thresholds

Symptom
Elasticsearch stops allocating shards when disk usage exceeds watermark (default 85%). Writes start failing. Red cluster.
Fix
Set cluster.routing.allocation.disk.watermark.low and high in elasticsearch.yml. Add monitoring alerts at 70% disk usage.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
Explain the difference between a `match` query and a `term` query. When ...
Q02SENIOR
What is an inverted index and how does it enable fast full-text search?
Q03SENIOR
How do shards work in Elasticsearch? Describe the role of primary and re...
Q04SENIOR
What causes a mapping explosion and how do you prevent it?
Q01 of 04JUNIOR

Explain the difference between a `match` query and a `term` query. When would you use each?

ANSWER
A match query analyses the search term (tokenises, lowercases, stems) and searches the analysed field. A term query does not analyse — it looks for the exact token as stored in the inverted index. Use match for full-text search on text fields. Use term for exact matches on keyword fields, IDs, enums, or status values. If you use term on a text field, you'll likely get zero results because the text was tokenised during indexing.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
When should I use Elasticsearch vs PostgreSQL full-text search?
02
What is the difference between match and term queries?
03
How do I choose the number of shards for an index?
04
What is the difference between filter context and query context in Elasticsearch?
🔥

That's NoSQL. Mark it forged?

4 min read · try the examples if you haven't

Previous
DynamoDB Basics
12 / 15 · NoSQL
Next
Neo4j Graph Database Basics