Elasticsearch Basics
Elasticsearch basics — what it is, when to use it, indices and documents, full-text search with match queries, aggregations, and how it differs from relational databases.
- Elasticsearch is a distributed search and analytics engine built on Apache Lucene.
- Data is stored as JSON documents in indices — no fixed schema required.
- Excels at full-text search with tokenisation, stemming, and relevance scoring.
- match queries do full-text search; term queries do exact, unanalysed matching.
- Aggregations let you run analytics on millions of documents in near-real-time.
- Not a primary database — no transactions, eventual consistency, and no ACID guarantees.
Documents and Indices
An index is like a database table but stores JSON documents. Documents are automatically given a unique _id (or you can assign one). The mapping defines the schema (field types, analysers). Unlike SQL, the schema can be dynamic — new fields are automatically added if dynamic mapping is enabled. But dynamic mapping is dangerous in production (see incident above). Each index is composed of one or more shards that are distributed across the cluster. That's how Elasticsearch scales horizontally.
PUT index/_mapping { "dynamic": false } before ingesting untrusted data.Full-Text Search Queries
Elasticsearch's killer feature is full-text search. When you send a match query, Elasticsearch analyses the query string (tokenises, lowercases, applies stemmers) and compares it against the analysed text in the inverted index. Relevance scoring (TF-IDF or BM25) ranks results. You can boost fields, require clauses, and exclude terms using the bool query. multi_match lets you search across multiple fields with individual boosts.
- Each field is analysed: tokenised, lowercased, stemmed.
- The inverted index stores (token → list of document IDs + position).
- A match query converts the search phrase into tokens, then looks up each token in the inverted index.
- Documents matching more tokens (and rarer tokens) get higher scores.
- The
termquery skips analysis — it looks for the exact token as stored in the inverted index.
term on a text field returns nothing because the field was analysed.match analyses the query, term does not.keyword and use term.term on a text field — zero results for hours.How the Inverted Index Powers Search
The inverted index is what makes full-text search possible without a full scan. When you index a text field, the analyser produces a list of tokens. For every unique token, the inverted index stores a sorted list of document IDs that contain that token, plus the position(s) within the document. When a match query arrives, Elasticsearch looks up each token in the inverted index — that's a hash table lookup, not a linear scan. The search returns the document IDs, scores them, and retrieves the top hits. The process is the same whether you have 1,000 or 100,000,000 documents — search latency scales with the number of distinct tokens, not the document count. That's why Elasticsearch can respond in milliseconds on huge datasets.
_profile on a slow query), it means your mapping doesn't match the query — e.g., a wildcard on a keyword field.GET index/_search { "profile": true }.Full-Text vs Keyword: When to Use Each
Elasticsearch offers two fundamental ways to handle string fields: text (analysed, full-text) and keyword (exact, unanalysed). Choosing the wrong type leads to missing results or broken filters. A text field uses an analyser to break the string into tokens; a keyword field stores the entire string as a single token. Use text for human-readable content you want to search by relevance (blog posts, product descriptions). Use keyword for identifiers, enums, tags, emails, URLs — anything you filter or aggregate on. In many production schemas, the same string field is mapped as both: a text sub-field for match queries and a keyword sub-field for term and aggregations.
ignore_above (default 256 for keyword) to avoid indexing unreasonably long strings. A 10KB string stored as keyword will blow up the in-memory field data cache. This is a common cause of slow aggregations.text with a .keyword sub-field.match, filter/aggregate with term on .keyword.term on a pure text field — you'll get zero results because the actual stored token is analysed.text field (which tokenised) and got per-token counts, not per-product.Aggregations — Analytics
Aggregations let you compute analytics over search results without writing custom code. They're like SQL GROUP BY on steroids: nested aggregations, date histograms, percentile ranks, geo distance, and more. Aggregations run on the shards in parallel and results are merged on the coordinating node. The size: 0 in the request avoids returning documents when you only want the aggregated data.
terms aggregation with a large size requests many buckets from each shard, then the coordinating node sorts and merges. If size is 10000, each shard sends 10000 buckets — multiply by number of shards. Use search.max_buckets to cap memory. Pre-filter with a query to reduce the dataset before aggregating.search.max_buckets defaults to 10000 — hitting it kills the query.composite aggregation with pagination, not a single terms with huge size.search.max_buckets and pre-filter with queries.composite aggregation.Mapping, Analysis and Tokenisation
Mapping defines how documents and their fields are stored and indexed. Analysis is the process of converting a text field into tokens (terms) that go into the inverted index. It consists of a character filter, a tokeniser, and token filters. By default, Elasticsearch uses the standard analyser (lowercases, splits on punctuation, removes common words). You can create custom analysers with different tokenisers (whitespace, keyword, n-gram) and filters (synonym, stemmer, stop). Understanding analysis is critical to getting search results that match user intent.
text field can have its own analyser. The same string can be analysed differently in different fields. For example, a title field might use a standard analyser with synonyms, while a content field uses a simpler analyser. You can also set search_analyzer separately from index_analyzer._analyze endpoint before deploying mappings.Mapping & Analyzers: Technical Reference Guide
This reference covers the essential mapping types, parameter configuration, and custom analyser components you'll use in production. The mapping is the schema definition for an index — it tells Elasticsearch how to store and index each field. Key field types: text (full-text), keyword (exact match), integer, long, float, double, boolean, date (ISO 8601), nested (array of objects), flattened (arbitrary key-value), geo_point, ip. Each type has parameters: index (true/false), store (separate stored field), doc_values (for aggregations/sorting), norms (scoring), copy_to (combine fields).
Analysers consist of: character filters (strip HTML, replace patterns), tokeniser (standard, whitespace, edge_ngram), token filters (lowercase, stemmer, stop, synonym, asciifolding). Custom analysers are defined in index settings and referenced in mappings.
Production-critical parameters: ignore_above (keyword length limit), coerce (auto-convert numbers), eager_global_ordinals (pre-load for high cardinality keyword fields used in aggregations), fields (multi-field for same string).
synonym_graph filter for multi-word synonyms to avoid false positives._template for index patterns to enforce mapping consistency across time-based indices.text field to keyword via PUT mapping — Elasticsearch silently ignored the change, and queries kept failing until they reindexed.Cluster Architecture, Sharding and Scaling
An Elasticsearch cluster consists of nodes: master-eligible nodes (handle cluster state), data nodes (store data and execute queries), ingest nodes (pre-process documents), and coordinating nodes (route requests). Data is split into shards — each shard is a separate Lucene index. When you index a document, it is routed to a primary shard based on _routing (default: document ID hash). The primary shard synchronously replicates to replica shards on other nodes. Scaling means adding data nodes — shards are automatically rebalanced. But you cannot change the number of primary shards after index creation. So right-sizing shards at index creation time is critical.
Shard & Replica Distribution Layout
Visualising how shards and replicas are distributed across nodes helps you understand resilience and read scalability. Each index has a set of primary shards (P0, P1, ...). Each primary shard has zero or more replica shards (R0, R1, ...) on different nodes. Elasticsearch ensures primary and replica shards for the same shard number are never allocated on the same node — this guarantees high availability. When you have replicas, read requests can be served by any copy, distributing load. The diagram below shows a 2-node cluster with an index of 3 primary shards and 1 replica each.
shard = hash(_routing) % number_of_primary_shards. The default routing is the document _id. If you need to group related documents on the same shard (e.g., all user documents for user 42), specify a custom _routing value. But beware: using a custom routing skips the hash and can cause uneven shard sizes — never use a high-cardinality routing key.Mapping Explosion Caused Cluster OOM
dynamic: false or dynamic: strict on the index template. Mapped only known fields explicitly. Used a flattened data type or nested key-value pair for unknown fields. Applied a limit on field count via index.mapping.total_fields.limit.- Dynamic mapping is safe for low-cardinality, known schemas. Never use it for user-generated keys or log fields with unbounded cardinality.
- Always set field count limits and monitor mapping size in production.
- A mapping explosion is silent — you'll see OOM before you see the mapping warning.
_cat/thread_pool/search?v for queue depth. If queue > 100, nodes are overloaded. Also check _nodes/hot_threads for CPU contention.GET _cluster/health?pretty. If unassigned_shards > 0, check GET _cat/shards?h=index,shard,prirep,state,node,unassigned.reason to see why shards are unassigned (e.g., node left, disk full).GET _cat/shards?v and see if primaries and replicas are in sync. Also verify that the refresh_interval isn't set too high (default 1s, increase only if indexing throughput is critical).GET _nodes/stats/breaker?pretty. Identify which breaker tripped (field data, request, in-flight). Increase the breaker limit temporarily, but the real fix is reducing memory pressure: reduce field data cache size or limit query complexity.cluster.routing.allocation.disk.watermark.low. If a node dropped, reroute shards manually: POST _cluster/reroute?retry_failed=true.Key takeaways
match queries do full-text search; term queries do exact matchingCommon mistakes to avoid
5 patternsUsing `term` query on `text` field
keyword (or use .keyword sub-field) and query with term on that sub-field. For full-text, use match.Not setting a replica count on critical indices
number_of_replicas >= 1 (default is 1). For mission-critical data, use 2 or more replicas. Monitor with GET _cat/indices?v.Wildcard queries on leading edge of `text` fields
*pattern runs extremely slow because it can't use the inverted index. CPU spikes and timeouts.n-gram tokeniser for partial matching. Or use match_phrase_prefix which is optimised. Avoid leading wildcards unless the field is mapped as keyword with a small prefix length.Over-sharding — too many primary shards
Not monitoring disk thresholds
cluster.routing.allocation.disk.watermark.low and high in elasticsearch.yml. Add monitoring alerts at 70% disk usage.Interview Questions on This Topic
Explain the difference between a `match` query and a `term` query. When would you use each?
match query analyses the search term (tokenises, lowercases, stems) and searches the analysed field. A term query does not analyse — it looks for the exact token as stored in the inverted index. Use match for full-text search on text fields. Use term for exact matches on keyword fields, IDs, enums, or status values. If you use term on a text field, you'll likely get zero results because the text was tokenised during indexing.Frequently Asked Questions
That's NoSQL. Mark it forged?
4 min read · try the examples if you haven't