Elasticsearch Basics
Elasticsearch basics — what it is, when to use it, indices and documents, full-text search with match queries, aggregations, and how it differs from relational databases..
20+ years shipping high-throughput database systems. Notes here come from systems that actually shipped.
- Elasticsearch is a distributed search and analytics engine built on Apache Lucene.
- Data is stored as JSON documents in indices — no fixed schema required.
- Excels at full-text search with tokenisation, stemming, and relevance scoring.
- match queries do full-text search; term queries do exact, unanalysed matching.
- Aggregations let you run analytics on millions of documents in near-real-time.
- Not a primary database — no transactions, eventual consistency, and no ACID guarantees.
What Elasticsearch Actually Does
Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It indexes JSON documents into inverted indices, enabling near-real-time full-text search, structured queries, and aggregations across terabytes of data. The core mechanic is sharding: each index is split into primary and replica shards, distributed across a cluster for parallelism and fault tolerance.
Documents are stored in indices, each shard is a complete Lucene index. Writes go to primary shards, then replicate; reads can hit any replica. This gives you sub-second search on billions of documents — but only if you design your mapping upfront. Dynamic mapping is convenient in dev; in production it causes field explosion, index bloat, and silent query failures.
Use Elasticsearch when you need free-text search, log analytics, or real-time aggregations on semi-structured data. It is not a primary data store — no transactions, no strong consistency guarantees. Teams that treat it as a general-purpose database hit split-brain scenarios, data loss on cluster splits, and painful reindexing. It excels as a secondary index or analytics layer behind a durable source of truth.
Documents and Indices
An index is like a database table but stores JSON documents. Documents are automatically given a unique _id (or you can assign one). The mapping defines the schema (field types, analysers). Unlike SQL, the schema can be dynamic — new fields are automatically added if dynamic mapping is enabled. But dynamic mapping is dangerous in production (see incident above). Each index is composed of one or more shards that are distributed across the cluster. That's how Elasticsearch scales horizontally.
PUT index/_mapping { "dynamic": false } before ingesting untrusted data.Full-Text Search Queries
Elasticsearch's killer feature is full-text search. When you send a match query, Elasticsearch analyses the query string (tokenises, lowercases, applies stemmers) and compares it against the analysed text in the inverted index. Relevance scoring (TF-IDF or BM25) ranks results. You can boost fields, require clauses, and exclude terms using the bool query. multi_match lets you search across multiple fields with individual boosts.
- Each field is analysed: tokenised, lowercased, stemmed.
- The inverted index stores (token → list of document IDs + position).
- A match query converts the search phrase into tokens, then looks up each token in the inverted index.
- Documents matching more tokens (and rarer tokens) get higher scores.
- The
termquery skips analysis — it looks for the exact token as stored in the inverted index.
term on a text field returns nothing because the field was analysed.match analyses the query, term does not.keyword and use term.term on a text field — zero results for hours.How the Inverted Index Powers Search
The inverted index is what makes full-text search possible without a full scan. When you index a text field, the analyser produces a list of tokens. For every unique token, the inverted index stores a sorted list of document IDs that contain that token, plus the position(s) within the document. When a match query arrives, Elasticsearch looks up each token in the inverted index — that's a hash table lookup, not a linear scan. The search returns the document IDs, scores them, and retrieves the top hits. The process is the same whether you have 1,000 or 100,000,000 documents — search latency scales with the number of distinct tokens, not the document count. That's why Elasticsearch can respond in milliseconds on huge datasets.
_profile on a slow query), it means your mapping doesn't match the query — e.g., a wildcard on a keyword field.GET index/_search { "profile": true }.Full-Text vs Keyword: When to Use Each
Elasticsearch offers two fundamental ways to handle string fields: text (analysed, full-text) and keyword (exact, unanalysed). Choosing the wrong type leads to missing results or broken filters. A text field uses an analyser to break the string into tokens; a keyword field stores the entire string as a single token. Use text for human-readable content you want to search by relevance (blog posts, product descriptions). Use keyword for identifiers, enums, tags, emails, URLs — anything you filter or aggregate on. In many production schemas, the same string field is mapped as both: a text sub-field for match queries and a keyword sub-field for term and aggregations.
ignore_above (default 256 for keyword) to avoid indexing unreasonably long strings. A 10KB string stored as keyword will blow up the in-memory field data cache. This is a common cause of slow aggregations.text with a .keyword sub-field.match, filter/aggregate with term on .keyword.term on a pure text field — you'll get zero results because the actual stored token is analysed.text field (which tokenised) and got per-token counts, not per-product.Aggregations — Analytics
Aggregations let you compute analytics over search results without writing custom code. They're like SQL GROUP BY on steroids: nested aggregations, date histograms, percentile ranks, geo distance, and more. Aggregations run on the shards in parallel and results are merged on the coordinating node. The size: 0 in the request avoids returning documents when you only want the aggregated data.
terms aggregation with a large size requests many buckets from each shard, then the coordinating node sorts and merges. If size is 10000, each shard sends 10000 buckets — multiply by number of shards. Use search.max_buckets to cap memory. Pre-filter with a query to reduce the dataset before aggregating.search.max_buckets defaults to 10000 — hitting it kills the query.composite aggregation with pagination, not a single terms with huge size.search.max_buckets and pre-filter with queries.composite aggregation.Mapping, Analysis and Tokenisation
Mapping defines how documents and their fields are stored and indexed. Analysis is the process of converting a text field into tokens (terms) that go into the inverted index. It consists of a character filter, a tokeniser, and token filters. By default, Elasticsearch uses the standard analyser (lowercases, splits on punctuation, removes common words). You can create custom analysers with different tokenisers (whitespace, keyword, n-gram) and filters (synonym, stemmer, stop). Understanding analysis is critical to getting search results that match user intent.
text field can have its own analyser. The same string can be analysed differently in different fields. For example, a title field might use a standard analyser with synonyms, while a content field uses a simpler analyser. You can also set search_analyzer separately from index_analyzer._analyze endpoint before deploying mappings.Mapping & Analyzers: Technical Reference Guide
This reference covers the essential mapping types, parameter configuration, and custom analyser components you'll use in production. The mapping is the schema definition for an index — it tells Elasticsearch how to store and index each field. Key field types: text (full-text), keyword (exact match), integer, long, float, double, boolean, date (ISO 8601), nested (array of objects), flattened (arbitrary key-value), geo_point, ip. Each type has parameters: index (true/false), store (separate stored field), doc_values (for aggregations/sorting), norms (scoring), copy_to (combine fields).
Analysers consist of: character filters (strip HTML, replace patterns), tokeniser (standard, whitespace, edge_ngram), token filters (lowercase, stemmer, stop, synonym, asciifolding). Custom analysers are defined in index settings and referenced in mappings.
Production-critical parameters: ignore_above (keyword length limit), coerce (auto-convert numbers), eager_global_ordinals (pre-load for high cardinality keyword fields used in aggregations), fields (multi-field for same string).
synonym_graph filter for multi-word synonyms to avoid false positives._template for index patterns to enforce mapping consistency across time-based indices.text field to keyword via PUT mapping — Elasticsearch silently ignored the change, and queries kept failing until they reindexed.Cluster Architecture, Sharding and Scaling
An Elasticsearch cluster consists of nodes: master-eligible nodes (handle cluster state), data nodes (store data and execute queries), ingest nodes (pre-process documents), and coordinating nodes (route requests). Data is split into shards — each shard is a separate Lucene index. When you index a document, it is routed to a primary shard based on _routing (default: document ID hash). The primary shard synchronously replicates to replica shards on other nodes. Scaling means adding data nodes — shards are automatically rebalanced. But you cannot change the number of primary shards after index creation. So right-sizing shards at index creation time is critical.
Shard & Replica Distribution Layout
Visualising how shards and replicas are distributed across nodes helps you understand resilience and read scalability. Each index has a set of primary shards (P0, P1, ...). Each primary shard has zero or more replica shards (R0, R1, ...) on different nodes. Elasticsearch ensures primary and replica shards for the same shard number are never allocated on the same node — this guarantees high availability. When you have replicas, read requests can be served by any copy, distributing load. The diagram below shows a 2-node cluster with an index of 3 primary shards and 1 replica each.
shard = hash(_routing) % number_of_primary_shards. The default routing is the document _id. If you need to group related documents on the same shard (e.g., all user documents for user 42), specify a custom _routing value. But beware: using a custom routing skips the hash and can cause uneven shard sizes — never use a high-cardinality routing key.Why Your Elasticsearch Cluster Died at 3 AM — Heap Pressure and GC Pauses
Every production meltdown I’ve seen with Elasticsearch traces back to one thing: the JVM heap. It’s not disk space. It’s not network. It’s the garbage collector stalling because you treated Elasticsearch like a document store instead of a search engine.
Elasticsearch runs on Lucene. Lucene writes to segments on disk, not to heap. But your aggregations, parent-child relationships, and heavy terms queries build in-memory hash tables that eat young gen alive. When those promotions fail, you get long GC pauses. Nodes drop out of the cluster. Shards go missing. You wake up to a 503.
The fix starts before deployment. Set -Xmx to no more than 50% of available RAM. The rest goes to the OS page cache — that’s where Lucene actually lives. Use _cat/nodes?v&h=heap.percent,ram.percent to monitor the split. If heap percent stays above 85 under load, you’re trading speed for stability. Split the index, cache fewer fields, or push aggregations to dedicated coordinating nodes.
The Write Path — Why Your Indexing Throughput Is Lying to You
You watch the indexing rate on the monitoring page and think everything’s fine. Then you do a forced refresh and all those buffered writes flood the disk. Your search latency spikes. Users complain. You had 10k docs/second incoming but the disk queue depth hit 100. That’s not throughput. That’s deferred pain.
Elasticsearch doesn’t write to disk on every _bulk request. It keeps segments in memory until a refresh (every second by default) or a flush (to translog). That translog is your safety net if the node crashes — but replaying it on startup can take minutes if you let it grow. A shard with 1 million uncommitted docs after an unclean shutdown? Good luck.
The move: use index.translog.sync_interval to control flush frequency. For bulk-heavy pipelines, set it to 30s and live with a slightly larger recovery window. Never use _forcemerge until indexing is fully done — it rewrites segments and destroys the file system cache. Batch your writes. Keep segment count under 150 per shard. And for the love of god, disable replicas during reindexing and turn them back on after.
grok or dissect to parse fields before they hit the indexing node. CPU-intensive parsing on the ingest node keeps your data nodes focused on writing segments. Split the work.Setup and Installation — Get Elasticsearch Running Before You Burn an Hour
Most engineers waste time fighting installation when they should be fighting bad queries. Elasticsearch runs on Java, so your first check is the JVM version. OpenJDK 17 or 21. No exceptions. Download the tarball from Elastic's official site — neverapt or brew outside of dev. Untar, run bin/elasticsearch, and it boots on port 9200. If you want Docker, use the official image with explicit memory limits: -Xms1g -Xmx1g. Defaults will eat your laptop alive. Production installs need the elasticsearch.yml file tuned for your cluster. Set discovery.type: single-node for local testing, then discovery.seed_hosts for real setups. The data and logs directories must be on separate volumes. Don't run as root — create a dedicated elasticsearch user. Verify with curl -XGET 'localhost:9200'. If you see a cluster name, you're live. If you see an error, check your heap. Always.
-Xms4g -Xmx4g for most workloads — bigger isn't better.Security and Access Control — Stop Running Elasticsearch With No Pants
Out of the box, Elasticsearch listens on 0.0.0.0:9200 with no authentication. That's a lawsuit waiting to happen. The first thing you do is enable the Elasticsearch security features. In version 8+, it's automatic — but only if you didn't override it. For older clusters, set xpack.security.enabled: true in elasticsearch.yml. Then run bin/elasticsearch-setup-passwords auto to generate credentials for the elastic superuser. Store those in a vault, not a sticky note. Next, create roles per use case: read_only for dashboards, ingest for Logstash, admin for your CI. Use the security API, not the config file. TLS is non-negotiable for transport between nodes. Generate certificates with bin/elasticsearch-certutil cert. If you're on Kubernetes, use the Elasticsearch operator — it handles secrets and certs for you. API keys are preferred over passwords for automation. Revoke them with one DELETE call. Audit logs catch the intern who drops an index at 2 AM. Turn them on with xpack.security.audit.enabled: true.
Why Data Ingestion Kills Your Cluster — The Ingestion Pipeline
Elasticsearch is only as fast as the data you feed it. Most engineers skip the ingestion pipeline and wonder why indexing blocks, field mappings drift, or documents disappear. The why: raw data from logs, APIs, or databases arrives in messy shapes. Elasticsearch can't search what it can't parse. The how: use ingest nodes with built-in processors — grok for unstructured logs, date index for timestamps, and set/remove for field cleanup. A pipeline runs before indexing, transforming each document. This prevents mapping explosions, reduces storage by dropping null fields, and improves query speed by pre-computing derived values. The production trap: without a pipeline, every source change requires reindexing your entire cluster. Pipeline versioning lets you evolve schemas without downtime. Key rule: always attach a pipeline to your index template — it's your data's first line of defense.
Advanced Querying — Why bool Queries Crush Your Lucene Score
Prerequisites to Learn Elasticsearch
Before building anything useful with Elasticsearch, you need a solid grasp of RESTful APIs (curl or Postman) and JSON structure, because every operation from indexing to search is a JSON-over-HTTP call. You must understand your data’s shape — fields that will be searched, aggregated, or sorted — so you can define a proper mapping. Without that, Elasticsearch will infer types, often disastrously (e.g., treating a ZIP code as a numeric field). You should also know the difference between structured (numeric, date) and unstructured (full-text) data; they require different analyzers. Cluster concepts matter: a single-node setup hides shard/replica behavior that will bite you in production. Finally, install Java 17+ (or the version required by your Elasticsearch release). A common pain point: default JVM heap settings (1 GB) cause immediate OutOfMemoryErrors when indexing even modest datasets. Tune heap to 50% of available RAM, not more than 32 GB, to avoid compressed object pointer overhead. Start small: index a few documents, run a match query, then scale to bulk ingestion.
Comparisons and Differences
Elasticsearch is often confused with relational databases (MySQL, PostgreSQL) and other search engines (Solr). The fundamental difference: ES is a distributed, near-real-time document store with inverted index architecture, not a structured table. Joins are expensive — ES prefers denormalized, nested, or parent-child mappings. Unlike traditional DBs, ES has no ACID transactions across shards; it favors eventual consistency for speed. Compare with Solr: Solr has better built-in faceting for e-commerce, but ES wins on ecosystem (Kibana, Beats, Logstash) and horizontal scalability driven by Zen Discovery. Against OpenSearch? It's a fork; performance and API are nearly identical, but ES has a larger community and more aggressive feature development. The biggest difference from any SQL database: ES does not support multi-row transactions or foreign keys. If you need strong consistency, pair ES with a source-of-truth DB and treat it as a secondary index. For time-series data (logs, metrics), ES excels over MongoDB because of inverted indices and aggregation pipelines. Choose ES when you need full-text search, analytics dashboards, or high-ingestion-rate logging — not for banking ledgers.
Conclusion
Elasticsearch is a powerful tool, but it demands respect for its distributed nature and memory-sensitive internals. The key takeaway from this series: mapping and analysis define how your data is stored and searched — get them wrong, and you'll chase phantom relevance issues. Cluster architecture, shard distribution, and heap tuning are not optional; they determine whether your cluster survives a traffic spike or implodes at 3 AM. The write path and ingestion pipelines are the primary bottlenecks — always measure throughput with bulk requests and monitor GC logs. Security, often an afterthought, should be configured before the first document is indexed (enable TLS, set password for elastic user). Comparisons with SQL databases or Solr clarify use cases: ES excels at full-text search and log analytics, not relational consistency. Finally, prerequisites like REST fluency and data-type awareness form the foundation; skip them and you'll wrestle with inferred mappings that corrupt search accuracy. Start small, monitor relentlessly, and treat every mapping change as a migration. Elasticsearch rewards understanding — it punishes blind optimism. Use the provided code snippets as daily checklists, and you'll push to production with confidence.
Mapping Explosion Caused Cluster OOM
dynamic: false or dynamic: strict on the index template. Mapped only known fields explicitly. Used a flattened data type or nested key-value pair for unknown fields. Applied a limit on field count via index.mapping.total_fields.limit.- Dynamic mapping is safe for low-cardinality, known schemas. Never use it for user-generated keys or log fields with unbounded cardinality.
- Always set field count limits and monitor mapping size in production.
- A mapping explosion is silent — you'll see OOM before you see the mapping warning.
_cat/thread_pool/search?v for queue depth. If queue > 100, nodes are overloaded. Also check _nodes/hot_threads for CPU contention.GET _cluster/health?pretty. If unassigned_shards > 0, check GET _cat/shards?h=index,shard,prirep,state,node,unassigned.reason to see why shards are unassigned (e.g., node left, disk full).GET _cat/shards?v and see if primaries and replicas are in sync. Also verify that the refresh_interval isn't set too high (default 1s, increase only if indexing throughput is critical).GET _nodes/stats/breaker?pretty. Identify which breaker tripped (field data, request, in-flight). Increase the breaker limit temporarily, but the real fix is reducing memory pressure: reduce field data cache size or limit query complexity.GET _cat/shards?h=index,shard,prirep,state,node,unassigned.reason&vGET _cluster/allocation/explain?prettycluster.routing.allocation.disk.watermark.low. If a node dropped, reroute shards manually: POST _cluster/reroute?retry_failed=true.Key takeaways
match queries do full-text search; term queries do exact matchingCommon mistakes to avoid
5 patternsUsing `term` query on `text` field
keyword (or use .keyword sub-field) and query with term on that sub-field. For full-text, use match.Not setting a replica count on critical indices
number_of_replicas >= 1 (default is 1). For mission-critical data, use 2 or more replicas. Monitor with GET _cat/indices?v.Wildcard queries on leading edge of `text` fields
*pattern runs extremely slow because it can't use the inverted index. CPU spikes and timeouts.n-gram tokeniser for partial matching. Or use match_phrase_prefix which is optimised. Avoid leading wildcards unless the field is mapped as keyword with a small prefix length.Over-sharding — too many primary shards
Not monitoring disk thresholds
cluster.routing.allocation.disk.watermark.low and high in elasticsearch.yml. Add monitoring alerts at 70% disk usage.Interview Questions on This Topic
Explain the difference between a `match` query and a `term` query. When would you use each?
match query analyses the search term (tokenises, lowercases, stems) and searches the analysed field. A term query does not analyse — it looks for the exact token as stored in the inverted index. Use match for full-text search on text fields. Use term for exact matches on keyword fields, IDs, enums, or status values. If you use term on a text field, you'll likely get zero results because the text was tokenised during indexing.Frequently Asked Questions
20+ years shipping high-throughput database systems. Notes here come from systems that actually shipped.
That's NoSQL. Mark it forged?
11 min read · try the examples if you haven't