ELK Stack Explained: Internals, Pipelines and Production Failures
ELK Stack internals most engineers never learn — inverted indices, Logstash pipelines that stall, and disk watermarks that kill clusters silently.
- ELK Stack = Elasticsearch (search/storage) + Logstash (ingest/transform) + Kibana (visualize)
- Elasticsearch uses inverted indices — term dictionary maps tokens to document IDs for sub-second search
- Each shard is a Lucene instance; shard count x (1 + replicas) x 1.2 = actual disk multiplier
- Logstash pipelines: inputs -> filters -> outputs; Grok is the CPU bottleneck on unstructured logs
- Kibana dashboards should answer one operational question — not display every metric you have
- Disk watermarks: low at 85% stops new shard allocation, high at 90% relocates shards, flood-stage at 95% switches indices to read-only. Monitor at 80% or you will be reacting instead of preventing.
Imagine your entire city's 911 call center receives thousands of calls a day from every neighborhood. Logstash is the operator who answers every call, cleans up the noise, and routes it to the right file. Elasticsearch is the giant filing cabinet that stores every call record in a way that lets you find any detail in milliseconds. Kibana is the big screen on the wall that turns all those records into live charts so the chief can see exactly what's happening across the city right now. The ELK Stack is that whole system — for your software.
Every production system lies. Not intentionally — but without proper observability, your application fails silently, degrades mysteriously, and wakes you at 3am with zero context. Log files exist, but a 400GB flat log file on a server nobody SSHs into anymore is just expensive noise. The ELK Stack transforms that noise into signal: structured, searchable, visualized intelligence about everything your infrastructure is doing, in real time.
The core problem ELK solves is the gap between raw log data and actionable insight. A typical microservices platform produces logs from dozens of services, each in a slightly different format, scattered across hundreds of containers. Correlating a failed payment transaction across an API gateway, an auth service, a Kafka consumer, and a Postgres adapter — without a centralized log aggregation system — is an exercise in madness. ELK gives every log line a home, a shape, and a timeline.
By the end you will understand how Elasticsearch actually indexes and retrieves documents under the hood, how to build Logstash pipelines that handle real-world log formats including multiline stacktraces, how to design Kibana dashboards that answer operational questions rather than just looking impressive in a quarterly review, and exactly where production deployments fall apart and how to prevent it. The incidents in this article are real. The fixes are the ones that actually worked.
What ELK Stack Is and How the Components Connect
ELK is not three tools bolted together. It is a data pipeline with three distinct failure domains, and understanding how data flows between them is what separates engineers who can debug it from engineers who restart services and hope.
Data originates on your hosts and containers. Filebeat — a lightweight Go agent — tails log files and ships events forward. It is stateful: Filebeat maintains a registry file tracking its read position in every file it monitors. If that registry file is corrupted by an unclean shutdown (common on spot instances), Filebeat loses its position and either re-ships everything from the start or skips forward to the current file end, depending on configuration. Always run Filebeat with its registry on a persistent volume and set close_inactive to a sensible value so file handles do not accumulate.
Filebeat ships to Logstash, or — in high-volume environments — to Kafka first. The Kafka buffer is not optional at scale. It absorbs traffic spikes so Logstash does not receive a 10x burst and OOM. It also means a Logstash restart does not lose data — Kafka holds the events until Logstash recovers and resumes consuming from its committed offset. Running Logstash reading directly from files in a high-volume environment is fragile. Add Kafka as the buffer between collection and processing.
Logstash reads from Kafka, applies filters to parse and enrich each event, and writes structured documents to Elasticsearch. The pipeline is: inputs -> filters -> outputs. Each stage runs in its own thread pool. The filter stage is where CPU is spent and where most production problems originate.
Elasticsearch receives structured JSON documents, indexes them into an inverted index, and serves search and aggregation queries. Kibana connects to Elasticsearch and renders the results.
The triage order when logs stop flowing is always: Elasticsearch first, then Logstash, then Kibana. Storage failures cascade upstream. A healthy Logstash shipping to a broken Elasticsearch looks, from the outside, identical to a broken Logstash — events simply stop appearing in Kibana. Check ES health before anything else.
In 2026, Elastic also offers OpenTelemetry-native ingestion and the Elastic Agent as a replacement for the Filebeat plus Logstash combination. The Elastic Agent consolidates collection and processing into a single managed binary with central policy control through Fleet. For new deployments, evaluate Elastic Agent rather than defaulting to the classic Filebeat-Logstash split. For existing deployments, migration is straightforward but not mandatory — the classic stack still works and is fully supported.
Beats Family — Filebeat, Metricbeat, Packetbeat, and When to Use Each
The Beats family is Elastic's collection of lightweight data shippers. Each Beat is purpose-built for a specific data type — logs, metrics, network packets — and runs as a single binary with minimal configuration. Understanding which Beat to use for which job prevents the mistake of forcing Filebeat to collect metrics or Metricbeat to tail log files.
Filebeat is the workhorse for log collection. It tails files, follows symlinks, handles rotation, and ships raw log lines to Logstash or directly to Elasticsearch. It maintains a registry — a local file tracking read positions — so a restart does not re-ship the same lines. Filebeat supports multiline aggregation, which is critical for Java stacktraces. Configure multiline in Filebeat rather than Logstash whenever possible to reduce Logstash heap pressure.
Metricbeat collects system and service metrics. It runs modules that know how to talk to specific services — MySQL, PostgreSQL, Redis, Nginx, Kafka, Docker, Kubernetes. Metricbeat pulls metrics from each module on a configurable period. The output is numerical time-series data, not raw log lines. Do not use Filebeat to read /proc/stats — use Metricbeat with the system module.
Packetbeat captures and parses network traffic. It runs as a packet sniffer using libpcap (Linux) or WinPcap (Windows), decoding protocols like HTTP, MySQL, PostgreSQL, Redis, Thrift, and DNS. Packetbeat reconstructs full transactions from packets, so it can show you every SQL query or HTTP request/response pair that crosses your network segment.
Auditbeat collects security audit events from your Linux kernel using the Linux Audit Framework. It ships user logins, privilege escalations (sudo), file integrity events (when critical configs change), and process execution logs. Auditbeat is the right tool for compliance auditing (SOC2, PCI-DSS) and security monitoring.
Heartbeat performs uptime monitoring. It pings services (ICMP), connects to TCP ports, or checks HTTP endpoints for expected status codes and response body patterns. Heartbeat sends synthetic check results as documents, which you can alert on for service availability. It is not a log collector and has no relation to a human heartbeat — it is named for the regular 'heartbeat' signal it emits.
Winlogbeat captures Windows Event Logs — Application, Security, Setup, System, and forwarded events. If your infrastructure includes Windows servers, Winlogbeat is the only supported way to get Windows Event Logs into Elasticsearch reliably. Do not try to tail raw .evtx files with Filebeat.
How Elasticsearch Actually Indexes Documents — Inverted Indices Under the Hood
Elasticsearch does not search documents. It searches an inverted index — a data structure that maps every unique term to the list of documents that contain it. When you index a document, Elasticsearch tokenizes the text, normalizes case, applies stemming if configured, and writes each token into a term dictionary. The term dictionary points to a postings list: document IDs, term frequency, and position offsets.
This is why Elasticsearch is fast at full-text search. You are not scanning every document. You are looking up a term in a sorted dictionary and getting back a pre-computed list of matching document IDs. BM25 scoring then ranks those matches by term frequency, inverse document frequency, and field length normalization.
Each Elasticsearch shard is an independent Lucene index. Lucene segments are immutable — once written, they never change. New or updated documents go into an in-memory buffer, then get flushed to a new segment on refresh, which defaults to every 1 second. This means there is a 1-second window where a newly indexed document is not yet searchable. If you need sub-second search freshness, the answer is not lowering the refresh interval — it will kill indexing throughput because every refresh triggers segment creation and eventual merges.
Segment merging happens in the background and is a silent performance killer when misconfigured. Too many small segments accumulate when indexing is faster than merging. The merge thread then consumes I/O and CPU, spiking latency for active searches. Monitor segment count per shard with _cat/segments — more than 100 segments per shard is a sign your merge policy needs tuning. For bulk indexing jobs, set refresh_interval to 30s or -1 during the load, force a refresh when done, then restore the interval.
Field mapping is where most teams create invisible performance problems. Every field you add increases the inverted index size and slows indexing. Use dynamic: strict in your index templates to reject unexpected fields and define only the fields you actually search or aggregate on. A common mistake is indexing full HTTP request bodies as a single text field, then wondering why searches are slow. Use index: false for fields you store but never query.
High-cardinality keyword fields deserve specific mention. Trace IDs, request IDs, and session tokens as keyword fields create term dictionaries with millions of unique values that cannot fit in RAM. Every search touching those fields forces disk lookups. Either do not index them as keywords, or use a separate index with appropriate settings for correlation lookups rather than mixing them into your main search index.
- Document goes in -> ES tokenizes text into individual terms and writes each to the term dictionary
- Each term gets a postings list: which docs contain it, how often, and at which position
- Search = dictionary lookup + postings list intersection — that is why it is fast on billions of documents
- Segments are immutable; updates create new segments, old ones get merged in the background by the merge thread
- Refresh interval (1s default) controls the trade-off between search freshness and indexing throughput — do not lower it below 1s
Logstash Pipelines — Ingest, Transform, Ship and Where They Break
Logstash receives raw data from inputs, transforms it through filters, and ships structured events to outputs. The pipeline is linear — input -> filter -> output — with each stage running in its own thread pool. The number of worker threads processing the filter stage is controlled by pipeline.workers, which defaults to the number of available CPU cores. Understanding the threading model is the first step to understanding why pipelines stall.
The Grok filter is where most Logstash performance problems originate. Grok combines regular expressions with named capture groups to extract structured fields from unstructured text. A pattern like %{COMBINEDAPACHELOG} expands to a 200-character-plus regex. When your log format does not match the pattern, Grok tries every alternative before failing. In a pipeline processing 10,000 events per second with a 5% failure rate, that is 500 wasted regex evaluations per second. Always add a catch-all pattern as the last alternative: %{GREEDYDATA:log_message}. It ensures events flow through even on mismatch, and you tag the failure for visibility rather than silently dropping the event.
Multiline event handling is the second major trap. Java stacktraces and Python tracebacks span multiple lines. Logstash's multiline codec aggregates them into a single event by buffering pending lines in JVM heap. A burst of stacktraces from a crashing service — which is exactly when you most need your logs — can spike heap usage from 500MB to 3GB in under a minute. Set -Xmx to at least 4GB when using multiline in production. Reduce max_lines to a realistic ceiling (200 is usually enough for stacktraces) so a runaway exception chain cannot consume unlimited heap.
Dead letter queues are the safety net that most teams skip and regret. By default, Logstash silently drops documents that Elasticsearch rejects — mapping conflicts, disk blocks, field limit breaches. Enable it in logstash.yml: dead_letter_queue.enable: true and dead_letter_queue.max_bytes: 1024mb. The DLQ is a local directory on the Logstash host, not an Elasticsearch index. Inspect it at the path configured by path.dead_letter_queue (default: /var/lib/logstash/dead_letter_queue). Use the dead_letter_queue input plugin to replay rejected events after fixing the root cause.
The pipeline.ordered setting deserves explicit mention. By default it is set to auto, which enables ordered processing when pipeline.workers is 1 and disables it otherwise. Set pipeline.ordered: false explicitly when event ordering between inputs does not matter — it allows workers to process events without coordination overhead, improving throughput at the cost of delivery order guarantees. For log pipelines where Elasticsearch timestamps handle ordering at query time, this is almost always the right call.
Pipeline workers and batch size interact directly with throughput. For a 16-core machine, start with 8 workers and a batch size of 250. Increasing batch size improves throughput by amortizing per-batch overhead but increases per-event latency and heap usage. A batch that fills with slow-to-process multiline events holds the worker thread for longer, starving other events. Benchmark with realistic load before committing to any setting.
- Every unmatched Grok pattern tries all alternatives before giving up — this is O(alternatives) CPU per failed event
- A 5% Grok failure rate on a 10K events/sec pipeline wastes 500 regex evaluations per second
- Without a catch-all pattern as the last alternative, unmatched events are tagged _grokparsefailure and may be dropped depending on your output configuration
- Always include %{GREEDYDATA:log_message} as your last alternative — it costs nothing and prevents silent data loss
- Test patterns against 50 real log lines using the Grok Debugger in Kibana Dev Tools before deploying
Logstash Filter Cheat Sheet — Grok, Date, Mutate, GeoIP
Logstash filters transform raw events into structured documents before they reach Elasticsearch. The four most frequently used filters in production pipelines are Grok, Date, Mutate, and GeoIP. Having a scannable reference makes pipeline debugging faster and reduces the guesswork when logs show up with missing fields or wrong timestamps.
Grok extracts structured fields from unstructured text using pattern matching. Built-in patterns cover common formats — %{COMBINEDAPACHELOG}, %{TIMESTAMP_ISO8601}, %{LOGLEVEL}. For custom formats, compose smaller patterns. Always add a catch-all as the last alternative to prevent dropped events.
Date parses timestamp strings from your logs into the @timestamp field. If you skip this, Elasticsearch uses the current time at indexing, making log order unreliable. The match parameter takes an array of format strings to try in order.
Mutate modifies field values and structures — renaming, copying, converting types, removing fields, and adding static strings. Use it to normalize field names across services (e.g., renaming customer_email to email) or to add pipeline metadata.
GeoIP enriches events with geographical location data from an IP address. It adds fields like geoip.country_code2, geoip.city_name, and geoip.location. This only works with public IPs. Rate-limit usage because the GeoIP database update can become a performance overhead on high-volume pipelines.
Grok failure debugging is where most teams waste time. When your pattern does not match, Logstash adds a _grokparsefailure tag to the event. Check for these tags in Kibana. Use the Grok Debugger in Kibana Dev Tools to test patterns against actual log lines before deploying.
logstash --config.test_and_exit before reloading
7. Monitor _grokparsefailure tags in Kibana after deploymentKibana Dashboards That Actually Answer Questions
Most Kibana dashboards are digital art. They look impressive in a demo and answer nothing during an incident. A dashboard with 47 panels showing every possible metric is a distraction when you are trying to figure out why payments are failing at 2am.
The right approach is to start with the question, then build the visualization. 'Which services are returning 5xx errors in the last 15 minutes?' needs one metric — error count — one dimension — service name — and one filter — status code 500 or above, time range last 15 minutes. That is a single data table, not a 12-panel dashboard. The panel answers the question. Everything else is friction.
Kibana's index patterns are the second major gotcha. When you create an index pattern like app-logs-*, Kibana fetches field mappings from every matching index. If you have 90 days of daily indices with dynamic mapping enabled, you can easily have 3,000-plus fields — especially when different services log different JSON structures that Elasticsearch ingests as separate mapped fields. Every time someone opens Discover or creates a visualization, Kibana loads all those field definitions. That is why your dashboard takes 20 seconds to load. The fix is upstream: use dynamic: strict in your index template and define only the fields you actually query.
For real incident response, saved searches outperform visualizations. They load faster because they do not aggregate — they list raw log lines with column selections. Pin a few key saved searches at the top of your Kibana navigation: 5xx errors, slow queries over 5 seconds, auth failures. When something breaks, open the relevant saved search rather than waiting for a complex dashboard to render aggregations across 90 days of data.
Kibana Query Language over Lucene syntax is worth the initial learning curve. KQL is more readable, less error-prone when written quickly under pressure, and better supported in autocomplete. Train the whole team on a handful of patterns — field:value, field:* wildcards, AND/OR combinations, range queries with > and < — and you will have faster incident investigation.
Time series visualizations with a large time range are a silent performance killer. Querying 30 days of data with a 1-minute bucket interval generates 43,200 buckets. Elasticsearch computes all of them. Use auto-interval on date histograms for overview dashboards and a fixed short interval only when drilling into a specific incident window. The inspect button on any Kibana panel shows the raw Elasticsearch query being executed — this is invaluable for understanding why a dashboard is slow or showing unexpected data.
- Start with the question, then pick the visualization — not the other way around
- One dashboard per operational scenario: incident triage, capacity planning, deploy validation
- If a panel does not change your next action, it is visual noise — remove it
- Saved searches load faster than visualizations and show raw log lines — use them for active incident investigation
- The Kibana inspect button shows the raw ES query — use it to debug slow panels and unexpected results
Kibana Query Language (KQL) vs Lucene — Syntax Comparison
Kibana gives you two query language options: Kibana Query Language (KQL) and Lucene. KQL is the default in modern Kibana versions (7.0+) and is the recommended choice for most users. Lucene is the legacy syntax that powers Elasticsearch's underlying query parser. Knowing both is useful for debugging, but KQL should be your daily driver.
KQL is designed for discoverability and error resistance. It provides autocomplete suggestions, syntax highlighting, and immediate error feedback. You cannot write invalid KQL — Kibana tells you where the syntax breaks. KQL supports nested fields, existence checks, and range queries with a cleaner syntax. Use KQL for all ad-hoc exploration and dashboard panels.
Lucene is more powerful but more dangerous. It supports regex, fuzzy queries, and proximity searches that KQL does not. The trade-off is that Lucene does not prevent you from writing queries that are syntactically valid but semantically wrong. A misplaced parenthesis can change the entire query meaning without an error message. Reserve Lucene for advanced use cases where KQL falls short, and always test Lucene queries in Dev Tools before adding them to dashboards.
The field:value syntax is identical in both languages. Wildcards work in both — service:payment* matches payment-api, payment-processor, payment-service. The differences appear in ranges, existence checks, and complex Boolean logic.
status_code: 500 |
| Prefix wildcard | service: payment |
| Range | duration_ms >= 5000 |
| AND | level: ERROR AND service: payment-api |
| OR | level: ERROR OR level: WARN (UPPERCASE required) |
| NOT | NOT level: DEBUG |
| Exists | trace_id: |
| Date math | @timestamp >= now-15m |
| Grouping | (status_code >= 500 OR level: ERROR) AND service: payment* |level:WARN OR ERROR returned nothing. The problem: KQL requires uppercase OR. WARN OR ERROR (without uppercase) is parsed as a field name, matching nothing. After switching to level: WARN OR level: ERROR (or level: (WARN OR ERROR)), the filter worked. Rule: KQL keywords (AND, OR, NOT) must be uppercase. Field names are case-sensitive as indexed.Shard Strategy and Capacity Planning — The Decisions That Haunt You Later
Shard count is the most consequential decision in an Elasticsearch deployment and it is almost always wrong on the first try. Too many shards and your cluster spends more time managing shard metadata than indexing data. Too few and you cannot distribute load or recover from node failures in a reasonable time window.
Here is the math most people skip. Each shard is a Lucene instance with its own heap overhead — roughly 1MB per shard for metadata plus per-segment structures. A cluster with 5,000 shards burns around 5GB of heap just on shard bookkeeping before indexing a single document. Elasticsearch's hard limit is 1,000 shards per node, but the practical ceiling is closer to 20 shards per GB of heap on data nodes. A node with 30GB of heap can handle around 600 shards before GC pressure from shard metadata starts affecting search latency.
The target shard size for log workloads is 10GB to 50GB. Below 10GB you are paying overhead on shards too small to benefit from parallelism. Above 50GB, shard recovery after a node failure requires copying the entire shard to a replacement node — a 100GB shard on a 1Gbps network takes around 13 minutes to recover during which that data has one fewer replica. For a daily index receiving 30GB of logs, one primary shard is fine. For 150GB per day, use 5 primary shards.
You cannot change the number of primary shards on an existing index without reindexing. This is one of the most painful lessons in Elasticsearch operations and it is avoidable if you plan before the first document arrives. Use index templates with the correct shard count set before the cluster receives any data. If you get it wrong, reindex is the only path forward — which is a significant operational event.
Replica shards serve two purposes: fault tolerance and read parallelism. For logs, one replica is usually enough — it doubles disk usage but protects against single node failure. If you have 3 data nodes, each primary shard has its primary on one node and its replica on another, giving you tolerance for a single node going down. With 2 replicas across 3 nodes, you have triple the disk usage but can lose any 2 nodes and still serve reads. Choose based on your actual availability requirements, not aspirational ones.
Post-node-restart shard allocation can itself become a cluster bottleneck when shard counts are high. The cluster manager thread handles all allocation decisions, and thousands of pending allocations after a node restart can keep the cluster in a recovering state for much longer than the actual data copy time. Monitor _cluster/allocation/explain when shards are not allocating — it gives the specific reason, from disk watermark to node attribute filtering to replica placement rules. This API saves hours of guessing.
Index Lifecycle Management — Automate Retention Before It Bites You
Without ILM, your ELK stack suffocates under its own data. Daily indices accumulate, disk fills, and someone is running curl commands at 2am to delete old indices. Index Lifecycle Management automates this: you define policies that transition indices through hot, warm, cold, and delete phases based on age, size, or document count. ILM is not optional infrastructure. It is the difference between a cluster that manages itself and one that requires constant manual intervention.
Hot phase: indices are actively written and frequently searched. Keep this short — 1 to 3 days for most log workloads. Use fast NVMe SSDs. This is your most expensive storage tier, and every day you keep an index here costs more than it should.
Warm phase: no more writes, still searchable but with lower urgency. Force-merge to 1 segment — this consolidates all the small segments from active indexing into one, reducing heap overhead and improving scan performance. Before force-merging, Elasticsearch requires the index to have 0 replicas for the shrink operation, which is why the allocate action reducing replicas must precede the shrink action in the policy. Missing this step causes the warm phase to fail silently.
Cold phase: read-only, reduced replica count, optionally migrated to slower storage using data tiers. For compliance-driven retention requirements, this phase can extend to months or years.
Delete phase: remove indices after the retention period. Set this based on your data retention policy and test it with a very short window on a development cluster — 1 hour delete — before using real durations in production.
ILM rollover is the mechanism that prevents any single index from growing beyond your target shard size. When an index exceeds max_size or max_age, ILM rolls over to a new index with the same template settings. This keeps shard sizes predictable and recovery times bounded. Set rollover on both a size trigger and an age trigger — whichever fires first — so that low-volume days still rotate on schedule.
A critical operational detail: ILM policies attached to an index template apply to new indices only. Indices created before the policy was attached require explicit opt-in via the PUT /index/_settings API. And always test your ILM policy on a development cluster before applying it to production. A misconfigured delete phase that fires too early is an availability incident, not a configuration mistake.
- Hot: fast writes, fast searches, expensive NVMe storage — keep indices here for 1 to 3 days maximum
- Warm: no writes, acceptable search speed, force-merge to 1 segment reduces overhead — 3 to 30 days
- Cold: read-only, minimal replicas, cheap storage — for compliance or rare lookups
- Delete: remove when retention policy says it is gone — test with a short window first
- Rollover on both size and age — prevents any single index from growing beyond your target shard size
Cluster Sizing and Hardware Selection — CPU, RAM, Disk Trade-offs
Choosing hardware for Elasticsearch is a trade-off between three resources that compete against each other. The right mix depends entirely on your workload profile — and the profile changes as your data grows.
RAM is the highest-leverage resource. Elasticsearch uses the OS filesystem cache as aggressively as the JVM heap. If your hot index fits in the filesystem cache, searches are nearly instant. If it does not, every search requires disk reads. The practical rule: allocate 50% of node RAM to the JVM heap, leave the rest for the OS. A 64GB node gets 31GB of heap for Elasticsearch and 33GB for the OS cache. Do not exceed 31GB of heap on a single node — above this threshold, the JVM switches from 4-byte compressed object pointers to 8-byte uncompressed ones, which doubles reference size and increases GC pressure. On Elasticsearch 8.x running JDK 21 with generational ZGC, the GC characteristics improve significantly over older G1GC configurations, but the 31GB ceiling on heap remains the safe practical limit. If you need more memory, add nodes rather than increasing per-node heap.
CPU matters most for indexing throughput and complex aggregations. Logstash with heavy Grok patterns can saturate CPU before Elasticsearch does. For data nodes, modern CPUs with high single-threaded clock speeds (3.5GHz or above) benefit search latency on sequential segment scans. Higher core counts improve bulk indexing throughput but have diminishing returns beyond 16 to 20 cores per node for typical log workloads.
Disk is where most teams underprovision. NVMe SSDs are non-negotiable for hot phase data nodes — the random I/O pattern that Elasticsearch generates during segment merges and concurrent searches will saturate spinning disks and cause indexing pauses. For warm and cold phases, SATA SSDs provide acceptable throughput at lower cost. In AWS, gp3 EBS volumes provide 3,000 IOPS baseline with 16,000 IOPS available at lower cost than gp2 — use gp3 for data nodes. io2 is rarely justified for log workloads.
Network is frequently the bottleneck that nobody planned for. Elasticsearch shuffles large amounts of data during shard recovery, rebalancing, and snapshot creation. On a 3-node cluster recovering a 200GB shard after a node replacement, 10GbE networking copies the shard in roughly 3 minutes. On 1GbE, that is 27 minutes during which the data has no replica. Use 10GbE or better between data nodes. In cloud environments, verify that your instance type provides dedicated network bandwidth — burstable instance types that share network bandwidth will throttle under sustained replication load.
Master nodes deserve dedicated resources. Three master-eligible nodes minimum for quorum. Give them 8GB RAM and 4 CPU cores — they manage cluster state, not data, so resource requirements are modest. Never run master-eligible and data roles on the same node in production. A GC pause on a data node caused by heavy indexing can delay master heartbeats, trigger unnecessary master elections, and destabilize the cluster at the worst possible moment. Keep the roles separated.
Elasticsearch Goes Read-Only on Black Friday — 4 Hours of Lost Logs
- Elasticsearch has three watermarks: 85% low (no new shards), 90% high (relocate shards), 95% flood-stage (read-only). It is the flood-stage that silently kills writes. Most monitoring setups watch the wrong threshold.
- Logstash dead letter queue is disabled by default. Without it, every document Elasticsearch rejects — for any reason — vanishes with no record. Enable it before the first log ever ships to production.
- Monitor disk usage with a hard alert at 80%. By the time you hit 95% and the flood-stage fires, you have no room to maneuver. At 80% you still have time to delete old indices, add nodes, or expand volumes before anything breaks.
- Black Friday, end-of-quarter, and any traffic spike compound log volume in non-linear ways. Capacity planning based on average daily ingest will fail on peak days. Calculate for 5-10x peak.
Key takeaways
Common mistakes to avoid
10 patternsCreating too many small shards by defaulting to the same shard count for every index
Using Filebeat for metrics collection when Metricbeat is the correct tool
Grok patterns that silently discard logs on parse failure
Not enabling Logstash dead letter queue before production
Using dynamic mapping on high-cardinality log data from multiple services
Lowering refresh_interval below 1 second for faster search
Not configuring ILM and relying on manual index cleanup
Setting Elasticsearch JVM heap above 31GB
Running Kibana on the same node as Elasticsearch data nodes
Trying to use KQL without uppercase AND/OR keywords
level:WARN OR ERROR returns nothing. The query is syntactically valid but semantically wrong — OR is interpreted as a field name, not a Boolean operator.level: WARN OR level: ERROR or level: (WARN OR ERROR). Field names are case-sensitive as indexed — Level: ERROR will not match level: ERROR.Interview Questions on This Topic
How does Elasticsearch's inverted index work, and why is it faster than scanning every document?
Frequently Asked Questions
That's Monitoring. Mark it forged?
20 min read · try the examples if you haven't