Junior 20 min · March 06, 2026

ELK Stack Explained: Internals, Pipelines and Production Failures

ELK Stack internals most engineers never learn — inverted indices, Logstash pipelines that stall, and disk watermarks that kill clusters silently.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • ELK Stack = Elasticsearch (search/storage) + Logstash (ingest/transform) + Kibana (visualize)
  • Elasticsearch uses inverted indices — term dictionary maps tokens to document IDs for sub-second search
  • Each shard is a Lucene instance; shard count x (1 + replicas) x 1.2 = actual disk multiplier
  • Logstash pipelines: inputs -> filters -> outputs; Grok is the CPU bottleneck on unstructured logs
  • Kibana dashboards should answer one operational question — not display every metric you have
  • Disk watermarks: low at 85% stops new shard allocation, high at 90% relocates shards, flood-stage at 95% switches indices to read-only. Monitor at 80% or you will be reacting instead of preventing.
Plain-English First

Imagine your entire city's 911 call center receives thousands of calls a day from every neighborhood. Logstash is the operator who answers every call, cleans up the noise, and routes it to the right file. Elasticsearch is the giant filing cabinet that stores every call record in a way that lets you find any detail in milliseconds. Kibana is the big screen on the wall that turns all those records into live charts so the chief can see exactly what's happening across the city right now. The ELK Stack is that whole system — for your software.

Every production system lies. Not intentionally — but without proper observability, your application fails silently, degrades mysteriously, and wakes you at 3am with zero context. Log files exist, but a 400GB flat log file on a server nobody SSHs into anymore is just expensive noise. The ELK Stack transforms that noise into signal: structured, searchable, visualized intelligence about everything your infrastructure is doing, in real time.

The core problem ELK solves is the gap between raw log data and actionable insight. A typical microservices platform produces logs from dozens of services, each in a slightly different format, scattered across hundreds of containers. Correlating a failed payment transaction across an API gateway, an auth service, a Kafka consumer, and a Postgres adapter — without a centralized log aggregation system — is an exercise in madness. ELK gives every log line a home, a shape, and a timeline.

By the end you will understand how Elasticsearch actually indexes and retrieves documents under the hood, how to build Logstash pipelines that handle real-world log formats including multiline stacktraces, how to design Kibana dashboards that answer operational questions rather than just looking impressive in a quarterly review, and exactly where production deployments fall apart and how to prevent it. The incidents in this article are real. The fixes are the ones that actually worked.

What ELK Stack Is and How the Components Connect

ELK is not three tools bolted together. It is a data pipeline with three distinct failure domains, and understanding how data flows between them is what separates engineers who can debug it from engineers who restart services and hope.

Data originates on your hosts and containers. Filebeat — a lightweight Go agent — tails log files and ships events forward. It is stateful: Filebeat maintains a registry file tracking its read position in every file it monitors. If that registry file is corrupted by an unclean shutdown (common on spot instances), Filebeat loses its position and either re-ships everything from the start or skips forward to the current file end, depending on configuration. Always run Filebeat with its registry on a persistent volume and set close_inactive to a sensible value so file handles do not accumulate.

Filebeat ships to Logstash, or — in high-volume environments — to Kafka first. The Kafka buffer is not optional at scale. It absorbs traffic spikes so Logstash does not receive a 10x burst and OOM. It also means a Logstash restart does not lose data — Kafka holds the events until Logstash recovers and resumes consuming from its committed offset. Running Logstash reading directly from files in a high-volume environment is fragile. Add Kafka as the buffer between collection and processing.

Logstash reads from Kafka, applies filters to parse and enrich each event, and writes structured documents to Elasticsearch. The pipeline is: inputs -> filters -> outputs. Each stage runs in its own thread pool. The filter stage is where CPU is spent and where most production problems originate.

Elasticsearch receives structured JSON documents, indexes them into an inverted index, and serves search and aggregation queries. Kibana connects to Elasticsearch and renders the results.

The triage order when logs stop flowing is always: Elasticsearch first, then Logstash, then Kibana. Storage failures cascade upstream. A healthy Logstash shipping to a broken Elasticsearch looks, from the outside, identical to a broken Logstash — events simply stop appearing in Kibana. Check ES health before anything else.

In 2026, Elastic also offers OpenTelemetry-native ingestion and the Elastic Agent as a replacement for the Filebeat plus Logstash combination. The Elastic Agent consolidates collection and processing into a single managed binary with central policy control through Fleet. For new deployments, evaluate Elastic Agent rather than defaulting to the classic Filebeat-Logstash split. For existing deployments, migration is straightforward but not mandatory — the classic stack still works and is fully supported.

docker-compose-elk.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Minimal ELK stack for local development and testing
# Shows the actual data flow: Filebeat -> Logstash -> Elasticsearch <- Kibana
# For production, replace single-node ES with a proper cluster and add Kafka between Filebeat and Logstash

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false       # Dev only — always enable security in production
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
      - cluster.routing.allocation.disk.watermark.low=85%
      - cluster.routing.allocation.disk.watermark.high=90%
      - cluster.routing.allocation.disk.watermark.flood_stage=95%  # Read-only at 95%
    ports:
      - "9200:9200"
    volumes:
      - esdata:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -sf 'localhost:9200/_cluster/health' | grep -v '\"status\":\"red\"'"]
      interval: 10s
      timeout: 5s
      retries: 10

  logstash:
    image: docker.elastic.co/logstash/logstash:8.13.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    environment:
      - LS_JAVA_OPTS=-Xms1g -Xmx2g
    ports:
      - "5044:5044"  # Beats input
    depends_on:
      elasticsearch:
        condition: service_healthy

  kibana:
    image: docker.elastic.co/kibana/kibana:8.13.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      elasticsearch:
        condition: service_healthy

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.13.0
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - filebeatdata:/usr/share/filebeat/data  # Persistent registry — survives container restarts
    depends_on:
      - logstash

volumes:
  esdata:
  filebeatdata:  # Registry persistence prevents re-shipping on restart
Output
# docker compose up -d
# Creating network elk_default
# Creating elk_elasticsearch_1 ... done
# Creating elk_logstash_1 ... done
# Creating elk_kibana_1 ... done
# Creating elk_filebeat_1 ... done
#
# Verify data flow:
# curl -s 'localhost:9200/_cat/indices?v' -- should show filebeat-* indices after ~30s
# curl -s 'localhost:9200/_cluster/health?pretty' -- should show green
The Three Failure Domains
Each ELK component can fail independently and in ways that look identical from the outside. Kibana showing no data could be a Kibana configuration problem, a Logstash pipeline stall, an Elasticsearch write rejection, or a Filebeat registry corruption. The triage order is always: Elasticsearch health first, then Logstash pipeline stats, then Kibana index pattern configuration. Jumping to Kibana when ES is the problem wastes everyone's time during an incident.
Production Insight
ELK is not just three tools — it is a data pipeline with three failure domains.
When logs stop flowing, isolate which domain broke before touching anything.
Triage order: Elasticsearch first, Logstash second, Kibana last — storage failures cascade upstream.
In 2026, evaluate Elastic Agent for new deployments — it consolidates Filebeat and Logstash into a single managed binary with Fleet-based policy management.
Key Takeaway
ELK is a pipeline: Filebeat collects, Kafka buffers (at scale), Logstash transforms, Elasticsearch stores, Kibana visualizes.
Filebeat registry corruption after unclean shutdown causes silent data loss — always mount registry on a persistent volume.
Triage order when logs stop: ES health first, then Logstash stats, then Kibana config.

Beats Family — Filebeat, Metricbeat, Packetbeat, and When to Use Each

The Beats family is Elastic's collection of lightweight data shippers. Each Beat is purpose-built for a specific data type — logs, metrics, network packets — and runs as a single binary with minimal configuration. Understanding which Beat to use for which job prevents the mistake of forcing Filebeat to collect metrics or Metricbeat to tail log files.

Filebeat is the workhorse for log collection. It tails files, follows symlinks, handles rotation, and ships raw log lines to Logstash or directly to Elasticsearch. It maintains a registry — a local file tracking read positions — so a restart does not re-ship the same lines. Filebeat supports multiline aggregation, which is critical for Java stacktraces. Configure multiline in Filebeat rather than Logstash whenever possible to reduce Logstash heap pressure.

Metricbeat collects system and service metrics. It runs modules that know how to talk to specific services — MySQL, PostgreSQL, Redis, Nginx, Kafka, Docker, Kubernetes. Metricbeat pulls metrics from each module on a configurable period. The output is numerical time-series data, not raw log lines. Do not use Filebeat to read /proc/stats — use Metricbeat with the system module.

Packetbeat captures and parses network traffic. It runs as a packet sniffer using libpcap (Linux) or WinPcap (Windows), decoding protocols like HTTP, MySQL, PostgreSQL, Redis, Thrift, and DNS. Packetbeat reconstructs full transactions from packets, so it can show you every SQL query or HTTP request/response pair that crosses your network segment.

Auditbeat collects security audit events from your Linux kernel using the Linux Audit Framework. It ships user logins, privilege escalations (sudo), file integrity events (when critical configs change), and process execution logs. Auditbeat is the right tool for compliance auditing (SOC2, PCI-DSS) and security monitoring.

Heartbeat performs uptime monitoring. It pings services (ICMP), connects to TCP ports, or checks HTTP endpoints for expected status codes and response body patterns. Heartbeat sends synthetic check results as documents, which you can alert on for service availability. It is not a log collector and has no relation to a human heartbeat — it is named for the regular 'heartbeat' signal it emits.

Winlogbeat captures Windows Event Logs — Application, Security, Setup, System, and forwarded events. If your infrastructure includes Windows servers, Winlogbeat is the only supported way to get Windows Event Logs into Elasticsearch reliably. Do not try to tail raw .evtx files with Filebeat.

beats-comparison.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# beats-comparison.yml — Quick reference for choosing the right Beat

# ============================================================
# FILEBEATLog files (application logs, JSON logs, plaintext)
# ============================================================
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  multiline:
    type: pattern
    pattern: '^\d{4}-\d{2}-\d{2}'  # New log lines start with timestamp
    negate: true
    match: after

# ============================================================
# METRICBEATSystem metrics + service metrics
# ============================================================
metricbeat.modules:
- module: system
  period: 10s
  metricsets: ["cpu", "memory", "diskio", "filesystem", "load", "process"]
- module: nginx
  period: 30s
  hosts: ["http://localhost/status"]
- module: docker
  period: 10s
  hosts: ["unix:///var/run/docker.sock"]

# ============================================================
# PACKETBEATNetwork protocol analysis
# ============================================================
packetbeat.interfaces.device: eth0
packetbeat.protocols:
- type: http
  ports: [80, 8080]
  send_headers: ["Authorization"]
- type: mysql
  ports: [3306]
- type: pgsql
  ports: [5432]
- type: redis
  ports: [6379]

# ============================================================
# WINLOGBATWindows Event Logs
# ============================================================
winlogbeat.event_logs:
- name: Application
  ignore_older: 72h
- name: Security
  ignore_older: 72h
- name: System
- name: Setup

# ============================================================
# HEARTBEATUptime monitoring
# ============================================================
heartbeat.monitors:
- type: http
  name: Production API
  urls: ["https://api.example.com/health"]
  schedule: '@every 30s'
  check.response.status: [200]
- type: tcp
  name: MySQL
  hosts: ["db.example.com:3306"]
  schedule: '@every 1m'

# ============================================================
# AUDITBEATLinux audit framework
# ============================================================
auditbeat.modules:
- module: auditd
  audit_rules: |
    -w /etc/passwd -p wa -k identity
    -w /etc/nginx -p wa -k config
    -a always,exit -S execve -k process_execution
Output
# Each Beat outputs JSON documents to Elasticsearch or Logstash
# Install: sudo apt install filebeat / yum install filebeat (for each Beat)
# Configure: /etc/filebeat/filebeat.yml (or metricbeat.yml, packetbeat.yml)
# Test: filebeat test config -e
# Start: sudo systemctl start filebeat
Beats Selection Rule of Thumb
| Data Type | Correct Beat | Wrong Beat | |-----------|--------------|------------| | Application log files (JSON, plaintext) | Filebeat | Metricbeat (not designed for logs) | | CPU, memory, disk, process metrics | Metricbeat | Filebeat (would require reading /proc manually) | | HTTP requests, SQL queries on wire | Packetbeat | Filebeat (not a packet sniffer) | | Windows Event Logs | Winlogbeat | Filebeat (cannot parse .evtx natively) | | Service uptime monitoring | Heartbeat | Metricbeat (can work but Heartbeat is purpose-built) | | Linux security audit events | Auditbeat | Filebeat (would miss kernel audit context) |
Production Insight
A team tried to use Filebeat to collect Docker container metrics by reading /proc/stat files directly. The configuration was complex, brittle, and broke on every Docker restart. Switching to Metricbeat with the Docker module reduced configuration from 200 lines to 20 and recovered metrics that were previously missing. Rule: use the Beat designed for your data type, not the one you already have installed.
Key Takeaway
Filebeat = log files, Metricbeat = system/service metrics, Packetbeat = network traffic, Winlogbeat = Windows Event Logs, Heartbeat = uptime monitoring, Auditbeat = security audit events. Choose the right Beat for the data type — forcing a file-based Beat to collect metrics is fragile and maintenance-heavy.

How Elasticsearch Actually Indexes Documents — Inverted Indices Under the Hood

Elasticsearch does not search documents. It searches an inverted index — a data structure that maps every unique term to the list of documents that contain it. When you index a document, Elasticsearch tokenizes the text, normalizes case, applies stemming if configured, and writes each token into a term dictionary. The term dictionary points to a postings list: document IDs, term frequency, and position offsets.

This is why Elasticsearch is fast at full-text search. You are not scanning every document. You are looking up a term in a sorted dictionary and getting back a pre-computed list of matching document IDs. BM25 scoring then ranks those matches by term frequency, inverse document frequency, and field length normalization.

Each Elasticsearch shard is an independent Lucene index. Lucene segments are immutable — once written, they never change. New or updated documents go into an in-memory buffer, then get flushed to a new segment on refresh, which defaults to every 1 second. This means there is a 1-second window where a newly indexed document is not yet searchable. If you need sub-second search freshness, the answer is not lowering the refresh interval — it will kill indexing throughput because every refresh triggers segment creation and eventual merges.

Segment merging happens in the background and is a silent performance killer when misconfigured. Too many small segments accumulate when indexing is faster than merging. The merge thread then consumes I/O and CPU, spiking latency for active searches. Monitor segment count per shard with _cat/segments — more than 100 segments per shard is a sign your merge policy needs tuning. For bulk indexing jobs, set refresh_interval to 30s or -1 during the load, force a refresh when done, then restore the interval.

Field mapping is where most teams create invisible performance problems. Every field you add increases the inverted index size and slows indexing. Use dynamic: strict in your index templates to reject unexpected fields and define only the fields you actually search or aggregate on. A common mistake is indexing full HTTP request bodies as a single text field, then wondering why searches are slow. Use index: false for fields you store but never query.

High-cardinality keyword fields deserve specific mention. Trace IDs, request IDs, and session tokens as keyword fields create term dictionaries with millions of unique values that cannot fit in RAM. Every search touching those fields forces disk lookups. Either do not index them as keywords, or use a separate index with appropriate settings for correlation lookups rather than mixing them into your main search index.

inspect-inverted-index.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
# See exactly how Elasticsearch tokenizes and stores a document
# The _termvectors API exposes the inverted index directly

# Index a sample document
curl -s -XPOST 'localhost:9200/logs/_doc/1' \
  -H 'Content-Type: application/json' \
  -d '{
    "message": "Payment failed for user 4821 timeout after 30s",
    "service": "payment-api",
    "level": "error",
    "@timestamp": "2026-04-25T10:30:00Z"
  }'

# Force refresh so the document is searchable immediately
curl -s -XPOST 'localhost:9200/logs/_refresh'

# See how ES tokenized the message field — this IS the inverted index
curl -s -XGET 'localhost:9200/logs/_termvectors/1?fields=message&pretty'
# Output shows each token, its frequency, and its position:
# "payment"  -> term_freq: 1, position: 0
# "failed"   -> term_freq: 1, position: 1
# "user"     -> term_freq: 1, position: 3
# "timeout"  -> term_freq: 1, position: 5

# Search uses the inverted index — no full document scan
curl -s -XGET 'localhost:9200/logs/_search?pretty' \
  -H 'Content-Type: application/json' \
  -d '{"query": {"match": {"message": "timeout"}}}'
Output
# _termvectors output (abbreviated):
# {
# "term_vectors": {
# "message": {
# "terms": {
# "payment": { "term_freq": 1, "tokens": [{ "position": 0 }] },
# "failed": { "term_freq": 1, "tokens": [{ "position": 1 }] },
# "timeout": { "term_freq": 1, "tokens": [{ "position": 5 }] }
# }
# }
# }
# }
Inverted Index Mental Model
  • Document goes in -> ES tokenizes text into individual terms and writes each to the term dictionary
  • Each term gets a postings list: which docs contain it, how often, and at which position
  • Search = dictionary lookup + postings list intersection — that is why it is fast on billions of documents
  • Segments are immutable; updates create new segments, old ones get merged in the background by the merge thread
  • Refresh interval (1s default) controls the trade-off between search freshness and indexing throughput — do not lower it below 1s
Production Insight
A 100GB dataset with 5 shards and 1 replica actually consumes roughly 1TB of disk — not 200GB.
Each shard copy (primary + each replica) is a full physical copy. Plus segment merge overhead adds 10-20%.
Rule: calculate disk as raw_data_size x (1 + replicas) x 1.2 before you allocate storage.
For high-cardinality keyword fields like trace IDs or request IDs, create a separate correlation index rather than mixing them into your main search index.
Key Takeaway
Inverted index maps terms to document IDs and postings lists — that is the engine behind every full-text search.
Disk usage is raw_data x (1 + replicas) x 1.2 — calculate before deploying.
Lowering refresh_interval below 1s kills indexing throughput for negligible search freshness gain.
dynamic: strict in index templates prevents field explosion from corrupting Kibana load times.
Choosing the Right Shard Strategy
IfDaily log volume under 5GB
Use1 primary shard per daily index. Do not over-shard small datasets — shard overhead exceeds any parallelism benefit below this threshold.
IfDaily log volume 5GB to 50GB
Use3 to 5 primary shards per daily index, targeting 10 to 30GB per shard. This keeps recovery fast and gives enough parallelism for concurrent searches.
IfDaily log volume over 50GB
UseCalculate: daily_volume_GB divided by 30 = primary shard count. Never exceed 20 shards per node — shard metadata overhead accumulates in heap at roughly 1MB per shard.
IfNeed sub-second search latency on aggregations
UseAdd 1 to 2 replicas across additional data nodes. Replicas serve read traffic in parallel. Latency improves as read load is distributed — but disk usage multiplies.

Logstash Pipelines — Ingest, Transform, Ship and Where They Break

Logstash receives raw data from inputs, transforms it through filters, and ships structured events to outputs. The pipeline is linear — input -> filter -> output — with each stage running in its own thread pool. The number of worker threads processing the filter stage is controlled by pipeline.workers, which defaults to the number of available CPU cores. Understanding the threading model is the first step to understanding why pipelines stall.

The Grok filter is where most Logstash performance problems originate. Grok combines regular expressions with named capture groups to extract structured fields from unstructured text. A pattern like %{COMBINEDAPACHELOG} expands to a 200-character-plus regex. When your log format does not match the pattern, Grok tries every alternative before failing. In a pipeline processing 10,000 events per second with a 5% failure rate, that is 500 wasted regex evaluations per second. Always add a catch-all pattern as the last alternative: %{GREEDYDATA:log_message}. It ensures events flow through even on mismatch, and you tag the failure for visibility rather than silently dropping the event.

Multiline event handling is the second major trap. Java stacktraces and Python tracebacks span multiple lines. Logstash's multiline codec aggregates them into a single event by buffering pending lines in JVM heap. A burst of stacktraces from a crashing service — which is exactly when you most need your logs — can spike heap usage from 500MB to 3GB in under a minute. Set -Xmx to at least 4GB when using multiline in production. Reduce max_lines to a realistic ceiling (200 is usually enough for stacktraces) so a runaway exception chain cannot consume unlimited heap.

Dead letter queues are the safety net that most teams skip and regret. By default, Logstash silently drops documents that Elasticsearch rejects — mapping conflicts, disk blocks, field limit breaches. Enable it in logstash.yml: dead_letter_queue.enable: true and dead_letter_queue.max_bytes: 1024mb. The DLQ is a local directory on the Logstash host, not an Elasticsearch index. Inspect it at the path configured by path.dead_letter_queue (default: /var/lib/logstash/dead_letter_queue). Use the dead_letter_queue input plugin to replay rejected events after fixing the root cause.

The pipeline.ordered setting deserves explicit mention. By default it is set to auto, which enables ordered processing when pipeline.workers is 1 and disables it otherwise. Set pipeline.ordered: false explicitly when event ordering between inputs does not matter — it allows workers to process events without coordination overhead, improving throughput at the cost of delivery order guarantees. For log pipelines where Elasticsearch timestamps handle ordering at query time, this is almost always the right call.

Pipeline workers and batch size interact directly with throughput. For a 16-core machine, start with 8 workers and a batch size of 250. Increasing batch size improves throughput by amortizing per-batch overhead but increases per-event latency and heap usage. A batch that fills with slow-to-process multiline events holds the worker thread for longer, starving other events. Benchmark with realistic load before committing to any setting.

logstash-pipeline.confRUBY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# /etc/logstash/conf.d/production-logs.conf
# Production Logstash pipeline for Java microservices
# Handles both structured JSON logs and raw stacktraces with multiline

input {
  # Beats input for Filebeat agents on each host
  beats {
    port => 5044
    congestion_threshold => 5  # Backpressure when 5 Filebeat connections are queued
  }
}

filter {
  # Attempt JSON parse first — structured logs from Logback/Jackson need no Grok
  json {
    source  => "message"
    target  => "parsed"
    skip_on_invalid_json => true  # Keep raw message if not JSONdo not drop it
  }

  # Grok fallback for non-JSON logs and raw stacktraces
  if ![parsed] {
    grok {
      match => {
        "message" => [
          # Primary pattern: structured log line with thread and logger
          "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{DATA:logger} - %{GREEDYDATA:log_message}",
          # Catch-all: never drop an event — tag it and keep flowing
          # Without this, every unmatched line is silently discarded
          "%{GREEDYDATA:log_message}"
        ]
      }
      tag_on_failure => ["_grokparsefailure_custom"]
    }
  }

  # Multiline handling for Java stacktraces
  # max_bytes limits heap consumption from runaway exception chains
  # This is a codec-level setting; shown here as a reference for input configuration
  # In practice, configure multiline on the Filebeat side to reduce Logstash heap pressure

  # Normalize timestamp to ISO 8601 for Elasticsearch
  date {
    match         => ["timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss.SSS"]
    target        => "@timestamp"
    remove_field  => ["timestamp"]
  }

  # Add processing metadata so you can trace which pipeline version and instance handled an event
  mutate {
    add_field => {
      "pipeline_version" => "4.0"
      "processed_by"    => "%{[host][name]}"
    }
  }
}

output {
  elasticsearch {
    hosts                  => ["es-data-01:9200", "es-data-02:9200", "es-data-03:9200"]
    index                  => "app-logs-%{+YYYY.MM.dd}"
    # Retry settings — transient ES errors should not cause data loss
    retry_max_interval     => 30
    retry_initial_interval => 2
  }

  # Dead letter queue is configured in logstash.yml, not here.
  # Add to logstash.yml:
  #   dead_letter_queue.enable: true
  #   dead_letter_queue.max_bytes: 1024mb
  #   path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
  # Inspect the DLQ directory to see what ES rejected.
  # Replay with the dead_letter_queue input plugin after fixing the root cause.
}
Output
# Pipeline started successfully
# [INFO] [logstash.inputs.beats] Starting server on port 5044
# [INFO] [logstash.pipeline] Pipeline started {"pipeline.id":"main"}
# Monitor with: curl -s 'localhost:9600/_node/stats/pipelines?pretty'
Grok Failure Is Silent by Default
  • Every unmatched Grok pattern tries all alternatives before giving up — this is O(alternatives) CPU per failed event
  • A 5% Grok failure rate on a 10K events/sec pipeline wastes 500 regex evaluations per second
  • Without a catch-all pattern as the last alternative, unmatched events are tagged _grokparsefailure and may be dropped depending on your output configuration
  • Always include %{GREEDYDATA:log_message} as your last alternative — it costs nothing and prevents silent data loss
  • Test patterns against 50 real log lines using the Grok Debugger in Kibana Dev Tools before deploying
Production Insight
Multiline codec holds pending events in JVM heap.
A burst of Java stacktraces from a crashing service — exactly when you need logs most — can OOM a 1GB heap pipeline in under a minute.
Set -Xmx to at least 4GB when using multiline. Configure multiline on the Filebeat side where possible to reduce Logstash heap pressure.
pipeline.ordered: false removes cross-worker coordination overhead — set it explicitly on log pipelines where ES timestamps handle ordering at query time.
Key Takeaway
Grok is the CPU bottleneck — always add %{GREEDYDATA:log_message} as the last alternative and tag failures for visibility.
Multiline codec is the heap trap — stacktrace bursts OOM pipelines under load. Configure multiline in Filebeat where possible.
Enable dead_letter_queue before production. DLQ is a local directory, not an ES index — inspect it at /var/lib/logstash/dead_letter_queue.
pipeline.ordered: false improves throughput on log pipelines where event ordering is handled at query time.

Logstash Filter Cheat Sheet — Grok, Date, Mutate, GeoIP

Logstash filters transform raw events into structured documents before they reach Elasticsearch. The four most frequently used filters in production pipelines are Grok, Date, Mutate, and GeoIP. Having a scannable reference makes pipeline debugging faster and reduces the guesswork when logs show up with missing fields or wrong timestamps.

Grok extracts structured fields from unstructured text using pattern matching. Built-in patterns cover common formats — %{COMBINEDAPACHELOG}, %{TIMESTAMP_ISO8601}, %{LOGLEVEL}. For custom formats, compose smaller patterns. Always add a catch-all as the last alternative to prevent dropped events.

Date parses timestamp strings from your logs into the @timestamp field. If you skip this, Elasticsearch uses the current time at indexing, making log order unreliable. The match parameter takes an array of format strings to try in order.

Mutate modifies field values and structures — renaming, copying, converting types, removing fields, and adding static strings. Use it to normalize field names across services (e.g., renaming customer_email to email) or to add pipeline metadata.

GeoIP enriches events with geographical location data from an IP address. It adds fields like geoip.country_code2, geoip.city_name, and geoip.location. This only works with public IPs. Rate-limit usage because the GeoIP database update can become a performance overhead on high-volume pipelines.

Grok failure debugging is where most teams waste time. When your pattern does not match, Logstash adds a _grokparsefailure tag to the event. Check for these tags in Kibana. Use the Grok Debugger in Kibana Dev Tools to test patterns against actual log lines before deploying.

logstash-filters.confRUBY
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# /etc/logstash/conf.d/filters.conf
# Common Logstash filter patterns — copy, paste, modify

# ============================================================
# 1. GROKExtract structured fields from unstructured text
# ============================================================
filter {
  grok {
    match => {
      "message" => [
        # Apache/Nginx combined log format
        "%{COMBINEDAPACHELOG}",
        # Custom JSON-like log line
        "timestamp=%{TIMESTAMP_ISO8601:timestamp} level=%{LOGLEVEL:level} trace_id=%{UUID:trace_id} msg=%{GREEDYDATA:message}",
        # Catch-all: never drop events — tag them instead
        "%{GREEDYDATA:log_message}"
      ]
    }
    # Tag failures so you can see them in Kibana
    tag_on_failure => ["_grokparsefailure_custom"]
    # Remove the original message after extracting to save disk
    remove_field => ["message"]
  }
}

# ============================================================
# 2. DATEParse timestamp into @timestamp
# ============================================================
filter {
  date {
    # Try these formats in order until one matches
    match => [
      "timestamp",
      "ISO8601",
      "yyyy-MM-dd HH:mm:ss.SSS",
      "dd/MMM/yyyy:HH:mm:ss Z"  # Apache log format
    ]
    target => "@timestamp"
    # Remove the original timestamp field after parsing
    remove_field => ["timestamp"]
  }
}

# ============================================================
# 3. MUTATEModify, rename, convert, or add fields
# ============================================================
filter {
  mutate {
    # Rename a field to standardize across services
    rename => {
      "customerEmail" => "email"
      "source_host"   => "hostname"
    }
    # Convert string numbers to actual numeric types
    convert => {
      "status_code" => "integer"
      "duration_ms" => "float"
    }
    # Remove fields you never query
    remove_field => ["headers", "raw_body"]
    # Add static processing metadata
    add_field => {
      "pipeline_name" => "logs-prod"
      "environment"   => "production"
    }
    # Copy a value to a new field
    copy => {
      "user_id" => "user.id"
    }
  }
}

# ============================================================
# 4. GEOIPEnrich with location data from IP address
# ============================================================
filter {
  geoip {
    # Source field containing the IP address
    source => "client_ip"
    # Fields to add (default includes country, city, location, etc.)
    target => "geoip"
    # Skip if IP is private (10.x.x.x, 192.168.x.x, 172.16.x.x)
    # Not a filter setting — handle with a conditional before this filter
  }
}

# ============================================================
# 5. IFConditional pipelines
# ============================================================
filter {
  # Only apply heavy filters to error logs (10% of traffic)
  if [level] == "ERROR" or [status_code] >= 500 {
    grok {
      match => { "stacktrace" => "%{JAVASTACKTRACE}" }
    }
  }
  
  # Skip GeoIP for private IP addresses
  if [client_ip] !~ /^(10\.|172\.1[6-9]|172\.2[0-9]|172\.3[0-1]|192\.168\.)/ {
    geoip {
      source => "client_ip"
    }
  }
}

# ============================================================
# 6. KV (Key-Value) — Parse key=value pairs
# ============================================================
filter {
  kv {
    # Source field containing key=value pairs
    source => "query_string"
    # Target field for extracted fields
    target => "params"
    # Separator between key and value (default '=')
    field_split => "&"
    value_split => "="
    # Remove the source after extraction
    remove_field => ["query_string"]
  }
}
Output
# Test a filter before deploying:
# echo '{"message":"127.0.0.1 - - [25/Apr/2026:10:30:00 +0000] \"GET /health\" 200 12"}' | \
# /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/test.conf --config.test_and_exit
Grok Debugging Workflow
1. Find a real log line that is failing — copy from raw application logs 2. Open Kibana Dev Tools -> Grok Debugger 3. Paste the log line and your pattern 4. Iterate on the pattern until fields extract correctly 5. Copy the working pattern to your Logstash config 6. Test with logstash --config.test_and_exit before reloading 7. Monitor _grokparsefailure tags in Kibana after deployment
Production Insight
A 10K events/second pipeline with 5% Grok failures was wasting 500 regex evaluations per second. Each failed evaluation ran through 15 alternative patterns before giving up. Adding a catch-all %{GREEDYDATA:log_message} pattern and moving it to the first position reduced CPU usage by 40% — the catch-all matches instantly, so the pipeline never evaluated the other patterns on mismatched logs. Rule: order patterns from most-specific to catch-all, and always end with %{GREEDYDATA}.
Key Takeaway
Grok extracts structure — always include %{GREEDYDATA:log_message} as the last pattern. Date sets @timestamp from log timestamps. Mutate renames, converts, and removes fields. GeoIP enriches with location from IP addresses. Test patterns in the Grok Debugger before deploying.

Kibana Dashboards That Actually Answer Questions

Most Kibana dashboards are digital art. They look impressive in a demo and answer nothing during an incident. A dashboard with 47 panels showing every possible metric is a distraction when you are trying to figure out why payments are failing at 2am.

The right approach is to start with the question, then build the visualization. 'Which services are returning 5xx errors in the last 15 minutes?' needs one metric — error count — one dimension — service name — and one filter — status code 500 or above, time range last 15 minutes. That is a single data table, not a 12-panel dashboard. The panel answers the question. Everything else is friction.

Kibana's index patterns are the second major gotcha. When you create an index pattern like app-logs-*, Kibana fetches field mappings from every matching index. If you have 90 days of daily indices with dynamic mapping enabled, you can easily have 3,000-plus fields — especially when different services log different JSON structures that Elasticsearch ingests as separate mapped fields. Every time someone opens Discover or creates a visualization, Kibana loads all those field definitions. That is why your dashboard takes 20 seconds to load. The fix is upstream: use dynamic: strict in your index template and define only the fields you actually query.

For real incident response, saved searches outperform visualizations. They load faster because they do not aggregate — they list raw log lines with column selections. Pin a few key saved searches at the top of your Kibana navigation: 5xx errors, slow queries over 5 seconds, auth failures. When something breaks, open the relevant saved search rather than waiting for a complex dashboard to render aggregations across 90 days of data.

Kibana Query Language over Lucene syntax is worth the initial learning curve. KQL is more readable, less error-prone when written quickly under pressure, and better supported in autocomplete. Train the whole team on a handful of patterns — field:value, field:* wildcards, AND/OR combinations, range queries with > and < — and you will have faster incident investigation.

Time series visualizations with a large time range are a silent performance killer. Querying 30 days of data with a 1-minute bucket interval generates 43,200 buckets. Elasticsearch computes all of them. Use auto-interval on date histograms for overview dashboards and a fixed short interval only when drilling into a specific incident window. The inspect button on any Kibana panel shows the raw Elasticsearch query being executed — this is invaluable for understanding why a dashboard is slow or showing unexpected data.

kibana-incident-dashboard.ndjsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Kibana saved search for incident triage — 5xx errors by service
// Import via Kibana -> Stack Management -> Saved Objects -> Import
// This is the first thing to open when someone says 'something is broken'
{
  "attributes": {
    "title": "5xx Errors by Service — Last 15m",
    "description": "Real-time error view for incident triage. Open this first, not the overview dashboard.",
    "columns": ["@timestamp", "service", "level", "log_message", "status_code"],
    "sort": [["@timestamp", "desc"]],
    "kibanaSavedObjectMeta": {
      "searchSourceJSON": "{\"query\":{\"query\":\"status_code >= 500\",\"language\":\"kuery\"},\"filter\":[{\"meta\":{\"index\":\"app-logs-*\",\"type\":\"range\"},\"query\":{\"range\":{\"@timestamp\":{\"gte\":\"now-15m\"}}}}]}"
    }
  },
  "type": "search"
}

// Human-readable version of the embedded query above:
// {
//   query: { query: 'status_code >= 500', language: 'kuery' },
//   filter: [{ range: { '@timestamp': { gte: 'now-15m' } } }]
// }
//
// The nested JSON escaping in kibanaSavedObjectMeta is the Kibana saved objects
// wire format — this is correct and required for import. The readable version above
// shows what is actually being executed against Elasticsearch.
Output
# Import: Kibana -> Stack Management -> Saved Objects -> Import -> select this file
# Access: Kibana -> Analytics -> Discover -> Open -> '5xx Errors by Service'
# During incidents: pin to Kibana sidebar for one-click access
Dashboard Design Mental Model
  • Start with the question, then pick the visualization — not the other way around
  • One dashboard per operational scenario: incident triage, capacity planning, deploy validation
  • If a panel does not change your next action, it is visual noise — remove it
  • Saved searches load faster than visualizations and show raw log lines — use them for active incident investigation
  • The Kibana inspect button shows the raw ES query — use it to debug slow panels and unexpected results
Production Insight
Index patterns with 2000-plus fields add 10-20 seconds to every Kibana page load.
Dynamic mapping on high-cardinality logs creates fields like request.headers.x-amzn-trace-id that nobody queries but Kibana loads on every page.
Rule: use dynamic: strict in your index template from day one. Every unexpected field is a performance tax paid on every dashboard load.
Key Takeaway
Start with the question, then build the visualization — not the other way around.
Saved searches load faster than visualizations and are the right tool during active incidents.
dynamic: strict prevents field explosion that makes every Kibana page slow to load.
One dashboard per operational scenario beats one dashboard attempting to show everything.

Kibana Query Language (KQL) vs Lucene — Syntax Comparison

Kibana gives you two query language options: Kibana Query Language (KQL) and Lucene. KQL is the default in modern Kibana versions (7.0+) and is the recommended choice for most users. Lucene is the legacy syntax that powers Elasticsearch's underlying query parser. Knowing both is useful for debugging, but KQL should be your daily driver.

KQL is designed for discoverability and error resistance. It provides autocomplete suggestions, syntax highlighting, and immediate error feedback. You cannot write invalid KQL — Kibana tells you where the syntax breaks. KQL supports nested fields, existence checks, and range queries with a cleaner syntax. Use KQL for all ad-hoc exploration and dashboard panels.

Lucene is more powerful but more dangerous. It supports regex, fuzzy queries, and proximity searches that KQL does not. The trade-off is that Lucene does not prevent you from writing queries that are syntactically valid but semantically wrong. A misplaced parenthesis can change the entire query meaning without an error message. Reserve Lucene for advanced use cases where KQL falls short, and always test Lucene queries in Dev Tools before adding them to dashboards.

The field:value syntax is identical in both languages. Wildcards work in both — service:payment* matches payment-api, payment-processor, payment-service. The differences appear in ranges, existence checks, and complex Boolean logic.

kql-vs-lucene.txtTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
# KQL vs LuceneSyntax Comparison Quick Reference
# Use KQL by default. Only switch to Lucene when you need advanced features.

# ============================================================
# BASIC FIELD LOOKUPSame in both
# ============================================================
# Match exact field value
KQL:     status_code: 500
Lucene:  status_code:500

# Match value anywhere in full-text field (message analysis applies)
KQL:     message: "timeout"
Lucene:  message:timeout

# ============================================================
# WILDCARDSSame in both
# ============================================================
# Prefix match
KQL:     service: payment*
Lucene:  service:payment*

# Single character wildcard
KQL:     trace_id: 1a2b?ef*
Lucene:  trace_id:1a2b?ef*

# ============================================================
# RANGE QUERIESKQL is more readable
# ============================================================
# Greater than or equal
KQL:     duration_ms >= 5000
Lucene:  duration_ms:[5000 TO *]

# Between (inclusive both ends)
KQL:     status_code >= 400 and status_code <= 499
Lucene:  status_code:[400 TO 499]

# Between (exclusive upper bound)
KQL:     duration_ms >= 0 and duration_ms < 1000
Lucene:  duration_ms:[0 TO 1000}

# Date range
KQL:     @timestamp >= "2026-04-25T10:00:00"
Lucene:  @timestamp:[2026-04-25T10:00:00 TO *]

# ============================================================
# BOOLEAN LOGICKQL uses words, Lucene uses symbols
# ============================================================
# AND
KQL:     level: ERROR AND service: payment-api
Lucene:  level:ERROR AND service:payment-api

# OR (note: KQL requires uppercase OR)
KQL:     level: ERROR OR level: WARN
Lucene:  level:ERROR OR level:WARN

# NOT
KQL:     NOT level: DEBUG
Lucene:  -level:DEBUG OR NOT level:DEBUG

# Complex grouping
KQL:     (status_code >= 500 OR level: ERROR) AND service: payment*
Lucene:  (status_code:[500 TO *] OR level:ERROR) AND service:payment*

# ============================================================
# EXISTENCE CHECKSKQL is more readable
# ============================================================
# Field exists (has any value, including null)
KQL:     trace_id: *
Lucene:  _exists_:trace_id

# Field does NOT exist
KQL:     NOT trace_id: *
Lucene:  -_exists_:trace_id

# ============================================================
# NESTED FIELDSSame in both (dot notation)
# ============================================================
KQL:     geoip.country_code: US
Lucene:  geoip.country_code:US

# ============================================================
# LUCENE-ONLY FEATURES (not available in KQL)
# ============================================================
# Regular expressions (use sparingly — expensive)
Lucene:  message:/pay.ent.*/i

# Fuzzy queries (character edit distance)
Lucene:  customer_name:bob~1

# Proximity searches
Lucene:  "user created"~3

# Boosting terms
Lucene:  level:ERROR^2 OR level:WARN

# ============================================================
# REAL-WORLD INCIDENT QUERIES
# ============================================================
# Find all errors from payment service in last hour (KQL)
level: ERROR AND service: payment-* AND @timestamp >= now-1h

# Find slow API calls (>10s) excluding health checks (Lucene)
duration_ms:[10000 TO *] AND NOT endpoint:/health

# Find any 5xx or ERROR from payment services except test users (KQL)
(status_code >= 500 OR level: ERROR) AND service: payment-* AND NOT user_id: test-*
KQL Quick Reference
| Query Pattern | KQL Example | |---------------|-------------| | Exact match | status_code: 500 | | Prefix wildcard | service: payment | | Range | duration_ms >= 5000 | | AND | level: ERROR AND service: payment-api | | OR | level: ERROR OR level: WARN (UPPERCASE required) | | NOT | NOT level: DEBUG | | Exists | trace_id: | | Date math | @timestamp >= now-15m | | Grouping | (status_code >= 500 OR level: ERROR) AND service: payment* |
Production Insight
A team spent 20 minutes debugging why their dashboard filter level:WARN OR ERROR returned nothing. The problem: KQL requires uppercase OR. WARN OR ERROR (without uppercase) is parsed as a field name, matching nothing. After switching to level: WARN OR level: ERROR (or level: (WARN OR ERROR)), the filter worked. Rule: KQL keywords (AND, OR, NOT) must be uppercase. Field names are case-sensitive as indexed.
Key Takeaway
Use KQL by default — it has autocomplete, error feedback, and cleaner syntax. Use Lucene only for regex, fuzzy queries, or proximity searches. KQL keywords (AND, OR, NOT) must be uppercase. Field names are case-sensitive as indexed.

Shard Strategy and Capacity Planning — The Decisions That Haunt You Later

Shard count is the most consequential decision in an Elasticsearch deployment and it is almost always wrong on the first try. Too many shards and your cluster spends more time managing shard metadata than indexing data. Too few and you cannot distribute load or recover from node failures in a reasonable time window.

Here is the math most people skip. Each shard is a Lucene instance with its own heap overhead — roughly 1MB per shard for metadata plus per-segment structures. A cluster with 5,000 shards burns around 5GB of heap just on shard bookkeeping before indexing a single document. Elasticsearch's hard limit is 1,000 shards per node, but the practical ceiling is closer to 20 shards per GB of heap on data nodes. A node with 30GB of heap can handle around 600 shards before GC pressure from shard metadata starts affecting search latency.

The target shard size for log workloads is 10GB to 50GB. Below 10GB you are paying overhead on shards too small to benefit from parallelism. Above 50GB, shard recovery after a node failure requires copying the entire shard to a replacement node — a 100GB shard on a 1Gbps network takes around 13 minutes to recover during which that data has one fewer replica. For a daily index receiving 30GB of logs, one primary shard is fine. For 150GB per day, use 5 primary shards.

You cannot change the number of primary shards on an existing index without reindexing. This is one of the most painful lessons in Elasticsearch operations and it is avoidable if you plan before the first document arrives. Use index templates with the correct shard count set before the cluster receives any data. If you get it wrong, reindex is the only path forward — which is a significant operational event.

Replica shards serve two purposes: fault tolerance and read parallelism. For logs, one replica is usually enough — it doubles disk usage but protects against single node failure. If you have 3 data nodes, each primary shard has its primary on one node and its replica on another, giving you tolerance for a single node going down. With 2 replicas across 3 nodes, you have triple the disk usage but can lose any 2 nodes and still serve reads. Choose based on your actual availability requirements, not aspirational ones.

Post-node-restart shard allocation can itself become a cluster bottleneck when shard counts are high. The cluster manager thread handles all allocation decisions, and thousands of pending allocations after a node restart can keep the cluster in a recovering state for much longer than the actual data copy time. Monitor _cluster/allocation/explain when shards are not allocating — it gives the specific reason, from disk watermark to node attribute filtering to replica placement rules. This API saves hours of guessing.

shard-capacity-planning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/bin/bash
# Elasticsearch shard capacity planning and monitoring commands
# Run these before creating a new index, not after you discover a problem

echo "=== Current shard distribution across data nodes ==="
curl -s 'localhost:9200/_cat/allocation?v&h=node,shards,disk.indices,disk.used,disk.avail,disk.percent'

echo ""
echo "=== Shards over 50GB — candidates for re-sharding on next reindex ==="
curl -s 'localhost:9200/_cat/shards?v&h=index,shard,prirep,store,node' | \
  awk 'NR==1 || ($4 ~ /gb/ && substr($4,1,length($4)-2)+0 > 50)'

echo ""
echo "=== Per-node shard count vs recommended maximum ==="
# Rule: keep shards per node under (heap_GB x 20)
# For a 30GB heap node, maximum is ~600 shards
curl -s 'localhost:9200/_cat/nodes?v&h=name,heapMax,shards' | \
  awk 'NR==1 { print; next }
       {
         heap=$2
         shards=$3
         # Extract numeric heap value (strip units)
         gsub(/[^0-9.]/,"",heap)
         max_shards = heap * 20
         if (shards > max_shards)
           printf "WARNING: %s has %s shards, max recommended %s\n", $1, shards, max_shards
         else
           printf "OK:      %s has %s shards (max %s)\n", $1, shards, max_shards
       }'

echo ""
echo "=== Shard count calculator for a new daily index ==="
DAILY_GB=${1:-50}         # Pass daily volume as first argument, default 50GB
TARGET_SHARD_GB=30        # Target 10-50GB per shard
SHARDS=$(( (DAILY_GB + TARGET_SHARD_GB - 1) / TARGET_SHARD_GB ))  # Ceiling division
echo "Daily volume: ${DAILY_GB}GB"
echo "Target shard size: ${TARGET_SHARD_GB}GB"
echo "Recommended primary shards: ${SHARDS}"
echo "Total disk with 1 replica: $(( DAILY_GB * 2 )) GB (before segment overhead)"
echo "Total disk with 1 replica + 20% overhead: $(echo "$DAILY_GB * 2 * 1.2" | bc)GB"

echo ""
echo "=== Creating index template with calculated shard count ==="
curl -s -XPUT 'localhost:9200/_index_template/app-logs' \
  -H 'Content-Type: application/json' \
  -d "{
    \"index_patterns\": [\"app-logs-*\"],
    \"template\": {
      \"settings\": {
        \"number_of_shards\": ${SHARDS},
        \"number_of_replicas\": 1,
        \"refresh_interval\": \"5s\",
        \"codec\": \"best_compression\",
        \"mapping\": { \"dynamic\": \"strict\" }
      }
    }
  }"
Output
# === Current shard distribution across data nodes ===
# node shards disk.indices disk.used disk.avail disk.percent
# es-data-01 42 180.2gb 210.5gb 789.5gb 21
# es-data-02 38 165.8gb 195.2gb 804.8gb 19
#
# === Per-node shard count vs recommended maximum ===
# OK: es-data-01 has 42 shards (max 600)
# OK: es-data-02 has 38 shards (max 600)
#
# === Shard count calculator ===
# Daily volume: 50GB
# Target shard size: 30GB
# Recommended primary shards: 2
# Total disk with 1 replica: 100GB
# Total disk with 1 replica + 20% overhead: 120GB
Shard Sizing in One Rule
Target 10 to 50GB per shard. Calculate primary shard count as ceiling(daily_volume_GB / 30). Set this in the index template before the first document arrives — you cannot change it afterward without a full reindex. Every shard adds roughly 1MB of heap overhead for metadata. Keep total shards per node under heap_GB times 20.
Production Insight
5,000 shards consume roughly 5GB of heap just on metadata — before indexing anything.
A node with 30GB heap and 600 shards is running within spec. The same node with 2,000 shards will GC constantly and degrade search latency for everything on the cluster.
Rule: calculate shard count from daily volume before creating the index, not after the cluster starts showing yellow status.
Key Takeaway
Target 10 to 50GB per shard — below 10GB wastes overhead, above 50GB makes recovery slow.
Shard count cannot be changed after index creation without a reindex — get it right in the template.
Keep total shards per node under heap_GB times 20 to avoid GC pressure from shard metadata.
Use _cluster/allocation/explain when shards will not allocate — it gives the specific blocking reason.

Index Lifecycle Management — Automate Retention Before It Bites You

Without ILM, your ELK stack suffocates under its own data. Daily indices accumulate, disk fills, and someone is running curl commands at 2am to delete old indices. Index Lifecycle Management automates this: you define policies that transition indices through hot, warm, cold, and delete phases based on age, size, or document count. ILM is not optional infrastructure. It is the difference between a cluster that manages itself and one that requires constant manual intervention.

Hot phase: indices are actively written and frequently searched. Keep this short — 1 to 3 days for most log workloads. Use fast NVMe SSDs. This is your most expensive storage tier, and every day you keep an index here costs more than it should.

Warm phase: no more writes, still searchable but with lower urgency. Force-merge to 1 segment — this consolidates all the small segments from active indexing into one, reducing heap overhead and improving scan performance. Before force-merging, Elasticsearch requires the index to have 0 replicas for the shrink operation, which is why the allocate action reducing replicas must precede the shrink action in the policy. Missing this step causes the warm phase to fail silently.

Cold phase: read-only, reduced replica count, optionally migrated to slower storage using data tiers. For compliance-driven retention requirements, this phase can extend to months or years.

Delete phase: remove indices after the retention period. Set this based on your data retention policy and test it with a very short window on a development cluster — 1 hour delete — before using real durations in production.

ILM rollover is the mechanism that prevents any single index from growing beyond your target shard size. When an index exceeds max_size or max_age, ILM rolls over to a new index with the same template settings. This keeps shard sizes predictable and recovery times bounded. Set rollover on both a size trigger and an age trigger — whichever fires first — so that low-volume days still rotate on schedule.

A critical operational detail: ILM policies attached to an index template apply to new indices only. Indices created before the policy was attached require explicit opt-in via the PUT /index/_settings API. And always test your ILM policy on a development cluster before applying it to production. A misconfigured delete phase that fires too early is an availability incident, not a configuration mistake.

ilm-policy.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// ILM policy: hot 3 days, warm 14 days, delete after 30 days
// PUT _ilm/policy/logs-ilm-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",  // Roll over when index hits 50GB regardless of age
            "max_age":  "1d"     // Also roll over daily even if under 50GB
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          // Reduce replicas to 0 BEFORE shrink — ES requires this for the shrink operation
          // Shrink needs all shards on a single node, which requires no replica competing for placement
          "allocate": { "number_of_replicas": 0 },
          "forcemerge": { "max_num_segments": 1 },
          "shrink":    { "number_of_shards": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// Attach to index template — new indices inherit this policy automatically
// PUT _index_template/app-logs
{
  "index_patterns": ["app-logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name":             "logs-ilm-policy",
      "index.lifecycle.rollover_alias":    "app-logs",
      "number_of_shards":                  2,
      "number_of_replicas":                1,
      "mapping": { "dynamic": "strict" }
    }
  }
}

// Verify ILM is working on existing indices:
// GET app-logs-*/_ilm/explain
Output
# ILM policy created and attached to template
# New indices matching app-logs-* will be managed automatically
#
# Check phase progression:
# curl -s 'localhost:9200/app-logs-*/_ilm/explain?pretty' | grep -E '(phase|age|action)'
#
# If an index is stuck in a phase, check the error:
# curl -s 'localhost:9200/app-logs-000001/_ilm/explain?pretty' | grep -A5 'error'
ILM as Automated Operations
  • Hot: fast writes, fast searches, expensive NVMe storage — keep indices here for 1 to 3 days maximum
  • Warm: no writes, acceptable search speed, force-merge to 1 segment reduces overhead — 3 to 30 days
  • Cold: read-only, minimal replicas, cheap storage — for compliance or rare lookups
  • Delete: remove when retention policy says it is gone — test with a short window first
  • Rollover on both size and age — prevents any single index from growing beyond your target shard size
Production Insight
The allocate action reducing replicas to 0 must precede the shrink action in the warm phase.
A warm phase that skips the allocate step will fail silently — the ILM explain API will show an error on the affected index.
Rule: always check _ilm/explain on a test index after attaching a new ILM policy before trusting it in production.
Key Takeaway
ILM automates hot, warm, and delete phases — without it, manual cleanup is the operation someone forgets.
The warm phase requires allocate with 0 replicas before shrink — missing this causes the phase to fail silently.
Attach ILM to every index template. It does not retroactively apply to existing indices.
Test ILM policies with a 1-hour delete window on dev before using real durations.

Cluster Sizing and Hardware Selection — CPU, RAM, Disk Trade-offs

Choosing hardware for Elasticsearch is a trade-off between three resources that compete against each other. The right mix depends entirely on your workload profile — and the profile changes as your data grows.

RAM is the highest-leverage resource. Elasticsearch uses the OS filesystem cache as aggressively as the JVM heap. If your hot index fits in the filesystem cache, searches are nearly instant. If it does not, every search requires disk reads. The practical rule: allocate 50% of node RAM to the JVM heap, leave the rest for the OS. A 64GB node gets 31GB of heap for Elasticsearch and 33GB for the OS cache. Do not exceed 31GB of heap on a single node — above this threshold, the JVM switches from 4-byte compressed object pointers to 8-byte uncompressed ones, which doubles reference size and increases GC pressure. On Elasticsearch 8.x running JDK 21 with generational ZGC, the GC characteristics improve significantly over older G1GC configurations, but the 31GB ceiling on heap remains the safe practical limit. If you need more memory, add nodes rather than increasing per-node heap.

CPU matters most for indexing throughput and complex aggregations. Logstash with heavy Grok patterns can saturate CPU before Elasticsearch does. For data nodes, modern CPUs with high single-threaded clock speeds (3.5GHz or above) benefit search latency on sequential segment scans. Higher core counts improve bulk indexing throughput but have diminishing returns beyond 16 to 20 cores per node for typical log workloads.

Disk is where most teams underprovision. NVMe SSDs are non-negotiable for hot phase data nodes — the random I/O pattern that Elasticsearch generates during segment merges and concurrent searches will saturate spinning disks and cause indexing pauses. For warm and cold phases, SATA SSDs provide acceptable throughput at lower cost. In AWS, gp3 EBS volumes provide 3,000 IOPS baseline with 16,000 IOPS available at lower cost than gp2 — use gp3 for data nodes. io2 is rarely justified for log workloads.

Network is frequently the bottleneck that nobody planned for. Elasticsearch shuffles large amounts of data during shard recovery, rebalancing, and snapshot creation. On a 3-node cluster recovering a 200GB shard after a node replacement, 10GbE networking copies the shard in roughly 3 minutes. On 1GbE, that is 27 minutes during which the data has no replica. Use 10GbE or better between data nodes. In cloud environments, verify that your instance type provides dedicated network bandwidth — burstable instance types that share network bandwidth will throttle under sustained replication load.

Master nodes deserve dedicated resources. Three master-eligible nodes minimum for quorum. Give them 8GB RAM and 4 CPU cores — they manage cluster state, not data, so resource requirements are modest. Never run master-eligible and data roles on the same node in production. A GC pause on a data node caused by heavy indexing can delay master heartbeats, trigger unnecessary master elections, and destabilize the cluster at the worst possible moment. Keep the roles separated.

cluster-sizing-check.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
# Cluster resource health check — run this daily as part of ops review

echo "=== Node roles, heap pressure, and disk usage ==="
curl -s 'localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,ram.percent,cpu,disk.used_percent'

echo ""
echo "=== JVM heap used percent per node ==="
curl -s 'localhost:9200/_nodes/stats/jvm?pretty' | \
  python3 -c "
import json, sys
nodes = json.load(sys.stdin)['nodes']
for nid, n in nodes.items():
    print(f"{n['name']:20s} heap: {n['jvm']['mem']['heap_used_percent']}%")
"

echo ""
echo "=== Thread pool rejections — indicates resource pressure ==="
curl -s 'localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected' | \
  awk 'NR==1 || $5+0 > 0'  # Show header and any row with rejections > 0

echo ""
echo "=== OS filesystem cache available per node ==="
curl -s 'localhost:9200/_nodes/stats/os?pretty' | \
  python3 -c "
import json, sys
nodes = json.load(sys.stdin)['nodes']
for nid, n in nodes.items():
    mem = n['os']['mem']
    free_gb = mem['free_in_bytes'] / 1024**3
    total_gb = mem['total_in_bytes'] / 1024**3
    print(f"{n['name']:20s} OS cache available: {free_gb:.1f}GB / {total_gb:.1f}GB")
"
Output
# === Node roles, heap pressure, and disk usage ===
# name node.role heap.percent ram.percent cpu disk.used_percent
# es-data-01 d 42 78 38 35
# es-data-02 d 38 75 32 32
# es-master-01 m 22 35 8 12
#
# === JVM heap used percent per node ===
# es-data-01 heap: 42%
# es-data-02 heap: 38%
#
# === Thread pool rejections ===
# (no rejections — clean cluster)
#
# === OS filesystem cache available ===
# es-data-01 OS cache available: 14.1GB / 64.0GB
# es-data-02 OS cache available: 16.0GB / 64.0GB
Hardware Lessons From Production
In AWS, use gp3 EBS for data nodes — it provides 3,000 IOPS baseline at lower cost than gp2 and scales to 16,000 IOPS without changing volume type. For master nodes, m7g.large (Graviton) provides enough CPU for cluster state management at low cost. Never use burstable instance types like t3 for data nodes — network bandwidth throttling under sustained replication load causes slow shard recovery and cluster instability at exactly the wrong moments.
Production Insight
Heap above 31GB switches the JVM from 4-byte to 8-byte object references, doubling reference overhead and increasing GC pressure.
The OS filesystem cache outside the heap is often more valuable than extra heap — a large cache keeps hot index data in memory without GC overhead.
Rule: 31GB heap maximum per node. Add nodes for capacity. Keep master and data roles on separate instances in production.
Key Takeaway
31GB is the practical JVM heap ceiling per node — above this, compressed OOPs disable and GC pressure increases. Add nodes instead.
Leave 50% of node RAM for OS filesystem cache — it keeps hot segments in memory with no GC cost.
NVMe SSDs are required for hot data nodes. 10GbE networking prevents shard recovery from becoming a 30-minute event.
Dedicated master nodes prevent GC pauses on data nodes from destabilizing cluster elections.
● Production incidentPOST-MORTEMseverity: high

Elasticsearch Goes Read-Only on Black Friday — 4 Hours of Lost Logs

Symptom
At 10:15 AM on Black Friday, the ops team noticed Kibana dashboards stopped updating. New transaction logs vanished. Alerting rules that depended on fresh log data went silent — no alerts fired for any new errors. The on-call engineer spent the first hour assuming PagerDuty was broken.
Assumption
The team assumed Logstash had crashed or the Kafka buffer had filled. They restarted Logstash twice and checked Kafka consumer lag. Both were fine. Logstash was happily sending documents and getting 403 BLOCKED responses back, which it logged quietly and discarded because the dead letter queue was not enabled.
Root cause
Elasticsearch has three disk watermarks. The low watermark at 85% stops new shards from being allocated to that node. The high watermark at 90% starts relocating existing shards away. The flood-stage watermark at 95% — the one that bit this team — switches every index on the affected nodes to read-only mode. ES issued 403 BLOCKED on every write attempt. No exception thrown upstream, no visible crash. Logstash dropped every rejected document because dead_letter_queue.enable was set to false, which is the default. Four hours of payment service logs gone.
Fix
1. Freed disk space immediately by deleting indices past retention window and running a force-merge on the oldest warm-phase indices to recover segment overhead. 2. Cleared the read-only flag on all affected indices: PUT /_all/_settings with index.blocks.read_only_allow_delete set to null. 3. Enabled dead_letter_queue.enable: true in logstash.yml and set dead_letter_queue.max_bytes: 1024mb so future rejections land in a local DLQ directory rather than disappearing. 4. Added a Prometheus alert firing at 80% disk usage on Elasticsearch data nodes — well before the 85% low watermark, let alone the 95% flood stage. 5. Raised the flood-stage watermark from the default 95% to 92% for an earlier safety margin on this cluster's growth rate.
Key lesson
  • Elasticsearch has three watermarks: 85% low (no new shards), 90% high (relocate shards), 95% flood-stage (read-only). It is the flood-stage that silently kills writes. Most monitoring setups watch the wrong threshold.
  • Logstash dead letter queue is disabled by default. Without it, every document Elasticsearch rejects — for any reason — vanishes with no record. Enable it before the first log ever ships to production.
  • Monitor disk usage with a hard alert at 80%. By the time you hit 95% and the flood-stage fires, you have no room to maneuver. At 80% you still have time to delete old indices, add nodes, or expand volumes before anything breaks.
  • Black Friday, end-of-quarter, and any traffic spike compound log volume in non-linear ways. Capacity planning based on average daily ingest will fail on peak days. Calculate for 5-10x peak.
Production debug guideWhen logs stop flowing, use this triage order to isolate the broken component in under 5 minutes.8 entries
Symptom · 01
Kibana shows no new logs for 10 or more minutes
Fix
Check Elasticsearch cluster health first: curl -s 'localhost:9200/_cluster/health?pretty' — if status is yellow or red, the problem is ES, not Logstash. Never restart Logstash before confirming ES is healthy. A healthy Logstash shipping to a broken ES will produce no visible output but will show increased retry counts in _node/stats.
Symptom · 02
Elasticsearch cluster status is red
Fix
Find unassigned shards and why they are unassigned: curl -s 'localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED — look for NODE_LEFT (node went down), ALLOCATION_FAILED (disk full or node limit), or DECIDERS_NO (allocation rules blocking). Then run curl -s 'localhost:9200/_cluster/allocation/explain?pretty' for a specific explanation on the first unassigned shard.
Symptom · 03
Elasticsearch cluster status is yellow
Fix
Yellow means primaries are assigned but replicas are not. Check why: curl -s 'localhost:9200/_cluster/allocation/explain?pretty' — the most common reasons are disk watermark exceeded on the target node, not enough nodes to place replicas on different nodes than primaries, or a per-node shard limit hit. Yellow does not mean writes are failing — data is safe but not fully replicated.
Symptom · 04
Elasticsearch is green but logs are still missing
Fix
Check Logstash pipeline stats: curl -s 'localhost:9600/_node/stats/pipelines?pretty' — look for events_out significantly lower than events_in, or high worker_concurrency pressure. Also check whether the dead letter queue is growing: ls -lh /var/lib/logstash/dead_letter_queue/ — a growing DLQ directory means ES is rejecting documents upstream.
Symptom · 05
Logstash shows high CPU but low throughput
Fix
Check Grok match rate by searching for _grokparsefailure tags in Kibana. If more than 5% of events carry this tag, your patterns do not match your actual log format and Grok is exhausting all alternatives before falling through on every miss. Use the Grok Debugger in Kibana Dev Tools to test patterns against 20 real log lines before changing production config.
Symptom · 06
Logstash heap usage above 85%
Fix
Check multiline codec usage: grep -r 'multiline' /etc/logstash/conf.d/ — multiline with large max_lines buffers pending events in JVM heap. A burst of Java stacktraces from a crashing service can spike heap from 500MB to 3GB in under a minute on a pipeline processing high-volume Java services. Set -Xmx to at least 4GB when using multiline in production and reduce max_lines to a realistic maximum stacktrace depth.
Symptom · 07
Kibana dashboards load slowly or time out
Fix
Check field count on the index pattern: curl -s 'localhost:9200/logs-/_field_caps?fields=&pretty' | grep -c type — if you see more than 500 fields, dynamic mapping has created a field explosion. Every Kibana page load fetches all field definitions. Use dynamic: strict in your index template and define only the fields you actually query.
Symptom · 08
Kibana shows 'Could not locate that index-pattern'
Fix
Verify the underlying index still exists: curl -s 'localhost:9200/_cat/indices?v' — if an ILM delete phase removed it or it was manually deleted, update the Kibana index pattern to use a wildcard like logs-* so it matches future indices as they are created.
★ ELK Quick Debug Cheat SheetProduction commands for the five most common ELK failures. Copy, paste, diagnose.
Cluster health yellow or red
Immediate action
Check which shards are unassigned and why before touching anything
Commands
curl -s 'localhost:9200/_cluster/health?pretty'
curl -s 'localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED
Fix now
curl -s -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true' — retries allocation for shards that failed due to transient errors. If this does not clear them, run the allocation explain API to get the specific blocking reason.
Logstash pipeline stalled — no events flowing+
Immediate action
Check pipeline worker stats and JVM heap before restarting
Commands
curl -s 'localhost:9600/_node/stats/pipelines?pretty'
curl -s 'localhost:9600/_node/stats/jvm?pretty' | grep heap_used_percent
Fix now
If heap is above 90%, restart is justified: systemctl restart logstash && tail -f /var/log/logstash/logstash-plain.log. If heap is normal but events_out is zero, check ES connectivity and DLQ size before restarting.
Disk watermark breach — ES rejecting writes with 403 BLOCKED+
Immediate action
Check disk allocation across nodes, then free space before clearing the read-only flag
Commands
curl -s 'localhost:9200/_cat/allocation?v'
curl -s -XPOST 'localhost:9200/old-index-*/_forcemerge?max_num_segments=1'
Fix now
After freeing space, clear the read-only flag: curl -s -XPUT 'localhost:9200/_all/_settings' -H 'Content-Type: application/json' -d '{"index.blocks.read_only_allow_delete": null}' — do not skip the space-freeing step or ES will re-apply the block within seconds.
Kibana dashboard timeout on load+
Immediate action
Check field count on the index pattern and reduce the query time range first
Commands
curl -s 'localhost:9200/logs-*/_field_caps?fields=*&pretty' | grep -c '"type"'
curl -s 'localhost:9200/logs-*/_search' -H 'Content-Type: application/json' -d '{"size":0,"query":{"range":{"@timestamp":{"gte":"now-1h"}}}}'
Fix now
Reduce time range in Kibana to last 1 hour and reload. If field count exceeds 500, add dynamic: strict to your index template immediately and plan a reindex to clean existing mappings.
Grok parse failures flooding logs+
Immediate action
Sample raw logs and test Grok patterns offline before pushing any config change
Commands
ls -lh /var/lib/logstash/dead_letter_queue/
echo 'YOUR_SAMPLE_LOG_LINE' | /usr/share/logstash/bin/logstash -e 'input{stdin{}} filter{grok{match=>{"message"=>"YOUR_PATTERN"}}} output{stdout{codec=>rubydebug}}'
Fix now
Fix the Grok pattern in /etc/logstash/conf.d/ — always include a catch-all %{GREEDYDATA:log_message} as the last alternative so events that do not match still flow through rather than getting tagged and dropped.
ELK Stack Component Comparison
ComponentPrimary RoleCommon Production FailureKey Metric to Monitor
ElasticsearchStore, index, and search documents at scale using inverted indicesFlood-stage disk watermark at 95% silently switches indices to read-only — no crash, no upstream errordisk_used_percent — alert at 80%, critical at 90%, flood-stage fires at 95%
LogstashIngest, parse, enrich, and route log events through configurable pipelinesGrok CPU saturation from unmatched patterns, or multiline codec OOM from stacktrace burstsjvm heap_used_percent and events_in vs events_out gap in pipeline stats
KibanaVisualize and explore log data through dashboards and saved searchesSlow load or timeout from index patterns with 2000-plus fields caused by dynamic mappingindex field count and dashboard response time — act when fields exceed 500
FilebeatCollect and ship logs from hosts and containers to Logstash or ElasticsearchRegistry file corruption after unclean shutdown causing position loss — either re-ships everything or skips forward, both wrongharvesters_running count and registry file integrity — mount registry on persistent volume

Key takeaways

1
ELK is a data pipeline with three failure domains
Elasticsearch stores, Logstash transforms, Kibana visualizes. Triage order when logs stop: ES first, Logstash second, Kibana last.
2
Inverted index maps terms to document IDs and postings lists
that is why full-text search across billions of documents completes in milliseconds.
3
Disk watermarks
85% stops new shard allocation, 90% relocates shards, 95% flood-stage switches indices read-only. Monitor at 80% — by 95% you have no room to react.
4
Enable Logstash dead letter queue before production. Without it, every document Elasticsearch rejects vanishes with zero visibility. The DLQ is a local directory, not an ES index.
5
Shard count cannot be changed after index creation without a reindex. Calculate ceiling(daily_volume_GB / 30) before creating the first index. Shard metadata costs roughly 1MB of heap per shard.
6
ILM warm phase requires allocate with number_of_replicas
0 before the shrink action. Missing this causes the warm phase to fail silently on every affected index.
7
Kibana dashboard panels should each answer one specific operational question. dynamic
strict on index templates prevents the field explosion that makes every Kibana page load slow.
8
31GB is the JVM heap ceiling per Elasticsearch node. Above this, compressed OOPs disable and GC pressure increases. Add nodes instead of increasing per-node heap.
9
Kafka between Filebeat and Logstash is not optional at scale
it is your replay buffer when Logstash crashes or ES goes read-only and you need to recover lost events.
10
pipeline.ordered
false removes cross-worker coordination overhead in Logstash — set it explicitly on log pipelines where ES timestamps handle event ordering at query time.
11
Choose the right Beat
Filebeat for logs, Metricbeat for metrics, Packetbeat for network traffic, Winlogbeat for Windows Event Logs, Heartbeat for uptime, Auditbeat for security events.

Common mistakes to avoid

10 patterns
×

Creating too many small shards by defaulting to the same shard count for every index

Symptom
Cluster state updates take 5 to 10 seconds. Search latency spikes unpredictably. Yellow status even with enough nodes because shard allocation decisions pile up for the cluster manager thread.
Fix
Target 10 to 50GB per shard. Use 1 primary shard for indices under 10GB. Calculate using ceiling(daily_volume_GB / 30) for larger volumes. Check current distribution with _cat/allocation and delete or force-merge old indices before creating new ones.
×

Using Filebeat for metrics collection when Metricbeat is the correct tool

Symptom
Complex, brittle Filebeat configurations that break on system updates. Missing metrics that Metricbeat collects automatically. High CPU usage from parsing /proc files manually.
Fix
Use Filebeat for log files only. Use Metricbeat for system and service metrics. Use Packetbeat for network traffic. Use Winlogbeat for Windows Event Logs. Use Auditbeat for Linux security events. Use Heartbeat for uptime monitoring.
×

Grok patterns that silently discard logs on parse failure

Symptom
You search Kibana for a known error that appeared in the raw application logs. Nothing. The event reached Logstash but the Grok pattern did not match, the event was tagged _grokparsefailure, and without a catch-all alternative it was dropped at the output filter stage.
Fix
Always add %{GREEDYDATA:log_message} as the last Grok alternative. Add tag_on_failure: ['_grokparsefailure_custom'] so failures are visible in Kibana. Create an alert on that tag count exceeding 0. Test patterns against 50 real log lines using Kibana Dev Tools Grok Debugger before deploying.
×

Not enabling Logstash dead letter queue before production

Symptom
Elasticsearch rejects documents — disk block, mapping conflict, field limit — and Logstash silently drops them. Zero visibility. No errors in Logstash logs, nothing in Kibana, no DLQ accumulating.
Fix
Add dead_letter_queue.enable: true and dead_letter_queue.max_bytes: 1024mb to logstash.yml before the first event ships. The DLQ writes to /var/lib/logstash/dead_letter_queue by default. Inspect the directory size: ls -lh /var/lib/logstash/dead_letter_queue/ — if it is growing, you are losing data and need to fix the root cause.
×

Using dynamic mapping on high-cardinality log data from multiple services

Symptom
Kibana Discover takes 15 to 20 seconds to load. Index pattern has 3,000-plus fields. Field count grows every time a new service logs a unique JSON key that Elasticsearch maps automatically.
Fix
Set dynamic: strict in your index template. Explicitly define the 50 to 100 fields you actually query and aggregate on. Unexpected fields are rejected at ingest — fix the logging code, not the mapping. A field explosion that already occurred requires a reindex to clean.
×

Lowering refresh_interval below 1 second for faster search

Symptom
Indexing throughput drops 40 to 60%. CPU spikes from constant segment creation and merging. Bulk request latencies increase from 200ms to 2 seconds. The improvement in search freshness is marginal and often imperceptible.
Fix
Keep refresh_interval at 1s for interactive search. For archival log indices that do not need real-time search, set it to 30s or 60s to improve indexing throughput significantly. For bulk loading jobs, set to -1 during the load and restore afterward.
×

Not configuring ILM and relying on manual index cleanup

Symptom
Old indices accumulate for months. Disk fills over a holiday weekend when nobody is on call. The cluster hits the flood-stage watermark, goes read-only, and you lose hours of logs before anyone notices.
Fix
Define an ILM policy with hot, warm, and delete phases and attach it to every index template from day one. The warm phase requires allocate with number_of_replicas: 0 before shrink. Test with a 1-hour delete window on dev before production.
×

Setting Elasticsearch JVM heap above 31GB

Symptom
GC pause duration increases above acceptable thresholds. Heap pressure does not translate to better performance — it gets worse because the JVM switches from 4-byte to 8-byte compressed object pointers above approximately 31GB.
Fix
Keep -Xmx at 31GB maximum per node. Use half of node RAM for heap and half for OS filesystem cache. Add data nodes for more capacity instead of increasing per-node heap. On JDK 21 with generational ZGC, benchmark before exceeding 26GB as GC behavior differs from G1GC.
×

Running Kibana on the same node as Elasticsearch data nodes

Symptom
Kibana becomes unresponsive during peak indexing loads. Elasticsearch performance also degrades because Kibana's dashboard rendering competes for CPU and I/O with segment merges and active searches.
Fix
Run Kibana on a dedicated instance. Even a small instance — 4GB RAM, 2 cores — is enough for most Kibana workloads. Keep it separate from data nodes. The operational cost of a dedicated Kibana instance is far lower than debugging resource contention during an incident.
×

Trying to use KQL without uppercase AND/OR keywords

Symptom
A dashboard filter like level:WARN OR ERROR returns nothing. The query is syntactically valid but semantically wrong — OR is interpreted as a field name, not a Boolean operator.
Fix
KQL keywords (AND, OR, NOT) must be uppercase. Write level: WARN OR level: ERROR or level: (WARN OR ERROR). Field names are case-sensitive as indexed — Level: ERROR will not match level: ERROR.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
How does Elasticsearch's inverted index work, and why is it faster than ...
Q02SENIOR
You notice 5% of logs are missing from Kibana. Walk through your debuggi...
Q03SENIOR
Design a production ELK stack for a platform ingesting 50,000 log events...
Q04SENIOR
Your ELK cluster just went read-only at peak traffic. You have no dead l...
Q05JUNIOR
What is the difference between Filebeat and Metricbeat? When would you u...
Q01 of 05JUNIOR

How does Elasticsearch's inverted index work, and why is it faster than scanning every document?

ANSWER
Elasticsearch builds an inverted index during document indexing. It tokenizes the text fields, normalizes terms by lowercasing and optionally stemming, and creates a mapping from each unique term to a postings list. The postings list contains document IDs, term frequencies, and position offsets. When you search for a term, ES looks it up in the sorted term dictionary — a binary search — and retrieves the pre-computed list of matching document IDs. No full document scan happens. BM25 scoring then ranks matches by term frequency, inverse document frequency, and field length normalization. Each shard is an independent Lucene index with immutable segments that get periodically merged in the background by a merge thread.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is the ELK Stack in simple terms?
02
Why does Elasticsearch go read-only and how do I fix it?
03
How many shards should I use for my Elasticsearch index?
04
What is the difference between Filebeat and Logstash?
05
What is the difference between KQL and Lucene in Kibana?
06
Should I run Kibana on the same server as Elasticsearch?
🔥

That's Monitoring. Mark it forged?

20 min read · try the examples if you haven't

Previous
Prometheus and Grafana Setup
3 / 9 · Monitoring
Next
Application Performance Monitoring