Advanced 17 min · March 06, 2026

ELK Stack — Elasticsearch Logstash Kibana

ELK Stack Explained: Internals, Pipelines and Production Failures

Q: What is the ELK Stack in simple terms?

ELK Stack is three open-source tools working together as a data pipeline. Filebeat collects logs from your servers and containers. Logstash parses and enriches them. Elasticsearch stores them in an inverted index that makes any field searchable in milliseconds. Kibana puts a visualization layer on top so you can explore, alert, and dashboard against live log data. The result is a centralized system that turns millions of scattered log lines into searchable, operational intelligence.

Q: Why does Elasticsearch go read-only and how do I fix it?

Elasticsearch has three disk watermarks with different behaviors. The low watermark at 85% stops new shards from being allocated to that node. The high watermark at 90% relocates existing shards away. The flood-stage watermark at 95% is what causes read-only mode — it sets index.blocks.read_only_allow_delete on every index on the affected nodes. To fix: free disk space by deleting old indices or running force-merge on warm indices. Then clear the read-only block: PUT /_all/_settings with index.blocks.read_only_allow_delete set to null. Then verify disk is below the low watermark before expecting new writes to resume. To prevent it: monitor at 80% with a hard alert, use ILM to auto-delete old indices, and calculate for peak traffic volume, not average.

Q: How many shards should I use for my Elasticsearch index?

Target 10 to 50GB per shard. For a daily log index: calculate ceiling(daily_volume_GB / 30) for the primary shard count. A daily index receiving 30GB uses 1 shard. One receiving 150GB uses 5 shards. Keep total shards per data node under heap_GB times 20 — a node with 30GB heap should have no more than around 600 shards total. You cannot change shard count after index creation without a full reindex, so set this in an index template before the first document arrives.

Q: What is the difference between Filebeat and Logstash?

Filebeat is a lightweight Go agent that runs on each host. It tails log files, maintains a registry of read positions, and ships events forward with minimal processing. It is stateful — registry file corruption after an unclean shutdown can cause it to re-ship everything or skip ahead, both of which are data loss events. Logstash is a heavier JVM-based processing engine that parses, enriches, and routes events through configurable pipelines. The standard pattern for high-volume production: Filebeat on every host shipping to Kafka, which buffers the stream so Logstash can consume at its own pace, parse events, and write to Elasticsearch.

Q: What is the difference between KQL and Lucene in Kibana?

KQL (Kibana Query Language) is the default modern query syntax. It provides autocomplete suggestions, syntax highlighting, and immediate error feedback. You cannot write invalid KQL — Kibana tells you where the syntax breaks. KQL supports nested fields, existence checks, and range queries with cleaner syntax. Use KQL for all ad-hoc exploration and dashboard panels. Lucene is the legacy syntax that powers Elasticsearch's underlying query parser. Lucene supports regex, fuzzy queries, and proximity searches that KQL does not. The trade-off is that Lucene does not prevent you from writing queries that are syntactically valid but semantically wrong. Reserve Lucene for advanced use cases where KQL falls short.

Q: Should I run Kibana on the same server as Elasticsearch?

No. Kibana can consume meaningful CPU and I/O when rendering complex dashboards, especially on index patterns with thousands of fields. Running it on the same instance as an Elasticsearch data node creates resource contention — Kibana steals cycles from segment merges and active searches, and ES GC pauses make Kibana unresponsive. Deploy Kibana on a separate instance. Even a 4GB RAM, 2-core instance is sufficient for most Kibana workloads. Keep it separate from both data nodes and master nodes.

ELK Stack internals most engineers never learn — inverted indices, Logstash pipelines that stall, and disk watermarks that kill clusters silently.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Production DevOps experience
✓Deep understanding of the tool's internals
✓Experience debugging distributed systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

ELK Stack = Elasticsearch (search/storage) + Logstash (ingest/transform) + Kibana (visualize)
Elasticsearch uses inverted indices — term dictionary maps tokens to document IDs for sub-second search
Each shard is a Lucene instance; shard count x (1 + replicas) x 1.2 = actual disk multiplier
Logstash pipelines: inputs -> filters -> outputs; Grok is the CPU bottleneck on unstructured logs
Kibana dashboards should answer one operational question — not display every metric you have
Disk watermarks: low at 85% stops new shard allocation, high at 90% relocates shards, flood-stage at 95% switches indices to read-only. Monitor at 80% or you will be reacting instead of preventing.

✦ Definition~90s read

What is ELK Stack?

The ELK Stack — Elasticsearch, Logstash, Kibana — is the de facto open-source platform for centralized logging, metrics, and observability at scale. Elasticsearch handles distributed search and analytics via inverted indices and sharded clusters; Logstash provides a server-side data processing pipeline that ingests, transforms, and ships data; Kibana offers visualization and dashboarding on top of Elasticsearch.

★

Imagine your entire city's 911 call center receives thousands of calls a day from every neighborhood.

The stack solves the fundamental problem of making machine-generated data (logs, metrics, traces) queryable and actionable across hundreds or thousands of servers, replacing ad-hoc grep-and-tail workflows with structured, real-time analysis. In production, it's the backbone for everything from debugging microservice failures to monitoring CDN edge nodes — but it's also where teams commonly hit performance cliffs, pipeline backpressure, and index mapping explosions if they don't understand the internals.

Each component has a distinct role and failure mode. Beats (Filebeat, Metricbeat, Packetbeat) are lightweight, single-purpose agents that ship data from edge nodes — Filebeat tails log files with minimal overhead, Metricbeat collects system/service metrics, Packetbeat sniffs network traffic.

They replace heavier agents like syslog-ng or custom scrapers. Elasticsearch's inverted index is the core: it tokenizes documents into term-document mappings, enabling sub-second full-text search across terabytes, but misconfigured analyzers or dynamic mapping can silently corrupt query results or bloat indices.

Logstash pipelines chain input, filter, and output plugins — the filter stage (grok, date, mutate, geoip) is where most data transformation happens and where regex backtracking kills throughput. Kibana dashboards are only as good as the underlying index patterns and aggregations; without careful design, they become unreadable scatter plots or time-series that mask root causes.

Alternatives exist: Grafana Loki pairs with Prometheus for log aggregation without indexing (cheaper, but weaker search), Datadog and Splunk offer managed SaaS with higher cost and less control, and ClickHouse can replace Elasticsearch for structured analytics with better compression. Don't use ELK for ephemeral debugging on a single box — use journalctl or tail -f.

Don't use it for real-time alerting on high-cardinality metrics — use Prometheus with Alertmanager. The stack shines when you need to correlate logs, metrics, and traces across distributed systems, with retention measured in weeks or months, and query latency under a second.

Production failures almost always trace back to three things: unbounded memory in Logstash pipelines, unoptimized Elasticsearch mappings causing disk I/O storms, or Kibana dashboards that try to visualize every field instead of answering a specific operational question.

Plain-English First

Imagine your entire city's 911 call center receives thousands of calls a day from every neighborhood. Logstash is the operator who answers every call, cleans up the noise, and routes it to the right file. Elasticsearch is the giant filing cabinet that stores every call record in a way that lets you find any detail in milliseconds. Kibana is the big screen on the wall that turns all those records into live charts so the chief can see exactly what's happening across the city right now. The ELK Stack is that whole system — for your software.

Every production system lies. Not intentionally — but without proper observability, your application fails silently, degrades mysteriously, and wakes you at 3am with zero context. Log files exist, but a 400GB flat log file on a server nobody SSHs into anymore is just expensive noise. The ELK Stack transforms that noise into signal: structured, searchable, visualized intelligence about everything your infrastructure is doing, in real time.

The core problem ELK solves is the gap between raw log data and actionable insight. A typical microservices platform produces logs from dozens of services, each in a slightly different format, scattered across hundreds of containers. Correlating a failed payment transaction across an API gateway, an auth service, a Kafka consumer, and a Postgres adapter — without a centralized log aggregation system — is an exercise in madness. ELK gives every log line a home, a shape, and a timeline.

By the end you will understand how Elasticsearch actually indexes and retrieves documents under the hood, how to build Logstash pipelines that handle real-world log formats including multiline stacktraces, how to design Kibana dashboards that answer operational questions rather than just looking impressive in a quarterly review, and exactly where production deployments fall apart and how to prevent it. The incidents in this article are real. The fixes are the ones that actually worked.

What ELK Stack Is and How the Components Connect

ELK is not three tools bolted together. It is a data pipeline with three distinct failure domains, and understanding how data flows between them is what separates engineers who can debug it from engineers who restart services and hope.

Data originates on your hosts and containers. Filebeat — a lightweight Go agent — tails log files and ships events forward. It is stateful: Filebeat maintains a registry file tracking its read position in every file it monitors. If that registry file is corrupted by an unclean shutdown (common on spot instances), Filebeat loses its position and either re-ships everything from the start or skips forward to the current file end, depending on configuration. Always run Filebeat with its registry on a persistent volume and set close_inactive to a sensible value so file handles do not accumulate.

Filebeat ships to Logstash, or — in high-volume environments — to Kafka first. The Kafka buffer is not optional at scale. It absorbs traffic spikes so Logstash does not receive a 10x burst and OOM. It also means a Logstash restart does not lose data — Kafka holds the events until Logstash recovers and resumes consuming from its committed offset. Running Logstash reading directly from files in a high-volume environment is fragile. Add Kafka as the buffer between collection and processing.

Logstash reads from Kafka, applies filters to parse and enrich each event, and writes structured documents to Elasticsearch. The pipeline is: inputs -> filters -> outputs. Each stage runs in its own thread pool. The filter stage is where CPU is spent and where most production problems originate.

Elasticsearch receives structured JSON documents, indexes them into an inverted index, and serves search and aggregation queries. Kibana connects to Elasticsearch and renders the results.

The triage order when logs stop flowing is always: Elasticsearch first, then Logstash, then Kibana. Storage failures cascade upstream. A healthy Logstash shipping to a broken Elasticsearch looks, from the outside, identical to a broken Logstash — events simply stop appearing in Kibana. Check ES health before anything else.

In 2026, Elastic also offers OpenTelemetry-native ingestion and the Elastic Agent as a replacement for the Filebeat plus Logstash combination. The Elastic Agent consolidates collection and processing into a single managed binary with central policy control through Fleet. For new deployments, evaluate Elastic Agent rather than defaulting to the classic Filebeat-Logstash split. For existing deployments, migration is straightforward but not mandatory — the classic stack still works and is fully supported.

docker-compose-elk.ymlYAML

# Minimal ELK stack for local development and testing
# Shows the actual data flow: Filebeat -> Logstash -> Elasticsearch <- Kibana
# For production, replace single-node ES with a proper cluster and add Kafka between Filebeat and Logstash

version: '3.8'

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.13.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false       # Dev only — always enable security in production
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
      - cluster.routing.allocation.disk.watermark.low=85%
      - cluster.routing.allocation.disk.watermark.high=90%
      - cluster.routing.allocation.disk.watermark.flood_stage=95%  # Read-only at 95%
    ports:
      - "9200:9200"
    volumes:
      - esdata:/usr/share/elasticsearch/data
    healthcheck:
      test: ["CMD-SHELL", "curl -sf 'localhost:9200/_cluster/health' | grep -v '\"status\":\"red\"'"]
      interval: 10s
      timeout: 5s
      retries: 10

  logstash:
    image: docker.elastic.co/logstash/logstash:8.13.0
    volumes:
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
    environment:
      - LS_JAVA_OPTS=-Xms1g -Xmx2g
    ports:
      - "5044:5044"  # Beats input
    depends_on:
      elasticsearch:
        condition: service_healthy

  kibana:
    image: docker.elastic.co/kibana/kibana:8.13.0
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on:
      elasticsearch:
        condition: service_healthy

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.13.0
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - filebeatdata:/usr/share/filebeat/data  # Persistent registry — survives container restarts
    depends_on:
      - logstash

volumes:
  esdata:
  filebeatdata:  # Registry persistence prevents re-shipping on restart

Output

# docker compose up -d

# Creating network elk_default

# Creating elk_elasticsearch_1 ... done

# Creating elk_logstash_1 ... done

# Creating elk_kibana_1 ... done

# Creating elk_filebeat_1 ... done

# Verify data flow:

# curl -s 'localhost:9200/_cat/indices?v' -- should show filebeat-* indices after ~30s

# curl -s 'localhost:9200/_cluster/health?pretty' -- should show green

🔥The Three Failure Domains

Each ELK component can fail independently and in ways that look identical from the outside. Kibana showing no data could be a Kibana configuration problem, a Logstash pipeline stall, an Elasticsearch write rejection, or a Filebeat registry corruption. The triage order is always: Elasticsearch health first, then Logstash pipeline stats, then Kibana index pattern configuration. Jumping to Kibana when ES is the problem wastes everyone's time during an incident.

📊 Production Insight

ELK is not just three tools — it is a data pipeline with three failure domains.

When logs stop flowing, isolate which domain broke before touching anything.

Triage order: Elasticsearch first, Logstash second, Kibana last — storage failures cascade upstream.

In 2026, evaluate Elastic Agent for new deployments — it consolidates Filebeat and Logstash into a single managed binary with Fleet-based policy management.

🎯 Key Takeaway

ELK is a pipeline: Filebeat collects, Kafka buffers (at scale), Logstash transforms, Elasticsearch stores, Kibana visualizes.

Filebeat registry corruption after unclean shutdown causes silent data loss — always mount registry on a persistent volume.

Triage order when logs stop: ES health first, then Logstash stats, then Kibana config.

thecodeforge.io

Elk Stack

Beats Family — Filebeat, Metricbeat, Packetbeat, and When to Use Each

The Beats family is Elastic's collection of lightweight data shippers. Each Beat is purpose-built for a specific data type — logs, metrics, network packets — and runs as a single binary with minimal configuration. Understanding which Beat to use for which job prevents the mistake of forcing Filebeat to collect metrics or Metricbeat to tail log files.

Filebeat is the workhorse for log collection. It tails files, follows symlinks, handles rotation, and ships raw log lines to Logstash or directly to Elasticsearch. It maintains a registry — a local file tracking read positions — so a restart does not re-ship the same lines. Filebeat supports multiline aggregation, which is critical for Java stacktraces. Configure multiline in Filebeat rather than Logstash whenever possible to reduce Logstash heap pressure.

Metricbeat collects system and service metrics. It runs modules that know how to talk to specific services — MySQL, PostgreSQL, Redis, Nginx, Kafka, Docker, Kubernetes. Metricbeat pulls metrics from each module on a configurable period. The output is numerical time-series data, not raw log lines. Do not use Filebeat to read /proc/stats — use Metricbeat with the system module.

Packetbeat captures and parses network traffic. It runs as a packet sniffer using libpcap (Linux) or WinPcap (Windows), decoding protocols like HTTP, MySQL, PostgreSQL, Redis, Thrift, and DNS. Packetbeat reconstructs full transactions from packets, so it can show you every SQL query or HTTP request/response pair that crosses your network segment.

Auditbeat collects security audit events from your Linux kernel using the Linux Audit Framework. It ships user logins, privilege escalations (sudo), file integrity events (when critical configs change), and process execution logs. Auditbeat is the right tool for compliance auditing (SOC2, PCI-DSS) and security monitoring.

Heartbeat performs uptime monitoring. It pings services (ICMP), connects to TCP ports, or checks HTTP endpoints for expected status codes and response body patterns. Heartbeat sends synthetic check results as documents, which you can alert on for service availability. It is not a log collector and has no relation to a human heartbeat — it is named for the regular 'heartbeat' signal it emits.

Winlogbeat captures Windows Event Logs — Application, Security, Setup, System, and forwarded events. If your infrastructure includes Windows servers, Winlogbeat is the only supported way to get Windows Event Logs into Elasticsearch reliably. Do not try to tail raw .evtx files with Filebeat.

beats-comparison.ymlYAML

# beats-comparison.yml — Quick reference for choosing the right Beat

# ============================================================
# FILEBEAT — Log files (application logs, JSON logs, plaintext)
# ============================================================
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  multiline:
    type: pattern
    pattern: '^\d{4}-\d{2}-\d{2}'  # New log lines start with timestamp
    negate: true
    match: after

# ============================================================
# METRICBEAT — System metrics + service metrics
# ============================================================
metricbeat.modules:
- module: system
  period: 10s
  metricsets: ["cpu", "memory", "diskio", "filesystem", "load", "process"]
- module: nginx
  period: 30s
  hosts: ["http://localhost/status"]
- module: docker
  period: 10s
  hosts: ["unix:///var/run/docker.sock"]

# ============================================================
# PACKETBEAT — Network protocol analysis
# ============================================================
packetbeat.interfaces.device: eth0
packetbeat.protocols:
- type: http
  ports: [80, 8080]
  send_headers: ["Authorization"]
- type: mysql
  ports: [3306]
- type: pgsql
  ports: [5432]
- type: redis
  ports: [6379]

# ============================================================
# WINLOGBAT — Windows Event Logs
# ============================================================
winlogbeat.event_logs:
- name: Application
  ignore_older: 72h
- name: Security
  ignore_older: 72h
- name: System
- name: Setup

# ============================================================
# HEARTBEAT — Uptime monitoring
# ============================================================
heartbeat.monitors:
- type: http
  name: Production API
  urls: ["https://api.example.com/health"]
  schedule: '@every 30s'
  check.response.status: [200]
- type: tcp
  name: MySQL
  hosts: ["db.example.com:3306"]
  schedule: '@every 1m'

# ============================================================
# AUDITBEAT — Linux audit framework
# ============================================================
auditbeat.modules:
- module: auditd
  audit_rules: |
    -w /etc/passwd -p wa -k identity
    -w /etc/nginx -p wa -k config
    -a always,exit -S execve -k process_execution

Output

# Each Beat outputs JSON documents to Elasticsearch or Logstash

# Install: sudo apt install filebeat / yum install filebeat (for each Beat)

# Configure: /etc/filebeat/filebeat.yml (or metricbeat.yml, packetbeat.yml)

# Test: filebeat test config -e

# Start: sudo systemctl start filebeat

💡Beats Selection Rule of Thumb

| Data Type | Correct Beat | Wrong Beat | |-----------|--------------|------------| | Application log files (JSON, plaintext) | Filebeat | Metricbeat (not designed for logs) | | CPU, memory, disk, process metrics | Metricbeat | Filebeat (would require reading /proc manually) | | HTTP requests, SQL queries on wire | Packetbeat | Filebeat (not a packet sniffer) | | Windows Event Logs | Winlogbeat | Filebeat (cannot parse .evtx natively) | | Service uptime monitoring | Heartbeat | Metricbeat (can work but Heartbeat is purpose-built) | | Linux security audit events | Auditbeat | Filebeat (would miss kernel audit context) |

📊 Production Insight

A team tried to use Filebeat to collect Docker container metrics by reading /proc/stat files directly. The configuration was complex, brittle, and broke on every Docker restart. Switching to Metricbeat with the Docker module reduced configuration from 200 lines to 20 and recovered metrics that were previously missing. Rule: use the Beat designed for your data type, not the one you already have installed.

🎯 Key Takeaway

Filebeat = log files, Metricbeat = system/service metrics, Packetbeat = network traffic, Winlogbeat = Windows Event Logs, Heartbeat = uptime monitoring, Auditbeat = security audit events. Choose the right Beat for the data type — forcing a file-based Beat to collect metrics is fragile and maintenance-heavy.

How Elasticsearch Actually Indexes Documents — Inverted Indices Under the Hood

Elasticsearch does not search documents. It searches an inverted index — a data structure that maps every unique term to the list of documents that contain it. When you index a document, Elasticsearch tokenizes the text, normalizes case, applies stemming if configured, and writes each token into a term dictionary. The term dictionary points to a postings list: document IDs, term frequency, and position offsets.

This is why Elasticsearch is fast at full-text search. You are not scanning every document. You are looking up a term in a sorted dictionary and getting back a pre-computed list of matching document IDs. BM25 scoring then ranks those matches by term frequency, inverse document frequency, and field length normalization.

Each Elasticsearch shard is an independent Lucene index. Lucene segments are immutable — once written, they never change. New or updated documents go into an in-memory buffer, then get flushed to a new segment on refresh, which defaults to every 1 second. This means there is a 1-second window where a newly indexed document is not yet searchable. If you need sub-second search freshness, the answer is not lowering the refresh interval — it will kill indexing throughput because every refresh triggers segment creation and eventual merges.

Segment merging happens in the background and is a silent performance killer when misconfigured. Too many small segments accumulate when indexing is faster than merging. The merge thread then consumes I/O and CPU, spiking latency for active searches. Monitor segment count per shard with _cat/segments — more than 100 segments per shard is a sign your merge policy needs tuning. For bulk indexing jobs, set refresh_interval to 30s or -1 during the load, force a refresh when done, then restore the interval.

Field mapping is where most teams create invisible performance problems. Every field you add increases the inverted index size and slows indexing. Use dynamic: strict in your index templates to reject unexpected fields and define only the fields you actually search or aggregate on. A common mistake is indexing full HTTP request bodies as a single text field, then wondering why searches are slow. Use index: false for fields you store but never query.

High-cardinality keyword fields deserve specific mention. Trace IDs, request IDs, and session tokens as keyword fields create term dictionaries with millions of unique values that cannot fit in RAM. Every search touching those fields forces disk lookups. Either do not index them as keywords, or use a separate index with appropriate settings for correlation lookups rather than mixing them into your main search index.

inspect-inverted-index.shBASH

#!/bin/bash
# See exactly how Elasticsearch tokenizes and stores a document
# The _termvectors API exposes the inverted index directly

# Index a sample document
curl -s -XPOST 'localhost:9200/logs/_doc/1' \
  -H 'Content-Type: application/json' \
  -d '{
    "message": "Payment failed for user 4821 timeout after 30s",
    "service": "payment-api",
    "level": "error",
    "@timestamp": "2026-04-25T10:30:00Z"
  }'

# Force refresh so the document is searchable immediately
curl -s -XPOST 'localhost:9200/logs/_refresh'

# See how ES tokenized the message field — this IS the inverted index
curl -s -XGET 'localhost:9200/logs/_termvectors/1?fields=message&pretty'
# Output shows each token, its frequency, and its position:
# "payment"  -> term_freq: 1, position: 0
# "failed"   -> term_freq: 1, position: 1
# "user"     -> term_freq: 1, position: 3
# "timeout"  -> term_freq: 1, position: 5

# Search uses the inverted index — no full document scan
curl -s -XGET 'localhost:9200/logs/_search?pretty' \
  -H 'Content-Type: application/json' \
  -d '{"query": {"match": {"message": "timeout"}}}'

Output

# _termvectors output (abbreviated):

# {

# "term_vectors": {

# "message": {

# "terms": {

# "payment": { "term_freq": 1, "tokens": [{ "position": 0 }] },

# "failed": { "term_freq": 1, "tokens": [{ "position": 1 }] },

# "timeout": { "term_freq": 1, "tokens": [{ "position": 5 }] }

# }

Mental Model

Inverted Index Mental Model

Think of the inverted index like a book's index at the back — it does not store the chapters, it tells you which pages contain each keyword. Searching is looking up the index, not reading every page.

Document goes in -> ES tokenizes text into individual terms and writes each to the term dictionary
Each term gets a postings list: which docs contain it, how often, and at which position
Search = dictionary lookup + postings list intersection — that is why it is fast on billions of documents
Segments are immutable; updates create new segments, old ones get merged in the background by the merge thread
Refresh interval (1s default) controls the trade-off between search freshness and indexing throughput — do not lower it below 1s

📊 Production Insight

A 100GB dataset with 5 shards and 1 replica actually consumes roughly 1TB of disk — not 200GB.

Each shard copy (primary + each replica) is a full physical copy. Plus segment merge overhead adds 10-20%.

Rule: calculate disk as raw_data_size x (1 + replicas) x 1.2 before you allocate storage.

For high-cardinality keyword fields like trace IDs or request IDs, create a separate correlation index rather than mixing them into your main search index.

🎯 Key Takeaway

Inverted index maps terms to document IDs and postings lists — that is the engine behind every full-text search.

Disk usage is raw_data x (1 + replicas) x 1.2 — calculate before deploying.

Lowering refresh_interval below 1s kills indexing throughput for negligible search freshness gain.

dynamic: strict in index templates prevents field explosion from corrupting Kibana load times.

Choosing the Right Shard Strategy

IfDaily log volume under 5GB

→

Use1 primary shard per daily index. Do not over-shard small datasets — shard overhead exceeds any parallelism benefit below this threshold.

IfDaily log volume 5GB to 50GB

→

Use3 to 5 primary shards per daily index, targeting 10 to 30GB per shard. This keeps recovery fast and gives enough parallelism for concurrent searches.

IfDaily log volume over 50GB

→

UseCalculate: daily_volume_GB divided by 30 = primary shard count. Never exceed 20 shards per node — shard metadata overhead accumulates in heap at roughly 1MB per shard.

IfNeed sub-second search latency on aggregations

→

UseAdd 1 to 2 replicas across additional data nodes. Replicas serve read traffic in parallel. Latency improves as read load is distributed — but disk usage multiplies.

thecodeforge.io

Elk Stack

Logstash Pipelines — Ingest, Transform, Ship and Where They Break

Logstash receives raw data from inputs, transforms it through filters, and ships structured events to outputs. The pipeline is linear — input -> filter -> output — with each stage running in its own thread pool. The number of worker threads processing the filter stage is controlled by pipeline.workers, which defaults to the number of available CPU cores. Understanding the threading model is the first step to understanding why pipelines stall.

The Grok filter is where most Logstash performance problems originate. Grok combines regular expressions with named capture groups to extract structured fields from unstructured text. A pattern like %{COMBINEDAPACHELOG} expands to a 200-character-plus regex. When your log format does not match the pattern, Grok tries every alternative before failing. In a pipeline processing 10,000 events per second with a 5% failure rate, that is 500 wasted regex evaluations per second. Always add a catch-all pattern as the last alternative: %{GREEDYDATA:log_message}. It ensures events flow through even on mismatch, and you tag the failure for visibility rather than silently dropping the event.

Multiline event handling is the second major trap. Java stacktraces and Python tracebacks span multiple lines. Logstash's multiline codec aggregates them into a single event by buffering pending lines in JVM heap. A burst of stacktraces from a crashing service — which is exactly when you most need your logs — can spike heap usage from 500MB to 3GB in under a minute. Set -Xmx to at least 4GB when using multiline in production. Reduce max_lines to a realistic ceiling (200 is usually enough for stacktraces) so a runaway exception chain cannot consume unlimited heap.

Dead letter queues are the safety net that most teams skip and regret. By default, Logstash silently drops documents that Elasticsearch rejects — mapping conflicts, disk blocks, field limit breaches. Enable it in logstash.yml: dead_letter_queue.enable: true and dead_letter_queue.max_bytes: 1024mb. The DLQ is a local directory on the Logstash host, not an Elasticsearch index. Inspect it at the path configured by path.dead_letter_queue (default: /var/lib/logstash/dead_letter_queue). Use the dead_letter_queue input plugin to replay rejected events after fixing the root cause.

The pipeline.ordered setting deserves explicit mention. By default it is set to auto, which enables ordered processing when pipeline.workers is 1 and disables it otherwise. Set pipeline.ordered: false explicitly when event ordering between inputs does not matter — it allows workers to process events without coordination overhead, improving throughput at the cost of delivery order guarantees. For log pipelines where Elasticsearch timestamps handle ordering at query time, this is almost always the right call.

Pipeline workers and batch size interact directly with throughput. For a 16-core machine, start with 8 workers and a batch size of 250. Increasing batch size improves throughput by amortizing per-batch overhead but increases per-event latency and heap usage. A batch that fills with slow-to-process multiline events holds the worker thread for longer, starving other events. Benchmark with realistic load before committing to any setting.

logstash-pipeline.confRUBY

# /etc/logstash/conf.d/production-logs.conf
# Production Logstash pipeline for Java microservices
# Handles both structured JSON logs and raw stacktraces with multiline

input {
  # Beats input for Filebeat agents on each host
  beats {
    port => 5044
    congestion_threshold => 5  # Backpressure when 5 Filebeat connections are queued
  }
}

filter {
  # Attempt JSON parse first — structured logs from Logback/Jackson need no Grok
  json {
    source  => "message"
    target  => "parsed"
    skip_on_invalid_json => true  # Keep raw message if not JSON — do not drop it
  }

  # Grok fallback for non-JSON logs and raw stacktraces
  if ![parsed] {
    grok {
      match => {
        "message" => [
          # Primary pattern: structured log line with thread and logger
          "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} \[%{DATA:thread}\] %{DATA:logger} - %{GREEDYDATA:log_message}",
          # Catch-all: never drop an event — tag it and keep flowing
          # Without this, every unmatched line is silently discarded
          "%{GREEDYDATA:log_message}"
        ]
      }
      tag_on_failure => ["_grokparsefailure_custom"]
    }
  }

  # Multiline handling for Java stacktraces
  # max_bytes limits heap consumption from runaway exception chains
  # This is a codec-level setting; shown here as a reference for input configuration
  # In practice, configure multiline on the Filebeat side to reduce Logstash heap pressure

  # Normalize timestamp to ISO 8601 for Elasticsearch
  date {
    match         => ["timestamp", "ISO8601", "yyyy-MM-dd HH:mm:ss.SSS"]
    target        => "@timestamp"
    remove_field  => ["timestamp"]
  }

  # Add processing metadata so you can trace which pipeline version and instance handled an event
  mutate {
    add_field => {
      "pipeline_version" => "4.0"
      "processed_by"    => "%{[host][name]}"
    }
  }
}

output {
  elasticsearch {
    hosts                  => ["es-data-01:9200", "es-data-02:9200", "es-data-03:9200"]
    index                  => "app-logs-%{+YYYY.MM.dd}"
    # Retry settings — transient ES errors should not cause data loss
    retry_max_interval     => 30
    retry_initial_interval => 2
  }

  # Dead letter queue is configured in logstash.yml, not here.
  # Add to logstash.yml:
  #   dead_letter_queue.enable: true
  #   dead_letter_queue.max_bytes: 1024mb
  #   path.dead_letter_queue: /var/lib/logstash/dead_letter_queue
  # Inspect the DLQ directory to see what ES rejected.
  # Replay with the dead_letter_queue input plugin after fixing the root cause.
}

Output

# Pipeline started successfully

# [INFO] [logstash.inputs.beats] Starting server on port 5044

# [INFO] [logstash.pipeline] Pipeline started {"pipeline.id":"main"}

# Monitor with: curl -s 'localhost:9600/_node/stats/pipelines?pretty'

⚠ Grok Failure Is Silent by Default

📊 Production Insight

Multiline codec holds pending events in JVM heap.

A burst of Java stacktraces from a crashing service — exactly when you need logs most — can OOM a 1GB heap pipeline in under a minute.

Set -Xmx to at least 4GB when using multiline. Configure multiline on the Filebeat side where possible to reduce Logstash heap pressure.

pipeline.ordered: false removes cross-worker coordination overhead — set it explicitly on log pipelines where ES timestamps handle ordering at query time.

🎯 Key Takeaway

Grok is the CPU bottleneck — always add %{GREEDYDATA:log_message} as the last alternative and tag failures for visibility.

Multiline codec is the heap trap — stacktrace bursts OOM pipelines under load. Configure multiline in Filebeat where possible.

Enable dead_letter_queue before production. DLQ is a local directory, not an ES index — inspect it at /var/lib/logstash/dead_letter_queue.

pipeline.ordered: false improves throughput on log pipelines where event ordering is handled at query time.

Logstash Filter Cheat Sheet — Grok, Date, Mutate, GeoIP

Logstash filters transform raw events into structured documents before they reach Elasticsearch. The four most frequently used filters in production pipelines are Grok, Date, Mutate, and GeoIP. Having a scannable reference makes pipeline debugging faster and reduces the guesswork when logs show up with missing fields or wrong timestamps.

Grok extracts structured fields from unstructured text using pattern matching. Built-in patterns cover common formats — %{COMBINEDAPACHELOG}, %{TIMESTAMP_ISO8601}, %{LOGLEVEL}. For custom formats, compose smaller patterns. Always add a catch-all as the last alternative to prevent dropped events.

Date parses timestamp strings from your logs into the @timestamp field. If you skip this, Elasticsearch uses the current time at indexing, making log order unreliable. The match parameter takes an array of format strings to try in order.

Mutate modifies field values and structures — renaming, copying, converting types, removing fields, and adding static strings. Use it to normalize field names across services (e.g., renaming customer_email to email) or to add pipeline metadata.

GeoIP enriches events with geographical location data from an IP address. It adds fields like geoip.country_code2, geoip.city_name, and geoip.location. This only works with public IPs. Rate-limit usage because the GeoIP database update can become a performance overhead on high-volume pipelines.

Grok failure debugging is where most teams waste time. When your pattern does not match, Logstash adds a _grokparsefailure tag to the event. Check for these tags in Kibana. Use the Grok Debugger in Kibana Dev Tools to test patterns against actual log lines before deploying.

logstash-filters.confRUBY

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

# /etc/logstash/conf.d/filters.conf
# Common Logstash filter patterns — copy, paste, modify

# ============================================================
# 1. GROK — Extract structured fields from unstructured text
# ============================================================
filter {
  grok {
    match => {
      "message" => [
        # Apache/Nginx combined log format
        "%{COMBINEDAPACHELOG}",
        # Custom JSON-like log line
        "timestamp=%{TIMESTAMP_ISO8601:timestamp} level=%{LOGLEVEL:level} trace_id=%{UUID:trace_id} msg=%{GREEDYDATA:message}",
        # Catch-all: never drop events — tag them instead
        "%{GREEDYDATA:log_message}"
      ]
    }
    # Tag failures so you can see them in Kibana
    tag_on_failure => ["_grokparsefailure_custom"]
    # Remove the original message after extracting to save disk
    remove_field => ["message"]
  }
}

# ============================================================
# 2. DATE — Parse timestamp into @timestamp
# ============================================================
filter {
  date {
    # Try these formats in order until one matches
    match => [
      "timestamp",
      "ISO8601",
      "yyyy-MM-dd HH:mm:ss.SSS",
      "dd/MMM/yyyy:HH:mm:ss Z"  # Apache log format
    ]
    target => "@timestamp"
    # Remove the original timestamp field after parsing
    remove_field => ["timestamp"]
  }
}

# ============================================================
# 3. MUTATE — Modify, rename, convert, or add fields
# ============================================================
filter {
  mutate {
    # Rename a field to standardize across services
    rename => {
      "customerEmail" => "email"
      "source_host"   => "hostname"
    }
    # Convert string numbers to actual numeric types
    convert => {
      "status_code" => "integer"
      "duration_ms" => "float"
    }
    # Remove fields you never query
    remove_field => ["headers", "raw_body"]
    # Add static processing metadata
    add_field => {
      "pipeline_name" => "logs-prod"
      "environment"   => "production"
    }
    # Copy a value to a new field
    copy => {
      "user_id" => "user.id"
    }
  }
}

# ============================================================
# 4. GEOIP — Enrich with location data from IP address
# ============================================================
filter {
  geoip {
    # Source field containing the IP address
    source => "client_ip"
    # Fields to add (default includes country, city, location, etc.)
    target => "geoip"
    # Skip if IP is private (10.x.x.x, 192.168.x.x, 172.16.x.x)
    # Not a filter setting — handle with a conditional before this filter
  }
}

# ============================================================
# 5. IF — Conditional pipelines
# ============================================================
filter {
  # Only apply heavy filters to error logs (10% of traffic)
  if [level] == "ERROR" or [status_code] >= 500 {
    grok {
      match => { "stacktrace" => "%{JAVASTACKTRACE}" }
    }
  }
  
  # Skip GeoIP for private IP addresses
  if [client_ip] !~ /^(10\.|172\.1[6-9]|172\.2[0-9]|172\.3[0-1]|192\.168\.)/ {
    geoip {
      source => "client_ip"
    }
  }
}

# ============================================================
# 6. KV (Key-Value) — Parse key=value pairs
# ============================================================
filter {
  kv {
    # Source field containing key=value pairs
    source => "query_string"
    # Target field for extracted fields
    target => "params"
    # Separator between key and value (default '=')
    field_split => "&"
    value_split => "="
    # Remove the source after extraction
    remove_field => ["query_string"]
  }
}

Output

# Test a filter before deploying:

# echo '{"message":"127.0.0.1 - - [25/Apr/2026:10:30:00 +0000] \"GET /health\" 200 12"}' | \

# /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/test.conf --config.test_and_exit

💡Grok Debugging Workflow

1. Find a real log line that is failing — copy from raw application logs 2. Open Kibana Dev Tools -> Grok Debugger 3. Paste the log line and your pattern 4. Iterate on the pattern until fields extract correctly 5. Copy the working pattern to your Logstash config 6. Test with logstash --config.test_and_exit before reloading 7. Monitor _grokparsefailure tags in Kibana after deployment

📊 Production Insight

A 10K events/second pipeline with 5% Grok failures was wasting 500 regex evaluations per second. Each failed evaluation ran through 15 alternative patterns before giving up. Adding a catch-all %{GREEDYDATA:log_message} pattern and moving it to the first position reduced CPU usage by 40% — the catch-all matches instantly, so the pipeline never evaluated the other patterns on mismatched logs. Rule: order patterns from most-specific to catch-all, and always end with %{GREEDYDATA}.

🎯 Key Takeaway

Grok extracts structure — always include %{GREEDYDATA:log_message} as the last pattern. Date sets @timestamp from log timestamps. Mutate renames, converts, and removes fields. GeoIP enriches with location from IP addresses. Test patterns in the Grok Debugger before deploying.

Kibana Dashboards That Actually Answer Questions

Most Kibana dashboards are digital art. They look impressive in a demo and answer nothing during an incident. A dashboard with 47 panels showing every possible metric is a distraction when you are trying to figure out why payments are failing at 2am.

The right approach is to start with the question, then build the visualization. 'Which services are returning 5xx errors in the last 15 minutes?' needs one metric — error count — one dimension — service name — and one filter — status code 500 or above, time range last 15 minutes. That is a single data table, not a 12-panel dashboard. The panel answers the question. Everything else is friction.

Kibana's index patterns are the second major gotcha. When you create an index pattern like app-logs-*, Kibana fetches field mappings from every matching index. If you have 90 days of daily indices with dynamic mapping enabled, you can easily have 3,000-plus fields — especially when different services log different JSON structures that Elasticsearch ingests as separate mapped fields. Every time someone opens Discover or creates a visualization, Kibana loads all those field definitions. That is why your dashboard takes 20 seconds to load. The fix is upstream: use dynamic: strict in your index template and define only the fields you actually query.

For real incident response, saved searches outperform visualizations. They load faster because they do not aggregate — they list raw log lines with column selections. Pin a few key saved searches at the top of your Kibana navigation: 5xx errors, slow queries over 5 seconds, auth failures. When something breaks, open the relevant saved search rather than waiting for a complex dashboard to render aggregations across 90 days of data.

Kibana Query Language over Lucene syntax is worth the initial learning curve. KQL is more readable, less error-prone when written quickly under pressure, and better supported in autocomplete. Train the whole team on a handful of patterns — field:value, field:* wildcards, AND/OR combinations, range queries with > and < — and you will have faster incident investigation.

Time series visualizations with a large time range are a silent performance killer. Querying 30 days of data with a 1-minute bucket interval generates 43,200 buckets. Elasticsearch computes all of them. Use auto-interval on date histograms for overview dashboards and a fixed short interval only when drilling into a specific incident window. The inspect button on any Kibana panel shows the raw Elasticsearch query being executed — this is invaluable for understanding why a dashboard is slow or showing unexpected data.

kibana-incident-dashboard.ndjsonJSON

// Kibana saved search for incident triage — 5xx errors by service
// Import via Kibana -> Stack Management -> Saved Objects -> Import
// This is the first thing to open when someone says 'something is broken'
{
  "attributes": {
    "title": "5xx Errors by Service — Last 15m",
    "description": "Real-time error view for incident triage. Open this first, not the overview dashboard.",
    "columns": ["@timestamp", "service", "level", "log_message", "status_code"],
    "sort": [["@timestamp", "desc"]],
    "kibanaSavedObjectMeta": {
      "searchSourceJSON": "{\"query\":{\"query\":\"status_code >= 500\",\"language\":\"kuery\"},\"filter\":[{\"meta\":{\"index\":\"app-logs-*\",\"type\":\"range\"},\"query\":{\"range\":{\"@timestamp\":{\"gte\":\"now-15m\"}}}}]}"
    }
  },
  "type": "search"
}

// Human-readable version of the embedded query above:
// {
//   query: { query: 'status_code >= 500', language: 'kuery' },
//   filter: [{ range: { '@timestamp': { gte: 'now-15m' } } }]
// }
//
// The nested JSON escaping in kibanaSavedObjectMeta is the Kibana saved objects
// wire format — this is correct and required for import. The readable version above
// shows what is actually being executed against Elasticsearch.

Output

# Import: Kibana -> Stack Management -> Saved Objects -> Import -> select this file

# Access: Kibana -> Analytics -> Discover -> Open -> '5xx Errors by Service'

# During incidents: pin to Kibana sidebar for one-click access

Mental Model

Dashboard Design Mental Model

A dashboard is a diagnostic tool, not a status page. Every panel should answer one specific operational question that leads to an action. If you cannot name what action a panel drives, remove it.

Start with the question, then pick the visualization — not the other way around
One dashboard per operational scenario: incident triage, capacity planning, deploy validation
If a panel does not change your next action, it is visual noise — remove it
Saved searches load faster than visualizations and show raw log lines — use them for active incident investigation
The Kibana inspect button shows the raw ES query — use it to debug slow panels and unexpected results

📊 Production Insight

Index patterns with 2000-plus fields add 10-20 seconds to every Kibana page load.

Dynamic mapping on high-cardinality logs creates fields like request.headers.x-amzn-trace-id that nobody queries but Kibana loads on every page.

Rule: use dynamic: strict in your index template from day one. Every unexpected field is a performance tax paid on every dashboard load.

🎯 Key Takeaway

Start with the question, then build the visualization — not the other way around.

Saved searches load faster than visualizations and are the right tool during active incidents.

dynamic: strict prevents field explosion that makes every Kibana page slow to load.

One dashboard per operational scenario beats one dashboard attempting to show everything.

Kibana Query Language (KQL) vs Lucene — Syntax Comparison

Kibana gives you two query language options: Kibana Query Language (KQL) and Lucene. KQL is the default in modern Kibana versions (7.0+) and is the recommended choice for most users. Lucene is the legacy syntax that powers Elasticsearch's underlying query parser. Knowing both is useful for debugging, but KQL should be your daily driver.

KQL is designed for discoverability and error resistance. It provides autocomplete suggestions, syntax highlighting, and immediate error feedback. You cannot write invalid KQL — Kibana tells you where the syntax breaks. KQL supports nested fields, existence checks, and range queries with a cleaner syntax. Use KQL for all ad-hoc exploration and dashboard panels.

Lucene is more powerful but more dangerous. It supports regex, fuzzy queries, and proximity searches that KQL does not. The trade-off is that Lucene does not prevent you from writing queries that are syntactically valid but semantically wrong. A misplaced parenthesis can change the entire query meaning without an error message. Reserve Lucene for advanced use cases where KQL falls short, and always test Lucene queries in Dev Tools before adding them to dashboards.

The field:value syntax is identical in both languages. Wildcards work in both — service:payment* matches payment-api, payment-processor, payment-service. The differences appear in ranges, existence checks, and complex Boolean logic.

kql-vs-lucene.txtTEXT

100

101

102

103

104

105

106

# KQL vs Lucene — Syntax Comparison Quick Reference
# Use KQL by default. Only switch to Lucene when you need advanced features.

# ============================================================
# BASIC FIELD LOOKUP — Same in both
# ============================================================
# Match exact field value
KQL:     status_code: 500
Lucene:  status_code:500

# Match value anywhere in full-text field (message analysis applies)
KQL:     message: "timeout"
Lucene:  message:timeout

# ============================================================
# WILDCARDS — Same in both
# ============================================================
# Prefix match
KQL:     service: payment*
Lucene:  service:payment*

# Single character wildcard
KQL:     trace_id: 1a2b?ef*
Lucene:  trace_id:1a2b?ef*

# ============================================================
# RANGE QUERIES — KQL is more readable
# ============================================================
# Greater than or equal
KQL:     duration_ms >= 5000
Lucene:  duration_ms:[5000 TO *]

# Between (inclusive both ends)
KQL:     status_code >= 400 and status_code <= 499
Lucene:  status_code:[400 TO 499]

# Between (exclusive upper bound)
KQL:     duration_ms >= 0 and duration_ms < 1000
Lucene:  duration_ms:[0 TO 1000}

# Date range
KQL:     @timestamp >= "2026-04-25T10:00:00"
Lucene:  @timestamp:[2026-04-25T10:00:00 TO *]

# ============================================================
# BOOLEAN LOGIC — KQL uses words, Lucene uses symbols
# ============================================================
# AND
KQL:     level: ERROR AND service: payment-api
Lucene:  level:ERROR AND service:payment-api

# OR (note: KQL requires uppercase OR)
KQL:     level: ERROR OR level: WARN
Lucene:  level:ERROR OR level:WARN

# NOT
KQL:     NOT level: DEBUG
Lucene:  -level:DEBUG OR NOT level:DEBUG

# Complex grouping
KQL:     (status_code >= 500 OR level: ERROR) AND service: payment*
Lucene:  (status_code:[500 TO *] OR level:ERROR) AND service:payment*

# ============================================================
# EXISTENCE CHECKS — KQL is more readable
# ============================================================
# Field exists (has any value, including null)
KQL:     trace_id: *
Lucene:  _exists_:trace_id

# Field does NOT exist
KQL:     NOT trace_id: *
Lucene:  -_exists_:trace_id

# ============================================================
# NESTED FIELDS — Same in both (dot notation)
# ============================================================
KQL:     geoip.country_code: US
Lucene:  geoip.country_code:US

# ============================================================
# LUCENE-ONLY FEATURES (not available in KQL)
# ============================================================
# Regular expressions (use sparingly — expensive)
Lucene:  message:/pay.ent.*/i

# Fuzzy queries (character edit distance)
Lucene:  customer_name:bob~1

# Proximity searches
Lucene:  "user created"~3

# Boosting terms
Lucene:  level:ERROR^2 OR level:WARN

# ============================================================
# REAL-WORLD INCIDENT QUERIES
# ============================================================
# Find all errors from payment service in last hour (KQL)
level: ERROR AND service: payment-* AND @timestamp >= now-1h

# Find slow API calls (>10s) excluding health checks (Lucene)
duration_ms:[10000 TO *] AND NOT endpoint:/health

# Find any 5xx or ERROR from payment services except test users (KQL)
(status_code >= 500 OR level: ERROR) AND service: payment-* AND NOT user_id: test-*

💡KQL Quick Reference

| Query Pattern | KQL Example | |---------------|-------------| | Exact match | status_code: 500 | | Prefix wildcard | service: payment | | Range | duration_ms >= 5000 | | AND | level: ERROR AND service: payment-api | | OR | level: ERROR OR level: WARN (UPPERCASE required) | | NOT | NOT level: DEBUG | | Exists | trace_id: | | Date math | @timestamp >= now-15m | | Grouping | (status_code >= 500 OR level: ERROR) AND service: payment* |

📊 Production Insight

A team spent 20 minutes debugging why their dashboard filter level:WARN OR ERROR returned nothing. The problem: KQL requires uppercase OR. WARN OR ERROR (without uppercase) is parsed as a field name, matching nothing. After switching to level: WARN OR level: ERROR (or level: (WARN OR ERROR)), the filter worked. Rule: KQL keywords (AND, OR, NOT) must be uppercase. Field names are case-sensitive as indexed.

🎯 Key Takeaway

Use KQL by default — it has autocomplete, error feedback, and cleaner syntax. Use Lucene only for regex, fuzzy queries, or proximity searches. KQL keywords (AND, OR, NOT) must be uppercase. Field names are case-sensitive as indexed.

Shard Strategy and Capacity Planning — The Decisions That Haunt You Later

Shard count is the most consequential decision in an Elasticsearch deployment and it is almost always wrong on the first try. Too many shards and your cluster spends more time managing shard metadata than indexing data. Too few and you cannot distribute load or recover from node failures in a reasonable time window.

Here is the math most people skip. Each shard is a Lucene instance with its own heap overhead — roughly 1MB per shard for metadata plus per-segment structures. A cluster with 5,000 shards burns around 5GB of heap just on shard bookkeeping before indexing a single document. Elasticsearch's hard limit is 1,000 shards per node, but the practical ceiling is closer to 20 shards per GB of heap on data nodes. A node with 30GB of heap can handle around 600 shards before GC pressure from shard metadata starts affecting search latency.

The target shard size for log workloads is 10GB to 50GB. Below 10GB you are paying overhead on shards too small to benefit from parallelism. Above 50GB, shard recovery after a node failure requires copying the entire shard to a replacement node — a 100GB shard on a 1Gbps network takes around 13 minutes to recover during which that data has one fewer replica. For a daily index receiving 30GB of logs, one primary shard is fine. For 150GB per day, use 5 primary shards.

You cannot change the number of primary shards on an existing index without reindexing. This is one of the most painful lessons in Elasticsearch operations and it is avoidable if you plan before the first document arrives. Use index templates with the correct shard count set before the cluster receives any data. If you get it wrong, reindex is the only path forward — which is a significant operational event.

Replica shards serve two purposes: fault tolerance and read parallelism. For logs, one replica is usually enough — it doubles disk usage but protects against single node failure. If you have 3 data nodes, each primary shard has its primary on one node and its replica on another, giving you tolerance for a single node going down. With 2 replicas across 3 nodes, you have triple the disk usage but can lose any 2 nodes and still serve reads. Choose based on your actual availability requirements, not aspirational ones.

Post-node-restart shard allocation can itself become a cluster bottleneck when shard counts are high. The cluster manager thread handles all allocation decisions, and thousands of pending allocations after a node restart can keep the cluster in a recovering state for much longer than the actual data copy time. Monitor _cluster/allocation/explain when shards are not allocating — it gives the specific reason, from disk watermark to node attribute filtering to replica placement rules. This API saves hours of guessing.

shard-capacity-planning.shBASH

#!/bin/bash
# Elasticsearch shard capacity planning and monitoring commands
# Run these before creating a new index, not after you discover a problem

echo "=== Current shard distribution across data nodes ==="
curl -s 'localhost:9200/_cat/allocation?v&h=node,shards,disk.indices,disk.used,disk.avail,disk.percent'

echo ""
echo "=== Shards over 50GB — candidates for re-sharding on next reindex ==="
curl -s 'localhost:9200/_cat/shards?v&h=index,shard,prirep,store,node' | \
  awk 'NR==1 || ($4 ~ /gb/ && substr($4,1,length($4)-2)+0 > 50)'

echo ""
echo "=== Per-node shard count vs recommended maximum ==="
# Rule: keep shards per node under (heap_GB x 20)
# For a 30GB heap node, maximum is ~600 shards
curl -s 'localhost:9200/_cat/nodes?v&h=name,heapMax,shards' | \
  awk 'NR==1 { print; next }
       {
         heap=$2
         shards=$3
         # Extract numeric heap value (strip units)
         gsub(/[^0-9.]/,"",heap)
         max_shards = heap * 20
         if (shards > max_shards)
           printf "WARNING: %s has %s shards, max recommended %s\n", $1, shards, max_shards
         else
           printf "OK:      %s has %s shards (max %s)\n", $1, shards, max_shards
       }'

echo ""
echo "=== Shard count calculator for a new daily index ==="
DAILY_GB=${1:-50}         # Pass daily volume as first argument, default 50GB
TARGET_SHARD_GB=30        # Target 10-50GB per shard
SHARDS=$(( (DAILY_GB + TARGET_SHARD_GB - 1) / TARGET_SHARD_GB ))  # Ceiling division
echo "Daily volume: ${DAILY_GB}GB"
echo "Target shard size: ${TARGET_SHARD_GB}GB"
echo "Recommended primary shards: ${SHARDS}"
echo "Total disk with 1 replica: $(( DAILY_GB * 2 )) GB (before segment overhead)"
echo "Total disk with 1 replica + 20% overhead: $(echo "$DAILY_GB * 2 * 1.2" | bc)GB"

echo ""
echo "=== Creating index template with calculated shard count ==="
curl -s -XPUT 'localhost:9200/_index_template/app-logs' \
  -H 'Content-Type: application/json' \
  -d "{
    \"index_patterns\": [\"app-logs-*\"],
    \"template\": {
      \"settings\": {
        \"number_of_shards\": ${SHARDS},
        \"number_of_replicas\": 1,
        \"refresh_interval\": \"5s\",
        \"codec\": \"best_compression\",
        \"mapping\": { \"dynamic\": \"strict\" }
      }
    }
  }"

Output

# === Current shard distribution across data nodes ===

# node shards disk.indices disk.used disk.avail disk.percent

# es-data-01 42 180.2gb 210.5gb 789.5gb 21

# es-data-02 38 165.8gb 195.2gb 804.8gb 19

# === Per-node shard count vs recommended maximum ===

# OK: es-data-01 has 42 shards (max 600)

# OK: es-data-02 has 38 shards (max 600)

# === Shard count calculator ===

# Daily volume: 50GB

# Target shard size: 30GB

# Recommended primary shards: 2

# Total disk with 1 replica: 100GB

# Total disk with 1 replica + 20% overhead: 120GB

💡Shard Sizing in One Rule

Target 10 to 50GB per shard. Calculate primary shard count as ceiling(daily_volume_GB / 30). Set this in the index template before the first document arrives — you cannot change it afterward without a full reindex. Every shard adds roughly 1MB of heap overhead for metadata. Keep total shards per node under heap_GB times 20.

📊 Production Insight

5,000 shards consume roughly 5GB of heap just on metadata — before indexing anything.

A node with 30GB heap and 600 shards is running within spec. The same node with 2,000 shards will GC constantly and degrade search latency for everything on the cluster.

Rule: calculate shard count from daily volume before creating the index, not after the cluster starts showing yellow status.

🎯 Key Takeaway

Target 10 to 50GB per shard — below 10GB wastes overhead, above 50GB makes recovery slow.

Shard count cannot be changed after index creation without a reindex — get it right in the template.

Keep total shards per node under heap_GB times 20 to avoid GC pressure from shard metadata.

Use _cluster/allocation/explain when shards will not allocate — it gives the specific blocking reason.

Index Lifecycle Management — Automate Retention Before It Bites You

Without ILM, your ELK stack suffocates under its own data. Daily indices accumulate, disk fills, and someone is running curl commands at 2am to delete old indices. Index Lifecycle Management automates this: you define policies that transition indices through hot, warm, cold, and delete phases based on age, size, or document count. ILM is not optional infrastructure. It is the difference between a cluster that manages itself and one that requires constant manual intervention.

Hot phase: indices are actively written and frequently searched. Keep this short — 1 to 3 days for most log workloads. Use fast NVMe SSDs. This is your most expensive storage tier, and every day you keep an index here costs more than it should.

Warm phase: no more writes, still searchable but with lower urgency. Force-merge to 1 segment — this consolidates all the small segments from active indexing into one, reducing heap overhead and improving scan performance. Before force-merging, Elasticsearch requires the index to have 0 replicas for the shrink operation, which is why the allocate action reducing replicas must precede the shrink action in the policy. Missing this step causes the warm phase to fail silently.

Cold phase: read-only, reduced replica count, optionally migrated to slower storage using data tiers. For compliance-driven retention requirements, this phase can extend to months or years.

Delete phase: remove indices after the retention period. Set this based on your data retention policy and test it with a very short window on a development cluster — 1 hour delete — before using real durations in production.

ILM rollover is the mechanism that prevents any single index from growing beyond your target shard size. When an index exceeds max_size or max_age, ILM rolls over to a new index with the same template settings. This keeps shard sizes predictable and recovery times bounded. Set rollover on both a size trigger and an age trigger — whichever fires first — so that low-volume days still rotate on schedule.

A critical operational detail: ILM policies attached to an index template apply to new indices only. Indices created before the policy was attached require explicit opt-in via the PUT /index/_settings API. And always test your ILM policy on a development cluster before applying it to production. A misconfigured delete phase that fires too early is an availability incident, not a configuration mistake.

ilm-policy.jsonJSON

// ILM policy: hot 3 days, warm 14 days, delete after 30 days
// PUT _ilm/policy/logs-ilm-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",  // Roll over when index hits 50GB regardless of age
            "max_age":  "1d"     // Also roll over daily even if under 50GB
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          // Reduce replicas to 0 BEFORE shrink — ES requires this for the shrink operation
          // Shrink needs all shards on a single node, which requires no replica competing for placement
          "allocate": { "number_of_replicas": 0 },
          "forcemerge": { "max_num_segments": 1 },
          "shrink":    { "number_of_shards": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

// Attach to index template — new indices inherit this policy automatically
// PUT _index_template/app-logs
{
  "index_patterns": ["app-logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name":             "logs-ilm-policy",
      "index.lifecycle.rollover_alias":    "app-logs",
      "number_of_shards":                  2,
      "number_of_replicas":                1,
      "mapping": { "dynamic": "strict" }
    }
  }
}

// Verify ILM is working on existing indices:
// GET app-logs-*/_ilm/explain

Output

# ILM policy created and attached to template

# New indices matching app-logs-* will be managed automatically

# Check phase progression:

# curl -s 'localhost:9200/app-logs-*/_ilm/explain?pretty' | grep -E '(phase|age|action)'

# If an index is stuck in a phase, check the error:

# curl -s 'localhost:9200/app-logs-000001/_ilm/explain?pretty' | grep -A5 'error'

Mental Model

ILM as Automated Operations

ILM is the garbage truck for your log data — it moves data through tiers as it ages and removes it when the retention window closes. Without it, manual cleanup is the operation that someone eventually forgets during a holiday.

Hot: fast writes, fast searches, expensive NVMe storage — keep indices here for 1 to 3 days maximum
Warm: no writes, acceptable search speed, force-merge to 1 segment reduces overhead — 3 to 30 days
Cold: read-only, minimal replicas, cheap storage — for compliance or rare lookups
Delete: remove when retention policy says it is gone — test with a short window first
Rollover on both size and age — prevents any single index from growing beyond your target shard size

📊 Production Insight

The allocate action reducing replicas to 0 must precede the shrink action in the warm phase.

A warm phase that skips the allocate step will fail silently — the ILM explain API will show an error on the affected index.

Rule: always check _ilm/explain on a test index after attaching a new ILM policy before trusting it in production.

🎯 Key Takeaway

ILM automates hot, warm, and delete phases — without it, manual cleanup is the operation someone forgets.

The warm phase requires allocate with 0 replicas before shrink — missing this causes the phase to fail silently.

Attach ILM to every index template. It does not retroactively apply to existing indices.

Test ILM policies with a 1-hour delete window on dev before using real durations.

Cluster Sizing and Hardware Selection — CPU, RAM, Disk Trade-offs

Choosing hardware for Elasticsearch is a trade-off between three resources that compete against each other. The right mix depends entirely on your workload profile — and the profile changes as your data grows.

RAM is the highest-leverage resource. Elasticsearch uses the OS filesystem cache as aggressively as the JVM heap. If your hot index fits in the filesystem cache, searches are nearly instant. If it does not, every search requires disk reads. The practical rule: allocate 50% of node RAM to the JVM heap, leave the rest for the OS. A 64GB node gets 31GB of heap for Elasticsearch and 33GB for the OS cache. Do not exceed 31GB of heap on a single node — above this threshold, the JVM switches from 4-byte compressed object pointers to 8-byte uncompressed ones, which doubles reference size and increases GC pressure. On Elasticsearch 8.x running JDK 21 with generational ZGC, the GC characteristics improve significantly over older G1GC configurations, but the 31GB ceiling on heap remains the safe practical limit. If you need more memory, add nodes rather than increasing per-node heap.

CPU matters most for indexing throughput and complex aggregations. Logstash with heavy Grok patterns can saturate CPU before Elasticsearch does. For data nodes, modern CPUs with high single-threaded clock speeds (3.5GHz or above) benefit search latency on sequential segment scans. Higher core counts improve bulk indexing throughput but have diminishing returns beyond 16 to 20 cores per node for typical log workloads.

Disk is where most teams underprovision. NVMe SSDs are non-negotiable for hot phase data nodes — the random I/O pattern that Elasticsearch generates during segment merges and concurrent searches will saturate spinning disks and cause indexing pauses. For warm and cold phases, SATA SSDs provide acceptable throughput at lower cost. In AWS, gp3 EBS volumes provide 3,000 IOPS baseline with 16,000 IOPS available at lower cost than gp2 — use gp3 for data nodes. io2 is rarely justified for log workloads.

Network is frequently the bottleneck that nobody planned for. Elasticsearch shuffles large amounts of data during shard recovery, rebalancing, and snapshot creation. On a 3-node cluster recovering a 200GB shard after a node replacement, 10GbE networking copies the shard in roughly 3 minutes. On 1GbE, that is 27 minutes during which the data has no replica. Use 10GbE or better between data nodes. In cloud environments, verify that your instance type provides dedicated network bandwidth — burstable instance types that share network bandwidth will throttle under sustained replication load.

Master nodes deserve dedicated resources. Three master-eligible nodes minimum for quorum. Give them 8GB RAM and 4 CPU cores — they manage cluster state, not data, so resource requirements are modest. Never run master-eligible and data roles on the same node in production. A GC pause on a data node caused by heavy indexing can delay master heartbeats, trigger unnecessary master elections, and destabilize the cluster at the worst possible moment. Keep the roles separated.

cluster-sizing-check.shBASH

#!/bin/bash
# Cluster resource health check — run this daily as part of ops review

echo "=== Node roles, heap pressure, and disk usage ==="
curl -s 'localhost:9200/_cat/nodes?v&h=name,node.role,heap.percent,ram.percent,cpu,disk.used_percent'

echo ""
echo "=== JVM heap used percent per node ==="
curl -s 'localhost:9200/_nodes/stats/jvm?pretty' | \
  python3 -c "
import json, sys
nodes = json.load(sys.stdin)['nodes']
for nid, n in nodes.items():
    print(f"{n['name']:20s} heap: {n['jvm']['mem']['heap_used_percent']}%")
"

echo ""
echo "=== Thread pool rejections — indicates resource pressure ==="
curl -s 'localhost:9200/_cat/thread_pool?v&h=node_name,name,active,queue,rejected' | \
  awk 'NR==1 || $5+0 > 0'  # Show header and any row with rejections > 0

echo ""
echo "=== OS filesystem cache available per node ==="
curl -s 'localhost:9200/_nodes/stats/os?pretty' | \
  python3 -c "
import json, sys
nodes = json.load(sys.stdin)['nodes']
for nid, n in nodes.items():
    mem = n['os']['mem']
    free_gb = mem['free_in_bytes'] / 1024**3
    total_gb = mem['total_in_bytes'] / 1024**3
    print(f"{n['name']:20s} OS cache available: {free_gb:.1f}GB / {total_gb:.1f}GB")
"

Output

# === Node roles, heap pressure, and disk usage ===

# name node.role heap.percent ram.percent cpu disk.used_percent

# es-data-01 d 42 78 38 35

# es-data-02 d 38 75 32 32

# es-master-01 m 22 35 8 12

# === JVM heap used percent per node ===

# es-data-01 heap: 42%

# es-data-02 heap: 38%

# === Thread pool rejections ===

# (no rejections — clean cluster)

# === OS filesystem cache available ===

# es-data-01 OS cache available: 14.1GB / 64.0GB

# es-data-02 OS cache available: 16.0GB / 64.0GB

🔥Hardware Lessons From Production

In AWS, use gp3 EBS for data nodes — it provides 3,000 IOPS baseline at lower cost than gp2 and scales to 16,000 IOPS without changing volume type. For master nodes, m7g.large (Graviton) provides enough CPU for cluster state management at low cost. Never use burstable instance types like t3 for data nodes — network bandwidth throttling under sustained replication load causes slow shard recovery and cluster instability at exactly the wrong moments.

📊 Production Insight

Heap above 31GB switches the JVM from 4-byte to 8-byte object references, doubling reference overhead and increasing GC pressure.

The OS filesystem cache outside the heap is often more valuable than extra heap — a large cache keeps hot index data in memory without GC overhead.

Rule: 31GB heap maximum per node. Add nodes for capacity. Keep master and data roles on separate instances in production.

🎯 Key Takeaway

31GB is the practical JVM heap ceiling per node — above this, compressed OOPs disable and GC pressure increases. Add nodes instead.

Leave 50% of node RAM for OS filesystem cache — it keeps hot segments in memory with no GC cost.

NVMe SSDs are required for hot data nodes. 10GbE networking prevents shard recovery from becoming a 30-minute event.

Dedicated master nodes prevent GC pauses on data nodes from destabilizing cluster elections.

ELK Stack in DevOps — Where the Rubber Meets the Pipeline

Most teams slap ELK together because someone read a blog. Then the first on-call rotation hits and they're drowning in noise. In DevOps, the ELK stack isn't a dashboard toy. It's your single source of truth when a deploy goes sideways at 3 AM.

Your CI/CD pipeline vomits logs. Your containers restart. Your API latency spikes. ELK ingests all of that — but only if you wired it right. The missing piece most tutorials skip: alerting from Kibana. You don't need a second monitoring stack. Use Elasticsearch watchers or Kibana alerting rules to fire webhooks into PagerDuty or Slack when error rates cross a threshold.

Pro tip: structure your Elasticsearch index mappings before your first pipeline runs. Dynamic mapping is a trap. It works fine for 100 logs/day. At 10 million events/hour it turns your cluster into a memory furnace. Define explicit field types for timestamps, IPs, and status codes. Your future self will send bourbon.

elk_devops_alerting.ymlYAML

// io.thecodeforge — devops tutorial

elasticsearch:
  watcher:
    trigger:
      schedule:
        interval: "5m"
    input:
      search:
        request:
          indices: ["nginx-access-*"]
          body:
            query:
              bool:
                filter:
                  - range:
                      "@timestamp":
                        gte: "now-5m"
                  - term:
                      response_code: 500
              aggs:
                error_count:
                  value_count:
                    field: "_index"
    condition:
      compare:
        "ctx.payload.aggregations.error_count.value":
          gt: 100
    actions:
      webhook:
        url: "https://hooks.slack.com/services/T00/B00/xxxx"
        body: "⚠️ 500 errors exceeded 100 in 5 minutes — check nginx-access-*"

Output

No direct output — webhook fires when condition met.

Slack message: ⚠️ 500 errors exceeded 100 in 5 minutes — check nginx-access-*

⚠ Production Trap:

Don't alert on every 500. Batch them. A single flapping pod can fire 10,000 alerts in an hour. Use aggregate conditions like '> 100 in 5 minutes' or you'll silence the alert and miss the real outage.

🎯 Key Takeaway

Wire alerting from Kibana/Elasticsearch into your incident response tool. Aggregate, don't spike.

Index Creation — Don't Let Elasticsearch Guess Your Schema

Elasticsearch will happily create an index the first time Logstash ships data to it. That's fine for a PoC. In production, that index has no analyzers, no explicit mapping, and every string field gets indexed as both text and keyword. That doubles your storage and kills query performance.

Create indices manually. Define your mappings and settings upfront. Critical decisions: decide on number_of_shards (rule of thumb: 1 shard per 20-50GB of data) and number_of_replicas (1 for most clusters, 2 for mission-critical). Lock the mapping with "dynamic": "strict" — any field not in your mapping will reject the document. Painful during onboarding, lifesaver after six months of log drift.

Use index templates for time-series data (nginx-access-YYYY.MM.DD). Templates apply settings automatically to new indices that match a pattern. No more manually configuring each daily index.

create_nginx_index_template.ymlYAML

// io.thecodeforge — devops tutorial

PUT _template/nginx-access-template
{
  "index_patterns": ["nginx-access-*"],
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.refresh_interval": "30s"
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "@timestamp": { "type": "date" },
      "client_ip": { "type": "ip" },
      "method": { "type": "keyword" },
      "url": { "type": "text", "fields": { "keyword": { "type": "keyword" } } },
      "response_code": { "type": "integer" },
      "body_bytes": { "type": "long" },
      "user_agent": { "type": "text" }
    }
  }
}

Output

{

"acknowledged": true

}

💡Senior Shortcut:

Use ILM (Index Lifecycle Management) with rollover. Set index.lifecycle.name to a policy that rolls at 50GB or 30 days. Combined with a strict template, zero manual index management.

🎯 Key Takeaway

Create index templates with strict mappings before your first log line hits Elasticsearch. Your future query speed depends on it.

● Production incidentPOST-MORTEMseverity: high

Elasticsearch Goes Read-Only on Black Friday — 4 Hours of Lost Logs

Symptom

At 10:15 AM on Black Friday, the ops team noticed Kibana dashboards stopped updating. New transaction logs vanished. Alerting rules that depended on fresh log data went silent — no alerts fired for any new errors. The on-call engineer spent the first hour assuming PagerDuty was broken.

Assumption

The team assumed Logstash had crashed or the Kafka buffer had filled. They restarted Logstash twice and checked Kafka consumer lag. Both were fine. Logstash was happily sending documents and getting 403 BLOCKED responses back, which it logged quietly and discarded because the dead letter queue was not enabled.

Root cause

Elasticsearch has three disk watermarks. The low watermark at 85% stops new shards from being allocated to that node. The high watermark at 90% starts relocating existing shards away. The flood-stage watermark at 95% — the one that bit this team — switches every index on the affected nodes to read-only mode. ES issued 403 BLOCKED on every write attempt. No exception thrown upstream, no visible crash. Logstash dropped every rejected document because dead_letter_queue.enable was set to false, which is the default. Four hours of payment service logs gone.

Fix

1. Freed disk space immediately by deleting indices past retention window and running a force-merge on the oldest warm-phase indices to recover segment overhead. 2. Cleared the read-only flag on all affected indices: PUT /_all/_settings with index.blocks.read_only_allow_delete set to null. 3. Enabled dead_letter_queue.enable: true in logstash.yml and set dead_letter_queue.max_bytes: 1024mb so future rejections land in a local DLQ directory rather than disappearing. 4. Added a Prometheus alert firing at 80% disk usage on Elasticsearch data nodes — well before the 85% low watermark, let alone the 95% flood stage. 5. Raised the flood-stage watermark from the default 95% to 92% for an earlier safety margin on this cluster's growth rate.

Key lesson

Elasticsearch has three watermarks: 85% low (no new shards), 90% high (relocate shards), 95% flood-stage (read-only). It is the flood-stage that silently kills writes. Most monitoring setups watch the wrong threshold.
Logstash dead letter queue is disabled by default. Without it, every document Elasticsearch rejects — for any reason — vanishes with no record. Enable it before the first log ever ships to production.
Monitor disk usage with a hard alert at 80%. By the time you hit 95% and the flood-stage fires, you have no room to maneuver. At 80% you still have time to delete old indices, add nodes, or expand volumes before anything breaks.
Black Friday, end-of-quarter, and any traffic spike compound log volume in non-linear ways. Capacity planning based on average daily ingest will fail on peak days. Calculate for 5-10x peak.

Production debug guideWhen logs stop flowing, use this triage order to isolate the broken component in under 5 minutes.8 entries

Symptom · 01

Kibana shows no new logs for 10 or more minutes

→

Fix

Check Elasticsearch cluster health first: curl -s 'localhost:9200/_cluster/health?pretty' — if status is yellow or red, the problem is ES, not Logstash. Never restart Logstash before confirming ES is healthy. A healthy Logstash shipping to a broken ES will produce no visible output but will show increased retry counts in _node/stats.

Symptom · 02

Elasticsearch cluster status is red

→

Fix

Find unassigned shards and why they are unassigned: curl -s 'localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED — look for NODE_LEFT (node went down), ALLOCATION_FAILED (disk full or node limit), or DECIDERS_NO (allocation rules blocking). Then run curl -s 'localhost:9200/_cluster/allocation/explain?pretty' for a specific explanation on the first unassigned shard.

Symptom · 03

Elasticsearch cluster status is yellow

→

Fix

Yellow means primaries are assigned but replicas are not. Check why: curl -s 'localhost:9200/_cluster/allocation/explain?pretty' — the most common reasons are disk watermark exceeded on the target node, not enough nodes to place replicas on different nodes than primaries, or a per-node shard limit hit. Yellow does not mean writes are failing — data is safe but not fully replicated.

Symptom · 04

Elasticsearch is green but logs are still missing

→

Fix

Check Logstash pipeline stats: curl -s 'localhost:9600/_node/stats/pipelines?pretty' — look for events_out significantly lower than events_in, or high worker_concurrency pressure. Also check whether the dead letter queue is growing: ls -lh /var/lib/logstash/dead_letter_queue/ — a growing DLQ directory means ES is rejecting documents upstream.

Symptom · 05

Logstash shows high CPU but low throughput

→

Fix

Check Grok match rate by searching for _grokparsefailure tags in Kibana. If more than 5% of events carry this tag, your patterns do not match your actual log format and Grok is exhausting all alternatives before falling through on every miss. Use the Grok Debugger in Kibana Dev Tools to test patterns against 20 real log lines before changing production config.

Symptom · 06

Logstash heap usage above 85%

→

Fix

Check multiline codec usage: grep -r 'multiline' /etc/logstash/conf.d/ — multiline with large max_lines buffers pending events in JVM heap. A burst of Java stacktraces from a crashing service can spike heap from 500MB to 3GB in under a minute on a pipeline processing high-volume Java services. Set -Xmx to at least 4GB when using multiline in production and reduce max_lines to a realistic maximum stacktrace depth.

Symptom · 07

Kibana dashboards load slowly or time out

→

Fix

Check field count on the index pattern: curl -s 'localhost:9200/logs-/_field_caps?fields=&pretty' | grep -c type — if you see more than 500 fields, dynamic mapping has created a field explosion. Every Kibana page load fetches all field definitions. Use dynamic: strict in your index template and define only the fields you actually query.

Symptom · 08

Kibana shows 'Could not locate that index-pattern'

→

Fix

Verify the underlying index still exists: curl -s 'localhost:9200/_cat/indices?v' — if an ILM delete phase removed it or it was manually deleted, update the Kibana index pattern to use a wildcard like logs-* so it matches future indices as they are created.

★ ELK Quick Debug Cheat SheetProduction commands for the five most common ELK failures. Copy, paste, diagnose.

Cluster health yellow or red−

Immediate action

Check which shards are unassigned and why before touching anything

Commands

curl -s 'localhost:9200/_cluster/health?pretty'

curl -s 'localhost:9200/_cat/shards?v&h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED

Fix now

curl -s -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true' — retries allocation for shards that failed due to transient errors. If this does not clear them, run the allocation explain API to get the specific blocking reason.

Logstash pipeline stalled — no events flowing+

Disk watermark breach — ES rejecting writes with 403 BLOCKED+

Kibana dashboard timeout on load+

Grok parse failures flooding logs+

ELK Stack Component Comparison

Component	Primary Role	Common Production Failure	Key Metric to Monitor
Elasticsearch	Store, index, and search documents at scale using inverted indices	Flood-stage disk watermark at 95% silently switches indices to read-only — no crash, no upstream error	disk_used_percent — alert at 80%, critical at 90%, flood-stage fires at 95%
Logstash	Ingest, parse, enrich, and route log events through configurable pipelines	Grok CPU saturation from unmatched patterns, or multiline codec OOM from stacktrace bursts	jvm heap_used_percent and events_in vs events_out gap in pipeline stats
Kibana	Visualize and explore log data through dashboards and saved searches	Slow load or timeout from index patterns with 2000-plus fields caused by dynamic mapping	index field count and dashboard response time — act when fields exceed 500
Filebeat	Collect and ship logs from hosts and containers to Logstash or Elasticsearch	Registry file corruption after unclean shutdown causing position loss — either re-ships everything or skips forward, both wrong	harvesters_running count and registry file integrity — mount registry on persistent volume

⚙ Quick Reference

12 commands from this guide

File	Command / Code	Purpose
docker-compose-elk.yml	version: '3.8'	What ELK Stack Is and How the Components Connect
beats-comparison.yml	filebeat.inputs:	Beats Family
inspect-inverted-index.sh	curl -s -XPOST 'localhost:9200/logs/_doc/1' \	How Elasticsearch Actually Indexes Documents
logstash-pipeline.conf	input {	Logstash Pipelines
logstash-filters.conf	filter {	Logstash Filter Cheat Sheet
kibana-incident-dashboard.ndjson	{	Kibana Dashboards That Actually Answer Questions
kql-vs-lucene.txt	KQL: status_code: 500	Kibana Query Language (KQL) vs Lucene
shard-capacity-planning.sh	echo "=== Current shard distribution across data nodes ==="	Shard Strategy and Capacity Planning
ilm-policy.json	{	Index Lifecycle Management
cluster-sizing-check.sh	echo "=== Node roles, heap pressure, and disk usage ==="	Cluster Sizing and Hardware Selection
elk_devops_alerting.yml	elasticsearch:	ELK Stack in DevOps
create_nginx_index_template.yml	PUT _template/nginx-access-template	Index Creation

Key takeaways

ELK is a data pipeline with three failure domains

Elasticsearch stores, Logstash transforms, Kibana visualizes. Triage order when logs stop: ES first, Logstash second, Kibana last.

Inverted index maps terms to document IDs and postings lists

that is why full-text search across billions of documents completes in milliseconds.

Disk watermarks

85% stops new shard allocation, 90% relocates shards, 95% flood-stage switches indices read-only. Monitor at 80% — by 95% you have no room to react.

Enable Logstash dead letter queue before production. Without it, every document Elasticsearch rejects vanishes with zero visibility. The DLQ is a local directory, not an ES index.

Shard count cannot be changed after index creation without a reindex. Calculate ceiling(daily_volume_GB / 30) before creating the first index. Shard metadata costs roughly 1MB of heap per shard.

ILM warm phase requires allocate with number_of_replicas

0 before the shrink action. Missing this causes the warm phase to fail silently on every affected index.

Kibana dashboard panels should each answer one specific operational question. dynamic

strict on index templates prevents the field explosion that makes every Kibana page load slow.

31GB is the JVM heap ceiling per Elasticsearch node. Above this, compressed OOPs disable and GC pressure increases. Add nodes instead of increasing per-node heap.

Kafka between Filebeat and Logstash is not optional at scale

it is your replay buffer when Logstash crashes or ES goes read-only and you need to recover lost events.

pipeline.ordered

false removes cross-worker coordination overhead in Logstash — set it explicitly on log pipelines where ES timestamps handle event ordering at query time.

Choose the right Beat

Filebeat for logs, Metricbeat for metrics, Packetbeat for network traffic, Winlogbeat for Windows Event Logs, Heartbeat for uptime, Auditbeat for security events.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

How does Elasticsearch's inverted index work, and why is it faster than ...

Q02SENIOR

You notice 5% of logs are missing from Kibana. Walk through your debuggi...

Q03SENIOR

Design a production ELK stack for a platform ingesting 50,000 log events...

Q04SENIOR

Your ELK cluster just went read-only at peak traffic. You have no dead l...

Q05JUNIOR

What is the difference between Filebeat and Metricbeat? When would you u...

Q01 of 05JUNIOR

How does Elasticsearch's inverted index work, and why is it faster than scanning every document?

ANSWER

Elasticsearch builds an inverted index during document indexing. It tokenizes the text fields, normalizes terms by lowercasing and optionally stemming, and creates a mapping from each unique term to a postings list. The postings list contains document IDs, term frequencies, and position offsets. When you search for a term, ES looks it up in the sorted term dictionary — a binary search — and retrieves the pre-computed list of matching document IDs. No full document scan happens. BM25 scoring then ranks matches by term frequency, inverse document frequency, and field length normalization. Each shard is an independent Lucene index with immutable segments that get periodically merged in the background by a merge thread.

FAQ · 6 QUESTIONS

Frequently Asked Questions

What is the ELK Stack in simple terms?

Why does Elasticsearch go read-only and how do I fix it?

How many shards should I use for my Elasticsearch index?

What is the difference between Filebeat and Logstash?

What is the difference between KQL and Lucene in Kibana?

Should I run Kibana on the same server as Elasticsearch?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's Monitoring. Mark it forged?

17 min read · try the examples if you haven't