Home DevOps Log Aggregation Best Practices: Structure, Ship, and Survive at Scale

Log Aggregation Best Practices: Structure, Ship, and Survive at Scale

In Plain English 🔥
Imagine every employee in a 500-person company keeps their own private diary of every mistake, decision, and event that happened at their desk. When something goes wrong, a manager has to run to 500 desks, open 500 diaries, and piece together what happened. Log aggregation is the company deciding: everyone writes their diary entries on sticky notes and posts them to one giant shared wall. Now the manager walks to one place, reads the full story in order, and finds the problem in minutes instead of days.
⚡ Quick Answer
Imagine every employee in a 500-person company keeps their own private diary of every mistake, decision, and event that happened at their desk. When something goes wrong, a manager has to run to 500 desks, open 500 diaries, and piece together what happened. Log aggregation is the company deciding: everyone writes their diary entries on sticky notes and posts them to one giant shared wall. Now the manager walks to one place, reads the full story in order, and finds the problem in minutes instead of days.

Production systems are lying to you right now — not maliciously, but by omission. Every microservice, container, and serverless function is quietly writing its own story to its own local log file, and the moment something breaks at 2 a.m., that story is scattered across dozens of machines that may not even exist by morning. Logs that live only on the box they were generated on are worse than useless — they're a false sense of security.

Log aggregation solves the fundamental observability problem: getting every event from every component of your system into one place, in a consistent format, fast enough to act on. Without it, you're debugging in the dark. With it, you can trace a single user's failed checkout across a frontend service, an auth service, a payments API, and a database — in seconds, not hours. The difference between a 5-minute mean-time-to-resolution and a 5-hour one is almost always a well-designed logging pipeline.

By the time you finish this article, you'll know how to structure your logs so machines and humans can both read them, how to build a pipeline that doesn't drop messages under load, how to set retention and alerting policies that don't bankrupt you, and the three mistakes that silently kill observability in otherwise well-engineered systems. These are patterns pulled from real production environments — the kind that handle millions of events per day.

Structured Logging: Stop Writing Sentences, Start Writing Data

The single highest-leverage change you can make to your logging strategy costs zero dollars and takes one afternoon: switch from unstructured to structured logs.

Unstructured logs are prose. They look like this: ERROR: Payment failed for user 4821 after 3 retries at 14:32:01. A human can read it. A machine cannot reliably parse it. The moment you want to query 'show me all payment failures where retry_count > 2 in the last hour', you're writing fragile regex against free-form text. That breaks the moment someone changes the wording of the message.

Structured logs are data. Every log line is a JSON object (or logfmt key-value pairs) with consistent, queryable fields. The same event becomes: {"level":"error","event":"payment_failed","user_id":4821,"retry_count":3,"timestamp":"2024-01-15T14:32:01Z"}. Now your log aggregator can index retry_count as a number, and your query is a trivial filter — no regex, no fragility.

The discipline here is schema consistency. Define your fields organization-wide: service_name, trace_id, user_id, duration_ms, level. Every team uses the same names. The payoff comes when you correlate events across services — and that only works if field names match.

structured_logger.py · PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
import json
import logging
import time
import uuid
from datetime import datetime, timezone

# ─── Custom JSON formatter ────────────────────────────────────────────────────
# This replaces the default plain-text log format with a structured JSON object.
# Every log line becomes a machine-parseable record with consistent field names.
class JsonFormatter(logging.Formatter):
    def __init__(self, service_name: str):
        super().__init__()
        self.service_name = service_name

    def format(self, record: logging.LogRecord) -> str:
        # Build the base log payload — these fields appear on EVERY log line.
        # This is your organisation-wide schema. Everyone uses the same keys.
        log_payload = {
            "timestamp": datetime.now(timezone.utc).isoformat(),  # Always UTC — never local time
            "level": record.levelname.lower(),                     # 'info', 'error', 'warning'
            "service": self.service_name,                          # Which service produced this
            "message": record.getMessage(),                        # The human-readable description
            "logger": record.name,                                 # Python logger name (e.g. 'payments.processor')
            "line": f"{record.filename}:{record.lineno}",          # Exact source location for fast debugging
        }

        # Merge any extra context fields the caller passed in.
        # This is where structured data lives: user_id, order_id, duration_ms, etc.
        if hasattr(record, 'extra_fields'):
            log_payload.update(record.extra_fields)

        # If an exception is attached, include the full traceback as a string field
        # rather than letting it spill onto multiple lines and break parsing.
        if record.exc_info:
            log_payload['exception'] = self.formatException(record.exc_info)

        return json.dumps(log_payload)  # One JSON object per line (NDJSON format)


# ─── Logger factory ───────────────────────────────────────────────────────────
def create_logger(service_name: str) -> logging.Logger:
    logger = logging.getLogger(service_name)
    logger.setLevel(logging.DEBUG)

    handler = logging.StreamHandler()  # In containers, always log to stdout — never to a file
    handler.setFormatter(JsonFormatter(service_name))
    logger.addHandler(handler)
    return logger


# ─── Context-aware logging helper ─────────────────────────────────────────────
# This wrapper adds structured fields to a log call without cluttering call sites.
class ContextLogger:
    def __init__(self, logger: logging.Logger, context: dict):
        self._logger = logger
        self._context = context  # Base context fields attached to every log from this instance

    def info(self, message: str, **extra):
        self._emit(logging.INFO, message, extra)

    def error(self, message: str, exc_info=False, **extra):
        self._emit(logging.ERROR, message, extra, exc_info=exc_info)

    def warning(self, message: str, **extra):
        self._emit(logging.WARNING, message, extra)

    def _emit(self, level: int, message: str, extra: dict, exc_info=False):
        # Merge base context with call-site extra fields.
        # Call-site values win on conflict — more specific context overrides general.
        merged = {**self._context, **extra}
        record = self._logger.makeRecord(
            self._logger.name, level, "", 0, message, [], None
        )
        record.extra_fields = merged
        if exc_info:
            import sys
            record.exc_info = sys.exc_info()
        self._logger.handle(record)


# ─── Simulated payment processing function ────────────────────────────────────
def process_payment(order_id: str, user_id: int, amount_cents: int):
    base_logger = create_logger("payments-service")

    # Attach a trace_id at the start of the request.
    # This single ID lets you find every log line for this one transaction
    # across ALL services — frontend, auth, payments, database — in one query.
    trace_id = str(uuid.uuid4())
    log = ContextLogger(base_logger, {
        "trace_id": trace_id,
        "order_id": order_id,
        "user_id": user_id,
    })

    log.info("payment_processing_started", amount_cents=amount_cents)

    start_time = time.monotonic()

    try:
        # Simulate payment gateway call
        time.sleep(0.042)  # Pretend network latency
        if amount_cents > 100_000:
            raise ValueError("Amount exceeds single-transaction limit")

        duration_ms = round((time.monotonic() - start_time) * 1000, 2)

        # duration_ms is logged as a NUMBER, not a string.
        # This matters — your aggregator can now compute p99 latency with a simple query.
        log.info("payment_succeeded",
                 amount_cents=amount_cents,
                 duration_ms=duration_ms,
                 gateway="stripe")

    except ValueError as exc:
        duration_ms = round((time.monotonic() - start_time) * 1000, 2)
        log.error("payment_failed",
                  exc_info=True,
                  amount_cents=amount_cents,
                  duration_ms=duration_ms,
                  failure_reason="limit_exceeded")
        raise


# ─── Run the simulation ───────────────────────────────────────────────────────
if __name__ == "__main__":
    print("--- Successful payment ---")
    process_payment(order_id="ORD-9921", user_id=4821, amount_cents=4999)

    print("\n--- Failed payment ---")
    try:
        process_payment(order_id="ORD-9922", user_id=4821, amount_cents=150_000)
    except ValueError:
        pass  # Error is logged inside process_payment; we just suppress the re-raise here
▶ Output
--- Successful payment ---
{"timestamp": "2024-01-15T14:32:01.102Z", "level": "info", "service": "payments-service", "message": "payment_processing_started", "logger": "payments-service", "line": "structured_logger.py:89", "trace_id": "a3f1c2d4-...", "order_id": "ORD-9921", "user_id": 4821, "amount_cents": 4999}
{"timestamp": "2024-01-15T14:32:01.144Z", "level": "info", "service": "payments-service", "message": "payment_succeeded", "logger": "payments-service", "line": "structured_logger.py:102", "trace_id": "a3f1c2d4-...", "order_id": "ORD-9921", "user_id": 4821, "amount_cents": 4999, "duration_ms": 42.1, "gateway": "stripe"}

--- Failed payment ---
{"timestamp": "2024-01-15T14:32:01.187Z", "level": "info", "service": "payments-service", "message": "payment_processing_started", ...}
{"timestamp": "2024-01-15T14:32:01.229Z", "level": "error", "service": "payments-service", "message": "payment_failed", "trace_id": "b7e2d1f5-...", "order_id": "ORD-9922", "user_id": 4821, "amount_cents": 150000, "duration_ms": 41.8, "failure_reason": "limit_exceeded", "exception": "ValueError: Amount exceeds single-transaction limit\n File ..."}
⚠️
Pro Tip: Log Events, Not SentencesUse snake_case event names as your `message` field ('payment_failed', not 'Payment failed for user'). This makes your message field groupable and queryable — you can count occurrences of 'payment_failed' as a metric without any parsing. Prose messages are for humans; event names serve both humans and machines.

Building a Pipeline That Doesn't Lose Messages Under Load

Getting logs off the machine that produced them is harder than it sounds. Most teams get this wrong in one of two ways: they either block the application while waiting for log writes to complete, or they drop messages silently when the downstream system is slow. Both failures cost you exactly when you need observability the most — during an incident.

The canonical architecture for a production log pipeline is: Application → Local Agent → Message Buffer → Aggregator → Storage. Each arrow is an asynchronous boundary. The application never waits for a log to reach Elasticsearch. It writes to stdout. A sidecar agent (Fluent Bit, Filebeat) tails that output and ships it forward. A message queue (Kafka, Kinesis, or even a Fluent Bit buffer) absorbs spikes. The aggregator (Logstash, Fluentd) processes and routes. Storage (Elasticsearch, Loki, CloudWatch) persists.

The local agent is your reliability contract. Configure it with a disk-backed buffer so if the aggregator goes down for 10 minutes, the agent stores messages locally and replays them when connectivity restores. Without this, a 10-minute aggregator restart means a 10-minute gap in your logs — right when you're trying to understand what caused the aggregator restart in the first place.

fluent-bit-pipeline.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101
# Fluent Bit configuration for a Kubernetes sidecar log shipping agent.
# This config implements the reliable pipeline pattern:
#   Container stdout → Fluent Bit tail → Disk buffer → Loki (with retry)
#
# Deploy this as a DaemonSet so every node has exactly one agent.
# The agent tails /var/log/containers/* which is where Kubernetes writes
# all container stdout/stderr on the host.

[SERVICE]
    # How often Fluent Bit flushes buffered records to the output plugin (seconds).
    # Lower = lower latency, higher CPU. 5s is a sensible production default.
    Flush         5

    # Grace period on shutdown — gives the agent time to flush pending messages
    # before the process exits. Critical for log completeness during rolling deploys.
    Grace         30

    # Enable the built-in HTTP server for health checks and metrics.
    # Your orchestrator should probe /api/v1/health before marking the pod ready.
    HTTP_Server   On
    HTTP_Listen   0.0.0.0
    HTTP_Port     2020

    # Storage type 'filesystem' means Fluent Bit buffers to disk, not just RAM.
    # If the output plugin (Loki) is unreachable, messages queue on disk instead
    # of being dropped. This is your primary reliability mechanism.
    storage.type  filesystem
    storage.path  /var/log/flb-storage/
    storage.sync  normal

    # Maximum disk space the buffer can consume. Size this generously —
    # you'd rather fill disk temporarily than lose log data during an incident.
    storage.max_chunks_up  128

# ─── INPUT: Tail all Kubernetes container log files ───────────────────────────
[INPUT]
    Name              tail
    Tag               kube.*

    # This glob matches every container log file Kubernetes writes on this node.
    Path              /var/log/containers/*.log

    # Kubernetes container logs are JSON lines written by the container runtime.
    # This parser extracts the actual log text and metadata from that wrapper.
    Parser            docker

    # db.lmdb tracks which byte offset has been read in each file.
    # On Fluent Bit restart, it resumes from where it left off — no duplicate
    # shipping and no gap, even across agent restarts or node reboots.
    DB                /var/log/flb_kube.db

    # Skip lines older than 5 minutes on startup.
    # Prevents flooding the aggregator with historical data after a long outage.
    Skip_Long_Lines   On
    Refresh_Interval  10

# ─── FILTER: Enrich logs with Kubernetes pod/namespace/container metadata ─────
[FILTER]
    Name                kubernetes
    Match               kube.*

    # This calls the Kubernetes API to look up pod metadata for each log line
    # and merges it in. Result: every log record now has namespace, pod_name,
    # container_name, and any labels/annotations you want to query on.
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On   # Merge the nested JSON log string into the top-level record
    Keep_Log            Off  # Drop the raw 'log' string field after merging to save space
    K8S-Logging.Parser  On   # Honour pod annotation 'fluentbit.io/parser' for custom parsers

# ─── FILTER: Drop noisy health-check logs before they hit storage ──────────────
[FILTER]
    Name   grep
    Match  kube.*
    # Kubernetes liveness probes generate hundreds of GET /healthz lines per hour.
    # They are never useful for debugging. Drop them here, before the buffer,
    # to save storage costs and reduce query noise. Adjust the regex to your paths.
    Exclude  log  GET /healthz|GET /readyz|GET /metrics

# ─── OUTPUT: Ship to Grafana Loki with disk-backed retry ──────────────────────
[OUTPUT]
    Name            loki
    Match           kube.*
    Host            loki.monitoring.svc.cluster.local
    Port            3100

    # Labels become the high-cardinality index in Loki — choose wisely.
    # namespace and app give you fast per-service filtering.
    # Do NOT use pod_name or container_name as labels — too high cardinality,
    # will degrade Loki performance severely (see Gotchas section).
    Labels          job=kubernetes, namespace=$kubernetes['namespace_name'], app=$kubernetes['labels']['app']

    # Loki expects log lines as strings. This serialises the full enriched record
    # back to JSON, so structured fields are preserved inside the log line value.
    Line_Format     json

    # Retry failed sends up to 5 times with exponential backoff.
    # Combined with the filesystem storage above, this means messages survive
    # a Loki restart of up to (buffer_size / throughput) duration.
    Retry_Limit     5
▶ Output
# Fluent Bit startup log (visible via kubectl logs -n logging fluent-bit-xxxxx):
[2024/01/15 14:32:00] [ info] [fluent bit] version=2.2.0
[2024/01/15 14:32:00] [ info] [storage] backend type = filesystem
[2024/01/15 14:32:00] [ info] [storage] storage path = /var/log/flb-storage/
[2024/01/15 14:32:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2147 watch_fd=1 name=/var/log/containers/payments-service-abc123.log
[2024/01/15 14:32:00] [ info] [output:loki:loki.0] worker #0 started

# Health check endpoint output (curl http://localhost:2020/api/v1/metrics):
{
"input": { "kube.tail.0": { "records": 15420, "bytes": 4823091 } },
"filter": { "kube.kubernetes.0": { "drop_records": 0, "add_records": 15420 } },
"output": { "loki.0": { "proc_records": 15398, "retried_records": 12, "dropped_records": 0 } }
}
⚠️
Watch Out: High-Cardinality Labels Will Kill LokiIn Loki (and Prometheus), labels create a separate stream/series per unique value combination. If you use pod_name as a label, and you have 200 pods, you get 200 streams. Use pod_name as a label on 10 namespaces with 50 pods each and you've got 500 streams — all unique, all indexed, and Loki's memory and query performance degrade exponentially. Keep labels to low-cardinality values like namespace, app, and environment. Put pod_name inside the log line body instead.

Retention, Alerting, and the Cost of Keeping Everything Forever

Here's the uncomfortable truth about log storage: keeping every log line forever is not observability — it's hoarding. And it will bankrupt your cloud budget while simultaneously making it harder to find what you're looking for.

A sensible retention strategy is tiered. Hot storage (Elasticsearch, Loki): last 7-14 days, indexed and fully queryable, expensive. Warm storage (S3, GCS, with a tool like Athena or BigQuery for ad-hoc queries): last 30-90 days, compressed, cheap. Cold/archive (Glacier, S3 Glacier Instant): 1-7 years, for compliance only, query never except for audits. Most debugging happens within 48 hours. Most compliance requirements are met by 1 year. Design your pipeline around that reality.

The second part of this equation is alerting on log content — and here's where teams consistently over-alert. Every ERROR log firing a PagerDuty is a recipe for alert fatigue. Instead, alert on derived signals: the error rate (errors per minute, not individual errors), the absence of expected logs (no checkout_completed events in 10 minutes is more alarming than one payment_failed), and sudden cardinality spikes in specific error types. Use your aggregator to compute these signals — that's what it's for.

loki-alert-rules.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
# Loki alerting rules using LogQLLoki's query language.
# These rules run continuously inside Loki's ruler component and fire
# alerts into your Alertmanager when conditions are met.
#
# The philosophy here is: alert on RATES and ABSENCE, not individual errors.
# One error log is noise. 50 error logs per minute is an incident.

groups:
  - name: payments-service-alerts
    interval: 1m  # Evaluate every rule in this group every 60 seconds
    rules:

      # ── Rule 1: Error RATE alert ─────────────────────────────────────────────
      # This fires when the payments service logs more than 10 errors per minute.
      # Using rate() means the alert auto-scales — a brief spike doesn't wake
      # someone up, but a sustained error rate does.
      - alert: PaymentsServiceHighErrorRate
        # LogQL explanation:
        #   {app="payments-service"}  — select only logs from this app
        #   |= `"level":"error"`      — filter to lines containing this string
        #   [1m]                      — sliding 1-minute window
        #   > 10                      — fire if rate exceeds 10 per minute
        expr: |
          sum(rate({app="payments-service", namespace="production"} |= `"level":"error"` [1m])) > 10
        for: 2m  # Must be true for 2 consecutive minutes before firing (reduces flapping)
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "Payments service error rate is {{ $value | humanize }} errors/min"
          # The runbook URL is MANDATORY — on-call engineers should never need to
          # think about what to do when an alert fires. The runbook tells them.
          runbook_url: "https://wiki.internal/runbooks/payments-high-error-rate"
          # This LogQL link drops the on-call engineer directly into the relevant
          # logs in Grafana — zero time wasted navigating dashboards.
          grafana_explore_url: "https://grafana.internal/explore?orgId=1&left=[...]"

      # ── Rule 2: Absence of expected events alert ──────────────────────────────
      # This is the alert most teams forget to write — and it catches the scariest
      # failure mode: the service is running, returning 200s, but silently not
      # processing any actual business events (e.g. stuck queue consumer).
      - alert: PaymentsNoSuccessfulTransactions
        # If there are zero payment_succeeded log events in the last 5 minutes
        # during business hours, something is very wrong.
        expr: |
          sum(rate({app="payments-service", namespace="production"} |= `"payment_succeeded"` [5m])) == 0
        for: 5m
        labels:
          severity: critical
          team: payments
        annotations:
          summary: "No successful payments processed in the last 5 minutes"
          runbook_url: "https://wiki.internal/runbooks/payments-no-transactions"

      # ── Rule 3: Spike in a specific error type ────────────────────────────────
      # Alert when a specific failure_reason appears more than 5 times per minute.
      # This catches targeted issues (e.g. card network outage) before they become
      # the dominant error type in the broader error-rate alert above.
      - alert: PaymentsGatewayTimeouts
        expr: |
          sum(
            rate(
              {app="payments-service", namespace="production"}
                | json                            # Parse the JSON log line into fields
                | failure_reason = "gateway_timeout"  # Filter on the structured field
              [2m]
            )
          ) > 5
        for: 3m
        labels:
          severity: warning
          team: payments
        annotations:
          summary: "Gateway timeout rate: {{ $value | humanize }}/min — possible Stripe outage"
          runbook_url: "https://wiki.internal/runbooks/payments-gateway-timeout"
▶ Output
# When PaymentsServiceHighErrorRate fires, Alertmanager sends:
{
"status": "firing",
"labels": {
"alertname": "PaymentsServiceHighErrorRate",
"severity": "critical",
"team": "payments",
"namespace": "production",
"app": "payments-service"
},
"annotations": {
"summary": "Payments service error rate is 23.4 errors/min",
"runbook_url": "https://wiki.internal/runbooks/payments-high-error-rate"
},
"startsAt": "2024-01-15T14:34:00Z",
"generatorURL": "http://loki-ruler:3100/..."
}

# Prometheus-format metrics exposed by Loki ruler (scraped by your Prometheus):
# loki_ruler_evaluations_total{rule_group="payments-service-alerts"} 1440
# loki_ruler_evaluation_failures_total{rule_group="payments-service-alerts"} 0
🔥
Interview Gold: The Three Tiers of Log ValueInterviewers love asking 'how do you manage log storage costs?' The answer they want: tiered retention by value. Debug/trace logs: 3 days. Info/warning: 14 days hot, 90 days warm. Error/audit: 1 year warm, 7 years cold. Compress everything older than 24 hours. The key insight is that 80% of your log volume is trace/debug data that's worthless after 72 hours — most teams keep it forever by default and pay 10x what they should.
AspectELK Stack (Elasticsearch + Logstash + Kibana)Grafana Loki + Fluent Bit
Storage modelIndexes every field — fast full-text search on any fieldIndexes only labels — stores log lines as compressed chunks
Storage costHigh — full inverted index for all fields is expensive at scaleLow — chunk compression gives 5-10x storage efficiency vs ELK
Query speed on known labelsFastFast (label queries hit the index)
Query speed on unindexed fieldsFast (everything is indexed)Slow (requires scanning log chunks)
Kubernetes native feelRequires Filebeat or Logstash for K8s metadata enrichmentDesigned for Kubernetes — Helm chart, native pod label discovery
Operational complexityHigh — Elasticsearch cluster tuning, shard management, JVM heapLow — stateless querier, object storage backend, minimal ops
Best forCompliance workloads, full-text search, large enterprisesCloud-native microservices, cost-sensitive teams, Grafana shops
AlertingKibana Alerts or ElastAlert (third-party)Loki Ruler — native LogQL alerts, integrates with Alertmanager
Learning curveSteep — KQL query language, index lifecycle managementModerate — LogQL is similar to PromQL, intuitive for DevOps teams

🎯 Key Takeaways

  • Structured JSON logs with consistent field schemas across all services are the foundation — without them, every query is fragile regex against prose, and cross-service correlation is impossible.
  • The local agent's disk-backed buffer is your reliability contract: the application writes to stdout, the agent handles delivery, and if the aggregator goes down, messages queue on disk and replay automatically — no code changes required.
  • Alert on derived signals (error rate per minute, absence of expected events) rather than individual log lines — one ERROR log is noise; 50 per minute is an incident. Write runbook URLs into every alert annotation.
  • Tiered retention is not optional at scale — debug logs older than 3 days are almost never useful, but teams routinely pay to store them for years. Moving aggressively to warm/cold storage tiers can cut logging costs by 60-80% without losing any meaningful observability.

⚠ Common Mistakes to Avoid

  • Mistake 1: Logging inside tight loops or hot paths without sampling — Symptom: a single high-traffic endpoint generates 80% of your total log volume, storage costs spike, and the aggregator pipeline backs up during traffic peaks. Fix: Add log sampling for debug/trace-level logs in hot paths. Log the first occurrence of a repeated event and then once every N times: track a counter in-memory and emit the log line only when counter % 100 == 0. Alternatively, use a sampling middleware that probabilistically drops a configurable percentage of lower-severity logs. Never sample error logs — those are always worth keeping.
  • Mistake 2: Using log timestamps from the application server without timezone normalisation — Symptom: you're correlating logs from two services and the events appear to be 5 hours apart even though they happened in the same second; or logs arrive at the aggregator with timestamps that are hours in the future or past. Fix: Always emit timestamps in UTC with explicit timezone offset (ISO 8601: '2024-01-15T14:32:01.123Z'). Never rely on local server time or implicit timezone. In your aggregator, configure it to use the log-line timestamp as the canonical time rather than the ingestion time — in Fluent Bit this is 'Time_Key timestamp' in your parser config.
  • Mistake 3: Putting sensitive data in log fields without a scrubbing step — Symptom: a security audit finds credit card numbers, session tokens, or PII (email addresses, phone numbers) sitting in plaintext in your Elasticsearch index — often for years, and replicated to your S3 cold tier. Fix: Add a scrubbing filter in your agent pipeline before any data leaves the host. In Fluent Bit, use the lua filter to run a regex over sensitive fields. At the application layer, never log raw request bodies or response payloads — log derived metadata instead (e.g. log 'request_body_size_bytes': 248 instead of the body itself). Treat your log pipeline as a data processing boundary with the same data-governance standards as your database.

Interview Questions on This Topic

  • QWalk me through the log pipeline architecture you'd design for a 50-microservice Kubernetes platform that needs to handle 500,000 log lines per second with a 30-day queryable retention. What components would you choose and why?
  • QWe're seeing a 30-minute gap in logs for our payments service every night at 2 a.m. How would you diagnose whether this is a log production problem, a shipping problem, or a storage problem?
  • QA candidate earlier today said they'd add trace_id, user_id, and request_id as Loki label values to make queries faster. What would you say to them?

Frequently Asked Questions

What is the difference between log aggregation and log monitoring?

Log aggregation is the process of collecting logs from all your services and centralising them in one place — it's about data collection and storage. Log monitoring is the practice of watching those aggregated logs in real time and alerting on conditions of interest. You need aggregation before you can do meaningful monitoring; they're sequential layers in the same pipeline, not alternatives.

Should I log to a file or to stdout in a containerised application?

Always log to stdout (and stderr for errors) in containers. The container runtime captures stdout and writes it to a file on the host node, where your log agent can tail it. Logging directly to a file inside the container breaks the agent's discovery, requires volume mounts, and means logs are lost when the container is destroyed. The twelve-factor app methodology codified this as a rule for exactly this reason.

How is log aggregation different from distributed tracing, and do I need both?

Logs tell you what happened and carry full context (the entire state at a point in time). Distributed traces tell you the timing and causal flow of a single request across multiple services — they show you latency breakdowns and which service was the bottleneck. You need both. The bridge between them is the trace_id field: if you include the same trace ID in both your logs and your traces, you can jump from a trace span directly to all the log lines generated during that span — which is where the real debugging detail lives.

🔥
TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

← PreviousFeature Flags ExplainedNext →Linux Disk and Storage Management
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged