Skip to content
Home Java Spring Boot Actuator and Monitoring: Production-Grade Observability

Spring Boot Actuator and Monitoring: Production-Grade Observability

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Spring Boot → Topic 11 of 15
Master Spring Boot Actuator and Monitoring.
⚙️ Intermediate — basic Java knowledge assumed
In this tutorial, you'll learn
Master Spring Boot Actuator and Monitoring.
  • Actuator is the bridge between application code and operational visibility — it is non-negotiable for any Spring Boot service running in production.
  • Use readiness probes to control load balancer traffic and liveness probes to signal Kubernetes when a pod needs a restart. Never put external dependency checks in liveness probes — the cascading restart pattern it creates has taken down production systems at companies with mature engineering teams.
  • Micrometer is the metrics engine. Use Counters for total counts, Gauges for current values, and Timers for latency percentiles. Mean latency is a liar — always instrument and alert on p95 or p99.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Spring Boot Actuator exposes operational endpoints (/health, /metrics, /prometheus) that turn your running JVM from a black box into an observable system
  • Health groups (liveness, readiness, startup) map directly to Kubernetes probe types — never put external dependency checks in liveness probes
  • Micrometer is the metrics engine: Counters for totals, Gauges for current values, Timers for latency percentiles (p50/p95/p99)
  • Prometheus scrapes /actuator/prometheus every 15s — using /actuator/health for scraping creates a thundering herd at scale
  • Dynamic log level changes via /actuator/loggers turn a 30-minute 'add logging and redeploy' cycle into a 10-second API call
  • management.metrics.tags.application=${spring.application.name} is the one line most teams forget — without it, metrics from different services collide in Prometheus
  • The full stack is Actuator (sensors) + Prometheus (recorder) + Grafana (visualizer) + Alertmanager (alerter) — Actuator alone gives you endpoints, not observability
🚨 START HERE
Actuator Debug Cheat Sheet — Commands That Save Hours
Real commands for real production debugging. Run these when something is broken and you have Actuator configured.
🟡Is the app healthy? Need a fast status check
Immediate ActionHit the health endpoint to see component-level status
Commands
curl -s http://localhost:8080/actuator/health | jq .
curl -s http://localhost:8080/actuator/health/liveness | jq .status
Fix NowIf liveness is DOWN, the JVM is unresponsive — restart the pod. If readiness is DOWN but liveness is UP, a dependency is failing — check database or cache connectivity. Never restart a pod when only readiness is failing — let the dependency recover.
🟡Need to see what version is running without checking CI/CD
Immediate ActionQuery the info endpoint for build and Git metadata
Commands
curl -s http://localhost:8080/actuator/info | jq .git.commit.id
curl -s http://localhost:8080/actuator/info | jq .build.version
Fix NowIf the commit hash does not match what your pipeline deployed, the container is running old code — force a pod restart or re-pull the image explicitly. This is the fastest way to catch Docker layer cache issues.
🟡Need DEBUG logs for a specific package without restarting
Immediate ActionEnable DEBUG logging dynamically via the loggers endpoint
Commands
curl -u admin:password -X POST -H 'Content-Type: application/json' -d '{"configuredLevel":"DEBUG"}' http://localhost:8080/actuator/loggers/io.thecodeforge.order
tail -f /var/log/app.log | grep 'io.thecodeforge.order'
Fix NowAfter capturing the issue, reset immediately: curl -u admin:password -X POST -H 'Content-Type: application/json' -d '{"configuredLevel":null}' http://localhost:8080/actuator/loggers/io.thecodeforge.order — DEBUG logging in production generates gigabytes per minute if left running.
🟡Prometheus is not scraping — target shows as DOWN in Prometheus UI
Immediate ActionVerify the Prometheus endpoint is reachable and returns data
Commands
curl -s http://localhost:8080/actuator/prometheus | head -20
curl -u prometheus:password -s http://localhost:8080/actuator/prometheus | wc -l
Fix NowIf empty or 404: check micrometer-registry-prometheus dependency in pom.xml and verify management.endpoints.web.exposure.include contains prometheus. If 403: check Spring Security config permits MONITORING role. If timeout: check if scrape_interval is too aggressive for the instance count.
🟡Docker container marked unhealthy but app seems fine from inside
Immediate ActionTest the HEALTHCHECK command manually inside the container
Commands
docker exec <container_id> wget --quiet --tries=1 --spider http://localhost:8080/actuator/health/liveness
docker inspect --format='{{json .State.Health}}' <container_id> | jq .
Fix NowIf wget is missing, you are using a distroless image. Switch to eclipse-temurin:17-jre-alpine which includes wget, or remove Docker HEALTHCHECK entirely and use Kubernetes probes exclusively. Do not add wget to a distroless image — that defeats the purpose of using it.
Production IncidentThe $40,000 Connection Pool Exhaustion — Flying Blind Without MetricsA payment service ran silently for three weeks before latency spiked to 15 seconds at 2 AM on a Saturday. No metrics, no health checks, no visibility. Four hours of blind debugging.
SymptomPayment service latency spiked from 50ms to 15,000ms per request. Customers reported timeouts. Revenue dropped as transactions queued and eventually failed.
AssumptionThe team assumed it was a downstream payment gateway outage or a network issue. Two hours were spent checking external dependencies and firewall rules before anyone looked at the connection pool.
Root causeHikariCP connection pool was exhausted. A slow database query caused by a missing index on the orders table held connections for 30 seconds or longer. Under Saturday peak load, all 10 pool connections were consumed. New requests waited in the queue until timeout. No Micrometer gauge was configured to track active connections versus pool size. The metric that would have solved this in 30 seconds — hikaricp_connections_active — was available but not enabled because the micrometer-registry-prometheus dependency was missing.
FixAdded micrometer-registry-prometheus dependency which auto-enables HikariCP metrics: hikaricp_connections_active, hikaricp_connections_idle, hikaricp_connections_pending, hikaricp_connections_timeout_total. Set up a Grafana alert when active connections exceed 80% of pool size. Added the missing database index. Increased pool size from 10 to 25 with connection timeout tuning. Added management.metrics.tags.application=${spring.application.name} so the metrics were attributable to this specific service in Prometheus.
Key Lesson
HikariCP metrics are auto-configured when micrometer-registry-prometheus is on the classpath — you do not write a single line of instrumentation code, you just add the dependencyA single gauge (hikaricp_connections_active) would have diagnosed the issue in 30 seconds instead of 4 hoursIf you cannot answer 'is the connection pool saturated?' from your dashboard, you are flying blindEvery production service needs at minimum: request rate, error rate, latency percentiles, connection pool saturation, and JVM heap usagemanagement.metrics.tags.application is not optional — without it, metrics from multiple services are indistinguishable in Prometheus
Production Debug GuideWhen Actuator is configured, here is how to go from observable symptom to resolution.
Kubernetes pods restarting in a loop every 60-90 secondsCheck liveness probe configuration — if it hits /actuator/health (full) instead of /actuator/health/liveness, a transient DB blip triggers cascading restarts. Switch to the liveness health group immediately. The fix is one line in your Kubernetes deployment YAML.
Prometheus shows gaps in metric data — scraping appears to stop intermittentlyCheck if /actuator/health is used as the scrape path instead of /actuator/prometheus. Heavy health checks can timeout under load, causing Prometheus to mark the target as down. Switch metrics_path to /actuator/prometheus and verify the micrometer-registry-prometheus dependency is present.
Grafana dashboard shows flat latency line, then sudden spike — no gradual degradation visibleYou are looking at mean latency, not percentiles. Mean hides tail latency completely. Check p95 and p99 — the degradation will be visible there well before the mean moves. Add publishPercentiles(0.5, 0.95, 0.99) to your Timer metrics and alert on p99, not the mean.
Intermittent payment failures but logs show nothing useful at INFO levelPOST to /actuator/loggers/io.thecodeforge.order with {"configuredLevel": "DEBUG"}. Reproduce the issue. Read the detailed logs. Reset to null after. No restart needed. If the issue is SQL-related, enable TRACE on org.hibernate.SQL to see exact queries and bind parameters.
Deployed new version but behavior has not changed — suspect old code is still runningcurl /actuator/info and verify git.commit.id matches your pipeline deployment. If it does not match, the Docker image cache served the old image or the Helm rollout did not apply. This is the most common 'ghost deployment' pattern.
Prometheus query returns duplicate time series for the same metric from different servicesYou are missing management.metrics.tags.application=${spring.application.name} in your application.yml. Without a global application tag, metrics from different services use identical names and collide in Prometheus. Add this to every service's base config.
/actuator/prometheus returns 404 even though the endpoint is in exposure.includeThe micrometer-registry-prometheus dependency is missing from your pom.xml or build.gradle. spring-boot-starter-actuator does not include it. Add io.micrometer:micrometer-registry-prometheus explicitly. The endpoint only exists when the registry is on the classpath.

In the world of microservices, 'it works on my machine' is not enough. You need to know if it works in production at 3:00 AM under peak load. Spring Boot Actuator is the industry-standard framework that transforms your dark application into an observable system by exposing HTTP and JMX endpoints that reveal the inner state of your running JVM.

I learned this the hard way. In 2021, our team deployed a Spring Boot payment service to production. It ran fine for three weeks. Then one Saturday at 2 AM, latency spiked to 15 seconds per request. We had no metrics, no health checks beyond a basic /ping, and no idea what was wrong. It took four hours to diagnose a connection pool exhaustion issue that a single Micrometer gauge would have caught in 30 seconds. That incident cost us $40,000 in lost transactions and a very uncomfortable Monday morning post-mortem.

After that, we instrumented everything. Every service got Actuator, Micrometer, Prometheus, and Grafana before it ever touched production. We have not had a mystery outage since.

This guide moves beyond basic dependency injection and covers the operational side of Java development. We will explore how to monitor application health, track custom business metrics, expose build traceability info, manage log levels dynamically, secure sensitive endpoints, wire everything into a Prometheus/Grafana stack, and build custom endpoints for operational control — all with the application.yml configuration you can actually copy into a project.

The Three Pillars of Observability

Before diving into Actuator, understand the framework that every production monitoring system is built on. Observability rests on three pillars, and knowing which pillar answers which question is the difference between a 4-hour incident and a 10-minute resolution.

  1. Metrics — Numeric measurements over time. Request rate, error rate, latency percentiles, JVM heap usage, connection pool saturation. These are what Prometheus scrapes and Grafana displays. Metrics answer 'how much' and 'how fast' and 'how often.' They are the first signal that something is wrong.
  2. Logs — Discrete events with context. A log entry says 'order #4521 failed with NullPointerException at PaymentService.java:87 for user abc123.' Logs answer 'what happened' and 'why.' Spring Boot's structured logging in JSON format feeds into ELK, Loki, or Datadog. They answer the question that the metric alert raised.
  3. Traces — A request's journey across services. In a microservice architecture, a single user action might touch 8 services. A trace connects those dots — showing that the 2-second delay happened in the inventory service on the third hop, not in the API gateway. Spring Boot integrates with Micrometer Tracing and OpenTelemetry for distributed tracing.

Spring Boot Actuator primarily addresses the metrics and health pillars. But a production-grade observability stack needs all three working together. The order of implementation matters more than most teams realize.

io/thecodeforge/monitoring/observability-stack.txt · TEXT
1234567891011121314151617181920212223242526
# io.thecodeforge: Production Observability Stack
#
#  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
#  │  Spring Boot │────▶│  Prometheus │────▶│   Grafana   │
#  │   Actuator   │     │  (Scraper)  │     │ (Dashboard) │
#  └─────────────┘     └─────────────┘     └─────────────┘
#         │                                        │
#         │              ┌─────────────┐           │
#         └─────────────▶│ Alertmanager│◀──────────┘
#                        │  (PagerDuty)│
#                        └─────────────┘
#
#  ┌─────────────┐     ┌─────────────┐
#  │ Structured  │────▶│  Loki / ELK │
#  │    Logs     │     │  (Log Agg.) │
#  └─────────────┘     └─────────────┘
#
#  ┌─────────────┐     ┌─────────────┐
#  │  Micrometer │────▶│  Tempo /    │
#  │   Tracing   │     │  Jaeger     │
#  └─────────────┘     └─────────────┘
#
# What each layer answers:
#   MetricsIs something wrong right now?
#   LogsWhat exactly happened and why?
#   TracesWhich service in the chain caused it?
Mental Model
You Do Not Need All Three on Day One
Observability is a maturity curve, not a checklist. Implement the pillars in the order they prevent outages, not the order they look impressive on an architecture diagram.
  • Start with metrics and health checks (Actuator + Prometheus) — this catches 80% of production issues before users notice
  • Add structured logging next when you need to debug 'what exactly happened' after an alert fires
  • Add distributed tracing when you have 3 or more microservices and need to find which service in the chain introduced latency
  • Trying to implement all three simultaneously leads to implementation paralysis and all three done badly
  • The order matters: metrics catch problems before logs do, logs explain problems before traces do
📊 Production Insight
Most teams over-engineer observability on day one and under-instrument by day 30. The pattern that consistently works: metrics plus health first — they prevent mystery outages. Add structured logging when you cannot debug from metrics alone. Add tracing when microservice count justifies the overhead. Do not skip the sequence.
🎯 Key Takeaway
Observability is a maturity curve, not a checklist. Start with metrics and health checks — they prevent 80% of production mystery outages. Implement the pillars in the order they prevent outages, not the order they look impressive on a resume.

The Anatomy of Actuator: Observability in Action

Spring Boot Actuator exists because manual health checks are a recipe for failure. Instead of writing custom endpoints to check if your database is alive or if your disk space is full, Actuator provides these out of the box. By adding the starter dependency and configuring application.yml, you instantly gain access to the /actuator base path with standardized operational endpoints.

However, most endpoints are hidden by default for security. The real power lies in the /health and /prometheus endpoints. One critical detail that trips up almost every team the first time: the /actuator/prometheus endpoint does not exist unless micrometer-registry-prometheus is on the classpath. The actuator starter and the Prometheus registry are separate dependencies. Adding spring-boot-starter-actuator without micrometer-registry-prometheus gives you health and info but not Prometheus metrics.

One thing worth saying explicitly: Actuator endpoints are not your application REST API. They are operational endpoints meant for internal monitoring infrastructure. Your security model should reflect this — your monitoring stack gets access through specific roles, your customers never touch these endpoints.

io/thecodeforge/monitoring/application.yml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
# io.thecodeforge: Canonical Actuator Configuration
# This is the base configuration every production Spring Boot service should start with.
# Copy this, adjust the exposure list to match your needs, and ship it.

management:
  endpoints:
    web:
      exposure:
        # Whitelist only what your monitoring stack actually needs.
        # Never use * in production.
        include: health,info,prometheus,loggers
        exclude: heapdump,env,threaddump,shutdown
      base-path: /actuator

  endpoint:
    health:
      # when_authorized: only show component details to authenticated users
      # never: hide component details entirely (use for public-facing services)
      # always: NEVER use in production — exposes DB versions, pool sizes, API keys
      show-details: when_authorized
      show-components: when_authorized
      probes:
        # Enables /actuator/health/liveness and /actuator/health/readiness
        # Required for Kubernetes probe integration
        enabled: true

  # Global tag applied to every metric this service emits.
  # Without this, metrics from different services collide in Prometheus.
  # This is the one line most teams forget. Do not skip it.
  metrics:
    tags:
      application: ${spring.application.name}
    export:
      prometheus:
        # Explicitly enable the Prometheus registry.
        # Redundant if micrometer-registry-prometheus is on the classpath,
        # but makes intent explicit and prevents confusion.
        enabled: true

  # Security: never expose stack traces via the error endpoint
  server:
    add-application-context-header: false

server:
  error:
    # Never expose stack traces to clients or monitoring scrapers
    include-stacktrace: never
    include-message: never

spring:
  application:
    # This name flows into management.metrics.tags.application above.
    # Set it explicitly — never rely on the default.
    name: order-service
▶ Output
# With this configuration:
# GET /actuator/health → component-level health (authorized users only)
# GET /actuator/health/liveness → JVM-only liveness check (Kubernetes liveness probe)
# GET /actuator/health/readiness → dependency check (Kubernetes readiness probe)
# GET /actuator/prometheus → Prometheus metrics with application=order-service tag
# GET /actuator/loggers → current log levels (read)
# POST /actuator/loggers/{pkg} → change log level dynamically (ADMIN role required)
# GET /actuator/info → build + git metadata
#
# /actuator/heapdump, /actuator/env, /actuator/threaddump → blocked
⚠ Never Set exposure.include=* in Production
Setting management.endpoints.web.exposure.include=* exposes /actuator/env (which returns AWS keys, database passwords, and API tokens in plaintext), /actuator/heapdump (which dumps every object in JVM memory including user sessions and PII), and /actuator/threaddump to anyone who can reach the endpoint. An automated scanner will find this within hours. I have seen a production incident where this exact mistake exposed AWS_SECRET_ACCESS_KEY. The attacker spun up 200 GPU instances for crypto mining. The bill was $12,000 before anyone noticed. Whitelist only what your monitoring stack requires.
📊 Production Insight
A team used /actuator/health as the Prometheus scrape_path instead of /actuator/prometheus. The full health check hit the database, external APIs, and disk on every scrape — 100 instances at 15-second intervals equaled 6,667 heavy health checks per minute. Response times for the health endpoint started spiking, causing Prometheus to mark targets as down, which triggered false alerts at 3 AM. The fix: use /actuator/prometheus for metrics scraping (lightweight counter and gauge reads, sub-millisecond) and reserve /actuator/health for Kubernetes probes exclusively.
🎯 Key Takeaway
Actuator endpoints are operational infrastructure, not application REST APIs. The security model must reflect this. Never set exposure.include=* in production — whitelist only what Prometheus and Kubernetes actually need. The canonical base configuration is in this section — copy it and adjust from there.

Custom Metrics with Micrometer — Beyond Health Checks

Health checks tell you if the system is alive. Metrics tell you how it is performing. Spring Boot uses Micrometer, a dimensional metrics instrumentation facade — think SLF4J but for metrics. It lets you track business-relevant things like order count, payment latency percentiles, and cart abandonment rate rather than just JVM garbage collection statistics.

Micrometer supports several meter types. Choosing the right one matters:

  • Counter — Monotonically increasing value. Use for total requests, total errors, and orders placed. Never goes down. Query with rate() in Prometheus to get events per second.
  • Gauge — A value that fluctuates. Use for current queue depth, active connections, and temperature. You report the current value; Micrometer samples it on each scrape.
  • Timer — Measures duration and rate simultaneously. Use for request latency, database query time, and external API call duration. Gives you percentiles (p50, p95, p99) automatically.
  • Distribution Summary — Like a Timer but for arbitrary values, not time. Use for payload sizes, batch sizes, and record counts.

The @Timed annotation is the fastest way to instrument a controller method. Spring Boot auto-configures a TimedAspect bean when Micrometer is on the classpath — you annotate the method and Micrometer handles the rest. For custom business logic, inject MeterRegistry directly and build meters explicitly.

io/thecodeforge/monitoring/OrderMetricsService.java · JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
package io.thecodeforge.monitoring;

import io.micrometer.core.annotation.Timed;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
import org.springframework.web.bind.annotation.*;

import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * io.thecodeforge: Custom Business Metrics for Order Processing.
 *
 * Two patterns shown here:
 * 1. @Timed on controller methods — zero-boilerplate method-level timing
 * 2. MeterRegistry injection — explicit control for business counters and gauges
 */
@Service
public class OrderMetricsService {

    private final Counter ordersPlaced;
    private final Counter ordersFailed;
    private final Timer paymentLatency;
    private final AtomicInteger activeCarts = new AtomicInteger(0);

    public OrderMetricsService(MeterRegistry registry) {
        // Counter: Total orders placed (monotonically increasing)
        // Query in Prometheus: rate(orders_placed_total[5m]) → orders per second
        this.ordersPlaced = Counter.builder("orders.placed.total")
            .description("Total number of orders successfully placed")
            .tag("service", "order-service")
            .register(registry);

        // Counter: Total orders failed
        // Alert when rate(orders_failed_total[5m]) / rate(orders_placed_total[5m]) > 0.01
        this.ordersFailed = Counter.builder("orders.failed.total")
            .description("Total number of orders that failed processing")
            .tag("service", "order-service")
            .register(registry);

        // Timer: Payment processing latency with percentiles
        // publishPercentiles exposes p50, p95, p99 as separate Prometheus labels
        // Alert on p99 exceeding your SLA threshold — the mean will hide tail latency
        this.paymentLatency = Timer.builder("payment.processing.duration")
            .description("Time taken to process payment end to end")
            .tag("service", "order-service")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);

        // Gauge: Active shopping carts — value goes up and down
        // Prometheus samples this on every scrape — you report the current value
        Gauge.builder("carts.active.current", activeCarts, AtomicInteger::get)
            .description("Number of active shopping carts in session")
            .register(registry);
    }

    public void recordSuccessfulOrder(long paymentDurationNanos) {
        ordersPlaced.increment();
        paymentLatency.record(paymentDurationNanos, TimeUnit.NANOSECONDS);
    }

    public void recordFailedOrder() {
        ordersFailed.increment();
    }

    public void cartOpened() { activeCarts.incrementAndGet(); }
    public void cartClosed() { activeCarts.decrementAndGet(); }
}

// --- @Timed annotation pattern: zero-boilerplate controller instrumentation ---

package io.thecodeforge.controller;

import io.micrometer.core.annotation.Timed;
import io.thecodeforge.dto.OrderDto;
import org.springframework.web.bind.annotation.*;

@RestController
@RequestMapping("/api/orders")
public class OrderController {

    /**
     * @Timed auto-instruments this method with a Timer named 'order.create.duration'.
     * Tracks: request count, total time, and percentiles.
     * Requires TimedAspect bean — auto-configured when spring-boot-starter-actuator
     * and micrometer-registry-prometheus are on the classpath.
     * extraTags: adds fixed labels to the metric for filtering in Grafana.
     */
    @Timed(
        value = "order.create.duration",
        description = "Time taken to create an order",
        extraTags = {"endpoint", "/api/orders", "method", "POST"},
        percentiles = {0.5, 0.95, 0.99},
        histogram = true
    )
    @PostMapping
    public OrderDto createOrder(@RequestBody OrderDto dto) {
        // Method execution is automatically timed.
        // A Timer named order_create_duration_seconds is created in Prometheus.
        return dto;
    }
}
▶ Output
# GET /actuator/prometheus (relevant excerpt)

# HELP orders_placed_total Total number of orders successfully placed
# TYPE orders_placed_total counter
orders_placed_total{service="order-service",} 1542.0

# HELP orders_failed_total Total number of orders that failed processing
# TYPE orders_failed_total counter
orders_failed_total{service="order-service",} 23.0

# HELP payment_processing_duration_seconds Time taken to process payment end to end
# TYPE payment_processing_duration_seconds summary
payment_processing_duration_seconds{service="order-service",quantile="0.5",} 0.045
payment_processing_duration_seconds{service="order-service",quantile="0.95",} 0.180
payment_processing_duration_seconds{service="order-service",quantile="0.99",} 0.420

# HELP carts_active_current Number of active shopping carts in session
# TYPE carts_active_current gauge
carts_active_current{application="order-service",} 87.0

# HELP order_create_duration_seconds Time taken to create an order (@Timed)
# TYPE order_create_duration_seconds summary
order_create_duration_seconds{endpoint="/api/orders",method="POST",quantile="0.99",} 0.312

# HikariCP auto-metrics (no code required — just micrometer-registry-prometheus on classpath)
hikaricp_connections_active{pool="HikariPool-1",} 8.0
hikaricp_connections_idle{pool="HikariPool-1",} 2.0
hikaricp_connections_pending{pool="HikariPool-1",} 0.0
hikaricp_connections_timeout_total{pool="HikariPool-1",} 0.0
⚠ Never Create High-Cardinality Metrics
A Counter tagged with userId creates one unique time series per user. With 100,000 users that is 100,000 time series in Prometheus — it will exhaust memory and crash your monitoring stack. The rule: keep tag values to low-cardinality dimensions — service name, region, HTTP method, HTTP status code, error type. Never use request IDs, user IDs, order IDs, or timestamps as tag values. If you need per-user visibility, that belongs in your log aggregation layer, not your metrics layer.
📊 Production Insight
A team tracked payment latency but only looked at the mean — 45ms looked healthy and no alert fired. Meanwhile p99 was silently climbing to 2 seconds for a subset of requests with large order payloads, causing intermittent timeout complaints that the team kept dismissing as user error. Adding publishPercentiles(0.5, 0.95, 0.99) to the Timer revealed the tail latency the mean was masking. The alert was set on p99 and the issue was diagnosed within one day of the next occurrence. Mean latency is a liar — always instrument and alert on p95 or p99.
🎯 Key Takeaway
Counters answer 'how many total,' Gauges answer 'how much right now,' Timers answer 'how long and how spread out.' Pick the wrong meter type and you will miss the actual problem. Use @Timed on controller methods for zero-boilerplate instrumentation. HikariCP metrics are free — just add the dependency. If you can only track five metrics, track: request rate, error rate, latency p99, connection pool saturation, and one business metric.

Custom Actuator Endpoints — Operational Control Beyond Health

The built-in Actuator endpoints cover most operational needs. But there are legitimate cases where you need operational endpoints specific to your application: runtime feature flag status, current circuit breaker state, deployment metadata that goes beyond what /actuator/info provides, or a cache invalidation trigger that ops can call without a full deploy.

Spring Boot makes this straightforward with the @Endpoint annotation. Annotate a Spring component with @Endpoint(id='yourEndpoint') and methods with @ReadOperation (HTTP GET), @WriteOperation (HTTP POST), or @DeleteOperation (HTTP DELETE). Spring Boot automatically exposes the endpoint at /actuator/yourEndpoint and applies the same security model as built-in endpoints — whatever your SecurityFilterChain says applies here too.

The design rule: @ReadOperation methods should have no side effects. @WriteOperation methods change state and must be secured with ADMIN role — they are essentially a surgical tool for ops teams, not a general API. I have seen @WriteOperation endpoints used to toggle circuit breakers, clear Redis caches, and reload configuration from external sources — all without a redeploy.

io/thecodeforge/monitoring/DeploymentEndpoint.java · JAVA
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
package io.thecodeforge.monitoring;

import org.springframework.boot.actuate.endpoint.annotation.*;
import org.springframework.stereotype.Component;
import java.time.Instant;
import java.util.Map;

/**
 * io.thecodeforge: Custom Actuator Endpoint for deployment metadata and cache control.
 *
 * Exposed at: GET  /actuator/deployment  → returns deployment info
 *             POST /actuator/deployment  → triggers cache invalidation (ADMIN only)
 *
 * Spring Boot automatically applies your SecurityFilterChain to this endpoint.
 * Secure @WriteOperation methods with ADMIN role — they modify production state.
 */
@Component
@Endpoint(id = "deployment")
public class DeploymentEndpoint {

    private final String gitCommit;
    private final String buildVersion;
    private volatile Instant lastCacheInvalidation = null;
    private volatile String lastInvalidationReason = "none";

    public DeploymentEndpoint(
            @org.springframework.beans.factory.annotation.Value("${git.commit.id.abbrev:unknown}") String gitCommit,
            @org.springframework.beans.factory.annotation.Value("${build.version:unknown}") String buildVersion) {
        this.gitCommit = gitCommit;
        this.buildVersion = buildVersion;
    }

    /**
     * @ReadOperationHTTP GET /actuator/deployment
     * Returns deployment metadata and cache state.
     * Safe to expose to MONITORING role — no side effects.
     */
    @ReadOperation
    public Map<String, Object> deploymentInfo() {
        return Map.of(
            "gitCommit", gitCommit,
            "buildVersion", buildVersion,
            "startedAt", Instant.now(),
            "lastCacheInvalidation", lastCacheInvalidation != null ? lastCacheInvalidation.toString() : "never",
            "lastInvalidationReason", lastInvalidationReason
        );
    }

    /**
     * @WriteOperationHTTP POST /actuator/deployment
     * Triggers a cache invalidation without a redeploy.
     * Secure this with ADMIN role in your SecurityFilterChain.
     * Body parameter: { "reason": "stale pricing data after DB migration" }
     */
    @WriteOperation
    public Map<String, String> invalidateCache(
            @org.springframework.boot.actuate.endpoint.annotation.Selector String reason) {
        this.lastCacheInvalidation = Instant.now();
        this.lastInvalidationReason = reason != null ? reason : "manual trigger";
        // In real code: inject CacheManager and call cache.invalidateAll()
        return Map.of(
            "status", "cache invalidated",
            "reason", lastInvalidationReason,
            "timestamp", lastCacheInvalidation.toString()
        );
    }
}
▶ Output
# GET /actuator/deployment
{
"gitCommit": "a1b2c3d",
"buildVersion": "2.4.1",
"startedAt": "2026-04-18T10:00:00Z",
"lastCacheInvalidation": "never",
"lastInvalidationReason": "none"
}

# POST /actuator/deployment/stale-pricing-data (ADMIN role required)
{
"status": "cache invalidated",
"reason": "stale-pricing-data",
"timestamp": "2026-04-18T10:05:22Z"
}

# The endpoint appears automatically in /actuator index:
# GET /actuator
# {
# "_links": {
# "deployment": { "href": "/actuator/deployment" },
# "health": { "href": "/actuator/health" },
# ...
# }
# }
⚠ @WriteOperation Changes Production State — Secure It
A @WriteOperation endpoint is essentially a surgical tool for production. It can clear caches, toggle features, or reload config — all without a deploy. That power requires strict access control. Always secure @WriteOperation methods with ADMIN role in your SecurityFilterChain. Log every invocation with the caller's identity so you have an audit trail. Use @WebEndpoint instead of @Endpoint if you want the endpoint to be web-only and not exposed via JMX.
📊 Production Insight
Custom endpoints are where Actuator becomes genuinely powerful beyond health checks. I have used @WriteOperation endpoints to toggle circuit breakers manually during dependency degradation events, clear pricing caches after database migrations, and reload feature flag configuration from an external source. All of these would have required a redeploy otherwise. The key discipline: document every custom endpoint in your runbook with the exact curl command ops should run and the expected response — your 3 AM on-call engineer should not be reading source code.
🎯 Key Takeaway
@Endpoint with @ReadOperation and @WriteOperation lets you build operational control surfaces specific to your application. @ReadOperation is HTTP GET with no side effects. @WriteOperation is HTTP POST that changes state and must be secured with ADMIN role. Custom endpoints inherit the same security model as built-in ones — no extra configuration needed. Document every custom endpoint in your runbook with the exact command to run.

Securing Actuator Endpoints with Spring Security

By default in Spring Boot 2.x and later, Actuator endpoints sit behind Spring Security if it is on the classpath. But behind security does not mean secure. The default configuration often allows all authenticated users to access all endpoints — including ones that dump environment variables, heap contents, and thread states.

The production pattern: restrict Actuator endpoints to a dedicated monitoring role or internal network. Your Prometheus scraper authenticates with a service account. Your developers get read-only access to health and info. Nobody outside the internal network touches env or heapdump. The 15-line SecurityFilterChain in this section has prevented two credential exfiltration incidents on teams I have worked with.

One detail most guides miss: when you add a dedicated SecurityFilterChain for /actuator/**, you need @Order(1) to give it higher priority than your application's main SecurityFilterChain. Without the ordering, Spring applies your main chain first, which may have different rules than what you intend for Actuator.

io/thecodeforge/monitoring/ActuatorSecurityConfig.java · JAVA
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
package io.thecodeforge.monitoring;

import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.annotation.Order;
import org.springframework.security.config.annotation.web.builders.HttpSecurity;
import org.springframework.security.web.SecurityFilterChain;

/**
 * io.thecodeforge: Actuator Security Configuration.
 *
 * @Order(1) gives this chain higher priority than your application's main chain.
 * Without it, your application's SecurityFilterChain may apply different rules
 * to /actuator/** than you intend.
 *
 * Role mapping:
 *   No role   → health, info (load balancers and Kubernetes probes)
 *   MONITORINGprometheus (Prometheus scraper service account)
 *   ADMIN      → loggers, env, heapdump, threaddump (ops team only)
 *   DENY       → shutdown, everything else not whitelisted
 */
@Configuration
public class ActuatorSecurityConfig {

    @Bean
    @Order(1)
    public SecurityFilterChain actuatorSecurityFilterChain(HttpSecurity http) throws Exception {
        http
            .securityMatcher("/actuator/**")
            .authorizeHttpRequests(auth -> auth
                // Public: health checks for load balancers and Kubernetes probes
                // These must be accessible without authentication
                .requestMatchers("/actuator/health/**").permitAll()
                .requestMatchers("/actuator/info").permitAll()

                // MONITORING role: Prometheus scraper authenticates via HTTP Basic
                .requestMatchers("/actuator/prometheus").hasRole("MONITORING")

                // ADMIN role: endpoints that expose or modify sensitive state
                // Log level changes can generate gigabytes/min if abused
                .requestMatchers(
                    "/actuator/loggers/**",
                    "/actuator/env",
                    "/actuator/heapdump",
                    "/actuator/threaddump"
                ).hasRole("ADMIN")

                // Deny everything not explicitly permitted above.
                // This includes /actuator/shutdown — never expose it.
                .anyRequest().denyAll()
            )
            // HTTP Basic: Prometheus natively supports basic_auth in scrape config
            .httpBasic(basic -> {})
            // Disable CSRF: Actuator endpoints are called by automated systems, not browsers
            .csrf(csrf -> csrf.disable());

        return http.build();
    }
}
▶ Output
# application.yml complement to the SecurityFilterChain above
# Role definitions for the monitoring service account
spring:
security:
user:
name: admin
password: ${ACTUATOR_ADMIN_PASSWORD} # inject via environment variable, never hardcode
roles: ADMIN

# In Kubernetes, use Spring Security with LDAP or OAuth2 service accounts.
# For simple setups, environment-variable-injected credentials are acceptable
# as long as the password is rotated and not committed to source control.
⚠ Never Set show-details to 'always' in Production
When show-details=always, your /actuator/health endpoint returns database versions, JDBC connection pool sizes, external API response times, and disk usage to anyone who hits it — including unauthenticated requests if your SecurityFilterChain permits health/* publicly. An attacker uses this to fingerprint your infrastructure and find known CVEs for your specific database version. Always set it to when_authorized or never. Kubernetes probes only need the status field, not the details.
📊 Production Insight
An automated scanner found /actuator/env exposed on a staging server within 6 hours of deployment. AWS_SECRET_ACCESS_KEY was returned in plaintext in the JSON response. The attacker spun up 200 GPU instances for crypto mining. The AWS bill was $12,000 before anyone noticed. The application had a SecurityFilterChain — but it was missing @Order(1), so the main chain ran first and permitted all authenticated requests including monitoring users who had broad access. A 15-line SecurityFilterChain with correct ordering would have blocked this entirely.
🎯 Key Takeaway
If your security audit has not flagged your Actuator endpoints, you are not looking hard enough. Use @Order(1) on the Actuator SecurityFilterChain to ensure it takes priority. Never expose env, heapdump, or threaddump publicly. Restrict them to ADMIN role. Never set show-details=always in production.

Prometheus and Grafana Integration — Full Stack Setup

Actuator exposes the /actuator/prometheus endpoint in Prometheus exposition format — a text-based, human-readable format that Prometheus understands natively. But Prometheus needs to be told where to scrape. This is where most tutorials end and most teams get stuck.

On the Spring Boot side: add micrometer-registry-prometheus to your dependencies and management.endpoints.web.exposure.include=prometheus to your application.yml. That is it. The endpoint auto-configures.

On the Prometheus side: add a scrape_config block pointing at your application. The two details most engineers get wrong: using /actuator/health as the metrics_path instead of /actuator/prometheus (heavy versus lightweight), and using static_configs in Kubernetes where pods get new IPs on every restart.

On the Grafana side: import dashboard ID 4701 (JVM Micrometer) from grafana.com/dashboards for a production-ready Spring Boot dashboard. It covers heap usage, GC pause time, HTTP request rates, and database connection pool saturation out of the box — no PromQL required to get started.

io/thecodeforge/monitoring/prometheus.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
# io.thecodeforge: Prometheus Scrape Configuration
# This file tells Prometheus where to find your Spring Boot metrics.

global:
  scrape_interval: 15s      # How often Prometheus scrapes all targets
  evaluation_interval: 15s  # How often alerting rules are evaluated

scrape_configs:

  # Static targets: works for a fixed number of servers or local development.
  # In production Kubernetes, replace this with kubernetes_sd_configs below.
  - job_name: 'spring-boot-order-service'
    # IMPORTANT: Use /actuator/prometheus, NOT /actuator/health.
    # /actuator/prometheus: lightweight metric reads, sub-millisecond.
    # /actuator/health: hits the DB, external APIs, disk — thundering herd at scale.
    metrics_path: '/actuator/prometheus'
    scrape_interval: 10s  # Override global for this job
    basic_auth:
      username: 'prometheus'
      # Use password_file in production — never inline passwords in prometheus.yml
      password_file: '/etc/prometheus/secrets/prometheus_password'
    static_configs:
      - targets: ['order-service-01:8080', 'order-service-02:8080']
        labels:
          environment: 'production'
          team: 'platform'

  # Kubernetes service discovery: use this instead of static_configs in K8s.
  # Pods annotated with prometheus.io/scrape: 'true' are scraped automatically.
  # No manual target management — works with rolling deploys and autoscaling.
  - job_name: 'spring-boot-k8s'
    metrics_path: '/actuator/prometheus'
    kubernetes_sd_configs:
      - role: pod
    basic_auth:
      username: 'prometheus'
      password_file: '/etc/prometheus/secrets/prometheus_password'
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: 'true' annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'
      # Use the pod's prometheus.io/port annotation as the scrape port
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: '${1}'
      # Add pod name and namespace as labels for Grafana filtering
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
▶ Output
# Add these annotations to your Kubernetes pod spec to enable auto-discovery:
#
# metadata:
# annotations:
# prometheus.io/scrape: 'true'
# prometheus.io/port: '8080'
# prometheus.io/path: '/actuator/prometheus'
#
# Prometheus will automatically discover and scrape this pod.
# When the pod is replaced (rolling deploy), Prometheus picks up the new IP.
# No prometheus.yml restart required.
💡Import Grafana Dashboard 4701 Before Building Your Own
Grafana dashboard ID 4701 (JVM Micrometer) is a production-ready Spring Boot dashboard maintained by the Micrometer team. It covers JVM heap, GC pauses, HTTP request rates, error rates, and HikariCP connection pool metrics out of the box. Import it first and use it for at least one sprint before building custom dashboards. You will learn which metrics actually matter during incidents before investing time in custom PromQL.
📊 Production Insight
Scrape intervals compound with instance count in ways most teams do not calculate until it is too late. One hundred instances at 15-second scrape intervals with a 500ms health check means 333 seconds of health check CPU time per minute — and that is before you account for the database calls inside each health check. Use /actuator/prometheus for Prometheus scraping (lightweight reads) and health groups for Kubernetes probes. In Kubernetes, static_configs break on every rolling deploy — pods get new IPs and Prometheus stops scraping the new instance. Use kubernetes_sd_configs with pod annotations from day one.
🎯 Key Takeaway
Prometheus scrapes metrics, not health — using /actuator/health as the scrape path is the thundering herd waiting to happen. In Kubernetes, use service discovery with pod annotations instead of static targets. Import Grafana dashboard 4701 before writing custom PromQL — it covers the metrics that actually matter during incidents.

Kubernetes Probes — Liveness, Readiness, and Startup

If you are deploying Spring Boot to Kubernetes, Actuator's health groups become the backbone of your pod lifecycle management. Misconfiguring these probes is the single most common cause of cascading restarts in Spring Boot Kubernetes deployments — and it is entirely preventable.

Spring Boot 2.3 introduced health groups — separate health endpoints for different probe types. This is critical because liveness and readiness must check different things:

  • Liveness ('Is the app alive?'): Checks only internal JVM state. If this fails, Kubernetes restarts the pod. Keep it lightweight — no database calls, no external API calls. A temporary DB blip should never restart your pods.
  • Readiness ('Can the app accept traffic?'): Checks dependencies — database reachable, cache warm, broker connected. If this fails, Kubernetes removes the pod from the load balancer but does not restart it. This is the correct behavior for a transient dependency failure.
  • Startup ('Has the app finished booting?'): For slow-starting apps. Kubernetes waits for this to pass before running liveness and readiness probes. The math matters: failureThreshold × periodSeconds = maximum startup window. A Spring Boot app with 60 seconds of startup time needs failureThreshold: 12 with periodSeconds: 10 for a 120-second safety window — always give at least 2x your measured startup time.

I once debugged a 20-replica production deployment restarting every 90 seconds. The liveness probe was checking database connectivity. A 30-second DB connection spike triggered liveness failures across all 20 pods simultaneously. They all restarted, reconnected at once, overwhelmed the DB, the DB connection time spiked again, and the cycle repeated. The fix was one line in the Kubernetes deployment YAML: change the liveness probe path from /actuator/health to /actuator/health/liveness.

io/thecodeforge/monitoring/k8s-deployment.yaml · YAML
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
# io.thecodeforge: Kubernetes Deployment with Actuator Probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  template:
    metadata:
      annotations:
        # Enable Prometheus auto-discovery (works with kubernetes_sd_configs above)
        prometheus.io/scrape: 'true'
        prometheus.io/port: '8080'
        prometheus.io/path: '/actuator/prometheus'
    spec:
      containers:
        - name: order-service
          image: io.thecodeforge/order-service:latest
          ports:
            - containerPort: 8080

          # STARTUP PROBE: Prevents liveness from killing a slow-starting app.
          # Math: failureThreshold (30) × periodSeconds (10) = 300 seconds max startup.
          # Measure your actual startup time and set this to at least 2× that value.
          # If your app starts in 45 seconds, use failureThreshold: 12, periodSeconds: 10
          # for a 120-second window. Never guess — measure.
          startupProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            failureThreshold: 30      # 30 × 10s = 300 seconds max startup window
            periodSeconds: 10
            timeoutSeconds: 3

          # LIVENESS PROBE: JVM-only — no external dependency checks.
          # If this fails, Kubernetes RESTARTS the pod.
          # Never check DB or external APIs here — a transient blip restarts all pods.
          livenessProbe:
            httpGet:
              path: /actuator/health/liveness
              port: 8080
            initialDelaySeconds: 0    # startupProbe handles the startup window
            periodSeconds: 10
            timeoutSeconds: 3
            failureThreshold: 3       # Restart after 3 consecutive failures (30 seconds)

          # READINESS PROBE: Checks dependencies — DB, cache, broker.
          # If this fails, pod is REMOVED from load balancer but NOT restarted.
          # This is the correct behavior for a transient dependency failure.
          readinessProbe:
            httpGet:
              path: /actuator/health/readiness
              port: 8080
            initialDelaySeconds: 0
            periodSeconds: 5
            timeoutSeconds: 3
            failureThreshold: 3       # Remove from LB after 15 seconds of failures
▶ Output
# Spring Boot application.yml configuration required for health groups:
#
# management:
# endpoint:
# health:
# probes:
# enabled: true ← enables /actuator/health/liveness and /readiness
# show-details: when_authorized
# health:
# livenessstate:
# enabled: true
# readinessstate:
# enabled: true
#
# Without probes.enabled=true, /actuator/health/liveness and /readiness return 404.
# This is the most common Kubernetes probe misconfiguration in Spring Boot.
⚠ Never Put External Dependency Checks in Liveness Probes
If your liveness probe checks the database and the database has a 30-second connection spike, Kubernetes restarts every pod in your deployment simultaneously. They all reconnect to the DB at once, creating a connection storm that makes the DB spike worse. This cascading restart pattern has taken down production systems at companies with mature engineering teams. The rule is absolute: liveness probes check only JVM internal state. External dependencies belong in readiness probes.
📊 Production Insight
A 20-replica deployment was restarting every 90 seconds. The liveness probe was set to /actuator/health instead of /actuator/health/liveness. A 30-second DB connection spike triggered liveness failures across all 20 pods simultaneously. They all restarted, reconnected at once, overwhelmed the DB, which caused another spike, which failed liveness again. The cascading loop ran for 40 minutes before anyone understood what was happening. The fix was changing one YAML field. The investigation took 2 hours. The actual fix took 5 seconds.
🎯 Key Takeaway
Liveness checks the JVM — failure triggers a pod restart. Readiness checks dependencies — failure removes the pod from the load balancer without restarting it. startupProbe math: failureThreshold × periodSeconds = maximum startup window — measure your actual startup time and set this to at least 2×. Never put external dependency checks in liveness probes — the cascading restart pattern is catastrophic at scale.
Kubernetes Probe Selection
IfApp is slow to start — Spring context takes 30 or more seconds to load
UseAdd a startupProbe using /actuator/health/liveness. Set failureThreshold × periodSeconds to at least 2× your measured startup time. Without startupProbe, liveness kills the pod during initialization.
IfExternal dependency (DB, cache, broker) is temporarily unreachable
UseReadiness probe fails — pod is removed from the load balancer but NOT restarted. Traffic stops going to it. When the dependency recovers, the pod passes readiness and rejoins the load balancer automatically.
IfJVM is unresponsive — deadlock, OOM, or GC thrashing
UseLiveness probe fails — Kubernetes restarts the pod. This is the correct behavior — the process is genuinely broken and needs a fresh start.
IfLiveness probe checks database connectivity
UseImmediate risk: a DB blip restarts ALL pods simultaneously — cascading failure. Change liveness to /actuator/health/liveness and move DB check to readiness immediately.

The /actuator/info Endpoint — Deployment Traceability

The /actuator/info endpoint is the most underused feature in the Actuator suite. It lets you expose build information — Git commit hash, build timestamp, artifact version — directly from your running application. When something breaks in production, the first question is always 'what version is running?' Without this endpoint configured, answering that question means digging through CI/CD pipeline logs, which takes minutes you do not have during an active incident.

The setup requires two Maven plugins: the spring-boot-maven-plugin with the build-info goal, and the git-commit-id-plugin to embed Git metadata. Once configured, every build automatically embeds its own DNA into the artifact. Every deployment automatically reports exactly what code is running.

In production setups, every Grafana dashboard should have a deployment panel that queries /actuator/info across instances. If instances report different commit hashes during a rolling deploy, you can see the split-brain state in real time. If a rollback happened silently, it shows up immediately in this panel.

io/thecodeforge/monitoring/pom.xml · XML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
<!-- io.thecodeforge: Maven plugins for /actuator/info enrichment -->
<!-- Add these inside your <build><plugins> block -->
<build>
  <plugins>

    <!-- Plugin 1: Generates build-info.properties at compile time.
         Embeds: artifact name, version, build timestamp.
         Appears under "build" key in /actuator/info response. -->
    <plugin>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-maven-plugin</artifactId>
      <executions>
        <execution>
          <goals>
            <goal>build-info</goal>
          </goals>
        </execution>
      </executions>
    </plugin>

    <!-- Plugin 2: Embeds Git metadata at build time.
         Embeds: commit hash, branch, commit time, tags.
         Appears under "git" key in /actuator/info response.
         failOnNoGitDirectory=false: prevents build failure in CI environments
         where the .git directory may not be present (Docker build layers). -->
    <plugin>
      <groupId>pl.project13.maven</groupId>
      <artifactId>git-commit-id-plugin</artifactId>
      <version>4.9.10</version>
      <executions>
        <execution>
          <goals>
            <goal>revision</goal>
          </goals>
        </execution>
      </executions>
      <configuration>
        <generateGitPropertiesFile>true</generateGitPropertiesFile>
        <!-- Prevents build failure when .git is not present (shallow clones, CI) -->
        <failOnNoGitDirectory>false</failOnNoGitDirectory>
        <!-- Only embed the abbreviated commit hash, not the full 40-char hash -->
        <abbrevLength>7</abbrevLength>
      </configuration>
    </plugin>

  </plugins>
</build>
▶ Output
// GET /actuator/info
{
"build": {
"artifact": "order-service",
"name": "Order Service",
"version": "2.4.1",
"time": "2026-04-18T14:22:00Z"
},
"git": {
"branch": "main",
"commit": {
"id": "a1b2c3d",
"time": "2026-04-18T14:18:32Z"
}
}
}

// CI/CD post-deploy verification step:
// DEPLOYED_COMMIT=$(git rev-parse --short HEAD)
// RUNNING_COMMIT=$(curl -s http://service:8080/actuator/info | jq -r '.git.commit.id')
// if [ "$DEPLOYED_COMMIT" != "$RUNNING_COMMIT" ]; then
// echo "ERROR: Container is running stale code. Expected $DEPLOYED_COMMIT, got $RUNNING_COMMIT"
// exit 1
// fi
💡Use Info in Your CI/CD Verification Step
After deploying, curl /actuator/info on the new instance and verify git.commit.id matches the commit your pipeline just built. If it does not match, your deployment did not actually apply — the Docker image cache served the old layer, or the Helm rollout did not complete. This 2-second verification step has caught the 'ghost deployment' failure mode more times than I can count. Add it as a required step in your deployment pipeline before marking the deploy as successful.
📊 Production Insight
A team deployed v2.4.1 to fix a critical pricing bug but customer reports kept coming in. The app was still behaving like v2.3.9. Fifteen minutes into the incident, someone thought to curl /actuator/info. The git.commit.id showed the old version's hash. Docker had cached the intermediate image layer and the registry served the old image despite the pipeline showing green. Without the info endpoint, this debugging path would have taken an hour. With it, the ghost deployment was identified in 30 seconds.
🎯 Key Takeaway
When something breaks, the first question is 'what version is running?' The info endpoint answers that in one API call. Configure it once with the two Maven plugins — it requires no ongoing maintenance. Add a CI/CD post-deploy verification step that compares the running commit hash against what was just deployed.

Dynamic Log Level Management — Debug Without Redeploying

One of Actuator's most powerful day-two features: changing log levels at runtime without restarting the application. Need to enable DEBUG logging for a specific package to troubleshoot a production issue? Hit /actuator/loggers, change the level, reproduce the issue, read the logs, then change it back. No restart. No downtime. No redeploy cycle.

This converts 'we need to add more logging and redeploy' — a 30-minute process minimum in most CI/CD pipelines — into a 10-second API call. During incident response, this is the difference between a 10-minute resolution and a 45-minute resolution while customers are actively impacted.

The endpoint supports GET to read current levels and POST to change them. Spring Security should restrict POST to ADMIN role — TRACE logging in production can generate gigabytes of log data per minute, fill disk, and cascade into other failures. Always reset the log level after you have captured what you need. Build that reset command into your debugging runbook as a mandatory step.

For Hibernate SQL debugging specifically, enable TRACE on org.hibernate.SQL to see the exact SQL being generated and org.hibernate.type.descriptor.sql to see the actual bind parameter values. This combination has diagnosed more mysterious data bugs than any other technique I know.

io/thecodeforge/monitoring/loglevel_commands.sh · BASH
12345678910111213141516171819202122232425262728293031323334353637383940
#!/bin/bash
# io.thecodeforge: Dynamic Log Level Management
# Run these during incident response — no restart, no redeploy required.

# --- Read current log level for a package ---
curl -s -u admin:${ACTUATOR_PASSWORD} \
  http://localhost:8080/actuator/loggers/io.thecodeforge.order | jq .
# Response: { "configuredLevel": "INFO", "effectiveLevel": "INFO" }

# --- Enable DEBUG for order package (troubleshooting) ---
curl -s -u admin:${ACTUATOR_PASSWORD} \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"configuredLevel": "DEBUG"}' \
  http://localhost:8080/actuator/loggers/io.thecodeforge.order

# --- Enable TRACE for Hibernate SQL (see exact SQL + bind parameters) ---
# WARNING: This generates enormous log volume. Reset immediately after capturing the query.
curl -s -u admin:${ACTUATOR_PASSWORD} \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"configuredLevel": "TRACE"}' \
  http://localhost:8080/actuator/loggers/org.hibernate.SQL

curl -s -u admin:${ACTUATOR_PASSWORD} \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"configuredLevel": "TRACE"}' \
  http://localhost:8080/actuator/loggers/org.hibernate.type.descriptor.sql

# --- ALWAYS reset after capturing logs (mandatory step in your runbook) ---
curl -s -u admin:${ACTUATOR_PASSWORD} \
  -X POST \
  -H 'Content-Type: application/json' \
  -d '{"configuredLevel": null}' \
  http://localhost:8080/actuator/loggers/io.thecodeforge.order

# null = inherit from parent logger (returns to default)
# Setting to null is different from setting to INFO explicitly —
# null respects future configuration changes, INFO overrides them.
▶ Output
# After enabling DEBUG for io.thecodeforge.order:
2026-04-18 14:30:12 DEBUG io.thecodeforge.order.OrderService - Processing order #4521
2026-04-18 14:30:12 DEBUG io.thecodeforge.order.OrderService - Payment method: CREDIT_CARD
2026-04-18 14:30:13 DEBUG io.thecodeforge.order.OrderService - Inventory reserved: 3 items
2026-04-18 14:30:13 ERROR io.thecodeforge.order.OrderService - Payment declined: insufficient funds for order #4521
2026-04-18 14:30:13 DEBUG io.thecodeforge.order.OrderService - Rollback initiated for order #4521

# With org.hibernate.SQL at TRACE:
2026-04-18 14:30:12 TRACE org.hibernate.SQL - select * from orders where user_id=? and status=?
2026-04-18 14:30:12 TRACE org.hibernate.type.descriptor.sql - binding parameter [1] as [BIGINT] - [10045]
2026-04-18 14:30:12 TRACE org.hibernate.type.descriptor.sql - binding parameter [2] as [VARCHAR] - [PENDING]

# After resetting to null: DEBUG logs disappear immediately, zero application restart.
💡Build a Runbook Entry for Every Common Incident
Document which packages to enable DEBUG for during common incident types. Example: 'Payment processing failures: enable DEBUG on io.thecodeforge.order and io.thecodeforge.payment. If SQL is suspected, add TRACE on org.hibernate.SQL. Reset all three within 10 minutes.' Your on-call engineer at 3 AM should be running a documented command, not guessing package names from memory. The runbook entry takes 5 minutes to write and saves 20 minutes per incident.
📊 Production Insight
A team was debugging intermittent payment failures that occurred roughly once every 200 transactions. Reproducing the issue was unreliable. They enabled DEBUG on the payment package for 15 minutes, captured 4 failed transactions with full context in the logs, identified a race condition in the idempotency key generation, and disabled DEBUG before anyone else noticed the log volume. The old workflow — add a log statement, build, deploy to staging, reproduce, deploy to production, check logs — would have been 45 minutes minimum and required a deployment window approval. Dynamic log levels turned it into a 15-minute debugging session with no deployment.
🎯 Key Takeaway
Dynamic log levels are the most underused Actuator feature in production incident response. They turn debugging from a deployment cycle into a 10-second API call. Secure POST with ADMIN role — TRACE logging can fill disks in minutes. Setting configuredLevel to null (inherit) is different from setting it to INFO explicitly — null respects future configuration changes.

Micrometer Integration and Docker Deployment

To make this production-ready, you containerize the application with a focus on how Docker handles the health signal from Actuator and how the JVM is tuned for container environments.

The Docker HEALTHCHECK instruction tells the Docker daemon whether the container is healthy. By pointing it at /actuator/health/liveness, you get the same lightweight JVM-only health logic that Kubernetes uses at the liveness probe level. This matters most for Docker Compose deployments, which do not have Kubernetes probe support.

Two things to watch for in every containerization. First, the HEALTHCHECK command needs wget or curl to be present in the base image. Distroless images include neither — you will need to switch to a slim base image or remove Docker HEALTHCHECK and rely exclusively on Kubernetes probes. Second, the JVM memory flags: without -XX:+UseContainerSupport, older JDK versions read the host machine's memory rather than the container's memory limits and allocate too large a heap, causing OOM kills. Modern JDK 17 and later handle this automatically, but the flag is harmless and makes the intent explicit.

io/thecodeforge/monitoring/Dockerfile · DOCKERFILE
1234567891011121314151617181920212223242526272829303132333435363738
# io.thecodeforge: Multi-stage Dockerfile with Actuator Health Integration

# Stage 1: Build
FROM eclipse-temurin:17-jdk-alpine AS build
WORKDIR /app
COPY . .
RUN ./mvnw clean package -DskipTests

# Stage 2: Runtime
# eclipse-temurin:17-jre-alpine includes wget — required for HEALTHCHECK
# Do NOT use distroless if you need Docker HEALTHCHECK
FROM eclipse-temurin:17-jre-alpine
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar

# Security: run as non-root user
RUN addgroup -S forgegroup && adduser -S forgeuser -G forgegroup
USER forgeuser

# Docker HEALTHCHECK: uses liveness endpoint for lightweight JVM-only check.
# interval: how often Docker checks (30s is conservative — adjust for your SLA)
# timeout: how long Docker waits for a response before marking unhealthy
# retries: consecutive failures before marking UNHEALTHY
# Uses /actuator/health/liveness — NOT /actuator/health
# Full health check would hit DB on every HEALTHCHECK — unnecessary for Docker daemon
HEALTHCHECK \
  --interval=30s \
  --timeout=3s \
  --retries=3 \
  CMD wget --quiet --tries=1 --spider http://localhost:8080/actuator/health/liveness || exit 1

EXPOSE 8080

# JVM flags for container environments:
# -XX:+UseContainerSupport: read container memory limits, not host memory
# -XX:MaxRAMPercentage=75.0: allocate 75% of container memory to the heap
#   leaving 25% for Metaspace, thread stacks, GC overhead, and OS
ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]
▶ Output
# Build and verify health:
# docker build -t io.thecodeforge/order-service:latest .
# docker run -p 8080:8080 io.thecodeforge/order-service:latest
# docker ps (HEALTHY appears after 30s)
#
# CONTAINER ID IMAGE STATUS
# a1b2c3d4e5f6 io.thecodeforge/order-service Up 2 minutes (healthy)
#
# Verify the health endpoint from inside the container:
# docker exec a1b2c3d4e5f6 wget -q -O - http://localhost:8080/actuator/health/liveness
# {"status":"UP"}
⚠ Watch Out for Distroless Images
Distroless base images (gcr.io/distroless/java17) do not include wget, curl, or a shell. Docker HEALTHCHECK requires one of these. If you switch to distroless for security, remove the Docker HEALTHCHECK instruction entirely and rely on Kubernetes probes exclusively. Kubernetes probes are HTTP checks performed by the kubelet from outside the container — they do not need wget inside the container. In Kubernetes, Docker HEALTHCHECK and Kubernetes probes are independent mechanisms; you do not need both.
📊 Production Insight
A team migrated to a distroless image for the reduced attack surface — a legitimate security improvement. Docker HEALTHCHECK started failing because wget was missing. Docker marked the container UNHEALTHY, Docker Compose restarted it, which caused a restart loop. The ops team spent two hours before identifying that HEALTHCHECK was the issue, not the application. The fix: remove Docker HEALTHCHECK and rely on Kubernetes probes exclusively. Kubernetes probes are more sophisticated anyway — they support failure thresholds, initial delays, and startup protection that Docker HEALTHCHECK does not.
🎯 Key Takeaway
Docker HEALTHCHECK requires wget or curl — distroless images include neither. In Kubernetes, remove Docker HEALTHCHECK and use Kubernetes probes exclusively. Set -XX:+UseContainerSupport and -XX:MaxRAMPercentage=75.0 in your ENTRYPOINT to prevent the JVM from over-allocating heap based on host memory rather than container limits.
🗂 Legacy vs. Modern Monitoring Approaches
How Spring Boot Actuator replaces ad-hoc monitoring patterns with standardized, production-grade observability.
Monitoring AspectLegacy Approach (Manual)Modern Approach (Actuator)
Health ChecksCustom /status endpoints with inconsistent JSON structure. Each developer implements their own format. You get 'OK' or nothing. No component-level detail, no Kubernetes probe integration.Standardized /health endpoint with nested component status, auto-aggregation (any DOWN component makes overall status DOWN), and health groups (liveness, readiness) that map directly to Kubernetes probe types.
Metrics GatheringLog parsing, manual JMX MBean registration, or custom counters in Redis. Fragile, hard to query, impossible to aggregate across instances, and requires per-service implementation.Micrometer integration with dimensional metrics. Auto-instrumentation of JVM, HTTP requests, and HikariCP. One dependency addition exports everything to Prometheus, Datadog, InfluxDB, or New Relic.
Runtime ManagementRequires application restart to change log levels. 'Add a log statement and redeploy' is a 30-minute process with a deployment window approval. Debugging in production is a full release cycle.Dynamic log level updates via /actuator/loggers — change any package's log level in 10 seconds without restarting. View environment variables and system properties via /actuator/env (secured).
SecurityAd-hoc security filters, often left unprotected or protected inconsistently. Actuator endpoints exposed with default Spring Security or no security at all — common source of credential exfiltration.Integrated with Spring Security. Fine-grained access control per endpoint using @Order SecurityFilterChain. Role-based restrictions. HTTP Basic for Prometheus scraper. CSRF disabled for stateless endpoints.
Deployment TraceabilityCheck CI/CD logs, SSH into the server, run git log. No programmatic way to verify which code version is running. Ghost deployments go undetected for hours during incidents./actuator/info exposes Git commit hash, build timestamp, and artifact version from the running process. CI/CD post-deploy verification catches ghost deployments in 2 seconds.
Operational ControlAny operational action (cache clear, config reload, feature toggle) requires a code change, pull request, code review, build, and deploy. A 30-minute minimum for a surgical change.Custom @Endpoint with @WriteOperation provides surgical operational control without a deploy. Cache invalidation, circuit breaker toggles, and config reloads become API calls restricted to ADMIN role.

🎯 Key Takeaways

  • Actuator is the bridge between application code and operational visibility — it is non-negotiable for any Spring Boot service running in production.
  • Use readiness probes to control load balancer traffic and liveness probes to signal Kubernetes when a pod needs a restart. Never put external dependency checks in liveness probes — the cascading restart pattern it creates has taken down production systems at companies with mature engineering teams.
  • Micrometer is the metrics engine. Use Counters for total counts, Gauges for current values, and Timers for latency percentiles. Mean latency is a liar — always instrument and alert on p95 or p99.
  • The /actuator/info endpoint is the fastest way to answer 'what version is running?' during an incident. Configure it once with git-commit-id-plugin and build-info goal. Add post-deploy verification to your CI/CD pipeline that compares the running commit hash against what was just deployed.
  • Dynamic log level management via /actuator/loggers turns a 30-minute debugging deployment cycle into a 10-second API call. Build the reset command into your runbook as a mandatory step — TRACE logging left running fills disks.
  • Always secure Actuator endpoints with a dedicated SecurityFilterChain using @Order(1). Whitelist only what you need. env, heapdump, and threaddump are the three most dangerous — restrict to ADMIN role and consider never exposing them at all.
  • Add management.metrics.tags.application=${spring.application.name} to every service's application.yml — it is the one line most teams forget. Without it, metrics from different services are indistinguishable in Prometheus.
  • Never create high-cardinality metrics with dynamic tag values like userId or requestId. Low-cardinality dimensions only: service name, region, HTTP method, status code, error type.
  • The observability stack is Actuator (sensors) + Prometheus (recorder) + Grafana (visualizer) + Alertmanager (alerter). All four work together — Actuator alone gives you endpoints, not observability.

⚠ Common Mistakes to Avoid

    Exposing all endpoints via management.endpoints.web.exposure.include=* in production
    Symptom

    Automated scanners find /actuator/env within hours. AWS credentials, database passwords, and API keys are returned in JSON plaintext. /actuator/heapdump dumps all in-memory data including user sessions and PII.

    Fix

    Whitelist only what your monitoring stack needs: health,info,prometheus,loggers. Explicitly exclude heapdump,env,threaddump,shutdown. Add a SecurityFilterChain with @Order(1) and role-based access control.

    Missing micrometer-registry-prometheus dependency — /actuator/prometheus returns 404
    Symptom

    /actuator/prometheus is listed in exposure.include but returns 404. Prometheus shows the target as having no metrics. Teams spend hours checking security configuration before realizing the dependency is missing.

    Fix

    Add io.micrometer:micrometer-registry-prometheus to pom.xml or build.gradle. This is separate from spring-boot-starter-actuator. Without it, the /actuator/prometheus endpoint does not exist regardless of what exposure.include contains.

    Performing heavy, blocking I/O inside a Custom Health Indicator
    Symptom

    Health checks take 2 to 5 seconds because they call external APIs synchronously. Kubernetes probes timeout. Prometheus marks the target as down. CPU spikes every 10 seconds when probes fire simultaneously across all instances.

    Fix

    Keep health indicators under 200ms. Use cached results with periodic background refresh via a scheduled task instead of live calls on every probe. Set connectTimeout and readTimeout to 2000ms maximum in your health indicator HTTP client.

    Leaving management.endpoint.health.show-details set to 'always'
    Symptom

    Anyone who hits /actuator/health sees database versions, connection pool sizes, external API response times, and disk usage. Attackers use this to fingerprint infrastructure and identify known CVEs for specific database or framework versions.

    Fix

    Set show-details to when_authorized or never in production. Only MONITORING or ADMIN roles should see component-level health details. Kubernetes probes only need the status field — they do not need details.

    Using /actuator/health as the Kubernetes liveness probe instead of /actuator/health/liveness
    Symptom

    A transient database blip causes the full health check to fail. Kubernetes restarts every pod simultaneously. They all reconnect at once, creating a connection storm that worsens the original blip into a cascading restart loop.

    Fix

    Use /actuator/health/liveness for liveness probes — it checks only JVM internal state. Move dependency checks to /actuator/health/readiness. Enable health groups with management.endpoint.health.probes.enabled=true.

    Not setting management.metrics.tags.application globally
    Symptom

    Prometheus shows http_server_requests_seconds from five different services all mixed together. Grafana dashboards cannot filter by service. Queries return combined meaningless numbers across the entire fleet.

    Fix

    Set management.metrics.tags.application=${spring.application.name} in every service's application.yml. Every metric emitted gets this tag automatically. Prometheus can then filter by application label in every query.

    Creating high-cardinality metrics by using dynamic values as tags
    Symptom

    A Counter tagged with userId creates one unique time series per user. With 100,000 users, Prometheus runs out of memory. Scrape failures increase. TSDB storage fills up in days. Prometheus restarts frequently.

    Fix

    Keep tags to low-cardinality dimensions: service name, region, HTTP method, HTTP status code, error type. Never use request IDs, user IDs, or timestamps as tag values. User-level detail belongs in log aggregation, not metrics.

    Ignoring the /actuator/info endpoint entirely
    Symptom

    During a production incident the team cannot answer 'what version is running?' without checking CI/CD logs. Ghost deployments — containers running old code despite a green pipeline — go undetected for hours.

    Fix

    Configure spring-boot-maven-plugin build-info goal and git-commit-id-plugin in pom.xml. Add a post-deploy verification step in CI/CD that curls /actuator/info and compares git.commit.id against the deployed commit hash.

Interview Questions on This Topic

  • QWhat is the difference between 'Liveness' and 'Readiness' probes in the context of Spring Boot 2.3+ Actuator? What should each probe check, and what happens when each one fails in Kubernetes?Mid-levelReveal
    Liveness probes check whether the application is alive — is the JVM responsive, is the event loop running? They use /actuator/health/liveness and must check only internal JVM state, never external dependencies. When a liveness probe fails, Kubernetes restarts the pod. Readiness probes check whether the application can accept traffic — is the database reachable, is the cache warm, is the broker connected? They use /actuator/health/readiness and include dependency checks. When a readiness probe fails, Kubernetes removes the pod from the service load balancer endpoints but does not restart it. The critical production rule: never put external dependency checks in liveness probes. A temporary database blip would fail liveness across all pods simultaneously, triggering a coordinated restart that creates a connection storm as all pods reconnect at once. This cascading pattern has taken down production deployments at teams with mature engineering practices. The entire architecture of the two health groups exists to prevent this one failure mode.
  • QHow would you implement a custom metric to track the number of successful vs failed login attempts using Micrometer's MeterRegistry? What meter type would you use and why?Mid-levelReveal
    Use Counter meters for both successful and failed login attempts. Counters track monotonically increasing values — exactly what cumulative login counts are. Create two counters with a shared name and a differentiating tag: Counter.builder('login.attempts').tag('outcome', 'success').register(registry) and Counter.builder('login.attempts').tag('outcome', 'failure').register(registry). In Prometheus, query rate(login_attempts_total{outcome='failure'}[5m]) for the failure rate and calculate the ratio of failures to total attempts for the error percentage. Do not use a Gauge — login counts only go up, and Gauges are for values that fluctuate. Do not use a Timer — you are counting events, not measuring duration. Do not create two separately named counters; using the same name with a differentiating tag lets you aggregate them with a single PromQL query. The production caveat: do not tag with userId — that creates one time series per user and will exhaust Prometheus memory. The 'outcome' tag has only two possible values, making it safely low-cardinality.
  • QIn a high-security environment, how do you restrict access to the /actuator/prometheus endpoint to only the Prometheus scraper's service account?SeniorReveal
    Create a dedicated SecurityFilterChain with @Order(1) that matches /actuator/**. The @Order(1) is critical — without it, your main application SecurityFilterChain may take precedence and apply different rules than intended. Configure the chain to require the MONITORING role for /actuator/prometheus. Use HTTP Basic authentication since Prometheus natively supports basic_auth in its scrape config — no custom token handling needed. In application.yml, create a monitoring service account with the MONITORING role. In the Prometheus scrape config, add basic_auth with the service account credentials using password_file rather than inline passwords. For defense in depth: use management.endpoints.web.exposure.include to whitelist only health, info, and prometheus — never use wildcard. Add Kubernetes NetworkPolicy to restrict which pods can reach the actuator port, limiting access to the Prometheus pod's IP range or namespace. In managed Kubernetes environments, this network-level restriction is more reliable than authentication alone because it prevents the endpoint from being reachable at all from unauthorized pods.
  • QExplain the 'Thundering Herd' problem that can occur if monitoring systems scrape /actuator/health simultaneously. How would you mitigate it?SeniorReveal
    The thundering herd occurs when multiple systems — Prometheus, Kubernetes probes, Docker HEALTHCHECK, uptime monitors — all hit the full /actuator/health endpoint at overlapping intervals. The full health check performs expensive operations: database queries, external API calls, disk space checks. With 100 instances scraped every 15 seconds with a 500ms health check, that is 333 seconds of health check CPU time per minute, plus the database load from 400 connection checks per minute. Mitigation strategy: use /actuator/prometheus for Prometheus scraping (lightweight counter and gauge reads, sub-millisecond). Use /actuator/health/liveness for Kubernetes liveness probes (JVM-only check, no I/O). Use /actuator/health/readiness for readiness probes (dependency checks, but only evaluated when pod state changes, not on a polling schedule matching Prometheus). For custom health indicators that must check external dependencies, cache the result with a background refresh thread and return the cached status on every probe call rather than performing a live check each time. Set connectTimeout and readTimeout to 2000ms maximum to prevent slow health checks from compounding.
  • QHow does the @WriteOperation annotation work in a custom @Endpoint, and what are the safety implications compared to a @ReadOperation? When would you create a custom endpoint?SeniorReveal
    @ReadOperation maps to HTTP GET and must have no side effects. It returns operational state — current config, deployment info, feature flag values. @WriteOperation maps to HTTP POST and changes application state — clearing a cache, toggling a feature, reloading configuration. @DeleteOperation maps to HTTP DELETE and removes something — clearing a specific cache key, deregistering a resource. The safety implication: @WriteOperation methods modify production state without a code deploy. This is powerful but requires strict access control. Always secure @WriteOperation methods with ADMIN role in your SecurityFilterChain. Log every invocation with the caller's identity for audit trail purposes. Create a custom endpoint when built-in Actuator endpoints do not cover your operational needs. Common legitimate cases: a deployment endpoint that returns richer metadata than /actuator/info, a features endpoint for runtime feature flag management and toggling, a circuit-breaker endpoint that shows the state of each circuit and allows manual open/close operations, or a cache endpoint that shows cache statistics and allows targeted invalidation without a full restart. Use @WebEndpoint instead of @Endpoint when you want the endpoint web-only with no JMX exposure.
  • QWhat is the difference between a Counter, Gauge, and Timer in Micrometer? Give a real-world example of when you would use each one in a payment processing service.Mid-levelReveal
    A Counter tracks a monotonically increasing value — it only ever goes up. In a payment service, use Counter for total payments processed (payments.completed.total) or total payment failures (payments.failed.total). Query with rate() in Prometheus to get payments per second or failures per second. Counter is wrong for values that can decrease. A Gauge tracks a value that fluctuates up and down. Use Gauge for the current number of pending payment approvals waiting for 3DS authentication, or the current HikariCP active connection count. You report the current value and Micrometer samples it on each Prometheus scrape. Gauge is wrong for counts that only accumulate. A Timer measures both duration and count simultaneously. Use Timer for end-to-end payment processing latency. It automatically calculates p50, p95, and p99 percentiles and the request rate. Add publishPercentiles(0.5, 0.95, 0.99) to expose them as separate Prometheus labels for alerting. Timer is the right choice for anything where both 'how long did it take' and 'how often does it happen' matter. Choosing wrong means missing the signal: using a Gauge for total payments loses the rate information entirely. Using a Counter for HikariCP active connections loses the current value. Using a Timer but only looking at the mean hides the tail latency where slow transactions actually live.
  • QHow would you use the /actuator/loggers endpoint to troubleshoot a production issue without restarting the application? Walk through the exact steps.Mid-levelReveal
    Step 1: Identify the relevant package from the stack trace or component you suspect — for example io.thecodeforge.order for the order processing path. Step 2: Read the current level: GET /actuator/loggers/io.thecodeforge.order. Response shows configuredLevel and effectiveLevel — if effectiveLevel is INFO, DEBUG messages are suppressed. Step 3: Enable DEBUG: POST /actuator/loggers/io.thecodeforge.order with body {"configuredLevel":"DEBUG"}. Takes effect immediately — no restart, no propagation delay. Step 4: Reproduce the issue or wait for it to recur. Monitor your log aggregator (Loki, ELK) for the detailed DEBUG messages. Step 5: Capture what you need from the logs and identify the root cause. Step 6: Reset immediately: POST /actuator/loggers/io.thecodeforge.order with body {"configuredLevel":null}. Null means inherit from parent — returns to whatever the default configuration specifies. This is different from setting to INFO explicitly; null respects future configuration changes. For Hibernate SQL debugging: enable TRACE on org.hibernate.SQL to see exact SQL statements and on org.hibernate.type.descriptor.sql to see the actual bind parameter values. Reset both immediately after capturing — TRACE generates enormous log volume that can fill disk in minutes.
  • QExplain how the /actuator/info endpoint can be enriched with Git commit information. What Maven plugins are required, and how would you use this in a CI/CD verification step?Mid-levelReveal
    Two Maven plugins are required: the spring-boot-maven-plugin with the build-info goal, which generates build-info.properties at compile time and embeds artifact name, version, and build timestamp; and the git-commit-id-plugin, which embeds Git metadata including commit hash (abbreviated), branch name, and commit time into git.properties. Both files are read by Actuator at startup and exposed under the build and git keys in /actuator/info. In CI/CD, add a mandatory post-deploy verification step after the deployment completes: extract the git commit hash that was just built (git rev-parse --short HEAD), curl /actuator/info on the newly deployed instance, and compare git.commit.id from the response against the expected hash. If they do not match, fail the pipeline — the deployment applied the wrong image. This catches Docker image cache issues where the registry served a cached layer, Helm chart misconfigurations where the image tag was not updated, and failed rolling updates where some pods are still on the old version. This verification costs 2 seconds and prevents the entire class of ghost deployment incidents where the team believes a fix is deployed but the old code is still running.
  • QWhat is the risk of setting management.endpoints.web.exposure.include=* in production? What specific endpoints are the most dangerous and why?Mid-levelReveal
    Wildcard exposure makes every Actuator endpoint reachable by anyone who can reach the application's management port. The three most dangerous endpoints: /actuator/env returns all environment variables and system properties in plaintext, including AWS_SECRET_ACCESS_KEY, database connection strings, API tokens, and OAuth client secrets. Automated scanners probe /actuator/env within hours of an application being reachable. This is the most commonly exploited Actuator misconfiguration and the one with the most severe business impact. /actuator/heapdump triggers a full JVM heap dump download — a binary file containing every object currently in memory including user session data, cached database records, in-memory PII, and decrypted credential values. Downloading this file gives an attacker a complete snapshot of your application's runtime memory state. /actuator/threaddump reveals the current state of every JVM thread including stack traces that expose internal application structure, class names, and timing information useful for targeted attacks. Additionally, /actuator/shutdown can terminate the application process entirely if enabled and exposed. The fix: whitelist only the endpoints your monitoring stack actually needs — health, info, prometheus, loggers. Restrict each to the appropriate role in a dedicated SecurityFilterChain with @Order(1). Use Kubernetes NetworkPolicy to prevent these endpoints from being reachable outside the cluster's internal network.

Frequently Asked Questions

What is the difference between /actuator/health, /actuator/health/liveness, and /actuator/health/readiness?

The base /actuator/health endpoint returns the aggregated status of all health indicators — database connectivity, disk space, external APIs, message brokers, and any custom indicators. It is the full-picture health check.

Spring Boot 2.3 introduced health groups: /actuator/health/liveness checks only whether the JVM is responsive with no external dependency checks, and /actuator/health/readiness checks whether the application can serve traffic and includes dependency checks.

In Kubernetes, use liveness for the liveness probe — if it fails, Kubernetes restarts the pod. Use readiness for the readiness probe — if it fails, the pod is removed from the load balancer but not restarted. Never use the full /actuator/health for liveness probes — a temporary database blip would restart every pod simultaneously.

Requires management.endpoint.health.probes.enabled=true in application.yml — without this, the liveness and readiness paths return 404.

How do I create a custom Actuator endpoint beyond health checks?

Annotate a Spring component with @Endpoint(id='yourEndpoint') and annotate methods with @ReadOperation (GET), @WriteOperation (POST), or @DeleteOperation (DELETE). Spring Boot automatically exposes the endpoint at /actuator/yourEndpoint.

For web-only endpoints with no JMX exposure, use @WebEndpoint instead of @Endpoint.

Custom endpoints inherit the same security model as built-in ones — they appear in your exposure.include list and respect the SecurityFilterChain you define. Secure @WriteOperation methods with ADMIN role since they change production state. Common use cases: deployment metadata richer than /actuator/info, runtime feature flag management, cache invalidation triggers, and circuit breaker state display and control.

How does Prometheus scrape Spring Boot Actuator metrics?

Prometheus pulls metrics from your application's /actuator/prometheus endpoint at a configured interval (default 15 seconds). On the Spring Boot side: add micrometer-registry-prometheus dependency and include prometheus in management.endpoints.web.exposure.include — this auto-configures the endpoint. On the Prometheus side: add a scrape_config block with metrics_path: '/actuator/prometheus' and your application's host and port.

For Kubernetes, use kubernetes_sd_configs with pod annotations instead of static targets. Pods annotated with prometheus.io/scrape: 'true' are auto-discovered. Each scrape returns all current metric values in Prometheus exposition text format. Prometheus stores these as time series in its TSDB.

Common issue: /actuator/prometheus returns 404 despite being in exposure.include — check that micrometer-registry-prometheus is in your pom.xml. It is a separate dependency from spring-boot-starter-actuator.

Can I change log levels in a running Spring Boot application without restarting?

Yes. POST to /actuator/loggers/{package.name} with a JSON body of {"configuredLevel": "DEBUG"}. The change takes effect immediately — no restart, no redeploy. To reset to default, POST with {"configuredLevel": null}. Null means inherit from the parent logger — this is different from explicitly setting INFO, which overrides any future configuration changes.

For Hibernate SQL debugging: enable TRACE on org.hibernate.SQL for query text and org.hibernate.type.descriptor.sql for bind parameter values.

Always secure POST with Spring Security restricting it to ADMIN role — TRACE logging generates gigabytes of output per minute and can fill disk quickly. Build the reset command into your incident runbook as a mandatory step, not an optional one.

What information does the /actuator/info endpoint show, and how do I populate it?

By default /actuator/info returns an empty JSON object — it must be explicitly populated. Two sources of data:

Build metadata: add the spring-boot-maven-plugin's build-info execution goal. This generates build-info.properties at compile time containing artifact name, version, and build timestamp. Appears under the build key in the response.

Git metadata: add git-commit-id-plugin to your Maven build. This embeds the commit hash (abbreviated), branch name, and commit time into git.properties at build time. Appears under the git key.

You can also add custom info via application.properties: info.app.description=Order processing service. These appear under the app key.

Use this in CI/CD post-deploy verification: compare git.commit.id from the running instance against the commit hash your pipeline just built to catch ghost deployments.

How do I secure Actuator endpoints in production?

Create a dedicated SecurityFilterChain with @Order(1) that matches /actuator/**. The @Order(1) ensures this chain has higher priority than your main application chain.

Permit /actuator/health/** and /actuator/info — needed by load balancers and Kubernetes probes without authentication. Restrict /actuator/prometheus to MONITORING role with HTTP Basic authentication. Restrict /actuator/loggers, /actuator/env, /actuator/heapdump, and /actuator/threaddump to ADMIN role. Deny everything else.

Also set management.endpoints.web.exposure.include to a whitelist — health, info, prometheus, loggers — and never use wildcard. Disable CSRF for the actuator security chain since these endpoints are called by automated systems, not browsers. Set management.endpoint.health.show-details to when_authorized or never in application.yml.

What is the difference between a Counter, Gauge, and Timer in Micrometer?

A Counter tracks a monotonically increasing value — it only goes up. Use it for total requests, total errors, and orders placed. Query with rate(counter_total[5m]) in Prometheus to get events per second. A Counter is wrong for any value that can decrease.

A Gauge tracks a value that goes up and down. Use it for current queue depth, active connections, and heap usage. You report the current value on demand and Micrometer samples it on each Prometheus scrape. A Gauge is wrong for cumulative counts.

A Timer measures duration and count simultaneously. Use it for request latency, database query time, and payment processing duration. It automatically calculates count, total time, and percentiles. Add publishPercentiles(0.5, 0.95, 0.99) to expose p50, p95, and p99 as Prometheus labels for alerting.

Choose the wrong type and you miss the signal: a Gauge for total orders loses the rate information, a Counter for connection pool size loses the current value, a Timer without percentiles hides the tail latency where the real problems live.

How do I handle Actuator in a Kubernetes environment with multiple replicas?

Configure all three probe types: startupProbe using /actuator/health/liveness with failureThreshold × periodSeconds set to at least 2× your measured startup time. livenessProbe using /actuator/health/liveness with no external dependency checks — JVM state only. readinessProbe using /actuator/health/readiness with dependency checks.

Enable health groups in application.yml with management.endpoint.health.probes.enabled=true — without this, the liveness and readiness paths return 404.

For metrics scraping, use Prometheus kubernetes_sd_configs with pod annotations rather than static targets. Pods get new IPs on every restart — static targets break on every rolling deploy. Add prometheus.io/scrape: 'true' and prometheus.io/port: '8080' annotations to your pod spec.

Set management.metrics.tags.application=${spring.application.name} so metrics from different service replicas are identifiable in Prometheus and can be filtered or aggregated by service name in Grafana dashboards.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousJWT Authentication with Spring BootNext →Spring Boot Testing with JUnit and Mockito
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged