Spring Boot Actuator and Monitoring: Production-Grade Observability
- Actuator is the bridge between application code and operational visibility — it is non-negotiable for any Spring Boot service running in production.
- Use readiness probes to control load balancer traffic and liveness probes to signal Kubernetes when a pod needs a restart. Never put external dependency checks in liveness probes — the cascading restart pattern it creates has taken down production systems at companies with mature engineering teams.
- Micrometer is the metrics engine. Use Counters for total counts, Gauges for current values, and Timers for latency percentiles. Mean latency is a liar — always instrument and alert on p95 or p99.
- Spring Boot Actuator exposes operational endpoints (/health, /metrics, /prometheus) that turn your running JVM from a black box into an observable system
- Health groups (liveness, readiness, startup) map directly to Kubernetes probe types — never put external dependency checks in liveness probes
- Micrometer is the metrics engine: Counters for totals, Gauges for current values, Timers for latency percentiles (p50/p95/p99)
- Prometheus scrapes /actuator/prometheus every 15s — using /actuator/health for scraping creates a thundering herd at scale
- Dynamic log level changes via /actuator/loggers turn a 30-minute 'add logging and redeploy' cycle into a 10-second API call
- management.metrics.tags.application=${spring.application.name} is the one line most teams forget — without it, metrics from different services collide in Prometheus
- The full stack is Actuator (sensors) + Prometheus (recorder) + Grafana (visualizer) + Alertmanager (alerter) — Actuator alone gives you endpoints, not observability
Is the app healthy? Need a fast status check
curl -s http://localhost:8080/actuator/health | jq .curl -s http://localhost:8080/actuator/health/liveness | jq .statusNeed to see what version is running without checking CI/CD
curl -s http://localhost:8080/actuator/info | jq .git.commit.idcurl -s http://localhost:8080/actuator/info | jq .build.versionNeed DEBUG logs for a specific package without restarting
curl -u admin:password -X POST -H 'Content-Type: application/json' -d '{"configuredLevel":"DEBUG"}' http://localhost:8080/actuator/loggers/io.thecodeforge.ordertail -f /var/log/app.log | grep 'io.thecodeforge.order'Prometheus is not scraping — target shows as DOWN in Prometheus UI
curl -s http://localhost:8080/actuator/prometheus | head -20curl -u prometheus:password -s http://localhost:8080/actuator/prometheus | wc -lDocker container marked unhealthy but app seems fine from inside
docker exec <container_id> wget --quiet --tries=1 --spider http://localhost:8080/actuator/health/livenessdocker inspect --format='{{json .State.Health}}' <container_id> | jq .Production Incident
Production Debug GuideWhen Actuator is configured, here is how to go from observable symptom to resolution.
In the world of microservices, 'it works on my machine' is not enough. You need to know if it works in production at 3:00 AM under peak load. Spring Boot Actuator is the industry-standard framework that transforms your dark application into an observable system by exposing HTTP and JMX endpoints that reveal the inner state of your running JVM.
I learned this the hard way. In 2021, our team deployed a Spring Boot payment service to production. It ran fine for three weeks. Then one Saturday at 2 AM, latency spiked to 15 seconds per request. We had no metrics, no health checks beyond a basic /ping, and no idea what was wrong. It took four hours to diagnose a connection pool exhaustion issue that a single Micrometer gauge would have caught in 30 seconds. That incident cost us $40,000 in lost transactions and a very uncomfortable Monday morning post-mortem.
After that, we instrumented everything. Every service got Actuator, Micrometer, Prometheus, and Grafana before it ever touched production. We have not had a mystery outage since.
This guide moves beyond basic dependency injection and covers the operational side of Java development. We will explore how to monitor application health, track custom business metrics, expose build traceability info, manage log levels dynamically, secure sensitive endpoints, wire everything into a Prometheus/Grafana stack, and build custom endpoints for operational control — all with the application.yml configuration you can actually copy into a project.
The Three Pillars of Observability
Before diving into Actuator, understand the framework that every production monitoring system is built on. Observability rests on three pillars, and knowing which pillar answers which question is the difference between a 4-hour incident and a 10-minute resolution.
- Metrics — Numeric measurements over time. Request rate, error rate, latency percentiles, JVM heap usage, connection pool saturation. These are what Prometheus scrapes and Grafana displays. Metrics answer 'how much' and 'how fast' and 'how often.' They are the first signal that something is wrong.
- Logs — Discrete events with context. A log entry says 'order #4521 failed with NullPointerException at PaymentService.java:87 for user abc123.' Logs answer 'what happened' and 'why.' Spring Boot's structured logging in JSON format feeds into ELK, Loki, or Datadog. They answer the question that the metric alert raised.
- Traces — A request's journey across services. In a microservice architecture, a single user action might touch 8 services. A trace connects those dots — showing that the 2-second delay happened in the inventory service on the third hop, not in the API gateway. Spring Boot integrates with Micrometer Tracing and OpenTelemetry for distributed tracing.
Spring Boot Actuator primarily addresses the metrics and health pillars. But a production-grade observability stack needs all three working together. The order of implementation matters more than most teams realize.
# io.thecodeforge: Production Observability Stack # # ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ # │ Spring Boot │────▶│ Prometheus │────▶│ Grafana │ # │ Actuator │ │ (Scraper) │ │ (Dashboard) │ # └─────────────┘ └─────────────┘ └─────────────┘ # │ │ # │ ┌─────────────┐ │ # └─────────────▶│ Alertmanager│◀──────────┘ # │ (PagerDuty)│ # └─────────────┘ # # ┌─────────────┐ ┌─────────────┐ # │ Structured │────▶│ Loki / ELK │ # │ Logs │ │ (Log Agg.) │ # └─────────────┘ └─────────────┘ # # ┌─────────────┐ ┌─────────────┐ # │ Micrometer │────▶│ Tempo / │ # │ Tracing │ │ Jaeger │ # └─────────────┘ └─────────────┘ # # What each layer answers: # Metrics → Is something wrong right now? # Logs → What exactly happened and why? # Traces → Which service in the chain caused it?
- Start with metrics and health checks (Actuator + Prometheus) — this catches 80% of production issues before users notice
- Add structured logging next when you need to debug 'what exactly happened' after an alert fires
- Add distributed tracing when you have 3 or more microservices and need to find which service in the chain introduced latency
- Trying to implement all three simultaneously leads to implementation paralysis and all three done badly
- The order matters: metrics catch problems before logs do, logs explain problems before traces do
The Anatomy of Actuator: Observability in Action
Spring Boot Actuator exists because manual health checks are a recipe for failure. Instead of writing custom endpoints to check if your database is alive or if your disk space is full, Actuator provides these out of the box. By adding the starter dependency and configuring application.yml, you instantly gain access to the /actuator base path with standardized operational endpoints.
However, most endpoints are hidden by default for security. The real power lies in the /health and /prometheus endpoints. One critical detail that trips up almost every team the first time: the /actuator/prometheus endpoint does not exist unless micrometer-registry-prometheus is on the classpath. The actuator starter and the Prometheus registry are separate dependencies. Adding spring-boot-starter-actuator without micrometer-registry-prometheus gives you health and info but not Prometheus metrics.
One thing worth saying explicitly: Actuator endpoints are not your application REST API. They are operational endpoints meant for internal monitoring infrastructure. Your security model should reflect this — your monitoring stack gets access through specific roles, your customers never touch these endpoints.
# io.thecodeforge: Canonical Actuator Configuration # This is the base configuration every production Spring Boot service should start with. # Copy this, adjust the exposure list to match your needs, and ship it. management: endpoints: web: exposure: # Whitelist only what your monitoring stack actually needs. # Never use * in production. include: health,info,prometheus,loggers exclude: heapdump,env,threaddump,shutdown base-path: /actuator endpoint: health: # when_authorized: only show component details to authenticated users # never: hide component details entirely (use for public-facing services) # always: NEVER use in production — exposes DB versions, pool sizes, API keys show-details: when_authorized show-components: when_authorized probes: # Enables /actuator/health/liveness and /actuator/health/readiness # Required for Kubernetes probe integration enabled: true # Global tag applied to every metric this service emits. # Without this, metrics from different services collide in Prometheus. # This is the one line most teams forget. Do not skip it. metrics: tags: application: ${spring.application.name} export: prometheus: # Explicitly enable the Prometheus registry. # Redundant if micrometer-registry-prometheus is on the classpath, # but makes intent explicit and prevents confusion. enabled: true # Security: never expose stack traces via the error endpoint server: add-application-context-header: false server: error: # Never expose stack traces to clients or monitoring scrapers include-stacktrace: never include-message: never spring: application: # This name flows into management.metrics.tags.application above. # Set it explicitly — never rely on the default. name: order-service
# GET /actuator/health → component-level health (authorized users only)
# GET /actuator/health/liveness → JVM-only liveness check (Kubernetes liveness probe)
# GET /actuator/health/readiness → dependency check (Kubernetes readiness probe)
# GET /actuator/prometheus → Prometheus metrics with application=order-service tag
# GET /actuator/loggers → current log levels (read)
# POST /actuator/loggers/{pkg} → change log level dynamically (ADMIN role required)
# GET /actuator/info → build + git metadata
#
# /actuator/heapdump, /actuator/env, /actuator/threaddump → blocked
Custom Metrics with Micrometer — Beyond Health Checks
Health checks tell you if the system is alive. Metrics tell you how it is performing. Spring Boot uses Micrometer, a dimensional metrics instrumentation facade — think SLF4J but for metrics. It lets you track business-relevant things like order count, payment latency percentiles, and cart abandonment rate rather than just JVM garbage collection statistics.
Micrometer supports several meter types. Choosing the right one matters:
- Counter — Monotonically increasing value. Use for total requests, total errors, and orders placed. Never goes down. Query with
rate()in Prometheus to get events per second. - Gauge — A value that fluctuates. Use for current queue depth, active connections, and temperature. You report the current value; Micrometer samples it on each scrape.
- Timer — Measures duration and rate simultaneously. Use for request latency, database query time, and external API call duration. Gives you percentiles (p50, p95, p99) automatically.
- Distribution Summary — Like a Timer but for arbitrary values, not time. Use for payload sizes, batch sizes, and record counts.
The @Timed annotation is the fastest way to instrument a controller method. Spring Boot auto-configures a TimedAspect bean when Micrometer is on the classpath — you annotate the method and Micrometer handles the rest. For custom business logic, inject MeterRegistry directly and build meters explicitly.
package io.thecodeforge.monitoring; import io.micrometer.core.annotation.Timed; import io.micrometer.core.instrument.Counter; import io.micrometer.core.instrument.Gauge; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.Timer; import org.springframework.stereotype.Service; import org.springframework.web.bind.annotation.*; import java.util.concurrent.TimeUnit; import java.util.concurrent.atomic.AtomicInteger; /** * io.thecodeforge: Custom Business Metrics for Order Processing. * * Two patterns shown here: * 1. @Timed on controller methods — zero-boilerplate method-level timing * 2. MeterRegistry injection — explicit control for business counters and gauges */ @Service public class OrderMetricsService { private final Counter ordersPlaced; private final Counter ordersFailed; private final Timer paymentLatency; private final AtomicInteger activeCarts = new AtomicInteger(0); public OrderMetricsService(MeterRegistry registry) { // Counter: Total orders placed (monotonically increasing) // Query in Prometheus: rate(orders_placed_total[5m]) → orders per second this.ordersPlaced = Counter.builder("orders.placed.total") .description("Total number of orders successfully placed") .tag("service", "order-service") .register(registry); // Counter: Total orders failed // Alert when rate(orders_failed_total[5m]) / rate(orders_placed_total[5m]) > 0.01 this.ordersFailed = Counter.builder("orders.failed.total") .description("Total number of orders that failed processing") .tag("service", "order-service") .register(registry); // Timer: Payment processing latency with percentiles // publishPercentiles exposes p50, p95, p99 as separate Prometheus labels // Alert on p99 exceeding your SLA threshold — the mean will hide tail latency this.paymentLatency = Timer.builder("payment.processing.duration") .description("Time taken to process payment end to end") .tag("service", "order-service") .publishPercentiles(0.5, 0.95, 0.99) .register(registry); // Gauge: Active shopping carts — value goes up and down // Prometheus samples this on every scrape — you report the current value Gauge.builder("carts.active.current", activeCarts, AtomicInteger::get) .description("Number of active shopping carts in session") .register(registry); } public void recordSuccessfulOrder(long paymentDurationNanos) { ordersPlaced.increment(); paymentLatency.record(paymentDurationNanos, TimeUnit.NANOSECONDS); } public void recordFailedOrder() { ordersFailed.increment(); } public void cartOpened() { activeCarts.incrementAndGet(); } public void cartClosed() { activeCarts.decrementAndGet(); } } // --- @Timed annotation pattern: zero-boilerplate controller instrumentation --- package io.thecodeforge.controller; import io.micrometer.core.annotation.Timed; import io.thecodeforge.dto.OrderDto; import org.springframework.web.bind.annotation.*; @RestController @RequestMapping("/api/orders") public class OrderController { /** * @Timed auto-instruments this method with a Timer named 'order.create.duration'. * Tracks: request count, total time, and percentiles. * Requires TimedAspect bean — auto-configured when spring-boot-starter-actuator * and micrometer-registry-prometheus are on the classpath. * extraTags: adds fixed labels to the metric for filtering in Grafana. */ @Timed( value = "order.create.duration", description = "Time taken to create an order", extraTags = {"endpoint", "/api/orders", "method", "POST"}, percentiles = {0.5, 0.95, 0.99}, histogram = true ) @PostMapping public OrderDto createOrder(@RequestBody OrderDto dto) { // Method execution is automatically timed. // A Timer named order_create_duration_seconds is created in Prometheus. return dto; } }
# HELP orders_placed_total Total number of orders successfully placed
# TYPE orders_placed_total counter
orders_placed_total{service="order-service",} 1542.0
# HELP orders_failed_total Total number of orders that failed processing
# TYPE orders_failed_total counter
orders_failed_total{service="order-service",} 23.0
# HELP payment_processing_duration_seconds Time taken to process payment end to end
# TYPE payment_processing_duration_seconds summary
payment_processing_duration_seconds{service="order-service",quantile="0.5",} 0.045
payment_processing_duration_seconds{service="order-service",quantile="0.95",} 0.180
payment_processing_duration_seconds{service="order-service",quantile="0.99",} 0.420
# HELP carts_active_current Number of active shopping carts in session
# TYPE carts_active_current gauge
carts_active_current{application="order-service",} 87.0
# HELP order_create_duration_seconds Time taken to create an order (@Timed)
# TYPE order_create_duration_seconds summary
order_create_duration_seconds{endpoint="/api/orders",method="POST",quantile="0.99",} 0.312
# HikariCP auto-metrics (no code required — just micrometer-registry-prometheus on classpath)
hikaricp_connections_active{pool="HikariPool-1",} 8.0
hikaricp_connections_idle{pool="HikariPool-1",} 2.0
hikaricp_connections_pending{pool="HikariPool-1",} 0.0
hikaricp_connections_timeout_total{pool="HikariPool-1",} 0.0
Custom Actuator Endpoints — Operational Control Beyond Health
The built-in Actuator endpoints cover most operational needs. But there are legitimate cases where you need operational endpoints specific to your application: runtime feature flag status, current circuit breaker state, deployment metadata that goes beyond what /actuator/info provides, or a cache invalidation trigger that ops can call without a full deploy.
Spring Boot makes this straightforward with the @Endpoint annotation. Annotate a Spring component with @Endpoint(id='yourEndpoint') and methods with @ReadOperation (HTTP GET), @WriteOperation (HTTP POST), or @DeleteOperation (HTTP DELETE). Spring Boot automatically exposes the endpoint at /actuator/yourEndpoint and applies the same security model as built-in endpoints — whatever your SecurityFilterChain says applies here too.
The design rule: @ReadOperation methods should have no side effects. @WriteOperation methods change state and must be secured with ADMIN role — they are essentially a surgical tool for ops teams, not a general API. I have seen @WriteOperation endpoints used to toggle circuit breakers, clear Redis caches, and reload configuration from external sources — all without a redeploy.
package io.thecodeforge.monitoring; import org.springframework.boot.actuate.endpoint.annotation.*; import org.springframework.stereotype.Component; import java.time.Instant; import java.util.Map; /** * io.thecodeforge: Custom Actuator Endpoint for deployment metadata and cache control. * * Exposed at: GET /actuator/deployment → returns deployment info * POST /actuator/deployment → triggers cache invalidation (ADMIN only) * * Spring Boot automatically applies your SecurityFilterChain to this endpoint. * Secure @WriteOperation methods with ADMIN role — they modify production state. */ @Component @Endpoint(id = "deployment") public class DeploymentEndpoint { private final String gitCommit; private final String buildVersion; private volatile Instant lastCacheInvalidation = null; private volatile String lastInvalidationReason = "none"; public DeploymentEndpoint( @org.springframework.beans.factory.annotation.Value("${git.commit.id.abbrev:unknown}") String gitCommit, @org.springframework.beans.factory.annotation.Value("${build.version:unknown}") String buildVersion) { this.gitCommit = gitCommit; this.buildVersion = buildVersion; } /** * @ReadOperation → HTTP GET /actuator/deployment * Returns deployment metadata and cache state. * Safe to expose to MONITORING role — no side effects. */ @ReadOperation public Map<String, Object> deploymentInfo() { return Map.of( "gitCommit", gitCommit, "buildVersion", buildVersion, "startedAt", Instant.now(), "lastCacheInvalidation", lastCacheInvalidation != null ? lastCacheInvalidation.toString() : "never", "lastInvalidationReason", lastInvalidationReason ); } /** * @WriteOperation → HTTP POST /actuator/deployment * Triggers a cache invalidation without a redeploy. * Secure this with ADMIN role in your SecurityFilterChain. * Body parameter: { "reason": "stale pricing data after DB migration" } */ @WriteOperation public Map<String, String> invalidateCache( @org.springframework.boot.actuate.endpoint.annotation.Selector String reason) { this.lastCacheInvalidation = Instant.now(); this.lastInvalidationReason = reason != null ? reason : "manual trigger"; // In real code: inject CacheManager and call cache.invalidateAll() return Map.of( "status", "cache invalidated", "reason", lastInvalidationReason, "timestamp", lastCacheInvalidation.toString() ); } }
{
"gitCommit": "a1b2c3d",
"buildVersion": "2.4.1",
"startedAt": "2026-04-18T10:00:00Z",
"lastCacheInvalidation": "never",
"lastInvalidationReason": "none"
}
# POST /actuator/deployment/stale-pricing-data (ADMIN role required)
{
"status": "cache invalidated",
"reason": "stale-pricing-data",
"timestamp": "2026-04-18T10:05:22Z"
}
# The endpoint appears automatically in /actuator index:
# GET /actuator
# {
# "_links": {
# "deployment": { "href": "/actuator/deployment" },
# "health": { "href": "/actuator/health" },
# ...
# }
# }
Securing Actuator Endpoints with Spring Security
By default in Spring Boot 2.x and later, Actuator endpoints sit behind Spring Security if it is on the classpath. But behind security does not mean secure. The default configuration often allows all authenticated users to access all endpoints — including ones that dump environment variables, heap contents, and thread states.
The production pattern: restrict Actuator endpoints to a dedicated monitoring role or internal network. Your Prometheus scraper authenticates with a service account. Your developers get read-only access to health and info. Nobody outside the internal network touches env or heapdump. The 15-line SecurityFilterChain in this section has prevented two credential exfiltration incidents on teams I have worked with.
One detail most guides miss: when you add a dedicated SecurityFilterChain for /actuator/**, you need @Order(1) to give it higher priority than your application's main SecurityFilterChain. Without the ordering, Spring applies your main chain first, which may have different rules than what you intend for Actuator.
package io.thecodeforge.monitoring; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; import org.springframework.core.annotation.Order; import org.springframework.security.config.annotation.web.builders.HttpSecurity; import org.springframework.security.web.SecurityFilterChain; /** * io.thecodeforge: Actuator Security Configuration. * * @Order(1) gives this chain higher priority than your application's main chain. * Without it, your application's SecurityFilterChain may apply different rules * to /actuator/** than you intend. * * Role mapping: * No role → health, info (load balancers and Kubernetes probes) * MONITORING → prometheus (Prometheus scraper service account) * ADMIN → loggers, env, heapdump, threaddump (ops team only) * DENY → shutdown, everything else not whitelisted */ @Configuration public class ActuatorSecurityConfig { @Bean @Order(1) public SecurityFilterChain actuatorSecurityFilterChain(HttpSecurity http) throws Exception { http .securityMatcher("/actuator/**") .authorizeHttpRequests(auth -> auth // Public: health checks for load balancers and Kubernetes probes // These must be accessible without authentication .requestMatchers("/actuator/health/**").permitAll() .requestMatchers("/actuator/info").permitAll() // MONITORING role: Prometheus scraper authenticates via HTTP Basic .requestMatchers("/actuator/prometheus").hasRole("MONITORING") // ADMIN role: endpoints that expose or modify sensitive state // Log level changes can generate gigabytes/min if abused .requestMatchers( "/actuator/loggers/**", "/actuator/env", "/actuator/heapdump", "/actuator/threaddump" ).hasRole("ADMIN") // Deny everything not explicitly permitted above. // This includes /actuator/shutdown — never expose it. .anyRequest().denyAll() ) // HTTP Basic: Prometheus natively supports basic_auth in scrape config .httpBasic(basic -> {}) // Disable CSRF: Actuator endpoints are called by automated systems, not browsers .csrf(csrf -> csrf.disable()); return http.build(); } }
# Role definitions for the monitoring service account
spring:
security:
user:
name: admin
password: ${ACTUATOR_ADMIN_PASSWORD} # inject via environment variable, never hardcode
roles: ADMIN
# In Kubernetes, use Spring Security with LDAP or OAuth2 service accounts.
# For simple setups, environment-variable-injected credentials are acceptable
# as long as the password is rotated and not committed to source control.
Prometheus and Grafana Integration — Full Stack Setup
Actuator exposes the /actuator/prometheus endpoint in Prometheus exposition format — a text-based, human-readable format that Prometheus understands natively. But Prometheus needs to be told where to scrape. This is where most tutorials end and most teams get stuck.
On the Spring Boot side: add micrometer-registry-prometheus to your dependencies and management.endpoints.web.exposure.include=prometheus to your application.yml. That is it. The endpoint auto-configures.
On the Prometheus side: add a scrape_config block pointing at your application. The two details most engineers get wrong: using /actuator/health as the metrics_path instead of /actuator/prometheus (heavy versus lightweight), and using static_configs in Kubernetes where pods get new IPs on every restart.
On the Grafana side: import dashboard ID 4701 (JVM Micrometer) from grafana.com/dashboards for a production-ready Spring Boot dashboard. It covers heap usage, GC pause time, HTTP request rates, and database connection pool saturation out of the box — no PromQL required to get started.
# io.thecodeforge: Prometheus Scrape Configuration # This file tells Prometheus where to find your Spring Boot metrics. global: scrape_interval: 15s # How often Prometheus scrapes all targets evaluation_interval: 15s # How often alerting rules are evaluated scrape_configs: # Static targets: works for a fixed number of servers or local development. # In production Kubernetes, replace this with kubernetes_sd_configs below. - job_name: 'spring-boot-order-service' # IMPORTANT: Use /actuator/prometheus, NOT /actuator/health. # /actuator/prometheus: lightweight metric reads, sub-millisecond. # /actuator/health: hits the DB, external APIs, disk — thundering herd at scale. metrics_path: '/actuator/prometheus' scrape_interval: 10s # Override global for this job basic_auth: username: 'prometheus' # Use password_file in production — never inline passwords in prometheus.yml password_file: '/etc/prometheus/secrets/prometheus_password' static_configs: - targets: ['order-service-01:8080', 'order-service-02:8080'] labels: environment: 'production' team: 'platform' # Kubernetes service discovery: use this instead of static_configs in K8s. # Pods annotated with prometheus.io/scrape: 'true' are scraped automatically. # No manual target management — works with rolling deploys and autoscaling. - job_name: 'spring-boot-k8s' metrics_path: '/actuator/prometheus' kubernetes_sd_configs: - role: pod basic_auth: username: 'prometheus' password_file: '/etc/prometheus/secrets/prometheus_password' relabel_configs: # Only scrape pods with prometheus.io/scrape: 'true' annotation - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: 'true' # Use the pod's prometheus.io/port annotation as the scrape port - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace target_label: __address__ regex: (.+) replacement: '${1}' # Add pod name and namespace as labels for Grafana filtering - source_labels: [__meta_kubernetes_pod_name] target_label: pod - source_labels: [__meta_kubernetes_namespace] target_label: namespace
#
# metadata:
# annotations:
# prometheus.io/scrape: 'true'
# prometheus.io/port: '8080'
# prometheus.io/path: '/actuator/prometheus'
#
# Prometheus will automatically discover and scrape this pod.
# When the pod is replaced (rolling deploy), Prometheus picks up the new IP.
# No prometheus.yml restart required.
Kubernetes Probes — Liveness, Readiness, and Startup
If you are deploying Spring Boot to Kubernetes, Actuator's health groups become the backbone of your pod lifecycle management. Misconfiguring these probes is the single most common cause of cascading restarts in Spring Boot Kubernetes deployments — and it is entirely preventable.
Spring Boot 2.3 introduced health groups — separate health endpoints for different probe types. This is critical because liveness and readiness must check different things:
- Liveness ('Is the app alive?'): Checks only internal JVM state. If this fails, Kubernetes restarts the pod. Keep it lightweight — no database calls, no external API calls. A temporary DB blip should never restart your pods.
- Readiness ('Can the app accept traffic?'): Checks dependencies — database reachable, cache warm, broker connected. If this fails, Kubernetes removes the pod from the load balancer but does not restart it. This is the correct behavior for a transient dependency failure.
- Startup ('Has the app finished booting?'): For slow-starting apps. Kubernetes waits for this to pass before running liveness and readiness probes. The math matters: failureThreshold × periodSeconds = maximum startup window. A Spring Boot app with 60 seconds of startup time needs failureThreshold: 12 with periodSeconds: 10 for a 120-second safety window — always give at least 2x your measured startup time.
I once debugged a 20-replica production deployment restarting every 90 seconds. The liveness probe was checking database connectivity. A 30-second DB connection spike triggered liveness failures across all 20 pods simultaneously. They all restarted, reconnected at once, overwhelmed the DB, the DB connection time spiked again, and the cycle repeated. The fix was one line in the Kubernetes deployment YAML: change the liveness probe path from /actuator/health to /actuator/health/liveness.
# io.thecodeforge: Kubernetes Deployment with Actuator Probes apiVersion: apps/v1 kind: Deployment metadata: name: order-service spec: replicas: 3 template: metadata: annotations: # Enable Prometheus auto-discovery (works with kubernetes_sd_configs above) prometheus.io/scrape: 'true' prometheus.io/port: '8080' prometheus.io/path: '/actuator/prometheus' spec: containers: - name: order-service image: io.thecodeforge/order-service:latest ports: - containerPort: 8080 # STARTUP PROBE: Prevents liveness from killing a slow-starting app. # Math: failureThreshold (30) × periodSeconds (10) = 300 seconds max startup. # Measure your actual startup time and set this to at least 2× that value. # If your app starts in 45 seconds, use failureThreshold: 12, periodSeconds: 10 # for a 120-second window. Never guess — measure. startupProbe: httpGet: path: /actuator/health/liveness port: 8080 failureThreshold: 30 # 30 × 10s = 300 seconds max startup window periodSeconds: 10 timeoutSeconds: 3 # LIVENESS PROBE: JVM-only — no external dependency checks. # If this fails, Kubernetes RESTARTS the pod. # Never check DB or external APIs here — a transient blip restarts all pods. livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 0 # startupProbe handles the startup window periodSeconds: 10 timeoutSeconds: 3 failureThreshold: 3 # Restart after 3 consecutive failures (30 seconds) # READINESS PROBE: Checks dependencies — DB, cache, broker. # If this fails, pod is REMOVED from load balancer but NOT restarted. # This is the correct behavior for a transient dependency failure. readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 0 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 # Remove from LB after 15 seconds of failures
#
# management:
# endpoint:
# health:
# probes:
# enabled: true ← enables /actuator/health/liveness and /readiness
# show-details: when_authorized
# health:
# livenessstate:
# enabled: true
# readinessstate:
# enabled: true
#
# Without probes.enabled=true, /actuator/health/liveness and /readiness return 404.
# This is the most common Kubernetes probe misconfiguration in Spring Boot.
The /actuator/info Endpoint — Deployment Traceability
The /actuator/info endpoint is the most underused feature in the Actuator suite. It lets you expose build information — Git commit hash, build timestamp, artifact version — directly from your running application. When something breaks in production, the first question is always 'what version is running?' Without this endpoint configured, answering that question means digging through CI/CD pipeline logs, which takes minutes you do not have during an active incident.
The setup requires two Maven plugins: the spring-boot-maven-plugin with the build-info goal, and the git-commit-id-plugin to embed Git metadata. Once configured, every build automatically embeds its own DNA into the artifact. Every deployment automatically reports exactly what code is running.
In production setups, every Grafana dashboard should have a deployment panel that queries /actuator/info across instances. If instances report different commit hashes during a rolling deploy, you can see the split-brain state in real time. If a rollback happened silently, it shows up immediately in this panel.
<!-- io.thecodeforge: Maven plugins for /actuator/info enrichment --> <!-- Add these inside your <build><plugins> block --> <build> <plugins> <!-- Plugin 1: Generates build-info.properties at compile time. Embeds: artifact name, version, build timestamp. Appears under "build" key in /actuator/info response. --> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <executions> <execution> <goals> <goal>build-info</goal> </goals> </execution> </executions> </plugin> <!-- Plugin 2: Embeds Git metadata at build time. Embeds: commit hash, branch, commit time, tags. Appears under "git" key in /actuator/info response. failOnNoGitDirectory=false: prevents build failure in CI environments where the .git directory may not be present (Docker build layers). --> <plugin> <groupId>pl.project13.maven</groupId> <artifactId>git-commit-id-plugin</artifactId> <version>4.9.10</version> <executions> <execution> <goals> <goal>revision</goal> </goals> </execution> </executions> <configuration> <generateGitPropertiesFile>true</generateGitPropertiesFile> <!-- Prevents build failure when .git is not present (shallow clones, CI) --> <failOnNoGitDirectory>false</failOnNoGitDirectory> <!-- Only embed the abbreviated commit hash, not the full 40-char hash --> <abbrevLength>7</abbrevLength> </configuration> </plugin> </plugins> </build>
{
"build": {
"artifact": "order-service",
"name": "Order Service",
"version": "2.4.1",
"time": "2026-04-18T14:22:00Z"
},
"git": {
"branch": "main",
"commit": {
"id": "a1b2c3d",
"time": "2026-04-18T14:18:32Z"
}
}
}
// CI/CD post-deploy verification step:
// DEPLOYED_COMMIT=$(git rev-parse --short HEAD)
// RUNNING_COMMIT=$(curl -s http://service:8080/actuator/info | jq -r '.git.commit.id')
// if [ "$DEPLOYED_COMMIT" != "$RUNNING_COMMIT" ]; then
// echo "ERROR: Container is running stale code. Expected $DEPLOYED_COMMIT, got $RUNNING_COMMIT"
// exit 1
// fi
Dynamic Log Level Management — Debug Without Redeploying
One of Actuator's most powerful day-two features: changing log levels at runtime without restarting the application. Need to enable DEBUG logging for a specific package to troubleshoot a production issue? Hit /actuator/loggers, change the level, reproduce the issue, read the logs, then change it back. No restart. No downtime. No redeploy cycle.
This converts 'we need to add more logging and redeploy' — a 30-minute process minimum in most CI/CD pipelines — into a 10-second API call. During incident response, this is the difference between a 10-minute resolution and a 45-minute resolution while customers are actively impacted.
The endpoint supports GET to read current levels and POST to change them. Spring Security should restrict POST to ADMIN role — TRACE logging in production can generate gigabytes of log data per minute, fill disk, and cascade into other failures. Always reset the log level after you have captured what you need. Build that reset command into your debugging runbook as a mandatory step.
For Hibernate SQL debugging specifically, enable TRACE on org.hibernate.SQL to see the exact SQL being generated and org.hibernate.type.descriptor.sql to see the actual bind parameter values. This combination has diagnosed more mysterious data bugs than any other technique I know.
#!/bin/bash # io.thecodeforge: Dynamic Log Level Management # Run these during incident response — no restart, no redeploy required. # --- Read current log level for a package --- curl -s -u admin:${ACTUATOR_PASSWORD} \ http://localhost:8080/actuator/loggers/io.thecodeforge.order | jq . # Response: { "configuredLevel": "INFO", "effectiveLevel": "INFO" } # --- Enable DEBUG for order package (troubleshooting) --- curl -s -u admin:${ACTUATOR_PASSWORD} \ -X POST \ -H 'Content-Type: application/json' \ -d '{"configuredLevel": "DEBUG"}' \ http://localhost:8080/actuator/loggers/io.thecodeforge.order # --- Enable TRACE for Hibernate SQL (see exact SQL + bind parameters) --- # WARNING: This generates enormous log volume. Reset immediately after capturing the query. curl -s -u admin:${ACTUATOR_PASSWORD} \ -X POST \ -H 'Content-Type: application/json' \ -d '{"configuredLevel": "TRACE"}' \ http://localhost:8080/actuator/loggers/org.hibernate.SQL curl -s -u admin:${ACTUATOR_PASSWORD} \ -X POST \ -H 'Content-Type: application/json' \ -d '{"configuredLevel": "TRACE"}' \ http://localhost:8080/actuator/loggers/org.hibernate.type.descriptor.sql # --- ALWAYS reset after capturing logs (mandatory step in your runbook) --- curl -s -u admin:${ACTUATOR_PASSWORD} \ -X POST \ -H 'Content-Type: application/json' \ -d '{"configuredLevel": null}' \ http://localhost:8080/actuator/loggers/io.thecodeforge.order # null = inherit from parent logger (returns to default) # Setting to null is different from setting to INFO explicitly — # null respects future configuration changes, INFO overrides them.
2026-04-18 14:30:12 DEBUG io.thecodeforge.order.OrderService - Processing order #4521
2026-04-18 14:30:12 DEBUG io.thecodeforge.order.OrderService - Payment method: CREDIT_CARD
2026-04-18 14:30:13 DEBUG io.thecodeforge.order.OrderService - Inventory reserved: 3 items
2026-04-18 14:30:13 ERROR io.thecodeforge.order.OrderService - Payment declined: insufficient funds for order #4521
2026-04-18 14:30:13 DEBUG io.thecodeforge.order.OrderService - Rollback initiated for order #4521
# With org.hibernate.SQL at TRACE:
2026-04-18 14:30:12 TRACE org.hibernate.SQL - select * from orders where user_id=? and status=?
2026-04-18 14:30:12 TRACE org.hibernate.type.descriptor.sql - binding parameter [1] as [BIGINT] - [10045]
2026-04-18 14:30:12 TRACE org.hibernate.type.descriptor.sql - binding parameter [2] as [VARCHAR] - [PENDING]
# After resetting to null: DEBUG logs disappear immediately, zero application restart.
Micrometer Integration and Docker Deployment
To make this production-ready, you containerize the application with a focus on how Docker handles the health signal from Actuator and how the JVM is tuned for container environments.
The Docker HEALTHCHECK instruction tells the Docker daemon whether the container is healthy. By pointing it at /actuator/health/liveness, you get the same lightweight JVM-only health logic that Kubernetes uses at the liveness probe level. This matters most for Docker Compose deployments, which do not have Kubernetes probe support.
Two things to watch for in every containerization. First, the HEALTHCHECK command needs wget or curl to be present in the base image. Distroless images include neither — you will need to switch to a slim base image or remove Docker HEALTHCHECK and rely exclusively on Kubernetes probes. Second, the JVM memory flags: without -XX:+UseContainerSupport, older JDK versions read the host machine's memory rather than the container's memory limits and allocate too large a heap, causing OOM kills. Modern JDK 17 and later handle this automatically, but the flag is harmless and makes the intent explicit.
# io.thecodeforge: Multi-stage Dockerfile with Actuator Health Integration # Stage 1: Build FROM eclipse-temurin:17-jdk-alpine AS build WORKDIR /app COPY . . RUN ./mvnw clean package -DskipTests # Stage 2: Runtime # eclipse-temurin:17-jre-alpine includes wget — required for HEALTHCHECK # Do NOT use distroless if you need Docker HEALTHCHECK FROM eclipse-temurin:17-jre-alpine WORKDIR /app COPY --from=build /app/target/*.jar app.jar # Security: run as non-root user RUN addgroup -S forgegroup && adduser -S forgeuser -G forgegroup USER forgeuser # Docker HEALTHCHECK: uses liveness endpoint for lightweight JVM-only check. # interval: how often Docker checks (30s is conservative — adjust for your SLA) # timeout: how long Docker waits for a response before marking unhealthy # retries: consecutive failures before marking UNHEALTHY # Uses /actuator/health/liveness — NOT /actuator/health # Full health check would hit DB on every HEALTHCHECK — unnecessary for Docker daemon HEALTHCHECK \ --interval=30s \ --timeout=3s \ --retries=3 \ CMD wget --quiet --tries=1 --spider http://localhost:8080/actuator/health/liveness || exit 1 EXPOSE 8080 # JVM flags for container environments: # -XX:+UseContainerSupport: read container memory limits, not host memory # -XX:MaxRAMPercentage=75.0: allocate 75% of container memory to the heap # leaving 25% for Metaspace, thread stacks, GC overhead, and OS ENTRYPOINT ["java", "-XX:+UseContainerSupport", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]
# docker build -t io.thecodeforge/order-service:latest .
# docker run -p 8080:8080 io.thecodeforge/order-service:latest
# docker ps (HEALTHY appears after 30s)
#
# CONTAINER ID IMAGE STATUS
# a1b2c3d4e5f6 io.thecodeforge/order-service Up 2 minutes (healthy)
#
# Verify the health endpoint from inside the container:
# docker exec a1b2c3d4e5f6 wget -q -O - http://localhost:8080/actuator/health/liveness
# {"status":"UP"}
| Monitoring Aspect | Legacy Approach (Manual) | Modern Approach (Actuator) |
|---|---|---|
| Health Checks | Custom /status endpoints with inconsistent JSON structure. Each developer implements their own format. You get 'OK' or nothing. No component-level detail, no Kubernetes probe integration. | Standardized /health endpoint with nested component status, auto-aggregation (any DOWN component makes overall status DOWN), and health groups (liveness, readiness) that map directly to Kubernetes probe types. |
| Metrics Gathering | Log parsing, manual JMX MBean registration, or custom counters in Redis. Fragile, hard to query, impossible to aggregate across instances, and requires per-service implementation. | Micrometer integration with dimensional metrics. Auto-instrumentation of JVM, HTTP requests, and HikariCP. One dependency addition exports everything to Prometheus, Datadog, InfluxDB, or New Relic. |
| Runtime Management | Requires application restart to change log levels. 'Add a log statement and redeploy' is a 30-minute process with a deployment window approval. Debugging in production is a full release cycle. | Dynamic log level updates via /actuator/loggers — change any package's log level in 10 seconds without restarting. View environment variables and system properties via /actuator/env (secured). |
| Security | Ad-hoc security filters, often left unprotected or protected inconsistently. Actuator endpoints exposed with default Spring Security or no security at all — common source of credential exfiltration. | Integrated with Spring Security. Fine-grained access control per endpoint using @Order SecurityFilterChain. Role-based restrictions. HTTP Basic for Prometheus scraper. CSRF disabled for stateless endpoints. |
| Deployment Traceability | Check CI/CD logs, SSH into the server, run git log. No programmatic way to verify which code version is running. Ghost deployments go undetected for hours during incidents. | /actuator/info exposes Git commit hash, build timestamp, and artifact version from the running process. CI/CD post-deploy verification catches ghost deployments in 2 seconds. |
| Operational Control | Any operational action (cache clear, config reload, feature toggle) requires a code change, pull request, code review, build, and deploy. A 30-minute minimum for a surgical change. | Custom @Endpoint with @WriteOperation provides surgical operational control without a deploy. Cache invalidation, circuit breaker toggles, and config reloads become API calls restricted to ADMIN role. |
🎯 Key Takeaways
- Actuator is the bridge between application code and operational visibility — it is non-negotiable for any Spring Boot service running in production.
- Use readiness probes to control load balancer traffic and liveness probes to signal Kubernetes when a pod needs a restart. Never put external dependency checks in liveness probes — the cascading restart pattern it creates has taken down production systems at companies with mature engineering teams.
- Micrometer is the metrics engine. Use Counters for total counts, Gauges for current values, and Timers for latency percentiles. Mean latency is a liar — always instrument and alert on p95 or p99.
- The /actuator/info endpoint is the fastest way to answer 'what version is running?' during an incident. Configure it once with git-commit-id-plugin and build-info goal. Add post-deploy verification to your CI/CD pipeline that compares the running commit hash against what was just deployed.
- Dynamic log level management via /actuator/loggers turns a 30-minute debugging deployment cycle into a 10-second API call. Build the reset command into your runbook as a mandatory step — TRACE logging left running fills disks.
- Always secure Actuator endpoints with a dedicated SecurityFilterChain using @Order(1). Whitelist only what you need. env, heapdump, and threaddump are the three most dangerous — restrict to ADMIN role and consider never exposing them at all.
- Add management.metrics.tags.application=${spring.application.name} to every service's application.yml — it is the one line most teams forget. Without it, metrics from different services are indistinguishable in Prometheus.
- Never create high-cardinality metrics with dynamic tag values like userId or requestId. Low-cardinality dimensions only: service name, region, HTTP method, status code, error type.
- The observability stack is Actuator (sensors) + Prometheus (recorder) + Grafana (visualizer) + Alertmanager (alerter). All four work together — Actuator alone gives you endpoints, not observability.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the difference between 'Liveness' and 'Readiness' probes in the context of Spring Boot 2.3+ Actuator? What should each probe check, and what happens when each one fails in Kubernetes?Mid-levelReveal
- QHow would you implement a custom metric to track the number of successful vs failed login attempts using Micrometer's MeterRegistry? What meter type would you use and why?Mid-levelReveal
- QIn a high-security environment, how do you restrict access to the /actuator/prometheus endpoint to only the Prometheus scraper's service account?SeniorReveal
- QExplain the 'Thundering Herd' problem that can occur if monitoring systems scrape /actuator/health simultaneously. How would you mitigate it?SeniorReveal
- QHow does the @WriteOperation annotation work in a custom @Endpoint, and what are the safety implications compared to a @ReadOperation? When would you create a custom endpoint?SeniorReveal
- QWhat is the difference between a Counter, Gauge, and Timer in Micrometer? Give a real-world example of when you would use each one in a payment processing service.Mid-levelReveal
- QHow would you use the /actuator/loggers endpoint to troubleshoot a production issue without restarting the application? Walk through the exact steps.Mid-levelReveal
- QExplain how the /actuator/info endpoint can be enriched with Git commit information. What Maven plugins are required, and how would you use this in a CI/CD verification step?Mid-levelReveal
- QWhat is the risk of setting management.endpoints.web.exposure.include=* in production? What specific endpoints are the most dangerous and why?Mid-levelReveal
Frequently Asked Questions
What is the difference between /actuator/health, /actuator/health/liveness, and /actuator/health/readiness?
The base /actuator/health endpoint returns the aggregated status of all health indicators — database connectivity, disk space, external APIs, message brokers, and any custom indicators. It is the full-picture health check.
Spring Boot 2.3 introduced health groups: /actuator/health/liveness checks only whether the JVM is responsive with no external dependency checks, and /actuator/health/readiness checks whether the application can serve traffic and includes dependency checks.
In Kubernetes, use liveness for the liveness probe — if it fails, Kubernetes restarts the pod. Use readiness for the readiness probe — if it fails, the pod is removed from the load balancer but not restarted. Never use the full /actuator/health for liveness probes — a temporary database blip would restart every pod simultaneously.
Requires management.endpoint.health.probes.enabled=true in application.yml — without this, the liveness and readiness paths return 404.
How do I create a custom Actuator endpoint beyond health checks?
Annotate a Spring component with @Endpoint(id='yourEndpoint') and annotate methods with @ReadOperation (GET), @WriteOperation (POST), or @DeleteOperation (DELETE). Spring Boot automatically exposes the endpoint at /actuator/yourEndpoint.
For web-only endpoints with no JMX exposure, use @WebEndpoint instead of @Endpoint.
Custom endpoints inherit the same security model as built-in ones — they appear in your exposure.include list and respect the SecurityFilterChain you define. Secure @WriteOperation methods with ADMIN role since they change production state. Common use cases: deployment metadata richer than /actuator/info, runtime feature flag management, cache invalidation triggers, and circuit breaker state display and control.
How does Prometheus scrape Spring Boot Actuator metrics?
Prometheus pulls metrics from your application's /actuator/prometheus endpoint at a configured interval (default 15 seconds). On the Spring Boot side: add micrometer-registry-prometheus dependency and include prometheus in management.endpoints.web.exposure.include — this auto-configures the endpoint. On the Prometheus side: add a scrape_config block with metrics_path: '/actuator/prometheus' and your application's host and port.
For Kubernetes, use kubernetes_sd_configs with pod annotations instead of static targets. Pods annotated with prometheus.io/scrape: 'true' are auto-discovered. Each scrape returns all current metric values in Prometheus exposition text format. Prometheus stores these as time series in its TSDB.
Common issue: /actuator/prometheus returns 404 despite being in exposure.include — check that micrometer-registry-prometheus is in your pom.xml. It is a separate dependency from spring-boot-starter-actuator.
Can I change log levels in a running Spring Boot application without restarting?
Yes. POST to /actuator/loggers/{package.name} with a JSON body of {"configuredLevel": "DEBUG"}. The change takes effect immediately — no restart, no redeploy. To reset to default, POST with {"configuredLevel": null}. Null means inherit from the parent logger — this is different from explicitly setting INFO, which overrides any future configuration changes.
For Hibernate SQL debugging: enable TRACE on org.hibernate.SQL for query text and org.hibernate.type.descriptor.sql for bind parameter values.
Always secure POST with Spring Security restricting it to ADMIN role — TRACE logging generates gigabytes of output per minute and can fill disk quickly. Build the reset command into your incident runbook as a mandatory step, not an optional one.
What information does the /actuator/info endpoint show, and how do I populate it?
By default /actuator/info returns an empty JSON object — it must be explicitly populated. Two sources of data:
Build metadata: add the spring-boot-maven-plugin's build-info execution goal. This generates build-info.properties at compile time containing artifact name, version, and build timestamp. Appears under the build key in the response.
Git metadata: add git-commit-id-plugin to your Maven build. This embeds the commit hash (abbreviated), branch name, and commit time into git.properties at build time. Appears under the git key.
You can also add custom info via application.properties: info.app.description=Order processing service. These appear under the app key.
Use this in CI/CD post-deploy verification: compare git.commit.id from the running instance against the commit hash your pipeline just built to catch ghost deployments.
How do I secure Actuator endpoints in production?
Create a dedicated SecurityFilterChain with @Order(1) that matches /actuator/**. The @Order(1) ensures this chain has higher priority than your main application chain.
Permit /actuator/health/** and /actuator/info — needed by load balancers and Kubernetes probes without authentication. Restrict /actuator/prometheus to MONITORING role with HTTP Basic authentication. Restrict /actuator/loggers, /actuator/env, /actuator/heapdump, and /actuator/threaddump to ADMIN role. Deny everything else.
Also set management.endpoints.web.exposure.include to a whitelist — health, info, prometheus, loggers — and never use wildcard. Disable CSRF for the actuator security chain since these endpoints are called by automated systems, not browsers. Set management.endpoint.health.show-details to when_authorized or never in application.yml.
What is the difference between a Counter, Gauge, and Timer in Micrometer?
A Counter tracks a monotonically increasing value — it only goes up. Use it for total requests, total errors, and orders placed. Query with rate(counter_total[5m]) in Prometheus to get events per second. A Counter is wrong for any value that can decrease.
A Gauge tracks a value that goes up and down. Use it for current queue depth, active connections, and heap usage. You report the current value on demand and Micrometer samples it on each Prometheus scrape. A Gauge is wrong for cumulative counts.
A Timer measures duration and count simultaneously. Use it for request latency, database query time, and payment processing duration. It automatically calculates count, total time, and percentiles. Add publishPercentiles(0.5, 0.95, 0.99) to expose p50, p95, and p99 as Prometheus labels for alerting.
Choose the wrong type and you miss the signal: a Gauge for total orders loses the rate information, a Counter for connection pool size loses the current value, a Timer without percentiles hides the tail latency where the real problems live.
How do I handle Actuator in a Kubernetes environment with multiple replicas?
Configure all three probe types: startupProbe using /actuator/health/liveness with failureThreshold × periodSeconds set to at least 2× your measured startup time. livenessProbe using /actuator/health/liveness with no external dependency checks — JVM state only. readinessProbe using /actuator/health/readiness with dependency checks.
Enable health groups in application.yml with management.endpoint.health.probes.enabled=true — without this, the liveness and readiness paths return 404.
For metrics scraping, use Prometheus kubernetes_sd_configs with pod annotations rather than static targets. Pods get new IPs on every restart — static targets break on every rolling deploy. Add prometheus.io/scrape: 'true' and prometheus.io/port: '8080' annotations to your pod spec.
Set management.metrics.tags.application=${spring.application.name} so metrics from different service replicas are identifiable in Prometheus and can be filtered or aggregated by service name in Grafana dashboards.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.