Spring Boot Actuator exposes operational endpoints (/health, /metrics, /prometheus) that turn your running JVM from a black box into an observable system
Health groups (liveness, readiness, startup) map directly to Kubernetes probe types — never put external dependency checks in liveness probes
Micrometer is the metrics engine: Counters for totals, Gauges for current values, Timers for latency percentiles (p50/p95/p99)
Prometheus scrapes /actuator/prometheus every 15s — using /actuator/health for scraping creates a thundering herd at scale
Dynamic log level changes via /actuator/loggers turn a 30-minute 'add logging and redeploy' cycle into a 10-second API call
management.metrics.tags.application=${spring.application.name} is the one line most teams forget — without it, metrics from different services collide in Prometheus
The full stack is Actuator (sensors) + Prometheus (recorder) + Grafana (visualizer) + Alertmanager (alerter) — Actuator alone gives you endpoints, not observability
✦ Definition~90s read
What is Spring Boot Actuator and Monitoring?
Spring Boot Actuator is a production-ready module that exposes operational information about your running application via HTTP endpoints or JMX. It solves the fundamental problem of black-box applications in production: you need to know what's happening inside without attaching a debugger or sifting through logs.
★
Think of Spring Boot Actuator as the black box and dashboard of a modern airplane.
Actuator gives you live insight into health, metrics, environment properties, thread dumps, and more — essentially a built-in observability layer that would otherwise require third-party agents or custom code. It's part of the Spring Boot ecosystem and integrates natively with Micrometer for metrics, meaning you can ship data to Prometheus, Datadog, or Graphite with zero additional instrumentation in most cases.
Actuator is not a monitoring system itself — it's the data source. You pair it with external tools like Prometheus + Grafana for dashboards, or with Spring Cloud Sleuth for distributed tracing. Don't use Actuator as a security bypass: endpoints like /actuator/env or /actuator/beans leak sensitive internals if left unsecured.
In practice, you'll expose only health and metrics publicly, lock down everything else behind Spring Security or network policies. The real power comes from custom endpoints and Micrometer metrics — for example, exposing connection pool usage (HikariCP), thread pool depth, or queue lengths — which is exactly what you need to avoid connection pool exhaustion in production.
Plain-English First
Think of Spring Boot Actuator as the black box and dashboard of a modern airplane. Without it, you are flying blind, hoping the engine is running fine. With it, you have a cockpit full of real-time gauges telling you the fuel level (memory usage), engine temperature (CPU load), and whether the landing gear is locked (database connectivity). Monitoring is the air traffic control tower that watches these signals from a distance to ensure you do not crash before you even realize there is a problem.
But Actuator alone is not the full picture. Actuator is the sensor array. Prometheus is the flight recorder that stores the readings over time. Grafana is the dashboard that makes those readings visual and actionable. And Alertmanager is the air traffic controller who radios you at 3 AM when something goes wrong. You need all four working together to get real production observability.
In the world of microservices, 'it works on my machine' is not enough. You need to know if it works in production at 3:00 AM under peak load. Spring Boot Actuator is the industry-standard framework that transforms your dark application into an observable system by exposing HTTP and JMX endpoints that reveal the inner state of your running JVM.
I learned this the hard way. In 2021, our team deployed a Spring Boot payment service to production. It ran fine for three weeks. Then one Saturday at 2 AM, latency spiked to 15 seconds per request. We had no metrics, no health checks beyond a basic /ping, and no idea what was wrong. It took four hours to diagnose a connection pool exhaustion issue that a single Micrometer gauge would have caught in 30 seconds. That incident cost us $40,000 in lost transactions and a very uncomfortable Monday morning post-mortem.
After that, we instrumented everything. Every service got Actuator, Micrometer, Prometheus, and Grafana before it ever touched production. We have not had a mystery outage since.
This guide moves beyond basic dependency injection and covers the operational side of Java development. We will explore how to monitor application health, track custom business metrics, expose build traceability info, manage log levels dynamically, secure sensitive endpoints, wire everything into a Prometheus/Grafana stack, and build custom endpoints for operational control — all with the application.yml configuration you can actually copy into a project.
What Spring Boot Actuator Actually Exposes
Spring Boot Actuator is a production-ready library that exposes operational endpoints (health, metrics, info, env, etc.) over HTTP or JMX. Its core mechanic is the autoconfiguration of these endpoints based on the beans and dependencies in your application context. You enable it with a single dependency and a few properties, and it immediately gives you a /actuator/health endpoint that reports the status of your database, disk space, and other critical subsystems.
In practice, Actuator works by registering a set of HealthIndicator beans — one per external system. Each indicator runs a lightweight check (e.g., a DataSourceHealthIndicator executes SELECT 1). The aggregated result is a JSON response with status UP, DOWN, or DEGRADED. The key property is management.endpoint.health.show-details=when-authorized (or always). Without it, you only get a simple UP/DOWN — useless for debugging. The /actuator/metrics endpoint exposes JVM, thread, and connection pool metrics, which is where connection pool exhaustion becomes visible.
Use Actuator when you need to monitor application health in a load-balanced or containerized environment — which is every production system. It matters because a health check that doesn't actually verify database connectivity is worse than no check: it gives false confidence. The real value is in wiring custom health indicators for your critical dependencies and exposing metrics to Prometheus or similar. Without that, you're flying blind.
Default health endpoint is a lie
Out of the box, /actuator/health returns UP even if your database is down — unless you explicitly enable show-details and configure the right indicators.
Production Insight
Teams using default health checks in Kubernetes liveness probes — the pod stays alive but all requests fail because the connection pool is exhausted.
Symptom: /actuator/health returns UP, but /api/orders hangs or returns 503 with "Connection is not available, request timed out after 30000ms".
Rule: Always wire a custom HealthIndicator that checks pool utilization (e.g., HikariPoolMXBean.getActiveConnections() > threshold) and set management.health.db.ignore-routing-data-sources=true.
Key Takeaway
Actuator is not a monitoring solution — it's a data source for your monitoring solution.
The default health endpoint is a security and reliability trap; always configure show-details and custom indicators.
Connection pool metrics are the single most actionable metric for preventing cascading failures in production.
thecodeforge.io
Spring Boot Actuator Observability Flow
Spring Boot Actuator
The Three Pillars of Observability
Before diving into Actuator, understand the framework that every production monitoring system is built on. Observability rests on three pillars, and knowing which pillar answers which question is the difference between a 4-hour incident and a 10-minute resolution.
Metrics — Numeric measurements over time. Request rate, error rate, latency percentiles, JVM heap usage, connection pool saturation. These are what Prometheus scrapes and Grafana displays. Metrics answer 'how much' and 'how fast' and 'how often.' They are the first signal that something is wrong.
Logs — Discrete events with context. A log entry says 'order #4521 failed with NullPointerException at PaymentService.java:87 for user abc123.' Logs answer 'what happened' and 'why.' Spring Boot's structured logging in JSON format feeds into ELK, Loki, or Datadog. They answer the question that the metric alert raised.
Traces — A request's journey across services. In a microservice architecture, a single user action might touch 8 services. A trace connects those dots — showing that the 2-second delay happened in the inventory service on the third hop, not in the API gateway. Spring Boot integrates with Micrometer Tracing and OpenTelemetry for distributed tracing.
Spring Boot Actuator primarily addresses the metrics and health pillars. But a production-grade observability stack needs all three working together. The order of implementation matters more than most teams realize.
Start with metrics and health checks (Actuator + Prometheus) — this catches 80% of production issues before users notice
Add structured logging next when you need to debug 'what exactly happened' after an alert fires
Add distributed tracing when you have 3 or more microservices and need to find which service in the chain introduced latency
Trying to implement all three simultaneously leads to implementation paralysis and all three done badly
The order matters: metrics catch problems before logs do, logs explain problems before traces do
Production Insight
Most teams over-engineer observability on day one and under-instrument by day 30. The pattern that consistently works: metrics plus health first — they prevent mystery outages. Add structured logging when you cannot debug from metrics alone. Add tracing when microservice count justifies the overhead. Do not skip the sequence.
Key Takeaway
Observability is a maturity curve, not a checklist. Start with metrics and health checks — they prevent 80% of production mystery outages. Implement the pillars in the order they prevent outages, not the order they look impressive on a resume.
The Anatomy of Actuator: Observability in Action
Spring Boot Actuator exists because manual health checks are a recipe for failure. Instead of writing custom endpoints to check if your database is alive or if your disk space is full, Actuator provides these out of the box. By adding the starter dependency and configuring application.yml, you instantly gain access to the /actuator base path with standardized operational endpoints.
However, most endpoints are hidden by default for security. The real power lies in the /health and /prometheus endpoints. One critical detail that trips up almost every team the first time: the /actuator/prometheus endpoint does not exist unless micrometer-registry-prometheus is on the classpath. The actuator starter and the Prometheus registry are separate dependencies. Adding spring-boot-starter-actuator without micrometer-registry-prometheus gives you health and info but not Prometheus metrics.
One thing worth saying explicitly: Actuator endpoints are not your application REST API. They are operational endpoints meant for internal monitoring infrastructure. Your security model should reflect this — your monitoring stack gets access through specific roles, your customers never touch these endpoints.
io/thecodeforge/monitoring/application.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# io.thecodeforge: CanonicalActuatorConfiguration
# This is the base configuration every production SpringBoot service should start with.
# Copythis, adjust the exposure list to match your needs, and ship it.
management:
endpoints:
web:
exposure:
# Whitelist only what your monitoring stack actually needs.
# Never use * in production.
include: health,info,prometheus,loggers
exclude: heapdump,env,threaddump,shutdown
base-path: /actuator
endpoint:
health:
# when_authorized: only show component details to authenticated users
# never: hide component details entirely (use forpublic-facing services)
# always: NEVER use in production — exposes DB versions, pool sizes, API keys
show-details: when_authorized
show-components: when_authorized
probes:
# Enables /actuator/health/liveness and /actuator/health/readiness
# RequiredforKubernetes probe integration
enabled: true
# Global tag applied to every metric this service emits.
# Withoutthis, metrics from different services collide in Prometheus.
# This is the one line most teams forget. Do not skip it.
metrics:
tags:
application: ${spring.application.name}
export:
prometheus:
# Explicitly enable the Prometheus registry.
# Redundantif micrometer-registry-prometheus is on the classpath,
# but makes intent explicit and prevents confusion.
enabled: true
# Security: never expose stack traces via the error endpoint
server:
add-application-context-header: false
server:
error:
# Never expose stack traces to clients or monitoring scrapers
include-stacktrace: never
include-message: never
spring:
application:
# This name flows into management.metrics.tags.application above.
# Set it explicitly — never rely on the default.
name: order-service
Output
# With this configuration:
# GET /actuator/health → component-level health (authorized users only)
# GET /actuator/health/liveness → JVM-only liveness check (Kubernetes liveness probe)
# GET /actuator/health/readiness → dependency check (Kubernetes readiness probe)
# GET /actuator/prometheus → Prometheus metrics with application=order-service tag
# GET /actuator/loggers → current log levels (read)
# POST /actuator/loggers/{pkg} → change log level dynamically (ADMIN role required)
Setting management.endpoints.web.exposure.include=* exposes /actuator/env (which returns AWS keys, database passwords, and API tokens in plaintext), /actuator/heapdump (which dumps every object in JVM memory including user sessions and PII), and /actuator/threaddump to anyone who can reach the endpoint. An automated scanner will find this within hours. I have seen a production incident where this exact mistake exposed AWS_SECRET_ACCESS_KEY. The attacker spun up 200 GPU instances for crypto mining. The bill was $12,000 before anyone noticed. Whitelist only what your monitoring stack requires.
Production Insight
A team used /actuator/health as the Prometheus scrape_path instead of /actuator/prometheus. The full health check hit the database, external APIs, and disk on every scrape — 100 instances at 15-second intervals equaled 6,667 heavy health checks per minute. Response times for the health endpoint started spiking, causing Prometheus to mark targets as down, which triggered false alerts at 3 AM. The fix: use /actuator/prometheus for metrics scraping (lightweight counter and gauge reads, sub-millisecond) and reserve /actuator/health for Kubernetes probes exclusively.
Key Takeaway
Actuator endpoints are operational infrastructure, not application REST APIs. The security model must reflect this. Never set exposure.include=* in production — whitelist only what Prometheus and Kubernetes actually need. The canonical base configuration is in this section — copy it and adjust from there.
Custom Metrics with Micrometer — Beyond Health Checks
Health checks tell you if the system is alive. Metrics tell you how it is performing. Spring Boot uses Micrometer, a dimensional metrics instrumentation facade — think SLF4J but for metrics. It lets you track business-relevant things like order count, payment latency percentiles, and cart abandonment rate rather than just JVM garbage collection statistics.
Micrometer supports several meter types. Choosing the right one matters:
Counter — Monotonically increasing value. Use for total requests, total errors, and orders placed. Never goes down. Query with rate() in Prometheus to get events per second.
Gauge — A value that fluctuates. Use for current queue depth, active connections, and temperature. You report the current value; Micrometer samples it on each scrape.
Timer — Measures duration and rate simultaneously. Use for request latency, database query time, and external API call duration. Gives you percentiles (p50, p95, p99) automatically.
Distribution Summary — Like a Timer but for arbitrary values, not time. Use for payload sizes, batch sizes, and record counts.
The @Timed annotation is the fastest way to instrument a controller method. Spring Boot auto-configures a TimedAspect bean when Micrometer is on the classpath — you annotate the method and Micrometer handles the rest. For custom business logic, inject MeterRegistry directly and build meters explicitly.
package io.thecodeforge.monitoring;
import io.micrometer.core.annotation.Timed;
import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.stereotype.Service;
import org.springframework.web.bind.annotation.*;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
* io.thecodeforge: CustomBusinessMetricsforOrderProcessing.
*
* Two patterns shown here:
* 1. @Timed on controller methods — zero-boilerplate method-level timing
* 2. MeterRegistry injection — explicit control for business counters and gauges
*/
@ServicepublicclassOrderMetricsService {
privatefinalCounter ordersPlaced;
privatefinalCounter ordersFailed;
privatefinalTimer paymentLatency;
privatefinalAtomicInteger activeCarts = newAtomicInteger(0);
publicOrderMetricsService(MeterRegistry registry) {
// Counter: Total orders placed (monotonically increasing)// Query in Prometheus: rate(orders_placed_total[5m]) → orders per secondthis.ordersPlaced = Counter.builder("orders.placed.total")
.description("Total number of orders successfully placed")
.tag("service", "order-service")
.register(registry);
// Counter: Total orders failed// Alert when rate(orders_failed_total[5m]) / rate(orders_placed_total[5m]) > 0.01this.ordersFailed = Counter.builder("orders.failed.total")
.description("Total number of orders that failed processing")
.tag("service", "order-service")
.register(registry);
// Timer: Payment processing latency with percentiles// publishPercentiles exposes p50, p95, p99 as separate Prometheus labels// Alert on p99 exceeding your SLA threshold — the mean will hide tail latencythis.paymentLatency = Timer.builder("payment.processing.duration")
.description("Time taken to process payment end to end")
.tag("service", "order-service")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
// Gauge: Active shopping carts — value goes up and down// Prometheus samples this on every scrape — you report the current valueGauge.builder("carts.active.current", activeCarts, AtomicInteger::get)
.description("Number of active shopping carts in session")
.register(registry);
}
publicvoidrecordSuccessfulOrder(long paymentDurationNanos) {
ordersPlaced.increment();
paymentLatency.record(paymentDurationNanos, TimeUnit.NANOSECONDS);
}
publicvoidrecordFailedOrder() {
ordersFailed.increment();
}
publicvoidcartOpened() { activeCarts.incrementAndGet(); }
publicvoidcartClosed() { activeCarts.decrementAndGet(); }
}
// --- @Timed annotation pattern: zero-boilerplate controller instrumentation ---package io.thecodeforge.controller;
import io.micrometer.core.annotation.Timed;
import io.thecodeforge.dto.OrderDto;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/orders")
publicclassOrderController {
/**
* @Timed auto-instruments this method with a Timer named 'order.create.duration'.
* Tracks: request count, total time, and percentiles.
* RequiresTimedAspect bean — auto-configured when spring-boot-starter-actuator
* and micrometer-registry-prometheus are on the classpath.
* extraTags: adds fixed labels to the metric for filtering in Grafana.
*/
@Timed(
value = "order.create.duration",
description = "Time taken to create an order",
extraTags = {"endpoint", "/api/orders", "method", "POST"},
percentiles = {0.5, 0.95, 0.99},
histogram = true
)
@PostMappingpublicOrderDtocreateOrder(@RequestBodyOrderDto dto) {
// Method execution is automatically timed.// A Timer named order_create_duration_seconds is created in Prometheus.return dto;
}
}
Output
# GET /actuator/prometheus (relevant excerpt)
# HELP orders_placed_total Total number of orders successfully placed
A Counter tagged with userId creates one unique time series per user. With 100,000 users that is 100,000 time series in Prometheus — it will exhaust memory and crash your monitoring stack. The rule: keep tag values to low-cardinality dimensions — service name, region, HTTP method, HTTP status code, error type. Never use request IDs, user IDs, order IDs, or timestamps as tag values. If you need per-user visibility, that belongs in your log aggregation layer, not your metrics layer.
Production Insight
A team tracked payment latency but only looked at the mean — 45ms looked healthy and no alert fired. Meanwhile p99 was silently climbing to 2 seconds for a subset of requests with large order payloads, causing intermittent timeout complaints that the team kept dismissing as user error. Adding publishPercentiles(0.5, 0.95, 0.99) to the Timer revealed the tail latency the mean was masking. The alert was set on p99 and the issue was diagnosed within one day of the next occurrence. Mean latency is a liar — always instrument and alert on p95 or p99.
Key Takeaway
Counters answer 'how many total,' Gauges answer 'how much right now,' Timers answer 'how long and how spread out.' Pick the wrong meter type and you will miss the actual problem. Use @Timed on controller methods for zero-boilerplate instrumentation. HikariCP metrics are free — just add the dependency. If you can only track five metrics, track: request rate, error rate, latency p99, connection pool saturation, and one business metric.
Custom Actuator Endpoints — Operational Control Beyond Health
The built-in Actuator endpoints cover most operational needs. But there are legitimate cases where you need operational endpoints specific to your application: runtime feature flag status, current circuit breaker state, deployment metadata that goes beyond what /actuator/info provides, or a cache invalidation trigger that ops can call without a full deploy.
Spring Boot makes this straightforward with the @Endpoint annotation. Annotate a Spring component with @Endpoint(id='yourEndpoint') and methods with @ReadOperation (HTTP GET), @WriteOperation (HTTP POST), or @DeleteOperation (HTTP DELETE). Spring Boot automatically exposes the endpoint at /actuator/yourEndpoint and applies the same security model as built-in endpoints — whatever your SecurityFilterChain says applies here too.
The design rule: @ReadOperation methods should have no side effects. @WriteOperation methods change state and must be secured with ADMIN role — they are essentially a surgical tool for ops teams, not a general API. I have seen @WriteOperation endpoints used to toggle circuit breakers, clear Redis caches, and reload configuration from external sources — all without a redeploy.
@WriteOperation Changes Production State — Secure It
A @WriteOperation endpoint is essentially a surgical tool for production. It can clear caches, toggle features, or reload config — all without a deploy. That power requires strict access control. Always secure @WriteOperation methods with ADMIN role in your SecurityFilterChain. Log every invocation with the caller's identity so you have an audit trail. Use @WebEndpoint instead of @Endpoint if you want the endpoint to be web-only and not exposed via JMX.
Production Insight
Custom endpoints are where Actuator becomes genuinely powerful beyond health checks. I have used @WriteOperation endpoints to toggle circuit breakers manually during dependency degradation events, clear pricing caches after database migrations, and reload feature flag configuration from an external source. All of these would have required a redeploy otherwise. The key discipline: document every custom endpoint in your runbook with the exact curl command ops should run and the expected response — your 3 AM on-call engineer should not be reading source code.
Key Takeaway
@Endpoint with @ReadOperation and @WriteOperation lets you build operational control surfaces specific to your application. @ReadOperation is HTTP GET with no side effects. @WriteOperation is HTTP POST that changes state and must be secured with ADMIN role. Custom endpoints inherit the same security model as built-in ones — no extra configuration needed. Document every custom endpoint in your runbook with the exact command to run.
Securing Actuator Endpoints with Spring Security
By default in Spring Boot 2.x and later, Actuator endpoints sit behind Spring Security if it is on the classpath. But behind security does not mean secure. The default configuration often allows all authenticated users to access all endpoints — including ones that dump environment variables, heap contents, and thread states.
The production pattern: restrict Actuator endpoints to a dedicated monitoring role or internal network. Your Prometheus scraper authenticates with a service account. Your developers get read-only access to health and info. Nobody outside the internal network touches env or heapdump. The 15-line SecurityFilterChain in this section has prevented two credential exfiltration incidents on teams I have worked with.
One detail most guides miss: when you add a dedicated SecurityFilterChain for /actuator/**, you need @Order(1) to give it higher priority than your application's main SecurityFilterChain. Without the ordering, Spring applies your main chain first, which may have different rules than what you intend for Actuator.
package io.thecodeforge.monitoring;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.annotation.Order;
import org.springframework.security.config.annotation.web.builders.HttpSecurity;
import org.springframework.security.web.SecurityFilterChain;
/**
* io.thecodeforge: ActuatorSecurityConfiguration.
*
* @Order(1) gives this chain higher priority than your application's main chain.
* Without it, your application's SecurityFilterChain may apply different rules
* to /actuator/** than you intend.
*
* Role mapping:
* No role → health, info (load balancers and Kubernetes probes)
* MONITORING → prometheus (Prometheus scraper service account)
* ADMIN → loggers, env, heapdump, threaddump (ops team only)
* DENY → shutdown, everything else not whitelisted
*/
@ConfigurationpublicclassActuatorSecurityConfig {
@Bean
@Order(1)
publicSecurityFilterChainactuatorSecurityFilterChain(HttpSecurity http) throwsException {
http
.securityMatcher("/actuator/**")
.authorizeHttpRequests(auth -> auth
// Public: health checks for load balancers and Kubernetes probes// These must be accessible without authentication
.requestMatchers("/actuator/health/**").permitAll()
.requestMatchers("/actuator/info").permitAll()
// MONITORING role: Prometheus scraper authenticates via HTTP Basic
.requestMatchers("/actuator/prometheus").hasRole("MONITORING")
// ADMIN role: endpoints that expose or modify sensitive state// Log level changes can generate gigabytes/min if abused
.requestMatchers(
"/actuator/loggers/**",
"/actuator/env",
"/actuator/heapdump",
"/actuator/threaddump"
).hasRole("ADMIN")
// Deny everything not explicitly permitted above.// This includes /actuator/shutdown — never expose it.
.anyRequest().denyAll()
)
// HTTP Basic: Prometheus natively supports basic_auth in scrape config
.httpBasic(basic -> {})
// Disable CSRF: Actuator endpoints are called by automated systems, not browsers
.csrf(csrf -> csrf.disable());
return http.build();
}
}
Output
# application.yml complement to the SecurityFilterChain above
# Role definitions for the monitoring service account
spring:
security:
user:
name: admin
password: ${ACTUATOR_ADMIN_PASSWORD} # inject via environment variable, never hardcode
roles: ADMIN
# In Kubernetes, use Spring Security with LDAP or OAuth2 service accounts.
# For simple setups, environment-variable-injected credentials are acceptable
# as long as the password is rotated and not committed to source control.
Never Set show-details to 'always' in Production
When show-details=always, your /actuator/health endpoint returns database versions, JDBC connection pool sizes, external API response times, and disk usage to anyone who hits it — including unauthenticated requests if your SecurityFilterChain permits health/* publicly. An attacker uses this to fingerprint your infrastructure and find known CVEs for your specific database version. Always set it to when_authorized or never. Kubernetes probes only need the status field, not the details.
Production Insight
An automated scanner found /actuator/env exposed on a staging server within 6 hours of deployment. AWS_SECRET_ACCESS_KEY was returned in plaintext in the JSON response. The attacker spun up 200 GPU instances for crypto mining. The AWS bill was $12,000 before anyone noticed. The application had a SecurityFilterChain — but it was missing @Order(1), so the main chain ran first and permitted all authenticated requests including monitoring users who had broad access. A 15-line SecurityFilterChain with correct ordering would have blocked this entirely.
Key Takeaway
If your security audit has not flagged your Actuator endpoints, you are not looking hard enough. Use @Order(1) on the Actuator SecurityFilterChain to ensure it takes priority. Never expose env, heapdump, or threaddump publicly. Restrict them to ADMIN role. Never set show-details=always in production.
Prometheus and Grafana Integration — Full Stack Setup
Actuator exposes the /actuator/prometheus endpoint in Prometheus exposition format — a text-based, human-readable format that Prometheus understands natively. But Prometheus needs to be told where to scrape. This is where most tutorials end and most teams get stuck.
On the Spring Boot side: add micrometer-registry-prometheus to your dependencies and management.endpoints.web.exposure.include=prometheus to your application.yml. That is it. The endpoint auto-configures.
On the Prometheus side: add a scrape_config block pointing at your application. The two details most engineers get wrong: using /actuator/health as the metrics_path instead of /actuator/prometheus (heavy versus lightweight), and using static_configs in Kubernetes where pods get new IPs on every restart.
On the Grafana side: import dashboard ID 4701 (JVM Micrometer) from grafana.com/dashboards for a production-ready Spring Boot dashboard. It covers heap usage, GC pause time, HTTP request rates, and database connection pool saturation out of the box — no PromQL required to get started.
io/thecodeforge/monitoring/prometheus.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# io.thecodeforge: PrometheusScrapeConfiguration
# This file tells Prometheus where to find your SpringBoot metrics.
global:
scrape_interval: 15s # How often Prometheus scrapes all targets
evaluation_interval: 15s # How often alerting rules are evaluated
scrape_configs:
# Static targets: works for a fixed number of servers or local development.
# In production Kubernetes, replace this with kubernetes_sd_configs below.
- job_name: 'spring-boot-order-service'
# IMPORTANT: Use /actuator/prometheus, NOT /actuator/health.
# /actuator/prometheus: lightweight metric reads, sub-millisecond.
# /actuator/health: hits the DB, external APIs, disk — thundering herd at scale.
metrics_path: '/actuator/prometheus'
scrape_interval: 10s # Override global forthis job
basic_auth:
username: 'prometheus'
# Use password_file in production — never inline passwords in prometheus.yml
password_file: '/etc/prometheus/secrets/prometheus_password'
static_configs:
- targets: ['order-service-01:8080', 'order-service-02:8080']
labels:
environment: 'production'
team: 'platform'
# Kubernetes service discovery: use this instead of static_configs in K8s.
# Pods annotated with prometheus.io/scrape: 'true' are scraped automatically.
# No manual target management — works with rolling deploys and autoscaling.
- job_name: 'spring-boot-k8s'
metrics_path: '/actuator/prometheus'
kubernetes_sd_configs:
- role: pod
basic_auth:
username: 'prometheus'
password_file: '/etc/prometheus/secrets/prometheus_password'
relabel_configs:
# Only scrape pods with prometheus.io/scrape: 'true' annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
# Use the pod's prometheus.io/port annotation as the scrape port
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: '${1}'
# Add pod name and namespace as labels forGrafana filtering
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
Output
# Add these annotations to your Kubernetes pod spec to enable auto-discovery:
#
# metadata:
# annotations:
# prometheus.io/scrape: 'true'
# prometheus.io/port: '8080'
# prometheus.io/path: '/actuator/prometheus'
#
# Prometheus will automatically discover and scrape this pod.
# When the pod is replaced (rolling deploy), Prometheus picks up the new IP.
# No prometheus.yml restart required.
Import Grafana Dashboard 4701 Before Building Your Own
Grafana dashboard ID 4701 (JVM Micrometer) is a production-ready Spring Boot dashboard maintained by the Micrometer team. It covers JVM heap, GC pauses, HTTP request rates, error rates, and HikariCP connection pool metrics out of the box. Import it first and use it for at least one sprint before building custom dashboards. You will learn which metrics actually matter during incidents before investing time in custom PromQL.
Production Insight
Scrape intervals compound with instance count in ways most teams do not calculate until it is too late. One hundred instances at 15-second scrape intervals with a 500ms health check means 333 seconds of health check CPU time per minute — and that is before you account for the database calls inside each health check. Use /actuator/prometheus for Prometheus scraping (lightweight reads) and health groups for Kubernetes probes. In Kubernetes, static_configs break on every rolling deploy — pods get new IPs and Prometheus stops scraping the new instance. Use kubernetes_sd_configs with pod annotations from day one.
Key Takeaway
Prometheus scrapes metrics, not health — using /actuator/health as the scrape path is the thundering herd waiting to happen. In Kubernetes, use service discovery with pod annotations instead of static targets. Import Grafana dashboard 4701 before writing custom PromQL — it covers the metrics that actually matter during incidents.
Kubernetes Probes — Liveness, Readiness, and Startup
If you are deploying Spring Boot to Kubernetes, Actuator's health groups become the backbone of your pod lifecycle management. Misconfiguring these probes is the single most common cause of cascading restarts in Spring Boot Kubernetes deployments — and it is entirely preventable.
Spring Boot 2.3 introduced health groups — separate health endpoints for different probe types. This is critical because liveness and readiness must check different things:
Liveness ('Is the app alive?'): Checks only internal JVM state. If this fails, Kubernetes restarts the pod. Keep it lightweight — no database calls, no external API calls. A temporary DB blip should never restart your pods.
Readiness ('Can the app accept traffic?'): Checks dependencies — database reachable, cache warm, broker connected. If this fails, Kubernetes removes the pod from the load balancer but does not restart it. This is the correct behavior for a transient dependency failure.
Startup ('Has the app finished booting?'): For slow-starting apps. Kubernetes waits for this to pass before running liveness and readiness probes. The math matters: failureThreshold × periodSeconds = maximum startup window. A Spring Boot app with 60 seconds of startup time needs failureThreshold: 12 with periodSeconds: 10 for a 120-second safety window — always give at least 2x your measured startup time.
I once debugged a 20-replica production deployment restarting every 90 seconds. The liveness probe was checking database connectivity. A 30-second DB connection spike triggered liveness failures across all 20 pods simultaneously. They all restarted, reconnected at once, overwhelmed the DB, the DB connection time spiked again, and the cycle repeated. The fix was one line in the Kubernetes deployment YAML: change the liveness probe path from /actuator/health to /actuator/health/liveness.
# io.thecodeforge: KubernetesDeployment with ActuatorProbes
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
template:
metadata:
annotations:
# EnablePrometheus auto-discovery (works with kubernetes_sd_configs above)
prometheus.io/scrape: 'true'
prometheus.io/port: '8080'
prometheus.io/path: '/actuator/prometheus'
spec:
containers:
- name: order-service
image: io.thecodeforge/order-service:latest
ports:
- containerPort: 8080
# STARTUPPROBE: Prevents liveness from killing a slow-starting app.
# Math: failureThreshold (30) × periodSeconds (10) = 300 seconds max startup.
# Measure your actual startup time and set this to at least 2× that value.
# If your app starts in 45 seconds, use failureThreshold: 12, periodSeconds: 10
# for a 120-second window. Never guess — measure.
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 30 # 30 × 10s = 300 seconds max startup window
periodSeconds: 10
timeoutSeconds: 3
# LIVENESSPROBE: JVM-only — no external dependency checks.
# Ifthis fails, KubernetesRESTARTS the pod.
# Never check DB or external APIs here — a transient blip restarts all pods.
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 0 # startupProbe handles the startup window
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3 # Restart after 3 consecutive failures (30 seconds)
# READINESSPROBE: Checks dependencies — DB, cache, broker.
# Ifthis fails, pod is REMOVED from load balancer but NOT restarted.
# This is the correct behavior for a transient dependency failure.
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3 # Remove from LB after 15 seconds of failures
Output
# Spring Boot application.yml configuration required for health groups:
#
# management:
# endpoint:
# health:
# probes:
# enabled: true ← enables /actuator/health/liveness and /readiness
# show-details: when_authorized
# health:
# livenessstate:
# enabled: true
# readinessstate:
# enabled: true
#
# Without probes.enabled=true, /actuator/health/liveness and /readiness return 404.
# This is the most common Kubernetes probe misconfiguration in Spring Boot.
Never Put External Dependency Checks in Liveness Probes
If your liveness probe checks the database and the database has a 30-second connection spike, Kubernetes restarts every pod in your deployment simultaneously. They all reconnect to the DB at once, creating a connection storm that makes the DB spike worse. This cascading restart pattern has taken down production systems at companies with mature engineering teams. The rule is absolute: liveness probes check only JVM internal state. External dependencies belong in readiness probes.
Production Insight
A 20-replica deployment was restarting every 90 seconds. The liveness probe was set to /actuator/health instead of /actuator/health/liveness. A 30-second DB connection spike triggered liveness failures across all 20 pods simultaneously. They all restarted, reconnected at once, overwhelmed the DB, which caused another spike, which failed liveness again. The cascading loop ran for 40 minutes before anyone understood what was happening. The fix was changing one YAML field. The investigation took 2 hours. The actual fix took 5 seconds.
Key Takeaway
Liveness checks the JVM — failure triggers a pod restart. Readiness checks dependencies — failure removes the pod from the load balancer without restarting it. startupProbe math: failureThreshold × periodSeconds = maximum startup window — measure your actual startup time and set this to at least 2×. Never put external dependency checks in liveness probes — the cascading restart pattern is catastrophic at scale.
Kubernetes Probe Selection
IfApp is slow to start — Spring context takes 30 or more seconds to load
→
UseAdd a startupProbe using /actuator/health/liveness. Set failureThreshold × periodSeconds to at least 2× your measured startup time. Without startupProbe, liveness kills the pod during initialization.
IfExternal dependency (DB, cache, broker) is temporarily unreachable
→
UseReadiness probe fails — pod is removed from the load balancer but NOT restarted. Traffic stops going to it. When the dependency recovers, the pod passes readiness and rejoins the load balancer automatically.
IfJVM is unresponsive — deadlock, OOM, or GC thrashing
→
UseLiveness probe fails — Kubernetes restarts the pod. This is the correct behavior — the process is genuinely broken and needs a fresh start.
IfLiveness probe checks database connectivity
→
UseImmediate risk: a DB blip restarts ALL pods simultaneously — cascading failure. Change liveness to /actuator/health/liveness and move DB check to readiness immediately.
The /actuator/info Endpoint — Deployment Traceability
The /actuator/info endpoint is the most underused feature in the Actuator suite. It lets you expose build information — Git commit hash, build timestamp, artifact version — directly from your running application. When something breaks in production, the first question is always 'what version is running?' Without this endpoint configured, answering that question means digging through CI/CD pipeline logs, which takes minutes you do not have during an active incident.
The setup requires two Maven plugins: the spring-boot-maven-plugin with the build-info goal, and the git-commit-id-plugin to embed Git metadata. Once configured, every build automatically embeds its own DNA into the artifact. Every deployment automatically reports exactly what code is running.
In production setups, every Grafana dashboard should have a deployment panel that queries /actuator/info across instances. If instances report different commit hashes during a rolling deploy, you can see the split-brain state in real time. If a rollback happened silently, it shows up immediately in this panel.
io/thecodeforge/monitoring/pom.xmlXML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<!-- io.thecodeforge: Maven plugins for /actuator/info enrichment -->
<!-- Add these inside your <build><plugins> block -->
<build>
<plugins>
<!-- Plugin1: Generates build-info.properties at compile time.
Embeds: artifact name, version, build timestamp.
Appears under "build" key in /actuator/info response. -->
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
<executions>
<execution>
<goals>
<goal>build-info</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- Plugin2: EmbedsGit metadata at build time.
Embeds: commit hash, branch, commit time, tags.
Appears under "git" key in /actuator/info response.
failOnNoGitDirectory=false: prevents build failure in CI environments
where the .git directory may not be present (Docker build layers). -->
<plugin>
<groupId>pl.project13.maven</groupId>
<artifactId>git-commit-id-plugin</artifactId>
<version>4.9.10</version>
<executions>
<execution>
<goals>
<goal>revision</goal>
</goals>
</execution>
</executions>
<configuration>
<generateGitPropertiesFile>true</generateGitPropertiesFile>
<!-- Prevents build failure when .git is not present (shallow clones, CI) -->
<failOnNoGitDirectory>false</failOnNoGitDirectory>
<!-- Only embed the abbreviated commit hash, not the full 40-char hash -->
<abbrevLength>7</abbrevLength>
</configuration>
</plugin>
</plugins>
</build>
// if [ "$DEPLOYED_COMMIT" != "$RUNNING_COMMIT" ]; then
// echo "ERROR: Container is running stale code. Expected $DEPLOYED_COMMIT, got $RUNNING_COMMIT"
// exit 1
// fi
Use Info in Your CI/CD Verification Step
After deploying, curl /actuator/info on the new instance and verify git.commit.id matches the commit your pipeline just built. If it does not match, your deployment did not actually apply — the Docker image cache served the old layer, or the Helm rollout did not complete. This 2-second verification step has caught the 'ghost deployment' failure mode more times than I can count. Add it as a required step in your deployment pipeline before marking the deploy as successful.
Production Insight
A team deployed v2.4.1 to fix a critical pricing bug but customer reports kept coming in. The app was still behaving like v2.3.9. Fifteen minutes into the incident, someone thought to curl /actuator/info. The git.commit.id showed the old version's hash. Docker had cached the intermediate image layer and the registry served the old image despite the pipeline showing green. Without the info endpoint, this debugging path would have taken an hour. With it, the ghost deployment was identified in 30 seconds.
Key Takeaway
When something breaks, the first question is 'what version is running?' The info endpoint answers that in one API call. Configure it once with the two Maven plugins — it requires no ongoing maintenance. Add a CI/CD post-deploy verification step that compares the running commit hash against what was just deployed.
Dynamic Log Level Management — Debug Without Redeploying
One of Actuator's most powerful day-two features: changing log levels at runtime without restarting the application. Need to enable DEBUG logging for a specific package to troubleshoot a production issue? Hit /actuator/loggers, change the level, reproduce the issue, read the logs, then change it back. No restart. No downtime. No redeploy cycle.
This converts 'we need to add more logging and redeploy' — a 30-minute process minimum in most CI/CD pipelines — into a 10-second API call. During incident response, this is the difference between a 10-minute resolution and a 45-minute resolution while customers are actively impacted.
The endpoint supports GET to read current levels and POST to change them. Spring Security should restrict POST to ADMIN role — TRACE logging in production can generate gigabytes of log data per minute, fill disk, and cascade into other failures. Always reset the log level after you have captured what you need. Build that reset command into your debugging runbook as a mandatory step.
For Hibernate SQL debugging specifically, enable TRACE on org.hibernate.SQL to see the exact SQL being generated and org.hibernate.type.descriptor.sql to see the actual bind parameter values. This combination has diagnosed more mysterious data bugs than any other technique I know.
# After resetting to null: DEBUG logs disappear immediately, zero application restart.
Build a Runbook Entry for Every Common Incident
Document which packages to enable DEBUG for during common incident types. Example: 'Payment processing failures: enable DEBUG on io.thecodeforge.order and io.thecodeforge.payment. If SQL is suspected, add TRACE on org.hibernate.SQL. Reset all three within 10 minutes.' Your on-call engineer at 3 AM should be running a documented command, not guessing package names from memory. The runbook entry takes 5 minutes to write and saves 20 minutes per incident.
Production Insight
A team was debugging intermittent payment failures that occurred roughly once every 200 transactions. Reproducing the issue was unreliable. They enabled DEBUG on the payment package for 15 minutes, captured 4 failed transactions with full context in the logs, identified a race condition in the idempotency key generation, and disabled DEBUG before anyone else noticed the log volume. The old workflow — add a log statement, build, deploy to staging, reproduce, deploy to production, check logs — would have been 45 minutes minimum and required a deployment window approval. Dynamic log levels turned it into a 15-minute debugging session with no deployment.
Key Takeaway
Dynamic log levels are the most underused Actuator feature in production incident response. They turn debugging from a deployment cycle into a 10-second API call. Secure POST with ADMIN role — TRACE logging can fill disks in minutes. Setting configuredLevel to null (inherit) is different from setting it to INFO explicitly — null respects future configuration changes.
Micrometer Integration and Docker Deployment
To make this production-ready, you containerize the application with a focus on how Docker handles the health signal from Actuator and how the JVM is tuned for container environments.
The Docker HEALTHCHECK instruction tells the Docker daemon whether the container is healthy. By pointing it at /actuator/health/liveness, you get the same lightweight JVM-only health logic that Kubernetes uses at the liveness probe level. This matters most for Docker Compose deployments, which do not have Kubernetes probe support.
Two things to watch for in every containerization. First, the HEALTHCHECK command needs wget or curl to be present in the base image. Distroless images include neither — you will need to switch to a slim base image or remove Docker HEALTHCHECK and rely exclusively on Kubernetes probes. Second, the JVM memory flags: without -XX:+UseContainerSupport, older JDK versions read the host machine's memory rather than the container's memory limits and allocate too large a heap, causing OOM kills. Modern JDK 17 and later handle this automatically, but the flag is harmless and makes the intent explicit.
io/thecodeforge/monitoring/DockerfileDOCKERFILE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# io.thecodeforge: Multi-stage Dockerfile with ActuatorHealthIntegration
# Stage1: BuildFROM eclipse-temurin:17-jdk-alpine AS build
WORKDIR /app
COPY . .
RUN ./mvnw clean package -DskipTests
# Stage2: Runtime
# eclipse-temurin:17-jre-alpine includes wget — required forHEALTHCHECK
# DoNOT use distroless if you need DockerHEALTHCHECKFROM eclipse-temurin:17-jre-alpine
WORKDIR /app
COPY --from=build /app/target/*.jar app.jar
# Security: run as non-root user
RUN addgroup -S forgegroup && adduser -S forgeuser -G forgegroup
USER forgeuser
# DockerHEALTHCHECK: uses liveness endpoint for lightweight JVM-only check.
# interval: how often Dockerchecks (30s is conservative — adjust for your SLA)
# timeout: how longDocker waits for a response before marking unhealthy
# retries: consecutive failures before marking UNHEALTHY
# Uses /actuator/health/liveness — NOT /actuator/health
# Full health check would hit DB on every HEALTHCHECK — unnecessary forDocker daemon
HEALTHCHECK \
--interval=30s \
--timeout=3s \
--retries=3 \
CMD wget --quiet --tries=1 --spider http://localhost:8080/actuator/health/liveness || exit 1EXPOSE8080
# JVM flags for container environments:
# -XX:+UseContainerSupport: read container memory limits, not host memory
# -XX:MaxRAMPercentage=75.0: allocate 75% of container memory to the heap
# leaving 25% forMetaspace, thread stacks, GC overhead, and OSENTRYPOINT ["java", "-XX:+UseContainerSupport", "-XX:MaxRAMPercentage=75.0", "-jar", "app.jar"]
Distroless base images (gcr.io/distroless/java17) do not include wget, curl, or a shell. Docker HEALTHCHECK requires one of these. If you switch to distroless for security, remove the Docker HEALTHCHECK instruction entirely and rely on Kubernetes probes exclusively. Kubernetes probes are HTTP checks performed by the kubelet from outside the container — they do not need wget inside the container. In Kubernetes, Docker HEALTHCHECK and Kubernetes probes are independent mechanisms; you do not need both.
Production Insight
A team migrated to a distroless image for the reduced attack surface — a legitimate security improvement. Docker HEALTHCHECK started failing because wget was missing. Docker marked the container UNHEALTHY, Docker Compose restarted it, which caused a restart loop. The ops team spent two hours before identifying that HEALTHCHECK was the issue, not the application. The fix: remove Docker HEALTHCHECK and rely on Kubernetes probes exclusively. Kubernetes probes are more sophisticated anyway — they support failure thresholds, initial delays, and startup protection that Docker HEALTHCHECK does not.
Key Takeaway
Docker HEALTHCHECK requires wget or curl — distroless images include neither. In Kubernetes, remove Docker HEALTHCHECK and use Kubernetes probes exclusively. Set -XX:+UseContainerSupport and -XX:MaxRAMPercentage=75.0 in your ENTRYPOINT to prevent the JVM from over-allocating heap based on host memory rather than container limits.
The /health Endpoint: What Your Monitoring Dashboard Never Told You
Most teams treat /actuator/health as a glorified ping. They wire it to a load balancer and call it observability. That’s naive, and it’s how you get paged at 3 AM when a downstream database connection pool silently exhausts.
The health endpoint aggregates every HealthIndicator registered in the context — DataSourceHealthIndicator, RedisHealthIndicator, DiskSpaceHealthIndicator, and your own custom ones. When any indicator returns DOWN, the entire endpoint returns 503. But here’s the pain point: by default, the HTTP response body only shows "{"status":"DOWN"}" to external callers. You get zero detail on what actually broke.
Spring Boot hides the details behind the management.endpoint.health.show-details property. Set it to when_authorized (with Spring Security) or always for internal-only endpoints. Pair that with management.endpoint.health.show-components to expose per-indicator status without leaking stack traces. This isn’t about security theatre; it’s about giving your ops team the signal they need to triage before the second alert fires.
Exposing full health details to the internet without authentication is a gift to attackers. They can map your entire infrastructure's dependency graph. Always pair show-details=when_authorized with Spring Security, and restrict the endpoint to internal CIDR ranges via management.endpoints.web.exposure.include.
Key Takeaway
Never let /actuator/health be a black box. Expose component-level details internally, and always check every downstream dependency.
HTTP Tracing: The Silent Debugger You're Not Using
You’ve been debugging latency spikes by grepping logs. Stop. Spring Boot Actuator includes an HTTP tracing endpoint that records the last 100 request-response pairs by default — method, URI, status code, time taken, and headers. That’s enough to catch a sudden increase in 5xx responses or a slow endpoint before it becomes a fire.
By default, the httptrace endpoint stores traces in an in-memory ring buffer of 100 entries. In production, that’s not enough for a busy service. Tune management.trace.http.max-memory-records to 1000 or more if you have headroom. You can also extend HttpTraceRepository to persist to a database or to Redis for distributed tracing across pods.
Crucial gotcha: the httptrace endpoint excludes request and response bodies by design (security). If you need that, implement Filter to capture body payloads yourself, but be warned — logging full request bodies for every call will trash your heap. Use it selectively via a feature flag or a sampling rate.
In high-throughput endpoints (>1000 req/s), avoid storing every trace in memory. Implement a sampling rate of 1% or 0.1% using a ThreadLocal counter. Your CPU and GC will thank you, and you'll still catch anomalies via aggregated error rate spikes.
Key Takeaway
HTTP tracing is your first line of defense against regressions. Set a sensible buffer size, persist to a backend, and never expose trace data to the public.
Actuator's Bean and Configuration Dump: Why You Should Audit Before Every Deploy
You’ve had the meeting: "The database connection pool is half the size it should be — nobody changed a thing." The liar is your YAML file. The /actuator/beans endpoint shows every bean in the context, its scope, its type, and its dependencies. /actuator/configprops dumps every @ConfigurationProperties bindings with their current values. Run these after every deploy before you declare success.
Use /actuator/beans to verify that custom beans like DataSource or RestTemplate actually register with the correct qualifier and scope. A bean of type DataSource that’s accidentally @Scope("prototype") will create a new connection pool per injection point. You’ll exhaust your database sockets in minutes.
/actuator/configprops is your best friend for catching typos in property names. Spring Boot silently ignores unknown properties by default unless you set spring.config.fail-on-unknown-properties=true. Dump the configprops endpoint, compare it against your expected values programmatically in an integration test, and block the deploy if mismatches exist.
at ConfigPropertyAuditTest.verifyDatabasePoolSize(ConfigPropertyAuditTest.java:22)
Production Trap: Silent Defaults
HikariCP defaults to a maximum pool size of 10. If you expect 20 but set the property with a typo (e.g., maximum-pool-size vs maximumPoolSize), Spring Boot will ignore it and silently use 10. Always verify via /actuator/configprops after deployment.
Key Takeaway
The beans and configprops endpoints are your deploy-time smoke test. Always audit them programmatically before declaring a deployment successful.
Scheduled Tasks Endpoint: Stop Guessing What's Running When
Your production logs are full of mysterious delays. You know something runs on a cron schedule, but nobody remembers when or with what trigger. The /actuator/scheduledtasks endpoint ends that guesswork.
This endpoint dumps every scheduled task in your context: fixed rate, fixed delay, cron expressions, and the current trigger state. It's read-only. No starting or stopping tasks here. But it tells you exactly what Spring's @Scheduled annotation created and how often it fires.
Why you care: I've debugged cascading failures caused by a cron job running every minute when the team thought it ran hourly. One curl against this endpoint caught it in seconds. The endpoint also surfaces tasks created by embedded Quartz schedulers or custom TaskScheduler beans. Wire it into your CI pipeline's pre-deploy audit. Know what's scheduled before you ship a change that conflicts.
This endpoint does not expose the last execution time or next fire time. It only shows the trigger configuration. For real-time scheduling data, pair it with Micrometer's executor metrics or your APM tool.
Key Takeaway
Always dump scheduled tasks before and after a deployment to catch configuration drift in cron expressions and intervals.
Excluding IDs/Endpoints: Poison Pill for Loose Actuators
Exposing all Actuator endpoints by default is the fastest way to get paged at 3 AM by your security team. The fix isn't complex ACLs — it's a single property that kills endpoints you don't need. Use management.endpoints.web.exposure.exclude.
Think of it as a blocklist. You enable specific endpoints via include, but the smarter move is to include everything safe (health, info, metrics) and explicitly exclude the dangerous ones: env, configprops, dump, heapdump, loggers (yes, even loggers can leak internal structure).
I run this in every service: include health,info,metrics,prometheus and exclude everything else by name. No regex magic. No guessing. Production code doesn't need /actuator/env to stay up. If a dev needs it, they can tunnel through the bastion host. The exclude property takes precedence over include, so be explicit. One misconfigured service leaking heapdump to the internet is how you get a breach report on your desk.
application.ymlJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — java tutorial
management:
endpoints:
web:
exposure:
# include: "*" # DANGER: uncomment this only in dev
include: health,info,metrics,prometheus
exclude: env,configprops,dump,heapdump,loggers,threaddump
endpoint:
env:
enabled: false # belt-and-suspenders
configprops:
enabled: false
Output
After applying this config, GET /actuator returns:
{
"_links": {
"self": { "href": "/actuator" },
"health": { "href": "/actuator/health" },
"info": { "href": "/actuator/info" },
"metrics": { "href": "/actuator/metrics" },
"prometheus": { "href": "/actuator/prometheus" }
}
}
Accessing /actuator/env returns 404 Not Found.
Production Trap:
The exclude list in exposure only blocks the HTTP endpoint. The underlying MvcEndpoint bean still exists in the context. For full removal, also set management.endpoint.<id>.enabled=false. This stops internal wiring from leaking through alternate paths.
Key Takeaway
Default-deny actuator endpoints. Include only what you need, exclude the rest by name. A service that doesn't expose /env can't leak your database password through a URL.
Extending Existing Endpoints — Add Data Without Breaking Contracts
Spring Boot Actuator endpoints are not final. You can extend them to inject custom data without rewriting the endpoint or breaking its contract. Why do this? Health checks, metrics, and info endpoints are consumed by monitoring tools, Kubernetes probes, and dashboards. Adding fields to these endpoints ensures your custom diagnostics flow into existing pipelines without new integrations. To extend, implement HealthIndicator for the health endpoint, or use @EndpointExtension for any Actuator endpoint. For the /info endpoint, just add keys to application.properties under info.*, or implement InfoContributor in a @Component bean. This keeps your extensions version-safe and compatible with Spring Boot's autoconfiguration. The trap: extending an endpoint without checking upstream dependencies can cause serialization errors or break Prometheus scraping if you add non-numeric fields to metrics. Always test against your monitoring stack after extending.
Extending the health endpoint with mutable state or blocking I/O will delay your Kubernetes liveness probe. Keep extensions fast and stateless.
Key Takeaway
Extend, don't replace — use HealthIndicator or InfoContributor to add data to existing Actuator endpoints without breaking consumer contracts.
Overview — What Actuator Solves Before You Write Code
Spring Boot Actuator is not an optional add-on. It is the production interface for your running application. Before you write a single custom endpoint, Actuator exposes health, metrics, environment properties, thread dumps, and HTTP traces through HTTP or JMX. The why: observability is not about adding code — it's about exposing runtime state without modifying business logic. Actuator turns your Spring Boot app into a self-diagnosing system by default. The /health endpoint aggregates database status, disk space, and external service health. /metrics gives JVM, CPU, and memory counters via Micrometer. /env reveals configuration sources for debugging. /loggers allows runtime log level changes. These endpoints are your first line of defense before building custom dashboards. The trap: exposing all endpoints without authentication leaks sensitive data — environment variables, passwords, and internal URLs. Always start secure: expose only health and info by default, then enable others with role-based access.
application.propertiesJAVA
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — java tutorial
# Expose only health and info by default
management.endpoints.web.exposure.include=health,info
# Custom info metadata
info.app.name=@project.name@
info.app.version=@project.version@
info.java.version=17
# Disable all other endpoints via exclude
management.endpoints.web.exposure.exclude=*
Output
// /actuator returns only health and info endpoints
// /actuator/health returns UP status with components
// /actuator/info returns app name, version, java version
Production Trap:
Never expose env or configprops in production without authentication — they leak database passwords and secret keys in plain text.
Key Takeaway
Actuator solves runtime introspection by default — start secure, expose only health and info, then add endpoints as needed for debugging.
● Production incidentPOST-MORTEMseverity: high
The $40,000 Connection Pool Exhaustion — Flying Blind Without Metrics
Symptom
Payment service latency spiked from 50ms to 15,000ms per request. Customers reported timeouts. Revenue dropped as transactions queued and eventually failed.
Assumption
The team assumed it was a downstream payment gateway outage or a network issue. Two hours were spent checking external dependencies and firewall rules before anyone looked at the connection pool.
Root cause
HikariCP connection pool was exhausted. A slow database query caused by a missing index on the orders table held connections for 30 seconds or longer. Under Saturday peak load, all 10 pool connections were consumed. New requests waited in the queue until timeout. No Micrometer gauge was configured to track active connections versus pool size. The metric that would have solved this in 30 seconds — hikaricp_connections_active — was available but not enabled because the micrometer-registry-prometheus dependency was missing.
Fix
Added micrometer-registry-prometheus dependency which auto-enables HikariCP metrics: hikaricp_connections_active, hikaricp_connections_idle, hikaricp_connections_pending, hikaricp_connections_timeout_total. Set up a Grafana alert when active connections exceed 80% of pool size. Added the missing database index. Increased pool size from 10 to 25 with connection timeout tuning. Added management.metrics.tags.application=${spring.application.name} so the metrics were attributable to this specific service in Prometheus.
Key lesson
HikariCP metrics are auto-configured when micrometer-registry-prometheus is on the classpath — you do not write a single line of instrumentation code, you just add the dependency
A single gauge (hikaricp_connections_active) would have diagnosed the issue in 30 seconds instead of 4 hours
If you cannot answer 'is the connection pool saturated?' from your dashboard, you are flying blind
Every production service needs at minimum: request rate, error rate, latency percentiles, connection pool saturation, and JVM heap usage
management.metrics.tags.application is not optional — without it, metrics from multiple services are indistinguishable in Prometheus
Production debug guideWhen Actuator is configured, here is how to go from observable symptom to resolution.7 entries
Symptom · 01
Kubernetes pods restarting in a loop every 60-90 seconds
→
Fix
Check liveness probe configuration — if it hits /actuator/health (full) instead of /actuator/health/liveness, a transient DB blip triggers cascading restarts. Switch to the liveness health group immediately. The fix is one line in your Kubernetes deployment YAML.
Symptom · 02
Prometheus shows gaps in metric data — scraping appears to stop intermittently
→
Fix
Check if /actuator/health is used as the scrape path instead of /actuator/prometheus. Heavy health checks can timeout under load, causing Prometheus to mark the target as down. Switch metrics_path to /actuator/prometheus and verify the micrometer-registry-prometheus dependency is present.
Symptom · 03
Grafana dashboard shows flat latency line, then sudden spike — no gradual degradation visible
→
Fix
You are looking at mean latency, not percentiles. Mean hides tail latency completely. Check p95 and p99 — the degradation will be visible there well before the mean moves. Add publishPercentiles(0.5, 0.95, 0.99) to your Timer metrics and alert on p99, not the mean.
Symptom · 04
Intermittent payment failures but logs show nothing useful at INFO level
→
Fix
POST to /actuator/loggers/io.thecodeforge.order with {"configuredLevel": "DEBUG"}. Reproduce the issue. Read the detailed logs. Reset to null after. No restart needed. If the issue is SQL-related, enable TRACE on org.hibernate.SQL to see exact queries and bind parameters.
Symptom · 05
Deployed new version but behavior has not changed — suspect old code is still running
→
Fix
curl /actuator/info and verify git.commit.id matches your pipeline deployment. If it does not match, the Docker image cache served the old image or the Helm rollout did not apply. This is the most common 'ghost deployment' pattern.
Symptom · 06
Prometheus query returns duplicate time series for the same metric from different services
→
Fix
You are missing management.metrics.tags.application=${spring.application.name} in your application.yml. Without a global application tag, metrics from different services use identical names and collide in Prometheus. Add this to every service's base config.
Symptom · 07
/actuator/prometheus returns 404 even though the endpoint is in exposure.include
→
Fix
The micrometer-registry-prometheus dependency is missing from your pom.xml or build.gradle. spring-boot-starter-actuator does not include it. Add io.micrometer:micrometer-registry-prometheus explicitly. The endpoint only exists when the registry is on the classpath.
★ Actuator Debug Cheat Sheet — Commands That Save HoursReal commands for real production debugging. Run these when something is broken and you have Actuator configured.
Is the app healthy? Need a fast status check−
Immediate action
Hit the health endpoint to see component-level status
If liveness is DOWN, the JVM is unresponsive — restart the pod. If readiness is DOWN but liveness is UP, a dependency is failing — check database or cache connectivity. Never restart a pod when only readiness is failing — let the dependency recover.
Need to see what version is running without checking CI/CD+
Immediate action
Query the info endpoint for build and Git metadata
If the commit hash does not match what your pipeline deployed, the container is running old code — force a pod restart or re-pull the image explicitly. This is the fastest way to catch Docker layer cache issues.
Need DEBUG logs for a specific package without restarting+
Immediate action
Enable DEBUG logging dynamically via the loggers endpoint
After capturing the issue, reset immediately: curl -u admin:password -X POST -H 'Content-Type: application/json' -d '{"configuredLevel":null}' http://localhost:8080/actuator/loggers/io.thecodeforge.order — DEBUG logging in production generates gigabytes per minute if left running.
Prometheus is not scraping — target shows as DOWN in Prometheus UI+
Immediate action
Verify the Prometheus endpoint is reachable and returns data
Commands
curl -s http://localhost:8080/actuator/prometheus | head -20
If empty or 404: check micrometer-registry-prometheus dependency in pom.xml and verify management.endpoints.web.exposure.include contains prometheus. If 403: check Spring Security config permits MONITORING role. If timeout: check if scrape_interval is too aggressive for the instance count.
Docker container marked unhealthy but app seems fine from inside+
Immediate action
Test the HEALTHCHECK command manually inside the container
If wget is missing, you are using a distroless image. Switch to eclipse-temurin:17-jre-alpine which includes wget, or remove Docker HEALTHCHECK entirely and use Kubernetes probes exclusively. Do not add wget to a distroless image — that defeats the purpose of using it.
Legacy vs. Modern Monitoring Approaches
Monitoring Aspect
Legacy Approach (Manual)
Modern Approach (Actuator)
Health Checks
Custom /status endpoints with inconsistent JSON structure. Each developer implements their own format. You get 'OK' or nothing. No component-level detail, no Kubernetes probe integration.
Standardized /health endpoint with nested component status, auto-aggregation (any DOWN component makes overall status DOWN), and health groups (liveness, readiness) that map directly to Kubernetes probe types.
Metrics Gathering
Log parsing, manual JMX MBean registration, or custom counters in Redis. Fragile, hard to query, impossible to aggregate across instances, and requires per-service implementation.
Micrometer integration with dimensional metrics. Auto-instrumentation of JVM, HTTP requests, and HikariCP. One dependency addition exports everything to Prometheus, Datadog, InfluxDB, or New Relic.
Runtime Management
Requires application restart to change log levels. 'Add a log statement and redeploy' is a 30-minute process with a deployment window approval. Debugging in production is a full release cycle.
Dynamic log level updates via /actuator/loggers — change any package's log level in 10 seconds without restarting. View environment variables and system properties via /actuator/env (secured).
Security
Ad-hoc security filters, often left unprotected or protected inconsistently. Actuator endpoints exposed with default Spring Security or no security at all — common source of credential exfiltration.
Integrated with Spring Security. Fine-grained access control per endpoint using @Order SecurityFilterChain. Role-based restrictions. HTTP Basic for Prometheus scraper. CSRF disabled for stateless endpoints.
Deployment Traceability
Check CI/CD logs, SSH into the server, run git log. No programmatic way to verify which code version is running. Ghost deployments go undetected for hours during incidents.
/actuator/info exposes Git commit hash, build timestamp, and artifact version from the running process. CI/CD post-deploy verification catches ghost deployments in 2 seconds.
Operational Control
Any operational action (cache clear, config reload, feature toggle) requires a code change, pull request, code review, build, and deploy. A 30-minute minimum for a surgical change.
Custom @Endpoint with @WriteOperation provides surgical operational control without a deploy. Cache invalidation, circuit breaker toggles, and config reloads become API calls restricted to ADMIN role.
Key takeaways
1
Actuator is the bridge between application code and operational visibility
it is non-negotiable for any Spring Boot service running in production.
2
Use readiness probes to control load balancer traffic and liveness probes to signal Kubernetes when a pod needs a restart. Never put external dependency checks in liveness probes
the cascading restart pattern it creates has taken down production systems at companies with mature engineering teams.
3
Micrometer is the metrics engine. Use Counters for total counts, Gauges for current values, and Timers for latency percentiles. Mean latency is a liar
always instrument and alert on p95 or p99.
4
The /actuator/info endpoint is the fastest way to answer 'what version is running?' during an incident. Configure it once with git-commit-id-plugin and build-info goal. Add post-deploy verification to your CI/CD pipeline that compares the running commit hash against what was just deployed.
5
Dynamic log level management via /actuator/loggers turns a 30-minute debugging deployment cycle into a 10-second API call. Build the reset command into your runbook as a mandatory step
TRACE logging left running fills disks.
6
Always secure Actuator endpoints with a dedicated SecurityFilterChain using @Order(1). Whitelist only what you need. env, heapdump, and threaddump are the three most dangerous
restrict to ADMIN role and consider never exposing them at all.
7
Add management.metrics.tags.application=${spring.application.name} to every service's application.yml
it is the one line most teams forget. Without it, metrics from different services are indistinguishable in Prometheus.
8
Never create high-cardinality metrics with dynamic tag values like userId or requestId. Low-cardinality dimensions only
service name, region, HTTP method, status code, error type.
9
The observability stack is Actuator (sensors) + Prometheus (recorder) + Grafana (visualizer) + Alertmanager (alerter). All four work together
Actuator alone gives you endpoints, not observability.
Common mistakes to avoid
8 patterns
×
Exposing all endpoints via management.endpoints.web.exposure.include=* in production
Symptom
Automated scanners find /actuator/env within hours. AWS credentials, database passwords, and API keys are returned in JSON plaintext. /actuator/heapdump dumps all in-memory data including user sessions and PII.
Fix
Whitelist only what your monitoring stack needs: health,info,prometheus,loggers. Explicitly exclude heapdump,env,threaddump,shutdown. Add a SecurityFilterChain with @Order(1) and role-based access control.
/actuator/prometheus is listed in exposure.include but returns 404. Prometheus shows the target as having no metrics. Teams spend hours checking security configuration before realizing the dependency is missing.
Fix
Add io.micrometer:micrometer-registry-prometheus to pom.xml or build.gradle. This is separate from spring-boot-starter-actuator. Without it, the /actuator/prometheus endpoint does not exist regardless of what exposure.include contains.
×
Performing heavy, blocking I/O inside a Custom Health Indicator
Symptom
Health checks take 2 to 5 seconds because they call external APIs synchronously. Kubernetes probes timeout. Prometheus marks the target as down. CPU spikes every 10 seconds when probes fire simultaneously across all instances.
Fix
Keep health indicators under 200ms. Use cached results with periodic background refresh via a scheduled task instead of live calls on every probe. Set connectTimeout and readTimeout to 2000ms maximum in your health indicator HTTP client.
×
Leaving management.endpoint.health.show-details set to 'always'
Symptom
Anyone who hits /actuator/health sees database versions, connection pool sizes, external API response times, and disk usage. Attackers use this to fingerprint infrastructure and identify known CVEs for specific database or framework versions.
Fix
Set show-details to when_authorized or never in production. Only MONITORING or ADMIN roles should see component-level health details. Kubernetes probes only need the status field — they do not need details.
×
Using /actuator/health as the Kubernetes liveness probe instead of /actuator/health/liveness
Symptom
A transient database blip causes the full health check to fail. Kubernetes restarts every pod simultaneously. They all reconnect at once, creating a connection storm that worsens the original blip into a cascading restart loop.
Fix
Use /actuator/health/liveness for liveness probes — it checks only JVM internal state. Move dependency checks to /actuator/health/readiness. Enable health groups with management.endpoint.health.probes.enabled=true.
×
Not setting management.metrics.tags.application globally
Symptom
Prometheus shows http_server_requests_seconds from five different services all mixed together. Grafana dashboards cannot filter by service. Queries return combined meaningless numbers across the entire fleet.
Fix
Set management.metrics.tags.application=${spring.application.name} in every service's application.yml. Every metric emitted gets this tag automatically. Prometheus can then filter by application label in every query.
×
Creating high-cardinality metrics by using dynamic values as tags
Symptom
A Counter tagged with userId creates one unique time series per user. With 100,000 users, Prometheus runs out of memory. Scrape failures increase. TSDB storage fills up in days. Prometheus restarts frequently.
Fix
Keep tags to low-cardinality dimensions: service name, region, HTTP method, HTTP status code, error type. Never use request IDs, user IDs, or timestamps as tag values. User-level detail belongs in log aggregation, not metrics.
×
Ignoring the /actuator/info endpoint entirely
Symptom
During a production incident the team cannot answer 'what version is running?' without checking CI/CD logs. Ghost deployments — containers running old code despite a green pipeline — go undetected for hours.
Fix
Configure spring-boot-maven-plugin build-info goal and git-commit-id-plugin in pom.xml. Add a post-deploy verification step in CI/CD that curls /actuator/info and compares git.commit.id against the deployed commit hash.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What is the difference between 'Liveness' and 'Readiness' probes in the ...
Q02SENIOR
How would you implement a custom metric to track the number of successfu...
Q03SENIOR
In a high-security environment, how do you restrict access to the /actua...
Q04SENIOR
Explain the 'Thundering Herd' problem that can occur if monitoring syste...
Q05SENIOR
How does the @WriteOperation annotation work in a custom @Endpoint, and ...
Q06SENIOR
What is the difference between a Counter, Gauge, and Timer in Micrometer...
Q07SENIOR
How would you use the /actuator/loggers endpoint to troubleshoot a produ...
Q08SENIOR
Explain how the /actuator/info endpoint can be enriched with Git commit ...
Q09SENIOR
What is the risk of setting management.endpoints.web.exposure.include=* ...
Q01 of 09SENIOR
What is the difference between 'Liveness' and 'Readiness' probes in the context of Spring Boot 2.3+ Actuator? What should each probe check, and what happens when each one fails in Kubernetes?
ANSWER
Liveness probes check whether the application is alive — is the JVM responsive, is the event loop running? They use /actuator/health/liveness and must check only internal JVM state, never external dependencies. When a liveness probe fails, Kubernetes restarts the pod.
Readiness probes check whether the application can accept traffic — is the database reachable, is the cache warm, is the broker connected? They use /actuator/health/readiness and include dependency checks. When a readiness probe fails, Kubernetes removes the pod from the service load balancer endpoints but does not restart it.
The critical production rule: never put external dependency checks in liveness probes. A temporary database blip would fail liveness across all pods simultaneously, triggering a coordinated restart that creates a connection storm as all pods reconnect at once. This cascading pattern has taken down production deployments at teams with mature engineering practices. The entire architecture of the two health groups exists to prevent this one failure mode.
Q02 of 09SENIOR
How would you implement a custom metric to track the number of successful vs failed login attempts using Micrometer's MeterRegistry? What meter type would you use and why?
ANSWER
Use Counter meters for both successful and failed login attempts. Counters track monotonically increasing values — exactly what cumulative login counts are. Create two counters with a shared name and a differentiating tag: Counter.builder('login.attempts').tag('outcome', 'success').register(registry) and Counter.builder('login.attempts').tag('outcome', 'failure').register(registry).
In Prometheus, query rate(login_attempts_total{outcome='failure'}[5m]) for the failure rate and calculate the ratio of failures to total attempts for the error percentage.
Do not use a Gauge — login counts only go up, and Gauges are for values that fluctuate. Do not use a Timer — you are counting events, not measuring duration. Do not create two separately named counters; using the same name with a differentiating tag lets you aggregate them with a single PromQL query.
The production caveat: do not tag with userId — that creates one time series per user and will exhaust Prometheus memory. The 'outcome' tag has only two possible values, making it safely low-cardinality.
Q03 of 09SENIOR
In a high-security environment, how do you restrict access to the /actuator/prometheus endpoint to only the Prometheus scraper's service account?
ANSWER
Create a dedicated SecurityFilterChain with @Order(1) that matches /actuator/**. The @Order(1) is critical — without it, your main application SecurityFilterChain may take precedence and apply different rules than intended. Configure the chain to require the MONITORING role for /actuator/prometheus. Use HTTP Basic authentication since Prometheus natively supports basic_auth in its scrape config — no custom token handling needed.
In application.yml, create a monitoring service account with the MONITORING role. In the Prometheus scrape config, add basic_auth with the service account credentials using password_file rather than inline passwords.
For defense in depth: use management.endpoints.web.exposure.include to whitelist only health, info, and prometheus — never use wildcard. Add Kubernetes NetworkPolicy to restrict which pods can reach the actuator port, limiting access to the Prometheus pod's IP range or namespace. In managed Kubernetes environments, this network-level restriction is more reliable than authentication alone because it prevents the endpoint from being reachable at all from unauthorized pods.
Q04 of 09SENIOR
Explain the 'Thundering Herd' problem that can occur if monitoring systems scrape /actuator/health simultaneously. How would you mitigate it?
ANSWER
The thundering herd occurs when multiple systems — Prometheus, Kubernetes probes, Docker HEALTHCHECK, uptime monitors — all hit the full /actuator/health endpoint at overlapping intervals. The full health check performs expensive operations: database queries, external API calls, disk space checks. With 100 instances scraped every 15 seconds with a 500ms health check, that is 333 seconds of health check CPU time per minute, plus the database load from 400 connection checks per minute.
Mitigation strategy: use /actuator/prometheus for Prometheus scraping (lightweight counter and gauge reads, sub-millisecond). Use /actuator/health/liveness for Kubernetes liveness probes (JVM-only check, no I/O). Use /actuator/health/readiness for readiness probes (dependency checks, but only evaluated when pod state changes, not on a polling schedule matching Prometheus). For custom health indicators that must check external dependencies, cache the result with a background refresh thread and return the cached status on every probe call rather than performing a live check each time. Set connectTimeout and readTimeout to 2000ms maximum to prevent slow health checks from compounding.
Q05 of 09SENIOR
How does the @WriteOperation annotation work in a custom @Endpoint, and what are the safety implications compared to a @ReadOperation? When would you create a custom endpoint?
ANSWER
@ReadOperation maps to HTTP GET and must have no side effects. It returns operational state — current config, deployment info, feature flag values. @WriteOperation maps to HTTP POST and changes application state — clearing a cache, toggling a feature, reloading configuration. @DeleteOperation maps to HTTP DELETE and removes something — clearing a specific cache key, deregistering a resource.
The safety implication: @WriteOperation methods modify production state without a code deploy. This is powerful but requires strict access control. Always secure @WriteOperation methods with ADMIN role in your SecurityFilterChain. Log every invocation with the caller's identity for audit trail purposes.
Create a custom endpoint when built-in Actuator endpoints do not cover your operational needs. Common legitimate cases: a deployment endpoint that returns richer metadata than /actuator/info, a features endpoint for runtime feature flag management and toggling, a circuit-breaker endpoint that shows the state of each circuit and allows manual open/close operations, or a cache endpoint that shows cache statistics and allows targeted invalidation without a full restart.
Use @WebEndpoint instead of @Endpoint when you want the endpoint web-only with no JMX exposure.
Q06 of 09SENIOR
What is the difference between a Counter, Gauge, and Timer in Micrometer? Give a real-world example of when you would use each one in a payment processing service.
ANSWER
A Counter tracks a monotonically increasing value — it only ever goes up. In a payment service, use Counter for total payments processed (payments.completed.total) or total payment failures (payments.failed.total). Query with rate() in Prometheus to get payments per second or failures per second. Counter is wrong for values that can decrease.
A Gauge tracks a value that fluctuates up and down. Use Gauge for the current number of pending payment approvals waiting for 3DS authentication, or the current HikariCP active connection count. You report the current value and Micrometer samples it on each Prometheus scrape. Gauge is wrong for counts that only accumulate.
A Timer measures both duration and count simultaneously. Use Timer for end-to-end payment processing latency. It automatically calculates p50, p95, and p99 percentiles and the request rate. Add publishPercentiles(0.5, 0.95, 0.99) to expose them as separate Prometheus labels for alerting. Timer is the right choice for anything where both 'how long did it take' and 'how often does it happen' matter.
Choosing wrong means missing the signal: using a Gauge for total payments loses the rate information entirely. Using a Counter for HikariCP active connections loses the current value. Using a Timer but only looking at the mean hides the tail latency where slow transactions actually live.
Q07 of 09SENIOR
How would you use the /actuator/loggers endpoint to troubleshoot a production issue without restarting the application? Walk through the exact steps.
ANSWER
Step 1: Identify the relevant package from the stack trace or component you suspect — for example io.thecodeforge.order for the order processing path.
Step 2: Read the current level: GET /actuator/loggers/io.thecodeforge.order. Response shows configuredLevel and effectiveLevel — if effectiveLevel is INFO, DEBUG messages are suppressed.
Step 3: Enable DEBUG: POST /actuator/loggers/io.thecodeforge.order with body {"configuredLevel":"DEBUG"}. Takes effect immediately — no restart, no propagation delay.
Step 4: Reproduce the issue or wait for it to recur. Monitor your log aggregator (Loki, ELK) for the detailed DEBUG messages.
Step 5: Capture what you need from the logs and identify the root cause.
Step 6: Reset immediately: POST /actuator/loggers/io.thecodeforge.order with body {"configuredLevel":null}. Null means inherit from parent — returns to whatever the default configuration specifies. This is different from setting to INFO explicitly; null respects future configuration changes.
For Hibernate SQL debugging: enable TRACE on org.hibernate.SQL to see exact SQL statements and on org.hibernate.type.descriptor.sql to see the actual bind parameter values. Reset both immediately after capturing — TRACE generates enormous log volume that can fill disk in minutes.
Q08 of 09SENIOR
Explain how the /actuator/info endpoint can be enriched with Git commit information. What Maven plugins are required, and how would you use this in a CI/CD verification step?
ANSWER
Two Maven plugins are required: the spring-boot-maven-plugin with the build-info goal, which generates build-info.properties at compile time and embeds artifact name, version, and build timestamp; and the git-commit-id-plugin, which embeds Git metadata including commit hash (abbreviated), branch name, and commit time into git.properties. Both files are read by Actuator at startup and exposed under the build and git keys in /actuator/info.
In CI/CD, add a mandatory post-deploy verification step after the deployment completes: extract the git commit hash that was just built (git rev-parse --short HEAD), curl /actuator/info on the newly deployed instance, and compare git.commit.id from the response against the expected hash. If they do not match, fail the pipeline — the deployment applied the wrong image. This catches Docker image cache issues where the registry served a cached layer, Helm chart misconfigurations where the image tag was not updated, and failed rolling updates where some pods are still on the old version.
This verification costs 2 seconds and prevents the entire class of ghost deployment incidents where the team believes a fix is deployed but the old code is still running.
Q09 of 09SENIOR
What is the risk of setting management.endpoints.web.exposure.include=* in production? What specific endpoints are the most dangerous and why?
ANSWER
Wildcard exposure makes every Actuator endpoint reachable by anyone who can reach the application's management port. The three most dangerous endpoints:
/actuator/env returns all environment variables and system properties in plaintext, including AWS_SECRET_ACCESS_KEY, database connection strings, API tokens, and OAuth client secrets. Automated scanners probe /actuator/env within hours of an application being reachable. This is the most commonly exploited Actuator misconfiguration and the one with the most severe business impact.
/actuator/heapdump triggers a full JVM heap dump download — a binary file containing every object currently in memory including user session data, cached database records, in-memory PII, and decrypted credential values. Downloading this file gives an attacker a complete snapshot of your application's runtime memory state.
/actuator/threaddump reveals the current state of every JVM thread including stack traces that expose internal application structure, class names, and timing information useful for targeted attacks.
Additionally, /actuator/shutdown can terminate the application process entirely if enabled and exposed.
The fix: whitelist only the endpoints your monitoring stack actually needs — health, info, prometheus, loggers. Restrict each to the appropriate role in a dedicated SecurityFilterChain with @Order(1). Use Kubernetes NetworkPolicy to prevent these endpoints from being reachable outside the cluster's internal network.
01
What is the difference between 'Liveness' and 'Readiness' probes in the context of Spring Boot 2.3+ Actuator? What should each probe check, and what happens when each one fails in Kubernetes?
SENIOR
02
How would you implement a custom metric to track the number of successful vs failed login attempts using Micrometer's MeterRegistry? What meter type would you use and why?
SENIOR
03
In a high-security environment, how do you restrict access to the /actuator/prometheus endpoint to only the Prometheus scraper's service account?
SENIOR
04
Explain the 'Thundering Herd' problem that can occur if monitoring systems scrape /actuator/health simultaneously. How would you mitigate it?
SENIOR
05
How does the @WriteOperation annotation work in a custom @Endpoint, and what are the safety implications compared to a @ReadOperation? When would you create a custom endpoint?
SENIOR
06
What is the difference between a Counter, Gauge, and Timer in Micrometer? Give a real-world example of when you would use each one in a payment processing service.
SENIOR
07
How would you use the /actuator/loggers endpoint to troubleshoot a production issue without restarting the application? Walk through the exact steps.
SENIOR
08
Explain how the /actuator/info endpoint can be enriched with Git commit information. What Maven plugins are required, and how would you use this in a CI/CD verification step?
SENIOR
09
What is the risk of setting management.endpoints.web.exposure.include=* in production? What specific endpoints are the most dangerous and why?
SENIOR
FAQ · 8 QUESTIONS
Frequently Asked Questions
01
What is the difference between /actuator/health, /actuator/health/liveness, and /actuator/health/readiness?
The base /actuator/health endpoint returns the aggregated status of all health indicators — database connectivity, disk space, external APIs, message brokers, and any custom indicators. It is the full-picture health check.
Spring Boot 2.3 introduced health groups: /actuator/health/liveness checks only whether the JVM is responsive with no external dependency checks, and /actuator/health/readiness checks whether the application can serve traffic and includes dependency checks.
In Kubernetes, use liveness for the liveness probe — if it fails, Kubernetes restarts the pod. Use readiness for the readiness probe — if it fails, the pod is removed from the load balancer but not restarted. Never use the full /actuator/health for liveness probes — a temporary database blip would restart every pod simultaneously.
Requires management.endpoint.health.probes.enabled=true in application.yml — without this, the liveness and readiness paths return 404.
Was this helpful?
02
How do I create a custom Actuator endpoint beyond health checks?
Annotate a Spring component with @Endpoint(id='yourEndpoint') and annotate methods with @ReadOperation (GET), @WriteOperation (POST), or @DeleteOperation (DELETE). Spring Boot automatically exposes the endpoint at /actuator/yourEndpoint.
For web-only endpoints with no JMX exposure, use @WebEndpoint instead of @Endpoint.
Custom endpoints inherit the same security model as built-in ones — they appear in your exposure.include list and respect the SecurityFilterChain you define. Secure @WriteOperation methods with ADMIN role since they change production state. Common use cases: deployment metadata richer than /actuator/info, runtime feature flag management, cache invalidation triggers, and circuit breaker state display and control.
Was this helpful?
03
How does Prometheus scrape Spring Boot Actuator metrics?
Prometheus pulls metrics from your application's /actuator/prometheus endpoint at a configured interval (default 15 seconds). On the Spring Boot side: add micrometer-registry-prometheus dependency and include prometheus in management.endpoints.web.exposure.include — this auto-configures the endpoint. On the Prometheus side: add a scrape_config block with metrics_path: '/actuator/prometheus' and your application's host and port.
For Kubernetes, use kubernetes_sd_configs with pod annotations instead of static targets. Pods annotated with prometheus.io/scrape: 'true' are auto-discovered. Each scrape returns all current metric values in Prometheus exposition text format. Prometheus stores these as time series in its TSDB.
Common issue: /actuator/prometheus returns 404 despite being in exposure.include — check that micrometer-registry-prometheus is in your pom.xml. It is a separate dependency from spring-boot-starter-actuator.
Was this helpful?
04
Can I change log levels in a running Spring Boot application without restarting?
Yes. POST to /actuator/loggers/{package.name} with a JSON body of {"configuredLevel": "DEBUG"}. The change takes effect immediately — no restart, no redeploy. To reset to default, POST with {"configuredLevel": null}. Null means inherit from the parent logger — this is different from explicitly setting INFO, which overrides any future configuration changes.
For Hibernate SQL debugging: enable TRACE on org.hibernate.SQL for query text and org.hibernate.type.descriptor.sql for bind parameter values.
Always secure POST with Spring Security restricting it to ADMIN role — TRACE logging generates gigabytes of output per minute and can fill disk quickly. Build the reset command into your incident runbook as a mandatory step, not an optional one.
Was this helpful?
05
What information does the /actuator/info endpoint show, and how do I populate it?
By default /actuator/info returns an empty JSON object — it must be explicitly populated. Two sources of data:
Build metadata: add the spring-boot-maven-plugin's build-info execution goal. This generates build-info.properties at compile time containing artifact name, version, and build timestamp. Appears under the build key in the response.
Git metadata: add git-commit-id-plugin to your Maven build. This embeds the commit hash (abbreviated), branch name, and commit time into git.properties at build time. Appears under the git key.
You can also add custom info via application.properties: info.app.description=Order processing service. These appear under the app key.
Use this in CI/CD post-deploy verification: compare git.commit.id from the running instance against the commit hash your pipeline just built to catch ghost deployments.
Was this helpful?
06
How do I secure Actuator endpoints in production?
Create a dedicated SecurityFilterChain with @Order(1) that matches /actuator/**. The @Order(1) ensures this chain has higher priority than your main application chain.
Permit /actuator/health/** and /actuator/info — needed by load balancers and Kubernetes probes without authentication. Restrict /actuator/prometheus to MONITORING role with HTTP Basic authentication. Restrict /actuator/loggers, /actuator/env, /actuator/heapdump, and /actuator/threaddump to ADMIN role. Deny everything else.
Also set management.endpoints.web.exposure.include to a whitelist — health, info, prometheus, loggers — and never use wildcard. Disable CSRF for the actuator security chain since these endpoints are called by automated systems, not browsers. Set management.endpoint.health.show-details to when_authorized or never in application.yml.
Was this helpful?
07
What is the difference between a Counter, Gauge, and Timer in Micrometer?
A Counter tracks a monotonically increasing value — it only goes up. Use it for total requests, total errors, and orders placed. Query with rate(counter_total[5m]) in Prometheus to get events per second. A Counter is wrong for any value that can decrease.
A Gauge tracks a value that goes up and down. Use it for current queue depth, active connections, and heap usage. You report the current value on demand and Micrometer samples it on each Prometheus scrape. A Gauge is wrong for cumulative counts.
A Timer measures duration and count simultaneously. Use it for request latency, database query time, and payment processing duration. It automatically calculates count, total time, and percentiles. Add publishPercentiles(0.5, 0.95, 0.99) to expose p50, p95, and p99 as Prometheus labels for alerting.
Choose the wrong type and you miss the signal: a Gauge for total orders loses the rate information, a Counter for connection pool size loses the current value, a Timer without percentiles hides the tail latency where the real problems live.
Was this helpful?
08
How do I handle Actuator in a Kubernetes environment with multiple replicas?
Configure all three probe types: startupProbe using /actuator/health/liveness with failureThreshold × periodSeconds set to at least 2× your measured startup time. livenessProbe using /actuator/health/liveness with no external dependency checks — JVM state only. readinessProbe using /actuator/health/readiness with dependency checks.
Enable health groups in application.yml with management.endpoint.health.probes.enabled=true — without this, the liveness and readiness paths return 404.
For metrics scraping, use Prometheus kubernetes_sd_configs with pod annotations rather than static targets. Pods get new IPs on every restart — static targets break on every rolling deploy. Add prometheus.io/scrape: 'true' and prometheus.io/port: '8080' annotations to your pod spec.
Set management.metrics.tags.application=${spring.application.name} so metrics from different service replicas are identifiable in Prometheus and can be filtered or aggregated by service name in Grafana dashboards.