Advanced 6 min · May 23, 2026

Microservices Interview Questions That Actually Filter Senior Engineers

Q: What’s the difference between a circuit breaker and a bulkhead in resilience patterns?

A circuit breaker prevents calling a downstream service when it’s likely to fail. A bulkhead limits the number of concurrent calls to a downstream service to prevent resource exhaustion in the caller. Think of a circuit breaker as a fuse that opens after too many failures. A bulkhead is like a separate thread pool per downstream service so that one slow service doesn’t consume all available threads. In practice, you use both together.

Q: How do I implement distributed tracing in Spring Boot 3.x without Sleuth?

Use Micrometer Tracing (spring-boot-starter-actuator includes it). Add io.micrometer:micrometer-tracing-bridge-otel for OpenTelemetry. Configure a span exporter like Zipkin or Jaeger. For async context propagation, use TaskDecorator with ContextSnapshot. For Kafka, configure spring.kafka.consumer.properties[spring.sleuth.kafka.propagation.enabled]=true. Virtual threads inherit context automatically.

Q: When should I use a saga instead of a distributed transaction?

Always. Distributed transactions (2PC, JTA) don’t work in microservices because they lock resources and reduce availability. Sagas trade atomicity for availability. Use a saga when a business process spans multiple services and must either complete fully or be rolled back consistently. Examples: order-to-payment-to-inventory, booking flights and hotels.

Q: What is the strangler fig pattern?

A migration pattern for moving from a monolith to microservices. You incrementally replace parts of the monolith with new microservices. The monolith still handles some requests; the microservices handle others. You route traffic via a proxy or API gateway. Over time, the microservices “strangle” the monolith until it’s fully replaced. This reduces risk compared to a big-bang rewrite.

Q: How do you handle idempotency in a microservices system?

Use an idempotency key (a unique string, often a UUID) sent with every mutating request. The server checks if it’s already processed this key. If yes, return the previous response. If no, process the request and store the key with the result. Use a database table with a unique constraint on the key column. For sagas, use the saga ID as the idempotency key for each step and compensation.

Stop asking what a circuit breaker is.

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Lessons pulled from things that broke in production.

✓ Production

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Distributed transactions don't exist; you need sagas or eventual consistency
Service mesh adds latency; don't deploy it until you have 20+ services
Circuit breakers hide failures; always log the fallback reason
Health checks must be distinct from liveness probes; don't mix them
Request tracing is useless without proper span propagation in async code

✦ Definition~90s read

What is Microservices Interview Questions?

Microservices architecture breaks a monolith into independently deployable services. Each service owns its data, communicates over the network, and fails in isolation. That's the theory. In production, you trade code complexity for infrastructure complexity.

★

Imagine a restaurant kitchen where each chef only makes one dish.

You get independent scaling, team autonomy, and faster deploys. You also get network latency, partial failures, and debugging nightmares. Spring Boot 3.x with its virtual threads and native support helps, but it doesn't fix bad architecture. You need to understand patterns like sagas, CQRS, and strangler fig.

Most interview questions test buzzwords. The real test is how you debug a cascading failure at 3 AM.

Plain-English First

Imagine a restaurant kitchen where each chef only makes one dish. When an order comes in, the chefs pass tickets back and forth. If one chef is slow, the whole kitchen backs up. Microservices work the same way — services talk to each other, and when one fails, the failure cascades. You need systems to detect slow chefs (circuit breakers), track the full order (distributed tracing), and decide when to just tell the customer the kitchen is busy (graceful degradation).

Last Thursday, 2:47 AM. My phone buzzed with a PagerDuty alert. Order service latency spiked to 30 seconds. Users were seeing 503s across the checkout flow. My first thought: database is slow. Second thought: someone pushed a bad query. Third thought: the truth. The circuit breaker on the payment service had opened, but the fallback method in the order service threw a NullPointerException. Nobody caught it because the logs only showed “Circuit breaker OPEN” — the team assumed that was normal. That incident cost us $40k in lost revenue and a pissed-off VP of Engineering. Here’s the thing: circuit breakers, bulkheads, retries — these patterns are supposed to make your system resilient. But every abstraction leaks. If you don’t understand what happens when the fallback itself fails, you’re building a house of cards. This article isn’t theory. It’s the questions I ask when hiring. It’s the things I’ve seen break. It’s the production debug steps that actually move the needle. Stop reading if you want textbook answers. Keep reading if you want to survive a real outage.

Circuit Breakers: The Fallback That Broke Production

Everyone on my team knew “circuit breaker prevents cascading failures.” They aced the interview question. Then they coded a fallback that returned null. The calling code didn’t check for null. NullPointerException. 500 to the customer. The circuit breaker was working exactly as configured — it stopped calls to the failing downstream. But it didn’t stop the fallback from failing. That’s the part nobody teaches. In Spring Boot 3.x with Resilience4j, the fallback method is just another bean method. It can throw exceptions. It can have its own bugs. You must test the fallback as thoroughly as the primary method. Use @CircuitBreaker(name = "service", fallbackMethod = "fallback") and write a unit test that opens the breaker and calls the method. Assert that fallback returns a valid response. Log the fallback invocation and result. Use a specific log level (WARN) so you can alert on it. And for the love of god, don’t swallow the exception in the fallback. Log it. Your 3 AM self will thank you.

Production Trap:

If your fallback method throws an exception, Resilience4j does NOT catch it. The exception propagates to the caller. You’ll get a 500, and the circuit breaker stays open because the failure was NOT in the primary call — it was in the fallback. The circuit breaker never learns about it. You’re now silently returning errors.

Production Insight

I’ve seen teams spend weeks tuning circuit breaker thresholds when the real fix was a null check in the fallback.

Key Takeaway

Test the fallback method as a first-class citizen. Add an actuator endpoint that lets you manually open the circuit breaker and verify behavior in staging.

thecodeforge.io

Microservices Interview Questions

Distributed Tracing: Why Your Spans Disappear Into the Void

You set up Spring Cloud Sleuth (now Micrometer Tracing in 3.x). You see traces across your services. Then you switch to async processing. Spans disappear. You check logs — no trace ID. You check Zipkin — incomplete traces. The culprit: thread pool executors that don’t propagate the tracing context. When you use @Async or CompletableFuture.supplyAsync(), the new thread doesn’t inherit the parent’s trace ID. Spring Boot 3.x uses Micrometer Tracing, which relies on io.micrometer.context.ContextSnapshot. You can fix this by configuring a TaskDecorator. Or by using Virtual Threads (project Loom), which are thread-per-task and preserve context. But here’s the real shot: if you’re using Kafka, the consumer side won’t propagate the trace automatically unless you set spring.kafka.consumer.properties[spring.sleuth.kafka.propagation.enabled]=true (or the Micrometer equivalent). Even then, if your Kafka listener uses a manual ack mode or retry template, the headers get lost. I’ve debugged this exact problem at 2 AM. The fix is to manually forward the trace context in the producer’s headers and reconstruct it on the consumer side using an interceptor.

Senior Shortcut:

Switch to Virtual Threads (spring.threads.virtual.enabled=true). They solve the context propagation problem because they don’t pool threads. Each task gets a new thread, inheriting the parent’s context. Just make sure you’re not using synchronized blocks — that pins the virtual thread and ruins performance.

Production Insight

If your trace has gaps longer than 500ms, you’re losing context in async code. Don’t blame the network.

Key Takeaway

Distributed tracing is only as good as your context propagation. Test with async methods, Kafka consumers, and scheduled tasks. If the trace doesn’t connect, you don’t have observability.

Health Checks: Liveness vs. Readiness — You’re Doing It Wrong

I see this on every interview. Candidate says: “We use Spring Boot Actuator for health checks.” I ask: “What happens when your database goes down?” They say: “The health check fails and Kubernetes restarts the pod.” Wrong. That’s the exact opposite of what you want. If your database goes down, you don’t want Kubernetes to kill your pod. You want the pod to stay alive, return 503s, and wait for the database to recover. Killing the pod just shifts the problem to the new pod — it will also fail health checks and get killed. That’s a restart loop. The right pattern: liveness probe checks JVM health (heap, thread deadlocks). Readiness probe checks downstream dependency health (database, Redis, downstream services). When the database is down, the readiness probe fails, and Kubernetes removes the pod from the Service’s endpoints. No traffic hits it. When the database recovers, the readiness probe succeeds, and the pod starts receiving traffic. No restarts. No cascading failures. In Spring Boot 3.x, you configure this with @LivenessState and @ReadinessState annotations on custom health indicators.

Interview Gold:

The most common mistake is having the liveness probe call the same health check as readiness. If they’re the same, a database failure kills the pod. Always separate them. A senior engineer explains why, not just what.

Production Insight

I’ve seen a production outage where 200 pods were restarting every 30 seconds because the liveness check included a downstream Redis call that was slow. The fix: rewrite liveness to only check JVM metrics.

Key Takeaway

Liveness = Is the JVM alive? Readiness = Is the service ready to accept traffic? Never conflate them.

thecodeforge.io

Microservices Interview Questions

Database Connections: The Silent Throttle in Your Microservices

Your payment service has a HikariCP pool of 10 connections. Your Tomcat thread pool has 200 threads. You get higher traffic, and suddenly all 10 database connections are held by slow queries. The remaining 190 threads are stuck waiting for a connection. Your latency spikes to 10 seconds. No CPU spike. No memory error. Just a wall of “Connection is not available, request timed out after 30000ms.” The fix isn’t to increase the pool size blindly. If you set pool size to 200, you’ll just overwhelm the database with 200 concurrent connections. The real fix: set pool size to match the number of concurrent requests that can hit the database at peak, plus a small buffer. Use a tool like pgbouncer or RDS Proxy to multiplex connections. Or use a reactive driver (R2DBC) that doesn’t hold connections per request. But here’s the trick most people miss: if you’re using virtual threads, the connection pool is even more important because virtual threads don’t block the OS thread, but they still block the database connection. You can have thousands of virtual threads waiting on a pool of 10 connections. The pool becomes the bottleneck. Monitor the HikariCP metrics actively: hikaricp.connections.active, hikaricp.connections.pending, hikaricp.connections.timeout. Alert on pending connections > 0 for more than 5 seconds.

Never Do This:

Setting maximumPoolSize to 200 “named myself because we have 200 threads.” Bad idea. Database connections are expensive. The optimal pool size is (cpu_cores * 2) + 1. For a 4-core machine, start with 9. Measure. Adjust.

Production Insight

I watched a team increase pool size from 10 to 100 to fix latency. Latency got worse because the database couldn’t handle 100 concurrent connections. The throughput dropped 40%. The fix was a 2-line config change: set pool size back to 10 and add a read replica.

Key Takeaway

Connection pool size is not a tuning knob for throughput. It’s a safety valve. Keep it small. Scale horizontally instead.

Sagas: The Distributed Transaction Myth

Business says: “Create order, deduct inventory, charge payment — all or nothing.” Developers hear: “Distributed transaction.” They reach for two-phase commit (2PC) or try to use JTA across microservices. Both are terrible ideas. 2PC locks resources for the duration of the transaction. If any participant fails, you have in-doubt transactions that require manual recovery. In microservices, you use sagas — a sequence of local transactions with compensating actions for rollback. Orchestration-based sagas use a central coordinator (like Axon or Temporal). Choreography-based sagas use events (each service publishes an event when done, another service reacts). Both patterns trade consistency for availability. You face the problem of handling the “last mile”: the saga says “payment captured” but then the inventory service crashes before updating stock. You’re in an inconsistent state. The solution: idempotency keys and retry queues. Each step must be idempotent. Use a unique saga ID in every request. Store the saga state in a database. Use a dead-letter queue for failed steps, with a retry mechanism and exponential backoff. And always design compensating actions that are idempotent too.

The Classic Bug:

Compensating actions must be commutative. If you’re compensating a compensation, you get infinite loops. Use a saga state machine with explicit transitions to prevent cycles.

Production Insight

I’ve seen a saga that refunded the same payment 5 times because the inventory service failed 5 times, each time triggering a refund compensation. The fix: check if the compensation was already applied before executing.

Key Takeaway

Sagas work when every step is idempotent, retryable, and has a well-defined compensating action. If you can’t achieve that, you need a different architecture.

Inter-Service Communication: REST vs. gRPC vs. Events — When Each Breaks

REST is the default. It’s simple. But it fails in every way that matters under load. Latency adds up. Serialization costs. No built-in retry or backpressure. gRPC fixes some of this: binary protocol, streaming, deadline propagation. But it introduces its own problems: HTTP/2 multiplexing can cause head-of-line blocking with misconfigured load balancers. And gRPC doesn’t handle load shedding well — if a service is overwhelmed, it just sends back RESOURCE_EXHAUSTED, and the caller has to decide what to do. Events (Kafka, RabbitMQ) solve the coupling problem. Services don’t call each other; they publish events that other services consume. This gives you async processing and inherent buffering. But it introduces complexity: event ordering, at-least-once vs. exactly-once semantics, and schema evolution. The trade-off is clear: synchronous calls (REST/gRPC) are for real-time interactions where you need a response. Events are for background processing and eventual consistency. In a real system, you use both. And you need a retry mechanism for both. Don’t assume REST will work because you added a timeout. I’ve seen a payment gateway timeout cause a 10-second thread block on every request, cascading through 5 services. The fix: implement a circuit breaker with a 2-second timeout. And test it.

Senior Shortcut:

For internal services, use gRPC with a deadline of 2 seconds and circuit breaker. For external-facing APIs, use REST with rate limiting. For async work, use Kafka. Don’t mix the patterns: a Kafka consumer should never make a synchronous REST call that blocks the consumer thread.

Production Insight

I replaced a REST endpoint with Kafka events between two services. Latency went from 600ms to 5ms. The trade-off: eventual consistency. The payment confirmation now arrives 1–5 seconds after the order. Marketing hated it. Engineering loved it. We kept both: a REST call for immediate response, then an event for the actual processing.

Key Takeaway

Use synchronous communication for reads and immediate writes. Use events for everything else. The hardest part is saying “no” to a synchronous call that a business stakeholder insists on.

Configuration Management: Why Your Dev Environment Works but Production Doesn't

You've seen it. The app that hums along locally but combusts in staging. Nine times out of ten, it's configuration. Not code. In a monolith, you had one application.properties. In microservices, you have dozens of services, each needing database URLs, feature flags, encryption keys. The moment you hardcode a value, you've created a time bomb.

Spring Cloud Config or Kubernetes ConfigMaps are the standard answers. But the real trick is layering: externalize everything that changes between environments. Use Spring Profiles to override defaults. Never put secrets in Git. Use Vault or sealed secrets.

The WHY is simple: configuration is the number one source of silent failures. A missing config key doesn't throw a compile error. It throws a NullPointerException in production at 3 AM. Your deployment pipeline should fail fast if required config keys are missing. Use @ConfigurationProperties with @Validated to catch this at startup, not runtime.

DatabaseConfig.javaJAVA

// io.thecodeforge — java tutorial
package com.thecodeforge.shipment.config;

import jakarta.validation.constraints.NotBlank;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.validation.annotation.Validated;

@Validated
@ConfigurationProperties(prefix = "app.database")
public class DatabaseConfig {

    @NotBlank(message = "Database URL is required. Check your environment variables.")
    private String url;

    private int maxPoolSize = 10; // sensible default

    // getters and setters omitted for brevity
}

Output

APPLICATION FAILED TO START

***************************

Description:

Binding to target org.springframework.boot.context.properties.bind.BindException: Failed to bind properties under 'app.database' to com.thecodeforge.shipment.config.DatabaseConfig failed:

Property: app.database.url

Value: null

Reason: Database URL is required. Check your environment variables.

Action:

Update your application's configuration

Production Trap:

Don't rely on @Value annotations for required config. They fail silently with defaults. Use @ConfigurationProperties with validation. Your CI pipeline should reject deploys that lack required keys.

Key Takeaway

Fail fast on missing configuration at startup. Externalize everything environment-specific. Never let a late-night config typo become a P1 incident.

Service Discovery: How Your Pods Learn to Find Each Other Without Hard-Coded IPs

In a monolith, you knew where everything lived. One server, one port, one database. In microservices, services die, scale up, and move across nodes constantly. Hard-coding service addresses is like giving someone directions with landmarks that change every hour.

Service discovery solves this with a registry. When a service starts, it registers itself with a registry (Eureka, Consul, or Kubernetes DNS). When another service needs to talk to it, it asks the registry for a healthy instance. The client gets a dynamic list of endpoints and load-balances across them.

The WHY: without service discovery, a failed instance means dead traffic. With it, the registry removes unhealthy instances automatically. But here's where it gets spicy: caching. If your client caches the registry response too long, it'll route to dead pods. Too short, and you DDOS the registry. A 30-second cache TTL with a circuit breaker on registry calls is the sweet spot.

Spring Cloud DiscoveryClient abstracts this. Use @LoadBalanced RestTemplate to get automatic service-aware calls. But always add a fallback: what happens if the registry is down? Your services should have a last-known-good cache, not throw 500s.

OrderServiceClient.javaJAVA

// io.thecodeforge — java tutorial
package com.thecodeforge.inventory.client;

import org.springframework.cloud.client.discovery.DiscoveryClient;
import org.springframework.stereotype.Component;
import org.springframework.web.client.RestTemplate;
import java.util.List;

@Component
public class OrderServiceClient {

    private final DiscoveryClient discoveryClient;
    private final RestTemplate restTemplate;

    public OrderServiceClient(DiscoveryClient discoveryClient, RestTemplate restTemplate) {
        this.discoveryClient = discoveryClient;
        this.restTemplate = restTemplate;
    }

    public String fetchOrder(String orderId) {
        // Fetch a healthy instance from the registry
        List<var> instances = discoveryClient.getInstances("order-service");
        if (instances.isEmpty()) {
            // Last-known-good cache or fallback logic here
            throw new RuntimeException("No healthy order-service instances found. Registry might be down.");
        }
        var instance = instances.get(0);
        String url = STR."http://\{instance.getHost()}:\{instance.getPort()}/orders/\{orderId}";
        return restTemplate.getForObject(url, String.class);
    }
}

Output

curl http://inventory-service:8080/orders/123

Response: 200 OK

{

"orderId": "123",

"status": "SHIPPED",

"serviceSource": "order-service-7f8c9e0a1b"

}

Tactical Note:

Don't use Eureka for inter-service calls in Kubernetes. Kubernetes already has a built-in DNS-based service discovery with health probes. Use it. Adding Eureka on top is redundant complexity. The K8s API server is your registry.

Key Takeaway

Service discovery decouples your services from infrastructure topology. Cache responses with a short TTL, and always have a fallback plan when the registry itself is unavailable.

● Production incidentPOST-MORTEMseverity: high

The Circuit Breaker That Killed Checkout

Symptom

Users saw “Something went wrong” on checkout. Logs showed a periodic “CircuitBreaker ‘paymentService’ is OPEN and does not permit further calls” message. The team thought this was normal circuit breaker behavior. Actually, the fallback function threw a NullPointerException because a dependency injection was lazy-initialized.

Assumption

First thought: Payment service is down. Second thought: Rate limiting on the payment gateway. Third thought: Network issue between pods.

Root cause

The circuit breaker on payment service opened after 5 consecutive timeouts. The fallback method called a utility class that was lazily initialized. Because the circuit breaker was open, the fallback ran on every request. The lazy init happened on the first call, and that call ran fine. But the utility had a bug: it modified a shared ConcurrentHashMap without synchronization, causing a ConcurrentModificationException on subsequent calls. This uncaught exception bubbled up, causing a 500 to the client.

Fix

1) Changed the utility class to use synchronized block or ConcurrentHashMap.compute() 2) Wrapped the fallback logic in try-catch with explicit error logging 3) Added a health check endpoint that tests the fallback path 4) Changed circuit breaker configuration from COUNT_BASED to TIME_BASED with a 10-second window 5) Wrote a unit test that opens the circuit breaker and verifies the fallback response 6) Added a Prometheus alert on circuit breaker open state duration > 30 seconds

Key lesson

A circuit breaker’s fallback is not a safety net.
It’s a code path that must be tested as rigorously as the primary path.
Log the fallback result — both success and failure.
Your pager will thank you.

Production debug guideSymptom -> root cause -> fix for the failures that actually happen4 entries

Symptom · 01

Service returns 503 but health endpoint returns 200

→

Fix

Check if the health endpoint is isolated from dependencies. Most teams configure Spring Boot actuator health to ping databases and downstream services. If the real endpoint relies on a downstream service but health doesn’t, you get false positives. Fix: Create separate liveness and readiness probes. liveness = basic JVM check, readiness = downstream dependency check. Use @ReadinessState and @LivenessState annotations in Spring Boot 3.x.

Symptom · 02

Slow response times but no CPU or memory spike

→

Fix

Check thread pool exhaustion. If you’re using Tomcat, look for thread dumps showing many threads in BLOCKED or WAITING state. Common cause: blocking calls inside a reactive pipeline, or virtual threads pinned by synchronized blocks. Fix: Switch to virtual threads with Spring Boot 3.2+, but avoid synchronized. Use ReentrantLock or structured concurrency. Also check database connection pool size — if it’s smaller than Tomcat’s thread pool, you get connection starvation.

Symptom · 03

Distributed trace shows gaps between spans

→

Fix

Context propagation failed. This happens when a new thread is spawned without passing the tracing context. Common in async methods (@Async), CompletableFuture, or Kafka listeners. Fix: Use Spring’s TaskDecorator to propagate MDC and tracing context. In Spring Boot 3.x, configure ExecutorBean with a TaskDecorator that copies the current trace ID. For Kafka, set spring.kafka.consumer.properties[spring.sleuth.kafka.propagation.enabled]=false and manually propagate headers.

Symptom · 04

Database deadlock in a saga

→

Fix

Two services updated the same row in different order. Sagas don’t use distributed transactions — they rely on eventual consistency and compensation. If you get a deadlock, it means two services are competing for the same resource without coordination. Fix: Use optimistic locking with version columns. Or redesign your saga to use a “one writer” pattern per entity. Add retry logic with exponential backoff on OptimisticLockException. Log the deadlock graph from PostgreSQL’s pg_stat_activity.

★ Debug Cheat SheetCommands for fast diagnosis in production

Circuit breaker opens too frequently−

Immediate action

Check circuit breaker state and failure count

Commands

kubectl exec -it <pod> -- curl -s http://localhost:8081/actuator/health | jq '.components.circuitBreakers'

kubectl logs --tail=100 <pod> | grep -E 'CircuitBreaker|Hystrix|resilience4j' | grep -v 'closed'

Fix now

Increase slidingWindowSize from 10 to 20 in application.yml: resilience4j.circuitbreaker.configs.default.slidingWindowSize: 20

High latency but no errors+

Distributed trace missing spans+

Inter-Service Communication Patterns

Characteristic	REST/HTTP	gRPC	Events (Kafka)
Latency per call	10-100ms	1-10ms	0.5-5ms (async)
Throughput	Moderate (HTTP overhead)	High (binary)	Very high (batch)
Coupling	Tight (client needs endpoint)	Tight (client needs proto)	Loose (event schema)
Error handling	HTTP status codes	gRPC status codes	Retry + DLQ pattern
Idempotency	Manual (idempotency key)	Manual (idempotency key)	Built-in (offset management)
Backpressure	None (pessimistic)	RESOURCE_EXHAUSTED	Partition lag (monitoring)
Streaming	No (polling or WebSocket)	Yes (bidirectional)	Yes (consumer group)

⚙ Quick Reference

2 commands from this guide

File	Command / Code	Purpose
DatabaseConfig.java	@Validated	Configuration Management
OrderServiceClient.java	@Component	Service Discovery

Key takeaways

Circuit breaker fallbacks are code paths that must be tested, logged, and alerted on. A failing fallback is worse than no fallback.

Liveness and readiness probes are not interchangeable. Kubernetes uses liveness to restart pods; readiness to route traffic. One serves the platform, the other serves the customer.

Database connection pool size should match CPU cores, not thread pool size. Oversizing the pool kills the database. Add replicas for scale, not bigger pools.

Sagas work only when compensations are idempotent and commutative. If you can refund the same payment twice, you haven’t designed a saga

you designed a bug.

Distributed tracing is useless without proper context propagation. Test with async, Kafka, and scheduled tasks. If the trace doesn’t connect, you’re flying blind.

Common mistakes to avoid

5 patterns

Using @Async without configuring a TaskDecorator for trace context

Symptom

Distributed traces show gaps or missing spans after async boundaries

Fix

Configure a TaskDecorator that captures ContextSnapshot from Micrometer, or switch to virtual threads with spring.threads.virtual.enabled=true

Same health check for liveness and readiness probes

Symptom

Pods restart in a loop when database goes down

Fix

Separate probes: liveness checks JVM health (heap, deadlocks), readiness checks downstream dependencies

Circuit breaker fallback method throws an exception

Symptom

Client receives 500 even though circuit breaker is open, but logs only show “Circuit breaker OPEN”

Fix

Wrap fallback in try-catch, return a safe default, log the exception. Never let fallback throw unchecked exceptions.

Saga compensating action is not idempotent

Symptom

Refund is applied multiple times, duplicate compensations

Fix

Use a saga state repository to track which compensations were already applied. Check before executing compensation.

Setting HikariCP maximumPoolSize equal to Tomcat thread pool size

Symptom

Database overwhelmed with connections, latency spikes

Fix

Set pool size to (cpu_cores * 2) + 1. Add read replicas for scalability. Use connection pooling middleware (pgbouncer).

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

You have two services communicating over REST. The downstream service go...

Q02SENIOR

What’s the difference between liveness and readiness probes? Give a conc...

Q03SENIOR

You’re using a saga pattern for an order workflow. The inventory service...

Q04SENIOR

Why is two-phase commit (2PC) a bad fit for microservices?

Q05SENIOR

Your distributed tracing shows gaps of 500ms–1s between spans. What do y...

Q06SENIOR

When would you choose gRPC over REST for inter-service communication?

Q07JUNIOR

You notice that a service’s response time is 2 seconds, but the CPU is a...

Q08SENIOR

How do you handle schema evolution in an event-driven system?

Q01 of 08SENIOR

You have two services communicating over REST. The downstream service goes down. The client starts seeing 5-second timeouts. How do you prevent this from cascading to upstream services?

ANSWER

Add a circuit breaker with a fast timeout (2 seconds). The circuit breaker opens after 5 consecutive failures, and the fallback returns a cached response or 503. Also configure a bulkhead to limit the number of concurrent calls to the downstream. In Spring Boot 3.x, use Resilience4j with @CircuitBreaker and @Bulkhead annotations. The key is to fail fast: don’t let the client thread wait 5 seconds.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What’s the difference between a circuit breaker and a bulkhead in resilience patterns?

How do I implement distributed tracing in Spring Boot 3.x without Sleuth?

When should I use a saga instead of a distributed transaction?

What is the strangler fig pattern?

How do you handle idempotency in a microservices system?

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 04, 2026

last updated

1,697

articles · all by Naren

🔥

That's Interview. Mark it forged?

6 min read · try the examples if you haven't