Senior 6 min · May 23, 2026

Microservices Interview Questions That Actually Filter Senior Engineers

Stop asking what a circuit breaker is.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Distributed transactions don't exist; you need sagas or eventual consistency
  • Service mesh adds latency; don't deploy it until you have 20+ services
  • Circuit breakers hide failures; always log the fallback reason
  • Health checks must be distinct from liveness probes; don't mix them
  • Request tracing is useless without proper span propagation in async code
✦ Definition~90s read
What is Microservices Interview Questions That Actually Filter Senior Engineers?

Microservices architecture breaks a monolith into independently deployable services. Each service owns its data, communicates over the network, and fails in isolation. That's the theory. In production, you trade code complexity for infrastructure complexity.

Imagine a restaurant kitchen where each chef only makes one dish.

You get independent scaling, team autonomy, and faster deploys. You also get network latency, partial failures, and debugging nightmares. Spring Boot 3.x with its virtual threads and native support helps, but it doesn't fix bad architecture. You need to understand patterns like sagas, CQRS, and strangler fig.

Most interview questions test buzzwords. The real test is how you debug a cascading failure at 3 AM.

Plain-English First

Imagine a restaurant kitchen where each chef only makes one dish. When an order comes in, the chefs pass tickets back and forth. If one chef is slow, the whole kitchen backs up. Microservices work the same way — services talk to each other, and when one fails, the failure cascades. You need systems to detect slow chefs (circuit breakers), track the full order (distributed tracing), and decide when to just tell the customer the kitchen is busy (graceful degradation).

Last Thursday, 2:47 AM. My phone buzzed with a PagerDuty alert. Order service latency spiked to 30 seconds. Users were seeing 503s across the checkout flow. My first thought: database is slow. Second thought: someone pushed a bad query. Third thought: the truth. The circuit breaker on the payment service had opened, but the fallback method in the order service threw a NullPointerException. Nobody caught it because the logs only showed “Circuit breaker OPEN” — the team assumed that was normal. That incident cost us $40k in lost revenue and a pissed-off VP of Engineering. Here’s the thing: circuit breakers, bulkheads, retries — these patterns are supposed to make your system resilient. But every abstraction leaks. If you don’t understand what happens when the fallback itself fails, you’re building a house of cards. This article isn’t theory. It’s the questions I ask when hiring. It’s the things I’ve seen break. It’s the production debug steps that actually move the needle. Stop reading if you want textbook answers. Keep reading if you want to survive a real outage.

Circuit Breakers: The Fallback That Broke Production

Everyone on my team knew “circuit breaker prevents cascading failures.” They aced the interview question. Then they coded a fallback that returned null. The calling code didn’t check for null. NullPointerException. 500 to the customer. The circuit breaker was working exactly as configured — it stopped calls to the failing downstream. But it didn’t stop the fallback from failing. That’s the part nobody teaches. In Spring Boot 3.x with Resilience4j, the fallback method is just another bean method. It can throw exceptions. It can have its own bugs. You must test the fallback as thoroughly as the primary method. Use @CircuitBreaker(name = "service", fallbackMethod = "fallback") and write a unit test that opens the breaker and calls the method. Assert that fallback returns a valid response. Log the fallback invocation and result. Use a specific log level (WARN) so you can alert on it. And for the love of god, don’t swallow the exception in the fallback. Log it. Your 3 AM self will thank you.

Production Trap:
If your fallback method throws an exception, Resilience4j does NOT catch it. The exception propagates to the caller. You’ll get a 500, and the circuit breaker stays open because the failure was NOT in the primary call — it was in the fallback. The circuit breaker never learns about it. You’re now silently returning errors.
Production Insight
I’ve seen teams spend weeks tuning circuit breaker thresholds when the real fix was a null check in the fallback.
Key Takeaway
Test the fallback method as a first-class citizen. Add an actuator endpoint that lets you manually open the circuit breaker and verify behavior in staging.

Distributed Tracing: Why Your Spans Disappear Into the Void

You set up Spring Cloud Sleuth (now Micrometer Tracing in 3.x). You see traces across your services. Then you switch to async processing. Spans disappear. You check logs — no trace ID. You check Zipkin — incomplete traces. The culprit: thread pool executors that don’t propagate the tracing context. When you use @Async or CompletableFuture.supplyAsync(), the new thread doesn’t inherit the parent’s trace ID. Spring Boot 3.x uses Micrometer Tracing, which relies on io.micrometer.context.ContextSnapshot. You can fix this by configuring a TaskDecorator. Or by using Virtual Threads (project Loom), which are thread-per-task and preserve context. But here’s the real shot: if you’re using Kafka, the consumer side won’t propagate the trace automatically unless you set spring.kafka.consumer.properties[spring.sleuth.kafka.propagation.enabled]=true (or the Micrometer equivalent). Even then, if your Kafka listener uses a manual ack mode or retry template, the headers get lost. I’ve debugged this exact problem at 2 AM. The fix is to manually forward the trace context in the producer’s headers and reconstruct it on the consumer side using an interceptor.

Senior Shortcut:
Switch to Virtual Threads (spring.threads.virtual.enabled=true). They solve the context propagation problem because they don’t pool threads. Each task gets a new thread, inheriting the parent’s context. Just make sure you’re not using synchronized blocks — that pins the virtual thread and ruins performance.
Production Insight
If your trace has gaps longer than 500ms, you’re losing context in async code. Don’t blame the network.
Key Takeaway
Distributed tracing is only as good as your context propagation. Test with async methods, Kafka consumers, and scheduled tasks. If the trace doesn’t connect, you don’t have observability.

Health Checks: Liveness vs. Readiness — You’re Doing It Wrong

I see this on every interview. Candidate says: “We use Spring Boot Actuator for health checks.” I ask: “What happens when your database goes down?” They say: “The health check fails and Kubernetes restarts the pod.” Wrong. That’s the exact opposite of what you want. If your database goes down, you don’t want Kubernetes to kill your pod. You want the pod to stay alive, return 503s, and wait for the database to recover. Killing the pod just shifts the problem to the new pod — it will also fail health checks and get killed. That’s a restart loop. The right pattern: liveness probe checks JVM health (heap, thread deadlocks). Readiness probe checks downstream dependency health (database, Redis, downstream services). When the database is down, the readiness probe fails, and Kubernetes removes the pod from the Service’s endpoints. No traffic hits it. When the database recovers, the readiness probe succeeds, and the pod starts receiving traffic. No restarts. No cascading failures. In Spring Boot 3.x, you configure this with @LivenessState and @ReadinessState annotations on custom health indicators.

Interview Gold:
The most common mistake is having the liveness probe call the same health check as readiness. If they’re the same, a database failure kills the pod. Always separate them. A senior engineer explains why, not just what.
Production Insight
I’ve seen a production outage where 200 pods were restarting every 30 seconds because the liveness check included a downstream Redis call that was slow. The fix: rewrite liveness to only check JVM metrics.
Key Takeaway
Liveness = Is the JVM alive? Readiness = Is the service ready to accept traffic? Never conflate them.

Database Connections: The Silent Throttle in Your Microservices

Your payment service has a HikariCP pool of 10 connections. Your Tomcat thread pool has 200 threads. You get higher traffic, and suddenly all 10 database connections are held by slow queries. The remaining 190 threads are stuck waiting for a connection. Your latency spikes to 10 seconds. No CPU spike. No memory error. Just a wall of “Connection is not available, request timed out after 30000ms.” The fix isn’t to increase the pool size blindly. If you set pool size to 200, you’ll just overwhelm the database with 200 concurrent connections. The real fix: set pool size to match the number of concurrent requests that can hit the database at peak, plus a small buffer. Use a tool like pgbouncer or RDS Proxy to multiplex connections. Or use a reactive driver (R2DBC) that doesn’t hold connections per request. But here’s the trick most people miss: if you’re using virtual threads, the connection pool is even more important because virtual threads don’t block the OS thread, but they still block the database connection. You can have thousands of virtual threads waiting on a pool of 10 connections. The pool becomes the bottleneck. Monitor the HikariCP metrics actively: hikaricp.connections.active, hikaricp.connections.pending, hikaricp.connections.timeout. Alert on pending connections > 0 for more than 5 seconds.

Never Do This:
Setting maximumPoolSize to 200 “named myself because we have 200 threads.” Bad idea. Database connections are expensive. The optimal pool size is (cpu_cores * 2) + 1. For a 4-core machine, start with 9. Measure. Adjust.
Production Insight
I watched a team increase pool size from 10 to 100 to fix latency. Latency got worse because the database couldn’t handle 100 concurrent connections. The throughput dropped 40%. The fix was a 2-line config change: set pool size back to 10 and add a read replica.
Key Takeaway
Connection pool size is not a tuning knob for throughput. It’s a safety valve. Keep it small. Scale horizontally instead.

Sagas: The Distributed Transaction Myth

Business says: “Create order, deduct inventory, charge payment — all or nothing.” Developers hear: “Distributed transaction.” They reach for two-phase commit (2PC) or try to use JTA across microservices. Both are terrible ideas. 2PC locks resources for the duration of the transaction. If any participant fails, you have in-doubt transactions that require manual recovery. In microservices, you use sagas — a sequence of local transactions with compensating actions for rollback. Orchestration-based sagas use a central coordinator (like Axon or Temporal). Choreography-based sagas use events (each service publishes an event when done, another service reacts). Both patterns trade consistency for availability. You face the problem of handling the “last mile”: the saga says “payment captured” but then the inventory service crashes before updating stock. You’re in an inconsistent state. The solution: idempotency keys and retry queues. Each step must be idempotent. Use a unique saga ID in every request. Store the saga state in a database. Use a dead-letter queue for failed steps, with a retry mechanism and exponential backoff. And always design compensating actions that are idempotent too.

The Classic Bug:
Compensating actions must be commutative. If you’re compensating a compensation, you get infinite loops. Use a saga state machine with explicit transitions to prevent cycles.
Production Insight
I’ve seen a saga that refunded the same payment 5 times because the inventory service failed 5 times, each time triggering a refund compensation. The fix: check if the compensation was already applied before executing.
Key Takeaway
Sagas work when every step is idempotent, retryable, and has a well-defined compensating action. If you can’t achieve that, you need a different architecture.

Inter-Service Communication: REST vs. gRPC vs. Events — When Each Breaks

REST is the default. It’s simple. But it fails in every way that matters under load. Latency adds up. Serialization costs. No built-in retry or backpressure. gRPC fixes some of this: binary protocol, streaming, deadline propagation. But it introduces its own problems: HTTP/2 multiplexing can cause head-of-line blocking with misconfigured load balancers. And gRPC doesn’t handle load shedding well — if a service is overwhelmed, it just sends back RESOURCE_EXHAUSTED, and the caller has to decide what to do. Events (Kafka, RabbitMQ) solve the coupling problem. Services don’t call each other; they publish events that other services consume. This gives you async processing and inherent buffering. But it introduces complexity: event ordering, at-least-once vs. exactly-once semantics, and schema evolution. The trade-off is clear: synchronous calls (REST/gRPC) are for real-time interactions where you need a response. Events are for background processing and eventual consistency. In a real system, you use both. And you need a retry mechanism for both. Don’t assume REST will work because you added a timeout. I’ve seen a payment gateway timeout cause a 10-second thread block on every request, cascading through 5 services. The fix: implement a circuit breaker with a 2-second timeout. And test it.

Senior Shortcut:
For internal services, use gRPC with a deadline of 2 seconds and circuit breaker. For external-facing APIs, use REST with rate limiting. For async work, use Kafka. Don’t mix the patterns: a Kafka consumer should never make a synchronous REST call that blocks the consumer thread.
Production Insight
I replaced a REST endpoint with Kafka events between two services. Latency went from 600ms to 5ms. The trade-off: eventual consistency. The payment confirmation now arrives 1–5 seconds after the order. Marketing hated it. Engineering loved it. We kept both: a REST call for immediate response, then an event for the actual processing.
Key Takeaway
Use synchronous communication for reads and immediate writes. Use events for everything else. The hardest part is saying “no” to a synchronous call that a business stakeholder insists on.
● Production incidentPOST-MORTEMseverity: high

The Circuit Breaker That Killed Checkout

Symptom
Users saw “Something went wrong” on checkout. Logs showed a periodic “CircuitBreaker ‘paymentService’ is OPEN and does not permit further calls” message. The team thought this was normal circuit breaker behavior. Actually, the fallback function threw a NullPointerException because a dependency injection was lazy-initialized.
Assumption
First thought: Payment service is down. Second thought: Rate limiting on the payment gateway. Third thought: Network issue between pods.
Root cause
The circuit breaker on payment service opened after 5 consecutive timeouts. The fallback method called a utility class that was lazily initialized. Because the circuit breaker was open, the fallback ran on every request. The lazy init happened on the first call, and that call ran fine. But the utility had a bug: it modified a shared ConcurrentHashMap without synchronization, causing a ConcurrentModificationException on subsequent calls. This uncaught exception bubbled up, causing a 500 to the client.
Fix
1) Changed the utility class to use synchronized block or ConcurrentHashMap.compute() 2) Wrapped the fallback logic in try-catch with explicit error logging 3) Added a health check endpoint that tests the fallback path 4) Changed circuit breaker configuration from COUNT_BASED to TIME_BASED with a 10-second window 5) Wrote a unit test that opens the circuit breaker and verifies the fallback response 6) Added a Prometheus alert on circuit breaker open state duration > 30 seconds
Key lesson
  • A circuit breaker’s fallback is not a safety net.
  • It’s a code path that must be tested as rigorously as the primary path.
  • Log the fallback result — both success and failure.
  • Your pager will thank you.
Production debug guideSymptom -> root cause -> fix for the failures that actually happen4 entries
Symptom · 01
Service returns 503 but health endpoint returns 200
Fix
Check if the health endpoint is isolated from dependencies. Most teams configure Spring Boot actuator health to ping databases and downstream services. If the real endpoint relies on a downstream service but health doesn’t, you get false positives. Fix: Create separate liveness and readiness probes. liveness = basic JVM check, readiness = downstream dependency check. Use @ReadinessState and @LivenessState annotations in Spring Boot 3.x.
Symptom · 02
Slow response times but no CPU or memory spike
Fix
Check thread pool exhaustion. If you’re using Tomcat, look for thread dumps showing many threads in BLOCKED or WAITING state. Common cause: blocking calls inside a reactive pipeline, or virtual threads pinned by synchronized blocks. Fix: Switch to virtual threads with Spring Boot 3.2+, but avoid synchronized. Use ReentrantLock or structured concurrency. Also check database connection pool size — if it’s smaller than Tomcat’s thread pool, you get connection starvation.
Symptom · 03
Distributed trace shows gaps between spans
Fix
Context propagation failed. This happens when a new thread is spawned without passing the tracing context. Common in async methods (@Async), CompletableFuture, or Kafka listeners. Fix: Use Spring’s TaskDecorator to propagate MDC and tracing context. In Spring Boot 3.x, configure ExecutorBean with a TaskDecorator that copies the current trace ID. For Kafka, set spring.kafka.consumer.properties[spring.sleuth.kafka.propagation.enabled]=false and manually propagate headers.
Symptom · 04
Database deadlock in a saga
Fix
Two services updated the same row in different order. Sagas don’t use distributed transactions — they rely on eventual consistency and compensation. If you get a deadlock, it means two services are competing for the same resource without coordination. Fix: Use optimistic locking with version columns. Or redesign your saga to use a “one writer” pattern per entity. Add retry logic with exponential backoff on OptimisticLockException. Log the deadlock graph from PostgreSQL’s pg_stat_activity.
★ Debug Cheat SheetCommands for fast diagnosis in production
Circuit breaker opens too frequently
Immediate action
Check circuit breaker state and failure count
Commands
kubectl exec -it <pod> -- curl -s http://localhost:8081/actuator/health | jq '.components.circuitBreakers'
kubectl logs --tail=100 <pod> | grep -E 'CircuitBreaker|Hystrix|resilience4j' | grep -v 'closed'
Fix now
Increase slidingWindowSize from 10 to 20 in application.yml: resilience4j.circuitbreaker.configs.default.slidingWindowSize: 20
High latency but no errors+
Immediate action
Check thread dumps and database connection pool
Commands
kubectl exec -it <pod> -- jcmd 1 Thread.print > threads.txt && grep -c 'BLOCKED' threads.txt
kubectl exec -it <pod> -- curl -s http://localhost:8081/actuator/metrics/hikaricp.connections.active
Fix now
Increase hikari.maximumPoolSize to match thread pool: spring.datasource.hikari.maximumPoolSize: 50
Distributed trace missing spans+
Immediate action
Check if async context is propagated
Commands
kubectl logs <pod> | grep -E 'TraceId|spanId' | head -5
kubectl exec -it <pod> -- curl -s http://localhost:8081/actuator/beans | jq '.[] | select(.type == "java.util.concurrent.ExecutorService") | .bean'
Fix now
Add TaskDecorator to bean: @Bean public Executor asyncExecutor() { return new ThreadPoolTaskExecutor() {{ setTaskDecorator(new TraceDecorator()); }}; }
Inter-Service Communication Patterns
CharacteristicREST/HTTPgRPCEvents (Kafka)
Latency per call10-100ms1-10ms0.5-5ms (async)
ThroughputModerate (HTTP overhead)High (binary)Very high (batch)
CouplingTight (client needs endpoint)Tight (client needs proto)Loose (event schema)
Error handlingHTTP status codesgRPC status codesRetry + DLQ pattern
IdempotencyManual (idempotency key)Manual (idempotency key)Built-in (offset management)
BackpressureNone (pessimistic)RESOURCE_EXHAUSTEDPartition lag (monitoring)
StreamingNo (polling or WebSocket)Yes (bidirectional)Yes (consumer group)

Key takeaways

1
Circuit breaker fallbacks are code paths that must be tested, logged, and alerted on. A failing fallback is worse than no fallback.
2
Liveness and readiness probes are not interchangeable. Kubernetes uses liveness to restart pods; readiness to route traffic. One serves the platform, the other serves the customer.
3
Database connection pool size should match CPU cores, not thread pool size. Oversizing the pool kills the database. Add replicas for scale, not bigger pools.
4
Sagas work only when compensations are idempotent and commutative. If you can refund the same payment twice, you haven’t designed a saga
you designed a bug.
5
Distributed tracing is useless without proper context propagation. Test with async, Kafka, and scheduled tasks. If the trace doesn’t connect, you’re flying blind.

Common mistakes to avoid

5 patterns
×

Using @Async without configuring a TaskDecorator for trace context

Symptom
Distributed traces show gaps or missing spans after async boundaries
Fix
Configure a TaskDecorator that captures ContextSnapshot from Micrometer, or switch to virtual threads with spring.threads.virtual.enabled=true
×

Same health check for liveness and readiness probes

Symptom
Pods restart in a loop when database goes down
Fix
Separate probes: liveness checks JVM health (heap, deadlocks), readiness checks downstream dependencies
×

Circuit breaker fallback method throws an exception

Symptom
Client receives 500 even though circuit breaker is open, but logs only show “Circuit breaker OPEN”
Fix
Wrap fallback in try-catch, return a safe default, log the exception. Never let fallback throw unchecked exceptions.
×

Saga compensating action is not idempotent

Symptom
Refund is applied multiple times, duplicate compensations
Fix
Use a saga state repository to track which compensations were already applied. Check before executing compensation.
×

Setting HikariCP maximumPoolSize equal to Tomcat thread pool size

Symptom
Database overwhelmed with connections, latency spikes
Fix
Set pool size to (cpu_cores * 2) + 1. Add read replicas for scalability. Use connection pooling middleware (pgbouncer).
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
You have two services communicating over REST. The downstream service go...
Q02SENIOR
What’s the difference between liveness and readiness probes? Give a conc...
Q03SENIOR
You’re using a saga pattern for an order workflow. The inventory service...
Q04SENIOR
Why is two-phase commit (2PC) a bad fit for microservices?
Q05SENIOR
Your distributed tracing shows gaps of 500ms–1s between spans. What do y...
Q06SENIOR
When would you choose gRPC over REST for inter-service communication?
Q07JUNIOR
You notice that a service’s response time is 2 seconds, but the CPU is a...
Q08SENIOR
How do you handle schema evolution in an event-driven system?
Q01 of 08SENIOR

You have two services communicating over REST. The downstream service goes down. The client starts seeing 5-second timeouts. How do you prevent this from cascading to upstream services?

ANSWER
Add a circuit breaker with a fast timeout (2 seconds). The circuit breaker opens after 5 consecutive failures, and the fallback returns a cached response or 503. Also configure a bulkhead to limit the number of concurrent calls to the downstream. In Spring Boot 3.x, use Resilience4j with @CircuitBreaker and @Bulkhead annotations. The key is to fail fast: don’t let the client thread wait 5 seconds.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What’s the difference between a circuit breaker and a bulkhead in resilience patterns?
02
How do I implement distributed tracing in Spring Boot 3.x without Sleuth?
03
When should I use a saga instead of a distributed transaction?
04
What is the strangler fig pattern?
05
How do you handle idempotency in a microservices system?
🔥

That's Interview. Mark it forged?

6 min read · try the examples if you haven't

Previous
Spring Boot Interview Questions
2 / 4 · Interview
Next
Spring Security Interview Questions