Microservices Interview Questions That Actually Filter Senior Engineers
Stop asking what a circuit breaker is.
- Distributed transactions don't exist; you need sagas or eventual consistency
- Service mesh adds latency; don't deploy it until you have 20+ services
- Circuit breakers hide failures; always log the fallback reason
- Health checks must be distinct from liveness probes; don't mix them
- Request tracing is useless without proper span propagation in async code
Imagine a restaurant kitchen where each chef only makes one dish. When an order comes in, the chefs pass tickets back and forth. If one chef is slow, the whole kitchen backs up. Microservices work the same way — services talk to each other, and when one fails, the failure cascades. You need systems to detect slow chefs (circuit breakers), track the full order (distributed tracing), and decide when to just tell the customer the kitchen is busy (graceful degradation).
Last Thursday, 2:47 AM. My phone buzzed with a PagerDuty alert. Order service latency spiked to 30 seconds. Users were seeing 503s across the checkout flow. My first thought: database is slow. Second thought: someone pushed a bad query. Third thought: the truth. The circuit breaker on the payment service had opened, but the fallback method in the order service threw a NullPointerException. Nobody caught it because the logs only showed “Circuit breaker OPEN” — the team assumed that was normal. That incident cost us $40k in lost revenue and a pissed-off VP of Engineering. Here’s the thing: circuit breakers, bulkheads, retries — these patterns are supposed to make your system resilient. But every abstraction leaks. If you don’t understand what happens when the fallback itself fails, you’re building a house of cards. This article isn’t theory. It’s the questions I ask when hiring. It’s the things I’ve seen break. It’s the production debug steps that actually move the needle. Stop reading if you want textbook answers. Keep reading if you want to survive a real outage.
Circuit Breakers: The Fallback That Broke Production
Everyone on my team knew “circuit breaker prevents cascading failures.” They aced the interview question. Then they coded a fallback that returned null. The calling code didn’t check for null. NullPointerException. 500 to the customer. The circuit breaker was working exactly as configured — it stopped calls to the failing downstream. But it didn’t stop the fallback from failing. That’s the part nobody teaches. In Spring Boot 3.x with Resilience4j, the fallback method is just another bean method. It can throw exceptions. It can have its own bugs. You must test the fallback as thoroughly as the primary method. Use @CircuitBreaker(name = "service", fallbackMethod = "fallback") and write a unit test that opens the breaker and calls the method. Assert that fallback returns a valid response. Log the fallback invocation and result. Use a specific log level (WARN) so you can alert on it. And for the love of god, don’t swallow the exception in the fallback. Log it. Your 3 AM self will thank you.
Distributed Tracing: Why Your Spans Disappear Into the Void
You set up Spring Cloud Sleuth (now Micrometer Tracing in 3.x). You see traces across your services. Then you switch to async processing. Spans disappear. You check logs — no trace ID. You check Zipkin — incomplete traces. The culprit: thread pool executors that don’t propagate the tracing context. When you use @Async or CompletableFuture.supplyAsync(), the new thread doesn’t inherit the parent’s trace ID. Spring Boot 3.x uses Micrometer Tracing, which relies on io.micrometer.context.ContextSnapshot. You can fix this by configuring a TaskDecorator. Or by using Virtual Threads (project Loom), which are thread-per-task and preserve context. But here’s the real shot: if you’re using Kafka, the consumer side won’t propagate the trace automatically unless you set spring.kafka.consumer.properties[spring.sleuth.kafka.propagation.enabled]=true (or the Micrometer equivalent). Even then, if your Kafka listener uses a manual ack mode or retry template, the headers get lost. I’ve debugged this exact problem at 2 AM. The fix is to manually forward the trace context in the producer’s headers and reconstruct it on the consumer side using an interceptor.
Health Checks: Liveness vs. Readiness — You’re Doing It Wrong
I see this on every interview. Candidate says: “We use Spring Boot Actuator for health checks.” I ask: “What happens when your database goes down?” They say: “The health check fails and Kubernetes restarts the pod.” Wrong. That’s the exact opposite of what you want. If your database goes down, you don’t want Kubernetes to kill your pod. You want the pod to stay alive, return 503s, and wait for the database to recover. Killing the pod just shifts the problem to the new pod — it will also fail health checks and get killed. That’s a restart loop. The right pattern: liveness probe checks JVM health (heap, thread deadlocks). Readiness probe checks downstream dependency health (database, Redis, downstream services). When the database is down, the readiness probe fails, and Kubernetes removes the pod from the Service’s endpoints. No traffic hits it. When the database recovers, the readiness probe succeeds, and the pod starts receiving traffic. No restarts. No cascading failures. In Spring Boot 3.x, you configure this with @LivenessState and @ReadinessState annotations on custom health indicators.
Database Connections: The Silent Throttle in Your Microservices
Your payment service has a HikariCP pool of 10 connections. Your Tomcat thread pool has 200 threads. You get higher traffic, and suddenly all 10 database connections are held by slow queries. The remaining 190 threads are stuck waiting for a connection. Your latency spikes to 10 seconds. No CPU spike. No memory error. Just a wall of “Connection is not available, request timed out after 30000ms.” The fix isn’t to increase the pool size blindly. If you set pool size to 200, you’ll just overwhelm the database with 200 concurrent connections. The real fix: set pool size to match the number of concurrent requests that can hit the database at peak, plus a small buffer. Use a tool like pgbouncer or RDS Proxy to multiplex connections. Or use a reactive driver (R2DBC) that doesn’t hold connections per request. But here’s the trick most people miss: if you’re using virtual threads, the connection pool is even more important because virtual threads don’t block the OS thread, but they still block the database connection. You can have thousands of virtual threads waiting on a pool of 10 connections. The pool becomes the bottleneck. Monitor the HikariCP metrics actively: hikaricp.connections.active, hikaricp.connections.pending, hikaricp.connections.timeout. Alert on pending connections > 0 for more than 5 seconds.
Sagas: The Distributed Transaction Myth
Business says: “Create order, deduct inventory, charge payment — all or nothing.” Developers hear: “Distributed transaction.” They reach for two-phase commit (2PC) or try to use JTA across microservices. Both are terrible ideas. 2PC locks resources for the duration of the transaction. If any participant fails, you have in-doubt transactions that require manual recovery. In microservices, you use sagas — a sequence of local transactions with compensating actions for rollback. Orchestration-based sagas use a central coordinator (like Axon or Temporal). Choreography-based sagas use events (each service publishes an event when done, another service reacts). Both patterns trade consistency for availability. You face the problem of handling the “last mile”: the saga says “payment captured” but then the inventory service crashes before updating stock. You’re in an inconsistent state. The solution: idempotency keys and retry queues. Each step must be idempotent. Use a unique saga ID in every request. Store the saga state in a database. Use a dead-letter queue for failed steps, with a retry mechanism and exponential backoff. And always design compensating actions that are idempotent too.
Inter-Service Communication: REST vs. gRPC vs. Events — When Each Breaks
REST is the default. It’s simple. But it fails in every way that matters under load. Latency adds up. Serialization costs. No built-in retry or backpressure. gRPC fixes some of this: binary protocol, streaming, deadline propagation. But it introduces its own problems: HTTP/2 multiplexing can cause head-of-line blocking with misconfigured load balancers. And gRPC doesn’t handle load shedding well — if a service is overwhelmed, it just sends back RESOURCE_EXHAUSTED, and the caller has to decide what to do. Events (Kafka, RabbitMQ) solve the coupling problem. Services don’t call each other; they publish events that other services consume. This gives you async processing and inherent buffering. But it introduces complexity: event ordering, at-least-once vs. exactly-once semantics, and schema evolution. The trade-off is clear: synchronous calls (REST/gRPC) are for real-time interactions where you need a response. Events are for background processing and eventual consistency. In a real system, you use both. And you need a retry mechanism for both. Don’t assume REST will work because you added a timeout. I’ve seen a payment gateway timeout cause a 10-second thread block on every request, cascading through 5 services. The fix: implement a circuit breaker with a 2-second timeout. And test it.
The Circuit Breaker That Killed Checkout
ConcurrentHashMap.compute()
2) Wrapped the fallback logic in try-catch with explicit error logging
3) Added a health check endpoint that tests the fallback path
4) Changed circuit breaker configuration from COUNT_BASED to TIME_BASED with a 10-second window
5) Wrote a unit test that opens the circuit breaker and verifies the fallback response
6) Added a Prometheus alert on circuit breaker open state duration > 30 seconds- A circuit breaker’s fallback is not a safety net.
- It’s a code path that must be tested as rigorously as the primary path.
- Log the fallback result — both success and failure.
- Your pager will thank you.
kubectl exec -it <pod> -- curl -s http://localhost:8081/actuator/health | jq '.components.circuitBreakers'kubectl logs --tail=100 <pod> | grep -E 'CircuitBreaker|Hystrix|resilience4j' | grep -v 'closed'Key takeaways
Common mistakes to avoid
5 patternsUsing @Async without configuring a TaskDecorator for trace context
Same health check for liveness and readiness probes
Circuit breaker fallback method throws an exception
Saga compensating action is not idempotent
Setting HikariCP maximumPoolSize equal to Tomcat thread pool size
Interview Questions on This Topic
You have two services communicating over REST. The downstream service goes down. The client starts seeing 5-second timeouts. How do you prevent this from cascading to upstream services?
Frequently Asked Questions
That's Interview. Mark it forged?
6 min read · try the examples if you haven't