Microservices decompose a monolith into independently deployable services that each own their data and expose APIs
Service decomposition follows bounded contexts from Domain-Driven Design — not arbitrary splits
Inter-service communication patterns: synchronous (HTTP/REST/gRPC) and asynchronous (message queues/events)
Performance: sync calls within the same cluster typically add 1–10ms per hop; the real danger is amplification — five synchronous hops multiply that latency and couple availability
Production: without circuit breakers, a single slow service exhausts upstream thread pools and cascades failures across the entire system within minutes
Biggest mistake: splitting by technical layers (e.g., UI service, DB service) instead of business domains
Plain-English First
Imagine a restaurant. In a tiny diner, one cook does everything — takes orders, grills the steak, makes dessert, handles the bill. That works fine until Saturday night when 200 people walk in and the single cook collapses. A large restaurant splits those jobs: a host, waiters, a grill chef, a pastry chef, a cashier. Each person can be replaced, trained, or given help independently. If the pastry chef calls in sick, the rest of the kitchen keeps running. Microservices are exactly that split — instead of one giant program doing everything, you divide your software into small, focused services that each do one job really well and talk to each other over a network. The tricky part is deciding where to draw the lines between jobs — cut in the wrong place and you end up with a grill chef who can't cook without calling the pastry chef first, which defeats the whole point.
Every company that has scaled past a certain size has hit the same wall: the monolith becomes unmaintainable. A change to the payment logic requires a full deployment of the entire application. One memory leak in the recommendation engine takes down the checkout flow. A single team of 200 engineers all committing to the same codebase creates merge conflicts, broken builds, and release bottlenecks that kill velocity. Netflix, Amazon, Uber, and Spotify all hit this wall and made the same architectural pivot — they decomposed their systems into microservices, and it changed how the entire industry thinks about building software at scale.
Microservices solve a specific, painful problem: the coupling problem. In a monolith, components share memory, share a database, and share a deployment lifecycle. That shared fate is the enemy of scale and resilience. Microservices decouple those dimensions — each service owns its data, deploys independently, scales independently, and fails independently. When the recommendation service goes down, users can still check out. When checkout traffic spikes on Black Friday, you scale just that service, not the entire application.
By the end of this article you'll understand not just what microservices are, but how to decompose a domain correctly using bounded contexts, which communication patterns to choose and why, how to handle distributed failure gracefully with circuit breakers and bulkheads, and the production-level trade-offs nobody talks about until it's 2am and your pager is going off. This is the article you read before your next system design interview — and before you propose microservices to your team.
What Is Microservices Architecture?
Microservices architecture breaks an application into small, loosely coupled services that each handle a specific business function. Each service runs its own process, communicates over a network, and can be deployed, scaled, and maintained independently. It's not about size — it's about autonomy. A single service might be 50 lines or 50,000 lines; what matters is that it owns its domain end-to-end and can be changed without asking permission from every other team.
The contrast with a monolith is worth making concrete. In a monolith, the order module and the payment module share the same JVM process. A method call from order to payment is a function invocation — nanoseconds, no network, no serialization. In a microservices system, that same interaction crosses a network boundary. It involves HTTP or gRPC, serialization, a timeout budget, and a retry policy. That cost is real, and it is worth paying only when the decoupling benefit justifies it.
Why does it matter? Because the monolith doesn't scale in two distinct ways that often get conflated. First, technical scaling: you cannot scale just the bottleneck. If your payment processing is slow and your product catalog is fine, a monolith forces you to scale both. Second, organizational scaling: with a monolith, every change requires coordination across the entire team. A 200-person engineering organization committing to a single codebase is a coordination disaster. Microservices give each team a codebase they can change and deploy without a cross-team meeting.
One thing I've seen consistently across migrations: the teams that succeed with microservices treat it as an organizational change first and a technical change second. The services that fail — the ones that create more meetings than they eliminate — are almost always the ones where the team structure didn't change when the codebase did.
package io.thecodeforge.microservices.boundary;
// ── What a monolith looks like: direct in-process method call ─────────────────// Fast (nanoseconds), simple, but tightly coupled.// A change to PaymentService requires redeploying the entire application.// A crash in PaymentService brings down OrderService too — shared JVM process.classMonolithOrderFlow {
private final PaymentService paymentService; // direct reference, same JVMprivatefinalInventoryService inventoryService;
publicOrderResultplaceOrder(OrderRequest request) {
// In-process call — no network, no serialization, no timeoutPaymentResult payment = paymentService.charge(request.amount(), request.card());
inventoryService.reserve(request.items());
returnnewOrderResult(payment.transactionId());
}
}
// ── What the same flow looks like as microservices ────────────────────────────// Each service is a separate process, potentially on a different machine.// OrderService does NOT hold a reference to PaymentService — it makes a network call.// The cost: network latency (~1–5ms same cluster), serialization, timeout budget.// The benefit: PaymentService can be deployed, scaled, and failed independently.classMicroserviceOrderFlow {
private final PaymentClient paymentClient; // HTTP/gRPC client, not a direct refprivatefinalInventoryClient inventoryClient;
privatefinalEventPublisher eventPublisher;
publicOrderResultplaceOrder(OrderRequest request) {
// Network call — PaymentService is a separate deployed process// If PaymentService is slow, only this thread blocks — not the entire applicationPaymentResult payment = paymentClient.charge(request.amount(), request.card());
// Async event — InventoryService consumes this when it's ready// OrderService does not wait for inventory to confirm — decoupled in time
eventPublisher.publish(newOrderPlacedEvent(request.orderId(), request.items()));
returnnewOrderResult(payment.transactionId());
}
}
// ── The key trade-off in one line ─────────────────────────────────────────────// Monolith: fast, simple, coupled — one failure domain// Microservices: slower, complex, decoupled — isolated failure domains// Choose based on team size and scaling bottleneck, not on what's fashionable.
// - A different team owns PaymentService and deploys on their own schedule
// - A PaymentService crash should NOT crash OrderService
//
// The 2ms cost is NOT worth it when:
// - Your team has fewer than 10 engineers
// - You don't yet know where the scaling bottleneck is
// - You don't have automated CI/CD for each service
Microservices Are an Organizational Pattern First
The most reliable signal that microservices are working: each team can ship a feature without scheduling a meeting with another team. If your services are split but your deployment calendar still requires cross-team coordination, you've paid the cost of microservices without getting the benefit. Split teams before you split code — Conway's Law will handle the rest.
Production Insight
A team decomposed their monolith into eight services but kept the same organizational structure — one team owned all eight services.
Every service change still required the same cross-team review because the schema was shared and the team hadn't reorganized around bounded contexts.
Six months later they merged six of the eight services back together. The two that survived the merger were the ones that had genuinely independent teams and independent data stores.
Rule: microservices without organizational decoupling is just distributed chaos with extra network hops.
Key Takeaway
Microservices are about autonomy and decoupling, not service size.
The network cost is real — pay it only when the independence benefit justifies it.
Organize teams around business capabilities first. The architecture will follow.
Should You Start with Microservices?
IfTeam size fewer than 10 engineers, domain is not yet fully understood
→
UseStart with a well-structured monolith. Extract services only when you hit a concrete scaling bottleneck or a team coordination problem that the monolith is causing. You will learn the domain boundaries better from the monolith before you cut.
IfMultiple teams, each owning a distinct business capability with different deployment cadences
→
UseMicroservices align naturally. Use bounded contexts from Domain-Driven Design to define service boundaries. Invest in CI/CD and observability before the first service goes to production.
IfRapidly growing team, frequent deployments, one part of the system is consistently the bottleneck
→
UseExtract the bottleneck as a microservice first. Do not split everything at once. Validate the pattern with one service before committing the entire system to the migration.
Microservices architecture showing a client layer (web, mobile, third-party) connecting through an API Gateway that handles auth, rate limiting, routing, and SSL termination. Three services — User service (JWT/login), Order service (cart/checkout), and Notification service (email/SMS) — communicate asynchronously via Kafka or RabbitMQ message bus. Each service owns its own database (database-per-service pattern). Cross-cutting concerns include service discovery (Eureka/Consul), observability (ELK, Prometheus, Jaeger), and centralised config (Spring Cloud Config).
Microservices pattern: Client → API Gateway → User/Order/Notification services → Kafka message bus → individual databases (DB per service pattern)
Microservices Architecture
Service Decomposition — The Bounded Context Rule
The hardest part of microservices isn't the technology — it's deciding where to cut. Most teams get this wrong by splitting services along technical layers: a database service, a UI service, a middleware service. That creates a distributed monolith: every business feature still touches all three services and requires coordinated deployment. The correct approach is Domain-Driven Design's bounded context: each microservice models a single business capability, owns all the data and logic for that capability, and communicates with other services via explicit interfaces.
For an e-commerce system, that means separate services for Product Catalog, Order Management, Payment, Inventory, and Shipping. Not a 'backend service' and a 'frontend service'. Each bounded context has its own database schema, its own deployment pipeline, and ideally its own team. The boundaries are driven by the business domain, not the technical stack.
The most useful heuristic I know: if two pieces of functionality change for the same business reason, they belong in the same service. If they change for different reasons — different teams, different stakeholders, different release cadences — they should be separated. This is the Single Responsibility Principle applied at the service level rather than the class level.
Event storming is the most effective technique for finding these boundaries in practice. Get domain experts, engineers, and product managers in a room. Write business events on sticky notes (OrderPlaced, PaymentFailed, ItemShipped). Group the events by the business process they belong to. Those groups are your bounded contexts. The boundaries where events cross between groups are where your service APIs will live.
Let's be honest about something: you will get the boundaries wrong the first time. Domain knowledge accretes over years, and the first cut is always made with incomplete information. That is acceptable — as long as each service has its own data and its own deployment pipeline, merging two services is a manageable refactor. The cost of merging two over-split services is much lower than the cost of untangling a prematurely split service that shares a database with its neighbor.
Use the strangler fig pattern when extracting services from an existing monolith: route specific request types to the new service while the monolith handles everything else, then gradually expand the new service's responsibility until the monolith no longer handles that capability. Never attempt a big-bang rewrite — extract one bounded context at a time, validate it in production, then move to the next.
package io.thecodeforge.microservices;
// OrderService owns the Order bounded context entirely.// It knows about orders, order lifecycle events, and order state.// It does NOT know about inventory levels, payment processing details,// or shipping logistics — those belong to other bounded contexts.//// The service's API surface is what it publishes to other services:// - Events: OrderCreatedEvent, OrderCancelledEvent, OrderFulfilledEvent// - Queries: getOrder(orderId), listOrdersByUser(userId)//// Nothing outside this service queries the orders table directly.// That boundary is what makes independent deployment possible.publicclassOrderService {
privatefinalOrderRepository orderRepository;
privatefinalEventPublisher eventPublisher;
publicOrderService(OrderRepository orderRepository, EventPublisher eventPublisher) {
this.orderRepository = orderRepository;
this.eventPublisher = eventPublisher;
}
publicOrdercreateOrder(CreateOrderRequest request) {
// OrderService validates and persists the order.// It does NOT call InventoryService or PaymentService synchronously here.// Those services will react to the OrderCreatedEvent asynchronously.// This means: order creation never blocks on inventory availability.Order order = newOrder(
request.userId(),
request.items(),
request.shippingAddress(),
OrderStatus.PENDING
);
orderRepository.save(order);
// Publish the event — other bounded contexts react at their own pace.// PaymentService will attempt to charge. InventoryService will reserve stock.// If either fails, they publish their own compensating events.
eventPublisher.publish(newOrderCreatedEvent(
order.id(),
order.userId(),
order.items(),
order.shippingAddress(),
order.totalAmount()
));
return order;
}
publicOrdergetOrder(String orderId) {
// Other services that need order data call this method via REST/gRPC.// They do NOT query the orders table directly.return orderRepository.findById(orderId)
.orElseThrow(() -> newOrderNotFoundException(orderId));
}
}
Output
// When createOrder() is called:
// 1. Order saved to orders table (owned exclusively by OrderService)
// 2. OrderCreatedEvent published to message broker
// 3. Response returned to caller immediately — no waiting for inventory or payment
// If PaymentService is down: the event sits in the queue. Order creation still succeeds.
// OrderService's availability is independent of PaymentService's availability.
The Shared Database Anti-Pattern Kills Independent Deployability
If two microservices read from or write to the same database table, you do not have microservices — you have a distributed monolith. That shared schema becomes the coupling point that defeats everything microservices are supposed to give you. A schema change in one service requires coordinated deployment of every service that touches that table. A slow query in one service exhausts the connection pool for all others. The database becomes your single point of failure, your single point of coupling, and your deployment bottleneck — simultaneously.
Production Insight
A team split their monolith into services but kept a single MySQL instance with all schemas on it.
A developer ran ALTER TABLE inventory.products ADD COLUMN during business hours.
The DDL acquired a server-level metadata lock. Queries from order_service and product_service queued behind the lock.
Within four minutes, the connection pool was exhausted. Three services started returning 503. Health checks failed. Kubernetes restarted the pods — which immediately tried to reconnect to the overloaded database, making it worse.
The team had separate codebases and separate deployments, but shared fate via shared infrastructure.
Rule: separate schemas on the same database server do not prevent server-level metadata locks. Only separate instances give you true data isolation.
Key Takeaway
Bounded context is the only reliable decomposition strategy — split by business capability, not by technology layer.
Use event storming to find boundaries before writing any code.
Expect to get the first cut wrong. That is fine if each service owns its data and deployment.
Use the strangler fig pattern for extraction — never a big-bang rewrite.
When to Split a Service
IfTwo components always change together and are owned by the same team
→
UseKeep them in the same service. Premature splitting adds network hops, serialization overhead, and distributed failure modes without any autonomy benefit.
IfComponents change at different rates, have different SLA requirements, or are owned by different teams
→
UseSplit into separate services with independent deployability and independent data stores. Validate that each service can deploy without the other being updated.
IfOne bounded context can be fully owned by a single cross-functional team of five to nine people
→
UseSplit. Team ownership is the signal that the boundary is correct. If no single team can fully own the service end-to-end, the boundary is wrong.
Communication Patterns — Sync vs Async
Microservices need to talk to each other. The two dominant patterns are synchronous (HTTP/REST, gRPC) and asynchronous (message queues, event streaming). Each comes with trade-offs that are not symmetric — choosing the wrong one for a given interaction is one of the most common causes of production incidents in microservices systems.
Synchronous calls are simple to implement and reason about: service A calls service B, waits for a response, and returns it to the caller. But they introduce tight temporal coupling. If service B is slow, service A blocks. If service B is down, service A fails. A chain of five synchronous hops means five points of failure, and the latencies add up: 5ms per hop becomes 25ms minimum for a request that touches five services, and under load that can easily become 250ms. Thread pool exhaustion is the failure mode — service B holding 200ms responses consumes all of service A's worker threads, and service A starts returning 503 to its callers before service B actually goes down.
Asynchronous communication decouples services in time: service A publishes an event to a message broker (Kafka, RabbitMQ, AWS SQS) and continues executing without waiting. Service B consumes the event when it is ready. If service B is slow, the message queue absorbs the backlog. If service B restarts, it picks up where it left off. The caller's latency is near zero — publishing a message to Kafka takes microseconds. But you have accepted eventual consistency. The caller cannot know the outcome of the operation immediately. You now manage topics, consumer groups, dead-letter queues, and message schema evolution.
gRPC deserves a specific mention for synchronous calls. It uses HTTP/2 (multiplexed connections, header compression, binary framing) and Protocol Buffers (compact binary serialization). For internal service-to-service calls where both sides are controlled, gRPC typically outperforms REST by 5x to 10x in throughput and 30% to 50% in latency at the same load. The cost is tooling complexity — you need protobuf schema management and a gRPC gateway if browser clients need access. For high-throughput internal APIs in 2026, gRPC is the default choice; REST is for external-facing APIs where developer experience and wide tooling support matter more than raw performance.
The rule that holds in production: default to async for most interactions, especially when the calling service does not need an immediate result. Use sync only when you truly need a real-time, consistent response — payment authorization, user authentication, read queries where the client is waiting. Even for synchronous calls, always set explicit timeouts. A downstream that never responds is worse than one that responds with an error, because it holds your thread indefinitely.
package io.thecodeforge.microservices.event;
import io.thecodeforge.microservices.OrderCreatedEvent;
// InventoryEventHandler is completely decoupled from OrderService.// OrderService does not know InventoryService exists.// InventoryService does not know when OrderService deployed last.// They are connected only by the shape of the OrderCreatedEvent message.//// This is the async pattern in practice:// - OrderService publishes → returns immediately (no wait)// - Kafka durably stores the event// - InventoryService consumes at its own pace// - If InventoryService is down: event waits in Kafka (retention: configurable, default 7 days)// - When InventoryService restarts: it processes the backlog// - No data loss, no cascading failure into OrderServicepublicclassInventoryEventHandler {
privatefinalInventoryService inventoryService;
privatefinalDeadLetterPublisher deadLetterPublisher;
publicInventoryEventHandler(InventoryService inventoryService,
DeadLetterPublisher deadLetterPublisher) {
this.inventoryService = inventoryService;
this.deadLetterPublisher = deadLetterPublisher;
}
// Called by the Kafka consumer when an OrderCreatedEvent arrives.// This method must be idempotent — Kafka delivers at-least-once,// so this handler may be called more than once for the same event.// Use the orderId as an idempotency key to detect and skip duplicates.publicvoidhandleOrderCreated(OrderCreatedEvent event) {
try {
for (var item : event.items()) {
// idempotent: reserve does nothing if already reserved for this orderId
inventoryService.reserve(item.sku(), item.quantity(), event.orderId());
}
} catch (InsufficientStockException e) {
// Business failure — do not retry. Publish to DLQ for manual review// and publish a StockUnavailableEvent so OrderService can compensate.
deadLetterPublisher.publish(event, e.getMessage());
// The saga compensation: OrderService will cancel the order when it// receives StockUnavailableEvent.
} catch (TransientException e) {
// Transient failure (DB timeout, network blip) — rethrow so Kafka retries// with the consumer's configured retry policy and backoff.throw e;
}
}
}
Output
// Async flow timeline:
// T+0ms: OrderService publishes OrderCreatedEvent to Kafka topic 'order.created'
// T+0ms: OrderService returns OrderResult to the HTTP caller — does not wait
// T+5ms: Kafka replicates the event to followers (default acks=all)
// T+50ms: InventoryEventHandler.handleOrderCreated() called by consumer thread
// T+52ms: inventoryService.reserve() called — stock reserved in inventory DB
// T+52ms: If InventoryService was down at T+0ms: event stays in Kafka
// InventoryService processes it when it comes back — no data loss
// T+0ms: OrderService thread BLOCKS waiting for InventoryService response
// T+5000ms: InventoryService times out — OrderService returns 503 to caller
// If 200 concurrent orders: 200 threads blocked for 5 seconds each
// Thread pool exhausted — all subsequent orders fail immediately
Production Insight
A team used synchronous HTTP for every inter-service call: order called product, inventory, payment, and shipping in sequence.
When the product service experienced a cache stampede and started responding in 4 seconds instead of 40ms, the order service thread pool (200 threads) was exhausted within 45 seconds.
All subsequent order requests failed with HTTP 503 — not because of anything wrong with order, payment, or shipping, but because product was slow and every thread was waiting for it.
Switching product enrichment (non-critical data added to the order response) to async reduced the blast radius: product slowness no longer affected order creation.
Rule: ask yourself 'does the caller actually need this response right now to do its job?' If the answer is no, use async.
Key Takeaway
Sync couples availability — if they are slow, you are slow.
Async decouples availability but requires eventual consistency and idempotent consumers.
Default to async for most inter-service interactions.
For synchronous calls: always set explicit timeouts, use gRPC for internal APIs, and never let the absence of a timeout be the failure mode.
Sync vs Async Decision
IfThe caller needs an immediate, consistent response to complete its own operation (payment authorization, user authentication, read queries)
→
UseUse synchronous. Prefer gRPC for internal service-to-service calls (binary protocol, HTTP/2, lower latency). Use REST for external-facing APIs where broad client compatibility matters.
IfThe response can be processed later, or the call is fire-and-forget (send confirmation email, update inventory, trigger shipment)
→
UseUse async via message queue or event stream. Kafka for high-throughput ordered events with replay capability. SQS for simpler queue semantics with managed infrastructure.
IfThe downstream service is owned by another team with a different SLA or deployment schedule
→
UseDefault to async. Cross-team sync calls couple your availability to their availability. An async boundary lets each team deploy and fail independently.
Failure Isolation — Circuit Breakers, Bulkheads, and Retry
In a microservices architecture, failure is not an edge case — it is the normal operating condition at scale. Services crash, networks partition, databases slow down, and dependencies time out. Without explicit isolation patterns, a single failing service can exhaust upstream thread pools and cascade failures through your entire system within seconds. These patterns are not optional features you add when something breaks — they are the load-bearing walls of a resilient distributed system.
The three essential patterns work together:
A circuit breaker monitors the error rate on calls to a downstream service. When errors exceed a configured threshold (say, 50% of calls failing in a 10-second window), the breaker opens: subsequent calls fail immediately without touching the downstream. After a cooldown period, the breaker enters half-open state and allows a single test call through. If the test succeeds, the breaker closes and normal traffic resumes. If it fails, the breaker reopens for another cooldown. The critical benefit: thread pools stop being consumed by calls that will time out anyway, and the failing downstream gets time to recover without being hammered by a thundering herd.
A bulkhead isolates resource consumption so that a failure in one dependency cannot consume all available threads or connections. In practice: separate thread pools (or semaphore-based concurrency limits) for each downstream dependency. If your payment service is slow and consuming all threads in its pool, calls to your product catalog service still have their own pool to use. The payment slowness is contained. Without bulkheads, one slow downstream starves every other dependency simultaneously.
Retry with exponential backoff handles transient failures — network blips, temporary overload, brief restarts. The key word is exponential: first retry after 200ms, second after 400ms, third after 800ms, with jitter added to prevent synchronized retry storms when hundreds of clients retry simultaneously. Never retry without backoff. Never retry more than three times for most operations.
The correct decorator order matters and is a common implementation mistake: the circuit breaker must wrap the retry, not the other way around. If retry wraps the circuit breaker, every retry hits the open breaker and generates another failure count. If the circuit breaker wraps the retry, an open breaker short-circuits before any retry fires — which is the correct behavior.
Timeouts are the cheapest isolation mechanism and the one most often forgotten. A downstream that never responds is worse than one that responds with an error — it holds your thread indefinitely. Set explicit timeouts on every network call. In Resilience4j, configure TimeLimiter alongside your circuit breaker. In gRPC, set deadlines on every RPC. A reasonable starting point: 200ms for internal calls on the same cluster, 1 second for calls crossing availability zones.
Don't treat these as separate decisions. The production-proven combination is: timeout (cheapest, first line) + retry with backoff (transient failures) + circuit breaker (persistent failures) + bulkhead (resource isolation). Each one handles a different failure mode. Together they give you a system that degrades gracefully under partial failure rather than failing completely.
package io.thecodeforge.microservices.resilience;
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
import io.github.resilience4j.retry.Retry;
import io.github.resilience4j.retry.RetryConfig;
import io.github.resilience4j.timelimiter.TimeLimiter;
import io.github.resilience4j.timelimiter.TimeLimiterConfig;
import java.time.Duration;
import java.util.concurrent.CompletableFuture;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.function.Supplier;
publicclassResilientPaymentClient {
// ── Bulkhead: dedicated thread pool for payment calls ─────────────────────// Payment slowness consumes only THESE threads.// Product, inventory, and shipping calls use their own pools.// Without this: one slow dependency starves all others.privatefinalExecutorService paymentThreadPool = Executors.newFixedThreadPool(10);
// ── Timeout: 300ms hard limit on payment calls ────────────────────────────// A payment service that never responds is worse than one returning errors.// Without this: threads block indefinitely, pool exhausts, system fails.privatefinalTimeLimiter timeLimiter = TimeLimiter.of(
TimeLimiterConfig.custom()
.timeoutDuration(Duration.ofMillis(300))
.build()
);
// ── Retry: up to 3 attempts for transient failures ────────────────────────// Handles network blips and brief restarts.// Exponential backoff with jitter prevents thundering herd on recovery.// Only retries on transient exceptions — not on business logic failures.privatefinalRetry retry = Retry.of("payment",
RetryConfig.custom()
.maxAttempts(3)
.waitDuration(Duration.ofMillis(200))
.retryExceptions(TransientPaymentException.class)
.ignoreExceptions(PaymentDeclinedException.class) // business failure, don't retry
.build()
);
// ── Circuit Breaker: opens when 50% of calls fail in 10s window ───────────// Prevents thread pool exhaustion during persistent downstream failures.// After 30s cooldown, allows one test call (half-open) to check recovery.privatefinalCircuitBreaker circuitBreaker = CircuitBreaker.of("payment",
CircuitBreakerConfig.custom()
.failureRateThreshold(50) // open if 50% of last 10 calls fail
.waitDurationInOpenState(Duration.ofSeconds(30)) // cooldown before half-open
.permittedNumberOfCallsInHalfOpenState(2) // test calls in half-open
.slidingWindowSize(10)
.build()
);
// ── Decorator order: circuit breaker WRAPS retry ──────────────────────────// If retry wrapped circuit breaker:// open breaker → retry fires → hits open breaker again → counts as failure// → keeps breaker open even after downstream recovers (wrong)//// Circuit breaker wrapping retry (correct):// open breaker → fails immediately, no retry fires// closed breaker → retry handles transient failures within the closed statepublicCompletableFuture<PaymentResponse> charge(PaymentRequest request) {
Supplier<PaymentResponse> supplier =
CircuitBreaker.decorateSupplier(circuitBreaker,
Retry.decorateSupplier(retry,
() -> callPaymentService(request)
)
);
// Combine timeout + bulkhead: supplier runs in dedicated pool with hard deadlinereturn timeLimiter.executeCompletionStage(
paymentThreadPool,
() -> CompletableFuture.supplyAsync(supplier, paymentThreadPool)
).toCompletableFuture();
}
privatePaymentResponsecallPaymentService(PaymentRequest request) {
// actual gRPC or HTTP call to payment servicethrownewUnsupportedOperationException("implement with actual gRPC client");
}
}
// Attempt 3 fails → CircuitBreaker records 3 failures → may trigger OPEN if rate threshold met
//
// Bulkhead: if all 10 threads are occupied (payment is very slow)
// → new charge() calls reject immediately with BulkheadFullException
// → other services (product, inventory) are unaffected — they have their own pools
//
// Timeout: if callPaymentService() takes > 300ms
// → TimeoutException thrown, thread released back to pool
// → counts as failure for circuit breaker tracking
Think of Circuit Breakers as Electrical Trip Switches
CLOSED: calls pass through normally, the breaker tracks error count in a sliding window.
OPEN: the breaker has tripped — calls fail immediately without touching the downstream. The downstream gets breathing room to recover.
HALF-OPEN: after the cooldown, the breaker allows a small number of test calls. Success resets it; failure sends it back to OPEN.
Critical: the circuit breaker must be the outer wrapper. If retry is outside the breaker, retries fire against an open breaker and generate more failure events, keeping it open even after the downstream recovers.
Production Insight
A team deployed circuit breakers with retry enabled — but the retry decorator was outside the circuit breaker.
When the payment service went down, the breaker opened after 5 failures.
But the retry configuration fired three attempts per request, each hitting the open breaker, each registering as a failure, resetting the cooldown timer.
The breaker never had a chance to enter half-open state. Error rates stayed at 100% for 8 minutes after the payment service had fully recovered.
Fixing the decorator order — circuit breaker wrapping retry — dropped the recovery time from 8 minutes to 32 seconds.
Rule: circuit breaker outside, retry inside. Test the failure recovery, not just the failure detection.
Use all three — they handle different failure modes and are not substitutes for each other.
Decorator order is a correctness issue, not a style preference: circuit breaker must wrap retry.
Set explicit timeouts on every network call. The absence of a timeout is a latent outage waiting to happen.
Choosing the Right Failure Isolation Strategy
IfDownstream has high sustained error rate (>50%) and is slow to recover
→
UseCircuit breaker with a tight timeout (200–300ms) and a meaningful cooldown (30–60s). The timeout stops thread consumption; the breaker stops you from hammering a recovering service.
UseRetry with exponential backoff and jitter (3 attempts, starting at 200ms). No circuit breaker needed for low error rates — it would open too aggressively.
IfOne slow dependency must not starve resources from other dependencies
→
UseBulkhead: separate thread pool or semaphore per downstream dependency. Size each pool based on the expected concurrency for that dependency, not a shared global pool.
Observability — Distributed Tracing, Structured Logging, and Metrics
You cannot debug what you cannot see. In a monolith, a single log file and a profiler were enough. In a microservices system, a single user request may travel through ten services across fifty containers. Standard logging gives you noise — you see errors in service A but cannot tell which upstream request caused them or which downstream call caused the delay. Centralized metrics tell you something is wrong but not where. Without proper observability, you are debugging a distributed system with a flashlight.
The three pillars work together, and the key is understanding which tool answers which question:
Distributed tracing answers the question 'what happened to this specific request?' Every request receives a unique trace ID at the entry point. That ID propagates — via HTTP headers, Kafka message headers, or gRPC metadata — through every service the request touches. Each service creates a span (a unit of work with start time, end time, and metadata) and attaches it to the trace. The result is a waterfall view of the entire request path: you can see that the checkout request spent 2ms in order service, 180ms waiting for a payment service response, and 4ms in the shipping service. The bottleneck is obvious.
In 2026, the instrumentation standard is OpenTelemetry. It is vendor-neutral, language-agnostic, and supported natively by every major cloud provider and observability vendor. You instrument your services once with the OpenTelemetry SDK and choose a backend separately. Jaeger and Grafana Tempo are the dominant self-hosted backends. Honeycomb, Datadog, and Grafana Cloud are the managed options. Do not confuse OpenTelemetry (the instrumentation layer) with the backend (the storage and query layer) — they are separate concerns. Choose OpenTelemetry for instrumentation regardless of which backend you use.
A critical implementation detail: context propagation across async boundaries. When an HTTP request handler publishes a Kafka message and returns, the trace context must be serialized into the Kafka message headers. When the consumer picks up that message, it must extract the trace context and continue the same trace. The OpenTelemetry Kafka instrumentation handles this automatically if you use the provided producer and consumer wrappers. If you use a custom producer, you must manually inject the context using W3C Trace Context headers. Miss this step and your traces fragment at every async boundary — you see half a trace on the HTTP side and a disconnected trace on the consumer side.
Structured logging answers 'what was this service doing at this time?' Log as JSON with a consistent schema: timestamp, level, service name, trace_id, span_id, message, and any business context (order_id, user_id). The trace_id field is what ties your logs to your traces — when a slow span appears in Jaeger, you copy the trace_id and filter your log aggregator (Grafana Loki, ELK) to see every log line from every service for that specific request.
Metrics answer 'is the system healthy right now?' Instrument every service with RED metrics: Rate (requests per second), Errors (error rate as a percentage), and Duration (latency distribution — p50, p95, p99). Use Prometheus to scrape and Grafana to alert. Alert on p99 latency and error rate, not on individual error logs. One error log is noise; a sustained 5% error rate is an incident.
The workflow that works in production: an alert fires on p99 latency for checkout. You open the Grafana dashboard and see the latency spike started four minutes ago. You jump to Jaeger and filter for slow traces in the checkout service from that window. You find a trace where the payment span took 2.8 seconds instead of the usual 180ms. You copy the trace_id, filter Loki for that ID, and see the payment service logs show a database connection timeout. Root cause found in under five minutes. Without tracing, this investigation takes hours.
// Structured log entry from order-service during a payment failure.
// Every field is queryable in Loki/Elasticsearch — no grep through prose.
// The trace_id ties this log line to the distributed trace in Jaeger/Tempo.
// The span_id identifies this specific unit of work within the trace.
// A developer sees the error alert, filters Loki by trace_id, and gets
// every log line from every service forthis one request — instantly.
{
"timestamp": "2026-04-22T10:30:15.123Z",
"level": "ERROR",
"service": "order-service",
"version": "2.14.1",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "payment_charge_failed",
"order_id": "ord-98765",
"user_id": "usr-4421",
"amount_cents": 4999,
"error": {
"type": "ServiceUnavailable",
"code": "PAYMENT_CIRCUIT_OPEN",
"downstream": "payment-service",
"duration_ms": 301,
"retry_attempts": 3
},
"action_taken": "order_held_pending_payment_recovery"
}
// What you can query with thisstructure (LokiLogQL examples):
//
// All errors for a specific order:
// {service="order-service"} | json | order_id="ord-98765" | level="ERROR"
//
// All log lines across ALL services for one user request:
// {} | json | trace_id="4bf92f3577b34da6a3ce929d0e0e4736"
//
// Payment failures in the last 5 minutes:
// {} | json | message="payment_charge_failed" | rate() > 10
//
// Without structured logging and trace_id propagation:
// grep -r "ord-98765" /var/log/* ← across 10 services, 50 containers
// ... good luck
Output
// In Grafana Loki — filter by trace_id across all services:
Context Propagation Breaks at Async Boundaries Unless You Explicitly Handle It
When a service publishes a Kafka message and returns, the OpenTelemetry trace context must be serialized into the Kafka message headers — otherwise the trace stops at the producer. When the consumer picks up the message, it extracts the trace context from the headers and continues the same trace. The OpenTelemetry Kafka instrumentation handles this automatically. If you use a custom producer or a library not covered by auto-instrumentation, you must inject the context manually using the W3C Trace Context format (traceparent header). Miss this step and your traces break at every queue boundary — exactly the places where latency problems are most likely to hide.
Production Insight
A team had Prometheus metrics and centralized logging but no distributed tracing.
When a high-value user complained of a 3-second checkout, the team spent two days correlating timestamps manually across eight service log files.
The root cause: a new gRPC call to a recommendation service that had a 2-second p99 tail latency on cold cache hits. It was invisible from any single service's logs.
After adding Jaeger with OpenTelemetry auto-instrumentation (one afternoon of work), the next latency complaint was diagnosed in under ten minutes.
Rule: add distributed tracing before your first production incident, not after. Retrofitting is a week of work across every service simultaneously.
Key Takeaway
Instrument with OpenTelemetry — it is the vendor-neutral standard in 2026. Choose your backend separately.
Trace_id in every log line is what connects your metrics alert to your traces to your logs.
Context propagation across Kafka and async boundaries does not happen automatically unless you use the OpenTelemetry instrumentation wrappers.
Implement distributed tracing before your first major incident. The retrofit cost is high.
Observability Maturity Path
IfFewer than 3 microservices, same team owns all of them
→
UseCentralized logging with trace_id correlation IDs is sufficient. Add structured logging now even if you skip tracing — you will need the queryable fields soon.
If3 to 10 microservices, cross-service latency issues appearing
→
UseAdd distributed tracing with OpenTelemetry SDK. Choose Grafana Tempo (if already on Grafana stack) or Jaeger (self-hosted). Wire trace_id into your log schema. The combination of traces and logs is where debugging becomes fast.
IfMore than 10 services, multiple teams, SLA commitments
→
UseFull observability stack: OpenTelemetry instrumentation, distributed tracing backend, structured logging with trace correlation, RED metrics per service with Prometheus, and per-service Grafana dashboards with alerts. Service-level ownership of dashboards — each team runs their own.
Data Ownership and Database-Per-Service
One of the biggest shifts when moving from monolith to microservices is data management. In a monolith, all components share one database — easy to query across entities with joins, but impossible to decouple. In microservices, each service must own its data exclusively: its own database instance, its own schema, credentials that no other service can use. This is the rule that most teams resist and most teams eventually learn the hard way.
Why does it matter? A shared database creates the worst kind of coupling: schema coupling. If two services read from the same table, you cannot change that table without coordinating across both teams. A migration becomes a multi-service, multi-team deployment. A slow query from one service exhausts connection pools that another service depends on. A metadata lock from a DDL operation on one service brings down queries from every other service on that database server. You get all the operational complexity of microservices — distributed deployment, network calls, serialization — with none of the independence.
But what happens when Service A needs data that Service B owns? This is the question teams ask when they first encounter the database-per-service rule, and the answer has two parts.
For real-time queries where Service A needs current data from Service B: use an API call. Service A calls Service B's REST or gRPC endpoint. This adds latency (a network hop) but maintains the boundary. If the data is read frequently and changes infrequently, cache it locally in Service A with a TTL. Service A caches Service B's data for 60 seconds — staleness is bounded, and Service A's availability is no longer coupled to Service B's.
For analytics queries, reporting, and cases where Service A needs data from multiple services combined: use the CQRS pattern with a read model. Each service publishes its state changes as events. A read service (sometimes called a query service or materialized view service) consumes those events and maintains a denormalised projection optimised for the query patterns you need. You can run cross-service joins on this read model because it is a dedicated view, not the source of truth. The source of truth for each entity remains in the owning service's database.
Data duplication across services via events is not a bug — it is a deliberate design decision that buys you deployment independence. The shipping service maintains its own copy of the order's shipping address, derived from the OrderCreatedEvent, because it cannot afford to call the order service every time it needs that address. If the order service is down, shipments must still be processable. Accept the duplication. The alternative — shared database — is far more expensive when it breaks.
package io.thecodeforge.microservices.data;
import io.thecodeforge.microservices.event.OrderCreatedEvent;
import java.util.List;
import java.util.stream.Collectors;
// ShippingService owns the shipping_info table.// It has ZERO knowledge of the orders table — that is OrderService's domain.// It does NOT query OrderService's database directly.// It does NOT call OrderService's API to get shipping information.//// Instead: it consumes the OrderCreatedEvent, which contains enough data// for ShippingService to do its job without ever talking to OrderService again.//// This means:// - ShippingService can be deployed without OrderService being up// - ShippingService can be scaled without affecting OrderService// - OrderService can change its internal data model without affecting ShippingService// (as long as the event schema remains backward-compatible)publicclassShippingService {
privatefinalShippingRepository shippingRepository;
publicShippingService(ShippingRepository shippingRepository) {
this.shippingRepository = shippingRepository;
}
// Consumes OrderCreatedEvent — called by the Kafka consumer.// The event payload contains all data ShippingService needs.// This is deliberate data duplication: the shipping address exists in// both the orders table (OrderService) and shipping_info table (ShippingService).// That duplication is the price of independence. It is worth it.publicvoidhandleOrderCreated(OrderCreatedEvent event) {
List<String> skus = event.items().stream()
.map(item -> item.sku())
.collect(Collectors.toList());
ShippingInfo info = newShippingInfo(
event.orderId(),
event.shippingAddress(), // copied from event — ShippingService owns this copy
skus,
ShippingStatus.PENDING
);
shippingRepository.save(info);
// ShippingInfo is now persisted in ShippingService's own database.// If OrderService goes down 5 minutes later: ShippingService continues// processing shipments from its own data store.
}
// When another service needs shipping data: they call this API.// They do NOT query ShippingService's database directly.publicShippingInfogetShippingInfo(String orderId) {
return shippingRepository.findByOrderId(orderId)
.orElseThrow(() -> newShippingInfoNotFoundException(orderId));
}
}
Output
// Data ownership in practice:
//
// OrderService database (owned exclusively by OrderService):
// Cross-service analytics (e.g., 'orders with delayed shipment'):
// Do NOT join orders and shipping_info tables across databases.
// Use a read model: both services publish events to an analytics consumer
// that maintains a denormalised view optimised for reporting queries.
// The analytics database is read-only — the source of truth stays in each service.
Data Duplication Across Services Is a Feature, Not a Problem
The instinct to avoid data duplication is correct for relational databases but wrong for microservices. When ShippingService stores a copy of the shipping address from the order event, that duplication is what allows ShippingService to function when OrderService is down. Eventual consistency — where ShippingService's copy may be seconds behind OrderService's truth — is the trade-off you accept. The alternative (shared database or synchronous API calls for every read) reintroduces the coupling you were trying to eliminate. Accept the duplication as the cost of independence.
Production Insight
A team allowed all services to read from a shared products table owned by the catalog team.
When the catalog team added a new field and renamed an existing column in the same migration, queries from three other services broke simultaneously.
No service had been notified. Each had its own deployment pipeline, but all failed at the same database boundary.
The fix required emergency rollbacks across four codebases, coordinated by three teams, at 11pm.
Rule: if two services read the same table, you have one service split across two deployments — not two services. The boundary is the database, not the codebase.
Key Takeaway
Database-per-service is non-negotiable for true microservices independence.
Data duplication via events is the correct pattern — it is the price of autonomy.
For cross-service reads: API call (real-time) or local replica via events (tolerable staleness).
For cross-service analytics: CQRS with a dedicated read model. Never direct cross-database joins.
Data Sharing Strategies
IfService A needs current, real-time data from Service B for a user-facing query
→
UseUse a synchronous API call (gRPC or REST). If the data changes infrequently, add a local cache in Service A with an appropriate TTL. Do not access Service B's database directly under any circumstances.
IfService A needs data from Service B but can tolerate seconds or minutes of staleness
→
UseHave Service B publish state-change events. Service A consumes those events and maintains a local replica. This is the pattern that makes Service A's availability independent of Service B's.
IfYou need to query and join data across multiple services (reporting, analytics, dashboards)
→
UseBuild a dedicated read model using CQRS. Each service publishes events to an analytics consumer that maintains a denormalised projection. This is the only pattern that scales for cross-service reporting without reintroducing shared databases.
IfTwo services currently share a database due to legacy constraints
→
UseThis is a distributed monolith. Use the strangler fig pattern: identify the table ownership, add API endpoints to the owning service, migrate consumers one at a time to use the API instead of direct database access, then move the table to a dedicated instance.
Organizational Alignment — Conway's Law in Practice
Conway's Law states that organizations which design systems are constrained to produce designs that mirror their own communication structures. It was observed in 1967 and has been validated by every large-scale software organization since. In practical terms: your system architecture will look like your org chart. If your teams are organized by technical function — a frontend team, a backend team, a database team — your architecture will produce a frontend service, a backend service, and a database service, regardless of what your technical documents say you are building. You do not choose whether Conway's Law applies. You only choose whether you use it deliberately.
The inverse also holds, and this is the actionable version: if you want a particular architecture, first build the org structure that produces it. Amazon's two-pizza team rule — no team larger than can be fed by two pizzas — predates their service-oriented architecture. The small team size was a deliberate design decision that forced service boundaries: a team can only own what it can fully understand and operate. The services emerged from the team constraints, not the other way around.
The failure mode I have seen most consistently in enterprise microservices migrations: a company decomposes its codebase into fifteen services but keeps the same five teams. Each team owns three services. The services are technically separate deployments, but the teams are not independent — a change to the payment flow still requires coordination between the checkout team and the payments team and the fraud team. Deployment calendars replace merge conflicts as the bottleneck. The coordination overhead is identical to the monolith, but now you also have distributed systems complexity. This is the worst possible outcome.
The fix is to reorganize around business capabilities before splitting the codebase. Each team should own one bounded context end-to-end: the code, the database, the deployment pipeline, the on-call rotation, and the runbooks. The team size should stay between five and nine people — large enough to have cross-functional skills (backend, frontend, data, ops), small enough to move fast without internal coordination overhead. When a team exceeds nine people, split it along sub-domain lines and create a new bounded context.
One practical technique: before any code is split, draw the target team structure and ask whether each team can write, test, deploy, and operate their service without scheduling a meeting with another team. If the answer is no for any team, the boundaries are wrong. Fix the org chart first, then split the code.
package io.thecodeforge.microservices.org;
// This file is not executable — it is a documentation artifact showing// how team ownership maps to service ownership in a correctly aligned org.//// The rule: one team, one bounded context, one service (or a small cluster// of tightly related services that always deploy together).//// ── WRONG: Technical layer teams ─────────────────────────────────────────────// Team: Frontend owns: checkout-ui, product-ui, account-ui// Team: Backend owns: checkout-api, product-api, account-api, order-api// Team: Database owns: all schema migrations across all services//// Result (Conway's Law, unavoidable):// - Every feature requires coordination across Frontend + Backend + Database// - The 'database team' becomes a bottleneck for every service's migrations// - Services cannot deploy independently because the database team controls schemas// - Architecture: layered monolith, distributed across three codebases//// ── CORRECT: Business capability teams ───────────────────────────────────────// Team: Checkout (7 people) owns: checkout-service, checkout DB, checkout UI components// Team: Payments (6 people) owns: payment-service, payment DB, payment gateway integrations// Team: Catalog (8 people) owns: product-service, catalog DB, search indexing// Team: Orders (6 people) owns: order-service, order DB, order history UI// Team: Shipping (5 people) owns: shipping-service, shipping DB, carrier integrations//// Result:// - Checkout team ships a feature without talking to Payments team// (they publish an OrderReadyForPayment event; Payments team owns the handler)// - Payments team deploys a new gateway integration without a change freeze// - Catalog team runs a schema migration on their own schedule// - Architecture mirrors the team structure — bounded contexts are clean// The test for correct alignment: can this team respond to a production incident// in their service at 2am without waking up another team?// If yes: the boundary is correct.// If no: the boundary is wrong — some dependency crosses team lines.classTeamBoundaryValidator {
publicstaticbooleanisBoundaryCorrect(Team team, Service service) {
return team.canDeploy(service) // team controls the pipeline
&& team.ownsDatabase(service) // team owns the schema
&& team.respondsToAlerts(service) // team is on-call for this service
&& team.size() >= 5// small enough to move fast
&& team.size() <= 9; // large enough to be cross-functional
}
}
Output
// Org structure signal → Architecture signal:
//
// Symptom: releasing a checkout feature requires approval from 3 teams
// Diagnosis: the checkout bounded context is spread across multiple teams
// Fix: reorganize teams so one team owns the entire checkout flow end-to-end
//
// Symptom: the database team is a bottleneck for every deployment
// Diagnosis: schema ownership is centralized (technical layer org)
// Fix: move schema ownership to each business capability team
// each team runs its own migrations on its own schedule
//
// Symptom: a payments service bug requires a checkout team engineer to fix it
// Diagnosis: payments and checkout share code or data without a clean API boundary
// Fix: identify the coupling, expose it as an explicit API, redraw the team boundary
//
// The 2am test:
// 'Can the on-call engineer for this service resolve a production incident
// without waking up a engineer from another team?'
// Pass → boundary is correct.
// Fail → find the cross-boundary dependency and eliminate it.
Conway's Law Is Not a Guideline — It Is a Constraint
Conway's Law does not say you should align teams and architecture. It says you cannot avoid alignment — the only choice is whether the alignment is deliberate or accidental. If your teams are split by technology layer (frontend, backend, DBA), your services will end up coupled along those same lines, regardless of what the architecture diagram says. Every attempt to maintain a different architecture than your team structure produces will fail over time as the humans take the path of least coordination. Reorganize teams first. The architecture will follow automatically.
Production Insight
A financial services company migrated to microservices over 18 months, decomposing their monolith into 15 services.
They kept the same 5 teams. Each team owned 3 services. The services were technically independent deployments.
But every cross-service feature — which was almost every feature — required a planning meeting with representatives from 3 teams, a shared deployment calendar, and a coordinated release window.
Deployment coordination took longer than the development work. Incident response required paging two teams instead of one.
They had the cost of microservices (distributed systems complexity, observability overhead, network failures) without the benefit (team autonomy, independent deployment).
The fix took another 9 months: reorganizing into product-aligned teams before the architecture actually started working as intended.
Key Takeaway
Conway's Law is not optional — it is a structural constraint on every engineering organization.
Your system architecture will mirror your team communication structure, whether you plan it or not.
Align teams to business capabilities before splitting code — the architecture follows automatically.
The 2am test: if an on-call engineer needs to wake up another team to resolve an incident, the service boundary is wrong.
When to Reorganize Teams
IfOne team owns multiple services that serve different business goals and have different stakeholders
→
UseSplit the team along bounded context lines. Each resulting team should be able to deploy their service without coordinating with the other.
IfMultiple teams share ownership of one service — multiple on-call rotations, multiple deployment approvers
→
UseEither merge the teams (if the service boundary is correct but team split is wrong) or decompose the service into sub-services aligned with each team's capability.
IfTeam size has grown beyond 9 people and coordination within the team is visibly slowing delivery
→
UseSplit the team. Identify a sub-domain within the current bounded context that can stand alone, form a new team around it, and extract the corresponding service or service cluster.
● Production incidentPOST-MORTEMseverity: high
How a Shared Database Took Down Five Services at Once — The Cascade Nobody Expected
Symptom
Users started seeing 500 errors and timeouts across product listing, checkout, and order history pages simultaneously. Error rates jumped from 0.1% to 100% across three services over ten minutes. The services had separate codebases, separate deployment pipelines, and separate teams — but they all went down at exactly the same moment.
Assumption
The team believed they had microservices. They had separate codebases and separate CI/CD pipelines. What they had not separated was the database: all services shared a single MySQL server, each with its own schema but all on the same instance. They assumed schema-level separation was equivalent to service-level isolation. It is not.
Root cause
A developer on the inventory team ran ALTER TABLE inventory.products ADD COLUMN during business hours. MySQL's default DDL behavior acquires a metadata lock on the table for the duration of the operation. On this server, the products table was referenced by queries from order_service and product_service (both on the same MySQL instance). Those queries queued behind the DDL lock. The queue grew to hundreds of connections within two minutes, exhausting the MySQL max_connections limit. Services that could not acquire a connection started failing health checks. Kubernetes restarted the pods — which immediately attempted to reconnect to the same overloaded database server, creating a crash loop that sustained the outage long after the DDL had completed. Note: separate schemas on the same MySQL server share the same lock manager and the same connection pool limit. Schema-level separation does not prevent server-level metadata locks.
Fix
1. Kill the stuck DDL with KILL QUERY on the blocking thread_id to release the metadata lock.
2. Restart all services in crash loops once the database connection queue cleared.
3. Migrate each service to its own dedicated MySQL instance over the following two sprints, starting with the service that had caused the incident.
4. Enforce a policy: all DDL changes must use pt-online-schema-change or gh-ost, which perform schema changes without acquiring metadata locks by using shadow tables and triggers.
5. Add database connection pool monitoring and alert at 80% pool utilization — this would have caught the growing queue before services started failing.
Key lesson
Separate schemas on the same database server do not give you service isolation — the lock manager, connection pool, and I/O subsystem are still shared. Only separate database instances provide true isolation.
Synchronous DDL operations in MySQL can cause cascading failures across every service on that server. Use pt-online-schema-change or gh-ost for all schema changes in production.
Monitor database connection pool utilization per service. A pool approaching its limit is an early warning — add an alert at 80% capacity before the first service starts failing.
Any shared infrastructure (database server, message broker, cache cluster) is a single point of failure for every service that depends on it. Plan for this explicitly in your capacity and failure mode analysis.
Production debug guideFrom observable symptom to targeted debug command5 entries
Symptom · 01
HTTP 503 across multiple services simultaneously
→
Fix
When multiple services fail at the same time, suspect shared infrastructure before individual service bugs. Check the database connection pool utilization (SELECT * FROM information_schema.processlist; and check max_connections vs current threads). Check if a shared cache or message broker is degraded. Look for a metadata lock with SHOW ENGINE INNODB STATUS. Simultaneous multi-service failure almost always traces to a shared dependency.
Symptom · 02
Circuit breaker open on a specific downstream
→
Fix
Check circuit breaker metrics: curl localhost:8080/actuator/circuitbreakers (Spring Boot with Actuator) or query the Prometheus metric resilience4j_circuitbreaker_state. Identify which downstream is open and its current error rate. Then check that downstream service's health endpoint and logs. Do not reset the breaker manually until the downstream is confirmed healthy — resetting an open breaker against an unhealthy service restarts the failure cascade.
Symptom · 03
High latency on a specific API endpoint, other endpoints normal
→
Fix
Find a slow trace in Jaeger or Grafana Tempo. Filter by service and minimum duration. Look for the span with the longest duration — that is your bottleneck. Check if the time is spent waiting for a downstream service (a wide gap between span start and first child span) or in local processing (many small child spans adding up). The trace waterfall makes this immediately visible.
Symptom · 04
Database connection pool exhausted
→
Fix
Run SELECT * FROM information_schema.processlist WHERE command != 'Sleep' ORDER BY time DESC to find long-running queries. Look specifically for state 'Waiting for table metadata lock' — that indicates a DDL operation is blocking all subsequent queries. Identify the blocking query with SHOW ENGINE INNODB STATUS and kill it with KILL QUERY thread_id. Then check connection pool metrics (hikaricp_connections_active in Prometheus) to confirm the pool recovers.
Symptom · 05
Messages accumulating in Kafka topic but consumer is running
→
Fix
Check consumer group lag with kafka-consumer-groups --bootstrap-server $KAFKA_BROKER --group your-group --describe. If lag is growing, the consumer is processing slower than the producer is publishing. Check consumer logs for errors or slow processing. Common causes: downstream DB is slow (consumer is waiting on each message), deserialization errors causing retries, or a poison-pill message that the consumer cannot process causing it to retry indefinitely. Check the dead-letter queue topic for failed messages.
★ 5-Minute Microservices Emergency Debug CardWhen your pager goes off at 3am, run these commands in order before touching any configuration
All endpoints returning 503 across multiple services−
Immediate action
Suspect shared infrastructure — check database and message broker first, not individual services
Commands
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -30
mysql -h $DB_HOST -u $DB_USER -p -e 'SHOW FULL PROCESSLIST;' | grep -v Sleep | head -30
Fix now
If you see 'Waiting for table metadata lock' in the processlist: KILL QUERY the blocking thread_id. If connection count is near max_connections, identify the service causing the pool exhaustion and restart it after the lock clears.
Circuit breaker open — specific downstream unavailable+
Immediate action
Verify the downstream is actually unhealthy before taking any action
If downstream is healthy but breaker is open (stale state): POST to /actuator/circuitbreakers/{name}/reset — but only after confirming the downstream health endpoint returns UP. Resetting against an unhealthy downstream immediately reopens the breaker and restarts the failure cycle.
Request latency spike — p99 goes from 50ms to 5s+
Immediate action
Find the slow span in distributed tracing before guessing at the cause
If a downstream service span is slow: check that service's metrics and logs using the trace_id from the slow span. If it is a database span: run SHOW FULL PROCESSLIST on that service's database. Do not scale horizontally until you know whether the problem is throughput or latency.
Kafka consumer lag growing — messages stuck in queue+
Immediate action
Check consumer group lag and distinguish between slow processing and a crash loop
If consumer is crashing on a specific message (same offset failing repeatedly): the message is likely malformed or triggers a bug. Move it to the dead-letter topic manually: kafka-console-consumer.sh --bootstrap-server $KAFKA_BROKER --topic order.created --partition 0 --offset <bad_offset> --max-messages 1 | kafka-console-producer.sh --broker-list $KAFKA_BROKER --topic order.created.dlq
Map which services require cross-team coordination for a single feature change. Those are your mis-aligned boundaries. Plan to either move service ownership to align with one team, or refactor the API boundary so the teams can work independently. This is a sprint of work, not a hotfix.
Monolith vs Microservices — Key Differences
Aspect
Monolith
Microservices
Deployability
One deployment per release; any change requires redeploying the entire application; low-risk in isolation but high-blast-radius
Each service deploys independently; a payment service fix ships without touching order or catalog; blast radius is bounded to one service
Scalability
Scale the entire application as a unit; if checkout is the bottleneck, you scale everything including the parts that are not bottlenecked
Scale only the services under load; scale checkout independently on Black Friday without touching catalog or account services
Team Organization
Single team typically owns the entire codebase; coordination overhead grows with team size; above 20 engineers, merge conflicts and release contention become significant
Each service owned by a small cross-functional team of 5–9 people; team can deploy without asking permission from other teams
Failure Isolation
A crash in any component takes down the entire application; a memory leak in recommendations takes down checkout
Failure is contained to the failing service; other services continue operating; requires explicit circuit breakers and bulkheads to enforce isolation
Communication
In-process function calls; nanosecond latency; no serialization overhead; no network failure modes
Network calls with 1–10ms latency per hop on the same cluster; serialization overhead; timeout and retry logic required; partial failure is a new failure mode you must design for
Data Management
Single shared database; cross-entity joins are trivial SQL; schema changes are coordinated in one place
Database-per-service; cross-service data access via API calls or event-driven replication; cross-service analytics requires CQRS with a dedicated read model
Complexity
High internal complexity in one place; lower infrastructure and operational complexity; one log file, one deployment, one database to monitor
Each service is simpler in isolation; total system complexity is higher — distributed failure modes, network partitions, eventual consistency, distributed tracing, and per-service deployment pipelines all add operational surface area
Observability
Single log file, standard profiler, one database to query; debugging is local and deterministic
Requires distributed tracing (OpenTelemetry + Jaeger/Tempo), structured logging with trace_id propagation, and RED metrics per service; without this stack, debugging a slow request across 10 services takes days
Key takeaways
1
Microservices are about autonomy and decoupling
each service deploys, scales, and fails independently. Size is irrelevant.
2
Use bounded contexts from Domain-Driven Design and event storming to find service boundaries. Expect to get the first cut wrong.
3
Default to asynchronous communication (Kafka, SQS) for most inter-service interactions. Use synchronous calls only when the caller truly needs an immediate, consistent response.
4
Circuit breakers, bulkheads, and retries with exponential backoff are load-bearing patterns
not optional features. Decorator order matters: circuit breaker wraps retry.
5
Each microservice must own its own database instance. Separate schemas on a shared server do not prevent metadata lock cascades.
6
Instrument with OpenTelemetry from day one. Add trace_id to every log line. Distributed tracing is the difference between a 5-minute diagnosis and a 2-day investigation.
7
Conway's Law is not optional. Reorganize teams around business capabilities before splitting code.
8
Start with a monolith. Extract services when you hit a concrete scaling bottleneck or team coordination problem
not because microservices are fashionable.
Common mistakes to avoid
5 patterns
×
Splitting services by technical layers instead of business domains
Symptom
Services called 'database-service', 'backend-service', and 'frontend-service' appear. Every feature change touches all three services and requires a coordinated release. Deployment calendar becomes the bottleneck instead of development.
Fix
Re-decompose using bounded contexts: each service maps to a business capability (OrderService, PaymentService, InventoryService). Run event storming with domain experts to find the boundaries before writing any code. Test the boundary: can this service be deployed without changing any other service?
×
Sharing a single database server across multiple services
Symptom
Schema changes in one service cause outages in other services. A DDL operation triggers a metadata lock that queues queries from all services on the same server. Teams cannot run migrations independently.
Fix
Each service must have its own database instance — not just its own schema on a shared server. Separate instances prevent metadata locks, connection pool exhaustion, and I/O contention from crossing service boundaries. Use pt-online-schema-change or gh-ost for all DDL operations to avoid locking even within a service's own database.
×
Using synchronous HTTP calls for every inter-service interaction
Symptom
A single slow downstream service exhausts all upstream worker threads within seconds, causing the upstream service to return 503 even though it has no problems of its own. Latency compounds across synchronous call chains.
Fix
Default to asynchronous communication (Kafka, SQS, RabbitMQ) for interactions where the caller does not need an immediate response. Reserve synchronous calls for real-time queries where the caller must wait for the result. For all synchronous calls: set explicit timeouts, add circuit breakers with Resilience4j, and size the thread pool with a bulkhead.
×
Skipping distributed tracing until after the first major production incident
Symptom
A slow request travels through 8 services. Engineers spend days manually correlating timestamps across log files trying to find which service introduced the latency. The root cause turns out to be a third-party API call in an unexpected service.
Fix
Instrument all services with OpenTelemetry from the first day of microservices. Choose a backend (Grafana Tempo for teams already on Grafana, Jaeger for self-hosted simplicity). Ensure context propagates across async boundaries by using the OpenTelemetry Kafka and messaging instrumentation. Add trace_id to every log line.
×
Not implementing circuit breakers on synchronous inter-service calls
Symptom
A downstream service slows down. Calls from the upstream service take 4 seconds instead of 50ms. The upstream thread pool (200 threads) exhausts in 50 seconds. The upstream service starts returning 503 and failing health checks. Kubernetes restarts it. The restarted pods immediately make more slow calls. The outage is now self-sustaining.
Fix
Add circuit breakers using Resilience4j on every synchronous downstream call. Configure timeouts (200–300ms for internal calls), error rate thresholds (50% over a 10-call window), and cooldown periods (30 seconds). Set the decorator order correctly: circuit breaker wraps retry, not the reverse. Test the failure and recovery behavior in staging before the first production deployment.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain how you would decompose a monolithic e-commerce application into...
Q02SENIOR
How do you handle distributed transactions in microservices?
Q03SENIOR
What is the difference between a circuit breaker and a retry mechanism? ...
Q01 of 03SENIOR
Explain how you would decompose a monolithic e-commerce application into microservices. What methodology would you use?
ANSWER
I'd start with Domain-Driven Design's bounded context methodology rather than a technical decomposition. First, I'd run event storming sessions with domain experts and engineers: map every business event (OrderPlaced, PaymentFailed, ItemShipped, StockReserved), group them by the business process they belong to, and identify the boundaries where events cross between groups. Those boundaries are the service APIs.
For an e-commerce system, the natural bounded contexts are: Product Catalog (products, categories, pricing), Order Management (order lifecycle, order state machine), Payment (charge, refund, reconciliation), Inventory (stock levels, reservation, replenishment), Shipping (shipment creation, carrier integration, tracking), and User Accounts (authentication, profile, addresses).
I'd validate each boundary by asking: can this service be deployed without deploying any other service? If no, the boundary is wrong. I'd use the strangler fig pattern for extraction — route specific request types to the new service while the monolith handles everything else, then expand incrementally. Never a big-bang rewrite.
The boundaries I'd expect to get wrong on the first cut: the line between Order Management and Payment often ends up too thin; orders and payments are tightly coupled in most domains. I'd start with them merged and extract Payment only when the team ownership or scaling needs make the separation worth the complexity.
Q02 of 03SENIOR
How do you handle distributed transactions in microservices?
ANSWER
The honest answer is: you avoid distributed transactions wherever possible, and when you cannot avoid them you use sagas with compensating transactions.
Two-phase commit (2PC) is an anti-pattern in microservices because it requires a distributed lock coordinator, which becomes a single point of failure and a scalability bottleneck. The moment the coordinator fails mid-transaction, you have distributed state with no clean way to resolve it.
The Saga pattern is the alternative. In choreography-based sagas, each service publishes events that trigger the next step — OrderCreated triggers PaymentService to charge, PaymentSucceeded triggers InventoryService to reserve, and so on. In orchestration-based sagas, a saga coordinator sends commands to each service and tracks the state machine.
Three things that textbook answers miss:
First, idempotency is mandatory. Saga steps will be retried — network failures, consumer restarts, and at-least-once delivery guarantees mean any step can execute more than once. Every saga step must detect and skip duplicate executions, typically by checking whether an idempotency key (the order_id) has already been processed.
Second, sagas do not provide isolation. Between saga steps, other processes can read intermediate state — an order that has been placed but not yet paid is visible to the reporting system. This is a semantic difference from ACID transactions that must be explicitly designed for.
Third, compensating transactions are not always reversible. A payment charge can be refunded, but a physical shipment that has already left the warehouse cannot be un-shipped. Design your sagas with this asymmetry in mind — some compensations are approximate, not exact.
Q03 of 03SENIOR
What is the difference between a circuit breaker and a retry mechanism? Can they be used together?
ANSWER
They handle different failure modes and must be used together, but the composition order matters.
A retry handles transient failures — a network blip, a brief service restart, a momentary overload. The assumption is that the call will succeed if you try again after a short wait. Exponential backoff with jitter (200ms, 400ms, 800ms) prevents synchronized retry storms when hundreds of callers retry simultaneously.
A circuit breaker handles persistent failures — a service that is down or degraded for seconds or minutes. Instead of retrying a call that will definitely fail, it short-circuits: calls fail immediately without touching the downstream. This prevents thread pool exhaustion in the caller and gives the downstream time to recover without being hammered.
Used together: the circuit breaker must be the outer decorator and the retry must be the inner decorator. If you invert this — retry wrapping circuit breaker — each retry attempt hits the open breaker, generates a failure event, and resets the breaker's cooldown timer. The breaker never gets a chance to enter half-open state, even after the downstream has recovered. I have seen this mistake extend a 2-minute outage to 8 minutes.
Correct order: circuit breaker → retry → actual call. When the breaker is open: fail immediately, no retries fire. When the breaker is closed: retries handle transient failures within the closed state. After the cooldown: the breaker enters half-open and allows a single test call through, bypassing the retry logic.
01
Explain how you would decompose a monolithic e-commerce application into microservices. What methodology would you use?
SENIOR
02
How do you handle distributed transactions in microservices?
SENIOR
03
What is the difference between a circuit breaker and a retry mechanism? Can they be used together?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
How many microservices should a system have?
There is no target number. Some systems work well with five services; others need five hundred. The right number is determined by your bounded contexts and your team structure: one service (or a small cluster of tightly related services) per business capability that a team of five to nine people can fully own, operate, and deploy independently. Start with fewer services and split only when a concrete problem demands it — scaling bottleneck, team coordination friction, or significantly different deployment cadences between two parts of the same service.
Was this helpful?
02
Can microservices share a database?
No — not even on separate schemas on the same database server. Separate schemas on the same instance share the lock manager, the connection pool, and the I/O subsystem. A metadata lock from a DDL operation in one service can queue queries from every other service on that server. A runaway query from one service can exhaust the max_connections limit for all others. Each service must have its own database instance for true isolation. If that is not immediately achievable due to legacy constraints, use the strangler fig pattern to migrate one service at a time to dedicated instances.
Was this helpful?
03
How do you handle configuration management across many services?
Use centralized configuration management: Spring Cloud Config or Consul for application configuration, and HashiCorp Vault or AWS Secrets Manager for secrets and credentials. Store all configuration in version control. Each service fetches its configuration at startup and optionally subscribes to dynamic updates. The critical rule: never hardcode service URLs, credentials, or feature flags. In Kubernetes environments, ConfigMaps handle non-sensitive configuration and Secrets (backed by Vault via the Vault Agent Injector) handle credentials. Each service's configuration namespace is owned by that service's team.
Was this helpful?
04
What is the biggest operational challenge in microservices?
Observability — by a significant margin. With ten or more services, you cannot debug a user-facing latency problem without distributed tracing. You cannot correlate an error in service A with its root cause in service D without trace_id propagation through every log line. The second biggest challenge is deployment discipline: each team must be able to deploy independently, which requires automated CI/CD, backward-compatible API changes, and feature flags for gradual rollouts. These are not tools you add later — they are prerequisites for microservices to work.
Was this helpful?
05
When should I avoid microservices?
When your team has fewer than ten engineers, when your domain is not yet well understood, or when you do not have automated CI/CD and observability infrastructure in place. Microservices add real operational complexity — distributed failure modes, network latency, serialization, and a full observability stack. Before that complexity pays off, you need a scaling or team coordination problem that a well-structured monolith cannot solve. The most expensive microservices mistake is doing the migration before hitting that wall.