Senior 7 min · March 29, 2026

Architecture Patterns: Saga Compensation Rate Limits

Customers charged for unfulfilled orders: saga compensation hit 429 rate limit with no retry.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Software architecture patterns are reusable solutions to common system design problems
  • Each pattern solves a specific problem but introduces new trade-offs
  • Monolith: simple but lacks failure isolation and independent scaling
  • Microservices: independent scaling but distributed transactions are painful
  • Event-driven: decouples producers and consumers but adds consumer lag monitoring
  • CQRS: separates read/write models but introduces eventual consistency
  • Production insight: choosing the wrong pattern costs months of rework; always start with the simplest pattern that meets your constraints
Plain-English First

Think of software architecture like designing a restaurant kitchen. A food truck has one chef doing everything — fast to set up, impossible to scale when 200 people show up. A Michelin-star kitchen has a strict brigade system: saucier, pastry chef, expeditor — each station independent, but the coordination overhead is brutal. Your architecture is that kitchen design. Pick the wrong one and your 'chef' either burns out solo or your 'brigade' spends more time talking than cooking. The food is the same. The choice of how to organise the kitchen determines whether you survive dinner service.

A startup I consulted for rewrote their monolith into 47 microservices in eight months. By month nine, they had more engineers debugging inter-service latency than building features, their P99 checkout latency had tripled, and their on-call rotation was a rotating trauma ward. The monolith had been slow. The microservices were broken in ways nobody could trace.

Architecture decisions are permanent in a way that code decisions aren't. You can refactor a function in an afternoon. Rearchitecting a distributed system costs quarters, sometimes years. The tragedy is that most teams make these decisions by copying what Netflix or Uber did — without copying the 500-engineer platform team that makes those patterns survivable. The pattern isn't the hard part. Knowing when it fits your actual constraints is the whole game.

After this, you'll be able to look at a system's requirements and map them to a concrete architectural pattern — not because you memorised a definition, but because you understand the exact failure modes each pattern introduces and the specific conditions under which those failure modes hurt you. You'll know when a monolith is the right call, when event-driven architecture saves you, when CQRS is overkill, and what questions to ask before committing to any of them.

Monolith vs. Microservices: The Decision Nobody Makes Honestly

Every team says they're 'moving to microservices for scalability.' What they usually mean is they read a Martin Fowler post and their CTO saw a conference talk. Let's be precise about what each architecture actually costs you, because the decision is irreversible for 18-36 months once you commit.

A monolith is a single deployable unit. All your business logic, data access, and API surface lives in one process. The wins are real: in-process function calls instead of network hops, a single transaction boundary, one deployment pipeline, and a debugger that actually works. The failure mode is equally real: a memory leak in your image-processing module takes down your payment API. One bad deploy nukes everything. Your release cadence is bottlenecked by the slowest team's merge.

Microservices split that single process into independently deployable services communicating over a network. You get independent deployability and isolated failure domains. You also get distributed systems problems you didn't have before: network partitions, eventual consistency, service discovery, distributed tracing, and the operational complexity of running 20+ services instead of one. I've watched teams spend three months building the infrastructure to support microservices before writing a single line of business logic.

The honest rule: if you have fewer than 15 engineers, a monolith is almost certainly correct. If your scaling bottleneck is genuinely a specific bounded domain — say, your video transcoding is hammering CPU while your API servers idle — extract that one service. Don't extract everything because you might need to scale it someday. That day may never come, and you'll have paid the distributed systems tax in advance for nothing.

OrderServiceMonolith.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# io.thecodeforge — System Design tutorial

# Scenario: E-commerce order service.
# This is the monolith version. Notice what you get for FREE
# that microservices make you rebuild from scratch.

from dataclasses import dataclass
from decimal import Decimal
from typing import Optional
import sqlite3

# Single database connection — one transaction wraps everything.
# In a distributed system, this atomicity is GONE unless you implement
# two-phase commit or saga patterns. Both are painful.
DB_PATH = "orders.db"

@dataclass
class OrderItem:
    product_id: str
    quantity: int
    unit_price: Decimal

@dataclass
class Order:
    order_id: str
    customer_id: str
    items: list[OrderItem]
    status: str = "pending"

class InventoryService:
    """In a monolith, this is just a class. In microservices, it's a network call
    that can timeout, return stale data, or be temporarily unavailable."""

    def __init__(self, conn: sqlite3.Connection):
        self.conn = conn

    def reserve_stock(self, product_id: str, quantity: int) -> bool:
        cursor = self.conn.cursor()
        # This SELECT ... FOR UPDATE equivalent in SQLite — serialised write.
        # In a distributed inventory service, this becomes a distributed lock.
        # Redis SETNX, or a dedicated locking service. Both add latency and failure modes.
        cursor.execute(
            "SELECT stock_count FROM inventory WHERE product_id = ?",
            (product_id,)
        )
        row = cursor.fetchone()
        if not row or row[0] < quantity:
            return False  # Not enough stock — fail the whole operation cleanly

        cursor.execute(
            "UPDATE inventory SET stock_count = stock_count - ? WHERE product_id = ?",
            (quantity, product_id)
        )
        return True

class PaymentService:
    """Again — a class. Zero network calls. Zero timeout handling needed.
    The moment this becomes a separate service, you need circuit breakers,
    retry logic, idempotency keys, and dead-letter queues. All of that
    is real engineering effort — not free."""

    def __init__(self, conn: sqlite3.Connection):
        self.conn = conn

    def charge_customer(self, customer_id: str, amount: Decimal) -> bool:
        cursor = self.conn.cursor()
        cursor.execute(
            "SELECT balance FROM accounts WHERE customer_id = ?",
            (customer_id,)
        )
        row = cursor.fetchone()
        if not row or row[0] < amount:
            return False  # Insufficient funds — atomic rollback handles cleanup

        cursor.execute(
            "UPDATE accounts SET balance = balance - ? WHERE customer_id = ?",
            (amount, customer_id)
        )
        return True

class OrderOrchestrator:
    """The core of the monolith advantage: ONE database transaction covers
    inventory reservation, payment, AND order creation. If payment fails,
    inventory is automatically unreserved. No compensation logic. No sagas.
    No 'sorry, your order is stuck in PENDING forever' bugs."""

    def __init__(self):
        self.conn = sqlite3.connect(DB_PATH)
        self.inventory = InventoryService(self.conn)
        self.payment = PaymentService(self.conn)

    def place_order(self, order: Order) -> dict:
        total = sum(
            item.unit_price * item.quantity for item in order.items
        )

        try:
            # BEGIN TRANSACTION — implicit in SQLite when autocommit is off
            self.conn.execute("BEGIN")

            # Step 1: Reserve all stock items.
            # If ANY item fails, we roll back everything. One line of code.
            for item in order.items:
                if not self.inventory.reserve_stock(item.product_id, item.quantity):
                    self.conn.rollback()
                    return {
                        "success": False,
                        "reason": f"Insufficient stock for product {item.product_id}"
                    }

            # Step 2: Charge the customer.
            # In a microservices world, payment already happened in a different
            # service. If inventory reservation then fails, you're issuing refunds.
            # That's a SUPPORT TICKET. Here? It's a rollback.
            if not self.payment.charge_customer(order.customer_id, total):
                self.conn.rollback()
                return {"success": False, "reason": "Payment failed"}

            # Step 3: Persist the order record.
            self.conn.execute(
                "INSERT INTO orders (order_id, customer_id, status, total) VALUES (?, ?, ?, ?)",
                (order.order_id, order.customer_id, "confirmed", str(total))
            )

            self.conn.commit()  # All three operations committed atomically
            return {"success": True, "order_id": order.order_id, "total": str(total)}

        except Exception as e:
            self.conn.rollback()  # Something unexpected? Clean slate.
            raise RuntimeError(f"Order placement failed: {e}") from e


# --- Demonstrate the flow ---
if __name__ == "__main__":
    # Normally you'd have migrations. Simplified for illustration.
    orchestrator = OrderOrchestrator()

    order = Order(
        order_id="ORD-20240315-001",
        customer_id="CUST-789",
        items=[
            OrderItem(product_id="SKU-HEADPHONES-XZ3", quantity=1, unit_price=Decimal("149.99")),
            OrderItem(product_id="SKU-USB-CABLE-C", quantity=2, unit_price=Decimal("12.50"))
        ]
    )

    result = orchestrator.place_order(order)
    print(f"Order result: {result}")
Production Trap: The Distributed Transaction Debt
When you split this into microservices, every conn.rollback() in the monolith becomes a saga with compensating transactions. Teams that don't plan for this end up with orders stuck in PENDING state indefinitely when a downstream service times out. The symptom: customer support tickets saying 'I was charged but my order never arrived.' The fix before you split: define the saga pattern and write the compensation handlers FIRST, before extracting a single service.
Production Insight
The real cost of microservices isn't the code—it's the operational overhead.
A six-engineer team I worked with spent 40% of their sprint time on CI/CD pipeline maintenance, service discovery, and distributed tracing setup.
That's time they could have spent building features; the modular monolith gave them independent deployability without the network tax.
Key Takeaway
If you have fewer than 15 engineers, a monolith is almost certainly correct.
If your scaling bottleneck is a specific bounded domain, extract that one service—not everything.
The distributed systems tax is real; don't pay it until you need the benefits.
Monolith or Microservices?
IfFewer than 15 engineers, single bounded context
UseMonolith — you don't need the distributed systems tax yet.
IfTeam > 15 engineers, multiple bounded contexts with clear ownership
UseMicroservices — but only after you've built the platform (CI/CD, tracing, service mesh).
IfScaling bottleneck in one specific service (e.g., video transcoding)
UseExtract only that service. Keep the rest in the monolith. Pay the tax only where it pays off.

Event-Driven Architecture: Power, Poison, and When to Reach for It

Event-driven architecture (EDA) solves a specific problem: you need multiple systems to react to something that happened, without the producer caring who's listening. The classic alternative — direct synchronous calls — creates a dependency spider web. Your order service calls inventory, which calls the warehouse, which calls shipping. Now your order service's uptime is the product of everyone else's uptime. At 99.9% each, four services in a chain gives you 99.6% overall. That's 3.5 hours of downtime per year from services that are individually 'highly available.'

EDA decouples that chain. The order service publishes an OrderConfirmed event. Inventory, warehouse, fraud detection, and email notification all consume it independently. The order service doesn't know they exist. New consumers can subscribe without touching the producer. This is real decoupling — not just the dependency injection kind.

The poison pill is invisible failure. In a synchronous call, if the warehouse service is down, you know immediately — your caller gets a 503. In EDA, your event is published to the queue, the producer returns success, and the warehouse consumer is silently dead. Events accumulate in the dead-letter queue. You discover it when a customer calls saying their package never shipped. I've seen this exact scenario play out in a logistics company where a misconfigured consumer group caused 6 hours of orders to pile up unprocessed while dashboards showed everything green.

Use EDA when: you have genuinely independent consumers, eventual consistency is acceptable for the domain, and you have the operational maturity to monitor queue depth and consumer lag. Don't use it for anything that needs synchronous confirmation — payment authorisation, stock reservation at checkout time, authentication.

EventDrivenOrderPipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# io.thecodeforge — System Design tutorial

# Scenario: Post-checkout event pipeline.
# The payment is already confirmed. Now we need to notify inventory,
# fraud, shipping, and email — all independently, all without
# the checkout service caring about any of them.

import json
import time
import threading
from dataclasses import dataclass, asdict
from typing import Callable
from collections import defaultdict, deque
from datetime import datetime

@dataclass
class OrderConfirmedEvent:
    event_id: str
    order_id: str
    customer_id: str
    product_ids: list[str]
    total_amount: float
    occurred_at: str  # ISO 8601 — always timestamp your events at creation time

    def to_json(self) -> str:
        return json.dumps(asdict(self))


class InMemoryEventBus:
    """
    Production equivalent: Apache Kafka, AWS SQS+SNS, or RabbitMQ.
    The contract is the same: publish once, consume independently.
    This in-memory version makes the pattern visible without Kafka setup.
    """

    def __init__(self):
        # Each topic maps to a list of independent consumer queues.
        # In Kafka terms: each consumer GROUP gets its own queue.
        # This is what enables independent consumption and replay.
        self._topics: dict[str, list[deque]] = defaultdict(list)
        self._handlers: dict[str, list[Callable]] = defaultdict(list)
        self._lock = threading.Lock()

    def subscribe(self, topic: str, handler: Callable) -> None:
"""Busy-polls the queue. In production Kafka consumers use long-polling
with configurable fetch.min.bytes and fetch.max.wait.ms for efficiency."""
        while True:
            if queue:
                event = queue.popleft()
                try:
                    handler(event)
                except Exception as e:
                    # In production: route to dead-letter queue, alert, do NOT silently swallow.
                    # Silent swallowing here is how you get the 'orders piling up unprocessed' nightmare.
                    print(f"[EventBus] CONSUMER ERROR — routing to DLQ: {e}")
            else:
                time.sleep(0.01)  # Back off when idle — don't spin-burn CPU


# --- CONSUMERS ---
# Each of these would be a separate service in production.
# They share zero state. They don't know about each other.

class InventoryConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Inventory] Reserving stock for order {event['order_id']} "
              f"— products: {event['product_ids']}")
        # In reality: update stock counts, trigger reorder if threshold hit

class FraudDetectionConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Fraud] Running fraud score for customer {event['customer_id']} "
              f"— amount: ${event['total_amount']}")
        # In reality: call ML model, flag order if score > threshold, publish FraudFlaggedEvent

class ShippingConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Shipping] Creating shipment manifest for order {event['order_id']}")
        # In reality: call 3PL API, generate label, store tracking number

class EmailNotificationConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Email] Queueing confirmation email to customer {event['customer_id']} "
              f"for order {event['order_id']}")
        # In reality: render template, call SES/SendGrid, record sent timestamp


# --- PRODUCER ---

class CheckoutService:
    """The checkout service knows about the event bus and the event schema.
    It does NOT know about inventory, fraud, shipping, or email.
    Adding a new downstream consumer requires ZERO changes here."""

    def __init__(self, event_bus: InMemoryEventBus):
        self.event_bus = event_bus

    def complete_checkout(self, queue: deque, order_id: str, customer_id: str, product_ids: list[str], total: float) -> dict:
        # Payment authorisation would happen here synchronously BEFORE this point.
        # EDA handles post-payment side effects — not the payment itself.
        event = OrderConfirmedEvent(
            event_id=f"evt-{order_id}-{int(time.time())}",
            order_id=order_id,
            customer_id=customer_id,
            product_ids=product_ids,
            total_amount=total,
            occurred_at=datetime.utcnow().isoformat() + "Z"
        )

        self.event_bus.publish("order.confirmed", asdict(event))

        # Returns IMMEDIATELY — doesn't wait for inventory, fraud, or shipping.
        # This is your sub-100ms checkout response time.
        return {"status": "confirmed", "order_id": order_id}
Never Do This: Publish Events Without Monitoring Consumer Lag
Consumer lag is the silent killer in EDA. Your producer is healthy, your event bus is healthy, but your consumer group fell behind 3 hours ago because of a bad deployment. In Kafka: alert on kafka.consumer.group.lag > 1000 per partition. In SQS: alert on ApproximateNumberOfMessagesNotVisible. I've seen teams discover six-figure inventory discrepancies because nobody set this alert. Set it on day one, before your first consumer hits production.
Production Insight
Consumer lag is the silent killer—your dashboards show green while orders pile up unprocessed.
A misconfigured Kafka consumer group caused a 45-minute order processing backlog at a logistics company.
We added PagerDuty alerts on consumer lag > 1000 partitions; that night, we caught the next incident before customers did.
Key Takeaway
EDA decouples producers from consumers but couples you to your monitoring.
If you can't monitor consumer lag, you're flying blind.
Use DLQ with alerts—don't let failed events disappear silently.
Synchronous or Event-Driven?
IfConsumer needs immediate confirmation (e.g., payment auth)
UseUse synchronous call (REST/gRPC). EDA's eventual consistency will cause business problems.
IfMultiple independent systems need to react to the same event
UseEvent-driven. Publish once, consume anywhere. New consumers subscribe without producer changes.
IfYou cannot afford to lose or delay events (e.g., order shipping)
UseEDA is fine, but you must monitor consumer lag and DLQ depth. No exception.

Saga Pattern: Coordinating Distributed Workflows Without the Pain

When you split a transaction across microservices, ACID goes away. The saga pattern is the answer: a sequence of local transactions where each step publishes an event that triggers the next step. If a step fails, the saga runs compensating transactions to undo the previous steps. But compensations fail too, and that's where the real pain begins.

There are two flavours: choreography and orchestration. Choreography: each service emits events and listens for others. Simple but hard to trace—a five-step saga means five events, five consumers, and you need to correlate them manually. Orchestration: a central coordinator tells each service what to do and handles failures. More code upfront but you get a single place to debug and monitor.

The honest truth: sagas add significant operational complexity. Every compensation transaction must be idempotent and retriable. You need dead-letter queues for non-recoverable failures. You need manual remediation runbooks for when compensations can't recover automatically. Only use sagas when you absolutely must split a transaction across service boundaries. If you can keep the transaction within a single service (or use a modular monolith), do that instead.

OrderSagaOrchestrator.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# io.thecodeforge — System Design tutorial

# Scenario: Order processing saga using an orchestrator.
# The orchestrator coordinates three microservices: Inventory, Payment, Shipping.
# If any step fails, it runs compensations for all completed steps.

from dataclasses import dataclass, asdict
from typing import Optional, Callable
from enum import Enum
import json
import time

class SagaStepStatus(Enum):
    PENDING = "pending"
    COMPLETED = "completed"
    FAILED = "failed"
    COMPENSATED = "compensated"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensate: Callable
    status: SagaStepStatus = SagaStepStatus.PENDING
    output: Optional[dict] = None
    error: Optional[str] = None

@dataclass
class SagaContext:
    order_id: str
    customer_id: str
    products: list[dict]
    payment_amount: float
    # Store outputs of each step for potential compensation
    inventory_reservation_id: Optional[str] = None
    payment_transaction_id: Optional[str] = None
    shipping_label_id: Optional[str] = None


class SagaOrchestrator:
    """Central coordinator for the order processing saga.
    In production, this would be a stateful service using a database or event store
    to persist saga state so it can survive restarts."""

    def __init__(self):
        self._steps: list[SagaStep] = []
        self._context: Optional[SagaContext] = None

    def add_step(self, name: str, action: Callable, compensate: Callable) -> None:
        self._steps.append(SagaStep(name=name, action=action, compensate=compensate))

    def execute(self, context: SagaContext) -> dict:
        self._context = context
        completed_steps = []

        print(f"[Saga] Starting saga for order {context.order_id}")

        try:
            for step in self._steps:
                print(f"[Saga] Executing step: {step.name}")
                # Call the action with context; action returns a dict with output
                result = step.action(self._context)
                step.status = SagaStepStatus.COMPLETED
                step.output = result
                completed_steps.append(step)
                print(f"[Saga] Step {step.name} completed: {result}")

            print(f"[Saga] All steps succeeded for order {context.order_id}")
            return {"status": "success", "order_id": context.order_id}

        except Exception as e:
            print(f"[Saga] Step failed: {e}")
            # Compensate completed steps in reverse order
            for step in reversed(completed_steps):
                print(f"[Saga] Compensating step: {step.name}")
                try:
                    step.compensate(self._context, step.output)
                    step.status = SagaStepStatus.COMPENSATED
                except Exception as comp_err:
                    # Non-recoverable compensation failure—requires manual intervention
                    print(f"[Saga] COMPENSATION FAILED for step {step.name}: {comp_err}")
                    # In production: publish to DLQ, alert ops
                    step.error = str(comp_err)
            return {"status": "failed", "order_id": context.order_id, "error": str(e)}


# --- Dummy microservice actions (simulate actual service calls) ---

def reserve_inventory(ctx: SagaContext) -> dict:
    """Call inventory service. Can raise exception on failure."""
    print(f"[Inventory] Reserving stock for order {ctx.order_id}")
    # Simulate success
    ctx.inventory_reservation_id = "INV-RES-12345"
    return {"reservation_id": ctx.inventory_reservation_id}

def compensate_inventory(ctx: SagaContext, step_output: dict) -> None:
    """Cancel inventory reservation."""
    print(f"[Inventory] Compensation: releasing reservation {step_output['reservation_id']}")
    # Simulate success

def process_payment(ctx: SagaContext) -> dict:
    """Call payment gateway. Can raise exception."""
    print(f"[Payment] Charging ${ctx.payment_amount} for order {ctx.order_id}")
    # Simulate success
    ctx.payment_transaction_id = "PAY-TXN-67890"
    return {"transaction_id": ctx.payment_transaction_id}

def compensate_payment(ctx: SagaContext, step_output: dict) -> None:
    """Issue refund."""
    print(f"[Payment] Compensation: refunding transaction {step_output['transaction_id']}")
    # Simulate success

def create_shipment(ctx: SagaContext) -> dict:
    """Create shipment with shipping provider."""
    print(f"[Shipping] Creating shipment for order {ctx.order_id}")
    # Simulate success
    ctx.shipping_label_id = "SHP-LBL-11111"
    return {"label_id": ctx.shipping_label_id}

def compensate_shipment(ctx: SagaContext, step_output: dict) -> None:
    """Cancel shipment."""
    print(f"[Shipping] Compensating: cancelling label {step_output['label_id']}")
    # Simulate success


# --- Wire it up ---

if __name__ == "__main__":
    orchestrator = SagaOrchestrator()

    # Define saga steps in order
    orchestrator.add_step("Reserve Inventory", reserve_inventory, compensate_inventory)
    orchestrator.add_step("Process Payment", process_payment, compensate_payment)
    orchestrator.add_step("Create Shipment", create_shipment, compensate_shipment)

    ctx = SagaContext(
        order_id="ORD-20241105-089",
        customer_id="CUST-321",
        products=[{"sku": "SKU-LAPTOP", "qty": 1}],
        payment_amount=1299.99
    )

    result = orchestrator.execute(ctx)
    print(f"\n[Saga] Final result: {json.dumps(result, indent=2)}")
The Non-Recoverable Saga Trap
If a compensation fails after you've already started rolling back, you're in a non-recoverable state—partial rollback, no automated recovery path. Mitigation: make all compensations idempotent and retry with exponential backoff. If retries exhaust, route to a dead-letter queue and alert ops immediately. Build a manual remediation dashboard. Never let a failed compensation disappear silently; it becomes a financial reconciliation problem later.
Production Insight
Saga compensation failures are silent unless you monitor DLQ depth.
I've seen a team lose $50k in refunds because their compensation retry loop had a bug that caused it to stop retrying after the first failure.
They only noticed when customers started complaining a week later—alert on DLQ depth on day one.
Key Takeaway
Sagas are for when you have no other choice.
Always test compensation under failure scenarios.
Monitor dead-letter queue depth for all saga topics.
Saga or Single Transaction?
IfTransaction fits within one database
UseUse local transaction (ACID). Simpler, safer, faster.
IfTransaction must span multiple services and you need eventual consistency
UseUse saga with orchestration. Choreography is harder to debug.
IfCompensation is not idempotent or cannot be made idempotent
UseDon't use saga. Rethink your service boundaries to keep the transaction local.

CQRS and the Read/Write Split: When Your Query Patterns Are Killing Your Writes

CQRS — Command Query Responsibility Segregation — is one of the most cargo-culted patterns in the industry. Teams add it because it sounds senior-level. Here's the honest version: you need CQRS when your read and write access patterns are so different that a single model optimised for both is actually optimised for neither.

Consider a product catalogue. Writes are rare, structured, and come from an admin tool — one product update at a time, full validation, transactional integrity. Reads are constant, require different field combinations per client (mobile wants a summary, web wants full detail, search wants keywords only), and need to be fast under high concurrency. A single relational model with indexes for everything is a compromise that serves nobody well. Your write path trips over read indexes. Your read path joins five tables to answer a query that could be a single document lookup.

CQRS splits this: the write side (Commands) uses a normalised transactional model. The read side (Queries) uses one or more denormalised read models, potentially different databases entirely — PostgreSQL for writes, Elasticsearch for search, Redis for session data. When a command succeeds, you publish an event (or a projection job runs) to update the read models. Those read models are eventually consistent — they lag behind by milliseconds to seconds.

That 'eventually consistent' part is where teams get burned. I've seen a fintech ship CQRS across their account balance domain. Write side was Postgres. Read side was a Redis projection. A deployment bug caused the projection to stop updating. For 40 minutes, customers saw stale balances. Nobody noticed until a customer tried to spend money the read model said they had but the write model had already debited. Don't use CQRS for anything where reading stale data causes a financial or safety consequence. It's a pattern for scale, not for correctness.

CQRSProductCatalogue.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# io.thecodeforge — System Design tutorial

# Scenario: Product catalogue with very different read/write patterns.
# Write side: admin updates one product at a time, needs validation + atomicity.
# Read side: 10,000 RPS product page loads, need sub-10ms response.
# A single DB model with indexes for both is a bottleneck in both directions.

from dataclasses import dataclass, field
from typing import Optional
import json
import time

# ─── WRITE SIDE (Command Model) ────────────────────────────────────────────
# Normalised. Strict validation. Transactional. Think PostgreSQL.

@dataclass
class ProductWriteModel:
    """The authoritative source of truth. Commands land here.
    This model is optimised for integrity, not query speed."""
    product_id: str
    sku: str
    name: str
    description: str
    price_cents: int       # Store money as integers — never floats. Ever.
    stock_count: int
    category_id: str
    brand_id: str
    is_active: bool = True
    version: int = 1       # Optimistic locking — detect concurrent modification

@dataclass
class UpdateProductPriceCommand:
    product_id: str
    new_price_cents: int
    updated_by: str        # Always audit who changed what in a write model
    reason: str            # Force callers to explain why — reduces lazy changes

@dataclass
class ProductPriceUpdatedEvent:
    product_id: str
    old_price_cents: int
    new_price_cents: int
    occurred_at: float


class ProductCommandHandler:
    """Handles all writes. Returns an event that downstream projections consume
    to update read models. The command handler does NOT update read models directly —
    that's the projection layer's job."""

    def __init__(self, write_store: dict):
        # In production: PostgreSQL with row-level locking
        self.write_store = write_store
        self.event_log: list[ProductPriceUpdatedEvent] = []

    def handle(self, command: UpdateProductPriceCommand) -> ProductPriceUpdatedEvent:
        product = self.write_store.get(command.product_id)
        if not product:
            raise ValueError(f"Product {command.product_id} not found")
        old_price = product.price_cents
        product.price_cents = command.new_price_cents
        product.version += 1
        # Store updated product (in production: UPDATE with WHERE version = old_version)
        self.write_store[command.product_id] = product

        event = ProductPriceUpdatedEvent(
            product_id=command.product_id,
            old_price_cents=old_price,
            new_price_cents=command.new_price_cents,
            occurred_at=time.time()
        )
        self.event_log.append(event)
        return event


# ─── READ SIDE (Projection / Query Model) ──────────────────────────────────
# Denormalised. Pre-joined. Optimised for fast lookups. Think Elasticsearch or Redis.

class ProductReadProjection:
    """Consumes events from the command side and builds a denormalised read model.
    This is eventually consistent. In production, this would run as a separate
    service or job that reads from an event queue."""

    def __init__(self):
        # In-memory dictionary simulating a fast key-value store or search index
        self._store: dict[str, dict] = {}

    def apply(self, event: ProductPriceUpdatedEvent):
        """Update the read model based on the event.
        Idempotent: applying the same event twice should yield same result."""
        # In production: check event idempotency (e.g., event_id dedup)
        if event.product_id not in self._store:
            # Initialise with default fields; in reality, you'd join with other data
            self._store[event.product_id] = {
                "product_id": event.product_id,
                "current_price_cents": event.new_price_cents,
                "last_updated": event.occurred_at
            }
        else:
            self._store[event.product_id]["current_price_cents"] = event.new_price_cents
            self._store[event.product_id]["last_updated"] = event.occurred_at

    def get_product(self, product_id: str) -> Optional[dict]:
        """Query side — instant, no joins, no locks."""
        return self._store.get(product_id)

    def get_all_prices(self) -> dict:
        """Bulk read: returns all current prices. Useful for caching."""
        return {pid: data["current_price_cents"] for pid, data in self._store.items()}


# ─── WIRING ─────────────────────────────────────────────────────────────────
# In a real system, the event would be published to a message broker,
# and the projection would consume asynchronously. Here we do it inline for demo.

if __name__ == "__main__":
    # Seed write store
    write_store = {
        "PROD-001": ProductWriteModel(
            product_id="PROD-001", sku="SKU-123", name="Widget",
            description="A widget", price_cents=1999, stock_count=100,
            category_id="CAT-1", brand_id="BRAND-A"
        )
    }

    command_handler = ProductCommandHandler(write_store)
    projection = ProductReadProjection()

    # Simulate a command
    cmd = UpdateProductPriceCommand(
        product_id="PROD-001",
        new_price_cents=1499,
        updated_by="admin@example.com",
        reason="Seasonal discount"
    )
    event = command_handler.handle(cmd)
    print(f"Command executed: price changed from ${event.old_price_cents/100:.2f} to ${event.new_price_cents/100:.2f}")

    # Apply event to projection
    projection.apply(event)
    print(f"Read model shows price: ${projection.get_product('PROD-001')['current_price_cents']/100:.2f}")

    # Consistency note: in a real async system, there's a delay.
    print("\nNote: In production, the read model update would be asynchronous.")
    print("The command handler does NOT wait for the projection to finish.")
CQRS + Eventually Consistent Balances: A Fintech Disaster Waiting to Happen
When the read model lags behind the write model in a financial context, users see outdated balances. If they act on that stale data, you have a reconciliation problem. Rule: never use CQRS for read models that drive financial decisions (e.g., 'Can I withdraw this amount?'). Always read from the write model for those queries, or use a synchronously updated materialised view within the same transaction.
Production Insight
CQRS projection lag is silent until a customer sees a wrong balance.
In the fintech case, the projection stopped for 40 minutes — zero alarms.
We added a synthetic heartbeat: every 10 seconds, the write side publishes a 'heartbeat' event; if the read model doesn't see it within 30 seconds, it triggers a critical alert.
Key Takeaway
CQRS is a scale pattern, not a correctness pattern.
If stale reads cause real-world damage, don't use eventual consistency.
Start with a materialised view or read replica before committing to full CQRS.
CQRS or Single Model?
IfRead queries require complex joins or different representations per client
UseConsider CQRS with denormalised read models (Elasticsearch, materialised views).
IfStale data could cause financial or safety issues
UseDo NOT use CQRS with eventual consistency. Keep the single transactional model.
IfRead throughput is very high and writes are low, but both use the same table
UseOptimise with read replicas or materialised views first. Only split to CQRS if those fail.

API Gateway and Backend for Frontend (BFF): Your System's Front Door

As your architecture grows beyond a single service, you need a single entry point for clients. That's the API gateway. It handles authentication, rate limiting, request routing, and response aggregation. The alternative — letting each client talk directly to microservices — means every client has to know service topology, handle multiple authentication schemes, and reimplement retry logic. It's chaos.

But there's a nuance. A generic API gateway works when all clients (web, mobile, third-party) need roughly the same API. When they don't, you need the Backend for Frontend (BFF) pattern: a separate gateway per client type, each exposing exactly the API that client needs. Mobile doesn't need the same data as web. A third-party API needs versioning and rate limits. Each BFF owns its own aggregation logic, and you avoid the 'one gateway to serve them all' bloat.

The trade-off: more deployable units, more duplication, and tighter coupling between each BFF and its client. But for teams with multiple distinct products (web app, mobile app, public API), it's often the right call. Without it, your API gateway becomes a monolith in its own right — a thousand routes, every endpoint dependent on every other, and a single deploy takes down every client.

bff_pattern_example.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// io.thecodeforge — System Design tutorial

// Simplified example: BFF for mobile vs web

class MobileBFF {
  constructor(orderService, userService, productService) {
    this.orderService = orderService;
    this.userService = userService;
    this.productService = productService;
  }

  // Mobile needs a lightweight order summary
  async getOrderSummary(orderId) {
    const order = await this.orderService.getOrder(orderId);
    // Only return what mobile needs — no full product details
    return {
      orderId: order.id,
      status: order.status,
      total: order.total,
      itemCount: order.items.length,
      lastUpdated: order.updatedAt
    };
  }
}

class WebBFF {
  constructor(orderService, userService, productService) {
    this.orderService = orderService;
    this.userService = userService;
    this.productService = productService;
  }

  // Web needs full details for product pages
  async getOrderDetail(orderId) {
    const order = await this.orderService.getOrder(orderId);
    // Enhance with product details
    const productDetails = await Promise.all(
      order.items.map(item => this.productService.getProduct(item.productId))
    );
    return {
      ...order,
      products: productDetails,
      user: await this.userService.getUser(order.userId)
    };
  }
}

// In production, these BFFs run as separate services,
// each with their own rate limits, auth, and circuit breakers.
How to Think About API Gateways
  • Single generic gateway works for one client type or similar clients.
  • BFFs prevent one client's API changes from affecting others.
  • Each BFF duplicates some concern (auth, logging) — that's okay if it prevents coupling.
  • Without a gateway, every client directly depends on your internal service topology.
  • Start with a generic gateway; extract BFFs only when client needs diverge.
Production Insight
The shared API gateway anti-pattern: one team's new endpoint requires re-deploy of the whole gateway, affecting all other teams.
BFFs solve this: each team deploys their own gateway alongside their own services.
Downside: 2x-3x more gateway instances to manage. Kubernetes + Istio makes this manageable.
Key Takeaway
An API gateway is essential once you have more than one service.
BFFs prevent client-specific logic from bloating the gateway.
Start with a single gateway; extract BFFs when your clients demand different data shapes.
API Gateway or BFF?
IfSingle client type (e.g., only a web app)
UseOne generic API gateway. Simple, centralized, less operational overhead.
IfMultiple client types with different data requirements
UseBFF per client. Mobile gets a compact response; web gets full detail.
IfThird-party API with strict versioning and rate limits
UseDedicated third-party BFF. Don't mix internal and external APIs in the same gateway.
● Production incidentPOST-MORTEMseverity: high

The Saga That Didn't Compensate: A Payment Pipeline Nightmare

Symptom
Customers reported being charged for orders that were never fulfilled. Customer support tickets surged with 'I was charged but order cancelled' complaints.
Assumption
The team assumed that idempotent compensation transactions would always succeed because the compensating service had no dependencies.
Root cause
The compensation handler for payment reversal called an external payment gateway that had a rate limit. Under load, the rate limit triggered a 429 response, and the compensation was not retried. The saga orchestration treated the failure as terminal and logged it to a dead-letter queue with no alert.
Fix
Implemented exponential backoff retry with a circuit breaker for the compensation call. Added a monitoring alert on dead-letter queue depth for the saga topic. Built a manual remediation dashboard for ops to replay failed compensations.
Key lesson
  • Compensations are critical paths—treat them like primary operations.
  • Always test compensations under failure scenarios, not just happy path.
  • Alert on dead-letter queue depth for sagas; silent failures in sagas become financial losses.
Production debug guideSymptom → Action guide for common architecture-pattern-related issues4 entries
Symptom · 01
Distributed transaction leaves data inconsistent (e.g., order confirmed but inventory not decremented)
Fix
Check saga orchestration logs for compensation execution. Verify dead-letter queue for failed compensation events. Run manual reconciliation script.
Symptom · 02
Event consumer lag > 10 minutes
Fix
Check consumer group status (kafka-consumer-groups --describe). Inspect consumer logs for exceptions. Scale consumer threads if partition count allows.
Symptom · 03
CQRS read model shows stale data for more than 5 seconds
Fix
Verify projection service health. Check event processing latency in the projection. Ensure read model update is idempotent and handles duplicates.
Symptom · 04
Microservice dependency chain leads to cascading failures
Fix
Implement circuit breakers in each client (e.g., Resilience4j). Set fallback responses or cached data. Review dependency graph to eliminate unnecessary synchronous calls.
★ Quick Debug Cheat Sheet for Architecture DecisionsWhen your chosen architecture pattern starts causing problems, use these commands and actions to diagnose and fix fast.
Saga compensation failure
Immediate action
Check dead-letter queue for saga topic
Commands
kafka-console-consumer --bootstrap-server localhost:9092 --topic saga.dlq --from-beginning --max-messages 10
curl -X GET http://saga-orchestrator:8080/actuator/health
Fix now
Manually replay failed saga events using a recovery script; implement retry with exponential backoff in the compensation handler
Event consumer lag spikes+
Immediate action
Check consumer group lag
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group order-confirmed-consumer --describe
docker logs order-confirmed-consumer --tail 100
Fix now
Scale consumer instances; check for blocking calls in consumer logic (e.g., synchronous DB writes); ensure consumer processing time is within SLA
CQRS projection lag > 30 seconds+
Immediate action
Check projection service CPU and memory
Commands
top -b -n 1 | grep projection
curl http://projection-service:8080/actuator/metrics/projection.lag
Fix now
Add read replicas for projection; optimize projection queries; if lag is persistent, increase batch size for event processing
Pattern Trade-offs at a Glance
PatternWhen to UseWhen to AvoidOperational CostKey Risk
MonolithSmall team (< 15), single bounded context, simple deploymentNeed for independent scaling of componentsLow — one pipeline, one binaryNo failure isolation; bad deploy takes everything down
MicroservicesLarge team, clear bounded contexts, independent scale neededTeam < 15, no separate deployment needsHigh — CI/CD per service, service mesh, observabilityDistributed transaction complexity, network failures
Event-DrivenMultiple independent consumers, eventual consistency OKNeed synchronous confirmations (payment, auth)Medium — message broker overhead, monitoring requiredSilent consumer failures, lag detection
SagaDistributed transaction required, compensation is idempotentTransaction fits in one serviceHigh — compensation logic, DLQ monitoringCompensation failure leads to data inconsistency
CQRSRead and write patterns are very different, high read throughputStale data causes safety/financial issuesMedium — dual models, projection managementEventual consistency, projection lag
API Gateway / BFFMultiple services need a single entry pointOnly one service existsLow to medium — one more service to manageBecoming a monolith (too many routes) or too many BFFs

Key takeaways

1
Architecture decisions are permanent; code can be refactored in a day, but rearchitecting costs quarters.
2
Start with the simplest pattern that meets your constraints
monolith first, then extract services only when you need independent scaling.
3
Event-driven architecture decouples producers and consumers but couples you to your monitoring; consumer lag is the silent killer.
4
Sagas introduce significant operational complexity; only use them when a transaction absolutely must span multiple services.
5
CQRS is for scale, not correctness; never apply it to domains where stale data causes financial or safety consequences.
6
API gateways and BFFs prevent client chaos; start with one gateway, extract BFFs when client needs diverge.

Common mistakes to avoid

5 patterns
×

Adopting microservices before the team is ready

Symptom
Team spends 60% of time on infrastructure (service discovery, tracing, CI/CD) instead of business logic. On-call rotations become unsustainable.
Fix
Stay with a monolith until you have at least 15 engineers and a clear bounded context to extract. Invest in a modular monolith first.
×

Using Event-Driven Architecture without monitoring consumer lag

Symptom
Orders pile up unprocessed for hours; customers complain; dashboards show green because producer is healthy.
Fix
Day one: set alerts on consumer group lag (e.g., kafka.consumer.group.lag > 1000). If lag exceeds threshold, page on-call.
×

Implementing a saga without idempotent compensations

Symptom
Compensation calls fail under load (rate limits, timeouts) and leave the system in a partially rolled-back state. Financial reconciliation becomes a manual nightmare.
Fix
Make every compensation idempotent and retriable. Use exponential backoff. If retries exhaust, route to DLQ and alert immediately.
×

Applying CQRS to a domain where stale data causes real-world damage

Symptom
Users see outdated account balances or product prices; financial disputes arise.
Fix
For any domain where correctness depends on up-to-date data, do NOT use eventually consistent read models. Either read from the write model or use synchronously updated materialised views within the same transaction boundary.
×

Building a single API gateway that tries to serve all client types

Symptom
Gateway becomes a monolith: thousands of routes, every change requires full redeployment, one team's endpoint affects all clients.
Fix
Extract Backend for Frontend (BFF) services per client type (web, mobile, third-party). Each BFF owns its own aggregation and deployment lifecycle.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
When would you choose a monolith over microservices?
Q02SENIOR
What's the biggest risk of Event-Driven Architecture, and how do you mit...
Q03SENIOR
Explain the saga pattern and when you would NOT use it.
Q04SENIOR
What is the difference between an API Gateway and a Backend for Frontend...
Q05SENIOR
How would you design a system that uses CQRS and needs to handle a domai...
Q01 of 05SENIOR

When would you choose a monolith over microservices?

ANSWER
Choose a monolith when your team is under 15 engineers, you have a single bounded context, and you don't need independent scaling of individual components. The distributed systems tax (service discovery, tracing, eventual consistency, multiple CI/CD pipelines) adds significant operational overhead that a small team can't absorb. A modular monolith gives you many of the same structural benefits without the network complexity.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What's the #1 mistake teams make when adopting microservices?
02
Can I use CQRS with a single database?
03
When should I choose choreography over orchestration for sagas?
04
How do I decide between an API Gateway and a Service Mesh?
05
Is a modular monolith a viable alternative to microservices?
🔥

That's Architecture. Mark it forged?

7 min read · try the examples if you haven't

Previous
Backend for Frontend Pattern
13 / 13 · Architecture
Next
Design URL Shortener