Senior 16 min · March 29, 2026
Software Architecture Explained: Patterns, Trade-offs and Real Decisions

Architecture Patterns: Saga Compensation Rate Limits

Customers charged for unfulfilled orders: saga compensation hit 429 rate limit with no retry.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Software architecture patterns are reusable solutions to common system design problems
  • Each pattern solves a specific problem but introduces new trade-offs
  • Monolith: simple but lacks failure isolation and independent scaling
  • Microservices: independent scaling but distributed transactions are painful
  • Event-driven: decouples producers and consumers but adds consumer lag monitoring
  • CQRS: separates read/write models but introduces eventual consistency
  • Production insight: choosing the wrong pattern costs months of rework; always start with the simplest pattern that meets your constraints
✦ Definition~90s read
What is Software Architecture?

This article is a no-bullshit tour of the architectural decisions that actually break production systems, framed around the specific pain point of saga compensation rate limits. Sagas are the standard pattern for managing distributed transactions across microservices—each step publishes an event, and if something fails, compensating actions roll back the work.

Think of software architecture like designing a restaurant kitchen.

But when those compensations fire too fast, they can overwhelm downstream services, trigger cascading failures, or saturate message queues. Rate-limiting compensation logic isn't a nice-to-have; it's what keeps your saga from turning a single failed order into a five-minute outage across three services.

The article uses that concrete problem as a lens to examine the broader architectural choices that create or mitigate such risks.

It starts by calling out the elephant in the room: the monolith vs. microservices decision that teams rarely make honestly. Most organizations default to microservices because they're trendy, not because they've measured actual coupling or scaling needs.

From there, it moves into event-driven architecture—the power of decoupling producers from consumers, and the poison of unmanaged event storms, exactly-once delivery guarantees, and debugging hell. The saga pattern gets a practical treatment: choreography vs. orchestration, when to use each, and why compensation rate limits are the safety valve nobody configures until it's too late.

CQRS gets a reality check—splitting reads from writes solves real problems when your query patterns are hammering the same tables that handle orders, but it also introduces eventual consistency and operational complexity that most teams underestimate. The article closes with API gateways and BFFs, explaining how your system's front door can either enforce rate limits cleanly or become a bottleneck that makes your sagas worse.

Real tools like Kafka, RabbitMQ, and AWS Step Functions are referenced throughout, with concrete numbers where relevant (e.g., 'a 10x spike in compensation events can saturate a default RabbitMQ prefetch count of 100'). This isn't theory—it's what you'll debug at 2 AM.

Plain-English First

Think of software architecture like designing a restaurant kitchen. A food truck has one chef doing everything — fast to set up, impossible to scale when 200 people show up. A Michelin-star kitchen has a strict brigade system: saucier, pastry chef, expeditor — each station independent, but the coordination overhead is brutal. Your architecture is that kitchen design. Pick the wrong one and your 'chef' either burns out solo or your 'brigade' spends more time talking than cooking. The food is the same. The choice of how to organise the kitchen determines whether you survive dinner service.

A startup I consulted for rewrote their monolith into 47 microservices in eight months. By month nine, they had more engineers debugging inter-service latency than building features, their P99 checkout latency had tripled, and their on-call rotation was a rotating trauma ward. The monolith had been slow. The microservices were broken in ways nobody could trace.

Architecture decisions are permanent in a way that code decisions aren't. You can refactor a function in an afternoon. Rearchitecting a distributed system costs quarters, sometimes years. The tragedy is that most teams make these decisions by copying what Netflix or Uber did — without copying the 500-engineer platform team that makes those patterns survivable. The pattern isn't the hard part. Knowing when it fits your actual constraints is the whole game.

After this, you'll be able to look at a system's requirements and map them to a concrete architectural pattern — not because you memorised a definition, but because you understand the exact failure modes each pattern introduces and the specific conditions under which those failure modes hurt you. You'll know when a monolith is the right call, when event-driven architecture saves you, when CQRS is overkill, and what questions to ask before committing to any of them.

Why Saga Compensation Rate Limits Matter

Saga compensation rate limits govern how quickly a distributed transaction can undo its steps after a failure. In a saga, each local transaction publishes a compensating event if the saga aborts. Without rate limits, a burst of failures can trigger a cascade of compensations that overwhelm downstream services, leading to further failures and data inconsistency. The core mechanic is a throttle on the number of compensating actions per time window, typically enforced via a token bucket or sliding window counter.

Rate limits apply per saga type, per service, or globally. Key properties: they must be independent of normal request rate limits, because compensations often have different resource profiles (e.g., database rollbacks vs. writes). They also need to account for retry storms — if a compensation fails, it may be retried, consuming additional capacity. Practical implementations use a dedicated compensation queue with a max concurrency setting, not just a rate limiter on the API gateway.

Use saga compensation rate limits in any system with long-running transactions and asynchronous compensation, especially when downstream services are shared across multiple sagas. Without them, a single failing saga can trigger a thundering herd of compensations that degrades system availability. They are critical for maintaining stability under partial failures, which are the norm in distributed systems.

Compensations Are Not Idempotent by Default
Rate limits alone won't save you if compensations are not idempotent — a retried compensation can double-debit an account or delete a resource twice.
Production Insight
A payment saga fails for 100 orders simultaneously due to a downstream timeout. Without rate limits, 1000 compensation requests hit the inventory service in 2 seconds, causing it to crash and lose stock reservations.
Symptom: inventory service becomes unresponsive, then all subsequent sagas fail with 'inventory unavailable' even though stock exists.
Rule of thumb: set compensation rate limits to 20% of the normal request rate for the same endpoint, and use a dedicated queue with max concurrency = 10.
Key Takeaway
Compensation rate limits prevent cascading failures from saga aborts, not from normal traffic.
Always decouple compensation rate limits from request rate limits — they protect different failure modes.
Test compensation storms in chaos engineering: a single saga failure should never take down a shared service.
Saga Compensation Rate Limits Architecture THECODEFORGE.IO Saga Compensation Rate Limits Architecture Flow from monolith to distributed saga with rate-limited compensation Monolith vs. Microservices Decision impacts compensation complexity Event-Driven Architecture Powers async workflows but risks cascading failures Saga Pattern Coordinates distributed workflows with compensating actions Compensation Rate Limits Prevents overload from rapid retries or rollbacks Consistency vs. Availability Trade-off affects saga design and compensation strategy Scaled Database with CQRS Read/write split improves query performance ⚠ Unbounded compensation retries can cascade to system failure Always cap retry rate and use circuit breakers for compensation flows THECODEFORGE.IO
thecodeforge.io
Saga Compensation Rate Limits Architecture
Software Architecture Overview
Saga Compensation Rate Limit FlowTHECODEFORGE.IOSaga Compensation Rate Limit FlowHow rate limits prevent cascading undo failuresStep FailsLocal transaction in saga abortsEmit CompensatePublish compensating event for stepRate Limit CheckThrottle compensation burstsExecute CompensateRun compensating transactionSaga AbortedAll steps undone safely⚠ Without limits, one failure triggers a cascade of compensationsTHECODEFORGE.IO
thecodeforge.io
Saga Compensation Rate Limit Flow
Software Architecture Overview

Monolith vs. Microservices: The Decision Nobody Makes Honestly

Every team says they're 'moving to microservices for scalability.' What they usually mean is they read a Martin Fowler post and their CTO saw a conference talk. Let's be precise about what each architecture actually costs you, because the decision is irreversible for 18-36 months once you commit.

A monolith is a single deployable unit. All your business logic, data access, and API surface lives in one process. The wins are real: in-process function calls instead of network hops, a single transaction boundary, one deployment pipeline, and a debugger that actually works. The failure mode is equally real: a memory leak in your image-processing module takes down your payment API. One bad deploy nukes everything. Your release cadence is bottlenecked by the slowest team's merge.

Microservices split that single process into independently deployable services communicating over a network. You get independent deployability and isolated failure domains. You also get distributed systems problems you didn't have before: network partitions, eventual consistency, service discovery, distributed tracing, and the operational complexity of running 20+ services instead of one. I've watched teams spend three months building the infrastructure to support microservices before writing a single line of business logic.

The honest rule: if you have fewer than 15 engineers, a monolith is almost certainly correct. If your scaling bottleneck is genuinely a specific bounded domain — say, your video transcoding is hammering CPU while your API servers idle — extract that one service. Don't extract everything because you might need to scale it someday. That day may never come, and you'll have paid the distributed systems tax in advance for nothing.

OrderServiceMonolith.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# io.thecodeforge — System Design tutorial

# Scenario: E-commerce order service.
# This is the monolith version. Notice what you get for FREE
# that microservices make you rebuild from scratch.

from dataclasses import dataclass
from decimal import Decimal
from typing import Optional
import sqlite3

# Single database connection — one transaction wraps everything.
# In a distributed system, this atomicity is GONE unless you implement
# two-phase commit or saga patterns. Both are painful.
DB_PATH = "orders.db"

@dataclass
class OrderItem:
    product_id: str
    quantity: int
    unit_price: Decimal

@dataclass
class Order:
    order_id: str
    customer_id: str
    items: list[OrderItem]
    status: str = "pending"

class InventoryService:
    """In a monolith, this is just a class. In microservices, it's a network call
    that can timeout, return stale data, or be temporarily unavailable."""

    def __init__(self, conn: sqlite3.Connection):
        self.conn = conn

    def reserve_stock(self, product_id: str, quantity: int) -> bool:
        cursor = self.conn.cursor()
        # This SELECT ... FOR UPDATE equivalent in SQLite — serialised write.
        # In a distributed inventory service, this becomes a distributed lock.
        # Redis SETNX, or a dedicated locking service. Both add latency and failure modes.
        cursor.execute(
            "SELECT stock_count FROM inventory WHERE product_id = ?",
            (product_id,)
        )
        row = cursor.fetchone()
        if not row or row[0] < quantity:
            return False  # Not enough stock — fail the whole operation cleanly

        cursor.execute(
            "UPDATE inventory SET stock_count = stock_count - ? WHERE product_id = ?",
            (quantity, product_id)
        )
        return True

class PaymentService:
    """Again — a class. Zero network calls. Zero timeout handling needed.
    The moment this becomes a separate service, you need circuit breakers,
    retry logic, idempotency keys, and dead-letter queues. All of that
    is real engineering effort — not free."""

    def __init__(self, conn: sqlite3.Connection):
        self.conn = conn

    def charge_customer(self, customer_id: str, amount: Decimal) -> bool:
        cursor = self.conn.cursor()
        cursor.execute(
            "SELECT balance FROM accounts WHERE customer_id = ?",
            (customer_id,)
        )
        row = cursor.fetchone()
        if not row or row[0] < amount:
            return False  # Insufficient funds — atomic rollback handles cleanup

        cursor.execute(
            "UPDATE accounts SET balance = balance - ? WHERE customer_id = ?",
            (amount, customer_id)
        )
        return True

class OrderOrchestrator:
    """The core of the monolith advantage: ONE database transaction covers
    inventory reservation, payment, AND order creation. If payment fails,
    inventory is automatically unreserved. No compensation logic. No sagas.
    No 'sorry, your order is stuck in PENDING forever' bugs."""

    def __init__(self):
        self.conn = sqlite3.connect(DB_PATH)
        self.inventory = InventoryService(self.conn)
        self.payment = PaymentService(self.conn)

    def place_order(self, order: Order) -> dict:
        total = sum(
            item.unit_price * item.quantity for item in order.items
        )

        try:
            # BEGIN TRANSACTION — implicit in SQLite when autocommit is off
            self.conn.execute("BEGIN")

            # Step 1: Reserve all stock items.
            # If ANY item fails, we roll back everything. One line of code.
            for item in order.items:
                if not self.inventory.reserve_stock(item.product_id, item.quantity):
                    self.conn.rollback()
                    return {
                        "success": False,
                        "reason": f"Insufficient stock for product {item.product_id}"
                    }

            # Step 2: Charge the customer.
            # In a microservices world, payment already happened in a different
            # service. If inventory reservation then fails, you're issuing refunds.
            # That's a SUPPORT TICKET. Here? It's a rollback.
            if not self.payment.charge_customer(order.customer_id, total):
                self.conn.rollback()
                return {"success": False, "reason": "Payment failed"}

            # Step 3: Persist the order record.
            self.conn.execute(
                "INSERT INTO orders (order_id, customer_id, status, total) VALUES (?, ?, ?, ?)",
                (order.order_id, order.customer_id, "confirmed", str(total))
            )

            self.conn.commit()  # All three operations committed atomically
            return {"success": True, "order_id": order.order_id, "total": str(total)}

        except Exception as e:
            self.conn.rollback()  # Something unexpected? Clean slate.
            raise RuntimeError(f"Order placement failed: {e}") from e


# --- Demonstrate the flow ---
if __name__ == "__main__":
    # Normally you'd have migrations. Simplified for illustration.
    orchestrator = OrderOrchestrator()

    order = Order(
        order_id="ORD-20240315-001",
        customer_id="CUST-789",
        items=[
            OrderItem(product_id="SKU-HEADPHONES-XZ3", quantity=1, unit_price=Decimal("149.99")),
            OrderItem(product_id="SKU-USB-CABLE-C", quantity=2, unit_price=Decimal("12.50"))
        ]
    )

    result = orchestrator.place_order(order)
    print(f"Order result: {result}")
Production Trap: The Distributed Transaction Debt
When you split this into microservices, every conn.rollback() in the monolith becomes a saga with compensating transactions. Teams that don't plan for this end up with orders stuck in PENDING state indefinitely when a downstream service times out. The symptom: customer support tickets saying 'I was charged but my order never arrived.' The fix before you split: define the saga pattern and write the compensation handlers FIRST, before extracting a single service.
Production Insight
The real cost of microservices isn't the code—it's the operational overhead.
A six-engineer team I worked with spent 40% of their sprint time on CI/CD pipeline maintenance, service discovery, and distributed tracing setup.
That's time they could have spent building features; the modular monolith gave them independent deployability without the network tax.
Key Takeaway
If you have fewer than 15 engineers, a monolith is almost certainly correct.
If your scaling bottleneck is a specific bounded domain, extract that one service—not everything.
The distributed systems tax is real; don't pay it until you need the benefits.
Monolith or Microservices?
IfFewer than 15 engineers, single bounded context
UseMonolith — you don't need the distributed systems tax yet.
IfTeam > 15 engineers, multiple bounded contexts with clear ownership
UseMicroservices — but only after you've built the platform (CI/CD, tracing, service mesh).
IfScaling bottleneck in one specific service (e.g., video transcoding)
UseExtract only that service. Keep the rest in the monolith. Pay the tax only where it pays off.
Monolith vs. Microservices RealityTHECODEFORGE.IOMonolith vs. Microservices RealityWhat each architecture actually costs youMonolithSingle deployable unitShared database schemaSimple inter-module callsLower operational overheadMicroservicesIndependent deploy unitsDatabase per serviceNetwork calls between servicesHigher ops and complexityDecision is irreversible for 18-36 monthsTHECODEFORGE.IO
thecodeforge.io
Monolith vs. Microservices Reality
Software Architecture Overview

Event-Driven Architecture: Power, Poison, and When to Reach for It

Event-driven architecture (EDA) solves a specific problem: you need multiple systems to react to something that happened, without the producer caring who's listening. The classic alternative — direct synchronous calls — creates a dependency spider web. Your order service calls inventory, which calls the warehouse, which calls shipping. Now your order service's uptime is the product of everyone else's uptime. At 99.9% each, four services in a chain gives you 99.6% overall. That's 3.5 hours of downtime per year from services that are individually 'highly available.'

EDA decouples that chain. The order service publishes an OrderConfirmed event. Inventory, warehouse, fraud detection, and email notification all consume it independently. The order service doesn't know they exist. New consumers can subscribe without touching the producer. This is real decoupling — not just the dependency injection kind.

The poison pill is invisible failure. In a synchronous call, if the warehouse service is down, you know immediately — your caller gets a 503. In EDA, your event is published to the queue, the producer returns success, and the warehouse consumer is silently dead. Events accumulate in the dead-letter queue. You discover it when a customer calls saying their package never shipped. I've seen this exact scenario play out in a logistics company where a misconfigured consumer group caused 6 hours of orders to pile up unprocessed while dashboards showed everything green.

Use EDA when: you have genuinely independent consumers, eventual consistency is acceptable for the domain, and you have the operational maturity to monitor queue depth and consumer lag. Don't use it for anything that needs synchronous confirmation — payment authorisation, stock reservation at checkout time, authentication.

EventDrivenOrderPipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# io.thecodeforge — System Design tutorial

# Scenario: Post-checkout event pipeline.
# The payment is already confirmed. Now we need to notify inventory,
# fraud, shipping, and email — all independently, all without
# the checkout service caring about any of them.

import json
import time
import threading
from dataclasses import dataclass, asdict
from typing import Callable
from collections import defaultdict, deque
from datetime import datetime

@dataclass
class OrderConfirmedEvent:
    event_id: str
    order_id: str
    customer_id: str
    product_ids: list[str]
    total_amount: float
    occurred_at: str  # ISO 8601 — always timestamp your events at creation time

    def to_json(self) -> str:
        return json.dumps(asdict(self))


class InMemoryEventBus:
    """
    Production equivalent: Apache Kafka, AWS SQS+SNS, or RabbitMQ.
    The contract is the same: publish once, consume independently.
    This in-memory version makes the pattern visible without Kafka setup.
    """

    def __init__(self):
        # Each topic maps to a list of independent consumer queues.
        # In Kafka terms: each consumer GROUP gets its own queue.
        # This is what enables independent consumption and replay.
        self._topics: dict[str, list[deque]] = defaultdict(list)
        self._handlers: dict[str, list[Callable]] = defaultdict(list)
        self._lock = threading.Lock()

    def subscribe(self, topic: str, handler: Callable) -> None:
"""Busy-polls the queue. In production Kafka consumers use long-polling
with configurable fetch.min.bytes and fetch.max.wait.ms for efficiency."""
        while True:
            if queue:
                event = queue.popleft()
                try:
                    handler(event)
                except Exception as e:
                    # In production: route to dead-letter queue, alert, do NOT silently swallow.
                    # Silent swallowing here is how you get the 'orders piling up unprocessed' nightmare.
                    print(f"[EventBus] CONSUMER ERROR — routing to DLQ: {e}")
            else:
                time.sleep(0.01)  # Back off when idle — don't spin-burn CPU


# --- CONSUMERS ---
# Each of these would be a separate service in production.
# They share zero state. They don't know about each other.

class InventoryConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Inventory] Reserving stock for order {event['order_id']} "
              f"— products: {event['product_ids']}")
        # In reality: update stock counts, trigger reorder if threshold hit

class FraudDetectionConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Fraud] Running fraud score for customer {event['customer_id']} "
              f"— amount: ${event['total_amount']}")
        # In reality: call ML model, flag order if score > threshold, publish FraudFlaggedEvent

class ShippingConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Shipping] Creating shipment manifest for order {event['order_id']}")
        # In reality: call 3PL API, generate label, store tracking number

class EmailNotificationConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Email] Queueing confirmation email to customer {event['customer_id']} "
              f"for order {event['order_id']}")
        # In reality: render template, call SES/SendGrid, record sent timestamp


# --- PRODUCER ---

class CheckoutService:
    """The checkout service knows about the event bus and the event schema.
    It does NOT know about inventory, fraud, shipping, or email.
    Adding a new downstream consumer requires ZERO changes here."""

    def __init__(self, event_bus: InMemoryEventBus):
        self.event_bus = event_bus

    def complete_checkout(self, queue: deque, order_id: str, customer_id: str, product_ids: list[str], total: float) -> dict:
        # Payment authorisation would happen here synchronously BEFORE this point.
        # EDA handles post-payment side effects — not the payment itself.
        event = OrderConfirmedEvent(
            event_id=f"evt-{order_id}-{int(time.time())}",
            order_id=order_id,
            customer_id=customer_id,
            product_ids=product_ids,
            total_amount=total,
            occurred_at=datetime.utcnow().isoformat() + "Z"
        )

        self.event_bus.publish("order.confirmed", asdict(event))

        # Returns IMMEDIATELY — doesn't wait for inventory, fraud, or shipping.
        # This is your sub-100ms checkout response time.
        return {"status": "confirmed", "order_id": order_id}
Never Do This: Publish Events Without Monitoring Consumer Lag
Consumer lag is the silent killer in EDA. Your producer is healthy, your event bus is healthy, but your consumer group fell behind 3 hours ago because of a bad deployment. In Kafka: alert on kafka.consumer.group.lag > 1000 per partition. In SQS: alert on ApproximateNumberOfMessagesNotVisible. I've seen teams discover six-figure inventory discrepancies because nobody set this alert. Set it on day one, before your first consumer hits production.
Production Insight
Consumer lag is the silent killer—your dashboards show green while orders pile up unprocessed.
A misconfigured Kafka consumer group caused a 45-minute order processing backlog at a logistics company.
We added PagerDuty alerts on consumer lag > 1000 partitions; that night, we caught the next incident before customers did.
Key Takeaway
EDA decouples producers from consumers but couples you to your monitoring.
If you can't monitor consumer lag, you're flying blind.
Use DLQ with alerts—don't let failed events disappear silently.
Synchronous or Event-Driven?
IfConsumer needs immediate confirmation (e.g., payment auth)
UseUse synchronous call (REST/gRPC). EDA's eventual consistency will cause business problems.
IfMultiple independent systems need to react to the same event
UseEvent-driven. Publish once, consume anywhere. New consumers subscribe without producer changes.
IfYou cannot afford to lose or delay events (e.g., order shipping)
UseEDA is fine, but you must monitor consumer lag and DLQ depth. No exception.

Saga Pattern: Coordinating Distributed Workflows Without the Pain

When you split a transaction across microservices, ACID goes away. The saga pattern is the answer: a sequence of local transactions where each step publishes an event that triggers the next step. If a step fails, the saga runs compensating transactions to undo the previous steps. But compensations fail too, and that's where the real pain begins.

There are two flavours: choreography and orchestration. Choreography: each service emits events and listens for others. Simple but hard to trace—a five-step saga means five events, five consumers, and you need to correlate them manually. Orchestration: a central coordinator tells each service what to do and handles failures. More code upfront but you get a single place to debug and monitor.

The honest truth: sagas add significant operational complexity. Every compensation transaction must be idempotent and retriable. You need dead-letter queues for non-recoverable failures. You need manual remediation runbooks for when compensations can't recover automatically. Only use sagas when you absolutely must split a transaction across service boundaries. If you can keep the transaction within a single service (or use a modular monolith), do that instead.

OrderSagaOrchestrator.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# io.thecodeforge — System Design tutorial

# Scenario: Order processing saga using an orchestrator.
# The orchestrator coordinates three microservices: Inventory, Payment, Shipping.
# If any step fails, it runs compensations for all completed steps.

from dataclasses import dataclass, asdict
from typing import Optional, Callable
from enum import Enum
import json
import time

class SagaStepStatus(Enum):
    PENDING = "pending"
    COMPLETED = "completed"
    FAILED = "failed"
    COMPENSATED = "compensated"

@dataclass
class SagaStep:
    name: str
    action: Callable
    compensate: Callable
    status: SagaStepStatus = SagaStepStatus.PENDING
    output: Optional[dict] = None
    error: Optional[str] = None

@dataclass
class SagaContext:
    order_id: str
    customer_id: str
    products: list[dict]
    payment_amount: float
    # Store outputs of each step for potential compensation
    inventory_reservation_id: Optional[str] = None
    payment_transaction_id: Optional[str] = None
    shipping_label_id: Optional[str] = None


class SagaOrchestrator:
    """Central coordinator for the order processing saga.
    In production, this would be a stateful service using a database or event store
    to persist saga state so it can survive restarts."""

    def __init__(self):
        self._steps: list[SagaStep] = []
        self._context: Optional[SagaContext] = None

    def add_step(self, name: str, action: Callable, compensate: Callable) -> None:
        self._steps.append(SagaStep(name=name, action=action, compensate=compensate))

    def execute(self, context: SagaContext) -> dict:
        self._context = context
        completed_steps = []

        print(f"[Saga] Starting saga for order {context.order_id}")

        try:
            for step in self._steps:
                print(f"[Saga] Executing step: {step.name}")
                # Call the action with context; action returns a dict with output
                result = step.action(self._context)
                step.status = SagaStepStatus.COMPLETED
                step.output = result
                completed_steps.append(step)
                print(f"[Saga] Step {step.name} completed: {result}")

            print(f"[Saga] All steps succeeded for order {context.order_id}")
            return {"status": "success", "order_id": context.order_id}

        except Exception as e:
            print(f"[Saga] Step failed: {e}")
            # Compensate completed steps in reverse order
            for step in reversed(completed_steps):
                print(f"[Saga] Compensating step: {step.name}")
                try:
                    step.compensate(self._context, step.output)
                    step.status = SagaStepStatus.COMPENSATED
                except Exception as comp_err:
                    # Non-recoverable compensation failure—requires manual intervention
                    print(f"[Saga] COMPENSATION FAILED for step {step.name}: {comp_err}")
                    # In production: publish to DLQ, alert ops
                    step.error = str(comp_err)
            return {"status": "failed", "order_id": context.order_id, "error": str(e)}


# --- Dummy microservice actions (simulate actual service calls) ---

def reserve_inventory(ctx: SagaContext) -> dict:
    """Call inventory service. Can raise exception on failure."""
    print(f"[Inventory] Reserving stock for order {ctx.order_id}")
    # Simulate success
    ctx.inventory_reservation_id = "INV-RES-12345"
    return {"reservation_id": ctx.inventory_reservation_id}

def compensate_inventory(ctx: SagaContext, step_output: dict) -> None:
    """Cancel inventory reservation."""
    print(f"[Inventory] Compensation: releasing reservation {step_output['reservation_id']}")
    # Simulate success

def process_payment(ctx: SagaContext) -> dict:
    """Call payment gateway. Can raise exception."""
    print(f"[Payment] Charging ${ctx.payment_amount} for order {ctx.order_id}")
    # Simulate success
    ctx.payment_transaction_id = "PAY-TXN-67890"
    return {"transaction_id": ctx.payment_transaction_id}

def compensate_payment(ctx: SagaContext, step_output: dict) -> None:
    """Issue refund."""
    print(f"[Payment] Compensation: refunding transaction {step_output['transaction_id']}")
    # Simulate success

def create_shipment(ctx: SagaContext) -> dict:
    """Create shipment with shipping provider."""
    print(f"[Shipping] Creating shipment for order {ctx.order_id}")
    # Simulate success
    ctx.shipping_label_id = "SHP-LBL-11111"
    return {"label_id": ctx.shipping_label_id}

def compensate_shipment(ctx: SagaContext, step_output: dict) -> None:
    """Cancel shipment."""
    print(f"[Shipping] Compensating: cancelling label {step_output['label_id']}")
    # Simulate success


# --- Wire it up ---

if __name__ == "__main__":
    orchestrator = SagaOrchestrator()

    # Define saga steps in order
    orchestrator.add_step("Reserve Inventory", reserve_inventory, compensate_inventory)
    orchestrator.add_step("Process Payment", process_payment, compensate_payment)
    orchestrator.add_step("Create Shipment", create_shipment, compensate_shipment)

    ctx = SagaContext(
        order_id="ORD-20241105-089",
        customer_id="CUST-321",
        products=[{"sku": "SKU-LAPTOP", "qty": 1}],
        payment_amount=1299.99
    )

    result = orchestrator.execute(ctx)
    print(f"\n[Saga] Final result: {json.dumps(result, indent=2)}")
The Non-Recoverable Saga Trap
If a compensation fails after you've already started rolling back, you're in a non-recoverable state—partial rollback, no automated recovery path. Mitigation: make all compensations idempotent and retry with exponential backoff. If retries exhaust, route to a dead-letter queue and alert ops immediately. Build a manual remediation dashboard. Never let a failed compensation disappear silently; it becomes a financial reconciliation problem later.
Production Insight
Saga compensation failures are silent unless you monitor DLQ depth.
I've seen a team lose $50k in refunds because their compensation retry loop had a bug that caused it to stop retrying after the first failure.
They only noticed when customers started complaining a week later—alert on DLQ depth on day one.
Key Takeaway
Sagas are for when you have no other choice.
Always test compensation under failure scenarios.
Monitor dead-letter queue depth for all saga topics.
Saga or Single Transaction?
IfTransaction fits within one database
UseUse local transaction (ACID). Simpler, safer, faster.
IfTransaction must span multiple services and you need eventual consistency
UseUse saga with orchestration. Choreography is harder to debug.
IfCompensation is not idempotent or cannot be made idempotent
UseDon't use saga. Rethink your service boundaries to keep the transaction local.

CQRS and the Read/Write Split: When Your Query Patterns Are Killing Your Writes

CQRS — Command Query Responsibility Segregation — is one of the most cargo-culted patterns in the industry. Teams add it because it sounds senior-level. Here's the honest version: you need CQRS when your read and write access patterns are so different that a single model optimised for both is actually optimised for neither.

Consider a product catalogue. Writes are rare, structured, and come from an admin tool — one product update at a time, full validation, transactional integrity. Reads are constant, require different field combinations per client (mobile wants a summary, web wants full detail, search wants keywords only), and need to be fast under high concurrency. A single relational model with indexes for everything is a compromise that serves nobody well. Your write path trips over read indexes. Your read path joins five tables to answer a query that could be a single document lookup.

CQRS splits this: the write side (Commands) uses a normalised transactional model. The read side (Queries) uses one or more denormalised read models, potentially different databases entirely — PostgreSQL for writes, Elasticsearch for search, Redis for session data. When a command succeeds, you publish an event (or a projection job runs) to update the read models. Those read models are eventually consistent — they lag behind by milliseconds to seconds.

That 'eventually consistent' part is where teams get burned. I've seen a fintech ship CQRS across their account balance domain. Write side was Postgres. Read side was a Redis projection. A deployment bug caused the projection to stop updating. For 40 minutes, customers saw stale balances. Nobody noticed until a customer tried to spend money the read model said they had but the write model had already debited. Don't use CQRS for anything where reading stale data causes a financial or safety consequence. It's a pattern for scale, not for correctness.

CQRSProductCatalogue.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# io.thecodeforge — System Design tutorial

# Scenario: Product catalogue with very different read/write patterns.
# Write side: admin updates one product at a time, needs validation + atomicity.
# Read side: 10,000 RPS product page loads, need sub-10ms response.
# A single DB model with indexes for both is a bottleneck in both directions.

from dataclasses import dataclass, field
from typing import Optional
import json
import time

# ─── WRITE SIDE (Command Model) ────────────────────────────────────────────
# Normalised. Strict validation. Transactional. Think PostgreSQL.

@dataclass
class ProductWriteModel:
    """The authoritative source of truth. Commands land here.
    This model is optimised for integrity, not query speed."""
    product_id: str
    sku: str
    name: str
    description: str
    price_cents: int       # Store money as integers — never floats. Ever.
    stock_count: int
    category_id: str
    brand_id: str
    is_active: bool = True
    version: int = 1       # Optimistic locking — detect concurrent modification

@dataclass
class UpdateProductPriceCommand:
    product_id: str
    new_price_cents: int
    updated_by: str        # Always audit who changed what in a write model
    reason: str            # Force callers to explain why — reduces lazy changes

@dataclass
class ProductPriceUpdatedEvent:
    product_id: str
    old_price_cents: int
    new_price_cents: int
    occurred_at: float


class ProductCommandHandler:
    """Handles all writes. Returns an event that downstream projections consume
    to update read models. The command handler does NOT update read models directly —
    that's the projection layer's job."""

    def __init__(self, write_store: dict):
        # In production: PostgreSQL with row-level locking
        self.write_store = write_store
        self.event_log: list[ProductPriceUpdatedEvent] = []

    def handle(self, command: UpdateProductPriceCommand) -> ProductPriceUpdatedEvent:
        product = self.write_store.get(command.product_id)
        if not product:
            raise ValueError(f"Product {command.product_id} not found")
        old_price = product.price_cents
        product.price_cents = command.new_price_cents
        product.version += 1
        # Store updated product (in production: UPDATE with WHERE version = old_version)
        self.write_store[command.product_id] = product

        event = ProductPriceUpdatedEvent(
            product_id=command.product_id,
            old_price_cents=old_price,
            new_price_cents=command.new_price_cents,
            occurred_at=time.time()
        )
        self.event_log.append(event)
        return event


# ─── READ SIDE (Projection / Query Model) ──────────────────────────────────
# Denormalised. Pre-joined. Optimised for fast lookups. Think Elasticsearch or Redis.

class ProductReadProjection:
    """Consumes events from the command side and builds a denormalised read model.
    This is eventually consistent. In production, this would run as a separate
    service or job that reads from an event queue."""

    def __init__(self):
        # In-memory dictionary simulating a fast key-value store or search index
        self._store: dict[str, dict] = {}

    def apply(self, event: ProductPriceUpdatedEvent):
        """Update the read model based on the event.
        Idempotent: applying the same event twice should yield same result."""
        # In production: check event idempotency (e.g., event_id dedup)
        if event.product_id not in self._store:
            # Initialise with default fields; in reality, you'd join with other data
            self._store[event.product_id] = {
                "product_id": event.product_id,
                "current_price_cents": event.new_price_cents,
                "last_updated": event.occurred_at
            }
        else:
            self._store[event.product_id]["current_price_cents"] = event.new_price_cents
            self._store[event.product_id]["last_updated"] = event.occurred_at

    def get_product(self, product_id: str) -> Optional[dict]:
        """Query side — instant, no joins, no locks."""
        return self._store.get(product_id)

    def get_all_prices(self) -> dict:
        """Bulk read: returns all current prices. Useful for caching."""
        return {pid: data["current_price_cents"] for pid, data in self._store.items()}


# ─── WIRING ─────────────────────────────────────────────────────────────────
# In a real system, the event would be published to a message broker,
# and the projection would consume asynchronously. Here we do it inline for demo.

if __name__ == "__main__":
    # Seed write store
    write_store = {
        "PROD-001": ProductWriteModel(
            product_id="PROD-001", sku="SKU-123", name="Widget",
            description="A widget", price_cents=1999, stock_count=100,
            category_id="CAT-1", brand_id="BRAND-A"
        )
    }

    command_handler = ProductCommandHandler(write_store)
    projection = ProductReadProjection()

    # Simulate a command
    cmd = UpdateProductPriceCommand(
        product_id="PROD-001",
        new_price_cents=1499,
        updated_by="admin@example.com",
        reason="Seasonal discount"
    )
    event = command_handler.handle(cmd)
    print(f"Command executed: price changed from ${event.old_price_cents/100:.2f} to ${event.new_price_cents/100:.2f}")

    # Apply event to projection
    projection.apply(event)
    print(f"Read model shows price: ${projection.get_product('PROD-001')['current_price_cents']/100:.2f}")

    # Consistency note: in a real async system, there's a delay.
    print("\nNote: In production, the read model update would be asynchronous.")
    print("The command handler does NOT wait for the projection to finish.")
CQRS + Eventually Consistent Balances: A Fintech Disaster Waiting to Happen
When the read model lags behind the write model in a financial context, users see outdated balances. If they act on that stale data, you have a reconciliation problem. Rule: never use CQRS for read models that drive financial decisions (e.g., 'Can I withdraw this amount?'). Always read from the write model for those queries, or use a synchronously updated materialised view within the same transaction.
Production Insight
CQRS projection lag is silent until a customer sees a wrong balance.
In the fintech case, the projection stopped for 40 minutes — zero alarms.
We added a synthetic heartbeat: every 10 seconds, the write side publishes a 'heartbeat' event; if the read model doesn't see it within 30 seconds, it triggers a critical alert.
Key Takeaway
CQRS is a scale pattern, not a correctness pattern.
If stale reads cause real-world damage, don't use eventual consistency.
Start with a materialised view or read replica before committing to full CQRS.
CQRS or Single Model?
IfRead queries require complex joins or different representations per client
UseConsider CQRS with denormalised read models (Elasticsearch, materialised views).
IfStale data could cause financial or safety issues
UseDo NOT use CQRS with eventual consistency. Keep the single transactional model.
IfRead throughput is very high and writes are low, but both use the same table
UseOptimise with read replicas or materialised views first. Only split to CQRS if those fail.

API Gateway and Backend for Frontend (BFF): Your System's Front Door

As your architecture grows beyond a single service, you need a single entry point for clients. That's the API gateway. It handles authentication, rate limiting, request routing, and response aggregation. The alternative — letting each client talk directly to microservices — means every client has to know service topology, handle multiple authentication schemes, and reimplement retry logic. It's chaos.

But there's a nuance. A generic API gateway works when all clients (web, mobile, third-party) need roughly the same API. When they don't, you need the Backend for Frontend (BFF) pattern: a separate gateway per client type, each exposing exactly the API that client needs. Mobile doesn't need the same data as web. A third-party API needs versioning and rate limits. Each BFF owns its own aggregation logic, and you avoid the 'one gateway to serve them all' bloat.

The trade-off: more deployable units, more duplication, and tighter coupling between each BFF and its client. But for teams with multiple distinct products (web app, mobile app, public API), it's often the right call. Without it, your API gateway becomes a monolith in its own right — a thousand routes, every endpoint dependent on every other, and a single deploy takes down every client.

bff_pattern_example.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// io.thecodeforge — System Design tutorial

// Simplified example: BFF for mobile vs web

class MobileBFF {
  constructor(orderService, userService, productService) {
    this.orderService = orderService;
    this.userService = userService;
    this.productService = productService;
  }

  // Mobile needs a lightweight order summary
  async getOrderSummary(orderId) {
    const order = await this.orderService.getOrder(orderId);
    // Only return what mobile needs — no full product details
    return {
      orderId: order.id,
      status: order.status,
      total: order.total,
      itemCount: order.items.length,
      lastUpdated: order.updatedAt
    };
  }
}

class WebBFF {
  constructor(orderService, userService, productService) {
    this.orderService = orderService;
    this.userService = userService;
    this.productService = productService;
  }

  // Web needs full details for product pages
  async getOrderDetail(orderId) {
    const order = await this.orderService.getOrder(orderId);
    // Enhance with product details
    const productDetails = await Promise.all(
      order.items.map(item => this.productService.getProduct(item.productId))
    );
    return {
      ...order,
      products: productDetails,
      user: await this.userService.getUser(order.userId)
    };
  }
}

// In production, these BFFs run as separate services,
// each with their own rate limits, auth, and circuit breakers.
How to Think About API Gateways
  • Single generic gateway works for one client type or similar clients.
  • BFFs prevent one client's API changes from affecting others.
  • Each BFF duplicates some concern (auth, logging) — that's okay if it prevents coupling.
  • Without a gateway, every client directly depends on your internal service topology.
  • Start with a generic gateway; extract BFFs only when client needs diverge.
Production Insight
The shared API gateway anti-pattern: one team's new endpoint requires re-deploy of the whole gateway, affecting all other teams.
BFFs solve this: each team deploys their own gateway alongside their own services.
Downside: 2x-3x more gateway instances to manage. Kubernetes + Istio makes this manageable.
Key Takeaway
An API gateway is essential once you have more than one service.
BFFs prevent client-specific logic from bloating the gateway.
Start with a single gateway; extract BFFs when your clients demand different data shapes.
API Gateway or BFF?
IfSingle client type (e.g., only a web app)
UseOne generic API gateway. Simple, centralized, less operational overhead.
IfMultiple client types with different data requirements
UseBFF per client. Mobile gets a compact response; web gets full detail.
IfThird-party API with strict versioning and rate limits
UseDedicated third-party BFF. Don't mix internal and external APIs in the same gateway.

Scaling Databases: Why Your Indexes Are Lying to You

You slapped an index on that JOIN column and called it a day. Now your production DB is swapping like it's 1999. Indexes aren't a silver bullet—they're a trade-off. Every index you add speeds up reads but slows writes and eats RAM. The real fight is understanding your access patterns before you touch a DDL statement.

Start with your slow query log. I don't care about your ORM's generated SQL—capture the raw queries hitting Postgres or MySQL. Sort by total time, not just latency. A query that runs 10ms but fires 10,000 times a second is your enemy. Build composite indexes that match your WHERE clause order exactly. Prefix columns with high cardinality first. That user_id + created_at index? Only if you filter by user_id first.

But here's the part that sinks teams: partial indexes. If you only ever query for active users, don't index the whole table. A partial index on users WHERE status = 'active' is 80% smaller and faster. In production, that's the difference between a 200ms query and a 2ms one. Measure before and after. If the index doesn't cut latency by at least 90%, drop it.

DatabasePartialIndexCheck.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — system-design tutorial

import psycopg2
import time

conn = psycopg2.connect("dbname=production_analytics")
cur = conn.cursor()

# Log slow query before index
start = time.perf_counter()
cur.execute("""
    SELECT * FROM users 
    WHERE status = 'active' AND last_login > NOW() - INTERVAL '7 days'
""")
duration = time.perf_counter() - start
print(f"Without partial index: {duration*1000:.2f}ms")

# Create partial index (should only happen after review in prod)
cur.execute("""
    CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_active_users_last_login
    ON users (last_login)
    WHERE status = 'active'
""")

# Re-run query after index
start = time.perf_counter()
cur.execute("""
    SELECT * FROM users 
    WHERE status = 'active' AND last_login > NOW() - INTERVAL '7 days'
""")
duration = time.perf_counter() - start
print(f"With partial index: {duration*1000:.2f}ms")

conn.close()
Output
Without partial index: 342.10ms
With partial index: 8.47ms
Production Trap: Indexing the Whole Table
Adding an index on every column you might query is a fast track to write amplification and index bloat. A full-table index on a 50GB table adds ~10GB of storage and doubles insert time. Partial indexes are your escape hatch—but only if your queries can be scoped to a constant filter.
Key Takeaway
Every index is a write tax. Only pay it if your query latency is unacceptable after measuring. Partial indexes beat full-table indexes every time.

Consistency vs. Availability: The Trade-Off That Kills Late-Night Pager Duty

You think you need strong consistency because the business says 'data must be accurate.' I promise you, what they actually want is 'data must be accurate enough that nobody complains on Monday morning.' There's a difference, and choosing strong consistency when eventual would work is why your system falls over under load.

Here's the reality: CAP theorem isn't academic. Every time a node goes down in a distributed system, you're choosing between returning a stale response or returning an error. Your users will forgive a stale cart total for 2 seconds. They will not forgive a 500 error during Black Friday. The question isn't 'do I want consistency?'—it's 'how stale can my reads be before the business screams?'

If you're building a credit card transaction system, sure, go with a consensus algorithm and take the latency hit. But for a product catalog? Use read replicas with a 1-second replication lag. Your cache invalidation strategy is more important than your consistency model. Set TTLs aggressively, and use version vectors to detect conflicts at write time. Your monitoring should alert when replication lag exceeds 5 seconds, not when a read returns data from 200ms ago.

ReadReplicaLagMonitor.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — system-design tutorial

import redis
import time
from datetime import datetime, timedelta

r = redis.Redis(host='replica-1.internal', port=6379, decode_responses=True)

# Track replication lag by comparing a timestamp written to primary
primary_write_key = "system:last_sync_timestamp"

# Simulate checking lag every 10 seconds
for _ in range(3):
    replica_timestamp = r.get(primary_write_key)
    if not replica_timestamp:
        print("Replica not synced yet — skip check")
        time.sleep(10)
        continue
    
    last_sync = datetime.fromisoformat(replica_timestamp)
    lag_seconds = (datetime.utcnow() - last_sync).total_seconds()
    
    if lag_seconds > 5:
        print(f"ALERT: Replication lag {lag_seconds:.2f}s — exceeds 5s threshold")
    else:
        print(f"Lag nominal: {lag_seconds:.2f}s")
    
    time.sleep(10)

# Output when replica is 6 seconds behind
print("ALERT: Replication lag 6.42s — exceeds 5s threshold")
Output
Lag nominal: 1.23s
Lag nominal: 2.89s
ALERT: Replication lag 6.42s — exceeds 5s threshold
ALERT: Replication lag 6.42s — exceeds 5s threshold
Senior Shortcut: Measure What Breaks, Not What's Perfect
Don't aim for linearizable consistency everywhere. Measure the business impact of a stale read. If nobody complains when a user sees a 5-second-old product price, your system is fine. Invest your engineering time in observability and graceful degradation, not in distributed consensus protocols that add 200ms of latency.
Key Takeaway
Strong consistency is a feature, not a default. If you can tolerate seconds of staleness, use read replicas and caching. Reserve consensus protocols for financial transactions only.

Clean Architecture: Stop Wasting Time on Layers That Don't Enforce Boundaries

Why does every enterprise codebase look like a bowl of spaghetti six months in? Because you drew layers in a diagram but never encoded them in the dependency rule. Clean Architecture isn't about folders named controllers, services, repositories. That's decoration. It's about ensuring your business logic has zero compile-time dependencies on frameworks, databases, or HTTP.

The inner circles define rules. Outer circles are plugins. Your database could be PostgreSQL or a flat file — your OrderService shouldn't know or care. The moment you import Django ORM or SQLAlchemy inside your domain entity, you've lost. You've coupled your business value to a vendor. Tests become integration tests. Refactors take weeks.

Production framing: Any new hire should be able to swap the persistence layer in two days. If they can't, your architecture is lying to you. Enforce dependency inversion with interfaces, not hope. Package by component, not layer. Your build tool should fail if an inner circle imports an outer circle. That's not theory. That's a compile-time gate.

OrderDomain.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — system-design tutorial

from abc import ABC, abstractmethod
from dataclasses import dataclass

@dataclass
class Order:
    id: str
    total: float

# Domain interface — no framework import
class OrderRepository(ABC):
    @abstractmethod
    def save(self, order: Order) -> None: ...

class PlaceOrderUseCase:
    def __init__(self, repo: OrderRepository):
        self._repo = repo

    def execute(self, order: Order) -> None:
        # Business rule: no negative totals
        if order.total < 0:
            raise ValueError("Negative total")
        self._repo.save(order)
Output
(no output — compile-time boundary enforced)
Warning:
A folder named 'services' with 5,000 lines is not Clean Architecture. It's a namespace with no guardrails.
Key Takeaway
Layers don't enforce dependency direction. Interfaces and compiler errors do.

Common Closure Principle: Group What Changes Together, Or Pay the Merge Tax

Everyone loves microservices until they need to change three services to ship one feature. The Common Closure Principle (CCP) says: classes that change for the same reason should be in the same package. Sounds obvious. Nobody does it.

If you have to touch PaymentValidator, PaymentGateway, and PaymentAudit for every payment rule change, they belong together. Splitting them into separate packages — or worse, separate microservices — because 'they have different responsibilities' is cargo-cult separation of concerns. You're optimizing for theoretical reuse while burning actual developer hours on cross-package releases.

Production framing: When a compliance update hits on Friday, you want one package to change, one PR, one deploy. Not a distributed commit dance. CCP is the principle that tells you where your monorepo boundaries should live. Ignore it, and you'll spend your Monday mornings in merge conflict hell. The test for correct grouping: the same person makes the changes, and they happen in lockstep.

PaymentPackage.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — system-design tutorial

# All payment-related changes happen together
class PaymentValidator:
    def validate(self, amount: float) -> bool:
        return amount > 0

class PaymentGateway:
    def charge(self, amount: float) -> str:
        return f"txn_{amount}"

class PaymentAudit:
    def log(self, txn_id: str) -> None:
        print(f"Audited: {txn_id}")

# One package, one reason to change: payment logic
# Violation: putting these in separate packages causes cross-package sync
Output
Audited: txn_49.99
Senior Shortcut:
Ask your team: 'Do we deploy these two classes together or separately?' If together, they belong in the same package.
Key Takeaway
Package by change reason, not by function. Your deploy cadence is the real architecture.

Common Reuse Principle: Don't Force Consumers to Carry Your Dead Weight

You know that shared utility library everyone depends on? The one with StringUtils, DateUtils, and a MySQL connection pool helper? It's a landmine. The Common Reuse Principle (CRP) says: the classes you put in a package are inseparable. If a consumer depends on one, they depend on all. That means every change to your beloved StringUtils forces a redeploy of every downstream service that just wanted the date formatter.

CRP is the counterbalance to CCP. Where CCP says 'group what changes together', CRP says 'don't group things that could be used independently'. If your PaymentValidator could be reused by a billing service but your PaymentGateway exposes a dangerous bulk_charge() method, they should not share a package. The consumer would inherit risk they never asked for.

Production framing: A package is a contract. Every public class is a promise to maintain it. Split aggressively. A package two lines long is better than a package with one dependency you can't shake. Your pip install or npm install footprint should match actual usage. Stop shipping libraries where 90% of the code is irrelevant to 90% of users.

ReuseSplit.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — system-design tutorial

# BAD: bundled together, forces transitive dependency on risky code
from payment_package import PaymentValidator, PaymentGateway

# GOOD: split into independent packages
# payment_validator.py — safe to use anywhere
class PaymentValidator:
    def validate(self, amount: float) -> bool:
        return amount > 0

# payment_gateway.py — risky, don't force consumers to carry it
class PaymentGateway:
    def charge(self, amount: float) -> str:
        # imagine this calls an external API
        return f"txn_{amount}"

# Consumer only needs validation
validator = PaymentValidator()
print(validator.validate(99.99))
Output
True
Production Trap:
That 'commons' library with 400 classes? You've built a distributed dependency bomb. Split it today.
Key Takeaway
Every public class is a deployment commitment. Don't make consumers pay for classes they never use.

Content Delivery Networks: Stop Serving Static Assets from Your Origin

Your origin server should never serve images, CSS, or JavaScript directly—that's what CDNs are for. A CDN caches static content at edge locations closest to users, drastically reducing latency and offloading your infrastructure. The WHY: each uncached asset request consumes bandwidth, CPU, and database connections on your server. The HOW: configure your CDN for cache-control headers, invalidation strategies, and origin shielding. For dynamic content, use CDN proxy features like token-based authentication or signed URLs. DNS routing alone won't cut it—measure cache hit ratios and set TTLs intelligently. Expect 80%+ cache hits for static assets after tuning. Without a CDN, your scaling costs balloon and user-perceived latency spikes. Your CDN is not a set-it-and-forget-it tool; monitor for cache misses and purge stale content promptly.

cdn_cache_headers.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — system-design tutorial

from flask import Flask, make_response

app = Flask(__name__)

@app.route('/static/<path:filename>')
def serve_static(filename):
    resp = make_response(open(f'static/{filename}').read())
    resp.headers['Cache-Control'] = 'public, max-age=31536000, immutable'
    resp.headers['CDN-Cache-Control'] = 'max-age=86400'
    return resp
Output
Responses now carry cache directives. CDN caches aggressively.
Origin load drops ~90% for static assets.
Production Trap:
Setting long cache TTLs without a versioning strategy (e.g., filename hash) means stale assets persist until manual purge. Always fingerprint static filenames.
Key Takeaway
CDNs are load-multipliers. Cache aggressively, version all assets, and monitor cache hit ratios.

WebSockets & Proxies: Full-Duplex Without Breaking Your Load Balancer

WebSockets give you persistent bidirectional communication—essential for real-time apps like chat or live dashboards. The WHY: HTTP polling wastes bandwidth and adds latency; WebSockets maintain a single TCP connection. The HOW: upgrade handshake via HTTP 101, then pass frames over the same socket. The problem: most load balancers and proxies terminate WebSocket connections or strip headers. You must configure them to forward the Upgrade: websocket header and keep the connection alive. Use sticky sessions (session affinity) or a shared pub/sub backend (e.g., Redis) to route messages to the correct server after reconnect. Without that, users get dropped messages. Also handle backpressure—clients on slow networks will buffer data. Set timeouts and heartbeat pings to detect dead connections. One server can handle thousands of concurrent sockets if you avoid blocking I/O.

websocket_proxy_config.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — system-design tutorial

import asyncio
import websockets

async def handler(ws):
    async for msg in ws:
        await ws.send(f"Echo: {msg}")

async def main():
    async with websockets.serve(handler, "0.0.0.0", 8765):
        await asyncio.Future()  # run forever

asyncio.run(main())
Output
WebSocket server starts on port 8765.
Proxy must pass Upgrade header. Test with curl -i --header 'Connection: Upgrade'.
Production Trap:
If your proxy has a timeout shorter than your WebSocket idle time, connections drop silently. Set proxy idle timeout >= 300 seconds and implement client-side reconnect logic.
Key Takeaway
Proxies must explicitly support WebSocket upgrade. Without sticky sessions or pub/sub, state gets lost on reconnect.

Database Sharding: Horizontal Scaling That Saves Your Writes

When a single database can't handle write throughput, sharding splits data across multiple instances by a shard key. The WHY: vertical scaling (bigger machines) hits cost and hardware limits; sharding distributes load. The HOW: choose a shard key that evenly distributes writes—user ID, tenant ID, or geohash. Range-based sharding risks hot spots; hash-based sharding balances better. Each shard operates independently: queries that hit one shard are fast, cross-shard queries need scatter-gather which kills performance. The hard part: rebalancing when you add shards. Consistent hashing minimizes data movement. Never shard on a monotonically increasing key like auto-increment ID—that floods the last shard. Also handle shard discovery: clients or middleware map keys to shards via a lookup service or hash ring. Expect 30-50% write throughput gain per additional shard, but cross-shard joins become your bottleneck.

shard_key_routing.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — system-design tutorial

import hashlib

def get_shard(user_id: str, num_shards: int) -> int:
    hash_val = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
    return hash_val % num_shards

# Usage
shard_id = get_shard("user_42", 16)
print(f"Shard for user_42: {shard_id}")
Output
Shard for user_42: 11
Shard key hashes evenly; add shards via consistent hashing (e.g., Ketama) to minimize rebalance.
Production Trap:
Using auto-increment IDs as shard keys creates write storms. Always hash a natural key or use UUID-based partitioning to avoid hotspots.
Key Takeaway
Shard keys must distribute writes evenly. Cross-shard queries are expensive—design your schema to avoid them.

Basics: The Foundation of Every Distributed System

Before you design a system, you must understand what you're building and why. Basics are not just definitions; they are the first principles that govern every architecture decision. Start by identifying the core components: clients, servers, databases, caches, load balancers, and message queues. Each has a specific role and constraint. Define your functional requirements (what the system must do) and nonfunctional requirements (how it must behave in terms of latency, availability, and durability). Without this clarity, you'll build something that technically works but fails in production. The why here is survival: a system that doesn't meet throughput or consistency needs will crumble under load. Recognize that every trade-off starts at the basics. A simple rule: understand the problem before reaching for a solution. For example, if you need at-most-once messaging for logging, don't design for exactly-once order guarantees. Keep your architecture minimal until complexity is justified. Master the basics, and you master the art of avoiding overengineering.

basics_health_check.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — system-design tutorial
# Check if your system's basics are solid
class SystemReadiness:
    def evaluate_basics(self, requirements):
        basics = {
            "latency_ok": requirements.get("max_latency_ms", 200) < 500,
            "throughput_ok": requirements.get("rps", 1000) <= 5000,
            "consistency_ok": requirements.get("consistency", "eventual") in ["strong", "eventual"]
        }
        return all(basics.values()), basics

evaluator = SystemReadiness()
all_good, checks = evaluator.evaluate_basics({})
print(f"Design ready? {all_good}")  # output if defaults pass
print(checks)
Output
Design ready? True
{'latency_ok': True, 'throughput_ok': True, 'consistency_ok': True}
Production Trap:
Skipping basics leads to premature optimization. You'll waste time on scaling when the real problem is a missing cache or misconfigured queue.
Key Takeaway
Define functional and nonfunctional requirements first; they dictate every architectural choice.

Testing: Why Your System Crumbles Without a Verification Strategy

Testing in system design isn't about unit tests—it's about proving your architecture behaves correctly under failure and load. Start with integration tests that simulate real network partitions, database lag, and service outages. Use chaos engineering to deliberately inject failures (e.g., kill a service) and verify your system self-heals. Why this order? Because without testing for edge cases like eventual consistency drift or timeout retries, your system will fail unpredictably. Add load testing to confirm throughput and latency meet nonfunctional requirements. A common mistake is only testing happy paths: you must test degraded modes. For example, if your cache goes down, does your database handle the spike? Testing also validates your reliability metrics, like SLIs and SLOs. Automate these tests in your CI/CD pipeline so every deployment is validated against production-like scenarios. Remember: if you can't test it, you can't trust it. Testing is the only way to prove that your trade-offs (e.g., sacrificing strong consistency) actually work in practice.

test_failover.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — system-design tutorial
# Simulate a cache failure and check fallback
class CacheSimulator:
    def __init__(self, fallback_active=True):
        self.fallback = fallback_active

    def fetch_user_data(self, user_id):
        if self.fallback:
            return f"db_fallback_user_{user_id}"
        raise ConnectionError("Cache unavailable")

# Test under failure
sim = CacheSimulator(fallback_active=False)
try:
    result = sim.fetch_user_data(42)
    print(result)
except ConnectionError as e:
    print(f"Failover required: {e}")  # output
Output
Failover required: Cache unavailable
Production Trap:
Testing only the happy path means your system will break when a database replica lags or a cache node fails—guaranteed.
Key Takeaway
Test for failure scenarios and load spikes before production; chaos engineering proves your resilience.
● Production incidentPOST-MORTEMseverity: high

The Saga That Didn't Compensate: A Payment Pipeline Nightmare

Symptom
Customers reported being charged for orders that were never fulfilled. Customer support tickets surged with 'I was charged but order cancelled' complaints.
Assumption
The team assumed that idempotent compensation transactions would always succeed because the compensating service had no dependencies.
Root cause
The compensation handler for payment reversal called an external payment gateway that had a rate limit. Under load, the rate limit triggered a 429 response, and the compensation was not retried. The saga orchestration treated the failure as terminal and logged it to a dead-letter queue with no alert.
Fix
Implemented exponential backoff retry with a circuit breaker for the compensation call. Added a monitoring alert on dead-letter queue depth for the saga topic. Built a manual remediation dashboard for ops to replay failed compensations.
Key lesson
  • Compensations are critical paths—treat them like primary operations.
  • Always test compensations under failure scenarios, not just happy path.
  • Alert on dead-letter queue depth for sagas; silent failures in sagas become financial losses.
Production debug guideSymptom → Action guide for common architecture-pattern-related issues4 entries
Symptom · 01
Distributed transaction leaves data inconsistent (e.g., order confirmed but inventory not decremented)
Fix
Check saga orchestration logs for compensation execution. Verify dead-letter queue for failed compensation events. Run manual reconciliation script.
Symptom · 02
Event consumer lag > 10 minutes
Fix
Check consumer group status (kafka-consumer-groups --describe). Inspect consumer logs for exceptions. Scale consumer threads if partition count allows.
Symptom · 03
CQRS read model shows stale data for more than 5 seconds
Fix
Verify projection service health. Check event processing latency in the projection. Ensure read model update is idempotent and handles duplicates.
Symptom · 04
Microservice dependency chain leads to cascading failures
Fix
Implement circuit breakers in each client (e.g., Resilience4j). Set fallback responses or cached data. Review dependency graph to eliminate unnecessary synchronous calls.
★ Quick Debug Cheat Sheet for Architecture DecisionsWhen your chosen architecture pattern starts causing problems, use these commands and actions to diagnose and fix fast.
Saga compensation failure
Immediate action
Check dead-letter queue for saga topic
Commands
kafka-console-consumer --bootstrap-server localhost:9092 --topic saga.dlq --from-beginning --max-messages 10
curl -X GET http://saga-orchestrator:8080/actuator/health
Fix now
Manually replay failed saga events using a recovery script; implement retry with exponential backoff in the compensation handler
Event consumer lag spikes+
Immediate action
Check consumer group lag
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group order-confirmed-consumer --describe
docker logs order-confirmed-consumer --tail 100
Fix now
Scale consumer instances; check for blocking calls in consumer logic (e.g., synchronous DB writes); ensure consumer processing time is within SLA
CQRS projection lag > 30 seconds+
Immediate action
Check projection service CPU and memory
Commands
top -b -n 1 | grep projection
curl http://projection-service:8080/actuator/metrics/projection.lag
Fix now
Add read replicas for projection; optimize projection queries; if lag is persistent, increase batch size for event processing
Pattern Trade-offs at a Glance
PatternWhen to UseWhen to AvoidOperational CostKey Risk
MonolithSmall team (< 15), single bounded context, simple deploymentNeed for independent scaling of componentsLow — one pipeline, one binaryNo failure isolation; bad deploy takes everything down
MicroservicesLarge team, clear bounded contexts, independent scale neededTeam < 15, no separate deployment needsHigh — CI/CD per service, service mesh, observabilityDistributed transaction complexity, network failures
Event-DrivenMultiple independent consumers, eventual consistency OKNeed synchronous confirmations (payment, auth)Medium — message broker overhead, monitoring requiredSilent consumer failures, lag detection
SagaDistributed transaction required, compensation is idempotentTransaction fits in one serviceHigh — compensation logic, DLQ monitoringCompensation failure leads to data inconsistency
CQRSRead and write patterns are very different, high read throughputStale data causes safety/financial issuesMedium — dual models, projection managementEventual consistency, projection lag
API Gateway / BFFMultiple services need a single entry pointOnly one service existsLow to medium — one more service to manageBecoming a monolith (too many routes) or too many BFFs

Key takeaways

1
Architecture decisions are permanent; code can be refactored in a day, but rearchitecting costs quarters.
2
Start with the simplest pattern that meets your constraints
monolith first, then extract services only when you need independent scaling.
3
Event-driven architecture decouples producers and consumers but couples you to your monitoring; consumer lag is the silent killer.
4
Sagas introduce significant operational complexity; only use them when a transaction absolutely must span multiple services.
5
CQRS is for scale, not correctness; never apply it to domains where stale data causes financial or safety consequences.
6
API gateways and BFFs prevent client chaos; start with one gateway, extract BFFs when client needs diverge.

Common mistakes to avoid

5 patterns
×

Adopting microservices before the team is ready

Symptom
Team spends 60% of time on infrastructure (service discovery, tracing, CI/CD) instead of business logic. On-call rotations become unsustainable.
Fix
Stay with a monolith until you have at least 15 engineers and a clear bounded context to extract. Invest in a modular monolith first.
×

Using Event-Driven Architecture without monitoring consumer lag

Symptom
Orders pile up unprocessed for hours; customers complain; dashboards show green because producer is healthy.
Fix
Day one: set alerts on consumer group lag (e.g., kafka.consumer.group.lag > 1000). If lag exceeds threshold, page on-call.
×

Implementing a saga without idempotent compensations

Symptom
Compensation calls fail under load (rate limits, timeouts) and leave the system in a partially rolled-back state. Financial reconciliation becomes a manual nightmare.
Fix
Make every compensation idempotent and retriable. Use exponential backoff. If retries exhaust, route to DLQ and alert immediately.
×

Applying CQRS to a domain where stale data causes real-world damage

Symptom
Users see outdated account balances or product prices; financial disputes arise.
Fix
For any domain where correctness depends on up-to-date data, do NOT use eventually consistent read models. Either read from the write model or use synchronously updated materialised views within the same transaction boundary.
×

Building a single API gateway that tries to serve all client types

Symptom
Gateway becomes a monolith: thousands of routes, every change requires full redeployment, one team's endpoint affects all clients.
Fix
Extract Backend for Frontend (BFF) services per client type (web, mobile, third-party). Each BFF owns its own aggregation and deployment lifecycle.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
When would you choose a monolith over microservices?
Q02SENIOR
What's the biggest risk of Event-Driven Architecture, and how do you mit...
Q03SENIOR
Explain the saga pattern and when you would NOT use it.
Q04SENIOR
What is the difference between an API Gateway and a Backend for Frontend...
Q05SENIOR
How would you design a system that uses CQRS and needs to handle a domai...
Q01 of 05SENIOR

When would you choose a monolith over microservices?

ANSWER
Choose a monolith when your team is under 15 engineers, you have a single bounded context, and you don't need independent scaling of individual components. The distributed systems tax (service discovery, tracing, eventual consistency, multiple CI/CD pipelines) adds significant operational overhead that a small team can't absorb. A modular monolith gives you many of the same structural benefits without the network complexity.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What's the #1 mistake teams make when adopting microservices?
02
Can I use CQRS with a single database?
03
When should I choose choreography over orchestration for sagas?
04
How do I decide between an API Gateway and a Service Mesh?
05
Is a modular monolith a viable alternative to microservices?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Architecture. Mark it forged?

16 min read · try the examples if you haven't

Previous
Backend for Frontend Pattern
13 / 17 · Architecture
Next
Design URL Shortener