Homeβ€Ί System Designβ€Ί Software Architecture Patterns: Trade-offs and Real Decisions

Software Architecture Patterns: Trade-offs and Real Decisions

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Architecture β†’ Topic 13 of 13
Software architecture patterns explained with real trade-offs, production war stories, and the decision frameworks senior engineers actually use.
βš™οΈ Intermediate β€” basic System Design knowledge assumed
In this tutorial, you'll learn:
  • Architecture patterns solve specific pain points β€” adopt them when you feel the pain, not when you anticipate it. Paying the distributed systems complexity tax without needing the distributed systems benefits is how teams slow down.
  • The hidden cost of microservices isn't the services β€” it's losing ACID transactions. Every distributed write that used to be a conn.rollback() is now a saga with compensating transactions, a dead-letter queue, and a support runbook.
  • Reach for CQRS when your read and write access patterns are genuinely incompatible β€” different query shapes, different consistency needs, order-of-magnitude different throughput. If your reads and writes look similar, a well-indexed relational database with read replicas handles 90% of real-world scale.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
Think of software architecture like designing a restaurant kitchen. A food truck has one chef doing everything β€” fast to set up, impossible to scale when 200 people show up. A Michelin-star kitchen has a strict brigade system: saucier, pastry chef, expeditor β€” each station independent, but the coordination overhead is brutal. Your architecture is that kitchen design. Pick the wrong one and your 'chef' either burns out solo or your 'brigade' spends more time talking than cooking. The food is the same. The choice of how to organise the kitchen determines whether you survive dinner service.

A startup I consulted for rewrote their monolith into 47 microservices in eight months. By month nine, they had more engineers debugging inter-service latency than building features, their P99 checkout latency had tripled, and their on-call rotation was a rotating trauma ward. The monolith had been slow. The microservices were broken in ways nobody could trace.

Architecture decisions are permanent in a way that code decisions aren't. You can refactor a function in an afternoon. Rearchitecting a distributed system costs quarters, sometimes years. The tragedy is that most teams make these decisions by copying what Netflix or Uber did β€” without copying the 500-engineer platform team that makes those patterns survivable. The pattern isn't the hard part. Knowing when it fits your actual constraints is the whole game.

After this, you'll be able to look at a system's requirements and map them to a concrete architectural pattern β€” not because you memorised a definition, but because you understand the exact failure modes each pattern introduces and the specific conditions under which those failure modes hurt you. You'll know when a monolith is the right call, when event-driven architecture saves you, when CQRS is overkill, and what questions to ask before committing to any of them.

Monolith vs. Microservices: The Decision Nobody Makes Honestly

Every team says they're 'moving to microservices for scalability.' What they usually mean is they read a Martin Fowler post and their CTO saw a conference talk. Let's be precise about what each architecture actually costs you, because the decision is irreversible for 18-36 months once you commit.

A monolith is a single deployable unit. All your business logic, data access, and API surface lives in one process. The wins are real: in-process function calls instead of network hops, a single transaction boundary, one deployment pipeline, and a debugger that actually works. The failure mode is equally real: a memory leak in your image-processing module takes down your payment API. One bad deploy nukes everything. Your release cadence is bottlenecked by the slowest team's merge.

Microservices split that single process into independently deployable services communicating over a network. You get independent deployability and isolated failure domains. You also get distributed systems problems you didn't have before: network partitions, eventual consistency, service discovery, distributed tracing, and the operational complexity of running 20+ services instead of one. I've watched teams spend three months building the infrastructure to support microservices before writing a single line of business logic.

The honest rule: if you have fewer than 15 engineers, a monolith is almost certainly correct. If your scaling bottleneck is genuinely a specific bounded domain β€” say, your video transcoding is hammering CPU while your API servers idle β€” extract that one service. Don't extract everything because you might need to scale it someday. That day may never come, and you'll have paid the distributed systems tax in advance for nothing.

OrderServiceMonolith.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
# io.thecodeforge β€” System Design tutorial

# Scenario: E-commerce order service.
# This is the monolith version. Notice what you get for FREE
# that microservices make you rebuild from scratch.

from dataclasses import dataclass
from decimal import Decimal
from typing import Optional
import sqlite3

# Single database connection β€” one transaction wraps everything.
# In a distributed system, this atomicity is GONE unless you implement
# two-phase commit or saga patterns. Both are painful.
DB_PATH = "orders.db"

@dataclass
class OrderItem:
    product_id: str
    quantity: int
    unit_price: Decimal

@dataclass
class Order:
    order_id: str
    customer_id: str
    items: list[OrderItem]
    status: str = "pending"

class InventoryService:
    """In a monolith, this is just a class. In microservices, it's a network call
    that can timeout, return stale data, or be temporarily unavailable."""

    def __init__(self, conn: sqlite3.Connection):
        self.conn = conn

    def reserve_stock(self, product_id: str, quantity: int) -> bool:
        cursor = self.conn.cursor()
        # This SELECT ... FOR UPDATE equivalent in SQLite β€” serialised write.
        # In a distributed inventory service, this becomes a distributed lock.
        # Redis SETNX, or a dedicated locking service. Both add latency and failure modes.
        cursor.execute(
            "SELECT stock_count FROM inventory WHERE product_id = ?",
            (product_id,)
        )
        row = cursor.fetchone()
        if not row or row[0] < quantity:
            return False  # Not enough stock β€” fail the whole operation cleanly

        cursor.execute(
            "UPDATE inventory SET stock_count = stock_count - ? WHERE product_id = ?",
            (quantity, product_id)
        )
        return True

class PaymentService:
    """Again β€” a class. Zero network calls. Zero timeout handling needed.
    The moment this becomes a separate service, you need circuit breakers,
    retry logic, idempotency keys, and dead-letter queues. All of that
    is real engineering effort β€” not free."""

    def __init__(self, conn: sqlite3.Connection):
        self.conn = conn

    def charge_customer(self, customer_id: str, amount: Decimal) -> bool:
        cursor = self.conn.cursor()
        cursor.execute(
            "SELECT balance FROM accounts WHERE customer_id = ?",
            (customer_id,)
        )
        row = cursor.fetchone()
        if not row or row[0] < amount:
            return False  # Insufficient funds β€” atomic rollback handles cleanup

        cursor.execute(
            "UPDATE accounts SET balance = balance - ? WHERE customer_id = ?",
            (amount, customer_id)
        )
        return True

class OrderOrchestrator:
    """The core of the monolith advantage: ONE database transaction covers
    inventory reservation, payment, AND order creation. If payment fails,
    inventory is automatically unreserved. No compensation logic. No sagas.
    No 'sorry, your order is stuck in PENDING forever' bugs."""

    def __init__(self):
        self.conn = sqlite3.connect(DB_PATH)
        self.inventory = InventoryService(self.conn)
        self.payment = PaymentService(self.conn)

    def place_order(self, order: Order) -> dict:
        total = sum(
            item.unit_price * item.quantity for item in order.items
        )

        try:
            # BEGIN TRANSACTION β€” implicit in SQLite when autocommit is off
            self.conn.execute("BEGIN")

            # Step 1: Reserve all stock items.
            # If ANY item fails, we roll back everything. One line of code.
            for item in order.items:
                if not self.inventory.reserve_stock(item.product_id, item.quantity):
                    self.conn.rollback()
                    return {
                        "success": False,
                        "reason": f"Insufficient stock for product {item.product_id}"
                    }

            # Step 2: Charge the customer.
            # In a microservices world, payment already happened in a different
            # service. If inventory reservation then fails, you're issuing refunds.
            # That's a SUPPORT TICKET. Here? It's a rollback.
            if not self.payment.charge_customer(order.customer_id, total):
                self.conn.rollback()
                return {"success": False, "reason": "Payment failed"}

            # Step 3: Persist the order record.
            self.conn.execute(
                "INSERT INTO orders (order_id, customer_id, status, total) VALUES (?, ?, ?, ?)",
                (order.order_id, order.customer_id, "confirmed", str(total))
            )

            self.conn.commit()  # All three operations committed atomically
            return {"success": True, "order_id": order.order_id, "total": str(total)}

        except Exception as e:
            self.conn.rollback()  # Something unexpected? Clean slate.
            raise RuntimeError(f"Order placement failed: {e}") from e


# --- Demonstrate the flow ---
if __name__ == "__main__":
    # Normally you'd have migrations. Simplified for illustration.
    orchestrator = OrderOrchestrator()

    order = Order(
        order_id="ORD-20240315-001",
        customer_id="CUST-789",
        items=[
            OrderItem(product_id="SKU-HEADPHONES-XZ3", quantity=1, unit_price=Decimal("149.99")),
            OrderItem(product_id="SKU-USB-CABLE-C", quantity=2, unit_price=Decimal("12.50"))
        ]
    )

    result = orchestrator.place_order(order)
    print(f"Order result: {result}")
β–Ά Output
Order result: {'success': True, 'order_id': 'ORD-20240315-001', 'total': '174.99'}
⚠️
Production Trap: The Distributed Transaction DebtWhen you split this into microservices, every conn.rollback() in the monolith becomes a saga with compensating transactions. Teams that don't plan for this end up with orders stuck in PENDING state indefinitely when a downstream service times out. The symptom: customer support tickets saying 'I was charged but my order never arrived.' The fix before you split: define the saga pattern and write the compensation handlers FIRST, before extracting a single service.

Event-Driven Architecture: Power, Poison, and When to Reach for It

Event-driven architecture (EDA) solves a specific problem: you need multiple systems to react to something that happened, without the producer caring who's listening. The classic alternative β€” direct synchronous calls β€” creates a dependency spider web. Your order service calls inventory, which calls the warehouse, which calls shipping. Now your order service's uptime is the product of everyone else's uptime. At 99.9% each, four services in a chain gives you 99.6% overall. That's 3.5 hours of downtime per year from services that are individually 'highly available.'

EDA decouples that chain. The order service publishes an OrderConfirmed event. Inventory, warehouse, fraud detection, and email notification all consume it independently. The order service doesn't know they exist. New consumers can subscribe without touching the producer. This is real decoupling β€” not just the dependency injection kind.

The poison pill is invisible failure. In a synchronous call, if the warehouse service is down, you know immediately β€” your caller gets a 503. In EDA, your event is published to the queue, the producer returns success, and the warehouse consumer is silently dead. Events accumulate in the dead-letter queue. You discover it when a customer calls saying their package never shipped. I've seen this exact scenario play out in a logistics company where a misconfigured consumer group caused 6 hours of orders to pile up unprocessed while dashboards showed everything green.

Use EDA when: you have genuinely independent consumers, eventual consistency is acceptable for the domain, and you have the operational maturity to monitor queue depth and consumer lag. Don't use it for anything that needs synchronous confirmation β€” payment authorisation, stock reservation at checkout time, authentication.

EventDrivenOrderPipeline.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167
# io.thecodeforge β€” System Design tutorial

# Scenario: Post-checkout event pipeline.
# The payment is already confirmed. Now we need to notify inventory,
# fraud, shipping, and email β€” all independently, all without
# the checkout service caring about any of them.

import json
import time
import threading
from dataclasses import dataclass, asdict
from typing import Callable
from collections import defaultdict, deque
from datetime import datetime

@dataclass
class OrderConfirmedEvent:
    event_id: str
    order_id: str
    customer_id: str
    product_ids: list[str]
    total_amount: float
    occurred_at: str  # ISO 8601 β€” always timestamp your events at creation time

    def to_json(self) -> str:
        return json.dumps(asdict(self))


class InMemoryEventBus:
    """
    Production equivalent: Apache Kafka, AWS SQS+SNS, or RabbitMQ.
    The contract is the same: publish once, consume independently.
    This in-memory version makes the pattern visible without Kafka setup.
    """

    def __init__(self):
        # Each topic maps to a list of independent consumer queues.
        # In Kafka terms: each consumer GROUP gets its own queue.
        # This is what enables independent consumption and replay.
        self._topics: dict[str, list[deque]] = defaultdict(list)
        self._handlers: dict[str, list[Callable]] = defaultdict(list)
        self._lock = threading.Lock()

    def subscribe(self, topic: str, handler: Callable[[dict], None]) -> None:
        """Register a consumer for a topic. Each subscriber gets ALL events
        published to that topic β€” not round-robin like a work queue."""
        with self._lock:
            consumer_queue: deque = deque()
            self._topics[topic].append(consumer_queue)
            # Start a background thread per consumer β€” simulates async consumption
            thread = threading.Thread(
                target=self._consume,
                args=(consumer_queue, handler),
                daemon=True  # Dies when main thread dies β€” fine for demo, use proper lifecycle in prod
            )
            thread.start()

    def publish(self, topic: str, event: dict) -> None:
        """Producer publishes and returns immediately. It does NOT wait for consumers.
        This is the core of EDA's decoupling β€” and the source of its observability challenges."""
        with self._lock:
            for consumer_queue in self._topics[topic]:
                consumer_queue.append(event)  # Each consumer gets its own copy
        print(f"[EventBus] Published to '{topic}': event_id={event.get('event_id')}")

    def _consume(self, queue: deque, handler: Callable) -> None:
        """Busy-polls the queue. In production Kafka consumers use long-polling
        with configurable fetch.min.bytes and fetch.max.wait.ms for efficiency."""
        while True:
            if queue:
                event = queue.popleft()
                try:
                    handler(event)
                except Exception as e:
                    # In production: route to dead-letter queue, alert, do NOT silently swallow.
                    # Silent swallowing here is how you get the 'orders piling up unprocessed' nightmare.
                    print(f"[EventBus] CONSUMER ERROR β€” routing to DLQ: {e}")
            else:
                time.sleep(0.01)  # Back off when idle β€” don't spin-burn CPU


# --- CONSUMERS ---
# Each of these would be a separate service in production.
# They share zero state. They don't know about each other.

class InventoryConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Inventory] Reserving stock for order {event['order_id']} "
              f"β€” products: {event['product_ids']}")
        # In reality: update stock counts, trigger reorder if threshold hit

class FraudDetectionConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Fraud] Running fraud score for customer {event['customer_id']} "
              f"β€” amount: ${event['total_amount']}")
        # In reality: call ML model, flag order if score > threshold, publish FraudFlaggedEvent

class ShippingConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Shipping] Creating shipment manifest for order {event['order_id']}")
        # In reality: call 3PL API, generate label, store tracking number

class EmailNotificationConsumer:
    def handle(self, event: dict) -> None:
        print(f"[Email] Queueing confirmation email to customer {event['customer_id']} "
              f"for order {event['order_id']}")
        # In reality: render template, call SES/SendGrid, record sent timestamp


# --- PRODUCER ---

class CheckoutService:
    """The checkout service knows about the event bus and the event schema.
    It does NOT know about inventory, fraud, shipping, or email.
    Adding a new downstream consumer requires ZERO changes here."""

    def __init__(self, event_bus: InMemoryEventBus):
        self.event_bus = event_bus

    def complete_checkout(self, order_id: str, customer_id: str,
                          product_ids: list[str], total: float) -> dict:
        # Payment authorisation would happen here synchronously BEFORE this point.
        # EDA handles post-payment side effects β€” not the payment itself.
        event = OrderConfirmedEvent(
            event_id=f"evt-{order_id}-{int(time.time())}",
            order_id=order_id,
            customer_id=customer_id,
            product_ids=product_ids,
            total_amount=total,
            occurred_at=datetime.utcnow().isoformat() + "Z"
        )

        self.event_bus.publish("order.confirmed", asdict(event))

        # Returns IMMEDIATELY β€” doesn't wait for inventory, fraud, or shipping.
        # This is your sub-100ms checkout response time.
        return {"status": "confirmed", "order_id": order_id}


# --- Wire it up ---

if __name__ == "__main__":
    bus = InMemoryEventBus()

    # Register consumers β€” order doesn't matter, they're all independent
    inventory = InventoryConsumer()
    fraud = FraudDetectionConsumer()
    shipping = ShippingConsumer()
    email = EmailNotificationConsumer()

    bus.subscribe("order.confirmed", inventory.handle)
    bus.subscribe("order.confirmed", fraud.handle)
    bus.subscribe("order.confirmed", shipping.handle)
    bus.subscribe("order.confirmed", email.handle)

    checkout = CheckoutService(bus)

    # Simulate a completed checkout
    result = checkout.complete_checkout(
        order_id="ORD-20240315-002",
        customer_id="CUST-456",
        product_ids=["SKU-MONITOR-27", "SKU-HDMI-CABLE"],
        total=389.98
    )

    print(f"\n[Checkout] Response: {result}")
    time.sleep(0.1)  # Give async consumers time to process in this demo
β–Ά Output
[EventBus] Published to 'order.confirmed': event_id=evt-ORD-20240315-002-1710460800

[Checkout] Response: {'status': 'confirmed', 'order_id': 'ORD-20240315-002'}
[Inventory] Reserving stock for order ORD-20240315-002 β€” products: ['SKU-MONITOR-27', 'SKU-HDMI-CABLE']
[Fraud] Running fraud score for customer CUST-456 β€” amount: $389.98
[Shipping] Creating shipment manifest for order ORD-20240315-002
[Email] Queueing confirmation email to customer CUST-456 for order ORD-20240315-002
⚠️
Never Do This: Publish Events Without Monitoring Consumer LagConsumer lag is the silent killer in EDA. Your producer is healthy, your event bus is healthy, but your consumer group fell behind 3 hours ago because of a bad deployment. In Kafka: alert on kafka.consumer.group.lag &gt; 1000 per partition. In SQS: alert on ApproximateNumberOfMessagesNotVisible. I've seen teams discover six-figure inventory discrepancies because nobody set this alert. Set it on day one, before your first consumer hits production.

CQRS and the Read/Write Split: When Your Query Patterns Are Killing Your Writes

CQRS β€” Command Query Responsibility Segregation β€” is one of the most cargo-culted patterns in the industry. Teams add it because it sounds senior-level. Here's the honest version: you need CQRS when your read and write access patterns are so different that a single model optimised for both is actually optimised for neither.

Consider a product catalogue. Writes are rare, structured, and come from an admin tool β€” one product update at a time, full validation, transactional integrity. Reads are constant, require different field combinations per client (mobile wants a summary, web wants full detail, search wants keywords only), and need to be fast under high concurrency. A single relational model with indexes for everything is a compromise that serves nobody well. Your write path trips over read indexes. Your read path joins five tables to answer a query that could be a single document lookup.

CQRS splits this: the write side (Commands) uses a normalised transactional model. The read side (Queries) uses one or more denormalised read models, potentially different databases entirely β€” PostgreSQL for writes, Elasticsearch for search, Redis for session data. When a command succeeds, you publish an event (or a projection job runs) to update the read models. Those read models are eventually consistent β€” they lag behind by milliseconds to seconds.

That 'eventually consistent' part is where teams get burned. I've seen a fintech ship CQRS across their account balance domain. Write side was Postgres. Read side was a Redis projection. A deployment bug caused the projection to stop updating. For 40 minutes, customers saw stale balances. Nobody noticed until a customer tried to spend money the read model said they had but the write model had already debited. Don't use CQRS for anything where reading stale data causes a financial or safety consequence. It's a pattern for scale, not for correctness.

CQRSProductCatalogue.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
# io.thecodeforge β€” System Design tutorial

# Scenario: Product catalogue with very different read/write patterns.
# Write side: admin updates one product at a time, needs validation + atomicity.
# Read side: 10,000 RPS product page loads, need sub-10ms response.
# A single DB model with indexes for both is a bottleneck in both directions.

from dataclasses import dataclass, field
from typing import Optional
import json
import time

# ─── WRITE SIDE (Command Model) ────────────────────────────────────────────
# Normalised. Strict validation. Transactional. Think PostgreSQL.

@dataclass
class ProductWriteModel:
    """The authoritative source of truth. Commands land here.
    This model is optimised for integrity, not query speed."""
    product_id: str
    sku: str
    name: str
    description: str
    price_cents: int       # Store money as integers β€” never floats. Ever.
    stock_count: int
    category_id: str
    brand_id: str
    is_active: bool = True
    version: int = 1       # Optimistic locking β€” detect concurrent modification

@dataclass
class UpdateProductPriceCommand:
    product_id: str
    new_price_cents: int
    updated_by: str        # Always audit who changed what in a write model
    reason: str            # Force callers to explain why β€” reduces lazy changes

@dataclass
class ProductPriceUpdatedEvent:
    product_id: str
    old_price_cents: int
    new_price_cents: int
    occurred_at: float


class ProductCommandHandler:
    """Handles all writes. Returns an event that downstream projections consume
    to update read models. The command handler does NOT update read models directly β€”
    that's the projection layer's job."""

    def __init__(self, write_store: dict):
        # In production: PostgreSQL with row-level locking
        self.write_store = write_store
        self.event_log: list[ProductPriceUpdatedEvent] = []

    def handle_update_price(
        self, command: UpdateProductPriceCommand
    ) -> ProductPriceUpdatedEvent:

        product = self.write_store.get(command.product_id)
        if not product:
            raise ValueError(f"Product {command.product_id} not found")

        if command.new_price_cents <= 0:
            raise ValueError("Price must be positive β€” write model enforces this invariant")

        old_price = product.price_cents
        product.price_cents = command.new_price_cents
        product.version += 1  # Increment version for optimistic lock detection

        event = ProductPriceUpdatedEvent(
            product_id=command.product_id,
            old_price_cents=old_price,
            new_price_cents=command.new_price_cents,
            occurred_at=time.time()
        )

        self.event_log.append(event)  # In production: append to Kafka/outbox table
        print(f"[CommandHandler] Price updated: {command.product_id} "
              f"${old_price/100:.2f} β†’ ${command.new_price_cents/100:.2f} "
              f"(v{product.version}) by {command.updated_by}")
        return event


# ─── READ SIDE (Query Model) ───────────────────────────────────────────────
# Denormalised. Pre-computed. Optimised for the specific query pattern.
# NOT the source of truth β€” it's a projection of the write side.

@dataclass
class ProductReadModel:
    """Flattened, denormalised view of a product.
    Fields are exactly what the product page API endpoint needs β€” nothing more.
    In production: this lives in Redis or Elasticsearch, not Postgres."""
    product_id: str
    display_name: str
    price_display: str     # Pre-formatted: '$149.99' β€” computed at write time, not read time
    category_name: str     # Joined and baked in β€” no JOIN at query time
    brand_name: str
    is_available: bool
    search_keywords: list[str]  # Pre-extracted for search β€” no LIKE query at runtime


class ProductReadModelProjection:
    """Listens to events from the write side and updates read models.
    This is the 'eventually consistent' part β€” it runs async after the command.
    If this falls behind, reads show stale data. Monitor it."""

    def __init__(self, read_store: dict, category_lookup: dict, brand_lookup: dict):
        self.read_store = read_store
        self.category_lookup = category_lookup  # Pre-loaded reference data
        self.brand_lookup = brand_lookup

    def project_price_update(self, event: ProductPriceUpdatedEvent,
                              write_model: ProductWriteModel) -> None:
        """Rebuild only the affected fields in the read model.
        In production with Kafka: this handler is idempotent β€” if it runs twice
        (at-least-once delivery), the result is the same."""

        existing_read_model = self.read_store.get(event.product_id)
        if not existing_read_model:
            # First time building this read model β€” construct from write model
            existing_read_model = ProductReadModel(
                product_id=write_model.product_id,
                display_name=write_model.name,
                price_display="",  # Will be set below
                category_name=self.category_lookup.get(write_model.category_id, "Unknown"),
                brand_name=self.brand_lookup.get(write_model.brand_id, "Unknown"),
                is_available=write_model.is_active and write_model.stock_count > 0,
                search_keywords=write_model.name.lower().split()
            )

        # Update ONLY the price field β€” targeted, not full rebuild
        existing_read_model.price_display = f"${event.new_price_cents / 100:.2f}"
        self.read_store[event.product_id] = existing_read_model

        print(f"[Projection] Read model updated for {event.product_id}: "
              f"price now {existing_read_model.price_display}")


class ProductQueryHandler:
    """Handles all reads. Hits the read store only β€” never touches the write DB.
    This is why reads are fast: the read store is pre-computed and query-shaped."""

    def __init__(self, read_store: dict):
        self.read_store = read_store

    def get_product_page_data(self, product_id: str) -> Optional[dict]:
        model = self.read_store.get(product_id)
        if not model:
            return None
        # Return exactly what the product page needs β€” no transformation at read time
        return {
            "id": model.product_id,
            "name": model.display_name,
            "price": model.price_display,
            "category": model.category_name,
            "brand": model.brand_name,
            "available": model.is_available
        }


# ─── Wire it all together ──────────────────────────────────────────────────

if __name__ == "__main__":
    # Simulate backing stores
    write_store = {
        "PROD-MONITOR-4K-32": ProductWriteModel(
            product_id="PROD-MONITOR-4K-32",
            sku="MON-32-4K-BK",
            name="32-inch 4K Monitor",
            description="IPS panel, 144Hz, USB-C PD 96W",
            price_cents=59999,
            stock_count=47,
            category_id="CAT-DISPLAYS",
            brand_id="BRD-DELL"
        )
    }
    read_store = {}
    category_lookup = {"CAT-DISPLAYS": "Monitors & Displays"}
    brand_lookup = {"BRD-DELL": "Dell"}

    command_handler = ProductCommandHandler(write_store)
    projection = ProductReadModelProjection(read_store, category_lookup, brand_lookup)
    query_handler = ProductQueryHandler(read_store)

    # 1. Admin issues a price update command
    update_command = UpdateProductPriceCommand(
        product_id="PROD-MONITOR-4K-32",
        new_price_cents=54999,  # Sale price
        updated_by="admin-jenna@example.com",
        reason="Black Friday promotional pricing"
    )

    event = command_handler.handle_update_price(update_command)

    # 2. Projection layer processes the event (async in prod, sync here for clarity)
    projection.project_price_update(event, write_store["PROD-MONITOR-4K-32"])

    # 3. Query handler serves the product page β€” hits read store only
    product_page = query_handler.get_product_page_data("PROD-MONITOR-4K-32")
    print(f"\n[QueryHandler] Product page response: {json.dumps(product_page, indent=2)}")
β–Ά Output
[CommandHandler] Price updated: PROD-MONITOR-4K-32 $599.99 β†’ $549.99 (v2) by admin-jenna@example.com
[Projection] Read model updated for PROD-MONITOR-4K-32: price now $549.99

[QueryHandler] Product page response: {
"id": "PROD-MONITOR-4K-32",
"name": "32-inch 4K Monitor",
"price": "$549.99",
"category": "Monitors & Displays",
"brand": "Dell",
"available": true
}
πŸ”₯
Interview Gold: The Projection Lag QuestionInterviewers love asking: 'What happens between the command succeeding and the read model updating in CQRS?' The answer distinguishes seniors from juniors. The write returns success. For a window (milliseconds to seconds), a read query returns stale data. Your system must either accept this (most product catalogues can), implement read-your-writes consistency (return the write model immediately after a write for that specific user session), or choose a different pattern for that domain entirely. Know which domains in your system can tolerate eventual consistency and which cannot before you adopt CQRS.

When Simpler Beats Clever: The Architecture Decisions You'll Regret

Here's the thing senior engineers know that junior engineers don't: every architectural pattern solves a real problem and creates three new ones. The art isn't picking the most sophisticated pattern β€” it's picking the one whose new problems you can actually manage given your team size, operational maturity, and product stage.

The strangler fig pattern is genuinely useful for migrating legacy systems incrementally, but I've seen teams spend 18 months building the 'strangling' infrastructure around a system they could have rewritten in six. The strangler makes sense when the legacy system is too risky to replace wholesale and you need continuity. It's overkill when you have a six-week-old Node.js app and you just don't like the original structure.

Event sourcing is similarly seductive. Store every state change as an immutable event. Replay history. Perfect audit logs. The operational reality: your event store becomes the most critical piece of infrastructure you own. Replaying 3 years of events to rebuild a read model takes hours. Schema migrations are now event migrations β€” every consumer must handle every historical event format. One team I know has events in 11 different schema versions. Their onboarding documentation for new engineers is 40 pages just on event version compatibility.

The heuristic that's served me across a decade of production systems: reach for a pattern when you're feeling specific pain, not when you're anticipating it. Slow reads under concurrent load? Now consider CQRS. Tight coupling causing deployment coordination? Now consider EDA. Teams that architect for pain they don't have yet pay the complexity tax without cashing the check.

ArchitectureDecisionRecord.py Β· PYTHON
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
# io.thecodeforge β€” System Design tutorial

# Architecture Decision Records (ADRs) are the most underused practice
# in software teams. This isn't just documentation β€” it's the thing that
# stops your team from having the same architectural argument six months later
# when half the context has walked out the door with someone who quit.

# An ADR captures: what decision was made, what alternatives were rejected,
# what problem it solves, and β€” critically β€” when to revisit it.
# Format based on Michael Nygard's widely adopted template.

from dataclasses import dataclass, field
from datetime import date
from enum import Enum
from typing import Optional

class ADRStatus(Enum):
    PROPOSED = "Proposed"       # Under discussion β€” not committed
    ACCEPTED = "Accepted"       # Decision made, team aligned
    DEPRECATED = "Deprecated"   # Superseded by a later ADR
    SUPERSEDED = "Superseded"   # Replaced β€” link to the new ADR

@dataclass
class Alternative:
    name: str
    why_rejected: str  # Honest, specific reason β€” not 'it was worse'

@dataclass
class ArchitectureDecisionRecord:
    adr_number: int
    title: str
    status: ADRStatus
    date_decided: date
    deciders: list[str]           # Who was in the room β€” accountability
    context: str                  # What problem forced a decision?
    decision: str                 # What did you decide? One clear statement.
    consequences_positive: list[str]   # What you gain
    consequences_negative: list[str]   # What you sacrifice β€” be honest
    alternatives_rejected: list[Alternative]
    revisit_trigger: str          # Specific condition that should re-open this decision
    superseded_by: Optional[int] = None  # ADR number that replaces this one

    def render(self) -> str:
        lines = [
            f"# ADR-{self.adr_number:04d}: {self.title}",
            f"**Status:** {self.status.value}   **Date:** {self.date_decided}   "
            f"**Deciders:** {', '.join(self.deciders)}",
            "",
            "## Context",
            self.context,
            "",
            "## Decision",
            self.decision,
            "",
            "## Consequences",
            "**Gains:**",
        ]
        for item in self.consequences_positive:
            lines.append(f"  + {item}")
        lines.append("**Costs:**")
        for item in self.consequences_negative:
            lines.append(f"  - {item}")
        lines.append("")
        lines.append("## Alternatives Rejected")
        for alt in self.alternatives_rejected:
            lines.append(f"**{alt.name}:** {alt.why_rejected}")
        lines.append("")
        lines.append("## Revisit When")
        lines.append(self.revisit_trigger)
        if self.superseded_by:
            lines.append(f"\n> **Superseded by ADR-{self.superseded_by:04d}**")
        return "\n".join(lines)


# ─── Real example: the decision NOT to go microservices ──────────────────

adr_001 = ArchitectureDecisionRecord(
    adr_number=1,
    title="Deploy as a modular monolith, not microservices",
    status=ADRStatus.ACCEPTED,
    date_decided=date(2024, 3, 15),
    deciders=["sarah-cto@example.com", "marcus-lead@example.com", "priya-senior@example.com"],
    context=(
        "We are a 6-engineer team building a B2B SaaS product with ~200 customers. "
        "Current P99 API latency is 180ms against a 500ms SLA β€” we have headroom. "
        "Peak load is 150 concurrent users. Engineering asked whether to start with "
        "microservices to 'build for scale' before our Series A."
    ),
    decision=(
        "Ship as a modular monolith with clear internal boundaries: orders/, billing/, "
        "inventory/, notifications/ as distinct Python packages with no cross-package "
        "imports except through defined interface classes. Single PostgreSQL database "
        "with schema-per-module. Single deployment unit."
    ),
    consequences_positive=[
        "Single deployment pipeline β€” ~20min CI vs estimated 2hr for 8+ microservices",
        "ACID transactions across modules β€” no saga patterns needed at this scale",
        "Local debugging with a single process β€” no distributed tracing setup",
        "New engineer onboarding: clone one repo, run one docker-compose",
    ],
    consequences_negative=[
        "A crash in one module affects all modules β€” mitigated by process-level health checks",
        "Shared database means schema migrations are global β€” coordinate carefully",
        "Harder to scale modules independently if one becomes a bottleneck",
    ],
    alternatives_rejected=[
        Alternative(
            name="Microservices from day one",
            why_rejected=(
                "At 6 engineers, we don't have the operational bandwidth to run "
                "distributed tracing, per-service deployment pipelines, service meshes, "
                "or manage eventual consistency. The complexity cost exceeds the scaling "
                "benefit at our current load. Revisit when load requires independent scaling."
            )
        ),
        Alternative(
            name="Serverless functions (AWS Lambda)",
            why_rejected=(
                "Cold start latency is incompatible with our interactive API SLA. "
                "Local development story is poor. State management for our transaction-heavy "
                "workflows requires significant workarounds."
            )
        ),
    ],
    revisit_trigger=(
        "Re-evaluate when: (a) any single module exceeds 40% of total DB CPU, "
        "(b) engineering team exceeds 20 engineers, or "
        "(c) two teams are blocked on the same deployment more than 3 times per sprint."
    )
)

if __name__ == "__main__":
    print(adr_001.render())
β–Ά Output
# ADR-0001: Deploy as a modular monolith, not microservices
**Status:** Accepted **Date:** 2024-03-15 **Deciders:** sarah-cto@example.com, marcus-lead@example.com, priya-senior@example.com

## Context
We are a 6-engineer team building a B2B SaaS product with ~200 customers. Current P99 API latency is 180ms against a 500ms SLA β€” we have headroom. Peak load is 150 concurrent users. Engineering asked whether to start with microservices to 'build for scale' before our Series A.

## Decision
Ship as a modular monolith with clear internal boundaries: orders/, billing/, inventory/, notifications/ as distinct Python packages with no cross-package imports except through defined interface classes. Single PostgreSQL database with schema-per-module. Single deployment unit.

## Consequences
**Gains:**
+ Single deployment pipeline β€” ~20min CI vs estimated 2hr for 8+ microservices
+ ACID transactions across modules β€” no saga patterns needed at this scale
+ Local debugging with a single process β€” no distributed tracing setup
+ New engineer onboarding: clone one repo, run one docker-compose
**Costs:**
- A crash in one module affects all modules β€” mitigated by process-level health checks
- Shared database means schema migrations are global β€” coordinate carefully
- Harder to scale modules independently if one becomes a bottleneck

## Alternatives Rejected
**Microservices from day one:** At 6 engineers, we don't have the operational bandwidth to run distributed tracing, per-service deployment pipelines, service meshes, or manage eventual consistency. The complexity cost exceeds the scaling benefit at our current load. Revisit when load requires independent scaling.
**Serverless functions (AWS Lambda):** Cold start latency is incompatible with our interactive API SLA. Local development story is poor. State management for our transaction-heavy workflows requires significant workarounds.

## Revisit When
Re-evaluate when: (a) any single module exceeds 40% of total DB CPU, (b) engineering team exceeds 20 engineers, or (c) two teams are blocked on the same deployment more than 3 times per sprint.
⚠️
Senior Shortcut: Write the ADR Before the ArchitectureThe discipline of filling in 'Alternatives Rejected' and 'Revisit When' before you commit to a pattern forces you to confront the trade-offs you're accepting. If you can't articulate a specific, measurable condition under which you'd revisit the decision, you haven't thought it through enough. The ADR isn't bureaucracy β€” it's the thing that stops your team rebuilding the same system from scratch in 18 months because nobody remembers why the original decision was made.
AttributeMonolithMicroservicesModular MonolithEvent-Driven
Deployment complexitySingle artifact, simplePer-service CI/CD pipelines requiredSingle artifact, moderateBroker infra required (Kafka/SQS)
Transaction handlingNative ACID β€” no extra workRequires sagas or 2PC β€” complexNative ACID within service boundaryEventual consistency only
Failure isolationNone β€” one crash, full outageStrong β€” faults contained per serviceNone β€” same processPartial β€” producer isolated from consumers
Debugging / tracingStandard debugger, one processRequires distributed tracing (Jaeger/Zipkin)Standard debugger, one processRequires correlation IDs + trace tooling
Team scalingBottleneck past ~15 engineersScales well with Conway's Law alignmentGood up to ~20 engineersScales well for async workflows
Appropriate team size1-12 engineers20+ engineers with platform team6-20 engineersAny β€” add when coupling pain is real
Read/write query optimisationSame model β€” compromises bothPer-service, can optimise independentlySame model β€” compromises bothPairs naturally with CQRS read models
Operational overheadLow β€” one service to monitorVery high β€” N services, N dashboardsLow-mediumMedium β€” queue depth + consumer lag alerts

🎯 Key Takeaways

  • Architecture patterns solve specific pain points β€” adopt them when you feel the pain, not when you anticipate it. Paying the distributed systems complexity tax without needing the distributed systems benefits is how teams slow down.
  • The hidden cost of microservices isn't the services β€” it's losing ACID transactions. Every distributed write that used to be a conn.rollback() is now a saga with compensating transactions, a dead-letter queue, and a support runbook.
  • Reach for CQRS when your read and write access patterns are genuinely incompatible β€” different query shapes, different consistency needs, order-of-magnitude different throughput. If your reads and writes look similar, a well-indexed relational database with read replicas handles 90% of real-world scale.
  • The teams that succeed with complex architecture are the ones who write Architecture Decision Records before they build, not after. A revisit trigger β€” a specific, measurable condition β€” is what separates a decision from a commitment you can never get out of.

⚠ Common Mistakes to Avoid

  • βœ•Mistake 1: Extracting microservices by technical layer (e.g., 'frontend service', 'database service', 'API service') instead of by business domain β€” Symptom: every new feature requires changes across 4+ services simultaneously and your deployments are co-ordinated anyway, defeating the entire purpose β€” Fix: use Domain-Driven Design bounded contexts to draw service boundaries; a service owns a business capability end-to-end (orders, inventory, billing), not a technical tier.
  • βœ•Mistake 2: Publishing events without idempotency keys and assuming at-most-once delivery from Kafka or SQS β€” Symptom: duplicate order confirmations, double-charged customers, duplicate inventory decrements after consumer rebalances or retries β€” Fix: every event must carry a stable event_id; every consumer handler must implement INSERT ... ON CONFLICT DO NOTHING or a processed-event deduplication table keyed on that event_id before taking any action.
  • βœ•Mistake 3: Building a CQRS read model that the write path queries back synchronously to validate business rules β€” Symptom: you've re-introduced coupling, your read model's eventual consistency window now causes validation failures, and your architecture has the complexity of CQRS with none of the benefits β€” Fix: all business rule validation must run against the write model exclusively; the read model is for query responses only, never for command validation.

Interview Questions on This Topic

  • QYour checkout service calls inventory, payment, and shipping synchronously. The shipping service degrades under Black Friday load, adding 8 seconds to every checkout. Walk me through the architectural options for isolating this failure, and what you'd change about the call structure.
  • QWhen would you choose event sourcing over standard CQRS with a relational write store in a production system? What operational capabilities does your team need before event sourcing is viable?
  • QYou have a CQRS system where the read model projection falls 45 minutes behind during a deployment. A customer calls saying they can see their account balance from 45 minutes ago and it doesn't reflect a payment they just made. What are the three systemic changes you make to prevent this class of incident β€” not just the hotfix?

Frequently Asked Questions

When should I actually use microservices instead of a monolith?

Use microservices when independent deployability genuinely matters β€” different teams deploy different capabilities without coordinating, or specific bounded domains have radically different scaling needs. The concrete signal: if two or more teams are blocked on each other's deployments more than once per sprint, or if one module is consuming 80%+ of your infrastructure resources while others idle, you have a real problem microservices solve. Below 15 engineers or below ~500 concurrent users with uniform load, a well-structured monolith is almost always faster to build and cheaper to operate.

What's the difference between CQRS and event sourcing β€” aren't they the same thing?

They're completely separate patterns that are often used together, which causes the confusion. CQRS separates your read and write models β€” different data stores, different schemas, optimised for their respective access patterns. Event sourcing stores state as a sequence of immutable events rather than current state. You can have CQRS without event sourcing (write to Postgres, project to Redis), and you can have event sourcing without CQRS (single event log, no read model split). The rule of thumb: adopt CQRS when your read/write query patterns diverge sharply. Only add event sourcing if you specifically need full audit history, temporal queries, or event replay β€” and only if your team can handle the operational weight of an event store.

How do I handle a saga if one of the compensating transactions also fails in a microservices system?

This is the distributed systems question that most saga tutorials skip entirely. A compensation failure means you're in an inconsistent state with no automated recovery path β€” this is called a 'non-recoverable saga.' Your mitigation stack, in order: first, make all compensating transactions idempotent and retry them with exponential backoff up to a configured limit. Second, route non-recoverable failures to a dead-letter queue with a full event payload and trigger an alert. Third, build a manual remediation runbook for ops β€” the system should surface these explicitly, not silently drop them. Fourth, design your sagas so compensation is 'soft' where possible: instead of reversing a payment, issue a credit. Partial rollbacks are safer than hard reversals when compensations can fail.

What happens to your event-driven system when the event schema changes and you have consumers on old versions?

This is where most EDA implementations accumulate quiet technical debt until it becomes a crisis. Schema changes in events break consumers silently β€” the consumer doesn't crash immediately, it either ignores unknown fields (if your deserialisation is lenient) or throws a deserialization exception and routes to the dead-letter queue. The production-hardened approach: treat event schemas like public APIs and version them explicitly β€” OrderConfirmedV1, OrderConfirmedV2. Producers publish to a new topic version; old consumers keep reading from the old topic until migrated. Use a schema registry (Confluent Schema Registry with Avro, or AWS Glue Schema Registry) to enforce compatibility rules β€” BACKWARD compatibility means new schema can read old messages, FORWARD means old schema can read new messages. Enforce backward compatibility as the minimum standard; never ship a breaking schema change to a shared topic.

πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousBackend for Frontend Pattern
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged