Software Architecture Patterns: Trade-offs and Real Decisions
- Architecture patterns solve specific pain points β adopt them when you feel the pain, not when you anticipate it. Paying the distributed systems complexity tax without needing the distributed systems benefits is how teams slow down.
- The hidden cost of microservices isn't the services β it's losing ACID transactions. Every distributed write that used to be a
conn.rollback()is now a saga with compensating transactions, a dead-letter queue, and a support runbook. - Reach for CQRS when your read and write access patterns are genuinely incompatible β different query shapes, different consistency needs, order-of-magnitude different throughput. If your reads and writes look similar, a well-indexed relational database with read replicas handles 90% of real-world scale.
A startup I consulted for rewrote their monolith into 47 microservices in eight months. By month nine, they had more engineers debugging inter-service latency than building features, their P99 checkout latency had tripled, and their on-call rotation was a rotating trauma ward. The monolith had been slow. The microservices were broken in ways nobody could trace.
Architecture decisions are permanent in a way that code decisions aren't. You can refactor a function in an afternoon. Rearchitecting a distributed system costs quarters, sometimes years. The tragedy is that most teams make these decisions by copying what Netflix or Uber did β without copying the 500-engineer platform team that makes those patterns survivable. The pattern isn't the hard part. Knowing when it fits your actual constraints is the whole game.
After this, you'll be able to look at a system's requirements and map them to a concrete architectural pattern β not because you memorised a definition, but because you understand the exact failure modes each pattern introduces and the specific conditions under which those failure modes hurt you. You'll know when a monolith is the right call, when event-driven architecture saves you, when CQRS is overkill, and what questions to ask before committing to any of them.
Monolith vs. Microservices: The Decision Nobody Makes Honestly
Every team says they're 'moving to microservices for scalability.' What they usually mean is they read a Martin Fowler post and their CTO saw a conference talk. Let's be precise about what each architecture actually costs you, because the decision is irreversible for 18-36 months once you commit.
A monolith is a single deployable unit. All your business logic, data access, and API surface lives in one process. The wins are real: in-process function calls instead of network hops, a single transaction boundary, one deployment pipeline, and a debugger that actually works. The failure mode is equally real: a memory leak in your image-processing module takes down your payment API. One bad deploy nukes everything. Your release cadence is bottlenecked by the slowest team's merge.
Microservices split that single process into independently deployable services communicating over a network. You get independent deployability and isolated failure domains. You also get distributed systems problems you didn't have before: network partitions, eventual consistency, service discovery, distributed tracing, and the operational complexity of running 20+ services instead of one. I've watched teams spend three months building the infrastructure to support microservices before writing a single line of business logic.
The honest rule: if you have fewer than 15 engineers, a monolith is almost certainly correct. If your scaling bottleneck is genuinely a specific bounded domain β say, your video transcoding is hammering CPU while your API servers idle β extract that one service. Don't extract everything because you might need to scale it someday. That day may never come, and you'll have paid the distributed systems tax in advance for nothing.
# io.thecodeforge β System Design tutorial # Scenario: E-commerce order service. # This is the monolith version. Notice what you get for FREE # that microservices make you rebuild from scratch. from dataclasses import dataclass from decimal import Decimal from typing import Optional import sqlite3 # Single database connection β one transaction wraps everything. # In a distributed system, this atomicity is GONE unless you implement # two-phase commit or saga patterns. Both are painful. DB_PATH = "orders.db" @dataclass class OrderItem: product_id: str quantity: int unit_price: Decimal @dataclass class Order: order_id: str customer_id: str items: list[OrderItem] status: str = "pending" class InventoryService: """In a monolith, this is just a class. In microservices, it's a network call that can timeout, return stale data, or be temporarily unavailable.""" def __init__(self, conn: sqlite3.Connection): self.conn = conn def reserve_stock(self, product_id: str, quantity: int) -> bool: cursor = self.conn.cursor() # This SELECT ... FOR UPDATE equivalent in SQLite β serialised write. # In a distributed inventory service, this becomes a distributed lock. # Redis SETNX, or a dedicated locking service. Both add latency and failure modes. cursor.execute( "SELECT stock_count FROM inventory WHERE product_id = ?", (product_id,) ) row = cursor.fetchone() if not row or row[0] < quantity: return False # Not enough stock β fail the whole operation cleanly cursor.execute( "UPDATE inventory SET stock_count = stock_count - ? WHERE product_id = ?", (quantity, product_id) ) return True class PaymentService: """Again β a class. Zero network calls. Zero timeout handling needed. The moment this becomes a separate service, you need circuit breakers, retry logic, idempotency keys, and dead-letter queues. All of that is real engineering effort β not free.""" def __init__(self, conn: sqlite3.Connection): self.conn = conn def charge_customer(self, customer_id: str, amount: Decimal) -> bool: cursor = self.conn.cursor() cursor.execute( "SELECT balance FROM accounts WHERE customer_id = ?", (customer_id,) ) row = cursor.fetchone() if not row or row[0] < amount: return False # Insufficient funds β atomic rollback handles cleanup cursor.execute( "UPDATE accounts SET balance = balance - ? WHERE customer_id = ?", (amount, customer_id) ) return True class OrderOrchestrator: """The core of the monolith advantage: ONE database transaction covers inventory reservation, payment, AND order creation. If payment fails, inventory is automatically unreserved. No compensation logic. No sagas. No 'sorry, your order is stuck in PENDING forever' bugs.""" def __init__(self): self.conn = sqlite3.connect(DB_PATH) self.inventory = InventoryService(self.conn) self.payment = PaymentService(self.conn) def place_order(self, order: Order) -> dict: total = sum( item.unit_price * item.quantity for item in order.items ) try: # BEGIN TRANSACTION β implicit in SQLite when autocommit is off self.conn.execute("BEGIN") # Step 1: Reserve all stock items. # If ANY item fails, we roll back everything. One line of code. for item in order.items: if not self.inventory.reserve_stock(item.product_id, item.quantity): self.conn.rollback() return { "success": False, "reason": f"Insufficient stock for product {item.product_id}" } # Step 2: Charge the customer. # In a microservices world, payment already happened in a different # service. If inventory reservation then fails, you're issuing refunds. # That's a SUPPORT TICKET. Here? It's a rollback. if not self.payment.charge_customer(order.customer_id, total): self.conn.rollback() return {"success": False, "reason": "Payment failed"} # Step 3: Persist the order record. self.conn.execute( "INSERT INTO orders (order_id, customer_id, status, total) VALUES (?, ?, ?, ?)", (order.order_id, order.customer_id, "confirmed", str(total)) ) self.conn.commit() # All three operations committed atomically return {"success": True, "order_id": order.order_id, "total": str(total)} except Exception as e: self.conn.rollback() # Something unexpected? Clean slate. raise RuntimeError(f"Order placement failed: {e}") from e # --- Demonstrate the flow --- if __name__ == "__main__": # Normally you'd have migrations. Simplified for illustration. orchestrator = OrderOrchestrator() order = Order( order_id="ORD-20240315-001", customer_id="CUST-789", items=[ OrderItem(product_id="SKU-HEADPHONES-XZ3", quantity=1, unit_price=Decimal("149.99")), OrderItem(product_id="SKU-USB-CABLE-C", quantity=2, unit_price=Decimal("12.50")) ] ) result = orchestrator.place_order(order) print(f"Order result: {result}")
conn.rollback() in the monolith becomes a saga with compensating transactions. Teams that don't plan for this end up with orders stuck in PENDING state indefinitely when a downstream service times out. The symptom: customer support tickets saying 'I was charged but my order never arrived.' The fix before you split: define the saga pattern and write the compensation handlers FIRST, before extracting a single service.Event-Driven Architecture: Power, Poison, and When to Reach for It
Event-driven architecture (EDA) solves a specific problem: you need multiple systems to react to something that happened, without the producer caring who's listening. The classic alternative β direct synchronous calls β creates a dependency spider web. Your order service calls inventory, which calls the warehouse, which calls shipping. Now your order service's uptime is the product of everyone else's uptime. At 99.9% each, four services in a chain gives you 99.6% overall. That's 3.5 hours of downtime per year from services that are individually 'highly available.'
EDA decouples that chain. The order service publishes an OrderConfirmed event. Inventory, warehouse, fraud detection, and email notification all consume it independently. The order service doesn't know they exist. New consumers can subscribe without touching the producer. This is real decoupling β not just the dependency injection kind.
The poison pill is invisible failure. In a synchronous call, if the warehouse service is down, you know immediately β your caller gets a 503. In EDA, your event is published to the queue, the producer returns success, and the warehouse consumer is silently dead. Events accumulate in the dead-letter queue. You discover it when a customer calls saying their package never shipped. I've seen this exact scenario play out in a logistics company where a misconfigured consumer group caused 6 hours of orders to pile up unprocessed while dashboards showed everything green.
Use EDA when: you have genuinely independent consumers, eventual consistency is acceptable for the domain, and you have the operational maturity to monitor queue depth and consumer lag. Don't use it for anything that needs synchronous confirmation β payment authorisation, stock reservation at checkout time, authentication.
# io.thecodeforge β System Design tutorial # Scenario: Post-checkout event pipeline. # The payment is already confirmed. Now we need to notify inventory, # fraud, shipping, and email β all independently, all without # the checkout service caring about any of them. import json import time import threading from dataclasses import dataclass, asdict from typing import Callable from collections import defaultdict, deque from datetime import datetime @dataclass class OrderConfirmedEvent: event_id: str order_id: str customer_id: str product_ids: list[str] total_amount: float occurred_at: str # ISO 8601 β always timestamp your events at creation time def to_json(self) -> str: return json.dumps(asdict(self)) class InMemoryEventBus: """ Production equivalent: Apache Kafka, AWS SQS+SNS, or RabbitMQ. The contract is the same: publish once, consume independently. This in-memory version makes the pattern visible without Kafka setup. """ def __init__(self): # Each topic maps to a list of independent consumer queues. # In Kafka terms: each consumer GROUP gets its own queue. # This is what enables independent consumption and replay. self._topics: dict[str, list[deque]] = defaultdict(list) self._handlers: dict[str, list[Callable]] = defaultdict(list) self._lock = threading.Lock() def subscribe(self, topic: str, handler: Callable[[dict], None]) -> None: """Register a consumer for a topic. Each subscriber gets ALL events published to that topic β not round-robin like a work queue.""" with self._lock: consumer_queue: deque = deque() self._topics[topic].append(consumer_queue) # Start a background thread per consumer β simulates async consumption thread = threading.Thread( target=self._consume, args=(consumer_queue, handler), daemon=True # Dies when main thread dies β fine for demo, use proper lifecycle in prod ) thread.start() def publish(self, topic: str, event: dict) -> None: """Producer publishes and returns immediately. It does NOT wait for consumers. This is the core of EDA's decoupling β and the source of its observability challenges.""" with self._lock: for consumer_queue in self._topics[topic]: consumer_queue.append(event) # Each consumer gets its own copy print(f"[EventBus] Published to '{topic}': event_id={event.get('event_id')}") def _consume(self, queue: deque, handler: Callable) -> None: """Busy-polls the queue. In production Kafka consumers use long-polling with configurable fetch.min.bytes and fetch.max.wait.ms for efficiency.""" while True: if queue: event = queue.popleft() try: handler(event) except Exception as e: # In production: route to dead-letter queue, alert, do NOT silently swallow. # Silent swallowing here is how you get the 'orders piling up unprocessed' nightmare. print(f"[EventBus] CONSUMER ERROR β routing to DLQ: {e}") else: time.sleep(0.01) # Back off when idle β don't spin-burn CPU # --- CONSUMERS --- # Each of these would be a separate service in production. # They share zero state. They don't know about each other. class InventoryConsumer: def handle(self, event: dict) -> None: print(f"[Inventory] Reserving stock for order {event['order_id']} " f"β products: {event['product_ids']}") # In reality: update stock counts, trigger reorder if threshold hit class FraudDetectionConsumer: def handle(self, event: dict) -> None: print(f"[Fraud] Running fraud score for customer {event['customer_id']} " f"β amount: ${event['total_amount']}") # In reality: call ML model, flag order if score > threshold, publish FraudFlaggedEvent class ShippingConsumer: def handle(self, event: dict) -> None: print(f"[Shipping] Creating shipment manifest for order {event['order_id']}") # In reality: call 3PL API, generate label, store tracking number class EmailNotificationConsumer: def handle(self, event: dict) -> None: print(f"[Email] Queueing confirmation email to customer {event['customer_id']} " f"for order {event['order_id']}") # In reality: render template, call SES/SendGrid, record sent timestamp # --- PRODUCER --- class CheckoutService: """The checkout service knows about the event bus and the event schema. It does NOT know about inventory, fraud, shipping, or email. Adding a new downstream consumer requires ZERO changes here.""" def __init__(self, event_bus: InMemoryEventBus): self.event_bus = event_bus def complete_checkout(self, order_id: str, customer_id: str, product_ids: list[str], total: float) -> dict: # Payment authorisation would happen here synchronously BEFORE this point. # EDA handles post-payment side effects β not the payment itself. event = OrderConfirmedEvent( event_id=f"evt-{order_id}-{int(time.time())}", order_id=order_id, customer_id=customer_id, product_ids=product_ids, total_amount=total, occurred_at=datetime.utcnow().isoformat() + "Z" ) self.event_bus.publish("order.confirmed", asdict(event)) # Returns IMMEDIATELY β doesn't wait for inventory, fraud, or shipping. # This is your sub-100ms checkout response time. return {"status": "confirmed", "order_id": order_id} # --- Wire it up --- if __name__ == "__main__": bus = InMemoryEventBus() # Register consumers β order doesn't matter, they're all independent inventory = InventoryConsumer() fraud = FraudDetectionConsumer() shipping = ShippingConsumer() email = EmailNotificationConsumer() bus.subscribe("order.confirmed", inventory.handle) bus.subscribe("order.confirmed", fraud.handle) bus.subscribe("order.confirmed", shipping.handle) bus.subscribe("order.confirmed", email.handle) checkout = CheckoutService(bus) # Simulate a completed checkout result = checkout.complete_checkout( order_id="ORD-20240315-002", customer_id="CUST-456", product_ids=["SKU-MONITOR-27", "SKU-HDMI-CABLE"], total=389.98 ) print(f"\n[Checkout] Response: {result}") time.sleep(0.1) # Give async consumers time to process in this demo
[Checkout] Response: {'status': 'confirmed', 'order_id': 'ORD-20240315-002'}
[Inventory] Reserving stock for order ORD-20240315-002 β products: ['SKU-MONITOR-27', 'SKU-HDMI-CABLE']
[Fraud] Running fraud score for customer CUST-456 β amount: $389.98
[Shipping] Creating shipment manifest for order ORD-20240315-002
[Email] Queueing confirmation email to customer CUST-456 for order ORD-20240315-002
kafka.consumer.group.lag > 1000 per partition. In SQS: alert on ApproximateNumberOfMessagesNotVisible. I've seen teams discover six-figure inventory discrepancies because nobody set this alert. Set it on day one, before your first consumer hits production.CQRS and the Read/Write Split: When Your Query Patterns Are Killing Your Writes
CQRS β Command Query Responsibility Segregation β is one of the most cargo-culted patterns in the industry. Teams add it because it sounds senior-level. Here's the honest version: you need CQRS when your read and write access patterns are so different that a single model optimised for both is actually optimised for neither.
Consider a product catalogue. Writes are rare, structured, and come from an admin tool β one product update at a time, full validation, transactional integrity. Reads are constant, require different field combinations per client (mobile wants a summary, web wants full detail, search wants keywords only), and need to be fast under high concurrency. A single relational model with indexes for everything is a compromise that serves nobody well. Your write path trips over read indexes. Your read path joins five tables to answer a query that could be a single document lookup.
CQRS splits this: the write side (Commands) uses a normalised transactional model. The read side (Queries) uses one or more denormalised read models, potentially different databases entirely β PostgreSQL for writes, Elasticsearch for search, Redis for session data. When a command succeeds, you publish an event (or a projection job runs) to update the read models. Those read models are eventually consistent β they lag behind by milliseconds to seconds.
That 'eventually consistent' part is where teams get burned. I've seen a fintech ship CQRS across their account balance domain. Write side was Postgres. Read side was a Redis projection. A deployment bug caused the projection to stop updating. For 40 minutes, customers saw stale balances. Nobody noticed until a customer tried to spend money the read model said they had but the write model had already debited. Don't use CQRS for anything where reading stale data causes a financial or safety consequence. It's a pattern for scale, not for correctness.
# io.thecodeforge β System Design tutorial # Scenario: Product catalogue with very different read/write patterns. # Write side: admin updates one product at a time, needs validation + atomicity. # Read side: 10,000 RPS product page loads, need sub-10ms response. # A single DB model with indexes for both is a bottleneck in both directions. from dataclasses import dataclass, field from typing import Optional import json import time # βββ WRITE SIDE (Command Model) ββββββββββββββββββββββββββββββββββββββββββββ # Normalised. Strict validation. Transactional. Think PostgreSQL. @dataclass class ProductWriteModel: """The authoritative source of truth. Commands land here. This model is optimised for integrity, not query speed.""" product_id: str sku: str name: str description: str price_cents: int # Store money as integers β never floats. Ever. stock_count: int category_id: str brand_id: str is_active: bool = True version: int = 1 # Optimistic locking β detect concurrent modification @dataclass class UpdateProductPriceCommand: product_id: str new_price_cents: int updated_by: str # Always audit who changed what in a write model reason: str # Force callers to explain why β reduces lazy changes @dataclass class ProductPriceUpdatedEvent: product_id: str old_price_cents: int new_price_cents: int occurred_at: float class ProductCommandHandler: """Handles all writes. Returns an event that downstream projections consume to update read models. The command handler does NOT update read models directly β that's the projection layer's job.""" def __init__(self, write_store: dict): # In production: PostgreSQL with row-level locking self.write_store = write_store self.event_log: list[ProductPriceUpdatedEvent] = [] def handle_update_price( self, command: UpdateProductPriceCommand ) -> ProductPriceUpdatedEvent: product = self.write_store.get(command.product_id) if not product: raise ValueError(f"Product {command.product_id} not found") if command.new_price_cents <= 0: raise ValueError("Price must be positive β write model enforces this invariant") old_price = product.price_cents product.price_cents = command.new_price_cents product.version += 1 # Increment version for optimistic lock detection event = ProductPriceUpdatedEvent( product_id=command.product_id, old_price_cents=old_price, new_price_cents=command.new_price_cents, occurred_at=time.time() ) self.event_log.append(event) # In production: append to Kafka/outbox table print(f"[CommandHandler] Price updated: {command.product_id} " f"${old_price/100:.2f} β ${command.new_price_cents/100:.2f} " f"(v{product.version}) by {command.updated_by}") return event # βββ READ SIDE (Query Model) βββββββββββββββββββββββββββββββββββββββββββββββ # Denormalised. Pre-computed. Optimised for the specific query pattern. # NOT the source of truth β it's a projection of the write side. @dataclass class ProductReadModel: """Flattened, denormalised view of a product. Fields are exactly what the product page API endpoint needs β nothing more. In production: this lives in Redis or Elasticsearch, not Postgres.""" product_id: str display_name: str price_display: str # Pre-formatted: '$149.99' β computed at write time, not read time category_name: str # Joined and baked in β no JOIN at query time brand_name: str is_available: bool search_keywords: list[str] # Pre-extracted for search β no LIKE query at runtime class ProductReadModelProjection: """Listens to events from the write side and updates read models. This is the 'eventually consistent' part β it runs async after the command. If this falls behind, reads show stale data. Monitor it.""" def __init__(self, read_store: dict, category_lookup: dict, brand_lookup: dict): self.read_store = read_store self.category_lookup = category_lookup # Pre-loaded reference data self.brand_lookup = brand_lookup def project_price_update(self, event: ProductPriceUpdatedEvent, write_model: ProductWriteModel) -> None: """Rebuild only the affected fields in the read model. In production with Kafka: this handler is idempotent β if it runs twice (at-least-once delivery), the result is the same.""" existing_read_model = self.read_store.get(event.product_id) if not existing_read_model: # First time building this read model β construct from write model existing_read_model = ProductReadModel( product_id=write_model.product_id, display_name=write_model.name, price_display="", # Will be set below category_name=self.category_lookup.get(write_model.category_id, "Unknown"), brand_name=self.brand_lookup.get(write_model.brand_id, "Unknown"), is_available=write_model.is_active and write_model.stock_count > 0, search_keywords=write_model.name.lower().split() ) # Update ONLY the price field β targeted, not full rebuild existing_read_model.price_display = f"${event.new_price_cents / 100:.2f}" self.read_store[event.product_id] = existing_read_model print(f"[Projection] Read model updated for {event.product_id}: " f"price now {existing_read_model.price_display}") class ProductQueryHandler: """Handles all reads. Hits the read store only β never touches the write DB. This is why reads are fast: the read store is pre-computed and query-shaped.""" def __init__(self, read_store: dict): self.read_store = read_store def get_product_page_data(self, product_id: str) -> Optional[dict]: model = self.read_store.get(product_id) if not model: return None # Return exactly what the product page needs β no transformation at read time return { "id": model.product_id, "name": model.display_name, "price": model.price_display, "category": model.category_name, "brand": model.brand_name, "available": model.is_available } # βββ Wire it all together ββββββββββββββββββββββββββββββββββββββββββββββββββ if __name__ == "__main__": # Simulate backing stores write_store = { "PROD-MONITOR-4K-32": ProductWriteModel( product_id="PROD-MONITOR-4K-32", sku="MON-32-4K-BK", name="32-inch 4K Monitor", description="IPS panel, 144Hz, USB-C PD 96W", price_cents=59999, stock_count=47, category_id="CAT-DISPLAYS", brand_id="BRD-DELL" ) } read_store = {} category_lookup = {"CAT-DISPLAYS": "Monitors & Displays"} brand_lookup = {"BRD-DELL": "Dell"} command_handler = ProductCommandHandler(write_store) projection = ProductReadModelProjection(read_store, category_lookup, brand_lookup) query_handler = ProductQueryHandler(read_store) # 1. Admin issues a price update command update_command = UpdateProductPriceCommand( product_id="PROD-MONITOR-4K-32", new_price_cents=54999, # Sale price updated_by="admin-jenna@example.com", reason="Black Friday promotional pricing" ) event = command_handler.handle_update_price(update_command) # 2. Projection layer processes the event (async in prod, sync here for clarity) projection.project_price_update(event, write_store["PROD-MONITOR-4K-32"]) # 3. Query handler serves the product page β hits read store only product_page = query_handler.get_product_page_data("PROD-MONITOR-4K-32") print(f"\n[QueryHandler] Product page response: {json.dumps(product_page, indent=2)}")
[Projection] Read model updated for PROD-MONITOR-4K-32: price now $549.99
[QueryHandler] Product page response: {
"id": "PROD-MONITOR-4K-32",
"name": "32-inch 4K Monitor",
"price": "$549.99",
"category": "Monitors & Displays",
"brand": "Dell",
"available": true
}
When Simpler Beats Clever: The Architecture Decisions You'll Regret
Here's the thing senior engineers know that junior engineers don't: every architectural pattern solves a real problem and creates three new ones. The art isn't picking the most sophisticated pattern β it's picking the one whose new problems you can actually manage given your team size, operational maturity, and product stage.
The strangler fig pattern is genuinely useful for migrating legacy systems incrementally, but I've seen teams spend 18 months building the 'strangling' infrastructure around a system they could have rewritten in six. The strangler makes sense when the legacy system is too risky to replace wholesale and you need continuity. It's overkill when you have a six-week-old Node.js app and you just don't like the original structure.
Event sourcing is similarly seductive. Store every state change as an immutable event. Replay history. Perfect audit logs. The operational reality: your event store becomes the most critical piece of infrastructure you own. Replaying 3 years of events to rebuild a read model takes hours. Schema migrations are now event migrations β every consumer must handle every historical event format. One team I know has events in 11 different schema versions. Their onboarding documentation for new engineers is 40 pages just on event version compatibility.
The heuristic that's served me across a decade of production systems: reach for a pattern when you're feeling specific pain, not when you're anticipating it. Slow reads under concurrent load? Now consider CQRS. Tight coupling causing deployment coordination? Now consider EDA. Teams that architect for pain they don't have yet pay the complexity tax without cashing the check.
# io.thecodeforge β System Design tutorial # Architecture Decision Records (ADRs) are the most underused practice # in software teams. This isn't just documentation β it's the thing that # stops your team from having the same architectural argument six months later # when half the context has walked out the door with someone who quit. # An ADR captures: what decision was made, what alternatives were rejected, # what problem it solves, and β critically β when to revisit it. # Format based on Michael Nygard's widely adopted template. from dataclasses import dataclass, field from datetime import date from enum import Enum from typing import Optional class ADRStatus(Enum): PROPOSED = "Proposed" # Under discussion β not committed ACCEPTED = "Accepted" # Decision made, team aligned DEPRECATED = "Deprecated" # Superseded by a later ADR SUPERSEDED = "Superseded" # Replaced β link to the new ADR @dataclass class Alternative: name: str why_rejected: str # Honest, specific reason β not 'it was worse' @dataclass class ArchitectureDecisionRecord: adr_number: int title: str status: ADRStatus date_decided: date deciders: list[str] # Who was in the room β accountability context: str # What problem forced a decision? decision: str # What did you decide? One clear statement. consequences_positive: list[str] # What you gain consequences_negative: list[str] # What you sacrifice β be honest alternatives_rejected: list[Alternative] revisit_trigger: str # Specific condition that should re-open this decision superseded_by: Optional[int] = None # ADR number that replaces this one def render(self) -> str: lines = [ f"# ADR-{self.adr_number:04d}: {self.title}", f"**Status:** {self.status.value} **Date:** {self.date_decided} " f"**Deciders:** {', '.join(self.deciders)}", "", "## Context", self.context, "", "## Decision", self.decision, "", "## Consequences", "**Gains:**", ] for item in self.consequences_positive: lines.append(f" + {item}") lines.append("**Costs:**") for item in self.consequences_negative: lines.append(f" - {item}") lines.append("") lines.append("## Alternatives Rejected") for alt in self.alternatives_rejected: lines.append(f"**{alt.name}:** {alt.why_rejected}") lines.append("") lines.append("## Revisit When") lines.append(self.revisit_trigger) if self.superseded_by: lines.append(f"\n> **Superseded by ADR-{self.superseded_by:04d}**") return "\n".join(lines) # βββ Real example: the decision NOT to go microservices ββββββββββββββββββ adr_001 = ArchitectureDecisionRecord( adr_number=1, title="Deploy as a modular monolith, not microservices", status=ADRStatus.ACCEPTED, date_decided=date(2024, 3, 15), deciders=["sarah-cto@example.com", "marcus-lead@example.com", "priya-senior@example.com"], context=( "We are a 6-engineer team building a B2B SaaS product with ~200 customers. " "Current P99 API latency is 180ms against a 500ms SLA β we have headroom. " "Peak load is 150 concurrent users. Engineering asked whether to start with " "microservices to 'build for scale' before our Series A." ), decision=( "Ship as a modular monolith with clear internal boundaries: orders/, billing/, " "inventory/, notifications/ as distinct Python packages with no cross-package " "imports except through defined interface classes. Single PostgreSQL database " "with schema-per-module. Single deployment unit." ), consequences_positive=[ "Single deployment pipeline β ~20min CI vs estimated 2hr for 8+ microservices", "ACID transactions across modules β no saga patterns needed at this scale", "Local debugging with a single process β no distributed tracing setup", "New engineer onboarding: clone one repo, run one docker-compose", ], consequences_negative=[ "A crash in one module affects all modules β mitigated by process-level health checks", "Shared database means schema migrations are global β coordinate carefully", "Harder to scale modules independently if one becomes a bottleneck", ], alternatives_rejected=[ Alternative( name="Microservices from day one", why_rejected=( "At 6 engineers, we don't have the operational bandwidth to run " "distributed tracing, per-service deployment pipelines, service meshes, " "or manage eventual consistency. The complexity cost exceeds the scaling " "benefit at our current load. Revisit when load requires independent scaling." ) ), Alternative( name="Serverless functions (AWS Lambda)", why_rejected=( "Cold start latency is incompatible with our interactive API SLA. " "Local development story is poor. State management for our transaction-heavy " "workflows requires significant workarounds." ) ), ], revisit_trigger=( "Re-evaluate when: (a) any single module exceeds 40% of total DB CPU, " "(b) engineering team exceeds 20 engineers, or " "(c) two teams are blocked on the same deployment more than 3 times per sprint." ) ) if __name__ == "__main__": print(adr_001.render())
**Status:** Accepted **Date:** 2024-03-15 **Deciders:** sarah-cto@example.com, marcus-lead@example.com, priya-senior@example.com
## Context
We are a 6-engineer team building a B2B SaaS product with ~200 customers. Current P99 API latency is 180ms against a 500ms SLA β we have headroom. Peak load is 150 concurrent users. Engineering asked whether to start with microservices to 'build for scale' before our Series A.
## Decision
Ship as a modular monolith with clear internal boundaries: orders/, billing/, inventory/, notifications/ as distinct Python packages with no cross-package imports except through defined interface classes. Single PostgreSQL database with schema-per-module. Single deployment unit.
## Consequences
**Gains:**
+ Single deployment pipeline β ~20min CI vs estimated 2hr for 8+ microservices
+ ACID transactions across modules β no saga patterns needed at this scale
+ Local debugging with a single process β no distributed tracing setup
+ New engineer onboarding: clone one repo, run one docker-compose
**Costs:**
- A crash in one module affects all modules β mitigated by process-level health checks
- Shared database means schema migrations are global β coordinate carefully
- Harder to scale modules independently if one becomes a bottleneck
## Alternatives Rejected
**Microservices from day one:** At 6 engineers, we don't have the operational bandwidth to run distributed tracing, per-service deployment pipelines, service meshes, or manage eventual consistency. The complexity cost exceeds the scaling benefit at our current load. Revisit when load requires independent scaling.
**Serverless functions (AWS Lambda):** Cold start latency is incompatible with our interactive API SLA. Local development story is poor. State management for our transaction-heavy workflows requires significant workarounds.
## Revisit When
Re-evaluate when: (a) any single module exceeds 40% of total DB CPU, (b) engineering team exceeds 20 engineers, or (c) two teams are blocked on the same deployment more than 3 times per sprint.
| Attribute | Monolith | Microservices | Modular Monolith | Event-Driven |
|---|---|---|---|---|
| Deployment complexity | Single artifact, simple | Per-service CI/CD pipelines required | Single artifact, moderate | Broker infra required (Kafka/SQS) |
| Transaction handling | Native ACID β no extra work | Requires sagas or 2PC β complex | Native ACID within service boundary | Eventual consistency only |
| Failure isolation | None β one crash, full outage | Strong β faults contained per service | None β same process | Partial β producer isolated from consumers |
| Debugging / tracing | Standard debugger, one process | Requires distributed tracing (Jaeger/Zipkin) | Standard debugger, one process | Requires correlation IDs + trace tooling |
| Team scaling | Bottleneck past ~15 engineers | Scales well with Conway's Law alignment | Good up to ~20 engineers | Scales well for async workflows |
| Appropriate team size | 1-12 engineers | 20+ engineers with platform team | 6-20 engineers | Any β add when coupling pain is real |
| Read/write query optimisation | Same model β compromises both | Per-service, can optimise independently | Same model β compromises both | Pairs naturally with CQRS read models |
| Operational overhead | Low β one service to monitor | Very high β N services, N dashboards | Low-medium | Medium β queue depth + consumer lag alerts |
π― Key Takeaways
- Architecture patterns solve specific pain points β adopt them when you feel the pain, not when you anticipate it. Paying the distributed systems complexity tax without needing the distributed systems benefits is how teams slow down.
- The hidden cost of microservices isn't the services β it's losing ACID transactions. Every distributed write that used to be a
conn.rollback()is now a saga with compensating transactions, a dead-letter queue, and a support runbook. - Reach for CQRS when your read and write access patterns are genuinely incompatible β different query shapes, different consistency needs, order-of-magnitude different throughput. If your reads and writes look similar, a well-indexed relational database with read replicas handles 90% of real-world scale.
- The teams that succeed with complex architecture are the ones who write Architecture Decision Records before they build, not after. A revisit trigger β a specific, measurable condition β is what separates a decision from a commitment you can never get out of.
β Common Mistakes to Avoid
- βMistake 1: Extracting microservices by technical layer (e.g., 'frontend service', 'database service', 'API service') instead of by business domain β Symptom: every new feature requires changes across 4+ services simultaneously and your deployments are co-ordinated anyway, defeating the entire purpose β Fix: use Domain-Driven Design bounded contexts to draw service boundaries; a service owns a business capability end-to-end (orders, inventory, billing), not a technical tier.
- βMistake 2: Publishing events without idempotency keys and assuming at-most-once delivery from Kafka or SQS β Symptom: duplicate order confirmations, double-charged customers, duplicate inventory decrements after consumer rebalances or retries β Fix: every event must carry a stable
event_id; every consumer handler must implementINSERT ... ON CONFLICT DO NOTHINGor a processed-event deduplication table keyed on thatevent_idbefore taking any action. - βMistake 3: Building a CQRS read model that the write path queries back synchronously to validate business rules β Symptom: you've re-introduced coupling, your read model's eventual consistency window now causes validation failures, and your architecture has the complexity of CQRS with none of the benefits β Fix: all business rule validation must run against the write model exclusively; the read model is for query responses only, never for command validation.
Interview Questions on This Topic
- QYour checkout service calls inventory, payment, and shipping synchronously. The shipping service degrades under Black Friday load, adding 8 seconds to every checkout. Walk me through the architectural options for isolating this failure, and what you'd change about the call structure.
- QWhen would you choose event sourcing over standard CQRS with a relational write store in a production system? What operational capabilities does your team need before event sourcing is viable?
- QYou have a CQRS system where the read model projection falls 45 minutes behind during a deployment. A customer calls saying they can see their account balance from 45 minutes ago and it doesn't reflect a payment they just made. What are the three systemic changes you make to prevent this class of incident β not just the hotfix?
Frequently Asked Questions
When should I actually use microservices instead of a monolith?
Use microservices when independent deployability genuinely matters β different teams deploy different capabilities without coordinating, or specific bounded domains have radically different scaling needs. The concrete signal: if two or more teams are blocked on each other's deployments more than once per sprint, or if one module is consuming 80%+ of your infrastructure resources while others idle, you have a real problem microservices solve. Below 15 engineers or below ~500 concurrent users with uniform load, a well-structured monolith is almost always faster to build and cheaper to operate.
What's the difference between CQRS and event sourcing β aren't they the same thing?
They're completely separate patterns that are often used together, which causes the confusion. CQRS separates your read and write models β different data stores, different schemas, optimised for their respective access patterns. Event sourcing stores state as a sequence of immutable events rather than current state. You can have CQRS without event sourcing (write to Postgres, project to Redis), and you can have event sourcing without CQRS (single event log, no read model split). The rule of thumb: adopt CQRS when your read/write query patterns diverge sharply. Only add event sourcing if you specifically need full audit history, temporal queries, or event replay β and only if your team can handle the operational weight of an event store.
How do I handle a saga if one of the compensating transactions also fails in a microservices system?
This is the distributed systems question that most saga tutorials skip entirely. A compensation failure means you're in an inconsistent state with no automated recovery path β this is called a 'non-recoverable saga.' Your mitigation stack, in order: first, make all compensating transactions idempotent and retry them with exponential backoff up to a configured limit. Second, route non-recoverable failures to a dead-letter queue with a full event payload and trigger an alert. Third, build a manual remediation runbook for ops β the system should surface these explicitly, not silently drop them. Fourth, design your sagas so compensation is 'soft' where possible: instead of reversing a payment, issue a credit. Partial rollbacks are safer than hard reversals when compensations can fail.
What happens to your event-driven system when the event schema changes and you have consumers on old versions?
This is where most EDA implementations accumulate quiet technical debt until it becomes a crisis. Schema changes in events break consumers silently β the consumer doesn't crash immediately, it either ignores unknown fields (if your deserialisation is lenient) or throws a deserialization exception and routes to the dead-letter queue. The production-hardened approach: treat event schemas like public APIs and version them explicitly β OrderConfirmedV1, OrderConfirmedV2. Producers publish to a new topic version; old consumers keep reading from the old topic until migrated. Use a schema registry (Confluent Schema Registry with Avro, or AWS Glue Schema Registry) to enforce compatibility rules β BACKWARD compatibility means new schema can read old messages, FORWARD means old schema can read new messages. Enforce backward compatibility as the minimum standard; never ship a breaking schema change to a shared topic.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.