Mid-level 9 min · March 06, 2026

E-commerce System Design — Flash Sale Race Conditions

Q: What is the most common cause of overselling in e-commerce platforms?

The most common cause is a race condition in inventory decrement: reading the current quantity, then updating it. This is a classic read-modify-write race. The fix is to use an atomic update (e.g., `UPDATE inventory SET quantity = quantity - 1 WHERE id = ? AND quantity > 0`) and check affected rows.

Q: How do you ensure an order is not lost after payment succeeds?

Use the outbox pattern: the order service writes the order event to a database table (outbox) within the same transaction as the order creation. A separate worker reads from the outbox and publishes the event to a message queue. This ensures the order is durable even if the queue is down temporarily. The worker retries until the message is published.

Q: Should I use a cache for inventory data?

Inventory data is consistency-critical — you should never cache the current stock count if it leads to overselling. However, you can use a cache as a fast gate: decrement in Redis atomically, then asynchronously update the database. Only serve checkout from Redis, but reconcile with the database frequently. For product page display (e.g., 'in stock' indicator), you can cache a boolean flag with a short TTL (e.g., 30 seconds) to reduce load.

Q: What is the difference between a distributed transaction (2PC) and a Saga?

2PC (two-phase commit) involves a coordinator that locks resources across all participants until all agree to commit. It provides strong consistency but blocks resources, does not scale horizontally, and is not resilient to participant failures (the coordinator is a single point of failure). A Saga splits the transaction into a sequence of local transactions, each with a compensating action. If a step fails, the saga runs compensation steps to undo previous actions. Sagas provide eventual consistency, are more scalable, and are better suited for microservices. However, you must design idempotent compensating actions and handle partial failures manually.

SELECT-then-UPDATE inventory causes double-bookings under load.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

E-commerce platforms are distributed systems managing product discovery, cart, checkout, payments, and inventory under high concurrency
Key components: product catalog service, cart service, checkout orchestration, payment gateway, inventory service
Performance insight: Product search must return in <200ms; use Elasticsearch with caching
Production insight: Without idempotency in payments, a single retry can charge a customer twice — use idempotency keys
Biggest mistake: Keeping cart and inventory in the same service — leads to tight coupling and checkout failures

✦ Definition~90s read

What is Design an E-commerce Platform?

This article dissects the architectural challenges of building an e-commerce platform that survives flash sales—those high-traffic, limited-time events where demand spikes 10x–100x in seconds. Naive designs fail because they treat inventory, cart, checkout, and payment as a single synchronous transaction, creating race conditions where two customers grab the last item, orders are charged but never fulfilled, or the database collapses under write contention.

★

Imagine you're running the world's biggest flea market.

You'll learn why optimistic locking, queue-based decoupling, and idempotency keys are non-negotiable, and how companies like Shopify and Amazon separate read-heavy catalog services from write-constrained inventory and payment systems to maintain consistency without sacrificing throughput. The article walks through each core component—search, cart, checkout, payment—and the specific race conditions each faces, from phantom reads in Elasticsearch to double-spends in Stripe.

It then maps scaling strategies (sharding, caching, circuit breakers) against real trade-offs: eventual consistency for search vs. strict serializability for payments, or why you might accept stale inventory counts for 50ms response times. By the end, you'll have a mental model for designing a system that doesn't just handle load, but guarantees correctness under pressure—without over-engineering for traffic you'll never see.

Plain-English First

Imagine you're running the world's biggest flea market. You've got thousands of sellers, millions of buyers, and everyone wants to browse, pick something, pay, and get it delivered — all at the same time, without chaos. Building an e-commerce platform is exactly that: designing the invisible plumbing that makes sure the right product gets to the right buyer, the money moves safely, and nothing crashes when a flash sale hits at midnight.

An e-commerce platform isn’t just a shopping cart with a database. It’s a distributed system that must survive flash sales, maintain consistent checkout states, and keep search fast under millions of SKUs. Without deliberate architectural separation and decoupled services, your naive monolith will collapse under the first real traffic spike—losing orders, payments, and trust.

Why Flash Sales Break Naive E-Commerce Systems

A flash sale is a high-concurrency event where a limited inventory is offered at a steep discount for a short window. The core mechanic is a race condition: thousands of users compete for the same few items, and the system must correctly decrement inventory exactly once per successful purchase. Without careful design, overselling (selling more than available stock) or underselling (rejecting valid purchases) is inevitable.

In practice, the key properties that matter are atomic inventory updates, idempotent payment processing, and graceful degradation under load. A typical flash sale sees 100x normal traffic within seconds. The system must handle concurrent writes to the same product row — naive locking (e.g., database row locks) becomes a bottleneck, while optimistic locking with retries can cause cascading failures. Distributed locks (Redis, ZooKeeper) or queue-based throttling are common solutions, but each introduces trade-offs in consistency and latency.

Use a dedicated flash sale architecture when the expected concurrency exceeds what your normal checkout pipeline can handle — roughly >10 concurrent requests per product SKU. It matters because a single oversell incident can trigger chargebacks, customer trust erosion, and regulatory fines. Real systems like Alibaba's Double 11 or Amazon's Lightning Deals rely on pre-allocated inventory pools, request queuing, and idempotency keys to survive the stampede.

Overselling Is Not a Bug — It's a Design Failure

Overselling isn't caused by bad code; it's caused by assuming reads are safe. Always validate inventory at the point of write, not read.

Production Insight

A major retailer used database row locks for flash sale inventory — during a 50k concurrent burst, the database connection pool saturated, causing a 30-second timeout cascade that took down the entire checkout.

Symptom: connection pool exhaustion, not overselling — the system became unavailable before it could corrupt data.

Rule of thumb: never hold a database lock across a network call (payment, notification). Use a dedicated inventory service with atomic decrement and a timeout circuit breaker.

Key Takeaway

Atomic inventory decrement is non-negotiable — use Redis DECR or database row-level locking with retry limits.

Idempotency keys on every purchase request prevent double-charges when clients retry.

Queue incoming requests and process inventory updates asynchronously — synchronous processing at flash sale scale is a recipe for cascading failure.

thecodeforge.io

Flash Sale Race Condition Flow

Design Ecommerce Platform

Core Components & Service Separation

An e-commerce platform is a set of loosely coupled services, each responsible for one domain. The three non-negotiable splits are:

Product Catalog Service: Manages product metadata (name, description, images, categories, prices). This is read-heavy and benefits from caching and Elasticsearch.
Cart Service: Manages user sessions, add/remove items, coupon application. It's write-heavy for the current session but read-only for historical data.
Checkout Orchestrator: Coordinates the actual purchase — validates cart, locks inventory, calls payment gateway, creates order. This is the most failure-sensitive service.
Inventory Service: Tracks stock levels across warehouses, reserves items during checkout, handles restocks.
Payment Service: Interacts with external gateways (Stripe, PayPal), stores idempotency keys, handles retries.
Order Service: Records completed orders, sends notifications, manages returns.

The mistake everyone makes is bundling inventory with the catalog. These have completely different access patterns: catalog is read-heavy and stale-ok, inventory is write-heavy and consistency-critical. Keep them separate from the start.

Service Boundaries Mental Model

Catalog: Fast reads, eventual consistency acceptable.
Cart: Temporary state, can be lost without financial impact.
Inventory: Strong consistency, cannot oversell.
Payment: Must be idempotent and auditable.
Order: Immutable after creation, source of truth.

Production Insight

A single-service monolith works for <10k products and <100 concurrent users.

Once you hit 100k products and 1k concurrent users, the catalog's read load kills inventory write throughput.

Rule: Strip inventory from catalog at the very start — you'll never untangle it later.

Key Takeaway

Service separation is about access pattern differences.

Read-heavy vs write-heavy vs consistency-critical — each belongs in its own service.

Never share a database between catalog and inventory.

When to Split Services

IfProduct count < 10k, users < 100 concurrent

→

UseMonolith with separate modules is fine. Focus on clean code.

IfProduct count > 50k, traffic spikes expected

→

UseSplit catalog and inventory immediately. Use Elasticsearch for search.

IfMultiple payment gateways or complex promotions

→

UseSplit checkout orchestrator from order service for independent scaling.

Product Search & Catalog Performance

Product search is the gateway to purchase. Users expect results in under 200ms, with filters for category, price range, rating, and sorting by relevance or newest. Achieving this at scale means you cannot query the primary database directly.

Architecture: - Use Elasticsearch as the search index. It supports full-text search, faceted aggregation, and fuzzy matching out of the box. - Keep a read-through cache (Redis) for product detail pages (PDP). The cache key is product_id:locale:version. - For autocomplete, use a prefix-based Trie in memory or Elasticsearch's completion suggester.

The search index is built from the product catalog database using change data capture (CDC) with Debezium. Updates propagate within seconds — eventual consistency is acceptable here because a stale product in search is better than a failing search.

Caveats: - Sorting by combined fields (e.g., relevance * price) requires careful mapping in Elasticsearch. - Facet counts can be expensive; cache them separately and invalidate on product updates. - Avoid deep pagination (>100th page) — use search_after instead of from/size.

elasticsearch-mapping.jsonJSON

{
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "standard" },
      "description": { "type": "text", "analyzer": "english" },
      "category_id": { "type": "keyword" },
      "price": { "type": "float" },
      "rating": { "type": "float" },
      "stock_status": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  },
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 2
  }
}

Search Performance Trap

Using 'wildcard' queries for partial matching is a common mistake. Wildcard queries in Elasticsearch are slow because they scan all terms. Use 'match_phrase_prefix' with 'index_prefixes' mapping instead. It's up to 10x faster.

Production Insight

A flash sale can cause the product cache to miss for newly popular items.

If the cache isn't warmed, the catalog DB sees a thundering herd of queries.

Rule: Pre-warm cache with top-1000 products before any known traffic spike.

Invest in a circuit breaker on catalog DB reads — 500 errors are better than a 30-minute DB outage.

Key Takeaway

Search is not query — use Elasticsearch with CDC.

Cache product pages aggressively, but invalidate on update.

Pre-warm, never let a spike hit the DB cold.

Cart & Checkout Consistency

The cart seems innocuous — items, quantities, maybe a promo code. But checkout is where distributed systems meet financial reality. The cart state must be consistent while the checkout orchestrator runs a mini-saga across inventory, payment, and order services.

Cart Design: - Store cart in Redis as a hash with TTL (e.g., 24 hours). This is fast and transient. - On checkout initiation, move cart data to a persistent checkout session in PostgreSQL. - Lock the cart to prevent modifications during checkout.

Checkout Orchestrator Steps: 1. Validate cart (prices, stock, promo codes). 2. Reserve inventory items (atomic decrement in inventory service). 3. Call payment gateway with idempotency key. 4. On payment success, create order record. 5. If payment fails, release inventory reservations (compensating transaction).

This is the Saga pattern: a sequence of local transactions with compensating actions. Avoid distributed transactions (2PC) — they don't scale and break across services.

Consistency Guarantee: - Use an outbox pattern: the order service writes an event to a database table, and a background worker publishes it reliably to a message queue. - This ensures no order is lost even if the message broker is down.

io/thecodeforge/checkout/CheckoutOrchestrator.javaJAVA

package io.thecodeforge.checkout;

public class CheckoutOrchestrator {
    // Using Saga pattern
    public OrderResult checkout(CheckoutRequest request) {
        String idempotencyKey = generateIdempotencyKey(request.userId, request.cartId);
        
        // 1. Validate cart
        Cart cart = cartService.getLockedCart(request.cartId);
        validatePrices(cart);
        
        // 2. Reserve inventory
        boolean reserved = inventoryService.reserveItems(cart.items());
        if (!reserved) {
            return OrderResult.failure("Out of stock");
        }
        
        // 3. Charge payment
        PaymentResult payment = paymentService.charge(
            request.paymentToken,
            cart.total(),
            idempotencyKey
        );
        if (!payment.success()) {
            inventoryService.releaseReservation(cart.items());
            return OrderResult.failure("Payment declined");
        }
        
        // 4. Create order (in outbox table)
        Order order = orderService.createOrder(cart, payment.transactionId);
        return OrderResult.success(order);
    }
}

Idempotency Key Mental Model

Generate a deterministic key: hash(userId + cartId + timestamp).
The payment gateway must reject duplicate key with the same payload.
Store key in a database table with a unique constraint to enforce idempotency.
If the request times out, retry with the same key — no double charge.

Production Insight

The most common checkout failure is not handling the gap between payment success and order creation.

If the payment callback is processed but order creation fails, the customer is charged but gets nothing.

Rule: Use an outbox pattern to make order creation durable before acknowledging payment.

Always log payment callback with raw payload for manual reconciliation.

Key Takeaway

Checkout is a saga — not a distributed transaction.

Idempotency keys prevent duplicate charges.

Outbox pattern guarantees order durability.

Payment System Reliability

Payment is the most critical subsystem — it moves real money. A successful payment must result in exactly one order and one charge. Payment systems at scale rely on three pillars: idempotency, retry with backoff, and idempotency verification at the gateway level.

Idempotency: - Before calling a payment gateway, generate a unique idempotency key (e.g., UUID per order attempt). - Store the key and the request payload in a database table with a unique constraint. - On a timeout or network error, retry with the same key. The gateway returns the original result.

Retry Strategy: - Use exponential backoff with jitter: first retry after 1s, then 2s, 4s, up to 60s max. - After 3 retries, escalate to a dead-letter queue for manual review. - Monitor the rate of payment timeouts — a sudden spike may indicate a gateway issue.

Failure Modes: - Dual charge: Happens when idempotency is missing or the gateway doesn't support it. Always use a payment provider that supports idempotency keys (Stripe, Braintree, Adyen). - Silent failures: Payment fails but the customer doesn't get an error — the system marks the order as pending. Use a reconciliation job that compares pending orders with gateway transactions daily.

io/thecodeforge/payment/PaymentService.javaJAVA

package io.thecodeforge.payment;

public class PaymentService {
    private static final int MAX_RETRIES = 3;
    private static final long BASE_DELAY_MS = 1000;
    
    public PaymentResult charge(String token, double amount, String idempotencyKey) {
        for (int attempt = 1; attempt <= MAX_RETRIES; attempt++) {
            try {
                PaymentResult result = gateway.charge(token, amount, idempotencyKey);
                if (result.isSuccess()) {
                    return result;
                }
                // If gateway declines with a non-retriable error, fail immediately
                if (result.isDecline()) {
                    return result;
                }
            } catch (TimeoutException | NetworkException e) {
                if (attempt == MAX_RETRIES) {
                    // Push to dead-letter queue for manual review
                    deadLetterQueue.send(new PaymentFailureEvent(token, amount, idempotencyKey, e));
                    return PaymentResult.failure("Payment processing delayed — contact support");
                }
                long delay = (long) (BASE_DELAY_MS * Math.pow(2, attempt - 1) * (1 + Math.random()));
                Thread.sleep(delay);
            }
        }
        return PaymentResult.failure("Max retries exceeded");
    }
}

Idempotency Key Storage

Store the idempotency key with a unique constraint in a dedicated table. If a retry arrives before the first request completed, the second request will block on the constraint until the first transaction commits or times out. Use a retry loop with a short timeout to handle this contention gracefully.

Production Insight

A payment gateway outage can cascade into a full site crash if your payment service doesn't have a circuit breaker.

Without it, every checkout request blocks waiting for a timeout, exhausting the HTTP connection pool and affecting other services.

Rule: Wrap payment gateway calls with a circuit breaker — fail fast after 3 consecutive timeouts within 30 seconds.

Key Takeaway

Idempotency keys are non-negotiable.

Retry with exponential backoff and jitter.

Circuit breakers prevent gateway failures from cascading.

Scaling Strategies & Trade-offs

Scaling an e-commerce platform is not just adding more servers — it's about understanding where bottlenecks appear at each growth stage.

Stage 1: Up to 100k daily active users - Monolithic architecture with separate read replicas for catalog. - Redis cache for product pages and session data. - Single PostgreSQL database with connection pooling.

Stage 2: 100k to 1M DAU - Break out catalog and inventory services (as discussed). - Use Elasticsearch for search, read replicas for orders. - Asynchronous payment callbacks (webhooks). - Message queue (RabbitMQ / Kafka) for order processing and inventory sync.

Stage 3: 1M+ DAU with flash sales - Full microservices architecture with event sourcing. - Each service has its own database (database-per-service). - CDN for static assets and cached product pages. - Pre-warming inventory cache for top products. - Auto-scaling infrastructure with Kubernetes. - Feature flags to quickly disable payment gateways or checkout during incidents.

The critical trade-off: consistency vs availability. During a flash sale, you might accept reduced availability for the checkout service to avoid overselling. Use a strong consistency model for inventory but allow reads from cache for product pages.

Database-Per-Service Pitfall

When each service has its own database, you lose the ability to do cross-service JOINs. Instead, use data duplication (denormalization) and eventual consistency. For example, the order service stores product snapshots (price, name at time of purchase) so order history doesn't depend on the catalog service being up.

Production Insight

Auto-scaling works great for stateless services (catalog, cart) but backfires for stateful services (inventory, orders).

Inventory service stores stock counts in a database that cannot be sharded at runtime easily.

Rule: Overprovision inventory database capacity for peak and use connection pooling to absorb spikes.

Also use a fast in-memory gate (Redis) for the hot inventory items during flash sales, then reconcile with the database asynchronously.

Key Takeaway

Scale in stages — don't build a microservices monolith from day one.

Consistency vs availability: choose based on the operation.

Stateful services (inventory) cannot auto-scale as easily — plan capacity ahead.

When to Scale Which Component

IfProduct page loading slow

→

UseCDN + Redis cache, horizontal scaling of catalog service, add Elasticsearch nodes.

IfCart save/update slow

→

UseRedis cluster for cart data, shard by user ID region.

IfCheckout failing during high load

→

UseScale checkout orchestration pods, add connection pooling to inventory DB, use Redis gate for stock.

IfPayment gateway timeouts

→

UseCircuit breaker, dead-letter queue, manual reconciliation script.

E-Commerce Architecture Types: Matching Topology to Traffic

Most naive designs start with a monolithic client-server setup. Client talks to server, server talks to database. Fine for a Shopify store with 50 visitors. But when your flash sale pushes 50k concurrent users, that single server becomes a bottleneck.

Two-tier architecture (client + database directly) is a death sentence for any real system. You expose raw DB queries to the network. Every SQL injection nightmare becomes real. I've seen production postmortems from teams who thought this was "simple."

Three-tier architecture is the baseline for any serious e-commerce platform. Presentation tier (React/Vue), application tier (REST/gRPC services), data tier (sharded databases). This gives you isolation. Your search service can have its own scaling policy and database without taking down checkout.

The real lesson: pick your tiers based on failure isolation, not just separation of concerns. Each tier must be independently deployable, testable, and scalable. If changing a product description requires redeploying your payment system, you've already lost.

TierSelector.pyPYTHON

// io.thecodeforge — system-design tutorial
// Map traffic patterns to tier topology

def recommend_architecture(daily_visitors: int) -> str:
    if daily_visitors < 1000:
        return "Three-tier with monolith app server"
    elif daily_visitors < 100_000:
        return "Three-tier with read replicas"
    else:
        return "Micro-frontend + partitioned services"

# Real production data from 2023 migration
print(recommend_architecture(500))       # threshold test
print(recommend_architecture(50_000))    # mid-traffic retailer
print(recommend_architecture(5_000_000)) # flash sale peak

Output

Three-tier with monolith app server

Three-tier with read replicas

Micro-frontend + partitioned services

Production Trap:

Never skip the application tier for "simplicity." Direct client-to-DB access in two-tier is how you get pwned during a traffic spike. Always route through a service layer that validates, throttles, and audits.

Key Takeaway

Match your architecture tier count to your traffic ceiling, not your MVP requirements. Three-tier is the floor for production.

Components That Tear Under Load: The Hidden Coupling

Every e-commerce platform has the same skeleton: product catalog, search, cart, checkout, payment, inventory, order management. The question is how tightly you've glued them together.

The biggest mistake I see? Sharing a single database across all components. Cart service locks rows on the same table that search queries. Now your search index refresh deadlocks against a checkout. I've debugged that at 3 AM. It's not fun.

Decompose by data ownership. Inventory owns stock counts. Cart owns session state. Payment owns transaction logs. They should only communicate through events or APIs, never through shared tables. Use an event bus (Kafka, RabbitMQ) to push inventory changes to search and to trigger order fulfillment.

Search needs its own index, preferably Elasticsearch or Meilisearch. Don't query product DB for search. That's for lookup by ID only. Cart should live in Redis with TTL for session expiration. Payment needs a ledger-level audit trail in something append-only like DynamoDB or Cassandra.

Your component boundaries define your blast radius. When the payment service goes down, you want users still able to browse products. If they're coupled at the DB level, everything burns together.

ComponentIsolation.pyPYTHON

// io.thecodeforge — system-design tutorial
// Decouple components by data store

class InventoryService:
    def __init__(self):
        self.db = PostgresCluster("inventory_db")

class SearchService:
    def __init__(self):
        self.index = ElasticsearchClient("products_index")
        self.inventory_events = KafkaConsumer("inventory_updates")

    def update_product_availability(self, sku: str, count: int):
        # Receives event, updates index independently
        self.index.update(sku, {"in_stock": count > 0})

# Inventory change triggers event, not direct DB call
inventory = InventoryService()
inventory.update_stock("SKU-42", 5)  # emits Kafka event
search = SearchService()
search.update_product_availability("SKU-42", 5)  # async

Output

Inventory updated. Event published. Search index refreshed asynchronously.

Senior Shortcut:

Key Takeaway

Components that share a database are not independent. Each service should own its data store and only communicate via events or APIs.

Importance of Domain Knowledge

E-commerce platforms fail when engineers treat them as generic CRUD apps. Domain knowledge separates a working system from one that collapses under Black Friday traffic. Understanding retail concepts like inventory buffers, payment settlement cycles, and fulfillment zoning directly impacts architectural decisions. Without domain knowledge, you build abstractions that leak complexity rather than contain it. You model a "Product" as a simple database row, missing that it has different states during checkout vs. restocking. Domain knowledge guides you to enforce invariants (like never overselling inventory) as business rules, not afterthoughts. When senior engineers lack domain fluency, they optimize for the wrong constraints—caching product catalogs aggressively while ignoring that price changes propagate slowly across supplier systems. Master domain knowledge first; technology second.

domain_knowledge.pyPYTHON

// io.thecodeforge — system-design tutorial
class InventoryService:
    def reserve_stock(self, sku: str, qty: int) -> bool:
        # Domain rule: never sell more than physical stock
        available = self.db.query('SELECT qty_available FROM inventory WHERE sku=%s', sku)
        if qty > available:
            raise OverSaleException('Domain constraint violated')
        self.db.execute('UPDATE inventory SET qty_reserved = qty_reserved + %s WHERE sku=%s', qty, sku)
        return True

Output

Only prints when OversaleException is raised, revealing domain violation.

Production Trap:

Engineers without domain knowledge often treat inventory as a single integer, missing multi-warehouse reservations and time-of-purchase lock expiration.

Key Takeaway

Model business realities before technical abstractions.

Strategic Design in DDD — Bounded Contexts & Ubiquitous Language

Strategic DDD partitions the e-commerce platform into Bounded Contexts: Catalog, Ordering, Inventory, Payments, Shipping. Each context owns its models and language. The "Customer" in Billing means billing address; in Shipping, it means delivery preferences. Ubiquitous Language ensures every team member—product, QA, and backend—uses terms like "reservation" not "temporary hold" and "shipment" not "package group." This eliminates translation errors during requirement gathering. Bounded Contexts communicate through context maps (e.g., Customer Service sends events, not direct DB reads). A Catalog context produces ProductCreated events; Inventory subscribes to initialize stock. This strategic design prevents god classes and allows independent deployability—critical when flash sales spike one context without cascading failure.

bounded_context.pyPYTHON

// io.thecodeforge — system-design tutorial
class OrderContext:
    class Order:
        def __init__(self):
            self.items = []
            self.status = 'pending_payment'
            # Ubiquitous Language: 'pending_payment' not 'waiting'
    def place_order(self, cart_id):
        # communicates via events, not direct DB
        event = OrderPlaced(order_id=self.id, total=self.total)
        message_bus.publish('orders.placed', event)
        return self.id

Output

Events fired: orders.placed with OrderPlaced payload.

Production Trap:

Mixing languages between teams causes subtle bugs—Shipping team's 'delivery_date' vs. Ordering team's 'commit_date' cause missed SLAs.

Key Takeaway

Bound contexts by business capability, not technical layers.

Tactical DDD Patterns — Entity, Value Object, Aggregate, Repository, Factory

Tactical patterns enforce consistency within a Bounded Context. An Entity (e.g., Cart) has identity—two carts are distinct even with same items. A Value Object (e.g., Address) has no identity but attributes—replacing Street changes the object itself. The Aggregate (e.g., Order) clusters Entity roots and Value Objects with transactional boundaries. The Cart Aggregate includes CartItems (entity) and ShippingAddress (value object). Repositories provide persistence abstractions over aggregates—never expose raw SQL. Factories encapsulate instantiation complexity (e.g., creating a bulk order vs. single-item order). Benefits: tactical DDD prevents anemic domain models by keeping business logic in domain objects, not services. It reduces cognitive load for new engineers because rules live where they belong—on the order, not scattered across controllers.

ddd_patterns.pyPYTHON

// io.thecodeforge — system-design tutorial
class Order(AggregateRoot):
    def __init__(self, factory: OrderFactory):
        self.items = []
        self.address = None  # Value Object
    def add_item(self, sku: str, qty: int, price: Money):  # Money = Value Object
        item = CartItem(sku, qty, price)  # Entity
        self.items.append(item)
class OrderRepository:
    def save(self, order: Order):  # Atomically persist aggregate
        for item in order.items:
            db.execute('INSERT INTO order_items ...', item.sku)

Output

Order aggregate saved atomically with all invariants enforced.

Production Trap:

Loading partial aggregates via repository leaks business logic to services—always load the whole aggregate root.

Key Takeaway

Aggregates define consistency boundaries—never bypass them.

● Production incidentPOST-MORTEMseverity: high

Flash Sale Double-Book Disaster

Symptom

Customers received order confirmation emails for the same limited-edition product, but inventory showed zero after the first few seconds. Support tickets flooded in.

Assumption

Assumed that a simple database decrement with a WHERE clause (quantity > 0) would prevent overselling. But two concurrent requests read the same row before either wrote.

Root cause

Read-modify-write race condition. The 'SELECT quantity then UPDATE ... SET quantity = quantity - 1' pattern is not atomic under high concurrency. MySQL default isolation level (REPEATABLE READ) still allows phantom reads in this pattern under high load.

Fix

Switched to atomic decrement: 'UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity > 0'. Then checked affected_rows in application code. Also added a Redis atomic counter as a fast gate before hitting the DB.

Key lesson

Inventory decrement must be atomic — never SELECT then UPDATE separately.
Use row-level locks or optimistic locking for critical stock operations.
Always test race conditions with simultaneous curl scripts before a flash sale.

Production debug guideSymptom → Action for the most frequent production issues4 entries

Symptom · 01

User gets 'Item out of stock' after adding to cart

→

Fix

Check if inventory reserve was released on cart expiry or failed payment. Look at inventory_reserve_logs for the user's session. Verify TTL on cart reservation.

Symptom · 02

Payment succeeded but order not created

→

Fix

Check payment gateway callback logs. The webhook may have been delivered but the order service failed to process it. Verify idempotency key was stored; if not, the callback is being replayed.

Symptom · 03

Duplicate charge on credit card

→

Fix

Your payment idempotency key is missing or not enforced on the gateway side. Check payment_intent.idempotency_key in Stripe logs. Ensure your payment service generates a deterministic key per order attempt.

Symptom · 04

Cart total is inconsistent with price after promo code

→

Fix

Promo code validation may have used stale product prices. Recalculate cart on checkout initiation, never rely on cached prices. Log all applied promos with timestamps.

★ Flash Sale Performance Debug Cheat SheetWhen traffic spikes 10x, these commands pinpoint the bottleneck in seconds.

Product page loads >1s−

Immediate action

Check cache hit ratio on product catalog CDN and Redis

Commands

redis-cli --stat | grep keyspace_hits

curl -s -w '%{time_total}' -o /dev/null https://your-cdn-endpoint/product/123

Fix now

Warm cache with popular product IDs one hour before sale. Pre-generate page HTML and serve from CDN.

Checkout API returning 503+

Orders failing due to inventory race+

E-commerce Architecture Comparison

Architecture Style	Best For	Consistency Model	Operational Complexity	Cost at Scale
Monolith	Startups, <100k DAU, simple catalog	Strong (single DB)	Low	Low — single server or small cluster
Microservices (per domain)	Medium to large platforms, 100k-10M DAU	Eventual consistency across services	High — requires DevOps, monitoring, and CI/CD maturity	Medium — more services, but better scaling
Serverless (Lambda, Fargate, etc.)	Variable traffic, low ops overhead	Eventual (function per service)	Low to medium — but cold starts affect latency	Low for low traffic, high for steady high traffic

Key takeaways

Service boundaries should follow access patterns

read-heavy (catalog) vs write-heavy (inventory) vs consistency-critical (payment).

Idempotency keys prevent duplicate payments

store them with a unique constraint and use them in every payment gateway call.

Checkout is a Saga

atomic inventory reservation, idempotent payment, compensating release on failure.

Search is not database query

use Elasticsearch with CDC for product search; cache product pages aggressively.

Scaling is stage-dependent

start monolith, split services when bottlenecks appear, and never share a database between inventory and catalog.

Common mistakes to avoid

3 patterns

Keeping inventory and catalog in the same database

Symptom

Product page queries cause deadlocks on inventory rows during checkout. Checkout fails with 'could not serialize access' errors.

Fix

Split into two separate databases or schemas. Catalog uses read replicas, inventory uses strongly consistent writes. Consider different engines: MySQL for catalog, PostgreSQL for inventory if needed.

Not using idempotency keys for payment

Symptom

Users are charged twice for a single order. Customer support spends hours processing refunds.

Fix

Generate a unique idempotency key per payment attempt. Send it to the payment gateway. Store the key in a table with a unique constraint. Retry with the same key on timeout.

Implementing cart as a server-side session only

Symptom

Users lose their cart when switching devices or after session timeout. Cart abandonment rate increases by 30%.

Fix

Persist cart to a database (or Redis) linked to the user ID after login. For guest users, use a client-side cart stored in localStorage and sync on login.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you design the checkout flow in a high-traffic e-commerce plat...

Q02SENIOR

How do you handle a payment gateway timeout in a customer-facing checkou...

Q03SENIOR

Explain the trade-offs between using a single database for all e-commerc...

Q01 of 03SENIOR

How would you design the checkout flow in a high-traffic e-commerce platform to prevent overselling?

ANSWER

The checkout flow must guarantee that inventory is decremented atomically before payment, and that payment is idempotent. Use an atomic UPDATE on the inventory table: UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity > 0. Then check affected_rows. If zero, reject. For high contention, use a Redis Lua script for the decrement and fall back to database. After reserving inventory, call payment gateway with an idempotency key. If payment fails, release the reservation by incrementing quantity back. Use a Saga pattern with compensating transactions. For flash sales, consider a pre-reservation step where the cart holds items for a short time (e.g., 5 minutes) before finalizing.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the most common cause of overselling in e-commerce platforms?

How do you ensure an order is not lost after payment succeeds?

Should I use a cache for inventory data?

What is the difference between a distributed transaction (2PC) and a Saga?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Real World. Mark it forged?

9 min read · try the examples if you haven't