E-commerce System Design — Flash Sale Race Conditions
SELECT-then-UPDATE inventory causes double-bookings under load.
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
- E-commerce platforms are distributed systems managing product discovery, cart, checkout, payments, and inventory under high concurrency
- Key components: product catalog service, cart service, checkout orchestration, payment gateway, inventory service
- Performance insight: Product search must return in <200ms; use Elasticsearch with caching
- Production insight: Without idempotency in payments, a single retry can charge a customer twice — use idempotency keys
- Biggest mistake: Keeping cart and inventory in the same service — leads to tight coupling and checkout failures
Imagine you're running the world's biggest flea market. You've got thousands of sellers, millions of buyers, and everyone wants to browse, pick something, pay, and get it delivered — all at the same time, without chaos. Building an e-commerce platform is exactly that: designing the invisible plumbing that makes sure the right product gets to the right buyer, the money moves safely, and nothing crashes when a flash sale hits at midnight.
An e-commerce platform isn’t just a shopping cart with a database. It’s a distributed system that must survive flash sales, maintain consistent checkout states, and keep search fast under millions of SKUs. Without deliberate architectural separation and decoupled services, your naive monolith will collapse under the first real traffic spike—losing orders, payments, and trust.
Why Flash Sales Break Naive E-Commerce Systems
A flash sale is a high-concurrency event where a limited inventory is offered at a steep discount for a short window. The core mechanic is a race condition: thousands of users compete for the same few items, and the system must correctly decrement inventory exactly once per successful purchase. Without careful design, overselling (selling more than available stock) or underselling (rejecting valid purchases) is inevitable.
In practice, the key properties that matter are atomic inventory updates, idempotent payment processing, and graceful degradation under load. A typical flash sale sees 100x normal traffic within seconds. The system must handle concurrent writes to the same product row — naive locking (e.g., database row locks) becomes a bottleneck, while optimistic locking with retries can cause cascading failures. Distributed locks (Redis, ZooKeeper) or queue-based throttling are common solutions, but each introduces trade-offs in consistency and latency.
Use a dedicated flash sale architecture when the expected concurrency exceeds what your normal checkout pipeline can handle — roughly >10 concurrent requests per product SKU. It matters because a single oversell incident can trigger chargebacks, customer trust erosion, and regulatory fines. Real systems like Alibaba's Double 11 or Amazon's Lightning Deals rely on pre-allocated inventory pools, request queuing, and idempotency keys to survive the stampede.
Core Components & Service Separation
An e-commerce platform is a set of loosely coupled services, each responsible for one domain. The three non-negotiable splits are:
- Product Catalog Service: Manages product metadata (name, description, images, categories, prices). This is read-heavy and benefits from caching and Elasticsearch.
- Cart Service: Manages user sessions, add/remove items, coupon application. It's write-heavy for the current session but read-only for historical data.
- Checkout Orchestrator: Coordinates the actual purchase — validates cart, locks inventory, calls payment gateway, creates order. This is the most failure-sensitive service.
- Inventory Service: Tracks stock levels across warehouses, reserves items during checkout, handles restocks.
- Payment Service: Interacts with external gateways (Stripe, PayPal), stores idempotency keys, handles retries.
- Order Service: Records completed orders, sends notifications, manages returns.
The mistake everyone makes is bundling inventory with the catalog. These have completely different access patterns: catalog is read-heavy and stale-ok, inventory is write-heavy and consistency-critical. Keep them separate from the start.
- Catalog: Fast reads, eventual consistency acceptable.
- Cart: Temporary state, can be lost without financial impact.
- Inventory: Strong consistency, cannot oversell.
- Payment: Must be idempotent and auditable.
- Order: Immutable after creation, source of truth.
Product Search & Catalog Performance
Product search is the gateway to purchase. Users expect results in under 200ms, with filters for category, price range, rating, and sorting by relevance or newest. Achieving this at scale means you cannot query the primary database directly.
Architecture: - Use Elasticsearch as the search index. It supports full-text search, faceted aggregation, and fuzzy matching out of the box. - Keep a read-through cache (Redis) for product detail pages (PDP). The cache key is product_id:locale:version. - For autocomplete, use a prefix-based Trie in memory or Elasticsearch's completion suggester.
The search index is built from the product catalog database using change data capture (CDC) with Debezium. Updates propagate within seconds — eventual consistency is acceptable here because a stale product in search is better than a failing search.
Caveats: - Sorting by combined fields (e.g., relevance * price) requires careful mapping in Elasticsearch. - Facet counts can be expensive; cache them separately and invalidate on product updates. - Avoid deep pagination (>100th page) — use search_after instead of from/size.
Cart & Checkout Consistency
The cart seems innocuous — items, quantities, maybe a promo code. But checkout is where distributed systems meet financial reality. The cart state must be consistent while the checkout orchestrator runs a mini-saga across inventory, payment, and order services.
Cart Design: - Store cart in Redis as a hash with TTL (e.g., 24 hours). This is fast and transient. - On checkout initiation, move cart data to a persistent checkout session in PostgreSQL. - Lock the cart to prevent modifications during checkout.
Checkout Orchestrator Steps: 1. Validate cart (prices, stock, promo codes). 2. Reserve inventory items (atomic decrement in inventory service). 3. Call payment gateway with idempotency key. 4. On payment success, create order record. 5. If payment fails, release inventory reservations (compensating transaction).
This is the Saga pattern: a sequence of local transactions with compensating actions. Avoid distributed transactions (2PC) — they don't scale and break across services.
Consistency Guarantee: - Use an outbox pattern: the order service writes an event to a database table, and a background worker publishes it reliably to a message queue. - This ensures no order is lost even if the message broker is down.
- Generate a deterministic key: hash(userId + cartId + timestamp).
- The payment gateway must reject duplicate key with the same payload.
- Store key in a database table with a unique constraint to enforce idempotency.
- If the request times out, retry with the same key — no double charge.
Payment System Reliability
Payment is the most critical subsystem — it moves real money. A successful payment must result in exactly one order and one charge. Payment systems at scale rely on three pillars: idempotency, retry with backoff, and idempotency verification at the gateway level.
Idempotency: - Before calling a payment gateway, generate a unique idempotency key (e.g., UUID per order attempt). - Store the key and the request payload in a database table with a unique constraint. - On a timeout or network error, retry with the same key. The gateway returns the original result.
Retry Strategy: - Use exponential backoff with jitter: first retry after 1s, then 2s, 4s, up to 60s max. - After 3 retries, escalate to a dead-letter queue for manual review. - Monitor the rate of payment timeouts — a sudden spike may indicate a gateway issue.
Failure Modes: - Dual charge: Happens when idempotency is missing or the gateway doesn't support it. Always use a payment provider that supports idempotency keys (Stripe, Braintree, Adyen). - Silent failures: Payment fails but the customer doesn't get an error — the system marks the order as pending. Use a reconciliation job that compares pending orders with gateway transactions daily.
Scaling Strategies & Trade-offs
Scaling an e-commerce platform is not just adding more servers — it's about understanding where bottlenecks appear at each growth stage.
Stage 1: Up to 100k daily active users - Monolithic architecture with separate read replicas for catalog. - Redis cache for product pages and session data. - Single PostgreSQL database with connection pooling.
Stage 2: 100k to 1M DAU - Break out catalog and inventory services (as discussed). - Use Elasticsearch for search, read replicas for orders. - Asynchronous payment callbacks (webhooks). - Message queue (RabbitMQ / Kafka) for order processing and inventory sync.
Stage 3: 1M+ DAU with flash sales - Full microservices architecture with event sourcing. - Each service has its own database (database-per-service). - CDN for static assets and cached product pages. - Pre-warming inventory cache for top products. - Auto-scaling infrastructure with Kubernetes. - Feature flags to quickly disable payment gateways or checkout during incidents.
The critical trade-off: consistency vs availability. During a flash sale, you might accept reduced availability for the checkout service to avoid overselling. Use a strong consistency model for inventory but allow reads from cache for product pages.
E-Commerce Architecture Types: Matching Topology to Traffic
Most naive designs start with a monolithic client-server setup. Client talks to server, server talks to database. Fine for a Shopify store with 50 visitors. But when your flash sale pushes 50k concurrent users, that single server becomes a bottleneck.
Two-tier architecture (client + database directly) is a death sentence for any real system. You expose raw DB queries to the network. Every SQL injection nightmare becomes real. I've seen production postmortems from teams who thought this was "simple."
Three-tier architecture is the baseline for any serious e-commerce platform. Presentation tier (React/Vue), application tier (REST/gRPC services), data tier (sharded databases). This gives you isolation. Your search service can have its own scaling policy and database without taking down checkout.
The real lesson: pick your tiers based on failure isolation, not just separation of concerns. Each tier must be independently deployable, testable, and scalable. If changing a product description requires redeploying your payment system, you've already lost.
Components That Tear Under Load: The Hidden Coupling
Every e-commerce platform has the same skeleton: product catalog, search, cart, checkout, payment, inventory, order management. The question is how tightly you've glued them together.
The biggest mistake I see? Sharing a single database across all components. Cart service locks rows on the same table that search queries. Now your search index refresh deadlocks against a checkout. I've debugged that at 3 AM. It's not fun.
Decompose by data ownership. Inventory owns stock counts. Cart owns session state. Payment owns transaction logs. They should only communicate through events or APIs, never through shared tables. Use an event bus (Kafka, RabbitMQ) to push inventory changes to search and to trigger order fulfillment.
Search needs its own index, preferably Elasticsearch or Meilisearch. Don't query product DB for search. That's for lookup by ID only. Cart should live in Redis with TTL for session expiration. Payment needs a ledger-level audit trail in something append-only like DynamoDB or Cassandra.
Your component boundaries define your blast radius. When the payment service goes down, you want users still able to browse products. If they're coupled at the DB level, everything burns together.
Importance of Domain Knowledge
E-commerce platforms fail when engineers treat them as generic CRUD apps. Domain knowledge separates a working system from one that collapses under Black Friday traffic. Understanding retail concepts like inventory buffers, payment settlement cycles, and fulfillment zoning directly impacts architectural decisions. Without domain knowledge, you build abstractions that leak complexity rather than contain it. You model a "Product" as a simple database row, missing that it has different states during checkout vs. restocking. Domain knowledge guides you to enforce invariants (like never overselling inventory) as business rules, not afterthoughts. When senior engineers lack domain fluency, they optimize for the wrong constraints—caching product catalogs aggressively while ignoring that price changes propagate slowly across supplier systems. Master domain knowledge first; technology second.
Strategic Design in DDD — Bounded Contexts & Ubiquitous Language
Strategic DDD partitions the e-commerce platform into Bounded Contexts: Catalog, Ordering, Inventory, Payments, Shipping. Each context owns its models and language. The "Customer" in Billing means billing address; in Shipping, it means delivery preferences. Ubiquitous Language ensures every team member—product, QA, and backend—uses terms like "reservation" not "temporary hold" and "shipment" not "package group." This eliminates translation errors during requirement gathering. Bounded Contexts communicate through context maps (e.g., Customer Service sends events, not direct DB reads). A Catalog context produces ProductCreated events; Inventory subscribes to initialize stock. This strategic design prevents god classes and allows independent deployability—critical when flash sales spike one context without cascading failure.
Tactical DDD Patterns — Entity, Value Object, Aggregate, Repository, Factory
Tactical patterns enforce consistency within a Bounded Context. An Entity (e.g., Cart) has identity—two carts are distinct even with same items. A Value Object (e.g., Address) has no identity but attributes—replacing Street changes the object itself. The Aggregate (e.g., Order) clusters Entity roots and Value Objects with transactional boundaries. The Cart Aggregate includes CartItems (entity) and ShippingAddress (value object). Repositories provide persistence abstractions over aggregates—never expose raw SQL. Factories encapsulate instantiation complexity (e.g., creating a bulk order vs. single-item order). Benefits: tactical DDD prevents anemic domain models by keeping business logic in domain objects, not services. It reduces cognitive load for new engineers because rules live where they belong—on the order, not scattered across controllers.
Flash Sale Double-Book Disaster
- Inventory decrement must be atomic — never SELECT then UPDATE separately.
- Use row-level locks or optimistic locking for critical stock operations.
- Always test race conditions with simultaneous curl scripts before a flash sale.
redis-cli --stat | grep keyspace_hitscurl -s -w '%{time_total}' -o /dev/null https://your-cdn-endpoint/product/123Key takeaways
Common mistakes to avoid
3 patternsKeeping inventory and catalog in the same database
Not using idempotency keys for payment
Implementing cart as a server-side session only
Interview Questions on This Topic
How would you design the checkout flow in a high-traffic e-commerce platform to prevent overselling?
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity > 0. Then check affected_rows. If zero, reject. For high contention, use a Redis Lua script for the decrement and fall back to database. After reserving inventory, call payment gateway with an idempotency key. If payment fails, release the reservation by incrementing quantity back. Use a Saga pattern with compensating transactions. For flash sales, consider a pre-reservation step where the cart holds items for a short time (e.g., 5 minutes) before finalizing.Frequently Asked Questions
20+ years shipping large-scale distributed systems. Everything here is grounded in real deployments.
That's Real World. Mark it forged?
9 min read · try the examples if you haven't