E-commerce System Design — Flash Sale Race Conditions
SELECT-then-UPDATE inventory causes double-bookings under load.
- E-commerce platforms are distributed systems managing product discovery, cart, checkout, payments, and inventory under high concurrency
- Key components: product catalog service, cart service, checkout orchestration, payment gateway, inventory service
- Performance insight: Product search must return in <200ms; use Elasticsearch with caching
- Production insight: Without idempotency in payments, a single retry can charge a customer twice — use idempotency keys
- Biggest mistake: Keeping cart and inventory in the same service — leads to tight coupling and checkout failures
Imagine you're running the world's biggest flea market. You've got thousands of sellers, millions of buyers, and everyone wants to browse, pick something, pay, and get it delivered — all at the same time, without chaos. Building an e-commerce platform is exactly that: designing the invisible plumbing that makes sure the right product gets to the right buyer, the money moves safely, and nothing crashes when a flash sale hits at midnight.
Amazon processes over 66,000 orders per minute at peak. Shopify powers over 4 million stores. These aren't just databases with a shopping cart bolted on — they're distributed systems solving some of the hardest problems in engineering: consistency under concurrency, sub-second search over millions of products, payment reliability, and inventory accuracy across warehouses. If you're designing an e-commerce platform from scratch, every architectural decision you make will either hold up under that load or quietly become technical debt that kills you at scale.
The core problem e-commerce platforms solve is deceptively simple on the surface: let someone find a product, add it to a cart, pay for it, and receive it. But underneath that user flow are a dozen non-trivial challenges — you need to prevent two buyers from purchasing the last item simultaneously, ensure a failed payment never charges a card twice, serve product search results in under 200ms, and handle a 10x traffic spike the moment a celebrity tweets about your product. Each of these requires a deliberate architectural choice, and the wrong choice doesn't just slow things down — it loses money or erodes customer trust instantly.
By the end of this article, you'll be able to walk into a system design interview and confidently sketch the architecture of a production-grade e-commerce platform. You'll understand why the product catalog and inventory services must be separated, how to handle the distributed transaction problem at checkout, what caching strategy keeps product pages fast, and how to design a payment system that is both idempotent and fault-tolerant. This is the article I wish existed when I was preparing for those interviews.
Core Components & Service Separation
An e-commerce platform is a set of loosely coupled services, each responsible for one domain. The three non-negotiable splits are:
- Product Catalog Service: Manages product metadata (name, description, images, categories, prices). This is read-heavy and benefits from caching and Elasticsearch.
- Cart Service: Manages user sessions, add/remove items, coupon application. It's write-heavy for the current session but read-only for historical data.
- Checkout Orchestrator: Coordinates the actual purchase — validates cart, locks inventory, calls payment gateway, creates order. This is the most failure-sensitive service.
- Inventory Service: Tracks stock levels across warehouses, reserves items during checkout, handles restocks.
- Payment Service: Interacts with external gateways (Stripe, PayPal), stores idempotency keys, handles retries.
- Order Service: Records completed orders, sends notifications, manages returns.
The mistake everyone makes is bundling inventory with the catalog. These have completely different access patterns: catalog is read-heavy and stale-ok, inventory is write-heavy and consistency-critical. Keep them separate from the start.
- Catalog: Fast reads, eventual consistency acceptable.
- Cart: Temporary state, can be lost without financial impact.
- Inventory: Strong consistency, cannot oversell.
- Payment: Must be idempotent and auditable.
- Order: Immutable after creation, source of truth.
Product Search & Catalog Performance
Product search is the gateway to purchase. Users expect results in under 200ms, with filters for category, price range, rating, and sorting by relevance or newest. Achieving this at scale means you cannot query the primary database directly.
Architecture: - Use Elasticsearch as the search index. It supports full-text search, faceted aggregation, and fuzzy matching out of the box. - Keep a read-through cache (Redis) for product detail pages (PDP). The cache key is product_id:locale:version. - For autocomplete, use a prefix-based Trie in memory or Elasticsearch's completion suggester.
The search index is built from the product catalog database using change data capture (CDC) with Debezium. Updates propagate within seconds — eventual consistency is acceptable here because a stale product in search is better than a failing search.
Caveats: - Sorting by combined fields (e.g., relevance * price) requires careful mapping in Elasticsearch. - Facet counts can be expensive; cache them separately and invalidate on product updates. - Avoid deep pagination (>100th page) — use search_after instead of from/size.
Cart & Checkout Consistency
The cart seems innocuous — items, quantities, maybe a promo code. But checkout is where distributed systems meet financial reality. The cart state must be consistent while the checkout orchestrator runs a mini-saga across inventory, payment, and order services.
Cart Design: - Store cart in Redis as a hash with TTL (e.g., 24 hours). This is fast and transient. - On checkout initiation, move cart data to a persistent checkout session in PostgreSQL. - Lock the cart to prevent modifications during checkout.
Checkout Orchestrator Steps: 1. Validate cart (prices, stock, promo codes). 2. Reserve inventory items (atomic decrement in inventory service). 3. Call payment gateway with idempotency key. 4. On payment success, create order record. 5. If payment fails, release inventory reservations (compensating transaction).
This is the Saga pattern: a sequence of local transactions with compensating actions. Avoid distributed transactions (2PC) — they don't scale and break across services.
Consistency Guarantee: - Use an outbox pattern: the order service writes an event to a database table, and a background worker publishes it reliably to a message queue. - This ensures no order is lost even if the message broker is down.
- Generate a deterministic key: hash(userId + cartId + timestamp).
- The payment gateway must reject duplicate key with the same payload.
- Store key in a database table with a unique constraint to enforce idempotency.
- If the request times out, retry with the same key — no double charge.
Payment System Reliability
Payment is the most critical subsystem — it moves real money. A successful payment must result in exactly one order and one charge. Payment systems at scale rely on three pillars: idempotency, retry with backoff, and idempotency verification at the gateway level.
Idempotency: - Before calling a payment gateway, generate a unique idempotency key (e.g., UUID per order attempt). - Store the key and the request payload in a database table with a unique constraint. - On a timeout or network error, retry with the same key. The gateway returns the original result.
Retry Strategy: - Use exponential backoff with jitter: first retry after 1s, then 2s, 4s, up to 60s max. - After 3 retries, escalate to a dead-letter queue for manual review. - Monitor the rate of payment timeouts — a sudden spike may indicate a gateway issue.
Failure Modes: - Dual charge: Happens when idempotency is missing or the gateway doesn't support it. Always use a payment provider that supports idempotency keys (Stripe, Braintree, Adyen). - Silent failures: Payment fails but the customer doesn't get an error — the system marks the order as pending. Use a reconciliation job that compares pending orders with gateway transactions daily.
Scaling Strategies & Trade-offs
Scaling an e-commerce platform is not just adding more servers — it's about understanding where bottlenecks appear at each growth stage.
Stage 1: Up to 100k daily active users - Monolithic architecture with separate read replicas for catalog. - Redis cache for product pages and session data. - Single PostgreSQL database with connection pooling.
Stage 2: 100k to 1M DAU - Break out catalog and inventory services (as discussed). - Use Elasticsearch for search, read replicas for orders. - Asynchronous payment callbacks (webhooks). - Message queue (RabbitMQ / Kafka) for order processing and inventory sync.
Stage 3: 1M+ DAU with flash sales - Full microservices architecture with event sourcing. - Each service has its own database (database-per-service). - CDN for static assets and cached product pages. - Pre-warming inventory cache for top products. - Auto-scaling infrastructure with Kubernetes. - Feature flags to quickly disable payment gateways or checkout during incidents.
The critical trade-off: consistency vs availability. During a flash sale, you might accept reduced availability for the checkout service to avoid overselling. Use a strong consistency model for inventory but allow reads from cache for product pages.
Flash Sale Double-Book Disaster
- Inventory decrement must be atomic — never SELECT then UPDATE separately.
- Use row-level locks or optimistic locking for critical stock operations.
- Always test race conditions with simultaneous curl scripts before a flash sale.
Key takeaways
Common mistakes to avoid
3 patternsKeeping inventory and catalog in the same database
Not using idempotency keys for payment
Implementing cart as a server-side session only
Interview Questions on This Topic
How would you design the checkout flow in a high-traffic e-commerce platform to prevent overselling?
UPDATE inventory SET quantity = quantity - 1 WHERE product_id = ? AND quantity > 0. Then check affected_rows. If zero, reject. For high contention, use a Redis Lua script for the decrement and fall back to database. After reserving inventory, call payment gateway with an idempotency key. If payment fails, release the reservation by incrementing quantity back. Use a Saga pattern with compensating transactions. For flash sales, consider a pre-reservation step where the cart holds items for a short time (e.g., 5 minutes) before finalizing.Frequently Asked Questions
That's Real World. Mark it forged?
5 min read · try the examples if you haven't