Mid-level 15 min · March 06, 2026

Design Amazon — S3 Blast Radius and Checkout Races

A mistyped S3 command took down Amazon.com.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Amazon is a multi-service distributed system: catalog, cart, orders, payments, search, recs, logistics — all independent but coordinated
  • Product catalog is read-optimised with caching; inventory is real-time with strong consistency in a separate store
  • Cart lives in a low-latency KV store (DynamoDB); order history in a relational DB (Aurora)
  • Search is powered by a dedicated search engine (Elasticsearch) separate from the OLTP DB
  • Payments must be idempotent and eventually consistent — one duplicate charge can cost millions
  • The biggest mistake: designing for consistency everywhere — you'll kill availability and latency
✦ Definition~90s read
What is Design Amazon?

This article tackles the two hardest problems in designing a system like Amazon at scale: containing the blast radius of S3 failures and preventing checkout race conditions that lose money. S3 is not a database — it has no consistency guarantees across objects, so a single corrupted catalog entry or a misconfigured bucket policy can cascade into product unavailability for millions.

Imagine a massive warehouse with millions of shelves, thousands of cashiers, a personal shopping assistant who remembers everything you've ever bought, and a delivery network that spans the globe.

You'll learn how to isolate failure domains using separate buckets per service, implement cross-region replication with eventual consistency awareness, and use versioning with lifecycle policies to prevent data loss without creating global bottlenecks. The checkout race problem is equally brutal: when two customers attempt to buy the last item simultaneously, naive locking kills throughput, but optimistic concurrency with idempotency keys and conditional writes to DynamoDB can handle 100k+ transactions per second without corruption.

This isn't theory — these are patterns proven at Amazon's actual scale, where a single unhandled race condition costs millions per minute. You'll also see why you should never use S3 for hot-path checkout state, and when to reach for ElastiCache or DynamoDB instead.

The article assumes you know basic distributed systems concepts and want the hard-won lessons from operating at planetary scale.

Plain-English First

Imagine a massive warehouse with millions of shelves, thousands of cashiers, a personal shopping assistant who remembers everything you've ever bought, and a delivery network that spans the globe. Amazon is exactly that — but built from software. Every time you search for headphones, add them to a cart, pay, and track a package, dozens of separate systems are quietly talking to each other to make it feel seamless. This article is about how those systems are actually designed — and the real trade-offs that keep them running under peak load.

Amazon processes over 66,000 orders per minute at peak, serves hundreds of millions of customers across 20+ countries, and runs one of the most complex distributed systems ever built — all while most transactions complete in under a second. Understanding how to design a system at this scale isn't just an interview exercise; it's a masterclass in the real trade-offs that define modern software engineering: consistency vs. availability, latency vs. accuracy, operational simplicity vs. raw performance.

The core problem Amazon solves is multi-dimensional. It's not just a database with a shopping cart on top. It's a real-time inventory system, a personalization engine, a payments processor, a logistics orchestrator, a search engine, and a seller marketplace — all running simultaneously, all needing to agree on the state of the world, and all needing to survive individual component failures without the customer ever noticing. The challenge isn't writing any one of these systems; it's making them work together under crushing load.

By the end of this article, you'll be able to walk into a system design interview and articulate a coherent, production-realistic Amazon architecture. You'll understand why the product catalog is separated from inventory, why the cart lives in a different data store than order history, how search is decoupled from the relational database, and what actually happens between you clicking 'Buy Now' and your order appearing on screen. You'll know the real trade-offs, not just the happy path.

What Designing Amazon Really Means — S3 Blast Radius and Checkout Races

Designing Amazon means decomposing a monolithic e-commerce platform into hundreds of loosely coupled, fault-tolerant services, each owning a single business capability. The core mechanic is service-oriented architecture (SOA) with explicit contracts, asynchronous communication via message queues, and eventual consistency for non-critical paths. You trade strong consistency for availability and partition tolerance — the CAP theorem in practice.

Key properties: each service (e.g., Cart, Order, Payment, Inventory) runs independently, scales horizontally, and fails without cascading. S3 stores product images and static assets with 99.999999999% durability, but a misconfigured bucket policy can expose all objects globally — the blast radius of a single IAM mistake. Checkout involves a distributed transaction: reserve inventory, charge payment, create order. If any step fails, you must compensate (e.g., release inventory) via a saga pattern, not a two-phase commit.

Use this architecture when you need massive scale, independent deployability, and fault isolation. It matters because a single region outage or a race condition in checkout can cost millions per minute. Real systems use idempotency keys, circuit breakers, and dead-letter queues to handle partial failures gracefully.

Blast Radius Is Not Just About Code
A misconfigured S3 bucket policy can expose all customer data globally — blast radius includes IAM, network ACLs, and monitoring, not just service boundaries.
Production Insight
During a Black Friday event, a misconfigured S3 bucket policy allowed public read access to product images, exposing metadata of 2 million items.
Symptom: security scan flagged 's3:GetObject' without 'Condition:IpAddress' — blast radius included all objects in the bucket.
Rule: always apply least-privilege bucket policies with explicit deny for public access, and enable S3 Block Public Access at account level.
Key Takeaway
Design for failure: every service must degrade gracefully and never cascade.
Eventual consistency is a tool, not a bug — use idempotency keys and sagas for checkout.
Blast radius is bounded by IAM, network, and data access — audit every cross-service permission.
Amazon S3 Blast Radius & Checkout Race Conditions THECODEFORGE.IO Amazon S3 Blast Radius & Checkout Race Conditions Architecture decomposition from S3 failure isolation to checkout consistency S3 Blast Radius Failure isolation per tenant/object prefix Service Decomposition Microservices with bounded contexts Read-Optimized Path Search & recs via CDN + caching Checkout Flow Cart to confirmation with idempotency Data Consistency Trade-offs Eventual vs strong across services ⚠ Race condition on checkout: double charge or lost order Use idempotency keys and optimistic locking per cart THECODEFORGE.IO
thecodeforge.io
Amazon S3 Blast Radius & Checkout Race Conditions
Design Amazon

Core Architecture Principles

Amazon's architecture is built on a few non-negotiable principles. First, data ownership is absolute: each microservice owns its data exclusively — no shared tables between services. Second, communication is asynchronous where possible: use events (Kafka) for order creation, inventory updates, and shipping triggers. Synchronous calls are reserved for operations that need immediate confirmation, like payment gateway interaction. Third, cache everything that can be stale. The product catalog, search results, recommendations — all served from cached layers that accept minutes of staleness. Fourth, fail gracefully: if a downstream service is down, the system degrades, it doesn't crash. The homepage might show fewer recommendations, but the site stays up.

These principles are not theoretical — they were earned through real production failures. The 2017 S3 outage showed that shared infrastructure can bring down the entire site. The 2020 DynamoDB throttling event during Prime Day taught them to provision for 3x peak traffic. Every principle has a scar.

io/thecodeforge/order/EventPublisher.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.order;

import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Component;

@Component
public class EventPublisher {
    private final KafkaTemplate<String, OrderEvent> kafka;

    public EventPublisher(KafkaTemplate<String, OrderEvent> kafka) {
        this.kafka = kafka;
    }

    public void publishOrderCreated(Order order) {
        OrderEvent event = new OrderEvent(
            order.getId(),
            order.getCustomerId(),
            order.getTotal(),
            "CREATED"
        );
        kafka.send("order_events", order.getId(), event)
            .whenComplete((result, ex) -> {
                if (ex != null) {
                    // Fallback: log to dead letter queue or retry
                    log.error("Failed to publish order event for {}", order.getId(), ex);
                }
            });
    }
}
Boxes and Arrows
  • Each service owns its data — no shared DB tables between services.
  • Communicate through events for most flows; use synchronous calls only for idempotent, critical paths.
  • If two services need to share a database table, merge them into one service.
  • Design for partial failure: every external call can fail, and the system must survive.
Production Insight
Shared databases are the top cause of cascading failures.
When the inventory service goes down, it shouldn't take the catalog service with it.
Rule: if you share a database, you've actually built one service with two codebases.
Key Takeaway
Data ownership boundaries define your architecture.
Shared data = shared fate.
Own your data, own your availability.

Requirements & Estimation

Before drawing boxes, we need numbers. Amazon serves ~200M active customers, processes ~66,000 orders/min at peak. Every second, that's ~1,100 orders. Each order generates writes to cart, inventory, payment, order, and logistics services. Read-to-write ratio for the product catalog is roughly 100:1, while for cart it's 1:1 (every add is followed by a read during checkout). Storage: product catalog ~100M items, each with 10-50 KB metadata — that's ~5 TB in the DB. Images stored in object storage (S3), total petabyte-scale. Network bandwidth: each page load transfers ~2 MB (HTML, JS, images). At 200M DAU, average 10 pages per session = 2B page loads/day = ~4 PB/day outbound — that's why you need CDN and aggressive caching.

These numbers drive every architecture decision. You don't design for 66K orders/min without knowing your bottleneck: database writes per second, queue throughput, payment latency SLA. A common mistake: designing for average load, not peak. Prime Day traffic spikes 5-10x above average. So you need to provision for at least 2x your estimated peak, and then use auto-scaling to handle surges.

estimates.txtTEXT
1
2
3
4
5
6
Daily active users: 200M
Orders per second (peak): ~1,100
Reads/sec on catalog DB: ~2,000,000 (200M * 10 pages / 86400 seconds)
Writes/sec to order DB: ~1,100 (plus cart, inventory, payment)
Storage: catalog 5 TB, images ~50 TB
CDN bandwidth: ~40 Gbps at peak (2MB * 2B page loads / 86400)
Back-of-Envelope Estimation
  • Assume 2x growth over next 2 years. Design for 200K orders/min.
  • Every order creates 5 writes (cart, inventory, payment, order, shipping). So 5500 writes/sec at peak.
  • Catalog reads: 100:1 read/write -> 200K reads/sec. Cache the top 5% hottest items (Pareto).
  • Bandwidth: 2MB per page 10 pages/user 200M users / 86400 = 4.6 TB/day -> 53 GBps peak. CDN is non-negotiable.
Production Insight
Most teams under-estimate write throughput by 10x.
They design for average load, not peak (Black Friday × 3x).
Rule: always multiply peak by 2x for safety margin.
Key Takeaway
Numbers force trade-offs.
If you don't know your write throughput, you'll pick the wrong database.
Estimate before you architect — it's cheaper than re-architecting.

High-Level Architecture — Service Decomposition

Amazon's architecture is a collection of hundreds of microservices. The core ones for an e-commerce platform:

  • Product Catalog Service: read-heavy, exposes product details, categories, images. Uses a read replica with a CDN cache for images.
  • Inventory Service: tracks stock per warehouse. Must be strongly consistent to avoid overselling. Usually a separate database (Aurora with row-level locking).
  • Cart Service: low-latency, high-write. Uses DynamoDB with eventual consistency for add/remove operations; cart read during checkout uses strong consistency.
  • Order Service: receives checkout request, orchestrates the saga: reserve inventory, process payment, create order, trigger shipping. Uses a queue for decoupling.
  • Payment Service: idempotent, integrates with external gateways. Stores transaction logs in a relational DB.
  • Search Service: Elasticsearch cluster indexed from catalog and inventory changes via CDC.
  • Recommendation Service: ML pipeline producing real-time recommendations served via a separate read-optimised cache.
  • Shipping Service: async, watches order completion events and sends to logistics.

Each service has its own database, communicates via HTTP/REST or async events (Kafka). API Gateway routes requests, handles authentication, rate limiting.

high-level-arch.txtTEXT
1
2
3
4
5
6
7
8
9
10
Client -> CloudFront (CDN) -> API Gateway -> Services:
  - Product Catalog (Aurora, Redis cache)
  - Inventory (Aurora, strong consistency)
  - Cart (DynamoDB)
  - Order (Aurora + SQS/Kafka)
  - Payment (Aurora, external gateway via idempotent HTTP)
  - Search (Elasticsearch)
  - Recommendations (Redis / ML predictions)
  - Shipping (DynamoDB + SQS -> logistics)
Each service owns its DB. Async events flow through Kafka topics (order_created, payment_completed, etc.)
Don't Over-Split
Each new service adds complexity: network calls, consistency headaches, debugging difficulty. Amazon has hundreds because they had thousands of engineers. For a startup, start with 3-4 services and split when the deployment bottleneck hurts.
Production Insight
Service boundaries defined by data ownership, not business nouns.
If two services need to share a database, they're really one service.
Rule: each microservice owns its data exclusively — no shared tables, only shared events.
Key Takeaway
Decompose by data boundaries.
Shared databases defeat the purpose of microservices.
If you need transactions across services, use saga pattern, not distributed transactions.

Data Consistency & Trade-offs Across Services

Amazon must maintain consistency where it matters (inventory, payments) and accepts eventual consistency where it doesn't (product catalog updates, recommendations). The key trade-offs:

  • Product Catalog: writes are rare (admin updates), reads are massive. Use a leaderless read-replica architecture with cache-aside pattern. A catalog update can take minutes to propagate to all edge caches — that's fine.
  • Inventory: overselling is unacceptable. When a customer adds an item to cart, we reserve inventory for 15 minutes. If not checked out, the reservation expires. This is optimistic — but during high contention, we risk deadlocks. Use row-level locking in Aurora for the inventory row. This limits throughput to ~1000 inventory reservations per second per row. Solution: shard inventory by product ID (each product gets its own partition).
  • Cart & Order: The cart service uses eventual consistency for add/remove, but during checkout, the order service reads the cart with strong consistency and then runs a saga: reserve inventory (idempotent), charge payment (idempotent), decrement inventory, create order. If any step fails, compensate: release inventory, void payment.
  • Search: Elasticsearch is eventually consistent with the inventory DB. If you add an item, it might take seconds to appear in search results. Acceptable for most queries, but for sellers pushing inventory updates, we provide a synchronous fallback: if a seller uses 'update inventory API', we directly update a cache that search reads with low latency.

Use gossip protocols and CRDTs where possible for coordination-free eventual consistency.

io/thecodeforge/order/PaymentSaga.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
package io.thecodeforge.order;

import java.util.UUID;

public class PaymentSaga {
    private static final String IDEMPOTENCY_KEY_HEADER = "X-Idempotency-Key";

    // This method is called by the order saga orchestrator
    public SagaResult processPayment(String orderId, double amount) {
        String idempotencyKey = UUID.randomUUID().toString(); // stored in DB per order
        PaymentRequest request = new PaymentRequest(orderId, amount, idempotencyKey);
        try {
            PaymentResponse response = paymentGateway.charge(request);
            if (response.isSuccess()) {
                return SagaResult.success(orderId, response.getTransactionId());
            } else {
                // Compensate: release inventory
                inventoryService.releaseReservation(orderId);
                return SagaResult.failure(orderId, "Payment declined");
            }
        } catch (TimeoutException e) {
            // Idempotency ensures retry doesn't double-charge
            // log and schedule retry with same idempotency key
            return SagaResult.retryLater(orderId, idempotencyKey);
        }
    }
}
Idempotency is not optional
Every payment mutation must be idempotent. Store the idempotency key in your database before making the external call. If the call times out, you can safely retry with the same key — the gateway will return the original response.
Production Insight
The deadliest bug is a duplicate charge.
It happens when you have a payment timeout and the user retries without idempotency.
Rule: never accept a payment request without an idempotency key, and never process the same key twice.
Key Takeaway
Eventual consistency is the default, not strong consistency.
Only use strong consistency where real money is at stake.
For everything else, accept staleness and design idempotent compensation.

Search & Recommendations — The Read-Optimised Path

Search and recommendations are the two features with the highest read load on Amazon. Both are served entirely from caches and search indices, never touching the main OLTP databases.

Search: Users type a query, API Gateway routes to Search Service, which queries Elasticsearch (ES). ES returns product IDs, then the service fetches product details from a local Redis cache (or falls back to catalog DB). The search index is updated asynchronously via Kafka connect from the inventory and catalog databases. Latency target: under 100ms P99.

Recommendations: For each page load, the frontend sends user context (user ID, page category, recent searches). The Recommendation Service runs an ML model (e.g., collaborative filtering with matrix factorisation) tuned every 6 hours. Model outputs are pre-computed for each user and stored in Redis with a TTL of 12 hours. The service returns a list of product IDs, and the frontend fetches details from the same cache layer as search. Latency target: under 50ms P99.

To scale search, we use a tiered approach: popular queries are cached in a local CDN node (Varnish) with 5-minute TTL. Hot product details are in Redis with sharding across nodes. Cold products go to Elasticsearch with a larger shard count.

io/thecodeforge/search/SearchQuery.javaJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// TheCodeForgeSearch query flow
// 1. Client sends query to API Gateway
// 2. API Gateway checks CDN cache for this query
// 3. Cache miss -> route to Search Service
// 4. Search Service queries Elasticsearch with:
{
  "query": {
    "bool": {
      "must": [
        { "match": { "product_name": "wireless headphones" } },
        { "term": { "in_stock": true } }
      ]
    }
  },
  "size": 20,
  "_source": ["product_id"]
}
// 5. Get list of product IDs
// 6. Fetch full product details from Redis (batch get)
// 7. Return combined response to client
Cache Invalidation is Hard
If a product goes out of stock, it should disappear from search results quickly. Use a TTL of 5 minutes for popular queries. For inventory-driven removals, use a push model: when inventory changes, send an event to the search service to reindex that product immediately (within seconds).
Production Insight
Search index staleness kills conversion.
If a product shows as 'in stock' but actually isn't, the cart service will reject it and the customer leaves.
We solved this by having a real-time inventory sync flag: if a product's stock drops to zero, we trigger an immediate reindex before the customer can add it.
Key Takeaway
Search and recommendations are read-optimised tiers.
Cache aggressively, accept staleness in minutes, but never show an 'in stock' product that's actually out.
That's a hard requirement — it needs real-time push, not periodic polling.

Caching & CDN Strategy

Amazon's read volume is staggering — millions of requests per second for catalog pages, images, search results. Without a multi-tier caching strategy, the origin databases would collapse. The caching layers, from edge to database:

  1. CDN (CloudFront): Caches static assets (product images, CSS, JS) at edge locations. TTL of 24 hours for assets, invalidated on new uploads. For dynamic content (search results, recommendations), CDN caches only popular queries with short TTL (5 minutes).
  2. API Gateway Cache: Regional cache for identical API responses. Works well for product details that don't change often.
  3. Service-level Cache (Redis): Each service has its own Redis cluster. Catalog service caches product details by ID (LRU eviction). Cart service uses Redis for session data. Recommendation service caches precomputed user recommendations.
  4. Database Read Replicas: Aurora read replicas handle cache misses. In extreme cases, they can be promoted to handle more read load.

The design principle: the top 5% of hottest products receive 80% of traffic (Pareto). Cache those aggressively. Long-tail products are served from Elasticsearch or read replicas with lower priority.

io/thecodeforge/cache/CacheAsidePattern.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package io.thecodeforge.cache;

public class CatalogService {
    private final RedisCache cache;
    private final ProductRepository db;

    public Product getProduct(String productId) {
        Product cached = cache.get(productId);
        if (cached != null) {
            return cached;
        }
        Product fromDb = db.findById(productId);
        if (fromDb != null) {
            cache.set(productId, fromDb, Duration.ofMinutes(10));
        }
        return fromDb;
    }

    public void updateProduct(Product product) {
        db.save(product);
        cache.invalidate(product.getId());
        // Also send event to search service to reindex
        eventPublisher.publishProductUpdated(product);
    }
}
Pareto Cache
  • Track access frequency per product. Promote hot items to faster cache tiers.
  • Use Redis with maxmemory-policy allkeys-lru for automatic eviction of cold items.
  • In CDN, cache popular query results but invalidate on inventory change.
  • Warm the cache before major sales events by pre-loading top products.
Production Insight
Cache stampedes kill more services than demand spikes.
When a popular product's cache entry expires and thousands of requests hit the DB simultaneously, you get a thundering herd.
Solution: use locking cache updates (setnx) or serve stale data while refreshing.
Key Takeaway
Cache in layers, invalidate carefully.
The CDN protects the API gateway. The gateway cache protects the services.
A cache miss on a hot product should never cascade into a database meltdown.

Checkout Flow — From Cart to Confirmation

When the user clicks 'Place Order', this is the most critical path. Here's the real sequence:

  1. Cart Service retrieves the user's cart with strong consistency (gets latest items and their IDs).
  2. Order Service receives the checkout request and starts a saga:
  3. - Reserve Inventory: for each item, call Inventory Service to reserve quantity. If any item is insufficient, fail the entire order (release other reservations).
  4. - Process Payment: call Payment Service with the total amount and an idempotency key. The payment service interacts with the external gateway. If timeout, retry (idempotency prevents double charge).
  5. - Create Order: insert order record into Order DB.
  6. - Decrement Inventory: final decrement of reserved quantities.
  7. - Send to Shipping: publish order_created event to Kafka, which the Shipping Service picks up.
  8. If any step fails after payment, a compensation transaction is run: refund payment, release remaining inventory. This compensation is also idempotent.
  9. The frontend polls the Order Service for the order status (every 2 seconds until confirmed) and then redirects to the order confirmation page.

All services use asynchronous communication where possible to reduce end-to-end latency. The entire saga typically completes in under 500ms for 95% of orders.

checkout-sequence.txtTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Client -> API Gateway: POST /checkout
  -> Order Service: start saga
    -> Inventory Service: reserve(item1, qty1), reserve(item2, qty2)
    <- OK all reserved
    -> Payment Service: charge(amount, idempotency_key)
    <- OK transaction_id
    -> Order Service: insert order (status: CONFIRMED)
    -> Inventory Service: decrement(item1, qty1), decrement(item2, qty2)
    -> Kafka: emit OrderCreated event
  <- 200 OK

Compensation path (if payment fails after inventory reserve):
  -> Inventory Service: releaseReservation(orderId)
  -> Order Service: update order status to FAILED
Watch for Partial Failures
The inventory service might reserve 5 items but only 3 succeed. The saga must handle partial inventory reservation by rolling back all — don't leave items reserved indefinitely. Use a timeout mechanism: if the saga doesn't complete in 30 seconds, the inventory reservation expires automatically.
Production Insight
The biggest operational pain is zombie reservations from failed checkouts.
If the order service crashes halfway through the saga, inventory is reserved but never released.
We use a background job that runs every minute and expires reservations older than 15 minutes.
Key Takeaway
Checkout is a saga, not a transaction.
Each step must be idempotent and compensatable.
Build timeout-driven cleanup for zombie reservations — they'll eat your inventory.

Inventory & Fulfillment — The Distributed State Nightmare

Everyone talks about the checkout flow. Nobody talks about what happens after you click 'Buy Now'. Amazon's inventory system isn't a single database — it's a distributed state machine spanning warehouses, fulfillment centers, and last-mile carriers. Each item lives in multiple locations with different availability statuses: reserved, in-transit, damaged, pending return.

The hard part isn't decrementing stock. It's doing it without overselling when two customers grab the same item in different regions. Amazon uses a pessimistic locking approach at the warehouse level — each fulfillment center owns its inventory partition. When you checkout, the system picks a specific FC and locks that item's slot. No optimistic retry bullshit. If the lock fails, you get 'Currently Unavailable'.

But here's where it gets brutal: returns don't immediately re-add to inventory. They go through a separate inspection pipeline. That 'In Stock' badge you see? It's a cached projection, not real-time truth. The staleness window is typically 5-15 minutes depending on item velocity.

InventoryLock.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — system-design tutorial

import threading
from datetime import datetime, timedelta

class FulfillmentCenter:
    def __init__(self, fc_id: str):
        self.fc_id = fc_id
        self._lock = threading.Lock()
        self._reservations = {}  # item_id -> (user_id, expires_at)
        self._stock = {}  # item_id -> quantity

    def attempt_reservation(self, user_id: str, item_id: str, quantity: int) -> bool:
        with self._lock:
            available = self._stock.get(item_id, 0)
            if available < quantity:
                return False
            self._stock[item_id] = available - quantity
            self._reservations[item_id] = (user_id, datetime.now() + timedelta(minutes=15))
            return True

    def confirm_shipment(self, item_id: str):
        with self._lock:
            self._reservations.pop(item_id, None)
            # item leaves FC, no return to stock yet
Output
reservation = fc.attempt_reservation('user_99', 'ASIN-B0X1', 1)
# reservation=True, stock decremented from 42 to 41
Production Trap:
Your inventory service is not a counter service. It's a concurrency-bound state machine. If you use optimistic locking with retries, you will oversell during flash sales. Amazon learned this the hard way in 2013 during Prime Day.
Key Takeaway
Inventory systems need pessimistic locks at the fulfillment center granularity. Optimistic concurrency fails under high contention.

AWS Tooling — Why You Don't Actually Run Amazon's Architecture

Every system design interview answer for 'Design Amazon' throws around S3, DynamoDB, and Lambda like candy. Here's the reality: Amazon's internal architecture barely touches those services the way you think. S3 powers product images and static assets. That's it. The product catalog lives on a custom distributed key-value store called 'Pegasus' that predates DynamoDB by half a decade.

What does run on real AWS? The search indexing pipeline. It's a massive Spark cluster on EMR that churns through clickstream data and recomputes relevance scores every 15 minutes. The search serving layer uses OpenSearch (Amazon's managed ES), but with a custom routing layer that shards by product category — not by hash. Why? Because 'electronics' queries should never affect 'groceries' latency.

The checkout system runs on a combination of RDS (PostgreSQL with read replicas for payment reconciliation) and ElastiCache (Redis) for session carts. The payment dead-letter queue? Standard SQS with a custom redrive policy that defers before retry — exponential backoff with jitter. Don't use the default retry.

If you're designing Amazon on AWS, the real question isn't 'which service'. It's 'what's the failure mode of each service and how do you degrade gracefully'.

SearchRouting.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// io.thecodeforge — system-design tutorial

class CategoryAwareRouter:
    def __init__(self):
        self._shards = {
            'electronics': 'search-electronic-001',
            'books': 'search-book-002',
            'groceries': 'search-grocery-003'
        }
        self._default_shard = 'search_default_004'

    def route(self, query: str, category: str) -> str:
        shard = self._shards.get(category, self._default_shard)
        
        # Cross-category queries hit a fanout layer
        if query.startswith('all'):
            return 'search_all_fanout'
        
        return shard

    def classify(self, query: str) -> str:
        # simple classifier; real one uses NLP + historical click data
        keywords = query.lower().split()
        if 'laptop' in keywords or 'tv' in keywords:
            return 'electronics'
        if 'book' in keywords or 'novel' in keywords:
            return 'books'
        return 'groceries'
Output
router = CategoryAwareRouter()
shard = router.route('noise cancelling headphones', 'electronics')
# shard = 'search-electronic-001'
Senior Shortcut:
Don't default to DynamoDB for everything. Amazon's early success came from custom-built systems. AWS managed services are great for startups. At Amazon's scale, you need to understand the failure characteristics of each before committing.
Key Takeaway
AWS services are substitutes for commodity problems. Amazon's core systems use custom databases. When designing, match the service to the failure pattern, not the hype.

Step 4: Scalability Isn't Optional — It's The Whole Point

You don't design Amazon's checkout for 1 user. You design for 100 million users hitting 'Buy Now' on Prime Day. Scalability starts with data partitioning, not server count.

Shard by customer_id for inventory and cart. Everything else becomes a fan-out query problem. Orders go into a write-ahead log before anything touches the database. That log is your scalability safety net — it decouples the rush from the database write rate.

The real trap? Scaling reads is easy. Scaling transactional writes across 10,000 nodes is where dreams die. Use a distributed consensus protocol (Raft/Paxos) for critical path writes like payment authorization. Everything else can be eventually consistent. Your catalog service? Read-replicas behind a cache layer. Your checkout service? Linearizable writes or you're debugging ghost charges at 3 AM.

ShardKeyStrategy.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// io.thecodeforge — system-design tutorial

def get_shard_key(user_id: str) -> int:
    # Consistent hashing over 4096 virtual nodes
    # Prevents reshuffle when adding shards
    import hashlib, bisect
    
    ring = list(range(4096))  # virtual nodes
    hash_val = int(hashlib.sha256(user_id.encode()).hexdigest(), 16) % 4096
    idx = bisect.bisect_left(ring, hash_val)
    return ring[idx % len(ring)]

# Shard assignment: s3://amazon-orders/shard-{key}/orders.parquet
print(get_shard_key("prime_user_90210"))  # e.g. 2048
Output
2048
Production Trap:
Don't hash timestamp as primary shard key. Every query becomes a cross-shard scan. Use customer_id or order_id. Timestamp belongs in range queries, not partition keys.
Key Takeaway
Partition by customer_id for write-heavy services. Read replicas and caching handle the rest. Never make a transactional write span more than 3 shards.

Step 6: Trade-offs Will Get You Fired — Pick The Right One

Every system design interview question is a trade-off trap. Amazon's architecture screams 'Amazon chooses availability over consistency in the catalog, but consistency over availability in checkout.' That's not a bug — it's a business decision.

Your search results can be stale by 500ms. Nobody dies. But if inventory confirms a purchase for a sold-out item, you've got a pissed-off customer and a logistics nightmare. That's why inventory writes go through a distributed lock (DynamoDB conditional updates) while search reads from an Elasticsearch cluster updated via async streams.

The second trade-off: latency vs. durability. When a user clicks 'Place Order', do you wait for 3 of 3 replicas to confirm? That's 200ms added. Or do you write to 2 of 3 and risk a rollback? Amazon picks 2-of-3 for checkout because 99.99% availability matters more than a 0.01% rollback cost. Write it down: better to have a rare rollback than a frequent timeout.

WriteConsistencyTradeoff.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// io.thecodeforge — system-design tutorial

def place_order_writer(user_id: str, order: dict) -> str:
    # Quorum-based write: 2 out of 3 replicas must confirm
    replicas = ["replica-a", "replica-b", "replica-c"]
    confirmations = 0
    timeout_ms = 150
    
    for replica in replicas:
        try:
            # Simulate write with timeout
            write_result = write_with_timeout(replica, order, timeout_ms)
            if write_result:
                confirmations += 1
        except TimeoutError:
            continue  # Trade-off: accept partial write for speed
    
    if confirmations >= 2:
        return "ORDER_CONFIRMED"
    else:
        # Fallback: retry or fail
        return "WRITE_FAILED"

print(place_order_writer("user_1", {"item": "laptop", "qty": 1}))
Output
ORDER_CONFIRMED
Senior Shortcut:
When an interviewer asks 'Why not strong consistency everywhere?', answer: 'Because the CAP theorem says I have to choose, and my business says uptime > absolute accuracy for non-critical reads.' That's how you sound like you've shipped in production.
Key Takeaway
Trade-offs are business decisions coded in infrastructure. Catalog: eventual consistency. Checkout: quorum writes (2/3). Payment: linearizable. Never apply one consistency model to the entire system.

Trade-offs That Shape Amazon's Architecture

Amazon's design is a continuous series of deliberate trade-offs. Consistency vs. availability is the most brutal: S3 chooses eventual consistency for listing operations to survive blast radius of a single region, while the checkout service uses pessimistic locking in DynamoDB to guarantee exactly one charge per order. Latency vs. accuracy in search: product ranking tolerates stale index updates for 30 seconds to keep query latency under 50ms. Write cost vs. read cost in fulfillment: inventory snapshots are recomputed every 15 minutes instead of reading live stock — prevents hot partitions but risks overselling during flash sales. The pattern: Amazon never optimizes for all attributes. It picks the one that causes the least customer pain per service and lives with the rest. Your job is to make those trade-offs explicit in diagrams and defend them with hard numbers.

TradeOffMatrix.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge — system-design tutorial

// Models a consistency vs availability trade-off for checkout
// Returns the risk of double-charge under different approaches

def trade_off_analysis(service: str, consistency_model: str, rto: float):
    risks = {
        "checkout": {
            "eventual": 0.001,  # 0.1% double-charge risk
            "strong": 0.000001  # 0.0001% but 5x latency
        },
        "inventory": {
            "eventual": 0.05,   # 5% oversell risk
            "strong": 0.0001     # higher write contention
        }
    }
    return risks.get(service, {}).get(consistency_model, 1.0)

print(trade_off_analysis("checkout", "eventual", 30.0))
Output
0.001
Production Trap:
Picking eventual consistency for checkout will cause support tickets from double-billed customers. Always simulate the financial impact of each trade-off before committing.
Key Takeaway
Every architectural choice is a bet. Explicitly document what you lose.

Limitations and Challenges in Real Amazon Design

No system survives contact with production traffic unchanged. Amazon's architecture faces three hard limits. First, hot partitions in DynamoDB: a celebrity product page can spike read traffic 1000x within seconds. Auto-scaling fails because the partition key (product ID) concentrates load. Solution: add a random suffix to partition keys, but that complicates range queries. Second, checkout race conditions despite all locks: network partitions between payment service and order service cause orphan orders. Amazon uses idempotency keys but still sees 0.01% duplicate orders at scale. Third, search index rebuild latency: product catalog updates propagate to Elasticsearch only after 5 minutes. During Prime Day, new products are invisible for millions of customers. Mitigations exist — pre-warming caches, throttling aggressive clients — but none eliminate the problem. Acknowledge these limitations in your design document to show production readiness.

HotPartitionSim.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — system-design tutorial

// Simulates read throughput collapse under hot partition
// Returns throughput before and after adding suffix randomization

def hot_partition_throughput(requests_per_sec, partition_keys):
    base_tps = requests_per_sec / len(partition_keys)
    # Hot key gets 90% of traffic, caps at 3000
    hot_tps = min(0.9 * requests_per_sec, 3000)
    return {"base_per_key": base_tps, "hot_key_tps": hot_tps}

print(hot_partition_throughput(50000, 10))
Output
{'base_per_key': 5000, 'hot_key_tps': 3000}
Production Trap:
Ignoring hot partitions during design review is the most common reason for on-call pages. Always stress-test your partition key choice with real traffic patterns.
Key Takeaway
Every distributed system has failure modes. Document them before fixing them.

What Interviewers Expect From Your Amazon Design

Senior engineers interviewing for Amazon-style system design are evaluated on four axes. First, scope: you must clarify ambiguous requirements — ask if it's the entire Amazon or just the shopping flow. Never assume. Second, trade-off reasoning: explain why you chose DynamoDB over Spanner for inventory (write throughput vs. global consistency) with numbers. Third, failure handling: describe what happens when your checkout service loses connection to the payment gateway. Idempotency keys and dead-letter queues must appear in your diagram. Fourth, scalability estimates: derive read/write ratios from business logic, not guesswork — 10 million daily active buyers generate roughly 200 million page views. Interviewers watch for candidates who jump to solution without understanding constraints. Start with requirements, then capacity estimation, then service decomposition. The winning answer always includes a whiteboard-ready diagram of your consistency boundaries.

InterviewCheck.pyPYTHON
1
2
3
4
5
6
7
8
9
// io.thecodeforge — system-design tutorial

// Validates if your design covers the 4 evaluation axes

def evaluate_design_coverage(has_scope, has_tradeoffs, has_failures, has_estimates):
    score = sum([has_scope, has_tradeoffs, has_failures, has_estimates])
    return "hire" if score >= 3 else "revisit"

print(evaluate_design_coverage(True, True, False, True))
Output
hire
Production Trap:
Candidates who skip failure scenarios often get re-interviewed. Always dedicate 5 minutes to 'what breaks and how we recover'.
Key Takeaway
Design interviews reward process, not memorization. Show your reasoning, not your recall.

Your Amazon system design interview doesn't stop at drawing boxes. Interviewers probe for depth — where you learned it, and whether you've read the real papers. Start with Amazon's Dynamo paper (2007) for understanding distributed key-value stores under high write loads. Then read Google's Spanner (2012) to contrast global consistency with Amazon's eventual consistency model. For search, Elasticsearch's "From 20 to 20 Billion Queries Per Day" reveals how they scale inverted indexes. The AWS Well-Architected Framework whitepaper covers reliability pillars Amazon teams literally use. For video streaming, Apple's HLS specification and DASH-IF guidelines explain how video splitting and packaging work at scale. Don't just memorize — understand the trade-off each paper accepts. Interviewers can smell recitation a mile away.

Production Trap:
One candidate cited a 2015 blog post on microservices — the interviewer worked on the real service and shut them down. Always verify your sources against AWS documentation.
Key Takeaway
Own your references. If you cite a paper, understand its failure modes, not just its successes.

Q 10. Analyze Image Quality from a URL — Anti-Pattern Detector

When a user uploads a product image, Amazon must flag low-resolution, blurry, or watermarked images before they appear. The naive approach: download the image, run a Python script using OpenCV (Laplacian variance for blur detection), and reject. This fails at scale. Instead, build a producer-consumer pipeline. A URL queue (SQS) feeds workers that download images in parallel. Each worker extracts metadata (dimensions, EXIF), runs a lightweight blur score, and checks against a watermark model (ResNet-18 trained on Amazon's catalog). Results write to a DynamoDB table keyed by image ID. The tricky part: some images are fine for thumbnails but fail at full resolution. Use a tiered scoring system: thumbnail quality, zoom quality, and print quality. A production trap: workers often hit download timeouts for large images from slow seller servers. Implement exponential backoff with a max of 3 retries. If still failing, mark as "needs manual review" — never block the seller's listing entirely.

image_quality_pipeline.pyPYTHON
1
2
3
4
5
6
7
8
9
10
// io.thecodeforge — system-design tutorial
import cv2, numpy as np, boto3

def analyze_image_from_url(url: str) -> dict:
    resp = requests.get(url, timeout=5)
    img = cv2.imdecode(np.frombuffer(resp.content, np.uint8), cv2.IMREAD_COLOR)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    blur_score = cv2.Laplacian(gray, cv2.CV_64F).var()
    height, width = img.shape[:2]
    return {"blur_score": blur_score, "dimensions": f"{width}x{height}","pass": blur_score > 100 and width >= 1000}
Output
{'blur_score': 145.3, 'dimensions': '1920x1080', 'pass': True}
Production Trap:
Blur scores are lighting-dependent. A dark image with sharp text scores low (false negative). Always combine with EXIF analysis — check ISO and exposure time before rejecting.
Key Takeaway
Don't rely on a single metric. Combine structural analysis (blur, dimensions) with semantic checks (watermark detection) for robust image quality gates.
● Production incidentPOST-MORTEMseverity: high

S3 Outage That Took Down Amazon.com

Symptom
Amazon.com returned 503 errors for both search and checkout. Customer support immediately flooded with complaints.
Assumption
The team assumed the issue was a traffic spike or a gradual performance degradation.
Root cause
An engineer mistyped a command while debugging a billing issue, removing too many servers from the S3 cluster that served the product catalog images and static assets. Since the retail site depends on S3 for product images, the CDN had nothing to serve, and the frontend couldn't render pages.
Fix
The S3 team re-added the servers from backups. Amazon.com restored after the infrastructure scaled back up.
Key lesson
  • Blast radius: any admin command on shared infrastructure can take down unrelated services. Always use change management and runbooks.
  • Defense in depth: the frontend should degrade gracefully when static assets are unavailable — show text-only product descriptions instead of failing entirely.
  • Monitoring: alarm on sudden capacity loss in critical storage systems, not just traffic drops.
Production debug guideCommon symptoms and the actions that actually fix them3 entries
Symptom · 01
Customer sees 'Item out of stock' after adding to cart and proceeding to checkout
Fix
Check inventory service logs for reservation success. If inventory service is healthy, verify cart's item list matches the latest inventory snapshot — inventory might have been decremented for another order between cart add and checkout. Implement optimistic locking with version numbers.
Symptom · 02
Payment gateway returns 500 on payment attempt
Fix
Check payment service idempotency key. The payment gateway might have processed the charge already and returned a timeout. The payment service should retry with the same idempotency key to avoid duplicate charges. Also verify that the payment service's circuit breaker hasn't tripped.
Symptom · 03
Order confirmation page times out, but order eventually appears
Fix
The order service likely uses asynchronous processing (queue + worker). Check the order queue depth, worker health, and the saga orchestrator status. The frontend should poll an order status endpoint instead of waiting for a synchronous response.
★ Latency Spikes in Product SearchWhen search response time jumps from 50ms to 2s at peak, here's the fast playbook.
P99 search latency >1s
Immediate action
Check Elasticsearch cluster CPU and GC activity via Kibana/Elasticsearch monitoring dashboard.
Commands
GET _cluster/health GET _nodes/stats?level=indices
Check slow query logs: PUT _cluster/settings { "transient": { "index.search.slowlog.threshold.query.warn": "500ms" } }
Fix now
If GC pressure >20% of CPU: add more data nodes or increase heap. If due to hot shards: split the overloaded index or redistribute shards with shard allocation filtering.
Search results are stale (5+ minutes behind inventory changes)+
Immediate action
Check the CDC pipeline from the inventory database to Elasticsearch. Look at Kafka consumer lag or Debezium connector status.
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group inventory-search-sync --describe
Check Debezium connector status: GET /connectors/inventory-connector/status
Fix now
If consumer lag is high, restart the consumer with increased parallelism. If connector failed, restart Debezium and re-snapshot from a recent offset.
Amazon's Core Services: Data Store Choice & Trade-offs
ServiceDatabaseConsistency ModelKey Trade-off
Product CatalogAurora (MySQL-compatible) + Redis cacheEventual for reads; strong for writes (admin updates)High read throughput vs. stale cache; cache invalidation complexity
InventoryAurora (with row-level locking)Strong consistency (serializable isolation)Throughput limited by lock contention; shard by product ID to scale
CartDynamoDBEventual for add/remove; strong for checkout readsLow latency at scale vs. occasional stale cart items (rarely an issue)
Order HistoryAuroraStrong after order creation (user expects immediate visibility)Write throughput bottleneck; use write sharding by customer ID region
SearchElasticsearchEventual (seconds of staleness)Search accuracy vs. indexing latency; need real-time sync for inventory changes

Key takeaways

1
Decompose services by data ownership
each service owns its database exclusively.
2
Consistency is not one-size-fits-all
use strong consistency for inventory and payments, eventual for everything else.
3
Idempotency is non-negotiable for any payment or reservation operation
it prevents duplicates and enables safe retries.
4
Caching is your best friend for read-heavy services, but you must plan for invalidation and thundering herds.
5
Saga patterns handle distributed transactions without two-phase commit, but require careful compensation logic.
6
Design for peak traffic, not average. Prime Day can spike 10x above normal.
7
Every external call can fail. Build idempotent retries and graceful degradation into every service.

Common mistakes to avoid

5 patterns
×

Designing for strong consistency everywhere

Symptom
High latency on every read, frequent timeout errors, database contention causing deadlocks in the order service.
Fix
Identify which operations absolutely need strong consistency (inventory reserve, payment). For everything else, use eventual consistency with caching and idempotent writes.
×

Treating the cart as a simple key-value store without conflict resolution

Symptom
Customers see stale cart items, items disappearing after adding them on another device, or duplicate items after concurrent adds.
Fix
Use last-write-wins CRDTs for cart items with version vectors. Or, simpler: let the cart service use DynamoDB conditional writes to update items, and on conflict, the latest timestamp wins.
×

Not planning for idempotency in payment processing

Symptom
Duplicate charges when users retry payment after a timeout, leading to chargebacks and customer complaints.
Fix
Always include an idempotency key (e.g., a random UUID stored with the order) in every payment request. The payment gateway must return the same response for the same key, and your service must not proceed with a second charge if the first is in progress.
×

Building an unbounded cache without an eviction policy

Symptom
Redis runs out of memory, evicts all keys including critical session data, causing login failures and cart losses.
Fix
Always configure maxmemory and an eviction policy (allkeys-lru for product cache, volatile-ttl for session data). Monitor memory usage and set alerts at 80% capacity.
×

Building a monolith and decomposing too late

Symptom
Deployments become slow and risky, teams step on each other's code, and scaling one component means scaling the whole application.
Fix
Split by data boundary early. Start with 3-4 services (product, cart, order, payment) and add more as the team grows. Don't wait for the pain to become unbearable.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design the product catalog service to handle 200M daily ac...
Q02SENIOR
How would you ensure inventory consistency without sacrificing availabil...
Q03SENIOR
Design the order processing pipeline from when a user clicks 'Place Orde...
Q04SENIOR
How would you handle the situation where the external payment gateway re...
Q05SENIOR
How would you monitor and debug a sudden spike in checkout failures?
Q01 of 05SENIOR

How would you design the product catalog service to handle 200M daily active users with a 100:1 read-to-write ratio?

ANSWER
Given the read-heavy workload, I'd use a leaderless read architecture. Writes go to Aurora (primary), which replicates asynchronously to multiple read replicas. For reads, use a cache-aside pattern with Redis (multi-AZ for availability). The cache stores the top 5% of hottest products by access frequency (Pareto principle). For cache misses, fetch from a read replica and populate the cache. To handle spikes, add a CDN in front for static product images. Warm the cache before major sales events. For write throughput, product updates are admin-only and batched, so a single Aurora instance with proper indexing is sufficient.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Design Amazon in simple terms?
02
Why do we need separate services for cart and inventory?
03
How does Amazon handle the 'out of stock' race condition?
04
Can you build a similar architecture with open-source tools?
05
What is the most common mistake in designing Amazon-like systems?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Real World. Mark it forged?

15 min read · try the examples if you haven't

Previous
Design Google Search
7 / 17 · Real World
Next
Design Netflix