Design Amazon — S3 Blast Radius and Checkout Races
A mistyped S3 command took down Amazon.com.
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
- Amazon is a multi-service distributed system: catalog, cart, orders, payments, search, recs, logistics — all independent but coordinated
- Product catalog is read-optimised with caching; inventory is real-time with strong consistency in a separate store
- Cart lives in a low-latency KV store (DynamoDB); order history in a relational DB (Aurora)
- Search is powered by a dedicated search engine (Elasticsearch) separate from the OLTP DB
- Payments must be idempotent and eventually consistent — one duplicate charge can cost millions
- The biggest mistake: designing for consistency everywhere — you'll kill availability and latency
Imagine a massive warehouse with millions of shelves, thousands of cashiers, a personal shopping assistant who remembers everything you've ever bought, and a delivery network that spans the globe. Amazon is exactly that — but built from software. Every time you search for headphones, add them to a cart, pay, and track a package, dozens of separate systems are quietly talking to each other to make it feel seamless. This article is about how those systems are actually designed — and the real trade-offs that keep them running under peak load.
Amazon processes over 66,000 orders per minute at peak, serves hundreds of millions of customers across 20+ countries, and runs one of the most complex distributed systems ever built — all while most transactions complete in under a second. Understanding how to design a system at this scale isn't just an interview exercise; it's a masterclass in the real trade-offs that define modern software engineering: consistency vs. availability, latency vs. accuracy, operational simplicity vs. raw performance.
The core problem Amazon solves is multi-dimensional. It's not just a database with a shopping cart on top. It's a real-time inventory system, a personalization engine, a payments processor, a logistics orchestrator, a search engine, and a seller marketplace — all running simultaneously, all needing to agree on the state of the world, and all needing to survive individual component failures without the customer ever noticing. The challenge isn't writing any one of these systems; it's making them work together under crushing load.
By the end of this article, you'll be able to walk into a system design interview and articulate a coherent, production-realistic Amazon architecture. You'll understand why the product catalog is separated from inventory, why the cart lives in a different data store than order history, how search is decoupled from the relational database, and what actually happens between you clicking 'Buy Now' and your order appearing on screen. You'll know the real trade-offs, not just the happy path.
What Designing Amazon Really Means — S3 Blast Radius and Checkout Races
Designing Amazon means decomposing a monolithic e-commerce platform into hundreds of loosely coupled, fault-tolerant services, each owning a single business capability. The core mechanic is service-oriented architecture (SOA) with explicit contracts, asynchronous communication via message queues, and eventual consistency for non-critical paths. You trade strong consistency for availability and partition tolerance — the CAP theorem in practice.
Key properties: each service (e.g., Cart, Order, Payment, Inventory) runs independently, scales horizontally, and fails without cascading. S3 stores product images and static assets with 99.999999999% durability, but a misconfigured bucket policy can expose all objects globally — the blast radius of a single IAM mistake. Checkout involves a distributed transaction: reserve inventory, charge payment, create order. If any step fails, you must compensate (e.g., release inventory) via a saga pattern, not a two-phase commit.
Use this architecture when you need massive scale, independent deployability, and fault isolation. It matters because a single region outage or a race condition in checkout can cost millions per minute. Real systems use idempotency keys, circuit breakers, and dead-letter queues to handle partial failures gracefully.
Core Architecture Principles
Amazon's architecture is built on a few non-negotiable principles. First, data ownership is absolute: each microservice owns its data exclusively — no shared tables between services. Second, communication is asynchronous where possible: use events (Kafka) for order creation, inventory updates, and shipping triggers. Synchronous calls are reserved for operations that need immediate confirmation, like payment gateway interaction. Third, cache everything that can be stale. The product catalog, search results, recommendations — all served from cached layers that accept minutes of staleness. Fourth, fail gracefully: if a downstream service is down, the system degrades, it doesn't crash. The homepage might show fewer recommendations, but the site stays up.
These principles are not theoretical — they were earned through real production failures. The 2017 S3 outage showed that shared infrastructure can bring down the entire site. The 2020 DynamoDB throttling event during Prime Day taught them to provision for 3x peak traffic. Every principle has a scar.
- Each service owns its data — no shared DB tables between services.
- Communicate through events for most flows; use synchronous calls only for idempotent, critical paths.
- If two services need to share a database table, merge them into one service.
- Design for partial failure: every external call can fail, and the system must survive.
Requirements & Estimation
Before drawing boxes, we need numbers. Amazon serves ~200M active customers, processes ~66,000 orders/min at peak. Every second, that's ~1,100 orders. Each order generates writes to cart, inventory, payment, order, and logistics services. Read-to-write ratio for the product catalog is roughly 100:1, while for cart it's 1:1 (every add is followed by a read during checkout). Storage: product catalog ~100M items, each with 10-50 KB metadata — that's ~5 TB in the DB. Images stored in object storage (S3), total petabyte-scale. Network bandwidth: each page load transfers ~2 MB (HTML, JS, images). At 200M DAU, average 10 pages per session = 2B page loads/day = ~4 PB/day outbound — that's why you need CDN and aggressive caching.
These numbers drive every architecture decision. You don't design for 66K orders/min without knowing your bottleneck: database writes per second, queue throughput, payment latency SLA. A common mistake: designing for average load, not peak. Prime Day traffic spikes 5-10x above average. So you need to provision for at least 2x your estimated peak, and then use auto-scaling to handle surges.
- Assume 2x growth over next 2 years. Design for 200K orders/min.
- Every order creates 5 writes (cart, inventory, payment, order, shipping). So 5500 writes/sec at peak.
- Catalog reads: 100:1 read/write -> 200K reads/sec. Cache the top 5% hottest items (Pareto).
- Bandwidth: 2MB per page 10 pages/user 200M users / 86400 = 4.6 TB/day -> 53 GBps peak. CDN is non-negotiable.
High-Level Architecture — Service Decomposition
Amazon's architecture is a collection of hundreds of microservices. The core ones for an e-commerce platform:
- Product Catalog Service: read-heavy, exposes product details, categories, images. Uses a read replica with a CDN cache for images.
- Inventory Service: tracks stock per warehouse. Must be strongly consistent to avoid overselling. Usually a separate database (Aurora with row-level locking).
- Cart Service: low-latency, high-write. Uses DynamoDB with eventual consistency for add/remove operations; cart read during checkout uses strong consistency.
- Order Service: receives checkout request, orchestrates the saga: reserve inventory, process payment, create order, trigger shipping. Uses a queue for decoupling.
- Payment Service: idempotent, integrates with external gateways. Stores transaction logs in a relational DB.
- Search Service: Elasticsearch cluster indexed from catalog and inventory changes via CDC.
- Recommendation Service: ML pipeline producing real-time recommendations served via a separate read-optimised cache.
- Shipping Service: async, watches order completion events and sends to logistics.
Each service has its own database, communicates via HTTP/REST or async events (Kafka). API Gateway routes requests, handles authentication, rate limiting.
Data Consistency & Trade-offs Across Services
Amazon must maintain consistency where it matters (inventory, payments) and accepts eventual consistency where it doesn't (product catalog updates, recommendations). The key trade-offs:
- Product Catalog: writes are rare (admin updates), reads are massive. Use a leaderless read-replica architecture with cache-aside pattern. A catalog update can take minutes to propagate to all edge caches — that's fine.
- Inventory: overselling is unacceptable. When a customer adds an item to cart, we reserve inventory for 15 minutes. If not checked out, the reservation expires. This is optimistic — but during high contention, we risk deadlocks. Use row-level locking in Aurora for the inventory row. This limits throughput to ~1000 inventory reservations per second per row. Solution: shard inventory by product ID (each product gets its own partition).
- Cart & Order: The cart service uses eventual consistency for add/remove, but during checkout, the order service reads the cart with strong consistency and then runs a saga: reserve inventory (idempotent), charge payment (idempotent), decrement inventory, create order. If any step fails, compensate: release inventory, void payment.
- Search: Elasticsearch is eventually consistent with the inventory DB. If you add an item, it might take seconds to appear in search results. Acceptable for most queries, but for sellers pushing inventory updates, we provide a synchronous fallback: if a seller uses 'update inventory API', we directly update a cache that search reads with low latency.
Use gossip protocols and CRDTs where possible for coordination-free eventual consistency.
Search & Recommendations — The Read-Optimised Path
Search and recommendations are the two features with the highest read load on Amazon. Both are served entirely from caches and search indices, never touching the main OLTP databases.
Search: Users type a query, API Gateway routes to Search Service, which queries Elasticsearch (ES). ES returns product IDs, then the service fetches product details from a local Redis cache (or falls back to catalog DB). The search index is updated asynchronously via Kafka connect from the inventory and catalog databases. Latency target: under 100ms P99.
Recommendations: For each page load, the frontend sends user context (user ID, page category, recent searches). The Recommendation Service runs an ML model (e.g., collaborative filtering with matrix factorisation) tuned every 6 hours. Model outputs are pre-computed for each user and stored in Redis with a TTL of 12 hours. The service returns a list of product IDs, and the frontend fetches details from the same cache layer as search. Latency target: under 50ms P99.
To scale search, we use a tiered approach: popular queries are cached in a local CDN node (Varnish) with 5-minute TTL. Hot product details are in Redis with sharding across nodes. Cold products go to Elasticsearch with a larger shard count.
Caching & CDN Strategy
Amazon's read volume is staggering — millions of requests per second for catalog pages, images, search results. Without a multi-tier caching strategy, the origin databases would collapse. The caching layers, from edge to database:
- CDN (CloudFront): Caches static assets (product images, CSS, JS) at edge locations. TTL of 24 hours for assets, invalidated on new uploads. For dynamic content (search results, recommendations), CDN caches only popular queries with short TTL (5 minutes).
- API Gateway Cache: Regional cache for identical API responses. Works well for product details that don't change often.
- Service-level Cache (Redis): Each service has its own Redis cluster. Catalog service caches product details by ID (LRU eviction). Cart service uses Redis for session data. Recommendation service caches precomputed user recommendations.
- Database Read Replicas: Aurora read replicas handle cache misses. In extreme cases, they can be promoted to handle more read load.
The design principle: the top 5% of hottest products receive 80% of traffic (Pareto). Cache those aggressively. Long-tail products are served from Elasticsearch or read replicas with lower priority.
- Track access frequency per product. Promote hot items to faster cache tiers.
- Use Redis with maxmemory-policy allkeys-lru for automatic eviction of cold items.
- In CDN, cache popular query results but invalidate on inventory change.
- Warm the cache before major sales events by pre-loading top products.
Checkout Flow — From Cart to Confirmation
When the user clicks 'Place Order', this is the most critical path. Here's the real sequence:
- Cart Service retrieves the user's cart with strong consistency (gets latest items and their IDs).
- Order Service receives the checkout request and starts a saga:
- - Reserve Inventory: for each item, call Inventory Service to reserve quantity. If any item is insufficient, fail the entire order (release other reservations).
- - Process Payment: call Payment Service with the total amount and an idempotency key. The payment service interacts with the external gateway. If timeout, retry (idempotency prevents double charge).
- - Create Order: insert order record into Order DB.
- - Decrement Inventory: final decrement of reserved quantities.
- - Send to Shipping: publish
order_createdevent to Kafka, which the Shipping Service picks up. - If any step fails after payment, a compensation transaction is run: refund payment, release remaining inventory. This compensation is also idempotent.
- The frontend polls the Order Service for the order status (every 2 seconds until confirmed) and then redirects to the order confirmation page.
All services use asynchronous communication where possible to reduce end-to-end latency. The entire saga typically completes in under 500ms for 95% of orders.
Inventory & Fulfillment — The Distributed State Nightmare
Everyone talks about the checkout flow. Nobody talks about what happens after you click 'Buy Now'. Amazon's inventory system isn't a single database — it's a distributed state machine spanning warehouses, fulfillment centers, and last-mile carriers. Each item lives in multiple locations with different availability statuses: reserved, in-transit, damaged, pending return.
The hard part isn't decrementing stock. It's doing it without overselling when two customers grab the same item in different regions. Amazon uses a pessimistic locking approach at the warehouse level — each fulfillment center owns its inventory partition. When you checkout, the system picks a specific FC and locks that item's slot. No optimistic retry bullshit. If the lock fails, you get 'Currently Unavailable'.
But here's where it gets brutal: returns don't immediately re-add to inventory. They go through a separate inspection pipeline. That 'In Stock' badge you see? It's a cached projection, not real-time truth. The staleness window is typically 5-15 minutes depending on item velocity.
AWS Tooling — Why You Don't Actually Run Amazon's Architecture
Every system design interview answer for 'Design Amazon' throws around S3, DynamoDB, and Lambda like candy. Here's the reality: Amazon's internal architecture barely touches those services the way you think. S3 powers product images and static assets. That's it. The product catalog lives on a custom distributed key-value store called 'Pegasus' that predates DynamoDB by half a decade.
What does run on real AWS? The search indexing pipeline. It's a massive Spark cluster on EMR that churns through clickstream data and recomputes relevance scores every 15 minutes. The search serving layer uses OpenSearch (Amazon's managed ES), but with a custom routing layer that shards by product category — not by hash. Why? Because 'electronics' queries should never affect 'groceries' latency.
The checkout system runs on a combination of RDS (PostgreSQL with read replicas for payment reconciliation) and ElastiCache (Redis) for session carts. The payment dead-letter queue? Standard SQS with a custom redrive policy that defers before retry — exponential backoff with jitter. Don't use the default retry.
If you're designing Amazon on AWS, the real question isn't 'which service'. It's 'what's the failure mode of each service and how do you degrade gracefully'.
Step 4: Scalability Isn't Optional — It's The Whole Point
You don't design Amazon's checkout for 1 user. You design for 100 million users hitting 'Buy Now' on Prime Day. Scalability starts with data partitioning, not server count.
Shard by customer_id for inventory and cart. Everything else becomes a fan-out query problem. Orders go into a write-ahead log before anything touches the database. That log is your scalability safety net — it decouples the rush from the database write rate.
The real trap? Scaling reads is easy. Scaling transactional writes across 10,000 nodes is where dreams die. Use a distributed consensus protocol (Raft/Paxos) for critical path writes like payment authorization. Everything else can be eventually consistent. Your catalog service? Read-replicas behind a cache layer. Your checkout service? Linearizable writes or you're debugging ghost charges at 3 AM.
Step 6: Trade-offs Will Get You Fired — Pick The Right One
Every system design interview question is a trade-off trap. Amazon's architecture screams 'Amazon chooses availability over consistency in the catalog, but consistency over availability in checkout.' That's not a bug — it's a business decision.
Your search results can be stale by 500ms. Nobody dies. But if inventory confirms a purchase for a sold-out item, you've got a pissed-off customer and a logistics nightmare. That's why inventory writes go through a distributed lock (DynamoDB conditional updates) while search reads from an Elasticsearch cluster updated via async streams.
The second trade-off: latency vs. durability. When a user clicks 'Place Order', do you wait for 3 of 3 replicas to confirm? That's 200ms added. Or do you write to 2 of 3 and risk a rollback? Amazon picks 2-of-3 for checkout because 99.99% availability matters more than a 0.01% rollback cost. Write it down: better to have a rare rollback than a frequent timeout.
Trade-offs That Shape Amazon's Architecture
Amazon's design is a continuous series of deliberate trade-offs. Consistency vs. availability is the most brutal: S3 chooses eventual consistency for listing operations to survive blast radius of a single region, while the checkout service uses pessimistic locking in DynamoDB to guarantee exactly one charge per order. Latency vs. accuracy in search: product ranking tolerates stale index updates for 30 seconds to keep query latency under 50ms. Write cost vs. read cost in fulfillment: inventory snapshots are recomputed every 15 minutes instead of reading live stock — prevents hot partitions but risks overselling during flash sales. The pattern: Amazon never optimizes for all attributes. It picks the one that causes the least customer pain per service and lives with the rest. Your job is to make those trade-offs explicit in diagrams and defend them with hard numbers.
Limitations and Challenges in Real Amazon Design
No system survives contact with production traffic unchanged. Amazon's architecture faces three hard limits. First, hot partitions in DynamoDB: a celebrity product page can spike read traffic 1000x within seconds. Auto-scaling fails because the partition key (product ID) concentrates load. Solution: add a random suffix to partition keys, but that complicates range queries. Second, checkout race conditions despite all locks: network partitions between payment service and order service cause orphan orders. Amazon uses idempotency keys but still sees 0.01% duplicate orders at scale. Third, search index rebuild latency: product catalog updates propagate to Elasticsearch only after 5 minutes. During Prime Day, new products are invisible for millions of customers. Mitigations exist — pre-warming caches, throttling aggressive clients — but none eliminate the problem. Acknowledge these limitations in your design document to show production readiness.
What Interviewers Expect From Your Amazon Design
Senior engineers interviewing for Amazon-style system design are evaluated on four axes. First, scope: you must clarify ambiguous requirements — ask if it's the entire Amazon or just the shopping flow. Never assume. Second, trade-off reasoning: explain why you chose DynamoDB over Spanner for inventory (write throughput vs. global consistency) with numbers. Third, failure handling: describe what happens when your checkout service loses connection to the payment gateway. Idempotency keys and dead-letter queues must appear in your diagram. Fourth, scalability estimates: derive read/write ratios from business logic, not guesswork — 10 million daily active buyers generate roughly 200 million page views. Interviewers watch for candidates who jump to solution without understanding constraints. Start with requirements, then capacity estimation, then service decomposition. The winning answer always includes a whiteboard-ready diagram of your consistency boundaries.
Further Read: Links That Will Save You in the Interview
Your Amazon system design interview doesn't stop at drawing boxes. Interviewers probe for depth — where you learned it, and whether you've read the real papers. Start with Amazon's Dynamo paper (2007) for understanding distributed key-value stores under high write loads. Then read Google's Spanner (2012) to contrast global consistency with Amazon's eventual consistency model. For search, Elasticsearch's "From 20 to 20 Billion Queries Per Day" reveals how they scale inverted indexes. The AWS Well-Architected Framework whitepaper covers reliability pillars Amazon teams literally use. For video streaming, Apple's HLS specification and DASH-IF guidelines explain how video splitting and packaging work at scale. Don't just memorize — understand the trade-off each paper accepts. Interviewers can smell recitation a mile away.
Q 10. Analyze Image Quality from a URL — Anti-Pattern Detector
When a user uploads a product image, Amazon must flag low-resolution, blurry, or watermarked images before they appear. The naive approach: download the image, run a Python script using OpenCV (Laplacian variance for blur detection), and reject. This fails at scale. Instead, build a producer-consumer pipeline. A URL queue (SQS) feeds workers that download images in parallel. Each worker extracts metadata (dimensions, EXIF), runs a lightweight blur score, and checks against a watermark model (ResNet-18 trained on Amazon's catalog). Results write to a DynamoDB table keyed by image ID. The tricky part: some images are fine for thumbnails but fail at full resolution. Use a tiered scoring system: thumbnail quality, zoom quality, and print quality. A production trap: workers often hit download timeouts for large images from slow seller servers. Implement exponential backoff with a max of 3 retries. If still failing, mark as "needs manual review" — never block the seller's listing entirely.
S3 Outage That Took Down Amazon.com
- Blast radius: any admin command on shared infrastructure can take down unrelated services. Always use change management and runbooks.
- Defense in depth: the frontend should degrade gracefully when static assets are unavailable — show text-only product descriptions instead of failing entirely.
- Monitoring: alarm on sudden capacity loss in critical storage systems, not just traffic drops.
GET _cluster/health
GET _nodes/stats?level=indicesCheck slow query logs: PUT _cluster/settings { "transient": { "index.search.slowlog.threshold.query.warn": "500ms" } }Key takeaways
Common mistakes to avoid
5 patternsDesigning for strong consistency everywhere
Treating the cart as a simple key-value store without conflict resolution
Not planning for idempotency in payment processing
Building an unbounded cache without an eviction policy
Building a monolith and decomposing too late
Interview Questions on This Topic
How would you design the product catalog service to handle 200M daily active users with a 100:1 read-to-write ratio?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.
That's Real World. Mark it forged?
15 min read · try the examples if you haven't