Mid-level 7 min · March 06, 2026

Design Amazon — S3 Blast Radius and Checkout Races

A mistyped S3 command took down Amazon.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Amazon is a multi-service distributed system: catalog, cart, orders, payments, search, recs, logistics — all independent but coordinated
  • Product catalog is read-optimised with caching; inventory is real-time with strong consistency in a separate store
  • Cart lives in a low-latency KV store (DynamoDB); order history in a relational DB (Aurora)
  • Search is powered by a dedicated search engine (Elasticsearch) separate from the OLTP DB
  • Payments must be idempotent and eventually consistent — one duplicate charge can cost millions
  • The biggest mistake: designing for consistency everywhere — you'll kill availability and latency
Plain-English First

Imagine a massive warehouse with millions of shelves, thousands of cashiers, a personal shopping assistant who remembers everything you've ever bought, and a delivery network that spans the globe. Amazon is exactly that — but built from software. Every time you search for headphones, add them to a cart, pay, and track a package, dozens of separate systems are quietly talking to each other to make it feel seamless. This article is about how those systems are actually designed — and the real trade-offs that keep them running under peak load.

Amazon processes over 66,000 orders per minute at peak, serves hundreds of millions of customers across 20+ countries, and runs one of the most complex distributed systems ever built — all while most transactions complete in under a second. Understanding how to design a system at this scale isn't just an interview exercise; it's a masterclass in the real trade-offs that define modern software engineering: consistency vs. availability, latency vs. accuracy, operational simplicity vs. raw performance.

The core problem Amazon solves is multi-dimensional. It's not just a database with a shopping cart on top. It's a real-time inventory system, a personalization engine, a payments processor, a logistics orchestrator, a search engine, and a seller marketplace — all running simultaneously, all needing to agree on the state of the world, and all needing to survive individual component failures without the customer ever noticing. The challenge isn't writing any one of these systems; it's making them work together under crushing load.

By the end of this article, you'll be able to walk into a system design interview and articulate a coherent, production-realistic Amazon architecture. You'll understand why the product catalog is separated from inventory, why the cart lives in a different data store than order history, how search is decoupled from the relational database, and what actually happens between you clicking 'Buy Now' and your order appearing on screen. You'll know the real trade-offs, not just the happy path.

Core Architecture Principles

Amazon's architecture is built on a few non-negotiable principles. First, data ownership is absolute: each microservice owns its data exclusively — no shared tables between services. Second, communication is asynchronous where possible: use events (Kafka) for order creation, inventory updates, and shipping triggers. Synchronous calls are reserved for operations that need immediate confirmation, like payment gateway interaction. Third, cache everything that can be stale. The product catalog, search results, recommendations — all served from cached layers that accept minutes of staleness. Fourth, fail gracefully: if a downstream service is down, the system degrades, it doesn't crash. The homepage might show fewer recommendations, but the site stays up.

These principles are not theoretical — they were earned through real production failures. The 2017 S3 outage showed that shared infrastructure can bring down the entire site. The 2020 DynamoDB throttling event during Prime Day taught them to provision for 3x peak traffic. Every principle has a scar.

io/thecodeforge/order/EventPublisher.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package io.thecodeforge.order;

import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.stereotype.Component;

@Component
public class EventPublisher {
    private final KafkaTemplate<String, OrderEvent> kafka;

    public EventPublisher(KafkaTemplate<String, OrderEvent> kafka) {
        this.kafka = kafka;
    }

    public void publishOrderCreated(Order order) {
        OrderEvent event = new OrderEvent(
            order.getId(),
            order.getCustomerId(),
            order.getTotal(),
            "CREATED"
        );
        kafka.send("order_events", order.getId(), event)
            .whenComplete((result, ex) -> {
                if (ex != null) {
                    // Fallback: log to dead letter queue or retry
                    log.error("Failed to publish order event for {}", order.getId(), ex);
                }
            });
    }
}
Boxes and Arrows
  • Each service owns its data — no shared DB tables between services.
  • Communicate through events for most flows; use synchronous calls only for idempotent, critical paths.
  • If two services need to share a database table, merge them into one service.
  • Design for partial failure: every external call can fail, and the system must survive.
Production Insight
Shared databases are the top cause of cascading failures.
When the inventory service goes down, it shouldn't take the catalog service with it.
Rule: if you share a database, you've actually built one service with two codebases.
Key Takeaway
Data ownership boundaries define your architecture.
Shared data = shared fate.
Own your data, own your availability.

Requirements & Estimation

Before drawing boxes, we need numbers. Amazon serves ~200M active customers, processes ~66,000 orders/min at peak. Every second, that's ~1,100 orders. Each order generates writes to cart, inventory, payment, order, and logistics services. Read-to-write ratio for the product catalog is roughly 100:1, while for cart it's 1:1 (every add is followed by a read during checkout). Storage: product catalog ~100M items, each with 10-50 KB metadata — that's ~5 TB in the DB. Images stored in object storage (S3), total petabyte-scale. Network bandwidth: each page load transfers ~2 MB (HTML, JS, images). At 200M DAU, average 10 pages per session = 2B page loads/day = ~4 PB/day outbound — that's why you need CDN and aggressive caching.

These numbers drive every architecture decision. You don't design for 66K orders/min without knowing your bottleneck: database writes per second, queue throughput, payment latency SLA. A common mistake: designing for average load, not peak. Prime Day traffic spikes 5-10x above average. So you need to provision for at least 2x your estimated peak, and then use auto-scaling to handle surges.

estimates.txtTEXT
1
2
3
4
5
6
Daily active users: 200M
Orders per second (peak): ~1,100
Reads/sec on catalog DB: ~2,000,000 (200M * 10 pages / 86400 seconds)
Writes/sec to order DB: ~1,100 (plus cart, inventory, payment)
Storage: catalog 5 TB, images ~50 TB
CDN bandwidth: ~40 Gbps at peak (2MB * 2B page loads / 86400)
Back-of-Envelope Estimation
  • Assume 2x growth over next 2 years. Design for 200K orders/min.
  • Every order creates 5 writes (cart, inventory, payment, order, shipping). So 5500 writes/sec at peak.
  • Catalog reads: 100:1 read/write -> 200K reads/sec. Cache the top 5% hottest items (Pareto).
  • Bandwidth: 2MB per page 10 pages/user 200M users / 86400 = 4.6 TB/day -> 53 GBps peak. CDN is non-negotiable.
Production Insight
Most teams under-estimate write throughput by 10x.
They design for average load, not peak (Black Friday × 3x).
Rule: always multiply peak by 2x for safety margin.
Key Takeaway
Numbers force trade-offs.
If you don't know your write throughput, you'll pick the wrong database.
Estimate before you architect — it's cheaper than re-architecting.

High-Level Architecture — Service Decomposition

Amazon's architecture is a collection of hundreds of microservices. The core ones for an e-commerce platform:

  • Product Catalog Service: read-heavy, exposes product details, categories, images. Uses a read replica with a CDN cache for images.
  • Inventory Service: tracks stock per warehouse. Must be strongly consistent to avoid overselling. Usually a separate database (Aurora with row-level locking).
  • Cart Service: low-latency, high-write. Uses DynamoDB with eventual consistency for add/remove operations; cart read during checkout uses strong consistency.
  • Order Service: receives checkout request, orchestrates the saga: reserve inventory, process payment, create order, trigger shipping. Uses a queue for decoupling.
  • Payment Service: idempotent, integrates with external gateways. Stores transaction logs in a relational DB.
  • Search Service: Elasticsearch cluster indexed from catalog and inventory changes via CDC.
  • Recommendation Service: ML pipeline producing real-time recommendations served via a separate read-optimised cache.
  • Shipping Service: async, watches order completion events and sends to logistics.

Each service has its own database, communicates via HTTP/REST or async events (Kafka). API Gateway routes requests, handles authentication, rate limiting.

high-level-arch.txtTEXT
1
2
3
4
5
6
7
8
9
10
Client -> CloudFront (CDN) -> API Gateway -> Services:
  - Product Catalog (Aurora, Redis cache)
  - Inventory (Aurora, strong consistency)
  - Cart (DynamoDB)
  - Order (Aurora + SQS/Kafka)
  - Payment (Aurora, external gateway via idempotent HTTP)
  - Search (Elasticsearch)
  - Recommendations (Redis / ML predictions)
  - Shipping (DynamoDB + SQS -> logistics)
Each service owns its DB. Async events flow through Kafka topics (order_created, payment_completed, etc.)
Don't Over-Split
Each new service adds complexity: network calls, consistency headaches, debugging difficulty. Amazon has hundreds because they had thousands of engineers. For a startup, start with 3-4 services and split when the deployment bottleneck hurts.
Production Insight
Service boundaries defined by data ownership, not business nouns.
If two services need to share a database, they're really one service.
Rule: each microservice owns its data exclusively — no shared tables, only shared events.
Key Takeaway
Decompose by data boundaries.
Shared databases defeat the purpose of microservices.
If you need transactions across services, use saga pattern, not distributed transactions.

Data Consistency & Trade-offs Across Services

Amazon must maintain consistency where it matters (inventory, payments) and accepts eventual consistency where it doesn't (product catalog updates, recommendations). The key trade-offs:

  • Product Catalog: writes are rare (admin updates), reads are massive. Use a leaderless read-replica architecture with cache-aside pattern. A catalog update can take minutes to propagate to all edge caches — that's fine.
  • Inventory: overselling is unacceptable. When a customer adds an item to cart, we reserve inventory for 15 minutes. If not checked out, the reservation expires. This is optimistic — but during high contention, we risk deadlocks. Use row-level locking in Aurora for the inventory row. This limits throughput to ~1000 inventory reservations per second per row. Solution: shard inventory by product ID (each product gets its own partition).
  • Cart & Order: The cart service uses eventual consistency for add/remove, but during checkout, the order service reads the cart with strong consistency and then runs a saga: reserve inventory (idempotent), charge payment (idempotent), decrement inventory, create order. If any step fails, compensate: release inventory, void payment.
  • Search: Elasticsearch is eventually consistent with the inventory DB. If you add an item, it might take seconds to appear in search results. Acceptable for most queries, but for sellers pushing inventory updates, we provide a synchronous fallback: if a seller uses 'update inventory API', we directly update a cache that search reads with low latency.

Use gossip protocols and CRDTs where possible for coordination-free eventual consistency.

io/thecodeforge/order/PaymentSaga.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
package io.thecodeforge.order;

import java.util.UUID;

public class PaymentSaga {
    private static final String IDEMPOTENCY_KEY_HEADER = "X-Idempotency-Key";

    // This method is called by the order saga orchestrator
    public SagaResult processPayment(String orderId, double amount) {
        String idempotencyKey = UUID.randomUUID().toString(); // stored in DB per order
        PaymentRequest request = new PaymentRequest(orderId, amount, idempotencyKey);
        try {
            PaymentResponse response = paymentGateway.charge(request);
            if (response.isSuccess()) {
                return SagaResult.success(orderId, response.getTransactionId());
            } else {
                // Compensate: release inventory
                inventoryService.releaseReservation(orderId);
                return SagaResult.failure(orderId, "Payment declined");
            }
        } catch (TimeoutException e) {
            // Idempotency ensures retry doesn't double-charge
            // log and schedule retry with same idempotency key
            return SagaResult.retryLater(orderId, idempotencyKey);
        }
    }
}
Idempotency is not optional
Every payment mutation must be idempotent. Store the idempotency key in your database before making the external call. If the call times out, you can safely retry with the same key — the gateway will return the original response.
Production Insight
The deadliest bug is a duplicate charge.
It happens when you have a payment timeout and the user retries without idempotency.
Rule: never accept a payment request without an idempotency key, and never process the same key twice.
Key Takeaway
Eventual consistency is the default, not strong consistency.
Only use strong consistency where real money is at stake.
For everything else, accept staleness and design idempotent compensation.

Search & Recommendations — The Read-Optimised Path

Search and recommendations are the two features with the highest read load on Amazon. Both are served entirely from caches and search indices, never touching the main OLTP databases.

Search: Users type a query, API Gateway routes to Search Service, which queries Elasticsearch (ES). ES returns product IDs, then the service fetches product details from a local Redis cache (or falls back to catalog DB). The search index is updated asynchronously via Kafka connect from the inventory and catalog databases. Latency target: under 100ms P99.

Recommendations: For each page load, the frontend sends user context (user ID, page category, recent searches). The Recommendation Service runs an ML model (e.g., collaborative filtering with matrix factorisation) tuned every 6 hours. Model outputs are pre-computed for each user and stored in Redis with a TTL of 12 hours. The service returns a list of product IDs, and the frontend fetches details from the same cache layer as search. Latency target: under 50ms P99.

To scale search, we use a tiered approach: popular queries are cached in a local CDN node (Varnish) with 5-minute TTL. Hot product details are in Redis with sharding across nodes. Cold products go to Elasticsearch with a larger shard count.

io/thecodeforge/search/SearchQuery.javaJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// TheCodeForgeSearch query flow
// 1. Client sends query to API Gateway
// 2. API Gateway checks CDN cache for this query
// 3. Cache miss -> route to Search Service
// 4. Search Service queries Elasticsearch with:
{
  "query": {
    "bool": {
      "must": [
        { "match": { "product_name": "wireless headphones" } },
        { "term": { "in_stock": true } }
      ]
    }
  },
  "size": 20,
  "_source": ["product_id"]
}
// 5. Get list of product IDs
// 6. Fetch full product details from Redis (batch get)
// 7. Return combined response to client
Cache Invalidation is Hard
If a product goes out of stock, it should disappear from search results quickly. Use a TTL of 5 minutes for popular queries. For inventory-driven removals, use a push model: when inventory changes, send an event to the search service to reindex that product immediately (within seconds).
Production Insight
Search index staleness kills conversion.
If a product shows as 'in stock' but actually isn't, the cart service will reject it and the customer leaves.
We solved this by having a real-time inventory sync flag: if a product's stock drops to zero, we trigger an immediate reindex before the customer can add it.
Key Takeaway
Search and recommendations are read-optimised tiers.
Cache aggressively, accept staleness in minutes, but never show an 'in stock' product that's actually out.
That's a hard requirement — it needs real-time push, not periodic polling.

Caching & CDN Strategy

Amazon's read volume is staggering — millions of requests per second for catalog pages, images, search results. Without a multi-tier caching strategy, the origin databases would collapse. The caching layers, from edge to database:

  1. CDN (CloudFront): Caches static assets (product images, CSS, JS) at edge locations. TTL of 24 hours for assets, invalidated on new uploads. For dynamic content (search results, recommendations), CDN caches only popular queries with short TTL (5 minutes).
  2. API Gateway Cache: Regional cache for identical API responses. Works well for product details that don't change often.
  3. Service-level Cache (Redis): Each service has its own Redis cluster. Catalog service caches product details by ID (LRU eviction). Cart service uses Redis for session data. Recommendation service caches precomputed user recommendations.
  4. Database Read Replicas: Aurora read replicas handle cache misses. In extreme cases, they can be promoted to handle more read load.

The design principle: the top 5% of hottest products receive 80% of traffic (Pareto). Cache those aggressively. Long-tail products are served from Elasticsearch or read replicas with lower priority.

io/thecodeforge/cache/CacheAsidePattern.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package io.thecodeforge.cache;

public class CatalogService {
    private final RedisCache cache;
    private final ProductRepository db;

    public Product getProduct(String productId) {
        Product cached = cache.get(productId);
        if (cached != null) {
            return cached;
        }
        Product fromDb = db.findById(productId);
        if (fromDb != null) {
            cache.set(productId, fromDb, Duration.ofMinutes(10));
        }
        return fromDb;
    }

    public void updateProduct(Product product) {
        db.save(product);
        cache.invalidate(product.getId());
        // Also send event to search service to reindex
        eventPublisher.publishProductUpdated(product);
    }
}
Pareto Cache
  • Track access frequency per product. Promote hot items to faster cache tiers.
  • Use Redis with maxmemory-policy allkeys-lru for automatic eviction of cold items.
  • In CDN, cache popular query results but invalidate on inventory change.
  • Warm the cache before major sales events by pre-loading top products.
Production Insight
Cache stampedes kill more services than demand spikes.
When a popular product's cache entry expires and thousands of requests hit the DB simultaneously, you get a thundering herd.
Solution: use locking cache updates (setnx) or serve stale data while refreshing.
Key Takeaway
Cache in layers, invalidate carefully.
The CDN protects the API gateway. The gateway cache protects the services.
A cache miss on a hot product should never cascade into a database meltdown.

Checkout Flow — From Cart to Confirmation

When the user clicks 'Place Order', this is the most critical path. Here's the real sequence:

  1. Cart Service retrieves the user's cart with strong consistency (gets latest items and their IDs).
  2. Order Service receives the checkout request and starts a saga:
  3. - Reserve Inventory: for each item, call Inventory Service to reserve quantity. If any item is insufficient, fail the entire order (release other reservations).
  4. - Process Payment: call Payment Service with the total amount and an idempotency key. The payment service interacts with the external gateway. If timeout, retry (idempotency prevents double charge).
  5. - Create Order: insert order record into Order DB.
  6. - Decrement Inventory: final decrement of reserved quantities.
  7. - Send to Shipping: publish order_created event to Kafka, which the Shipping Service picks up.
  8. If any step fails after payment, a compensation transaction is run: refund payment, release remaining inventory. This compensation is also idempotent.
  9. The frontend polls the Order Service for the order status (every 2 seconds until confirmed) and then redirects to the order confirmation page.

All services use asynchronous communication where possible to reduce end-to-end latency. The entire saga typically completes in under 500ms for 95% of orders.

checkout-sequence.txtTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Client -> API Gateway: POST /checkout
  -> Order Service: start saga
    -> Inventory Service: reserve(item1, qty1), reserve(item2, qty2)
    <- OK all reserved
    -> Payment Service: charge(amount, idempotency_key)
    <- OK transaction_id
    -> Order Service: insert order (status: CONFIRMED)
    -> Inventory Service: decrement(item1, qty1), decrement(item2, qty2)
    -> Kafka: emit OrderCreated event
  <- 200 OK

Compensation path (if payment fails after inventory reserve):
  -> Inventory Service: releaseReservation(orderId)
  -> Order Service: update order status to FAILED
Watch for Partial Failures
The inventory service might reserve 5 items but only 3 succeed. The saga must handle partial inventory reservation by rolling back all — don't leave items reserved indefinitely. Use a timeout mechanism: if the saga doesn't complete in 30 seconds, the inventory reservation expires automatically.
Production Insight
The biggest operational pain is zombie reservations from failed checkouts.
If the order service crashes halfway through the saga, inventory is reserved but never released.
We use a background job that runs every minute and expires reservations older than 15 minutes.
Key Takeaway
Checkout is a saga, not a transaction.
Each step must be idempotent and compensatable.
Build timeout-driven cleanup for zombie reservations — they'll eat your inventory.
● Production incidentPOST-MORTEMseverity: high

S3 Outage That Took Down Amazon.com

Symptom
Amazon.com returned 503 errors for both search and checkout. Customer support immediately flooded with complaints.
Assumption
The team assumed the issue was a traffic spike or a gradual performance degradation.
Root cause
An engineer mistyped a command while debugging a billing issue, removing too many servers from the S3 cluster that served the product catalog images and static assets. Since the retail site depends on S3 for product images, the CDN had nothing to serve, and the frontend couldn't render pages.
Fix
The S3 team re-added the servers from backups. Amazon.com restored after the infrastructure scaled back up.
Key lesson
  • Blast radius: any admin command on shared infrastructure can take down unrelated services. Always use change management and runbooks.
  • Defense in depth: the frontend should degrade gracefully when static assets are unavailable — show text-only product descriptions instead of failing entirely.
  • Monitoring: alarm on sudden capacity loss in critical storage systems, not just traffic drops.
Production debug guideCommon symptoms and the actions that actually fix them3 entries
Symptom · 01
Customer sees 'Item out of stock' after adding to cart and proceeding to checkout
Fix
Check inventory service logs for reservation success. If inventory service is healthy, verify cart's item list matches the latest inventory snapshot — inventory might have been decremented for another order between cart add and checkout. Implement optimistic locking with version numbers.
Symptom · 02
Payment gateway returns 500 on payment attempt
Fix
Check payment service idempotency key. The payment gateway might have processed the charge already and returned a timeout. The payment service should retry with the same idempotency key to avoid duplicate charges. Also verify that the payment service's circuit breaker hasn't tripped.
Symptom · 03
Order confirmation page times out, but order eventually appears
Fix
The order service likely uses asynchronous processing (queue + worker). Check the order queue depth, worker health, and the saga orchestrator status. The frontend should poll an order status endpoint instead of waiting for a synchronous response.
★ Latency Spikes in Product SearchWhen search response time jumps from 50ms to 2s at peak, here's the fast playbook.
P99 search latency >1s
Immediate action
Check Elasticsearch cluster CPU and GC activity via Kibana/Elasticsearch monitoring dashboard.
Commands
GET _cluster/health GET _nodes/stats?level=indices
Check slow query logs: PUT _cluster/settings { "transient": { "index.search.slowlog.threshold.query.warn": "500ms" } }
Fix now
If GC pressure >20% of CPU: add more data nodes or increase heap. If due to hot shards: split the overloaded index or redistribute shards with shard allocation filtering.
Search results are stale (5+ minutes behind inventory changes)+
Immediate action
Check the CDC pipeline from the inventory database to Elasticsearch. Look at Kafka consumer lag or Debezium connector status.
Commands
kafka-consumer-groups --bootstrap-server localhost:9092 --group inventory-search-sync --describe
Check Debezium connector status: GET /connectors/inventory-connector/status
Fix now
If consumer lag is high, restart the consumer with increased parallelism. If connector failed, restart Debezium and re-snapshot from a recent offset.
Amazon's Core Services: Data Store Choice & Trade-offs
ServiceDatabaseConsistency ModelKey Trade-off
Product CatalogAurora (MySQL-compatible) + Redis cacheEventual for reads; strong for writes (admin updates)High read throughput vs. stale cache; cache invalidation complexity
InventoryAurora (with row-level locking)Strong consistency (serializable isolation)Throughput limited by lock contention; shard by product ID to scale
CartDynamoDBEventual for add/remove; strong for checkout readsLow latency at scale vs. occasional stale cart items (rarely an issue)
Order HistoryAuroraStrong after order creation (user expects immediate visibility)Write throughput bottleneck; use write sharding by customer ID region
SearchElasticsearchEventual (seconds of staleness)Search accuracy vs. indexing latency; need real-time sync for inventory changes

Key takeaways

1
Decompose services by data ownership
each service owns its database exclusively.
2
Consistency is not one-size-fits-all
use strong consistency for inventory and payments, eventual for everything else.
3
Idempotency is non-negotiable for any payment or reservation operation
it prevents duplicates and enables safe retries.
4
Caching is your best friend for read-heavy services, but you must plan for invalidation and thundering herds.
5
Saga patterns handle distributed transactions without two-phase commit, but require careful compensation logic.
6
Design for peak traffic, not average. Prime Day can spike 10x above normal.
7
Every external call can fail. Build idempotent retries and graceful degradation into every service.

Common mistakes to avoid

5 patterns
×

Designing for strong consistency everywhere

Symptom
High latency on every read, frequent timeout errors, database contention causing deadlocks in the order service.
Fix
Identify which operations absolutely need strong consistency (inventory reserve, payment). For everything else, use eventual consistency with caching and idempotent writes.
×

Treating the cart as a simple key-value store without conflict resolution

Symptom
Customers see stale cart items, items disappearing after adding them on another device, or duplicate items after concurrent adds.
Fix
Use last-write-wins CRDTs for cart items with version vectors. Or, simpler: let the cart service use DynamoDB conditional writes to update items, and on conflict, the latest timestamp wins.
×

Not planning for idempotency in payment processing

Symptom
Duplicate charges when users retry payment after a timeout, leading to chargebacks and customer complaints.
Fix
Always include an idempotency key (e.g., a random UUID stored with the order) in every payment request. The payment gateway must return the same response for the same key, and your service must not proceed with a second charge if the first is in progress.
×

Building an unbounded cache without an eviction policy

Symptom
Redis runs out of memory, evicts all keys including critical session data, causing login failures and cart losses.
Fix
Always configure maxmemory and an eviction policy (allkeys-lru for product cache, volatile-ttl for session data). Monitor memory usage and set alerts at 80% capacity.
×

Building a monolith and decomposing too late

Symptom
Deployments become slow and risky, teams step on each other's code, and scaling one component means scaling the whole application.
Fix
Split by data boundary early. Start with 3-4 services (product, cart, order, payment) and add more as the team grows. Don't wait for the pain to become unbearable.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you design the product catalog service to handle 200M daily ac...
Q02SENIOR
How would you ensure inventory consistency without sacrificing availabil...
Q03SENIOR
Design the order processing pipeline from when a user clicks 'Place Orde...
Q04SENIOR
How would you handle the situation where the external payment gateway re...
Q05SENIOR
How would you monitor and debug a sudden spike in checkout failures?
Q01 of 05SENIOR

How would you design the product catalog service to handle 200M daily active users with a 100:1 read-to-write ratio?

ANSWER
Given the read-heavy workload, I'd use a leaderless read architecture. Writes go to Aurora (primary), which replicates asynchronously to multiple read replicas. For reads, use a cache-aside pattern with Redis (multi-AZ for availability). The cache stores the top 5% of hottest products by access frequency (Pareto principle). For cache misses, fetch from a read replica and populate the cache. To handle spikes, add a CDN in front for static product images. Warm the cache before major sales events. For write throughput, product updates are admin-only and batched, so a single Aurora instance with proper indexing is sufficient.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is Design Amazon in simple terms?
02
Why do we need separate services for cart and inventory?
03
How does Amazon handle the 'out of stock' race condition?
04
Can you build a similar architecture with open-source tools?
05
What is the most common mistake in designing Amazon-like systems?
🔥

That's Real World. Mark it forged?

7 min read · try the examples if you haven't

Previous
Design Google Search
7 / 17 · Real World
Next
Design Netflix