Senior 6 min · June 25, 2026

Design Object Storage (S3): Build a Petabyte-Scale System That Doesn't Fall Over

Q: How do I design an object storage system like Amazon S3?

Start with a flat key namespace, partition keys using consistent hashing, replicate objects across failure domains, and use a strongly consistent metadata store. The hard parts are consistency, failure recovery, and performance at scale.

Q: What's the difference between object storage and block storage?

Object storage stores data as objects with a key, accessed via HTTP. Block storage stores data in fixed-size blocks, accessed as a mounted drive. Object storage is cheaper and scales better but has higher latency.

Q: How do I handle concurrent writes to the same object in object storage?

Use last-writer-wins with timestamps or version numbers. For stronger consistency, use conditional writes (compare-and-swap) on the metadata store. AWS S3 uses a last-writer-wins model.

Q: What happens when a storage node fails in an object storage cluster?

The failure detector marks the node as dead. A repair coordinator replicates objects from replicas to new nodes. During repair, reads are served from remaining replicas. Writes continue to other nodes.

Design object storage (S3) from scratch.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Design object storage by partitioning the key space, replicating objects across failure domains, and using a metadata store to map keys to locations. The hard part is consistency under concurrent writes and recovery from node failures.

✦ Definition~90s read

What is Design Object Storage (S3)?

Object storage is a flat namespace of immutable objects identified by a key, stored across distributed nodes with built-in redundancy. Unlike block or file storage, objects are accessed via HTTP, not mounted filesystems.

★

Think of object storage like a giant warehouse with numbered shelves.

Plain-English First

Think of object storage like a giant warehouse with numbered shelves. Each shelf holds a box (the object) with a label (the key). You don't care which shelf your box is on — you just tell the warehouse manager the label, and they fetch it. The manager keeps a catalog (metadata store) of where every box is. If a shelf breaks, the manager copies a spare box from another shelf. That's object storage: flat, scalable, and self-healing.

Everyone thinks object storage is just PUT and GET over HTTP. Then they try to serve 10,000 concurrent requests and the whole thing implodes. I've seen a startup's S3-compatible service fall over at 500 requests per second because they used a single PostgreSQL table for metadata. Don't be that team.

Object storage solves the problem of storing massive amounts of unstructured data — backups, media, logs — without worrying about filesystem limits or RAID arrays. It's the backbone of cloud storage, but building your own means wrestling with consistency, durability, and performance at scale.

By the end of this, you'll be able to design an object storage system that handles petabyte-scale data, survives node failures without data loss, and serves requests with predictable latency. You'll know exactly where the bottlenecks hide and how to kill them.

The Core Abstraction: Keys, Objects, and Flat Namespace

Object storage is deceptively simple. You have a key (a string, often a URL path) and an object (the data blob plus metadata). The namespace is flat — no directories, just keys. But flat doesn't mean simple. The key is the only way to locate an object, so key design is critical. Prefix-based partitioning (e.g., first 4 hex chars) is common. But if your keys are sequential (like timestamps), you'll hotspot one partition. Use hash prefixes or reverse the key to distribute load.

In production, you'll store metadata separately from the data. Metadata (key, size, checksum, location) goes in a fast, consistent store like etcd or a distributed SQL database. Data goes on cheap, high-capacity disks. This separation lets you scale metadata and data independently.

KeyPartitioning.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Partition key space into 256 buckets using first byte of SHA256 hash
func partition(key string) int {
    hash := sha256.Sum256([]byte(key))
    return int(hash[0]) // 0-255
}

// Store object metadata in etcd with key as etcd key
func putMetadata(key string, location string, size int64) error {
    // Use etcd transaction to ensure atomicity
    txn := etcd.Txn().If(etcd.Compare(etcd.ModRevision(key), "=", 0)).Then(etcd.Put(key, location))
    _, err := txn.Commit()
    return err
}

Output

No direct output. This is a design pattern, not a runnable program.

Production Trap: Sequential Keys

Using timestamps as keys (e.g., '2024-01-01/object') will hammer a single partition. Always hash or reverse the key. I've seen a cluster melt because all writes went to one node.

thecodeforge.io

Object Storage (S3) Architecture Overview

Design Object Storage S3

Data Placement and Replication: Where Does the Object Live?

Once you have the key, you need to decide which storage nodes hold the object. The naive approach is a central location service — but that's a single point of failure and a bottleneck. Instead, use consistent hashing with virtual nodes. Each node gets multiple points on the hash ring. When you write an object, you hash the key and find the next N nodes on the ring (the 'preference list'). Replicate the object to those N nodes.

Why virtual nodes? They spread load evenly even when nodes have different capacities. A node with 2TB gets twice as many virtual nodes as a 1TB node. When a node fails, its load is redistributed evenly across all other nodes, not just its neighbor.

In production, N=3 is standard for durability. But watch out: if you replicate synchronously, write latency is the max of the three nodes. Asynchronous replication is faster but risks data loss on node failure before replication completes.

ConsistentHashRing.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Consistent hash ring with virtual nodes
class Ring {
    struct VNode { nodeID string; hash uint64 }
    vnodes []VNode // sorted by hash

    func (r *Ring) AddNode(nodeID string, weight int) {
        for i := 0; i < weight*100; i++ {
            h := hash(nodeID + strconv.Itoa(i))
            r.vnodes = append(r.vnodes, VNode{nodeID, h})
        }
        sort.Slice(r.vnodes, func(i, j int) bool { return r.vnodes[i].hash < r.vnodes[j].hash })
    }

    func (r *Ring) GetNodes(key string, count int) []string {
        h := hash(key)
        idx := sort.Search(len(r.vnodes), func(i int) bool { return r.vnodes[i].hash >= h })
        var result []string
        seen := map[string]bool{}
        for len(result) < count {
            v := r.vnodes[idx%len(r.vnodes)]
            if !seen[v.nodeID] {
                result = append(result, v.nodeID)
                seen[v.nodeID] = true
            }
            idx++
        }
        return result
    }
}

Output

No direct output. This is a design pattern.

Senior Shortcut: Replication Factor

Always make replication factor configurable per bucket. Some data needs N=5 (financial logs), some can tolerate N=2 (cached thumbnails). Don't hardcode N=3.

thecodeforge.io

Object Write Path

Design Object Storage S3

Metadata Store: The Brain of the System

The metadata store maps keys to object locations, sizes, checksums, and version info. It must be strongly consistent — if you PUT an object and immediately GET it, you should get the new version. That rules out eventually consistent stores like Cassandra (unless you use quorum reads, which are slow).

Production choices: etcd (for small clusters, <10K ops/sec), CockroachDB (for geo-distributed), or a custom sharded MySQL with Vitess. The key insight: metadata operations are small (a few KB), so you can afford strong consistency. Data operations are large (MB to TB), so they use eventual consistency with checksum verification.

Never store object data in the metadata store. That's how you run out of space and crash the cluster. I've seen a team store 100MB objects as etcd values — etcd has a 1.5MB limit per key. They learned the hard way.

MetadataStore.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Metadata schema (CockroachDB)
CREATE TABLE objects (
    key STRING PRIMARY KEY,
    size INT8 NOT NULL,
    checksum STRING NOT NULL, -- SHA256 hex
    locations STRING[] NOT NULL, -- list of node IDs
    version INT8 NOT NULL DEFAULT 1,
    created_at TIMESTAMP NOT NULL DEFAULT now()
);

// Atomic compare-and-swap for conditional PUT
UPDATE objects SET version = version + 1, locations = $2
WHERE key = $1 AND version = $3;

Output

No direct output. This is a schema design.

Never Do This: Storing Data in Metadata

Metadata stores are not blob stores. Keep metadata small (<10KB per entry). Use them only for pointers and attributes.

Consistency Models: Strong vs. Eventual — Pick Your Poison

Object storage often promises 'read-after-write consistency' for new objects, but 'eventual consistency' for overwrites. Why? Because new objects are easy — you write metadata and data, and subsequent reads see the new metadata. Overwrites are hard: you have to coordinate updating metadata across replicas.

In practice, use a last-writer-wins (LWW) strategy with wall clocks. Each replica stores a version timestamp. When reading, the client fetches from all replicas and picks the latest version. This is simple but relies on clock synchronization. NTP skew can cause stale reads. Mitigation: use logical clocks (vector clocks) or monotonic clocks (like Google's TrueTime).

For strong consistency, use a quorum-based approach: write to all N replicas, read from at least one. But this increases latency. I've seen teams use strong consistency for metadata and eventual for data — best of both worlds.

LWWReconciliation.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Last-writer-wins reconciliation on read
func (n *Node) GetObject(key string) ([]byte, error) {
    replicas := ring.GetNodes(key, 3)
    var best []byte
    var bestTime int64
    for _, r := range replicas {
        data, ts, err := n.fetchFromPeer(r, key)
        if err != nil { continue }
        if ts > bestTime {
            best = data
            bestTime = ts
        }
    }
    if best == nil { return nil, ErrNotFound }
    return best, nil
}

Output

No direct output. This is a design pattern.

Interview Gold: Clock Skew

Interviewers love asking about clock skew. Answer: use monotonic clocks or vector clocks. TrueTime gives tight error bounds but requires special hardware.

Handling Failures: Replication, Repair, and Rebalancing

Nodes fail. Disks die. Network partitions happen. Your system must detect failures and repair automatically. Use a gossip protocol for failure detection — each node periodically pings a random subset of others. If a node is unreachable for T seconds (e.g., 30s), mark it as 'suspected'. After another T seconds, mark it as 'dead' and trigger repair.

Repair means replicating the objects that were on the dead node to other nodes. The metadata store knows which objects had replicas on that node. A repair coordinator picks new nodes from the ring and copies the data. This is bandwidth-intensive — throttle repair to avoid saturating the network.

Rebalancing happens when you add or remove nodes. The ring changes, so some objects need to move. Use a 'virtual node' approach to minimize movement: only objects whose hash falls between the old and new node positions need to move. In practice, about 1/N of objects move when adding a node.

FailureDetection.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Gossip-based failure detector (simplified)
type Node struct {
    ID string
    heartbeat int64
    suspected bool
}

func (n *Node) Gossip(peers []*Node) {
    for _, peer := range peers {
        if peer.heartbeat < time.Now().Unix()-30 {
            peer.suspected = true
        }
        if peer.suspected && peer.heartbeat < time.Now().Unix()-60 {
            // Declare dead, trigger repair
            repairCoordinator.Repair(peer.ID)
        }
    }
}

Output

No direct output.

Production Trap: Repair Storm

When a node fails, all its replicas need repair simultaneously. This can saturate network and disk I/O, causing cascading failures. Throttle repair to 10% of bandwidth.

Performance: Latency, Throughput, and Hotspots

Object storage performance is dominated by network and disk I/O. For small objects (<1MB), latency is network-bound. For large objects (>100MB), throughput is disk-bound. Use SSDs for metadata nodes, HDDs for data nodes. But even HDDs can saturate with enough concurrent requests — stripe across multiple disks per node.

Hotspots happen when a popular object is requested thousands of times per second. Solution: cache popular objects at the edge (CDN) or in an in-memory cache like Redis. But caching large objects is expensive — use a bloom filter to decide what to cache.

Another hotspot: a single partition gets too many writes. Mitigation: split the partition into smaller sub-partitions when it exceeds a threshold (e.g., 10K objects). This is called 'splitting' in DynamoDB terms.

HotspotMitigation.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Bloom filter for cache eligibility
cacheFilter := bloom.New(1000000, 5) // 1M items, 5 hash functions
if cacheFilter.Test(key) {
    // Serve from cache
} else {
    // Serve from storage, add to cache if popular
    if requestCount[key] > 100 {
        cacheFilter.Add(key)
    }
}

Output

No direct output.

Senior Shortcut: CDN for Hot Objects

Don't build your own cache for hot objects. Use a CDN like CloudFront or Cloudflare. They handle scale better than you ever will.

Security: Authentication, Authorization, and Encryption

Object storage is often exposed via HTTP APIs. You need to authenticate requests — typically using HMAC signatures (like AWS Signature V4). The client signs the request with a secret key, and the server verifies the signature. This prevents tampering and replay attacks.

Authorization is done via access control lists (ACLs) or bucket policies. Each bucket has a policy that specifies who can read/write. Policies are evaluated on every request — cache them to avoid hitting the metadata store.

Encryption: data at rest should be encrypted using AES-256. Either server-side (the storage node encrypts before writing to disk) or client-side (the client encrypts before sending). Server-side is easier but you manage the keys. Use a key management service (KMS) to rotate keys regularly.

Never store encryption keys in the same cluster as the data. That's like locking your house and leaving the key under the mat.

HMACAuth.systemdesignSYSTEMDESIGN

// io.thecodeforge — System Design tutorial

// Verify AWS Signature V4 (simplified)
func verifySignature(secretKey string, r *http.Request) bool {
    // Recompute signature from request headers
    canonicalRequest := buildCanonicalRequest(r)
    stringToSign := "AWS4-HMAC-SHA256\n" + r.Header.Get("X-Amz-Date") + "\n" + canonicalRequest
    expectedSig := hmacSHA256(secretKey, stringToSign)
    return expectedSig == r.Header.Get("X-Amz-Signature")
}

Output

No direct output.

Never Do This: Hardcoded Keys

I've seen production code with AWS keys hardcoded in a config file committed to Git. Use a secrets manager like Vault or AWS Secrets Manager.

When Not to Use Object Storage

Object storage is not a filesystem. Don't use it for databases, logs that need tailing, or applications that require random writes. The latency is too high (tens of milliseconds) compared to block storage (sub-millisecond). Also, object storage is terrible for small files — the overhead of HTTP requests and metadata lookups kills throughput. If you have billions of tiny files (<1KB), use a key-value store like Redis or Cassandra instead.

Another anti-pattern: using object storage as a message queue. SQS exists for a reason. Polling for new objects is inefficient and introduces latency.

For high-throughput sequential writes (like video streaming), object storage works well. For random-access workloads, use block storage.

Interview Gold: When Not to Use S3

Interviewers love asking when object storage is a bad fit. Answer: small files, high IOPS, low latency, or random writes. Always have a counterexample ready.

thecodeforge.io

Object Storage vs. Filesystem

Design Object Storage S3

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

A storage node crashed every 6 hours. Logs showed 'OutOfMemoryError: Java heap space' but the heap was only 60% used.

Assumption

Memory leak in the object replication code.

Root cause

The node was buffering entire objects in memory during replication. A 4GB object caused the JVM to exceed its container memory limit (4GB) and get OOM-killed by the kernel.

Fix

Changed replication to stream objects in 1MB chunks instead of loading them entirely. Also added a per-node memory limit for replication buffers: 512MB max.

Key lesson

Never assume your code handles large objects gracefully.
Always stream, never buffer.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

PUT requests timeout after 30 seconds

→

Fix

1. Check disk I/O on storage nodes with iostat -x 1. 2. If %util > 90%, add nodes or throttle uploads. 3. Check network latency between client and storage nodes. 4. If >100ms, move storage nodes closer to clients.

Symptom · 02

GET requests return 404 for existing objects

→

Fix

1. Query metadata store directly to verify key exists. 2. If missing, run reconciliation job: scan all storage nodes and rebuild metadata. 3. If metadata exists but data missing, restore from replica or backup.

Symptom · 03

Cluster health check shows nodes flapping

→

Fix

1. Check network latency between nodes with ping. 2. If >50ms, increase failure detection timeout from 30s to 60s. 3. Check for resource exhaustion (CPU, memory, disk). 4. If OK, check for network partitions or firewall drops.

★ Design Object Storage (S3) Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

`connection refused` on storage node−

Immediate action

Check if node process is running

Commands

systemctl status storage-node

journalctl -u storage-node -n 50

Fix now

Restart node: systemctl restart storage-node

`disk full` errors+

`metadata store timeout`+

`replication lag` alerts+

Feature / Aspect	Object Storage (S3)	Block Storage (EBS)
Access pattern	HTTP GET/PUT	Block device (mountable)
Latency	10-100ms	<1ms
Max object size	5TB (single object)	16TB (volume)
Consistency	Read-after-write (new), eventual (overwrite)	Strong
Cost per GB	$0.023/GB	$0.10/GB
Use case	Backups, media, logs	Databases, OS volumes

Key takeaways

Object storage is a flat namespace of keys and objects

no directories, just hash-based partitioning.

Separate metadata (strongly consistent) from data (eventually consistent) to scale independently.

Use consistent hashing with virtual nodes for even load distribution and minimal data movement on topology changes.

Never buffer entire objects in memory

stream in chunks to avoid OOM kills on large objects.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How does consistent hashing handle node additions without massive data m...

Q02SENIOR

When would you choose eventual consistency over strong consistency for o...

Q03SENIOR

What happens when a storage node fails during a write operation, and how...

Q04JUNIOR

What is object storage and how does it differ from block storage?

Q05SENIOR

A node is marked as dead but is actually alive. How do you diagnose and ...

Q06SENIOR

How would you design object storage to serve 1 billion objects with 99.9...

Q01 of 06SENIOR

How does consistent hashing handle node additions without massive data movement?

ANSWER

Consistent hashing with virtual nodes ensures only a fraction of keys (roughly 1/N) need to move when a node is added or removed. The hash ring is cut at the new node's position; only keys between the new node and its successor move.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How do I design an object storage system like Amazon S3?

What's the difference between object storage and block storage?

How do I handle concurrent writes to the same object in object storage?

What happens when a storage node fails in an object storage cluster?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't