Senior 6 min · June 25, 2026

Design Object Storage (S3): Build a Petabyte-Scale System That Doesn't Fall Over

Design object storage (S3) from scratch.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Design object storage by partitioning the key space, replicating objects across failure domains, and using a metadata store to map keys to locations. The hard part is consistency under concurrent writes and recovery from node failures.

✦ Definition~90s read
What is Design Object Storage (S3)?

Object storage is a flat namespace of immutable objects identified by a key, stored across distributed nodes with built-in redundancy. Unlike block or file storage, objects are accessed via HTTP, not mounted filesystems.

Think of object storage like a giant warehouse with numbered shelves.
Plain-English First

Think of object storage like a giant warehouse with numbered shelves. Each shelf holds a box (the object) with a label (the key). You don't care which shelf your box is on — you just tell the warehouse manager the label, and they fetch it. The manager keeps a catalog (metadata store) of where every box is. If a shelf breaks, the manager copies a spare box from another shelf. That's object storage: flat, scalable, and self-healing.

Everyone thinks object storage is just PUT and GET over HTTP. Then they try to serve 10,000 concurrent requests and the whole thing implodes. I've seen a startup's S3-compatible service fall over at 500 requests per second because they used a single PostgreSQL table for metadata. Don't be that team.

Object storage solves the problem of storing massive amounts of unstructured data — backups, media, logs — without worrying about filesystem limits or RAID arrays. It's the backbone of cloud storage, but building your own means wrestling with consistency, durability, and performance at scale.

By the end of this, you'll be able to design an object storage system that handles petabyte-scale data, survives node failures without data loss, and serves requests with predictable latency. You'll know exactly where the bottlenecks hide and how to kill them.

The Core Abstraction: Keys, Objects, and Flat Namespace

Object storage is deceptively simple. You have a key (a string, often a URL path) and an object (the data blob plus metadata). The namespace is flat — no directories, just keys. But flat doesn't mean simple. The key is the only way to locate an object, so key design is critical. Prefix-based partitioning (e.g., first 4 hex chars) is common. But if your keys are sequential (like timestamps), you'll hotspot one partition. Use hash prefixes or reverse the key to distribute load.

In production, you'll store metadata separately from the data. Metadata (key, size, checksum, location) goes in a fast, consistent store like etcd or a distributed SQL database. Data goes on cheap, high-capacity disks. This separation lets you scale metadata and data independently.

KeyPartitioning.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — System Design tutorial

// Partition key space into 256 buckets using first byte of SHA256 hash
func partition(key string) int {
    hash := sha256.Sum256([]byte(key))
    return int(hash[0]) // 0-255
}

// Store object metadata in etcd with key as etcd key
func putMetadata(key string, location string, size int64) error {
    // Use etcd transaction to ensure atomicity
    txn := etcd.Txn().If(etcd.Compare(etcd.ModRevision(key), "=", 0)).Then(etcd.Put(key, location))
    _, err := txn.Commit()
    return err
}
Output
No direct output. This is a design pattern, not a runnable program.
Production Trap: Sequential Keys
Using timestamps as keys (e.g., '2024-01-01/object') will hammer a single partition. Always hash or reverse the key. I've seen a cluster melt because all writes went to one node.
Object Storage (S3) Architecture Overview THECODEFORGE.IO Object Storage (S3) Architecture Overview Core components and data flow for petabyte-scale object storage Keys, Objects, Flat Namespace Unique key per object, no hierarchy Data Placement & Replication Distribute replicas across nodes/racks Metadata Store Central brain: maps keys to locations Consistency Model Strong or eventual — trade-offs Failure Handling Replication, repair, rebalance Performance & Security Latency, throughput, auth, encryption ⚠ Eventual consistency can cause stale reads Use strong consistency for critical metadata THECODEFORGE.IO
thecodeforge.io
Object Storage (S3) Architecture Overview
Design Object Storage S3

Data Placement and Replication: Where Does the Object Live?

Once you have the key, you need to decide which storage nodes hold the object. The naive approach is a central location service — but that's a single point of failure and a bottleneck. Instead, use consistent hashing with virtual nodes. Each node gets multiple points on the hash ring. When you write an object, you hash the key and find the next N nodes on the ring (the 'preference list'). Replicate the object to those N nodes.

Why virtual nodes? They spread load evenly even when nodes have different capacities. A node with 2TB gets twice as many virtual nodes as a 1TB node. When a node fails, its load is redistributed evenly across all other nodes, not just its neighbor.

In production, N=3 is standard for durability. But watch out: if you replicate synchronously, write latency is the max of the three nodes. Asynchronous replication is faster but risks data loss on node failure before replication completes.

ConsistentHashRing.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — System Design tutorial

// Consistent hash ring with virtual nodes
class Ring {
    struct VNode { nodeID string; hash uint64 }
    vnodes []VNode // sorted by hash

    func (r *Ring) AddNode(nodeID string, weight int) {
        for i := 0; i < weight*100; i++ {
            h := hash(nodeID + strconv.Itoa(i))
            r.vnodes = append(r.vnodes, VNode{nodeID, h})
        }
        sort.Slice(r.vnodes, func(i, j int) bool { return r.vnodes[i].hash < r.vnodes[j].hash })
    }

    func (r *Ring) GetNodes(key string, count int) []string {
        h := hash(key)
        idx := sort.Search(len(r.vnodes), func(i int) bool { return r.vnodes[i].hash >= h })
        var result []string
        seen := map[string]bool{}
        for len(result) < count {
            v := r.vnodes[idx%len(r.vnodes)]
            if !seen[v.nodeID] {
                result = append(result, v.nodeID)
                seen[v.nodeID] = true
            }
            idx++
        }
        return result
    }
}
Output
No direct output. This is a design pattern.
Senior Shortcut: Replication Factor
Always make replication factor configurable per bucket. Some data needs N=5 (financial logs), some can tolerate N=2 (cached thumbnails). Don't hardcode N=3.
Object Write PathTHECODEFORGE.IOObject Write PathFrom PUT request to durable storageClient PUTHMAC-signed request with key + dataHash Ring LookupConsistent hash on key → virtual nodePrimary NodeAccepts object, writes to local diskReplica NodesAsync replication to N-1 peersMetadata CommitStrongly consistent key→location map⚠ Replication must complete before ack to client — else read-after-write failsTHECODEFORGE.IO
thecodeforge.io
Object Write Path
Design Object Storage S3

Metadata Store: The Brain of the System

The metadata store maps keys to object locations, sizes, checksums, and version info. It must be strongly consistent — if you PUT an object and immediately GET it, you should get the new version. That rules out eventually consistent stores like Cassandra (unless you use quorum reads, which are slow).

Production choices: etcd (for small clusters, <10K ops/sec), CockroachDB (for geo-distributed), or a custom sharded MySQL with Vitess. The key insight: metadata operations are small (a few KB), so you can afford strong consistency. Data operations are large (MB to TB), so they use eventual consistency with checksum verification.

Never store object data in the metadata store. That's how you run out of space and crash the cluster. I've seen a team store 100MB objects as etcd values — etcd has a 1.5MB limit per key. They learned the hard way.

MetadataStore.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — System Design tutorial

// Metadata schema (CockroachDB)
CREATE TABLE objects (
    key STRING PRIMARY KEY,
    size INT8 NOT NULL,
    checksum STRING NOT NULL, -- SHA256 hex
    locations STRING[] NOT NULL, -- list of node IDs
    version INT8 NOT NULL DEFAULT 1,
    created_at TIMESTAMP NOT NULL DEFAULT now()
);

// Atomic compare-and-swap for conditional PUT
UPDATE objects SET version = version + 1, locations = $2
WHERE key = $1 AND version = $3;
Output
No direct output. This is a schema design.
Never Do This: Storing Data in Metadata
Metadata stores are not blob stores. Keep metadata small (<10KB per entry). Use them only for pointers and attributes.

Consistency Models: Strong vs. Eventual — Pick Your Poison

Object storage often promises 'read-after-write consistency' for new objects, but 'eventual consistency' for overwrites. Why? Because new objects are easy — you write metadata and data, and subsequent reads see the new metadata. Overwrites are hard: you have to coordinate updating metadata across replicas.

In practice, use a last-writer-wins (LWW) strategy with wall clocks. Each replica stores a version timestamp. When reading, the client fetches from all replicas and picks the latest version. This is simple but relies on clock synchronization. NTP skew can cause stale reads. Mitigation: use logical clocks (vector clocks) or monotonic clocks (like Google's TrueTime).

For strong consistency, use a quorum-based approach: write to all N replicas, read from at least one. But this increases latency. I've seen teams use strong consistency for metadata and eventual for data — best of both worlds.

LWWReconciliation.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — System Design tutorial

// Last-writer-wins reconciliation on read
func (n *Node) GetObject(key string) ([]byte, error) {
    replicas := ring.GetNodes(key, 3)
    var best []byte
    var bestTime int64
    for _, r := range replicas {
        data, ts, err := n.fetchFromPeer(r, key)
        if err != nil { continue }
        if ts > bestTime {
            best = data
            bestTime = ts
        }
    }
    if best == nil { return nil, ErrNotFound }
    return best, nil
}
Output
No direct output. This is a design pattern.
Interview Gold: Clock Skew
Interviewers love asking about clock skew. Answer: use monotonic clocks or vector clocks. TrueTime gives tight error bounds but requires special hardware.

Handling Failures: Replication, Repair, and Rebalancing

Nodes fail. Disks die. Network partitions happen. Your system must detect failures and repair automatically. Use a gossip protocol for failure detection — each node periodically pings a random subset of others. If a node is unreachable for T seconds (e.g., 30s), mark it as 'suspected'. After another T seconds, mark it as 'dead' and trigger repair.

Repair means replicating the objects that were on the dead node to other nodes. The metadata store knows which objects had replicas on that node. A repair coordinator picks new nodes from the ring and copies the data. This is bandwidth-intensive — throttle repair to avoid saturating the network.

Rebalancing happens when you add or remove nodes. The ring changes, so some objects need to move. Use a 'virtual node' approach to minimize movement: only objects whose hash falls between the old and new node positions need to move. In practice, about 1/N of objects move when adding a node.

FailureDetection.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — System Design tutorial

// Gossip-based failure detector (simplified)
type Node struct {
    ID string
    heartbeat int64
    suspected bool
}

func (n *Node) Gossip(peers []*Node) {
    for _, peer := range peers {
        if peer.heartbeat < time.Now().Unix()-30 {
            peer.suspected = true
        }
        if peer.suspected && peer.heartbeat < time.Now().Unix()-60 {
            // Declare dead, trigger repair
            repairCoordinator.Repair(peer.ID)
        }
    }
}
Output
No direct output.
Production Trap: Repair Storm
When a node fails, all its replicas need repair simultaneously. This can saturate network and disk I/O, causing cascading failures. Throttle repair to 10% of bandwidth.

Performance: Latency, Throughput, and Hotspots

Object storage performance is dominated by network and disk I/O. For small objects (<1MB), latency is network-bound. For large objects (>100MB), throughput is disk-bound. Use SSDs for metadata nodes, HDDs for data nodes. But even HDDs can saturate with enough concurrent requests — stripe across multiple disks per node.

Hotspots happen when a popular object is requested thousands of times per second. Solution: cache popular objects at the edge (CDN) or in an in-memory cache like Redis. But caching large objects is expensive — use a bloom filter to decide what to cache.

Another hotspot: a single partition gets too many writes. Mitigation: split the partition into smaller sub-partitions when it exceeds a threshold (e.g., 10K objects). This is called 'splitting' in DynamoDB terms.

HotspotMitigation.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — System Design tutorial

// Bloom filter for cache eligibility
cacheFilter := bloom.New(1000000, 5) // 1M items, 5 hash functions
if cacheFilter.Test(key) {
    // Serve from cache
} else {
    // Serve from storage, add to cache if popular
    if requestCount[key] > 100 {
        cacheFilter.Add(key)
    }
}
Output
No direct output.
Senior Shortcut: CDN for Hot Objects
Don't build your own cache for hot objects. Use a CDN like CloudFront or Cloudflare. They handle scale better than you ever will.

Security: Authentication, Authorization, and Encryption

Object storage is often exposed via HTTP APIs. You need to authenticate requests — typically using HMAC signatures (like AWS Signature V4). The client signs the request with a secret key, and the server verifies the signature. This prevents tampering and replay attacks.

Authorization is done via access control lists (ACLs) or bucket policies. Each bucket has a policy that specifies who can read/write. Policies are evaluated on every request — cache them to avoid hitting the metadata store.

Encryption: data at rest should be encrypted using AES-256. Either server-side (the storage node encrypts before writing to disk) or client-side (the client encrypts before sending). Server-side is easier but you manage the keys. Use a key management service (KMS) to rotate keys regularly.

Never store encryption keys in the same cluster as the data. That's like locking your house and leaving the key under the mat.

HMACAuth.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
// io.thecodeforge — System Design tutorial

// Verify AWS Signature V4 (simplified)
func verifySignature(secretKey string, r *http.Request) bool {
    // Recompute signature from request headers
    canonicalRequest := buildCanonicalRequest(r)
    stringToSign := "AWS4-HMAC-SHA256\n" + r.Header.Get("X-Amz-Date") + "\n" + canonicalRequest
    expectedSig := hmacSHA256(secretKey, stringToSign)
    return expectedSig == r.Header.Get("X-Amz-Signature")
}
Output
No direct output.
Never Do This: Hardcoded Keys
I've seen production code with AWS keys hardcoded in a config file committed to Git. Use a secrets manager like Vault or AWS Secrets Manager.

When Not to Use Object Storage

Object storage is not a filesystem. Don't use it for databases, logs that need tailing, or applications that require random writes. The latency is too high (tens of milliseconds) compared to block storage (sub-millisecond). Also, object storage is terrible for small files — the overhead of HTTP requests and metadata lookups kills throughput. If you have billions of tiny files (<1KB), use a key-value store like Redis or Cassandra instead.

Another anti-pattern: using object storage as a message queue. SQS exists for a reason. Polling for new objects is inefficient and introduces latency.

For high-throughput sequential writes (like video streaming), object storage works well. For random-access workloads, use block storage.

Interview Gold: When Not to Use S3
Interviewers love asking when object storage is a bad fit. Answer: small files, high IOPS, low latency, or random writes. Always have a counterexample ready.
Object Storage vs. FilesystemTHECODEFORGE.IOObject Storage vs. FilesystemWhen to choose which storage paradigmObject StorageFlat namespace, key-based lookupHigh latency (10-100ms per op)Great for large blobs, archivesNo random writes or tailingFilesystem (POSIX)Hierarchical directoriesSub-millisecond latencyIdeal for DBs, logs, small filesSupports random writes & appendsUse object storage for durability at scale, not for low-latency workloadsTHECODEFORGE.IO
thecodeforge.io
Object Storage vs. Filesystem
Design Object Storage S3
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
A storage node crashed every 6 hours. Logs showed 'OutOfMemoryError: Java heap space' but the heap was only 60% used.
Assumption
Memory leak in the object replication code.
Root cause
The node was buffering entire objects in memory during replication. A 4GB object caused the JVM to exceed its container memory limit (4GB) and get OOM-killed by the kernel.
Fix
Changed replication to stream objects in 1MB chunks instead of loading them entirely. Also added a per-node memory limit for replication buffers: 512MB max.
Key lesson
  • Never assume your code handles large objects gracefully.
  • Always stream, never buffer.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
PUT requests timeout after 30 seconds
Fix
1. Check disk I/O on storage nodes with iostat -x 1. 2. If %util > 90%, add nodes or throttle uploads. 3. Check network latency between client and storage nodes. 4. If >100ms, move storage nodes closer to clients.
Symptom · 02
GET requests return 404 for existing objects
Fix
1. Query metadata store directly to verify key exists. 2. If missing, run reconciliation job: scan all storage nodes and rebuild metadata. 3. If metadata exists but data missing, restore from replica or backup.
Symptom · 03
Cluster health check shows nodes flapping
Fix
1. Check network latency between nodes with ping. 2. If >50ms, increase failure detection timeout from 30s to 60s. 3. Check for resource exhaustion (CPU, memory, disk). 4. If OK, check for network partitions or firewall drops.
★ Design Object Storage (S3) Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
`connection refused` on storage node
Immediate action
Check if node process is running
Commands
systemctl status storage-node
journalctl -u storage-node -n 50
Fix now
Restart node: systemctl restart storage-node
`disk full` errors+
Immediate action
Check disk usage
Commands
df -h /data
du -sh /data/* | sort -rh | head
Fix now
Delete old objects or add more disks. Use lifecycle policies to expire old data.
`metadata store timeout`+
Immediate action
Check metadata store health
Commands
etcdctl endpoint health
etcdctl check perf
Fix now
Add more metadata nodes or increase request timeout from 5s to 10s.
`replication lag` alerts+
Immediate action
Check replication queue depth
Commands
curl http://node:8080/metrics | grep replication_queue
Check network bandwidth: `iftop`
Fix now
Throttle client writes to reduce replication backlog.
Feature / AspectObject Storage (S3)Block Storage (EBS)
Access patternHTTP GET/PUTBlock device (mountable)
Latency10-100ms<1ms
Max object size5TB (single object)16TB (volume)
ConsistencyRead-after-write (new), eventual (overwrite)Strong
Cost per GB$0.023/GB$0.10/GB
Use caseBackups, media, logsDatabases, OS volumes

Key takeaways

1
Object storage is a flat namespace of keys and objects
no directories, just hash-based partitioning.
2
Separate metadata (strongly consistent) from data (eventually consistent) to scale independently.
3
Use consistent hashing with virtual nodes for even load distribution and minimal data movement on topology changes.
4
Never buffer entire objects in memory
stream in chunks to avoid OOM kills on large objects.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does consistent hashing handle node additions without massive data m...
Q02SENIOR
When would you choose eventual consistency over strong consistency for o...
Q03SENIOR
What happens when a storage node fails during a write operation, and how...
Q04JUNIOR
What is object storage and how does it differ from block storage?
Q05SENIOR
A node is marked as dead but is actually alive. How do you diagnose and ...
Q06SENIOR
How would you design object storage to serve 1 billion objects with 99.9...
Q01 of 06SENIOR

How does consistent hashing handle node additions without massive data movement?

ANSWER
Consistent hashing with virtual nodes ensures only a fraction of keys (roughly 1/N) need to move when a node is added or removed. The hash ring is cut at the new node's position; only keys between the new node and its successor move.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
How do I design an object storage system like Amazon S3?
02
What's the difference between object storage and block storage?
03
How do I handle concurrent writes to the same object in object storage?
04
What happens when a storage node fails in an object storage cluster?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't

Previous
Design a Distributed Key-Value Store
23 / 40 · Real World
Next
Design a Proximity Service