Design Object Storage (S3): Build a Petabyte-Scale System That Doesn't Fall Over
Design object storage (S3) from scratch.
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
Design object storage by partitioning the key space, replicating objects across failure domains, and using a metadata store to map keys to locations. The hard part is consistency under concurrent writes and recovery from node failures.
Think of object storage like a giant warehouse with numbered shelves. Each shelf holds a box (the object) with a label (the key). You don't care which shelf your box is on — you just tell the warehouse manager the label, and they fetch it. The manager keeps a catalog (metadata store) of where every box is. If a shelf breaks, the manager copies a spare box from another shelf. That's object storage: flat, scalable, and self-healing.
Everyone thinks object storage is just PUT and GET over HTTP. Then they try to serve 10,000 concurrent requests and the whole thing implodes. I've seen a startup's S3-compatible service fall over at 500 requests per second because they used a single PostgreSQL table for metadata. Don't be that team.
Object storage solves the problem of storing massive amounts of unstructured data — backups, media, logs — without worrying about filesystem limits or RAID arrays. It's the backbone of cloud storage, but building your own means wrestling with consistency, durability, and performance at scale.
By the end of this, you'll be able to design an object storage system that handles petabyte-scale data, survives node failures without data loss, and serves requests with predictable latency. You'll know exactly where the bottlenecks hide and how to kill them.
The Core Abstraction: Keys, Objects, and Flat Namespace
Object storage is deceptively simple. You have a key (a string, often a URL path) and an object (the data blob plus metadata). The namespace is flat — no directories, just keys. But flat doesn't mean simple. The key is the only way to locate an object, so key design is critical. Prefix-based partitioning (e.g., first 4 hex chars) is common. But if your keys are sequential (like timestamps), you'll hotspot one partition. Use hash prefixes or reverse the key to distribute load.
In production, you'll store metadata separately from the data. Metadata (key, size, checksum, location) goes in a fast, consistent store like etcd or a distributed SQL database. Data goes on cheap, high-capacity disks. This separation lets you scale metadata and data independently.
Data Placement and Replication: Where Does the Object Live?
Once you have the key, you need to decide which storage nodes hold the object. The naive approach is a central location service — but that's a single point of failure and a bottleneck. Instead, use consistent hashing with virtual nodes. Each node gets multiple points on the hash ring. When you write an object, you hash the key and find the next N nodes on the ring (the 'preference list'). Replicate the object to those N nodes.
Why virtual nodes? They spread load evenly even when nodes have different capacities. A node with 2TB gets twice as many virtual nodes as a 1TB node. When a node fails, its load is redistributed evenly across all other nodes, not just its neighbor.
In production, N=3 is standard for durability. But watch out: if you replicate synchronously, write latency is the max of the three nodes. Asynchronous replication is faster but risks data loss on node failure before replication completes.
Metadata Store: The Brain of the System
The metadata store maps keys to object locations, sizes, checksums, and version info. It must be strongly consistent — if you PUT an object and immediately GET it, you should get the new version. That rules out eventually consistent stores like Cassandra (unless you use quorum reads, which are slow).
Production choices: etcd (for small clusters, <10K ops/sec), CockroachDB (for geo-distributed), or a custom sharded MySQL with Vitess. The key insight: metadata operations are small (a few KB), so you can afford strong consistency. Data operations are large (MB to TB), so they use eventual consistency with checksum verification.
Never store object data in the metadata store. That's how you run out of space and crash the cluster. I've seen a team store 100MB objects as etcd values — etcd has a 1.5MB limit per key. They learned the hard way.
Consistency Models: Strong vs. Eventual — Pick Your Poison
Object storage often promises 'read-after-write consistency' for new objects, but 'eventual consistency' for overwrites. Why? Because new objects are easy — you write metadata and data, and subsequent reads see the new metadata. Overwrites are hard: you have to coordinate updating metadata across replicas.
In practice, use a last-writer-wins (LWW) strategy with wall clocks. Each replica stores a version timestamp. When reading, the client fetches from all replicas and picks the latest version. This is simple but relies on clock synchronization. NTP skew can cause stale reads. Mitigation: use logical clocks (vector clocks) or monotonic clocks (like Google's TrueTime).
For strong consistency, use a quorum-based approach: write to all N replicas, read from at least one. But this increases latency. I've seen teams use strong consistency for metadata and eventual for data — best of both worlds.
Handling Failures: Replication, Repair, and Rebalancing
Nodes fail. Disks die. Network partitions happen. Your system must detect failures and repair automatically. Use a gossip protocol for failure detection — each node periodically pings a random subset of others. If a node is unreachable for T seconds (e.g., 30s), mark it as 'suspected'. After another T seconds, mark it as 'dead' and trigger repair.
Repair means replicating the objects that were on the dead node to other nodes. The metadata store knows which objects had replicas on that node. A repair coordinator picks new nodes from the ring and copies the data. This is bandwidth-intensive — throttle repair to avoid saturating the network.
Rebalancing happens when you add or remove nodes. The ring changes, so some objects need to move. Use a 'virtual node' approach to minimize movement: only objects whose hash falls between the old and new node positions need to move. In practice, about 1/N of objects move when adding a node.
Performance: Latency, Throughput, and Hotspots
Object storage performance is dominated by network and disk I/O. For small objects (<1MB), latency is network-bound. For large objects (>100MB), throughput is disk-bound. Use SSDs for metadata nodes, HDDs for data nodes. But even HDDs can saturate with enough concurrent requests — stripe across multiple disks per node.
Hotspots happen when a popular object is requested thousands of times per second. Solution: cache popular objects at the edge (CDN) or in an in-memory cache like Redis. But caching large objects is expensive — use a bloom filter to decide what to cache.
Another hotspot: a single partition gets too many writes. Mitigation: split the partition into smaller sub-partitions when it exceeds a threshold (e.g., 10K objects). This is called 'splitting' in DynamoDB terms.
Security: Authentication, Authorization, and Encryption
Object storage is often exposed via HTTP APIs. You need to authenticate requests — typically using HMAC signatures (like AWS Signature V4). The client signs the request with a secret key, and the server verifies the signature. This prevents tampering and replay attacks.
Authorization is done via access control lists (ACLs) or bucket policies. Each bucket has a policy that specifies who can read/write. Policies are evaluated on every request — cache them to avoid hitting the metadata store.
Encryption: data at rest should be encrypted using AES-256. Either server-side (the storage node encrypts before writing to disk) or client-side (the client encrypts before sending). Server-side is easier but you manage the keys. Use a key management service (KMS) to rotate keys regularly.
Never store encryption keys in the same cluster as the data. That's like locking your house and leaving the key under the mat.
When Not to Use Object Storage
Object storage is not a filesystem. Don't use it for databases, logs that need tailing, or applications that require random writes. The latency is too high (tens of milliseconds) compared to block storage (sub-millisecond). Also, object storage is terrible for small files — the overhead of HTTP requests and metadata lookups kills throughput. If you have billions of tiny files (<1KB), use a key-value store like Redis or Cassandra instead.
Another anti-pattern: using object storage as a message queue. SQS exists for a reason. Polling for new objects is inefficient and introduces latency.
For high-throughput sequential writes (like video streaming), object storage works well. For random-access workloads, use block storage.
The 4GB Container That Kept Dying
- Never assume your code handles large objects gracefully.
- Always stream, never buffer.
iostat -x 1. 2. If %util > 90%, add nodes or throttle uploads. 3. Check network latency between client and storage nodes. 4. If >100ms, move storage nodes closer to clients.ping. 2. If >50ms, increase failure detection timeout from 30s to 60s. 3. Check for resource exhaustion (CPU, memory, disk). 4. If OK, check for network partitions or firewall drops.systemctl status storage-nodejournalctl -u storage-node -n 50systemctl restart storage-nodeKey takeaways
Interview Questions on This Topic
How does consistent hashing handle node additions without massive data movement?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
That's Real World. Mark it forged?
6 min read · try the examples if you haven't