Dropbox Design — Hash Collisions That Corrupted User Files
A chunking bug caused SHA-256 collisions, returning garbage for 4MB blocks.
- Dropbox uses client-server sync with block-level chunking (4 MB blocks) for efficient uploads
- Deduplication via SHA-256 hashes eliminates duplicate storage across all users
- Metadata service stores file hierarchy in a scalable key-value store (like MySQL sharded)
- Delta sync transfers only changed parts of files, not entire files
- Conflict resolution uses last-writer-wins for simple cases and creates conflict copies for complex merges
- Notification system uses long-polling to detect remote changes within seconds
Imagine you have a magic folder on your desk. Whatever paper you drop in it instantly appears in the exact same folder on your friend's desk across the world — and on your phone too. If you both edit the same paper at the same time, the magic folder figures out how to combine your changes without losing either person's work. Dropbox is that magic folder, built for hundreds of millions of people simultaneously.
File synchronization sounds deceptively simple until you're the one building it at scale. Dropbox processes over 1.2 billion file syncs per day, maintains over 500 petabytes of user data, and must deliver sub-second sync latency while handling everything from a 2 KB sticky note to a 50 GB video file. The gap between 'copy a file to the cloud' and 'build a production sync platform' is enormous, and every corner of that gap has killed startups.
The core problem is elegant to state and brutal to solve: multiple clients, on different networks, with different OS file systems, modifying a shared namespace — and every client must converge to the same state, eventually, without data loss, even when the network disappears for days. Throw in deduplication to save petabytes of storage, delta sync to save bandwidth, and conflict resolution that doesn't confuse non-technical users, and you have a genuinely hard distributed systems challenge.
By the end of this article you'll be able to walk into a senior system design interview and draw the complete Dropbox architecture from memory — the client sync engine, the block store, the metadata service, the notification system, and the conflict resolution strategy. More importantly, you'll understand why each component exists and what breaks if you cut corners on any of them.
What is Design Dropbox?
You don't start with a dry definition. You start with the problem: multiple clients on different networks, editing the same namespace, going offline for days. Dropbox's design is a classic distributed file synchronization system. The core challenges: handling offline edits, conflict resolution, efficient storage via deduplication, and scaling to billions of files. In this article, we build the complete architecture from the client sync engine to the backend block store and metadata service.
Core Architecture: Client ↔ Server Sync Model
Dropbox uses a simple but robust pull-based sync model. The client maintains a local file system watcher (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows). When a change is detected, the client builds a local file tree and compares it with the server's tree.
The server stores metadata in a horizontally sharded MySQL cluster. Each user's files are partitioned by user ID. The metadata schema includes: file_id, parent_id, name, hash (SHA-256 of file content), size, and mtime. The block store is an Amazon S3-compatible object store, with blocks referenced by content hash.
The client syncs in three phases: 1) Upload changed blocks (only if hash not in block store), 2) Update metadata (send new checksums to server), 3) Poll for remote changes (every 3 seconds via long-polling HTTP). Server notifies clients of changes by returning the updated file tree delta.
When the metadata update succeeds, the server broadcasts a notification to all connected clients via the long-poll notification service, indicating that the user's file tree has changed.
But here's the nuance: the server doesn't push. It holds the HTTP response open (long-poll) until there's a change or timeout. This keeps connection overhead low. If you implement this naively, you'll hit connection limits on your load balancer. Dropbox's notification servers use a consistent hash ring to route the same user to the same server, so the server can track which users are connected without synchronising state across all servers.
Block Store: Chunking, Deduplication & Delta Sync
Every file is split into 4 MB blocks. The last block is often smaller. Each block gets a SHA-256 hash. The block store is a content-addressable store: blocks are stored at paths like /blocks/{hash[0:2]}/{hash[2:4]}/{hash}. This two-level prefix directory avoids huge single directories on S3.
Deduplication is trivial: if the hash already exists in the block store, we skip upload. Since Dropbox stores over 500 PB with only ~200 PB of unique blocks (60% dedup ratio), this saves billions of dollars in storage costs.
Delta sync: when a file is edited, the client recomputes block boundaries and uploads only the blocks that changed. However, small edits can shift all subsequent block boundaries. Dropbox uses a content-defined chunking algorithm (rolling hash like CDC) to keep block boundaries stable across edits. This means editing one byte in a 500 MB video changes only one block, not all blocks after it.
- If 10 users upload the same cat video, only one copy is stored.
- Block store is a giant map from hash → data. Uploading a block with an existing hash is a no-op.
- The 60% dedup ratio means every 100 PB of logical storage costs only 40 PB of physical storage.
- But dedup has a hidden cost: integrity checks. A hash collision can destroy data (see production incident).
- Always add a CRC or byte comparison on the first few bytes before returning cached data.
Metadata Service: File Tree Storage & Synchronization
The metadata service is the single source of truth for file hierarchy. Each user has a file tree stored in a relational database (MySQL, sharded by user_id). The tree is represented as adjacency list: each row has file_id, parent_id, name, and content_hash. The root of each user's tree is a special entry with parent_id = NULL.
When the client uploads new blocks and gets their hashes, it sends a transaction to the metadata service: "replace the content_hash of file X with new_hash Y". The server validates that the new hash actually exists in the block store (otherwise reject). Then it logs the change in a journal table.
The journal table is the key to conflict resolution and delta sync. Each change is a row: (change_id, user_id, file_id, new_hash, timestamp). The client syncs by requesting changes after a known change_id. The server returns all changes since that ID. This is a classic changelog pattern.
Read operations (browsing folders) are served from a read replica to reduce load on the primary. Write operations go to the primary. Eventual consistency means a write might take up to 100ms to propagate to read replicas — acceptable because the client's next poll will see the latest state.
Conflict Resolution: When Two Clients Edit the Same File
The classic problem: user A and user B both edit the same file while offline. When they come online, the server has two versions. Dropbox uses a simple strategy: the first uploaded version wins as the canonical copy. The second version is saved as a conflict copy (e.g., "report.docx (A's conflicted copy 2026-04-22).docx").
This works because it never loses data, and users can manually merge if needed. For office documents, Dropbox offers automatic merge via a custom diff engine (similar to 3-way merge in version control). But that's only for specific file types (Office docs, Google Docs, etc.). For plain text or binaries, it's last-writer-wins with conflict copy.
The resolution happens at the metadata level: when the server receives a write for a file, it checks the version (an incrementing counter). If the version in the update doesn't match the current server version, it's a conflict. The server applies the update and creates a new file entry for the conflict copy.
What about concurrent writes while both are online? The server's database transaction ensures serializability. One client's update succeeds, the other gets a 409 Conflict response. The client must then fetch the new version and offer to merge.
One detail that often trips people up: conflict copies are created at the metadata level, not the block level. The server simply creates a new file row with the conflicting content_hash. No duplicate block storage is needed because deduplication already handles the identical blocks. The only cost is the metadata row and the filename.
- Each conflict copy is a new file entry with the same content_hash.
- Block store already has the data; no extra bytes are stored.
- But metadata storage grows linearly with conflict copies — clean them up periodically.
- Non-technical users often don't notice conflict copies; surface them clearly in the UI.
- Add a 'sync history' feature to show all versions including conflicts.
Scaling to 700M+ Files per Day: The Infrastructure Behind Dropbox
Dropbox's infrastructure runs across multiple data centers and AWS (for block storage). Key scaling numbers (as of 2020s): - 500+ PB of user data stored. - 700 million+ files uploaded per day. - 1.2 billion sync operations per day. - Metadata stored in 100+ MySQL shards (each ~5 TB). - Block store: custom object store (called 'Magic Pocket') built on top of JBOD servers with replication.
Scaling challenges: 1. Metadata sharding: Users are mapped to shards by user_id hash. Hot users (with millions of files) are split across multiple shards via sub-sharding. This required a custom rebalancing tool that moves user chunks between shards without downtime. 2. Block store throughput: A single 10 GB file upload generates 2560 blocks (4 MB each). For large files, clients upload blocks in parallel (up to 10 concurrent uploads per file). The block store must handle millions of small PUT requests per second. Solution: use a distributed key-value store (like Dynamo-inspired database) with in-memory tiers for hot blocks. 3. Notification scalability: Long-polling connections are handled by a dedicated notification service (not metadata). Each notification server handles 500k+ connections. They use a consistent hash ring to route the same user to the same notification server, so the server can track which users are connected. 4. Cache layer: Block store uses a CDN-like edge cache for frequently accessed blocks. The metadata service uses memcached clusters to cache file tree lookups.
- Top 1% of users account for 40% of block storage.
- 10% of users generate 80% of sync operations due to automated software syncing (e.g., iOS backups).
- Caching works well because most files are read once and never read again (long tail).
- Hot blocks (popular shared files) are cached aggressively.
- Cold blocks (personal archives) live in slow, cheap storage.
Client Sync Engine: Local File Monitoring and Upload Pipeline
The Dropbox client runs as a background process on each device. It uses operating system file system event APIs (inotify on Linux, FSEvents on macOS, ReadDirectoryChangesW on Windows) to detect file changes in the designated Dropbox folder. When a change is detected, the client does not immediately upload. It waits for a quiet period (typically 100ms) to batch rapid edits (e.g., during save-as). Then it computes the file tree diff and determines which blocks have changed.
For new files, it chunks the entire file and uploads all blocks. For modifications, it uses content-defined chunking to detect changed block boundaries. Uploads are parallelized (up to 10 concurrent connections per file) and retried with exponential backoff. Each block upload includes a SHA-256 hash and file offset. The block store responds with a success or conflict.
After all blocks are uploaded, the client sends a metadata update request to the server with the new file hash. The server validates that all block hashes exist in the block store, then commits the metadata change and appends to the changelog.
One critical detail: the client must handle the case where the server rejects the metadata update because another client already updated the same file. The client then receives the current server state and must merge or create a conflict copy.
- A 100ms quiet period batches multiple saves from the same application (e.g., auto-save in editors).
- If set too low, each keystroke triggers a full file scan and upload wave.
- If set too high, the user sees a delay between saving and the file appearing on other devices.
- Dropbox's default 200ms quiet period works well for most document types.
- Rule: make the quiet period configurable per device based on file type patterns.
The 4 MB Block That Crashed the Sync Engine
- Even with strong hashing, always verify block content on read for mission-critical data.
- Never trust that deduplication is lossless without an integrity check layer.
- Add a block-level CRC checksum stored alongside the hash for double verification.
- Always implement a background integrity check service that periodically verifies blocks against their hashes.
dd if=/dev/zero bs=1M count=10 | nc -w5 blockstore-host 443 to measure throughput.Key takeaways
Common mistakes to avoid
7 patternsMemorising syntax before understanding the concept
Skipping practice and only reading theory
Assuming deduplication is always lossless
Using push notifications instead of polling for sync
Not planning for clients that disappear for months
Using a single metadata database without sharding
Not handling simultaneous upload of the same block by two clients
Interview Questions on This Topic
How would you design a file synchronization service like Dropbox? Focus on the sync algorithm and conflict resolution.
Frequently Asked Questions
That's Real World. Mark it forged?
7 min read · try the examples if you haven't