Database Sharding — Why Sequential IDs Cause Hot Shards
One shard hit 95% CPU while others idled at 15% during a flash sale.
20+ years shipping high-throughput database systems. Drawn from code that ran under real load.
- Sharding splits data across multiple independent database instances (shards) to scale beyond a single server.
- A shard key determines which shard stores each row; queries must include it to avoid broadcasting.
- Hash sharding distributes writes evenly; range sharding supports efficient range queries but risks hot shards.
- Cross-shard queries are slow and distributed transactions are hard — design the shard key to keep related data together.
- Most apps should not shard — optimize indexes, add replicas, and scale vertically first.
Imagine a filing cabinet (one database) that gets too full. Sharding is like buying multiple cabinets and splitting your files across them. You need a rule (the shard key) to decide which file goes where. Looking up a file is fast if you know the rule. But if you need to search across all cabinets — like finding all files from a certain date — it's slow because you have to open every drawer.
Most applications never need to shard. A properly indexed PostgreSQL or MySQL instance can handle tens of millions of rows and thousands of queries per second. But when you hit the limits of vertical scaling — larger instances give diminishing returns — sharding becomes the path forward.
The tradeoff is real: sharding gives you horizontal scale but introduces significant operational complexity. Cross-shard queries are slow, transactions spanning shards are hard, and resharding (changing the number of shards) requires careful migration. Get the shard key wrong, and you'll trade one bottleneck for another.
Database Sharding — The Only Scalability Pattern That Actually Splits Data
Database sharding is a horizontal partitioning strategy where you split a single logical dataset across multiple independent database instances (shards). Each shard holds a subset of the data, determined by a shard key — a column or hash of a column that maps every row to exactly one shard. The core mechanic is that no single database node holds the full dataset; instead, the application or a proxy routes each query to the correct shard based on the shard key.
In practice, sharding trades consistency and query flexibility for write throughput and storage capacity. A well-chosen shard key distributes writes evenly across shards — ideally O(1) routing per operation. But if the shard key is poorly chosen (e.g., a monotonically increasing sequential ID), writes concentrate on a single shard, creating a "hot shard" that becomes a bottleneck. This defeats the purpose of sharding: instead of scaling horizontally, you end up with one overloaded node and many idle ones.
Use sharding when your dataset exceeds a single node's storage capacity (e.g., >1 TB) or when write throughput surpasses a single node's IOPS (e.g., >10k writes/second). It's not a default choice — it adds operational complexity (backups, resharding, cross-shard joins). But for systems like social feeds, IoT event ingestion, or multi-tenant SaaS, sharding is the difference between a system that scales and one that falls over at 2x growth.
How Sharding Works
The shard key is the column used to route data to a specific shard. Every query must include the shard key to avoid broadcasting to all shards. The application or a proxy layer uses a routing function (e.g., hash modulo) to determine the target shard. Each shard is an independent database instance with its own compute and storage.
In production, the shard key is usually the primary dimension of access — something you filter on in every query. User ID, tenant ID, or region ID are common choices. The key must be present in all queries; otherwise the system either fails or performs a full scan across all shards.
- A dictionary is effectively sharded by the first letter: 'apple' belongs to 'A', 'banana' to 'B'.
- Queries that don't specify a letter (e.g., 'find all fruits with 5 letters') require scanning every page — that's a cross-shard query.
- Hot shards happen when one letter (like 'S') contains vastly more words than others.
- Hash sharding is like assigning words to pages randomly — even distribution but you lose the ability to easily browse consecutive words.
Scalability Decision Tree: Sharding vs Vertical Scaling
Before deciding to shard, consider whether vertical scaling (upgrading the database server) or read replicas can solve your problem. Sharding adds operational complexity that is rarely justified for data sets under 10TB or write throughput under 10k writes/second. The following decision tree helps evaluate your options.
When to use vertical scaling: If your database fits on a single server and you can increase CPU/RAM/IOPS to meet demand, vertical scaling is simpler. Modern cloud databases (like RDS, Cloud SQL) offer instances with up to 128 vCPUs and 4TB RAM – enough for most workloads. Always benchmark your query patterns first.
When to consider read replicas: If your bottleneck is read queries (e.g., reporting, dashboards), adding replicas can absorb read traffic without sharding. This works well when writes are manageable on a single primary.
When sharding becomes necessary: Your data grows beyond what a single server can store (e.g., > 10TB). Your write throughput exceeds the I/O capacity of the largest instance. Read latency is unacceptable even with replicas because the working set doesn't fit in memory.
The decision tree below summarizes the evaluation:
- Is data size > 10TB? If no, use vertical scaling.
- If yes, is write throughput > single server capacity? If no, use read replicas + vertical scaling.
- If yes, is read latency acceptable with replicas? If yes, use read replicas. If no, consider sharding.
- If sharding, can related data be co-located on one shard? If yes, use hash sharding by primary access key. If no, evaluate denormalization or a search engine.
Hash vs Range Sharding
Hash sharding applies a hash function to the shard key and takes modulo the number of shards. It distributes rows evenly across shards, preventing hot shards. The trade-off: range queries (e.g., 'users joined in January') require broadcasting to all shards because the hash destroys order.
Range sharding assigns contiguous ranges of the shard key to each shard. Range queries can be routed to a single shard if the range fits within one boundary. The downside: hot shards appear when new data falls disproportionately on one range — for example, sequential user IDs always go to the last shard.
Consistent hashing (used by Cassandra, DynamoDB) maps data to a ring of hash values. When a shard is added or removed, only the data from the adjacent shard moves — no full rehash. This minimizes migration during resharding.
Sharding Strategy Comparison Matrix
Beyond the basic hash vs range dichotomy, there are several sharding strategies used in production. This section compares the most common ones: Hash, Range, Directory-based, and Consistent Hashing.
| Strategy | Write Distribution | Range Query Support | Resharding Cost | Complexity | Use Case |
|---|---|---|---|---|---|
| Hash | Even | Poor | High (full rehash) | Low | High write volume, any access pattern |
| Range | Skewed | Good | Medium (boundary split) | Medium | Read-heavy, predictable range queries |
| Directory | Configurable | Depends on key | Low (update lookup) | High | Multi-tenant dynamic allocation, frequent rebalancing |
| Consistent Hashing | Even | Poor | Low (minimal move) | Medium | Elastic scaling, frequent resharding |
Directory-based sharding uses a lookup table that maps each entity to a shard. This decouples the shard key from the physical shard, allowing you to reassign shards without touching the data. The lookup table must be replicated or distributed to avoid a single point of failure. This approach is common in multi-tenant SaaS platforms where tenants have different sizes and can be moved between shards.
Consistent hashing, used by Cassandra and DynamoDB, maps data to a ring of hash values. When a shard is added or removed, only data from adjacent shards moves, minimizing migration. However, it requires virtual nodes to handle load imbalance and adds complexity in failure scenarios.
When choosing a strategy, consider your write/read ratio, need for range queries, and how often you plan to add or remove shards. No single strategy fits all – the right choice depends on your specific workload patterns and operational constraints.
Shard Key Selection and Trade-offs
Choosing the wrong shard key is the most common sharding mistake. A good shard key has three properties: high cardinality (many distinct values), even distribution, and inclusion in most queries. Common choices are user_id, tenant_id, order_id, or a hash of email.
If you need both even distribution and efficient range queries over time, consider a composite shard key (e.g., hash(user_id) + month). The primary key includes the hash, and a secondary index supports the time range scan. This adds storage and complexity but may be worthwhile.
Another approach: use a lookup table that maps each shard key to a shard ID. This decouples routing from the key value, allowing you to change shards without modifying the key. However, the lookup becomes a potential bottleneck and single point of failure.
Cross-Shard Queries and Transactions
When a query doesn't include the shard key, the system must broadcast it to all shards and merge results — this is slow and resource-intensive. For range queries across shards, consider a secondary index or a search engine (Elasticsearch) that pre-indexes the data.
Distributed transactions (e.g., transfer money between users on different shards) require coordination — typically two-phase commit (2PC). 2PC is slow, blocks on failure, and reduces availability. Most production systems avoid cross-shard transactions by designing the data model so that related data lives on the same shard. For example, put all orders for a customer on the same shard as the customer record. If absolute consistency is needed, use saga patterns with compensating actions.
- A user's profile, orders, and payments should all land on the same shard via user_id shard key.
- If you need to join user and product data, either keep products on every shard (replicated) or accept cross-shard joins via materialized views.
- Cross-shard transactions are always slower and less reliable. Prefer eventual consistency with idempotent operations.
Distributed Transactions in Sharded Databases
When a transaction spans multiple shards, atomicity becomes a distributed systems problem. Two-Phase Commit (2PC) is the classic solution: a coordinator sends a prepare request to each participant; if all agree, it sends a commit. 2PC has well-known drawbacks. The coordinator becomes a single point of failure. If the coordinator crashes after prepare but before commit, participants remain locked until recovery. This can cause cascading failures and timeouts. The protocol also adds latency due to the two-round-trip overhead.
Given these issues, most production systems avoid distributed transactions by designing for data locality. If cross-shard transactions are unavoidable, the saga pattern is the preferred alternative. A saga breaks a transaction into a sequence of local transactions, each with a compensating action. If a step fails, the saga runs the compensating actions for all completed steps, rolling back to a consistent state. Sagas can be choreographed (each step triggers the next via events) or orchestrated (a central coordinator manages the flow).
When evaluating whether to use distributed transactions, consider the CAP trade-off. 2PC ensures strong consistency but sacrifices availability under partition (CP). Sagas provide availability and eventual consistency (AP with human reconciliation). In practice, AP is often acceptable for financial transfers with reconciliation, while CP is required for inventory deduplication in critical systems.
If you must use 2PC, minimize the number of participants and keep their prepare phases fast. Use low timeouts and have a manual override process for stuck transactions.
Resharding and Consistent Hashing
Resharding — changing the number of shards — is the most painful operation in a sharded system. With simple modulo hash, changing N to N+1 remaps nearly all rows (about 87.5% for 4->8). The system must read all data, rehash it, and write to new locations. This requires downtime or a complex double-write migration.
Consistent hashing dramatically reduces the amount of data that moves: only the data in the immediate neighbor of a new node is affected. This is why Cassandra and DynamoDB use it. However, consistent hashing adds complexity in handling node failures and virtual nodes (vnodes).
- Pre-splitting: Start with many virtual shards (e.g., 1024) that map to physical shards. To add a physical shard, reassign some virtual shards.
- Mirroring: Write all new data to both old and new shards simultaneously, then backfill old data.
- Read-repair: After migration, inconsistent reads are fixed lazily during read operations (Cassandra's approach).
Resharding Architectural Flow
Resharding involves three phases: preparation, migration, and cutover. The exact steps depend on the strategy and whether you use consistent hashing or modulo. Below is a typical resharding flow using a double-write pattern, which is the safest approach for production.
Phase 1: Preparation – Deploy the new shards. Set up replication so they are empty targets. Instrument your application with a feature flag to enable double-write. Create a migration coordination table to track progress.
Phase 2: Migration – Start writing new data to both old and new shards simultaneously. Then backfill existing data from old shards to new shards in batches. Use checkpointing to allow resumption after failures. Validate data integrity by comparing row counts and checksums.
Phase 3: Cutover – Once migration is complete and validated, point all reads to the new shards. Keep double-write active for a grace period in case of rollback. Finally, disable writes to old shards and decommission them.
The mermaid diagram below illustrates the flow with a double-write strategy.
Why Most Teams Get Shard Key Selection Wrong (And How to Fix It)
The shard key is the single most important decision in your sharding architecture. Pick wrong, and you'll rebuild everything. Most teams default to user_id or customer_id because it's obvious — but that's not always right.
The real question: what access pattern drives 90% of your reads? In a SaaS product, maybe it's tenant_id. In a messaging app, maybe it's conversation_id. Your shard key must align with your hottest query path. If every query crosses shards, you've built a distributed system that acts like a single database — except slower.
There's a second trap: uneven distribution. A range-based shard on "created_at" sounds smart until all new data lands on one shard. That's not sharding. That's a hot partition with extra complexity.
Choose a key with high cardinality and uniform distribution. Hash it. Route by the hash. Then design your shard key so that 95% of your queries resolve to one shard. That's not theory — that's the difference between a system that works at 10x scale and one that collapses at 2x.
Read-Through Caches: The Band-Aid That Became a Production Standard
Sharding solves write scaling, but reads still hurt when every shard handles its own cache. The fix: read-through caching with a distributed cache layer like Redis or Memcached positioned in front of your sharded database.
Here's the pattern: application queries cache first. On cache miss, fetch from the correct shard, populate cache, return. This keeps hot data off your database shards entirely. For read-heavy workloads (80/20 read/write split), this reduces database load by 60-80%.
But there's a catch — cache invalidation across shards is brutal. When a write updates a user's profile on shard 4, you must invalidate that user's cached record everywhere. Implement a TTL-based strategy (5-15 minutes) and use cache-aside with explicit invalidation on writes.
Don't try perfect consistency. Accept eventual consistency. Your shards have enough work to do without coordinating cache purges at transaction commit time. In production, we've seen teams burn weeks building distributed invalidation systems that a 10-minute TTL would have solved.
The Hot Shard That Took Down Checkout at Midnight
- Always simulate write patterns before choosing a shard key — sequential IDs kill range sharding.
- Use hash sharding or consistent hashing for write-heavy workloads.
- Monitor per-shard CPU and throughput; set alerts for imbalance.
SELECT shard FROM shard_map WHERE user_id = ?Check the router logs: tail -100 /var/log/shard_router.log | grep 'user_id=123'Key takeaways
Common mistakes to avoid
4 patternsChoosing a low-cardinality shard key
Forgetting to include the shard key in every query
Using modulo sharding with a future expectation to add shards
Not monitoring per-shard load and skew
Interview Questions on This Topic
What is a shard key and how do you choose a good one?
Frequently Asked Questions
20+ years shipping high-throughput database systems. Drawn from code that ran under real load.
That's Database Design. Mark it forged?
11 min read · try the examples if you haven't