Skip to content
Home Database Database Sharding

Database Sharding

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Database Design → Topic 6 of 16
Database sharding explained — horizontal scaling, shard keys, range vs hash sharding, cross-shard queries, resharding challenges, and when to shard.
🔥 Advanced — solid Database foundation required
In this tutorial, you'll learn
Database sharding explained — horizontal scaling, shard keys, range vs hash sharding, cross-shard queries, resharding challenges, and when to shard.
  • Sharding enables horizontal scale — add more servers as data grows.
  • The shard key must be included in all queries — otherwise all shards must be scanned.
  • Hash sharding distributes evenly; range sharding allows range queries but risks hot shards.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Sharding splits data across multiple independent database instances (shards) to scale beyond what a single server can handle. Each shard holds a subset of the data. A shard key determines which shard holds each row. The application (or a proxy) routes queries to the correct shard based on the key.

How Sharding Works

The shard key is the column used to route data to a specific shard. Every query must include the shard key to avoid broadcasting to all shards.

Example · PYTHON
123456789101112131415161718192021222324
# Package: io.thecodeforge.python.system_design

# Simple shard router based on user_id
class ShardRouter:
    def __init__(self, shards):
        self.shards = shards  # list of database connections
        self.n = len(shards)

    def get_shard(self, user_id: int):
        """Hash sharding: user_id % number_of_shards."""
        return self.shards[user_id % self.n]

    def query_user(self, user_id: int, sql: str):
        shard = self.get_shard(user_id)
        return shard.execute(sql, user_id)

# Example: 4 shards
# user_id 1001 → shard 1001 % 4 = 1
# user_id 1002 → shard 1002 % 4 = 2
# user_id 1000 → shard 1000 % 4 = 0

# Cross-shard query — must broadcast to all shards (expensive)
# SELECT * FROM orders WHERE order_date = '2025-01-01'
# This hits ALL 4 shards and merges results in application layer
▶ Output
# user_id 1001 → shard index 1

Hash vs Range Sharding

Example · PYTHON
123456789101112131415161718192021222324
# Hash sharding: even distribution, no hot shards
# Weakness: range queries require broadcast
def hash_shard(user_id, n_shards):
    return user_id % n_shards

# Range sharding: range queries efficient, but hot shard risk
# Shard 0: user_ids 1 to 1,000,000
# Shard 1: user_ids 1,000,001 to 2,000,000
# etc.
RANGE_BOUNDARIES = [0, 1_000_000, 2_000_000, 3_000_000]

def range_shard(user_id, boundaries):
    for i, boundary in enumerate(boundaries[1:], 1):
        if user_id < boundary:
            return i - 1
    return len(boundaries) - 1

# Hot shard problem with range sharding:
# If all new users have sequential IDs, new users always go to the LAST shard
# That shard gets all the write load — called a hot shard

# Consistent hashing: used by Cassandra, DynamoDB
# Solves resharding without reassigning all data
# When adding a shard, only the adjacent shard's data needs to move
▶ Output
# hash_shard(1001, 4) = 1

🎯 Key Takeaways

  • Sharding enables horizontal scale — add more servers as data grows.
  • The shard key must be included in all queries — otherwise all shards must be scanned.
  • Hash sharding distributes evenly; range sharding allows range queries but risks hot shards.
  • Cross-shard joins and transactions are hard — design your shard key to minimise them.
  • Most applications should NOT shard — optimise indexes, add read replicas, and scale vertically first.

Interview Questions on This Topic

  • QWhat is a shard key and how do you choose a good one?
  • QWhat is the hot shard problem and how do you prevent it?
  • QHow does consistent hashing help with resharding?

Frequently Asked Questions

How do you handle transactions that span multiple shards?

Two-phase commit (2PC) is the formal approach — a coordinator asks all shards to prepare, then commits if all agree. But 2PC is slow and reduces availability (blocked if any shard fails). Most large-scale systems avoid cross-shard transactions by designing the data model so transactions stay within one shard. Eventual consistency or saga patterns handle cases where cross-shard coordination is unavoidable.

What is resharding and why is it painful?

Resharding is changing the number of shards (e.g., 4 → 8). With simple hash sharding (user_id % n), changing n means nearly all data maps to a different shard — you must migrate ~87.5% of data. Consistent hashing minimises this: adding a shard only moves adjacent data. Most cloud databases (DynamoDB, Cassandra) handle resharding automatically.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousDatabase RelationshipsNext →Database Replication
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged