Senior 6 min · June 25, 2026

Design Reddit: Building a Scalable Social News Platform from Scratch

Q: How does Reddit handle millions of votes per second?

Reddit offloads votes to a message queue (Kafka) and batch updates the database every few seconds. The displayed score is eventually consistent. A Redis counter provides immediate read. This avoids write contention on the post row.

Q: What's the difference between fan-out-on-write and pull-on-read for feeds?

Fan-out-on-write pushes new posts to each follower's cache at write time, giving real-time feeds but high write overhead. Pull-on-read generates the feed on request by merging top posts from subscribed subreddits, which is simpler but slower. Use fan-out for active users, pull for lurkers.

Q: How do I design a comment tree that scales?

Use a materialized path (e.g., 'root.parent.child') stored as an ltree column. Index it with a GiST index. This allows fetching an entire thread with a single index scan. Avoid recursive CTEs — they don't scale.

Q: What happens if a subreddit gets too hot?

The subreddit's shard becomes a bottleneck. Mitigate by splitting the subreddit into multiple partitions using a hash of post_id. Use a two-level partition: subreddit_id + hash(post_id). Also, scale the shard vertically or add read replicas.

Learn how to design Reddit's core features: scalable feed, voting, comments, and subreddits.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Design Reddit by separating read and write paths: use a distributed database for posts/comments, a caching layer for hot feeds, and a message queue for asynchronous vote aggregation. Subreddits are partitions. The feed is precomputed and cached per user session.

✦ Definition~90s read

What is Design Reddit?

Design Reddit is the system design exercise of building a social news aggregation platform where users can submit content, vote, comment, and organize into communities (subreddits). It tests your ability to handle high write throughput, real-time feeds, and hierarchical data at scale.

★

Imagine a giant corkboard where anyone can pin a note.

Plain-English First

Imagine a giant corkboard where anyone can pin a note. People walk by and either thumbs-up or thumbs-down each note. The most popular notes float to the top. Now imagine millions of people doing this every second — you need a system that doesn't collapse. That's Reddit. You need a way to collect notes fast, count votes without fighting over the same paper, and show each person a personalized board without recalculating everything from scratch.

Reddit serves 430 million monthly active users. Every second, someone submits a post, casts a vote, or writes a comment. The naive approach — a single SQL database with joins — would collapse under its own weight within minutes. The real challenge isn't the feature set; it's the scale. This article walks you through building a Reddit clone that won't fall over when you hit the front page of Hacker News. You'll learn how to design the data model, handle the hot path of voting, serve personalized feeds, and partition subreddits — all with production-tested patterns. By the end, you'll be able to architect a social platform that handles millions of concurrent users without breaking a sweat.

Data Model: The Foundation That Won't Crumble

Start with the data model. Reddit's core entities are users, posts, comments, votes, and subreddits. The naive approach is a normalized relational schema with foreign keys everywhere. That works for a few thousand users. At Reddit's scale, joins become death. You need to denormalize strategically. For posts, store the subreddit ID, author ID, title, URL or text, and a score (denormalized from votes). For comments, use a materialized path — a string like 'root_id.parent_id.child_id' — so you can fetch an entire thread with a single index scan. Votes are the hottest path: every upvote/downvote is a write. Store them in a separate table with a composite primary key (user_id, post_id) to enforce one vote per user per post. But don't update the post score synchronously — that's a write bottleneck. Instead, use a message queue to batch vote updates. The subreddit is a partition key: shard posts and comments by subreddit ID. This keeps hot subreddits from starving cold ones.

schema.sqlSQL

-- io.thecodeforge — System Design tutorial

-- Posts table: denormalized score, partitioned by subreddit_id
CREATE TABLE posts (
    id BIGSERIAL,
    subreddit_id INT NOT NULL,
    author_id INT NOT NULL,
    title TEXT NOT NULL,
    url TEXT,
    body TEXT,
    score INT NOT NULL DEFAULT 0,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (subreddit_id, id)
) PARTITION BY HASH (subreddit_id);

-- Comments with materialized path
CREATE EXTENSION IF NOT EXISTS ltree;
CREATE TABLE comments (
    id BIGSERIAL,
    post_id BIGINT NOT NULL,
    author_id INT NOT NULL,
    body TEXT NOT NULL,
    path LTREE NOT NULL,  -- e.g., 'root.child.grandchild'
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (post_id, id)
);
CREATE INDEX idx_comments_path ON comments USING GIST (path);

-- Votes: one per user per post
CREATE TABLE votes (
    user_id INT NOT NULL,
    post_id BIGINT NOT NULL,
    vote SMALLINT NOT NULL CHECK (vote IN (-1, 1)),
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, post_id)
);

Output

Tables created successfully.

Production Trap:

Don't use a single auto-increment ID for posts across all subreddits. You'll get hotspot writes on the index. Partition by subreddit_id to distribute write load.

thecodeforge.io

Scalable Social News Platform Architecture

Design Reddit

thecodeforge.io

Normalized vs Denormalized Schema

Design Reddit

Voting: The Hot Path That Burns the Naive

Voting is the highest-write-throughput operation on Reddit. Every upvote or downvote is a write to the votes table and an update to the post's score. If you do both synchronously in a transaction, you'll serialize all writes to the same post. At scale, that's a bottleneck. The fix: decouple the vote recording from the score update. Write the vote to a fast append-only log (Kafka, Pulsar) and let a consumer batch-update the post score every few seconds. This means the displayed score is eventually consistent — but users don't notice a 5-second lag. For the vote count, use a counter in Redis (INCR/DECR) as a hot cache, and persist to Postgres asynchronously. This pattern is called 'write-behind cache'. The classic rookie mistake is updating the post score in the same transaction as the vote insert. That causes row-level locks on the post row, and under high concurrency, you'll see 'deadlock detected' errors.

vote_handler.pyPYTHON

# io.thecodeforge — System Design tutorial

import redis
from kafka import KafkaProducer
import json

r = redis.Redis(host='redis-cluster', decode_responses=True)
producer = KafkaProducer(bootstrap_servers='kafka:9092',
                         value_serializer=lambda v: json.dumps(v).encode())

def cast_vote(user_id, post_id, vote_value):
    # 1. Write to Kafka for durability and async processing
    producer.send('votes', {'user_id': user_id, 'post_id': post_id, 'vote': vote_value})
    # 2. Update Redis counter for immediate read
    key = f"post:{post_id}:score"
    if vote_value == 1:
        r.incr(key)
    else:
        r.decr(key)
    # 3. Return success immediately — don't wait for DB
    return {"status": "accepted"}

Output

{"status": "accepted"}

Senior Shortcut:

Use Redis INCR/DECR for the hot score cache. Set a TTL of 1 hour. If Redis goes down, fall back to the database score. The eventual consistency is fine for a social platform.

thecodeforge.io

Vote Write Path: Sync vs Async

Design Reddit

Feed Generation: Precompute or Die

Reddit's front page and subreddit feeds are the most-read data. Generating them on the fly by scanning all posts and sorting by score is impossible at scale. Instead, precompute the feed for each subreddit every few minutes and cache it. For the home feed (aggregate of subscribed subreddits), merge the top N posts from each subscribed subreddit's cached feed. Use a fan-out-on-write pattern for high-engagement users: when a user is active, push new posts from their subscribed subreddits into a per-user Redis list. For the rest, use a pull-based approach: on request, fetch the top posts from each subscribed subreddit's cache and merge them. The trade-off: fan-out writes more data but gives real-time feeds for active users. Pull is simpler and works for 99% of users. Reddit uses a hybrid: active users get fan-out, others get pull. The key insight: don't sort the entire post set. Use a hotness score that decays over time, and only keep the top 1000 posts per subreddit in cache.

feed_service.pyPYTHON

# io.thecodeforge — System Design tutorial

import redis
from collections import heapq

r = redis.Redis(host='redis-cluster', decode_responses=True)

def get_home_feed(user_id, limit=25):
    # Try user-specific fan-out cache first
    feed_key = f"user:{user_id}:feed"
    cached = r.lrange(feed_key, 0, limit-1)
    if cached:
        return [json.loads(p) for p in cached]
    # Fall back to pull: merge top from each subscribed subreddit
    subreddits = get_subscribed_subreddits(user_id)  # from DB
    heap = []
    for sub_id in subreddits:
        sub_feed = r.lrange(f"subreddit:{sub_id}:top", 0, 99)
        for post in sub_feed:
            post = json.loads(post)
            heapq.heappush(heap, (-post['score'], post))
    top_posts = [heapq.heappop(heap)[1] for _ in range(min(limit, len(heap)))]
    return top_posts

Output

[{'id': 123, 'title': '...', 'score': 4500}, ...]

Interview Gold:

When asked about feed generation, mention the fan-out-on-write vs pull trade-off. Say: 'Fan-out works for active users but wastes resources on lurkers. Reddit uses a hybrid: active users get pushed, others pull.' That shows depth.

Subreddits as Partitions: The Scalability Multiplier

Subreddits are the natural partition key. Each subreddit is independent — posts, comments, and votes are scoped to a subreddit. Shard your database by subreddit_id. This means a hot subreddit (like r/AskReddit) lives on its own shard and doesn't affect others. Use consistent hashing to map subreddit IDs to shards. For cross-subreddit queries (home feed), you query all relevant shards in parallel and merge results. This is a scatter-gather pattern. The downside: adding a new shard requires rebalancing. Use a lookup table (subreddit_id -> shard_id) that can be updated without downtime. For the comment tree, since comments are partitioned by post_id (which is under a subreddit), the entire tree lives on one shard. That's fine — a single post's comment tree fits on one node. The gotcha: if a subreddit grows too hot (e.g., r/place), you may need to split it further. Use a two-level partition: subreddit_id, then hash of post_id.

shard_lookup.sqlSQL

-- io.thecodeforge — System Design tutorial

-- Lookup table for subreddit to shard mapping
CREATE TABLE subreddit_shard (
    subreddit_id INT PRIMARY KEY,
    shard_id INT NOT NULL
);

-- On application startup, load the mapping into memory
-- When a subreddit grows too hot, update its shard_id and migrate data asynchronously

Output

Table created.

Never Do This:

Don't use auto-increment IDs across shards. Use UUIDs or snowflake IDs for global uniqueness. Otherwise, you'll get collisions when merging data.

Caching Strategy: Three Layers Deep

Reddit's read-to-write ratio is about 80:20. Caching is critical. Use three layers: CDN for static assets (images, CSS), Redis for hot data (feed, scores, user sessions), and Memcached for less hot data (user profiles, subreddit info). The feed cache is the most important. Cache the top 1000 posts per subreddit with a TTL of 5 minutes. For the home feed, cache the merged result per user for 1 minute. Use cache-aside pattern: on miss, load from DB and populate cache. For write-through, update cache on vote — but only for the post's score, not the entire feed. The feed is eventually consistent. The classic mistake: caching entire post objects in the feed. Cache only post IDs and scores, then fetch full post data on demand (with a separate cache). This reduces cache size and avoids cache invalidation on post edits.

cache_layer.pyPYTHON

# io.thecodeforge — System Design tutorial

import redis

r = redis.Redis(host='redis-cluster', decode_responses=True)

def get_post_metadata(post_id):
    key = f"post:{post_id}:meta"
    cached = r.get(key)
    if cached:
        return json.loads(cached)
    # Load from DB
    meta = db.query("SELECT id, title, score, author_id FROM posts WHERE id = %s", post_id)
    r.setex(key, 300, json.dumps(meta))  # 5 min TTL
    return meta

Output

{'id': 123, 'title': '...', 'score': 4500, 'author_id': 42}

Production Trap:

Cache stampede: when a popular cache key expires, thousands of requests hit the DB simultaneously. Use a mutex lock or 'probabilistic early expiration' to refresh the cache before it expires.

Real-Time Comments and Notifications

Reddit needs real-time updates for comments and notifications. Use WebSockets for live comment threads. When a user posts a comment, the server broadcasts it to all clients viewing that post via a pub/sub channel (Redis Pub/Sub or Kafka). For notifications (replies, mentions), use a message queue to decouple the notification sending from the comment write. The notification service consumes from the queue and pushes to the user via WebSocket or push notification. The gotcha: if a post has 10,000 viewers, broadcasting to all of them can overwhelm the server. Use a fan-out approach: each WebSocket server subscribes to a channel per post, and the pub/sub system distributes the message to all servers. Don't send the full comment object — send only the comment ID, and let the client fetch the details.

comment_broadcast.pyPYTHON

# io.thecodeforge — System Design tutorial

import redis
import asyncio

r = redis.Redis(host='redis-cluster', decode_responses=True)
pubsub = r.pubsub()

async def broadcast_comment(post_id, comment_id):
    channel = f"post:{post_id}:comments"
    # Publish comment ID to all subscribers
    r.publish(channel, json.dumps({'comment_id': comment_id}))

# On WebSocket connect, subscribe to the post's channel
async def handle_websocket(ws, post_id):
    channel = f"post:{post_id}:comments"
    pubsub.subscribe(channel)
    async for message in pubsub.listen():
        if message['type'] == 'message':
            await ws.send(message['data'])

Output

Comment broadcasted to all viewers.

Senior Shortcut:

Use Redis Streams instead of Pub/Sub for comment broadcasting. Streams persist messages, so if a client disconnects and reconnects, they can catch up on missed comments.

Search: Full-Text Search at Scale

Reddit's search needs to index posts, comments, and subreddits. Use Elasticsearch for full-text search. Index posts with fields: title, body, subreddit_id, author_id, score, created_at. Use a separate index for comments. For subreddit search, use a lightweight autocomplete index (e.g., Elasticsearch's completion suggester). The challenge: keeping the index in sync with the database. Use a change data capture (CDC) pipeline: Debezium reads the Postgres WAL and pushes changes to Kafka, then a consumer updates Elasticsearch. This ensures near-real-time search without dual-write complexity. The classic mistake: indexing every comment immediately. That's expensive. Instead, index comments in batches every 30 seconds. Search results can be slightly stale — users don't notice.

search_indexer.pyPYTHON

# io.thecodeforge — System Design tutorial

from elasticsearch import Elasticsearch
from kafka import KafkaConsumer
import json

es = Elasticsearch(['http://elasticsearch:9200'])
consumer = KafkaConsumer('post-changes', bootstrap_servers='kafka:9092')

def index_post(post):
    doc = {
        'title': post['title'],
        'body': post['body'],
        'subreddit_id': post['subreddit_id'],
        'author_id': post['author_id'],
        'score': post['score'],
        'created_at': post['created_at']
    }
    es.index(index='posts', id=post['id'], body=doc)

for msg in consumer:
    post = json.loads(msg.value)
    index_post(post)

Output

Post indexed in Elasticsearch.

Production Trap:

Don't use the same Elasticsearch cluster for logging and search. They have different performance characteristics. Logging is write-heavy, search is read-heavy. Separate clusters prevent resource contention.

When Not to Use This Architecture

This architecture is overkill for a small community (<10k users). For a small Reddit clone, a single Postgres instance with proper indexing and caching (Redis) is sufficient. The partitioning, Kafka, and Elasticsearch add operational complexity. Only invest in this when you have >1M DAU or anticipate rapid growth. Also, if your content is ephemeral (e.g., disappearing posts), you can skip the search index and use a simpler in-memory store. Another case: if you don't need real-time feeds, you can generate feeds on request with a simple SQL query and cache the result. The fan-out-on-write pattern is only justified when users expect sub-second feed updates. For most apps, a 5-minute stale feed is acceptable.

Interview Gold:

When asked 'When would you not use this design?', say: 'For a small community, a monolith with Postgres and Redis is fine. The distributed architecture is for scale, not for simplicity.' Shows you understand trade-offs.

● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom

Postgres CPU at 100%, queries timing out after 30s. The team saw 'ERROR: out of shared memory' in logs.

Assumption

Assumed it was a memory leak in the application layer. Restarted pods multiple times.

Root cause

A single subreddit (r/pics) had 2 million comments on one post. The recursive CTE used to fetch the comment tree consumed all work_mem (default 4MB) and spilled to disk, thrashing I/O.

Fix

Set work_mem = '64MB' and replaced recursive CTE with a materialized path approach using ltree extension. Added an index on path column.

Key lesson

Recursive queries on hierarchical data at scale are a trap.
Always use materialized paths or nested sets for comment trees.

Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries

Symptom · 01

Feed returning stale data for >5 minutes

→

Fix

1. Check Redis TTL on subreddit feed keys. 2. Verify feed generation cron job is running. 3. Check Kafka consumer lag for vote updates. 4. Restart feed generator if lag >1000.

Symptom · 02

Vote not reflected after 10 seconds

→

Fix

1. Check Kafka producer for errors. 2. Check Redis score key exists. 3. Check vote consumer is processing. 4. If consumer stuck, restart it and replay from last offset.

Symptom · 03

Comment tree loading slowly

→

Fix

1. Check if ltree index exists on comments.path. 2. Run EXPLAIN ANALYZE on the query. 3. If index scan is missing, create index. 4. If still slow, increase work_mem.

★ Design Reddit Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.

Feed empty for all users−

Immediate action

Check Redis connectivity and key existence

Commands

redis-cli -h redis-cluster keys 'subreddit:*:top' | head -5

redis-cli -h redis-cluster get 'subreddit:1:top'

Fix now

Restart feed generator: docker-compose restart feed-generator

Votes not persisting+

Comment tree not loading+

Search returning no results+

Feature / Aspect	Fan-Out on Write	Pull on Read
Latency	Real-time (<1s)	Minutes stale
Write overhead	High (write to many caches)	Low (write once)
Read overhead	Low (cache hit)	High (merge from many sources)
Best for	Active users, high engagement	Lurkers, low engagement
Implementation complexity	High	Low

Key takeaways

Decouple vote writes from score updates using a message queue

prevents row locks and deadlocks.

Use materialized paths (ltree) for comment trees

avoid recursive CTEs that kill performance.

Partition by subreddit_id to isolate hot subreddits and scale horizontally.

Cache feed aggressively with a hybrid fan-out/pull strategy

precompute for active users, pull for the rest.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you handle a sudden spike in votes on a single post (e.g., a f...

Q02SENIOR

When would you choose fan-out-on-write over pull-on-read for feed genera...

Q03SENIOR

What happens when a subreddit becomes extremely hot (e.g., r/place)? How...

Q04JUNIOR

How does Reddit ensure one vote per user per post?

Q05SENIOR

A user reports that their comment disappeared after posting. How do you ...

Q06SENIOR

Design a system to show the top 10 posts of all time across all subreddi...

Q01 of 06SENIOR

How would you handle a sudden spike in votes on a single post (e.g., a front-page post)?

ANSWER

Offload votes to a message queue (Kafka) and batch update the score. Use Redis INCR for immediate read. The database write is async. This prevents the post row from becoming a hot spot.

FAQ · 4 QUESTIONS

Frequently Asked Questions

How does Reddit handle millions of votes per second?

What's the difference between fan-out-on-write and pull-on-read for feeds?

How do I design a comment tree that scales?

What happens if a subreddit gets too hot?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

June 25, 2026

last updated

1,663

articles · all by Naren

🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't