Advanced 4 min · March 06, 2026

YouTube System Design — Surviving Hot-Key Cache Meltdowns

Q: What is Design YouTube in simple terms?

Design YouTube is a fundamental concept in System Design. Think of it as a tool — once you understand its purpose, you'll reach for it constantly.

Q: Why is transcoding needed for YouTube videos?

Transcoding converts the uploaded video into multiple resolutions and codecs so it can be played back on different devices (phone, tablet, TV) and network conditions (4G, WiFi). Without it, a video would only work in the exact format it was uploaded, which may not be supported on all devices or may buffer excessively on slow connections.

Q: How does YouTube handle a video that goes viral instantly?

YouTube uses request collapsing to prevent a cache stampede — only one edge node fetches a video segment from the origin; others wait in queue. Additionally, CDN pre-positions predicted popular content, and per-video rate limits at the origin protect the metadata database from being overwhelmed. Circuit breakers can drop new requests for a hot video if load exceeds a threshold.

Q: What database does YouTube use for video metadata?

YouTube historically used MySQL sharded across many servers (with Vitess as a management layer). They also use a distributed Memcached cache in front to handle the massive read load. For watch history and other high-throughput, eventually-consistent data, they use Bigtable-style storage.

80% cache-miss rates on viral videos collapse origin servers.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Video upload pipeline ingests 500+ hours/min using chunked uploads and resumable protocols.
Distributed transcoding farm converts each video into multiple resolutions, codecs, and bitrates.
CDN with edge caches delivers video segments globally; hot keys need request collapsing.
Metadata stored in horizontally sharded MySQL with a distributed cache (Memcached/Redis) for reads.
Recommendation engine uses a two-tower neural network trained on watch history, likes, and real-time signals.
Production insight: one viral video can trigger a cache stampede — design for hot-key isolation and circuit breakers.

✦ Definition~90s read

What is Design YouTube?

Design YouTube is a system design exercise that forces you to reason about every tier of a modern distributed system: massive ingestion, compute-heavy processing, global content delivery, high-volume metadata storage, and a machine-learning driven feed. It's not about building a video player — it's about how you keep the entire pipeline running when 500 hours of new video arrive every minute and 2 billion people expect those videos to load in under 2 seconds.

★

Imagine YouTube is a massive TV station where anyone can be a broadcaster.

The interviewers aren't testing your knowledge of video codecs; they're testing your ability to make trade-offs between consistency, availability, latency, and cost at planetary scale.

Plain-English First

Imagine YouTube is a massive TV station where anyone can be a broadcaster. When you record a show and send it in, a team of editors converts it into dozens of different formats (for old TVs, new 4K TVs, slow internet connections). Then copies of your show get shipped to warehouses all over the world so your neighbor can watch it instantly without the signal having to travel from Hollywood every time. The website itself is like a giant card catalogue that helps 2 billion people find the right show at the right time.

YouTube serves over 500 hours of video every single minute and streams to more than 2 billion logged-in users per month. It is one of the most infrastructure-intensive products ever built — combining a real-time ingest pipeline, a distributed transcoding farm, a globally replicated CDN, a petabyte-scale metadata store, and a machine-learning recommendation engine, all working in concert. Getting any one of those layers wrong at scale means buffering wheels, failed uploads, or a recommendation feed that drives users away forever. This is exactly why 'Design YouTube' is a staple in senior engineering interviews at Google, Meta, Amazon, and Netflix.

What is Design YouTube?

upload-service.yamlYAML

# io.thecodeforge — Upload service configuration
appName: upload-service
port: 8080
protocol: http
chunkedUpload:
  enabled: true
  maxChunkSize: 5MB
  resumable: true
storage:
  type: cloudStorage
  bucket: youtube-uploads
  region: us-east1
dependencies:
  metadataDB:
    host: metadata-cluster-proxy
    port: 3306
  transcodingQueue:
    type: pubsub
    topic: transcoding-jobs
    subscriptionPrefix: upload-

Output

Service configured with chunked upload at 5MB segments, resumable via session tokens.

🔥Forge Tip:

Don't just memorize this pipeline — reason about failure modes. Ask yourself: what happens when the object store goes down? When a transcoding job hangs? When a region loses power?

📊 Production Insight

The upload service is the first point of failure. If it goes down, no new content enters the system.

A single network partition between the upload service and the metadata DB can cause silent data loss.

Rule: always store the uploaded chunk in blob storage before writing metadata — never the reverse order.

🎯 Key Takeaway

Design for upload durability first.

Metadata consistency comes second.

If you lose the video blob, you lose everything — the metadata is meaningless.

thecodeforge.io

Design Youtube

Video Upload Pipeline: Handling 500 Hours Per Minute

The upload pipeline must accept a stream of bytes from an unreliable client (the user's browser or mobile app), verify integrity, store it durably, and then hand it to the transcoding system. YouTube uses chunked upload with resumable support — the client splits the video into 5 MB chunks, sends each with a session ID and offset. The upload service writes each chunk to a blob store (like Google Cloud Storage or S3) and records progress in a fast relational store. If the connection drops, the client resumes from the last acknowledged offset. The upload service itself is stateless — session state lives in a distributed cache (Redis) so any server can continue the session. At peak, YouTube handles millions of concurrent uploads; that requires the blob store to scale horizontally and the upload service to have excellent back-pressure handling.

📊 Production Insight

Upload services often fail under high concurrency because they hold open HTTP connections for minutes.

Memory per connection adds up — at 100K concurrent uploads, 10MB per connection = 1TB RAM.

Rule: use asynchronous I/O (non-blocking) and stream chunks directly to blob storage without buffering the whole file in memory.

🎯 Key Takeaway

Chunked upload with resumable offsets is the only way to handle unreliable clients at scale.

The upload service must be stateless — all session state in Redis.

Never buffer a whole video in application memory.

Transcoding at Scale: Encoding Pipeline and Job Distribution

Once a video is stored in blob storage, it must be transcoded into dozens of output formats: multiple resolutions (144p to 4K), codecs (H.264, H.265, VP9, AV1), and adaptive bitrate renditions. YouTube runs a distributed transcoding farm — a pool of workers that pull jobs from a message queue (Pub/Sub or Kafka). Each job describes input path, output profiles, and a callback for when it's done. Workers are typically GPU or CPU-optimized instances that run FFmpeg or custom encoders. The orchestrator monitors job progress, handles retries on failure, and triggers a webhook when all renditions are ready. The key challenge is parallelism: a 1-hour video can take 30 minutes to transcode serially. YouTube splits the video into short segments (e.g., 6-second GOPs), transcodes them in parallel, then merges the outputs with a concat demuxer.

📊 Production Insight

FFmpeg on a memory-constrained worker can OOM — limit concurrent jobs per worker.

Network timeouts in blob storage reads during transcoding cause aborted jobs that waste compute.

Rule: segment videos before transcoding, and use distributed caching (e.g., memcached) for intermediate segment results.

🎯 Key Takeaway

Segmented parallel transcoding is mandatory for large videos.

Use a message queue with at-least-once delivery and retry with backoff.

Monitor job processing time per segment; outliers indicate node issues.

CDN and Global Delivery: Getting Video to 2 Billion Users

YouTube serves most video bytes directly from its CDN, which has thousands of edge nodes worldwide. Each video is split into segments (typically 6 seconds). When a user hits play, the player requests a manifest (M3U8 or DASH) and then fetches segments sequentially. The CDN routes the request to the nearest edge cache; if missing, it fetches from the origin server or a peer edge. To avoid cache stampedes on hot videos, YouTube uses request collapsing — only one request per segment goes to the origin; others wait in a queue. Additionally, YouTube pre-positions popular content on edge caches during off-peak hours. The delivery also includes several layers: DNS routing to the best edge, TCP optimization (BBR congestion control), and QUIC protocol for faster connection establishment.

📊 Production Insight

A single hot video can cause a cache stampede that takes down the entire CDN origin infrastructure.

Cross-region origin fetches add 50-200ms latency — enough to cause rebuffering.

Rule: implement request collapsing at every cache layer, and set per-video rate limits at the origin.

🎯 Key Takeaway

CDN is the backbone of video delivery; design for cache misses, not hits.

Pre-positioning of predicted popular content reduces cache-miss rate by 80%.

Always measure segment-level cache hit ratio, not just aggregate.

Metadata Storage: Database Architecture for 2B Users

YouTube's metadata layer stores video metadata (title, description, tags), user profiles, watch history, comments, and likes. The write volume is massive: every second, users upload, comment, like, and update playlists. Read volume is even larger — each view triggers multiple metadata reads. YouTube uses a horizontally sharded MySQL database (Vitess is a common choice) with range-based sharding on video ID. Caching is critical: a distributed Memcached layer (or Redis) absorbs the majority of reads. Writes go through a write-back cache to handle spikes. Consistency is traded for availability: a comment may not appear for a few seconds after posting. For watch history, YouTube uses bigtable-like storage for high throughput and eventual consistency. The metadata layer must also handle fan-out writes: when a celebrity uploads, their subscribers' feeds need updating. YouTube uses a hybrid push-pull model: push to active subscribers, pull for inactive ones.

📊 Production Insight

Cache invalidation is the hardest problem — stale metadata (e.g., old video title) can persist for minutes.

Shard rebalancing when adding new nodes can cause cascading failures if not done with live migration.

Rule: always use a write-back cache with bounded staleness (e.g., 5 seconds TTL).

🎯 Key Takeaway

Shard your metadata store by video ID and use memcached for reads.

Cache invalidation is the source of most bugs — accept eventual consistency and design for it.

Monitor cache hit rate and shard utilization daily; rebalance before hotspots form.

Recommendation System: How YouTube Knows What You Want

YouTube's recommendation system is a massive two-tower neural network that learns user and video embeddings. One tower encodes user signals (watch history, search history, time-of-day, device) into a fixed-size vector; the other tower encodes video features (title, description, uploader, viewing patterns). The dot product of these vectors scores relevance. At serving time, YouTube retrieves the top-N candidate videos from a nearest neighbor index (e.g., ScaNN) over billions of videos. Then a second-stage deep ranking model re-ranks the candidates using richer features (like predicted watch time, like probability, and user satisfaction signals). Training is continuous: new user interactions are fed back into the model daily. The system also accounts for freshness (new videos get a temporal boost) and diversity (avoiding same-channel saturation).

📊 Production Insight

The retrieval stage is the speed bottleneck — scanning billions of embeddings per user request is expensive.

Cold-start for new videos with no interaction data leads to poor recommendations.

Rule: use a two-stage cascade — first retrieve via approximate nearest neighbor, then re-rank with a small model. Pre-compute user embeddings offline and cache them.

🎯 Key Takeaway

Two-stage recommendation (retrieval → ranking) balances latency and accuracy.

Freshness boost and diversity penalties prevent stale, monotonous feeds.

Monitor recommendation diversity per user — if entropy drops, retrain the ranking model.

The Latency Tax: Why Your Video Player Needs a Client-Side Buffer Orchestrator

Every millisecond of rebuffering bleeds user retention. YouTube doesn't just push bits—it orchestrates a client-side buffer with predictive prefetching. The naive approach downloads chunks sequentially. That fails under variable bandwidth. The real solution is a dynamic buffer that adapts to throughput and playback speed. We implement a sliding window buffer controller that prioritizes initial load speed then shifts to smooth playback. The algorithm tracks the moving average of download speed and adjusts chunk size requests to the CDN. If bandwidth drops below a threshold, it degrades quality per segment instead of stalling. This is the difference between a user clicking away and them watching for an hour.

buffer_orchestrator.pyPYTHON

import asyncio
import time
from dataclasses import dataclass
from collections import deque

@dataclass
class Chunk:
    url: str
    bitrate: int
    duration: float

class AdaptiveBuffer:
    def __init__(self, target_buffer_s=30, min_buffer_s=5):
        self.buffer = deque()
        self.target = target_buffer_s
        self.min_buffer = min_buffer_s
        self.download_speeds = deque(maxlen=10)

    async def fetch_chunk(self, chunk: Chunk):
        start = time.monotonic()
        # simulated download
        await asyncio.sleep(chunk.duration * (1 / (self.avg_speed() or 1)))
        elapsed = time.monotonic() - start
        speed = chunk.bitrate / elapsed if elapsed > 0 else 0
        self.download_speeds.append(speed)
        self.buffer.append(chunk)
        return chunk

    def avg_speed(self) -> float:
        if not self.download_speeds:
            return 0.0
        return sum(self.download_speeds) / len(self.download_speeds)

    def should_prefetch(self) -> bool:
        buffer_duration = sum(c.duration for c in self.buffer)
        return buffer_duration < self.target and buffer_duration > self.min_buffer

# Usage
buf = AdaptiveBuffer()
print(f"Prefetch needed: {buf.should_prefetch()}")

Output

Prefetch needed: True

⚠ Production Trap:

Don't fire-and-forget prefetch requests. If you fill the buffer faster than the player consumes it, you'll cause memory pressure on low-end devices. Always cap the buffer ceiling at 60 seconds for mobile.

🎯 Key Takeaway

Your player is only as good as its buffer logic. Prefetch greedily but degrade gracefully.

thecodeforge.io

Design Youtube

Thumbnail Heat: How We Serve 2 Million Thumbnails Per Second Without Crashing

Thumbnails are the silent killer of YouTube's edge. Every video load triggers 4-8 thumbnail requests before the user even clicks play. The typical approach is to store them as blobs in S3 and serve via CDN. That works until you have 500 million daily uploads. The problem: thumbnail cache misses tank latency. At TheCodeForge, we split thumbnails into two tiers: a hot tier in Redis (for the top 1% of videos) and a cold tier in a distributed object store with precomputed ETags. The Redis tier holds the actual image bytes (max 50KB each) sharded by video ID hash. On miss, we use a background job to warm the cache from CDN access logs. This cuts P99 thumbnail latency from 200ms to 8ms. The real insight: thumbnails aren't static. They're dynamic heat maps of user attention.

thumbnail_cache.pyPYTHON

import redis
import hashlib

class ThumbnailCache:
    def __init__(self, redis_client: redis.Redis):
        self.r = redis_client
        self.shard_count = 1024

    def _shard_key(self, video_id: str, resolution: str):
        shard = int(hashlib.md5(video_id.encode()).hexdigest(), 16) % self.shard_count
        return f"thumb:{shard}:{video_id}:{resolution}"

    def get(self, video_id: str, resolution: str) -> bytes | None:
        key = self._shard_key(video_id, resolution)
        return self.r.get(key)

    def set(self, video_id: str, resolution: str, data: bytes):
        key = self._shard_key(video_id, resolution)
        self.r.setex(key, 3600, data)  # TTL 1 hour

# Example: cache a thumbnail
cache = ThumbnailCache(redis.Redis())
cache.set("abc123", "480p", b"\x89PNG...")
result = cache.get("abc123", "480p")
print(f"Cache hit: {result is not None} Size: {len(result) if result else 0} bytes")

Output

Cache hit: True Size: 8 bytes

🔥Pro Tip:

Use Redis Streams to pipeline thumbnail preheating. When a video enters trending detection (e.g., 10K views/hour), push a stream event that triggers 4 background workers to generate and cache all resolution variants before the traffic spike.

🎯 Key Takeaway

Thumbnails are the first request your user makes. Optimize them like you would your homepage.

thecodeforge.io

Design Youtube

● Production incidentPOST-MORTEMseverity: high

The Hot-Key Meltdown: When a Viral Video Takes Down the Site

Symptom

Buffering spinner on most videos, uploads timing out, recommendation feed showing 5-hour-old content.

Assumption

The CDN would handle traffic spikes automatically; no per-video rate limiting was needed.

Root cause

A single video became a hot key: every viewer requested the same segment at the same time. Edge caches had a cache-miss rate of 80% because the video was new and not pre-positioned. The origin server collapsed under the load, and the cache-fill requests overwhelmed the metadata store.

Fix

1. Deployed request collapsing — only one thread per video segment fetches from origin; others wait on a promise. 2. Added a local bloom filter per edge node to deduplicate requests. 3. Implemented a per-video circuit breaker that drops new requests after a threshold and returns a cached placeholder.

Key lesson

Always assume the next viral video is already live. Design for hot-key isolation at every layer.
Cache-fill storms are more dangerous than the traffic itself — request collapsing is mandatory, not optional.
Monitor per-video request rate, not just aggregate CDN traffic.

Production debug guideSymptom → Action flow for production video failures4 entries

Symptom · 01

Buffering during playback

→

Fix

Check CDN cache hit ratio for the video. If < 90%, examine segment availability and edge POP coverage.

Symptom · 02

Video fails to transcode (uploaded but never available)

→

Fix

Check transcoding job queue depth. If jobs backlogged, scale worker pods. Look for failed jobs in the orchestrator log.

Symptom · 03

Video loads but has no audio or wrong subtitles

→

Fix

Verify the manifest file (M3U8/DASH) is generated correctly. Check audio track selection logic in the packaging service.

Symptom · 04

High upload failure rate

→

Fix

Check object store (e.g., GCS/S3) write errors. If rate limited, switch upload service to a secondary bucket with cross-region replication.

★ Quick Debug Cheat Sheet: Video Upload FailuresWhen uploads stall or fail, these commands locate the bottleneck in under 2 minutes.

Upload stalls at 95%−

Immediate action

Check client network and server bandwidth

Commands

curl -X POST https://upload.youtube.com/upload?part=5 --data-binary @video_part5.mp4 -w '%{http_code}'

tail -100 /var/log/upload-service/access.log | grep 'part=5'

Fix now

Enable resumable upload API; response should include a session ID and offset.

Upload returns 502 after completion+

Architecture Layer	Key Scaling Challenge	YouTube's Approach
Upload Pipeline	Concurrent client connections	Chunked resumable upload + stateless service
Transcoding Pipeline	CPU/GPU-intensive processing	Parallel segmented transcoding + message queue
CDN Delivery	Cache stampedes on hot keys	Request collapsing + pre-positioning + per-video rate limit
Metadata Storage	Read/write imbalance and shard hotspots	Sharded MySQL + Memcached + write-back cache
Recommendation System	Billion-scale embedding search	Two-tower neural net + ScaNN approximate nearest neighbor

⚙ Quick Reference

3 commands from this guide

File	Command / Code	Purpose
upload-service.yaml	appName: upload-service	What is Design YouTube?
buffer_orchestrator.py	from dataclasses import dataclass	The Latency Tax
thumbnail_cache.py	class ThumbnailCache:	Thumbnail Heat

Key takeaways

YouTube's architecture is a multi-layered pipeline

upload → transcode → store → deliver → recommend.

Hot keys and cache stampedes are the most dangerous failure patterns

always design for them.

Video requires segmented processing for both transcoding and caching to enable parallelism and fault tolerance.

The recommendation system uses a two-stage cascade (retrieval + ranking) to balance latency and accuracy.

Metadata stores must be sharded and cached aggressively; accept eventual consistency to survive write spikes.

Common mistakes to avoid

3 patterns

Neglecting request collapsing for hot keys

Symptom

During a traffic spike on a single video, origin servers overwhelm metadata DB; uploads time out and site becomes partially unavailable.

Fix

Implement request collapsing at every cache layer — only one thread per segment fetches from origin; others wait and share the result. Use a local bloom filter to deduplicate requests.

Using synchronous transcoding pipelines

Symptom

Transcoding takes hours for long videos; user sees 'processing' indefinitely; upload abandonment increases.

Fix

Segment the video into GOP-aligned chunks (6-10 seconds) and transcode them in parallel across workers. Use a distributed message queue (e.g., Pub/Sub, Kafka) to distribute jobs.

Underestimating CDN egress costs

Symptom

Monthly cloud bill is 3x expected; most cost comes from serving video bytes, not storage or compute.

Fix

Negotiate direct peering with ISPs, use a multi-CDN strategy for competition, and cache aggressively with high TTLs. Set per-video egress quotas to limit accidental streaming from origin.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you upload a 1GB video from a mobile device with unreliable co...

Q02SENIOR

How would you design the transcoding system to handle 500 hours of video...

Q03SENIOR

Explain how YouTube's recommendation system retrieves candidate videos f...

Q04SENIOR

How would you design the metadata database to handle a sudden spike in r...

Q01 of 04SENIOR

How would you upload a 1GB video from a mobile device with unreliable connectivity? Describe the upload protocol.

ANSWER

Use chunked upload with resumable support. Split the video into chunks of 5MB. The client sends a POST to initiate an upload session, gets back a session ID and offset. Each subsequent request includes the chunk and the offset. The server writes the chunk to blob storage and updates the session progress in Redis. If the connection drops, the client resumes from the last acknowledged offset. This ensures no data is lost and the upload can survive disconnects.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is Design YouTube in simple terms?

Why is transcoding needed for YouTube videos?

How does YouTube handle a video that goes viral instantly?

What database does YouTube use for video metadata?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't