Senior 4 min · March 06, 2026

YouTube System Design — Surviving Hot-Key Cache Meltdowns

80% cache-miss rates on viral videos collapse origin servers.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Video upload pipeline ingests 500+ hours/min using chunked uploads and resumable protocols.
  • Distributed transcoding farm converts each video into multiple resolutions, codecs, and bitrates.
  • CDN with edge caches delivers video segments globally; hot keys need request collapsing.
  • Metadata stored in horizontally sharded MySQL with a distributed cache (Memcached/Redis) for reads.
  • Recommendation engine uses a two-tower neural network trained on watch history, likes, and real-time signals.
  • Production insight: one viral video can trigger a cache stampede — design for hot-key isolation and circuit breakers.
Plain-English First

Imagine YouTube is a massive TV station where anyone can be a broadcaster. When you record a show and send it in, a team of editors converts it into dozens of different formats (for old TVs, new 4K TVs, slow internet connections). Then copies of your show get shipped to warehouses all over the world so your neighbor can watch it instantly without the signal having to travel from Hollywood every time. The website itself is like a giant card catalogue that helps 2 billion people find the right show at the right time.

YouTube serves over 500 hours of video every single minute and streams to more than 2 billion logged-in users per month. It is one of the most infrastructure-intensive products ever built — combining a real-time ingest pipeline, a distributed transcoding farm, a globally replicated CDN, a petabyte-scale metadata store, and a machine-learning recommendation engine, all working in concert. Getting any one of those layers wrong at scale means buffering wheels, failed uploads, or a recommendation feed that drives users away forever. This is exactly why 'Design YouTube' is a staple in senior engineering interviews at Google, Meta, Amazon, and Netflix.

What is Design YouTube?

Design YouTube is a system design exercise that forces you to reason about every tier of a modern distributed system: massive ingestion, compute-heavy processing, global content delivery, high-volume metadata storage, and a machine-learning driven feed. It's not about building a video player — it's about how you keep the entire pipeline running when 500 hours of new video arrive every minute and 2 billion people expect those videos to load in under 2 seconds. The interviewers aren't testing your knowledge of video codecs; they're testing your ability to make trade-offs between consistency, availability, latency, and cost at planetary scale.

upload-service.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# io.thecodeforge — Upload service configuration
appName: upload-service
port: 8080
protocol: http
chunkedUpload:
  enabled: true
  maxChunkSize: 5MB
  resumable: true
storage:
  type: cloudStorage
  bucket: youtube-uploads
  region: us-east1
dependencies:
  metadataDB:
    host: metadata-cluster-proxy
    port: 3306
  transcodingQueue:
    type: pubsub
    topic: transcoding-jobs
    subscriptionPrefix: upload-
Output
Service configured with chunked upload at 5MB segments, resumable via session tokens.
Forge Tip:
Don't just memorize this pipeline — reason about failure modes. Ask yourself: what happens when the object store goes down? When a transcoding job hangs? When a region loses power?
Production Insight
The upload service is the first point of failure. If it goes down, no new content enters the system.
A single network partition between the upload service and the metadata DB can cause silent data loss.
Rule: always store the uploaded chunk in blob storage before writing metadata — never the reverse order.
Key Takeaway
Design for upload durability first.
Metadata consistency comes second.
If you lose the video blob, you lose everything — the metadata is meaningless.

Video Upload Pipeline: Handling 500 Hours Per Minute

The upload pipeline must accept a stream of bytes from an unreliable client (the user's browser or mobile app), verify integrity, store it durably, and then hand it to the transcoding system. YouTube uses chunked upload with resumable support — the client splits the video into 5 MB chunks, sends each with a session ID and offset. The upload service writes each chunk to a blob store (like Google Cloud Storage or S3) and records progress in a fast relational store. If the connection drops, the client resumes from the last acknowledged offset. The upload service itself is stateless — session state lives in a distributed cache (Redis) so any server can continue the session. At peak, YouTube handles millions of concurrent uploads; that requires the blob store to scale horizontally and the upload service to have excellent back-pressure handling.

Production Insight
Upload services often fail under high concurrency because they hold open HTTP connections for minutes.
Memory per connection adds up — at 100K concurrent uploads, 10MB per connection = 1TB RAM.
Rule: use asynchronous I/O (non-blocking) and stream chunks directly to blob storage without buffering the whole file in memory.
Key Takeaway
Chunked upload with resumable offsets is the only way to handle unreliable clients at scale.
The upload service must be stateless — all session state in Redis.
Never buffer a whole video in application memory.

Transcoding at Scale: Encoding Pipeline and Job Distribution

Once a video is stored in blob storage, it must be transcoded into dozens of output formats: multiple resolutions (144p to 4K), codecs (H.264, H.265, VP9, AV1), and adaptive bitrate renditions. YouTube runs a distributed transcoding farm — a pool of workers that pull jobs from a message queue (Pub/Sub or Kafka). Each job describes input path, output profiles, and a callback for when it's done. Workers are typically GPU or CPU-optimized instances that run FFmpeg or custom encoders. The orchestrator monitors job progress, handles retries on failure, and triggers a webhook when all renditions are ready. The key challenge is parallelism: a 1-hour video can take 30 minutes to transcode serially. YouTube splits the video into short segments (e.g., 6-second GOPs), transcodes them in parallel, then merges the outputs with a concat demuxer.

Production Insight
FFmpeg on a memory-constrained worker can OOM — limit concurrent jobs per worker.
Network timeouts in blob storage reads during transcoding cause aborted jobs that waste compute.
Rule: segment videos before transcoding, and use distributed caching (e.g., memcached) for intermediate segment results.
Key Takeaway
Segmented parallel transcoding is mandatory for large videos.
Use a message queue with at-least-once delivery and retry with backoff.
Monitor job processing time per segment; outliers indicate node issues.

CDN and Global Delivery: Getting Video to 2 Billion Users

YouTube serves most video bytes directly from its CDN, which has thousands of edge nodes worldwide. Each video is split into segments (typically 6 seconds). When a user hits play, the player requests a manifest (M3U8 or DASH) and then fetches segments sequentially. The CDN routes the request to the nearest edge cache; if missing, it fetches from the origin server or a peer edge. To avoid cache stampedes on hot videos, YouTube uses request collapsing — only one request per segment goes to the origin; others wait in a queue. Additionally, YouTube pre-positions popular content on edge caches during off-peak hours. The delivery also includes several layers: DNS routing to the best edge, TCP optimization (BBR congestion control), and QUIC protocol for faster connection establishment.

Production Insight
A single hot video can cause a cache stampede that takes down the entire CDN origin infrastructure.
Cross-region origin fetches add 50-200ms latency — enough to cause rebuffering.
Rule: implement request collapsing at every cache layer, and set per-video rate limits at the origin.
Key Takeaway
CDN is the backbone of video delivery; design for cache misses, not hits.
Pre-positioning of predicted popular content reduces cache-miss rate by 80%.
Always measure segment-level cache hit ratio, not just aggregate.

Metadata Storage: Database Architecture for 2B Users

YouTube's metadata layer stores video metadata (title, description, tags), user profiles, watch history, comments, and likes. The write volume is massive: every second, users upload, comment, like, and update playlists. Read volume is even larger — each view triggers multiple metadata reads. YouTube uses a horizontally sharded MySQL database (Vitess is a common choice) with range-based sharding on video ID. Caching is critical: a distributed Memcached layer (or Redis) absorbs the majority of reads. Writes go through a write-back cache to handle spikes. Consistency is traded for availability: a comment may not appear for a few seconds after posting. For watch history, YouTube uses bigtable-like storage for high throughput and eventual consistency. The metadata layer must also handle fan-out writes: when a celebrity uploads, their subscribers' feeds need updating. YouTube uses a hybrid push-pull model: push to active subscribers, pull for inactive ones.

Production Insight
Cache invalidation is the hardest problem — stale metadata (e.g., old video title) can persist for minutes.
Shard rebalancing when adding new nodes can cause cascading failures if not done with live migration.
Rule: always use a write-back cache with bounded staleness (e.g., 5 seconds TTL).
Key Takeaway
Shard your metadata store by video ID and use memcached for reads.
Cache invalidation is the source of most bugs — accept eventual consistency and design for it.
Monitor cache hit rate and shard utilization daily; rebalance before hotspots form.

Recommendation System: How YouTube Knows What You Want

YouTube's recommendation system is a massive two-tower neural network that learns user and video embeddings. One tower encodes user signals (watch history, search history, time-of-day, device) into a fixed-size vector; the other tower encodes video features (title, description, uploader, viewing patterns). The dot product of these vectors scores relevance. At serving time, YouTube retrieves the top-N candidate videos from a nearest neighbor index (e.g., ScaNN) over billions of videos. Then a second-stage deep ranking model re-ranks the candidates using richer features (like predicted watch time, like probability, and user satisfaction signals). Training is continuous: new user interactions are fed back into the model daily. The system also accounts for freshness (new videos get a temporal boost) and diversity (avoiding same-channel saturation).

Production Insight
The retrieval stage is the speed bottleneck — scanning billions of embeddings per user request is expensive.
Cold-start for new videos with no interaction data leads to poor recommendations.
Rule: use a two-stage cascade — first retrieve via approximate nearest neighbor, then re-rank with a small model. Pre-compute user embeddings offline and cache them.
Key Takeaway
Two-stage recommendation (retrieval → ranking) balances latency and accuracy.
Freshness boost and diversity penalties prevent stale, monotonous feeds.
Monitor recommendation diversity per user — if entropy drops, retrain the ranking model.
● Production incidentPOST-MORTEMseverity: high

The Hot-Key Meltdown: When a Viral Video Takes Down the Site

Symptom
Buffering spinner on most videos, uploads timing out, recommendation feed showing 5-hour-old content.
Assumption
The CDN would handle traffic spikes automatically; no per-video rate limiting was needed.
Root cause
A single video became a hot key: every viewer requested the same segment at the same time. Edge caches had a cache-miss rate of 80% because the video was new and not pre-positioned. The origin server collapsed under the load, and the cache-fill requests overwhelmed the metadata store.
Fix
1. Deployed request collapsing — only one thread per video segment fetches from origin; others wait on a promise. 2. Added a local bloom filter per edge node to deduplicate requests. 3. Implemented a per-video circuit breaker that drops new requests after a threshold and returns a cached placeholder.
Key lesson
  • Always assume the next viral video is already live. Design for hot-key isolation at every layer.
  • Cache-fill storms are more dangerous than the traffic itself — request collapsing is mandatory, not optional.
  • Monitor per-video request rate, not just aggregate CDN traffic.
Production debug guideSymptom → Action flow for production video failures4 entries
Symptom · 01
Buffering during playback
Fix
Check CDN cache hit ratio for the video. If < 90%, examine segment availability and edge POP coverage.
Symptom · 02
Video fails to transcode (uploaded but never available)
Fix
Check transcoding job queue depth. If jobs backlogged, scale worker pods. Look for failed jobs in the orchestrator log.
Symptom · 03
Video loads but has no audio or wrong subtitles
Fix
Verify the manifest file (M3U8/DASH) is generated correctly. Check audio track selection logic in the packaging service.
Symptom · 04
High upload failure rate
Fix
Check object store (e.g., GCS/S3) write errors. If rate limited, switch upload service to a secondary bucket with cross-region replication.
★ Quick Debug Cheat Sheet: Video Upload FailuresWhen uploads stall or fail, these commands locate the bottleneck in under 2 minutes.
Upload stalls at 95%
Immediate action
Check client network and server bandwidth
Commands
curl -X POST https://upload.youtube.com/upload?part=5 --data-binary @video_part5.mp4 -w '%{http_code}'
tail -100 /var/log/upload-service/access.log | grep 'part=5'
Fix now
Enable resumable upload API; response should include a session ID and offset.
Upload returns 502 after completion+
Immediate action
Verify object store is reachable
Commands
aws s3api head-object --bucket youtube-uploads --key <video_id>
kubectl logs -l app=upload-service --tail=50
Fix now
Check if the upload service is behind on writing metadata; clear the write queue.
Architecture LayerKey Scaling ChallengeYouTube's Approach
Upload PipelineConcurrent client connectionsChunked resumable upload + stateless service
Transcoding PipelineCPU/GPU-intensive processingParallel segmented transcoding + message queue
CDN DeliveryCache stampedes on hot keysRequest collapsing + pre-positioning + per-video rate limit
Metadata StorageRead/write imbalance and shard hotspotsSharded MySQL + Memcached + write-back cache
Recommendation SystemBillion-scale embedding searchTwo-tower neural net + ScaNN approximate nearest neighbor

Key takeaways

1
YouTube's architecture is a multi-layered pipeline
upload → transcode → store → deliver → recommend.
2
Hot keys and cache stampedes are the most dangerous failure patterns
always design for them.
3
Video requires segmented processing for both transcoding and caching to enable parallelism and fault tolerance.
4
The recommendation system uses a two-stage cascade (retrieval + ranking) to balance latency and accuracy.
5
Metadata stores must be sharded and cached aggressively; accept eventual consistency to survive write spikes.

Common mistakes to avoid

3 patterns
×

Neglecting request collapsing for hot keys

Symptom
During a traffic spike on a single video, origin servers overwhelm metadata DB; uploads time out and site becomes partially unavailable.
Fix
Implement request collapsing at every cache layer — only one thread per segment fetches from origin; others wait and share the result. Use a local bloom filter to deduplicate requests.
×

Using synchronous transcoding pipelines

Symptom
Transcoding takes hours for long videos; user sees 'processing' indefinitely; upload abandonment increases.
Fix
Segment the video into GOP-aligned chunks (6-10 seconds) and transcode them in parallel across workers. Use a distributed message queue (e.g., Pub/Sub, Kafka) to distribute jobs.
×

Underestimating CDN egress costs

Symptom
Monthly cloud bill is 3x expected; most cost comes from serving video bytes, not storage or compute.
Fix
Negotiate direct peering with ISPs, use a multi-CDN strategy for competition, and cache aggressively with high TTLs. Set per-video egress quotas to limit accidental streaming from origin.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How would you upload a 1GB video from a mobile device with unreliable co...
Q02SENIOR
How would you design the transcoding system to handle 500 hours of video...
Q03SENIOR
Explain how YouTube's recommendation system retrieves candidate videos f...
Q04SENIOR
How would you design the metadata database to handle a sudden spike in r...
Q01 of 04SENIOR

How would you upload a 1GB video from a mobile device with unreliable connectivity? Describe the upload protocol.

ANSWER
Use chunked upload with resumable support. Split the video into chunks of 5MB. The client sends a POST to initiate an upload session, gets back a session ID and offset. Each subsequent request includes the chunk and the offset. The server writes the chunk to blob storage and updates the session progress in Redis. If the connection drops, the client resumes from the last acknowledged offset. This ensures no data is lost and the upload can survive disconnects.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is Design YouTube in simple terms?
02
Why is transcoding needed for YouTube videos?
03
How does YouTube handle a video that goes viral instantly?
04
What database does YouTube use for video metadata?
🔥

That's Real World. Mark it forged?

4 min read · try the examples if you haven't

Previous
Design Twitter Feed
3 / 17 · Real World
Next
Design WhatsApp