YouTube System Design — Surviving Hot-Key Cache Meltdowns
80% cache-miss rates on viral videos collapse origin servers.
- Video upload pipeline ingests 500+ hours/min using chunked uploads and resumable protocols.
- Distributed transcoding farm converts each video into multiple resolutions, codecs, and bitrates.
- CDN with edge caches delivers video segments globally; hot keys need request collapsing.
- Metadata stored in horizontally sharded MySQL with a distributed cache (Memcached/Redis) for reads.
- Recommendation engine uses a two-tower neural network trained on watch history, likes, and real-time signals.
- Production insight: one viral video can trigger a cache stampede — design for hot-key isolation and circuit breakers.
Imagine YouTube is a massive TV station where anyone can be a broadcaster. When you record a show and send it in, a team of editors converts it into dozens of different formats (for old TVs, new 4K TVs, slow internet connections). Then copies of your show get shipped to warehouses all over the world so your neighbor can watch it instantly without the signal having to travel from Hollywood every time. The website itself is like a giant card catalogue that helps 2 billion people find the right show at the right time.
YouTube serves over 500 hours of video every single minute and streams to more than 2 billion logged-in users per month. It is one of the most infrastructure-intensive products ever built — combining a real-time ingest pipeline, a distributed transcoding farm, a globally replicated CDN, a petabyte-scale metadata store, and a machine-learning recommendation engine, all working in concert. Getting any one of those layers wrong at scale means buffering wheels, failed uploads, or a recommendation feed that drives users away forever. This is exactly why 'Design YouTube' is a staple in senior engineering interviews at Google, Meta, Amazon, and Netflix.
What is Design YouTube?
Design YouTube is a system design exercise that forces you to reason about every tier of a modern distributed system: massive ingestion, compute-heavy processing, global content delivery, high-volume metadata storage, and a machine-learning driven feed. It's not about building a video player — it's about how you keep the entire pipeline running when 500 hours of new video arrive every minute and 2 billion people expect those videos to load in under 2 seconds. The interviewers aren't testing your knowledge of video codecs; they're testing your ability to make trade-offs between consistency, availability, latency, and cost at planetary scale.
Video Upload Pipeline: Handling 500 Hours Per Minute
The upload pipeline must accept a stream of bytes from an unreliable client (the user's browser or mobile app), verify integrity, store it durably, and then hand it to the transcoding system. YouTube uses chunked upload with resumable support — the client splits the video into 5 MB chunks, sends each with a session ID and offset. The upload service writes each chunk to a blob store (like Google Cloud Storage or S3) and records progress in a fast relational store. If the connection drops, the client resumes from the last acknowledged offset. The upload service itself is stateless — session state lives in a distributed cache (Redis) so any server can continue the session. At peak, YouTube handles millions of concurrent uploads; that requires the blob store to scale horizontally and the upload service to have excellent back-pressure handling.
Transcoding at Scale: Encoding Pipeline and Job Distribution
Once a video is stored in blob storage, it must be transcoded into dozens of output formats: multiple resolutions (144p to 4K), codecs (H.264, H.265, VP9, AV1), and adaptive bitrate renditions. YouTube runs a distributed transcoding farm — a pool of workers that pull jobs from a message queue (Pub/Sub or Kafka). Each job describes input path, output profiles, and a callback for when it's done. Workers are typically GPU or CPU-optimized instances that run FFmpeg or custom encoders. The orchestrator monitors job progress, handles retries on failure, and triggers a webhook when all renditions are ready. The key challenge is parallelism: a 1-hour video can take 30 minutes to transcode serially. YouTube splits the video into short segments (e.g., 6-second GOPs), transcodes them in parallel, then merges the outputs with a concat demuxer.
CDN and Global Delivery: Getting Video to 2 Billion Users
YouTube serves most video bytes directly from its CDN, which has thousands of edge nodes worldwide. Each video is split into segments (typically 6 seconds). When a user hits play, the player requests a manifest (M3U8 or DASH) and then fetches segments sequentially. The CDN routes the request to the nearest edge cache; if missing, it fetches from the origin server or a peer edge. To avoid cache stampedes on hot videos, YouTube uses request collapsing — only one request per segment goes to the origin; others wait in a queue. Additionally, YouTube pre-positions popular content on edge caches during off-peak hours. The delivery also includes several layers: DNS routing to the best edge, TCP optimization (BBR congestion control), and QUIC protocol for faster connection establishment.
Metadata Storage: Database Architecture for 2B Users
YouTube's metadata layer stores video metadata (title, description, tags), user profiles, watch history, comments, and likes. The write volume is massive: every second, users upload, comment, like, and update playlists. Read volume is even larger — each view triggers multiple metadata reads. YouTube uses a horizontally sharded MySQL database (Vitess is a common choice) with range-based sharding on video ID. Caching is critical: a distributed Memcached layer (or Redis) absorbs the majority of reads. Writes go through a write-back cache to handle spikes. Consistency is traded for availability: a comment may not appear for a few seconds after posting. For watch history, YouTube uses bigtable-like storage for high throughput and eventual consistency. The metadata layer must also handle fan-out writes: when a celebrity uploads, their subscribers' feeds need updating. YouTube uses a hybrid push-pull model: push to active subscribers, pull for inactive ones.
Recommendation System: How YouTube Knows What You Want
YouTube's recommendation system is a massive two-tower neural network that learns user and video embeddings. One tower encodes user signals (watch history, search history, time-of-day, device) into a fixed-size vector; the other tower encodes video features (title, description, uploader, viewing patterns). The dot product of these vectors scores relevance. At serving time, YouTube retrieves the top-N candidate videos from a nearest neighbor index (e.g., ScaNN) over billions of videos. Then a second-stage deep ranking model re-ranks the candidates using richer features (like predicted watch time, like probability, and user satisfaction signals). Training is continuous: new user interactions are fed back into the model daily. The system also accounts for freshness (new videos get a temporal boost) and diversity (avoiding same-channel saturation).
The Hot-Key Meltdown: When a Viral Video Takes Down the Site
- Always assume the next viral video is already live. Design for hot-key isolation at every layer.
- Cache-fill storms are more dangerous than the traffic itself — request collapsing is mandatory, not optional.
- Monitor per-video request rate, not just aggregate CDN traffic.
Key takeaways
Common mistakes to avoid
3 patternsNeglecting request collapsing for hot keys
Using synchronous transcoding pipelines
Underestimating CDN egress costs
Interview Questions on This Topic
How would you upload a 1GB video from a mobile device with unreliable connectivity? Describe the upload protocol.
Frequently Asked Questions
That's Real World. Mark it forged?
4 min read · try the examples if you haven't