Netflix System Design — CDN Cache Miss Storms at Scale
CDN cache hit ratio dropped from 95% to 40%, causing widespread user buffering.
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
- Netflix architecture: tiered CDN delivery + fault-tolerant microservices + ML recommendations
- CDN caches video chunks at ISP edges — travel distance <10km for most users
- Adaptive bitrate (ABR) chunks video into 2-second segments at multiple bitrates
- Client-side ABR algorithm selects bitrate based on current bandwidth and buffer
- Microservices are isolated with circuit breakers — one service failing doesn't cascade
- Chaos Engineering (Chaos Monkey) proactively kills instances to test resilience
Imagine a massive video rental store with 250 million members worldwide. Instead of one giant store, Netflix secretly runs thousands of mini-stores hidden inside internet provider buildings near your home — so when you press play, the movie travels just a few blocks, not across the ocean. Behind the scenes, a team of invisible coordinators constantly watches what everyone's watching, pre-loads the movies they'll probably pick next, and quietly reroutes traffic when any one mini-store goes down. That invisible coordination system — built to never drop a single frame for 250 million simultaneous viewers — is what we're designing today.
Netflix streams to over 250 million subscribers across 190 countries, serving more than 100 million hours of video every single day. At peak hours in North America, Netflix alone accounts for roughly 15% of all downstream internet traffic. When you're designing a system at that scale, every architectural decision — from how you encode a video file to how you route a DNS query — has a measurable dollar cost and a direct impact on whether someone abandons their Friday-night movie before the credits roll. This isn't an academic exercise; it's the kind of problem where a 500ms latency increase costs millions in subscriber churn.
The core challenge Netflix solves is fundamentally different from a typical web application. A social media post weighs a few kilobytes; a 4K HDR episode of a prestige drama weighs 15–20 gigabytes. You can't serve that from a handful of origin servers the way you'd serve a JSON API. You need a tiered delivery hierarchy, intelligent client-side adaptation, and fault-tolerant microservices that can independently fail without taking the whole platform down. On top of the delivery problem sits the discovery problem: recommending the right title to the right person at the right moment, at a scale where even a 1% improvement in recommendation accuracy translates to hundreds of millions in retained subscriptions.
By the end of this article you'll be able to walk into a system design interview and confidently reason through Netflix's full architecture — from the moment a user clicks play to the moment a video frame renders on screen. You'll understand the tradeoffs behind each major subsystem, know which database engines Netflix actually uses and why, recognize the failure modes that keep SREs awake at night, and speak credibly about how adaptive bitrate streaming actually works under the hood. Let's build it from the ground up.
What Netflix System Design Really Is — CDN Cache Miss Storms at Scale
Netflix system design is the architecture of a global streaming platform that delivers video to 200M+ subscribers with sub-second startup latency. The core mechanic is a multi-tier caching hierarchy: CDN edge caches hold popular content, regional caches handle misses, and the origin store serves the full catalog. When a CDN cache miss storm occurs — a sudden flood of requests for the same unpopular asset — it cascades through every tier, overwhelming origin storage and degrading quality of experience for all users.
In practice, the system uses a CDN like Open Connect, which Netflix deploys inside ISP networks. Each edge node caches only the most requested titles (typically the top 1-5% of the catalog). A miss triggers a fetch from a regional cache, which itself may miss and hit the AWS origin. The key property is that a single unpopular title can generate thousands of simultaneous requests during a flash crowd (e.g., a new episode drop), turning a cold start into a thundering herd. The system must absorb this with pre-warming, request coalescing, and adaptive bitrate fallback.
Use this design when you must serve latency-sensitive, read-heavy workloads with a long tail of rarely accessed objects. It matters because a naive cache-only approach collapses under the Zipf distribution of video popularity: the top 1% of titles get 80% of views, but the remaining 99% still generate enough traffic to cause origin outages if not carefully managed. Real systems like Netflix, YouTube, and Twitch all implement some form of predictive pre-fetching and cache admission control to survive miss storms.
System Requirements and Scale
Start with the numbers that drive every decision. Netflix serves 250M+ subscribers, each watching an average of 2 hours per day. That's 500 million hours of streaming daily — 57,000 years of video every 24 hours.
Storage: A 4K movie is roughly 20GB. With 20,000+ titles, the master library is 400TB. But you don't serve masters — you serve 50+ encoded versions per title (different resolutions, codecs, bitrates). That pushes total storage to tens of petabytes.
Bandwidth: Peak hours see 15% of all downstream internet traffic in North America. You need multi-terabit-per-second capacity at the CDN edge. Every 1% improvement in compression efficiency saves millions in bandwidth costs.
Latency: A 500ms increase in start-up time drops viewer retention by 20%. Your architecture must deliver the first video frame in under 2 seconds.
- Ingest: write-heavy, batch, offline
- Serving: read-heavy, real-time, latency-sensitive
- Intelligence: compute-heavy, hybrid online/offline
High-Level Architecture
Netflix's architecture breaks into four layers: Client Apps (web, mobile, TV), CDN Layer (Open Connect Appliances), Backend Services (microservices running on AWS), and Data Layer (Cassandra, EVCache, S3).
Client apps are thin — they fetch manifests, chunks, and metadata from APIs. The CDN is a massive network of custom appliances deployed inside ISP data centers, serving 95%+ of traffic. Backend services are hundreds of microservices, each owned by a small team, communicating via REST or gRPC. Data is spread across multiple database technologies: Cassandra for user viewing history and profiles, EVCache (Memcached-based) for session and recommendation caches, and S3 for raw video and assets.
The glue that holds it together: a service mesh (Envoy) for traffic management, Hystrix for circuit breaking, and Chaos Monkey to continuously test resilience.
Content Delivery: CDN and Adaptive Bitrate Streaming
Netflix doesn't serve a single video file. Each title is encoded into 50+ versions (resolutions from 360p to 4K, codecs like H.264, AV1, VP9). These are sliced into 2-second chunks and placed on CDN edge servers (Open Connect Appliances). When you press play, the client fetches a manifest file (MPD or M3U8) listing available bitrates and chunk URLs. The client's ABR algorithm — typically BOLA or a custom buffer-based scheme — monitors download speed and buffer fill level. If the buffer drops below 5 seconds, it requests a lower bitrate chunk. If the buffer is healthy, it might try the next higher bitrate.
Open Connect is Netflix's secret weapon. They lease space in ISP data centers, install custom servers pre-loaded with popular content. A user's DNS request resolves to the nearest Open Connect appliance. If the content isn't there (cache miss), the appliance fetches from a regional hub, which may fetch from Netflix's own origin in AWS. The hop count is at most 2 or 3, keeping latency low.
- Manifest = menu
- ABR algorithm = your dining strategy
- Cache miss = dish is empty, need to wait for refill
Microservices and Fault Tolerance
Netflix runs hundreds of microservices, each responsible for a narrow domain (user profile, playback, billing, recommendations, etc.). They communicate via synchronous REST or gRPC, and asynchronously via Kafka for events like 'new title added' or 'user watched'.
The critical pattern here is the Circuit Breaker, implemented via Netflix's Hystrix library. Each remote call is wrapped in a HystrixCommand. If the call fails or times out (e.g., 90% of calls fail within 10 seconds), the circuit breaker opens. Subsequent calls fail immediately without hitting the backend. After a configurable sleep window (e.g., 30 seconds), a single probe is allowed through. If it succeeds, the circuit closes. This prevents cascading failures when one service degrades.
Chaos Monkey is an extension of this philosophy: it randomly terminates instances in production during business hours, forcing engineers to build systems that survive individual failures. Over time, this has made the platform remarkably resilient — Netflix can lose an entire AWS Availability Zone and still serve users seamlessly.
Data Storage and Caching
Netflix uses multiple database engines, each chosen for a specific workload:
- Cassandra: Used for all user metadata — profiles, viewing history, ratings. Netflix runs one of the largest Cassandra clusters in the world (~2000 nodes). The data model is heavily denormalized for fast reads. Each row is keyed by user ID, with columns for watch history, recommendations metadata, etc.
- EVCache: A Memcached-based distributed cache layer. Used for session data, short-lived recommendations, and any data that needs sub-millisecond latency. EVCache is built on top of Amazon EC2 instances with SSDs, and data is replicated across two AZs for durability.
- S3 (Simple Storage Service): All raw video masters, encoded chunks, thumbnail images, and static assets live in S3. The CDN cache is populated from S3 via a pre-fetch pipeline.
- Elasticsearch: Used for log aggregation and search within internal tools (e.g., finding playback errors per content ID).
The key tradeoff: Cassandra provides high write throughput (ingesting viewing events) and tunable consistency, but doesn't support complex queries. That's why Netflix uses separate services for search and recommendations.
Recommendations Engine
Netflix's recommendations are responsible for 80% of content discovery. The system uses a mix of collaborative filtering, content-based filtering, and deep learning models. The pipeline has two parts:
Offline training: Machine learning models (e.g., factorization machines, neural networks) are trained on historical viewing data (Cassandra + S3). This produces user embeddings and item embeddings. Training runs daily on large Spark clusters.
Online serving: When a user opens Netflix, the recommendation service fetches their embeddings from EVCache, scores all available titles using nearest-neighbor search (approximate via FAISS or similar), and returns the top 20-40 titles. This must happen in under 200ms.
Personalization also considers time of day, device type, and recent searches. A/B testing is continuous — Netflix runs thousands of experiments simultaneously to evaluate new models.
- User embeddings shift over time (new likes, new devices)
- Item embeddings are static until re-trained
- Nearest neighbor search is approximate (FAISS) to keep latency under 200ms
Video Uploading: The Real Bottleneck Isn't Bandwidth
Bandwidth is cheap. The real pain is video processing. When a user uploads a 4K video, you can't just throw raw bytes at your CDN. Netflix transcodes every upload into dozens of renditions—different resolutions, codecs, and bitrates. That's not a simple encode job; it's a distributed pipeline. The naive approach is a monolithic encoder that queues jobs sequentially. That fails when a 2-hour 4K HDR movie takes 12 hours to process. Netflix breaks transcoding into chunks: split the video into 2-second segments, process each segment independently on separate workers, then stitch the results. This is why their upload pipeline uses Apache Kafka for job distribution and S3 for intermediate storage. Each segment goes through a directed acyclic graph (DAG) of operations: audio normalization, subtitle burning, resolution scaling, encryption. If one segment fails, you retry only that segment, not the whole video. The key metric isn't upload speed; it's Time To First Playable Segment—how fast can we serve the first 2 seconds of any rendition.
Video Streaming: Why Your Player Buffers While Netflix Doesn't
Netflix doesn't stream one file. It streams a playlist. The player fetches a manifest file (MPD for DASH, or M3U8 for HLS) that lists available renditions: resolutions, codecs, bitrates, and segment URLs. The client doesn't just pick the highest resolution and stick with it. A rate adaptation algorithm on the player continuously monitors TCP throughput and buffer occupancy. If your phone drops from 5G to LTE, the player requests a lower bitrate segment before the buffer drains. This is why Netflix feels smooth even on congested networks. The server-side part is less obvious: Netflix pre-positions edge servers closer to ISPs. They call this Open Connect. Each edge server holds the top 20% of content by popularity locally. Everything else comes from regional caches. The critical insight: the player avoids making requests that deliver data the client doesn't need. If the user scrubs to minute 30, the player requests the segment at minute 30, not the whole file. That's why video service architecture is a serverless event-driven system: each segment fetch is an independent HTTP request that a CDN edge can serve without talking to the origin.
The Recommendation Engine: How Netflix Knows You'll Binge at 2 AM
Netflix's recommendation system is a multi-stage funnel, not a single model. Stage one: candidate generation. From your 2000-title library, generate 500 candidates using collaborative filtering (people who watched X also watched Y), content-based filtering (genre, actors, director), and trending signals. Stage two: ranking. A deep neural network scores those 500 candidates based on predicted engagement—probability you'll click, watch at least 70%, or finish the title. Stage three: re-ranking. Apply business constraints: diversity (don't show 5 Marvel movies in a row), freshness (recently added content gets a boost), and personalization (time of day matters; you watch comedy at night, documentaries in morning). The architecture is batch + real-time. Batch jobs (Spark on EMR) compute candidate sets nightly. Real-time inference (a TensorFlow Serving cluster) ranks on-demand when you refresh the homepage. The trap: cold start. New users have no history. Netflix solves this with a two-phase model: first, collaborative filtering from similar users (based on sign-up demographics), then, after 10 views, switch to personalized models. The key metric isn't precision—it's hours streamed per session. If your model suggests a movie the user finishes vs. abandons after 10 minutes, that's a win.
Super Bowl CDN Cache Miss Storm
- CDN cache hit ratio is a leading indicator of user experience — monitor it per region.
- Origin servers must handle worst-case cache miss storms, not just average load.
- Pre-warming caches with predicted popular content reduces origin load by 60%.
- Never assume the CDN will absorb all traffic patterns — have a fallback rate-limiting strategy.
curl https://netflix.com/cdn-cgi/trace | grep loc (to check CDN edge location)kubectl logs -l app=cache-proxy --tail 100 (CDN edge logs)Key takeaways
Common mistakes to avoid
3 patternsSingle database for both writes and reads
No circuit breaker on downstream calls
Assuming CDN cache hit ratio is always high (99%)
Interview Questions on This Topic
How would you design the CDN cache hierarchy for Netflix?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.
That's Real World. Mark it forged?
9 min read · try the examples if you haven't