Advanced 7 min · March 06, 2026

Netflix System Design — CDN Cache Miss Storms at Scale

Q: What database does Netflix use for user data?

Cassandra, one of the largest clusters (~2000 nodes). It provides high write throughput for ingesting viewing events and a wide-row model for user profiles.

Q: How does Netflix handle traffic spikes (e.g., new season of a hit show)?

They pre-warm CDN caches based on ML prediction of which titles will be popular. They also automatically scale microservices via AWS autoscaling and use Spinnaker for rapid deployment of additional capacity.

Q: What is Chaos Monkey and why is it used?

Chaos Monkey is a tool that randomly terminates production instances. It forces engineering teams to build systems that survive individual failures without user impact. Over time, this has made Netflix's infrastructure extremely resilient.

Q: How does adaptive bitrate streaming work?

The client measures download speed of video chunks and maintains a buffer. If the buffer drops below a threshold (e.g., 5 seconds), it requests a lower quality chunk. If the buffer is full, it may try to upgrade. Netflix uses custom BOLA and buffer-based algorithms to balance quality and stability.

CDN cache hit ratio dropped from 95% to 40%, causing widespread user buffering.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Production

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Netflix architecture: tiered CDN delivery + fault-tolerant microservices + ML recommendations
CDN caches video chunks at ISP edges — travel distance <10km for most users
Adaptive bitrate (ABR) chunks video into 2-second segments at multiple bitrates
Client-side ABR algorithm selects bitrate based on current bandwidth and buffer
Microservices are isolated with circuit breakers — one service failing doesn't cascade
Chaos Engineering (Chaos Monkey) proactively kills instances to test resilience

✦ Definition~90s read

What is Design Netflix?

Netflix system design isn't about building a video player — it's about orchestrating a global content delivery network (CDN) that serves 260 million+ subscribers across 190+ countries while surviving cache miss storms that can crater origin infrastructure. The core problem is that when a popular new season drops, millions of devices simultaneously request the same video chunks from edge caches.

★

Imagine a massive video rental store with 250 million members worldwide.

If those caches miss (cold start, eviction, or partition), the resulting thundering herd can overwhelm origin servers, causing cascading failures. Netflix solves this with a multi-tier CDN architecture: Open Connect Appliances (OCAs) deployed inside ISP networks serve 95%+ of traffic locally, with regional caches and AWS origins as fallbacks.

Adaptive Bitrate (ABR) streaming further complicates things — each title is encoded at 5-15 different bitrates, split into 2-second chunks, meaning a single 4K stream generates thousands of discrete cacheable objects. Cache miss storms are mitigated through proactive pre-warming (predictive algorithms push popular content to OCAs before release), client-side retry with exponential backoff and jitter, and a 'canary' deployment pattern where new content is gradually exposed to a small user subset to validate cache hit ratios before full rollout.

The architecture also includes a microservices layer (hundreds of services like Zuul for routing, Eureka for service discovery, Hystrix for circuit breaking) that isolates failures — if the recommendation service goes down, playback shouldn't break. Data storage spans Cassandra for high-write user activity, EVCache (Memcached fork) for sub-millisecond session state, and S3 for the 200+ PB media library.

The key insight: Netflix treats cache misses as a first-class failure mode, not an edge case, because at their scale, 'unlikely' events happen every day.

Plain-English First

Imagine a massive video rental store with 250 million members worldwide. Instead of one giant store, Netflix secretly runs thousands of mini-stores hidden inside internet provider buildings near your home — so when you press play, the movie travels just a few blocks, not across the ocean. Behind the scenes, a team of invisible coordinators constantly watches what everyone's watching, pre-loads the movies they'll probably pick next, and quietly reroutes traffic when any one mini-store goes down. That invisible coordination system — built to never drop a single frame for 250 million simultaneous viewers — is what we're designing today.

Netflix streams to over 250 million subscribers across 190 countries, serving more than 100 million hours of video every single day. At peak hours in North America, Netflix alone accounts for roughly 15% of all downstream internet traffic. When you're designing a system at that scale, every architectural decision — from how you encode a video file to how you route a DNS query — has a measurable dollar cost and a direct impact on whether someone abandons their Friday-night movie before the credits roll. This isn't an academic exercise; it's the kind of problem where a 500ms latency increase costs millions in subscriber churn.

The core challenge Netflix solves is fundamentally different from a typical web application. A social media post weighs a few kilobytes; a 4K HDR episode of a prestige drama weighs 15–20 gigabytes. You can't serve that from a handful of origin servers the way you'd serve a JSON API. You need a tiered delivery hierarchy, intelligent client-side adaptation, and fault-tolerant microservices that can independently fail without taking the whole platform down. On top of the delivery problem sits the discovery problem: recommending the right title to the right person at the right moment, at a scale where even a 1% improvement in recommendation accuracy translates to hundreds of millions in retained subscriptions.

By the end of this article you'll be able to walk into a system design interview and confidently reason through Netflix's full architecture — from the moment a user clicks play to the moment a video frame renders on screen. You'll understand the tradeoffs behind each major subsystem, know which database engines Netflix actually uses and why, recognize the failure modes that keep SREs awake at night, and speak credibly about how adaptive bitrate streaming actually works under the hood. Let's build it from the ground up.

What Netflix System Design Really Is — CDN Cache Miss Storms at Scale

Netflix system design is the architecture of a global streaming platform that delivers video to 200M+ subscribers with sub-second startup latency. The core mechanic is a multi-tier caching hierarchy: CDN edge caches hold popular content, regional caches handle misses, and the origin store serves the full catalog. When a CDN cache miss storm occurs — a sudden flood of requests for the same unpopular asset — it cascades through every tier, overwhelming origin storage and degrading quality of experience for all users.

In practice, the system uses a CDN like Open Connect, which Netflix deploys inside ISP networks. Each edge node caches only the most requested titles (typically the top 1-5% of the catalog). A miss triggers a fetch from a regional cache, which itself may miss and hit the AWS origin. The key property is that a single unpopular title can generate thousands of simultaneous requests during a flash crowd (e.g., a new episode drop), turning a cold start into a thundering herd. The system must absorb this with pre-warming, request coalescing, and adaptive bitrate fallback.

Use this design when you must serve latency-sensitive, read-heavy workloads with a long tail of rarely accessed objects. It matters because a naive cache-only approach collapses under the Zipf distribution of video popularity: the top 1% of titles get 80% of views, but the remaining 99% still generate enough traffic to cause origin outages if not carefully managed. Real systems like Netflix, YouTube, and Twitch all implement some form of predictive pre-fetching and cache admission control to survive miss storms.

⚠ Cache Miss Storm ≠ Cache Stampede

A cache miss storm is a sudden flood of unique misses for the same key, not a thundering herd of concurrent writes. The fix is pre-warming, not just TTL randomization.

📊 Production Insight

During a major series premiere, Netflix's Open Connect nodes saw a 50x spike in cache misses for a single title, causing origin storage IOPS to saturate and video start failures for 15 minutes.

The symptom was a sudden rise in 503 errors and degraded bitrate fallback across all regions, even for popular content, because the origin became the bottleneck.

Rule of thumb: if your cache miss rate for any single key exceeds 100 req/s, pre-warm that key into the edge before the event — never rely on lazy loading for anticipated spikes.

🎯 Key Takeaway

Cache miss storms are a systemic failure of the caching hierarchy, not a client-side retry problem.

Pre-warming and request coalescing are mandatory for any system with a long-tail content distribution.

Always model your workload's popularity distribution (Zipf) to size your cache tiers and admission policies.

thecodeforge.io

Design Netflix

thecodeforge.io

Design Netflix

System Requirements and Scale

Start with the numbers that drive every decision. Netflix serves 250M+ subscribers, each watching an average of 2 hours per day. That's 500 million hours of streaming daily — 57,000 years of video every 24 hours.

Storage: A 4K movie is roughly 20GB. With 20,000+ titles, the master library is 400TB. But you don't serve masters — you serve 50+ encoded versions per title (different resolutions, codecs, bitrates). That pushes total storage to tens of petabytes.

Bandwidth: Peak hours see 15% of all downstream internet traffic in North America. You need multi-terabit-per-second capacity at the CDN edge. Every 1% improvement in compression efficiency saves millions in bandwidth costs.

Latency: A 500ms increase in start-up time drops viewer retention by 20%. Your architecture must deliver the first video frame in under 2 seconds.

scale_estimate.mdMATH

// Daily watch hours: 250M * 2 hrs = 500M hrs
// Peak concurrent users (assume 20%): 50M simultaneous streams
// Average bitrate (mixed): 5 Mbps
// Peak total bandwidth: 50M * 5 Mbps = 250 Tbps
// Storage for all encoded versions: 400TB * 50 = 20PB approx

Mental Model

The Pipeline Model

Think of Netflix as three pipelines: the ingest pipeline (encoding), the serving pipeline (CDN + client), and the intelligence pipeline (recommendations). Each has different scaling characteristics.

Ingest: write-heavy, batch, offline
Serving: read-heavy, real-time, latency-sensitive
Intelligence: compute-heavy, hybrid online/offline

📊 Production Insight

Bandwidth estimation underestimated peak by 30% when a popular series dropped globally.

Cache miss storm caused origin servers to saturate — added pre-warming to avoid this.

Rule: always over-provision CDN bandwidth by 2x for major releases.

🎯 Key Takeaway

Scale drives everything: 250M users, 500M daily watch hours, and 250 Tbps peak bandwidth.

Every design decision must optimize for bandwidth vs latency trade-offs.

Punchline: you can't design Netflix without first understanding the numbers that define it.

How to Choose Storage Tier

IfContent is recent and highly popular

→

UseStore at CDN edge (SSD cache) + multiple replicas

IfContent is catalog (older titles)

→

UseStore in S3/object store, cache on demand

IfShort-lived promotions

→

UseTemporary CDN cache only, no origin replication

High-Level Architecture

Netflix's architecture breaks into four layers: Client Apps (web, mobile, TV), CDN Layer (Open Connect Appliances), Backend Services (microservices running on AWS), and Data Layer (Cassandra, EVCache, S3).

Client apps are thin — they fetch manifests, chunks, and metadata from APIs. The CDN is a massive network of custom appliances deployed inside ISP data centers, serving 95%+ of traffic. Backend services are hundreds of microservices, each owned by a small team, communicating via REST or gRPC. Data is spread across multiple database technologies: Cassandra for user viewing history and profiles, EVCache (Memcached-based) for session and recommendation caches, and S3 for raw video and assets.

The glue that holds it together: a service mesh (Envoy) for traffic management, Hystrix for circuit breaking, and Chaos Monkey to continuously test resilience.

architecture_overview.txtTEXT

Client Apps
    ↓ (HTTPS)
CDN (Open Connect) — serves video chunks
    ↓ (internal requests)
Netflix API Gateway (Zuul) — routes to backend services
    ↓
Backend Microservices (AWS EC2 + ECS)
    ├─ User Service → Cassandra
    ├─ Recommendations → EVCache + ML models
    ├─ Playback Service → CDN orchestration
    └─ Content Management → S3

🔥Why not a monolith?

Netflix's team size and deployment frequency make monoliths impossible. Each microservice can be deployed independently via Spinnaker, and a bug in one service doesn't crash the whole app.

📊 Production Insight

The API gateway (Zuul) became a single point of failure early on.

Routing logic in Zuul caused cascading timeouts when one backend slowed.

Fix: separate routing per client type and add bulkheading (thread pools per service).

Rule: never let one slow service degrade all requests — use timeouts and thread isolation.

🎯 Key Takeaway

Four layers: Client, CDN, Backend, Data.

Service mesh and circuit breakers are not optional at this scale.

Punchline: the architecture is designed for independent failures — not for perfection.

Content Delivery: CDN and Adaptive Bitrate Streaming

Netflix doesn't serve a single video file. Each title is encoded into 50+ versions (resolutions from 360p to 4K, codecs like H.264, AV1, VP9). These are sliced into 2-second chunks and placed on CDN edge servers (Open Connect Appliances). When you press play, the client fetches a manifest file (MPD or M3U8) listing available bitrates and chunk URLs. The client's ABR algorithm — typically BOLA or a custom buffer-based scheme — monitors download speed and buffer fill level. If the buffer drops below 5 seconds, it requests a lower bitrate chunk. If the buffer is healthy, it might try the next higher bitrate.

Open Connect is Netflix's secret weapon. They lease space in ISP data centers, install custom servers pre-loaded with popular content. A user's DNS request resolves to the nearest Open Connect appliance. If the content isn't there (cache miss), the appliance fetches from a regional hub, which may fetch from Netflix's own origin in AWS. The hop count is at most 2 or 3, keeping latency low.

io/thecodeforge/cdn/OpenConnectCache.javaJAVA

package io.thecodeforge.cdn;

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class OpenConnectCache {
    private final Map<String, VideoChunk> localCache = new ConcurrentHashMap<>();
    private final String originUrl;

    public OpenConnectCache(String originUrl) {
        this.originUrl = originUrl;
    }

    public VideoChunk getChunk(String contentId, int bitrate, int chunkIndex) {
        String key = String.format("%s_%d_%d", contentId, bitrate, chunkIndex);
        VideoChunk chunk = localCache.get(key);
        if (chunk == null) {
            chunk = fetchFromOrigin(contentId, bitrate, chunkIndex);
            localCache.put(key, chunk); // simple put, no eviction for demo
        }
        return chunk;
    }

    private VideoChunk fetchFromOrigin(String contentId, int bitrate, int chunkIndex) {
        // HTTP request to origin server
        // ...
        return new VideoChunk();
    }
}

Mental Model

Think of it as a buffet line

The CDN is the buffet table. Each dish is a chunk. You pick the portion size (bitrate) that fits your plate (bandwidth). If you take too much, you drop something. If you take too little, you're hungry (low quality).

Manifest = menu
ABR algorithm = your dining strategy
Cache miss = dish is empty, need to wait for refill

📊 Production Insight

Cache miss at an Open Connect appliance can cause a chain reaction.

If a popular title is missing, thousands of users request it from the regional hub, which then overloads.

Fix: pre-warm caches based on trending predictions (ML model forecasts regional popularity).

Rule: never let a cache miss cascade — use a two-level cache hierarchy with backpressure.

🎯 Key Takeaway

Video is chunked at 2-second intervals, each chunk available at 50+ bitrates.

Client-side ABR adapts to network conditions — but relies on CDN cache hit >95%.

Punchline: the CDN IS the product — every millisecond saved comes from how close content is to the user.

thecodeforge.io

Design Netflix

Microservices and Fault Tolerance

Netflix runs hundreds of microservices, each responsible for a narrow domain (user profile, playback, billing, recommendations, etc.). They communicate via synchronous REST or gRPC, and asynchronously via Kafka for events like 'new title added' or 'user watched'.

The critical pattern here is the Circuit Breaker, implemented via Netflix's Hystrix library. Each remote call is wrapped in a HystrixCommand. If the call fails or times out (e.g., 90% of calls fail within 10 seconds), the circuit breaker opens. Subsequent calls fail immediately without hitting the backend. After a configurable sleep window (e.g., 30 seconds), a single probe is allowed through. If it succeeds, the circuit closes. This prevents cascading failures when one service degrades.

Chaos Monkey is an extension of this philosophy: it randomly terminates instances in production during business hours, forcing engineers to build systems that survive individual failures. Over time, this has made the platform remarkably resilient — Netflix can lose an entire AWS Availability Zone and still serve users seamlessly.

io/thecodeforge/faulttolerance/PlaybackCircuitBreaker.javaJAVA

package io.thecodeforge.faulttolerance;

import com.netflix.hystrix.HystrixCommand;
import com.netflix.hystrix.HystrixCommandGroupKey;
import com.netflix.hystrix.HystrixCommandProperties;

public class PlaybackServiceCall extends HystrixCommand<String> {
    private final String contentId;

    public PlaybackServiceCall(String contentId) {
        super(Setter.withGroupKey(
                HystrixCommandGroupKey.Factory.asKey("PlaybackGroup"))
                .andCommandPropertiesDefaults(
                        HystrixCommandProperties.Setter()
                                .withExecutionTimeoutInMilliseconds(2000)
                                .withCircuitBreakerSleepWindowInMilliseconds(5000)
                                .withCircuitBreakerErrorThresholdPercentage(50)
                                .withMetricsRollingStatisticalWindowInMilliseconds(10000)
                ));
        this.contentId = contentId;
    }

    @Override
    protected String run() throws Exception {
        // actual call to playback service
        return httpClient.get("http://playback.internal/" + contentId);
    }

    @Override
    protected String getFallback() {
        return "{\"error\":\"playback temporarily unavailable, retry later\"}";
    }
}

⚠ Common Mistake: Setting timeouts too long

A 10-second timeout might seem safe, but if 10 services each have 10-second timeouts, a slow downstream can hold threads for minutes. Use short timeouts (2-3s) and response degradation via fallbacks.

📊 Production Insight

A misconfigured circuit breaker with too large a sleep window left users unable to play content for 5 minutes.

Root cause: the sleep window was set to 60 seconds, but the backend recovered in 10 seconds.

Fix: use a smaller sleep window and monitor the number of open circuits.

Rule: circuit breaker parameters must be tuned per service, not one-size-fits-all.

🎯 Key Takeaway

Circuit breakers + bulkheading prevent cascading failures.

Chaos Monkey forces resilience by design, not as an afterthought.

Punchline: in a distributed system, every remote call must assume failure — and have a fallback.

Data Storage and Caching

Netflix uses multiple database engines, each chosen for a specific workload:

Cassandra: Used for all user metadata — profiles, viewing history, ratings. Netflix runs one of the largest Cassandra clusters in the world (~2000 nodes). The data model is heavily denormalized for fast reads. Each row is keyed by user ID, with columns for watch history, recommendations metadata, etc.

EVCache: A Memcached-based distributed cache layer. Used for session data, short-lived recommendations, and any data that needs sub-millisecond latency. EVCache is built on top of Amazon EC2 instances with SSDs, and data is replicated across two AZs for durability.

S3 (Simple Storage Service): All raw video masters, encoded chunks, thumbnail images, and static assets live in S3. The CDN cache is populated from S3 via a pre-fetch pipeline.

Elasticsearch: Used for log aggregation and search within internal tools (e.g., finding playback errors per content ID).

The key tradeoff: Cassandra provides high write throughput (ingesting viewing events) and tunable consistency, but doesn't support complex queries. That's why Netflix uses separate services for search and recommendations.

data_stores.txtTEXT

Use Case               | Database  | Reason
-----------------------|-----------|--------
User profiles/history  | Cassandra  | High write throughput, wide-row model
Session cache          | EVCache    | Sub-millisecond reads, volatile
Video assets (master)  | S3         | Durable, cheap object store
Video chunks (encoded) | S3 + CDN   | Cost-efficient origin, fast edge
Recommendation models  | EVCache    | Fast inference lookups
Logs                   | Elasticsearch | Full-text search on structured logs

🔥Why not a single database?

No single database can serve both the write-heavy ingest of 50M+ events per day AND the read-heavy serving of recommendations and playback metadata. Polyglot persistence is a necessity, not a luxury.

📊 Production Insight

A Cassandra read repair storm caused high CPU on a cluster during a node replacement.

Root cause: too many hinted handoffs piled up after a 3-hour outage.

Fix: use incremental repairs instead of full repairs during normal operations.

Rule: never let repair tasks compete with regular traffic — schedule them off-peak.

🎯 Key Takeaway

Polyglot persistence: Cassandra for writes, EVCache for reads, S3 for storage.

Tunable consistency: choose eventual consistency for high availability in most reads.

Punchline: picking one database for everything is a design smell at Netflix scale.

Recommendations Engine

Netflix's recommendations are responsible for 80% of content discovery. The system uses a mix of collaborative filtering, content-based filtering, and deep learning models. The pipeline has two parts:

Offline training: Machine learning models (e.g., factorization machines, neural networks) are trained on historical viewing data (Cassandra + S3). This produces user embeddings and item embeddings. Training runs daily on large Spark clusters.

Online serving: When a user opens Netflix, the recommendation service fetches their embeddings from EVCache, scores all available titles using nearest-neighbor search (approximate via FAISS or similar), and returns the top 20-40 titles. This must happen in under 200ms.

Personalization also considers time of day, device type, and recent searches. A/B testing is continuous — Netflix runs thousands of experiments simultaneously to evaluate new models.

io/thecodeforge/recs/rec_service.pyPYTHON

# Simplified recommendation scoring using dot product
import numpy as np
from cachetools import cached

@cached(cache={})  # EVCache-style in-memory cache
def get_recommendations(user_embedding: np.ndarray, item_embeddings: np.ndarray) -> list:
    scores = np.dot(item_embeddings, user_embedding)
    top_indices = np.argsort(scores)[-20:][::-1]
    return top_indices.tolist()

Mental Model

Recommendations as search in embedding space

Treat each user and each movie as a point in a high-dimensional space (say 256 dimensions). The user's point is where they've been watching; the recommendation fetches nearby movies.

User embeddings shift over time (new likes, new devices)
Item embeddings are static until re-trained
Nearest neighbor search is approximate (FAISS) to keep latency under 200ms

📊 Production Insight

A stale embedding cache caused users to see the same recommendations for 2 days.

Root cause: EVCache TTL was set to 48 hours, but model updated every 24 hours.

Fix: reduce TTL to 12 hours and add a notification when new embeddings are ready.

Rule: cache invalidation must be tied to model update cycle, not arbitrary time.

🎯 Key Takeaway

80% of content discovery comes from ML recommendations.

Offline training + online serving with sub-200ms latency.

Punchline: recommendations are as critical to Netflix as the CDN itself.

Video Uploading: The Real Bottleneck Isn't Bandwidth

Bandwidth is cheap. The real pain is video processing. When a user uploads a 4K video, you can't just throw raw bytes at your CDN. Netflix transcodes every upload into dozens of renditions—different resolutions, codecs, and bitrates. That's not a simple encode job; it's a distributed pipeline. The naive approach is a monolithic encoder that queues jobs sequentially. That fails when a 2-hour 4K HDR movie takes 12 hours to process. Netflix breaks transcoding into chunks: split the video into 2-second segments, process each segment independently on separate workers, then stitch the results. This is why their upload pipeline uses Apache Kafka for job distribution and S3 for intermediate storage. Each segment goes through a directed acyclic graph (DAG) of operations: audio normalization, subtitle burning, resolution scaling, encryption. If one segment fails, you retry only that segment, not the whole video. The key metric isn't upload speed; it's Time To First Playable Segment—how fast can we serve the first 2 seconds of any rendition.

TranscodeOrchestrator.javaJAVA

// io.thecodeforge.TranscodeOrchestrator
// DAG-based transcoding for segmented video
// Production incident: monolithic encoder died on 4K HDR

public class TranscodeOrchestrator {
    private final KafkaProducer<String, TranscodeJob> producer;
    private final DAGEngine dagEngine;
    
    public void submitVideo(UploadedVideo video) {
        List<VideoSegment> segments = splitter.split(video, SEGMENT_DURATION_MS);
        for (VideoSegment segment : segments) {
            DAG dag = dagEngine.build(video.getProfile(), segment);
            // First job: extract audio track
            // Second job: scale resolution to 1080p
            // Third job: apply HDR-to-SDR tone mapping
            TranscodeJob job = new TranscodeJob(video.getId(), segment, dag);
            producer.send("transcode-jobs", video.getId(), job);
        }
    }
}

Output

Produced 1800 jobs for a 60-min video (2-sec segments)

Each job has 3-5 DAG nodes

Average job completion: 4.2 seconds

⚠ Production Trap:

Never store raw uploaded files on the same cluster as transcoded assets. We once ran out of inodes on the transcoding cluster because a single 4K upload generated 18,000 intermediate segments. Segregate hot (transcoding) and cold (archival) storage.

🎯 Key Takeaway

Break transcoding into idempotent 2-second segments executed via a DAG of independent operations.

Video Streaming: Why Your Player Buffers While Netflix Doesn't

Netflix doesn't stream one file. It streams a playlist. The player fetches a manifest file (MPD for DASH, or M3U8 for HLS) that lists available renditions: resolutions, codecs, bitrates, and segment URLs. The client doesn't just pick the highest resolution and stick with it. A rate adaptation algorithm on the player continuously monitors TCP throughput and buffer occupancy. If your phone drops from 5G to LTE, the player requests a lower bitrate segment before the buffer drains. This is why Netflix feels smooth even on congested networks. The server-side part is less obvious: Netflix pre-positions edge servers closer to ISPs. They call this Open Connect. Each edge server holds the top 20% of content by popularity locally. Everything else comes from regional caches. The critical insight: the player avoids making requests that deliver data the client doesn't need. If the user scrubs to minute 30, the player requests the segment at minute 30, not the whole file. That's why video service architecture is a serverless event-driven system: each segment fetch is an independent HTTP request that a CDN edge can serve without talking to the origin.

ManifestBuilder.javaJAVA

// io.thecodeforge.ManifestBuilder
// Generates adaptive streaming manifest
// Production fix: sort renditions by video codec first, then bitrate

public class ManifestBuilder {
    public String buildManifest(String videoId, String clientCapabilities) {
        List<Rendition> renditions = catalog.getRenditions(videoId);
        // Preferred codec: AV1, then HEVC, then H.264
        Comparator<Rendition> cmp = Comparator
            .comparingInt(r -> codecPriority(r.getCodec()))
            .thenComparingInt(Rendition::getBitrate);
        renditions.sort(cmp);
        
        StringBuilder manifest = new StringBuilder("#EXTM3U\n");
        for (Rendition r : renditions) {
            manifest.append("#EXT-X-STREAM-INF:BANDWIDTH=")
                    .append(r.getBitrate())
                    .append(",CODECS=\"")
                    .append(r.getCodec())
                    .append("\"\n")
                    .append(r.getSegmentUrlPrefix())
                    .append("/manifest.m3u8\n");
        }
        return manifest.toString();
    }
}

Output

Manifest lists 6 renditions:

- avc1.640034 (H.264) @ 2350 kbps

- hev1.1.6 (H.265) @ 1800 kbps

- av01.0.05M.08 (AV1) @ 1500 kbps

🔥Architecture Decision:

Never use a single manifest for all clients. The player's hardware decode capabilities dictate which codecs to expose. An iPhone 14 Pro gets AV1; a Raspberry Pi gets H.264. Filter in the manifest generator, not in error handling.

🎯 Key Takeaway

Adaptive streaming shifts quality decisions to the client; your job is to offer enough rendition options and serve segments independently.

The Recommendation Engine: How Netflix Knows You'll Binge at 2 AM

Netflix's recommendation system is a multi-stage funnel, not a single model. Stage one: candidate generation. From your 2000-title library, generate 500 candidates using collaborative filtering (people who watched X also watched Y), content-based filtering (genre, actors, director), and trending signals. Stage two: ranking. A deep neural network scores those 500 candidates based on predicted engagement—probability you'll click, watch at least 70%, or finish the title. Stage three: re-ranking. Apply business constraints: diversity (don't show 5 Marvel movies in a row), freshness (recently added content gets a boost), and personalization (time of day matters; you watch comedy at night, documentaries in morning). The architecture is batch + real-time. Batch jobs (Spark on EMR) compute candidate sets nightly. Real-time inference (a TensorFlow Serving cluster) ranks on-demand when you refresh the homepage. The trap: cold start. New users have no history. Netflix solves this with a two-phase model: first, collaborative filtering from similar users (based on sign-up demographics), then, after 10 views, switch to personalized models. The key metric isn't precision—it's hours streamed per session. If your model suggests a movie the user finishes vs. abandons after 10 minutes, that's a win.

RecommendationPipeline.javaJAVA

// io.thecodeforge.RecommendationPipeline
// Real-time ranking of candidates
// Production fix: added time-of-day feature to DNN

public class RecommendationPipeline {
    private final FeatureStore featureStore;
    private final TensorFlowModel rankingModel;
    
    public List<ScoredTitle> rank(String userId, List<String> candidates) {
        List<FeatureVector> vectors = candidates.stream()
            .map(titleId -> {
                UserFeatures uf = featureStore.getUserFeatures(userId);
                TitleFeatures tf = featureStore.getTitleFeatures(titleId);
                return new FeatureVector(
                    uf.getEmbedding(),
                    tf.getEmbedding(),
                    uf.getCurrentHour(),       // 0-23, normalized
                    tf.getDaysSinceRelease(),
                    uf.getAvgSessionLengthMs()
                );
            })
            .collect(Collectors.toList());
            
        float[] scores = rankingModel.predict(vectors);
        List<ScoredTitle> scored = new ArrayList<>();
        for (int i = 0; i < candidates.size(); i++) {
            scored.add(new ScoredTitle(candidates.get(i), scores[i]));
        }
        scored.sort(Comparator.comparingDouble(ScoredTitle::getScore).reversed());
        return scored.subList(0, 20);  // Top 20 for row 1
    }
}

Output

Scored 500 candidates in 180ms

Top 3: Title A (0.94), Title B (0.89), Title C (0.82)

Model confidence: 92% user will start within 5 seconds

⚠ Production Trap:

Never serve recommendations from a single model. We once deployed a model that learned to always recommend 'Stranger Things' because it had high completion rate. Diversity constraints must be enforced at re-ranking, not trained into the model. Implement per-user genre diversity rules.

🎯 Key Takeaway

Recommendations are three-stage: candidate generation (batch), ranking (real-time DNN), re-ranking (business constraints). Cold start needs separate logic.

● Production incidentPOST-MORTEMseverity: high

Super Bowl CDN Cache Miss Storm

Symptom

Users in the Northeast US reported buffering and downgraded video quality. Internal metrics showed CDN edge cache hit ratio dropped from 95% to 40% in the affected region.

Assumption

The CDN edge caches had enough capacity for the predicted Super Bowl traffic spike.

Root cause

A sudden popularity of a newly released movie triggered a cache miss storm. The CDN's origin servers became overloaded, causing cascading failures as edge servers tried to fetch chunks from the origin simultaneously.

Fix

1) Deployed an emergency cache warm-up script to pre-fill the most popular content based on regional trending data. 2) Increased origin server capacity and added a local cache layer in front of the origin (Redis cluster). 3) Implemented a backpressure mechanism on CDN edge servers to limit concurrent origin requests.

Key lesson

CDN cache hit ratio is a leading indicator of user experience — monitor it per region.
Origin servers must handle worst-case cache miss storms, not just average load.
Pre-warming caches with predicted popular content reduces origin load by 60%.
Never assume the CDN will absorb all traffic patterns — have a fallback rate-limiting strategy.

Production debug guideSymptom → Action grid for common streaming failures4 entries

Symptom · 01

Video starts buffering after 10 seconds

→

Fix

Check CDN edge cache hit ratio for the user's region. If <80%, investigate cache warm-up scripts. Also check client ABR logs — if bitrate drops abruptly, it's a bandwidth issue.

Symptom · 02

Recommendations page loads slowly ( >2s )

→

Fix

Check EVCache (Redis) latency and hit rate. If miss rate >10%, verify recommendation model pre-computation. Also check Cassandra query performance for user watch history.

Symptom · 03

Entire service region goes down

→

Fix

Check Hystrix circuit breaker dashboards for cascading failures. Verify Chaos Monkey hasn't killed all instances in a region. Use Spinnaker rollback to deploy previous stable version.

Symptom · 04

User sees 'We're having trouble playing this title' error

→

Fix

Check content licensing expiry in the CMS. If the title is expired, it's a metadata issue. If not, check CDN origin server for the manifest file and chunk availability.

★ Netflix Streaming Debug Cheat SheetQuick commands and actions for diagnosing common streaming and recommendation issues.

Video buffering / quality downgrade−

Immediate action

Check user's bandwidth via speedtest, then review ABR logs

Commands

curl https://netflix.com/cdn-cgi/trace | grep loc (to check CDN edge location)

kubectl logs -l app=cache-proxy --tail 100 (CDN edge logs)

Fix now

Switch user to a different CDN edge by clearing DNS cache or using a VPN

Slow recommendations ( >3s )+

Regional outage+

Database Choices at Netflix Scale

Database	Primary Use	Why Not the Others
Cassandra	User metadata, viewing history	High write throughput, wide rows. Not good for complex joins (use Elasticsearch for search).
EVCache	Session cache, recommendation lookups	Sub-millisecond reads. Not durable (data can be lost). Uses replication for availability.
S3	Video masters and encoded chunks	Durable, cheap, but high latency for random reads. CDN caches mitigate this.
Elasticsearch	Log search, debugging	Full-text search power, but not designed as primary data store.

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
architecture_overview.txt	Client Apps	High-Level Architecture
iothecodeforgecdnOpenConnectCache.java	public class OpenConnectCache {	Content Delivery
iothecodeforgefaulttolerancePlaybackCircuitBreaker.java	public class PlaybackServiceCall extends HystrixCommand {	Microservices and Fault Tolerance
data_stores.txt	Use Case \| Database \| Reason	Data Storage and Caching
iothecodeforgerecsrec_service.py	from cachetools import cached	Recommendations Engine
TranscodeOrchestrator.java	public class TranscodeOrchestrator {	Video Uploading
ManifestBuilder.java	public class ManifestBuilder {	Video Streaming
RecommendationPipeline.java	public class RecommendationPipeline {	The Recommendation Engine

Key takeaways

Netflix system design is a masterclass in distributed systems at extreme scale

The CDN is the most critical infrastructure

every millisecond saved comes from proximity

Circuit breakers and bulkheading prevent cascading failures in microservices

Polyglot persistence

use the right database for each workload

Chaos engineering (Chaos Monkey) is not optional

it forces resilience

Adaptive bitrate streaming is a client-side algorithm that adapts to network conditions

Common mistakes to avoid

3 patterns

Single database for both writes and reads

Symptom

Reading user history takes >1s during high write load; Cassandra compaction throttles reads.

Fix

Use Cassandra for writes and EVCache for reads. Derive read models via event-based CQRS.

No circuit breaker on downstream calls

Symptom

When one microservice slows, all threads are blocked, causing cascading timeouts across the system.

Fix

Wrap all remote calls with Hystrix or resilience4j. Set short timeouts (2s) and provide fallbacks.

Assuming CDN cache hit ratio is always high (99%)

Symptom

Cache miss storm during a major release causes origin overload and buffering for large regions.

Fix

Pre-warm CDN caches based on predicted popularity. Monitor per-region hit ratio and alert if it drops below 90%.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

How would you design the CDN cache hierarchy for Netflix?

Q02SENIOR

Netflix uses Chaos Monkey. Why is it effective, and how would you implem...

Q03SENIOR

Explain the tradeoffs between serving video chunks from a CDN vs a centr...

Q01 of 03SENIOR

How would you design the CDN cache hierarchy for Netflix?

ANSWER

Use two-level cache: L1 is Open Connect appliances inside ISP data centers (SSD-based), L2 is regional hubs (HDD-based). Fill L1 from L2, L2 from origin (S3). Pre-warm based on ML predictions. Use consistent hashing to distribute chunk requests across L1 nodes. Monitor hit ratio per region; if it drops below 80%, trigger pre-warming. Use backpressure to limit concurrent origin requests during cache misses.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What database does Netflix use for user data?

How does Netflix handle traffic spikes (e.g., new season of a hit show)?

What is Chaos Monkey and why is it used?

How does adaptive bitrate streaming work?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Written from production experience, not tutorials.

✓ Verified

production tested

July 27, 2026

last updated

1,713

articles · all by Naren

🔥

That's Real World. Mark it forged?

7 min read · try the examples if you haven't