Load Balancing Components Explained — How Traffic Gets Distributed at Scale
- Load balancing is the primary mechanism for horizontal scalability — it's what lets you add servers instead of just making one server bigger. But it only works correctly if health checks are accurate, algorithms match the workload, and state is externalized.
- Layer 4 is faster and simpler. Layer 7 is smarter and more flexible. Most production systems at scale use both: Layer 4 at the network edge for raw throughput, Layer 7 internally for content-aware microservice routing.
- Health checks are the foundation everything else depends on. TCP-only checks are inadequate — they validate process existence, not application readiness. Deep HTTP health checks that validate dependencies are the minimum acceptable standard for production.
- A load balancer distributes incoming traffic across multiple servers so no single machine becomes a bottleneck or single point of failure
- Health checks are the heartbeat — they probe servers continuously and remove dead ones from rotation automatically; TCP-only checks are a trap
- Algorithms (Round Robin, Least Connections, IP Hash, Weighted, Least Response Time) decide which server gets each request based on different signals
- Layer 4 (TCP/UDP) is faster; Layer 7 (HTTP/HTTPS) is smarter — it can inspect cookies, headers, URL paths, and make content-aware routing decisions
- Session persistence (sticky sessions) keeps users on one server but creates hotspots and causes mass session loss if that server dies
- The biggest trap: skipping health checks or using TCP-only probes means your LB becomes a black hole, routing traffic into servers that can't respond
- In production, no single load balancer handles everything — DNS, edge, Layer 7 gateway, and service mesh each own a different tier
Upstream servers showing as down in LB logs
curl -v http://<server-ip>:8080/healthzkubectl get pods -l app=backend -o wideOne server receiving disproportionate traffic
nginx -T 2>&1 | grep -A5 'upstream'ss -tnp | grep :8080 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rnSSL errors appearing intermittently at the load balancer
openssl s_client -connect <lb-host>:443 -tls1_2openssl x509 -in /etc/ssl/cert.pem -noout -datesRequests timing out but servers appear healthy
curl -w '@curl-format.txt' -o /dev/null -s http://<lb-host>/api/testtail -f /var/log/nginx/access.log | grep ' 0.0[0-9]\{2\} 'Production Incident
Production Debug GuideSymptom → Action mapping for common LB failures
Every time you tap 'Buy Now' on Amazon or start a video on Netflix, your request hits one of hundreds or thousands of servers — chosen in milliseconds by a load balancer you never see. Without it, modern internet-scale applications simply couldn't exist.
The core problem is deceptively simple: distribute work across many machines so no single machine becomes a bottleneck, a single point of failure, or a performance nightmare. Without load balancing, one server handles everything until it buckles under the weight. With it, traffic is spread intelligently, failed servers are automatically removed from the pool, and new capacity can be added without touching the rest of the system.
But 'load balancer' is not a single thing. It's a tier — sometimes multiple tiers — of components that each own a different slice of the problem. DNS-level routing decides which data center gets your request. A network load balancer handles the raw TCP connection at line rate. An application load balancer inspects your HTTP headers and routes you to the right microservice. A service mesh sidecar manages the connection between that microservice and the next one in the chain. Understanding where each layer sits, what decisions it can make, and what its failure modes look like is what separates engineers who can configure a load balancer from engineers who can design a system that stays up when things go wrong.
By the end of this article you'll understand what load balancers are, which components make them tick, when to use Round Robin vs Least Connections vs Least Response Time, why sticky sessions can be a trap at exactly the wrong moment, and how to answer the load balancing questions that trip people up in system design interviews at senior level.
The Core Mechanics: How a Load Balancer Decides Where to Send Traffic
A load balancer sits between the client and your server pool. When a request arrives, it has to make a routing decision in milliseconds — which server gets this connection, right now, given the current state of the cluster.
That decision happens at one of two layers, and the layer matters more than most people realize. Layer 4 load balancers operate at the transport layer — they see IP addresses, TCP/UDP ports, and packet counts. They don't open the envelope. Layer 7 load balancers operate at the application layer — they can read the HTTP method, URL path, headers, cookies, and request body. They know the difference between a GET /api/images request and a POST /api/payments request and can route them to entirely different server pools.
The trade-off is straightforward: Layer 4 is faster because there's almost nothing to parse. Layer 7 is more expensive computationally because it has to terminate the connection, parse the HTTP request, make a routing decision, and then establish a new connection (or reuse a keepalive connection) to the backend. In practice, that overhead is typically 0.5–2ms per request — negligible for most applications, meaningful for high-frequency trading or real-time gaming.
In production, you usually don't pick one. The standard architecture is a Layer 4 network load balancer at the edge handling raw TCP connections at line rate, with Layer 7 application load balancers behind it doing content-aware routing to specific service pools. AWS calls these NLB and ALB. On-premise, you'd see HAProxy in TCP mode in front of NGINX instances.
# Upstream pool with mixed weights and a backup server upstream forge_backend_cluster { # Least Connections: best for workloads with variable request processing times # Round Robin (default) assumes all requests take roughly the same time — often wrong least_conn; server 10.0.0.1:8080 weight=3; # Larger instance — gets 3x the traffic server 10.0.0.2:8080 weight=1; # Standard instance server 10.0.0.3:8080 backup; # Only receives traffic when primary servers are down # Keepalive: reuse connections to backends instead of opening new TCP connections per request # Without this, high-traffic scenarios create connection exhaustion keepalive 32; # Passive health checks: mark server down after 3 consecutive failures, retry after 30s # Active health checks (requires nginx_upstream_check_module or NGINX Plus) # server 10.0.0.1:8080 max_fails=3 fail_timeout=30s; } server { listen 80; server_name thecodeforge.io; location /api/ { proxy_pass http://forge_backend_cluster; # Preserve real client IP through the proxy proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $host; # Timeout configuration — tune for your backend's actual SLA proxy_connect_timeout 5s; proxy_read_timeout 60s; proxy_send_timeout 60s; # Retry on failure — but only for idempotent methods to avoid double-posting proxy_next_upstream error timeout http_503; proxy_next_upstream_tries 2; } }
# Server 10.0.0.1 receives ~3x requests vs 10.0.0.2 due to weight=3
# 10.0.0.3 stays idle unless both primaries fail health checks
# Keepalive reuses existing connections — avoids TCP handshake overhead on every request
- Layer 4: No packet inspection. Lowest latency (~microseconds). Ideal for raw TCP/UDP traffic like gaming servers, video streaming, or any protocol that isn't HTTP.
- Layer 7: Reads cookies, headers, URL paths, and HTTP methods. Enables A/B testing, canary deployments, microservice routing by path, and authentication offloading.
- Rule of thumb: if your routing decision requires knowing anything about the request content, you need Layer 7. If you only need to balance load across identical servers, Layer 4 is enough.
- Performance cost of Layer 7: typically 0.5–2ms additional latency per request due to connection termination, TLS handling, and HTTP parsing.
- Standard production architecture: Layer 4 NLB at the edge absorbs raw connection volume, Layer 7 ALB/NGINX behind it makes content-aware routing decisions per service.
Load Balancing Algorithms in Depth: Choosing the Right Strategy for Your Workload
The algorithm your load balancer uses to select a backend is not a configuration detail — it's a decision that directly affects your latency distribution, your server utilization, and what happens when servers become slow rather than fully dead.
Round Robin is the simplest: request 1 goes to server 1, request 2 to server 2, and so on, cycling back to the beginning. It works well when all servers are identical and all requests take roughly the same time to process. Both of those assumptions break in practice. Servers are rarely perfectly identical after weeks of different memory allocations and GC histories. And requests are almost never uniform — a request that triggers a complex database join takes 50x longer than one hitting a cache.
Least Connections routes each new request to whichever server currently has the fewest active connections. This self-corrects automatically: a slow server accumulates connections faster, so the LB naturally sends it fewer new ones. This is why Least Connections is the safer default for most production web applications.
Least Response Time takes the next step: instead of counting connections, it measures actual backend latency (TTFB) and routes to whichever server is responding fastest right now. This is more accurate but requires the LB to actively probe or measure response times, which adds some overhead. It's the right choice for latency-sensitive workloads where a server can be 'healthy' but slow.
IP Hash routes each client to the same backend based on a hash of their source IP. This provides a form of session affinity without application-level cookies. The significant risk: if all your users are behind a corporate NAT gateway, they all hash to the same backend. Also, when a backend is added or removed, the hash changes and users get redistributed — breaking any state you were relying on affinity to preserve.
Weighted variants of Round Robin and Least Connections let you express that some servers have more capacity than others. A server with weight=3 gets three times the share of a server with weight=1. This is essential in mixed hardware environments.
package io.thecodeforge.loadbalancer; import java.util.*; import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.CopyOnWriteArrayList; /** * Production-grade Weighted Round-Robin implementation. * * Design decisions worth understanding: * - AtomicInteger for the index counter: load balancers handle concurrent requests. * A plain int here is a data race waiting to happen. * - CopyOnWriteArrayList for the server pool: allows safe dynamic weight updates * without locking the hot path (getNextServer). * - Collections.shuffle() on construction: prevents all initial traffic from * hitting server[0] in a predictable burst during startup. * - Math.abs() on the modulo result: AtomicInteger.getAndIncrement() eventually * overflows to negative values. Without abs(), you get an ArrayIndexOutOfBoundsException * at 2^31 requests — a production bug you will not find in testing. */ public class WeightedRoundRobinBalancer { private final List<String> serverPool; private final AtomicInteger currentIndex = new AtomicInteger(0); public WeightedRoundRobinBalancer(Map<String, Integer> serversWithWeights) { List<String> pool = new ArrayList<>(); for (Map.Entry<String, Integer> entry : serversWithWeights.entrySet()) { for (int i = 0; i < entry.getValue(); i++) { pool.add(entry.getKey()); } } // Shuffle to prevent predictable startup burst on the first server Collections.shuffle(pool); // CopyOnWriteArrayList: safe reads on the hot path, supports dynamic updates this.serverPool = new CopyOnWriteArrayList<>(pool); } public String getNextServer() { if (serverPool.isEmpty()) { throw new IllegalStateException( "No active servers in the pool. Check health checks and server registration." ); } // Math.abs handles integer overflow at 2^31 requests int index = Math.abs(currentIndex.getAndIncrement() % serverPool.size()); return serverPool.get(index); } /** * Dynamically update a server's weight — e.g., temporarily reduce weight * for a server showing elevated GC pause times or increased error rate. * In production, this would be called by your health monitoring system. */ public synchronized void updateWeight(String server, int newWeight, int oldWeight) { // Remove existing entries for this server serverPool.removeIf(s -> s.equals(server)); // Re-add with new weight for (int i = 0; i < newWeight; i++) { serverPool.add(server); } System.out.printf("Weight updated: %s %d → %d (pool size: %d)%n", server, oldWeight, newWeight, serverPool.size()); } public static void main(String[] args) { Map<String, Integer> config = new LinkedHashMap<>(); config.put("app-server-large-01", 5); // High-capacity instance config.put("app-server-std-01", 2); // Standard instance config.put("app-server-std-02", 2); // Standard instance WeightedRoundRobinBalancer balancer = new WeightedRoundRobinBalancer(config); System.out.println("Initial distribution across 18 requests:"); Map<String, Integer> distribution = new LinkedHashMap<>(); for (int i = 0; i < 18; i++) { String server = balancer.getNextServer(); distribution.merge(server, 1, Integer::sum); } distribution.forEach((server, count) -> System.out.printf(" %-25s → %d requests%n", server, count)); // Simulate degraded server — reduce its weight System.out.println("\nSimulating GC pressure on app-server-large-01..."); balancer.updateWeight("app-server-large-01", 1, 5); System.out.println("Server temporarily downweighted. Rebalancing traffic."); } }
app-server-large-01 → 10 requests
app-server-std-01 → 4 requests
app-server-std-02 → 4 requests
Simulating GC pressure on app-server-large-01...
Weight updated: app-server-large-01 5 → 1 (pool size: 5)
Server temporarily downweighted. Rebalancing traffic.
Math.abs() call in the Java implementation is not defensive programming theater — it's fixing a real production bug.AtomicInteger.getAndIncrement() overflows to Integer.MIN_VALUE after 2^31 calls. Without Math.abs(), the modulo of a negative number is negative, which throws ArrayIndexOutOfBoundsException on the next line.Health Checks: The Component That Makes Everything Else Work
Health checks are the mechanism by which a load balancer knows which servers are actually capable of serving traffic right now. Everything else — algorithm, weights, session persistence — is irrelevant if the LB doesn't have accurate information about server state.
There are three types of health checks in common use, and understanding their trade-offs matters:
TCP health checks open a connection to the server's port and consider it healthy if the connection succeeds. Fast, low overhead, and completely inadequate for detecting application-level failures. The server's OS can accept a TCP connection while the application is in a GC pause, crashed internally, or waiting on a database connection that will never arrive.
HTTP health checks send an actual HTTP request to a designated endpoint (typically /health or /healthz) and validate the response code. This is the minimum acceptable standard for production. The endpoint must return a non-200 response if the application isn't ready to serve traffic — not just if the process is running.
Deep health checks go further: the /healthz endpoint actively validates downstream dependencies — can we connect to the database, is the cache reachable, are critical feature flags loaded. These checks are more expensive to run but catch a class of failures that HTTP-only checks miss: the application process is up, the port responds, but the database connection pool is exhausted and every request will fail.
The health check configuration details matter as much as the type. Check interval (how often), timeout (how long to wait for a response), unhealthy threshold (how many consecutive failures before removal), and healthy threshold (how many consecutive successes before re-addition) all interact. Misconfigure any of these and you get either flapping — servers rapidly cycling in and out of rotation — or a slow response to actual failures.
const express = require('express'); const { createClient } = require('redis'); const { Pool } = require('pg'); const app = express(); // Initialize dependencies const redisClient = createClient({ url: process.env.REDIS_URL }); const pgPool = new Pool({ connectionString: process.env.DATABASE_URL }); redisClient.connect().catch(console.error); /** * Shallow health check — fast, for high-frequency LB probing. * Returns 200 if the process is alive. Does not check dependencies. * Use this for the LB's frequent interval check (every 5s). */ app.get('/health/live', (req, res) => { res.status(200).json({ status: 'alive', pid: process.pid, uptime: process.uptime(), }); }); /** * Deep readiness check — validates all dependencies. * Returns 200 only when the application can actually serve traffic. * Use this as the LB's readiness gate during startup and deployment. * Check interval should be longer (every 10–15s) due to dependency I/O. */ app.get('/health/ready', async (req, res) => { const checks = {}; let allHealthy = true; // Check database connectivity try { const client = await pgPool.connect(); await client.query('SELECT 1'); client.release(); checks.database = { status: 'healthy' }; } catch (err) { checks.database = { status: 'unhealthy', error: err.message }; allHealthy = false; } // Check Redis connectivity try { await redisClient.ping(); checks.redis = { status: 'healthy' }; } catch (err) { checks.redis = { status: 'unhealthy', error: err.message }; allHealthy = false; } // Memory pressure check — prevent routing to a server about to OOM const memUsage = process.memoryUsage(); const heapUsedPercent = memUsage.heapUsed / memUsage.heapTotal; if (heapUsedPercent > 0.90) { checks.memory = { status: 'degraded', heapUsedPercent: (heapUsedPercent * 100).toFixed(1) + '%', }; allHealthy = false; } else { checks.memory = { status: 'healthy', heapUsedPercent: (heapUsedPercent * 100).toFixed(1) + '%', }; } const statusCode = allHealthy ? 200 : 503; res.status(statusCode).json({ status: allHealthy ? 'ready' : 'not_ready', checks, pid: process.pid, timestamp: new Date().toISOString(), }); }); app.listen(8080, () => console.log('Health server on :8080'));
// { "status": "alive", "pid": 12801, "uptime": 347.2 }
// GET /health/ready (all healthy) → 200
// {
// "status": "ready",
// "checks": {
// "database": { "status": "healthy" },
// "redis": { "status": "healthy" },
// "memory": { "status": "healthy", "heapUsedPercent": "42.1%" }
// },
// "pid": 12801
// }
// GET /health/ready (database down) → 503
// {
// "status": "not_ready",
// "checks": {
// "database": { "status": "unhealthy", "error": "connect ECONNREFUSED" },
// "redis": { "status": "healthy" },
// "memory": { "status": "healthy", "heapUsedPercent": "41.8%" }
// }
// }
- Liveness: Is the process alive and not deadlocked? If this fails, the orchestrator should restart the container. Fast to evaluate, check every 5 seconds.
- Readiness: Can this server actually handle a request right now? If this fails, the LB should stop routing traffic here but not restart the process. Slower to evaluate due to dependency I/O, check every 10-15 seconds.
- A server can be live but not ready — JVM warmup, connecting to database, loading configuration. Don't send traffic to it yet.
- A server can be ready but degraded — one dependency responding slowly. You might want to keep routing but alert on the degradation.
- Kubernetes formalizes this split with livenessProbe and readinessProbe as first-class configurations. Use both.
process.uptime() or a simple 200 response. No I/O, no dependency checks. Failure here means restart the process.| Algorithm | Strategy | Best For | Watch Out For |
|---|---|---|---|
| Round Robin | Sequential distribution — each server gets the next request in rotation | Clusters with identical server specs and uniform request processing time | Falls apart when requests have variable processing times. A slow request on Server 1 doesn't reduce what it receives next. |
| Weighted Round Robin | Round robin with proportional traffic share based on configured weight | Mixed hardware environments — legacy vs new instances, different instance types | Static weights drift over time. A server's effective capacity changes with memory pressure and GC history. Weights must be updated dynamically. |
| Least Connections | Routes to whichever server currently has the fewest active connections | Long-lived requests — streaming, heavy database queries, WebSocket connections | Connection count doesn't equal load. A server with 5 slow connections may be more loaded than one with 20 fast ones. Still better than Round Robin for most workloads. |
| Least Response Time | Routes to the server with the lowest current TTFB (Time to First Byte) | Latency-sensitive workloads where response time variance between servers matters | Requires the LB to actively probe or measure backend latency — adds overhead. Can cause traffic oscillation if server speeds fluctuate rapidly. |
| IP Hash | Routes each client to a consistent backend based on a hash of their source IP | Stateful applications that need session affinity without cookie-based sticky sessions | All traffic behind a corporate NAT or proxy hashes to the same backend. Adding or removing servers changes the hash distribution and breaks affinity. |
| Random with Two Choices | Pick two servers at random, route to whichever has fewer connections | Very large server pools where maintaining full state is expensive | Less predictable distribution than Least Connections. Better than pure random but doesn't match Least Connections for accuracy. |
🎯 Key Takeaways
- Load balancing is the primary mechanism for horizontal scalability — it's what lets you add servers instead of just making one server bigger. But it only works correctly if health checks are accurate, algorithms match the workload, and state is externalized.
- Layer 4 is faster and simpler. Layer 7 is smarter and more flexible. Most production systems at scale use both: Layer 4 at the network edge for raw throughput, Layer 7 internally for content-aware microservice routing.
- Health checks are the foundation everything else depends on. TCP-only checks are inadequate — they validate process existence, not application readiness. Deep HTTP health checks that validate dependencies are the minimum acceptable standard for production.
- Least Connections is the safest default algorithm for modern web applications with variable request processing times. Round Robin's implicit assumption — that all requests take roughly the same time — is almost never true in practice.
- Sticky sessions are a trap at scale. They defeat load balancing, create hotspots, and cause mass session loss when a pinned server dies. The correct answer is stateless application servers backed by Redis for session state.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QDesign a system that handles 1 million concurrent users. Where do you place the load balancers and what type at each tier?SeniorReveal
- QHow does the 'Least Connections' algorithm differ from 'Least Response Time'?SeniorReveal
- QWhat is SSL Termination and why is it used at the Load Balancer level?Mid-levelReveal
- QWhat happens if the health check fails, and how do you prevent a server from flapping in and out of rotation?JuniorReveal
Frequently Asked Questions
What is the difference between Horizontal and Vertical Scaling?
Vertical Scaling (scaling up) means adding more CPU, RAM, or faster storage to a single machine. It has a hard ceiling — you can only make one machine so large — and it typically requires downtime to resize. Horizontal Scaling (scaling out) means adding more machines to your server pool. Load balancers are what make horizontal scaling work: without one, you can't distribute traffic across multiple machines transparently. Horizontal scaling is the model that enables the kind of elastic capacity that cloud infrastructure is built around — add machines when load increases, remove them when it drops.
Can a Load Balancer become a Single Point of Failure?
Yes, absolutely — and this is one of the first questions worth asking when evaluating any load balancing architecture. A single load balancer that goes down takes the entire service with it. The standard mitigation is an active-passive or active-active high availability pair. Two load balancers share a Virtual IP (VIP). If the primary fails, the secondary detects the failure via heartbeat and takes ownership of the VIP — traffic continues flowing within seconds. On-premise implementations use VRRP (Virtual Router Redundancy Protocol) with tools like Keepalived. Cloud-managed load balancers (AWS ALB, GCP LB) handle HA internally and are effectively transparent to this problem.
What happens if the health check fails?
The load balancer marks the server as unhealthy and stops routing new connections to it. If connection draining is configured, in-flight requests are allowed to complete up to the drain timeout — typically 30–60 seconds. The LB continues probing the unhealthy server at the configured interval. The server only returns to rotation after passing a configured number of consecutive successful health checks — typically 2 or 3 — to prevent flapping. A server that passes one check and fails the next should not cycle in and out of rotation on every check cycle. The consecutive-success threshold is what creates the hysteresis needed for stable behavior.
When should I use IP Hash instead of session cookies for affinity?
IP Hash is useful when you need session affinity but can't or won't modify the application to set a cookie — for example, with third-party clients or binary protocols that don't support cookies. The significant limitations: if users are behind a NAT gateway or corporate proxy, all of them share the same source IP and all hash to the same backend, creating a severe hotspot. Adding or removing servers from the pool changes the hash distribution, so existing users get rerouted to different servers — breaking the affinity you were trying to maintain. For most applications, Redis-backed sessions with stateless servers is the right answer. Cookie-based affinity at the LB is a second option. IP Hash is a distant third, appropriate only when the first two genuinely aren't possible.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.