Skip to content
Home System Design Load Balancing Components Explained — How Traffic Gets Distributed at Scale

Load Balancing Components Explained — How Traffic Gets Distributed at Scale

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Components → Topic 1 of 18
Load balancing components demystified: learn how health checks, algorithms, session persistence, and load balancers work together to keep systems fast and resilient.
⚙️ Intermediate — basic System Design knowledge assumed
In this tutorial, you'll learn
Load balancing components demystified: learn how health checks, algorithms, session persistence, and load balancers work together to keep systems fast and resilient.
  • Load balancing is the primary mechanism for horizontal scalability — it's what lets you add servers instead of just making one server bigger. But it only works correctly if health checks are accurate, algorithms match the workload, and state is externalized.
  • Layer 4 is faster and simpler. Layer 7 is smarter and more flexible. Most production systems at scale use both: Layer 4 at the network edge for raw throughput, Layer 7 internally for content-aware microservice routing.
  • Health checks are the foundation everything else depends on. TCP-only checks are inadequate — they validate process existence, not application readiness. Deep HTTP health checks that validate dependencies are the minimum acceptable standard for production.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • A load balancer distributes incoming traffic across multiple servers so no single machine becomes a bottleneck or single point of failure
  • Health checks are the heartbeat — they probe servers continuously and remove dead ones from rotation automatically; TCP-only checks are a trap
  • Algorithms (Round Robin, Least Connections, IP Hash, Weighted, Least Response Time) decide which server gets each request based on different signals
  • Layer 4 (TCP/UDP) is faster; Layer 7 (HTTP/HTTPS) is smarter — it can inspect cookies, headers, URL paths, and make content-aware routing decisions
  • Session persistence (sticky sessions) keeps users on one server but creates hotspots and causes mass session loss if that server dies
  • The biggest trap: skipping health checks or using TCP-only probes means your LB becomes a black hole, routing traffic into servers that can't respond
  • In production, no single load balancer handles everything — DNS, edge, Layer 7 gateway, and service mesh each own a different tier
🚨 START HERE
Load Balancing Quick Debug Cheat Sheet
Immediate diagnostic commands when load balancing breaks in production.
🟡Upstream servers showing as down in LB logs
Immediate ActionVerify the health endpoint directly, completely bypassing the load balancer. This tells you immediately whether the problem is the server or the LB's health check configuration.
Commands
curl -v http://<server-ip>:8080/healthz
kubectl get pods -l app=backend -o wide
Fix NowIf the health endpoint fails when curled directly, restart the pod or investigate the application startup logs. If it succeeds directly but the LB marks it unhealthy, check the LB's health check timeout — it may be shorter than your application's response time for /healthz. Also confirm the LB is checking the right port and path.
🟡One server receiving disproportionate traffic
Immediate ActionCheck actual connection distribution across upstream servers right now, not what the algorithm says should happen in theory.
Commands
nginx -T 2>&1 | grep -A5 'upstream'
ss -tnp | grep :8080 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn
Fix NowIf using IP Hash and all traffic originates from a single corporate NAT gateway or proxy, every request will hash to the same backend. Switch to Least Connections. If using sticky sessions, verify that a single session cookie value isn't being shared across users — this happens with misconfigured session middleware and pins all those users to one server.
🟡SSL errors appearing intermittently at the load balancer
Immediate ActionTest the TLS handshake directly against the LB to see the full certificate chain and negotiated cipher.
Commands
openssl s_client -connect <lb-host>:443 -tls1_2
openssl x509 -in /etc/ssl/cert.pem -noout -dates
Fix NowIf the certificate is expired or expiring within 7 days, renew immediately — this is usually the cause of intermittent failures as some clients cache the old cert. If the cipher suite doesn't match what backends expect during TLS re-wrap, update ssl_ciphers to include the required suites. Check if errors correlate with specific client versions — TLS 1.0/1.1 clients may be hitting a policy block.
🟡Requests timing out but servers appear healthy
Immediate ActionCheck whether the timeout is happening at the LB or at the backend, and how far into the request lifecycle it occurs.
Commands
curl -w '@curl-format.txt' -o /dev/null -s http://<lb-host>/api/test
tail -f /var/log/nginx/access.log | grep ' 0.0[0-9]\{2\} '
Fix NowIf TTFB is consistently near your proxy_read_timeout value, the backend is taking too long and the LB is cutting the connection. Either increase the timeout for that route specifically, or investigate why the backend is slow. If TTFB is fast but total time is high, the backend is sending a large response slowly — check backend connection limits and network throughput.
Production IncidentThe Health Check Black Hole: 40% of Requests Vanish Into Healthy-Looking Dead ServersA misconfigured health check marked servers as healthy based on port reachability alone, not application readiness. Servers accepting TCP connections but stuck in JVM warmup received live traffic they couldn't process.
SymptomMonitoring shows 40% of HTTP requests returning 503 errors despite every server reporting green in the load balancer dashboard. CPU on the affected servers is near zero — they're not processing anything. Application logs show no incoming requests at all, which rules out application-level errors. The LB logs show connections being established and immediately dropped on the backend side.
AssumptionThe load balancer dashboard is green for all backends, so the engineering team assumes the problem must be downstream — maybe the database is down, or a dependent microservice is timing out. Two engineers spend 25 minutes digging through database connection pool metrics before someone thinks to curl a backend server directly.
Root causeThe health check was configured as a simple TCP connect probe on port 8080. Three things were happening simultaneously after the deployment. First, servers that had crashed internally but whose OS still held the socket open passed the TCP check — the kernel accepted the connection, but the application wasn't there to handle it. Second, servers in a long JVM warmup phase also passed — they accepted the TCP connection but couldn't serve HTTP requests before the health check timeout. Third, two servers had entered a stop-the-world GC pause that lasted longer than the health check interval, so they appeared healthy between pauses but were unresponsive during them. The LB had no visibility into any of this because it was only checking 'can I open a socket.'
Fix1. Replace all TCP health checks with HTTP GET /healthz endpoints that validate database connectivity, cache reachability, and the readiness of critical dependencies — not just that the process is running. 2. Add a readiness gate: the /healthz endpoint must return 200 only after the application has fully initialized, completed warmup, and successfully connected to its dependencies. During startup, return 503. 3. Configure consecutive failure thresholds — 3 consecutive failures before marking a server unhealthy, 2 consecutive successes before returning it to rotation. This prevents flapping during transient network hiccups. 4. Implement connection draining — when a server is removed from rotation, wait for in-flight requests to complete (up to a configurable drain timeout, typically 30 seconds) before cutting the connection. Abrupt removal mid-request is a guaranteed user-facing error. 5. Add health check transition alerting — alert when a server's health state changes, not just when it's unhealthy. Frequent transitions are a signal of instability that steady-state monitoring won't catch.
Key Lesson
A TCP port accepting connections does not mean the application behind it is ready or capable of serving traffic. These are completely different things.Health checks must validate end-to-end application readiness — database connected, cache reachable, dependencies healthy — not just socket availability.JVM warmup and GC pauses are real, predictable events. Your health check design must account for them or they'll cause exactly this kind of incident.Always configure connection draining — abruptly cutting traffic to a server mid-request causes user-facing errors that are entirely preventable.Monitor health check state transitions, not just current state. A server that flips between healthy and unhealthy 20 times per hour is a problem your dashboard's green dot will never show you.
Production Debug GuideSymptom → Action mapping for common LB failures
Traffic black hole — requests return 503 despite servers appearing healthy in the dashboardBypass the load balancer completely and curl the backend servers directly on their health endpoint. If the direct request succeeds and the LB-routed request fails, you have a health check misconfiguration — the LB is either checking the wrong endpoint, using TCP instead of HTTP, or the timeout is too short for your application's response time. Switch to HTTP-level health checks that validate actual application readiness including database connectivity.
Uneven load distribution — one server at 95% CPU while others idle at 10%Check for two common culprits: sticky session misconfiguration pinning a disproportionate share of users to one server, and long-lived connections (WebSockets, streaming, gRPC) accumulating on whichever server happened to get them first. Inspect current connection counts per backend using ss or netstat. If using IP Hash, verify whether all traffic originates from a single NAT gateway — if so, every request hashes to the same server. Switch to Least Connections for long-lived connection workloads.
Intermittent SSL handshake failures at the load balancerTest the TLS handshake directly against the LB with openssl s_client. Check certificate expiration and chain completeness — an intermediate certificate missing from the chain causes failures in some clients but not others, making it intermittent. Verify TLS version and cipher suite compatibility between LB and backends if you're doing TLS re-wrapping. Check whether failures correlate with specific client TLS versions or spike during high traffic, which would point to CPU exhaustion on the LB.
Connection pool exhaustion under moderate loadInspect keepalive settings between the LB and backends. Without connection reuse, every request opens a new TCP connection — expensive under load. Ensure keepalive is enabled and configured correctly (keepalive 32 in NGINX means 32 idle keepalive connections per worker to each upstream). Check for connection leaks in application code — connections that are opened but not properly returned to the pool. Monitor the LB's active vs idle connection counts over time.
Backend servers healthy but response times spiking significantlyHealth checks confirm reachability, not performance. A server can be healthy and slow simultaneously. Check whether the LB algorithm accounts for response time — if using Round Robin, a slow server gets the same traffic as a fast one. Consider switching to Least Response Time or Least Connections. Also check whether connection draining from a recent deployment is causing a traffic imbalance as some backends handle both old and new connections.

Every time you tap 'Buy Now' on Amazon or start a video on Netflix, your request hits one of hundreds or thousands of servers — chosen in milliseconds by a load balancer you never see. Without it, modern internet-scale applications simply couldn't exist.

The core problem is deceptively simple: distribute work across many machines so no single machine becomes a bottleneck, a single point of failure, or a performance nightmare. Without load balancing, one server handles everything until it buckles under the weight. With it, traffic is spread intelligently, failed servers are automatically removed from the pool, and new capacity can be added without touching the rest of the system.

But 'load balancer' is not a single thing. It's a tier — sometimes multiple tiers — of components that each own a different slice of the problem. DNS-level routing decides which data center gets your request. A network load balancer handles the raw TCP connection at line rate. An application load balancer inspects your HTTP headers and routes you to the right microservice. A service mesh sidecar manages the connection between that microservice and the next one in the chain. Understanding where each layer sits, what decisions it can make, and what its failure modes look like is what separates engineers who can configure a load balancer from engineers who can design a system that stays up when things go wrong.

By the end of this article you'll understand what load balancers are, which components make them tick, when to use Round Robin vs Least Connections vs Least Response Time, why sticky sessions can be a trap at exactly the wrong moment, and how to answer the load balancing questions that trip people up in system design interviews at senior level.

The Core Mechanics: How a Load Balancer Decides Where to Send Traffic

A load balancer sits between the client and your server pool. When a request arrives, it has to make a routing decision in milliseconds — which server gets this connection, right now, given the current state of the cluster.

That decision happens at one of two layers, and the layer matters more than most people realize. Layer 4 load balancers operate at the transport layer — they see IP addresses, TCP/UDP ports, and packet counts. They don't open the envelope. Layer 7 load balancers operate at the application layer — they can read the HTTP method, URL path, headers, cookies, and request body. They know the difference between a GET /api/images request and a POST /api/payments request and can route them to entirely different server pools.

The trade-off is straightforward: Layer 4 is faster because there's almost nothing to parse. Layer 7 is more expensive computationally because it has to terminate the connection, parse the HTTP request, make a routing decision, and then establish a new connection (or reuse a keepalive connection) to the backend. In practice, that overhead is typically 0.5–2ms per request — negligible for most applications, meaningful for high-frequency trading or real-time gaming.

In production, you usually don't pick one. The standard architecture is a Layer 4 network load balancer at the edge handling raw TCP connections at line rate, with Layer 7 application load balancers behind it doing content-aware routing to specific service pools. AWS calls these NLB and ALB. On-premise, you'd see HAProxy in TCP mode in front of NGINX instances.

io/thecodeforge/nginx/upstream.conf · NGINX
1234567891011121314151617181920212223242526272829303132333435363738394041
# Upstream pool with mixed weights and a backup server
upstream forge_backend_cluster {
    # Least Connections: best for workloads with variable request processing times
    # Round Robin (default) assumes all requests take roughly the same time — often wrong
    least_conn;

    server 10.0.0.1:8080 weight=3;  # Larger instance — gets 3x the traffic
    server 10.0.0.2:8080 weight=1;  # Standard instance
    server 10.0.0.3:8080 backup;    # Only receives traffic when primary servers are down

    # Keepalive: reuse connections to backends instead of opening new TCP connections per request
    # Without this, high-traffic scenarios create connection exhaustion
    keepalive 32;

    # Passive health checks: mark server down after 3 consecutive failures, retry after 30s
    # Active health checks (requires nginx_upstream_check_module or NGINX Plus)
    # server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name thecodeforge.io;

    location /api/ {
        proxy_pass http://forge_backend_cluster;

        # Preserve real client IP through the proxy
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $host;

        # Timeout configuration — tune for your backend's actual SLA
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;
        proxy_send_timeout 60s;

        # Retry on failure — but only for idempotent methods to avoid double-posting
        proxy_next_upstream error timeout http_503;
        proxy_next_upstream_tries 2;
    }
}
▶ Output
# Traffic flows: client → NGINX (L7) → least-connected backend from pool
# Server 10.0.0.1 receives ~3x requests vs 10.0.0.2 due to weight=3
# 10.0.0.3 stays idle unless both primaries fail health checks
# Keepalive reuses existing connections — avoids TCP handshake overhead on every request
Mental Model
Layer 4 vs Layer 7 — The Speed vs Smarts Trade-off
Layer 4 routes packets without reading them — fast, but completely blind to request content. Layer 7 reads the actual HTTP request — slightly slower, but capable of making intelligent routing decisions.
  • Layer 4: No packet inspection. Lowest latency (~microseconds). Ideal for raw TCP/UDP traffic like gaming servers, video streaming, or any protocol that isn't HTTP.
  • Layer 7: Reads cookies, headers, URL paths, and HTTP methods. Enables A/B testing, canary deployments, microservice routing by path, and authentication offloading.
  • Rule of thumb: if your routing decision requires knowing anything about the request content, you need Layer 7. If you only need to balance load across identical servers, Layer 4 is enough.
  • Performance cost of Layer 7: typically 0.5–2ms additional latency per request due to connection termination, TLS handling, and HTTP parsing.
  • Standard production architecture: Layer 4 NLB at the edge absorbs raw connection volume, Layer 7 ALB/NGINX behind it makes content-aware routing decisions per service.
📊 Production Insight
Layer 4 load balancers are blind to URL paths — /api/payments and /api/images look identical at the TCP layer.
In a microservices architecture, routing different paths to different service pools requires Layer 7. Layer 4 alone means one pool per port, which doesn't scale.
The NLB → ALB pattern solves this: NLB handles the connection volume and TLS termination at the edge, ALB handles path-based routing internally. Both AWS and GCP make this pattern straightforward with managed offerings.
🎯 Key Takeaway
Layer 4 is the motorcycle — fast and direct, but you can't read road signs at that speed. Layer 7 is the GPS-equipped car — slightly slower off the line, but it knows exactly where every request needs to go. Most production systems at scale need both, with clear ownership of which tier makes which routing decisions.
Choosing Between Layer 4 and Layer 7
IfRouting decision requires inspecting URL path, HTTP headers, cookies, or request body
UseLayer 7 (NGINX, HAProxy in HTTP mode, AWS ALB, GCP Application Load Balancer). No way around it — Layer 4 cannot see this information.
IfRaw TCP/UDP traffic with no HTTP semantics — game servers, video streaming, gRPC without HTTP/2 inspection, DNS
UseLayer 4 (AWS NLB, HAProxy in TCP mode, GCP Network Load Balancer). Lower latency, higher throughput, no parsing overhead.
IfExtreme latency sensitivity — sub-millisecond routing required, financial trading, real-time bidding
UseLayer 4. The 0.5–2ms overhead of HTTP parsing is real and will show up in your p99 latency.
IfNeed both maximum throughput at the edge and intelligent content routing for microservices
UseLayer 4 (NLB) at the edge → Layer 7 (NGINX/Envoy/ALB) internally. Standard architecture for anything at significant scale.

Load Balancing Algorithms in Depth: Choosing the Right Strategy for Your Workload

The algorithm your load balancer uses to select a backend is not a configuration detail — it's a decision that directly affects your latency distribution, your server utilization, and what happens when servers become slow rather than fully dead.

Round Robin is the simplest: request 1 goes to server 1, request 2 to server 2, and so on, cycling back to the beginning. It works well when all servers are identical and all requests take roughly the same time to process. Both of those assumptions break in practice. Servers are rarely perfectly identical after weeks of different memory allocations and GC histories. And requests are almost never uniform — a request that triggers a complex database join takes 50x longer than one hitting a cache.

Least Connections routes each new request to whichever server currently has the fewest active connections. This self-corrects automatically: a slow server accumulates connections faster, so the LB naturally sends it fewer new ones. This is why Least Connections is the safer default for most production web applications.

Least Response Time takes the next step: instead of counting connections, it measures actual backend latency (TTFB) and routes to whichever server is responding fastest right now. This is more accurate but requires the LB to actively probe or measure response times, which adds some overhead. It's the right choice for latency-sensitive workloads where a server can be 'healthy' but slow.

IP Hash routes each client to the same backend based on a hash of their source IP. This provides a form of session affinity without application-level cookies. The significant risk: if all your users are behind a corporate NAT gateway, they all hash to the same backend. Also, when a backend is added or removed, the hash changes and users get redistributed — breaking any state you were relying on affinity to preserve.

Weighted variants of Round Robin and Least Connections let you express that some servers have more capacity than others. A server with weight=3 gets three times the share of a server with weight=1. This is essential in mixed hardware environments.

io/thecodeforge/loadbalancer/WeightedRoundRobinBalancer.java · JAVA
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
package io.thecodeforge.loadbalancer;

import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.CopyOnWriteArrayList;

/**
 * Production-grade Weighted Round-Robin implementation.
 *
 * Design decisions worth understanding:
 * - AtomicInteger for the index counter: load balancers handle concurrent requests.
 *   A plain int here is a data race waiting to happen.
 * - CopyOnWriteArrayList for the server pool: allows safe dynamic weight updates
 *   without locking the hot path (getNextServer).
 * - Collections.shuffle() on construction: prevents all initial traffic from
 *   hitting server[0] in a predictable burst during startup.
 * - Math.abs() on the modulo result: AtomicInteger.getAndIncrement() eventually
 *   overflows to negative values. Without abs(), you get an ArrayIndexOutOfBoundsException
 *   at 2^31 requests — a production bug you will not find in testing.
 */
public class WeightedRoundRobinBalancer {

    private final List<String> serverPool;
    private final AtomicInteger currentIndex = new AtomicInteger(0);

    public WeightedRoundRobinBalancer(Map<String, Integer> serversWithWeights) {
        List<String> pool = new ArrayList<>();
        for (Map.Entry<String, Integer> entry : serversWithWeights.entrySet()) {
            for (int i = 0; i < entry.getValue(); i++) {
                pool.add(entry.getKey());
            }
        }
        // Shuffle to prevent predictable startup burst on the first server
        Collections.shuffle(pool);
        // CopyOnWriteArrayList: safe reads on the hot path, supports dynamic updates
        this.serverPool = new CopyOnWriteArrayList<>(pool);
    }

    public String getNextServer() {
        if (serverPool.isEmpty()) {
            throw new IllegalStateException(
                "No active servers in the pool. Check health checks and server registration."
            );
        }
        // Math.abs handles integer overflow at 2^31 requests
        int index = Math.abs(currentIndex.getAndIncrement() % serverPool.size());
        return serverPool.get(index);
    }

    /**
     * Dynamically update a server's weight — e.g., temporarily reduce weight
     * for a server showing elevated GC pause times or increased error rate.
     * In production, this would be called by your health monitoring system.
     */
    public synchronized void updateWeight(String server, int newWeight, int oldWeight) {
        // Remove existing entries for this server
        serverPool.removeIf(s -> s.equals(server));
        // Re-add with new weight
        for (int i = 0; i < newWeight; i++) {
            serverPool.add(server);
        }
        System.out.printf("Weight updated: %s  %d → %d (pool size: %d)%n",
            server, oldWeight, newWeight, serverPool.size());
    }

    public static void main(String[] args) {
        Map<String, Integer> config = new LinkedHashMap<>();
        config.put("app-server-large-01", 5);  // High-capacity instance
        config.put("app-server-std-01",   2);  // Standard instance
        config.put("app-server-std-02",   2);  // Standard instance

        WeightedRoundRobinBalancer balancer = new WeightedRoundRobinBalancer(config);

        System.out.println("Initial distribution across 18 requests:");
        Map<String, Integer> distribution = new LinkedHashMap<>();
        for (int i = 0; i < 18; i++) {
            String server = balancer.getNextServer();
            distribution.merge(server, 1, Integer::sum);
        }
        distribution.forEach((server, count) ->
            System.out.printf("  %-25s → %d requests%n", server, count));

        // Simulate degraded server — reduce its weight
        System.out.println("\nSimulating GC pressure on app-server-large-01...");
        balancer.updateWeight("app-server-large-01", 1, 5);
        System.out.println("Server temporarily downweighted. Rebalancing traffic.");
    }
}
▶ Output
Initial distribution across 18 requests:
app-server-large-01 → 10 requests
app-server-std-01 → 4 requests
app-server-std-02 → 4 requests

Simulating GC pressure on app-server-large-01...
Weight updated: app-server-large-01 5 → 1 (pool size: 5)
Server temporarily downweighted. Rebalancing traffic.
⚠ The Sticky Session Trap — It Fails Exactly When You Need It Most
Sticky sessions sound like a reasonable solution to stateful applications. In practice, they create two problems that compound each other. First, they defeat load balancing — if 30% of your users all happen to hash to the same server, that server gets 30% of traffic regardless of its current load. Second, when that server dies, every user pinned to it loses their session simultaneously — a mass logout event during your highest-traffic moment. The correct fix is to store session state in Redis and make your application servers stateless. Sticky sessions are a band-aid that delays this conversation until the worst possible moment.
📊 Production Insight
The Math.abs() call in the Java implementation is not defensive programming theater — it's fixing a real production bug.
AtomicInteger.getAndIncrement() overflows to Integer.MIN_VALUE after 2^31 calls. Without Math.abs(), the modulo of a negative number is negative, which throws ArrayIndexOutOfBoundsException on the next line.
On a moderately loaded API server handling 1,000 requests/second, you hit 2^31 in roughly 24 days. You won't catch this in load testing unless you run it for a very long time.
Dynamic weight adjustment is equally important in production. A server with weight=5 that enters a long GC pause should drop to weight=1 automatically. Static weights set at deployment time are a snapshot of server capacity at one moment — they drift.
🎯 Key Takeaway
Weighted Round Robin is not 'set and forget.' Static weights reflect server capacity at deployment time and drift as memory pressure and GC behavior evolve. Production implementations monitor per-server response time and error rate and adjust weights dynamically. A server with weight=5 that starts returning errors at 10% should not receive 5x the traffic.

Health Checks: The Component That Makes Everything Else Work

Health checks are the mechanism by which a load balancer knows which servers are actually capable of serving traffic right now. Everything else — algorithm, weights, session persistence — is irrelevant if the LB doesn't have accurate information about server state.

There are three types of health checks in common use, and understanding their trade-offs matters:

TCP health checks open a connection to the server's port and consider it healthy if the connection succeeds. Fast, low overhead, and completely inadequate for detecting application-level failures. The server's OS can accept a TCP connection while the application is in a GC pause, crashed internally, or waiting on a database connection that will never arrive.

HTTP health checks send an actual HTTP request to a designated endpoint (typically /health or /healthz) and validate the response code. This is the minimum acceptable standard for production. The endpoint must return a non-200 response if the application isn't ready to serve traffic — not just if the process is running.

Deep health checks go further: the /healthz endpoint actively validates downstream dependencies — can we connect to the database, is the cache reachable, are critical feature flags loaded. These checks are more expensive to run but catch a class of failures that HTTP-only checks miss: the application process is up, the port responds, but the database connection pool is exhausted and every request will fail.

The health check configuration details matter as much as the type. Check interval (how often), timeout (how long to wait for a response), unhealthy threshold (how many consecutive failures before removal), and healthy threshold (how many consecutive successes before re-addition) all interact. Misconfigure any of these and you get either flapping — servers rapidly cycling in and out of rotation — or a slow response to actual failures.

io/thecodeforge/health/healthcheck.js · JAVASCRIPT
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
const express = require('express');
const { createClient } = require('redis');
const { Pool } = require('pg');

const app = express();

// Initialize dependencies
const redisClient = createClient({ url: process.env.REDIS_URL });
const pgPool = new Pool({ connectionString: process.env.DATABASE_URL });

redisClient.connect().catch(console.error);

/**
 * Shallow health check — fast, for high-frequency LB probing.
 * Returns 200 if the process is alive. Does not check dependencies.
 * Use this for the LB's frequent interval check (every 5s).
 */
app.get('/health/live', (req, res) => {
  res.status(200).json({
    status: 'alive',
    pid: process.pid,
    uptime: process.uptime(),
  });
});

/**
 * Deep readiness check — validates all dependencies.
 * Returns 200 only when the application can actually serve traffic.
 * Use this as the LB's readiness gate during startup and deployment.
 * Check interval should be longer (every 10–15s) due to dependency I/O.
 */
app.get('/health/ready', async (req, res) => {
  const checks = {};
  let allHealthy = true;

  // Check database connectivity
  try {
    const client = await pgPool.connect();
    await client.query('SELECT 1');
    client.release();
    checks.database = { status: 'healthy' };
  } catch (err) {
    checks.database = { status: 'unhealthy', error: err.message };
    allHealthy = false;
  }

  // Check Redis connectivity
  try {
    await redisClient.ping();
    checks.redis = { status: 'healthy' };
  } catch (err) {
    checks.redis = { status: 'unhealthy', error: err.message };
    allHealthy = false;
  }

  // Memory pressure check — prevent routing to a server about to OOM
  const memUsage = process.memoryUsage();
  const heapUsedPercent = memUsage.heapUsed / memUsage.heapTotal;
  if (heapUsedPercent > 0.90) {
    checks.memory = {
      status: 'degraded',
      heapUsedPercent: (heapUsedPercent * 100).toFixed(1) + '%',
    };
    allHealthy = false;
  } else {
    checks.memory = {
      status: 'healthy',
      heapUsedPercent: (heapUsedPercent * 100).toFixed(1) + '%',
    };
  }

  const statusCode = allHealthy ? 200 : 503;
  res.status(statusCode).json({
    status: allHealthy ? 'ready' : 'not_ready',
    checks,
    pid: process.pid,
    timestamp: new Date().toISOString(),
  });
});

app.listen(8080, () => console.log('Health server on :8080'));
▶ Output
// GET /health/live → 200
// { "status": "alive", "pid": 12801, "uptime": 347.2 }

// GET /health/ready (all healthy) → 200
// {
// "status": "ready",
// "checks": {
// "database": { "status": "healthy" },
// "redis": { "status": "healthy" },
// "memory": { "status": "healthy", "heapUsedPercent": "42.1%" }
// },
// "pid": 12801
// }

// GET /health/ready (database down) → 503
// {
// "status": "not_ready",
// "checks": {
// "database": { "status": "unhealthy", "error": "connect ECONNREFUSED" },
// "redis": { "status": "healthy" },
// "memory": { "status": "healthy", "heapUsedPercent": "41.8%" }
// }
// }
Mental Model
Liveness vs Readiness — Two Different Questions
These are not the same check and should not be the same endpoint. Conflating them causes either unnecessary restarts or invisible traffic black holes.
  • Liveness: Is the process alive and not deadlocked? If this fails, the orchestrator should restart the container. Fast to evaluate, check every 5 seconds.
  • Readiness: Can this server actually handle a request right now? If this fails, the LB should stop routing traffic here but not restart the process. Slower to evaluate due to dependency I/O, check every 10-15 seconds.
  • A server can be live but not ready — JVM warmup, connecting to database, loading configuration. Don't send traffic to it yet.
  • A server can be ready but degraded — one dependency responding slowly. You might want to keep routing but alert on the degradation.
  • Kubernetes formalizes this split with livenessProbe and readinessProbe as first-class configurations. Use both.
📊 Production Insight
The memory pressure check in the health endpoint above is not hypothetical — it's something I've added to services after watching a server get progressively slower as it approached OOM, continuing to receive full traffic because the LB had no visibility into heap state.
At 90% heap utilization, GC frequency increases sharply, response times climb, and eventually the server stops responding entirely. By that point, the LB has marked it unhealthy and is in the process of draining connections — but you've already returned slow or failed responses to real users.
The check costs microseconds and can save you minutes of degraded production traffic.
🎯 Key Takeaway
A health check that only checks whether a port is open is not a health check — it's a process existence check with false confidence. Real health checks validate the ability to serve traffic: database connected, cache reachable, heap not exhausted, application initialized. The extra cost of a deep health check is a few milliseconds every 10 seconds. The cost of a TCP-only health check is a production incident.
Configuring Health Check Parameters
IfHigh-frequency liveness check (every 5 seconds)
UseUse /health/live — just process.uptime() or a simple 200 response. No I/O, no dependency checks. Failure here means restart the process.
IfLB readiness check for traffic routing decisions
UseUse /health/ready with dependency validation. Check interval 10–15 seconds. 3 consecutive failures before removing from rotation. 2 consecutive successes before re-adding.
IfServer oscillating between healthy and unhealthy (flapping)
UseIncrease the unhealthy threshold (require more consecutive failures) and add a minimum healthy duration before re-admission. Flapping is a signal of deeper instability — investigate the cause, don't just tune the thresholds.
IfLong application startup (JVM warmup, ML model loading)
UseUse a startup probe with a generous initial delay and longer timeout separate from the ongoing readiness check. Kubernetes startupProbe is purpose-built for this — it buys time for slow-starting containers without making the readiness check slow for the normal case.
🗂 Load Balancing Algorithm Comparison
Choose the algorithm that matches your workload profile — not just the one that sounds right
AlgorithmStrategyBest ForWatch Out For
Round RobinSequential distribution — each server gets the next request in rotationClusters with identical server specs and uniform request processing timeFalls apart when requests have variable processing times. A slow request on Server 1 doesn't reduce what it receives next.
Weighted Round RobinRound robin with proportional traffic share based on configured weightMixed hardware environments — legacy vs new instances, different instance typesStatic weights drift over time. A server's effective capacity changes with memory pressure and GC history. Weights must be updated dynamically.
Least ConnectionsRoutes to whichever server currently has the fewest active connectionsLong-lived requests — streaming, heavy database queries, WebSocket connectionsConnection count doesn't equal load. A server with 5 slow connections may be more loaded than one with 20 fast ones. Still better than Round Robin for most workloads.
Least Response TimeRoutes to the server with the lowest current TTFB (Time to First Byte)Latency-sensitive workloads where response time variance between servers mattersRequires the LB to actively probe or measure backend latency — adds overhead. Can cause traffic oscillation if server speeds fluctuate rapidly.
IP HashRoutes each client to a consistent backend based on a hash of their source IPStateful applications that need session affinity without cookie-based sticky sessionsAll traffic behind a corporate NAT or proxy hashes to the same backend. Adding or removing servers changes the hash distribution and breaks affinity.
Random with Two ChoicesPick two servers at random, route to whichever has fewer connectionsVery large server pools where maintaining full state is expensiveLess predictable distribution than Least Connections. Better than pure random but doesn't match Least Connections for accuracy.

🎯 Key Takeaways

  • Load balancing is the primary mechanism for horizontal scalability — it's what lets you add servers instead of just making one server bigger. But it only works correctly if health checks are accurate, algorithms match the workload, and state is externalized.
  • Layer 4 is faster and simpler. Layer 7 is smarter and more flexible. Most production systems at scale use both: Layer 4 at the network edge for raw throughput, Layer 7 internally for content-aware microservice routing.
  • Health checks are the foundation everything else depends on. TCP-only checks are inadequate — they validate process existence, not application readiness. Deep HTTP health checks that validate dependencies are the minimum acceptable standard for production.
  • Least Connections is the safest default algorithm for modern web applications with variable request processing times. Round Robin's implicit assumption — that all requests take roughly the same time — is almost never true in practice.
  • Sticky sessions are a trap at scale. They defeat load balancing, create hotspots, and cause mass session loss when a pinned server dies. The correct answer is stateless application servers backed by Redis for session state.

⚠ Common Mistakes to Avoid

    Using TCP-only health checks in production
    Symptom

    The load balancer dashboard shows all backends as healthy while 503 errors climb in your application monitoring. Servers that have crashed internally but whose OS still holds the socket open pass the TCP check. Servers in JVM warmup or GC pause accept the TCP connection but can't process HTTP requests. Users see failures that your LB has no visibility into.

    Fix

    Replace TCP health checks with HTTP health checks on a /healthz or /health/ready endpoint that validates actual application readiness — database connected, cache reachable, dependencies healthy. Return 200 only when the server can genuinely handle a request. Configure 3 consecutive failures before removing from rotation and 2 consecutive successes before re-adding. Never go back to TCP-only checks.

    Ignoring SSL termination overhead until it becomes a crisis
    Symptom

    The load balancer CPU spikes to 100% under TLS-heavy traffic while backend servers sit idle at 15%. TLS handshake latency increases dramatically during traffic bursts. The LB becomes the bottleneck, and increasing backend capacity does nothing to fix it because the constraint is at the LB layer.

    Fix

    Size the LB instance for the computational cost of TLS termination — this is frequently under-provisioned because it's invisible until load hits. Use TLS 1.3 which requires fewer round trips than 1.2. Enable TLS session resumption via session tickets to avoid full handshakes for returning clients. For AWS deployments, NLB with TLS offloading uses hardware acceleration that sidesteps the CPU constraint entirely.

    Hardcoding server IPs in the upstream configuration
    Symptom

    A server is replaced during a scaling event or a failed instance is rebuilt with a new IP. The LB still routes traffic to the old IP — health checks eventually mark it unhealthy, but the new instance at the new IP receives zero traffic. Scaling events require manual config changes and a reload. In a cloud environment where IPs change constantly, this becomes a persistent operational burden.

    Fix

    Use DNS-based service discovery to populate the LB's server pool dynamically. Kubernetes Services handle this automatically. For non-Kubernetes environments, Consul or similar service registries provide DNS records that update as instances come and go. Configure the LB to resolve DNS names rather than cache IPs — set resolver directives in NGINX to control TTL.

    Over-reliance on sticky sessions as a substitute for stateless application design
    Symptom

    One server is at 95% CPU while others idle at 10% — a subset of high-traffic users are all pinned to the same backend. When that server dies, every user pinned to it gets logged out simultaneously. Scaling the cluster doesn't help because new servers receive no traffic from existing sessions. Deployments require careful session migration planning.

    Fix

    Move session state to Redis and make application servers stateless. Stateless servers can receive traffic from any worker, scale freely, and be replaced without user impact. If sticky sessions are truly unavoidable for a legacy system, set a TTL on the session cookie to bound the maximum duration of affinity, and monitor per-server connection distribution actively.

    No connection draining on server removal
    Symptom

    During deployments or auto-scaling scale-in events, in-flight requests to servers being removed are abruptly terminated. Users experience random errors mid-request — form submissions that don't complete, API calls that return connection reset errors. The error rate spikes exactly when you're deploying, making it easy to blame the new code rather than the removal mechanics.

    Fix

    Configure connection draining on your LB — a grace period during which the server is removed from rotation for new requests but allowed to complete in-flight ones. AWS ALB calls this deregistration delay (default 300 seconds, often should be tuned lower to 30-60 seconds based on your request SLA). NGINX uses the drain flag. Set this to at least your p99 request duration, not the default.

Interview Questions on This Topic

  • QDesign a system that handles 1 million concurrent users. Where do you place the load balancers and what type at each tier?SeniorReveal
    This needs a multi-tier approach — no single load balancer handles everything at this scale. Tier 1 — DNS-level routing: Route 53 or equivalent with latency-based or geolocation routing directs users to the nearest regional data center. This handles continent-level distribution and provides automatic failover between regions. Not a traditional LB but performs the same conceptual function at global scale. Tier 2 — Edge/Network LB (Layer 4): AWS NLB or GCP Network LB at the edge of each region. These handle raw TCP/UDP at line rate with sub-millisecond overhead. Their job is to absorb the raw connection volume, terminate TLS if needed (using hardware acceleration), and distribute traffic across the next tier. At 1M concurrent users, this tier needs to be sized for connection rate, not just bandwidth — each new connection requires CPU for TLS handshake. Tier 3 — Application LB (Layer 7): NGINX, HAProxy, or AWS ALB sitting behind the NLB. This is where content-aware routing happens — /api/payments routes to the payments service pool, /api/media to the media service pool. This tier also handles authentication offloading, canary deployments, and A/B testing. Use Least Connections algorithm here. Tier 4 — Service mesh (east-west): Envoy sidecars or similar for service-to-service communication inside the cluster. Circuit breaking, retry logic, and mTLS between microservices. This isn't usually what people think of as 'load balancing' but it's distributing traffic across service instances. The critical design constraint at every tier: no single LB is a single point of failure. Active-active pairs with shared virtual IPs (VRRP/Keepalived for on-premise, cloud-managed for AWS/GCP) at each tier. The architecture must survive the failure of any single component without user impact.
  • QHow does the 'Least Connections' algorithm differ from 'Least Response Time'?SeniorReveal
    Least Connections routes to whichever server has the fewest active connections at the moment the routing decision is made. It's a volume metric — it assumes that fewer connections means less load, which is a reasonable approximation but not always accurate. A server with 5 connections each doing a 10-second database query might be more loaded than a server with 20 connections each serving cached responses in 5ms. Least Response Time is more sophisticated. Instead of counting connections, it measures the actual time to first byte from each backend and routes new requests to whichever server is currently responding fastest. This accounts for load, GC pauses, database wait time, and network conditions — not just connection count. A server with 5 slow connections that is visibly struggling will have a high TTFB and receive fewer new requests automatically. The practical difference matters in mixed workload scenarios. If your server pool handles both fast cached requests and slow database-heavy requests simultaneously, Least Connections will underestimate the load on servers handling the slow requests. Least Response Time will naturally route away from those servers. The cost: Least Response Time requires the LB to maintain latency measurements for each backend, either through active probing or by measuring response times from recent requests. This adds some overhead and can cause oscillation if server speeds fluctuate rapidly — the LB routes a burst to the 'fastest' server, that server becomes temporarily slower, the LB routes away, the first server recovers, and the cycle repeats. For most production web applications, Least Connections is the right default. Switch to Least Response Time when you have measurable latency variance between backends and that variance actually affects user experience.
  • QWhat is SSL Termination and why is it used at the Load Balancer level?Mid-levelReveal
    SSL Termination is the process of decrypting HTTPS traffic at the load balancer. The LB holds the TLS certificate, performs the handshake with the client, decrypts the request, and forwards plain HTTP to the backend servers over the private internal network. Why at the LB: CPU offloading: TLS handshake and symmetric encryption/decryption are computationally expensive. Centralizing this at one LB (ideally with hardware acceleration) prevents every application server from spending CPU cycles on crypto. At scale, this matters — TLS overhead on 1000 app servers adds up to significant capacity. Certificate management: You manage one certificate at the LB instead of distributing, rotating, and monitoring certificates across hundreds of backends. Certificate rotation on a single LB is a 2-minute operation. On 300 backend servers, it's a deployment pipeline and a risk. Inspection capability: Once decrypted, a Layer 7 load balancer can inspect cookies, headers, and URL paths for routing decisions. You can't do content-aware routing on encrypted traffic. The trade-off is that traffic between the LB and backends travels unencrypted over the private network. In most traditional architectures this is acceptable because the internal network is trusted. In zero-trust environments or regulated industries (PCI-DSS, HIPAA), this isn't acceptable — you need TLS re-wrapping for the backend hop, which adds latency (one more TLS handshake per request) but maintains end-to-end encryption. AWS ALB supports both modes.
  • QWhat happens if the health check fails, and how do you prevent a server from flapping in and out of rotation?JuniorReveal
    When a health check fails, the load balancer marks the server as unhealthy and stops routing new requests to it. Existing connections in flight are handled based on whether connection draining is configured — with draining, they complete; without it, they may be abruptly terminated. The LB continues probing the unhealthy server at the configured interval. The server is returned to rotation only after it passes a configured number of consecutive successful checks — not after a single success. This consecutive-success threshold is what prevents the first half of the flapping problem. Flapping — a server rapidly cycling between healthy and unhealthy — happens when the underlying problem is intermittent. A server with a database connection that's borderline timing out might pass a check, fail the next, pass again, fail again. Without hysteresis, the LB adds and removes the server on every cycle, which itself amplifies instability: users get routed to the server, their connections fail, the LB removes it, users get rerouted, and the event loop of the other servers spikes from the sudden load shift. Fix: require 3 consecutive failures before removal and 2 consecutive successes before re-addition. Add a minimum healthy duration — the server must pass checks for at least 30 seconds continuously before re-admission. Alert on state transitions, not just steady-state health — frequent transitions are a signal that the unhealthy threshold investigation should be the on-call priority, not just stabilizing the cluster.

Frequently Asked Questions

What is the difference between Horizontal and Vertical Scaling?

Vertical Scaling (scaling up) means adding more CPU, RAM, or faster storage to a single machine. It has a hard ceiling — you can only make one machine so large — and it typically requires downtime to resize. Horizontal Scaling (scaling out) means adding more machines to your server pool. Load balancers are what make horizontal scaling work: without one, you can't distribute traffic across multiple machines transparently. Horizontal scaling is the model that enables the kind of elastic capacity that cloud infrastructure is built around — add machines when load increases, remove them when it drops.

Can a Load Balancer become a Single Point of Failure?

Yes, absolutely — and this is one of the first questions worth asking when evaluating any load balancing architecture. A single load balancer that goes down takes the entire service with it. The standard mitigation is an active-passive or active-active high availability pair. Two load balancers share a Virtual IP (VIP). If the primary fails, the secondary detects the failure via heartbeat and takes ownership of the VIP — traffic continues flowing within seconds. On-premise implementations use VRRP (Virtual Router Redundancy Protocol) with tools like Keepalived. Cloud-managed load balancers (AWS ALB, GCP LB) handle HA internally and are effectively transparent to this problem.

What happens if the health check fails?

The load balancer marks the server as unhealthy and stops routing new connections to it. If connection draining is configured, in-flight requests are allowed to complete up to the drain timeout — typically 30–60 seconds. The LB continues probing the unhealthy server at the configured interval. The server only returns to rotation after passing a configured number of consecutive successful health checks — typically 2 or 3 — to prevent flapping. A server that passes one check and fails the next should not cycle in and out of rotation on every check cycle. The consecutive-success threshold is what creates the hysteresis needed for stable behavior.

When should I use IP Hash instead of session cookies for affinity?

IP Hash is useful when you need session affinity but can't or won't modify the application to set a cookie — for example, with third-party clients or binary protocols that don't support cookies. The significant limitations: if users are behind a NAT gateway or corporate proxy, all of them share the same source IP and all hash to the same backend, creating a severe hotspot. Adding or removing servers from the pool changes the hash distribution, so existing users get rerouted to different servers — breaking the affinity you were trying to maintain. For most applications, Redis-backed sessions with stateless servers is the right answer. Cookie-based affinity at the LB is a second option. IP Hash is a distant third, appropriate only when the first two genuinely aren't possible.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →What Is a Load Balancer? Types, Algorithms and How They Work
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged