Junior 8 min · March 05, 2026

Load Balancing Health Check Black Hole — 40% Vanish

TCP health checks miss JVM GC pauses and warmup, causing 40% 503 errors.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • A load balancer distributes incoming traffic across multiple servers so no single machine becomes a bottleneck or single point of failure
  • Health checks are the heartbeat — they probe servers continuously and remove dead ones from rotation automatically; TCP-only checks are a trap
  • Algorithms (Round Robin, Least Connections, IP Hash, Weighted, Least Response Time) decide which server gets each request based on different signals
  • Layer 4 (TCP/UDP) is faster; Layer 7 (HTTP/HTTPS) is smarter — it can inspect cookies, headers, URL paths, and make content-aware routing decisions
  • Session persistence (sticky sessions) keeps users on one server but creates hotspots and causes mass session loss if that server dies
  • The biggest trap: skipping health checks or using TCP-only probes means your LB becomes a black hole, routing traffic into servers that can't respond
  • In production, no single load balancer handles everything — DNS, edge, Layer 7 gateway, and service mesh each own a different tier
Plain-English First

Imagine a busy McDonald's with 6 cashiers. A greeter at the door watches all the lines and sends you to whichever cashier is least busy — not just the first one in rotation. If one cashier goes on break or calls in sick, the greeter stops sending people their way entirely. If one cashier is twice as fast as the others, the greeter sends them twice as many customers. That greeter IS the load balancer. The cashiers are your servers. The system that checks whether a cashier is available and actually working — not just standing at their register staring at a frozen screen — is the health check. And the strategy the greeter uses to pick a cashier — shortest queue, round-robin rotation, same cashier you had last time, or the fastest one right now — is the load balancing algorithm. Everything else in this article is just the details of how that greeter makes smarter decisions at internet scale.

Every time you tap 'Buy Now' on Amazon or start a video on Netflix, your request hits one of hundreds or thousands of servers — chosen in milliseconds by a load balancer you never see. Without it, modern internet-scale applications simply couldn't exist.

The core problem is deceptively simple: distribute work across many machines so no single machine becomes a bottleneck, a single point of failure, or a performance nightmare. Without load balancing, one server handles everything until it buckles under the weight. With it, traffic is spread intelligently, failed servers are automatically removed from the pool, and new capacity can be added without touching the rest of the system.

But 'load balancer' is not a single thing. It's a tier — sometimes multiple tiers — of components that each own a different slice of the problem. DNS-level routing decides which data center gets your request. A network load balancer handles the raw TCP connection at line rate. An application load balancer inspects your HTTP headers and routes you to the right microservice. A service mesh sidecar manages the connection between that microservice and the next one in the chain. Understanding where each layer sits, what decisions it can make, and what its failure modes look like is what separates engineers who can configure a load balancer from engineers who can design a system that stays up when things go wrong.

By the end of this article you'll understand what load balancers are, which components make them tick, when to use Round Robin vs Least Connections vs Least Response Time, why sticky sessions can be a trap at exactly the wrong moment, and how to answer the load balancing questions that trip people up in system design interviews at senior level.

The Core Mechanics: How a Load Balancer Decides Where to Send Traffic

A load balancer sits between the client and your server pool. When a request arrives, it has to make a routing decision in milliseconds — which server gets this connection, right now, given the current state of the cluster.

That decision happens at one of two layers, and the layer matters more than most people realize. Layer 4 load balancers operate at the transport layer — they see IP addresses, TCP/UDP ports, and packet counts. They don't open the envelope. Layer 7 load balancers operate at the application layer — they can read the HTTP method, URL path, headers, cookies, and request body. They know the difference between a GET /api/images request and a POST /api/payments request and can route them to entirely different server pools.

The trade-off is straightforward: Layer 4 is faster because there's almost nothing to parse. Layer 7 is more expensive computationally because it has to terminate the connection, parse the HTTP request, make a routing decision, and then establish a new connection (or reuse a keepalive connection) to the backend. In practice, that overhead is typically 0.5–2ms per request — negligible for most applications, meaningful for high-frequency trading or real-time gaming.

In production, you usually don't pick one. The standard architecture is a Layer 4 network load balancer at the edge handling raw TCP connections at line rate, with Layer 7 application load balancers behind it doing content-aware routing to specific service pools. AWS calls these NLB and ALB. On-premise, you'd see HAProxy in TCP mode in front of NGINX instances.

io/thecodeforge/nginx/upstream.confNGINX
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Upstream pool with mixed weights and a backup server
upstream forge_backend_cluster {
    # Least Connections: best for workloads with variable request processing times
    # Round Robin (default) assumes all requests take roughly the same time — often wrong
    least_conn;

    server 10.0.0.1:8080 weight=3;  # Larger instance — gets 3x the traffic
    server 10.0.0.2:8080 weight=1;  # Standard instance
    server 10.0.0.3:8080 backup;    # Only receives traffic when primary servers are down

    # Keepalive: reuse connections to backends instead of opening new TCP connections per request
    # Without this, high-traffic scenarios create connection exhaustion
    keepalive 32;

    # Passive health checks: mark server down after 3 consecutive failures, retry after 30s
    # Active health checks (requires nginx_upstream_check_module or NGINX Plus)
    # server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name thecodeforge.io;

    location /api/ {
        proxy_pass http://forge_backend_cluster;

        # Preserve real client IP through the proxy
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $host;

        # Timeout configuration — tune for your backend's actual SLA
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;
        proxy_send_timeout 60s;

        # Retry on failure — but only for idempotent methods to avoid double-posting
        proxy_next_upstream error timeout http_503;
        proxy_next_upstream_tries 2;
    }
}
Output
# Traffic flows: client → NGINX (L7) → least-connected backend from pool
# Server 10.0.0.1 receives ~3x requests vs 10.0.0.2 due to weight=3
# 10.0.0.3 stays idle unless both primaries fail health checks
# Keepalive reuses existing connections — avoids TCP handshake overhead on every request
Layer 4 vs Layer 7 — The Speed vs Smarts Trade-off
  • Layer 4: No packet inspection. Lowest latency (~microseconds). Ideal for raw TCP/UDP traffic like gaming servers, video streaming, or any protocol that isn't HTTP.
  • Layer 7: Reads cookies, headers, URL paths, and HTTP methods. Enables A/B testing, canary deployments, microservice routing by path, and authentication offloading.
  • Rule of thumb: if your routing decision requires knowing anything about the request content, you need Layer 7. If you only need to balance load across identical servers, Layer 4 is enough.
  • Performance cost of Layer 7: typically 0.5–2ms additional latency per request due to connection termination, TLS handling, and HTTP parsing.
  • Standard production architecture: Layer 4 NLB at the edge absorbs raw connection volume, Layer 7 ALB/NGINX behind it makes content-aware routing decisions per service.
Production Insight
Layer 4 load balancers are blind to URL paths — /api/payments and /api/images look identical at the TCP layer.
In a microservices architecture, routing different paths to different service pools requires Layer 7. Layer 4 alone means one pool per port, which doesn't scale.
The NLB → ALB pattern solves this: NLB handles the connection volume and TLS termination at the edge, ALB handles path-based routing internally. Both AWS and GCP make this pattern straightforward with managed offerings.
Key Takeaway
Layer 4 is the motorcycle — fast and direct, but you can't read road signs at that speed. Layer 7 is the GPS-equipped car — slightly slower off the line, but it knows exactly where every request needs to go. Most production systems at scale need both, with clear ownership of which tier makes which routing decisions.
Choosing Between Layer 4 and Layer 7
IfRouting decision requires inspecting URL path, HTTP headers, cookies, or request body
UseLayer 7 (NGINX, HAProxy in HTTP mode, AWS ALB, GCP Application Load Balancer). No way around it — Layer 4 cannot see this information.
IfRaw TCP/UDP traffic with no HTTP semantics — game servers, video streaming, gRPC without HTTP/2 inspection, DNS
UseLayer 4 (AWS NLB, HAProxy in TCP mode, GCP Network Load Balancer). Lower latency, higher throughput, no parsing overhead.
IfExtreme latency sensitivity — sub-millisecond routing required, financial trading, real-time bidding
UseLayer 4. The 0.5–2ms overhead of HTTP parsing is real and will show up in your p99 latency.
IfNeed both maximum throughput at the edge and intelligent content routing for microservices
UseLayer 4 (NLB) at the edge → Layer 7 (NGINX/Envoy/ALB) internally. Standard architecture for anything at significant scale.

OSI Layer 4 vs Layer 7 — Visual Comparison

Understanding the OSI layer at which a load balancer operates is fundamental to designing your architecture. Layer 4 (transport layer) and Layer 7 (application layer) operate at completely different levels of abstraction, and the decision between them determines what information is available for routing, how much overhead is added, and what failure modes you must design for.

The diagram below shows the two layers side-by-side, with their respective capabilities, overhead, and typical use cases. At Layer 4, the load balancer sees only IP addresses, ports, and TCP flags — packets are forwarded without inspection, making it extremely fast but completely unaware of request content. At Layer 7, the load balancer terminates the TCP connection, performs TLS termination, and then parses the HTTP request to extract cookies, headers, URL paths, and even the request body. This enables content-aware routing, but adds latency from connection termination and parsing.

In practice, production systems rarely pick one over the other — they use both in a tiered architecture. A Layer 4 NLB at the network edge handles raw connection volume at line rate and distributes traffic to a pool of Layer 7 application load balancers (NGINX, HAProxy, Envoy) that perform content-aware routing to specific microservices. This hybrid approach gives you the throughput of Layer 4 at the edge with the intelligence of Layer 7 inside.

Production Insight
One common antipattern: placing a single Layer 7 load balancer at the edge for all traffic. Layer 7 LBs have a per-connection memory overhead because they maintain TCP state. Under a flood of short-lived connections, a Layer 7 LB can run out of memory or CPU before its backend capacity is even tapped. The fix is always a Layer 4 LB in front to absorb the connection volume — it has near-zero per-packet state and can handle millions of concurrent connections with minimal resource usage.
Key Takeaway
Layer 4 is stateless and blazing fast — perfect for edge traffic absorption. Layer 7 is stateful and intelligent — perfect for content-aware routing. A production system that uses only one is either leaving performance on the table (all L4, no intelligent routing) or risking capacity exhaustion (all L7 at the edge).

Load Balancing Algorithms in Depth: Choosing the Right Strategy for Your Workload

The algorithm your load balancer uses to select a backend is not a configuration detail — it's a decision that directly affects your latency distribution, your server utilization, and what happens when servers become slow rather than fully dead.

Round Robin is the simplest: request 1 goes to server 1, request 2 to server 2, and so on, cycling back to the beginning. It works well when all servers are identical and all requests take roughly the same time to process. Both of those assumptions break in practice. Servers are rarely perfectly identical after weeks of different memory allocations and GC histories. And requests are almost never uniform — a request that triggers a complex database join takes 50x longer than one hitting a cache.

Least Connections routes each new request to whichever server currently has the fewest active connections. This self-corrects automatically: a slow server accumulates connections faster, so the LB naturally sends it fewer new ones. This is why Least Connections is the safer default for most production web applications.

Least Response Time takes the next step: instead of counting connections, it measures actual backend latency (TTFB) and routes to whichever server is responding fastest right now. This is more accurate but requires the LB to actively probe or measure response times, which adds some overhead. It's the right choice for latency-sensitive workloads where a server can be 'healthy' but slow.

IP Hash routes each client to the same backend based on a hash of their source IP. This provides a form of session affinity without application-level cookies. The significant risk: if all your users are behind a corporate NAT gateway, they all hash to the same backend. Also, when a backend is added or removed, the hash changes and users get redistributed — breaking any state you were relying on affinity to preserve.

Weighted variants of Round Robin and Least Connections let you express that some servers have more capacity than others. A server with weight=3 gets three times the share of a server with weight=1. This is essential in mixed hardware environments.

io/thecodeforge/loadbalancer/WeightedRoundRobinBalancer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
package io.thecodeforge.loadbalancer;

import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.CopyOnWriteArrayList;

/**
 * Production-grade Weighted Round-Robin implementation.
 *
 * Design decisions worth understanding:
 * - AtomicInteger for the index counter: load balancers handle concurrent requests.
 *   A plain int here is a data race waiting to happen.
 * - CopyOnWriteArrayList for the server pool: allows safe dynamic weight updates
 *   without locking the hot path (getNextServer).
 * - Collections.shuffle() on construction: prevents all initial traffic from
 *   hitting server[0] in a predictable burst during startup.
 * - Math.abs() on the modulo result: AtomicInteger.getAndIncrement() eventually
 *   overflows to negative values. Without abs(), you get an ArrayIndexOutOfBoundsException
 *   at 2^31 requests — a production bug you will not find in testing.
 */
public class WeightedRoundRobinBalancer {\n\n    private final List<String> serverPool;\n    private final AtomicInteger currentIndex = new AtomicInteger(0);\n\n    public WeightedRoundRobinBalancer(Map<String, Integer> serversWithWeights) {\n        List<String> pool = new ArrayList<>();\n        for (Map.Entry<String, Integer> entry : serversWithWeights.entrySet()) {\n            for (int i = 0; i < entry.getValue(); i++) {\n                pool.add(entry.getKey());\n            }
        }
        // Shuffle to prevent predictable startup burst on the first server
        Collections.shuffle(pool);
        // CopyOnWriteArrayList: safe reads on the hot path, supports dynamic updates
        this.serverPool = new CopyOnWriteArrayList<>(pool);
    }

    public String getNextServer() {
        if (serverPool.isEmpty()) {
            throw new IllegalStateException(
                "No active servers in the pool. Check health checks and server registration."
            );
        }
        // Math.abs handles integer overflow at 2^31 requests
        int index = Math.abs(currentIndex.getAndIncrement() % serverPool.size());
        return serverPool.get(index);
    }

    /**
     * Dynamically update a server's weight — e.g., temporarily reduce weight
     * for a server showing elevated GC pause times or increased error rate.
     * In production, this would be called by your health monitoring system.
     */
    public synchronized void updateWeight(String server, int newWeight, int oldWeight) {
        // Remove existing entries for this server
        serverPool.removeIf(s -> s.equals(server));
        // Re-add with new weight
        for (int i = 0; i < newWeight; i++) {
            serverPool.add(server);
        }
        System.out.printf("Weight updated: %s  %d → %d (pool size: %d)%n",
            server, oldWeight, newWeight, serverPool.size());
    }

    public static void main(String[] args) {
        Map<String, Integer> config = new LinkedHashMap<>();
        config.put("app-server-large-01", 5);  // High-capacity instance
        config.put("app-server-std-01",   2);  // Standard instance
        config.put("app-server-std-02",   2);  // Standard instance

        WeightedRoundRobinBalancer balancer = new WeightedRoundRobinBalancer(config);

        System.out.println("Initial distribution across 18 requests:");
        Map<String, Integer> distribution = new LinkedHashMap<>();
        for (int i = 0; i < 18; i++) {
            String server = balancer.getNextServer();
            distribution.merge(server, 1
Output
Initial distribution across 18 requests:
app-server-large-01 → 10 requests
app-server-std-01 → 4 requests
app-server-std-02 → 4 requests
Simulating GC pressure on app-server-large-01...
Weight updated: app-server-large-01 5 → 1 (pool size: 5)
Server temporarily downweighted. Rebalancing traffic.
The Sticky Session Trap — It Fails Exactly When You Need It Most
Sticky sessions sound like a reasonable solution to stateful applications. In practice, they create two problems that compound each other. First, they defeat load balancing — if 30% of your users all happen to hash to the same server, that server gets 30% of traffic regardless of its current load. Second, when that server dies, every user pinned to it loses their session simultaneously — a mass logout event during your highest-traffic moment. The correct fix is to store session state in Redis and make your application servers stateless. Sticky sessions are a band-aid that delays this conversation until the worst possible moment.
Production Insight
The Math.abs() call in the Java implementation is not defensive programming theater — it's fixing a real production bug.
AtomicInteger.getAndIncrement() overflows to Integer.MIN_VALUE after 2^31 calls. Without Math.abs(), the modulo of a negative number is negative, which throws ArrayIndexOutOfBoundsException on the next line.
On a moderately loaded API server handling 1,000 requests/second, you hit 2^31 in roughly 24 days. You won't catch this in load testing unless you run it for a very long time.
Dynamic weight adjustment is equally important in production. A server with weight=5 that enters a long GC pause should drop to weight=1 automatically. Static weights set at deployment time are a snapshot of server capacity at one moment — they drift.
Key Takeaway
Weighted Round Robin is not 'set and forget.' Static weights reflect server capacity at deployment time and drift as memory pressure and GC behavior evolve. Production implementations monitor per-server response time and error rate and adjust weights dynamically. A server with weight=5 that starts returning errors at 10% should not receive 5x the traffic.

Load Balancing Algorithm Matrix — Best Use Case Selection Guide

Choosing the right load balancing algorithm is not a one-size-fits-all decision. The matrix below maps each algorithm to its ideal workload profile, along with the key risks when the assumptions behind the algorithm are violated. Use this as a quick reference when designing or debugging a load-balanced system.

io/thecodeforge/loadbalancer/algorithm_matrix.mdMARKDOWN
1
2
3
4
5
6
7
8
| Algorithm | Strategy | Best Use Case | Watch Out For |
|---|---|---|---|
| Round Robin | Sequential distribution | Identical servers with uniform request processing time | Variable request times cause uneven load despite uniform distribution |
| Weighted Round Robin | Distribution proportional to configured weight | Mixed hardware (big vs small instances) | Static weights drift over time; need dynamic adjustment |
| Least Connections | Routes to server with fewest active connections | Long-lived request workloads (streaming, heavy queries) | Connection count doesn't equal load; slow servers accumulate connections |
| Least Response Time | Routes to server with lowest TTFB | Latency-sensitive applications | Traffic oscillation if server speed fluctuates rapidly; requires LB to probe |
| IP Hash | Consistent hash of source IP | Stateful apps requiring session affinity without cookies | NAT/proxy makes all users hash to same server; pool changes break affinity |
| Random with Two Choices | Pick two random servers, choose the one with fewer connections | Very large server pools (reduces state tracking overhead) | Less predictable than Least Connections; still vulnerable to slow servers |
Production Insight
The 'Best Use Case' column is not a prescriptive rule — it's a starting point. In production, the actual best algorithm often emerges from observability data: if you see one server accumulating connections faster than others despite Least Connections, your requests have variable processing time that Least Response Time would handle better. Treat algorithm selection as an iterative tuning process, not a one-time decision.
Key Takeaway
No algorithm is universally best. Least Connections is the safest default for most web applications, but latency-sensitive workloads need Least Response Time, and mixed hardware demands weighted variants. Always validate your choice with production metrics.

Health Checks: The Component That Makes Everything Else Work

Health checks are the mechanism by which a load balancer knows which servers are actually capable of serving traffic right now. Everything else — algorithm, weights, session persistence — is irrelevant if the LB doesn't have accurate information about server state.

There are three types of health checks in common use, and understanding their trade-offs matters:

TCP health checks open a connection to the server's port and consider it healthy if the connection succeeds. Fast, low overhead, and completely inadequate for detecting application-level failures. The server's OS can accept a TCP connection while the application is in a GC pause, crashed internally, or waiting on a database connection that will never arrive.

HTTP health checks send an actual HTTP request to a designated endpoint (typically /health or /healthz) and validate the response code. This is the minimum acceptable standard for production. The endpoint must return a non-200 response if the application isn't ready to serve traffic — not just if the process is running.

Deep health checks go further: the /healthz endpoint actively validates downstream dependencies — can we connect to the database, is the cache reachable, are critical feature flags loaded. These checks are more expensive to run but catch a class of failures that HTTP-only checks miss: the application process is up, the port responds, but the database connection pool is exhausted and every request will fail.

The health check configuration details matter as much as the type. Check interval (how often), timeout (how long to wait for a response), unhealthy threshold (how many consecutive failures before removal), and healthy threshold (how many consecutive successes before re-addition) all interact. Misconfigure any of these and you get either flapping — servers rapidly cycling in and out of rotation — or a slow response to actual failures.

io/thecodeforge/health/healthcheck.jsJAVASCRIPT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
const express = require('express');
const { createClient } = require('redis');
const { Pool } = require('pg');

const app = express();

// Initialize dependencies
const redisClient = createClient({ url: process.env.REDIS_URL });
const pgPool = new Pool({ connectionString: process.env.DATABASE_URL });

redisClient.connect().catch(console.error);

/**
 * Shallow health check — fast, for high-frequency LB probing.
 * Returns 200 if the process is alive. Does not check dependencies.
 * Use this for the LB's frequent interval check (every 5s).
 */
app.get('/health/live', (req, res) => {
  res.status(200).json({
    status: 'alive',
    pid: process.pid,
    uptime: process.uptime(),
  });
});

/**
 * Deep readiness check — validates all dependencies.
 * Returns 200 only when the application can actually serve traffic.
 * Use this as the LB's readiness gate during startup and deployment.
 * Check interval should be longer (every 10–15s) due to dependency I/O.
 */
app.get('/health/ready', async (req, res) => {
  const checks = {};
  let allHealthy = true;

  // Check database connectivity
  try {
    const client = await pgPool.connect();
    await client.query('SELECT 1');
    client.release();
    checks.database = { status: 'healthy' };
  } catch (err) {
    checks.database = { status: 'unhealthy'
Output
// GET /health/live → 200
// { "status": "alive", "pid": 12801, "uptime": 347.2 }
// GET /health/ready (all healthy) → 200
// {
// "status": "ready",
// "checks": {\n// \"database\": { \"status\": \"healthy\" },\n// \"redis\": { \"status\": \"healthy\" },\n// \"memory\": { \"status\": \"healthy\", \"heapUsedPercent\": \"42.1%\" }\n// },\n// \"pid\": 12801\n// }\n\n// GET /health/ready (database down) → 503\n// {\n// \"status\": \"not_ready\",\n// \"checks\": {\n// \"database\": { \"status\": \"unhealthy\", \"error\": \"connect ECONNREFUSED\" },\n// \"redis\": { \"status\": \"healthy\" },\n// \"memory\": { \"status\": \"healthy\", \"heapUsedPercent\": \"41.8%\" }\n// }\n// }"
}

Global Load Balancing (GSLB) — Routing Across Data Centers and Regions

Global Server Load Balancing (GSLB) extends load balancing beyond a single data center to distribute traffic across multiple geographic regions or cloud availability zones. At this scale, load balancing decisions are based on factors like proximity, latency, and data center health, often using DNS as the control plane rather than packet forwarding.

GSLB operates differently from local load balancing. Instead of inspecting individual packets, it manipulates DNS responses: when a client requests the IP address for your service (e.g., api.thecodeforge.io), the GSLB-enabled DNS server returns the IP of the nearest or healthiest data center. This happens at the DNS resolution step, before any TCP connection is established. The client then connects directly to that data center's front-end load balancer.

The diagram below shows the typical GSLB architecture. DNS resolution is the first routing decision point; it determines which regional cluster the client will hit. Within each cluster, standard Layer 4 and Layer 7 load balancers handle the traffic. If a data center goes down, the GSLB controller removes its IPs from DNS responses, and clients eventually (after TTL expiry) resolve to a healthy region.

Production Insight
GSLB is deceptively simple in theory but has two critical gotchas in practice. First, DNS TTL: if you set a short TTL (e.g., 30 seconds) to enable fast failover, clients that respect DNS caching may still hold old IPs, and a single DNS query to a recursive resolver can cause a flood of traffic. Second, DNS-based GSLB has no visibility into TCP-level health — it relies on health probes from each data center's controller, which can miss intermittent failures. Always pair GSLB with health check-driven removal at the local LB level, and consider using anycast routing (as Cloudflare does) for sub-10-second failover.
Key Takeaway
GSLB is the first routing decision in a multi-region architecture. It uses DNS to direct clients to the nearest healthy data center. The TTL of your DNS records directly controls failover speed — too short and you risk DNS amplifier effects on your authoritative servers; too long and failover takes minutes. A common production compromise is TTL=60 seconds with a pre-warming strategy for planned failovers.

Load Balancer Deployment Models — Hardware, Software, and Cloud

Load balancers come in three fundamental deployment forms: dedicated hardware appliances, software running on commodity servers, and cloud-managed services. The choice between them affects your upfront cost, operational complexity, scalability ceiling, and failure domain. The table below compares the three models across the dimensions that matter in production.

io/thecodeforge/loadbalancer/deployment_models.mdMARKDOWN
1
2
3
4
5
6
7
8
9
10
| Dimension | Hardware (e.g., F5 BIG-IP) | Software (e.g., HAProxy, NGINX) | Cloud Managed (e.g., AWS ALB, GCP LB) |
|---|---|---|---|
| **Deployment Model** | Dedicated appliance in the data center | Installed on a VM or bare-metal server | API-provisioned, fully managed by cloud provider |
| **Performance Ceiling** | Very high, custom ASICs for SSL and packet processing | Limited by the host's CPU and NIC (but can be scaled horizontally) | Scales automatically within region; no manual capacity planning |
| **Configuration Flexibility** | Moderate; vendor-specific UI/config language | Extremely flexible; config files can be version-controlled, templated (Ansible, etc.) | Good; limited to cloud provider's feature set; no access to kernel tuning |
| **TLS Acceleration** | Hardware offload for RSA/ECC, high throughput (hundreds of thousands of handshakes/sec) | CPU-bound; can use kernel TLS (kTLS) or hardware acceleration on instance types with Intel QAT | Built-in, often using hardware behind the scenes; simple to enable |
| **Operational Overhead** | High: firmware upgrades, redundant appliances, vendor lock-in | Medium: OS patching, high availability setup (Keepalived, VRRP), monitoring | Low: no OS to manage; health checks and scaling are automated |
| **Cost Model** | High upfront CAPEX + annual maintenance license | Upfront cost = server + software (or open source free); OPEX for operations | Pay-per-use (per hour or per LB-capacity-unit); no upfront cost |
| **Failure Domain** | The appliance itself is a single point of failure; need active/passive pair | Two software instances with VIP failover; same failure domain as the host | Cloud provider guarantees high availability across availability zones |
| **Best For** | Large enterprises with existing data center footprint, compliance requirements, or need for hardware crypto | Teams that want full control over load balancer behavior, need custom health checks, or run on-premise | Teams building on cloud that want to minimize operational overhead and scale without capacity planning |
Production Insight
The choice between these models often comes down to operational maturity, not just performance. Cloud-managed LBs are spectacularly easy to set up but tie you to the provider's feature set and pricing model. Software LBs give you maximum control and portability — you can run the same HAProxy config on bare metal, in a container, or in the cloud. Hardware LBs are increasingly rare outside regulated industries because their cost and complexity rarely justify the performance difference. In most cases, a well-tuned software LB running on modern hardware with kTLS can match hardware performance at a fraction of the cost.
Key Takeaway
Hardware LBs are legacy technology for environments that require physical separation or hardware crypto. Software LBs like HAProxy and NGINX provide the best balance of performance, flexibility, and cost for most teams. Cloud-managed LBs are the right choice when you want to outsource operations entirely and don't need custom L7 logic beyond what the cloud provider offers.
● Production incidentPOST-MORTEMseverity: high

The Health Check Black Hole: 40% of Requests Vanish Into Healthy-Looking Dead Servers

Symptom
Monitoring shows 40% of HTTP requests returning 503 errors despite every server reporting green in the load balancer dashboard. CPU on the affected servers is near zero — they're not processing anything. Application logs show no incoming requests at all, which rules out application-level errors. The LB logs show connections being established and immediately dropped on the backend side.
Assumption
The load balancer dashboard is green for all backends, so the engineering team assumes the problem must be downstream — maybe the database is down, or a dependent microservice is timing out. Two engineers spend 25 minutes digging through database connection pool metrics before someone thinks to curl a backend server directly.
Root cause
The health check was configured as a simple TCP connect probe on port 8080. Three things were happening simultaneously after the deployment. First, servers that had crashed internally but whose OS still held the socket open passed the TCP check — the kernel accepted the connection, but the application wasn't there to handle it. Second, servers in a long JVM warmup phase also passed — they accepted the TCP connection but couldn't serve HTTP requests before the health check timeout. Third, two servers had entered a stop-the-world GC pause that lasted longer than the health check interval, so they appeared healthy between pauses but were unresponsive during them. The LB had no visibility into any of this because it was only checking 'can I open a socket.'
Fix
1. Replace all TCP health checks with HTTP GET /healthz endpoints that validate database connectivity, cache reachability, and the readiness of critical dependencies — not just that the process is running. 2. Add a readiness gate: the /healthz endpoint must return 200 only after the application has fully initialized, completed warmup, and successfully connected to its dependencies. During startup, return 503. 3. Configure consecutive failure thresholds — 3 consecutive failures before marking a server unhealthy, 2 consecutive successes before returning it to rotation. This prevents flapping during transient network hiccups. 4. Implement connection draining — when a server is removed from rotation, wait for in-flight requests to complete (up to a configurable drain timeout, typically 30 seconds) before cutting the connection. Abrupt removal mid-request is a guaranteed user-facing error. 5. Add health check transition alerting — alert when a server's health state changes, not just when it's unhealthy. Frequent transitions are a signal of instability that steady-state monitoring won't catch.
Key lesson
  • A TCP port accepting connections does not mean the application behind it is ready or capable of serving traffic. These are completely different things.
  • Health checks must validate end-to-end application readiness — database connected, cache reachable, dependencies healthy — not just socket availability.
  • JVM warmup and GC pauses are real, predictable events. Your health check design must account for them or they'll cause exactly this kind of incident.
  • Always configure connection draining — abruptly cutting traffic to a server mid-request causes user-facing errors that are entirely preventable.
  • Monitor health check state transitions, not just current state. A server that flips between healthy and unhealthy 20 times per hour is a problem your dashboard's green dot will never show you.
Production debug guideSymptom → Action mapping for common LB failures5 entries
Symptom · 01
Traffic black hole — requests return 503 despite servers appearing healthy in the dashboard
Fix
Bypass the load balancer completely and curl the backend servers directly on their health endpoint. If the direct request succeeds and the LB-routed request fails, you have a health check misconfiguration — the LB is either checking the wrong endpoint, using TCP instead of HTTP, or the timeout is too short for your application's response time. Switch to HTTP-level health checks that validate actual application readiness including database connectivity.
Symptom · 02
Uneven load distribution — one server at 95% CPU while others idle at 10%
Fix
Check for two common culprits: sticky session misconfiguration pinning a disproportionate share of users to one server, and long-lived connections (WebSockets, streaming, gRPC) accumulating on whichever server happened to get them first. Inspect current connection counts per backend using ss or netstat. If using IP Hash, verify whether all traffic originates from a single NAT gateway — if so, every request hashes to the same server. Switch to Least Connections for long-lived connection workloads.
Symptom · 03
Intermittent SSL handshake failures at the load balancer
Fix
Test the TLS handshake directly against the LB with openssl s_client. Check certificate expiration and chain completeness — an intermediate certificate missing from the chain causes failures in some clients but not others, making it intermittent. Verify TLS version and cipher suite compatibility between LB and backends if you're doing TLS re-wrapping. Check whether failures correlate with specific client versions or spike during high traffic, which would point to CPU exhaustion on the LB.
Symptom · 04
Connection pool exhaustion under moderate load
Fix
Inspect keepalive settings between the LB and backends. Without connection reuse, every request opens a new TCP connection — expensive under load. Ensure keepalive is enabled and configured correctly (keepalive 32 in NGINX means 32 idle keepalive connections per worker to each upstream). Check for connection leaks in application code — connections that are opened but not properly returned to the pool. Monitor the LB's active vs idle connection counts over time.
Symptom · 05
Backend servers healthy but response times spiking significantly
Fix
Health checks confirm reachability, not performance. A server can be healthy and slow simultaneously. Check whether the LB algorithm accounts for response time — if using Round Robin, a slow server gets the same traffic as a fast one. Consider switching to Least Response Time or Least Connections. Also check whether connection draining from a recent deployment is causing a traffic imbalance as some backends handle both old and new connections.
★ Load Balancing Quick Debug Cheat SheetImmediate diagnostic commands when load balancing breaks in production.
Upstream servers showing as down in LB logs
Immediate action
Verify the health endpoint directly, completely bypassing the load balancer. This tells you immediately whether the problem is the server or the LB's health check configuration.
Commands
curl -v http://<server-ip>:8080/healthz
kubectl get pods -l app=backend -o wide
Fix now
If the health endpoint fails when curled directly, restart the pod or investigate the application startup logs. If it succeeds directly but the LB marks it unhealthy, check the LB's health check timeout — it may be shorter than your application's response time for /healthz. Also confirm the LB is checking the right port and path.
One server receiving disproportionate traffic+
Immediate action
Check actual connection distribution across upstream servers right now, not what the algorithm says should happen in theory.
Commands
nginx -T 2>&1 | grep -A5 'upstream'
ss -tnp | grep :8080 | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn
Fix now
If using IP Hash and all traffic originates from a single corporate NAT gateway or proxy, every request will hash to the same backend. Switch to Least Connections. If using sticky sessions, verify that a single session cookie value isn't being shared across users — this happens with misconfigured session middleware and pins all those users to one server.
SSL errors appearing intermittently at the load balancer+
Immediate action
Test the TLS handshake directly against the LB to see the full certificate chain and negotiated cipher.
Commands
openssl s_client -connect <lb-host>:443 -tls1_2
openssl x509 -in /etc/ssl/cert.pem -noout -dates
Fix now
If the certificate is expired or expiring within 7 days, renew immediately — this is usually the cause of intermittent failures as some clients cache the old cert. If the cipher suite doesn't match what backends expect during TLS re-wrap, update ssl_ciphers to include the required suites. Check if errors correlate with specific client versions — TLS 1.0/1.1 clients may be hitting a policy block.
Requests timing out but servers appear healthy+
Immediate action
Check whether the timeout is happening at the LB or at the backend, and how far into the request lifecycle it occurs.
Commands
curl -w '@curl-format.txt' -o /dev/null -s http://<lb-host>/api/test
tail -f /var/log/nginx/access.log | grep ' 0.0[0-9]\{2\} '
Fix now
If TTFB is consistently near your proxy_read_timeout value, the backend is taking too long and the LB is cutting the connection. Either increase the timeout for that route specifically, or investigate why the backend is slow. If TTFB is fast but total time is high, the backend is sending a large response slowly — check backend connection limits and network throughput.
Load Balancing Algorithm Comparison
AlgorithmStrategyBest ForWatch Out For
Round RobinSequential distribution — each server gets the next request in rotationClusters with identical server specs and uniform request processing timeFalls apart when requests have variable processing times. A slow request on Server 1 doesn't reduce what it receives next.
Weighted Round RobinRound robin with proportional traffic share based on configured weightMixed hardware environments — legacy vs new instances, different instance typesStatic weights drift over time. A server's effective capacity changes with memory pressure and GC history. Weights must be updated dynamically.
Least ConnectionsRoutes to whichever server currently has the fewest active connectionsLong-lived requests — streaming, heavy database queries, WebSocket connectionsConnection count doesn't equal load. A server with 5 slow connections may be more loaded than one with 20 fast ones. Still better than Round Robin for most workloads.
Least Response TimeRoutes to the server with the lowest current TTFB (Time to First Byte)Latency-sensitive workloads where response time variance between servers mattersRequires the LB to actively probe or measure backend latency — adds overhead. Can cause traffic oscillation if server speeds fluctuate rapidly.
IP HashRoutes each client to a consistent backend based on a hash of their source IPStateful applications that need session affinity without cookie-based sticky sessionsAll traffic behind a corporate NAT or proxy hashes to the same backend. Adding or removing servers changes the hash distribution and breaks affinity.
Random with Two ChoicesPick two servers at random, route to whichever has fewer connectionsVery large server pools where maintaining full state is expensiveLess predictable distribution than Least Connections. Better than pure random but doesn't match Least Connections for accuracy.

Key takeaways

1
Load balancing is the primary mechanism for horizontal scalability
it's what lets you add servers instead of just making one server bigger. But it only works correctly if health checks are accurate, algorithms match the workload, and state is externalized.
2
Layer 4 is faster and simpler. Layer 7 is smarter and more flexible. Most production systems at scale use both
Layer 4 at the network edge for raw throughput, Layer 7 internally for content-aware microservice routing.
3
Health checks are the foundation everything else depends on. TCP-only checks are inadequate
they validate process existence, not application readiness. Deep HTTP health checks that validate dependencies are the minimum acceptable standard for production.
4
Least Connections is the safest default algorithm for modern web applications with variable request processing times. Round Robin's implicit assumption
that all requests take roughly the same time — is almost never true in practice.
5
Sticky sessions are a trap at scale. They defeat load balancing, create hotspots, and cause mass session loss when a pinned server dies. The correct answer is stateless application servers backed by Redis for session state.

Common mistakes to avoid

5 patterns
×

Using TCP-only health checks in production

Symptom
The load balancer dashboard shows all backends as healthy while 503 errors climb in your application monitoring. Servers that have crashed internally but whose OS still holds the socket open pass the TCP check. Servers in JVM warmup or GC pause accept the TCP connection but can't process HTTP requests. Users see failures that your LB has no visibility into.
Fix
Replace TCP health checks with HTTP health checks on a /healthz or /health/ready endpoint that validates actual application readiness — database connected, cache reachable, dependencies healthy. Return 200 only when the server can genuinely handle a request. Configure 3 consecutive failures before removing from rotation and 2 consecutive successes before re-adding. Never go back to TCP-only checks.
×

Ignoring SSL termination overhead until it becomes a crisis

Symptom
The load balancer CPU spikes to 100% under TLS-heavy traffic while backend servers sit idle at 15%. TLS handshake latency increases dramatically during traffic bursts. The LB becomes the bottleneck, and increasing backend capacity does nothing to fix it because the constraint is at the LB layer.
Fix
Size the LB instance for the computational cost of TLS termination — this is frequently under-provisioned because it's invisible until load hits. Use TLS 1.3 which requires fewer round trips than 1.2. Enable TLS session resumption via session tickets to avoid full handshakes for returning clients. For AWS deployments, NLB with TLS offloading uses hardware acceleration that sidesteps the CPU constraint entirely.
×

Hardcoding server IPs in the upstream configuration

Symptom
A server is replaced during a scaling event or a failed instance is rebuilt with a new IP. The LB still routes traffic to the old IP — health checks eventually mark it unhealthy, but the new instance at the new IP receives zero traffic. Scaling events require manual config changes and a reload. In a cloud environment where IPs change constantly, this becomes a persistent operational burden.
Fix
Use DNS-based service discovery to populate the LB's server pool dynamically. Kubernetes Services handle this automatically. For non-Kubernetes environments, Consul or similar service registries provide DNS records that update as instances come and go. Configure the LB to resolve DNS names rather than cache IPs — set resolver directives in NGINX to control TTL.
×

Over-reliance on sticky sessions as a substitute for stateless application design

Symptom
One server is at 95% CPU while others idle at 10% — a subset of high-traffic users are all pinned to the same backend. When that server dies, every user pinned to it gets logged out simultaneously. Scaling the cluster doesn't help because new servers receive no traffic from existing sessions. Deployments require careful session migration planning.
Fix
Move session state to Redis and make application servers stateless. Stateless servers can receive traffic from any worker, scale freely, and be replaced without user impact. If sticky sessions are truly unavoidable for a legacy system, set a TTL on the session cookie to bound the maximum duration of affinity, and monitor per-server connection distribution actively.
×

No connection draining on server removal

Symptom
During deployments or auto-scaling scale-in events, in-flight requests to servers being removed are abruptly terminated. Users experience random errors mid-request — form submissions that don't complete, API calls that return connection reset errors. The error rate spikes exactly when you're deploying, making it easy to blame the new code rather than the removal mechanics.
Fix
Configure connection draining on your LB — a grace period during which the server is removed from rotation for new requests but allowed to complete in-flight ones. AWS ALB calls this deregistration delay (default 300 seconds, often should be tuned lower to 30-60 seconds based on your request SLA). NGINX uses the drain flag. Set this to at least your p99 request duration, not the default.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Design a system that handles 1 million concurrent users. Where do you pl...
Q02SENIOR
How does the 'Least Connections' algorithm differ from 'Least Response T...
Q03SENIOR
What is SSL Termination and why is it used at the Load Balancer level?
Q04JUNIOR
What happens if the health check fails, and how do you prevent a server ...
Q01 of 04SENIOR

Design a system that handles 1 million concurrent users. Where do you place the load balancers and what type at each tier?

ANSWER
This needs a multi-tier approach — no single load balancer handles everything at this scale. Tier 1 — DNS-level routing: Route 53 or equivalent with latency-based or geolocation routing directs users to the nearest regional data center. This handles continent-level distribution and provides automatic failover between regions. Not a traditional LB but performs the same conceptual function at global scale. Tier 2 — Edge/Network LB (Layer 4): AWS NLB or GCP Network LB at the edge of each region. These handle raw TCP/UDP at line rate with sub-millisecond overhead. Their job is to absorb the raw connection volume, terminate TLS if needed (using hardware acceleration), and distribute traffic across the next tier. At 1M concurrent users, this tier needs to be sized for connection rate, not just bandwidth — each new connection requires CPU for TLS handshake. Tier 3 — Application LB (Layer 7): NGINX, HAProxy, or AWS ALB sitting behind the NLB. This is where content-aware routing happens — /api/payments routes to the payments service pool, /api/media to the media service pool. This tier also handles authentication offloading, canary deployments, and A/B testing. Use Least Connections algorithm here. Tier 4 — Service mesh (east-west): Envoy sidecars or similar for service-to-service communication inside the cluster. Circuit breaking, retry logic, and mTLS between microservices. This isn't usually what people think of as 'load balancing' but it's distributing traffic across service instances. The critical design constraint at every tier: no single LB is a single point of failure. Active-active pairs with shared virtual IPs (VRRP/Keepalived for on-premise, cloud-managed for AWS/GCP) at each tier. The architecture must survive the failure of any single component without user impact.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between Horizontal and Vertical Scaling?
02
Can a Load Balancer become a Single Point of Failure?
03
What happens if the health check fails?
04
When should I use IP Hash instead of session cookies for affinity?
🔥

That's Components. Mark it forged?

8 min read · try the examples if you haven't

Previous
Idempotency in API Design
1 / 18 · Components
Next
What Is a Load Balancer? Types, Algorithms and How They Work