Junior 11 min · March 05, 2026

Load Balancing Health Check Black Hole — 40% Vanish

Q: What is the difference between Horizontal and Vertical Scaling?

Vertical Scaling (scaling up) means adding more CPU, RAM, or faster storage to a single machine. It has a hard ceiling — you can only make one machine so large — and it typically requires downtime to resize. Horizontal Scaling (scaling out) means adding more machines to your server pool. Load balancers are what make horizontal scaling work: without one, you can't distribute traffic across multiple machines transparently. Horizontal scaling is the model that enables the kind of elastic capacity that cloud infrastructure is built around — add machines when load increases, remove them when it drops.

Q: Can a Load Balancer become a Single Point of Failure?

Yes, absolutely — and this is one of the first questions worth asking when evaluating any load balancing architecture. A single load balancer that goes down takes the entire service with it. The standard mitigation is an active-passive or active-active high availability pair. Two load balancers share a Virtual IP (VIP). If the primary fails, the secondary detects the failure via heartbeat and takes ownership of the VIP — traffic continues flowing within seconds. On-premise implementations use VRRP (Virtual Router Redundancy Protocol) with tools like Keepalived. Cloud-managed load balancers (AWS ALB, GCP LB) handle HA internally and are effectively transparent to this problem.

Q: What happens if the health check fails?

The load balancer marks the server as unhealthy and stops routing new connections to it. If connection draining is configured, in-flight requests are allowed to complete up to the drain timeout — typically 30–60 seconds. The LB continues probing the unhealthy server at the configured interval. The server only returns to rotation after passing a configured number of consecutive successful health checks — typically 2 or 3 — to prevent flapping. A server that passes one check and fails the next should not cycle in and out of rotation on every check cycle. The consecutive-success threshold is what creates the hysteresis needed for stable behavior.

Q: When should I use IP Hash instead of session cookies for affinity?

IP Hash is useful when you need session affinity but can't or won't modify the application to set a cookie — for example, with third-party clients or binary protocols that don't support cookies. The significant limitations: if users are behind a NAT gateway or corporate proxy, all of them share the same source IP and all hash to the same backend, creating a severe hotspot. Adding or removing servers from the pool changes the hash distribution, so existing users get rerouted to different servers — breaking the affinity you were trying to maintain. For most applications, Redis-backed sessions with stateless servers is the right answer. Cookie-based affinity at the LB is a second option. IP Hash is a distant third, appropriate only when the first two genuinely aren't possible.

TCP health checks miss JVM GC pauses and warmup, causing 40% 503 errors.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

A load balancer distributes incoming traffic across multiple servers so no single machine becomes a bottleneck or single point of failure
Health checks are the heartbeat — they probe servers continuously and remove dead ones from rotation automatically; TCP-only checks are a trap
Algorithms (Round Robin, Least Connections, IP Hash, Weighted, Least Response Time) decide which server gets each request based on different signals
Layer 4 (TCP/UDP) is faster; Layer 7 (HTTP/HTTPS) is smarter — it can inspect cookies, headers, URL paths, and make content-aware routing decisions
Session persistence (sticky sessions) keeps users on one server but creates hotspots and causes mass session loss if that server dies
The biggest trap: skipping health checks or using TCP-only probes means your LB becomes a black hole, routing traffic into servers that can't respond
In production, no single load balancer handles everything — DNS, edge, Layer 7 gateway, and service mesh each own a different tier

✦ Definition~90s read

What is Load Balancing?

A load balancer is a traffic cop that sits between clients and a pool of backend servers, distributing incoming requests to prevent any single server from being overwhelmed. It exists to solve two fundamental problems: scalability (handling more traffic than one server can manage) and reliability (keeping the service running when servers fail).

★

Imagine a busy McDonald's with 6 cashiers.

Without load balancing, you'd hit the limits of vertical scaling — throwing more CPU and RAM at a single machine — which is expensive and has hard ceilings. Instead, load balancers enable horizontal scaling, letting you add or remove servers on the fly, and they're the backbone of every major web service you use: Netflix, Google, Amazon, all route traffic through them.

In practice, a load balancer doesn't just blindly forward packets — it makes decisions based on algorithms like round-robin, least connections, or IP hash, and it operates at different OSI layers. Layer 4 load balancers (like HAProxy in TCP mode or AWS Network Load Balancer) work at the transport layer, routing traffic based on IP and port without inspecting the payload.

Layer 7 load balancers (like NGINX, Envoy, or AWS Application Load Balancer) understand HTTP, HTTPS, and even gRPC, allowing them to route based on URL paths, headers, or cookies. The choice between them is a trade-off: Layer 4 is faster and simpler, Layer 7 gives you fine-grained control but adds latency and overhead.

The critical component that makes load balancing actually work in production is the health check — without it, you're just guessing. A health check is a periodic probe (HTTP GET, TCP connect, or ICMP ping) that the load balancer sends to each backend to verify it's alive and ready to serve traffic.

When a health check fails, the load balancer marks that server as 'down' and stops sending traffic to it — this is where the 'black hole' problem emerges. If your health check is too lenient (e.g., only checking TCP port 80 without verifying the app responds), the load balancer will keep sending requests to a server that's returning 500s or stuck in a deadlock, effectively routing traffic into a black hole where requests vanish.

Conversely, overly aggressive health checks can cause cascading failures by taking servers out of rotation prematurely. The sweet spot — and the source of the '40% vanish' problem — is when misconfigured health checks silently drop traffic without alerting anyone, turning your load balancer into a request shredder.

Plain-English First

Imagine a busy McDonald's with 6 cashiers. A greeter at the door watches all the lines and sends you to whichever cashier is least busy — not just the first one in rotation. If one cashier goes on break or calls in sick, the greeter stops sending people their way entirely. If one cashier is twice as fast as the others, the greeter sends them twice as many customers. That greeter IS the load balancer. The cashiers are your servers. The system that checks whether a cashier is available and actually working — not just standing at their register staring at a frozen screen — is the health check. And the strategy the greeter uses to pick a cashier — shortest queue, round-robin rotation, same cashier you had last time, or the fastest one right now — is the load balancing algorithm. Everything else in this article is just the details of how that greeter makes smarter decisions at internet scale.

Every time you tap 'Buy Now' on Amazon or start a video on Netflix, your request hits one of hundreds or thousands of servers — chosen in milliseconds by a load balancer you never see. Without it, modern internet-scale applications simply couldn't exist.

The core problem is deceptively simple: distribute work across many machines so no single machine becomes a bottleneck, a single point of failure, or a performance nightmare. Without load balancing, one server handles everything until it buckles under the weight. With it, traffic is spread intelligently, failed servers are automatically removed from the pool, and new capacity can be added without touching the rest of the system.

But 'load balancer' is not a single thing. It's a tier — sometimes multiple tiers — of components that each own a different slice of the problem. DNS-level routing decides which data center gets your request. A network load balancer handles the raw TCP connection at line rate. An application load balancer inspects your HTTP headers and routes you to the right microservice. A service mesh sidecar manages the connection between that microservice and the next one in the chain. Understanding where each layer sits, what decisions it can make, and what its failure modes look like is what separates engineers who can configure a load balancer from engineers who can design a system that stays up when things go wrong.

By the end of this article you'll understand what load balancers are, which components make them tick, when to use Round Robin vs Least Connections vs Least Response Time, why sticky sessions can be a trap at exactly the wrong moment, and how to answer the load balancing questions that trip people up in system design interviews at senior level.

How Load Balancers Actually Distribute Traffic — And Why Health Checks Fail

A load balancer sits between clients and servers, distributing incoming requests across a pool of backend instances. Its core mechanic is a scheduling algorithm — round-robin, least connections, or consistent hashing — that selects a healthy target for each request. Without health checks, the balancer is blind: it will route traffic to a crashed or degraded server, causing errors or timeouts.

In practice, a load balancer periodically probes each backend with a health check (e.g., HTTP GET /health, TCP port check). If a backend fails N consecutive checks, it is removed from the pool. The balancer then redistributes its share of traffic to the remaining instances. This is not instant — there is a window between failure and detection where traffic still hits the dead node. The key property is the health check interval and failure threshold, which together define the recovery time objective (RTO) for that node.

Use a load balancer whenever you have multiple application instances and need high availability or horizontal scaling. It is not optional for any production system serving traffic — without it, a single instance failure takes down the entire service. The real value is not just distributing load, but isolating failures so that users see zero downtime even when individual servers die.

Health Check Black Hole

A load balancer with misconfigured health checks can route traffic to a zombie server that accepts connections but returns 500s — users see errors while the balancer thinks the backend is healthy.

Production Insight

A payment service used a TCP port health check on a Java app that accepted connections but was stuck in a full GC pause — the balancer saw port open, routed traffic, and 40% of requests timed out.

The symptom was intermittent 504s on a subset of requests, with no pattern in logs because the balancer itself had no visibility into application-level health.

Rule: Always use application-level health checks (e.g., /health endpoint) that verify the service can actually process a request, not just that the socket is open.

Key Takeaway

Health checks are the load balancer's only source of truth — a bad check is worse than no check.

The time between failure and detection is bounded by (interval × threshold) — tune these to your acceptable error budget.

Always add a circuit breaker on the client side as a second line of defense against a balancer that routes to a dead node.

thecodeforge.io

Load Balancing Health Check Black Hole — 40% Vanish

Load Balancing

The Core Mechanics: How a Load Balancer Decides Where to Send Traffic

A load balancer sits between the client and your server pool. When a request arrives, it has to make a routing decision in milliseconds — which server gets this connection, right now, given the current state of the cluster.

That decision happens at one of two layers, and the layer matters more than most people realize. Layer 4 load balancers operate at the transport layer — they see IP addresses, TCP/UDP ports, and packet counts. They don't open the envelope. Layer 7 load balancers operate at the application layer — they can read the HTTP method, URL path, headers, cookies, and request body. They know the difference between a GET /api/images request and a POST /api/payments request and can route them to entirely different server pools.

The trade-off is straightforward: Layer 4 is faster because there's almost nothing to parse. Layer 7 is more expensive computationally because it has to terminate the connection, parse the HTTP request, make a routing decision, and then establish a new connection (or reuse a keepalive connection) to the backend. In practice, that overhead is typically 0.5–2ms per request — negligible for most applications, meaningful for high-frequency trading or real-time gaming.

In production, you usually don't pick one. The standard architecture is a Layer 4 network load balancer at the edge handling raw TCP connections at line rate, with Layer 7 application load balancers behind it doing content-aware routing to specific service pools. AWS calls these NLB and ALB. On-premise, you'd see HAProxy in TCP mode in front of NGINX instances.

io/thecodeforge/nginx/upstream.confNGINX

# Upstream pool with mixed weights and a backup server
upstream forge_backend_cluster {
    # Least Connections: best for workloads with variable request processing times
    # Round Robin (default) assumes all requests take roughly the same time — often wrong
    least_conn;

    server 10.0.0.1:8080 weight=3;  # Larger instance — gets 3x the traffic
    server 10.0.0.2:8080 weight=1;  # Standard instance
    server 10.0.0.3:8080 backup;    # Only receives traffic when primary servers are down

    # Keepalive: reuse connections to backends instead of opening new TCP connections per request
    # Without this, high-traffic scenarios create connection exhaustion
    keepalive 32;

    # Passive health checks: mark server down after 3 consecutive failures, retry after 30s
    # Active health checks (requires nginx_upstream_check_module or NGINX Plus)
    # server 10.0.0.1:8080 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name thecodeforge.io;

    location /api/ {
        proxy_pass http://forge_backend_cluster;

        # Preserve real client IP through the proxy
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header Host $host;

        # Timeout configuration — tune for your backend's actual SLA
        proxy_connect_timeout 5s;
        proxy_read_timeout 60s;
        proxy_send_timeout 60s;

        # Retry on failure — but only for idempotent methods to avoid double-posting
        proxy_next_upstream error timeout http_503;
        proxy_next_upstream_tries 2;
    }
}

Output

# Traffic flows: client → NGINX (L7) → least-connected backend from pool

# Server 10.0.0.1 receives ~3x requests vs 10.0.0.2 due to weight=3

# 10.0.0.3 stays idle unless both primaries fail health checks

# Keepalive reuses existing connections — avoids TCP handshake overhead on every request

Layer 4 vs Layer 7 — The Speed vs Smarts Trade-off

Layer 4: No packet inspection. Lowest latency (~microseconds). Ideal for raw TCP/UDP traffic like gaming servers, video streaming, or any protocol that isn't HTTP.
Layer 7: Reads cookies, headers, URL paths, and HTTP methods. Enables A/B testing, canary deployments, microservice routing by path, and authentication offloading.
Rule of thumb: if your routing decision requires knowing anything about the request content, you need Layer 7. If you only need to balance load across identical servers, Layer 4 is enough.
Performance cost of Layer 7: typically 0.5–2ms additional latency per request due to connection termination, TLS handling, and HTTP parsing.
Standard production architecture: Layer 4 NLB at the edge absorbs raw connection volume, Layer 7 ALB/NGINX behind it makes content-aware routing decisions per service.

Production Insight

Layer 4 load balancers are blind to URL paths — /api/payments and /api/images look identical at the TCP layer.

In a microservices architecture, routing different paths to different service pools requires Layer 7. Layer 4 alone means one pool per port, which doesn't scale.

The NLB → ALB pattern solves this: NLB handles the connection volume and TLS termination at the edge, ALB handles path-based routing internally. Both AWS and GCP make this pattern straightforward with managed offerings.

Key Takeaway

Layer 4 is the motorcycle — fast and direct, but you can't read road signs at that speed. Layer 7 is the GPS-equipped car — slightly slower off the line, but it knows exactly where every request needs to go. Most production systems at scale need both, with clear ownership of which tier makes which routing decisions.

Choosing Between Layer 4 and Layer 7

IfRouting decision requires inspecting URL path, HTTP headers, cookies, or request body

→

UseLayer 7 (NGINX, HAProxy in HTTP mode, AWS ALB, GCP Application Load Balancer). No way around it — Layer 4 cannot see this information.

IfRaw TCP/UDP traffic with no HTTP semantics — game servers, video streaming, gRPC without HTTP/2 inspection, DNS

→

UseLayer 4 (AWS NLB, HAProxy in TCP mode, GCP Network Load Balancer). Lower latency, higher throughput, no parsing overhead.

IfExtreme latency sensitivity — sub-millisecond routing required, financial trading, real-time bidding

→

UseLayer 4. The 0.5–2ms overhead of HTTP parsing is real and will show up in your p99 latency.

IfNeed both maximum throughput at the edge and intelligent content routing for microservices

→

UseLayer 4 (NLB) at the edge → Layer 7 (NGINX/Envoy/ALB) internally. Standard architecture for anything at significant scale.

OSI Layer 4 vs Layer 7 — Visual Comparison

Understanding the OSI layer at which a load balancer operates is fundamental to designing your architecture. Layer 4 (transport layer) and Layer 7 (application layer) operate at completely different levels of abstraction, and the decision between them determines what information is available for routing, how much overhead is added, and what failure modes you must design for.

The diagram below shows the two layers side-by-side, with their respective capabilities, overhead, and typical use cases. At Layer 4, the load balancer sees only IP addresses, ports, and TCP flags — packets are forwarded without inspection, making it extremely fast but completely unaware of request content. At Layer 7, the load balancer terminates the TCP connection, performs TLS termination, and then parses the HTTP request to extract cookies, headers, URL paths, and even the request body. This enables content-aware routing, but adds latency from connection termination and parsing.

In practice, production systems rarely pick one over the other — they use both in a tiered architecture. A Layer 4 NLB at the network edge handles raw connection volume at line rate and distributes traffic to a pool of Layer 7 application load balancers (NGINX, HAProxy, Envoy) that perform content-aware routing to specific microservices. This hybrid approach gives you the throughput of Layer 4 at the edge with the intelligence of Layer 7 inside.

Production Insight

One common antipattern: placing a single Layer 7 load balancer at the edge for all traffic. Layer 7 LBs have a per-connection memory overhead because they maintain TCP state. Under a flood of short-lived connections, a Layer 7 LB can run out of memory or CPU before its backend capacity is even tapped. The fix is always a Layer 4 LB in front to absorb the connection volume — it has near-zero per-packet state and can handle millions of concurrent connections with minimal resource usage.

Key Takeaway

Layer 4 is stateless and blazing fast — perfect for edge traffic absorption. Layer 7 is stateful and intelligent — perfect for content-aware routing. A production system that uses only one is either leaving performance on the table (all L4, no intelligent routing) or risking capacity exhaustion (all L7 at the edge).

Layer 4 vs Layer 7 — Decision Flow

Load Balancing Algorithms in Depth: Choosing the Right Strategy for Your Workload

The algorithm your load balancer uses to select a backend is not a configuration detail — it's a decision that directly affects your latency distribution, your server utilization, and what happens when servers become slow rather than fully dead.

Round Robin is the simplest: request 1 goes to server 1, request 2 to server 2, and so on, cycling back to the beginning. It works well when all servers are identical and all requests take roughly the same time to process. Both of those assumptions break in practice. Servers are rarely perfectly identical after weeks of different memory allocations and GC histories. And requests are almost never uniform — a request that triggers a complex database join takes 50x longer than one hitting a cache.

Least Connections routes each new request to whichever server currently has the fewest active connections. This self-corrects automatically: a slow server accumulates connections faster, so the LB naturally sends it fewer new ones. This is why Least Connections is the safer default for most production web applications.

Least Response Time takes the next step: instead of counting connections, it measures actual backend latency (TTFB) and routes to whichever server is responding fastest right now. This is more accurate but requires the LB to actively probe or measure response times, which adds some overhead. It's the right choice for latency-sensitive workloads where a server can be 'healthy' but slow.

IP Hash routes each client to the same backend based on a hash of their source IP. This provides a form of session affinity without application-level cookies. The significant risk: if all your users are behind a corporate NAT gateway, they all hash to the same backend. Also, when a backend is added or removed, the hash changes and users get redistributed — breaking any state you were relying on affinity to preserve.

Weighted variants of Round Robin and Least Connections let you express that some servers have more capacity than others. A server with weight=3 gets three times the share of a server with weight=1. This is essential in mixed hardware environments.

io/thecodeforge/loadbalancer/WeightedRoundRobinBalancer.javaJAVA

package io.thecodeforge.loadbalancer;

import java.util.*;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.CopyOnWriteArrayList;

/**
 * Production-grade Weighted Round-Robin implementation.
 *
 * Design decisions worth understanding:
 * - AtomicInteger for the index counter: load balancers handle concurrent requests.
 *   A plain int here is a data race waiting to happen.
 * - CopyOnWriteArrayList for the server pool: allows safe dynamic weight updates
 *   without locking the hot path (getNextServer).
 * - Collections.shuffle() on construction: prevents all initial traffic from
 *   hitting server[0] in a predictable burst during startup.
 * - Math.abs() on the modulo result: AtomicInteger.getAndIncrement() eventually
 *   overflows to negative values. Without abs(), you get an ArrayIndexOutOfBoundsException
 *   at 2^31 requests — a production bug you will not find in testing.
 */
public class WeightedRoundRobinBalancer {\n\n    private final List<String> serverPool;\n    private final AtomicInteger currentIndex = new AtomicInteger(0);\n\n    public WeightedRoundRobinBalancer(Map<String, Integer> serversWithWeights) {\n        List<String> pool = new ArrayList<>();\n        for (Map.Entry<String, Integer> entry : serversWithWeights.entrySet()) {\n            for (int i = 0; i < entry.getValue(); i++) {\n                pool.add(entry.getKey());\n            }
        }
        // Shuffle to prevent predictable startup burst on the first server
        Collections.shuffle(pool);
        // CopyOnWriteArrayList: safe reads on the hot path, supports dynamic updates
        this.serverPool = new CopyOnWriteArrayList<>(pool);
    }

    public String getNextServer() {
        if (serverPool.isEmpty()) {
            throw new IllegalStateException(
                "No active servers in the pool. Check health checks and server registration."
            );
        }
        // Math.abs handles integer overflow at 2^31 requests
        int index = Math.abs(currentIndex.getAndIncrement() % serverPool.size());
        return serverPool.get(index);
    }

    /**
     * Dynamically update a server's weight — e.g., temporarily reduce weight
     * for a server showing elevated GC pause times or increased error rate.
     * In production, this would be called by your health monitoring system.
     */
    public synchronized void updateWeight(String server, int newWeight, int oldWeight) {
        // Remove existing entries for this server
        serverPool.removeIf(s -> s.equals(server));
        // Re-add with new weight
        for (int i = 0; i < newWeight; i++) {
            serverPool.add(server);
        }
        System.out.printf("Weight updated: %s  %d → %d (pool size: %d)%n",
            server, oldWeight, newWeight, serverPool.size());
    }

    public static void main(String[] args) {
        Map<String, Integer> config = new LinkedHashMap<>();
        config.put("app-server-large-01", 5);  // High-capacity instance
        config.put("app-server-std-01",   2);  // Standard instance
        config.put("app-server-std-02",   2);  // Standard instance

        WeightedRoundRobinBalancer balancer = new WeightedRoundRobinBalancer(config);

        System.out.println("Initial distribution across 18 requests:");
        Map<String, Integer> distribution = new LinkedHashMap<>();
        for (int i = 0; i < 18; i++) {
            String server = balancer.getNextServer();
            distribution.merge(server, 1

Output

Initial distribution across 18 requests:

app-server-large-01 → 10 requests

app-server-std-01 → 4 requests

app-server-std-02 → 4 requests

Simulating GC pressure on app-server-large-01...

Weight updated: app-server-large-01 5 → 1 (pool size: 5)

Server temporarily downweighted. Rebalancing traffic.

The Sticky Session Trap — It Fails Exactly When You Need It Most

Sticky sessions sound like a reasonable solution to stateful applications. In practice, they create two problems that compound each other. First, they defeat load balancing — if 30% of your users all happen to hash to the same server, that server gets 30% of traffic regardless of its current load. Second, when that server dies, every user pinned to it loses their session simultaneously — a mass logout event during your highest-traffic moment. The correct fix is to store session state in Redis and make your application servers stateless. Sticky sessions are a band-aid that delays this conversation until the worst possible moment.

Production Insight

The Math.abs() call in the Java implementation is not defensive programming theater — it's fixing a real production bug.

AtomicInteger.getAndIncrement() overflows to Integer.MIN_VALUE after 2^31 calls. Without Math.abs(), the modulo of a negative number is negative, which throws ArrayIndexOutOfBoundsException on the next line.

On a moderately loaded API server handling 1,000 requests/second, you hit 2^31 in roughly 24 days. You won't catch this in load testing unless you run it for a very long time.

Dynamic weight adjustment is equally important in production. A server with weight=5 that enters a long GC pause should drop to weight=1 automatically. Static weights set at deployment time are a snapshot of server capacity at one moment — they drift.

Key Takeaway

Weighted Round Robin is not 'set and forget.' Static weights reflect server capacity at deployment time and drift as memory pressure and GC behavior evolve. Production implementations monitor per-server response time and error rate and adjust weights dynamically. A server with weight=5 that starts returning errors at 10% should not receive 5x the traffic.

Load Balancing Algorithm Matrix — Best Use Case Selection Guide

Choosing the right load balancing algorithm is not a one-size-fits-all decision. The matrix below maps each algorithm to its ideal workload profile, along with the key risks when the assumptions behind the algorithm are violated. Use this as a quick reference when designing or debugging a load-balanced system.

io/thecodeforge/loadbalancer/algorithm_matrix.mdMARKDOWN

| Algorithm | Strategy | Best Use Case | Watch Out For |
|---|---|---|---|
| Round Robin | Sequential distribution | Identical servers with uniform request processing time | Variable request times cause uneven load despite uniform distribution |
| Weighted Round Robin | Distribution proportional to configured weight | Mixed hardware (big vs small instances) | Static weights drift over time; need dynamic adjustment |
| Least Connections | Routes to server with fewest active connections | Long-lived request workloads (streaming, heavy queries) | Connection count doesn't equal load; slow servers accumulate connections |
| Least Response Time | Routes to server with lowest TTFB | Latency-sensitive applications | Traffic oscillation if server speed fluctuates rapidly; requires LB to probe |
| IP Hash | Consistent hash of source IP | Stateful apps requiring session affinity without cookies | NAT/proxy makes all users hash to same server; pool changes break affinity |
| Random with Two Choices | Pick two random servers, choose the one with fewer connections | Very large server pools (reduces state tracking overhead) | Less predictable than Least Connections; still vulnerable to slow servers |

Production Insight

The 'Best Use Case' column is not a prescriptive rule — it's a starting point. In production, the actual best algorithm often emerges from observability data: if you see one server accumulating connections faster than others despite Least Connections, your requests have variable processing time that Least Response Time would handle better. Treat algorithm selection as an iterative tuning process, not a one-time decision.

Key Takeaway

No algorithm is universally best. Least Connections is the safest default for most web applications, but latency-sensitive workloads need Least Response Time, and mixed hardware demands weighted variants. Always validate your choice with production metrics.

Algorithm Selection Flow

Health Checks: The Component That Makes Everything Else Work

Health checks are the mechanism by which a load balancer knows which servers are actually capable of serving traffic right now. Everything else — algorithm, weights, session persistence — is irrelevant if the LB doesn't have accurate information about server state.

There are three types of health checks in common use, and understanding their trade-offs matters:

TCP health checks open a connection to the server's port and consider it healthy if the connection succeeds. Fast, low overhead, and completely inadequate for detecting application-level failures. The server's OS can accept a TCP connection while the application is in a GC pause, crashed internally, or waiting on a database connection that will never arrive.

HTTP health checks send an actual HTTP request to a designated endpoint (typically /health or /healthz) and validate the response code. This is the minimum acceptable standard for production. The endpoint must return a non-200 response if the application isn't ready to serve traffic — not just if the process is running.

Deep health checks go further: the /healthz endpoint actively validates downstream dependencies — can we connect to the database, is the cache reachable, are critical feature flags loaded. These checks are more expensive to run but catch a class of failures that HTTP-only checks miss: the application process is up, the port responds, but the database connection pool is exhausted and every request will fail.

The health check configuration details matter as much as the type. Check interval (how often), timeout (how long to wait for a response), unhealthy threshold (how many consecutive failures before removal), and healthy threshold (how many consecutive successes before re-addition) all interact. Misconfigure any of these and you get either flapping — servers rapidly cycling in and out of rotation — or a slow response to actual failures.

io/thecodeforge/health/healthcheck.jsJAVASCRIPT

const express = require('express');
const { createClient } = require('redis');
const { Pool } = require('pg');

const app = express();

// Initialize dependencies
const redisClient = createClient({ url: process.env.REDIS_URL });
const pgPool = new Pool({ connectionString: process.env.DATABASE_URL });

redisClient.connect().catch(console.error);

/**
 * Shallow health check — fast, for high-frequency LB probing.
 * Returns 200 if the process is alive. Does not check dependencies.
 * Use this for the LB's frequent interval check (every 5s).
 */
app.get('/health/live', (req, res) => {
  res.status(200).json({
    status: 'alive',
    pid: process.pid,
    uptime: process.uptime(),
  });
});

/**
 * Deep readiness check — validates all dependencies.
 * Returns 200 only when the application can actually serve traffic.
 * Use this as the LB's readiness gate during startup and deployment.
 * Check interval should be longer (every 10–15s) due to dependency I/O.
 */
app.get('/health/ready', async (req, res) => {
  const checks = {};
  let allHealthy = true;

  // Check database connectivity
  try {
    const client = await pgPool.connect();
    await client.query('SELECT 1');
    client.release();
    checks.database = { status: 'healthy' };
  } catch (err) {
    checks.database = { status: 'unhealthy'

Output

// GET /health/live → 200

// { "status": "alive", "pid": 12801, "uptime": 347.2 }

// GET /health/ready (all healthy) → 200

// {

// "status": "ready",

// "checks": {\n// \"database\": { \"status\": \"healthy\" },\n// \"redis\": { \"status\": \"healthy\" },\n// \"memory\": { \"status\": \"healthy\", \"heapUsedPercent\": \"42.1%\" }\n// },\n// \"pid\": 12801\n// }\n\n// GET /health/ready (database down) → 503\n// {\n// \"status\": \"not_ready\",\n// \"checks\": {\n// \"database\": { \"status\": \"unhealthy\", \"error\": \"connect ECONNREFUSED\" },\n// \"redis\": { \"status\": \"healthy\" },\n// \"memory\": { \"status\": \"healthy\", \"heapUsedPercent\": \"41.8%\" }\n// }\n// }"

}

Global Load Balancing (GSLB) — Routing Across Data Centers and Regions

Global Server Load Balancing (GSLB) extends load balancing beyond a single data center to distribute traffic across multiple geographic regions or cloud availability zones. At this scale, load balancing decisions are based on factors like proximity, latency, and data center health, often using DNS as the control plane rather than packet forwarding.

GSLB operates differently from local load balancing. Instead of inspecting individual packets, it manipulates DNS responses: when a client requests the IP address for your service (e.g., api.thecodeforge.io), the GSLB-enabled DNS server returns the IP of the nearest or healthiest data center. This happens at the DNS resolution step, before any TCP connection is established. The client then connects directly to that data center's front-end load balancer.

The diagram below shows the typical GSLB architecture. DNS resolution is the first routing decision point; it determines which regional cluster the client will hit. Within each cluster, standard Layer 4 and Layer 7 load balancers handle the traffic. If a data center goes down, the GSLB controller removes its IPs from DNS responses, and clients eventually (after TTL expiry) resolve to a healthy region.

Production Insight

GSLB is deceptively simple in theory but has two critical gotchas in practice. First, DNS TTL: if you set a short TTL (e.g., 30 seconds) to enable fast failover, clients that respect DNS caching may still hold old IPs, and a single DNS query to a recursive resolver can cause a flood of traffic. Second, DNS-based GSLB has no visibility into TCP-level health — it relies on health probes from each data center's controller, which can miss intermittent failures. Always pair GSLB with health check-driven removal at the local LB level, and consider using anycast routing (as Cloudflare does) for sub-10-second failover.

Key Takeaway

GSLB is the first routing decision in a multi-region architecture. It uses DNS to direct clients to the nearest healthy data center. The TTL of your DNS records directly controls failover speed — too short and you risk DNS amplifier effects on your authoritative servers; too long and failover takes minutes. A common production compromise is TTL=60 seconds with a pre-warming strategy for planned failovers.

GSLB Architecture

Load Balancer Deployment Models — Hardware, Software, and Cloud

Load balancers come in three fundamental deployment forms: dedicated hardware appliances, software running on commodity servers, and cloud-managed services. The choice between them affects your upfront cost, operational complexity, scalability ceiling, and failure domain. The table below compares the three models across the dimensions that matter in production.

io/thecodeforge/loadbalancer/deployment_models.mdMARKDOWN

| Dimension | Hardware (e.g., F5 BIG-IP) | Software (e.g., HAProxy, NGINX) | Cloud Managed (e.g., AWS ALB, GCP LB) |
|---|---|---|---|
| **Deployment Model** | Dedicated appliance in the data center | Installed on a VM or bare-metal server | API-provisioned, fully managed by cloud provider |
| **Performance Ceiling** | Very high, custom ASICs for SSL and packet processing | Limited by the host's CPU and NIC (but can be scaled horizontally) | Scales automatically within region; no manual capacity planning |
| **Configuration Flexibility** | Moderate; vendor-specific UI/config language | Extremely flexible; config files can be version-controlled, templated (Ansible, etc.) | Good; limited to cloud provider's feature set; no access to kernel tuning |
| **TLS Acceleration** | Hardware offload for RSA/ECC, high throughput (hundreds of thousands of handshakes/sec) | CPU-bound; can use kernel TLS (kTLS) or hardware acceleration on instance types with Intel QAT | Built-in, often using hardware behind the scenes; simple to enable |
| **Operational Overhead** | High: firmware upgrades, redundant appliances, vendor lock-in | Medium: OS patching, high availability setup (Keepalived, VRRP), monitoring | Low: no OS to manage; health checks and scaling are automated |
| **Cost Model** | High upfront CAPEX + annual maintenance license | Upfront cost = server + software (or open source free); OPEX for operations | Pay-per-use (per hour or per LB-capacity-unit); no upfront cost |
| **Failure Domain** | The appliance itself is a single point of failure; need active/passive pair | Two software instances with VIP failover; same failure domain as the host | Cloud provider guarantees high availability across availability zones |
| **Best For** | Large enterprises with existing data center footprint, compliance requirements, or need for hardware crypto | Teams that want full control over load balancer behavior, need custom health checks, or run on-premise | Teams building on cloud that want to minimize operational overhead and scale without capacity planning |

Production Insight

The choice between these models often comes down to operational maturity, not just performance. Cloud-managed LBs are spectacularly easy to set up but tie you to the provider's feature set and pricing model. Software LBs give you maximum control and portability — you can run the same HAProxy config on bare metal, in a container, or in the cloud. Hardware LBs are increasingly rare outside regulated industries because their cost and complexity rarely justify the performance difference. In most cases, a well-tuned software LB running on modern hardware with kTLS can match hardware performance at a fraction of the cost.

Key Takeaway

Hardware LBs are legacy technology for environments that require physical separation or hardware crypto. Software LBs like HAProxy and NGINX provide the best balance of performance, flexibility, and cost for most teams. Cloud-managed LBs are the right choice when you want to outsource operations entirely and don't need custom L7 logic beyond what the cloud provider offers.

Why Your Load Balancer Configuration Is a Security Liability

Most devs treat load balancers as traffic cops, not security gates. That's a mistake. A misconfigured load balancer is a backdoor into your network. The biggest trap? Leaving management ports exposed to the internet. I've seen a production cluster go down because someone forgot to restrict SSH access on the HAProxy admin socket. Another classic: terminating TLS at the load balancer but sending unencrypted traffic to backend servers. That's a data leak waiting to happen. Always enforce end-to-end encryption, even within your VPC. Use Web Application Firewall (WAF) features at Layer 7 to filter malicious requests before they hit your app. And for the love of everything, never put your load balancer in the same subnet as your backend servers — that's how lateral movement happens. The why is simple: load balancers are your network's front door. If you leave it unlocked, you're not distributing traffic, you're distributing risk.

nginx-hardened.ymlYAML

// io.thecodeforge
# Example NGINX config blocking admin access and enforcing TLS
http {
    server {
        listen 443 ssl;
        # Only allow internal IPs to access /status
        location /status {
            allow 10.0.0.0/8;
            deny all;
            stub_status;
        }
        # Reject common attack patterns
        if ($http_user_agent ~* (nikto|sqlmap|nmap) ) {
            return 403;
        }
        proxy_pass http://backend;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Output

Blocked 403 - malicious user agent detected

Production Trap:

Never expose the load balancer's admin interface (e.g., HAProxy stats or NGINX stub_status) to the internet. Always restrict to internal CIDR blocks.

Key Takeaway

A load balancer is a security appliance first, traffic distributor second. Lock it down like it's the perimeter.

How to Debug a Failing Load Balancer Without Losing Your Mind

When a load balancer starts dropping traffic, the panic sets in fast. Stop guessing. Follow a systematic approach. First, check health checks — not just if they pass, but how. A 200 OK from a health check endpoint doesn't mean the service is functional. I've seen apps where the health check endpoint returns 200 but the actual API has a dead database connection. Use deep health checks that validate dependencies. Second, look at connection metrics. If you see a spike in SYN_SENT or TIME_WAIT on the load balancer, your backend pool is saturated and connections are timing out. That's a capacity problem, not a configuration issue. Third, use tcpdump or strace on the load balancer itself. Layer-7 load balancers (NGINX, HAProxy) log request-level data. Parse those logs. If you see 502s, your backend is dying. 504s? The upstream is too slow. 499s? The client disconnected first. The why is simple: a load balancer is a state machine. If you don't understand its state, you'll chase ghosts. Logs are the truth.

debug-lb.shBASH

// io.thecodeforge
#!/bin/bash
# Quick health-check deep dive for HAProxy
# Run from load balancer host

echo "=== Checking backend health via stats socket ==="
echo "show backend" | socat stdio /var/run/haproxy/haproxy.sock | grep -E "(UP|DOWN)"

echo "=== Checking connection states ==="
ss -s | head -10

echo "=== Recent 5xx errors in logs ==="
grep -E '" 5[0-9]{2} ' /var/log/nginx/access.log | tail -20

Output

=== Checking backend health via stats socket ===

app-backend UP 1/5

app-backend DOWN 2/5

=== Checking connection states ===

ESTAB 142

TIME_WAIT 89

SYN_SENT 34

Debugging Flow:

Health checks → connection metrics → load balancer logs. Never skip step 1. A failed health check means the backend is the problem, not the load balancer.

Key Takeaway

When debugging a load balancer, start with health checks. Always. If health checks pass but traffic fails, you have an application problem, not a network problem.

● Production incidentPOST-MORTEMseverity: high

The Health Check Black Hole: 40% of Requests Vanish Into Healthy-Looking Dead Servers

Symptom

Monitoring shows 40% of HTTP requests returning 503 errors despite every server reporting green in the load balancer dashboard. CPU on the affected servers is near zero — they're not processing anything. Application logs show no incoming requests at all, which rules out application-level errors. The LB logs show connections being established and immediately dropped on the backend side.

Assumption

The load balancer dashboard is green for all backends, so the engineering team assumes the problem must be downstream — maybe the database is down, or a dependent microservice is timing out. Two engineers spend 25 minutes digging through database connection pool metrics before someone thinks to curl a backend server directly.

Root cause

The health check was configured as a simple TCP connect probe on port 8080. Three things were happening simultaneously after the deployment. First, servers that had crashed internally but whose OS still held the socket open passed the TCP check — the kernel accepted the connection, but the application wasn't there to handle it. Second, servers in a long JVM warmup phase also passed — they accepted the TCP connection but couldn't serve HTTP requests before the health check timeout. Third, two servers had entered a stop-the-world GC pause that lasted longer than the health check interval, so they appeared healthy between pauses but were unresponsive during them. The LB had no visibility into any of this because it was only checking 'can I open a socket.'

Fix

1. Replace all TCP health checks with HTTP GET /healthz endpoints that validate database connectivity, cache reachability, and the readiness of critical dependencies — not just that the process is running. 2. Add a readiness gate: the /healthz endpoint must return 200 only after the application has fully initialized, completed warmup, and successfully connected to its dependencies. During startup, return 503. 3. Configure consecutive failure thresholds — 3 consecutive failures before marking a server unhealthy, 2 consecutive successes before returning it to rotation. This prevents flapping during transient network hiccups. 4. Implement connection draining — when a server is removed from rotation, wait for in-flight requests to complete (up to a configurable drain timeout, typically 30 seconds) before cutting the connection. Abrupt removal mid-request is a guaranteed user-facing error. 5. Add health check transition alerting — alert when a server's health state changes, not just when it's unhealthy. Frequent transitions are a signal of instability that steady-state monitoring won't catch.

Key lesson

A TCP port accepting connections does not mean the application behind it is ready or capable of serving traffic. These are completely different things.
Health checks must validate end-to-end application readiness — database connected, cache reachable, dependencies healthy — not just socket availability.
JVM warmup and GC pauses are real, predictable events. Your health check design must account for them or they'll cause exactly this kind of incident.
Always configure connection draining — abruptly cutting traffic to a server mid-request causes user-facing errors that are entirely preventable.
Monitor health check state transitions, not just current state. A server that flips between healthy and unhealthy 20 times per hour is a problem your dashboard's green dot will never show you.

Production debug guideSymptom → Action mapping for common LB failures5 entries

Symptom · 01

Traffic black hole — requests return 503 despite servers appearing healthy in the dashboard

→

Fix

Bypass the load balancer completely and curl the backend servers directly on their health endpoint. If the direct request succeeds and the LB-routed request fails, you have a health check misconfiguration — the LB is either checking the wrong endpoint, using TCP instead of HTTP, or the timeout is too short for your application's response time. Switch to HTTP-level health checks that validate actual application readiness including database connectivity.

Symptom · 02

Uneven load distribution — one server at 95% CPU while others idle at 10%

→

Fix

Check for two common culprits: sticky session misconfiguration pinning a disproportionate share of users to one server, and long-lived connections (WebSockets, streaming, gRPC) accumulating on whichever server happened to get them first. Inspect current connection counts per backend using ss or netstat. If using IP Hash, verify whether all traffic originates from a single NAT gateway — if so, every request hashes to the same server. Switch to Least Connections for long-lived connection workloads.

Symptom · 03

Intermittent SSL handshake failures at the load balancer

→

Fix

Test the TLS handshake directly against the LB with openssl s_client. Check certificate expiration and chain completeness — an intermediate certificate missing from the chain causes failures in some clients but not others, making it intermittent. Verify TLS version and cipher suite compatibility between LB and backends if you're doing TLS re-wrapping. Check whether failures correlate with specific client versions or spike during high traffic, which would point to CPU exhaustion on the LB.

Symptom · 04

Connection pool exhaustion under moderate load

→

Fix

Inspect keepalive settings between the LB and backends. Without connection reuse, every request opens a new TCP connection — expensive under load. Ensure keepalive is enabled and configured correctly (keepalive 32 in NGINX means 32 idle keepalive connections per worker to each upstream). Check for connection leaks in application code — connections that are opened but not properly returned to the pool. Monitor the LB's active vs idle connection counts over time.

Symptom · 05

Backend servers healthy but response times spiking significantly

→

Fix

Health checks confirm reachability, not performance. A server can be healthy and slow simultaneously. Check whether the LB algorithm accounts for response time — if using Round Robin, a slow server gets the same traffic as a fast one. Consider switching to Least Response Time or Least Connections. Also check whether connection draining from a recent deployment is causing a traffic imbalance as some backends handle both old and new connections.

★ Load Balancing Quick Debug Cheat SheetImmediate diagnostic commands when load balancing breaks in production.

Upstream servers showing as down in LB logs−

Immediate action

Verify the health endpoint directly, completely bypassing the load balancer. This tells you immediately whether the problem is the server or the LB's health check configuration.

Commands

curl -v http://<server-ip>:8080/healthz

kubectl get pods -l app=backend -o wide

Fix now

If the health endpoint fails when curled directly, restart the pod or investigate the application startup logs. If it succeeds directly but the LB marks it unhealthy, check the LB's health check timeout — it may be shorter than your application's response time for /healthz. Also confirm the LB is checking the right port and path.

One server receiving disproportionate traffic+

SSL errors appearing intermittently at the load balancer+

Requests timing out but servers appear healthy+

Load Balancing Algorithm Comparison

Algorithm	Strategy	Best For	Watch Out For
Round Robin	Sequential distribution — each server gets the next request in rotation	Clusters with identical server specs and uniform request processing time	Falls apart when requests have variable processing times. A slow request on Server 1 doesn't reduce what it receives next.
Weighted Round Robin	Round robin with proportional traffic share based on configured weight	Mixed hardware environments — legacy vs new instances, different instance types	Static weights drift over time. A server's effective capacity changes with memory pressure and GC history. Weights must be updated dynamically.
Least Connections	Routes to whichever server currently has the fewest active connections	Long-lived requests — streaming, heavy database queries, WebSocket connections	Connection count doesn't equal load. A server with 5 slow connections may be more loaded than one with 20 fast ones. Still better than Round Robin for most workloads.
Least Response Time	Routes to the server with the lowest current TTFB (Time to First Byte)	Latency-sensitive workloads where response time variance between servers matters	Requires the LB to actively probe or measure backend latency — adds overhead. Can cause traffic oscillation if server speeds fluctuate rapidly.
IP Hash	Routes each client to a consistent backend based on a hash of their source IP	Stateful applications that need session affinity without cookie-based sticky sessions	All traffic behind a corporate NAT or proxy hashes to the same backend. Adding or removing servers changes the hash distribution and breaks affinity.
Random with Two Choices	Pick two servers at random, route to whichever has fewer connections	Very large server pools where maintaining full state is expensive	Less predictable distribution than Least Connections. Better than pure random but doesn't match Least Connections for accuracy.

Key takeaways

Load balancing is the primary mechanism for horizontal scalability

it's what lets you add servers instead of just making one server bigger. But it only works correctly if health checks are accurate, algorithms match the workload, and state is externalized.

Layer 4 is faster and simpler. Layer 7 is smarter and more flexible. Most production systems at scale use both

Layer 4 at the network edge for raw throughput, Layer 7 internally for content-aware microservice routing.

Health checks are the foundation everything else depends on. TCP-only checks are inadequate

they validate process existence, not application readiness. Deep HTTP health checks that validate dependencies are the minimum acceptable standard for production.

Least Connections is the safest default algorithm for modern web applications with variable request processing times. Round Robin's implicit assumption

that all requests take roughly the same time — is almost never true in practice.

Sticky sessions are a trap at scale. They defeat load balancing, create hotspots, and cause mass session loss when a pinned server dies. The correct answer is stateless application servers backed by Redis for session state.

Common mistakes to avoid

5 patterns

Using TCP-only health checks in production

Symptom

The load balancer dashboard shows all backends as healthy while 503 errors climb in your application monitoring. Servers that have crashed internally but whose OS still holds the socket open pass the TCP check. Servers in JVM warmup or GC pause accept the TCP connection but can't process HTTP requests. Users see failures that your LB has no visibility into.

Fix

Replace TCP health checks with HTTP health checks on a /healthz or /health/ready endpoint that validates actual application readiness — database connected, cache reachable, dependencies healthy. Return 200 only when the server can genuinely handle a request. Configure 3 consecutive failures before removing from rotation and 2 consecutive successes before re-adding. Never go back to TCP-only checks.

Ignoring SSL termination overhead until it becomes a crisis

Symptom

The load balancer CPU spikes to 100% under TLS-heavy traffic while backend servers sit idle at 15%. TLS handshake latency increases dramatically during traffic bursts. The LB becomes the bottleneck, and increasing backend capacity does nothing to fix it because the constraint is at the LB layer.

Fix

Size the LB instance for the computational cost of TLS termination — this is frequently under-provisioned because it's invisible until load hits. Use TLS 1.3 which requires fewer round trips than 1.2. Enable TLS session resumption via session tickets to avoid full handshakes for returning clients. For AWS deployments, NLB with TLS offloading uses hardware acceleration that sidesteps the CPU constraint entirely.

Hardcoding server IPs in the upstream configuration

Symptom

A server is replaced during a scaling event or a failed instance is rebuilt with a new IP. The LB still routes traffic to the old IP — health checks eventually mark it unhealthy, but the new instance at the new IP receives zero traffic. Scaling events require manual config changes and a reload. In a cloud environment where IPs change constantly, this becomes a persistent operational burden.

Fix

Use DNS-based service discovery to populate the LB's server pool dynamically. Kubernetes Services handle this automatically. For non-Kubernetes environments, Consul or similar service registries provide DNS records that update as instances come and go. Configure the LB to resolve DNS names rather than cache IPs — set resolver directives in NGINX to control TTL.

Over-reliance on sticky sessions as a substitute for stateless application design

Symptom

One server is at 95% CPU while others idle at 10% — a subset of high-traffic users are all pinned to the same backend. When that server dies, every user pinned to it gets logged out simultaneously. Scaling the cluster doesn't help because new servers receive no traffic from existing sessions. Deployments require careful session migration planning.

Fix

Move session state to Redis and make application servers stateless. Stateless servers can receive traffic from any worker, scale freely, and be replaced without user impact. If sticky sessions are truly unavoidable for a legacy system, set a TTL on the session cookie to bound the maximum duration of affinity, and monitor per-server connection distribution actively.

No connection draining on server removal

Symptom

During deployments or auto-scaling scale-in events, in-flight requests to servers being removed are abruptly terminated. Users experience random errors mid-request — form submissions that don't complete, API calls that return connection reset errors. The error rate spikes exactly when you're deploying, making it easy to blame the new code rather than the removal mechanics.

Fix

Configure connection draining on your LB — a grace period during which the server is removed from rotation for new requests but allowed to complete in-flight ones. AWS ALB calls this deregistration delay (default 300 seconds, often should be tuned lower to 30-60 seconds based on your request SLA). NGINX uses the drain flag. Set this to at least your p99 request duration, not the default.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Design a system that handles 1 million concurrent users. Where do you pl...

Q02SENIOR

How does the 'Least Connections' algorithm differ from 'Least Response T...

Q03SENIOR

What is SSL Termination and why is it used at the Load Balancer level?

Q04JUNIOR

What happens if the health check fails, and how do you prevent a server ...

Q01 of 04SENIOR

Design a system that handles 1 million concurrent users. Where do you place the load balancers and what type at each tier?

ANSWER

This needs a multi-tier approach — no single load balancer handles everything at this scale. Tier 1 — DNS-level routing: Route 53 or equivalent with latency-based or geolocation routing directs users to the nearest regional data center. This handles continent-level distribution and provides automatic failover between regions. Not a traditional LB but performs the same conceptual function at global scale. Tier 2 — Edge/Network LB (Layer 4): AWS NLB or GCP Network LB at the edge of each region. These handle raw TCP/UDP at line rate with sub-millisecond overhead. Their job is to absorb the raw connection volume, terminate TLS if needed (using hardware acceleration), and distribute traffic across the next tier. At 1M concurrent users, this tier needs to be sized for connection rate, not just bandwidth — each new connection requires CPU for TLS handshake. Tier 3 — Application LB (Layer 7): NGINX, HAProxy, or AWS ALB sitting behind the NLB. This is where content-aware routing happens — /api/payments routes to the payments service pool, /api/media to the media service pool. This tier also handles authentication offloading, canary deployments, and A/B testing. Use Least Connections algorithm here. Tier 4 — Service mesh (east-west): Envoy sidecars or similar for service-to-service communication inside the cluster. Circuit breaking, retry logic, and mTLS between microservices. This isn't usually what people think of as 'load balancing' but it's distributing traffic across service instances. The critical design constraint at every tier: no single LB is a single point of failure. Active-active pairs with shared virtual IPs (VRRP/Keepalived for on-premise, cloud-managed for AWS/GCP) at each tier. The architecture must survive the failure of any single component without user impact.

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the difference between Horizontal and Vertical Scaling?

Can a Load Balancer become a Single Point of Failure?

What happens if the health check fails?

When should I use IP Hash instead of session cookies for affinity?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Components. Mark it forged?

11 min read · try the examples if you haven't