Load Balancing Health Check Black Hole — 40% Vanish
TCP health checks miss JVM GC pauses and warmup, causing 40% 503 errors.
- A load balancer distributes incoming traffic across multiple servers so no single machine becomes a bottleneck or single point of failure
- Health checks are the heartbeat — they probe servers continuously and remove dead ones from rotation automatically; TCP-only checks are a trap
- Algorithms (Round Robin, Least Connections, IP Hash, Weighted, Least Response Time) decide which server gets each request based on different signals
- Layer 4 (TCP/UDP) is faster; Layer 7 (HTTP/HTTPS) is smarter — it can inspect cookies, headers, URL paths, and make content-aware routing decisions
- Session persistence (sticky sessions) keeps users on one server but creates hotspots and causes mass session loss if that server dies
- The biggest trap: skipping health checks or using TCP-only probes means your LB becomes a black hole, routing traffic into servers that can't respond
- In production, no single load balancer handles everything — DNS, edge, Layer 7 gateway, and service mesh each own a different tier
Imagine a busy McDonald's with 6 cashiers. A greeter at the door watches all the lines and sends you to whichever cashier is least busy — not just the first one in rotation. If one cashier goes on break or calls in sick, the greeter stops sending people their way entirely. If one cashier is twice as fast as the others, the greeter sends them twice as many customers. That greeter IS the load balancer. The cashiers are your servers. The system that checks whether a cashier is available and actually working — not just standing at their register staring at a frozen screen — is the health check. And the strategy the greeter uses to pick a cashier — shortest queue, round-robin rotation, same cashier you had last time, or the fastest one right now — is the load balancing algorithm. Everything else in this article is just the details of how that greeter makes smarter decisions at internet scale.
Every time you tap 'Buy Now' on Amazon or start a video on Netflix, your request hits one of hundreds or thousands of servers — chosen in milliseconds by a load balancer you never see. Without it, modern internet-scale applications simply couldn't exist.
The core problem is deceptively simple: distribute work across many machines so no single machine becomes a bottleneck, a single point of failure, or a performance nightmare. Without load balancing, one server handles everything until it buckles under the weight. With it, traffic is spread intelligently, failed servers are automatically removed from the pool, and new capacity can be added without touching the rest of the system.
But 'load balancer' is not a single thing. It's a tier — sometimes multiple tiers — of components that each own a different slice of the problem. DNS-level routing decides which data center gets your request. A network load balancer handles the raw TCP connection at line rate. An application load balancer inspects your HTTP headers and routes you to the right microservice. A service mesh sidecar manages the connection between that microservice and the next one in the chain. Understanding where each layer sits, what decisions it can make, and what its failure modes look like is what separates engineers who can configure a load balancer from engineers who can design a system that stays up when things go wrong.
By the end of this article you'll understand what load balancers are, which components make them tick, when to use Round Robin vs Least Connections vs Least Response Time, why sticky sessions can be a trap at exactly the wrong moment, and how to answer the load balancing questions that trip people up in system design interviews at senior level.
The Core Mechanics: How a Load Balancer Decides Where to Send Traffic
A load balancer sits between the client and your server pool. When a request arrives, it has to make a routing decision in milliseconds — which server gets this connection, right now, given the current state of the cluster.
That decision happens at one of two layers, and the layer matters more than most people realize. Layer 4 load balancers operate at the transport layer — they see IP addresses, TCP/UDP ports, and packet counts. They don't open the envelope. Layer 7 load balancers operate at the application layer — they can read the HTTP method, URL path, headers, cookies, and request body. They know the difference between a GET /api/images request and a POST /api/payments request and can route them to entirely different server pools.
The trade-off is straightforward: Layer 4 is faster because there's almost nothing to parse. Layer 7 is more expensive computationally because it has to terminate the connection, parse the HTTP request, make a routing decision, and then establish a new connection (or reuse a keepalive connection) to the backend. In practice, that overhead is typically 0.5–2ms per request — negligible for most applications, meaningful for high-frequency trading or real-time gaming.
In production, you usually don't pick one. The standard architecture is a Layer 4 network load balancer at the edge handling raw TCP connections at line rate, with Layer 7 application load balancers behind it doing content-aware routing to specific service pools. AWS calls these NLB and ALB. On-premise, you'd see HAProxy in TCP mode in front of NGINX instances.
- Layer 4: No packet inspection. Lowest latency (~microseconds). Ideal for raw TCP/UDP traffic like gaming servers, video streaming, or any protocol that isn't HTTP.
- Layer 7: Reads cookies, headers, URL paths, and HTTP methods. Enables A/B testing, canary deployments, microservice routing by path, and authentication offloading.
- Rule of thumb: if your routing decision requires knowing anything about the request content, you need Layer 7. If you only need to balance load across identical servers, Layer 4 is enough.
- Performance cost of Layer 7: typically 0.5–2ms additional latency per request due to connection termination, TLS handling, and HTTP parsing.
- Standard production architecture: Layer 4 NLB at the edge absorbs raw connection volume, Layer 7 ALB/NGINX behind it makes content-aware routing decisions per service.
OSI Layer 4 vs Layer 7 — Visual Comparison
Understanding the OSI layer at which a load balancer operates is fundamental to designing your architecture. Layer 4 (transport layer) and Layer 7 (application layer) operate at completely different levels of abstraction, and the decision between them determines what information is available for routing, how much overhead is added, and what failure modes you must design for.
The diagram below shows the two layers side-by-side, with their respective capabilities, overhead, and typical use cases. At Layer 4, the load balancer sees only IP addresses, ports, and TCP flags — packets are forwarded without inspection, making it extremely fast but completely unaware of request content. At Layer 7, the load balancer terminates the TCP connection, performs TLS termination, and then parses the HTTP request to extract cookies, headers, URL paths, and even the request body. This enables content-aware routing, but adds latency from connection termination and parsing.
In practice, production systems rarely pick one over the other — they use both in a tiered architecture. A Layer 4 NLB at the network edge handles raw connection volume at line rate and distributes traffic to a pool of Layer 7 application load balancers (NGINX, HAProxy, Envoy) that perform content-aware routing to specific microservices. This hybrid approach gives you the throughput of Layer 4 at the edge with the intelligence of Layer 7 inside.
Load Balancing Algorithms in Depth: Choosing the Right Strategy for Your Workload
The algorithm your load balancer uses to select a backend is not a configuration detail — it's a decision that directly affects your latency distribution, your server utilization, and what happens when servers become slow rather than fully dead.
Round Robin is the simplest: request 1 goes to server 1, request 2 to server 2, and so on, cycling back to the beginning. It works well when all servers are identical and all requests take roughly the same time to process. Both of those assumptions break in practice. Servers are rarely perfectly identical after weeks of different memory allocations and GC histories. And requests are almost never uniform — a request that triggers a complex database join takes 50x longer than one hitting a cache.
Least Connections routes each new request to whichever server currently has the fewest active connections. This self-corrects automatically: a slow server accumulates connections faster, so the LB naturally sends it fewer new ones. This is why Least Connections is the safer default for most production web applications.
Least Response Time takes the next step: instead of counting connections, it measures actual backend latency (TTFB) and routes to whichever server is responding fastest right now. This is more accurate but requires the LB to actively probe or measure response times, which adds some overhead. It's the right choice for latency-sensitive workloads where a server can be 'healthy' but slow.
IP Hash routes each client to the same backend based on a hash of their source IP. This provides a form of session affinity without application-level cookies. The significant risk: if all your users are behind a corporate NAT gateway, they all hash to the same backend. Also, when a backend is added or removed, the hash changes and users get redistributed — breaking any state you were relying on affinity to preserve.
Weighted variants of Round Robin and Least Connections let you express that some servers have more capacity than others. A server with weight=3 gets three times the share of a server with weight=1. This is essential in mixed hardware environments.
Math.abs() call in the Java implementation is not defensive programming theater — it's fixing a real production bug.AtomicInteger.getAndIncrement() overflows to Integer.MIN_VALUE after 2^31 calls. Without Math.abs(), the modulo of a negative number is negative, which throws ArrayIndexOutOfBoundsException on the next line.Load Balancing Algorithm Matrix — Best Use Case Selection Guide
Choosing the right load balancing algorithm is not a one-size-fits-all decision. The matrix below maps each algorithm to its ideal workload profile, along with the key risks when the assumptions behind the algorithm are violated. Use this as a quick reference when designing or debugging a load-balanced system.
Health Checks: The Component That Makes Everything Else Work
Health checks are the mechanism by which a load balancer knows which servers are actually capable of serving traffic right now. Everything else — algorithm, weights, session persistence — is irrelevant if the LB doesn't have accurate information about server state.
There are three types of health checks in common use, and understanding their trade-offs matters:
TCP health checks open a connection to the server's port and consider it healthy if the connection succeeds. Fast, low overhead, and completely inadequate for detecting application-level failures. The server's OS can accept a TCP connection while the application is in a GC pause, crashed internally, or waiting on a database connection that will never arrive.
HTTP health checks send an actual HTTP request to a designated endpoint (typically /health or /healthz) and validate the response code. This is the minimum acceptable standard for production. The endpoint must return a non-200 response if the application isn't ready to serve traffic — not just if the process is running.
Deep health checks go further: the /healthz endpoint actively validates downstream dependencies — can we connect to the database, is the cache reachable, are critical feature flags loaded. These checks are more expensive to run but catch a class of failures that HTTP-only checks miss: the application process is up, the port responds, but the database connection pool is exhausted and every request will fail.
The health check configuration details matter as much as the type. Check interval (how often), timeout (how long to wait for a response), unhealthy threshold (how many consecutive failures before removal), and healthy threshold (how many consecutive successes before re-addition) all interact. Misconfigure any of these and you get either flapping — servers rapidly cycling in and out of rotation — or a slow response to actual failures.
Global Load Balancing (GSLB) — Routing Across Data Centers and Regions
Global Server Load Balancing (GSLB) extends load balancing beyond a single data center to distribute traffic across multiple geographic regions or cloud availability zones. At this scale, load balancing decisions are based on factors like proximity, latency, and data center health, often using DNS as the control plane rather than packet forwarding.
GSLB operates differently from local load balancing. Instead of inspecting individual packets, it manipulates DNS responses: when a client requests the IP address for your service (e.g., api.thecodeforge.io), the GSLB-enabled DNS server returns the IP of the nearest or healthiest data center. This happens at the DNS resolution step, before any TCP connection is established. The client then connects directly to that data center's front-end load balancer.
The diagram below shows the typical GSLB architecture. DNS resolution is the first routing decision point; it determines which regional cluster the client will hit. Within each cluster, standard Layer 4 and Layer 7 load balancers handle the traffic. If a data center goes down, the GSLB controller removes its IPs from DNS responses, and clients eventually (after TTL expiry) resolve to a healthy region.
Load Balancer Deployment Models — Hardware, Software, and Cloud
Load balancers come in three fundamental deployment forms: dedicated hardware appliances, software running on commodity servers, and cloud-managed services. The choice between them affects your upfront cost, operational complexity, scalability ceiling, and failure domain. The table below compares the three models across the dimensions that matter in production.
The Health Check Black Hole: 40% of Requests Vanish Into Healthy-Looking Dead Servers
- A TCP port accepting connections does not mean the application behind it is ready or capable of serving traffic. These are completely different things.
- Health checks must validate end-to-end application readiness — database connected, cache reachable, dependencies healthy — not just socket availability.
- JVM warmup and GC pauses are real, predictable events. Your health check design must account for them or they'll cause exactly this kind of incident.
- Always configure connection draining — abruptly cutting traffic to a server mid-request causes user-facing errors that are entirely preventable.
- Monitor health check state transitions, not just current state. A server that flips between healthy and unhealthy 20 times per hour is a problem your dashboard's green dot will never show you.
Key takeaways
Common mistakes to avoid
5 patternsUsing TCP-only health checks in production
Ignoring SSL termination overhead until it becomes a crisis
Hardcoding server IPs in the upstream configuration
Over-reliance on sticky sessions as a substitute for stateless application design
No connection draining on server removal
Interview Questions on This Topic
Design a system that handles 1 million concurrent users. Where do you place the load balancers and what type at each tier?
Frequently Asked Questions
That's Components. Mark it forged?
8 min read · try the examples if you haven't