System Design Beginner

What Is a Load Balancer? Types, Algorithms, and How They Work

📅 2026-04-11 ⏱ 3 min read 🎯 Beginner

Where developers are forged. · Structured learning · Free forever.

📍 Part of: Components → Topic 2 of 18

Learn what a load balancer is, how it distributes traffic across servers, and the types and algorithms used.

🧑‍💻 Beginner-friendly — no prior System Design experience needed

In this tutorial, you'll learn

Learn what a load balancer is, how it distributes traffic across servers, and the types and algorithms used.

Load balancers distribute traffic across servers for availability, scalability, and fault tolerance
Layer 4 routes by IP/port — fast and protocol-agnostic. Layer 7 routes by HTTP content — flexible but slower
Algorithm choice matters: least-connections for variable workloads, round-robin for uniform requests

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

A load balancer distributes incoming network traffic across multiple backend servers
It prevents any single server from becoming overwhelmed and improves availability
Layer 4 (transport) balances by IP and port; Layer 7 (application) balances by HTTP content
Health checks remove unhealthy servers from rotation automatically
Production outages often trace back to misconfigured health checks or missing connection draining
Biggest mistake: treating load balancing as set-and-forget without monitoring distribution skew

🚨 START HERE

Load Balancer Quick Debug Reference

Symptom-based guide to diagnosing load balancer issues

🟡Uneven traffic distribution across backends

Immediate ActionCheck load balancer algorithm and session persistence settings

Commands

kubectl get endpoints <service-name> -o wide

aws elbv2 describe-target-health --target-group-arn <arn>

Fix NowSwitch from sticky sessions to round-robin unless session affinity is explicitly required

🟡Health checks failing across all backends simultaneously

Immediate ActionTest health check endpoint directly on each backend server

Commands

curl -v http://localhost:8080/health

kubectl logs <pod-name> --tail=50 | grep -i health

Fix NowVerify health check path, port, and timeout are correct. Ensure endpoint does not depend on external services

🟡Connection refused errors after scaling event

Immediate ActionVerify new instances are registered and passing health checks

Commands

aws elbv2 describe-target-health --target-group-arn <arn> --query 'TargetHealthDescriptions[*].TargetHealth.State'

ss -tlnp | grep <backend-port>

Fix NowAdd startup probe with longer initial delay to prevent premature traffic routing

Production IncidentLoad Balancer Health Check Misconfiguration Causes Complete OutageA deployment triggered all backend servers to fail health checks simultaneously, removing every server from the load balancer pool.

SymptomAll HTTP requests returned 503 Service Unavailable for 12 minutes during a routine deployment. Zero servers were listed as healthy in the load balancer dashboard.

AssumptionThe deployment introduced a bug in the application code that broke the health check endpoint.

Root causeThe health check endpoint performed a database query. During deployment, a schema migration locked the users table for 90 seconds. Every health check query timed out, marking all servers unhealthy simultaneously. The load balancer had no minimum healthy server threshold configured.

FixChanged the health check endpoint to a lightweight liveness probe that does not query the database. Added a separate readiness probe for database connectivity. Configured the load balancer to maintain at least one server in rotation even if unhealthy using the minimum_healthy_hosts setting. Implemented connection draining with a 30-second grace period during deployments.

Key Lesson

Health check endpoints must be lightweight — never depend on external servicesSeparate liveness checks from readiness checks to prevent cascading removalConfigure minimum_healthy_hosts to prevent complete pool exhaustionAlways use connection draining during deployments to preserve in-flight requests

Production Debug GuideCommon symptoms when load balancing behaves unexpectedly

One server receives significantly more traffic than others→Check if session persistence (sticky sessions) is enabled. Verify the load balancing algorithm matches your traffic pattern. Inspect connection pooling behavior in clients.

Intermittent 502 or 503 errors during deployments→Enable connection draining with adequate grace period. Verify health check frequency and thresholds allow for deployment lag. Check if new instances pass health checks before receiving traffic.

Latency spikes correlate with specific backend servers→Compare per-server request rates and response times. Check for noisy neighbor issues on shared infrastructure. Verify instance types are identical across the pool.

All requests fail after adding new servers to the pool→Verify new servers pass health checks before traffic is routed. Check security group and network ACL rules allow load balancer to reach new instances. Confirm application is fully started on new servers.

Load balancers are critical infrastructure components that distribute client requests across a pool of backend servers. They improve application availability, enable horizontal scaling, and provide fault tolerance by routing traffic away from failed instances. Every production web service behind more than one server requires a load balancing layer.

Misunderstanding load balancer algorithms, health check configurations, and session persistence mechanisms causes some of the most common production incidents. A misconfigured health check can remove all servers from rotation simultaneously, causing a complete outage. An incorrect algorithm choice can create hotspots where one server handles 80% of traffic while others sit idle.

What Is a Load Balancer?

A load balancer is a device or software component that distributes incoming network traffic across multiple backend servers. It acts as a single entry point for client requests and routes them to available servers based on a configured algorithm.

Load balancers solve three fundamental problems: availability by removing failed servers from rotation, scalability by enabling horizontal addition of servers, and performance by preventing any single server from becoming a bottleneck. Without a load balancer, every client would need to know individual server addresses, and a single server failure would cause service disruption.

io.thecodeforge.loadbalancer.core.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100

from dataclasses import dataclass, field
from enum import Enum
from typing import List, Optional
import time
import threading
from io.thecodeforge.loadbalancer.health import HealthChecker
from io.thecodeforge.loadbalancer.algorithms import LoadBalancingAlgorithm

class BackendState(Enum):
    HEALTHY = "healthy"
    UNHEALTHY = "unhealthy"
    DRAINING = "draining"

@dataclass
class BackendServer:
    host: str
    port: int
    weight: int = 1
    state: BackendState = BackendState.HEALTHY
    active_connections: int = 0
    last_health_check: float = 0.0
    consecutive_failures: int = 0
    
    @property
    def address(self) -> str:
        return f"{self.host}:{self.port}"
    
    def is_available(self) -> bool:
        return self.state in (BackendState.HEALTHY, BackendState.DRAINING)


class LoadBalancer:
    """
    Production-grade load balancer with health checking,
    connection draining, and multiple routing algorithms.
    """
    
    def __init__( health_check_interval: float = 5.0):
        self.backends: List[BackendServer] = []
        self.algorithm = algorithm
        self.health_checker = HealthChecker(interval=health_check_interval)
        self._lock = threading.Lock()
        self._minimum_healthy_hosts: int = 0
    
    def add_backend(self, host: str, port: int, weight: int = 1) -> BackendServer:
        """
        Register a new backend server with the load balancer.
        """
        with self._lock:
            backend = BackendServer(host=host, port=port, weight=weight)
            self.backends.append(backend)
            self.health_checker.register(backend)
            return backend
    
    def remove_backend(self, backend: BackendServer) -> None:
        """
        Gracefully remove a backend with connection draining.
        """
        with self._lock:
            backend.state = BackendState.DRAINING
            self.health_checker.unregister(backend)
    
    def select_backend(self) -> Optional[BackendServer]:
        """
        Select next backend using configured algorithm.
        Respects minimum_healthy_hosts threshold.
        """
        with self._lock:
            available = [b for b in self.backends if b.is_available()]
            healthy = [b for b in available if b.state == BackendState.HEALTHY]
            
            if len(healthy) < self._minimum_healthy_hosts and available:
                return self.algorithm.select(available)
            
            if not healthy:
                return None
            
            return self.algorithm.select(healthy)
    
    def set_minimum_healthy_hosts(self, count: int) -> None:
        """
        Configure minimum healthy backends before routing stops.
        Prevents complete pool exhaustion during failures.
        """
        self._minimum_healthy_hosts = count


# Example usage
from io.thecodeforge.loadbalancer.algorithms import RoundRobinAlgorithm

lb = LoadBalancer(algorithm=RoundRobinAlgorithm(), health_check_interval=5.0)
lb.add_backend("10.0.1.10", 8080)
lb.add_backend("10.0.1.11", 8080)
lb.addself, algorithm: LoadBalancingAlgorithm,_backend("10.0.1.12", 8080)
lb.set_minimum_healthy_hosts(1)

for i in range(6):
    backend = lb.select_backend()
    if backend:
        print(f"Request {i} -> {backend.address}")

Mental Model

Load Balancer as Traffic Director

A load balancer is a single entry point that fans out requests to multiple backends based on routing rules and server health.

Clients connect to the load balancer, never directly to backend servers
The balancer decides which backend receives each request
Failed servers are removed automatically via health checks
New servers are added without client-side changes
The balancer itself must be redundant to avoid becoming a single point of failure

📊 Production Insight

Load balancers become the single point of entry for all traffic.

If the balancer fails, all backends become unreachable.

Rule: always deploy load balancers in redundant pairs or use managed services.

🎯 Key Takeaway

Load balancers distribute traffic across servers for availability and scale.

They are the single entry point — making them redundant is critical.

Health checks and connection draining prevent cascading failures.

Load Balancer Deployment Decision

IfSimple HTTP traffic with standard routing needs

→

UseUse a managed load balancer like AWS ALB or GCP HTTP(S) LB

IfTCP/UDP traffic or non-HTTP protocols

→

UseUse a Layer 4 load balancer like AWS NLB or HAProxy in TCP mode

IfNeed full control over routing logic

→

UseDeploy HAProxy, NGINX, or Envoy as self-managed load balancer

IfKubernetes-based microservices

→

UseUse ingress controller (NGINX Ingress, Istio Gateway) with service mesh

Types of Load Balancers

Load balancers operate at different layers of the network stack, each with distinct capabilities and trade-offs. The two primary categories are Layer 4 (transport) and Layer 7 (application) load balancers.

Layer 4 load balancers make routing decisions based on IP address and port information. They are fast and protocol-agnostic but cannot inspect request content. Layer 7 load balancers operate at the application layer and can route based on HTTP headers, URLs, cookies, and request content. They enable sophisticated routing but add latency from content inspection.

io.thecodeforge.loadbalancer.types.py · PYTHON

from abc import ABC, abstractmethod
from typing import Optional, Dict, Any
from io.thecodeforge.loadbalancer.core import BackendServer
from io.thecodeforge.loadbalancer.models import Request, Connection

class LoadBalancerType(ABC):
    """
    Abstract base for Layer 4 and Layer 7 load balancer types.
    """
    
    @abstractmethod
    def route(self, request: Any) -> Optional[BackendServer]:
        pass


class Layer4LoadBalancer(LoadBalancerType):
    """
    Transport-layer load balancer.
    Routes based on source/destination IP and port.
    Does not inspect packet contents.
    """
    
    def __init__(self, balancer):
        self.balancer = balancer
    
    def route(self, connection: Connection) -> Optional[BackendServer]:
        """
        Route based on 5-tuple: src_ip, src_port, dst_ip, dst_port, protocol.
        """
        backend = self.balancer.select_backend()
        if backend:
            backend.active_connections += 1
        return backend
    
    def get_capabilities(self) -> Dict[str, bool]:
        return {
            "protocol_agnostic": True,
            "url_routing": False,
            "header_inspection": False,
            "cookie_persistence": False,
            "content_based_routing": False,
            "ssl_termination": False,
            "websocket_support": True,
            "latency_overhead": "minimal"
        }


class Layer7LoadBalancer(LoadBalancerType):
    """
    Application-layer load balancer.
    Routes based on HTTP headers, URL path, hostname, cookies.
    """
    
    def __init__(self, balancer):
        self.balancer = balancer
        self.rules: list = []
    
    def add_routing_rule(self, condition: callable, target_pool: str) -> None:
        """
        Add content-based routing rule.
        """
        self.rules.append({"condition": condition, "pool": target_pool})
    
    def route(self, request: Request) -> Optional[BackendServer]:
        """
        Route based on HTTP request content.
        """
        for rule in self.rules:
            if rule["condition"](request):
                pool = rule["pool"]
                return self.balancer.select_backend_from_pool(pool)
        
        return self.balancer.select_backend()
    
    def get_capabilities(self) -> Dict[str, bool]:
        return {
            "protocol_agnostic": False,
            "url_routing": True,
            "header_inspection": True,
            "cookie_persistence": True,
            "content_based_routing": True,
            "ssl_termination": True,
            "websocket_support": True,
            "latency_overhead": "moderate"
        }


# Example: Layer 7 routing rules
from io.thecodeforge.loadbalancer.algorithms import WeightedRoundRobin

l7 = Layer7LoadBalancer(balancer=WeightedRoundRobin())

# Route API traffic to API servers
l7.add_routing_rule(
    condition=lambda req: req.path.startswith("/api/"),
    target_pool="api-servers"
)

# Route static assets to CDN-backed servers
l7.add_routing_rule(
    condition=lambda req: req.path.startswith("/static/"),
    target_pool="static-servers"
)

# Route by hostname
l7.add_routing_rule(
    condition=lambda req: req.host == "api.example.com",
    target_pool="api-servers"
)

⚠ Layer 4 vs Layer 7 Trade-offs

📊 Production Insight

Layer 7 load balancers add latency from HTTP parsing.

For latency-sensitive paths, consider Layer 4 with client-side routing.

Rule: measure added latency from the load balancer tier independently.

🎯 Key Takeaway

Layer 4 routes by IP and port — fast and protocol-agnostic.

Layer 7 routes by HTTP content — flexible but adds latency.

Choose based on routing requirements, not default preference.

Load Balancing Algorithms

The load balancing algorithm determines how the balancer selects a backend server for each incoming request. Algorithm choice directly impacts traffic distribution, server utilization, and response latency. No single algorithm is optimal for all workloads.

The most common algorithms are round-robin (sequential distribution), weighted round-robin (proportional to server capacity), least connections (route to server with fewest active connections), and IP hash (consistent routing based on client IP). Each algorithm makes different assumptions about server capacity, request duration, and client behavior.

io.thecodeforge.loadbalancer.algorithms.py · PYTHON

from abc import ABC, abstractmethod
from typing import List, Optional
import hashlib
import random
from io.thecodeforge.loadbalancer.core import BackendServer


class LoadBalancingAlgorithm(ABC):
    """
    Abstract base for all load balancing algorithms.
    """
    
    @abstractmethod
    def select(self, backends: List[BackendServer]) -> Optional[BackendServer]:
        pass


class RoundRobinAlgorithm(LoadBalancingAlgorithm):
    """
    Distributes requests sequentially across all healthy backends.
    Simple and fair when servers have equal capacity.
    """
    
    def __init__(self):
        self._index = 0
    
    def select(self, backends: List[BackendServer]) -> Optional[BackendServer]:
        if not backends:
            return None
        backend = backends[self._index % len(backends)]
        self._index += 1
        return backend


class WeightedRoundRobinAlgorithm(LoadBalancingAlgorithm):
    """
    Distributes requests proportionally based on server weights.
    Higher weight servers receive proportionally more requests.
    """
    
    def __init__(self):
        self._current_weights: dict = {}
        self._index = 0
    
    def select(self, backends: List[BackendServer]) -> Optional[BackendServer]:
        if not backends:
            return None
        
        total_weight = sum(b.weight for b in backends)
        
        for backend in backends:
            addr = backend.address
            if addr not in self._current_weights:
                self._current_weights[addr] = 0
            self._current_weights[addr] += backend.weight
        
        selected = max(backends, key=lambda b: self._current_weights[b.address])
        self._current_weights[selected.address] -= total_weight
        return selected


class LeastConnectionsAlgorithm(LoadBalancingAlgorithm):
    """
    Routes to the server with the fewest active connections.
    Best for workloads with variable request durations.
    """
    
    def select(self, backends: List[BackendServer]) -> Optional[BackendServer]:
        if not backends:
            return None
        return min(backends, key=lambda b: b.active_connections)


class IpHashAlgorithm(LoadBalancingAlgorithm):
    """
    Routes based on hash of client IP address.
    Provides session affinity without cookies.
    """
    
    def __init__(self, client_ip_getter: callable = None):
        self._get_client_ip = client_ip_getter or (lambda: "127.0.0.1")
    
    def select(self, backends: List[BackendServer]) -> Optional[BackendServer]:
        if not backends:
            return None
        
        client_ip = self._get_client_ip()
        hash_value = int(hashlib.md5(client_ip.encode()).hexdigest(), 16)
        index = hash_value % len(backends)
        return backends[index]


class P2CLeastConnectionsAlgorithm(LoadBalancingAlgorithm):
    """
    Power of Two Choices: randomly pick two backends,
    then route to the one with fewer connections.
    Near-optimal load distribution with O(1) selection.
    """
    
    def select(self, backends: List[BackendServer]) -> Optional[BackendServer]:
        if not backends:
            return None
        if len(backends) == 1:
            return backends[0]
        
        a, b = random.sample(backends, 2)
        return a if a.active_connections <= b.active_connections else b


# Algorithm comparison
algorithms = {
    "Round Robin": "Simple sequential distribution. Assumes equal server capacity.",
    "Weighted Round Robin": "Proportional distribution based on server weight. For heterogeneous pools.",
    "Least Connections": "Routes to fewest active connections. Best for variable-duration requests.",
    "IP Hash": "Consistent routing by client IP. Provides session affinity without cookies.",
    "P2C Least Connections": "Near-optimal distribution with O(1) complexity. Used by Envoy and gRPC."
}

Mental Model

Algorithm Selection Heuristic

The right algorithm depends on whether your requests are short-lived, long-lived, or require session affinity.

Short uniform requests: round-robin is simple and effective
Variable-length requests (WebSockets, streams): least connections prevents hotspots
Session-dependent state: IP hash or cookie-based persistence
Heterogeneous server capacities: weighted algorithms respect capacity differences
High-scale random routing: P2C least connections gives near-optimal distribution in O(1)

📊 Production Insight

Round-robin creates hotspots with variable-duration requests.

Long-running connections tie up server capacity unevenly.

Rule: use least-connections for any workload where request duration varies significantly.

🎯 Key Takeaway

Algorithm choice determines traffic distribution pattern.

Round-robin fails with variable request durations.

P2C least connections provides near-optimal distribution at scale.

Algorithm Selection Guide

IfAll servers have equal capacity and requests are uniform

→

UseUse round-robin for simplicity

IfServers have different capacities

→

UseUse weighted round-robin with capacity-based weights

IfRequest durations vary significantly

→

UseUse least connections or P2C least connections

IfSession affinity is required without cookies

→

UseUse IP hash with consistent hashing for stable routing

Health Checks and Connection Draining

Health checks are the mechanism by which a load balancer determines whether a backend server is capable of handling traffic. Without health checks, the balancer would route requests to failed servers, causing errors for clients. Connection draining ensures in-flight requests complete before a server is removed from rotation.

Health checks come in two types: active checks where the balancer periodically probes the backend, and passive checks where the balancer monitors real request failures. Active checks detect failures proactively but add load. Passive checks detect failures only after real client requests fail.

io.thecodeforge.loadbalancer.health.py · PYTHON

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166

import time
import threading
import requests
from dataclasses import dataclass
from enum import Enum
from typing import Callable, Optional
from io.thecodeforge.loadbalancer.core import BackendServer, BackendState


class HealthCheckType(Enum):
    HTTP = "http"
    TCP = "tcp"
    GRPC = "grpc"


@dataclass
class HealthCheckConfig:
    check_type: HealthCheckType
    path: str = "/health"
    port: Optional[int] = None
    interval_seconds: float = 5.0
    timeout_seconds: float = 2.0
    healthy_threshold: int = 2
    unhealthy_threshold: int = 3
    expected_status_codes: list = None
    
    def __post_init__(self):
        if self.expected_status_codes is None:
            self.expected_status_codes = [200]


class HealthChecker:
    """
    Production health checker with configurable thresholds,
    grace periods, and passive failure detection.
    """
    
    def __init__(self, config: HealthCheckConfig = None):
        self.config = config or HealthCheckConfig(
            check_type=HealthCheckType.HTTP,
            path="/health"
        )
        self._backends: dict = {}
        self._running = False
        self._thread: Optional[threading.Thread] = None
    
    def register(self, backend: BackendServer) -> None:
        """
        Register a backend for health checking.
        """
        self._backends[backend.address] = {
            "backend": backend,
            "consecutive_successes": 0,
            "consecutive_failures": 0,
            "last_check_time": 0.0
        }
    
    def unregister(self, backend: BackendServer) -> None:
        """
        Remove a backend from health checking.
        """
        self._backends.pop(backend.address, None)
    
    def check_http(self, backend: BackendServer) -> bool:
        """
        Perform HTTP health check against backend.
        """
        port = self.config.port or backend.port
        url = f"http://{backend.host}:{port}{self.config.path}"
        
        try:
            response = requests.get(
                url,
                timeout=self.config.timeout_seconds
            )
            return response.status_code in self.config.expected_status_codes
        except (requests.ConnectionError, requests.Timeout):
            return False
    
    def check_tcp(self, backend: BackendServer) -> bool:
        """
        Perform TCP connection check against backend.
        """
        import socket
        port = self.config.port or backend.port
        
        try:
            sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
            sock.settimeout(self.config.timeout_seconds)
            result = sock.connect_ex((backend.host, port))
            sock.close()
            return result == 0
        except socket.error:
            return False
    
    def run_check(self, backend: BackendServer) -> bool:
        """
        Execute health check and update backend state based on thresholds.
        """
        if self.config.check_type == HealthCheckType.HTTP:
            is_healthy = self.check_http(backend)
        else:
            is_healthy = self.check_tcp(backend)
        
        entry = self._backends.get(backend.address)
        if not entry:
            return is_healthy
        
        if is_healthy:
            entry["consecutive_failures"] = 0
            entry["consecutive_successes"] += 1
            
            if entry["consecutive_successes"] >= self.config.healthy_threshold:
                backend.state = BackendState.HEALTHY
                backend.consecutive_failures = 0
        else:
            entry["consecutive_successes"] = 0
            entry["consecutive_failures"] += 1
            
            if entry["consecutive_failures"] >= self.config.unhealthy_threshold:
                backend.state = BackendState.UNHEALTHY
                backend.consecutive_failures = entry["consecutive_failures"]
        
        entry["last_check_time"] = time.time()
        return is_healthy
    
    def start(self) -> None:
        """
        Start background health checking thread.
        """
        self._running = True
        self._thread = threading.Thread(target=self._check_loop, daemon=True)
        self._thread.start()
    
    def stop(self) -> None:
        """
        Stop background health checking.
        """
        self._running = False
        if self._thread:
            self._thread.join(timeout=10.0)
    
    def _check_loop(self) -> None:
        """
        Continuous health check loop.
        """
        while self._running:
            for entry in list(self._backends.values()):
                backend = entry["backend"]
                if backend.state != BackendState.DRAINING:
                    self.run_check(backend)
            time.sleep(self.config.interval_seconds)


# Example: configure health checks
config = HealthCheckConfig(
    check_type=HealthCheckType.HTTP,
    path="/health/ready",
    interval_seconds=5.0,
    timeout_seconds=2.0,
    healthy_threshold=2,
    unhealthy_threshold=3,
    expected_status_codes=[200, 204]
)

checker = HealthChecker(config=config)

⚠ Health Check Pitfalls

📊 Production Insight

Health checks that depend on external services cause cascade failures.

A database lock can mark all backends unhealthy simultaneously.

Rule: health check endpoints must test only the process, not dependencies.

🎯 Key Takeaway

Health checks remove failed servers before clients see errors.

Consecutive threshold prevents flapping from transient failures.

Connection draining preserves in-flight requests during server removal.

Session Persistence and Sticky Sessions

Session persistence, also called sticky sessions, ensures that requests from the same client are consistently routed to the same backend server. This is required when backend servers maintain in-memory session state that is not shared across the pool.

Sticky sessions are implemented through three mechanisms: cookie-based persistence (the balancer sets a cookie identifying the backend), IP-based persistence (routing by client IP hash), or application-controlled persistence (the application signals which backend to use). Each mechanism has different trade-offs for reliability and scalability.

io.thecodeforge.loadbalancer.persistence.py · PYTHON

import hashlib
import time
from typing import Dict, Optional
from dataclasses import dataclass
from io.thecodeforge.loadbalancer.core import BackendServer


@dataclass
class SessionEntry:
    backend_address: str
    created_at: float
    last_used: float
    ttl_seconds: float
    
    def is_expired(self) -> bool:
        return time.time() - self.last_used > self.ttl_seconds


class SessionPersistenceManager:
    """
    Manages session affinity between clients and backends.
    Supports cookie-based and IP-based persistence.
    """
    
    def __init__(self, ttl_seconds: int = 3600, cookie_name: str = "SERVERID"):
        self._sessions: Dict[str, SessionEntry] = {}
        self._ttl = ttl_seconds
        self._cookie_name = cookie_name
    
    def get_backend_for_client(
        self,
        client_id: str,
        healthy_backends: list
    ) -> Optional[BackendServer]:
        """
        Look up persisted backend for client.
        Returns None if session expired or backend unhealthy.
        """
        entry = self._sessions.get(client_id)
        
        if entry is None or entry.is_expired():
            return None
        
        for backend in healthy_backends:
            if backend.address == entry.backend_address:
                entry.last_used = time.time()
                return backend
        
        del self._sessions[client_id]
        return None
    
    def persist_session(self, client_id: str, backend: BackendServer) -> None:
        """
        Create or update session affinity for a client.
        """
        self._sessions[client_id] = SessionEntry(
            backend_address=backend.address,
            created_at=time.time(),
            last_used=time.time(),
            ttl_seconds=self._ttl
        )
    
    def extract_client_id_from_cookie(self, cookies: dict) -> Optional[str]:
        """
        Extract client identifier from load balancer cookie.
        """
        return cookies.get(self._cookie_name)
    
    def create_session_cookie(self, client_id: str, backend: BackendServer) -> dict:
        """
        Create cookie header for session persistence.
        """
        return {
            "name": self._cookie_name,
            "value": backend.address,
            "max_age": self._ttl,
            "path": "/",
            "http_only": True,
            "secure": True
        }
    
    def cleanup_expired(self) -> int:
        """
        Remove expired session entries.
        Returns count of removed entries.
        """
        expired = [
            k for k, v in self._sessions.items()
            if v.is_expired()
        ]
        for key in expired:
            del self._sessions[key]
        return len(expired)
    
    @property
    def active_sessions(self) -> int:
        return len(self._sessions)


class ConsistentHashPersistence:
    """
    IP-based persistence using consistent hashing.
    Minimizes redistribution when backends are added or removed.
    """
    
    def __init__(self, virtual_nodes: int = 150):
        self._virtual_nodes = virtual_nodes
        self._ring: Dict[int, str] = {}
    
    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def add_backend(self, backend: BackendServer) -> None:
        for i in range(self._virtual_nodes):
            vnode_key = f"{backend.address}#{i}"
            hash_val = self._hash(vnode_key)
            self._ring[hash_val] = backend.address
    
    def remove_backend(self, backend: BackendServer) -> None:
        for i in range(self._virtual_nodes):
            vnode_key = f"{backend.address}#{i}"
            hash_val = self._hash(vnode_key)
            self._ring.pop(hash_val, None)
    
    def get_backend(self, client_ip: str) -> Optional[str]:
        if not self._ring:
            return None
        
        hash_val = self._hash(client_ip)
        sorted_hashes = sorted(self._ring.keys())
        
        for h in sorted_hashes:
            if h >= hash_val:
                return self._ring[h]
        
        return self._ring[sorted_hashes[0]]

💡When to Avoid Sticky Sessions

Sticky sessions create uneven load distribution when some clients are more active
Server failure loses all sessions bound to that server
Horizontal scaling is limited — new servers get no existing traffic
Prefer shared session stores (Redis, Memcached) over sticky sessions when possible
If sticky sessions are required, set reasonable TTLs and monitor session distribution

📊 Production Insight

Sticky sessions create hotspots when power-law clients exist.

A single active client can saturate one backend while others idle.

Rule: monitor per-backend connection counts and alert on distribution skew exceeding 2x.

🎯 Key Takeaway

Sticky sessions route repeat clients to the same backend.

Cookie-based persistence is more reliable than IP-based.

Shared session stores eliminate the need for sticky sessions entirely.

🗂 Load Balancing Algorithm Comparison

Choosing the right algorithm for your workload

Algorithm	Distribution	Session Affinity	Best For	Drawback
Round Robin	Sequential, equal	None	Uniform short requests	Hotspots with variable-duration requests
Weighted Round Robin	Proportional to weight	None	Heterogeneous server capacities	Requires accurate weight configuration
Least Connections	Fewest active connections	None	Variable request durations	Slightly higher selection overhead
IP Hash	Consistent by client IP	Yes (implicit)	Session affinity without cookies	Uneven distribution with few clients
P2C Least Connections	Random pair, pick fewer	None	High-scale uniform distribution	Randomness can cause temporary imbalance
Cookie-based	Consistent by cookie	Yes (explicit)	Stateful web applications	Session loss on server failure

🎯 Key Takeaways

Load balancers distribute traffic across servers for availability, scalability, and fault tolerance
Layer 4 routes by IP/port — fast and protocol-agnostic. Layer 7 routes by HTTP content — flexible but slower
Algorithm choice matters: least-connections for variable workloads, round-robin for uniform requests
Health checks must be lightweight and independent — never depend on shared resources
Connection draining and minimum_healthy_hosts prevent cascading failures during deployments

⚠ Common Mistakes to Avoid

✕Health check endpoint depends on database or external service

Symptom

All backends marked unhealthy simultaneously during database maintenance, causing complete outage

Fix

Use a lightweight liveness endpoint that checks only process status. Add a separate readiness endpoint for dependency checks.

✕No connection draining configured during deployments

Symptom

In-flight requests fail with connection reset errors when servers are removed from the pool

Fix

Configure connection draining with at least 30-second grace period. Verify load balancer waits for active connections to complete before removing backends.

✕Using round-robin with long-lived WebSocket connections

Symptom

First few servers accumulate all WebSocket connections while later servers receive no traffic

Fix

Switch to least-connections algorithm for workloads with persistent connections. Monitor per-backend connection counts.

✕No minimum healthy hosts configured

Symptom

Brief health check failures remove all servers from rotation, causing total outage instead of partial degradation

Fix

Set minimum_healthy_hosts to at least 1. Accept degraded service with unhealthy backends rather than complete failure.

✕Sticky sessions with no TTL or cleanup

Symptom

Session table grows unbounded, consuming memory. Removed servers still referenced in session entries causing routing failures

Fix

Set explicit TTL on session entries. Implement periodic cleanup of expired sessions. Remove session entries when their backend is decommissioned.

Interview Questions on This Topic

QWhat is the difference between a Layer 4 and Layer 7 load balancer?JuniorReveal
A Layer 4 load balancer operates at the transport layer and makes routing decisions based on IP address and TCP/UDP port information. It does not inspect packet contents, making it fast and protocol-agnostic. It works for any TCP or UDP traffic including non-HTTP protocols. A Layer 7 load balancer operates at the application layer and can inspect HTTP headers, URLs, cookies, and request content. This enables sophisticated routing rules like directing /api requests to API servers and /static requests to CDN servers. It can also perform SSL termination, modify headers, and implement content-based routing. The trade-off is performance versus flexibility. Layer 4 adds minimal latency. Layer 7 adds latency from content parsing but enables routing decisions impossible at Layer 4.
QHow would you design health checks for a microservices architecture?Mid-levelReveal
I would implement two separate health check endpoints per difference between hardware: If using DNS-based load balancing, service: 1. Liveness probe at /health/live — checks only that the process is running and responsive. This endpoint does NOT query databases or call other services. If this fails, the orchestrator restarts the pod. 2. Readiness probe at /health/ready — checks that the service can handle traffic by verifying critical dependencies are reachable. This is what the load balancer uses to determine routing eligibility. Configuration: healthy_threshold of 2 consecutive successes before adding to rotation, unhealthy_threshold of 3 consecutive failures before removing. Interval of 5 seconds with a 2-second timeout. Critical rules: health check endpoints must return in under 100ms. They must not perform write operations. They must not depend on services that depend on this service to avoid circular dependency deadlocks during cascading failures.
QA production system shows one backend server handling 80% of traffic while three other servers handle 20% combined. How do you diagnose and fix this?SeniorReveal
First, I would identify the root cause by checking several dimensions: 1. Algorithm: Is the load balancer using IP hash or sticky sessions? A small number of high-traffic clients would concentrate on one backend. Check the session persistence configuration and per-backend connection counts. 2. Health checks: Are some backends intermittently failing health checks and being removed? Check health check logs for threshold crossings. A flapping backend would cause traffic to pile onto remaining servers. 3. Connection type: Are long-lived connections (WebSockets, gRPC streams) accumulating on the first server that was available? Round-robin only distributes new connections, not existing ones. Switch to least-connections. 4. DNS caching clients may have cached the first server IP. Check DNS TTL values. 5. Instance heterogeneity: Are all backends the same instance type? A smaller instance would naturally handle fewer connections. Fix approach: Switch to least-connections or P2C least-connections algorithm. Disable sticky sessions unless explicitly required. Verify health check configuration prevents flapping. Add monitoring for per-backend request rate distribution with alerts when skew exceeds 2x the mean.

Frequently Asked Questions

What is a load balancer in simple terms?

A load balancer is a system that sits in front of your servers and distributes incoming traffic across them. Instead of all users hitting one server, the load balancer spreads the load so no single server gets overwhelmed. If one server goes down, the load balancer automatically stops sending traffic to it.

What is the and software services. Modern production systems overwhelmingly use software load balancers or managed cloud load balancers (AWS ALB/NLB, GCP Load Balancer) for flexibility, cost, and scalability.

What is the best load balancing algorithm?

There is no universally best algorithm. Round-robin works well for simple, uniform workloads. Least connections is best when request durations vary. IP hash provides session affinity without cookies. P2C least connections offers near-optimal distribution at high scale with O(1) complexity. The right choice depends on your traffic pattern, session requirements, and server capacity.

Can a load balancer itself be a single point of failure?

Yes, a single load balancer is a single point of failure. Production systems deploy load balancers in redundant pairs using active-passive or active-active configurations. Cloud providers offer managed load balancers with built-in redundancy across availability zones. DNS-based load balancing across multiple load balancer instances provides another layer of fault tolerance.

What is connection draining?

Connection draining is the process of allowing in-flight requests to complete before removing a server from the load balancer pool. When a server is marked for removal (during deployment or scaling), the load balancer stops sending new requests but waits for existing connections to finish. This prevents users from experiencing connection reset errors during deployments.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged