Advanced 12 min · March 05, 2026

Service Discovery Trap — DNS TTL Killed Black Friday

The DNS TTL trap that killed Black Friday: 20% of payment requests failed with 503 due to stale DNS cache.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Service discovery lets services find each other by name, not IP:port
  • Two core operations: registration (service tells registry 'I'm here') and resolution (client asks 'where is service X?')
  • Client-side discovery: client queries the registry and load-balances itself
  • Server-side discovery: a load balancer or proxy (e.g., API gateway) handles both lookup and routing
  • Health checks separate liveness (is the process alive?) from readiness (is it ready to serve?) — confusing them causes cascading failures
  • DNS-based discovery has a TTL trap: cached stale records continue routing to dead instances after a crash
Plain-English First

Imagine you move to a new city and need a plumber. You don't have their number memorised — you look them up in a directory, get their current address, and call them directly. Service Discovery is that directory for software services. Instead of hardcoding 'Service B lives at 192.168.1.42:8080', every service registers itself in a central registry when it starts up, and looks up others by name when it needs them. The directory always stays fresh, even when services crash, scale up, or move to a new machine. Without it, auto-scaling and container rescheduling would break every client that cached a stale IP.

In a monolith, your code calls a function — it's right there in the same process. In a microservices architecture running across hundreds of containers on dynamically scheduled cloud infrastructure, that luxury disappears overnight. Pods get rescheduled. Auto-scaling fires up three new instances of your payment service at 11pm on Black Friday. IPs change. Ports shift. If you hardcoded any of that, your system falls apart the moment the environment breathes. Service Discovery is the infrastructure primitive that makes dynamic, self-healing distributed systems actually work in production.

The problem it solves is deceptively simple to state and brutally hard to get right: how does Service A know where to send its request to Service B, right now, with a healthy instance, without a human operator updating a config file? The naive solution — a static config map — breaks the moment you deploy more than once a week. The production solution requires a registry, a health-check protocol, and a resolution strategy that can handle partial failures, network partitions, and stale data without cascading into an outage.

By the end of this article you'll understand the two fundamental discovery patterns (client-side and server-side), how health checks work under the hood, why DNS-based discovery has a hidden TTL trap that bites almost every team, how Consul, Eureka, and Kubernetes each implement the registry differently, and what you need to think about before choosing one. You'll also walk away with concrete production gotchas that most tutorials skip entirely.

What is Service Discovery?

Service Discovery is a core concept in System Design. Rather than starting with a dry definition, let's see it in action and understand why it exists.

At its simplest, service discovery has two jobs: registration — when a service starts, it tells a central registry 'I am running at this IP:port and I am healthy'. And resolution — when another service needs to call it, it asks the registry 'give me a healthy instance of service X'. The registry acts as the source of truth for the current state of all services in the distributed system.

This pattern is centuries old. Telephone operators maintained switchboards to connect callers. DNS maps domain names to IPs. The difference in microservices is the rate of change: instances come and go every second due to auto-scaling, rolling updates, and failures. A static phone book would be outdated before it's printed.

io/thecodeforge/discovery/ConsulRegistration.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// TheCodeForge — Register a service with Consul using its HTTP API
package io.thecodeforge.discovery;

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class ConsulRegistration {
    private final HttpClient client = HttpClient.newHttpClient();

    public void register(String serviceName, String address, int port) throws Exception {\n        String payload = \"\"\"\n            {\n                \"Name\": \"%s\",\n                \"Address\": \"%s\",\n                \"Port\": %d,\n                \"Check\": {\n                    \"HTTP\": \"http://%s:%d/health/ready\",\n                    \"Interval\": \"5s\",\n                    \"DeregisterCriticalServiceAfter\": \"30s\"\n                }\n            }\n            \"\"\".formatted(serviceName, address, port, address, port);\n\n        HttpRequest request = HttpRequest.newBuilder()\n            .uri(URI.create(\"http://localhost:8500/v1/agent/service/register\"))\n            .header(\"Content-Type\", \"application/json\")\n            .PUT(HttpRequest.BodyPublishers.ofString(payload))\n            .build();\n\n        HttpResponse<Void> response = client.send(request, HttpResponse.BodyHandlers.discarding());\n        if (response.statusCode() != 200) {\n            throw new RuntimeException(\"Registration failed: \" + response.statusCode());\n        }\n        System.out.println(\"Registered \" + serviceName + \" at \" + address + \":\" + port);\n    }\n}",
        "output": "Registered payment at 10.0.1.5:8080\n"
      }

How Registration & Resolution Actually Work

Registration is not just posting a key-value pair. The registry must decide when to remove an instance. It does this through heartbeats. The service sends a periodic heartbeat (every 5s by default in Eureka, every 10s in Consul). If the registry misses three consecutive heartbeats, it automatically deregisters the service.

Resolution can happen in two ways: client-side — the client queries the registry for all healthy instances and picks one using a load balancing strategy (round-robin, random, least connections). Server-side — the client sends the request to a known endpoint (load balancer or proxy), which uses the registry to find a healthy backend.

Crucially, resolution is a distributed read — every client reads a copy of the registry state. Different clients may see different subsets of instances due to caching and eventual consistency. That's fine for high-level load distribution but causes trouble during rapid failover scenarios.

Heartbeat tuning is a subtle art. Too short and you'll get false deregistrations from GC pauses. Too long and dead instances stay in the pool. Start with 3x the expected interval for expiry, then measure your JVM GC pause times — if you see pauses over 500ms, your heartbeat interval should be at least 1.5 seconds to avoid false positives.

io/thecodeforge/discovery/RegistrationExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// TheCodeForge — Registration with a registry like Eureka or Consul
package io.thecodeforge.discovery;

import com.netflix.appinfo.InstanceInfo;
import com.netflix.discovery.EurekaClient;

public class RegistrationExample {
    private final EurekaClient eurekaClient;
    private final String serviceId = "payment";
    private final int port = 8080;

    public RegistrationExample(EurekaClient client) {
        this.eurekaClient = client;
    }

    public void register() {
        // Eureka client usually handles registration automatically via configuration.
        // Here's how you'd manually register an instance:
        InstanceInfo info = InstanceInfo.Builder.newBuilder()
            .setAppName(serviceId)
            .setPort(port)
            .setHostName(InetAddress.getLocalHost().getHostName())
            .build();
        eurekaClient.registerHealthCheck(info);
        System.out.println("Registered " + serviceId + " on port " + port);
    }

    public void sendHeartbeat() {
        // Heartbeat handled by the Eureka client periodically.
        // If you want to manually send:
        eurekaClient.heartbeat(serviceId, port, InstanceInfo.InstanceStatus.UP);
    }
}
Output
Registered payment on port 8080
(plus periodic heartbeat logs)
Mental Model: A Switchboard Operator
  • When a new line is connected (service starts), the operator plugs it into the board (registers).
  • The operator keeps the board updated by calling each line periodically (heartbeat).
  • When you want to call someone, you ask the operator to connect you (resolution).
  • If the operator doesn't get an answer for a while, they unplug the line (deregistration).
  • Multiple operators may have slightly different views of which lines are live (eventual consistency).
Production Insight
Heartbeat timeouts are a tuning minefield. Too low: false positives due to GC pauses. Too high: dead instances remain in rotation.
Start with 3 * heartbeatInterval = expiryTime. Then measure JVM GC pause times.
Rule: always separate liveness from readiness — heartbeat should only prove the process is alive, not that it can serve traffic.
Key Takeaway
Registration is a lease: the registry grants a temporary slot that must be renewed.
Resolution is a read — it sees a snapshot that may be seconds stale.
The tighter the heartbeat window, the more false deregistrations you'll trigger.
Heartbeat Interval Tuning Decision
IfApplication has frequent GC pauses >500ms
UseIncrease heartbeat interval to at least 2s, set expiry to 6s
IfServices restart frequently (rolling update every minute)
UseLower heartbeat interval to 1s, expiry 3s for faster convergence
IfRunning on JVM with ZGC (sub-millisecond pauses)
UseDefault 5s is fine; no need to adjust
IfUsing Consul with gossip protocol
UseHeartbeat is separate from gossip; tune both independently

Client-Side vs Server-Side Discovery

Client-side discovery puts the burden on each service's code to query the registry and pick an instance. Spring Cloud Eureka, Netflix OSS, and Consul's client library are common examples. The client gets a list of all healthy instances for the target service and chooses one using a load balancing policy (e.g., Ribbon).

Server-side discovery removes that responsibility from the client. The client sends a request to a well-known load balancer (e.g., AWS ALB, HAProxy, or a sidecar proxy in a service mesh). The load balancer queries the registry and forwards the request to an appropriate backend.

Which one to choose? Client-side gives you lower latency (no extra hop) and more control over routing logic (canary, retry, circuit breaking). Server-side simplifies the client code and centralises routing control, which is critical for security and compliance. Cloud-native environments often use server-side via Kubernetes Services combined with Ingress or a service mesh.

A less obvious trade-off: client-side discovery creates a fan-out of registry queries — each client polls the registry. With 1,000 services each discovering 50 others, that's 50,000 queries per minute. Server-side centralises that load to a single balancer, which is easier to scale. Measure your registry's capacity before committing to client-side at scale.

io/thecodeforge/discovery/ClientSideDiscovery.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// TheCodeForge — Client-side discovery with Consul
package io.thecodeforge.discovery;

import com.orbitz.consul.Consul;
import com.orbitz.consul.model.health.ServiceHealth;
import java.util.List;

public class ClientSideDiscovery {
    private final Consul consulClient;

    public ClientSideDiscovery(String consulHost) {
        this.consulClient = Consul.builder().withHostAndPort(consulHost, 8500).build();
    }

    public String findHealthyInstance(String serviceName) {
        List<ServiceHealth> passings = consulClient.healthClient()
            .getHealthyServiceInstances(serviceName)
            .getResponse();
        if (passings.isEmpty()) throw new RuntimeException("No healthy instances for " + serviceName);
        // Simple round-robin: pick first (in real code use weighted random)
        ServiceHealth chosen = passings.get(0);
        return chosen.getService().getAddress() + ":" + chosen.getService().getPort();
    }
}
Output
Returns e.g. "10.0.1.5:8080"
Production Insight
Client-side discovery creates a fan-out: each client polls the registry. With 1000 services discovering 50 others, that's 50,000 queries per minute.
Tune cache expiry (10-30s TTL) to reduce load — but accept staleness.
Rule: always cache the instance list locally to avoid synchronous registry calls on the critical path.
Key Takeaway
Client-side = low latency, high client complexity. Server-side = higher latency, simple clients.
Cloud-native defaults trend toward server-side via service mesh.
Don't mix both — you'll double your latency and complexity for no benefit.
Client-Side vs Server-Side: Decision Matrix
IfYou need low latency (no extra network hop)
UseClient-side discovery
IfYou want simple clients that just send requests to a fixed endpoint
UseServer-side discovery
IfYou need per-request routing logic (canary, version)
UseClient-side (or service mesh)
IfYou're in Kubernetes and want minimal code changes
UseKubernetes Service DNS (server-side) or ingress controller

Health Checks: The Silent Failure Point

Health checks are the single most misconfigured feature in service discovery. Most teams use a simple TCP port check (is the port open?) or a generic HTTP endpoint (/health) that always returns 200. These only verify that the process is alive — they don't tell you if the service can actually handle requests.

Production-ready health checks should differentiate between liveness (is the process running?) and readiness (is the service ready to serve traffic?). Kubernetes, Consul, and Eureka all support this distinction, but few teams configure both correctly.

A common pitfall: the readiness check passes even when a critical dependency (database, cache) is down. The service keeps receiving traffic, fails every request, and the outage appears as random 500 errors. The health check should cascade: if the database is unreachable, the service reports itself as unhealthy for readiness, and traffic is redirected to healthy instances.

Another subtlety: health checks can cause a thundering herd during startup. If all instances of a new deployment become ready simultaneously and all start reporting themselves as healthy, the registry may broadcast a sudden surge of new endpoints to all clients, causing a wave of reconnections and potential CPU spikes.

Mitigation: stagger readiness by adding a random delay (e.g., 0-5 seconds) after the readiness check passes before advertising the instance. In Kubernetes, use minReadySeconds on the Deployment to force a grace period.

io/thecodeforge/discovery/HealthCheckExample.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// TheCodeForge — Proper readiness health check with dependency cascading
package io.thecodeforge.discovery;

import java.sql.Connection;
import java.util.concurrent.atomic.AtomicBoolean;

public class HealthCheckExample {
    private final AtomicBoolean databaseAvailable = new AtomicBoolean(false);
    private final AtomicBoolean redisAvailable = new AtomicBoolean(false);

    // This endpoint is called by /health/ready
    public boolean isReady() {
        if (!databaseAvailable.get()) return false;
        if (!redisAvailable.get()) return false;
        // optionally check if recent uptime > grace period
        return true;
    }

    // This endpoint is called by /health/live — just checks process is alive
    public boolean isAlive() {
        return true; // process alive if this method is reachable
    }

    public void setDatabaseAvailable(boolean state) {
        this.databaseAvailable.set(state);
    }

    public void setRedisAvailable(boolean state) {
        this.redisAvailable.set(state);
    }
}
Production Warning: Stagger Readiness to Avoid Thundering Herd
When a new deployment rolls out, all instances may become ready at nearly the same moment. This floods the registry with simultaneous health updates, potentially overwhelming clients that cache instance lists. Add a random delay (0-5s) between readiness and advertising. In Kubernetes, set minReadySeconds to 10–30 to spread the registration window.
Production Insight
A readiness check that only returns HTTP 200 without verifying dependencies will pass even when the DB is down.
Traffic continues to a broken instance, depleting error budgets in minutes.
Rule: readiness checks must fail if any critical dependency is unavailable.
Key Takeaway
Liveness = process alive. Readiness = can serve requests.
Never use liveness checks for routing decisions.
Make readiness checks dependency-aware — but only for critical dependencies.
Health Check Configuration
IfService is stateless and has no critical dependencies
UseSimple TCP liveness check is sufficient for discovery
IfService depends on a database (most real services)
UseReadiness check must verify DB connectivity, separate from liveness
IfService has multiple downstream dependencies
UseReadiness should fail if ANY critical dependency is unavailable
IfHigh startup time (>10s) due to cache warming
UseUse startupProbe (K8s) to delay readiness until warmup completes

DNS-Based Discovery and the Hidden TTL Trap

Kubernetes and many cloud providers use DNS for service discovery. A service name like payment.prod.svc.cluster.local resolves to the IP of whichever backend pod is healthy at that moment. DNS is familiar, free, and requires no extra infrastructure.

Here's the trap: DNS responses are cached aggressively, both by the OS resolver and by intermediate DNS servers. The TTL (Time To Live) on the DNS record controls how long the cache lives. Kubernetes DNS records for services have a default TTL of 30 seconds. That means if a pod crashes, up to 30 seconds can pass before all clients stop sending requests to the dead pod.

In staging, where you restart one pod at a time and monitor manually, the 30-second window is invisible. In production with auto-scaling groups of 10 pods and rolling updates, a single pod crash causes a cascade of failures as requests pile up on dead instances. The TTL trap is especially dangerous when combined with a slow health check that doesn't detect the failure quickly.

To mitigate, reduce the TTL to 5–10 seconds for critical services, and ensure your readiness check is fast enough to detect failures within that window. Also implement client-side retry with a different instance on first failure. For Kubernetes, consider using a headless service (no cluster IP) with a ClusterIP: None — this returns all pod IPs directly, bypassing DNS caching. But then you lose the load balancing that kube-proxy provides, so you'll need your own client-side balancing.

check_dns_ttl.shBASH
1
2
3
4
5
6
7
8
9
10
11
# TheCodeForgeInspect DNS TTL for a Kubernetes service
# Assuming kube-dns is at 10.100.0.10 (check your cluster's DNS IP)

# Query the service name
$ dig +nocmd +noall +answer +ttlid servicex.svc.cluster.local @10.100.0.10
servicex.svc.cluster.local. 30 IN A 10.96.1.5
servicex.svc.cluster.local. 30 IN A 10.96.1.6

# TTL is 30 seconds. Change it via the service's annotations or use headless services for lower TTL.
# For headless: dig +nocmd servicex.svc.cluster.local ANY @10.100.0.10
# That returns the actual pod IPs (no DNS caching).
Output
TTL=30 (shown in the dig output)
Production Warning: TTL Thundering Herd
If you reduce TTL too low (e.g., 1 second), every client will re-resolve DNS on every request, overwhelming the DNS server. Kubernetes' kube-dns or CoreDNS will throttle requests, causing resolution failures. A TTL of 5–10 seconds balances freshness and server load. Additionally, at very low TTL, the DNS server becomes a single point of failure. Implement retry with fallback to a cached IP on failure.
Production Insight
DNS-based discovery with a 60-second TTL causes a 60-second window of errors after each pod termination during rolling updates.
Each pod swap creates a window of stale routing, compounding over the entire deployment.
Rule: reduce TTL to 5 seconds for services under rolling updates or auto-scaling.
Key Takeaway
DNS caching = stale endpoints = false positives for dead instances.
Short TTLs (5s) fix staleness but increase DNS server load.
Headless services in Kubernetes bypass caching entirely — use when freshness > load.
DNS TTL Strategy
IfService has very dynamic endpoints (auto-scaling every minute)
UseReduce TTL to 5s or use headless service with client-side LB
IfService endpoints change infrequently (deployments every hour)
UseDefault 30s TTL is fine; add client-side retry as safety net
IfYou have high traffic volume and want to reduce DNS server load
UseKeep TTL at 10-30s, but ensure readiness checks are fast (<5s)
IfYou need zero staleness for critical payments
UseAvoid DNS entirely — use a registry with immediate push notifications (e.g., Consul watches)

Consul, Eureka, and Kubernetes: Registry Implementations Compared

Three major registries dominate production deployments: Consul (HashiCorp), Eureka (Netflix), and Kubernetes native service discovery. Each takes a different philosophical approach.

Consul uses a gossip protocol (Serf) for health dissemination. This means health changes propagate quickly across all nodes via peer-to-peer updates, not centralised polling. Consul also provides a DNS interface (port 8600) that respects health status — unhealthy services are automatically removed from DNS responses. This makes it ideal for multi-cloud or hybrid environments where a central registry is required.

Eureka was designed by Netflix for their AWS-centric architecture. It uses a peer-to-peer pattern where each Eureka server replicates state. Eureka has a 'self-preservation' mode that kicks in when a large number of heartbeats are missed — it stops evicting instances, effectively assuming a network partition rather than an actual mass failure. This prevents a cascading removal of instances, but it also means stale entries survive longer. Eureka is best for environments with high churn and network instability.

Kubernetes does not have a central registry by default. It uses DNS and the API server for service resolution. The EndpointSlice controller keeps track of all pod IPs behind a service. The kube-proxy on each node programs iptables or IPVS rules to forward traffic. This design is decentralised and extremely scalable but gives the team less control over routing logic without a service mesh.

Choosing between them depends on your infrastructure homogeneity, desired control, and operational maturity. Avoid a 'best of breed' mixture that forces every service to implement two discovery mechanisms simultaneously.

To query a registry programmatically, you can use the following example with Consul's HTTP API:

consul_query.shBASH
1
2
# TheCodeForgeQuery Consul for healthy instances of a service
$ curl -s http://consul:8500/v1/health/service/payment?passing | jq '.[] | {address: .Service.Address

Service Mesh: The Evolution of Server-Side Discovery

As microservices grow beyond 50 services, managing discovery, load balancing, and retries in each service's code becomes unsustainable. A service mesh (e.g., Istio, Linkerd) moves these responsibilities out of the application and into a sidecar proxy. Each service has a proxy (Envoy) injected alongside it. The proxy handles all service-to-service communication: it discovers the target via a control plane, load balances, retries on failure, handles circuit breaking, and captures metrics. The application code becomes completely discovery-unaware. This is the ultimate server-side discovery pattern.

The trade-off is complexity: you now need to deploy, scale, and monitor a mesh infrastructure. For large organisations, the operational overhead is worth the decoupling it provides.

istio-sidecar.yamlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# TheCodeForgeIstio VirtualService for weighted routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: payment
spec:
  hosts:
  - payment
  http:
  - route:
    - destination:
        host: payment
        subset: v1
      weight: 90
    - destination:
        host: payment
        subset: v2
      weight: 10
When Not to Use a Service Mesh
For teams under 10 services, a service mesh adds unnecessary complexity. Stick with client-side or simple DNS discovery until you need advanced routing, strict mTLS, or canary deployments at scale.
Production Insight
Service mesh sidecars add ~2-5ms per hop in latency. If your latency budget is tight (e.g., <50ms total), measure the overhead.
Sidecars also consume memory: each Envoy proxy can use 50-100MB. With 200 services, that's 10-20GB of extra memory.
Rule: do not introduce a service mesh until you have a proven scaling problem and the team to operate it.
Key Takeaway
Service mesh moves discovery from application code to infrastructure.
Latency overhead and operational cost are real.
Only adopt when the scale justifies the complexity.
Service Mesh Adoption Decision
IfYou have <20 microservices and simple routing needs
UseSkip service mesh; use client-side discovery with a registry
IfYou need mTLS between all services and canary deployments
UseAdopt Istio or Linkerd — the security and routing features justify the complexity
IfYou are on Kubernetes and want to standardise discovery across teams
UseConsider a mesh for consistent policy and observability

Production Hardening: Retry, Caching & Circuit Breakers

Even with a perfectly tuned registry and health checks, failures happen. A network partition can separate the registry from your clients. A slow GC pause can cause a false deregistration. The key to production hardening is assuming the registry will occasionally lie to you.

First, implement client-side caching of the instance list. Cache it locally for at least the health check interval (e.g., 10 seconds). When a resolution request arrives, return the cached list immediately and refresh asynchronously. This prevents every request from becoming a synchronous RPC to the registry.

Second, add retry with exponential backoff. If the first resolved instance fails (connection refused, timeout, 5xx), retry with the next instance from the cached list. Set a maximum retry count (e.g., 3) and a backoff multiplier (e.g., 100ms initial, double each time). This handles transient failures during TTL windows.

Third, use a circuit breaker per service. If a particular service returns errors on >50% of requests within a sliding window (e.g., 10 seconds), open the circuit — stop sending requests entirely for a cooldown period. This prevents cascading failures when the registry is still pointing to a bad instance.

These three patterns — cache, retry, circuit break — are not optional for production service discovery. They transform a fragile central registry into a system that degrades gracefully.

io/thecodeforge/discovery/ResilientDiscoveryClient.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
// TheCodeForge — Client-side discovery with caching, retry and circuit breaker
package io.thecodeforge.discovery;

import java.util.List;
import java.util.concurrent.CopyOnWriteArrayList;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

public class ResilientDiscoveryClient {\n    private final List<String> cachedEndpoints = new CopyOnWriteArrayList<>();\n    private final AtomicInteger failureCount = new AtomicInteger(0);\n    private final AtomicLong lastFailureTime = new AtomicLong(0);\n    private static final int CIRCUIT_THRESHOLD = 5;\n    private static final long COOLDOWN_MS = 30_000;\n\n    public String getEndpoint(String serviceName) {\n        if (isCircuitOpen()) {\n            throw new RuntimeException(\"Circuit open for \" + serviceName + \". Retry later.\");\n        }\n        if (cachedEndpoints.isEmpty()) {\n            refreshCache(serviceName);\n        }\n        // Retry with backoff on failure\n        for (int attempt = 0; attempt < 3; attempt++) {\n            if (!cachedEndpoints.isEmpty()) {\n                String endpoint = cachedEndpoints.get(0); // simplified round-robin\n                if (tryCall(endpoint)) {\n                    failureCount.set(0);\n                    return endpoint;\n                }\n                recordFailure();\n                cachedEndpoints.remove(0);\n            }\n            sleep(Math.min(100 * (1 << attempt), 2000)); // exponential backoff\n        }\n        throw new RuntimeException(\"All instances failed for \" + serviceName);\n    }\n\n    private boolean isCircuitOpen() {\n        if (failureCount.get() >= CIRCUIT_THRESHOLD) {\n            long elapsed = System.currentTimeMillis() - lastFailureTime.get();\n            if (elapsed > COOLDOWN_MS) {\n                failureCount.set(0); // half-open\n                return false;\n            }\n            return true;\n        }\n        return false;\n    }\n\n    private void recordFailure() {\n        failureCount.incrementAndGet();\n        lastFailureTime.set(System.currentTimeMillis());\n    }\n\n    private void refreshCache(String serviceName) {\n        // Query registry and populate cachedEndpoints\n        // This should run asynchronously off the critical path\n    }\n\n    private boolean tryCall(String endpoint) {\n        // return true if call succeeds\n        return false; // placeholder\n    }\n\n    private void sleep(long ms) {\n        try { Thread.sleep(ms); } catch (InterruptedException e) { Thread.currentThread().interrupt(); }\n    }\n}",
        "output": ""
      }

Kubernetes Service Discovery Mechanisms Under the Hood

Kubernetes offers multiple discovery mechanisms: DNS through CoreDNS, environment variables, and the Kubernetes API. Understanding how they work helps you choose the right one for each scenario.

DNS-based discovery is the default. Each Service gets a DNS name (e.g., my-svc.my-namespace.svc.cluster.local). CoreDNS resolves it to the ClusterIP (virtual IP) of the Service, which kube-proxy then load balances to healthy pods. The DNS record's TTL defaults to 30 seconds. You can change it via the dnsConfig pod spec or reduce it per service using annotations.

Headless services (with clusterIP: None) bypass the virtual IP entirely. DNS returns all pod IPs directly. This is useful for stateful workloads (StatefulSets) or when you need client-side load balancing. But you lose the load balancing from kube-proxy, so you must implement retry and backoff in your client.

Environment variables are injected at pod creation time. Kubernetes writes the ClusterIP and port of every Service in the same namespace into each pod's environment. This is simple but stale — updates after pod creation won't be reflected.

Kubernetes API is the most powerful but also the riskiest. You can watch EndpointSlice resources for real-time updates. This is how service meshes like Istio get instant pod changes.

Production recommendation: for most stateless apps, use DNS with a TTL of 5-10 seconds and add client-side retry. For stateful apps or low-latency requirements, use headless services with client-side caching. For mesh, let the sidecar proxy handle discovery via API watches.

k8s_discovery_check.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
# TheCodeForgeInspect Kubernetes service discovery details

# Check DNS resolution for a service
kubectl run -it --rm dns-test --image=busybox:1.28 -- nslookup my-svc

# List endpoints directly
kubectl get endpoints my-svc -n my-namespace

# Watch endpoint changes in real-time (useful for debugging)
kubectl get endpoints my-svc -n my-namespace -w

# For headless services, DNS returns pod IPs directly
# Use dig +short to see multiple A records
Headless Service Tip
When you use a headless service, each pod must independently handle load balancing. That means your client code needs retry logic with fallback to another IP. The trade-off is lower latency (no virtual IP hop) and zero DNS caching staleness.
Production Insight
Using environment variables for discovery is a common mistake in CI/CD pipelines where pods restart frequently.
Variables are set at pod creation and never updated — stale values cause routing failures.
Rule: avoid environment variables for service resolution unless you never restart services in the same namespace.
Key Takeaway
Kubernetes gives you three discovery roads: DNS (simple), headless (fast but non-trivial), API (powerful).
Pick the one that matches your traffic pattern and operational comfort.
Environment variables are a trap for production — they go stale without warning.
Kubernetes Discovery Method Selection
IfSimple stateless API with moderate traffic
UseRegular DNS-based discovery (ClusterIP) with TTL 10s
IfStateful workload needing stable network identity
UseHeadless service (clusterIP: None) with client-side load balancing
IfNeed real-time pod changes (e.g., for canary routing)
UseUse Kubernetes API watches via a service mesh or custom controller
IfLow-throughput internal tooling
UseEnvironment variables are acceptable — simple and fast

Monitoring and Observing Service Discovery Health

Service discovery failures often manifest as intermittent errors that are hard to trace. You need proactive monitoring to detect issues before they cause outages. Key metrics to monitor: registration count, health check pass/fail ratio, DNS query latency, and registry request volume.

Set up alerts for: sudden drop in registered instances (indicates mass deregistration or partition), increase in health check failures, high DNS resolution latency, or registry response time spikes.

Use distributed tracing (e.g., Jaeger) to correlate service calls with discovery queries. When a call fails, can you see if the client resolved the wrong IP? Traces coupled with metrics give you the full picture.

Logging: every discovery operation should emit a structured log with service name, outcome, and latency. But be careful not to log every successful resolution in high-throughput systems. Log failures and cache misses.

Dashboards: build a service discovery dashboard showing number of healthy instances per service, health check pass rate, DNS TTL adherence, and registry response time. This helps you spot trends before they become incidents.

monitor_discovery.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# TheCodeForgeQuick monitoring commands for service discovery

# Consul: list all services with health status
consul catalog services -detailed | awk '{print $1}' | while read svc; do
  passing=$(curl -s http://localhost:8500/v1/health/service/$svc?passing | jq length)
  critical=$(curl -s http://localhost:8500/v1/health/service/$svc | jq '[.[] | select(.Checks[].Status == "critical")] | length')
  echo "$svc: $passing passing, $critical critical"
done

# Kubernetes: watch endpoints for a specific service
kubectl get endpoints -n my-namespace -w -o json | jq '.subsets[].addresses | length'

# Eureka: check registered instances and self-preservation status
curl -s http://eureka:8761/eureka/apps | xmllint --format - | grep -E '<status>|<app>'
Mental Model: The Registry as a Health Report
  • Each service's health check is like a vital sign (pulse, temperature).
  • A sudden drop in registered instances is like a mass fainting in a crowd.
  • Health check failure spikes indicate a systemic issue (like a virus).
  • DNS resolution latency is like a slow phone book lookup.
  • Distributed tracing connects the symptoms to the root cause across services.
Production Insight
A team once missed a slow registry degradation because they only monitored HTTP error rates.
By the time errors spiked, the registry was already serving stale data for 20 minutes.
Rule: monitor registry health directly — don't rely on downstream service errors as a proxy.
Key Takeaway
Monitor the registry, not just the services.
Track registration count, health check pass rate, and DNS latency.
Use distributed traces to connect discovery decisions to request outcomes.
Monitoring Configuration Priorities
IfYou have no monitoring yet
UseStart with health check pass rate per service and registry response time
IfYou see intermittent 503s but don't know why
UseAdd DNS resolution metrics: latency, cache hit rate, TTL adherence
IfYou want to catch cascading failures early
UseImplement distributed tracing with discovery context (resolved IP, retry count)
IfYou need capacity planning for discovery infrastructure
UseMonitor query rate per registry node and cache hit ratio

Testing Service Discovery Failure Scenarios

You cannot assume discovery will work in production just because it worked in development. You need to test failure scenarios deliberately. Chaos engineering for discovery: kill a registry node, pause heartbeats from a set of services, induce network partitions between services and registry, and simulate slow health check responses.

Each test should validate: clients fall back to cached instances, new instances can still register with a partially available registry, the system does not degrade into a hard failure, and recovery is automatic after the fault is removed.

Automated integration tests: in a test environment, spin up your discovery infrastructure, register services, then abruptly kill one and measure how long it takes for clients to stop sending traffic to it. That's your convergence time.

Performance test your registry under load: simulate the maximum number of services and heartbeats you expect in production. Measure CPU and memory per node. Registry servers that handle 100 services can look fine but buckle under 10,000.

Don't forget to test TTL behaviour: set your DNS TTL to the production value, kill a pod, and measure the actual error window with client-side retry enabled. You'll be surprised how much longer than TTL the errors can last (due to client caching layers).

test_discovery_failures.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# TheCodeForgeChaos testing for service discovery convergence time

# Simulate a pod crash
kubectl delete pod -l app=payment --force --grace-period=0

# Measure convergence: check when DNS stops returning the old IP
OLD_IP=$(kubectl get pod -l app=payment -o jsonpath='{.items[0].status.podIP}')
echo "Watching for $OLD_IP to disappear from DNS..."

# Poll DNS until IP is gone (or timeout)
for i in $(seq 1 30); do
  RESULT=$(dig +short my-svc.svc.cluster.local @10.100.0.10)
  if echo "$RESULT" | grep -q "$OLD_IP"; then
    echo "Second $i: $OLD_IP still in DNS"
  else
    echo "Second $i: old IP removed from DNS"
    break
  fi
  sleep 1
done

# Also check health check removal
# In Consul, watch health status changes
consul watch -type=health -service=payment -passingonly curl -s http://localhost:8500/v1/health/service/payment | jq '.[].Node.Node'
Automate Convergence Tests in Your CI Pipeline
Every time you change discovery configuration (TTL, health check interval, deregistration timeout), run a convergence test as part of your CI pipeline. It's the only way to catch unexpected regressions.
Production Insight
Many teams discover their TTL trap only during a production incident.
A 30-second DNS TTL plus 10-second health check interval equals a 40-second window where dead instances remain active.
Rule: measure actual convergence time in your environment — it's always longer than the TTL you configured.
Key Takeaway
Test discovery failures deliberately — don't wait for production to teach you.
Measure convergence time, fallback behaviour, and registry capacity.
Make convergence tests part of your CI pipeline for every configuration change.
Testing Priority by Failure Type
IfUnclear how long traffic continues to dead instances
UseTest pod crash and measure DNS removal plus health check deregistration time
IfRegistry becomes slow or unresponsive
UseTest client-side caching fallback: stop registry and verify service calls still succeed
IfNetwork partition splits clients from registry
UseTest that clients use cached instance lists and do not degrade to all vs nothing
IfNew instances don't receive traffic after scaling
UseTest registration timing: start a new instance and measure when it first appears in DNS/resolution
● Production incidentPOST-MORTEMseverity: high

The DNS TTL Trap That Killed Black Friday Traffic

Symptom
About 20% of requests to the payment service returned 503 errors while the service itself was healthy and responding to direct IP requests.
Assumption
The team assumed that DNS-based service discovery with a 30-second TTL would be fast enough to react to pod failures. They tested it in staging with a single pod restart, and it worked fine.
Root cause
When the payment service pod restarted (due to an OOM kill), the old IP was removed from DNS. However, many clients had cached the old DNS record for up to 30 seconds. During that window, they kept hitting the dead pod. Worse, because the health check was liveness-only (TCP port check), the old pod was marked healthy until the TCP connection actually timed out. The combination of stale DNS and slow health check convergence caused routing to a dead instance.
Fix
Reduced DNS TTL to 5 seconds for service discovery records. Changed health checks to application-level (HTTP /health/ready) with a shorter interval (5s). Implemented client-side retry with a different instance on failure. Added a circuit breaker that stops sending requests to an unhealthy instance after a configurable failure count.
Key lesson
  • DNS TTL is not a joke — it's a consistency latency. Treat it as a maximum staleness window, not a freshness guarantee.
  • Liveness health checks alone are dangerous for discovery. Use readiness probes that reflect the service's ability to handle requests.
  • Client-side retry with fallback to another instance is essential when using DNS-based discovery with moderate TTLs.
  • Test failure scenarios with the actual TTL and multiple pods to expose cascading timing issues.
Production debug guideSymptom → Action mapping for the most common discovery outages5 entries
Symptom · 01
Requests to service X timeout or get connection refused intermittently
Fix
Check if the target service is registered in the registry (e.g., curl registry:8500/v1/health/service/X for Consul). Verify its health check status. Look for 'passing' vs 'critical'.
Symptom · 02
DNS resolves to an old IP that doesn't respond
Fix
Check the TTL on the DNS record. Run dig +short serviceX.svc.cluster.local @cluster-dns-ip and compare with current pod IPs. Flush local DNS cache by restarting the client container or using nscd.
Symptom · 03
Health check shows passing but service still returns errors
Fix
Your health check is too shallow. Use an HTTP readiness endpoint that actually exercises the service's dependencies (DB, cache). Check if the health check is liveness (process alive) or readiness (ready to serve).
Symptom · 04
Newly scaled instances are not receiving traffic
Fix
Verify registration timing: after startup, does the service wait for a readiness confirmation before advertising itself? In Kubernetes, check the startupProbe and readinessProbe timing. In Consul/Eureka, check the initialStatus and heartbeatInterval.
Symptom · 05
Registry returns inconsistent health status across nodes
Fix
Check for network partitions between registry nodes. In Consul, verify Serf gossip status with consul members and look for left or failed nodes. In Eureka, verify all peers are reachable and not in self-preservation mode.
★ Quick Debug Cheat Sheet for Service DiscoveryFour common failure symptoms with the exact commands and fixes to resolve them in under a minute.
Service not found in registry
Immediate action
Check if the service process is running and registered with correct service name.
Commands
curl -s http://localhost:8500/v1/agent/services | jq '.[] | select(.Service == "payment")'
kubectl get pod -l app=payment -o wide
Fix now
If service missing, check EUREKA_INSTANCE_HOSTNAME or CONSUL_HTTP_ADDR environment variable. In K8s, check service labels and endpoint readiness.
Requests hit wrong IP or port+
Immediate action
Check the DNS resolution for the service name and compare with actual pod IPs.
Commands
dig +short serviceX.svc.cluster.local @10.100.0.10
kubectl get endpoints serviceX -o yaml
Fix now
If DNS returns stale IPs, reduce DNS TTL or switch to a polling-based registry. If K8s endpoints are correct but DNS is wrong, restart coredns or kube-dns.
Health check passing but service returns 503+
Immediate action
Check if the service's dependencies (DB, cache) are reachable from the instance.
Commands
kubectl exec -it <pod> -- curl localhost:8080/health/ready
kubectl exec -it <pod> -- nc -zv db-service 3306
Fix now
Update health check endpoint to include dependency checks. For readiness failures, configure startupProbe to delay readiness until after dependency init.
New instance not receiving traffic after scaling+
Immediate action
Check when the new instance was registered and if it passed readiness.
Commands
kubectl describe pod <new-pod> | grep -A5 Readiness
curl -s http://consul:8500/v1/health/checks/serviceX | jq '.[] | select(.Node == "<new-node>")'
Fix now
Increase readiness probe initial delay seconds to match actual startup time. In Consul, lower the deregister critical service timeout to quickly remove dead instances.
Multiple instances show as unhealthy after brief network blip+
Immediate action
Check if the registry entered self-preservation mode (Eureka) or if gossip convergence is slow (Consul).
Commands
curl -s http://eureka:8761/eureka/apps | grep -o 'selfPreservation" : ".*?"'
consul members -detailed | grep -E 'left|failed'
Fix now
For Eureka, disable self-preservation during planned network events. For Consul, increase gossip interval or tweak failure detection timing.
Service Discovery Patterns at a Glance
AspectClient-Side DiscoveryServer-Side DiscoveryDNS-Based Discovery
Client complexityHigh (needs registry client + load balancer logic)Low (just send request to fixed endpoint)Medium (DNS lookup + retry logic often needed)
LatencyLow (no extra hop)Medium (+1 network hop through load balancer)Low to Medium (DNS resolution time, typically <10ms cached)
Control over routingFull (can implement canary, retry, circuit breaking per client)Limited (routing logic is centralised in balancer or proxy)Minimal (DNS round-robin only; no per-request control)
Scalability under loadRegistry query load proportional to number of clientsLoad balancer can become bottleneck, but easy to scale horizontallyDNS servers scale well but TTL and caching cause staleness
Resilience to registry failureOnly cached instances work until cache expiryLoad balancer chooses an alternative backend as long as it has recent cacheDNS caching works for the TTL duration; after that, resolution fails
Example implementationsEureka, Consul (client libraries), RibbonKubernetes Services + Ingress, AWS ALB, Envoy sidecarCoreDNS, kube-dns, Consul DNS interface

Common mistakes to avoid

9 patterns
×

Memorising syntax before understanding the concept

Symptom
Unable to adapt discovery pattern to different runtime environments; when the registry changes, engineers start from scratch.
Fix
Focus on the two core operations: registration and resolution. Learn why each step exists before memorising API calls.
×

Skipping hands-on practice with a real registry

Symptom
First production incident with discovery leads to panic and long debugging sessions; no experience with failure modes.
Fix
Set up a local Consul or use Kubernetes service discovery in minikube. Deliberately kill a pod and watch how long it takes for clients to stop sending traffic.
×

Using only a liveness check for discovery routing

Symptom
Traffic continues to a service whose dependency (DB, cache) is down; all requests fail but the health check shows healthy.
Fix
Implement a separate readiness endpoint that verifies critical dependencies. Use Kubernetes readinessProbe or Consul check with HTTP endpoint.
×

Setting DNS TTL too high (e.g., 60 seconds) for frequently-changing endpoints

Symptom
After a crash or scaling event, clients continue hitting dead pods for the entire TTL window.
Fix
Reduce TTL to 5–10 seconds for service discovery records. Use headless services in Kubernetes for exact pod IPs without caching.
×

Running multiple registries (e.g., Consul + Eureka) without coordination

Symptom
Some clients route through Consul, others through Eureka, leading to inconsistent views of healthy instances.
Fix
Standardise on one registry per environment. If you must have two, build a synchronisation layer that propagates health state between them.
×

Not caching registry queries on the client side

Symptom
Each request makes a synchronous call to the registry; when the registry slows down, entire system latency spikes.
Fix
Cache the list of healthy instances locally for at least 10 seconds. Update asynchronously off the critical path.
×

Using health check interval longer than DNS TTL

Symptom
A pod becomes unhealthy but the health check doesn't detect it until after the DNS TTL expires, causing stale routes for the entire window.
Fix
Ensure health check interval is shorter than DNS TTL (e.g., health check every 5s, DNS TTL 10s) so that unhealthy pods are removed before DNS cache expires.
×

Omitting deregisterCriticalServiceAfter in Consul

Symptom
After a pod crashes, the critical service remains in the registry indefinitely, causing DNS to continue returning dead IPs for the full TTL window.
Fix
Set DeregisterCriticalServiceAfter to a reasonable timeout (e.g. 30s) so Consul automatically removes instances that haven't sent heartbeats.
×

Using default heartbeat intervals without tuning for GC pauses

Symptom
Services with long GC pauses (>1s) get falsely deregistered, causing flapping availability and confusing alerts.
Fix
Measure JVM GC pause times, then increase heartbeat interval to at least 2s and expiry to 6s. Or switch to a GC algorithm with lower pause times.
🔥

That's Components. Mark it forged?

12 min read · try the examples if you haven't

Previous
Rate Limiting
8 / 18 · Components
Next
Circuit Breaker Pattern