Service Discovery Trap — DNS TTL Killed Black Friday
The DNS TTL trap that killed Black Friday: 20% of payment requests failed with 503 due to stale DNS cache.
- Service discovery lets services find each other by name, not IP:port
- Two core operations: registration (service tells registry 'I'm here') and resolution (client asks 'where is service X?')
- Client-side discovery: client queries the registry and load-balances itself
- Server-side discovery: a load balancer or proxy (e.g., API gateway) handles both lookup and routing
- Health checks separate liveness (is the process alive?) from readiness (is it ready to serve?) — confusing them causes cascading failures
- DNS-based discovery has a TTL trap: cached stale records continue routing to dead instances after a crash
Imagine you move to a new city and need a plumber. You don't have their number memorised — you look them up in a directory, get their current address, and call them directly. Service Discovery is that directory for software services. Instead of hardcoding 'Service B lives at 192.168.1.42:8080', every service registers itself in a central registry when it starts up, and looks up others by name when it needs them. The directory always stays fresh, even when services crash, scale up, or move to a new machine. Without it, auto-scaling and container rescheduling would break every client that cached a stale IP.
In a monolith, your code calls a function — it's right there in the same process. In a microservices architecture running across hundreds of containers on dynamically scheduled cloud infrastructure, that luxury disappears overnight. Pods get rescheduled. Auto-scaling fires up three new instances of your payment service at 11pm on Black Friday. IPs change. Ports shift. If you hardcoded any of that, your system falls apart the moment the environment breathes. Service Discovery is the infrastructure primitive that makes dynamic, self-healing distributed systems actually work in production.
The problem it solves is deceptively simple to state and brutally hard to get right: how does Service A know where to send its request to Service B, right now, with a healthy instance, without a human operator updating a config file? The naive solution — a static config map — breaks the moment you deploy more than once a week. The production solution requires a registry, a health-check protocol, and a resolution strategy that can handle partial failures, network partitions, and stale data without cascading into an outage.
By the end of this article you'll understand the two fundamental discovery patterns (client-side and server-side), how health checks work under the hood, why DNS-based discovery has a hidden TTL trap that bites almost every team, how Consul, Eureka, and Kubernetes each implement the registry differently, and what you need to think about before choosing one. You'll also walk away with concrete production gotchas that most tutorials skip entirely.
What is Service Discovery?
Service Discovery is a core concept in System Design. Rather than starting with a dry definition, let's see it in action and understand why it exists.
At its simplest, service discovery has two jobs: registration — when a service starts, it tells a central registry 'I am running at this IP:port and I am healthy'. And resolution — when another service needs to call it, it asks the registry 'give me a healthy instance of service X'. The registry acts as the source of truth for the current state of all services in the distributed system.
This pattern is centuries old. Telephone operators maintained switchboards to connect callers. DNS maps domain names to IPs. The difference in microservices is the rate of change: instances come and go every second due to auto-scaling, rolling updates, and failures. A static phone book would be outdated before it's printed.
How Registration & Resolution Actually Work
Registration is not just posting a key-value pair. The registry must decide when to remove an instance. It does this through heartbeats. The service sends a periodic heartbeat (every 5s by default in Eureka, every 10s in Consul). If the registry misses three consecutive heartbeats, it automatically deregisters the service.
Resolution can happen in two ways: client-side — the client queries the registry for all healthy instances and picks one using a load balancing strategy (round-robin, random, least connections). Server-side — the client sends the request to a known endpoint (load balancer or proxy), which uses the registry to find a healthy backend.
Crucially, resolution is a distributed read — every client reads a copy of the registry state. Different clients may see different subsets of instances due to caching and eventual consistency. That's fine for high-level load distribution but causes trouble during rapid failover scenarios.
Heartbeat tuning is a subtle art. Too short and you'll get false deregistrations from GC pauses. Too long and dead instances stay in the pool. Start with 3x the expected interval for expiry, then measure your JVM GC pause times — if you see pauses over 500ms, your heartbeat interval should be at least 1.5 seconds to avoid false positives.
- When a new line is connected (service starts), the operator plugs it into the board (registers).
- The operator keeps the board updated by calling each line periodically (heartbeat).
- When you want to call someone, you ask the operator to connect you (resolution).
- If the operator doesn't get an answer for a while, they unplug the line (deregistration).
- Multiple operators may have slightly different views of which lines are live (eventual consistency).
Client-Side vs Server-Side Discovery
Client-side discovery puts the burden on each service's code to query the registry and pick an instance. Spring Cloud Eureka, Netflix OSS, and Consul's client library are common examples. The client gets a list of all healthy instances for the target service and chooses one using a load balancing policy (e.g., Ribbon).
Server-side discovery removes that responsibility from the client. The client sends a request to a well-known load balancer (e.g., AWS ALB, HAProxy, or a sidecar proxy in a service mesh). The load balancer queries the registry and forwards the request to an appropriate backend.
Which one to choose? Client-side gives you lower latency (no extra hop) and more control over routing logic (canary, retry, circuit breaking). Server-side simplifies the client code and centralises routing control, which is critical for security and compliance. Cloud-native environments often use server-side via Kubernetes Services combined with Ingress or a service mesh.
A less obvious trade-off: client-side discovery creates a fan-out of registry queries — each client polls the registry. With 1,000 services each discovering 50 others, that's 50,000 queries per minute. Server-side centralises that load to a single balancer, which is easier to scale. Measure your registry's capacity before committing to client-side at scale.
Health Checks: The Silent Failure Point
Health checks are the single most misconfigured feature in service discovery. Most teams use a simple TCP port check (is the port open?) or a generic HTTP endpoint (/health) that always returns 200. These only verify that the process is alive — they don't tell you if the service can actually handle requests.
Production-ready health checks should differentiate between liveness (is the process running?) and readiness (is the service ready to serve traffic?). Kubernetes, Consul, and Eureka all support this distinction, but few teams configure both correctly.
A common pitfall: the readiness check passes even when a critical dependency (database, cache) is down. The service keeps receiving traffic, fails every request, and the outage appears as random 500 errors. The health check should cascade: if the database is unreachable, the service reports itself as unhealthy for readiness, and traffic is redirected to healthy instances.
Another subtlety: health checks can cause a thundering herd during startup. If all instances of a new deployment become ready simultaneously and all start reporting themselves as healthy, the registry may broadcast a sudden surge of new endpoints to all clients, causing a wave of reconnections and potential CPU spikes.
Mitigation: stagger readiness by adding a random delay (e.g., 0-5 seconds) after the readiness check passes before advertising the instance. In Kubernetes, use minReadySeconds on the Deployment to force a grace period.
minReadySeconds to 10–30 to spread the registration window.DNS-Based Discovery and the Hidden TTL Trap
Kubernetes and many cloud providers use DNS for service discovery. A service name like payment.prod.svc.cluster.local resolves to the IP of whichever backend pod is healthy at that moment. DNS is familiar, free, and requires no extra infrastructure.
Here's the trap: DNS responses are cached aggressively, both by the OS resolver and by intermediate DNS servers. The TTL (Time To Live) on the DNS record controls how long the cache lives. Kubernetes DNS records for services have a default TTL of 30 seconds. That means if a pod crashes, up to 30 seconds can pass before all clients stop sending requests to the dead pod.
In staging, where you restart one pod at a time and monitor manually, the 30-second window is invisible. In production with auto-scaling groups of 10 pods and rolling updates, a single pod crash causes a cascade of failures as requests pile up on dead instances. The TTL trap is especially dangerous when combined with a slow health check that doesn't detect the failure quickly.
To mitigate, reduce the TTL to 5–10 seconds for critical services, and ensure your readiness check is fast enough to detect failures within that window. Also implement client-side retry with a different instance on first failure. For Kubernetes, consider using a headless service (no cluster IP) with a ClusterIP: None — this returns all pod IPs directly, bypassing DNS caching. But then you lose the load balancing that kube-proxy provides, so you'll need your own client-side balancing.
Consul, Eureka, and Kubernetes: Registry Implementations Compared
Three major registries dominate production deployments: Consul (HashiCorp), Eureka (Netflix), and Kubernetes native service discovery. Each takes a different philosophical approach.
Consul uses a gossip protocol (Serf) for health dissemination. This means health changes propagate quickly across all nodes via peer-to-peer updates, not centralised polling. Consul also provides a DNS interface (port 8600) that respects health status — unhealthy services are automatically removed from DNS responses. This makes it ideal for multi-cloud or hybrid environments where a central registry is required.
Eureka was designed by Netflix for their AWS-centric architecture. It uses a peer-to-peer pattern where each Eureka server replicates state. Eureka has a 'self-preservation' mode that kicks in when a large number of heartbeats are missed — it stops evicting instances, effectively assuming a network partition rather than an actual mass failure. This prevents a cascading removal of instances, but it also means stale entries survive longer. Eureka is best for environments with high churn and network instability.
Kubernetes does not have a central registry by default. It uses DNS and the API server for service resolution. The EndpointSlice controller keeps track of all pod IPs behind a service. The kube-proxy on each node programs iptables or IPVS rules to forward traffic. This design is decentralised and extremely scalable but gives the team less control over routing logic without a service mesh.
Choosing between them depends on your infrastructure homogeneity, desired control, and operational maturity. Avoid a 'best of breed' mixture that forces every service to implement two discovery mechanisms simultaneously.
To query a registry programmatically, you can use the following example with Consul's HTTP API:
Service Mesh: The Evolution of Server-Side Discovery
As microservices grow beyond 50 services, managing discovery, load balancing, and retries in each service's code becomes unsustainable. A service mesh (e.g., Istio, Linkerd) moves these responsibilities out of the application and into a sidecar proxy. Each service has a proxy (Envoy) injected alongside it. The proxy handles all service-to-service communication: it discovers the target via a control plane, load balances, retries on failure, handles circuit breaking, and captures metrics. The application code becomes completely discovery-unaware. This is the ultimate server-side discovery pattern.
The trade-off is complexity: you now need to deploy, scale, and monitor a mesh infrastructure. For large organisations, the operational overhead is worth the decoupling it provides.
Production Hardening: Retry, Caching & Circuit Breakers
Even with a perfectly tuned registry and health checks, failures happen. A network partition can separate the registry from your clients. A slow GC pause can cause a false deregistration. The key to production hardening is assuming the registry will occasionally lie to you.
First, implement client-side caching of the instance list. Cache it locally for at least the health check interval (e.g., 10 seconds). When a resolution request arrives, return the cached list immediately and refresh asynchronously. This prevents every request from becoming a synchronous RPC to the registry.
Second, add retry with exponential backoff. If the first resolved instance fails (connection refused, timeout, 5xx), retry with the next instance from the cached list. Set a maximum retry count (e.g., 3) and a backoff multiplier (e.g., 100ms initial, double each time). This handles transient failures during TTL windows.
Third, use a circuit breaker per service. If a particular service returns errors on >50% of requests within a sliding window (e.g., 10 seconds), open the circuit — stop sending requests entirely for a cooldown period. This prevents cascading failures when the registry is still pointing to a bad instance.
These three patterns — cache, retry, circuit break — are not optional for production service discovery. They transform a fragile central registry into a system that degrades gracefully.
Kubernetes Service Discovery Mechanisms Under the Hood
Kubernetes offers multiple discovery mechanisms: DNS through CoreDNS, environment variables, and the Kubernetes API. Understanding how they work helps you choose the right one for each scenario.
DNS-based discovery is the default. Each Service gets a DNS name (e.g., my-svc.my-namespace.svc.cluster.local). CoreDNS resolves it to the ClusterIP (virtual IP) of the Service, which kube-proxy then load balances to healthy pods. The DNS record's TTL defaults to 30 seconds. You can change it via the dnsConfig pod spec or reduce it per service using annotations.
Headless services (with clusterIP: None) bypass the virtual IP entirely. DNS returns all pod IPs directly. This is useful for stateful workloads (StatefulSets) or when you need client-side load balancing. But you lose the load balancing from kube-proxy, so you must implement retry and backoff in your client.
Environment variables are injected at pod creation time. Kubernetes writes the ClusterIP and port of every Service in the same namespace into each pod's environment. This is simple but stale — updates after pod creation won't be reflected.
Kubernetes API is the most powerful but also the riskiest. You can watch EndpointSlice resources for real-time updates. This is how service meshes like Istio get instant pod changes.
Production recommendation: for most stateless apps, use DNS with a TTL of 5-10 seconds and add client-side retry. For stateful apps or low-latency requirements, use headless services with client-side caching. For mesh, let the sidecar proxy handle discovery via API watches.
Monitoring and Observing Service Discovery Health
Service discovery failures often manifest as intermittent errors that are hard to trace. You need proactive monitoring to detect issues before they cause outages. Key metrics to monitor: registration count, health check pass/fail ratio, DNS query latency, and registry request volume.
Set up alerts for: sudden drop in registered instances (indicates mass deregistration or partition), increase in health check failures, high DNS resolution latency, or registry response time spikes.
Use distributed tracing (e.g., Jaeger) to correlate service calls with discovery queries. When a call fails, can you see if the client resolved the wrong IP? Traces coupled with metrics give you the full picture.
Logging: every discovery operation should emit a structured log with service name, outcome, and latency. But be careful not to log every successful resolution in high-throughput systems. Log failures and cache misses.
Dashboards: build a service discovery dashboard showing number of healthy instances per service, health check pass rate, DNS TTL adherence, and registry response time. This helps you spot trends before they become incidents.
- Each service's health check is like a vital sign (pulse, temperature).
- A sudden drop in registered instances is like a mass fainting in a crowd.
- Health check failure spikes indicate a systemic issue (like a virus).
- DNS resolution latency is like a slow phone book lookup.
- Distributed tracing connects the symptoms to the root cause across services.
Testing Service Discovery Failure Scenarios
You cannot assume discovery will work in production just because it worked in development. You need to test failure scenarios deliberately. Chaos engineering for discovery: kill a registry node, pause heartbeats from a set of services, induce network partitions between services and registry, and simulate slow health check responses.
Each test should validate: clients fall back to cached instances, new instances can still register with a partially available registry, the system does not degrade into a hard failure, and recovery is automatic after the fault is removed.
Automated integration tests: in a test environment, spin up your discovery infrastructure, register services, then abruptly kill one and measure how long it takes for clients to stop sending traffic to it. That's your convergence time.
Performance test your registry under load: simulate the maximum number of services and heartbeats you expect in production. Measure CPU and memory per node. Registry servers that handle 100 services can look fine but buckle under 10,000.
Don't forget to test TTL behaviour: set your DNS TTL to the production value, kill a pod, and measure the actual error window with client-side retry enabled. You'll be surprised how much longer than TTL the errors can last (due to client caching layers).
The DNS TTL Trap That Killed Black Friday Traffic
- DNS TTL is not a joke — it's a consistency latency. Treat it as a maximum staleness window, not a freshness guarantee.
- Liveness health checks alone are dangerous for discovery. Use readiness probes that reflect the service's ability to handle requests.
- Client-side retry with fallback to another instance is essential when using DNS-based discovery with moderate TTLs.
- Test failure scenarios with the actual TTL and multiple pods to expose cascading timing issues.
curl registry:8500/v1/health/service/X for Consul). Verify its health check status. Look for 'passing' vs 'critical'.dig +short serviceX.svc.cluster.local @cluster-dns-ip and compare with current pod IPs. Flush local DNS cache by restarting the client container or using nscd.startupProbe and readinessProbe timing. In Consul/Eureka, check the initialStatus and heartbeatInterval.consul members and look for left or failed nodes. In Eureka, verify all peers are reachable and not in self-preservation mode.EUREKA_INSTANCE_HOSTNAME or CONSUL_HTTP_ADDR environment variable. In K8s, check service labels and endpoint readiness.Common mistakes to avoid
9 patternsMemorising syntax before understanding the concept
Skipping hands-on practice with a real registry
Using only a liveness check for discovery routing
Setting DNS TTL too high (e.g., 60 seconds) for frequently-changing endpoints
Running multiple registries (e.g., Consul + Eureka) without coordination
Not caching registry queries on the client side
Using health check interval longer than DNS TTL
Omitting deregisterCriticalServiceAfter in Consul
DeregisterCriticalServiceAfter to a reasonable timeout (e.g. 30s) so Consul automatically removes instances that haven't sent heartbeats.Using default heartbeat intervals without tuning for GC pauses
That's Components. Mark it forged?
12 min read · try the examples if you haven't