Senior 4 min · March 06, 2026

DNS TTL Killed a Migration — Computer Networks Interview

A 24-hour DNS TTL caused 30% traffic failure during a migration.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • OSI model isn't theory — it's a fault-isolation map for debugging network problems.
  • TCP guarantees delivery (at a cost); UDP trades reliability for speed — choose based on data criticality.
  • DNS resolution walks a cached hierarchy: browser → OS → resolver → root → TLD → authoritative.
  • HTTPS = HTTP + TLS; the extra handshake adds ~2 RTT but protects data in transit.
  • Subnetting with CIDR is how cloud providers isolate networks and control traffic flow.
Plain-English First

Imagine the internet is a global postal system. Your computer is a house with an address (IP address), the postal routes are the network cables and Wi-Fi signals, and the rules about how letters get packed, addressed, and delivered are the protocols. When you visit google.com, you're essentially writing a letter, dropping it in a mailbox, watching it get sorted through multiple post offices (routers), and getting a reply back — all in milliseconds. Computer networking is the science of making that postal system fast, reliable, and secure.

Every backend engineer, DevOps engineer, and full-stack developer eventually sits across from an interviewer who asks 'What happens when you type a URL into a browser?' That question alone can make or break a senior-level interview. Networking isn't just a theoretical subject — it's the invisible infrastructure that your APIs, databases, and microservices live on. Understanding it deeply separates candidates who just write code from engineers who understand systems.

The OSI Model — Why 7 Layers Actually Matter in Practice

The OSI (Open Systems Interconnection) model is a framework that breaks network communication into 7 distinct layers. Most people memorize the names ('Please Do Not Throw Sausage Pizza Away') and stop there. That's a mistake. Understanding what each layer is responsible for helps you debug real problems.

[Image of the 7 layers of the OSI model]

When your HTTP request fails, is it a DNS issue (Layer 7/5), a TCP connection problem (Layer 4), or a routing issue (Layer 3)? Knowing the layers lets you mentally narrow down where the fault is, just like a doctor using anatomy to diagnose illness.

In practice, you rarely work below Layer 4 (Transport) unless you're writing embedded systems or kernel code. But you absolutely need to understand Layers 3, 4, and 7 — IP addressing, TCP/UDP, and application protocols — because they appear in every production debugging scenario, from a failing API call to a slow database connection.

Here's the critical insight: layers are about separation of concerns. Each layer only talks to the layer directly above and below it. That's why you can swap out Wi-Fi for Ethernet (Layer 1/2 change) without rewriting your HTTP code (Layer 7). The abstraction is intentional and powerful.

io/thecodeforge/networking/OsiLayerInspector.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
package io.thecodeforge.networking;

import java.net.InetAddress;
import java.net.Socket;
import java.io.PrintWriter;
import java.io.BufferedReader;
import java.io.InputStreamReader;

/**
 * Demonstration of OSI Layers 3, 4, and 7 in a production Java context.
 */
public class OsiLayerInspector {
    public static void main(String[] args) {
        String host = "example.com";
        int port = 80; // Layer 4 (Transport) Port

        try {
            // Layer 3 (Network): DNS Resolution
            InetAddress address = InetAddress.getByName(host);
            System.out.println("[L3 - Network] Resolved " + host + " to " + address.getHostAddress());

            // Layer 4 (Transport): TCP Connection established via Socket
            try (Socket socket = new Socket(address, port);
                 PrintWriter out = new PrintWriter(socket.getOutputStream(), true);
                 BufferedReader in = new BufferedReader(new InputStreamReader(socket.getInputStream()))) {
                
                System.out.println("[L4 - Transport] TCP connection established (Handshake complete)");

                // Layer 7 (Application): Raw HTTP Protocol communication
                out.println("GET / HTTP/1.1");
                out.println("Host: " + host);
                out.println("Connection: close");
                out.println();

                System.out.println("[L7 - Application] HTTP Request Sent");
                String responseLine = in.readLine();
                System.out.println("[L7 - Application] Server Response: " + responseLine);
            }
        } catch (Exception e) {
            System.err.println("Connection Failed at specific layer: " + e.getMessage());
        }
    }
}
Output
[L3 - Network] Resolved example.com to 93.184.216.34
[L4 - Transport] TCP connection established (Handshake complete)
[L7 - Application] HTTP Request Sent
[L7 - Application] Server Response: HTTP/1.1 200 OK
Interview Gold:
When asked about OSI layers, anchor your answer in debugging. Say: 'If ping works but HTTP doesn't, the issue is Layer 7, not Layer 3.' That shows you understand the model operationally, not just academically.
Production Insight
In cloud environments, ping (ICMP) is often blocked by security groups while HTTP works fine.
Don't assume ICMP reachability equals application reachability — test at the right layer.
Rule: always test from bottom up: L3 (ping), L4 (telnet), L7 (curl), then check firewall logs.
Key Takeaway
OSI model is a fault-isolation framework.
When debugging, start at the application layer and work down.
If your app works locally but fails over the network, the problem is never in your code — it's in the layers below.

TCP vs UDP — Choosing the Right Delivery Guarantee

TCP (Transmission Control Protocol) and UDP (User Datagram Protocol) are the two workhorses of the Transport layer, and choosing between them is one of the most consequential decisions in system design.

TCP is like sending a package with signature confirmation. Before any data moves, there's a 3-way handshake (SYN, SYN-ACK, ACK). Every packet is numbered, acknowledged, and retransmitted if lost. Order is guaranteed. This reliability costs time — that handshake adds latency, and the acknowledgment mechanism adds overhead.

UDP is like dropping a flyer through every door in the neighbourhood. You send it and forget it. No handshake, no acknowledgment, no guarantee of delivery or order. But it's blazingly fast, which is exactly what you need for real-time applications.

In modern systems, QUIC (used by HTTP/3) is effectively UDP with reliability built on top of it — proof that the TCP/UDP choice isn't always binary.

io/thecodeforge/networking/ProtocolComparison.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
package io.thecodeforge.networking;

import java.net.*;
import java.nio.charset.StandardCharsets;

public class ProtocolComparison {

    // TCP: Reliable delivery for sensitive data
    public void tcpTransmission(String message) throws Exception {
        try (Socket socket = new Socket("localhost", 9001)) {
            socket.getOutputStream().write(message.getBytes());
        }
    }

    // UDP: Unreliable but fast for high-frequency updates (gaming/telemetry)
    public void udpTransmission(String message) throws Exception {
        try (DatagramSocket socket = new DatagramSocket()) {
            byte[] buf = message.getBytes(StandardCharsets.UTF_8);
            DatagramPacket packet = new DatagramPacket(
                buf, buf.length, InetAddress.getByName("localhost"), 9002
            );
            socket.send(packet);
        }
    }
}
Output
// TCP: Established connection, verified receipt.
// UDP: Sent packet to network buffer without verification.
Real-World Mapping:
DNS uses UDP for queries (fast, small payloads) but falls back to TCP when the response is too large (>512 bytes). HTTP/1.1 and HTTP/2 use TCP. HTTP/3 uses QUIC (UDP-based). Knowing these specifics in an interview is a strong signal.
Production Insight
TCP's congestion control can cause 'bufferbloat' — intermediate routers buffer too many packets, increasing latency (not loss).
UDP doesn't have congestion control, so a misbehaving UDP application can starve TCP flows sharing the same link.
Rule: for latency-sensitive apps (VoIP, gaming), use UDP with application-level retransmission — don't let TCP's reliability destroy your real-time experience.
Key Takeaway
TCP is for reliability — use it when data integrity matters more than latency.
UDP is for speed — use it when missing a packet is better than being late.
But remember: QUIC proves that you can have both — just not with the raw protocol alone.

DNS Deep Dive — What Actually Happens When You Type a URL

DNS (Domain Name System) is the internet's phonebook. You know the name (google.com), and DNS finds the phone number (IP address). But the process behind that lookup is more fascinating than most people realise — and it's a classic interview question.

When your browser needs to resolve 'api.github.com', it doesn't just ask one server. It walks a hierarchy. First, it checks its local cache. If that's empty, it asks your OS's resolver. If that misses, it queries your ISP's recursive resolver. That resolver then walks the DNS tree: it asks a Root Name Server for the authoritative server for '.com', then asks that server for 'github.com', then finally asks GitHub's authoritative DNS server for 'api.github.com'. The answer comes back and gets cached at every step.

io/thecodeforge/networking/dns_trace.shBASH
1
2
3
4
5
6
# Using 'dig' to trace the iterative resolution process (standard interview tool)
# Trace github.com from the root servers down
dig +trace github.com

# Inspect the TTL (Time To Live) to understand caching behavior
dig github.com | grep "IN A"
Output
;; Received 759 bytes from 192.5.5.241#53(f.root-servers.net)
github.com. 60 IN A 140.82.121.4
Watch Out:
Never change a DNS record without first lowering the TTL to 60–300 seconds at least 24 hours in advance. If you change an IP with a TTL of 86400 (24 hours), old clients will keep hitting the wrong server for an entire day — and you can't force them to flush their cache.
Production Insight
DNS resolution failures are the #1 cause of 'it works on my machine' in microservices.
If your service depends on another service via DNS name, a resolver timeout can cascade quickly under load (connection pool exhaustion, increased latency).
Rule: always configure a short TTL for service discovery DNS records and implement client-side fallback (e.g., cached IP list).
Key Takeaway
DNS is a distributed cache hierarchy.
TTL is your primary scaling lever — short TTLs for dynamic endpoints, long TTLs for stable ones.
Never assume DNS changes propagate instantly — plan for the old TTL duration.

HTTP vs HTTPS, Status Codes, and Subnetting — The Interview Essentials

These three topics appear in virtually every networking interview, so let's cover them with precision.

HTTP vs HTTPS: HTTP sends everything in plaintext. HTTPS wraps HTTP inside TLS (Transport Layer Security). The TLS handshake happens after the TCP handshake. After that, all data is encrypted.

HTTP Status Codes: These are a language. 2xx means success. 3xx means redirect. 4xx means the client made an error. 5xx means the server failed.

Subnetting: An IP address like 192.168.1.100/24 means the first 24 bits identify the network and the last 8 bits identify the host. /24 gives you 256 addresses (254 usable).

io/thecodeforge/networking/SubnetCalculator.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
package io.thecodeforge.networking;

/**
 * Simulating CIDR mask logic for interview discussions.
 */
public class SubnetCalculator {
    public static void main(String[] args) {
        int prefix = 24;
        int totalHosts = (int) Math.pow(2, (32 - prefix));
        int usableHosts = totalHosts - 2; // Subtract Network and Broadcast

        System.out.println("CIDR /" + prefix + " allows for " + usableHosts + " usable hosts.");
    }
}
Output
CIDR /24 allows for 254 usable hosts.
Pro Tip:
When asked 'what's the difference between 401 and 403?', say: '401 means unauthenticated — we don't know who you are. 403 means unauthorised — we know exactly who you are, you just don't have permission.' That distinction shows you think about security design, not just HTTP syntax.
Production Insight
In cloud environments, most '403 Forbidden' errors are misconfigured IAM policies or VPC security groups.
The error message rarely tells you what's missing — you need to audit permissions at every layer (network, identity, application).
Rule: when debugging 403s, check not just the web server logs but also the cloud trail for API deny events.
Key Takeaway
HTTPS = HTTP + TLS — the security is in the transport layer, not the application.
Status codes are a quick diagnostic tool: 4xx means fix your request, 5xx means fix the server.
Subnetting is a design tool: choose CIDR prefixes that allow growth without renumbering.

Production Network Debugging: Tools Every Engineer Should Know

Knowing theory is one thing. Being able to diagnose a real outage under pressure is what separates senior engineers. Here are the tools that matter in production:

dig — The DNS Swiss Army knife. dig +trace shows you the full resolution path. dig -x does reverse lookup.

curl — Every engineer's first tool for HTTP debugging. Verbose mode (-v) shows the entire handshake. -k bypasses certificate validation (for testing only!).

tcpdump — Raw packet capture. Filter by host, port, or protocol. -A prints ASCII payload. Critical for diagnosing retransmissions and dropped packets.

traceroute/mtr — Shows the path packets take and where latency spikes. mtr combines ping and traceroute in real-time.

netstat/ss — Check open ports, connection states, and socket buffers. ss -tuln lists all listening TCP/UDP ports. ss -s shows overall statistics.

In an interview, being able to describe a real debugging session (e.g., 'I used tcpdump to spot TCP retransmissions, then mtr to find a congested router') is worth more than reciting the OSI layers.

io/thecodeforge/networking/debug_session.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
# Real scenario: slow API response
# Step 1: Check DNS
curl -s -o /dev/null -w "%{time_namelookup}\n" https://api.example.com
# If >5ms, DNS is slow

# Step 2: Trace the route
mtr --report-wide api.example.com

# Step 3: Capture traffic to see retransmissions
tcpdump -i eth0 -nn 'host api.example.com and tcp[tcpflags] & (tcp-syn|tcp-ack) != 0' -c 100

# Step 4: Check connection state
ss -tn state all dst api.example.com
Output
DNS lookup: 12ms
mtr: first hop 1ms, second hop 45ms (packet loss 2%)
TCP captures show retransmissions on the second hop
=> Congested router: escalate to network team, add retry logic to client
The Debugging Pyramid
  • L1/L2: Physical link up? Check cables, carrier detect, interface stats.
  • L3: IP connectivity? Ping the target (but remember ICMP may be blocked).
  • L4: Port reachable? Telnet or curl against the port.
  • L7: Application responding? Check HTTP status, response body, latency.
  • If all layers pass locally but fail in production, the issue is likely configuration (firewall, DNS, load balancer rules).
Production Insight
I once spent two hours debugging a 'connection refused' error. Turned out the application container had crashed but Kubernetes hadn't restarted it yet.
Network tools told me the port was closed — the real fix was checking pod status, not firewall rules.
Rule: when debugging network problems, always check the health of the target process first.
Key Takeaway
The best network debugger is systematic — verify each layer from bottom up.
Learn one tool per layer: ping for L3, telnet/nc for L4, curl for L7, dig for DNS.
Most 'network' problems are actually application or configuration problems — use the tools to prove where the fault is.
● Production incidentPOST-MORTEMseverity: high

The DNS TTL That Killed a Migration

Symptom
After changing the IP of api.example.com, roughly 30% of users still hit the old server. The old server was decommissioned, causing requests to fail with connection timeouts.
Assumption
The team assumed DNS changes propagate instantly after updating the A record.
Root cause
The original TTL was set to 86400 seconds (24 hours). DNS resolvers cached the old IP for up to 24 hours, so a large slice of traffic kept routing to the decommissioned server.
Fix
Lowered the TTL to 60 seconds 48 hours before the migration. After the cutover, monitored traffic until the old IP had zero requests. Then decommissioned the old server.
Key lesson
  • Always lower TTL to 60–300 seconds at least 24 hours before any IP change.
  • Monitor DNS propagation with tools like dig +trace or whatsmydns.net.
  • Keep the old server running until traffic drops to zero — not just until you flip the record.
Production debug guideMatch symptoms to root causes and immediate actions5 entries
Symptom · 01
ping fails to a host
Fix
Check if ICMP is blocked by firewall (common in cloud environments). Use tcping or curl against the actual service port to test L4+ connectivity.
Symptom · 02
HTTP request hangs or times out
Fix
Check DNS resolution (dig +short), then test TCP connectivity with telnet or nc. If DNS resolves but TCP fails, check security groups/firewall rules on the target.
Symptom · 03
Slow file transfer or API response
Fix
Use tcpdump to capture packets and look for TCP retransmissions or duplicate ACKs — indicates packet loss. Check network congestion or MTU issues.
Symptom · 04
Client gets 403 Forbidden
Fix
Verify authentication token or API key. For cloud instances, check instance metadata (IAM roles) or VPC endpoint policies.
Symptom · 05
Random connection resets
Fix
Inspect logs for 'Connection reset by peer'. Could be a load balancer idle timeout, a proxy closing the connection, or a client-side socket timeout mismatch.
★ Quick Command Reference for Network TroubleshootingRun these commands in order when an application can't connect.
Cannot reach a server
Immediate action
Test local network stack
Commands
ping 8.8.8.8 # L3 connectivity test
dig +short google.com # DNS resolution test
Fix now
If ping works but DNS fails, check /etc/resolv.conf or DNS server settings.
Port is not open+
Immediate action
Test TCP connectivity to the specific port
Commands
curl -v http://host:port # L7 health check
nc -zv host port # L4 port scan
Fix now
If nc fails, the port is not listening or a firewall blocks it. Use iptables -L or cloud security group console.
High latency+
Immediate action
Measure round-trip time
Commands
mtr host # combines traceroute and ping
tcpdump -i eth0 port 80 # capture traffic
Fix now
Look for high latency hops or packet loss in mtr output. Contact ISP or check for network saturation.
HTTPS certificate error+
Immediate action
Inspect the certificate chain
Commands
openssl s_client -connect host:443 -showcerts
curl -vI https://host # verbose SSL handshake
Fix now
Check expiration, intermediate certificate inclusion, and SNI configuration on the server.
TCP vs UDP Quick Reference
AspectTCPUDP
ConnectionConnection-oriented (3-way handshake)Connectionless (no handshake)
ReliabilityGuaranteed delivery & orderingNo delivery guarantee, no ordering
SpeedSlower (overhead from ACKs)Faster (fire and forget)
Error CheckingFull — retransmits lost packetsChecksum only — no retransmission
Use CasesHTTP, HTTPS, SSH, FTP, SMTPDNS, video streaming, VoIP, gaming
Header Size20–60 bytes8 bytes fixed
Flow ControlYes (sliding window)No
Congestion ControlYes (slow start, AIMD)No — app must handle it
HTTP VersionHTTP/1.1, HTTP/2HTTP/3 (via QUIC)

Key takeaways

1
The OSI model is a debugging framework
use it to isolate faults between physical, network, and application layers.
2
Reliability (TCP) vs Speed (UDP) is the fundamental trade-off of the transport layer.
3
DNS is a distributed, hierarchical database where caching (TTL) is the primary scaling mechanism.
4
HTTPS is TLS-wrapped HTTP; the security happens after the TCP connection is established.
5
Subnetting is the primary tool for network isolation and IP management in modern cloud architectures.
6
Production debugging requires systematic layer-by-layer diagnosis
never skip L3 and L4 before blaming L7.
7
HTTP status codes are a universal language
4xx means fix the request, 5xx means fix the server.

Common mistakes to avoid

5 patterns
×

Thinking a 'ping' failure always means the server is down

Symptom
You run ping, get no response, assume the server is offline. You start deploying a replacement while the real server is actually running fine.
Fix
Remember that ICMP (ping) can be blocked by firewalls, security groups, or ACLs. Always verify with a higher-layer tool like curl, telnet, or a health check endpoint.
×

Confusing the 3-way handshake (TCP) with the SSL handshake (TLS/HTTPS)

Symptom
During an interview, you say 'the handshake takes 3 packets' but the interviewer follows up with TLS and you realise you conflated the two.
Fix
TCP handshake: SYN, SYN-ACK, ACK. TLS handshake happens after TCP is established: ClientHello, ServerHello, Certificate, KeyExchange, Finished. HTTPS requires both handshakes.
×

Not knowing the difference between a Recursive and Iterative DNS query

Symptom
When asked to explain DNS resolution, you describe a single query to a server. The interviewer expects you to mention that the resolver may do iterative queries starting from root servers.
Fix
Recursive: resolver does all the work for you (typical ISP/cloud resolver). Iterative: the client follows referrals (like dig +trace shows). Know both.
×

Ignoring the 'Ephemeral Port' range when debugging why a server can't make new outgoing connections

Symptom
Your application suddenly can't make any new outbound TCP connections. You check everything — DNS, routing, firewall — but it's actually port exhaustion.
Fix
Each outbound TCP connection uses a temporary source port from the ephemeral range (usually 32768–60999). If you exhaust those, new connections fail. Monitor with netstat -n | wc -l and tune ip_local_port_range if needed.
×

Assuming HTTP 503 means the server is overloaded

Symptom
You get a 503 Service Unavailable and immediately start scaling the web servers. But the real issue is that the load balancer's health check is failing because the database is down.
Fix
503 means the server is temporarily unable to handle the request — it could be due to overload, but also because of dependency failures (database, cache, downstream API). Always check upstream dependencies first.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is the difference between an IP address and a MAC address, and at w...
Q02SENIOR
Explain the 'Head-of-Line Blocking' problem in TCP and how HTTP/3 (QUIC)...
Q03SENIOR
Describe the full lifecycle of an HTTP request, starting from the DNS lo...
Q04SENIOR
What is MTU (Maximum Transmission Unit), and what happens when a packet ...
Q05SENIOR
How does a Load Balancer (Layer 4 vs Layer 7) differ in how it handles i...
Q01 of 05JUNIOR

What is the difference between an IP address and a MAC address, and at which OSI layers do they operate?

ANSWER
An IP address is a logical address at Layer 3 (Network) — it identifies a device on a network and can change as the device moves. A MAC address is a physical address at Layer 2 (Data Link) — it's burned into the network interface card and rarely changes. IP addresses are used for end-to-end routing across networks; MAC addresses are used for hop-to-hop delivery within a local network segment (Ethernet). ARP (Address Resolution Protocol) maps IP addresses to MAC addresses.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What is 'Anycast' routing in DNS?
02
Why do we say TCP has 'Congestion Control' but UDP doesn't?
03
What is the 'Default Gateway'?
04
What is the difference between a hub, a switch, and a router?
05
How does NAT (Network Address Translation) work?
🔥

That's Computer Networks. Mark it forged?

4 min read · try the examples if you haven't

Previous
VPN Explained
15 / 22 · Computer Networks
Next
CDN How It Works