Skip to content
Home CS Fundamentals DNS Outage: A Deleted A Record Took Down an E-Commerce Site

DNS Outage: A Deleted A Record Took Down an E-Commerce Site

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Computer Networks → Topic 1 of 22
Main site down ('Server Not Found') but staging worked.
🧑‍💻 Beginner-friendly — no prior CS Fundamentals experience needed
In this tutorial, you'll learn
Main site down ('Server Not Found') but staging worked.
  • A computer network is a system of interconnected devices that exchange data using protocols.
  • Data is broken into packets; each packet includes headers for addressing, routing, and error recovery.
  • DNS and DHCP are critical services — misconfigurations cause silent outages.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Computer networks are interconnected devices sharing data using protocols
  • Data travels in packets through layers (OSI/TCP/IP) with headers and payload
  • DNS translates domain names to IPs; DHCP assigns addresses dynamically
  • Latency adds ~5ms per network hop; packet loss >1% breaks TCP throughput
  • Production network failures often stem from DNS misconfig or subnet overlap
  • Biggest mistake: assuming the network is reliable — it's not, and it drops silently
🚨 START HERE

Network Troubleshooting Cheat Sheet

Common symptoms, immediate actions, and exact commands to diagnose network issues fast.
🟡

No network connectivity at all

Immediate ActionCheck physical link and interface status
Commands
ip link show (or ifconfig)
ping -c 4 8.8.8.8
Fix NowRestart the interface: sudo ip link set dev eth0 down; sudo ip link set dev eth0 up
🟡

DNS resolution fails

Immediate ActionTest direct IP connectivity
Commands
nslookup example.com
dig +trace example.com
Fix NowAdd a fallback nameserver in /etc/resolv.conf: nameserver 1.1.1.1
🟡

Application-specific timeout (e.g., database)

Immediate ActionCheck port reachability
Commands
nc -zv db-server 3306
ss -tunap | grep 3306
Fix NowUpdate security group or iptables rule to allow the port
Production Incident

The DNS Misconfiguration That Took Down an E-Commerce Site

A single missing A record caused a 4-hour outage during Black Friday.
SymptomUsers received "Server Not Found" errors when visiting the main website, but internal systems and staging environments worked fine.
AssumptionThe team assumed the DNS changes they pushed the night before had propagated correctly — the TTL was set to 300 seconds, so they expected resolution within minutes.
Root causeThe new A record for the primary domain was accidentally deleted during a bulk update script. The DNS query fell through to a stale CNAME pointing to an old load balancer that had been decommissioned.
FixRestored the A record from a backup zone file and manually flushed the DNS cache on all authoritative nameservers. Set up a pre-deployment DNS verification script.
Key Lesson
Always use DNS transaction logs to verify changes immediately after deployment.Don't rely solely on TTL for recovery — have a rollback plan for DNS.Monitor DNS resolution from multiple geographic locations during major events.
Production Debug Guide

Quick symptom-to-action map for the most common network failures

Application throws connection timeout on external APICheck firewall rules and outbound security groups. Run telnet api.example.com 443 from the server.
Hostnames resolve to wrong IP or fail intermittentlyVerify DNS records with dig +short example.com and check TTL values. Compare against authoritative NS responses.
High latency or packet loss in logsRun mtr --report target-ip to identify the hop with loss. Check for bandwidth saturation or misconfigured MTU on that link.
One server cannot reach another on the same subnetCheck ARP table on both hosts (arp -a). Verify subnet mask consistency. Look for VLAN misconfig on the switch.

Every single time you open Instagram, pay for something online, or video-call a friend on the other side of the world, a computer network is the invisible plumbing making it happen. Networks are not just a niche topic for network engineers — they're the foundation of almost every piece of software ever built. If you don't understand how devices communicate, you'll spend your career confused about why your app is slow, why a request times out, or what an API even is at a physical level. This article breaks down the essentials: how data actually moves from your laptop to a server across the globe, what protocols are, and the real-world failures you'll hit when the network breaks.

What is a Computer Network?

A computer network is a collection of interconnected devices — laptops, servers, routers, switches — that exchange data using agreed-upon protocols. Networks come in different sizes: LAN (Local Area Network) connects devices within a single building, WAN (Wide Area Network) stretches across cities or continents, and the Internet itself is the biggest WAN of all. The core job of a network is to move data from source to destination reliably and efficiently. That means handling addressing (who gets the data), routing (which path it takes), and error recovery (what happens when a packet is lost).

At the simplest level, every device gets a unique identifier — an IP address — and data is split into packets. Each packet carries the destination IP, the source IP, and a payload. Routers along the way inspect the destination and forward the packet toward its target. This is the fundamental mechanism behind everything from loading a webpage to streaming a video.

network_essentials.sh · BASH
123456789
#!/bin/bash
# TheCodeForge - basic network diagnostics
# Check local IP and connectivity
ip addr show eth0
echo "---"
ping -c 2 google.com

# Trace route to a host
traceroute 8.8.8.8
Mental Model
The Postal System Analogy
A network works like the postal service — you don't know every sorting office, you just trust the system to deliver your letter.
  • Your device (house) has a return address (IP).
  • DNS is the phone book: it tells you the address of "google.com".
  • TCP is registered mail — it confirms delivery and retries if lost.
  • Routers are sorting offices that decide the next hop.
📊 Production Insight
A misconfigured subnet mask can make two servers on the same physical segment appear unreachable.
Always verify netmask consistency: a /24 vs /16 mismatch silently breaks communication.
Rule: never assume layer 2 connectivity works just because both hosts have IPs.
🎯 Key Takeaway
Networks are unreliable by design.
Packets can be dropped, delayed, or duplicated.
Build applications that handle network failures gracefully.

How Data Travels: The OSI and TCP/IP Models

Data travels through multiple layers, each adding its own header. The OSI model defines seven layers: Physical, Data Link, Network, Transport, Session, Presentation, Application. In practice, TCP/IP collapses these into four: Link, Internet, Transport, Application.

When you send an HTTP request, the application layer (e.g., browser) creates the payload. The transport layer (TCP) adds a header with source and destination ports, splits data into segments, and guarantees delivery. The internet layer (IP) wraps each segment into a packet with source and destination IP addresses. Finally, the link layer adds MAC addresses and sends the frame over the wire.

Each intermediate router strips and re-adds the link-layer header but keeps the IP packet intact. The destination host unwraps layers in reverse order, reassembles the segments, and delivers the data to the application.

tcp_packet_structure.py · PYTHON
1234567891011
# TheCodeForge - simulate packet encapsulation
def encapsulate(data, src_port, dst_port, src_ip, dst_ip):
    # Transport layer: TCP segment
    segment = f"{src_port}:{dst_port}|{data}"
    # Network layer: IP packet
    packet = f"{src_ip}->{dst_ip}|{segment}"
    # Link layer: Ethernet frame (simplified)
    frame = f"[MAC src->MAC dst]{packet}"
    return frame

print(encapsulate("GET /index.html", 54321, 80, "192.168.1.5", "142.250.80.46"))
🔥Forge Tip:
You don't need to memorise every OSI layer. Focus on the TCP/IP stack — it's what maps to real headers you see in a packet capture (Wireshark).
📊 Production Insight
MTU mismatches cause silent packet fragmentation and performance degradation.
Path MTU discovery (PMTUD) often fails when ICMP is blocked by firewalls.
Set TCP MSS clamping at the router to avoid fragmentation over VPNs.
🎯 Key Takeaway
Each layer adds overhead.
TCP adds ~20 bytes, IP adds ~20 bytes, Ethernet adds ~14 bytes.
Total ~54 bytes per packet — factor this into bandwidth calculations.

IP Addressing and Subnetting

Every device on a network needs a unique IP address. IPv4 addresses are 32-bit numbers, usually written as four octets (e.g., 192.168.1.1). IPv6 uses 128 bits to solve address exhaustion. Subnetting divides a network into smaller logical segments. A subnet mask (e.g., 255.255.255.0 or /24) defines which part of the address is the network prefix and which part identifies the host.

CIDR (Classless Inter-Domain Routing) notation replaces classful addressing. For instance, 10.0.0.0/16 means the first 16 bits are the network, giving 65,534 usable host addresses. Subnetting allows efficient use of IP space and improves security by isolating broadcast domains. In production, misconfiguring subnet masks is a common cause of connectivity issues — two hosts with different subnet masks may think the other is on a different network and send traffic to the default gateway, even though they're on the same physical segment.

subnet_calculator.py · PYTHON
123456789
# TheCodeForge - simple subnet calculator
def subnet_info(ip_cidr):
    ip, prefix = ip_cidr.split('/')
    prefix = int(prefix)
    mask = (0xFFFFFFFF << (32 - prefix)) & 0xFFFFFFFF
    mask_str = '.'.join(str((mask >> (24 - 8*i)) & 0xFF) for i in range(4))
    return f"{ip}/{prefix} subnet mask: {mask_str}"

print(subnet_info("10.0.0.0/16"))
▶ Output
10.0.0.0/16 subnet mask: 255.255.0.0
⚠ Subnet Mask Trap
A common mistake: setting a subnet mask of /24 on one host and /16 on another in the same physical LAN. The /24 host will send packets to the default gateway, thinking the other host is on a different network, even though they're directly connected.
📊 Production Insight
Overlapping subnets in cloud VPCs cause routing black holes.
Always reserve a contiguous CIDR block during initial design.
Use RFC 1918 private ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) for internal networks.
🎯 Key Takeaway
Subnet mask determines whether a destination is local or through a gateway.
Incorrect masks create hard-to-debug connectivity issues.
Always double-check subnet masks during network changes.

Key Network Services: DNS and DHCP

DNS (Domain Name System) translates human-readable domain names (e.g., google.com) into IP addresses. It's a hierarchical, distributed database. When your browser looks up a domain, it queries a resolver (usually your ISP or a public DNS like 8.8.8.8), which walks the chain of root, TLD, and authoritative name servers to find the IP. DNS uses UDP on port 53 for queries, with TCP for zone transfers and large responses.

DHCP (Dynamic Host Configuration Protocol) automatically assigns IP addresses, subnet masks, default gateways, and DNS servers to devices when they join a network. Without DHCP, every device would need manual configuration. In production, DHCP lease times affect address availability; short leases (e.g., 5 minutes) cause churn, long leases (e.g., 24 hours) can exhaust the pool during scale-out events.

dns_query.sh · BASH
12345
# TheCodeForge - resolve a domain and see the query path
dig +trace thecodeforge.com

# Check DHCP lease
ip addr show | grep dynamic
Mental Model
DNS as a Phonebook
You know the name "Alice", but to call her you need her phone number — DNS gives you the number from the name.
  • Root servers (/) know where .com lives.
  • TLD servers (.com) know where authoritative nameservers are.
  • Authoritative servers return the actual IP for example.com.
  • DNS resolvers cache results to speed up subsequent lookups.
📊 Production Insight
DNS caching can mask failures for the duration of the TTL.
During a DNS migration, lower the TTL to 60 seconds a day before to allow quick rollback.
A stale DNS record after a server migration can send traffic to the old IP for up to the TTL period.
🎯 Key Takeaway
DNS is critical — a single misrecord can take your service offline.
Always monitor DNS resolution from multiple locations.
Use short TTLs for critical records during changes.

Common Network Failures and Debugging

Network failures are inevitable in production. The most common: DNS failures (domain not resolving), routing issues (packets taking wrong path), firewall blocks (silent drops), ARP cache poisoning, MTU mismatches, and bandwidth saturation. Debugging requires a systematic approach: start at the application layer and work downward.

Essential tools: ping (basic reachability), traceroute/mtr (path analysis), nslookup/dig (DNS), netstat/ss (listening ports), tcpdump/Wireshark (packet inspection), and curl/wget (HTTP layer). Many silent failures happen because ICMP is blocked — path MTU discovery and traceroute rely on it.

A real story: a team deployed a Kubernetes cluster with overlay network MTU 1450, but the physical network had MTU 1500. Applications experienced intermittent timeouts because packets were fragmented at the IP layer and the fragments were dropped by the AWS network load balancer. The fix was to set the overlay MTU to 1430 (to account for VXLAN overhead) or enable PMTUD at the application level.

debug_network.sh · BASH
123456789101112
#!/bin/bash
# TheCodeForge - systematic network debug
echo "1. Check local interface and IP"
ip addr show
echo "2. Check default gateway reachability"
ping -c 2 $(ip route | grep default | awk '{print $3}')
echo "3. DNS resolution"
nslookup google.com
echo "4. Port reachability to remote"
nc -zv db.example.com 5432
echo "5. Full path analysis"
mtr --report github.com
💡Debug Order
Always start at the application layer and work down: 1. Is the server responding? (curl) 2. Is the port open? (nc, nmap) 3. Is DNS correct? (dig) 4. Is the route working? (traceroute) 5. Is the link up? (ip link show)
📊 Production Insight
Firewall logs are your best friend — but they're often the last place people look.
When a connection times out and the server is healthy, check the firewall first.
Rule: a dropped packet has no error message; only a timeout tells you something is wrong.
🎯 Key Takeaway
Networks drop silently.
Timeouts are the only symptom of a block or misroute.
Learn to use tcpdump — it sees what applications cannot.
🗂 Network Types
LAN vs WAN vs MAN
TypeScopeTypical SpeedExample
LANSingle building / campus1 Gbps – 10 GbpsOffice network, home network
WANCities / continents10 Mbps – 10 GbpsInternet, corporate MPLS
MANCity-wide100 Mbps – 10 GbpsISP backbone, municipal Wi-Fi

🎯 Key Takeaways

  • A computer network is a system of interconnected devices that exchange data using protocols.
  • Data is broken into packets; each packet includes headers for addressing, routing, and error recovery.
  • DNS and DHCP are critical services — misconfigurations cause silent outages.
  • Always design applications to handle network failures; they are not reliable.
  • Debug network issues systematically: application → transport → internet → link layer.

⚠ Common Mistakes to Avoid

    Assuming the network is reliable
    Symptom

    Applications crash or hang under packet loss; retries not implemented.

    Fix

    Design with network failures in mind — implement retries with exponential backoff, timeouts, and circuit breakers.

    Misconfiguring subnet masks
    Symptom

    Two hosts on the same physical switch cannot communicate directly; traffic goes through default gateway unnecessarily.

    Fix

    Ensure all hosts on the same subnet have identical subnet masks. Use a configuration management tool to enforce consistency.

    Using DNS with long TTLs during changes
    Symptom

    After a server migration, users still hit the old IP for hours despite DNS record update.

    Fix

    Before planned changes, lower TTL to 60 seconds. After the change, verify propagation, then restore normal TTL.

    Ignoring MTU mismatches
    Symptom

    Intermittent connectivity issues, especially with VPN or overlay networks (e.g., Docker, Kubernetes).

    Fix

    Set the same MTU on all network segments. For overlays, reduce MTU to account for encapsulation overhead (e.g., 1450 for VXLAN).

Interview Questions on This Topic

  • QExplain how a client connects to a server using TCP. What happens during the three-way handshake?JuniorReveal
    The client sends a SYN packet with a random sequence number. The server responds with SYN-ACK, acknowledging the client's sequence number and sending its own. The client then sends an ACK. After this, a full-duplex connection is established. The handshake ensures both sides are willing to communicate and synchronizes sequence numbers for reliable data transfer.
  • QWhat happens when you type a URL into a browser and press Enter? Describe the network flow.Mid-levelReveal
    1. Browser checks if the hostname is in its cache or OS DNS cache. 2. If not, it makes a DNS query (recursive) to the configured resolver. 3. The resolver queries root, TLD, and authoritative nameservers to get the IP. 4. Browser opens a TCP connection to that IP (three-way handshake). 5. If HTTPS, a TLS handshake occurs. 6. Browser sends an HTTP GET request. 7. Server processes and returns HTTP response. 8. Browser renders the page. Key network layers: DNS (UDP), TCP (SYN/SYN-ACK/ACK), TLS, HTTP (TCP payload).
  • QHow does a subnet mask affect communication between two hosts? Give an example of a misconfiguration.SeniorReveal
    A subnet mask defines which part of the IP is the network prefix and which is the host. Two hosts can communicate directly only if they are on the same subnet (same network prefix). If host A has IP 192.168.1.5/24 and host B has IP 192.168.1.10/16, A will see B as having network 192.168.0.0 (due to /24) and will think B is on a different network. So A sends the packet to the default gateway instead of directly to B. Even though they're on the same switch, the traffic goes through the router, adding latency and potential load. Fix: ensure consistent subnet masks.
  • QDescribe a production incident you debugged that was caused by a network issue. How did you diagnose and fix it?SeniorReveal
    In a previous role, the API stopped returning responses every few minutes. We checked application logs — no errors. We used ping and traceroute and found a 2% packet loss at a specific hop, which was a load balancer with high CPU due to a DDoS attack. We identified the attack via netstat showing many half-open connections. Mitigation: we rate-limited on the upstream firewall and added more capacity to the load balancer. The lesson: always check the network layer when you see intermittent timeouts.

Frequently Asked Questions

What is the difference between a hub, a switch, and a router?

A hub broadcasts all data to all ports (simple, insecure). A switch learns MAC addresses and forwards data only to the intended port (layer 2, efficient). A router forwards packets between different networks using IP addresses (layer 3, connects LAN to WAN/Internet). In most production networks, you'll use switches for internal LAN and routers for WAN connectivity.

Why does my application sometimes get 'Connection refused' vs 'Connection timed out'?

Connection refused means the server actively rejected the connection (no service listening on that port, or firewall sent a RST). Connection timed out means the server didn't respond at all (network path broken, firewall dropped the packet silently, or the server is overloaded and not accepting connections). The two errors have very different root causes: 'refused' is usually a server-side port issue, while 'timeout' is a network or load issue.

What is NAT and why is it needed?

NAT (Network Address Translation) allows multiple devices on a private network (e.g., 192.168.x.x) to share a single public IP address when accessing the Internet. It rewrites the source IP and port in outgoing packets and remembers the mapping so return traffic is forwarded to the correct internal device. NAT conserves IPv4 address space and adds a layer of security (external hosts cannot directly reach internal devices). Drawback: it breaks end-to-end connectivity and complicates protocols that embed IP addresses (e.g., SIP, FTP).

What is the difference between TCP and UDP? When would you use each?

TCP is connection-oriented, provides reliable delivery, in-order data, flow control, and error recovery via retransmission. It has higher overhead (headers + handshake). Use TCP for applications that require all data to arrive correctly and in order: HTTP, email, file transfers. UDP is connectionless, fire-and-forget; no guarantees on delivery or order. Use UDP for real-time applications where speed matters over completeness: video streaming, VoIP, DNS queries, online gaming.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →OSI Model Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged