DNS Outage: A Deleted A Record Took Down an E-Commerce Site
Main site down ('Server Not Found') but staging worked.
20+ years shipping production systems from the metal up. Written from production experience, not tutorials.
- Computer networks are interconnected devices sharing data using protocols
- Data travels in packets through layers (OSI/TCP/IP) with headers and payload
- DNS translates domain names to IPs; DHCP assigns addresses dynamically
- Latency adds ~5ms per network hop; packet loss >1% breaks TCP throughput
- Production network failures often stem from DNS misconfig or subnet overlap
- Biggest mistake: assuming the network is reliable — it's not, and it drops silently
Imagine you and your friends live in different houses on the same street. You want to share a pizza recipe, so you pass a note from house to house until it reaches your friend. A computer network works exactly the same way — devices (houses) are connected by wires or wireless signals (the street), and data (the note) travels between them following agreed-upon rules so it arrives at the right place. That's it. Every time you send a message, load a webpage, or stream a video, you're just passing very fast, very organised notes.
Every single time you open Instagram, pay for something online, or video-call a friend on the other side of the world, a computer network is the invisible plumbing making it happen. Networks are not just a niche topic for network engineers — they're the foundation of almost every piece of software ever built. If you don't understand how devices communicate, you'll spend your career confused about why your app is slow, why a request times out, or what an API even is at a physical level. This article breaks down the essentials: how data actually moves from your laptop to a server across the globe, what protocols are, and the real-world failures you'll hit when the network breaks.
What is a Computer Network?
A computer network is a collection of interconnected devices — laptops, servers, routers, switches — that exchange data using agreed-upon protocols. Networks come in different sizes: LAN (Local Area Network) connects devices within a single building, WAN (Wide Area Network) stretches across cities or continents, and the Internet itself is the biggest WAN of all. The core job of a network is to move data from source to destination reliably and efficiently. That means handling addressing (who gets the data), routing (which path it takes), and error recovery (what happens when a packet is lost).
At the simplest level, every device gets a unique identifier — an IP address — and data is split into packets. Each packet carries the destination IP, the source IP, and a payload. Routers along the way inspect the destination and forward the packet toward its target. This is the fundamental mechanism behind everything from loading a webpage to streaming a video.
- Your device (house) has a return address (IP).
- DNS is the phone book: it tells you the address of "google.com".
- TCP is registered mail — it confirms delivery and retries if lost.
- Routers are sorting offices that decide the next hop.
How Data Travels: The OSI and TCP/IP Models
Data travels through multiple layers, each adding its own header. The OSI model defines seven layers: Physical, Data Link, Network, Transport, Session, Presentation, Application. In practice, TCP/IP collapses these into four: Link, Internet, Transport, Application.
When you send an HTTP request, the application layer (e.g., browser) creates the payload. The transport layer (TCP) adds a header with source and destination ports, splits data into segments, and guarantees delivery. The internet layer (IP) wraps each segment into a packet with source and destination IP addresses. Finally, the link layer adds MAC addresses and sends the frame over the wire.
Each intermediate router strips and re-adds the link-layer header but keeps the IP packet intact. The destination host unwraps layers in reverse order, reassembles the segments, and delivers the data to the application.
IP Addressing and Subnetting
Every device on a network needs a unique IP address. IPv4 addresses are 32-bit numbers, usually written as four octets (e.g., 192.168.1.1). IPv6 uses 128 bits to solve address exhaustion. Subnetting divides a network into smaller logical segments. A subnet mask (e.g., 255.255.255.0 or /24) defines which part of the address is the network prefix and which part identifies the host.
CIDR (Classless Inter-Domain Routing) notation replaces classful addressing. For instance, 10.0.0.0/16 means the first 16 bits are the network, giving 65,534 usable host addresses. Subnetting allows efficient use of IP space and improves security by isolating broadcast domains. In production, misconfiguring subnet masks is a common cause of connectivity issues — two hosts with different subnet masks may think the other is on a different network and send traffic to the default gateway, even though they're on the same physical segment.
Key Network Services: DNS and DHCP
DNS (Domain Name System) translates human-readable domain names (e.g., google.com) into IP addresses. It's a hierarchical, distributed database. When your browser looks up a domain, it queries a resolver (usually your ISP or a public DNS like 8.8.8.8), which walks the chain of root, TLD, and authoritative name servers to find the IP. DNS uses UDP on port 53 for queries, with TCP for zone transfers and large responses.
DHCP (Dynamic Host Configuration Protocol) automatically assigns IP addresses, subnet masks, default gateways, and DNS servers to devices when they join a network. Without DHCP, every device would need manual configuration. In production, DHCP lease times affect address availability; short leases (e.g., 5 minutes) cause churn, long leases (e.g., 24 hours) can exhaust the pool during scale-out events.
- Root servers (/) know where .com lives.
- TLD servers (.com) know where authoritative nameservers are.
- Authoritative servers return the actual IP for example.com.
- DNS resolvers cache results to speed up subsequent lookups.
Common Network Failures and Debugging
Network failures are inevitable in production. The most common: DNS failures (domain not resolving), routing issues (packets taking wrong path), firewall blocks (silent drops), ARP cache poisoning, MTU mismatches, and bandwidth saturation. Debugging requires a systematic approach: start at the application layer and work downward.
Essential tools: ping (basic reachability), traceroute/mtr (path analysis), nslookup/dig (DNS), netstat/ss (listening ports), tcpdump/Wireshark (packet inspection), and curl/wget (HTTP layer). Many silent failures happen because ICMP is blocked — path MTU discovery and traceroute rely on it.
A real story: a team deployed a Kubernetes cluster with overlay network MTU 1450, but the physical network had MTU 1500. Applications experienced intermittent timeouts because packets were fragmented at the IP layer and the fragments were dropped by the AWS network load balancer. The fix was to set the overlay MTU to 1430 (to account for VXLAN overhead) or enable PMTUD at the application level.
The Physical Layer: Where Your Bits Actually Live
Every network conversation eventually hits the wire. Or the air. The Physical Layer is layer 1 in the OSI model, and it's where all your carefully crafted packets become voltage levels, light pulses, or radio waves. Your TCP handshake doesn't mean anything if the cable is crushed under a server rack.
This layer defines the electrical, mechanical, and procedural interface to the transmission medium. Copper wire? That's Ethernet over twisted pair. Fiber? That's light bouncing through glass. Wi-Fi? That's a specific radio frequency with collision avoidance baked in. The Physical Layer also governs encoding schemes—how a '1' and a '0' actually look on the medium. Manchester encoding, NRZ, 4B/5B. These matter when you're debugging why a 10-meter run of Cat5e works but a 11-meter run doesn't.
Why this matters in production: you can't fix a network problem you can't see. If your switchport shows 'err-disabled', guess what—that's layer 1. Cable test first. Always cable test first. I've watched engineers burn three hours on ARP cache issues only to find a bent pin on an RJ45 connector.
The Data Link Layer: Switching, MACs, and Why Your ARP Table Matters
Layer 2 is where frames become real. The Data Link Layer takes raw bits from the physical layer and packages them into frames with MAC addresses. This is the domain of switches, not routers. If you've ever wondered why 'arp -a' shows nonsense, this is the layer to understand.
The Data Link Layer is split into two sublayers: LLC (Logical Link Control) and MAC (Media Access Control). LLC handles flow control and error checking at the frame level. MAC is where the 48-bit hardware address lives and where CSMA/CD (Carrier Sense Multiple Access with Collision Detection) runs for Ethernet. VLAN tagging also happens here — that 802.1Q header you see in packet captures is pure layer 2.
Why this bites you on the job: broadcast storms. When a switch learns a MAC address, it updates its CAM table. If that table floods with unknown unicast frames because of a loop, you get a broadcast storm. Spanning Tree Protocol (STP) prevents this, but only if it's configured correctly. I've seen a junior bring down an entire office floor by plugging a patch cable into two ports on the same switch. Layer 2 loops don't care about your TCP retransmission timers.
Network Performance: Why Your Packets Are Late and What to Do About It
Latency, bandwidth, jitter, packet loss. These four metrics define whether your app feels snappy or your users throw their laptops out the window.
Bandwidth is the pipe size — how much data you can shove through per second. Latency is the travel time. High bandwidth doesn't fix high latency. You can't outrun the speed of light. Jitter is latency's unpredictable cousin — it kills real-time audio and video. Packet loss forces retransmits, which makes everything worse.
When you're debugging, measure all four. Don't just check ping. Run iperf for throughput. Measure jitter with a UDP test. If you see 1% packet loss on a VoIP call, that's 1% of your conversation gone — and your users will notice. Fix the physical link, upgrade the switch, or throttle traffic before your app dies.
Modern Networking: SDN, Overlays, and Why You Can't Ignore virtual Networks
Software-Defined Networking (SDN) separates the control plane from the data plane. That means you manage network policies in software, not by SSHing into switches. Overlays like VXLAN and VLANs let you build virtual networks on top of physical ones. Real talk: your cloud runs on this. AWS VPCs, Azure vNets, Kubernetes CNI — all overlays.
Why should you care? Because physical topology no longer constrains you. You can spin up isolated networks in seconds. You can migrate workloads across data centers without re-cabling. But that freedom comes with cost: encapsulation overhead, MTU headaches, and debugging complexity.
When your container can't reach a service, check the overlay first. Is the tunnel up? Is the MTU jumbo-sized? Is your CNI plugin leaking routes? Modern networking demands you think in layers — physical, virtual, and policy. Master that stack or your microservices will fail silently.
The Bridge: Why You Need It and How It Segments Your Network
Bridges operate at Layer 2, the Data Link layer. They connect two or more network segments, reducing collision domains by learning which MAC addresses live on each side. Unlike a hub that blindly rebroadcasts frames, a bridge forwards only the traffic that needs to cross the segment boundary. This means devices on segment A don't see traffic meant for segment B, reducing unnecessary load and improving performance. Bridges also buffer frames to handle speed mismatches between segments, e.g., a 100 Mbps Ethernet segment talking to a 1 Gbps one. Modern switches are essentially multi-port bridges with high-speed backplanes, but the core principle remains: isolate traffic to where it belongs. Production networks use bridges (or switches) to prevent broadcast storms from crippling an entire flat network. Without them, every ARP request hits every host — a recipe for congestion.
The Modem: Why It Exists and What Happens When Your Bits Leave the LAN
The DNS Misconfiguration That Took Down an E-Commerce Site
- Always use DNS transaction logs to verify changes immediately after deployment.
- Don't rely solely on TTL for recovery — have a rollback plan for DNS.
- Monitor DNS resolution from multiple geographic locations during major events.
telnet api.example.com 443 from the server.dig +short example.com and check TTL values. Compare against authoritative NS responses.mtr --report target-ip to identify the hop with loss. Check for bandwidth saturation or misconfigured MTU on that link.arp -a). Verify subnet mask consistency. Look for VLAN misconfig on the switch.ip link show (or ifconfig)ping -c 4 8.8.8.8Key takeaways
Common mistakes to avoid
4 patternsAssuming the network is reliable
Misconfiguring subnet masks
Using DNS with long TTLs during changes
Ignoring MTU mismatches
Interview Questions on This Topic
Explain how a client connects to a server using TCP. What happens during the three-way handshake?
Frequently Asked Questions
20+ years shipping production systems from the metal up. Written from production experience, not tutorials.
That's Computer Networks. Mark it forged?
8 min read · try the examples if you haven't