ARP Cache Timeout — Why 300s Default Breaks HA Failover
Switches ignore gratuitous ARP until cache expires.
- ARP maps IPs to MACs on local Ethernet/Wi-Fi. Hardware doesn't understand IPs.
- Request: broadcast 'who has 192.168.1.1?'; reply: target responds with its MAC.
- Cache entries live 60-300 seconds. That delay is the #1 cause of slow failover.
- Performance: first packet to new IP triggers a broadcast; subsequent packets use cached MAC.
- Production trap: ARP has zero authentication. Spoofing trivially redirects traffic.
- Biggest mistake: assuming gratuitous ARP instantly updates all caches (it doesn't).
Imagine you move to a new neighborhood and you know your friend Sarah's house number (42 Maple Street) but you don't know what her front door looks like. So you stand outside and shout 'Hey, who lives at number 42?' — Sarah hears you, waves, and now you know exactly which door to knock on. ARP does the same thing on a network: your computer knows the IP address it wants to reach, but it needs the physical MAC address to actually deliver the data. It broadcasts a 'who has this IP?' question to everyone on the local network, and the right machine shouts back its MAC address.
Every time you load a webpage, send a Slack message, or ping a server, your operating system has to solve a puzzle before a single byte leaves your machine: it knows the destination's IP address, but your network hardware — your Ethernet card, your Wi-Fi adapter — doesn't understand IP addresses. It only speaks in MAC addresses, those 48-bit hardware identifiers burned into every network interface at the factory. Without a way to bridge that gap, your packets go nowhere.
This is the exact problem ARP was designed to solve back in 1982 (RFC 826), and it's still doing that job on virtually every LAN on the planet. It sits at the boundary between Layer 2 (Data Link) and Layer 3 (Network) of the OSI model, acting as a live translation service that maps 'logical' IP addresses to 'physical' MAC addresses. When it works, it's invisible. When it breaks — or gets exploited — things get interesting fast.
By the end you'll understand exactly how ARP request and reply packets are constructed, why the ARP cache exists and what happens when it goes stale, how ARP spoofing works at a packet level so you can reason about network security, and how to inspect and manipulate ARP behavior on a real Linux or macOS machine. This is the kind of depth that separates engineers who just use networks from engineers who actually understand them.
What is ARP — Address Resolution Protocol?
ARP — Address Resolution Protocol is a core networking mechanism that bridges Layer 2 (MAC) and Layer 3 (IP). Instead of a dry definition, let's see it in action. When your machine wants to send a packet to another machine on the same Ethernet segment, it needs the destination's MAC address. It broadcasts an ARP request: 'who-has 192.168.1.42? Tell 192.168.1.1'. The target unicasts back its MAC. Your OS caches that mapping so future packets don't need to broadcast again.
That's the entire protocol in one paragraph. The details — packet format, cache behavior, timeouts — are where production issues live.
arp -a for same IP with different MACs across multiple queries.gc_stale_time and switch aging time. Test with arping -U on failover.ARP Spoofing: The Attack That Redirects Traffic Without Routes
ARP spoofing (ARP cache poisoning) exploits the fact that ARP has no authentication. An attacker sends unsolicited ARP replies (gratuitous ARP) claiming to own the IP address of the default gateway or another host. The victim's ARP cache updates with the attacker's MAC, and all traffic destined for that IP is sent to the attacker instead.
How it works: attacker sends "192.168.1.1 is at aa:bb:cc:dd:ee:ff" (where aa:bb:cc:dd:ee:ff is attacker's MAC). The target believes this unsolicited update and forwards all traffic. The attacker can then inspect, modify, or block the traffic — a classic man-in-the-middle attack.
Mitigations: dynamic ARP inspection (DAI) on switches validates ARP packets against DHCP snooping bindings. Port security limits MAC addresses per port. Static ARP entries for critical IPs (gateway, DNS, NTP) prevent poisoning but are administratively heavy. Use arp_filter and arp_ignore sysctl on Linux to reject unsolicited ARP on some interfaces.
Detection: use arpwatch (logs ARP changes) or arp-scan to detect duplicate IP claims. On Linux, arp -a may show the same IP with different MACs over time. Anomaly detection can alert when gateway MAC changes outside maintenance windows.
arpwatch to detect 'flip' events.arpwatch is the standard monitoring tool.Gratuitous ARP: The Double-Edged Sword
Gratuitous ARP (GARP) is an ARP announcement sent without a corresponding request. It's used for IP address takeover (failover), MAC address updates, and duplicate address detection (DAD).
In gratuitous ARP, the sender puts its own IP in the 'target IP' field (not the usual request format). The message says 'this IP is now at this MAC'. Recipients may update their ARP cache immediately, even though they didn't ask.
Common uses: - HA failover (VRRP, CARP): Standby server sends GARP to update switch MAC tables and client caches when VIP moves. - MAC address change: If a NIC MAC changes (rare, but possible with virtual machines), GARP can notify the network. - Duplicate IP detection: A node that receives GARP claiming an IP it already owns can detect conflict.
Why GARP fails in production: - Many switches and client OSes ignore unsolicited ARP updates (security hardening). They only update cache in response to requests. - Even when accepted, some implementations only update if the entry doesn't already exist or is stale. - The solution is to send a series of ARP requests for the same IP, forcing a cache refresh via reply.
sysctl -w net.ipv4.conf.eth0.arp_accept=1 forces Linux to accept unsolicited ARP updates. Default is 0 (ignore). Many distributions leave it at 0 for security. Always test if your GARP is actually being accepted in your environment.net.ipv4.neigh.default.gc_stale_time = 30 and net.ipv4.neigh.default.proxy_qlen = 96ARP Cache Internals: Aging, GC and Production Tuning
The ARP cache is a simple key-value store: IP → MAC. But its behavior is governed by several timers and thresholds that directly impact production reliability.
Key Linux sysctl parameters: - gc_stale_time (default 60s): how long an entry can be stale before it's considered for garbage collection. A stale entry means the MAC hasn't been verified recently, but the entry still exists. - gc_thresh1 (default 128): if the cache has fewer entries than this, GC doesn't run. - gc_thresh2 (default 512): if cache exceeds this, GC runs more aggressively. - gc_thresh3 (default 1024): hard limit. Once reached, new ARP resolutions fail with "neighbour table overflow". - base_reachable_time (default 30s): base time for an entry to be considered reachable; actual reachable time = base_reachable_time + random(0, gc_stale_time/2). - delay_first_probe_time (default 5s): time to wait before first probe after an entry becomes stale.
Windows ARP cache: netsh interface ip delete arpcache flushes. Default timeout is 300 seconds (ARP cache timeout = 60 seconds for neighbor unreachability detection actually). Windows uses a different mechanism (NUD).
Switch MAC aging: Layer-2 switches have a MAC address table that maps MACs to ports. Aging time default is often 300 seconds. When a GARP arrives, the switch may update the MAC table if the entry is not the same MAC on different port? Actually, MAC learning updates on any frame with source MAC. If the frame comes from a different port than the current entry, the switch updates immediately (MAC flapping). GARP triggers this. However, ARP cache on the switch (if it's a layer-3 switch) is separate and may not update from GARP.
Tuning for HA: - Reduce gc_stale_time to 15-30 seconds for faster failover. - Increase gc_thresh3 if you have many neighbors (e.g., container hosts). - Set arp_accept=1 if you trust GARP from your failover script. - Always test: send GARP and verify cache update on target with ip neigh show.
- Entries have a 'reachable' state and 'stale' state. Stale entries are still usable but need verification.
- GC runs periodically to purge entries that haven't been used. Not all stale entries are removed immediately.
gc_stale_timesets how long an entry can stay stale before GC considers it for deletion.gc_thresh1/2/3set the watermarks for GC aggression. Overflow causes ARP failures.- Tuning is a trade-off: faster failover vs more ARP broadcasts.
gc_thresh3) causes 'neighbour table overflow' errors — new connections fail silently.cat /proc/net/stat/arp_cache for table fullness. Increase gc_thresh3 if you have >1000 neighbours.gc_thresh to avoid silent failures. Reduce gc_stale_time for HA.gc_stale_time to 15s, set arp_accept=1, reduce switch MAC aging to 30s. Use VRRP for sub-second.gc_thresh3 to 4096 or higher. Monitor cache usage. Enable neigh/default/gc_interval if needed.gc_thresh3 is too low. Increase it. Also consider reducing gc_stale_time to flush stale entries faster.gc_stale_time might be too low, causing frequent re-resolutions. Increase it to reduce broadcasts, but balance with failover needs.arp_accept=0), they will not update cache until next resolution (could be minutes). Set arp_accept=1 on critical clients or use gratuitous ARP with request mode.Proxy ARP and ARP in Virtualized/Cloud Environments
Proxy ARP is a technique where a device (usually a router) answers ARP requests on behalf of another host. It's used in scenarios like VPNs, virtual IPs, and transparent bridging. The router sees an ARP request for an IP that belongs to a host behind it, and it replies with its own MAC address. This tricks the sender into forwarding traffic to the router, which then forwards the packet to the real destination.
When to use Proxy ARP: - VPN clients on a subnet need to appear as local hosts. - Load balancers that proxy connections to backend servers. - Containers in host-networking mode where the host answers for container IPs.
Production pitfalls: - Proxy ARP can cause routing loops if misconfigured. The router answers for an IP that is on the same subnet but behind itself, leading to a cycle. - It hides the true topology, making debugging harder. - Many security teams disable proxy ARP to prevent spoofing.
ARP in cloud environments (AWS, GCP, Azure): - Cloud providers use Software-Defined Networking (SDN) that replaces ARP entirely. Instances do not send ARP requests to other instances. - The hypervisor handles MAC-to-IP mapping. Even if you see MACs in arp -a, they are virtual MACs assigned by the cloud controller. - Gratuitous ARP is ignored. Failover must use cloud-specific mechanisms: health checks, load balancers, Elastic IPs (AWS), etc. - In AWS, if you move an Elastic IP to another instance, the network mapping updates in seconds — but it's not ARP-based. It's a control plane update. - Key rule: In the cloud, forget everything you know about ARP. It doesn't work the same way.
arp_ignore/arp_announce to prevent servers from responding on VIP.The 7-Minute Failover That Cost $400k
arping -U -I eth0 -c 3 <VIP> (unsolicited ARP) — some OSes accept this with arp_accept=1 sysctl. Implemented link-layer networking with VRRP (which sends multicast GARP with proper MAC). Used send_arp in keepalived to emit unsolicited ARPs. After tuning, failover dropped to 3 seconds.- Gratuitous ARP is a hint, not a command. Switches and clients ignore it unless configured to accept unsolicited updates. Never rely on it as your only failover mechanism.
- Always tune ARP cache timeout for your failover requirement. 300 seconds (default switch) is too long for HA. 30-60 seconds is safer; use VRRP or BFD for sub-second.
- Test failover with packet capture. Look for ARP requests/replies during transition. If you see GARP but traffic still goes to wrong MAC, cache is the culprit.
- In cloud environments (AWS, GCP), ARP is disabled or replaced with SDN forwarding rules. Use health checks and load balancers, not VIPs with ARP.
arp -a | grep <destination_IP>. If MAC is incomplete or wrong, ARP resolution failed or stale. Clear cache: ip neigh flush dev eth0 or arp -d <IP>. Watch tcpdump: tcpdump -i eth0 arp — see if requests send or replies come back.mac address-table aging-time. Use arping -U from the new owner. For Linux clients, set net.ipv4.neigh.default.gc_stale_time = 30arpwatch or arp-scan to detect anomalies. arp -a may show duplicate IP with different MACs. Mitigate: port security on switches (static ARP entries for critical IPs). Use arp_filter or arp_ignore sysctls on Linux to reject unsolicited ARP. For high-security, configure static ARP entries or use layer-3 routing not layer-2.arp -a | grep <IP> and see multiple MAC entries. Use arping -D -I eth0 <IP> to detect duplicate address. Fix by renumbering the conflict or shutting down the rogue device.tcpdump -i eth0 arp and host <target_IP>. Ensure both machines are on same VLAN/physical segment.arp -a on affected clients. Clear the entry and force re-resolution: arp -d <VIP>. If GARP sent but ignored, check arp_accept on Linux clients.arp -d <IP> or ip neigh del <IP> dev eth0. Then retry ping to refresh cache.Key takeaways
arp_accept=0 on Linux). Always test with packet capture.Common mistakes to avoid
6 patternsAssuming ARP works for cross-subnet communication
arp -a shows incomplete or no entry for remote IP; pings fail despite correct routes.Leaving ARP cache timeout too long for failover environment
gc_stale_time on Linux clients (default 60s). Reduce switch MAC aging time from 300s to 30s. For Linux, set arp_accept=1 if unsolicited ARP is safe in your environment.Trusting ARP for security — no additional controls
arpwatch. Encrypt traffic end-to-end (TLS, IPsec) as defense-in-depth.Forgetting to clear ARP cache after moving IP to new MAC
arp -d <IP> on Linux or arp -d <IP> on Windows. On switch, clear mac address-table for that VLAN.Using `arping -U` without testing acceptance on target OS
arp -a on receiver doesn't update. Failover scripts think update succeeded.arp_accept sysctl on Linux (sysctl net.ipv4.conf.eth0.arp_accept). If 0, receiver ignores unsolicited ARP. Change to 1 or rely on arping -c 3 request mode, not -U (gratuitous).Assuming ARP works in cloud as it does on-prem
Interview Questions on This Topic
Explain the difference between ARP request, ARP reply, and gratuitous ARP. When is gratuitous ARP used?
who-has 192.168.1.1?) sent to find MAC of an IP. ARP reply: unicast response from the owner of that IP, containing its MAC address. Gratuitous ARP (GARP): unsolicited broadcast (or unicast) where sender puts its own IP in the target field, announcing 'this IP is now at this MAC'. GARP is used for: (1) HA failover — new owner of virtual IP announces itself; (2) Duplicate address detection — a node can check if its IP is already in use; (3) MAC address change — after NIC swap or VM migration. The problem: many network stacks ignore GARP unless configured otherwise (arp_accept=1 on Linux). So GARP is not reliable for failover without additional tuning.Frequently Asked Questions
That's Computer Networks. Mark it forged?
6 min read · try the examples if you haven't