AWS VPC NAT Gateway — AZ Failure Modes and HA Routing
Single NAT Gateway took down all private subnets during AZ failure.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
- VPC is your logically isolated network in AWS — you control IP ranges, subnets, routing, and access.
- Subnets slice the VPC CIDR block; each lives in one AZ and can be public or private.
- Route tables direct traffic; one misconfigured entry silently black-holes packets.
- NAT Gateway enables outbound internet from private subnets — costs ~$32/month plus data.
- Security groups are stateful instance firewalls; NACLs are stateless subnet filter lists.
- VPC Peering and Transit Gateway connect VPCs; TGW scales better for multi-VPC topologies.
Think of an AWS VPC like building your own private office complex inside a giant shared skyscraper (AWS's data center). You get to decide which floors are public-facing (lobbies anyone can walk into) and which are private back-offices (only internal staff allowed). The hallways between floors are your route tables. The security desk at each door is a security group. And the master building directory that controls who even gets onto your floors from outside is your Network ACL. Your VPC is your building — fully yours — inside a building that belongs to everyone.
Every production AWS workload lives inside a VPC, and networking mistakes are one of the top three causes of outages, security breaches, and unexplained latency spikes in cloud infrastructure. Yet most engineers treat VPC config as a checkbox — pick the wizard defaults, click through, and move on. That works until it catastrophically doesn't. A misconfigured route table silently black-holes traffic. A security group rule that's too permissive exposes your RDS instance to the internet. A NAT gateway in the wrong AZ becomes a single point of failure that takes down your entire application tier at 2am on a Friday.
AWS VPC (Virtual Private Cloud) exists because the alternative — putting all your EC2 instances on a flat, shared network with every other AWS customer — is obviously untenable. VPC gives you a logically isolated section of the AWS cloud where you control IP addresses, subnets, routing, and access control completely. It's not just a network; it's the security and topology foundation everything else sits on. Get it right and your architecture is clean, scalable, and defensible. Get it wrong and you're debugging mysterious connection timeouts in production while your users are screaming.
By the end of this article you'll understand how VPC traffic actually flows end-to-end — from an internet request hitting your load balancer all the way to a database query and back — including exactly what each component does, why it exists, how the pieces interact at the packet level, and the specific production decisions that separate well-architected systems from ones that quietly accumulate technical debt and security risk.
Why NAT Gateways Are a Single-Zone Risk
An AWS VPC NAT Gateway enables outbound internet connectivity for instances in private subnets while blocking inbound traffic. It sits in a public subnet, translates private IPs to its Elastic IP, and forwards responses back. This is a stateful, managed service — you don't patch or scale it, but you pay per hour and per GB processed.
Each NAT Gateway is deployed in one Availability Zone. If that AZ goes down, all private instances using that gateway lose outbound access. There is no automatic failover. The gateway's Elastic IP stays with the failed resource until you manually replace it. Traffic from other AZs routed through this gateway also breaks — cross-AZ data transfer incurs additional cost and latency.
Use multiple NAT Gateways (one per AZ) for high availability in production. Route tables must be AZ-specific: each private subnet sends 0.0.0.0/0 traffic to the NAT Gateway in its own AZ. This eliminates a single point of failure and avoids cross-AZ data charges. For non-critical workloads, a single NAT Gateway with a failover script may suffice, but expect downtime during AZ outages.
VPC Fundamentals and CIDR Design
A VPC is a virtual network dedicated to your AWS account. It's logically isolated from other VPCs in the same region. When you create a VPC, you specify an IPv4 CIDR block — a private IP range (RFC 1918) like 10.0.0.0/16. That address space is yours. Every resource inside gets an IP from this range.
Choose your CIDR block carefully. It must not overlap with any other network you'll connect (on-premises, other VPCs). A /16 gives 65,536 IPs — enough for most use cases. But AWS reserves 5 IPs per subnet, so plan for that loss. Never use a /28 unless you're sure you need only 11 usable IPs. The most common mistake is picking a CIDR that's too large or overlapping with an existing on-premises range. You cannot change the VPC CIDR after creation — you must rebuild.
DNS settings inside VPC are controlled by the VPC's DNS configuration. Enable 'DNS hostnames' and 'DNS resolution' for production VPCs. This lets you use private DNS names for EC2 instances, which makes internal service discovery clean.
- Choose a /16 or /20 for production — leaves room for growth without waste.
- Avoid overlapping with on-premises IP ranges if you'll ever use VPN or Direct Connect.
- AWS reserves 5 IPs per subnet — plan for that in capacity estimates.
- You can add secondary CIDRs after creation, but the primary CIDR is forever.
- Use predictable CIDRs per environment: 10.0.0.0/16 for dev, 10.1.0.0/16 for staging, 10.2.0.0/16 for prod.
VPC Architecture and Component Relationship Visual
A VPC is more than a collection of isolated resources — it's a structured network with well-defined relationships between components. Understanding these relationships is essential for debugging and designing resilient architectures. Below is a high-level diagram that shows how the core VPC components connect and interact.
The VPC itself contains subnets, route tables, and security boundaries. Each subnet is tied to a single Availability Zone. Route tables control traffic between subnets, to the internet, and between connected networks. Internet Gateways (IGW) attach to the VPC and provide a path to the internet for public subnets. NAT Gateways sit in a public subnet and enable outbound internet for private subnets. VPC Endpoints provide private connectivity to AWS services. VPC Peering and Transit Gateway allow inter-VPC traffic.
Security Groups act as virtual firewalls attached to ENIs (Elastic Network Interfaces) of instances. Network ACLs provide an additional layer at the subnet boundary. Together they form a defense-in-depth strategy.
The following diagram captures the logical placement and traffic flows:
Subnets and Route Tables in Production
Subnets divide your VPC IP range into smaller segments, each anchored to a single Availability Zone. This is how you achieve multi-AZ redundancy — deploy resources across subnets in different AZs. Subnets can be public (with a route to an Internet Gateway) or private (no direct internet access). The subnet's route table determines traffic flow. Every subnet must be associated with exactly one route table.
Route tables contain entries (routes) that specify where to send traffic based on destination. The most important route is the local route — automatically added for the VPC CIDR. For internet access, add 0.0.0.0/0 -> Internet Gateway (public subnet) or 0.0.0.0/0 -> NAT Gateway (private subnet). Misconfigured route tables are the number one cause of network outages in VPCs. A missing route silently drops traffic.
Production tip: Use explicit subnet associations. Avoid using the main route table for anything — it's a common source of accidents. Create custom route tables per tier (web, app, db) and associate them explicitly. For highly available architectures, create at least two subnets per function (one per AZ) and spread resources across them.
Public vs Private Subnet Connectivity Checklist
Misclassifying a subnet as public or private is one of the most common VPC misconfigurations. Use the following checklist to validate connectivity assumptions for each subnet type.
For a Public Subnet (instances reachable from the internet): - Route table has a default route (0.0.0.0/0) pointing to an Internet Gateway (igw-xxx). - The Internet Gateway is attached to the VPC and is in the 'available' state. - Auto-assign public IPv4 address is enabled at the subnet level (or the instance has an Elastic IP). - Security group inbound rules allow the desired traffic (e.g., port 22 for SSH, port 80/443 for web). - NACL inbound and outbound rules allow the necessary traffic (including ephemeral ports 1024-65535 for outbound responses). - Instance has a public IP assigned or an Elastic IP attached.
For a Private Subnet (instances cannot be directly reached from the internet): - Route table has a default route (0.0.0.0/0) pointing to a NAT Gateway (nat-xxx) or a VPC Endpoint for specific services. - The NAT Gateway is in a public subnet (route to IGW), is in 'available' state, and has an Elastic IP. - Security group outbound rules allow traffic to the internet (e.g., all traffic to 0.0.0.0/0). - NACL inbound rules for the subnet allow return traffic on ephemeral ports (1024-65535) from the internet. - NACL outbound rules allow traffic to the internet (e.g., 0.0.0.0/0 on ports 80, 443, or ephemeral). - VPC Endpoints (Gateway or Interface) are used for AWS services instead of routing through NAT to reduce cost and latency.
Verification Commands: ``bash # Check subnet's route table aws ec2 describe-route-tables --filters Name=association.subnet-id,Values=subnet-123 # Check if subnet auto-assigns public IP aws ec2 describe-subnets --subnet-ids subnet-123 --query 'Subnets[0].MapPublicIpOnLaunch' # Verify NAT Gateway status aws ec2 describe-nat-gateways --nat-gateway-ids nat-456 ``
aws ec2 describe-route-tables and test with curl from an instance.NAT Gateways and Internet Connectivity for Private Subnets
NAT (Network Address Translation) Gateway allows instances in private subnets to initiate outbound traffic to the internet (e.g., for software updates, calling external APIs) while blocking inbound traffic from the internet. It's a managed AWS service that scales automatically up to 45 Gbps. You pay per hour and per GB of data processed (about $0.045/hour + $0.045/GB). That adds up — a single NAT Gateway costs roughly $32/month before data transfer.
The critical production rule: deploy one NAT Gateway per AZ. If you put a single NAT Gateway in us-east-1a and your private subnets in us-east-1b route through it, an outage in us-east-1a kills internet access for all those instances. The fix is to create a NAT Gateway in each AZ and route each private subnet's 0.0.0.0/0 traffic to the NAT Gateway in its own AZ. Cross-AZ NAT is technically possible but adds latency and defeats the purpose of multi-AZ HA.
NAT Gateway has an Elastic IP (EIP) — ensure your firewall rules permit outbound traffic from that IP. Also note that NAT Gateway sits in a public subnet — it must have a route to an Internet Gateway. If you're cost-conscious and the workload is non-critical, consider a NAT Instance (a custom EC2 AMI) which can be cheaper but requires patching and failover management.
- Private subnet traffic to internet goes through NAT Gateway, which replaces the source IP with its Elastic IP.
- Return traffic is forwarded back because the NAT Gateway tracked the connection (stateful).
- Inbound traffic from internet to private subnets is impossible — no established connection.
- This asymmetry is by design: private means no unsolicited inbound.
Security Groups vs Network ACLs: The Two Layers of Defense
Security Groups (SGs) and Network ACLs (NACLs) are both virtual firewalls, but they operate at different levels and have fundamentally different behaviours. Understanding the difference is critical to designing a secure VPC without introducing confusing behaviour.
Security Groups are stateful, instance-level firewalls. If you allow inbound traffic on port 443, the return traffic is automatically allowed regardless of outbound rules. They support allow rules only (no explicit deny). You attach SGs to ENIs (Elastic Network Interfaces) of EC2 instances, RDS, ELB, etc. Changes take effect immediately. This is what you should use for controlling access between application components — e.g., allow web tier to talk to app tier on port 8080.
Network ACLs are stateless, subnet-level firewalls. They have separate inbound and outbound rules — both must be explicitly allowed for traffic to flow. They support allow and deny rules, evaluated in order (lowest number first). NACLs are useful for defense-in-depth, e.g., blocking known bad IPs at the subnet boundary. But they're stateless — if you allow inbound HTTP (port 80), you must also allow outbound ephemeral ports (1024-65535) for the response.
Production gotcha: When you allow ping (ICMP echo request) inbound, a security group automatically returns the reply. A NACL requires both inbound ICMP request and outbound ICMP reply rules — otherwise ping fails. This catches everyone at least once.
NACL vs Security Group Comparison Table
When designing your VPC security posture, you need to decide where to place each rule. The table below provides a side-by-side comparison of Security Groups and Network ACLs across key dimensions.
| Feature | Security Group | Network ACL |
|---|---|---|
| Scope | Instance-level (ENI) | Subnet-level |
| Statefulness | Stateful (return traffic automatically allowed) | Stateless (return traffic must be explicitly allowed) |
| Rule types | Allow only | Allow and Deny |
| Rule evaluation | All rules evaluated (no order) | Rules evaluated in order (lowest number first) |
| Default rules | Inbound: deny all; Outbound: allow all | Inbound: allow all; Outbound: allow all |
| Supports source/destination by | CIDR, security group ID, prefix list | CIDR only |
| Number of rules | Up to 60 inbound + 60 outbound per SG | Up to 20 inbound + 20 outbound per NACL (before limit increase) |
| Applies to | EC2, ELB, RDS, Lambda (via VPC), etc. | All instances in the associated subnet |
| Changes | Apply immediately to attached instances | Apply immediately to subnet traffic |
| Use case | Fine-grained control between services | Broad network boundaries, IP blacklisting |
Use Security Groups as your primary access control mechanism — they are simpler, stateful, and more granular. Use NACLs as a secondary layer for defenses such as blocking known malicious IPs or preventing traffic to/from specific ports at the subnet boundary. Because NACLs are stateless, always verify that both inbound and outbound rules cover the necessary traffic, especially ephemeral ports.
Advanced Connectivity: VPC Peering, Transit Gateway, and VPN
As your AWS footprint grows, you'll need to connect VPCs to each other and to on-premises networks. AWS offers three primary mechanisms: VPC Peering, Transit Gateway (TGW), and AWS VPN/Direct Connect. Each has different trade-offs for scale, cost, and operational overhead.
VPC Peering connects two VPCs (within same or different accounts/regions) via a 1:1 relationship. Traffic stays on the AWS backbone — no internet. Peering is not transitive; if VPC A is peered with B, and B with C, A cannot talk to C unless a direct peering exists. It's great for small-scale inter-VPC communication but becomes unwieldy beyond a handful of VPCs (n*(n-1)/2 connections).
Transit Gateway (TGW) is a hub-and-spoke router that connects up to thousands of VPCs and VPNs. It supports transitive routing — one attachment to TGW connects to all others (with route table controls). TGW simplifies network management at scale. You pay per attachment ($0.05/hour) and per GB processed. For large enterprises with many VPCs and hybrid connectivity, TGW is the standard.
AWS Site-to-Site VPN creates an IPsec tunnel between your VPC and on-premises network. It's often used as a backup to Direct Connect. A VPN connection goes through the Internet, so latency and bandwidth vary. Direct Connect provides dedicated private connectivity, but requires physical cross-connects and longer lead times.
Production consideration: Combine VPC Peering for high-bandwidth, low-latency needs between a small number of VPCs, and TGW for everything else. Use VPN as a cost-effective backup or for burst traffic that doesn't require SLA bandwidth.
- Each VPC attachment is like a network interface card on the router.
- Route tables within TGW control which attachments can talk to each other.
- Use separate route tables for production vs non-production attachments (isolation).
- Propagation automatically populates routes from attachments into TGW route tables — reduces manual entries.
- TGW supports multicast, which is not possible with VPC Peering.
The VPC Is Not a Data Center: How Cloud Networking Breaks Your Assumptions
If you treat your VPC like a physical data center, you're going to get burned. I've seen it happen. A junior architect once told me "our VPC is just like the on-prem network." Six hours later, a misconfigured route table took down production. The VPC is a software-defined network. It has no cables, no switches you can touch, and no latency guarantees between AZs. The single biggest mistake I see is over-provisioning CIDR blocks. You think you need a /16 because "that's what on-prem used." You don't. AWS limits you to 5 VPCs per region by default. Start with a /20. You can always add secondary CIDRs. The real constraint isn't IP space — it's route table limits. Each route table handles 50 routes max by default. Design for that. Your future self, debugging a route propagation issue at 2 AM, will thank you.
Flow Logs Are Not Optional: Wiring a Firehose to Your Incident Response
Every time I audit a VPC that suffered a data breach, the first question is "did you have flow logs?" Answer is always no. Or: "they were on, but we never looked at them." Flow logs are your only source of truth for network traffic at the VPC level. CloudTrail tells you who made API calls. Flow logs tell you what packets actually moved. Set them up before you launch a single instance. Aggregation interval matters: 10 minutes is cheap but useless during an active attack. Use 1 minute for production subnets. Ship logs to S3, then stream to Athena or OpenSearch. Don't put them in CloudWatch Logs alone — the query cost will bankrupt you. The real trick: tag each flow log with the subnet name and environment. When you have an incident, you query "source IP came from where" and get the subnet name in the first column, not an account number you have to cross-reference.
The Silent NAT Gateway Single Point of Failure
- NAT Gateway is AZ-specific — route traffic to the one in the same AZ as your resources.
- Cross-AZ NAT traffic is possible but adds latency and couples availability to a single AZ.
- For high availability, always provision one NAT Gateway per AZ that contains private resources needing outbound internet.
- Consider NAT Instance as a cost-effective alternative for dev environments, but accept the management overhead.
aws ec2 describe-nat-gateways --nat-gateway-ids $(aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[?Routes[?DestinationCidrBlock==`0.0.0.0/0`]].Routes[?NatGatewayId].NatGatewayId' --output text)aws ec2 describe-network-acls --filters Name=association.subnet-id,Values=subnet-xxxxx --query 'NetworkAcls[].Entries[?RuleAction==`deny`]' --output tableKey takeaways
Common mistakes to avoid
5 patternsUsing a single NAT Gateway for all AZs
Forgetting to allow outbound ephemeral ports in NACLs
Relying on the main route table instead of custom explicit associations
Choosing a VPC CIDR that overlaps with on-premises or other VPCs
Not enabling DNS hostnames and DNS resolution
Interview Questions on This Topic
What is the difference between a Security Group and a Network ACL? When would you use each?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Cloud. Mark it forged?
12 min read · try the examples if you haven't