AWS VPC NAT Gateway — AZ Failure Modes and HA Routing
- VPC is the root of your AWS network — CIDR choices are permanent, so plan ahead.
- Subnets and route tables are the traffic engineers; missing routes drop packets silently.
- NAT Gateways must be deployed per AZ to survive AZ failures.
- VPC is your logically isolated network in AWS — you control IP ranges, subnets, routing, and access.
- Subnets slice the VPC CIDR block; each lives in one AZ and can be public or private.
- Route tables direct traffic; one misconfigured entry silently black-holes packets.
- NAT Gateway enables outbound internet from private subnets — costs ~$32/month plus data.
- Security groups are stateful instance firewalls; NACLs are stateless subnet filter lists.
- VPC Peering and Transit Gateway connect VPCs; TGW scales better for multi-VPC topologies.
VPC Debug Commands and Quick Fixes
Outbound internet fails from private subnet
aws ec2 describe-nat-gateways --nat-gateway-ids $(aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[?Routes[?DestinationCidrBlock==`0.0.0.0/0`]].Routes[?NatGatewayId].NatGatewayId' --output text)aws ec2 describe-network-acls --filters Name=association.subnet-id,Values=subnet-xxxxx --query 'NetworkAcls[].Entries[?RuleAction==`deny`]' --output tableCross-VPC connectivity broken
aws ec2 describe-vpc-peering-connections --vpc-peering-connection-ids pcx-xxxxx --query 'VpcPeeringConnections[].Status.Code' --output textaws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[].Routes[?DestinationCidrBlock==`10.0.0.0/16`]' --output tableApplication latency spikes with VPC endpoints
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=vpc-xxxxx --query 'VpcEndpoints[?ServiceName==`com.amazonaws.us-east-1.s3`].Id' --output textaws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[].Routes[?DestinationCidrBlock==`pl-xxxxx`]' --output tableProduction Incident
Production Debug GuideQuick diagnostic steps when traffic doesn't flow as expected
Every production AWS workload lives inside a VPC, and networking mistakes are one of the top three causes of outages, security breaches, and unexplained latency spikes in cloud infrastructure. Yet most engineers treat VPC config as a checkbox — pick the wizard defaults, click through, and move on. That works until it catastrophically doesn't. A misconfigured route table silently black-holes traffic. A security group rule that's too permissive exposes your RDS instance to the internet. A NAT gateway in the wrong AZ becomes a single point of failure that takes down your entire application tier at 2am on a Friday.
AWS VPC (Virtual Private Cloud) exists because the alternative — putting all your EC2 instances on a flat, shared network with every other AWS customer — is obviously untenable. VPC gives you a logically isolated section of the AWS cloud where you control IP addresses, subnets, routing, and access control completely. It's not just a network; it's the security and topology foundation everything else sits on. Get it right and your architecture is clean, scalable, and defensible. Get it wrong and you're debugging mysterious connection timeouts in production while your users are screaming.
By the end of this article you'll understand how VPC traffic actually flows end-to-end — from an internet request hitting your load balancer all the way to a database query and back — including exactly what each component does, why it exists, how the pieces interact at the packet level, and the specific production decisions that separate well-architected systems from ones that quietly accumulate technical debt and security risk.
VPC Fundamentals and CIDR Design
A VPC is a virtual network dedicated to your AWS account. It's logically isolated from other VPCs in the same region. When you create a VPC, you specify an IPv4 CIDR block — a private IP range (RFC 1918) like 10.0.0.0/16. That address space is yours. Every resource inside gets an IP from this range.
Choose your CIDR block carefully. It must not overlap with any other network you'll connect (on-premises, other VPCs). A /16 gives 65,536 IPs — enough for most use cases. But AWS reserves 5 IPs per subnet (first 4 and last 1), so plan for that loss. Never use a /28 unless you're sure you need only 11 usable IPs. The most common mistake is picking a CIDR that's too large or overlapping with an existing on-premises range. You cannot change the VPC CIDR after creation — you must rebuild.
DNS settings inside VPC are controlled by the VPC's DNS configuration. Enable 'DNS hostnames' and 'DNS resolution' for production VPCs. This lets you use private DNS names for EC2 instances, which makes internal service discovery clean.
#!/bin/bash # Create VPC with CIDR 10.0.0.0/16 VPC_ID=$(aws ec2 create-vpc \ --cidr-block 10.0.0.0/16 \ --amazon-provided-ipv6-cidr-block \ --query 'Vpc.VpcId' --output text) # Enable DNS hostnames aws ec2 modify-vpc-attribute \ --vpc-id $VPC_ID \ --enable-dns-hostnames "{\"Value\": true}" # Enable DNS resolution aws ec2 modify-vpc-attribute \ --vpc-id $VPC_ID \ --enable-dns-support "{\"Value\": true}" echo "VPC created: $VPC_ID"
- Choose a /16 or /20 for production — leaves room for growth without waste.
- Avoid overlapping with on-premises IP ranges if you'll ever use VPN or Direct Connect.
- AWS reserves 5 IPs per subnet — plan for that in capacity estimates.
- You can add secondary CIDRs after creation, but the primary CIDR is forever.
- Use predictable CIDRs per environment: 10.0.0.0/16 for dev, 10.1.0.0/16 for staging, 10.2.0.0/16 for prod.
Subnets and Route Tables in Production
Subnets divide your VPC IP range into smaller segments, each anchored to a single Availability Zone. This is how you achieve multi-AZ redundancy — deploy resources across subnets in different AZs. Subnets can be public (with a route to an Internet Gateway) or private (no direct internet access). The subnet's route table determines traffic flow. Every subnet must be associated with exactly one route table.
Route tables contain entries (routes) that specify where to send traffic based on destination. The most important route is the local route — automatically added for the VPC CIDR. For internet access, add 0.0.0.0/0 -> Internet Gateway (public subnet) or 0.0.0.0/0 -> NAT Gateway (private subnet). Misconfigured route tables are the number one cause of network outages in VPCs. A missing route silently drops traffic.
Production tip: Use explicit subnet associations. Avoid using the main route table for anything — it's a common source of accidents. Create custom route tables per tier (web, app, db) and associate them explicitly. For highly available architectures, create at least two subnets per function (one per AZ) and spread resources across them.
#!/bin/bash VPC_ID="vpc-0a1b2c3d" # Create public subnets in two AZs SUBNET_PUBLIC_A=$(aws ec2 create-subnet \ --vpc-id $VPC_ID --cidr-block 10.0.1.0/24 \ --availability-zone us-east-1a \ --query 'Subnet.SubnetId' --output text) SUBNET_PUBLIC_B=$(aws ec2 create-subnet \ --vpc-id $VPC_ID --cidr-block 10.0.2.0/24 \ --availability-zone us-east-1b \ --query 'Subnet.SubnetId' --output text) # Create Internet Gateway and attach IGW_ID=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text) aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID # Create custom route table for public subnets and associate RTB_PUBLIC=$(aws ec2 create-route-table --vpc-id $VPC_ID --query 'RouteTable.RouteTableId' --output text) aws ec2 associate-route-table --route-table-id $RTB_PUBLIC --subnet-id $SUBNET_PUBLIC_A aws ec2 associate-route-table --route-table-id $RTB_PUBLIC --subnet-id $SUBNET_PUBLIC_B # Add default route to Internet Gateway aws ec2 create-route --route-table-id $RTB_PUBLIC \ --destination-cidr-block 0.0.0.0/0 --gateway-id $IGW_ID echo "Public subnets ready"
NAT Gateways and Internet Connectivity for Private Subnets
NAT (Network Address Translation) Gateway allows instances in private subnets to initiate outbound traffic to the internet (e.g., for software updates, calling external APIs) while blocking inbound traffic from the internet. It's a managed AWS service that scales automatically up to 45 Gbps. You pay per hour and per GB of data processed (about $0.045/hour + $0.045/GB). That adds up — a single NAT Gateway costs roughly $32/month before data transfer.
The critical production rule: deploy one NAT Gateway per AZ. If you put a single NAT Gateway in us-east-1a and your private subnets in us-east-1b route through it, an outage in us-east-1a kills internet access for all those instances. The fix is to create a NAT Gateway in each AZ and route each private subnet's 0.0.0.0/0 traffic to the NAT Gateway in its own AZ. Cross-AZ NAT is technically possible but adds latency and defeats the purpose of multi-AZ HA.
NAT Gateway has an Elastic IP (EIP) — ensure your firewall rules permit outbound traffic from that IP. Also note that NAT Gateway sits in a public subnet — it must have a route to an Internet Gateway. If you're cost-conscious and the workload is non-critical, consider a NAT Instance (a custom EC2 AMI) which can be cheaper but requires patching and failover management.
#!/bin/bash # Assume public subnets exist in us-east-1a and us-east-1b # and you have an Internet Gateway attached for AZ in a b; do # Allocate Elastic IP EIP_ALLOC=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text) # Create NAT Gateway in the public subnet of this AZ NAT_GW_ID=$(aws ec2 create-nat-gateway \ --subnet-id $(aws ec2 describe-subnets --filters "Name=availability-zone,Values=us-east-1$AZ" "Name=tag:Type,Values=public" --query 'Subnets[0].SubnetId' --output text) \ --allocation-id $EIP_ALLOC \ --query 'NatGateway.NatGatewayId' --output text) # Wait for NAT Gateway to become available aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID # Get the private route table for this AZ's private subnets RTB_PRIV=$(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-xxxxx" "Name=tag:Zone,Values=us-east-1$AZ" --query 'RouteTables[0].RouteTableId' --output text) # Add route to NAT Gateway aws ec2 create-route --route-table-id $RTB_PRIV \ --destination-cidr-block 0.0.0.0/0 \ --nat-gateway-id $NAT_GW_ID echo "NAT gateway for az $AZ: $NAT_GW_ID" done
NAT gateway for az b: nat-0fedcba9876543210
- Private subnet traffic to internet goes through NAT Gateway, which replaces the source IP with its Elastic IP.
- Return traffic is forwarded back because the NAT Gateway tracked the connection (stateful).
- Inbound traffic from internet to private subnets is impossible — no established connection.
- This asymmetry is by design: private means no unsolicited inbound.
Security Groups vs Network ACLs: The Two Layers of Defense
Security Groups (SGs) and Network ACLs (NACLs) are both virtual firewalls, but they operate at different levels and have fundamentally different behaviours. Understanding the difference is critical to designing a secure VPC without introducing confusing behaviour.
Security Groups are stateful, instance-level firewalls. If you allow inbound traffic on port 443, the return traffic is automatically allowed regardless of outbound rules. They support allow rules only (no explicit deny). You attach SGs to ENIs (Elastic Network Interfaces) of EC2 instances, RDS, ELB, etc. Changes take effect immediately. This is what you should use for controlling access between application components — e.g., allow web tier to talk to app tier on port 8080.
Network ACLs are stateless, subnet-level firewalls. They have separate inbound and outbound rules — both must be explicitly allowed for traffic to flow. They support allow and deny rules, evaluated in order (lowest number first). NACLs are useful for defense-in-depth, e.g., blocking known bad IPs at the subnet boundary. But they're stateless — if you allow inbound HTTP (port 80), you must also allow outbound ephemeral ports (1024-65535) for the response.
Production gotcha: When you allow ping (ICMP echo request) inbound, a security group automatically returns the reply. A NACL requires both inbound ICMP request and outbound ICMP reply rules — otherwise ping fails. This catches everyone at least once.
#!/bin/bash # Example: Allow web tier to access app tier on port 4000 # Security group for app instances SG_APP_ID=$(aws ec2 create-security-group \ --group-name app-sg --description "App tier SG" \ --vpc-id vpc-xxxxx --query 'GroupId' --output text) # Allow inbound from web security group (by reference) SG_WEB_ID="sg-xxxxxxxx" aws ec2 authorize-security-group-ingress \ --group-id $SG_APP_ID \ --protocol tcp --port 4000 \ --source-group $SG_WEB_ID # Network ACL for the private subnet (example: deny SSH from 0.0.0.0/0 at the subnet level) NACL_ID=$(aws ec2 create-network-acl \ --vpc-id vpc-xxxxx --query 'NetworkAcl.NetworkAclId' --output text) # Inbound rule: deny SSH (rule number 100, higher number = lower precedence) aws ec2 create-network-acl-entry \ --network-acl-id $NACL_ID --rule-number 100 \ --protocol tcp --port-range From=22,To=22 \ --cidr-block 0.0.0.0/0 --rule-action deny --ingress # Outbound rule: allow all return traffic (stateless!) aws ec2 create-network-acl-entry \ --network-acl-id $NACL_ID --rule-number 100 \ --protocol tcp --port-range From=1024,To=65535 \ --cidr-block 0.0.0.0/0 --rule-action allow --egress echo "SG created: $SG_APP_ID" echo "NACL created: $NACL_ID"
NACL created: acl-12345678
Advanced Connectivity: VPC Peering, Transit Gateway, and VPN
As your AWS footprint grows, you'll need to connect VPCs to each other and to on-premises networks. AWS offers three primary mechanisms: VPC Peering, Transit Gateway (TGW), and AWS VPN/Direct Connect. Each has different trade-offs for scale, cost, and operational overhead.
VPC Peering connects two VPCs (within same or different accounts/regions) via a 1:1 relationship. Traffic stays on the AWS backbone — no internet. Peering is not transitive; if VPC A is peered with B, and B with C, A cannot talk to C unless a direct peering exists. It's great for small-scale inter-VPC communication but becomes unwieldy beyond a handful of VPCs (n*(n-1)/2 connections).
Transit Gateway (TGW) is a hub-and-spoke router that connects up to thousands of VPCs and VPNs. It supports transitive routing — one attachment to TGW connects to all others (with route table controls). TGW simplifies network management at scale. You pay per attachment ($0.05/hour) and per GB processed. For large enterprises with many VPCs and hybrid connectivity, TGW is the standard.
AWS Site-to-Site VPN creates an IPsec tunnel between your VPC and on-premises network. It's often used as a backup to Direct Connect. A VPN connection goes through the Internet, so latency and bandwidth vary. Direct Connect provides dedicated private connectivity, but requires physical cross-connects and longer lead times.
Production consideration: Combine VPC Peering for high-bandwidth, low-latency needs between a small number of VPCs, and TGW for everything else. Use VPN as a cost-effective backup or for burst traffic that doesn't require SLA bandwidth.
#!/bin/bash # Peer VPC (vpc-aaaaa) with another VPC (vpc-bbbbb) in the same account export VPC_A="vpc-aaaaa" export VPC_B="vpc-bbbbb" # Request peering connection from VPC A PEERING_ID=$(aws ec2 create-vpc-peering-connection \ --vpc-id $VPC_A --peer-vpc-id $VPC_B \ --query 'VpcPeeringConnection.VpcPeeringConnectionId' \ --output text) # Accept the peering from VPC B's account (if same account, auto-accept) aws ec2 accept-vpc-peering-connection --vpc-peering-connection-id $PEERING_ID # Add routes in both VPCs aws ec2 create-route --route-table-id rtb-11111 \ --destination-cidr-block 10.1.0.0/16 \ --vpc-peering-connection-id $PEERING_ID aws ec2 create-route --route-table-id rtb-22222 \ --destination-cidr-block 10.0.0.0/16 \ --vpc-peering-connection-id $PEERING_ID echo "Peering created: $PEERING_ID"
- Each VPC attachment is like a network interface card on the router.
- Route tables within TGW control which attachments can talk to each other.
- Use separate route tables for production vs non-production attachments (isolation).
- Propagation automatically populates routes from attachments into TGW route tables — reduces manual entries.
- TGW supports multicast, which is not possible with VPC Peering.
| Feature | VPC Peering | Transit Gateway | AWS VPN |
|---|---|---|---|
| Transitive routing | No | Yes | Yes (with TGW) |
| Max connections | 125 per VPC | 1000s | Multiple tunnels per VPN |
| Latency | Low (AWS backbone) | Low (AWS backbone) | Medium (internet) |
| Bandwidth | Up to 10 Gbps (depends on instance) | Up to 50 Gbps per attachment | 1.25 Gbps per tunnel |
| Cost | No hourly fee (data transfer only) | $0.05/hour per attachment + data | $0.05/hour per connection + data |
| Management overhead | High for many VPCs (n*(n-1)/2) | Low (centralised) | Low (managed service) |
| Use case | Few VPCs, high bandwidth | Many VPCs, hybrid | Backup connectivity, dev/test |
🎯 Key Takeaways
- VPC is the root of your AWS network — CIDR choices are permanent, so plan ahead.
- Subnets and route tables are the traffic engineers; missing routes drop packets silently.
- NAT Gateways must be deployed per AZ to survive AZ failures.
- Security Groups are stateful and instance-level; NACLs are stateless and subnet-level — use both with understanding.
- For multi-VPC connectivity, Transit Gateway beats VPC Peering beyond a handful of VPCs.
- Always test connectivity at the packet level with tools like curl, telnet, and tcpdump.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat is the difference between a Security Group and a Network ACL? When would you use each?Mid-levelReveal
- QExplain the 'one NAT Gateway per AZ' rule. Why is a single NAT Gateway insufficient for high availability?SeniorReveal
- QHow does VPC Peering handle transitive routing? What are the limitations?Mid-levelReveal
- QYou are troubleshooting a scenario where an EC2 instance in a private subnet cannot access the internet (e.g., yum update fails). Walk through your debugging steps.SeniorReveal
- QWhat is a VPC Endpoint? When would you use a Gateway Endpoint vs an Interface Endpoint?SeniorReveal
Frequently Asked Questions
What is an AWS VPC in simple terms?
A Virtual Private Cloud (VPC) is your own private section of the AWS cloud where you can launch resources in a virtual network that you define. You control IP addresses, subnets, route tables, and access controls — just like a traditional on-premises network, but virtualised and managed by AWS.
How many VPCs can I have per region?
By default, AWS allows up to 5 VPCs per region. You can request a limit increase through the AWS Support Center. Each VPC can have up to 200 subnets, 5 Internet Gateways, and 5 NAT Gateways per AZ (soft limits).
What is the difference between a public subnet and a private subnet?
A public subnet has a route to an Internet Gateway, meaning instances can be directly reachable from the internet (if they have public IPs). A private subnet does not have a direct route to the internet — instances can only reach the internet through a NAT Gateway or a VPC Endpoint. Private subnets are used for application and database tiers to keep them isolated from direct external access.
Can I change the CIDR block of a VPC after creation?
You cannot change the primary CIDR block of an existing VPC. However, you can add secondary CIDR blocks (up to 5) to the same VPC, as long as they don't overlap with existing CIDRs or connected networks. If you need a different primary CIDR, you must create a new VPC and migrate your resources.
What is the purpose of a Network ACL?
A Network Access Control List (NACL) provides an additional layer of security at the subnet level. It's stateless, meaning you must explicitly allow both inbound and outbound traffic. NACLs support allow and deny rules, evaluated in order. They're useful for blocking specific IPs or protocols at the subnet boundary, complementing security groups which are stateful and instance-level.
When should I use VPC Peering vs Transit Gateway?
Use VPC Peering when you need to connect a small number of VPCs (2-5) with high-bandwidth requirements and you don't need transitive routing. Use Transit Gateway when you have more than a handful of VPCs, need to connect to on-premises networks, or want transitive routing capabilities. Transit Gateway centralised management but has an hourly cost per attachment.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.