Skip to content
Home DevOps AWS VPC NAT Gateway — AZ Failure Modes and HA Routing

AWS VPC NAT Gateway — AZ Failure Modes and HA Routing

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Cloud → Topic 7 of 23
Single NAT Gateway took down all private subnets during AZ failure.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn
Single NAT Gateway took down all private subnets during AZ failure.
  • VPC is the root of your AWS network — CIDR choices are permanent, so plan ahead.
  • Subnets and route tables are the traffic engineers; missing routes drop packets silently.
  • NAT Gateways must be deployed per AZ to survive AZ failures.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • VPC is your logically isolated network in AWS — you control IP ranges, subnets, routing, and access.
  • Subnets slice the VPC CIDR block; each lives in one AZ and can be public or private.
  • Route tables direct traffic; one misconfigured entry silently black-holes packets.
  • NAT Gateway enables outbound internet from private subnets — costs ~$32/month plus data.
  • Security groups are stateful instance firewalls; NACLs are stateless subnet filter lists.
  • VPC Peering and Transit Gateway connect VPCs; TGW scales better for multi-VPC topologies.
🚨 START HERE

VPC Debug Commands and Quick Fixes

Run these commands and checks immediately when facing network issues in your VPC.
🟡

Outbound internet fails from private subnet

Immediate ActionSSH into a bastion host in the same VPC and run `curl -v https://checkip.amazonaws.com` to confirm outbound path.
Commands
aws ec2 describe-nat-gateways --nat-gateway-ids $(aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[?Routes[?DestinationCidrBlock==`0.0.0.0/0`]].Routes[?NatGatewayId].NatGatewayId' --output text)
aws ec2 describe-network-acls --filters Name=association.subnet-id,Values=subnet-xxxxx --query 'NetworkAcls[].Entries[?RuleAction==`deny`]' --output table
Fix NowIf NAT Gateway is missing, attach an Internet Gateway to the VPC, create a NAT Gateway in each AZ, and update route tables. Immediate workaround: launch a NAT instance (AMI) and route through it.
🟡

Cross-VPC connectivity broken

Immediate ActionPing the private IP of an instance in the other VPC (if ICMP is allowed) to test layer 3 reachability.
Commands
aws ec2 describe-vpc-peering-connections --vpc-peering-connection-ids pcx-xxxxx --query 'VpcPeeringConnections[].Status.Code' --output text
aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[].Routes[?DestinationCidrBlock==`10.0.0.0/16`]' --output table
Fix NowAdd missing routes in both VPCs. Ensure security groups allow traffic from the peer VPC CIDR. If using Transit Gateway, check TGW route table associations.
🟠

Application latency spikes with VPC endpoints

Immediate ActionCheck if traffic is going through NAT Gateway instead of VPC Endpoint (increase in network bytes for NAT GW).
Commands
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=vpc-xxxxx --query 'VpcEndpoints[?ServiceName==`com.amazonaws.us-east-1.s3`].Id' --output text
aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[].Routes[?DestinationCidrBlock==`pl-xxxxx`]' --output table
Fix NowAdd a route to the S3 prefix list via the VPC Endpoint (prefix list ID). Remove the 0.0.0.0/0 route that sends S3 traffic through NAT.
Production Incident

The Silent NAT Gateway Single Point of Failure

A single NAT Gateway in one AZ took down an entire microservices fleet during an AZ failure.
SymptomAll instances in private subnets lost outbound internet access. Health checks failed, ECS tasks stopped pulling images, and the application tier reported broken external API dependencies.
AssumptionNAT Gateway is a managed service — AWS handles availability. One NAT Gateway in any AZ should be enough.
Root causeNAT Gateway is deployed in a specific AZ. If that AZ goes down, all traffic routed through it is lost. The team had only one NAT Gateway shared across all private subnets in all AZs.
FixDeploy one NAT Gateway per AZ, and add a route in each private subnet's route table pointing to the NAT Gateway in the same AZ. Use a NAT Gateway per AZ pattern to avoid cross-AZ dependencies.
Key Lesson
NAT Gateway is AZ-specific — route traffic to the one in the same AZ as your resources.Cross-AZ NAT traffic is possible but adds latency and couples availability to a single AZ.For high availability, always provision one NAT Gateway per AZ that contains private resources needing outbound internet.Consider NAT Instance as a cost-effective alternative for dev environments, but accept the management overhead.
Production Debug Guide

Quick diagnostic steps when traffic doesn't flow as expected

Cannot SSH into EC2 instance from the internetCheck security group inbound rules (allow port 22 from your IP). Check NACL inbound/outbound rules (allow ephemeral ports). Verify route table has a route to an Internet Gateway (0.0.0.0/0 -> igw-xxx). Ensure subnet is public (auto-assign public IPv4 enabled).
Private instance cannot download packages (yum/apt) from the internetVerify NAT Gateway exists in the same AZ and is in 'Available' state. Check route table of the private subnet: 0.0.0.0/0 target must be the NAT Gateway (nat-xxx). Confirm NAT Gateway's Elastic IP is not blocked by destination. Test with curl -v https://awscli.amazonaws.com.
Cross-VPC communication fails (peering or Transit Gateway)Ensure VPC peering connection is 'Active'. Verify route tables in both VPCs have routes to the peer CIDR via the peering ID (pcx-xxx). Check NACL and security groups allow traffic from the peer VPC CIDR. If using Transit Gateway, check TGW route table associations and propagation.
Application cannot reach an RDS database in the same VPCSecurity group on RDS must allow inbound from the application's security group (by group ID). NACL must allow return traffic (ephemeral ports). Check that the RDS subnet group spans multiple AZs. Test connectivity with telnet rds-endpoint 3306 from the app instance.

Every production AWS workload lives inside a VPC, and networking mistakes are one of the top three causes of outages, security breaches, and unexplained latency spikes in cloud infrastructure. Yet most engineers treat VPC config as a checkbox — pick the wizard defaults, click through, and move on. That works until it catastrophically doesn't. A misconfigured route table silently black-holes traffic. A security group rule that's too permissive exposes your RDS instance to the internet. A NAT gateway in the wrong AZ becomes a single point of failure that takes down your entire application tier at 2am on a Friday.

AWS VPC (Virtual Private Cloud) exists because the alternative — putting all your EC2 instances on a flat, shared network with every other AWS customer — is obviously untenable. VPC gives you a logically isolated section of the AWS cloud where you control IP addresses, subnets, routing, and access control completely. It's not just a network; it's the security and topology foundation everything else sits on. Get it right and your architecture is clean, scalable, and defensible. Get it wrong and you're debugging mysterious connection timeouts in production while your users are screaming.

By the end of this article you'll understand how VPC traffic actually flows end-to-end — from an internet request hitting your load balancer all the way to a database query and back — including exactly what each component does, why it exists, how the pieces interact at the packet level, and the specific production decisions that separate well-architected systems from ones that quietly accumulate technical debt and security risk.

VPC Fundamentals and CIDR Design

A VPC is a virtual network dedicated to your AWS account. It's logically isolated from other VPCs in the same region. When you create a VPC, you specify an IPv4 CIDR block — a private IP range (RFC 1918) like 10.0.0.0/16. That address space is yours. Every resource inside gets an IP from this range.

Choose your CIDR block carefully. It must not overlap with any other network you'll connect (on-premises, other VPCs). A /16 gives 65,536 IPs — enough for most use cases. But AWS reserves 5 IPs per subnet (first 4 and last 1), so plan for that loss. Never use a /28 unless you're sure you need only 11 usable IPs. The most common mistake is picking a CIDR that's too large or overlapping with an existing on-premises range. You cannot change the VPC CIDR after creation — you must rebuild.

DNS settings inside VPC are controlled by the VPC's DNS configuration. Enable 'DNS hostnames' and 'DNS resolution' for production VPCs. This lets you use private DNS names for EC2 instances, which makes internal service discovery clean.

create-vpc.sh · BASH
123456789101112131415161718
#!/bin/bash
# Create VPC with CIDR 10.0.0.0/16
VPC_ID=$(aws ec2 create-vpc \
  --cidr-block 10.0.0.0/16 \
  --amazon-provided-ipv6-cidr-block \
  --query 'Vpc.VpcId' --output text)

# Enable DNS hostnames
aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-hostnames "{\"Value\": true}"

# Enable DNS resolution
aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-support "{\"Value\": true}"

echo "VPC created: $VPC_ID"
▶ Output
VPC created: vpc-0a1b2c3d
Mental Model
Why CIDR matters like a floor plan
A VPC CIDR is like the plot of land you buy — every building (subnet) sits inside it, and you can't change the plot boundaries later.
  • Choose a /16 or /20 for production — leaves room for growth without waste.
  • Avoid overlapping with on-premises IP ranges if you'll ever use VPN or Direct Connect.
  • AWS reserves 5 IPs per subnet — plan for that in capacity estimates.
  • You can add secondary CIDRs after creation, but the primary CIDR is forever.
  • Use predictable CIDRs per environment: 10.0.0.0/16 for dev, 10.1.0.0/16 for staging, 10.2.0.0/16 for prod.
📊 Production Insight
A /16 VPC gives 65,536 IPs — but subnets waste 5 each, and ELB/ENI consumption can surprise you.
Always reserve a secondary CIDR (e.g., 10.1.0.0/16) for future expansion before you need it.
Rule: No VPC CIDR should overlap with any connected network — check before creation.
🎯 Key Takeaway
Choose your VPC CIDR once, carefully, and never overlap.
Primary CIDR is immutable; add secondary CIDRs proactively.
The 5 reserved IPs per subnet will catch you if you don't count them.
Choosing the Right VPC CIDR Block
IfSingle application, no on-premises connectivity
UseUse 10.0.0.0/16. Simple, room to grow.
IfMultiple environments with peering
UseUse distinct /16 per environment (10.x.0.0/16). No overlap.
IfConnecting to on-premises via VPN/Direct Connect
UseMust not overlap with on-premises CIDRs. Choose something like 172.20.0.0/16.
IfLimited IP space available
UseUse /20 or /22, but you will likely need secondary CIDRs later.

Subnets and Route Tables in Production

Subnets divide your VPC IP range into smaller segments, each anchored to a single Availability Zone. This is how you achieve multi-AZ redundancy — deploy resources across subnets in different AZs. Subnets can be public (with a route to an Internet Gateway) or private (no direct internet access). The subnet's route table determines traffic flow. Every subnet must be associated with exactly one route table.

Route tables contain entries (routes) that specify where to send traffic based on destination. The most important route is the local route — automatically added for the VPC CIDR. For internet access, add 0.0.0.0/0 -> Internet Gateway (public subnet) or 0.0.0.0/0 -> NAT Gateway (private subnet). Misconfigured route tables are the number one cause of network outages in VPCs. A missing route silently drops traffic.

Production tip: Use explicit subnet associations. Avoid using the main route table for anything — it's a common source of accidents. Create custom route tables per tier (web, app, db) and associate them explicitly. For highly available architectures, create at least two subnets per function (one per AZ) and spread resources across them.

create-subnets.sh · BASH
123456789101112131415161718192021222324252627
#!/bin/bash
VPC_ID="vpc-0a1b2c3d"

# Create public subnets in two AZs
SUBNET_PUBLIC_A=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID --cidr-block 10.0.1.0/24 \
  --availability-zone us-east-1a \
  --query 'Subnet.SubnetId' --output text)

SUBNET_PUBLIC_B=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID --cidr-block 10.0.2.0/24 \
  --availability-zone us-east-1b \
  --query 'Subnet.SubnetId' --output text)

# Create Internet Gateway and attach
IGW_ID=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text)
aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID

# Create custom route table for public subnets and associate
RTB_PUBLIC=$(aws ec2 create-route-table --vpc-id $VPC_ID --query 'RouteTable.RouteTableId' --output text)
aws ec2 associate-route-table --route-table-id $RTB_PUBLIC --subnet-id $SUBNET_PUBLIC_A
aws ec2 associate-route-table --route-table-id $RTB_PUBLIC --subnet-id $SUBNET_PUBLIC_B

# Add default route to Internet Gateway
aws ec2 create-route --route-table-id $RTB_PUBLIC \
  --destination-cidr-block 0.0.0.0/0 --gateway-id $IGW_ID
echo "Public subnets ready"
▶ Output
Public subnets ready
⚠ Main Route Table Trap
The main route table is the default for any new subnet. If you don't explicitly associate a custom route table, the subnet inherits the main one. This causes accidental exposure of private subnets to the internet (if main route has an IGW) or private subnets without NAT (if main route lacks a NAT Gateway). Always set the main route table to be a 'black hole' (only local route) and associate custom tables explicitly.
📊 Production Insight
A missing route in a private subnet's route table sends traffic nowhere — packets vanish without error.
Always use explicit subnet associations; the main route table should be a bare minimum (local route only).
Rule: For every subnet, check its route table for a default route to IGW (public) or NAT (private).
🎯 Key Takeaway
Subnets are AZ-scoped; route tables are VPC-scoped.
Explicit associations prevent routing leaks — never rely on the main route table.
Always verify route table content: one wrong entry can isolate an entire tier.
Route Table Design Decisions
IfSubnet contains public-facing load balancers or bastions
UseAssociate with a route table that has 0.0.0.0/0 -> Internet Gateway.
IfSubnet contains application or database servers
UseAssociate with a route table that has 0.0.0.0/0 -> NAT Gateway (same AZ).
IfSubnet used for internal services only (no internet needed)
UseAssociate with a route table that has only the local route. Consider VPC Endpoints for AWS services.
IfSubnet must talk to peered VPC
UseAdd route to peer VPC CIDR via VPC Peering ID (pcx-xxx) in the route table.

NAT Gateways and Internet Connectivity for Private Subnets

NAT (Network Address Translation) Gateway allows instances in private subnets to initiate outbound traffic to the internet (e.g., for software updates, calling external APIs) while blocking inbound traffic from the internet. It's a managed AWS service that scales automatically up to 45 Gbps. You pay per hour and per GB of data processed (about $0.045/hour + $0.045/GB). That adds up — a single NAT Gateway costs roughly $32/month before data transfer.

The critical production rule: deploy one NAT Gateway per AZ. If you put a single NAT Gateway in us-east-1a and your private subnets in us-east-1b route through it, an outage in us-east-1a kills internet access for all those instances. The fix is to create a NAT Gateway in each AZ and route each private subnet's 0.0.0.0/0 traffic to the NAT Gateway in its own AZ. Cross-AZ NAT is technically possible but adds latency and defeats the purpose of multi-AZ HA.

NAT Gateway has an Elastic IP (EIP) — ensure your firewall rules permit outbound traffic from that IP. Also note that NAT Gateway sits in a public subnet — it must have a route to an Internet Gateway. If you're cost-conscious and the workload is non-critical, consider a NAT Instance (a custom EC2 AMI) which can be cheaper but requires patching and failover management.

create-nat-gateway.sh · BASH
123456789101112131415161718192021222324252627
#!/bin/bash
# Assume public subnets exist in us-east-1a and us-east-1b
# and you have an Internet Gateway attached

for AZ in a b; do
  # Allocate Elastic IP
  EIP_ALLOC=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
  
  # Create NAT Gateway in the public subnet of this AZ
  NAT_GW_ID=$(aws ec2 create-nat-gateway \
    --subnet-id $(aws ec2 describe-subnets --filters "Name=availability-zone,Values=us-east-1$AZ" "Name=tag:Type,Values=public" --query 'Subnets[0].SubnetId' --output text) \
    --allocation-id $EIP_ALLOC \
    --query 'NatGateway.NatGatewayId' --output text)
  
  # Wait for NAT Gateway to become available
  aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID
  
  # Get the private route table for this AZ's private subnets
  RTB_PRIV=$(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-xxxxx" "Name=tag:Zone,Values=us-east-1$AZ" --query 'RouteTables[0].RouteTableId' --output text)
  
  # Add route to NAT Gateway
  aws ec2 create-route --route-table-id $RTB_PRIV \
    --destination-cidr-block 0.0.0.0/0 \
    --nat-gateway-id $NAT_GW_ID
  
  echo "NAT gateway for az $AZ: $NAT_GW_ID"
done
▶ Output
NAT gateway for az a: nat-1234567890abcdef0
NAT gateway for az b: nat-0fedcba9876543210
Mental Model
NAT Gateway = One-way mirror for private subnets
Think of a NAT Gateway as a one-way mirror in a security room — people inside can look out, but outsiders cannot look in.
  • Private subnet traffic to internet goes through NAT Gateway, which replaces the source IP with its Elastic IP.
  • Return traffic is forwarded back because the NAT Gateway tracked the connection (stateful).
  • Inbound traffic from internet to private subnets is impossible — no established connection.
  • This asymmetry is by design: private means no unsolicited inbound.
📊 Production Insight
One NAT Gateway per AZ prevents a single-AZ failure from taking down all internet access.
Cross-AZ NAT routes add ~10-20ms latency and violate the intended HA pattern.
NAT Gateway costs $32/month + data; estimate costs before committing — use NAT Instance for dev/test.
🎯 Key Takeaway
NAT Gateway is AZ-bound — deploy one per AZ with private resources.
Route each private subnet's default traffic to its zone's NAT Gateway.
Costs add up — budget $30-40/month per gateway before data transfer.

Security Groups vs Network ACLs: The Two Layers of Defense

Security Groups (SGs) and Network ACLs (NACLs) are both virtual firewalls, but they operate at different levels and have fundamentally different behaviours. Understanding the difference is critical to designing a secure VPC without introducing confusing behaviour.

Security Groups are stateful, instance-level firewalls. If you allow inbound traffic on port 443, the return traffic is automatically allowed regardless of outbound rules. They support allow rules only (no explicit deny). You attach SGs to ENIs (Elastic Network Interfaces) of EC2 instances, RDS, ELB, etc. Changes take effect immediately. This is what you should use for controlling access between application components — e.g., allow web tier to talk to app tier on port 8080.

Network ACLs are stateless, subnet-level firewalls. They have separate inbound and outbound rules — both must be explicitly allowed for traffic to flow. They support allow and deny rules, evaluated in order (lowest number first). NACLs are useful for defense-in-depth, e.g., blocking known bad IPs at the subnet boundary. But they're stateless — if you allow inbound HTTP (port 80), you must also allow outbound ephemeral ports (1024-65535) for the response.

Production gotcha: When you allow ping (ICMP echo request) inbound, a security group automatically returns the reply. A NACL requires both inbound ICMP request and outbound ICMP reply rules — otherwise ping fails. This catches everyone at least once.

security-groups-vs-nacls.sh · BASH
123456789101112131415161718192021222324252627282930313233
#!/bin/bash
# Example: Allow web tier to access app tier on port 4000

# Security group for app instances
SG_APP_ID=$(aws ec2 create-security-group \
  --group-name app-sg --description "App tier SG" \
  --vpc-id vpc-xxxxx --query 'GroupId' --output text)

# Allow inbound from web security group (by reference)
SG_WEB_ID="sg-xxxxxxxx"
aws ec2 authorize-security-group-ingress \
  --group-id $SG_APP_ID \
  --protocol tcp --port 4000 \
  --source-group $SG_WEB_ID

# Network ACL for the private subnet (example: deny SSH from 0.0.0.0/0 at the subnet level)
NACL_ID=$(aws ec2 create-network-acl \
  --vpc-id vpc-xxxxx --query 'NetworkAcl.NetworkAclId' --output text)

# Inbound rule: deny SSH (rule number 100, higher number = lower precedence)
aws ec2 create-network-acl-entry \
  --network-acl-id $NACL_ID --rule-number 100 \
  --protocol tcp --port-range From=22,To=22 \
  --cidr-block 0.0.0.0/0 --rule-action deny --ingress

# Outbound rule: allow all return traffic (stateless!)
aws ec2 create-network-acl-entry \
  --network-acl-id $NACL_ID --rule-number 100 \
  --protocol tcp --port-range From=1024,To=65535 \
  --cidr-block 0.0.0.0/0 --rule-action allow --egress

echo "SG created: $SG_APP_ID"
echo "NACL created: $NACL_ID"
▶ Output
SG created: sg-0x1234567890abcde
NACL created: acl-12345678
⚠ NACL Ephemeral Port Trap
Stateless NACL needs explicit outbound rules for return traffic. Common mistake: allow inbound SSH (port 22) but forget outbound ephemeral ports (1024-65535). Connection is established but responses are blocked. Result: SSH 'connection timeout' or intermittent hangs.
📊 Production Insight
Security groups are stateful — use them for application-level access between tiers.
NACLs are stateless and rule-order-based — use them for broad subnet-level filtering or IP blacklisting.
Rule: Never mix SG and NACL allow/deny without understanding the stateless nature; test with curl or telnet.
🎯 Key Takeaway
SGs are stateful and instance-level; NACLs are stateless and subnet-level.
Stateless means you must explicitly allow return traffic on ephemeral ports.
Default NACL allows all traffic — change it to explicit deny-all then allow minimum.
Security Layer Decision Guide
IfYou need to allow traffic between specific instances or services
UseUse security groups with source/destination by group ID.
IfYou need to block specific IPs at the subnet boundary
UseUse NACL deny rules (lower numbered rules override higher).
IfYou need to allow traffic to a load balancer from the internet
UseUse the load balancer's security group with appropriate inbound rules.
IfYou notice intermittent connection failures on ephemeral ports
UseCheck NACL outbound rules for missing ephemeral port ranges. Usually 1024-65535 covering TCP/UDP.

Advanced Connectivity: VPC Peering, Transit Gateway, and VPN

As your AWS footprint grows, you'll need to connect VPCs to each other and to on-premises networks. AWS offers three primary mechanisms: VPC Peering, Transit Gateway (TGW), and AWS VPN/Direct Connect. Each has different trade-offs for scale, cost, and operational overhead.

VPC Peering connects two VPCs (within same or different accounts/regions) via a 1:1 relationship. Traffic stays on the AWS backbone — no internet. Peering is not transitive; if VPC A is peered with B, and B with C, A cannot talk to C unless a direct peering exists. It's great for small-scale inter-VPC communication but becomes unwieldy beyond a handful of VPCs (n*(n-1)/2 connections).

Transit Gateway (TGW) is a hub-and-spoke router that connects up to thousands of VPCs and VPNs. It supports transitive routing — one attachment to TGW connects to all others (with route table controls). TGW simplifies network management at scale. You pay per attachment ($0.05/hour) and per GB processed. For large enterprises with many VPCs and hybrid connectivity, TGW is the standard.

AWS Site-to-Site VPN creates an IPsec tunnel between your VPC and on-premises network. It's often used as a backup to Direct Connect. A VPN connection goes through the Internet, so latency and bandwidth vary. Direct Connect provides dedicated private connectivity, but requires physical cross-connects and longer lead times.

Production consideration: Combine VPC Peering for high-bandwidth, low-latency needs between a small number of VPCs, and TGW for everything else. Use VPN as a cost-effective backup or for burst traffic that doesn't require SLA bandwidth.

create-vpc-peering.sh · BASH
123456789101112131415161718192021222324
#!/bin/bash
# Peer VPC (vpc-aaaaa) with another VPC (vpc-bbbbb) in the same account
export VPC_A="vpc-aaaaa"
export VPC_B="vpc-bbbbb"

# Request peering connection from VPC A
PEERING_ID=$(aws ec2 create-vpc-peering-connection \
  --vpc-id $VPC_A --peer-vpc-id $VPC_B \
  --query 'VpcPeeringConnection.VpcPeeringConnectionId' \
  --output text)

# Accept the peering from VPC B's account (if same account, auto-accept)
aws ec2 accept-vpc-peering-connection --vpc-peering-connection-id $PEERING_ID

# Add routes in both VPCs
aws ec2 create-route --route-table-id rtb-11111 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id $PEERING_ID

aws ec2 create-route --route-table-id rtb-22222 \
  --destination-cidr-block 10.0.0.0/16 \
  --vpc-peering-connection-id $PEERING_ID

echo "Peering created: $PEERING_ID"
▶ Output
Peering created: pcx-1234567890abcdef0
Mental Model
Transit Gateway as a network router
Think of Transit Gateway as a central router — every VPC plugs into it, and the TGW decides where each packet goes based on route tables and propagation.
  • Each VPC attachment is like a network interface card on the router.
  • Route tables within TGW control which attachments can talk to each other.
  • Use separate route tables for production vs non-production attachments (isolation).
  • Propagation automatically populates routes from attachments into TGW route tables — reduces manual entries.
  • TGW supports multicast, which is not possible with VPC Peering.
📊 Production Insight
VPC Peering is non-transitive — each pair needs its own connection. At 10+ VPCs, manageability collapses.
Transit Gateway scales to hundreds of VPCs but costs $0.05/hour/attachment — budget for >$30/month per attachment.
VPN over internet is cheap but latency-sensitive; Direct Connect offers 10x the bandwidth but requires months of lead time.
Rule: Prefer TGW for any greenfield multi-VPC architecture; use VPC Peering only for a few VPCs with high bandwidth needs.
🎯 Key Takeaway
VPC Peering = simple but non-transitive; Transit Gateway = scalable hub.
Always plan for future growth — TGW scales with you; VPC Peering does not.
On-premises connectivity: Direct Connect for production, VPN for backup or burst.
Connectivity Pattern Selection
IfLess than 5 VPCs, no on-premises
UseVPC Peering. Simple, low cost, low latency.
If5+ VPCs or multiple accounts
UseTransit Gateway. Centralised management and transitive routing.
IfNeed to connect on-premises (primary)
UseDirect Connect (high bandwidth, low latency) + VPN backup.
IfTemporary or low-bandwidth on-premises connectivity
UseAWS Site-to-Site VPN. Quick setup, runs over internet.
IfNeed multicast support
UseTransit Gateway multicast is the only option (VPC Peering does not support multicast).
🗂 AWS VPC Connectivity Options
When to use VPC Peering, Transit Gateway, or VPN
FeatureVPC PeeringTransit GatewayAWS VPN
Transitive routingNoYesYes (with TGW)
Max connections125 per VPC1000sMultiple tunnels per VPN
LatencyLow (AWS backbone)Low (AWS backbone)Medium (internet)
BandwidthUp to 10 Gbps (depends on instance)Up to 50 Gbps per attachment1.25 Gbps per tunnel
CostNo hourly fee (data transfer only)$0.05/hour per attachment + data$0.05/hour per connection + data
Management overheadHigh for many VPCs (n*(n-1)/2)Low (centralised)Low (managed service)
Use caseFew VPCs, high bandwidthMany VPCs, hybridBackup connectivity, dev/test

🎯 Key Takeaways

  • VPC is the root of your AWS network — CIDR choices are permanent, so plan ahead.
  • Subnets and route tables are the traffic engineers; missing routes drop packets silently.
  • NAT Gateways must be deployed per AZ to survive AZ failures.
  • Security Groups are stateful and instance-level; NACLs are stateless and subnet-level — use both with understanding.
  • For multi-VPC connectivity, Transit Gateway beats VPC Peering beyond a handful of VPCs.
  • Always test connectivity at the packet level with tools like curl, telnet, and tcpdump.

⚠ Common Mistakes to Avoid

    Using a single NAT Gateway for all AZs
    Symptom

    Outbound internet from private subnets fails entirely when the NAT Gateway's AZ goes down.

    Fix

    Deploy one NAT Gateway per AZ and configure each private subnet route table to use the NAT Gateway in its own AZ.

    Forgetting to allow outbound ephemeral ports in NACLs
    Symptom

    Inbound traffic (e.g., HTTP from internet) reaches the instance, but responses are dropped. Intermittent timeouts.

    Fix

    Add an outbound NACL rule allowing TCP/UDP on ports 1024-65535 for the source CIDR. Stateless means both directions need rules.

    Relying on the main route table instead of custom explicit associations
    Symptom

    New subnets inadvertently get internet access or are left without NAT because the main table is incorrect.

    Fix

    Set the main route table to have only the local route (10.0.0.0/16 -> local). Create custom route tables for public and private subnets, and associate them explicitly.

    Choosing a VPC CIDR that overlaps with on-premises or other VPCs
    Symptom

    VPC Peering or VPN connection fails due to overlapping CIDRs. Traffic cannot be routed correctly.

    Fix

    Plan your CIDR allocation carefully before creating VPCs. Use a central IP address management (IPAM) tool, or at least maintain a spreadsheet of all CIDRs.

    Not enabling DNS hostnames and DNS resolution
    Symptom

    EC2 instances get private IPs but cannot resolve private DNS names internally. Service discovery fails.

    Fix

    Enable 'DNS hostnames' and 'DNS resolution' on the VPC. This allows instances to use private DNS names (e.g., ip-10-0-1-5.ec2.internal).

Interview Questions on This Topic

  • QWhat is the difference between a Security Group and a Network ACL? When would you use each?Mid-levelReveal
    Security Groups are stateful, instance-level firewalls with allow rules only. Changes take effect immediately. Use them to control traffic between application tiers (e.g., web to app). NACLs are stateless, subnet-level firewalls with allow and deny rules evaluated in numbered order. Use them for broad IP blocking at the subnet boundary or for defense-in-depth. Because NACLs are stateless, you must explicitly allow return traffic on ephemeral ports. A typical pattern: use SGs for most access control, and NACLs to deny known bad actors or to provide an additional layer.
  • QExplain the 'one NAT Gateway per AZ' rule. Why is a single NAT Gateway insufficient for high availability?SeniorReveal
    A NAT Gateway is deployed in a specific Availability Zone. If that AZ becomes unavailable, all traffic routed through that NAT Gateway is lost. If you have private subnets in multiple AZs all pointing to a single NAT Gateway, an AZ outage takes down internet access for all of them. The fix: deploy one NAT Gateway per AZ that contains private resources. Configure route tables in each private subnet to route 0.0.0.0/0 to the NAT Gateway in the same AZ. This ensures that an AZ failure only affects resources in that AZ, not the entire account.
  • QHow does VPC Peering handle transitive routing? What are the limitations?Mid-levelReveal
    VPC Peering is non-transitive. If VPC A is peered with VPC B, and VPC B is peered with VPC C, traffic from A cannot reach C unless there is a direct peering connection between A and C. Each pair must be explicitly peered. This makes large-scale mesh architectures difficult — you end up with n*(n-1)/2 connections. Additionally, you cannot have overlapping CIDRs in peered VPCs. For transitive routing, use Transit Gateway (a hub-and-spoke model) or a third-party virtual appliance.
  • QYou are troubleshooting a scenario where an EC2 instance in a private subnet cannot access the internet (e.g., yum update fails). Walk through your debugging steps.SeniorReveal
    1. Check the instance's route table — verify 0.0.0.0/0 is pointing to a NAT Gateway (nat-xxx) and not to an Internet Gateway. 2. Ensure the NAT Gateway is in 'Available' state and has an Elastic IP. 3. Check that the NAT Gateway is in a public subnet with a route to an Internet Gateway. 4. Verify security group outbound rules allow traffic to 0.0.0.0/0 (usually all traffic allowed in default SG). 5. Check NACL inbound/outbound rules for the subnet — outbound ephemeral ports (1024-65535) must be allowed for the response. 6. Test from the instance: curl -v https://checkip.amazonaws.com to confirm reachability and see the source IP (should be the NAT Gateway's EIP). 7. If all else fails, check if the destination IP or port is blocked by an external firewall or the NAT Gateway's ACLs.
  • QWhat is a VPC Endpoint? When would you use a Gateway Endpoint vs an Interface Endpoint?SeniorReveal
    A VPC Endpoint allows you to privately connect to AWS services (like S3 or DynamoDB) without going through the public internet or a NAT Gateway. Gateway Endpoints are used for S3 and DynamoDB — they appear as a prefix list in your route table, are free, and scale seamlessly. Interface Endpoints (AWS PrivateLink) are used for many other services (SQS, SNS, Lambda, API Gateway) — they create an ENI in your subnet with a private IP. Interface Endpoints cost per hour and per GB processed. Use Gateway Endpoints for S3/DynamoDB (simpler, cheaper), and Interface Endpoints for other services when you need private connectivity without internet egress.

Frequently Asked Questions

What is an AWS VPC in simple terms?

A Virtual Private Cloud (VPC) is your own private section of the AWS cloud where you can launch resources in a virtual network that you define. You control IP addresses, subnets, route tables, and access controls — just like a traditional on-premises network, but virtualised and managed by AWS.

How many VPCs can I have per region?

By default, AWS allows up to 5 VPCs per region. You can request a limit increase through the AWS Support Center. Each VPC can have up to 200 subnets, 5 Internet Gateways, and 5 NAT Gateways per AZ (soft limits).

What is the difference between a public subnet and a private subnet?

A public subnet has a route to an Internet Gateway, meaning instances can be directly reachable from the internet (if they have public IPs). A private subnet does not have a direct route to the internet — instances can only reach the internet through a NAT Gateway or a VPC Endpoint. Private subnets are used for application and database tiers to keep them isolated from direct external access.

Can I change the CIDR block of a VPC after creation?

You cannot change the primary CIDR block of an existing VPC. However, you can add secondary CIDR blocks (up to 5) to the same VPC, as long as they don't overlap with existing CIDRs or connected networks. If you need a different primary CIDR, you must create a new VPC and migrate your resources.

What is the purpose of a Network ACL?

A Network Access Control List (NACL) provides an additional layer of security at the subnet level. It's stateless, meaning you must explicitly allow both inbound and outbound traffic. NACLs support allow and deny rules, evaluated in order. They're useful for blocking specific IPs or protocols at the subnet boundary, complementing security groups which are stateful and instance-level.

When should I use VPC Peering vs Transit Gateway?

Use VPC Peering when you need to connect a small number of VPCs (2-5) with high-bandwidth requirements and you don't need transitive routing. Use Transit Gateway when you have more than a handful of VPCs, need to connect to on-premises networks, or want transitive routing capabilities. Transit Gateway centralised management but has an hourly cost per attachment.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAWS RDS and DynamoDBNext →AWS IAM — Identity and Access
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged