Senior 12 min · March 06, 2026

AWS VPC NAT Gateway — AZ Failure Modes and HA Routing

Single NAT Gateway took down all private subnets during AZ failure.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • VPC is your logically isolated network in AWS — you control IP ranges, subnets, routing, and access.
  • Subnets slice the VPC CIDR block; each lives in one AZ and can be public or private.
  • Route tables direct traffic; one misconfigured entry silently black-holes packets.
  • NAT Gateway enables outbound internet from private subnets — costs ~$32/month plus data.
  • Security groups are stateful instance firewalls; NACLs are stateless subnet filter lists.
  • VPC Peering and Transit Gateway connect VPCs; TGW scales better for multi-VPC topologies.
✦ Definition~90s read
What is AWS VPC and Networking?

A NAT Gateway is a managed AWS service that enables outbound internet connectivity for instances in private subnets while preventing unsolicited inbound connections. It solves the problem of private resources needing to download updates, access APIs, or send telemetry without exposing them to the public internet.

Think of an AWS VPC like building your own private office complex inside a giant shared skyscraper (AWS's data center).

NAT Gateways are deployed in a single Availability Zone (AZ) and are inherently a single-point-of-failure risk — if that AZ goes down, all private subnet traffic loses internet access unless you architect for multi-AZ redundancy with separate NAT Gateways per AZ and corresponding route table entries. In production VPC designs, you typically pair NAT Gateways with public subnets in each AZ, using route tables to direct 0.0.0.0/0 traffic from private subnets to the NAT Gateway in the same AZ.

Alternatives include NAT instances (self-managed EC2, cheaper but require HA configuration) and VPC endpoints for specific AWS services (more secure, no NAT needed). The key tradeoff is cost versus availability: at ~$0.045/hour per gateway plus data processing charges, running one per AZ adds up, but failing to do so means your private workloads lose internet during an AZ outage — a common oversight in production architectures.

Plain-English First

Think of an AWS VPC like building your own private office complex inside a giant shared skyscraper (AWS's data center). You get to decide which floors are public-facing (lobbies anyone can walk into) and which are private back-offices (only internal staff allowed). The hallways between floors are your route tables. The security desk at each door is a security group. And the master building directory that controls who even gets onto your floors from outside is your Network ACL. Your VPC is your building — fully yours — inside a building that belongs to everyone.

Every production AWS workload lives inside a VPC, and networking mistakes are one of the top three causes of outages, security breaches, and unexplained latency spikes in cloud infrastructure. Yet most engineers treat VPC config as a checkbox — pick the wizard defaults, click through, and move on. That works until it catastrophically doesn't. A misconfigured route table silently black-holes traffic. A security group rule that's too permissive exposes your RDS instance to the internet. A NAT gateway in the wrong AZ becomes a single point of failure that takes down your entire application tier at 2am on a Friday.

AWS VPC (Virtual Private Cloud) exists because the alternative — putting all your EC2 instances on a flat, shared network with every other AWS customer — is obviously untenable. VPC gives you a logically isolated section of the AWS cloud where you control IP addresses, subnets, routing, and access control completely. It's not just a network; it's the security and topology foundation everything else sits on. Get it right and your architecture is clean, scalable, and defensible. Get it wrong and you're debugging mysterious connection timeouts in production while your users are screaming.

By the end of this article you'll understand how VPC traffic actually flows end-to-end — from an internet request hitting your load balancer all the way to a database query and back — including exactly what each component does, why it exists, how the pieces interact at the packet level, and the specific production decisions that separate well-architected systems from ones that quietly accumulate technical debt and security risk.

Why NAT Gateways Are a Single-Zone Risk

An AWS VPC NAT Gateway enables outbound internet connectivity for instances in private subnets while blocking inbound traffic. It sits in a public subnet, translates private IPs to its Elastic IP, and forwards responses back. This is a stateful, managed service — you don't patch or scale it, but you pay per hour and per GB processed.

Each NAT Gateway is deployed in one Availability Zone. If that AZ goes down, all private instances using that gateway lose outbound access. There is no automatic failover. The gateway's Elastic IP stays with the failed resource until you manually replace it. Traffic from other AZs routed through this gateway also breaks — cross-AZ data transfer incurs additional cost and latency.

Use multiple NAT Gateways (one per AZ) for high availability in production. Route tables must be AZ-specific: each private subnet sends 0.0.0.0/0 traffic to the NAT Gateway in its own AZ. This eliminates a single point of failure and avoids cross-AZ data charges. For non-critical workloads, a single NAT Gateway with a failover script may suffice, but expect downtime during AZ outages.

Single NAT Gateway = Single Point of Failure
A single NAT Gateway in one AZ will drop all outbound traffic if that AZ fails. Always deploy one per AZ for HA.
Production Insight
A team lost all outbound access for 45 minutes when us-east-1a suffered a power event — their single NAT Gateway was in that AZ.
Symptom: all private instances could not reach package repositories, S3 endpoints, or external APIs; health checks failed.
Rule: deploy one NAT Gateway per AZ used by private subnets, and route each subnet to its local gateway only.
Key Takeaway
NAT Gateways are zone-scoped — one per AZ is the minimum for HA.
Cross-AZ routing through a NAT Gateway adds cost and latency; avoid it.
Always test AZ failure by simulating a gateway outage in staging before production.
AWS VPC NAT Gateway HA Routing THECODEFORGE.IO AWS VPC NAT Gateway HA Routing Single-zone risk and high-availability routing design NAT Gateway in AZ-A Single point of failure for private subnets Private Subnet Route Table Default route points to NAT Gateway in AZ-A AZ-A Failure NAT Gateway becomes unreachable Private Subnet Outage No internet connectivity for private instances Multi-AZ NAT Gateways Deploy NAT Gateway in each AZ with separate route tables ⚠ Single NAT Gateway is a single point of failure Deploy one NAT Gateway per AZ and update route tables accordingly THECODEFORGE.IO
thecodeforge.io
AWS VPC NAT Gateway HA Routing
Aws Vpc Networking

VPC Fundamentals and CIDR Design

A VPC is a virtual network dedicated to your AWS account. It's logically isolated from other VPCs in the same region. When you create a VPC, you specify an IPv4 CIDR block — a private IP range (RFC 1918) like 10.0.0.0/16. That address space is yours. Every resource inside gets an IP from this range.

Choose your CIDR block carefully. It must not overlap with any other network you'll connect (on-premises, other VPCs). A /16 gives 65,536 IPs — enough for most use cases. But AWS reserves 5 IPs per subnet, so plan for that loss. Never use a /28 unless you're sure you need only 11 usable IPs. The most common mistake is picking a CIDR that's too large or overlapping with an existing on-premises range. You cannot change the VPC CIDR after creation — you must rebuild.

DNS settings inside VPC are controlled by the VPC's DNS configuration. Enable 'DNS hostnames' and 'DNS resolution' for production VPCs. This lets you use private DNS names for EC2 instances, which makes internal service discovery clean.

create-vpc.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# Create VPC with CIDR 10.0.0.0/16
VPC_ID=$(aws ec2 create-vpc \
  --cidr-block 10.0.0.0/16 \
  --amazon-provided-ipv6-cidr-block \
  --query 'Vpc.VpcId' --output text)

# Enable DNS hostnames
aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-hostnames "{\"Value\": true}"

# Enable DNS resolution
aws ec2 modify-vpc-attribute \
  --vpc-id $VPC_ID \
  --enable-dns-support "{\"Value\": true}"

echo "VPC created: $VPC_ID"
Output
VPC created: vpc-0a1b2c3d
Why CIDR matters like a floor plan
  • Choose a /16 or /20 for production — leaves room for growth without waste.
  • Avoid overlapping with on-premises IP ranges if you'll ever use VPN or Direct Connect.
  • AWS reserves 5 IPs per subnet — plan for that in capacity estimates.
  • You can add secondary CIDRs after creation, but the primary CIDR is forever.
  • Use predictable CIDRs per environment: 10.0.0.0/16 for dev, 10.1.0.0/16 for staging, 10.2.0.0/16 for prod.
Production Insight
A /16 VPC gives 65,536 IPs — but subnets waste 5 each, and ELB/ENI consumption can surprise you.
Always reserve a secondary CIDR (e.g., 10.1.0.0/16) for future expansion before you need it.
Rule: No VPC CIDR should overlap with any connected network — check before creation.
Key Takeaway
Choose your VPC CIDR once, carefully, and never overlap.
Primary CIDR is immutable; add secondary CIDRs proactively.
The 5 reserved IPs per subnet will catch you if you don't count them.
Choosing the Right VPC CIDR Block
IfSingle application, no on-premises connectivity
UseUse 10.0.0.0/16. Simple, room to grow.
IfMultiple environments with peering
UseUse distinct /16 per environment (10.x.0.0/16). No overlap.
IfConnecting to on-premises via VPN/Direct Connect
UseMust not overlap with on-premises CIDRs. Choose something like 172.20.0.0/16.
IfLimited IP space available
UseUse /20 or /22, but you will likely need secondary CIDRs later.

VPC Architecture and Component Relationship Visual

A VPC is more than a collection of isolated resources — it's a structured network with well-defined relationships between components. Understanding these relationships is essential for debugging and designing resilient architectures. Below is a high-level diagram that shows how the core VPC components connect and interact.

The VPC itself contains subnets, route tables, and security boundaries. Each subnet is tied to a single Availability Zone. Route tables control traffic between subnets, to the internet, and between connected networks. Internet Gateways (IGW) attach to the VPC and provide a path to the internet for public subnets. NAT Gateways sit in a public subnet and enable outbound internet for private subnets. VPC Endpoints provide private connectivity to AWS services. VPC Peering and Transit Gateway allow inter-VPC traffic.

Security Groups act as virtual firewalls attached to ENIs (Elastic Network Interfaces) of instances. Network ACLs provide an additional layer at the subnet boundary. Together they form a defense-in-depth strategy.

The following diagram captures the logical placement and traffic flows:

Production Insight
When debugging connectivity issues, start at the diagram level: trace the expected path from source to destination. A missing route, misattached IGW, or wrong NACL rule is visible in the relationship between components. Always verify that NAT Gateways, IGWs, and VPC Endpoints are in the correct subnets and that route tables explicitly associate with those subnets.
Key Takeaway
Visualise the VPC as a layered graph: subnets, route tables, gateways, and security controls. A broken link in that graph explains most connectivity failures.
VPC Component Relationships and Traffic Flow
SecuritySubnetsPublic SubnetRoute TablePrivate SubnetOutboundInbound/OutboundInstance-levelSubnet-levelInternet GatewayEC2 with Public IPNAT GatewayEC2 without Public IPVPCVPC EndpointAWS ServiceVPC PeeringPeer VPCTransit GatewayOn-premises / Other VPCsPublic SubnetPrivate SubnetSecurity GroupNetwork ACLRoute Table

Subnets and Route Tables in Production

Subnets divide your VPC IP range into smaller segments, each anchored to a single Availability Zone. This is how you achieve multi-AZ redundancy — deploy resources across subnets in different AZs. Subnets can be public (with a route to an Internet Gateway) or private (no direct internet access). The subnet's route table determines traffic flow. Every subnet must be associated with exactly one route table.

Route tables contain entries (routes) that specify where to send traffic based on destination. The most important route is the local route — automatically added for the VPC CIDR. For internet access, add 0.0.0.0/0 -> Internet Gateway (public subnet) or 0.0.0.0/0 -> NAT Gateway (private subnet). Misconfigured route tables are the number one cause of network outages in VPCs. A missing route silently drops traffic.

Production tip: Use explicit subnet associations. Avoid using the main route table for anything — it's a common source of accidents. Create custom route tables per tier (web, app, db) and associate them explicitly. For highly available architectures, create at least two subnets per function (one per AZ) and spread resources across them.

create-subnets.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/bash
VPC_ID="vpc-0a1b2c3d"

# Create public subnets in two AZs
SUBNET_PUBLIC_A=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID --cidr-block 10.0.1.0/24 \
  --availability-zone us-east-1a \
  --query 'Subnet.SubnetId' --output text)

SUBNET_PUBLIC_B=$(aws ec2 create-subnet \
  --vpc-id $VPC_ID --cidr-block 10.0.2.0/24 \
  --availability-zone us-east-1b \
  --query 'Subnet.SubnetId' --output text)

# Create Internet Gateway and attach
IGW_ID=$(aws ec2 create-internet-gateway --query 'InternetGateway.InternetGatewayId' --output text)
aws ec2 attach-internet-gateway --vpc-id $VPC_ID --internet-gateway-id $IGW_ID

# Create custom route table for public subnets and associate
RTB_PUBLIC=$(aws ec2 create-route-table --vpc-id $VPC_ID --query 'RouteTable.RouteTableId' --output text)
aws ec2 associate-route-table --route-table-id $RTB_PUBLIC --subnet-id $SUBNET_PUBLIC_A
aws ec2 associate-route-table --route-table-id $RTB_PUBLIC --subnet-id $SUBNET_PUBLIC_B

# Add default route to Internet Gateway
aws ec2 create-route --route-table-id $RTB_PUBLIC \
  --destination-cidr-block 0.0.0.0/0 --gateway-id $IGW_ID
echo "Public subnets ready"
Output
Public subnets ready
Main Route Table Trap
The main route table is the default for any new subnet. If you don't explicitly associate a custom route table, the subnet inherits the main one. This causes accidental exposure of private subnets to the internet (if main route has an IGW) or private subnets without NAT (if main route lacks a NAT Gateway). Always set the main route table to be a 'black hole' (only local route) and associate custom tables explicitly.
Production Insight
A missing route in a private subnet's route table sends traffic nowhere — packets vanish without error.
Always use explicit subnet associations; the main route table should be a bare minimum (local route only).
Rule: For every subnet, check its route table for a default route to IGW (public) or NAT (private).
Key Takeaway
Subnets are AZ-scoped; route tables are VPC-scoped.
Explicit associations prevent routing leaks — never rely on the main route table.
Always verify route table content: one wrong entry can isolate an entire tier.
Route Table Design Decisions
IfSubnet contains public-facing load balancers or bastions
UseAssociate with a route table that has 0.0.0.0/0 -> Internet Gateway.
IfSubnet contains application or database servers
UseAssociate with a route table that has 0.0.0.0/0 -> NAT Gateway (same AZ).
IfSubnet used for internal services only (no internet needed)
UseAssociate with a route table that has only the local route. Consider VPC Endpoints for AWS services.
IfSubnet must talk to peered VPC
UseAdd route to peer VPC CIDR via VPC Peering ID (pcx-xxx) in the route table.

Public vs Private Subnet Connectivity Checklist

Misclassifying a subnet as public or private is one of the most common VPC misconfigurations. Use the following checklist to validate connectivity assumptions for each subnet type.

For a Public Subnet (instances reachable from the internet): - Route table has a default route (0.0.0.0/0) pointing to an Internet Gateway (igw-xxx). - The Internet Gateway is attached to the VPC and is in the 'available' state. - Auto-assign public IPv4 address is enabled at the subnet level (or the instance has an Elastic IP). - Security group inbound rules allow the desired traffic (e.g., port 22 for SSH, port 80/443 for web). - NACL inbound and outbound rules allow the necessary traffic (including ephemeral ports 1024-65535 for outbound responses). - Instance has a public IP assigned or an Elastic IP attached.

For a Private Subnet (instances cannot be directly reached from the internet): - Route table has a default route (0.0.0.0/0) pointing to a NAT Gateway (nat-xxx) or a VPC Endpoint for specific services. - The NAT Gateway is in a public subnet (route to IGW), is in 'available' state, and has an Elastic IP. - Security group outbound rules allow traffic to the internet (e.g., all traffic to 0.0.0.0/0). - NACL inbound rules for the subnet allow return traffic on ephemeral ports (1024-65535) from the internet. - NACL outbound rules allow traffic to the internet (e.g., 0.0.0.0/0 on ports 80, 443, or ephemeral). - VPC Endpoints (Gateway or Interface) are used for AWS services instead of routing through NAT to reduce cost and latency.

Verification Commands: ``bash # Check subnet's route table aws ec2 describe-route-tables --filters Name=association.subnet-id,Values=subnet-123 # Check if subnet auto-assigns public IP aws ec2 describe-subnets --subnet-ids subnet-123 --query 'Subnets[0].MapPublicIpOnLaunch' # Verify NAT Gateway status aws ec2 describe-nat-gateways --nat-gateway-ids nat-456 ``

Production Insight
A common production mistake: a subnet is marked as 'public' but the route table lacks the IGW route, or the IGW is not attached. Always test connectivity by launching a test instance and attempting an outbound curl. For private subnets, verify that the NAT Gateway is in a public subnet and that the private subnet's route table points to the correct NAT Gateway in the same AZ.
Key Takeaway
Public vs private is determined solely by the route table's default route. Verify with aws ec2 describe-route-tables and test with curl from an instance.

NAT Gateways and Internet Connectivity for Private Subnets

NAT (Network Address Translation) Gateway allows instances in private subnets to initiate outbound traffic to the internet (e.g., for software updates, calling external APIs) while blocking inbound traffic from the internet. It's a managed AWS service that scales automatically up to 45 Gbps. You pay per hour and per GB of data processed (about $0.045/hour + $0.045/GB). That adds up — a single NAT Gateway costs roughly $32/month before data transfer.

The critical production rule: deploy one NAT Gateway per AZ. If you put a single NAT Gateway in us-east-1a and your private subnets in us-east-1b route through it, an outage in us-east-1a kills internet access for all those instances. The fix is to create a NAT Gateway in each AZ and route each private subnet's 0.0.0.0/0 traffic to the NAT Gateway in its own AZ. Cross-AZ NAT is technically possible but adds latency and defeats the purpose of multi-AZ HA.

NAT Gateway has an Elastic IP (EIP) — ensure your firewall rules permit outbound traffic from that IP. Also note that NAT Gateway sits in a public subnet — it must have a route to an Internet Gateway. If you're cost-conscious and the workload is non-critical, consider a NAT Instance (a custom EC2 AMI) which can be cheaper but requires patching and failover management.

create-nat-gateway.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/bash
# Assume public subnets exist in us-east-1a and us-east-1b
# and you have an Internet Gateway attached

for AZ in a b; do
  # Allocate Elastic IP
  EIP_ALLOC=$(aws ec2 allocate-address --domain vpc --query 'AllocationId' --output text)
  
  # Create NAT Gateway in the public subnet of this AZ
  NAT_GW_ID=$(aws ec2 create-nat-gateway \
    --subnet-id $(aws ec2 describe-subnets --filters "Name=availability-zone,Values=us-east-1$AZ" "Name=tag:Type,Values=public" --query 'Subnets[0].SubnetId' --output text) \
    --allocation-id $EIP_ALLOC \
    --query 'NatGateway.NatGatewayId' --output text)
  
  # Wait for NAT Gateway to become available
  aws ec2 wait nat-gateway-available --nat-gateway-ids $NAT_GW_ID
  
  # Get the private route table for this AZ's private subnets
  RTB_PRIV=$(aws ec2 describe-route-tables --filters "Name=vpc-id,Values=vpc-xxxxx" "Name=tag:Zone,Values=us-east-1$AZ" --query 'RouteTables[0].RouteTableId' --output text)
  
  # Add route to NAT Gateway
  aws ec2 create-route --route-table-id $RTB_PRIV \
    --destination-cidr-block 0.0.0.0/0 \
    --nat-gateway-id $NAT_GW_ID
  
  echo "NAT gateway for az $AZ: $NAT_GW_ID"
done
Output
NAT gateway for az a: nat-1234567890abcdef0
NAT gateway for az b: nat-0fedcba9876543210
NAT Gateway = One-way mirror for private subnets
  • Private subnet traffic to internet goes through NAT Gateway, which replaces the source IP with its Elastic IP.
  • Return traffic is forwarded back because the NAT Gateway tracked the connection (stateful).
  • Inbound traffic from internet to private subnets is impossible — no established connection.
  • This asymmetry is by design: private means no unsolicited inbound.
Production Insight
One NAT Gateway per AZ prevents a single-AZ failure from taking down all internet access.
Cross-AZ NAT routes add ~10-20ms latency and violate the intended HA pattern.
NAT Gateway costs $32/month + data; estimate costs before committing — use NAT Instance for dev/test.
Key Takeaway
NAT Gateway is AZ-bound — deploy one per AZ with private resources.
Route each private subnet's default traffic to its zone's NAT Gateway.
Costs add up — budget $30-40/month per gateway before data transfer.

Security Groups vs Network ACLs: The Two Layers of Defense

Security Groups (SGs) and Network ACLs (NACLs) are both virtual firewalls, but they operate at different levels and have fundamentally different behaviours. Understanding the difference is critical to designing a secure VPC without introducing confusing behaviour.

Security Groups are stateful, instance-level firewalls. If you allow inbound traffic on port 443, the return traffic is automatically allowed regardless of outbound rules. They support allow rules only (no explicit deny). You attach SGs to ENIs (Elastic Network Interfaces) of EC2 instances, RDS, ELB, etc. Changes take effect immediately. This is what you should use for controlling access between application components — e.g., allow web tier to talk to app tier on port 8080.

Network ACLs are stateless, subnet-level firewalls. They have separate inbound and outbound rules — both must be explicitly allowed for traffic to flow. They support allow and deny rules, evaluated in order (lowest number first). NACLs are useful for defense-in-depth, e.g., blocking known bad IPs at the subnet boundary. But they're stateless — if you allow inbound HTTP (port 80), you must also allow outbound ephemeral ports (1024-65535) for the response.

Production gotcha: When you allow ping (ICMP echo request) inbound, a security group automatically returns the reply. A NACL requires both inbound ICMP request and outbound ICMP reply rules — otherwise ping fails. This catches everyone at least once.

security-groups-vs-nacls.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
# Example: Allow web tier to access app tier on port 4000

# Security group for app instances
SG_APP_ID=$(aws ec2 create-security-group \
  --group-name app-sg --description "App tier SG" \
  --vpc-id vpc-xxxxx --query 'GroupId' --output text)

# Allow inbound from web security group (by reference)
SG_WEB_ID="sg-xxxxxxxx"
aws ec2 authorize-security-group-ingress \
  --group-id $SG_APP_ID \
  --protocol tcp --port 4000 \
  --source-group $SG_WEB_ID

# Network ACL for the private subnet (example: deny SSH from 0.0.0.0/0 at the subnet level)
NACL_ID=$(aws ec2 create-network-acl \
  --vpc-id vpc-xxxxx --query 'NetworkAcl.NetworkAclId' --output text)

# Inbound rule: deny SSH (rule number 100, higher number = lower precedence)
aws ec2 create-network-acl-entry \
  --network-acl-id $NACL_ID --rule-number 100 \
  --protocol tcp --port-range From=22,To=22 \
  --cidr-block 0.0.0.0/0 --rule-action deny --ingress

# Outbound rule: allow all return traffic (stateless!)
aws ec2 create-network-acl-entry \
  --network-acl-id $NACL_ID --rule-number 100 \
  --protocol tcp --port-range From=1024,To=65535 \
  --cidr-block 0.0.0.0/0 --rule-action allow --egress

echo "SG created: $SG_APP_ID"
echo "NACL created: $NACL_ID"
Output
SG created: sg-0x1234567890abcde
NACL created: acl-12345678
NACL Ephemeral Port Trap
Stateless NACL needs explicit outbound rules for return traffic. Common mistake: allow inbound SSH (port 22) but forget outbound ephemeral ports (1024-65535). Connection is established but responses are blocked. Result: SSH 'connection timeout' or intermittent hangs.
Production Insight
Security groups are stateful — use them for application-level access between tiers.
NACLs are stateless and rule-order-based — use them for broad subnet-level filtering or IP blacklisting.
Rule: Never mix SG and NACL allow/deny without understanding the stateless nature; test with curl or telnet.
Key Takeaway
SGs are stateful and instance-level; NACLs are stateless and subnet-level.
Stateless means you must explicitly allow return traffic on ephemeral ports.
Default NACL allows all traffic — change it to explicit deny-all then allow minimum.
Security Layer Decision Guide
IfYou need to allow traffic between specific instances or services
UseUse security groups with source/destination by group ID.
IfYou need to block specific IPs at the subnet boundary
UseUse NACL deny rules (lower numbered rules override higher).
IfYou need to allow traffic to a load balancer from the internet
UseUse the load balancer's security group with appropriate inbound rules.
IfYou notice intermittent connection failures on ephemeral ports
UseCheck NACL outbound rules for missing ephemeral port ranges. Usually 1024-65535 covering TCP/UDP.

NACL vs Security Group Comparison Table

When designing your VPC security posture, you need to decide where to place each rule. The table below provides a side-by-side comparison of Security Groups and Network ACLs across key dimensions.

FeatureSecurity GroupNetwork ACL
ScopeInstance-level (ENI)Subnet-level
StatefulnessStateful (return traffic automatically allowed)Stateless (return traffic must be explicitly allowed)
Rule typesAllow onlyAllow and Deny
Rule evaluationAll rules evaluated (no order)Rules evaluated in order (lowest number first)
Default rulesInbound: deny all; Outbound: allow allInbound: allow all; Outbound: allow all
Supports source/destination byCIDR, security group ID, prefix listCIDR only
Number of rulesUp to 60 inbound + 60 outbound per SGUp to 20 inbound + 20 outbound per NACL (before limit increase)
Applies toEC2, ELB, RDS, Lambda (via VPC), etc.All instances in the associated subnet
ChangesApply immediately to attached instancesApply immediately to subnet traffic
Use caseFine-grained control between servicesBroad network boundaries, IP blacklisting

Use Security Groups as your primary access control mechanism — they are simpler, stateful, and more granular. Use NACLs as a secondary layer for defenses such as blocking known malicious IPs or preventing traffic to/from specific ports at the subnet boundary. Because NACLs are stateless, always verify that both inbound and outbound rules cover the necessary traffic, especially ephemeral ports.

Production Insight
In production, rely on Security Groups for most traffic control—they are stateful and easier to manage. Use NACLs only when you need to deny specific IPs or when a subnet-level rule is required (e.g., blocking SSH from the internet while allowing it from within the VPC). Remember that NACL order matters; place deny rules before allows to ensure they are evaluated first.
Key Takeaway
Security Groups are your primary firewall (stateful, instance-level). NACLs are your secondary filter (stateless, subnet-level). Understand the differences to avoid connectivity surprises.

Advanced Connectivity: VPC Peering, Transit Gateway, and VPN

As your AWS footprint grows, you'll need to connect VPCs to each other and to on-premises networks. AWS offers three primary mechanisms: VPC Peering, Transit Gateway (TGW), and AWS VPN/Direct Connect. Each has different trade-offs for scale, cost, and operational overhead.

VPC Peering connects two VPCs (within same or different accounts/regions) via a 1:1 relationship. Traffic stays on the AWS backbone — no internet. Peering is not transitive; if VPC A is peered with B, and B with C, A cannot talk to C unless a direct peering exists. It's great for small-scale inter-VPC communication but becomes unwieldy beyond a handful of VPCs (n*(n-1)/2 connections).

Transit Gateway (TGW) is a hub-and-spoke router that connects up to thousands of VPCs and VPNs. It supports transitive routing — one attachment to TGW connects to all others (with route table controls). TGW simplifies network management at scale. You pay per attachment ($0.05/hour) and per GB processed. For large enterprises with many VPCs and hybrid connectivity, TGW is the standard.

AWS Site-to-Site VPN creates an IPsec tunnel between your VPC and on-premises network. It's often used as a backup to Direct Connect. A VPN connection goes through the Internet, so latency and bandwidth vary. Direct Connect provides dedicated private connectivity, but requires physical cross-connects and longer lead times.

Production consideration: Combine VPC Peering for high-bandwidth, low-latency needs between a small number of VPCs, and TGW for everything else. Use VPN as a cost-effective backup or for burst traffic that doesn't require SLA bandwidth.

create-vpc-peering.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# Peer VPC (vpc-aaaaa) with another VPC (vpc-bbbbb) in the same account
export VPC_A="vpc-aaaaa"
export VPC_B="vpc-bbbbb"

# Request peering connection from VPC A
PEERING_ID=$(aws ec2 create-vpc-peering-connection \
  --vpc-id $VPC_A --peer-vpc-id $VPC_B \
  --query 'VpcPeeringConnection.VpcPeeringConnectionId' \
  --output text)

# Accept the peering from VPC B's account (if same account, auto-accept)
aws ec2 accept-vpc-peering-connection --vpc-peering-connection-id $PEERING_ID

# Add routes in both VPCs
aws ec2 create-route --route-table-id rtb-11111 \
  --destination-cidr-block 10.1.0.0/16 \
  --vpc-peering-connection-id $PEERING_ID

aws ec2 create-route --route-table-id rtb-22222 \
  --destination-cidr-block 10.0.0.0/16 \
  --vpc-peering-connection-id $PEERING_ID

echo "Peering created: $PEERING_ID"
Output
Peering created: pcx-1234567890abcdef0
Transit Gateway as a network router
  • Each VPC attachment is like a network interface card on the router.
  • Route tables within TGW control which attachments can talk to each other.
  • Use separate route tables for production vs non-production attachments (isolation).
  • Propagation automatically populates routes from attachments into TGW route tables — reduces manual entries.
  • TGW supports multicast, which is not possible with VPC Peering.
Production Insight
VPC Peering is non-transitive — each pair needs its own connection. At 10+ VPCs, manageability collapses.
Transit Gateway scales to hundreds of VPCs but costs $0.05/hour/attachment — budget for >$30/month per attachment.
VPN over internet is cheap but latency-sensitive; Direct Connect offers 10x the bandwidth but requires months of lead time.
Rule: Prefer TGW for any greenfield multi-VPC architecture; use VPC Peering only for a few VPCs with high bandwidth needs.
Key Takeaway
VPC Peering = simple but non-transitive; Transit Gateway = scalable hub.
Always plan for future growth — TGW scales with you; VPC Peering does not.
On-premises connectivity: Direct Connect for production, VPN for backup or burst.
Connectivity Pattern Selection
IfLess than 5 VPCs, no on-premises
UseVPC Peering. Simple, low cost, low latency.
If5+ VPCs or multiple accounts
UseTransit Gateway. Centralised management and transitive routing.
IfNeed to connect on-premises (primary)
UseDirect Connect (high bandwidth, low latency) + VPN backup.
IfTemporary or low-bandwidth on-premises connectivity
UseAWS Site-to-Site VPN. Quick setup, runs over internet.
IfNeed multicast support
UseTransit Gateway multicast is the only option (VPC Peering does not support multicast).

The VPC Is Not a Data Center: How Cloud Networking Breaks Your Assumptions

If you treat your VPC like a physical data center, you're going to get burned. I've seen it happen. A junior architect once told me "our VPC is just like the on-prem network." Six hours later, a misconfigured route table took down production. The VPC is a software-defined network. It has no cables, no switches you can touch, and no latency guarantees between AZs. The single biggest mistake I see is over-provisioning CIDR blocks. You think you need a /16 because "that's what on-prem used." You don't. AWS limits you to 5 VPCs per region by default. Start with a /20. You can always add secondary CIDRs. The real constraint isn't IP space — it's route table limits. Each route table handles 50 routes max by default. Design for that. Your future self, debugging a route propagation issue at 2 AM, will thank you.

vpc_cidr_design.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// io.thecodeforge
// Don't oversize your VPC. Production patterns from incident postmortems.
resource "aws_vpc" "production" {
  cidr_block       = "10.0.0.0/20"         # 4,096 IPs. Enough. Grow with secondary CIDRs.
  instance_tenancy = "default"

  tags = {
    Name = "prod-vpc"
    // Mandatory: who to page when traffic drops
    PagerDuty = "sre-prod-net"
  }
}

# Secondary CIDR when you need it — don't pre-allocate
resource "aws_vpc_ipv4_cidr_block_association" "secondary" {
  count      = var.need_secondary_cidr ? 1 : 0
  vpc_id     = aws_vpc.production.id
  cidr_block = "10.1.0.0/20"
}
Output
aws_vpc.production: Creation complete after 2s [id=vpc-0a1b2c3d4e5f67890]
# Route table limits are now your bottleneck, not IP space.
Production Trap:
Route propagation from Transit Gateway or VPN eats route table slots fast. Monitor your route table usage with CloudWatch metric 'RouteTableCount' in VPC namespace. Default limit: 50 routes per table. You can request more, but each route is a potential misconfiguration vector.
Key Takeaway
Your VPC CIDR should be as small as realistically possible. /20 for production. /16 is a trap. Secondary CIDRs are your safety valve.

Flow Logs Are Not Optional: Wiring a Firehose to Your Incident Response

Every time I audit a VPC that suffered a data breach, the first question is "did you have flow logs?" Answer is always no. Or: "they were on, but we never looked at them." Flow logs are your only source of truth for network traffic at the VPC level. CloudTrail tells you who made API calls. Flow logs tell you what packets actually moved. Set them up before you launch a single instance. Aggregation interval matters: 10 minutes is cheap but useless during an active attack. Use 1 minute for production subnets. Ship logs to S3, then stream to Athena or OpenSearch. Don't put them in CloudWatch Logs alone — the query cost will bankrupt you. The real trick: tag each flow log with the subnet name and environment. When you have an incident, you query "source IP came from where" and get the subnet name in the first column, not an account number you have to cross-reference.

vpc_flow_logs.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge
// 1-minute aggregation. Production only. S3 + Athena, not CloudWatch cost trap.
resource "aws_flow_log" "production_subnets" {
  for_each = toset([
    "subnet-web-01",
    "subnet-app-01",
    "subnet-db-01"
  ])

  iam_role_arn    = aws_iam_role.flow_logs.arn
  log_destination = aws_s3_bucket.flow_logs.arn
  traffic_type    = "ALL"
  vpc_id          = aws_vpc.production.id
  subnet_id       = aws_subnet.each.value.id

  max_aggregation_interval = 60  # 1 minute for security

  tags = {
    Environment = "production"
    SubnetName  = each.value
  }
}

# Athena table definition (run once)
# CREATE EXTERNAL TABLE vpc_flow_logs (...) LOCATION 's3://prod-vpc-flow-logs/AWSLogs/';
Output
aws_flow_log.production_subnets["subnet-web-01"]: Creation complete after 3s
# Query in Athena: SELECT * FROM vpc_flow_logs WHERE dstport = 22 AND action = 'REJECT' LIMIT 10;
Production Trap:
Flow logs do not capture traffic to the Amazon DNS server (169.254.169.253) or DHCP traffic. If your app relies on those for service discovery, you need VPC Reachability Analyzer for that path. Flow logs will show nothing.
Key Takeaway
Production VPCs must have 1-minute flow logs to S3, queryable via Athena. No exceptions. This is your network black box recorder.
● Production incidentPOST-MORTEMseverity: high

The Silent NAT Gateway Single Point of Failure

Symptom
All instances in private subnets lost outbound internet access. Health checks failed, ECS tasks stopped pulling images, and the application tier reported broken external API dependencies.
Assumption
NAT Gateway is a managed service — AWS handles availability. One NAT Gateway in any AZ should be enough.
Root cause
NAT Gateway is deployed in a specific AZ. If that AZ goes down, all traffic routed through it is lost. The team had only one NAT Gateway shared across all private subnets in all AZs.
Fix
Deploy one NAT Gateway per AZ, and add a route in each private subnet's route table pointing to the NAT Gateway in the same AZ. Use a NAT Gateway per AZ pattern to avoid cross-AZ dependencies.
Key lesson
  • NAT Gateway is AZ-specific — route traffic to the one in the same AZ as your resources.
  • Cross-AZ NAT traffic is possible but adds latency and couples availability to a single AZ.
  • For high availability, always provision one NAT Gateway per AZ that contains private resources needing outbound internet.
  • Consider NAT Instance as a cost-effective alternative for dev environments, but accept the management overhead.
Production debug guideQuick diagnostic steps when traffic doesn't flow as expected4 entries
Symptom · 01
Cannot SSH into EC2 instance from the internet
Fix
Check security group inbound rules (allow port 22 from your IP). Check NACL inbound/outbound rules (allow ephemeral ports). Verify route table has a route to an Internet Gateway (0.0.0.0/0 -> igw-xxx). Ensure subnet is public (auto-assign public IPv4 enabled).
Symptom · 02
Private instance cannot download packages (yum/apt) from the internet
Fix
Verify NAT Gateway exists in the same AZ and is in 'Available' state. Check route table of the private subnet: 0.0.0.0/0 target must be the NAT Gateway (nat-xxx). Confirm NAT Gateway's Elastic IP is not blocked by destination. Test with curl -v https://awscli.amazonaws.com.
Symptom · 03
Cross-VPC communication fails (peering or Transit Gateway)
Fix
Ensure VPC peering connection is 'Active'. Verify route tables in both VPCs have routes to the peer CIDR via the peering ID (pcx-xxx). Check NACL and security groups allow traffic from the peer VPC CIDR. If using Transit Gateway, check TGW route table associations and propagation.
Symptom · 04
Application cannot reach an RDS database in the same VPC
Fix
Security group on RDS must allow inbound from the application's security group (by group ID). NACL must allow return traffic (ephemeral ports). Check that the RDS subnet group spans multiple AZs. Test connectivity with telnet rds-endpoint 3306 from the app instance.
★ VPC Debug Commands and Quick FixesRun these commands and checks immediately when facing network issues in your VPC.
Outbound internet fails from private subnet
Immediate action
SSH into a bastion host in the same VPC and run `curl -v https://checkip.amazonaws.com` to confirm outbound path.
Commands
aws ec2 describe-nat-gateways --nat-gateway-ids $(aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[?Routes[?DestinationCidrBlock==`0.0.0.0/0`]].Routes[?NatGatewayId].NatGatewayId' --output text)
aws ec2 describe-network-acls --filters Name=association.subnet-id,Values=subnet-xxxxx --query 'NetworkAcls[].Entries[?RuleAction==`deny`]' --output table
Fix now
If NAT Gateway is missing, attach an Internet Gateway to the VPC, create a NAT Gateway in each AZ, and update route tables. Immediate workaround: launch a NAT instance (AMI) and route through it.
Cross-VPC connectivity broken+
Immediate action
Ping the private IP of an instance in the other VPC (if ICMP is allowed) to test layer 3 reachability.
Commands
aws ec2 describe-vpc-peering-connections --vpc-peering-connection-ids pcx-xxxxx --query 'VpcPeeringConnections[].Status.Code' --output text
aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[].Routes[?DestinationCidrBlock==`10.0.0.0/16`]' --output table
Fix now
Add missing routes in both VPCs. Ensure security groups allow traffic from the peer VPC CIDR. If using Transit Gateway, check TGW route table associations.
Application latency spikes with VPC endpoints+
Immediate action
Check if traffic is going through NAT Gateway instead of VPC Endpoint (increase in network bytes for NAT GW).
Commands
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=vpc-xxxxx --query 'VpcEndpoints[?ServiceName==`com.amazonaws.us-east-1.s3`].Id' --output text
aws ec2 describe-route-tables --filters Name=vpc-id,Values=vpc-xxxxx --query 'RouteTables[].Routes[?DestinationCidrBlock==`pl-xxxxx`]' --output table
Fix now
Add a route to the S3 prefix list via the VPC Endpoint (prefix list ID). Remove the 0.0.0.0/0 route that sends S3 traffic through NAT.
AWS VPC Connectivity Options
FeatureVPC PeeringTransit GatewayAWS VPN
Transitive routingNoYesYes (with TGW)
Max connections125 per VPC1000sMultiple tunnels per VPN
LatencyLow (AWS backbone)Low (AWS backbone)Medium (internet)
BandwidthUp to 10 Gbps (depends on instance)Up to 50 Gbps per attachment1.25 Gbps per tunnel
CostNo hourly fee (data transfer only)$0.05/hour per attachment + data$0.05/hour per connection + data
Management overheadHigh for many VPCs (n*(n-1)/2)Low (centralised)Low (managed service)
Use caseFew VPCs, high bandwidthMany VPCs, hybridBackup connectivity, dev/test

Key takeaways

1
VPC is the root of your AWS network
CIDR choices are permanent, so plan ahead.
2
Subnets and route tables are the traffic engineers; missing routes drop packets silently.
3
NAT Gateways must be deployed per AZ to survive AZ failures.
4
Security Groups are stateful and instance-level; NACLs are stateless and subnet-level
use both with understanding.
5
For multi-VPC connectivity, Transit Gateway beats VPC Peering beyond a handful of VPCs.
6
Always test connectivity at the packet level with tools like curl, telnet, and tcpdump.

Common mistakes to avoid

5 patterns
×

Using a single NAT Gateway for all AZs

Symptom
Outbound internet from private subnets fails entirely when the NAT Gateway's AZ goes down.
Fix
Deploy one NAT Gateway per AZ and configure each private subnet route table to use the NAT Gateway in its own AZ.
×

Forgetting to allow outbound ephemeral ports in NACLs

Symptom
Inbound traffic (e.g., HTTP from internet) reaches the instance, but responses are dropped. Intermittent timeouts.
Fix
Add an outbound NACL rule allowing TCP/UDP on ports 1024-65535 for the source CIDR. Stateless means both directions need rules.
×

Relying on the main route table instead of custom explicit associations

Symptom
New subnets inadvertently get internet access or are left without NAT because the main table is incorrect.
Fix
Set the main route table to have only the local route (10.0.0.0/16 -> local). Create custom route tables for public and private subnets, and associate them explicitly.
×

Choosing a VPC CIDR that overlaps with on-premises or other VPCs

Symptom
VPC Peering or VPN connection fails due to overlapping CIDRs. Traffic cannot be routed correctly.
Fix
Plan your CIDR allocation carefully before creating VPCs. Use a central IP address management (IPAM) tool, or at least maintain a spreadsheet of all CIDRs.
×

Not enabling DNS hostnames and DNS resolution

Symptom
EC2 instances get private IPs but cannot resolve private DNS names internally. Service discovery fails.
Fix
Enable 'DNS hostnames' and 'DNS resolution' on the VPC. This allows instances to use private DNS names (e.g., ip-10-0-1-5.ec2.internal).
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the difference between a Security Group and a Network ACL? When ...
Q02SENIOR
Explain the 'one NAT Gateway per AZ' rule. Why is a single NAT Gateway i...
Q03SENIOR
How does VPC Peering handle transitive routing? What are the limitations...
Q04SENIOR
You are troubleshooting a scenario where an EC2 instance in a private su...
Q05SENIOR
What is a VPC Endpoint? When would you use a Gateway Endpoint vs an Inte...
Q01 of 05SENIOR

What is the difference between a Security Group and a Network ACL? When would you use each?

ANSWER
Security Groups are stateful, instance-level firewalls with allow rules only. Changes take effect immediately. Use them to control traffic between application tiers (e.g., web to app). NACLs are stateless, subnet-level firewalls with allow and deny rules evaluated in numbered order. Use them for broad IP blocking at the subnet boundary or for defense-in-depth. Because NACLs are stateless, you must explicitly allow return traffic on ephemeral ports. A typical pattern: use SGs for most access control, and NACLs to deny known bad actors or to provide an additional layer.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is an AWS VPC in simple terms?
02
How many VPCs can I have per region?
03
What is the difference between a public subnet and a private subnet?
04
Can I change the CIDR block of a VPC after creation?
05
What is the purpose of a Network ACL?
06
When should I use VPC Peering vs Transit Gateway?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Cloud. Mark it forged?

12 min read · try the examples if you haven't

Previous
AWS RDS and DynamoDB
7 / 23 · Cloud
Next
AWS IAM — Identity and Access