Mid-level 14 min · March 06, 2026

AWS EC2 Basics

EC2 Security Groups — SSH 0.0.0.0/0 to Miner in 24h

Q: What is the difference between stopping and terminating an EC2 instance?

Stopping an instance shuts it down but keeps the EBS volumes and instance metadata intact. You stop paying for compute but still pay for attached EBS storage. You can start the instance again later. Terminating deletes the instance and, by default, the root EBS volume. You cannot recover a terminated instance (unless you have a snapshot). Use stop for temporary pauses; use terminate when you're done permanently.

Q: How do I reduce EC2 costs?

1. Use Auto Scaling to match demand — don't run 10 instances when 3 will do. 2. Purchase Compute Savings Plans (or Reserved Instances) for steady workloads. 3. Use Spot Instances for stateless, fault-tolerant work. 4. Tag all resources and review usage weekly via AWS Cost Explorer. 5. Implement instance scheduling to stop non-production instances overnight and weekends. 6. Right-size after profiling: downgrade over-provisioned instances. 7. Migrate gp2 EBS volumes to gp3 for 20% cost reduction.

Q: What is a security group, and how is it different from a network ACL?

A security group (SG) is a stateful virtual firewall attached to an EC2 instance (or ENI). It only has allow rules; traffic not explicitly allowed is denied. Stateful means response traffic is automatically allowed regardless of outbound rules. A network ACL (NACL) is a stateless firewall attached to a subnet. It has both allow and deny rules (evaluated in order). Response traffic must be explicitly allowed. NACLs are used for broad subnet-level control; SGs are for fine-grained instance-level control.

Q: What is the difference between an AMI and a snapshot?

An Amazon Machine Image (AMI) is a pre-configured template that contains the OS, software, and configuration needed to launch an instance. You can create custom AMIs from existing instances. A snapshot is a point-in-time backup of an EBS volume. AMIs are built from snapshots (of the root volume plus any attached volumes). Use AMIs to launch new instances quickly; use snapshots for backup and disaster recovery.

Q: How do I increase the disk space on an EC2 instance?

1. Increase the EBS volume size using `aws ec2 modify-volume --volume-id --size `. 2. Wait for the volume to enter 'optimizing' state (a few seconds to minutes). 3. Extend the filesystem: - For ext4 (Linux): `sudo resize2fs /dev/xvda1` - For xfs: `sudo xfs_growfs /` 4. If the root volume is full, you may need to stop the instance to modify the root volume size. For data volumes, you can do it online. 5. Set CloudWatch alarm on disk space to prevent future freezes.

Q: Should I use t3.micro for production?

No — t3.micro is a burstable instance that earns CPU credits when idle and burns them under load. Under sustained load, credits exhaust and CPU throttles to 10-20% of a core. Your app will become slow and unpredictable. Use t3.micro for: - Development and testing environments - Lightweight web servers with sporadic traffic (<5% CPU average) - Instances that are stopped outside business hours For production with steady traffic, use m6i.large or larger. For databases, never use burstable instances.

t3.micro CPU credits hit zero, 200GB exfiltrated via open SSH port 22.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Production

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

EC2 = virtual server rental. Pay per second. Stop = compute stops billing, EBS continues. Terminate = delete instance and root volume.
Instance families: t (burstable credits), c (compute), r (RAM), i (storage), g (GPU). Never use t3.micro for production sustained load.
Security groups = stateful firewall. 0.0.0.0/0 on SSH = crypto mining within 24 hours. Always restrict to your IP.
EBS: gp3 is cheaper than gp2 and decouples IOPS from size. Default gp2 is a trap for new accounts.
Pricing: On-Demand (no commitment), Savings Plans (flexible, 72% off), Reserved Instances (inflexible, locks instance type), Spot (90% off, can be interrupted).
Production killer: setting ClockSkew = 0 in JWT validation — server drift by 30 seconds locks out all users.

✦ Definition~90s read

What is AWS EC2 Basics?

EC2 (Elastic Compute Cloud) is AWS's core IaaS offering: virtual servers you provision in minutes, paying only for what you use. It solves the fundamental problem of needing compute capacity without buying and maintaining physical hardware. You pick an instance type (CPU, memory, storage combo), an AMI (OS image), a key pair for SSH access, and a security group—a stateful firewall that controls inbound/outbound traffic.

★

Imagine you need a powerful gaming PC to run a tournament, but you only need it for one weekend.

Security groups are the single most common misconfiguration vector in AWS, responsible for countless breaches where attackers find port 22 (SSH) open to 0.0.0.0/0 and install crypto miners within hours. If you're running anything less than a production load, EC2 is often overkill—consider Lightsail for fixed-price simplicity or Lambda for event-driven workloads.

For persistent servers, you'll also need EBS volumes (block storage attached over the network), where gp3 is now the default because gp2's burst credit model can throttle your I/O silently when credits run out.

Plain-English First

Imagine you need a powerful gaming PC to run a tournament, but you only need it for one weekend. Instead of buying one, you rent it from a warehouse that has thousands of PCs in every size. AWS EC2 is that warehouse — except instead of gaming PCs, it's servers. You rent exactly the computing power you need, for exactly as long as you need it, and when you're done, you hand it back and stop paying. That's it.

Every app you've ever built eventually hits the same wall: where does it actually run? Your laptop can't serve production traffic, a shared hosting plan falls over under load, and buying physical servers means you're locked into hardware that's obsolete in three years. The cloud exists to solve this, and AWS EC2 is where most teams start — and for good reason. It's the backbone of thousands of production systems running right now, from early-stage startups to Fortune 500 backends.

EC2 (Elastic Compute Cloud) solves the problem of unpredictable infrastructure needs. 'Elastic' is the key word — you can spin up 50 servers at 9am for a product launch and terminate 48 of them by noon when the traffic spike passes. You're billed by the second. No contracts, no idle hardware, no datacenter lease. The underlying model shifts infrastructure from a capital expense (buy servers) to an operational one (rent compute), which changes how engineering teams think about scaling entirely.

By the end of this article you'll understand what an EC2 instance actually is under the hood, how to choose the right instance type for your workload, how to launch and connect to a real server using the AWS CLI, and how to lock it down with security groups. You'll also see the exact mistakes that burn people — including accidental bills from instances left running and SSH connections that silently refuse to work.

Why EC2 Security Groups Are Your First Line of Defense — and How to Get It Wrong

An EC2 security group is a virtual firewall that controls inbound and outbound traffic at the instance level. It operates as a stateful packet filter: if you allow inbound traffic on port 22 (SSH) from a specific IP, the return traffic is automatically allowed regardless of outbound rules. This statefulness is the core mechanic that makes security groups simple but also easy to misconfigure.

Each security group consists of a set of allow rules — there are no deny rules. Traffic that isn't explicitly allowed is implicitly denied. You can assign up to 5 security groups per instance (soft limit), and rules are evaluated collectively: any matching allow rule in any assigned group permits the traffic. Changes take effect immediately, with no instance restart required. This means a single overly permissive rule, like SSH from 0.0.0.0/0, opens your instance to the entire internet within seconds.

Use security groups to enforce least-privilege access for your EC2 instances. In production, you should never use 0.0.0.0/0 for SSH, RDP, or database ports. Instead, restrict access to specific IP ranges (e.g., your office VPN CIDR) or use a bastion host. Security groups are free, fast to modify, and essential for compliance — but they are not a substitute for host-level firewalls like iptables or network ACLs for subnet-level control.

Statefulness Trap

Security groups are stateful: if you allow inbound SSH from 0.0.0.0/0, the outbound response is automatically allowed — even if your outbound rules block all traffic.

Production Insight

A team opened SSH to 0.0.0.0/0 for a quick debug session and forgot to revert. Within 24 hours, a cryptominer was installed via a brute-forced SSH key. The symptom was 100% CPU usage and a $10,000 AWS bill from spot instance spikes. Rule of thumb: never use 0.0.0.0/0 for any management protocol — always scope to a specific IP or use a VPN.

Key Takeaway

Security groups are stateful — inbound allow automatically permits outbound response, which can mask overly permissive rules.

Implicit deny means no explicit deny rules exist; you can only whitelist, so a single mistake exposes the instance.

Changes apply instantly — there is no grace period, so a bad rule is live the moment you save it.

thecodeforge.io

EC2 Security Groups: SSH 0.0.0.0/0 to Miner in 24h

Aws Ec2 Basics

EC2 Instance Types and Pricing — Stop Paying for What You Don't Need

AWS divides instance types into families based on the ratio of CPU, memory, storage, and network. Pick wrong and you overpay or underperform.

General purpose (t3, m6i) — balanced CPU/memory. Good for web servers, dev/test environments. The t3 family includes burstable credits: you earn credits when idle and burn them under load. Exhaust them and the instance throttles.

Compute optimized (c6i, c7g) — higher CPU-to-memory ratio. For batch processing, video encoding, high-performance web servers. c6i is Intel, c7g is Graviton (ARM) — ~20% better price/performance.

Memory optimized (r6i, x2iedn) — massive RAM per vCPU. For in-memory databases (Redis, SAP HANA), large caches. The x2iedn gives up to 4 TB RAM.

Storage optimized (i3, d2) — high local NVMe SSD. For data warehouses, log processing. d2 has spinning disks for cold storage.

GPU instances (p4, g5) — for machine learning, graphic rendering. p4 uses A100 GPUs; g5 uses NVIDIA A10G.

In production, you usually start with a small general-purpose for your app, then use CloudWatch to profile actual resource usage and right-size after a week. Most teams over-provision by 2–3x initially.

Pricing models: - On-Demand: Pay per second, no commitment. Highest per-hour cost. Best for short-lived workloads. - Savings Plans: Commit to $/hour spend for 1-3 years (up to 72% off). Flexible across instance families and regions. - Reserved Instances: Commit to specific instance family in specific AZ for 1-3 years. Inflexible, but up to 75% off. Lock-in risk. - Spot Instances: Spare capacity, up to 90% off, but can be terminated with 2-min warning. Use for batch, CI/CD, stateless workloads.

Cost optimisation trap: buying a 3-year RI for a project that gets cancelled after 6 months. You're stuck paying for the entire term. Start with Savings Plans — they're flexible.

One nuance: the Graviton-based types (c7g, m7g, r7g) offer better price-performance for most workloads, but you'll need ARM-compatible software. Most container images and modern language runtimes work fine, but legacy binaries might not. Test before you commit.

io/thecodeforge/aws/ec2_cost_check.shBASH

#!/bin/bash
# TheCodeForge — Check EC2 cost and right-sizing recommendations

echo "=== EC2 Cost Analysis ==="

# List all running instances with type and launch time
aws ec2 describe-instances --filters Name=instance-state-name,Values=running \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,LaunchTime,PublicIpAddress]' \
  --output table

# Check CloudWatch CPU utilization for one instance
INSTANCE_ID="i-0123456789abcdef0"
echo "=== CPU Utilization for $INSTANCE_ID (last 24h) ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --statistics Average \
  --period 3600 \
  --start-time "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --query 'Datapoints[*].[Timestamp,Average]' \
  --output table

# Check if instance is t-family and burst credits
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --statistics Minimum \
  --period 3600 \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --query 'Datapoints[0].Minimum' \
  --output text

Output

=== EC2 Cost Analysis ===

----------------------------------------------------------------

| DescribeInstances |

+----------------------+----------+----------------------+

| InstanceId | Type | LaunchTime |

+----------------------+----------+----------------------+

| i-0abcd1234efgh5678 | t3.micro| 2026-03-01T14:23:00Z|

| i-0efgh5678ijkl9012 | m6i.large| 2026-03-10T09:15:00Z|

+----------------------+----------+----------------------+

=== CPU Utilization for i-0abcd1234efgh5678 (last 24h) ===

-------------------------------------------------

| GetMetricStatistics |

+---------------------------+--------------------+

| Timestamp | Average |

+---------------------------+--------------------+

| 2026-03-18T15:00:00Z | 92.3 | <- sustained high CPU

| 2026-03-18T14:00:00Z | 88.7 |

+---------------------------+--------------------+

CPUCreditBalance: 0.0 <- credits exhausted, instance throttling

Recommendation: Migrate from t3.micro to m6i.large for sustained workload.

Burstable Credits = Idle Save, Load Pay

t3.micro baseline: 10-20% of a full vCPU. Burst to 100% for short periods.
CPUCreditBalance = 0 → instance is throttled. Your app slows down.
For sustained >20% CPU for more than a few hours, switch to m6i.large.
Enable 'unlimited' mode for t3 — costs extra when credits negative. Use only for unexpected spikes.
Production databases on t3.micro = slow queries + unhappy customers.

Production Insight

We once saw a team running a Redis cluster on t3.micro instances.

The credit balance kept draining during peak hours, causing random latency spikes.

Sometimes 100ms, sometimes 2 seconds. Hard to debug.

Rule: never use burstable instances for sustained workloads — use m6i.large at minimum.

Burstable (t3) is for sporadic traffic (testing, staging, development), not production databases.

Key Takeaway

Instance types: t (burstable), c (compute), r (RAM), i (storage), g (GPU).

For sustained production workloads, skip t3 — use m6i.large or c6i.large.

Pricing: On-Demand (flexible), Savings Plans (recommended), RI (inflexible), Spot (interruptible).

The biggest cost savings: terminate idle instances and downgrade over-provisioned ones.

Choose EC2 instance family and pricing model

IfSustained CPU > 20% for > 4 hours/day

→

UseAvoid t3. Use m6i.large or c6i.large. t3 will throttle and performance will be inconsistent.

IfWorkload runs 24/7 for foreseeable future

→

UsePurchase Compute Savings Plans (1-year, no upfront). 40-50% savings over On-Demand. Flexible across families.

IfWorkload is stateless and interruptible (batch, CI/CD, data processing)

→

UseUse Spot Instances with Auto Scaling group. 60-90% savings. Handle interruption gracefully (2-min warning).

IfExperimentally migrating to ARM (Graviton)

→

UseTest with c7g.large or m7g.large first. Verify all dependencies have ARM-compatible binaries. Expect 20% price/performance improvement.

Instance Family Comparison Table — C, M, R, T, I, G Series at a Glance

When selecting an instance, start with the family. Each family optimises a different resource ratio. The table below shows the common families, representative types, the vCPU-to-memory ratio, typical use cases, and approximate on-demand hourly pricing (us-east-1, as of May 2026). Prices are for the smallest size in each family and increase with size.

Family	Example Types	vCPU:RAM Ratio	Common Use Cases	~Min Hourly Price (On-Demand)
T (Burstable)	t3.nano - t3.2xlarge	1:2 (t3.micro)	Dev/test, low-traffic web servers, microservices with sporadic load	$0.0104 (t3.nano)
M (General)	m6i.large - m6i.32xlarge	1:4	Web servers, application servers, small databases	$0.096 (m6i.large)
C (Compute)	c6i.large - c6i.32xlarge	1:2	Batch processing, video encoding, high-performance web servers	$0.085 (c6i.large)
R (Memory)	r6i.large - r6i.32xlarge	1:8	In-memory databases (Redis, Memcached), real-time analytics	$0.126 (r6i.large)
I (Storage)	i3.large - i3.16xlarge	1:4 + NVMe	Data warehouses, log processing, NoSQL databases	$0.156 (i3.large)
G (GPU)	g5.xlarge - g5.48xlarge	1:4 + GPU	Machine learning training/inference, graphics rendering	$1.006 (g5.xlarge)
X (Memory-optimised)	x2iedn.xlarge - x2iedn.32xlarge	1:16	SAP HANA, large in-memory databases	$0.557 (x2iedn.xlarge)

Graviton (ARM) variants (c7g, m7g, r7g) offer 20% better price/performance than equivalent Intel/AMD types for most workloads. For example, c7g.large costs ~$0.068/hr vs c6i.large at $0.085/hr. Check software compatibility before migrating.

How to choose: - If your CPU is pegged at 100% but memory is under 50%, you need a compute-optimised type (C or better). - If memory is near capacity but CPU is idle, move up to R series. - If you need high local I/O (e.g., for temporary data processing), pick I series with instance store. - For GPU-accelerated workloads, G or P series. P (p4, p5) are even more powerful but more expensive.

Use the AWS CLI to list available instance types in your region: aws ec2 describe-instance-types --filters Name=instance-type,Values=t3. --query 'InstanceTypes[].[InstanceType,VCpuInfo.DefaultVCpus,MemoryInfo.SizeInMiB]' --output table

io/thecodeforge/aws/ec2_list_families.shBASH

#!/bin/bash
# TheCodeForge — List EC2 instance families and their characteristics

echo "=== Listing all instance families with example types ==="

# Describe instance types for each family
for family in t3 m6i c6i r6i i3 g5 x2iedn; do
  echo ""
  echo "--- $family ---"
  aws ec2 describe-instance-types \
    --filters Name=instance-type,Values="${family}.*" \
    --query 'InstanceTypes[*].[InstanceType, VCpuInfo.DefaultVCpus, MemoryInfo.SizeInMiB, InstanceStorageInfo.TotalSizeInGB]' \
    --output table --max-items 5
done

Output

=== Listing all instance families with example types ===

--- t3 ---

-----------------------------------------

| InstanceType | vCPUs | Mem(MiB) |

+------------------+--------+-----------+

| t3.nano | 2 | 512 |

| t3.micro | 2 | 1024 |

| t3.small | 2 | 2048 |

| t3.medium | 2 | 4096 |

| t3.large | 2 | 8192 |

+------------------+--------+-----------+

--- m6i ---

-----------------------------------------

| InstanceType | vCPUs | Mem(MiB) |

+------------------+--------+-----------+

| m6i.large | 2 | 8192 |

| m6i.xlarge | 4 | 16384 |

| m6i.2xlarge | 8 | 32768 |

| m6i.4xlarge | 16 | 65536 |

| m6i.8xlarge | 32 | 131072 |

+------------------+--------+-----------+

... (output truncated for brevity)

Right-size with CloudWatch before committing to a family

Don't pick a family based on guesswork. Run your workload on a small general-purpose instance for 48 hours, then analyse CPU, memory, and disk metrics. If CPU is pegged at 100% and memory is 20%, switch to compute-optimised. If memory is 90% and CPU is 10%, go memory-optimised. The chart above is a starting point — your actual metrics tell the story.

Production Insight

A client ran a production API on a t3.medium for months because 'it worked fine'. What they didn't see was that CPU credit balance was draining every day during peak hours, causing sporadic 5-second response times. Switching to an m6i.large eliminated the issue and actually reduced total cost because they didn't need to over-provision for credit exhaustion. Always profile before choosing a family.

Key Takeaway

Instance families encode the resource ratio: T (burst), M (balanced), C (compute), R (memory), I (storage), G (GPU). Profile your workload with CloudWatch before picking a family. Consider Graviton (c7g/m7g/r7g) for 20% price-performance improvement if your software is ARM-compatible.

Pricing Model Decision Matrix — Choose Between On-Demand, Reserved, Spot, and Savings Plans

EC2 offers multiple pricing models, and choosing the wrong one can double your costs or leave you locked into a commitment you can't change. This decision matrix maps each model's commitment level, discount depth, interruptibility, and ideal use case.

Pricing Model	Commitment Required	Max Discount	Interruptible?	Best For
On-Demand	None	0%	No	Short-lived workloads, spiky traffic, development, workloads with unpredictable duration
Reserved Instance (Standard)	1 or 3 years, specific instance family + region	Up to 75%	No	Steady-state production workloads where instance type and region are certain for 1-3 years
Reserved Instance (Convertible)	1 or 3 years, but can change family/region (within same OS)	Up to 60%	No	Workloads that are steady but may need to migrate instance family (e.g., upgrade from M to R)
Compute Savings Plans	1 or 3 years, $/hour commitment across any EC2 instance family and region	Up to 66-72%	No	Most flexible commitment; best for diversified workloads where instance types vary
Spot Instances	None (but may be interrupted)	Up to 90%	Yes (2-min warning)	Stateless, fault-tolerant tasks: batch processing, CI/CD, big data, image processing, microservices with graceful shutdown

Decision flow: 1. Will the workload run continuously for 1-3 years? -> Yes -> Consider Savings Plans or RIs. Start with Compute Savings Plans for flexibility. 2. Can the workload tolerate interruption? -> Yes -> Use Spot Instances for maximum savings. If no, use On-Demand or Savings Plans. 3. Is the instance type and region known to be fixed? -> Yes -> Standard RI gives highest discount (up to 75%) but locks you in. For most teams, Savings Plans are safer. 4. Is the workload short or unpredictable? -> On-Demand is the simplest. Never commit to RI/Savings Plans for short projects.

Common pitfall: A developer committed to a 3-year Standard RI for an application that migrated to containers 6 months later. The RI was tied to a specific instance type and region, so it became useless. The cost: $50,000 paid for compute they never used.

Best practice: Use a mix: On-Demand for baseline flexibility, Savings Plans for steady-state capacity (covering about 60-80% of your expected usage), and Spot for elastic workloads that can be restarted. This blend gives you cost efficiency without lock-in risk.

io/thecodeforge/aws/ec2_pricing_advisor.shBASH

#!/bin/bash
# TheCodeForge — Compare pricing models for a given instance type

INSTANCE_TYPE="m6i.large"
REGION="us-east-1"

echo "=== Pricing for $INSTANCE_TYPE in $REGION ==="

# Get On-Demand price
aws pricing get-products \
  --service-code AmazonEC2 \
  --filters "Type=TERM_MATCH,Field=instanceType,Value=$INSTANCE_TYPE" "Type=TERM_MATCH,Field=regionCode,Value=$REGION" \
  --query "PriceList[0]" --output json | jq -r '."terms".OnDemand | to_entries[0].value.priceDimensions | to_entries[0].value.pricePerUnit.USD'

# Get 1-year Compute Savings Plan price
# Savings Plans don't have a direct API, but you can approximate: ~40% off On-Demand

# Get Spot price history
aws ec2 describe-spot-price-history \
  --instance-types $INSTANCE_TYPE \
  --product-descriptions "Linux/UNIX" \
  --start-time "$(date -u -d '-1 day')" \
  --query 'SpotPriceHistory[*].[SpotPrice,Timestamp]' \
  --output table | head -5

Output

=== Pricing for m6i.large in us-east-1 ===

On-Demand price per hour: 0.096

Spot price history (last 24h):

--------------------------

| SpotPrice | Timestamp |

+-------------+---------------------+

| 0.0288 | 2026-05-11T15:00Z |

| 0.0301 | 2026-05-11T12:00Z |

| 0.0275 | 2026-05-11T09:00Z |

+-------------+---------------------+

Estimated 1-year Compute Savings Plan price: ~0.058/hr (40% off On-Demand)

Reserved Instances: Lock-in Trap

Standard RIs lock you into a specific instance type and AZ for 1-3 years. If your workload changes, you're stuck paying for capacity you don't use — or paying fees to modify/convert. Compute Savings Plans are nearly as cheap and cover all EC2 families and regions. Unless you are absolutely certain about your instance footprint for 3 years, choose Savings Plans over RIs.

Production Insight

I've seen teams buy 3-year Standard RIs for 'production servers' that were later migrated to containers on Fargate. The RI purchase was a $100k sunk cost. Now we always recommend Compute Savings Plans instead — same flexibility as On-Demand but with 40-60% savings. For truly elastic workloads, we layer Spot on top of Savings Plans. The combined strategy can cut costs by 60-80% compared to pure On-Demand.

Key Takeaway

Use On-Demand for flexibility, Savings Plans for steady-state discount (most flexible), Spot for interruptible workloads. Never buy Standard RIs unless you are 100% certain about instance type and region for the entire term. Start with Compute Savings Plans for any committed spend.

Security Groups — The Firewall That Saves You from Crypto Miners

Security groups (SGs) are stateful virtual firewalls attached to EC2 (and other AWS resources). They control inbound and outbound traffic based on rules you define.

Key properties: - Stateful: if you allow inbound on port 80, response traffic is automatically allowed outbound regardless of outbound rules. - Explicit allow: no deny rules — only allow. Traffic that isn't allowed is implicitly denied. - Reference other security groups: you can allow inbound from another SG (e.g., allow HTTP from ALB SG), which is more secure than IP ranges. - You can attach up to 5 SGs per instance.

Common patterns: - Web tier: allow HTTP (80) and HTTPS (443) from 0.0.0.0/0; allow SSH from your office IP only. - App tier: allow traffic only from the web tier SG on application port. - Database tier: allow traffic only from app tier SG on DB port (e.g., 3306 for MySQL). Never allow DB ports from 0.0.0.0/0.

Mistake to avoid: Opening all ports to 0.0.0.0/0 for 'convenience'. That's how your instance becomes a crypto mining node overnight.

SSH-specific best practices: - Never open 0.0.0.0/0 on port 22. Ever. Bots scan every IP every 15 minutes. - Use AWS Systems Manager Session Manager instead of SSH — no public IP needed, no SSH keys, audit logs built-in. - If you must use SSH, restrict to your office IP using --cidr $(curl -s http://checkip.amazonaws.com)/32. - Use EC2 Instance Connect (temporary SSH key pushed via IAM).

Worst-case scenario: A client opened port 3306 (MySQL) to 0.0.0.0/0 for 'easy testing' and forgot to revert. Within 3 hours, their database was publicly accessible and a script dumped all tables. The data breach cost $500k in fines.

io/thecodeforge/aws/ec2_security_group_fix.shBASH

#!/bin/bash
# TheCodeForge — Fix dangerous security group rules

# Find security groups with SSH open to 0.0.0.0/0
echo "=== Security groups with SSH open to all ==="
aws ec2 describe-security-groups \
  --filters Name=ip-permission.protocol,Values=tcp Name=ip-permission.to-port,Values=22 \
  --query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]]].[GroupId,GroupName]' \
  --output table

# Revoke dangerous rule (replace SG_ID with actual)
# aws ec2 revoke-security-group-ingress --group-id sg-0123456789abcdef0 --protocol tcp --port 22 --cidr 0.0.0.0/0

# Add rule for your current IP only
MY_IP=$(curl -s http://checkip.amazonaws.com)
echo "=== Adding SSH rule for your IP: $MY_IP ==="
# aws ec2 authorize-security-group-ingress --group-id sg-0123456789abcdef0 --protocol tcp --port 22 --cidr "$MY_IP/32"

# Check for database ports open to internet
DB_PORTS=(3306 5432 1433 27017)
for port in "${DB_PORTS[@]}"; do
  echo "=== Checking port $port ==="
  aws ec2 describe-security-groups \
    --filters Name=ip-permission.to-port,Values=$port \
    --query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]]].[GroupId,GroupName]' \
    --output table
done

echo "=== Recommendation: Use Systems Manager Session Manager instead of SSH ==="
echo "aws ssm start-session --target i-0123456789abcdef0"

Output

=== Security groups with SSH open to all ===

----------------------------------------------

| GroupId | GroupName |

+-------------------------+-------------------+

| sg-0abcd1234efgh5678 | default |

| sg-0efgh5678ijkl9012 | web-app-sg |

+-------------------------+-------------------+

=== Adding SSH rule for your IP: 203.0.113.45 ===

=== Checking port 3306 (MySQL) ===

[No results — good]

=== Recommendation: Use Systems Manager Session Manager instead of SSH ===

aws ssm start-session --target i-0123456789abcdef0

The 0.0.0.0/0 Trap

SSH open to 0.0.0.0/0 means the entire internet can attempt to connect to your instance. Attackers scan the entire IPv4 space every 15 minutes. A weak password or unpatched SSH version = compromise within hours. Fix: use Systems Manager Session Manager (no public IP needed) or restrict SSH to your IP range.

Production Insight

A client once opened port 3306 to 0.0.0.0/0 for 'easy testing' and forgot to revert.

Within 3 hours, their database was publicly accessible and a script dumped all tables.

The data breach cost $500k in fines and customer compensation.

Rule: use security group references, not IP ranges, for intra-VPC communication.

And never, ever open database ports to the internet — use a bastion host or SSM port forwarding.

Key Takeaway

Security groups are stateful, allow-only firewalls.

Never open 0.0.0.0/0 on SSH (port 22). Use Session Manager instead.

Use security group references for intra-VPC traffic (e.g., allow app tier to talk to DB tier by SG ID).

Start with no inbound rules, add only what you need.

EBS Volumes — gp3 vs gp2 and the Burst Credit Trap

Amazon Elastic Block Store (EBS) provides block-level storage volumes that persist independently from your EC2 instance. Think of it as an external hard drive you can attach/detach at will.

Volume types: - gp3 (General Purpose SSD): baseline 3000 IOPS, burst to 16000. Good for most workloads. Cost-optimised. Recommended for new deployments. - gp2 (older): IOPS tied to volume size (3 IOPS per GB). Baseline 3 IOPS/GB with burst credits (up to 3000 IOPS). Avoid for new deployments unless you need compatibility. - io1/io2 (Provisioned IOPS): guaranteed IOPS, expensive. For databases requiring consistent, high IOPS. - st1 (Throughput Optimized HDD): cheap, high throughput. For log processing, big data. - sc1 (Cold HDD): lowest cost, infrequent access.

The gp2 burst trap: gp2 earns I/O credits when idle (like CPU credits). Under sustained high I/O, credits exhaust, and performance drops from 3000 IOPS to baseline (3 IOPS per GB). A 100GB gp2 volume would drop to 300 IOPS.

gp3 eliminates burst credits: baseline 3000 IOPS regardless of size, and it's often cheaper than gp2. Migrate any gp2 volumes to gp3 for consistent performance and lower cost.

Performance tip: gp3 decouples IOPS from size and is cheaper than gp2. For high-performance databases, use io2 Block Express (up to 256,000 IOPS).

Encryption warning: By default, EBS encryption is not enabled in a new account. Enable it at the account level via the EC2 console settings. Otherwise, any snapshot you share accidentally could leak data.

io/thecodeforge/aws/ec2_ebs_migrate_gp3.shBASH

#!/bin/bash
# TheCodeForge — Migrate gp2 volumes to gp3 for cost and performance

# List all gp2 volumes
echo "=== gp2 volumes in account ==="
aws ec2 describe-volumes --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].[VolumeId,Size,Attachments[0].InstanceId,CreateTime]' \
  --output table

# Check burst credit balance for a gp2 volume
VOLUME_ID="vol-0123456789abcdef0"
echo "=== gp2 burst credit balance for $VOLUME_ID ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name BurstBalance \
  --dimensions Name=VolumeId,Value=$VOLUME_ID \
  --statistics Minimum \
  --period 3600 \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --output text

# Migrate volume from gp2 to gp3 (no downtime, online operation)
echo "=== Migrating $VOLUME_ID from gp2 to gp3 ==="
aws ec2 modify-volume \
  --volume-id $VOLUME_ID \
  --volume-type gp3 \
  --iops 3000

echo "=== Migration started. Monitor with: ==="
aws ec2 describe-volumes-modifications --volume-ids $VOLUME_ID

# Enable EBS encryption by default for future volumes
echo "=== Enabling EBS encryption by default ==="
aws ec2 enable-ebs-encryption-by-default

Output

=== gp2 volumes in account ===

-------------------------------------------------

+--------------------+------+--------------------+-------------------+

| vol-0abcd1234efgh | 100 | i-0abcd1234efgh | 2025-01-15T10:00Z |

| vol-0efgh5678ijkl | 50 | i-0efgh5678ijkl | 2025-01-20T14:30Z |

+--------------------+------+--------------------+-------------------+

=== gp2 burst credit balance for vol-0abcd1234efgh ===

0.0 <- credits exhausted, performance degraded

=== Migrating vol-0abcd1234efgh from gp2 to gp3 ===

{

"VolumeModification": {

"VolumeId": "vol-0abcd1234efgh",

"TargetVolumeType": "gp3",

"Progress": 0

}

=== Enabling EBS encryption by default ===

{

"EbsEncryptionByDefault": true

}

gp2 Burst Credits Exhaust = Slow Database

gp2 volumes have burst credits that deplete under sustained high I/O. A 100GB gp2 volume baseline is only 300 IOPS. If your database needs 2000 IOPS consistently, you'll exhaust credits and performance tanks. Migrate to gp3: baseline 3000 IOPS regardless of size. Often cheaper too.

Production Insight

We had an incident where a large batch job filled the root volume.

The server froze: no space for logs, no SSH.

The fix: resize via API and extend filesystem — 2 minutes.

But the outage cost $12,000 in missed SLAs.

Rule: set CloudWatch alarms on disk space at 80% and 90%.

Also have a runbook for quick resizing.

The gp2 burst trap: a database on gp2 worked fine for months, then a Black Friday spike exhausted credits. Performance dropped from 3000 IOPS to 300 IOPS. Queries took 30 seconds. This is why we recommend gp3 for production.

Key Takeaway

gp3 is now the default — baseline 3000 IOPS, no burst credits, often cheaper than gp2.

Migrate existing gp2 volumes to gp3 (online, no downtime).

Set CloudWatch alarms on disk space at 80% and 90% — this prevents freezes.

Encrypt all EBS volumes by default at the account level.

Choose your EBS volume type

IfBoot volume or general purpose app (< 16k IOPS needed)

→

Usegp3 — default choice. Baseline 3000 IOPS, burst to 16000. Cheaper than gp2.

IfProduction database (Oracle, MySQL, PostgreSQL) requiring consistent high IOPS

→

Useio2 or io2 Block Express. Provision IOPS based on DB workload profile (e.g., 10000 IOPS for 2TB database).

IfData warehouse / analytics with sequential reads

→

Usest1 — high throughput at low cost. Not for random I/O.

IfInfrequently accessed archives (backups, logs older than 90 days)

→

Usesc1 — lowest cost. Acceptable for archival data.

IfExisting gp2 volume (legacy)

→

UseMigrate to gp3 immediately. Online operation, no downtime. Lower cost + better performance.

Storage Comparison — EBS vs Instance Store

EC2 instances have two storage options: Elastic Block Store (EBS) volumes and Instance Store volumes (ephemeral). They differ fundamentally in persistence, performance, and pricing. Choosing incorrectly can lead to data loss or unexpected costs.

EBS (Elastic Block Store): - Network-attached block storage that persists independently of the instance. - Can be detached and reattached to another instance. - Data survives instance stop, start, and terminate (unless you choose 'delete on termination' for the root volume). - Multiple volume types (gp3, io2, st1, sc1) with different performance/cost profiles. - Billed per GB-month plus IOPS/throughput provisions. - Typical latency: 1-5 ms.

Instance Store (Ephemeral Storage): - Physically attached to the host server that runs the instance. - Data is lost when the instance is stopped, terminated, or the underlying host fails. - Included in the instance price — no separate billing. - Extremely low latency (sub-millisecond) and very high IOPS (millions on NVMe). - Only available on certain instance types (i3, i4i, m5d, c5d, r5d, etc. — look for 'd' suffix for local NVMe). - Cannot be detached or moved to another instance.

Feature	EBS	Instance Store
Persistence	Survives stop/terminate	Lost on stop/terminate/host failure
Performance	Network-attached, 1-5ms latency	Direct-attached, sub-ms latency, millions IOPS
Cost	Pay per GB-month + IOPS	Included in instance price (no extra)
Size limit	Up to 64 TB per volume (by request)	Up to ~60 TB per instance (multiple NVMe)
Snapshot support	Yes (snapshots to S3)	No
Encryption	Supports KMS/SSE	Supports instance-level encryption
Detach/reattach	Yes	No
Use cases	Databases, OS boot volumes, persistent application data	Temporary storage, caches, scratch data, log processing, swap

Best practices: - Always use EBS for persistent data like databases, application state, and logs you need to keep. - Use Instance Store for temporary data that can be regenerated: build caches, intermediate processing results, swap space, or data replicated from another source. - Many production architectures combine both: boot from EBS (gp3), and mount instance store NVMe for high-performance scratch space (e.g., database temp tables, MapReduce shuffle). - If you use Instance Store for anything important, replicate it across instances or to a shared EBS/EFS/S3 to avoid single-point-of-failure.

Common mistake: A developer used an instance store volume as the primary data store for a stateful application. When the instance was stopped for a security patch, all data was lost. Recovery took days from backups. Always review the 'Delete on termination' flag and instance type storage options before launching.

io/thecodeforge/aws/ec2_storage_check.shBASH

#!/bin/bash
# TheCodeForge — Check instance storage details

echo "=== Check if instance has Instance Store volumes ==="
INSTANCE_ID="i-0123456789abcdef0"

# Describe instance to see block device mappings
aws ec2 describe-instances --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].BlockDeviceMappings[*]'

echo ""
echo "=== List all EBS volumes attached to instance ==="
aws ec2 describe-volumes --filters Name=attachment.instance-id,Values=$INSTANCE_ID \
  --query 'Volumes[*].[VolumeId,Size,VolumeType,State,Attachments[0].Device]' \
  --output table

echo ""
echo "=== Check if instance type supports Instance Store ==="
INSTANCE_TYPE=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].InstanceType' --output text)

echo "Instance type: $INSTANCE_TYPE"
aws ec2 describe-instance-types --instance-types $INSTANCE_TYPE \
  --query 'InstanceTypes[0].[InstanceType, InstanceStorageSupported, InstanceStorageInfo.TotalSizeInGB]' \
  --output table

Output

=== Check if instance has Instance Store volumes ===

[

{

"DeviceName": "/dev/xvda",

"Ebs": {

"VolumeId": "vol-0abcd1234efgh",

"Status": "attached"

}

]

=== List all EBS volumes attached to instance ===

--------------------------------------------------

+--------------------+------+------------+-------+--------+

+--------------------+------+------------+-------+--------+

=== Check if instance type supports Instance Store ===

Instance type: m6i.large

No instance store volumes. (InstanceStorageSupported: false)

For instance store support, choose types with 'd' suffix: i3.large, m5d.large, c5d.large, r5d.large, etc.

Instance Store is Ephemeral — Don't Treat It Like a Hard Drive

Instance store volumes are physically attached to the host server. If the instance stops, terminates, or the host fails, all data on instance store is permanently lost. There is no 'undelate' or snapshot. Use it only for cache, temp files, or data that can be regenerated from another source (e.g., S3, database replica). Never use it as the primary storage for databases or user data.

Production Insight

A team ran a Redis cache on an i3 instance's NVMe instance store. The cache was rebuildable from the primary database, so it was a good fit. But they also stored user session data in the same NVMe. When AWS scheduled a host replacement for hardware maintenance, all sessions were lost — users were logged out globally. They had to add a warm-up routine to rebuild the cache from backups. Lesson: know what lives on instance store and have a recovery plan for its potential data loss.

Key Takeaway

EBS is persistent, detachable, and billable; Instance Store is ephemeral, fast, and free with the instance. Use EBS for anything you need to keep. Use Instance Store for temporary, high-I/O workloads that can tolerate loss. Always have a data replication or recovery strategy if using Instance Store.

User Data and Bootstrapping — Stop Hand-Jobbing Your Instances

Manually SSHing into every new instance to install packages is a production incident waiting to happen. User Data is a script you pass at launch that runs once on boot. Use it to install agents, pull the latest app code, or register with your configuration management. The trap: User Data runs as root and is idempotent only if you write it that way. On Windows, it runs as SYSTE M. On Linux, it executes after the kernel boots but before the service manager finishes. If your script calls yum install and the instance reboots, it won't re-run. Wrap your logic in a conditional check against a lock file or cloud-init state. Never hardcode secrets in User Data — fetch them from Parameter Store or Secrets Manager. The golden rule: treat your User Data like a one-shot provisioning trigger, not a daemon.

bootstrap.shBASH

#!/bin/bash
# io.thecodeforge
# One-shot bootstrap for production EC2
set -euo pipefail

LOCKFILE="/var/lib/cloud/instance/booted.lock"
if [[ -f "$LOCKFILE" ]]; then
  echo "Bootstrap already ran. Exiting."
  exit 0
fi

echo "Installing CloudWatch agent..."
yum install -y amazon-cloudwatch-agent || apt-get install -y amazon-cloudwatch-agent

echo "Fetching app config from Parameter Store..."
REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)
aws ssm get-parameters --names "/app/prod/db_url" --with-decryption --region "$REGION" --query "Parameters[0].Value" --output text > /etc/app/db_url.conf

echo "Registering with ASG lifecycle hook..."
touch "$LOCKFILE"
echo "Bootstrap complete."

Output

Instance boots, pulls config, registers, and locks itself. No re-runs on restart.

Production Trap:

If your User Data script tries to install packages on every boot, your instances will become inconsistent. Always use a lock file or cloud-init modules flag to ensure idempotence. Also: never store API keys in the script body.

Key Takeaway

User Data runs once. Wrap it in idempotent logic. Fetch secrets from Parameter Store, never from the script.

Instance Metadata — The Backdoor You Didn't Lock

Every EC2 instance has a metadata endpoint at 169.254.169.254. It hands out instance ID, IAM role, security group info, and even the User Data you sent. If an attacker gets shell access, they hit that endpoint and steal your IAM credentials. That's how crypto miners get a foothold. The fix: restrict IMDS to version 2, which requires a session token. Version 1 is plaintext and vulnerable to SSRF attacks. Also, disable IMDS entirely on instances that don't need it — like NAT gateways or load balancers. For IAM roles, use the IMDSv2 token flow instead of hardcoding credentials. And always audit your metadata access with CloudTrail. The single worst EC2 security mistake is leaving IMDSv1 enabled on a public-facing instance with an admin role attached. Lock it down before your next deployment.

disable_imdsv1.shBASH

#!/bin/bash
# io.thecodeforge
# Force IMDSv2 on a running instance
# Requires AWS CLI and valid credentials

INSTANCE_ID="i-0abcdef1234567890"

# Check current metadata option
aws ec2 modify-instance-metadata-options \
  --instance-id "$INSTANCE_ID" \
  --http-tokens required \
  --http-endpoint enabled \
  --http-put-response-hop-limit 2 \
  --region us-east-1

echo "IMDSv2 enforced. Token required for all calls."

# Verify change
aws ec2 describe-instances \
  --instance-ids "$INSTANCE_ID" \
  --region us-east-1 \
  --query "Reservations[0].Instances[0].MetadataOptions.HttpTokens"

# Expected output: "required"

Output

Metadata options updated. Instances now reject IMDSv1 requests.

Blocker:

Some legacy workloads (like custom AMIs with old cloud-init) break under IMDSv2. Test in staging first. If your app uses IMDSv1, update the SDK — most major language SDKs support v2 natively.

Key Takeaway

Lock IMDSv2 on every instance. IMDSv1 is a free credential leak. Audit with CloudTrail. No exceptions.

Placement Groups — How to Sabotage Your Own Latency

When you launch instances without placement strategy, AWS spreads them across hardware to maximize fault tolerance. That's fine for web servers. For high-performance computing, distributed databases, or Kafka clusters, you need low latency between instances. Placement groups force instances onto the same rack or cluster. Three flavors: cluster for single-AZ, ten-thousandths latency; spread for maximum separation across hardware; partition for large distributed workloads like Hadoop. The gotcha: cluster placement groups don't span Availability Zones. If the AZ fails, everything goes down. And you can't move a running instance into a placement group — you must launch it there. Also, if capacity is tight, your cluster group may get an insufficient capacity error. Always request an EC2 capacity reservation for critical cluster groups. And never mix instance families in a cluster group — the network performance will be unpredictable.

launch_cluster_group.shBASH

#!/bin/bash
# io.thecodeforge
# Create a cluster placement group and launch instances into it

PLACEMENT_GROUP="production-kafka-cluster"

# Create the group
aws ec2 create-placement-group \
  --group-name "$PLACEMENT_GROUP" \
  --strategy cluster \
  --region us-east-1

echo "Placement group created: $PLACEMENT_GROUP"

# Launch instances into it (example: c5n.4xlarge for network-optimized Kafka)
aws ec2 run-instances \
  --image-id ami-0abcdef1234567890 \
  --instance-type c5n.4xlarge \
  --placement GroupName="$PLACEMENT_GROUP" \
  --count 3 \
  --region us-east-1 \
  --query "Instances[*].InstanceId"

echo "3 instances launched into cluster group. Check inter-instance latency."

Output

Placement group created. 3 c5n.4xlarge instances launched. Expect sub-millisecond latency between group members.

Production Trap:

Cluster placement groups are single-AZ. One AZ outage takes down the entire group. For multi-AZ redundancy, use partition groups and spread your partitions across AZs. Also, remember: you can't add existing instances to a placement group — only new launches.

Key Takeaway

Placement groups buy latency at the cost of availability. Cluster for HPC, spread for HA, partition for distributed systems. Plan your AZ failure mode first.

● Production incidentPOST-MORTEMseverity: high

The t3.micro That Cost $2,300 in One Weekend

Symptom

Instance CPU pegged at 100% for 48 hours straight. Network egress spikes to unknown IPs. CloudWatch shows CPUUtilization at 100% but t3.micro credits exhausted (CPUCreditBalance = 0). Unknown processes consuming memory. AWS sends 'suspicious activity' notification.

Assumption

The developer assumed 'light testing' meant the instance would be terminated. They didn't set termination protection, didn't monitor billing, and thought t3.micro was 'too small to matter'. They also assumed security group with 0.0.0.0/0 on SSH was fine because 'no one will find this random IP'. Bots scan the entire IPv4 space every 15 minutes.

Root cause

Security group inbound rule allowed SSH (port 22) from 0.0.0.0/0. Bots brute-forced a weak password within 24 hours. Attacker installed crypto miner that consumed 100% CPU. t3.micro ran out of CPU credits, switched to 'unlimited' mode (costs extra when credits negative). Data transfer to mining pool cost $1,800. Compromised IAM role with S3 read access exfiltrated 200GB of data to external server. The instance had no CloudWatch alarm on CPU, disk, or billing. No one noticed for 48 hours until AWS suspended the account.

Fix

1. Terminated compromised instance immediately. 2. Rotated all IAM credentials associated with the instance's role. 3. Revoked all access keys that were ever on the instance. 4. Deleted security group rule allowing 0.0.0.0/0 on SSH. 5. Added CloudWatch alarm: CPUUtilization > 80% for 5 minutes → page on-call. 6. Added billing alarm: EstimatedCharges > $100 → email. 7. Implemented AWS Systems Manager Session Manager for prod access (no SSH keys, no public IP). 8. Enabled VPC Flow Logs to audit traffic patterns. 9. Added mandatory tagging (CostCenter, Environment, ExpirationDate) for all instances. 10. Created Lambda function that auto-terminates instances older than 7 days with 'testing' tag.

Key lesson

0.0.0.0/0 on SSH is not 'low risk' — it's a guarantee of compromise within 24-48 hours.
Every instance needs CloudWatch billing alarm and CPU/disk monitoring.
t3.micro is fine for testing, but set termination protection and expiration tags.
Don't attach IAM roles with S3 read access to public-facing instances.
Use AWS Systems Manager Session Manager instead of SSH for production access.

Production debug guideDebug connectivity, performance, and configuration issues fast.5 entries

Symptom · 01

SSH connection times out or 'Connection refused'

→

Fix

Check security group inbound rules for port 22 – ensure your IP is allowed. Verify the instance has a public IP (or you're using a bastion host). Check network ACLs and route tables. If using a VPC, ensure the subnet has an internet gateway attached.

Symptom · 02

Instance status checks show 2/2 but app is slow

→

Fix

Check CloudWatch metrics: CPU utilisation, memory (install CloudWatch agent), EBS burst balance (for gp2). If gp2 burst credits exhausted, switch to gp3. For memory pressure, consider a larger instance type or enable swap.

Symptom · 03

Instance stops responding after a few days

→

Fix

Check if the instance reached its credit balance limit (T2/T3 unlimited). Verify OS-level disk usage with df -h; EBS volume might be full. Also check for OOM killer in /var/log/kern.log.

Symptom · 04

Can't attach an EBS volume — 'Invalid volume' or 'Attachment limit'

→

Fix

Check if the volume is already attached to another instance. Each instance has a max number of attachments (e.g., 40 for most Nitro instances). Detach unused volumes. If volume is in 'error' state, create a snapshot and restore a new volume.

Symptom · 05

Instance launched but never passes system checks

→

Fix

Review EC2 console system log (screenshot) for boot errors. Common causes: missing kernel, corrupt AMI, or wrong architecture. Try launching with a different AMI or instance type.

★ EC2 Quick Debug Cheat SheetFive common EC2 issues and the exact commands to diagnose and fix them.

SSH timeout−

Immediate action

Check security group inbound rules

Commands

aws ec2 describe-security-groups --group-ids <sg-id> --query "SecurityGroups[0].IpPermissions[?ToPort==22]"

nslookup <public-dns> or ping <public-ip>

Fix now

Add your current IP to the SSH rule: aws ec2 authorize-security-group-ingress --group-id <sg-id> --protocol tcp --port 22 --cidr $(curl -s http://checkip.amazonaws.com)/32

High CPU — suspected crypto mining+

Disk full — instance frozen+

Instance terminates unexpectedly+

EBS gp2 volume performance degraded+

EC2 vs Container (ECS/EKS) vs Lambda – When to Use What

Criterion	EC2	ECS/EKS (Containers)	Lambda (Serverless)
Control over OS and runtime	Full control (OS, kernel, packages)	Moderate (container image + host OS)	None (managed runtime)
Cold start latency	None (always on)	Low (container warm-up)	High (first invocation ~200ms–1s)
Cost for steady 24/7 workload	Low (with Savings Plans)	Moderate (no OS licensing, but cluster cost)	High (charged per request + duration)
Scaling granularity	Manual or via Auto Scaling (minutes)	Fast (seconds to minutes)	Instant (per request)
Persistence	EBS, instance store	EBS, EFS (attached per task)	Stateless (use S3, DynamoDB for state)
Ideal use case	Legacy apps, databases, long-running services	Microservices, batch jobs	Event-driven APIs, scheduled tasks, data processing

Key takeaways

EC2 is virtual servers in AWS

you pay per second, stop when not needed.

Instance types encode workload

t for burstable, c for compute, r for memory, i for storage, g for GPU.

Security groups are stateful, allow-only firewalls; restrict SSH to your IP and use SG references for intra-VPC traffic.

EBS volumes persist separately; gp3 is the default and cheaper than gp2; always snapshot before resizing.

Pricing mix

On-Demand for flexibility, Savings Plans for steady state, Spot for disposable workloads.

The three biggest mistakes

open SSH to 0.0.0.0, leaving instances running, and storing secrets in plaintext.

Set CloudWatch billing alarm at $100. Set disk space alarms at 80% and 90%. You'll catch 90% of surprises.

Common mistakes to avoid

7 patterns

Opening SSH port to 0.0.0.0/0 and using default security groups

Symptom

Instance gets compromised via brute force. CPU spikes due to crypto mining. Data exfiltration. Surprise bills.

Fix

Restrict SSH to your IP using --cidr $(curl -s http://checkip.amazonaws.com)/32. Use Systems Manager Session Manager for production access. Create a custom security group with minimal rules.

Leaving instances running after testing

Symptom

Unbudgeted charges appear in AWS monthly bill. Often thousands of dollars for a single t3.medium running for weeks.

Fix

Tag all instances with expiration tag (e.g., expiration-date: 2026-05-01). Set up CloudWatch alarms on billing. Use AWS Instance Scheduler to auto-stop instances outside work hours. Terminate instead of stop when done.

Using t2/t3 micro for production services

Symptom

Performance degrades unpredictably under load. CPU credits exhausted, instance throttled. Latency spikes and timeouts.

Fix

Use burstable instances (t3) only for non-production or variable-load work. For production, choose m6i.large or larger. If you must use burstable, enable unlimited mode (costs extra when credits are negative).

Storing secrets (API keys, passwords) in user data or AMIs

Symptom

If instance is compromised, attacker gains access to secret keys, potentially across accounts.

Fix

Store secrets in AWS Secrets Manager or Parameter Store. Use IAM roles attached to the instance (instance profile) to grant permissions without embedded keys. Never put secrets in plaintext in scripts.

Not encrypting EBS volumes by default

Symptom

If an EBS snapshot is shared accidentally or a volume is detached, raw data is exposed.

Fix

Enable EBS encryption by default in the AWS account (via Account Settings). Use KMS keys for customer-managed encryption. All new volumes will be encrypted automatically.

Not enabling termination protection on production instances

Symptom

Someone accidentally terminates a production instance via console or CLI, causing an application outage.

Fix

Enable termination protection at launch or via aws ec2 modify-instance-attribute --instance-id <id> --attribute disableApiTermination --value true. Set IAM policies that require MFA to disable termination protection.

Using the default VPC without understanding its limits

Symptom

Cannot create any more subnets or resources because default VPC has limited IP space (default /20). Unexpected connectivity issues when trying to peer VPCs.

Fix

Create a custom VPC with sufficient CIDR block (e.g., /16) for current and future needs. Use /21 or /20 for subnets based on projected growth. Never use the default VPC for production.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the different EC2 instance purchase options and when you would u...

Q02SENIOR

How do you debug an EC2 instance that is unreachable via SSH?

Q03SENIOR

What is an EC2 instance profile, and how does it relate to IAM roles?

Q04SENIOR

What happens when a Spot Instance is interrupted? How do you design for ...

Q05SENIOR

How do you migrate a running EC2 instance from one instance type to anot...

Q06SENIOR

What is the difference between gp2 and gp3 EBS volumes? Why should you m...

Q01 of 06SENIOR

Explain the different EC2 instance purchase options and when you would use each.

ANSWER

EC2 offers On-Demand, Reserved Instances (Standard/Convertible), Spot Instances, and Savings Plans. - On-Demand: No commitment, pay per second. Best for uncertain workloads or short-term needs. - Reserved Instances: 1- or 3-year commitment in a specific AZ/instance type. Standard RIs offer up to 72% discount but are inflexible. Convertible RIs allow changing instance families within the same RDS. - Spot Instances: Use spare AWS capacity at up to 90% discount. Can be terminated with a 2-minute warning. Use for stateless, fault-tolerant workloads like batch processing, CI/CD, or big data. - Savings Plans: Flexible commitment in $/hour across EC2, Fargate, and Lambda. Compute Savings Plans cover any region/instance family. Easier to manage than RIs for dynamic environments. In production, you typically use a mix: On-Demand for baseline flexibility, Savings Plans for steady-state, Spot for elastic workloads. Always use Auto Scaling to match capacity to demand.

FAQ · 6 QUESTIONS

Frequently Asked Questions

What is the difference between stopping and terminating an EC2 instance?

How do I reduce EC2 costs?

What is a security group, and how is it different from a network ACL?

What is the difference between an AMI and a snapshot?

How do I increase the disk space on an EC2 instance?

Should I use t3.micro for production?

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

✓ Verified

production tested

May 24, 2026

last updated

1,554

articles · all by Naren

🔥

That's Cloud. Mark it forged?

14 min read · try the examples if you haven't