Mid-level 11 min · March 06, 2026

EC2 Security Groups — SSH 0.0.0.0/0 to Miner in 24h

t3.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • EC2 = virtual server rental. Pay per second. Stop = compute stops billing, EBS continues. Terminate = delete instance and root volume.
  • Instance families: t (burstable credits), c (compute), r (RAM), i (storage), g (GPU). Never use t3.micro for production sustained load.
  • Security groups = stateful firewall. 0.0.0.0/0 on SSH = crypto mining within 24 hours. Always restrict to your IP.
  • EBS: gp3 is cheaper than gp2 and decouples IOPS from size. Default gp2 is a trap for new accounts.
  • Pricing: On-Demand (no commitment), Savings Plans (flexible, 72% off), Reserved Instances (inflexible, locks instance type), Spot (90% off, can be interrupted).
  • Production killer: setting ClockSkew = 0 in JWT validation — server drift by 30 seconds locks out all users.
Plain-English First

Imagine you need a powerful gaming PC to run a tournament, but you only need it for one weekend. Instead of buying one, you rent it from a warehouse that has thousands of PCs in every size. AWS EC2 is that warehouse — except instead of gaming PCs, it's servers. You rent exactly the computing power you need, for exactly as long as you need it, and when you're done, you hand it back and stop paying. That's it.

Every app you've ever built eventually hits the same wall: where does it actually run? Your laptop can't serve production traffic, a shared hosting plan falls over under load, and buying physical servers means you're locked into hardware that's obsolete in three years. The cloud exists to solve this, and AWS EC2 is where most teams start — and for good reason. It's the backbone of thousands of production systems running right now, from early-stage startups to Fortune 500 backends.

EC2 (Elastic Compute Cloud) solves the problem of unpredictable infrastructure needs. 'Elastic' is the key word — you can spin up 50 servers at 9am for a product launch and terminate 48 of them by noon when the traffic spike passes. You're billed by the second. No contracts, no idle hardware, no datacenter lease. The underlying model shifts infrastructure from a capital expense (buy servers) to an operational one (rent compute), which changes how engineering teams think about scaling entirely.

By the end of this article you'll understand what an EC2 instance actually is under the hood, how to choose the right instance type for your workload, how to launch and connect to a real server using the AWS CLI, and how to lock it down with security groups. You'll also see the exact mistakes that burn people — including accidental bills from instances left running and SSH connections that silently refuse to work.

EC2 Instance Types and Pricing — Stop Paying for What You Don't Need

AWS divides instance types into families based on the ratio of CPU, memory, storage, and network. Pick wrong and you overpay or underperform.

General purpose (t3, m6i) — balanced CPU/memory. Good for web servers, dev/test environments. The t3 family includes burstable credits: you earn credits when idle and burn them under load. Exhaust them and the instance throttles.

Compute optimized (c6i, c7g) — higher CPU-to-memory ratio. For batch processing, video encoding, high-performance web servers. c6i is Intel, c7g is Graviton (ARM) — ~20% better price/performance.

Memory optimized (r6i, x2iedn) — massive RAM per vCPU. For in-memory databases (Redis, SAP HANA), large caches. The x2iedn gives up to 4 TB RAM.

Storage optimized (i3, d2) — high local NVMe SSD. For data warehouses, log processing. d2 has spinning disks for cold storage.

GPU instances (p4, g5) — for machine learning, graphic rendering. p4 uses A100 GPUs; g5 uses NVIDIA A10G.

In production, you usually start with a small general-purpose for your app, then use CloudWatch to profile actual resource usage and right-size after a week. Most teams over-provision by 2–3x initially.

Pricing models: - On-Demand: Pay per second, no commitment. Highest per-hour cost. Best for short-lived workloads. - Savings Plans: Commit to $/hour spend for 1-3 years (up to 72% off). Flexible across instance families and regions. - Reserved Instances: Commit to specific instance family in specific AZ for 1-3 years. Inflexible, but up to 75% off. Lock-in risk. - Spot Instances: Spare capacity, up to 90% off, but can be terminated with 2-min warning. Use for batch, CI/CD, stateless workloads.

Cost optimisation trap: buying a 3-year RI for a project that gets cancelled after 6 months. You're stuck paying for the entire term. Start with Savings Plans — they're flexible.

One nuance: the Graviton-based types (c7g, m7g, r7g) offer better price-performance for most workloads, but you'll need ARM-compatible software. Most container images and modern language runtimes work fine, but legacy binaries might not. Test before you commit.

io/thecodeforge/aws/ec2_cost_check.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# TheCodeForgeCheck EC2 cost and right-sizing recommendations

echo "=== EC2 Cost Analysis ==="

# List all running instances with type and launch time
aws ec2 describe-instances --filters Name=instance-state-name,Values=running \
  --query 'Reservations[*].Instances[*].[InstanceId,InstanceType,LaunchTime,PublicIpAddress]' \
  --output table

# Check CloudWatch CPU utilization for one instance
INSTANCE_ID="i-0123456789abcdef0"
echo "=== CPU Utilization for $INSTANCE_ID (last 24h) ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --statistics Average \
  --period 3600 \
  --start-time "$(date -u -d '24 hours ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --query 'Datapoints[*].[Timestamp,Average]' \
  --output table

# Check if instance is t-family and burst credits
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUCreditBalance \
  --dimensions Name=InstanceId,Value=$INSTANCE_ID \
  --statistics Minimum \
  --period 3600 \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --query 'Datapoints[0].Minimum' \
  --output text
Output
=== EC2 Cost Analysis ===
----------------------------------------------------------------
| DescribeInstances |
+----------------------+----------+----------------------+
| InstanceId | Type | LaunchTime |
+----------------------+----------+----------------------+
| i-0abcd1234efgh5678 | t3.micro| 2026-03-01T14:23:00Z|
| i-0efgh5678ijkl9012 | m6i.large| 2026-03-10T09:15:00Z|
+----------------------+----------+----------------------+
=== CPU Utilization for i-0abcd1234efgh5678 (last 24h) ===
-------------------------------------------------
| GetMetricStatistics |
+---------------------------+--------------------+
| Timestamp | Average |
+---------------------------+--------------------+
| 2026-03-18T15:00:00Z | 92.3 | <- sustained high CPU
| 2026-03-18T14:00:00Z | 88.7 |
+---------------------------+--------------------+
CPUCreditBalance: 0.0 <- credits exhausted, instance throttling
Recommendation: Migrate from t3.micro to m6i.large for sustained workload.
Burstable Credits = Idle Save, Load Pay
  • t3.micro baseline: 10-20% of a full vCPU. Burst to 100% for short periods.
  • CPUCreditBalance = 0 → instance is throttled. Your app slows down.
  • For sustained >20% CPU for more than a few hours, switch to m6i.large.
  • Enable 'unlimited' mode for t3 — costs extra when credits negative. Use only for unexpected spikes.
  • Production databases on t3.micro = slow queries + unhappy customers.
Production Insight
We once saw a team running a Redis cluster on t3.micro instances.
The credit balance kept draining during peak hours, causing random latency spikes.
Sometimes 100ms, sometimes 2 seconds. Hard to debug.
Rule: never use burstable instances for sustained workloads — use m6i.large at minimum.
Burstable (t3) is for sporadic traffic (testing, staging, development), not production databases.
Key Takeaway
Instance types: t (burstable), c (compute), r (RAM), i (storage), g (GPU).
For sustained production workloads, skip t3 — use m6i.large or c6i.large.
Pricing: On-Demand (flexible), Savings Plans (recommended), RI (inflexible), Spot (interruptible).
The biggest cost savings: terminate idle instances and downgrade over-provisioned ones.
Choose EC2 instance family and pricing model
IfSustained CPU > 20% for > 4 hours/day
UseAvoid t3. Use m6i.large or c6i.large. t3 will throttle and performance will be inconsistent.
IfWorkload runs 24/7 for foreseeable future
UsePurchase Compute Savings Plans (1-year, no upfront). 40-50% savings over On-Demand. Flexible across families.
IfWorkload is stateless and interruptible (batch, CI/CD, data processing)
UseUse Spot Instances with Auto Scaling group. 60-90% savings. Handle interruption gracefully (2-min warning).
IfExperimentally migrating to ARM (Graviton)
UseTest with c7g.large or m7g.large first. Verify all dependencies have ARM-compatible binaries. Expect 20% price/performance improvement.

Instance Family Comparison Table — C, M, R, T, I, G Series at a Glance

When selecting an instance, start with the family. Each family optimises a different resource ratio. The table below shows the common families, representative types, the vCPU-to-memory ratio, typical use cases, and approximate on-demand hourly pricing (us-east-1, as of May 2026). Prices are for the smallest size in each family and increase with size.

FamilyExample TypesvCPU:RAM RatioCommon Use Cases~Min Hourly Price (On-Demand)
T (Burstable)t3.nano - t3.2xlarge1:2 (t3.micro)Dev/test, low-traffic web servers, microservices with sporadic load$0.0104 (t3.nano)
M (General)m6i.large - m6i.32xlarge1:4Web servers, application servers, small databases$0.096 (m6i.large)
C (Compute)c6i.large - c6i.32xlarge1:2Batch processing, video encoding, high-performance web servers$0.085 (c6i.large)
R (Memory)r6i.large - r6i.32xlarge1:8In-memory databases (Redis, Memcached), real-time analytics$0.126 (r6i.large)
I (Storage)i3.large - i3.16xlarge1:4 + NVMeData warehouses, log processing, NoSQL databases$0.156 (i3.large)
G (GPU)g5.xlarge - g5.48xlarge1:4 + GPUMachine learning training/inference, graphics rendering$1.006 (g5.xlarge)
X (Memory-optimised)x2iedn.xlarge - x2iedn.32xlarge1:16SAP HANA, large in-memory databases$0.557 (x2iedn.xlarge)

Graviton (ARM) variants (c7g, m7g, r7g) offer 20% better price/performance than equivalent Intel/AMD types for most workloads. For example, c7g.large costs ~$0.068/hr vs c6i.large at $0.085/hr. Check software compatibility before migrating.

How to choose: - If your CPU is pegged at 100% but memory is under 50%, you need a compute-optimised type (C or better). - If memory is near capacity but CPU is idle, move up to R series. - If you need high local I/O (e.g., for temporary data processing), pick I series with instance store. - For GPU-accelerated workloads, G or P series. P (p4, p5) are even more powerful but more expensive.

Use the AWS CLI to list available instance types in your region: aws ec2 describe-instance-types --filters Name=instance-type,Values=t3. --query 'InstanceTypes[].[InstanceType,VCpuInfo.DefaultVCpus,MemoryInfo.SizeInMiB]' --output table

io/thecodeforge/aws/ec2_list_families.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# TheCodeForgeList EC2 instance families and their characteristics

echo "=== Listing all instance families with example types ==="

# Describe instance types for each family
for family in t3 m6i c6i r6i i3 g5 x2iedn; do
  echo ""
  echo "--- $family ---"
  aws ec2 describe-instance-types \
    --filters Name=instance-type,Values="${family}.*" \
    --query 'InstanceTypes[*].[InstanceType, VCpuInfo.DefaultVCpus, MemoryInfo.SizeInMiB, InstanceStorageInfo.TotalSizeInGB]' \
    --output table --max-items 5
done
Output
=== Listing all instance families with example types ===
--- t3 ---
-----------------------------------------
| InstanceType | vCPUs | Mem(MiB) |
+------------------+--------+-----------+
| t3.nano | 2 | 512 |
| t3.micro | 2 | 1024 |
| t3.small | 2 | 2048 |
| t3.medium | 2 | 4096 |
| t3.large | 2 | 8192 |
+------------------+--------+-----------+
--- m6i ---
-----------------------------------------
| InstanceType | vCPUs | Mem(MiB) |
+------------------+--------+-----------+
| m6i.large | 2 | 8192 |
| m6i.xlarge | 4 | 16384 |
| m6i.2xlarge | 8 | 32768 |
| m6i.4xlarge | 16 | 65536 |
| m6i.8xlarge | 32 | 131072 |
+------------------+--------+-----------+
... (output truncated for brevity)
Right-size with CloudWatch before committing to a family
Don't pick a family based on guesswork. Run your workload on a small general-purpose instance for 48 hours, then analyse CPU, memory, and disk metrics. If CPU is pegged at 100% and memory is 20%, switch to compute-optimised. If memory is 90% and CPU is 10%, go memory-optimised. The chart above is a starting point — your actual metrics tell the story.
Production Insight
A client ran a production API on a t3.medium for months because 'it worked fine'. What they didn't see was that CPU credit balance was draining every day during peak hours, causing sporadic 5-second response times. Switching to an m6i.large eliminated the issue and actually reduced total cost because they didn't need to over-provision for credit exhaustion. Always profile before choosing a family.
Key Takeaway
Instance families encode the resource ratio: T (burst), M (balanced), C (compute), R (memory), I (storage), G (GPU). Profile your workload with CloudWatch before picking a family. Consider Graviton (c7g/m7g/r7g) for 20% price-performance improvement if your software is ARM-compatible.

Pricing Model Decision Matrix — Choose Between On-Demand, Reserved, Spot, and Savings Plans

EC2 offers multiple pricing models, and choosing the wrong one can double your costs or leave you locked into a commitment you can't change. This decision matrix maps each model's commitment level, discount depth, interruptibility, and ideal use case.

Pricing ModelCommitment RequiredMax DiscountInterruptible?Best For
On-DemandNone0%NoShort-lived workloads, spiky traffic, development, workloads with unpredictable duration
Reserved Instance (Standard)1 or 3 years, specific instance family + regionUp to 75%NoSteady-state production workloads where instance type and region are certain for 1-3 years
Reserved Instance (Convertible)1 or 3 years, but can change family/region (within same OS)Up to 60%NoWorkloads that are steady but may need to migrate instance family (e.g., upgrade from M to R)
Compute Savings Plans1 or 3 years, $/hour commitment across any EC2 instance family and regionUp to 66-72%NoMost flexible commitment; best for diversified workloads where instance types vary
Spot InstancesNone (but may be interrupted)Up to 90%Yes (2-min warning)Stateless, fault-tolerant tasks: batch processing, CI/CD, big data, image processing, microservices with graceful shutdown

Decision flow: 1. Will the workload run continuously for 1-3 years? -> Yes -> Consider Savings Plans or RIs. Start with Compute Savings Plans for flexibility. 2. Can the workload tolerate interruption? -> Yes -> Use Spot Instances for maximum savings. If no, use On-Demand or Savings Plans. 3. Is the instance type and region known to be fixed? -> Yes -> Standard RI gives highest discount (up to 75%) but locks you in. For most teams, Savings Plans are safer. 4. Is the workload short or unpredictable? -> On-Demand is the simplest. Never commit to RI/Savings Plans for short projects.

Common pitfall: A developer committed to a 3-year Standard RI for an application that migrated to containers 6 months later. The RI was tied to a specific instance type and region, so it became useless. The cost: $50,000 paid for compute they never used.

Best practice: Use a mix: On-Demand for baseline flexibility, Savings Plans for steady-state capacity (covering about 60-80% of your expected usage), and Spot for elastic workloads that can be restarted. This blend gives you cost efficiency without lock-in risk.

io/thecodeforge/aws/ec2_pricing_advisor.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#!/bin/bash
# TheCodeForgeCompare pricing models for a given instance type

INSTANCE_TYPE="m6i.large"
REGION="us-east-1"

echo "=== Pricing for $INSTANCE_TYPE in $REGION ==="

# Get On-Demand price
aws pricing get-products \
  --service-code AmazonEC2 \
  --filters "Type=TERM_MATCH,Field=instanceType,Value=$INSTANCE_TYPE" "Type=TERM_MATCH,Field=regionCode,Value=$REGION" \
  --query "PriceList[0]" --output json | jq -r '."terms".OnDemand | to_entries[0].value.priceDimensions | to_entries[0].value.pricePerUnit.USD'

# Get 1-year Compute Savings Plan price
# Savings Plans don't have a direct API, but you can approximate: ~40% off On-Demand

# Get Spot price history
aws ec2 describe-spot-price-history \
  --instance-types $INSTANCE_TYPE \
  --product-descriptions "Linux/UNIX" \
  --start-time "$(date -u -d '-1 day')" \
  --query 'SpotPriceHistory[*].[SpotPrice,Timestamp]' \
  --output table | head -5
Output
=== Pricing for m6i.large in us-east-1 ===
On-Demand price per hour: 0.096
Spot price history (last 24h):
--------------------------
| SpotPrice | Timestamp |
+-------------+---------------------+
| 0.0288 | 2026-05-11T15:00Z |
| 0.0301 | 2026-05-11T12:00Z |
| 0.0275 | 2026-05-11T09:00Z |
+-------------+---------------------+
Estimated 1-year Compute Savings Plan price: ~0.058/hr (40% off On-Demand)
Reserved Instances: Lock-in Trap
Standard RIs lock you into a specific instance type and AZ for 1-3 years. If your workload changes, you're stuck paying for capacity you don't use — or paying fees to modify/convert. Compute Savings Plans are nearly as cheap and cover all EC2 families and regions. Unless you are absolutely certain about your instance footprint for 3 years, choose Savings Plans over RIs.
Production Insight
I've seen teams buy 3-year Standard RIs for 'production servers' that were later migrated to containers on Fargate. The RI purchase was a $100k sunk cost. Now we always recommend Compute Savings Plans instead — same flexibility as On-Demand but with 40-60% savings. For truly elastic workloads, we layer Spot on top of Savings Plans. The combined strategy can cut costs by 60-80% compared to pure On-Demand.
Key Takeaway
Use On-Demand for flexibility, Savings Plans for steady-state discount (most flexible), Spot for interruptible workloads. Never buy Standard RIs unless you are 100% certain about instance type and region for the entire term. Start with Compute Savings Plans for any committed spend.

Security Groups — The Firewall That Saves You from Crypto Miners

Security groups (SGs) are stateful virtual firewalls attached to EC2 (and other AWS resources). They control inbound and outbound traffic based on rules you define.

Key properties: - Stateful: if you allow inbound on port 80, response traffic is automatically allowed outbound regardless of outbound rules. - Explicit allow: no deny rules — only allow. Traffic that isn't allowed is implicitly denied. - Reference other security groups: you can allow inbound from another SG (e.g., allow HTTP from ALB SG), which is more secure than IP ranges. - You can attach up to 5 SGs per instance.

Common patterns: - Web tier: allow HTTP (80) and HTTPS (443) from 0.0.0.0/0; allow SSH from your office IP only. - App tier: allow traffic only from the web tier SG on application port. - Database tier: allow traffic only from app tier SG on DB port (e.g., 3306 for MySQL). Never allow DB ports from 0.0.0.0/0.

Mistake to avoid: Opening all ports to 0.0.0.0/0 for 'convenience'. That's how your instance becomes a crypto mining node overnight.

SSH-specific best practices: - Never open 0.0.0.0/0 on port 22. Ever. Bots scan every IP every 15 minutes. - Use AWS Systems Manager Session Manager instead of SSH — no public IP needed, no SSH keys, audit logs built-in. - If you must use SSH, restrict to your office IP using --cidr $(curl -s http://checkip.amazonaws.com)/32. - Use EC2 Instance Connect (temporary SSH key pushed via IAM).

Worst-case scenario: A client opened port 3306 (MySQL) to 0.0.0.0/0 for 'easy testing' and forgot to revert. Within 3 hours, their database was publicly accessible and a script dumped all tables. The data breach cost $500k in fines.

io/thecodeforge/aws/ec2_security_group_fix.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/bin/bash
# TheCodeForgeFix dangerous security group rules

# Find security groups with SSH open to 0.0.0.0/0
echo "=== Security groups with SSH open to all ==="
aws ec2 describe-security-groups \
  --filters Name=ip-permission.protocol,Values=tcp Name=ip-permission.to-port,Values=22 \
  --query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]]].[GroupId,GroupName]' \
  --output table

# Revoke dangerous rule (replace SG_ID with actual)
# aws ec2 revoke-security-group-ingress --group-id sg-0123456789abcdef0 --protocol tcp --port 22 --cidr 0.0.0.0/0

# Add rule for your current IP only
MY_IP=$(curl -s http://checkip.amazonaws.com)
echo "=== Adding SSH rule for your IP: $MY_IP ==="
# aws ec2 authorize-security-group-ingress --group-id sg-0123456789abcdef0 --protocol tcp --port 22 --cidr "$MY_IP/32"

# Check for database ports open to internet
DB_PORTS=(3306 5432 1433 27017)
for port in "${DB_PORTS[@]}"; do
  echo "=== Checking port $port ==="
  aws ec2 describe-security-groups \
    --filters Name=ip-permission.to-port,Values=$port \
    --query 'SecurityGroups[?IpPermissions[?IpRanges[?CidrIp==`0.0.0.0/0`]]].[GroupId,GroupName]' \
    --output table
done

echo "=== Recommendation: Use Systems Manager Session Manager instead of SSH ==="
echo "aws ssm start-session --target i-0123456789abcdef0"
Output
=== Security groups with SSH open to all ===
----------------------------------------------
| GroupId | GroupName |
+-------------------------+-------------------+
| sg-0abcd1234efgh5678 | default |
| sg-0efgh5678ijkl9012 | web-app-sg |
+-------------------------+-------------------+
=== Adding SSH rule for your IP: 203.0.113.45 ===
=== Checking port 3306 (MySQL) ===
[No results — good]
=== Recommendation: Use Systems Manager Session Manager instead of SSH ===
aws ssm start-session --target i-0123456789abcdef0
The 0.0.0.0/0 Trap
SSH open to 0.0.0.0/0 means the entire internet can attempt to connect to your instance. Attackers scan the entire IPv4 space every 15 minutes. A weak password or unpatched SSH version = compromise within hours. Fix: use Systems Manager Session Manager (no public IP needed) or restrict SSH to your IP range.
Production Insight
A client once opened port 3306 to 0.0.0.0/0 for 'easy testing' and forgot to revert.
Within 3 hours, their database was publicly accessible and a script dumped all tables.
The data breach cost $500k in fines and customer compensation.
Rule: use security group references, not IP ranges, for intra-VPC communication.
And never, ever open database ports to the internet — use a bastion host or SSM port forwarding.
Key Takeaway
Security groups are stateful, allow-only firewalls.
Never open 0.0.0.0/0 on SSH (port 22). Use Session Manager instead.
Use security group references for intra-VPC traffic (e.g., allow app tier to talk to DB tier by SG ID).
Start with no inbound rules, add only what you need.

EBS Volumes — gp3 vs gp2 and the Burst Credit Trap

Amazon Elastic Block Store (EBS) provides block-level storage volumes that persist independently from your EC2 instance. Think of it as an external hard drive you can attach/detach at will.

Volume types: - gp3 (General Purpose SSD): baseline 3000 IOPS, burst to 16000. Good for most workloads. Cost-optimised. Recommended for new deployments. - gp2 (older): IOPS tied to volume size (3 IOPS per GB). Baseline 3 IOPS/GB with burst credits (up to 3000 IOPS). Avoid for new deployments unless you need compatibility. - io1/io2 (Provisioned IOPS): guaranteed IOPS, expensive. For databases requiring consistent, high IOPS. - st1 (Throughput Optimized HDD): cheap, high throughput. For log processing, big data. - sc1 (Cold HDD): lowest cost, infrequent access.

The gp2 burst trap: gp2 earns I/O credits when idle (like CPU credits). Under sustained high I/O, credits exhaust, and performance drops from 3000 IOPS to baseline (3 IOPS per GB). A 100GB gp2 volume would drop to 300 IOPS.

gp3 eliminates burst credits: baseline 3000 IOPS regardless of size, and it's often cheaper than gp2. Migrate any gp2 volumes to gp3 for consistent performance and lower cost.

Performance tip: gp3 decouples IOPS from size and is cheaper than gp2. For high-performance databases, use io2 Block Express (up to 256,000 IOPS).

Encryption warning: By default, EBS encryption is not enabled in a new account. Enable it at the account level via the EC2 console settings. Otherwise, any snapshot you share accidentally could leak data.

io/thecodeforge/aws/ec2_ebs_migrate_gp3.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#!/bin/bash
# TheCodeForgeMigrate gp2 volumes to gp3 for cost and performance

# List all gp2 volumes
echo "=== gp2 volumes in account ==="
aws ec2 describe-volumes --filters Name=volume-type,Values=gp2 \
  --query 'Volumes[*].[VolumeId,Size,Attachments[0].InstanceId,CreateTime]' \
  --output table

# Check burst credit balance for a gp2 volume
VOLUME_ID="vol-0123456789abcdef0"
echo "=== gp2 burst credit balance for $VOLUME_ID ==="
aws cloudwatch get-metric-statistics \
  --namespace AWS/EBS \
  --metric-name BurstBalance \
  --dimensions Name=VolumeId,Value=$VOLUME_ID \
  --statistics Minimum \
  --period 3600 \
  --start-time "$(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ)" \
  --end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --output text

# Migrate volume from gp2 to gp3 (no downtime, online operation)
echo "=== Migrating $VOLUME_ID from gp2 to gp3 ==="
aws ec2 modify-volume \
  --volume-id $VOLUME_ID \
  --volume-type gp3 \
  --iops 3000

echo "=== Migration started. Monitor with: ==="
aws ec2 describe-volumes-modifications --volume-ids $VOLUME_ID

# Enable EBS encryption by default for future volumes
echo "=== Enabling EBS encryption by default ==="
aws ec2 enable-ebs-encryption-by-default
Output
=== gp2 volumes in account ===
-------------------------------------------------
| VolumeId | Size | InstanceId | CreateTime |
+--------------------+------+--------------------+-------------------+
| vol-0abcd1234efgh | 100 | i-0abcd1234efgh | 2025-01-15T10:00Z |
| vol-0efgh5678ijkl | 50 | i-0efgh5678ijkl | 2025-01-20T14:30Z |
+--------------------+------+--------------------+-------------------+
=== gp2 burst credit balance for vol-0abcd1234efgh ===
0.0 <- credits exhausted, performance degraded
=== Migrating vol-0abcd1234efgh from gp2 to gp3 ===
{
"VolumeModification": {
"VolumeId": "vol-0abcd1234efgh",
"TargetVolumeType": "gp3",
"Progress": 0
}
}
=== Enabling EBS encryption by default ===
{
"EbsEncryptionByDefault": true
}
gp2 Burst Credits Exhaust = Slow Database
gp2 volumes have burst credits that deplete under sustained high I/O. A 100GB gp2 volume baseline is only 300 IOPS. If your database needs 2000 IOPS consistently, you'll exhaust credits and performance tanks. Migrate to gp3: baseline 3000 IOPS regardless of size. Often cheaper too.
Production Insight
We had an incident where a large batch job filled the root volume.
The server froze: no space for logs, no SSH.
The fix: resize via API and extend filesystem — 2 minutes.
But the outage cost $12,000 in missed SLAs.
Rule: set CloudWatch alarms on disk space at 80% and 90%.
Also have a runbook for quick resizing.
The gp2 burst trap: a database on gp2 worked fine for months, then a Black Friday spike exhausted credits. Performance dropped from 3000 IOPS to 300 IOPS. Queries took 30 seconds. This is why we recommend gp3 for production.
Key Takeaway
gp3 is now the default — baseline 3000 IOPS, no burst credits, often cheaper than gp2.
Migrate existing gp2 volumes to gp3 (online, no downtime).
Set CloudWatch alarms on disk space at 80% and 90% — this prevents freezes.
Encrypt all EBS volumes by default at the account level.
Choose your EBS volume type
IfBoot volume or general purpose app (< 16k IOPS needed)
Usegp3 — default choice. Baseline 3000 IOPS, burst to 16000. Cheaper than gp2.
IfProduction database (Oracle, MySQL, PostgreSQL) requiring consistent high IOPS
Useio2 or io2 Block Express. Provision IOPS based on DB workload profile (e.g., 10000 IOPS for 2TB database).
IfData warehouse / analytics with sequential reads
Usest1 — high throughput at low cost. Not for random I/O.
IfInfrequently accessed archives (backups, logs older than 90 days)
Usesc1 — lowest cost. Acceptable for archival data.
IfExisting gp2 volume (legacy)
UseMigrate to gp3 immediately. Online operation, no downtime. Lower cost + better performance.

Storage Comparison — EBS vs Instance Store

EC2 instances have two storage options: Elastic Block Store (EBS) volumes and Instance Store volumes (ephemeral). They differ fundamentally in persistence, performance, and pricing. Choosing incorrectly can lead to data loss or unexpected costs.

EBS (Elastic Block Store): - Network-attached block storage that persists independently of the instance. - Can be detached and reattached to another instance. - Data survives instance stop, start, and terminate (unless you choose 'delete on termination' for the root volume). - Multiple volume types (gp3, io2, st1, sc1) with different performance/cost profiles. - Billed per GB-month plus IOPS/throughput provisions. - Typical latency: 1-5 ms.

Instance Store (Ephemeral Storage): - Physically attached to the host server that runs the instance. - Data is lost when the instance is stopped, terminated, or the underlying host fails. - Included in the instance price — no separate billing. - Extremely low latency (sub-millisecond) and very high IOPS (millions on NVMe). - Only available on certain instance types (i3, i4i, m5d, c5d, r5d, etc. — look for 'd' suffix for local NVMe). - Cannot be detached or moved to another instance.

FeatureEBSInstance Store
PersistenceSurvives stop/terminateLost on stop/terminate/host failure
PerformanceNetwork-attached, 1-5ms latencyDirect-attached, sub-ms latency, millions IOPS
CostPay per GB-month + IOPSIncluded in instance price (no extra)
Size limitUp to 64 TB per volume (by request)Up to ~60 TB per instance (multiple NVMe)
Snapshot supportYes (snapshots to S3)No
EncryptionSupports KMS/SSESupports instance-level encryption
Detach/reattachYesNo
Use casesDatabases, OS boot volumes, persistent application dataTemporary storage, caches, scratch data, log processing, swap

Best practices: - Always use EBS for persistent data like databases, application state, and logs you need to keep. - Use Instance Store for temporary data that can be regenerated: build caches, intermediate processing results, swap space, or data replicated from another source. - Many production architectures combine both: boot from EBS (gp3), and mount instance store NVMe for high-performance scratch space (e.g., database temp tables, MapReduce shuffle). - If you use Instance Store for anything important, replicate it across instances or to a shared EBS/EFS/S3 to avoid single-point-of-failure.

Common mistake: A developer used an instance store volume as the primary data store for a stateful application. When the instance was stopped for a security patch, all data was lost. Recovery took days from backups. Always review the 'Delete on termination' flag and instance type storage options before launching.

io/thecodeforge/aws/ec2_storage_check.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/bash
# TheCodeForgeCheck instance storage details

echo "=== Check if instance has Instance Store volumes ==="
INSTANCE_ID="i-0123456789abcdef0"

# Describe instance to see block device mappings
aws ec2 describe-instances --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].BlockDeviceMappings[*]'

echo ""
echo "=== List all EBS volumes attached to instance ==="
aws ec2 describe-volumes --filters Name=attachment.instance-id,Values=$INSTANCE_ID \
  --query 'Volumes[*].[VolumeId,Size,VolumeType,State,Attachments[0].Device]' \
  --output table

echo ""
echo "=== Check if instance type supports Instance Store ==="
INSTANCE_TYPE=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
  --query 'Reservations[0].Instances[0].InstanceType' --output text)

echo "Instance type: $INSTANCE_TYPE"
aws ec2 describe-instance-types --instance-types $INSTANCE_TYPE \
  --query 'InstanceTypes[0].[InstanceType, InstanceStorageSupported, InstanceStorageInfo.TotalSizeInGB]' \
  --output table
Output
=== Check if instance has Instance Store volumes ===
[
{
"DeviceName": "/dev/xvda",
"Ebs": {
"VolumeId": "vol-0abcd1234efgh",
"Status": "attached"
}
}
]
=== List all EBS volumes attached to instance ===
--------------------------------------------------
| VolumeId | Size | VolumeType | State | Device |
+--------------------+------+------------+-------+--------+
| vol-0abcd1234efgh | 100 | gp3 | in-use| /dev/sda1 |
+--------------------+------+------------+-------+--------+
=== Check if instance type supports Instance Store ===
Instance type: m6i.large
No instance store volumes. (InstanceStorageSupported: false)
For instance store support, choose types with 'd' suffix: i3.large, m5d.large, c5d.large, r5d.large, etc.
Instance Store is Ephemeral — Don't Treat It Like a Hard Drive
Instance store volumes are physically attached to the host server. If the instance stops, terminates, or the host fails, all data on instance store is permanently lost. There is no 'undelate' or snapshot. Use it only for cache, temp files, or data that can be regenerated from another source (e.g., S3, database replica). Never use it as the primary storage for databases or user data.
Production Insight
A team ran a Redis cache on an i3 instance's NVMe instance store. The cache was rebuildable from the primary database, so it was a good fit. But they also stored user session data in the same NVMe. When AWS scheduled a host replacement for hardware maintenance, all sessions were lost — users were logged out globally. They had to add a warm-up routine to rebuild the cache from backups. Lesson: know what lives on instance store and have a recovery plan for its potential data loss.
Key Takeaway
EBS is persistent, detachable, and billable; Instance Store is ephemeral, fast, and free with the instance. Use EBS for anything you need to keep. Use Instance Store for temporary, high-I/O workloads that can tolerate loss. Always have a data replication or recovery strategy if using Instance Store.
● Production incidentPOST-MORTEMseverity: high

The t3.micro That Cost $2,300 in One Weekend

Symptom
Instance CPU pegged at 100% for 48 hours straight. Network egress spikes to unknown IPs. CloudWatch shows CPUUtilization at 100% but t3.micro credits exhausted (CPUCreditBalance = 0). Unknown processes consuming memory. AWS sends 'suspicious activity' notification.
Assumption
The developer assumed 'light testing' meant the instance would be terminated. They didn't set termination protection, didn't monitor billing, and thought t3.micro was 'too small to matter'. They also assumed security group with 0.0.0.0/0 on SSH was fine because 'no one will find this random IP'. Bots scan the entire IPv4 space every 15 minutes.
Root cause
Security group inbound rule allowed SSH (port 22) from 0.0.0.0/0. Bots brute-forced a weak password within 24 hours. Attacker installed crypto miner that consumed 100% CPU. t3.micro ran out of CPU credits, switched to 'unlimited' mode (costs extra when credits negative). Data transfer to mining pool cost $1,800. Compromised IAM role with S3 read access exfiltrated 200GB of data to external server. The instance had no CloudWatch alarm on CPU, disk, or billing. No one noticed for 48 hours until AWS suspended the account.
Fix
1. Terminated compromised instance immediately. 2. Rotated all IAM credentials associated with the instance's role. 3. Revoked all access keys that were ever on the instance. 4. Deleted security group rule allowing 0.0.0.0/0 on SSH. 5. Added CloudWatch alarm: CPUUtilization > 80% for 5 minutes → page on-call. 6. Added billing alarm: EstimatedCharges > $100 → email. 7. Implemented AWS Systems Manager Session Manager for prod access (no SSH keys, no public IP). 8. Enabled VPC Flow Logs to audit traffic patterns. 9. Added mandatory tagging (CostCenter, Environment, ExpirationDate) for all instances. 10. Created Lambda function that auto-terminates instances older than 7 days with 'testing' tag.
Key lesson
  • 0.0.0.0/0 on SSH is not 'low risk' — it's a guarantee of compromise within 24-48 hours.
  • Every instance needs CloudWatch billing alarm and CPU/disk monitoring.
  • t3.micro is fine for testing, but set termination protection and expiration tags.
  • Don't attach IAM roles with S3 read access to public-facing instances.
  • Use AWS Systems Manager Session Manager instead of SSH for production access.
Production debug guideDebug connectivity, performance, and configuration issues fast.5 entries
Symptom · 01
SSH connection times out or 'Connection refused'
Fix
Check security group inbound rules for port 22 – ensure your IP is allowed. Verify the instance has a public IP (or you're using a bastion host). Check network ACLs and route tables. If using a VPC, ensure the subnet has an internet gateway attached.
Symptom · 02
Instance status checks show 2/2 but app is slow
Fix
Check CloudWatch metrics: CPU utilisation, memory (install CloudWatch agent), EBS burst balance (for gp2). If gp2 burst credits exhausted, switch to gp3. For memory pressure, consider a larger instance type or enable swap.
Symptom · 03
Instance stops responding after a few days
Fix
Check if the instance reached its credit balance limit (T2/T3 unlimited). Verify OS-level disk usage with df -h; EBS volume might be full. Also check for OOM killer in /var/log/kern.log.
Symptom · 04
Can't attach an EBS volume — 'Invalid volume' or 'Attachment limit'
Fix
Check if the volume is already attached to another instance. Each instance has a max number of attachments (e.g., 40 for most Nitro instances). Detach unused volumes. If volume is in 'error' state, create a snapshot and restore a new volume.
Symptom · 05
Instance launched but never passes system checks
Fix
Review EC2 console system log (screenshot) for boot errors. Common causes: missing kernel, corrupt AMI, or wrong architecture. Try launching with a different AMI or instance type.
★ EC2 Quick Debug Cheat SheetFive common EC2 issues and the exact commands to diagnose and fix them.
SSH timeout
Immediate action
Check security group inbound rules
Commands
aws ec2 describe-security-groups --group-ids <sg-id> --query "SecurityGroups[0].IpPermissions[?ToPort==22]"
nslookup <public-dns> or ping <public-ip>
Fix now
Add your current IP to the SSH rule: aws ec2 authorize-security-group-ingress --group-id <sg-id> --protocol tcp --port 22 --cidr $(curl -s http://checkip.amazonaws.com)/32
High CPU — suspected crypto mining+
Immediate action
Identify top processes via AWS SSM or serial console
Commands
ssh -i key.pem ec2-user@<ip> 'top -bn1 | head -20'
aws cloudwatch get-metric-statistics --metric-name CPUUtilization --namespace AWS/EC2 --statistics Average --period 300 --start-time "$(date -u -d '-1 hour')" --end-time "$(date -u)" --dimensions Name=InstanceId,Value=<instance-id>
Fix now
If crypto-mining suspected (unknown process 'xmrig', 'minerd', CPU 100% constant), take snapshot and terminate immediately. Do not 'stop' — terminate.
Disk full — instance frozen+
Immediate action
Check disk usage and find large files
Commands
ssh <instance> 'df -h && du -sh /* 2>/dev/null | sort -rh | head -10'
aws ec2 modify-volume --volume-id <vol> --size <new-size> --region <region>
Fix now
Resize EBS volume (increase size) and extend the filesystem (xfs_growfs or resize2fs). Set CloudWatch alarm on disk space at 80%.
Instance terminates unexpectedly+
Immediate action
Check if termination protection was enabled
Commands
aws ec2 describe-instance-attribute --instance-id <id> --attribute disableApiTermination --query 'DisableApiTermination.Value'
aws ec2 describe-instances --instance-ids <id> --query 'Reservations[0].Instances[0].StateReason'
Fix now
Enable termination protection: aws ec2 modify-instance-attribute --instance-id <id> --attribute disableApiTermination --value true. Check CloudTrail for who terminated.
EBS gp2 volume performance degraded+
Immediate action
Check burst credit balance
Commands
aws cloudwatch get-metric-statistics --metric-name BurstBalance --namespace AWS/EBS --statistics Average --period 300 --dimensions Name=VolumeId,Value=<vol-id>
aws ec2 modify-volume --volume-id <vol-id> --volume-type gp3 --region <region>
Fix now
Migrate from gp2 to gp3. gp3 has baseline 3000 IOPS regardless of size, no burst credits. Cheaper and more predictable.
EC2 vs Container (ECS/EKS) vs Lambda – When to Use What
CriterionEC2ECS/EKS (Containers)Lambda (Serverless)
Control over OS and runtimeFull control (OS, kernel, packages)Moderate (container image + host OS)None (managed runtime)
Cold start latencyNone (always on)Low (container warm-up)High (first invocation ~200ms–1s)
Cost for steady 24/7 workloadLow (with Savings Plans)Moderate (no OS licensing, but cluster cost)High (charged per request + duration)
Scaling granularityManual or via Auto Scaling (minutes)Fast (seconds to minutes)Instant (per request)
PersistenceEBS, instance storeEBS, EFS (attached per task)Stateless (use S3, DynamoDB for state)
Ideal use caseLegacy apps, databases, long-running servicesMicroservices, batch jobsEvent-driven APIs, scheduled tasks, data processing

Key takeaways

1
EC2 is virtual servers in AWS
you pay per second, stop when not needed.
2
Instance types encode workload
t for burstable, c for compute, r for memory, i for storage, g for GPU.
3
Security groups are stateful, allow-only firewalls; restrict SSH to your IP and use SG references for intra-VPC traffic.
4
EBS volumes persist separately; gp3 is the default and cheaper than gp2; always snapshot before resizing.
5
Pricing mix
On-Demand for flexibility, Savings Plans for steady state, Spot for disposable workloads.
6
The three biggest mistakes
open SSH to 0.0.0.0, leaving instances running, and storing secrets in plaintext.
7
Set CloudWatch billing alarm at $100. Set disk space alarms at 80% and 90%. You'll catch 90% of surprises.

Common mistakes to avoid

7 patterns
×

Opening SSH port to 0.0.0.0/0 and using default security groups

Symptom
Instance gets compromised via brute force. CPU spikes due to crypto mining. Data exfiltration. Surprise bills.
Fix
Restrict SSH to your IP using --cidr $(curl -s http://checkip.amazonaws.com)/32. Use Systems Manager Session Manager for production access. Create a custom security group with minimal rules.
×

Leaving instances running after testing

Symptom
Unbudgeted charges appear in AWS monthly bill. Often thousands of dollars for a single t3.medium running for weeks.
Fix
Tag all instances with expiration tag (e.g., expiration-date: 2026-05-01). Set up CloudWatch alarms on billing. Use AWS Instance Scheduler to auto-stop instances outside work hours. Terminate instead of stop when done.
×

Using t2/t3 micro for production services

Symptom
Performance degrades unpredictably under load. CPU credits exhausted, instance throttled. Latency spikes and timeouts.
Fix
Use burstable instances (t3) only for non-production or variable-load work. For production, choose m6i.large or larger. If you must use burstable, enable unlimited mode (costs extra when credits are negative).
×

Storing secrets (API keys, passwords) in user data or AMIs

Symptom
If instance is compromised, attacker gains access to secret keys, potentially across accounts.
Fix
Store secrets in AWS Secrets Manager or Parameter Store. Use IAM roles attached to the instance (instance profile) to grant permissions without embedded keys. Never put secrets in plaintext in scripts.
×

Not encrypting EBS volumes by default

Symptom
If an EBS snapshot is shared accidentally or a volume is detached, raw data is exposed.
Fix
Enable EBS encryption by default in the AWS account (via Account Settings). Use KMS keys for customer-managed encryption. All new volumes will be encrypted automatically.
×

Not enabling termination protection on production instances

Symptom
Someone accidentally terminates a production instance via console or CLI, causing an application outage.
Fix
Enable termination protection at launch or via aws ec2 modify-instance-attribute --instance-id <id> --attribute disableApiTermination --value true. Set IAM policies that require MFA to disable termination protection.
×

Using the default VPC without understanding its limits

Symptom
Cannot create any more subnets or resources because default VPC has limited IP space (default /20). Unexpected connectivity issues when trying to peer VPCs.
Fix
Create a custom VPC with sufficient CIDR block (e.g., /16) for current and future needs. Use /21 or /20 for subnets based on projected growth. Never use the default VPC for production.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the different EC2 instance purchase options and when you would u...
Q02SENIOR
How do you debug an EC2 instance that is unreachable via SSH?
Q03SENIOR
What is an EC2 instance profile, and how does it relate to IAM roles?
Q04SENIOR
What happens when a Spot Instance is interrupted? How do you design for ...
Q05SENIOR
How do you migrate a running EC2 instance from one instance type to anot...
Q06SENIOR
What is the difference between gp2 and gp3 EBS volumes? Why should you m...
Q01 of 06SENIOR

Explain the different EC2 instance purchase options and when you would use each.

ANSWER
EC2 offers On-Demand, Reserved Instances (Standard/Convertible), Spot Instances, and Savings Plans. - On-Demand: No commitment, pay per second. Best for uncertain workloads or short-term needs. - Reserved Instances: 1- or 3-year commitment in a specific AZ/instance type. Standard RIs offer up to 72% discount but are inflexible. Convertible RIs allow changing instance families within the same RDS. - Spot Instances: Use spare AWS capacity at up to 90% discount. Can be terminated with a 2-minute warning. Use for stateless, fault-tolerant workloads like batch processing, CI/CD, or big data. - Savings Plans: Flexible commitment in $/hour across EC2, Fargate, and Lambda. Compute Savings Plans cover any region/instance family. Easier to manage than RIs for dynamic environments. In production, you typically use a mix: On-Demand for baseline flexibility, Savings Plans for steady-state, Spot for elastic workloads. Always use Auto Scaling to match capacity to demand.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
What is the difference between stopping and terminating an EC2 instance?
02
How do I reduce EC2 costs?
03
What is a security group, and how is it different from a network ACL?
04
What is the difference between an AMI and a snapshot?
05
How do I increase the disk space on an EC2 instance?
06
Should I use t3.micro for production?
🔥

That's Cloud. Mark it forged?

11 min read · try the examples if you haven't

Previous
Introduction to AWS
3 / 23 · Cloud
Next
AWS S3 Basics