Mid-level 24 min · March 06, 2026
Introduction to AWS

Lambda Timeout at 15 Minutes — Migration Nightmare

Lambda's 15-minute hard timeout aborts migrations; most tutorials miss it.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • EC2: virtual machine for continuous workloads; charges per second
  • Lambda: event-driven function; charges per 1ms execution, zero at rest
  • S3: object storage with 11 nines durability
  • RDS: managed relational database with automated backups
  • IAM: identity and permissions layer — everything is denied by default
  • Performance insight: Lambda cold start adds 100ms–1s to first request
  • Production insight: Lambda + RDS without a connection pool exhausts DB connections
  • Biggest mistake: using the root account for daily work
✦ Definition~90s read
What is Introduction to AWS?

Lambda Timeout at 15 Minutes is the maximum execution duration allowed for an AWS Lambda function before it is forcibly terminated by the service. When a Lambda function is invoked, it runs for up to 900 seconds (15 minutes) of wall-clock time; if the function has not completed and returned a response within that period, AWS Lambda stops execution and throws a timeout error.

Imagine you're opening a pizza restaurant.

This limit applies to synchronous and asynchronous invocations alike, and it is a hard cap—no configuration can extend it beyond 15 minutes.

Plain-English First

Imagine you're opening a pizza restaurant. You could buy your own building, ovens, and delivery vans — or you could rent a kitchen by the hour, use a shared delivery fleet, and only pay when orders come in. AWS is that rental model, but for computing. Instead of buying servers, storage, and networking hardware, you rent exactly what you need, scale it up on a busy Friday night, and scale it back down when things are quiet. You pay for what you use, nothing more.

Every app you use daily — Netflix streaming your show, Airbnb finding you a room, even NASA processing telescope images — runs on someone else's hardware. That hardware is overwhelmingly likely to be Amazon Web Services. AWS controls roughly 31% of the global cloud market, and understanding it isn't optional for a modern developer. Whether you're deploying your first side project or designing a system that serves millions, AWS is the environment you'll be working in.

Before cloud computing existed, launching a product meant buying physical servers, installing them in a data centre, estimating your peak traffic years in advance, and paying for that capacity whether you used it or not. A startup that went viral overnight would crash under load with no way to recover quickly. AWS solved this by turning infrastructure into software — things you provision with an API call in seconds, pay for by the minute, and throw away when you're done.

By the end of this article you'll understand the five services every AWS project touches — EC2, S3, RDS, Lambda, and IAM — why each one exists, when to reach for it over the alternatives, and how they wire together into a real production architecture. You'll also walk away with the vocabulary and mental models that make AWS job interviews approachable.

What AWS Lambda's 15-Minute Timeout Really Means

AWS Lambda enforces a hard 15-minute timeout on all function invocations. This is not a configurable soft limit — it is a platform-imposed ceiling that terminates execution at exactly 900 seconds. The timeout is measured from the moment the function receives an invocation event to the moment it returns a response or throws an unhandled error. Once the timeout fires, Lambda freezes the execution context, discards any in-flight work, and returns a 408 status code (Task timed out) to the caller. This is not a retry — it is a kill signal.

Under the hood, the timeout is enforced by the Lambda service at the hypervisor level. The function's allocated CPU and memory do not pause or slow down as the deadline approaches — execution continues until the wall clock hits the limit, then the process is terminated immediately. This means any cleanup logic, database commits, or external API calls that haven't completed by the 900-second mark will never execute. The function's response payload is also discarded; the caller receives only the timeout error.

You must design for this constraint from day one. If your workload regularly exceeds 15 minutes — such as large ETL jobs, video transcoding, or bulk database migrations — Lambda is the wrong compute service. Use AWS Fargate, Batch, or EC2 instead. For workloads that fit within the limit, implement idempotency and checkpointing so that a timeout does not corrupt state or lose data. The 15-minute ceiling is not negotiable; it is a design boundary.

Timeout ≠ Retry
A Lambda timeout does not automatically retry the invocation. Only async invocations with DLQ or on-failure destinations can capture the timeout event — synchronous calls simply fail.
Production Insight
A team migrated a 20-minute CSV processing job to Lambda without splitting the work, hitting the 900-second wall in production.
The symptom: every invocation returned Task timed out after 15 minutes, but the partial data was already written to S3, causing duplicate records on retry.
Rule of thumb: if a single invocation cannot complete within 15 minutes, split the work into sub-minute chunks using Step Functions or SQS batch windows.
Key Takeaway
Lambda's 15-minute timeout is a hard platform limit — not a configurable suggestion.
Design every function to complete within 900 seconds, or choose a different compute service.
Always implement idempotency and checkpointing for any stateful operation that could be interrupted by a timeout.
AWS Compute & Architecture Decision Flow THECODEFORGE.IO AWS Compute & Architecture Decision Flow From Lambda limits to EC2, Fargate, and IAM best practices Lambda 15-Minute Timeout Hard limit for synchronous invocations AWS Global Infrastructure Regions, AZs for high availability EC2 vs Lambda vs Fargate Choose based on duration, state, control IAM Least-Privilege Access Restrict permissions to minimum needed Shared Responsibility Model AWS secures cloud, you secure in cloud Pricing Models: On-Demand, Reserved, Spot Match workload patterns for cost savings ⚠ Lambda timeout at 15 min breaks long-running tasks Migrate to EC2 or Fargate for tasks >15 min THECODEFORGE.IO
thecodeforge.io
AWS Compute & Architecture Decision Flow
Introduction Aws

AWS Global Infrastructure: Regions, Availability Zones, and Edge Locations

Before you provision a single resource, understand where it lives. AWS runs out of 33 geographic Regions worldwide (as of 2026), each containing at least three Availability Zones (AZs). An AZ is one or more data centres — physically separate, each with independent power, cooling, and networking. AZs are connected by high-speed, low-latency links, but a disaster that takes out one AZ leaves the others functional.

When you deploy an EC2 instance or an RDS database, you choose both a Region (e.g., us-east-1 in Northern Virginia) and an AZ within that Region. For high availability, you spread across multiple AZs. Multi-AZ architectures are the standard for production workloads. If one AZ goes offline, traffic shifts to the others.

Edge Locations extend AWS's footprint beyond Regions. These are points of presence (POPs) in major cities around the world, used by CloudFront (the CDN) and Route 53 (DNS) to cache content and respond to DNS queries from the closest edge. There are over 400 Edge Locations — far more than Regions — because it's cheaper to deploy a cache than a full data centre.

When building for global audiences, pick Regions closest to your users, and use CloudFront to cache static assets at edge locations. For disaster recovery, replicate data to a second Region hundreds of miles away.

Production Insight
Choosing a Region matters for cost, latency, and compliance. AWS pricing varies by region — for example, eu-west-1 (Ireland) is typically 10-15% more expensive than us-east-1 (N. Virginia). Some data must stay within a country (GDPR in Europe). Always select the Region physically closest to your users to minimise latency. For fault tolerance, deploy across at least 2 AZs in the same region; for disaster recovery, replicate to a second region.
A common blunder: deploying all resources in a single AZ because it's simpler — until that AZ goes down.
Debug hint: use AWS Regional Services health dashboard to check AZ status during an outage.
Key Takeaway
Regions are geographic areas, AZs are isolated data centers within a Region. Edge Locations accelerate content delivery. For production, spread across at least two AZs in the same region.
Always design for AZ failure from day one.
AWS Global Infrastructure Structure
AWS GlobalRegion 1: us-east-1Region 2: eu-west-1Region 33: ap-southeast-1Availability Zone 1aAvailability Zone 1bAvailability Zone 1cData Center 1Data Center 2Data Center 3Edge Location 1: MumbaiEdge Location 2: São PauloEdge Location 400+: Tokyo

The Five Core Services Every AWS Project Uses — and Why They Were Built

AWS has over 200 services, which is overwhelming until you realise that almost every architecture starts with the same five building blocks. Think of them as the five trades in construction: electricity, plumbing, walls, a roof, and a lock on the door. Everything else is finishing work.

EC2 (Elastic Compute Cloud) is your rented computer. It runs your application code exactly as a physical server would, but you can resize it, clone it, or delete it in minutes.

S3 (Simple Storage Service) is unlimited file storage. Not a database — a place to put files. Images, videos, backups, static websites, data exports. It's so reliable (eleven 9s of durability) that AWS themselves use it internally.

RDS (Relational Database Service) runs a managed PostgreSQL, MySQL, or other SQL engine. You don't patch it, back it up, or handle failover — AWS does. You just query it.

Lambda runs a function without a server. Upload code, define a trigger, done. No EC2 instance sitting idle waiting for work.

IAM (Identity and Access Management) is the lock on the door. Every call to every AWS service checks IAM first. Get this wrong and either nothing works or everything is exposed.

aws_core_services_setup.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
#!/bin/bash
# Prerequisites: AWS CLI installed and configured with `aws configure`
# This script creates the skeleton of a real web app infrastructure:
# an S3 bucket for static assets, checks your IAM identity, and
# lists available EC2 instance types so you can make an informed choice.

# ── Step 1: Confirm who you are (IAM) ────────────────────────────────
# Always run this first. If the wrong profile is active you'll create
# resources in the wrong account — a very expensive mistake.
echo "Current IAM identity:"
aws sts get-caller-identity
# Output shows Account ID, IAM User ARN, and User ID.
# If you see 'Unable to locate credentials'
Output
Current IAM identity:
{
"UserId": "AIDA4EXAMPLE7USERID",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/sarah-dev"
}
Creating S3 bucket: myapp-static-assets-1718123456
{\n \"Location\": \"/myapp-static-assets-1718123456\"\n}\nBucket myapp-static-assets-1718123456 created and locked down.\nFile uploaded successfully.\nBucket contents:\n2024-06-11 14:22:01 22 index.html"
}

EC2 vs Lambda: Choosing the Right Compute Model Before You Write a Line of Code

The most consequential architectural decision in AWS isn't which database to use or how to structure your VPC. It's whether your code runs on EC2 or Lambda. Getting this wrong means either paying for idle servers 24/7 or hitting cold-start timeouts on user-facing requests.

Use EC2 when: your workload is continuous and predictable, you need full OS control, you're running long-running processes (video encoding, ML training), or you're lifting-and-shifting an existing app. An EC2 instance is just a VM — it starts up and stays up until you stop it.

Use Lambda when: your workload is event-driven and intermittent. An API endpoint that gets 50 requests per minute, a function that fires when a file lands in S3, a nightly data transform. Lambda charges per 1ms of execution. If the function doesn't run, you pay nothing.

The trap beginners fall into is using Lambda for everything because it sounds cheaper and more modern. Lambda has a hard 15-minute execution timeout. Put a 20-minute database migration in a Lambda and it will die mid-run, leaving your schema in a broken state. Put a CPU-intensive image processor in Lambda and the cold start latency will frustrate your users.

The sweet spot is using Lambda for glue — the code that reacts to events and orchestrates other services — while EC2 or containers handle the persistent, long-running workloads.

lambda_s3_image_thumbnail.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# This is a complete AWS Lambda function that triggers whenever a new image
# is uploaded to an S3 bucket, generates a thumbnail, and saves it to a
# second 'thumbnails' bucket. This is one of the most common Lambda patterns.

import boto3       # AWS SDK for Python — installed in Lambda runtime by default
import json
from PIL import Image   # Requires a Lambda Layer or packaging Pillow with your deploy
import io
import os

# boto3 clients are created outside the handler so they are reused across
# warm invocations — this is a real performance optimisation, not just style.
s3_client = boto3.client('s3')

# The THUMBNAILS_BUCKET env var is set in the Lambda config, not hardcoded.
# Hardcoding bucket names is a common mistake that breaks staging/prod parity.
THUMBNAILS_BUCKET = os.environ['THUMBNAILS_BUCKET']
THUMBNAIL_SIZE = (128, 128)  # width x height in pixels

def lambda_handler(event, context):
    """
    AWS calls this function automatically when a new object is created in
    the source S3 bucket. 'event' contains the bucket name and object key.
    """
    # Extract the source bucket and file key from the S3 event payload
    source_bucket = event['Records'][0]['s3']['bucket']['name']
    source_key = event['Records'][0]['s3']['object']['key']

    # Only process image files — avoid infinite loops if thumbnails land
    # in the same bucket as originals (a classic footgun).
    if source_key.startswith('thumbnails/'):
        print(f"Skipping thumbnail file to prevent recursion: {source_key}")
        return {'statusCode': 200
Pro Tip: Initialise boto3 clients outside the handler
Boto3 client initialisation takes ~50ms. On a warm Lambda invocation, code outside the handler function is NOT re-executed — AWS reuses the same execution environment. Moving client creation outside the handler is free performance. On a high-traffic Lambda processing 10,000 requests/day, this saves roughly 8 minutes of billed compute time per day.
Production Insight
A team once used Lambda for a daily ETL job that processed 500MB CSV files. The function ran for 14 minutes each time, barely under the 15-minute limit. When data volume grew, jobs started timing out. Switched to AWS Batch with EC2 spot instances and cut costs by 60%.
Cold starts: For latency-sensitive APIs, a Lambda hitting cold start (>500ms) can cause user frustration. Use Provisioned Concurrency for predictable latency, but it costs.
Know your workload profile before choosing compute.
Debug tip: enable Lambda Insights to monitor cold start frequency and duration.
Key Takeaway
Continuous + predictable → EC2. Intermittent + event-driven → Lambda.
If it runs longer than 15 minutes, it can't be Lambda.
If it's latency-critical, cold starts matter.
Measure before you decide — use CloudWatch metrics for invocation patterns.
EC2 vs Lambda decision guide
IfWorkload runs longer than 15 minutes
UseUse EC2 or containers
IfWorkload runs intermittently, with idle periods
UseUse Lambda to avoid paying for idle time
IfNeed to control the OS or install custom software
UseUse EC2
IfJust want to react to events (S3 upload, API call)
UseUse Lambda — glue code is its sweet spot

IAM Done Right: Why Least-Privilege Access Is Not Optional

IAM is the part of AWS that most tutorials rush through to get to the 'interesting' stuff, and it's the part that causes the most expensive real-world incidents. The 2019 Capital One breach that exposed 100 million customer records was an IAM misconfiguration. Understanding IAM isn't bureaucracy — it's engineering.

Every entity in AWS (a user, an EC2 instance, a Lambda function) has an identity. Every action on every resource is authorised by checking IAM policies attached to that identity. By default, everything is denied. You grant access explicitly.

The three concepts you must internalise are: Users (humans), Roles (services and applications — an EC2 instance assumes a role, not a user), and Policies (JSON documents that say what is allowed or denied on which resources).

The golden rule is **least privilege**: grant only the exact permissions needed for a specific task, scoped to the specific resource. Not s3: on — that's every S3 action on every bucket in your account. Instead: s3:GetObject on arn:aws:s3:::myapp-assets/*.

If a Lambda function only needs to read from one S3 bucket, its execution role should be able to do exactly that — nothing else. If that Lambda is compromised, the blast radius is one bucket in read-only mode, not your entire AWS account.

lambda_s3_readonly_policy.jsonJSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowReadingFromSpecificBucketOnly",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::myapp-static-assets-1718123456",
        "arn:aws:s3:::myapp-static-assets-1718123456/*"
      ]
    },
    {
      "Sid": "AllowCloudWatchLoggingForDebugging",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/thumbnail-processor:*"
    }
  ]
}

# To attach this policy to a Lambda execution role via CLI:
#
# 1. Create the policy in IAM:
# aws iam create-policy \n#   --policy-name LambdaThumbnailS3ReadPolicy \n#   --policy-document file://lambda_s3_readonly_policy.json
#
# 2. Create the role (Lambda needs permission to assume it):
# aws iam create-role \n#   --role-name LambdaThumbnailRole \n#   --assume-role-policy-document '{
#     "Version": "2012-10-17",
#     "Statement": [{
#       "Effect": "Allow",
#       "Principal": {"Service": "lambda.amazonaws.com"},
#       "Action": "sts:AssumeRole"
#     }]
#   }'
#
# 3. Attach the policy to the role:
# aws iam attach-role-policy \n#   --role-name LambdaThumbnailRole \n#   --policy-arn arn:aws:iam::123456789012:policy/LambdaThumbnailS3ReadPolicy
Output
# After running create-policy:
{
"Policy": {
"PolicyName": "LambdaThumbnailS3ReadPolicy",
"PolicyId": "ANPA4EXAMPLEPOLICYID",
"Arn": "arn:aws:iam::123456789012:policy/LambdaThumbnailS3ReadPolicy",
"CreateDate": "2024-06-11T14:30:00+00:00",
"AttachmentCount": 0,
"IsAttachable": true
}
}
# After attach-role-policy: (no output means success — this is intentional CLI behaviour)
Interview Gold: The Two-Statement S3 Pattern
Notice the policy has two Resource ARNs for the bucket: one without / and one with. The ListBucket action applies to the bucket itself (no slash), while GetObject applies to objects inside it (with /). Using only the /* ARN is a very common mistake that causes ListBucket to silently fail with an AccessDenied error. Mentioning this in an interview signals real hands-on experience.
Production Insight
The Capital One breach in 2019 was caused by an overly permissive IAM role attached to a WAF instance. The role had s3:PutObject on a bucket that stored customer data, and the attacker used a SSRF vulnerability to assume that role.
Least privilege sounds bureaucratic until it's your breach. Scope actions to specific resources.
Always enable AWS CloudTrail to audit who did what. Without it, you're blind.
Debug tip: use IAM Access Analyzer to find policies that are too permissive.
Key Takeaway
IAM is the only service that can grant or deny every other service.
Default deny is the only safe starting point.
Wildcard actions and wildcard resources together = disaster waiting to happen.
Audit your policies quarterly with IAM Access Analyzer.
When to use an IAM Role vs User
IfAn EC2 instance needs to access S3
UseCreate an IAM Role and attach it to the EC2 instance profile
IfA human developer needs AWS Console access
UseCreate an IAM User with MFA
IfA Lambda function needs to write to DynamoDB
UseCreate an IAM Role with a trust policy for lambda.amazonaws.com

AWS Shared Responsibility Model: What AWS Secures and What You Must Secure

A common misconception for new AWS users is that AWS is fully responsible for security. In reality, security is shared: AWS secures the cloud infrastructure (data centres, hardware, networking, hypervisors), while you secure everything inside that infrastructure — your data, applications, operating systems, network configurations, IAM policies, and encryption.

This is often described as Security OF the Cloud (AWS's responsibility) vs Security IN the Cloud (your responsibility). The boundary shifts depending on the service. For EC2, you manage the OS, patches, and firewall; for RDS, AWS manages the OS and database engine patching, but you manage database access, user accounts, and data encryption at rest. With Lambda, AWS manages the runtime environment, but you manage function code, environment variables, and execution role permissions.

In practice, this means: always encrypt data at rest (S3 SSE, EBS encryption, RDS encryption), encrypt data in transit (SSL/TLS), use IAM roles instead of long-lived access keys, regularly patch your EC2 AMIs, and never open more ports than necessary. Assume AWS will protect the physical data centre; assume everything else is your problem.

Common Mistake: Assuming AWS Encrypts Everything by Default
By default, new S3 buckets and EBS volumes are NOT encrypted. You must explicitly enable encryption. AWS now offers default encryption at the account level for S3 and EBS — enable it in the account settings. For RDS, encryption can only be enabled at creation time; you cannot encrypt an existing unencrypted RDS instance without migrating to a new one.
Production Insight
A real incident: a company stored unencrypted logs in S3 containing customer PII. An S3 bucket policy misconfiguration made the logs publicly readable. Since the bucket was unencrypted, anyone who accessed it got plaintext data. The shared responsibility model means you own data classification and encryption. Use S3 Block Public Access, enable S3 Server-Side Encryption, and use CloudTrail to monitor bucket operations.
Key takeaway: always encrypt at rest and in transit by default.
Key Takeaway
AWS secures the infrastructure; you secure your data, access controls, and configurations. Never assume encryption is on by default. Always audit your resources with AWS Config and use IAM least privilege.
Enable encryption at rest for all new resources via account settings.

AWS Pricing Models: On-Demand, Reserved, Spot, and Savings Plans — When to Use Each

AWS offers four primary pricing models, and choosing the wrong one is like paying first-class for a cargo flight — you get the same seat but at a wildly different price. Understanding when to use each can cut your compute costs by 50-70% with zero architectural change.

On-Demand: pay per hour or per second with no commitment. Best for short-term, spiky, or unpredictable workloads — development environments, new applications still being evaluated, or workloads that cannot tolerate interruption. You pay a premium for flexibility.

Reserved Instances (RIs): commit to 1 or 3 years of specific instance usage in a specific region. You save up to 72% compared to On-Demand. Best for steady-state, predictable workloads — your production web server running 24/7, an RDS instance for a core database. Convertible RIs allow some flexibility in instance family.

Spot Instances: bid on unused EC2 capacity at up to 90% discount. AWS can reclaim the instance with a 2-minute warning. Best for fault-tolerant, stateless, or batch workloads — data processing, image rendering, CI/CD workers, or any workload that can be interrupted and resumed. Never use Spot for databases or stateful app servers.

Savings Plans: a flexible discount model in exchange for a commitment to a consistent amount of compute usage (measured in $/hour) for 1 or 3 years. Savings Plans apply across EC2, Lambda, and Fargate, and automatically apply to any instance in the chosen family. They offer similar savings to RIs but with more flexibility. Best for organisations with diverse compute usage across multiple services.

The practical strategy: Use On-Demand for anything temporary or variable. For baseline, always-on workloads, buy Reserved Instances or Savings Plans. For batch processing, test environments, or non-critical services, use Spot. Never use On-Demand for predictable, long-running workloads — you're throwing money away.

Production Insight
A common cost pitfall: leaving a large On-Demand EC2 instance running 24/7 when it's only used during business hours. The fix: use AWS Instance Scheduler to stop instances overnight, or switch to a Spot instance if the workload is tolerant. Another: using On-Demand for a multi-AZ RDS deployment that never scales down. Convert to Reserved Instances or Savings Plan to cut 30-40%. Use AWS Cost Explorer and Athena on Cost and Usage Reports to identify the biggest savings opportunities.
Debug tip: set up AWS Budgets to get alerts when costs exceed thresholds.
Key Takeaway
On-Demand for flexibility, Reserved/Savings Plans for baseline, Spot for batch/cost-sensitive. A mix of all three, aligned to workload patterns, optimises cost without sacrificing reliability.
Use AWS Cost Explorer to visualise where your money goes.

EC2 vs Lambda vs Fargate: When to Use Each Compute Service

Beyond EC2 and Lambda, Fargate is a third compute option that sits between them — it runs containers without managing servers. Here's how to decide among all three.

EC2 (Elastic Compute Cloud): Full control over the OS and runtime. You manage everything from the kernel up. Best when you need custom AMIs, specific kernel modules, or direct hardware access (GPU). Also ideal for lift-and-shift migrations, long-running jobs (>15 min), and workloads requiring persistent storage attachments (EBS).

Lambda: Zero management. You upload code, set a trigger, and AWS runs it. Best for event-driven, short-lived tasks (<15 min), intermittent workloads with idle periods, and functions that need to scale to thousands of concurrent invocations instantly. Cold start can be an issue for latency-sensitive APIs.

Fargate: Run Docker containers without managing EC2 instances. You define the task (CPU, memory, container image), and AWS provisions and manages the underlying servers. Best for microservices that run continuously, batch jobs longer than Lambda's timeout, or workloads that need consistent performance without the overhead of EC2 management. Fargate is more expensive than EC2 for large, predictable workloads (you pay a ~10-20% markup for the managed infrastructure).

The practical guidance: use Lambda as glue for event-driven tasks, use Fargate for persistent containerised microservices (especially if you already use ECS/EKS), and use EC2 for any workload that needs full control, runs very long, or is cost-sensitive at scale. Many architectures mix all three: Lambda for processing S3 uploads, Fargate for the API server, and EC2 spot fleets for data processing jobs.

Production Insight
A team ran a Python ML inference service on Lambda, but inference took 10-12 minutes per request — often timing out. They moved to Fargate, which handled the long-running tasks with stable CPU and memory. Costs went up slightly, but reliability improved. Another team used Fargate for a low-traffic API, but the “always running” cost exceeded what Lambda would have charged for the same volume of requests. Always calculate cost: Lambda is cheapest at low utilisation, Fargate is competitive at moderate utilisation, EC2 is cheapest at high utilisation (especially with Reserved Instances).
Debug tip: use AWS Compute Optimizer to get recommendations for right-sizing.
Key Takeaway
Lambda: short event-driven. Fargate: managed containers for persistent services. EC2: full control, long-running, or cost-optimised at scale. Mix and match based on workload profile.
Use AWS Compute Optimizer to validate your choices.
EC2, Lambda, or Fargate Decision Guide
IfNeed full OS or hardware control?
UseEC2
IfWorkload runs less than 15 min, event-driven, intermittent?
UseLambda
IfRunning containers, don't want to manage servers, workload longer than 15 min?
UseFargate
IfNeed GPU or custom kernel?
UseEC2
IfLarge, predictable, cost-sensitive container workload?
UseEC2 (self-managed ECS/EKS)

How a Real Production Architecture Wires These Services Together

Seeing services in isolation is useful for learning. But AWS's real power emerges when services compose. Here's how a production web application typically connects the pieces we've covered.

A user hits your domain. Route 53 (AWS DNS) resolves it to a CloudFront distribution (CDN). CloudFront serves static assets (HTML, CSS, JS) directly from S3 — zero server involved, globally cached, essentially free at scale. For dynamic API requests, CloudFront forwards to an Application Load Balancer, which distributes traffic across EC2 instances (or ECS containers) running your application.

The application reads and writes to RDS for structured data, and stores uploaded files directly to S3 using pre-signed URLs (so files go direct from the browser to S3 — never through your server). When a file lands in S3, an event triggers a Lambda function for async processing: thumbnail generation, virus scanning, metadata extraction.

Everything runs inside a VPC (Virtual Private Cloud) — a private network. The RDS instance has no public IP. The EC2 instances live in private subnets. Only the Load Balancer is internet-facing. IAM roles control exactly which service can talk to which resource.

This pattern — static assets on S3/CloudFront, compute on EC2/Lambda, data on RDS, files on S3, security via IAM and VPC — handles everything from a startup's MVP to a Fortune 500 platform without fundamentally changing shape.

rds_setup_and_connect.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#!/bin/bash
# This script creates a minimal RDS PostgreSQL instance for a web app
# and demonstrates connecting to it securely from an EC2 instance.
# Cost note: even the smallest RDS instance (~$15/month) is NOT free tier
# eligible after the first 12 months. Use RDS Proxy in production for
# connection pooling — Lambda functions can exhaust DB connections instantly.

DB_IDENTIFIER="myapp-postgres-prod"
DB_NAME="myapp"
DB_USER="myapp_admin"
# In real usage, pull this from AWS Secrets Manager — never hardcode passwords
DB_PASSWORD="$(aws secretsmanager get-secret-value \n  --secret-id myapp/db/password \n  --query SecretString \n  --output text)"

# ── Create the RDS instance ───────────────────────────────────────────
echo "Creating RDS PostgreSQL instance..."
aws rds create-db-instance \n  --db-instance-identifier "$DB_IDENTIFIER" \n  --db-instance-class db.t3.micro \n  --engine postgres \n  --engine-version "15.4" \n  --master-username "$DB_USER" \n  --master-user-password "$DB_PASSWORD" \n  --db-name "$DB_NAME" \n  --allocated-storage 20 \n  --storage-type gp3 \n  --no-publicly-accessible \n  --backup-retention-period 7 \n  --deletion-protection \n  --region us-east-1

# --no-publicly-accessible: the DB only accepts connections from within the VPC
# --deletion-protection: prevents accidental deletion with a single CLI call
# --backup-retention-period 7: keeps 7 days of automated backups

echo "Waiting for instance to become available (this takes ~5 minutes)..."
aws rds wait db-instance-available \n  --db-instance-identifier "$DB_IDENTIFIER"

# ── Retrieve the endpoint once the instance is ready ──────────────────
DB_ENDPOINT=$(aws rds describe-db-instances \n  --db-instance-identifier "$DB_IDENTIFIER" \n  --query 'DBInstances[0].Endpoint.Address' \n  --output text)

echo "RDS instance ready at: $DB_ENDPOINT"

# ── Connect (run this from inside your EC2 instance, not your laptop) ─
# psql is available on Amazon Linux 2: sudo yum install -y postgresql15
echo "Connecting to database..."
PGPASSWORD="$DB_PASSWORD" psql \n  --host="$DB_ENDPOINT" \n  --port=5432 \n  --username="$DB_USER" \n  --dbname="$DB_NAME" \n  --command="SELECT version();"
Output
Creating RDS PostgreSQL instance...
{
"DBInstance": {
"DBInstanceIdentifier": "myapp-postgres-prod",
"DBInstanceClass": "db.t3.micro",
"Engine": "postgres",
"DBInstanceStatus": "creating",
"Endpoint": null
}
}
Waiting for instance to become available (this takes ~5 minutes)...
RDS instance ready at: myapp-postgres-prod.cxyz1234abcd.us-east-1.rds.amazonaws.com
Connecting to database...
version
------------------------------------------------------------------------
PostgreSQL 15.4 on x86_64-pc-linux-gnu, compiled by gcc 7.3.1, 64-bit
(1 row)
Watch Out: Lambda + RDS Without a Connection Pool
Each Lambda invocation opens a new database connection. At 1,000 concurrent Lambda executions, you've just opened 1,000 simultaneous DB connections. PostgreSQL's default max_connections is 100. Your database will refuse connections and your app will crash. The fix: put RDS Proxy between Lambda and RDS. It pools and reuses connections, and it's purpose-built for this exact scenario. This is a very common production incident for teams new to serverless.
Production Insight
When using Lambda + RDS without RDS Proxy, a traffic spike can open thousands of database connections, hitting PostgreSQL's max_connections (often 100). The database refuses new connections, causing application errors.
RDS Proxy is not optional; it reuses connections and reduces database load.
VPC design: placing RDS in a private subnet with no public IP is essential. An EC2 in the same VPC can connect via internal DNS.
Debug tip: monitor RDS connections with CloudWatch metric DatabaseConnections.
Key Takeaway
Static content → S3 + CloudFront. API → ALB + EC2/Lambda. Database → RDS with RDS Proxy.
Network isolation via VPC subnets. IAM policies enforce service-to-resource permissions.
This pattern scales from MVP to enterprise.
Always use RDS Proxy with Lambda — it's not optional.
Database connection strategy with Lambda
IfLambda + RDS with high concurrency
UseUse RDS Proxy
IfLambda + DynamoDB
UseNo connection pooling needed — DynamoDB is serverless
IfEC2 + RDS
UseUse a standard connection pool (HikariCP, pgBouncer) in the application

Networking and Security: VPC, Security Groups, and NACLs — The Invisible Backbone

Every AWS resource lives inside a Virtual Private Cloud (VPC) — your private slice of the AWS network. Without understanding VPC basics, you'll struggle to connect services securely. The VPC is where you define subnets (public and private), route tables, internet gateways, and NAT gateways.

Security Groups are stateful firewalls attached to individual resources (EC2, RDS, Lambda). If you allow inbound on port 443, outbound traffic is automatically allowed regardless of rules. This is convenient but can mask misconfigurations.

Network ACLs (NACLs) are stateless firewalls applied to entire subnets. You must define both inbound and outbound rules explicitly. If you allow inbound HTTP but forget outbound return traffic, the connection fails silently.

The standard pattern: place your load balancer in a public subnet with a Security Group allowing HTTP/HTTPS from the internet. Place EC2 instances and RDS in private subnets with Security Groups that only allow traffic from the load balancer's security group. A bastion host (jump box) in a public subnet provides secure SSH access for administrators.

Enable VPC Flow Logs to capture metadata about every packet that traverses your VPC — invaluable for debugging connectivity issues and security incidents.

create_vpc_basic.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
# Creates a VPC with one public and one private subnet, an Internet Gateway,
# and a NAT Gateway (costs ~$0.045/hour when running).

VPC_NAME="myapp-vpc"
VPC_CIDR="10.0.0.0/16"
PUBLIC_SUBNET_CIDR="10.0.1.0/24"
PRIVATE_SUBNET_CIDR="10.0.2.0/24"
REGION="us-east-1"

# ── Create VPC ────────────────────────────────────────────────────────
VPC_ID=$(aws ec2 create-vpc \n  --cidr-block "$VPC_CIDR" \n  --region "$REGION" \n  --query 'Vpc.VpcId' \n  --output text)
aws ec2 create-tags --resources "$VPC_ID" --tags Key=Name,Value="$VPC_NAME"
echo "Created VPC: $VPC_ID"

# ── Enable DNS hostnames ──────────────────────────────────────────────
aws ec2 modify-vpc-attribute \n  --vpc-id "$VPC_ID" \n  --enable-dns-hostnames "{\\\"Value\\\":true}\"\n\n# ── Create subnets ────────────────────────────────────────────────────\nPUBLIC_SUBNET_ID=$(aws ec2 create-subnet \\\n  --vpc-id \"$VPC_ID\" \\\n  --cidr-block \"$PUBLIC_SUBNET_CIDR\" \\\n  --region \"$REGION\" \\\n  --query 'Subnet.SubnetId' \\\n  --output text)\naws ec2 create-tags --resources \"$PUBLIC_SUBNET_ID\" --tags Key=Name,Value=\"${VPC_NAME}-public\"\n\necho \"Created public subnet: $PUBLIC_SUBNET_ID\"\n\nPRIVATE_SUBNET_ID=$(aws ec2 create-subnet \\\n  --vpc-id \"$VPC_ID\" \\\n  --cidr-block \"$PRIVATE_SUBNET_CIDR\" \\\n  --region \"$REGION\" \\\n  --query 'Subnet.SubnetId' \\\n  --output text)\naws ec2 create-tags --resources \"$PRIVATE_SUBNET_ID\" --tags Key=Name,Value=\"${VPC_NAME}-private\"\necho \"Created private subnet: $PRIVATE_SUBNET_ID\"\n\n# ── Internet Gateway ───────────────────────────────────────────────────\nIGW_ID=$(aws ec2 create-internet-gateway \\\n  --region \"$REGION\" \\\n  --query 'InternetGateway.InternetGatewayId' \\\n  --output text)\naws ec2 attach-internet-gateway \\\n  --internet-gateway-id \"$IGW_ID\" \\\n  --vpc-id \"$VPC_ID\"\necho \"Attached Internet Gateway: $IGW_ID\"\n\n# ── Route table for public subnet ─────────────────────────────────────\nPUBLIC_RT_ID=$(aws ec2 create-route-table \\\n  --vpc-id \"$VPC_ID\" \\\n  --region \"$REGION\" \\\n  --query 'RouteTable.RouteTableId' \\\n  --output text)\naws ec2 create-route \\\n  --route-table-id \"$PUBLIC_RT_ID\" \\\n  --destination-cidr-block 0.0.0.0/0 \\\n  --gateway-id \"$IGW_ID\"\naws ec2 associate-route-table \\\n  --route-table-id \"$PUBLIC_RT_ID\" \\\n  --subnet-id \"$PUBLIC_SUBNET_ID\"\necho \"Public route table configured.\"\n\n# ── NAT Gateway (for private subnet outbound) ─────────────────────────\n# First allocate an Elastic IP\nEIP_ALLOC=$(aws ec2 allocate-address \\\n  --domain vpc \\\n  --region \"$REGION\" \\\n  --query 'AllocationId' \\\n  --output text)\n\nNAT_GW_ID=$(aws ec2 create-nat-gateway \\\n  --subnet-id \"$PUBLIC_SUBNET_ID\" \\\n  --allocation-id \"$EIP_ALLOC\" \\\n  --region \"$REGION\" \\\n  --query 'NatGateway.NatGatewayId' \\\n  --output text)\n# NAT Gateway takes time to become available; do not proceed until it's active\necho \"Waiting for NAT Gateway to become available...\"\naws ec2 wait nat-gateway-available --nat-gateway-ids \"$NAT_GW_ID\"\necho \"NAT Gateway ready: $NAT_GW_ID\"\n\n# Private subnet route table (default one created with VPC, but we'll create a dedicated one)\nPRIVATE_RT_ID=$(aws ec2 create-route-table \\\n  --vpc-id \"$VPC_ID\" \\\n  --region \"$REGION\" \\\n  --query 'RouteTable.RouteTableId' \\\n  --output text)\naws ec2 create-route \\\n  --route-table-id \"$PRIVATE_RT_ID\" \\\n  --destination-cidr-block 0.0.0.0/0 \\\n  --nat-gateway-id \"$NAT_GW_ID\"\naws ec2 associate-route-table \\\n  --route-table-id \"$PRIVATE_RT_ID\" \\\n  --subnet-id \"$PRIVATE_SUBNET_ID\"\necho \"Private route table with NAT Gateway configured.\"\n\necho \"VPC setup complete. Summary:\"\necho \"  VPC: $VPC_ID\"\necho \"  Public subnet: $PUBLIC_SUBNET_ID\"\necho \"  Private subnet: $PRIVATE_SUBNET_ID\"\necho \"  Internet Gateway: $IGW_ID\"\necho \"  NAT Gateway: $NAT_GW_ID\"",
        "output": "Created VPC: vpc-0a1b2c3d4e5f67890\nCreated public subnet: subnet-12345678\nCreated private subnet: subnet-87654321\nAttached Internet Gateway: igw-12345678\nPublic route table configured.\nWaiting for NAT Gateway to become available...\nNAT Gateway ready: nat-0abcdef1234567890\nPrivate route table with NAT Gateway configured.\nVPC setup complete.\n  VPC: vpc-0a1b2c3d4e5f67890\n  Public subnet: subnet-12345678\n  Private subnet: subnet-87654321\n  Internet Gateway: igw-12345678\n  NAT Gateway: nat-0abcdef1234567890"
      }

AWS Certification Roadmap: Which Exams to Take and in What Order

If you're a working developer aiming to validate your AWS skills, the certification path can be confusing. Here's a brief, opinionated roadmap based on what matters for real engineering roles.

Start with: AWS Certified Cloud Practitioner (CLF-C02). This foundational exam covers basic cloud concepts, pricing, and core services. It's non-technical but gives you a broad overview. Skip it if you already have 6+ months of hands-on experience — go straight to Associate.

Then: AWS Certified Solutions Architect – Associate (SAA-C03). This is the gold standard for developers and architects. It tests your ability to design secure, resilient, cost-optimised architectures using core services. Most job postings list this as a preferred certification. Study focus: VPC, S3, EC2, Lambda, RDS, IAM, CloudFront, Route 53, and the right patterns for each use case.

Optional but valuable: AWS Certified Developer – Associate (DVA-C02). Overlaps with Solutions Architect but dives deeper into CI/CD, CloudFormation, Lambda, DynamoDB, and application deployment. If you write code on AWS, this validates development-specific skills.

Advanced: AWS Certified Solutions Architect – Professional (SAP-C02). For senior engineers who design multi-account, hybrid, and large-scale architectures. Expect scenario-based questions about migration, cost control, and security at enterprise scale.

Specialty certifications (Security, Data Analytics, Machine Learning, Networking, Database) are for focused roles. Don't chase them unless your daily work demands it.

The practical path: Cloud Practitioner (optional) → Solutions Architect Associate (mandatory) → Developer Associate (if you build apps) → Solutions Architect Professional (after 2+ years of AWS experience). This sequence gives you the vocabulary, design principles, and confidence to architect and debug production systems.

Production Insight
Certifications don't replace experience. The most valuable learning comes from breaking something in a dev account and fixing it. Use the exam as a structured study guide, then apply the concepts by building real projects. The AWS re:Invent videos, official documentation, and labs at AWS Skill Builder are excellent preparation resources. Combine certifications with a side project — like hosting a static site on S3+CloudFront or building a serverless API — to cement the concepts.
Practical advice: set aside time each week for labs; AWS Skill Builder has free sandboxes.
Key Takeaway
Start with Solutions Architect Associate (SAA-C03) for the broadest ROI. Add Developer Associate if you write code. Professional is for senior architects. Use exams as a study roadmap, not end goals.
Build a real project alongside studying to make concepts stick.

Hands-On Practice: 5 AWS Exercises to Build Real Skills

Reading about AWS is not enough. You must provision resources, misconfigure them, break them, and fix them. These five exercises cover the services discussed in this article and will give you the hands-on confidence to tackle production issues.

Exercise 1: S3 Bucket Policy – Public vs Private Create an S3 bucket, upload a file, and make it publicly readable by adding a bucket policy. Then block public access using S3 Block Public Access. Verify the public access fails. Then create a pre-signed URL that grants temporary access. This exercise teaches bucket policies, public access blocks, and signed URLs — a common production pattern for file sharing.

Exercise 2: Create an IAM Role and Attach a Policy via CLI Use the AWS CLI to create an IAM role for EC2 with a trust policy that allows ec2.amazonaws.com to assume it. Attach a managed policy (e.g., AmazonS3ReadOnlyAccess). Launch an EC2 instance with this role and verify it can list S3 buckets without any access key. This exercise demonstrates instance profiles and role assumption — the foundation of secure AWS usage.

Exercise 3: Trigger a Lambda Function from an S3 Upload Create a simple Lambda function (e.g., in Python) that logs the bucket and key of uploaded objects. Create an S3 bucket and add a trigger that invokes the Lambda function on s3:ObjectCreated:*. Upload a file and check CloudWatch Logs to confirm the invocation. This exercise is the building block for event-driven architectures.

Exercise 4: Launch an RDS Instance and Connect from an EC2 Instance Create an RDS PostgreSQL instance in a private subnet. Create an EC2 instance in the same VPC (public subnet) and install the PostgreSQL client. Connect to the RDS instance using its internal DNS. Then enable deletion protection and attempt to delete the instance via CLI to see the error. This exercise covers VPC networking, security groups, and database management basics.

Exercise 5: Build a Two-Tier VPC with Public and Private Subnets Create a VPC with CIDR 10.0.0.0/16. Add a public subnet and a private subnet. Set up an Internet Gateway for the public subnet and a NAT Gateway for the private subnet. Test connectivity: launch an EC2 in the public subnet (should have internet access) and another in the private subnet (should have outbound internet via NAT but no direct inbound). This exercise is the foundation of any secure network in AWS.

exercise5_vpc_two_tier.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#!/bin/bash
# Complete VPC with public and private subnets, Internet and NAT Gateways.
# Reuses logic from the VPC section but combines into a single script.

VPC_CIDR="10.0.0.0/16"
PUBLIC_CIDR="10.0.1.0/24"
PRIVATE_CIDR="10.0.2.0/24"
REGION="us-east-1"

# Create VPC
VPC_ID=$(aws ec2 create-vpc --cidr-block $VPC_CIDR --region $REGION --query 'Vpc.VpcId' --output text)

# Enable DNS hostnames
aws ec2 modify-vpc-attribute --vpc-id $VPC_ID --enable-dns-hostnames "{"Value":true

What is DevOps — And Why AWS Doesn't Give a Damn About Your Job Titles

DevOps isn't a role you hire for. It's a contract between developers and operators that says: 'We stop throwing code over the wall and start owning what we ship together.' The core principles — automation, CI/CD, monitoring, feedback loops — existed long before someone put a buzzword on a slide deck.

Here's what actually matters: you automate everything that hurts. You build pipelines that catch failures before they hit production. You monitor in real-time because users won't file tickets — they'll just leave. AWS doesn't care if your team is called 'DevOps' or 'SRE' or 'Site Reliability Wizards.' It gives you the tools to enforce this contract. CodePipeline for CI/CD. CloudWatch for monitoring. Systems Manager for operational automation.

The real question isn't 'what is DevOps.' It's: does your team own the outcome, or just the code? AWS forces you to answer that question the first time a deployment breaks at 3 AM.

DevOpsContract.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// io.thecodeforge — devops tutorial

# This pipeline enforces the DevOps contract — no manual approvals, no hero deploys
name: enforce-devops-contract

on:
  push:
    branches:
      - main

env:
  AWS_REGION: us-east-1
  ECR_REPOSITORY: payments-api
  ECS_CLUSTER: prod-cluster

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Run unit tests
        run: |
          pytest tests/ --cov=src --cov-fail-under=80
          echo 'Tests passed — you can deploy.'

      - name: Build and push Docker image
        run: |
          docker build -t $ECR_REPOSITORY:latest .
          aws ecr get-login-password | docker login --password-stdin

      - name: Notify Slack on deploy
        run: |
          curl -X POST -H 'Content-type: application/json' \
            --data '{"text":"New image pushed to ECR"}' \
            ${{ secrets.SLACK_WEBHOOK }}
Output
Tests passed — you can deploy.
Pipeline executes automatically on main branch push.
Senior Shortcut:
Don't buy the 'we need a DevOps engineer' argument. You need developers who understand operations and operators who can write code. Hire for the mindset, not the title. AWS tools won't fix a broken culture — they'll just expose it faster.
Key Takeaway
DevOps is a contract, not a job title. Automate everything that hurts, or your 3 AM pager will remind you.

Getting Started: Your First AWS DevOps Account Without Getting Fired

Setting up an AWS account is the easy part. Keeping it from becoming a security incident waiting to happen is where most people fail. Here's the bare minimum: create a root account, enable MFA immediately, generate an access key for programmatic access, then lock the root user in a drawer. You don't deploy from root. Period.

Next: create an IAM user for yourself with administrator access — but only for initial setup. Then build a least-privilege user for your actual work. The principle is simple: if a deployment doesn't need S3 delete permissions, it shouldn't have them. AWS IAM Access Analyzer will tell you when you're being sloppy. Listen to it.

For the actual DevOps setup: enable CloudTrail from day one. It's your audit log when something breaks at 2 AM and you need to know who deleted the production database. Set up a budget alert. Trust me — the first month's bill will be a shock if you don't. Start with the Free Tier, learn in a single region (us-east-1 is fine), and never leave a playground environment running overnight.

SecureAccountSetup.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// io.thecodeforge — devops tutorial

# CloudFormation template for a secure initial setup
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Lock down your AWS account — no exceptions'

Resources:
  DevOpsUser:
    Type: AWS::IAM::User
    Properties:
      UserName: devops-engineer
      Policies:
        - PolicyName: minimal-deploy-permissions
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - ecs:UpdateService
                  - ecs:DescribeServices
                  - codebuild:StartBuild
                Resource: '*'
              - Effect: Deny
                Action:
                  - s3:DeleteBucket
                  - iam:*
                Resource: '*'

  CloudTrail:
    Type: AWS::CloudTrail::Trail
    Properties:
      IsLogging: true
      S3BucketName: audit-logs-prod-2024
      IncludeGlobalServiceEvents: true
Output
CloudFormation stack created successfully.
DevOps user now has: ECS update, codebuild start, S3 read — nothing else.
Production Trap:
The root account access key you create 'just for testing' will end up in a GitHub repo within 72 hours. It always does. Generate it, note the ID, then delete it. Create separate keys for each environment. Use AWS Secrets Manager for rotation.
Key Takeaway
Root account is a loaded weapon. Lock it away. Create IAM users with shotgun permissions — small, precise, and only what the job needs.

Accessing AWS: The API, Console, and CLI Are Not Interchangeable

You will access AWS through three doors: the web console, the CLI, and the SDK/API. They all talk to the same backend, but they are not interchangeable for production work.

The console is for exploration, debugging, and one-off tasks. It lulls you into clicking through wizard UIs. That is fine for learning. Dangerous for operations. Every click is a manual step that cannot be version-controlled, audited, or repeated reliably. If you are building infrastructure, you should be writing code, not clicking buttons.

The CLI and SDK are your production tools. The CLI is for scripting and ad-hoc automation. The SDK is for embedding AWS calls into your application code. Both authenticate through the same IAM credentials — never hardcode them. Use environment variables, instance profiles, or AWS Secrets Manager. The moment you paste an access key into a config file committed to Git, you have a security incident waiting to happen.

aws-access-example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — devops tutorial

# Never do this in production
aws configure
AWS Access Key ID: AKIAIOSFODNN7EXAMPLE
AWS Secret Access Key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

# Do this instead — set environment variables
export AWS_ACCESS_KEY_ID="$(aws secretsmanager get-secret-value --secret-id prod-aws-keys --query SecretString --output text | jq -r '.access_key')"
export AWS_SECRET_ACCESS_KEY="$(aws secretsmanager get-secret-value --secret-id prod-aws-keys --query SecretString --output text | jq -r '.secret_key')"

# Or use an IAM role (EC2 instance profile) — no keys at all
aws s3 ls s3://my-production-bucket --region us-east-1
Output
2024-11-20 14:32:01 my-production-bucket
2024-11-20 14:32:02 logs-bucket
2024-11-20 14:32:03 backups-bucket
Production Trap:
The AWS console's 'copy CLI command' feature copies your live credentials into the clipboard. Paste that into a shared terminal session? You just leaked your admin keys.
Key Takeaway
The console is for humans, the CLI is for scripts, the SDK is for code. Never use the console for repeatable tasks.

IAM Users vs Roles: Stop Creating Users for Machines

IAM users have long-term credentials. IAM roles have temporary credentials that rotate automatically. If you are tempted to create an IAM user for your EC2 instance or Lambda function, stop.

IAM roles are the only correct way to give AWS resources permissions. You attach a role to the resource, and AWS hands it temporary credentials valid for up to 12 hours. If those credentials leak, they expire. An IAM user's access keys live until you revoke them — and you will forget to rotate them.

The same logic applies to cross-account access. Never create a user in Account A and share keys with Account B. Instead, establish a trust policy on Account A's role that allows Account B's role to assume it. Now Account B can act in Account A without any long-lived secrets. This is the foundation of AWS Organizations and multi-account security.

One exception: human developers doing local testing. They need long-lived keys. Use an IAM user with a strong password and MFA. But the moment that code runs on a server, the user goes away and the role takes over.

iam-role-assume.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — devops tutorial

# Role in Account A (production) that allows Account B (CI/CD) to assume it
Resources:
  ProductionReadOnlyRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              AWS: arn:aws:iam::222222222222:root  # Account B
            Action: sts:AssumeRole
            Condition:
              Bool:
                aws:MultiFactorAuthPresent: false  # Machines don't have MFA
      Policies:
        - PolicyName: ReadOnlyAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: s3:GetObject
                Resource: arn:aws:s3:::my-production-bucket/*
Output
Account B assumes the role. Temporary credentials issued. No shared keys. No rotation needed.
Senior Shortcut:
When you see an IAM user with access keys older than 90 days, rotate them immediately — or delete the user and switch to a role.
Key Takeaway
IAM users are for humans. IAM roles are for machines. Never give a server permanent keys.

Why AWS CodeCommit Is Not Just a Git Clone

AWS CodeCommit is a fully managed source control service that hosts secure Git repositories. Unlike GitHub or GitLab, it integrates natively with AWS IAM for access control, meaning no SSH keys or personal access tokens to manage—your AWS credentials become your Git credentials. This reduces attack surface and simplifies compliance audits. CodeCommit automatically encrypts repositories at rest and in transit, and scales without provisioning servers. Its tight integration with CodeBuild, CodeDeploy, and CodePipeline makes it the obvious choice for AWS-native CI/CD. However, the tradeoff is fewer community features: no pull request reviews or forking. Use CodeCommit when your team already operates within AWS and needs audit trails, VPC isolation, or cross-account access. Avoid it if you rely on GitHub Actions or third-party integrations that AWS doesn't mirror. The real value is not the Git features—it's the IAM-powered security boundary that eliminates credential sprawl.

codecommit-repo.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial

Resources:
  MyRepo:
    Type: AWS::CodeCommit::Repository
    Properties:
      RepositoryName: MyAppRepo
      RepositoryDescription: Production microservice
      Code:
        S3:
          Bucket: my-bootstrap-bucket
          Key: initial-commit.zip
      Tags:
        - Key: Environment
          Value: Production
    Outputs:
      CloneUrlSsh:
        Value: !GetAtt MyRepo.CloneUrlSsh
      CloneUrlHttps:
        Value: !GetAtt MyRepo.CloneUrlHttps
Production Trap:
CodeCommit enforces IAM policies on every push. If you grant broad access, any developer can delete branches. Use IAM conditions to restrict force-push to protected branches only.
Key Takeaway
CodeCommit's advantage is IAM-native security, not feature parity with public Git hosts.

AWS CodeDeploy: Why You Deploy to Instances, Not to Environments

AWS CodeDeploy automates application deployments to EC2, Lambda, or on-premises servers. It does not care about your environment labels—it deploys to compute targets based on deployment groups. You define a deployment group (e.g., production-asg) containing auto scaling group instances or Lambda aliases. CodeDeploy then rolls out your revision (AppSpec + artifacts) with configurable traffic shifting strategies: AllAtOnce for dev, Rolling for production, or Blue/Green for zero-downtime. The AppSpec file defines lifecycle hooks (BeforeInstall, AfterInstall, ValidateService) where you run validation scripts. If a hook fails, CodeDeploy automatically rolls back. This removes manual SSH and error-prone bash scripts from deployments. Critical detail: deployment groups can be updated without redeploying, and you can stop a deployment mid-rollout. Use with CodePipeline for fully automated rollouts. Avoid manual tagging or environment checks—let CodeDeploy manage the target mapping. The missing link most engineers overlook: hook scripts must be idempotent, because CodeDeploy re-runs them on retries.

appspec.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial

version: 0.0
os: linux
files:
  - source: /app
    destination: /var/www/myapp
hooks:
  BeforeInstall:
    - location: scripts/stop_server.sh
      timeout: 60
      runas: root
  AfterInstall:
    - location: scripts/start_server.sh
      timeout: 60
      runas: root
  ValidateService:
    - location: scripts/health_check.sh
      timeout: 30
      runas: ec2-user
Production Trap:
CodeDeploy deploys to ALL instances in a deployment group simultaneously by default. For mission-critical apps, set the deployment configuration to OneAtATime to limit blast radius.
Key Takeaway
Deploy to deployment groups, not environments. Lifecycle hooks are your safety net—make them idempotent.

1. Introduction

DevOps is not a tool — it's a cultural shift that collapses the wall between development and operations. AWS provides the raw infrastructure to automate that collapse, but understanding the why is critical before touching any button. The core promise of DevOps on AWS is this: infrastructure becomes code, deployments become automated, and feedback loops shrink from weeks to minutes. When you define your entire stack in CloudFormation or Terraform, you eliminate configuration drift and manual errors. AWS services like CodePipeline, CodeBuild, and CodeDeploy exist to formalize that automation, but they are meaningless without first understanding the problem they solve — namely that manual deployment is the single biggest source of production downtime. The real metric of success is not how many AWS services you know, but how fast you can recover from a failed deployment. Before you build pipelines, ask yourself: what does 'done' mean for your team? If you cannot answer that without mentioning a business outcome, you are not ready for AWS DevOps.

6. Implementing CI/CD on AWS

Continuous Integration and Continuous Deployment on AWS is not a pipeline — it is a feedback mechanism. The why is simple: every code commit should be deployable to production without human intervention, because hands on keyboards introduce variability and delay. AWS CodePipeline orchestrates the flow, but the real intelligence lives in the triggers. Instead of polling Git every hour, configure webhooks so that CodePipeline activates on every push to main. CodeBuild compiles, tests, and packages your application in isolated containers — this catches dependency hell before it reaches staging. CodeDeploy then pushes artifacts to EC2, Lambda, or ECS using the same deployment strategy every time, eliminating the 'works on my machine' syndrome. The key trap: don't deploy to environments; deploy to instances. Use blue/green or rolling deployments with health checks. A failed deployment should automatically roll back, not require a midnight Slack message. Remember: CI/CD is not about speed — it is about repeatable, auditable, safe delivery. If your pipeline takes 10 minutes but never breaks production, you win.

10. Hands-On Projects for Learning

Theory evaporates without practice. Do not start with a complex microservice architecture — start with a single Lambda function behind API Gateway, triggered by a CodeCommit push. Project 1: Build a static site on S3 behind CloudFront with a CI pipeline that invalidates the cache on deploy. You will learn CloudFormation, IAM roles, and the pain of SSL certs. Project 2: Containerize a Node.js app with Docker, push to ECR, and deploy on ECS Fargate using CodePipeline. You will touch task definitions, service auto-scaling, and load balancer health checks. Project 3: implement blue/green deployment on EC2 with CodeDeploy — deliberately break the health check to see an automatic rollback. Each project should be fully destroyed and rebuilt from code in under 30 minutes. If it takes longer, your automation is wrong. The goal is not the application — it is the infrastructure as code that deploys it. Track your time: every minute spent clicking in the AWS console is a minute stolen from learning real automation.

12. Conclusion

AWS DevOps is not a certification or a resume bullet — it is the discipline of eliminating trust in humans and placing trust in code. If you walk away with one thing, let it be this: every manual action you take in AWS is a future incident waiting to happen. Script everything. Automate rollbacks before you automate deployments. Never use root credentials, never hardcode secrets, never approve a production deploy on a Friday. The tools — CodePipeline, CloudFormation, EKS — are just syntax. The real skill is knowing when to say 'no' to a process that cannot be automated. Production architectures fail not because of bad code, but because of bad processes. AWS gives you the raw power to build anything; it also gives you the power to destroy your entire account in one command. Use Infrastructure as Code, enforce least-privilege IAM, and treat every deployment like it will fail. Do that, and you will not just be a DevOps engineer on AWS — you will be someone who never gets paged at 3 AM.

Kubernetes on AWS: The Orchestrator You Didn't Know You Needed

Kubernetes (K8s) is not a deployment tool — it is a declarative operating system for your containers. On AWS, you have two paths: EKS (managed) or self-hosted on EC2. Choose EKS unless you enjoy patching control plane nodes at 2 AM. The why is simple: Kubernetes gives you self-healing, scaling, and rolling updates out of the box. When a pod dies, K8s restarts it. When traffic spikes, the Horizontal Pod Autoscaler spins up replicas. When you push a new image, a rolling update replaces old pods without downtime. But here is the trap — Kubernetes adds complexity. You now manage Ingress controllers, RBAC, ConfigMaps, and persistent volumes. Do not run K8s just to run containers. Use it when you need multi-service orchestration, blue/green deploys across nodes, or fine-grained resource limits. On AWS, integrate with ALB Ingress Controller for traffic routing, and store secrets in AWS Secrets Manager not plaintext YAML. The gold rule: if your app fits on one ECS service with Fargate, do not touch Kubernetes.

● Production incidentPOST-MORTEMseverity: high

Lambda Timeout Wrecks Database Migration

Symptom
Migration job fails after exactly 15 minutes. Logs show: "Task timed out after 15:00 minutes". No partial rollback; database left in mid-migration state.
Assumption
Lambda can run any workload because it's serverless and scales automatically.
Root cause
Lambda has a hard 15-minute execution timeout. The migration script took 20 minutes to complete. The function was killed before finishing.
Fix
Switch to EC2, ECS, or AWS Batch for long-running tasks. Alternatively, break the migration into smaller chunks processed by sequential Lambda invocations using Step Functions.
Key lesson
  • Know Lambda's limits before choosing compute. 15-minute max is a hard wall.
  • Long-running batch jobs belong on persistent compute — EC2 or containers.
  • Always test with realistic data volume — development migrations are fast, production ones are not.
Production debug guideWhen AWS returns AccessDenied errors, use these steps to find the missing policy.3 entries
Symptom · 01
AccessDenied when calling s3:GetObject on bucket my-bucket
Fix
Check the identity's attached policies using aws iam list-attached-role-policies --role-name YourRole. Look for a policy that allows s3:GetObject on arn:aws:s3:::my-bucket/*.
Symptom · 02
AccessDenied on ec2:StartInstances for a specific instance
Fix
Check if the policy uses resource-level permissions. EC2 actions may require specifying the instance ARN. Also check for explicit Deny statements in the same policy or in Service Control Policies (SCPs).
Symptom · 03
Role assumed from EC2 cannot write to CloudWatch Logs
Fix
Verify the trust policy allows sts:AssumeRole for ec2.amazonaws.com. Then check the permissions policy includes logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents on the correct log group ARN.
★ Lambda Debug Cheat SheetQuick commands to diagnose Lambda execution issues.
Lambda function times out
Immediate action
Check CloudWatch logs for "Task timed out" message at the exact timeout mark.
Commands
aws logs get-log-events --log-group-name /aws/lambda/your-function --log-stream-name 'LATEST'
aws lambda get-function-configuration --function-name your-function | jq .Timeout
Fix now
Increase timeout via console or CLI: aws lambda update-function-configuration --function-name your-function --timeout 30
Lambda returns AccessDenied on S3+
Immediate action
Identify the Lambda execution role ARN from the function configuration.
Commands
aws lambda get-function-configuration --function-name your-function | jq .Role
aws iam list-attached-role-policies --role-name RoleNameFromARN
Fix now
Add the missing S3 permission to the role's policy, or update the bucket policy to allow the Lambda role.
Lambda invocation fails with "ResourceNotFoundException"+
Immediate action
Check that the event source (S3 bucket, SQS queue, etc.) still exists and the Lambda trigger is configured correctly.
Commands
aws lambda get-event-source-mapping --uuid your-uuid
aws s3api get-bucket-notification-configuration --bucket your-bucket
Fix now
Reconfigure the trigger: delete and recreate the event source mapping. Ensure the Lambda permissions policy allows the source service to invoke the function.

Key takeaways

1
AWS services are region-based; always choose the closest region for low latency.
2
The five core services—EC2, S3, RDS, Lambda, IAM—cover 90% of architectures.
3
Lambda has a hard 15-minute timeout; use EC2 or containers for long-running jobs.
4
IAM least privilege is critical; scope permissions to specific resources.
5
Security is shared
AWS secures infrastructure, you secure your data and access.
6
Use a mix of On-Demand, Reserved, and Spot instances to optimise cost.
7
VPC design with public/private subnets and proper firewalls is non-negotiable.
8
Certifications (especially SAA-C03) accelerate learning, but hands-on practice is essential.

Common mistakes to avoid

4 patterns
×

Using root account for daily operations

Symptom
Accidental deletion of resources or policy changes that lock you out; no audit trail per-person.
Fix
Create an IAM user for yourself with the 'PowerUserAccess' policy, enable MFA, and store root credentials securely. Use root only for account-level actions like closing the account.
×

Leaving S3 buckets publicly accessible

Symptom
Sensitive data is discoverable by anyone on the internet, leading to data breaches or unexpected charges.
Fix
Enable S3 Block Public Access at the account level. Audit bucket policies using AWS IAM Access Analyzer. For legitimate public content, use CloudFront with OAI.
×

Not setting up CloudTrail and monitoring

Symptom
When an incident occurs, you have no logs to trace who did what. Debugging permissions failures becomes guesswork.
Fix
Enable AWS CloudTrail in all regions with log file validation. Send logs to a centralized S3 bucket and analyze with Athena or a third-party SIEM.
×

Putting RDS in a public subnet

Symptom
Database instance has a public IP, increasing attack surface. A single misconfigured security group can expose data.
Fix
Always place RDS in a private subnet. Use a bastion host or VPN for admin access. Do not assign a public IP.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the difference between a Security Group and a Network ACL?
Q02SENIOR
How would you design a highly available and fault-tolerant architecture ...
Q03JUNIOR
What is the maximum execution timeout for AWS Lambda? What alternatives ...
Q04SENIOR
Explain the difference between an IAM user, role, and policy. When would...
Q01 of 04SENIOR

What is the difference between a Security Group and a Network ACL?

ANSWER
Security Groups are stateful, operate at the instance level, and support allow rules only. Network ACLs are stateless, operate at the subnet level, and support both allow and deny rules. Use Security Groups for granular per-resource control and NACLs for subnet-wide rule enforcement (e.g., blocking an IP range).
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between S3 Standard and S3 Glacier storage classes?
02
Can AWS Lambda use Docker container images?
03
How do I monitor costs in AWS?
04
What is the free tier in AWS and what does it include?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Cloud. Mark it forged?

24 min read · try the examples if you haven't

Previous
Cloud Computing Explained: Models, Services, and Real-World Architecture
2 / 23 · Cloud
Next
AWS EC2 Basics