Terraform declares cloud infrastructure in HCL and manages the lifecycle via plan-apply
State file maps logical names to real resource IDs; remote state + locking prevents corruption
Providers are plugins that translate HCL into cloud API calls; always pin major versions
The core loop: desired state → current state (from .tfstate) → diff → apply
Performance insight: Terraform reads all state into memory; large state (>20 MB) causes ~5s plan startup
Production insight: manual cloud changes create silent drift — use drift detection runs or terraform refresh
Biggest mistake: storing state locally on a shared filesystem — leads to forced unlocks and ghost resources
Plain-English First
Imagine you're building a LEGO city. Instead of just building it by hand and hoping you remember every piece, you write down the exact instructions — 'place a red 2x4 brick here, a blue window there.' Terraform is that instruction manual for cloud infrastructure. You write down exactly what servers, databases, and networks you want, and Terraform builds it. Tear it down and rebuild it tomorrow? Same instructions, identical city — every single time.
Every company running in the cloud eventually hits the same wall: someone clicks around the AWS console to spin up a server, another person does it slightly differently, and six months later nobody knows what's running or why. Servers become 'pets' — hand-crafted, irreplaceable, and terrifying to touch. Terraform exists to end that chaos by letting you describe your entire infrastructure in version-controlled code, the same way you describe your application logic.
Before Terraform, teams either wrote brittle bash scripts full of AWS CLI commands or relied entirely on cloud-specific tools like CloudFormation (which only works on AWS) or Azure ARM templates (which only work on Azure). Terraform solved the vendor lock-in problem by introducing a single declarative language — HCL — that works across AWS, GCP, Azure, and hundreds of other providers. You write your intent ('I want three EC2 instances'), and Terraform figures out the sequence of API calls to make it real.
By the end of this article you'll understand why the Terraform state file is both its superpower and its biggest footgun, how providers and modules keep your code DRY at scale, and how a real-world multi-environment setup actually looks — not a toy example, but the kind of structure you'd find in a production codebase at a fast-growing startup or enterprise engineering team.
HCL Syntax Quick Reference Table
HashiCorp Configuration Language (HCL) is Terraform's declarative language for defining infrastructure. Unlike JSON or YAML, HCL is designed to be human-readable and supports blocks, labels, and expressions. The table below summarizes the most common HCL constructs you'll encounter in every Terraform project. Understanding these primitives — especially the distinction between resources and data sources — is the foundation of writing safe, reusable infrastructure code.
Resources are the core building blocks: they map to real cloud objects (VPCs, instances, databases). Data sources read existing objects without managing their lifecycle. Variables make your configuration parameterizable, and outputs expose values to callers or other configurations. The terraform block sets global settings like required provider versions and backend configuration.
Block types have at least one label (the resource type) and an optional second label (the local name). Inside the block, arguments are key-value pairs. Expressions use Terraform's built-in functions, references, and interpolation syntax (${}, but modern Terraform prefers var.name style). Comments are # or // for single-line, // for multi-line.
One common mistake: confusing locals (computed values that don't require user input) with variables (values supplied by the user at runtime). Locals are defined with locals { ... } and referenced as local.my_local; variables are defined with variable "name" { ... } and referenced as var.name.
syntax_reference.tfHCL
1
2
3
4
5
6
# ── Block types and their structure ──────────────────────────────
# terraform block — global settings (provider requirements, backend)
terraform {\n required_version = \">= 1.5\"\n required_providers {\n aws = {\n source = \"hashicorp/aws\"\n version = \"~> 5.0\"\n }\n }\n backend \"s3\" { # only one backend block allowed\n bucket = \"my-state-bucket\"\n key = \"project.tfstate\"\n region = \"us-east-1\"\n }\n}\n\n# variable block — user-supplied values\nvariable \"instance_type\" {\n type = string\n description = \"EC2 instance size\"\n default = \"t3.micro\" # optional\n validation {\n condition = can(regex(\"^t3\\\.\", var.instance_type))\n error_message = \"Must be t3 family.\"\n }\n}\n\n# locals block — computed values\nlocals {\n name_prefix = \"${var.environment}-app\"\n common_tags = {\n ManagedBy = \"terraform\"\n Env = var.environment\n }\n}\n\n# resource block — creates and manages infrastructure\nresource \"aws_instance\" \"web\" {\n ami = data.aws_ami.amazon_linux.id # reference a data source\n instance_type = var.instance_type # reference a variable\n subnet_id = aws_subnet.main.id # reference another resource\n tags = local.common_tags # reference a local\n}\n\n# data source block — read-only access to existing resources\ndata \"aws_ami\" \"amazon_linux\" {\n most_recent = true\n owners = [\"amazon\"]\n filter {\n name = \"name\"\n values = [\"amzn2-ami-hvm-*-x86_64-*\"*\"]\n # Note: The closing bracket inside filter value is intentional for the example\n }\n}\n\n# output block — expose values after apply\noutput \"instance_ip\" {\n value = aws_instance.web.public_ip\n description = \"Public IP of the web server\"\n sensitive = false # if true, hides value in CLI output\n}\n\n# module block — call a reusable module\nmodule \"vpc\" {\n source = \"./modules/vpc\"\n cidr = \"10.0.0.0/16\"\n name = local.name_prefix\n}\n\n# ── Expressions and functions ─────────────────────────────────────\n# String interpolation: \"${resource.type.name.attribute}\"\n# Direct attribute access: resource.type.name.attribute\n# Built-in functions: format, join, split, lower, upper, length, concat, etc.\n# Conditional: condition ? true_val : false_val\n# For expression: [for k, v in var.map : upper(k)]\n",
"output": null
}
Terraform's command lifecycle is the sequence of steps you run to manage infrastructure. Understanding this flow — and the safety checks built into each step — is critical to avoiding production outages. The four primary commands are init, plan, apply, and destroy, but validate, fmt, refresh, and import also play important roles in a robust workflow.
The diagram below visualises the lifecycle: you start with code and state, then move through initialization, planning, and finally applying changes. The state file is the persistent memory that connects each run. If you skip steps — like running apply without first reviewing the plan — you risk unintended changes.
terraform init is the first command you run in a new checkout. It downloads provider plugins and configures the backend (local or remote). If you change provider versions or backend configuration, you must re-run init. terraform validate checks syntax without connecting to providers — use it in CI to catch typos fast. terraform plan creates a diff between desired state and current state (from state file) without making any API changes. It's the most important command for safety. terraform apply executes the plan. terraform destroy is equivalent to apply with an empty configuration — it destroys all managed resources. Always review the destroy plan before confirming.
Note:terraform refresh is used to update the state file with real-world resource attributes without proposing changes. In Terraform 1.5+, terraform plan -refresh-only is the recommended way to detect drift.
This lifecycle is the same whether you run it locally or in CI/CD, but in CI/CD the plan output is reviewed via pull request comments and apply is gated by approvals.
commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Typical local workflow — always start with init
$ terraform init
# Validatesyntax (fast, no API calls)
$ terraform validate
# See what will change
$ terraform plan -out=plan.tfplan
# Apply the plan file (safe: exact same plan)
$ terraform apply plan.tfplan
# Destroyeverything (be careful!)
$ terraform plan -destroy -out=destroy.tfplan
$ terraform apply destroy.tfplan
# Refresh state without changing resources (detect drift)
$ terraform plan -refresh-only
# Import existing resource into state
$ terraform import aws_instance.web i-1234567890abcdef0
# Remove resource from state without destroying
$ terraform state rm aws_instance.web
# List all resources in state
$ terraform state list
Output
# Plan output example (abbreviated)
$ terraform plan
Terraform will perform the following actions:
# aws_instance.web will be created
+ resource "aws_instance" "web" {
+ ami = "ami-0abcdef1234567890"
+ instance_type = "t3.micro"
+ tags = {
+ "Name" = "example-web"
}
}
Plan: 1 to add, 0 to change, 0 to destroy.
Always run `terraform plan` before any apply — even for small changes
The plan output is the last chance to catch mistakes before they hit production. In a team setting, post the plan as a PR comment and require another engineer to approve it. Never auto-apply without human review.
Production Insight
In production pipelines, always use the -out flag to save the plan file. This guarantees that the apply uses exactly the same plan that was reviewed. Running apply without a saved plan file re-evaluates the config, which could produce different results if the state or provider versions changed in the meantime.
Key Takeaway
The lifecycle is init → validate → plan → apply. Each step has a safety purpose: init ensures provider compatibility, plan shows the diff, and apply executes. Never skip plan review in production.
How Terraform's Core Loop Actually Works — Plan, Apply, State
Most tutorials show you terraform apply and move on. But the real magic — and the real danger — lives in the three-step loop Terraform runs every single time you touch your infrastructure.
First, Terraform reads your .tf files and builds a desired state — a mental model of what you want the world to look like. Then it reads the state file (more on this shortly) to understand what it already built. Finally, it calls your cloud provider's APIs to build a diff between those two pictures. That diff is your plan.
This is fundamentally different from imperative tools like Ansible where you say 'run these steps.' Terraform is declarative — you say 'here's the destination' and it plots the route. The benefit is idempotency: running terraform apply ten times on an unchanged config does nothing after the first run, because the desired state already matches reality.
The critical thing to internalise is that Terraform doesn't inspect your live cloud resources to build that diff — it trusts the state file. If someone manually changes a resource in the AWS console, Terraform doesn't know. Your state file lies. That's the source of more production incidents than almost any other Terraform mistake.
main.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# main.tf — A minimal but complete AWS setup that demonstrates the core loop
# This creates a VPC and a single EC2 instance inside it.
# Run: terraform init -> terraform plan -> terraform apply
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0" # Pin to major version to avoid surprise breaking changes
}
}
}
# The provider block tells TerraformWHERE to build — credentials come from
# environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) or an IAM role.
# Never hardcode credentials in .tf files — they'll end up in version control.
provider "aws" {
region = var.aws_region
}
# Variables make this config reusable across environments.
# Actual values live in terraform.tfvars (git-ignored) or are passed via -var flags.
variable "aws_region" {
type = string
description = "AWS region where all resources will be created"default = "us-east-1"
}
variable "environment_name" {
type = string
description = "Environment tag applied to every resource (e.g. staging, production)"
}
# TheVPC is our private network — everything else lives inside it.
resource "aws_vpc""primary_network" {\n cidr_block = \"10.0.0.0/16\"\n enable_dns_hostnames = true # Needed so EC2 instances get resolvable DNS names\n\n tags = {\n Name = \"${var.environment_name}-vpc\"\n Environment = var.environment_name\n ManagedBy = \"terraform\" # Tagging as Terraform-managed helps ops teams know NOT to edit manually\n }\n}\n\n# A public subnet within that VPC.\nresource \"aws_subnet\" \"public_web_subnet\" {\n vpc_id = aws_vpc.primary_network.id # Reference to the VPC above — Terraform builds the dependency graph from this\n cidr_block = \"10.0.1.0/24\"\n availability_zone = \"${var.aws_region}a\"\n map_public_ip_on_launch = true\n\n tags = {\n Name = \"${var.environment_name}-public-subnet\"\n Environment = var.environment_name\n }\n}\n\n# Data source — reads EXISTING resources rather than creating new ones.\n# Here we fetch the latest Amazon Linux 2023 AMI ID dynamically,\n# so we're never hardcoding an AMI that gets deprecated.\ndata \"aws_ami\" \"amazon_linux_2023\" {\n most_recent = true\n owners = [\"amazon\"]\n\n filter {\n name = \"name\"\n values = [\"al2023-ami-*-x86_64\"] # Amazon Linux 2023 naming pattern\n }\n}\n\n# The EC2 instance. Notice it references both the subnet and the AMI data source.\nresource \"aws_instance\" \"web_server\" {\n ami = data.aws_ami.amazon_linux_2023.id # Dynamic AMI from the data source above\n instance_type = \"t3.micro\"\n subnet_id = aws_subnet.public_web_subnet.id\n\n tags = {\n Name = \"${var.environment_name}-web-server\"\n Environment = var.environment_name\n }\n}\n\n# Outputs let you extract values after apply — useful for feeding into CI/CD pipelines\n# or just confirming what was built.\noutput \"web_server_public_ip\" {\n description = \"Public IP of the web server — use this to SSH in or configure DNS\"\n value = aws_instance.web_server.public_ip\n}\n\noutput \"vpc_id\" {\n description = \"ID of the created VPC — needed if other Terraform workspaces reference this network\"\n value = aws_vpc.primary_network.id\n}",
"output": "$ terraform plan -var='environment_name=staging'\n\nTerraform will perform the following actions:\n\n # aws_instance.web_server will be created\n + resource \"aws_instance\" \"web_server\" {\n + ami = \"ami-0abcdef1234567890\"\n + instance_type = \"t3.micro\"\n + tags = {\n + \"Environment\" = \"staging\"\n + \"ManagedBy\" = \"terraform\"\n + \"Name\" = \"staging-web-server\"\n }\n ...\n }\n\n # aws_subnet.public_web_subnet will be created\n + resource \"aws_subnet\" \"public_web_subnet\" { ... }\n\n # aws_vpc.primary_network will be created\n + resource \"aws_vpc\" \"primary_network\" { ... }\n\nPlan: 3 to add, 0 to change, 0 to destroy.\n\n$ terraform apply -var='environment_name=staging' -auto-approve\n\naws_vpc.primary_network: Creating...\naws_vpc.primary_network: Creation complete after 2s [id=vpc-0a1b2c3d4e5f67890]\naws_subnet.public_web_subnet: Creating...\naws_subnet.public_web_subnet: Creation complete after 1s [id=subnet-0f9e8d7c6b5a43210]\naws_instance.web_server: Creating...\naws_instance.web_server: Creation complete after 32s [id=i-0123456789abcdef0]\n\nApply complete! Resources: 3 added, 0 changed, 0 destroyed.\n\nOutputs:\n\nvpc_id = \"vpc-0a1b2c3d4e5f67890\"\nweb_server_public_ip = \"54.210.167.83\""
}
The State File — Why It's the Heart of Terraform and How to Not Kill It
The state file (terraform.tfstate) is a JSON document that maps your HCL resource names to real cloud resource IDs. When you write aws_instance.web_server, Terraform stores the fact that this logical name corresponds to i-0123456789abcdef0 in AWS. Without it, Terraform would have no idea what it already built and would try to create duplicates on every apply.
Here's the problem: by default the state file sits on your local machine. The moment two engineers on a team both run terraform apply, you have a race condition. Whoever writes their state file last wins — and the loser's changes get orphaned in AWS with no state record. Those resources become ghost infrastructure: real, billing you, invisible to Terraform.
The solution is remote state — storing the state file in a shared, locked backend like S3 with DynamoDB locking (for AWS teams) or Terraform Cloud. The DynamoDB lock table is what prevents two simultaneous applies: the first engineer acquires the lock, the second gets a clear error message and must wait.
You should also never manually edit the state file. If something goes wrong — a resource gets deleted outside of Terraform — use terraform import to bring the real resource back under management, or terraform state rm to drop a resource from state without destroying it.
backend.tfHCL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# backend.tf — Remote state configuration using S3 + DynamoDB
# This file MUST be committed to version control so the entire team
# uses the same backend. TheS3 bucket and DynamoDB table themselves
# are usually bootstrapped manually (or via a separate 'bootstrap'Terraform workspace)
# because you can't store state for the thing that stores your state.
terraform {
backend "s3" {
bucket = "my-company-terraform-state" # Must already exist — Terraform won't create it
key = "services/web-app/staging/terraform.tfstate" # Path within the bucket — use a consistent naming scheme
region = "us-east-1"
# DynamoDB table provides state locking — prevents concurrent applies
# Table must have a partition key named exactly 'LockID' (string type)
dynamodb_table = "terraform-state-locks"
# Encrypt the state file at rest — your state contains sensitive values
# like database passwords and privateIPs
encrypt = true
}
}
# ─── How to bootstrap the S3 bucket and DynamoDB table themselves ───
# The resources below are meant to run ONCE in a dedicated 'bootstrap' workspace
# that uses local state (committed to git for reference).
# After running this once, you never touch it again.
resource "aws_s3_bucket""terraform_state_store" {\n bucket = \"my-company-terraform-state\"\n\n # Prevent accidental deletion of this bucket — if it's gone, all your state is gone\n lifecycle {\n prevent_destroy = true\n }\n\n tags = {\n Name = \"Terraform Remote State\"\n ManagedBy = \"terraform-bootstrap\"\n }\n}\n\n# Versioning on the bucket means you can recover from a botched state write\n# by rolling back to a previous version — this has saved production more than once\nresource \"aws_s3_bucket_versioning\" \"state_store_versioning\" {\n bucket = aws_s3_bucket.terraform_state_store.id\n\n versioning_configuration {\n status = \"Enabled\"\n }\n}\n\n# Block all public access — state files contain secrets, never make them public\nresource \"aws_s3_bucket_public_access_block\" \"state_store_access_block\" {\n bucket = aws_s3_bucket.terraform_state_store.id\n block_public_acls = true\n block_public_policy = true\n ignore_public_acls = true\n restrict_public_buckets = true\n}\n\n# The DynamoDB table for distributed locking\nresource \"aws_dynamodb_table\" \"terraform_lock_table\" {\n name = \"terraform-state-locks\"\n billing_mode = \"PAY_PER_REQUEST\" # No need to provision capacity for a low-traffic lock table\n hash_key = \"LockID\" # Must be exactly 'LockID' — Terraform expects this name\n\n attribute {\n name = \"LockID\"\n type = \"S\" # String type\n }\n\n lifecycle {\n prevent_destroy = true # Losingthis table means losing state locking — never delete it\n }\n\n tags = {\n Name = \"TerraformStateLockTable\"\n ManagedBy = \"terraform-bootstrap\"\n }\n}",
"output": "$ terraform init\n\nInitializing the backend...\n\nSuccessfully configured the backend \"s3\"! Terraform will automatically\nuse this backend unless the backend configuration changes.\n\nInitializing provider plugins...\n- Finding hashicorp/aws versions matching \"~> 5.0\"...\n- Installing hashicorp/aws v5.31.0...\n- Installed hashicorp/aws v5.31.0 (signed by HashiCorp)\n\nTerraform has been successfully initialized!\n\n# When a second engineer tries to apply at the same time:\n$ terraform apply\nAcquiring state lock. This may take a few moments...\n\nError: Error acquiring the state lock\n\n Error message: ConditionalCheckFailedException: The conditional request failed\n Lock Info:\n ID: f2a1b3c4-d5e6-7890-abcd-ef1234567890\n Path: my-company-terraform-state/services/web-app/staging/terraform.tfstate\n Operation: OperationTypeApply\n Who: alice@build-server-01\n Created: 2024-03-15 14:22:01 UTC\n\nTerraform acquires a state lock to protect from concurrent modifications.\nAnother Terraform process is currently running. Wait for it to complete,\nor use `terraform force-unlock f2a1b3c4-d5e6-7890-abcd-ef1234567890` if it crashed."
}
Setting up a remote backend with S3 and DynamoDB is the single most impactful change you can make for team safety. This guide walks you through the bootstrap process: creating the S3 bucket for state storage and the DynamoDB table for locking. Once configured, Terraform will automatically use this backend for all operations.
Step 1: Create the S3 bucket with versioning enabled, public access blocked, and server-side encryption. Use a unique name (e.g., company-terraform-state-2026). The bucket must exist before you configure the backend block.
Step 2: Create the DynamoDB table with a hash key named LockID (string type). Use PAY_PER_REQUEST billing since lock operations are infrequent. The table name must match the dynamodb_table value in your backend config.
Step 3: Write the backend configuration in your backend.tf file. This tells Terraform where to store state. The backend block cannot use interpolation (no variables, no locals) — values must be literal strings or provided via -backend-config flags on terraform init.
Step 4: Run terraform init to migrate from local state to the remote backend. Terraform will ask for confirmation to copy existing state. Once done, the local .tfstate file becomes a symlink to the remote state.
Step 5: Test locking by running terraform apply in one terminal and a second terraform apply in another. The second should fail with a lock error.
Best practice: create the bucket and table using a bootstrap Terraform configuration (with local state) or via CloudFormation/AWS CLI. The bootstrapping infrastructure is small and rarely changes. Only IAM roles that need to run Terraform should have access to the state bucket and lock table.
bootstrap.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Step-by-step bootstrap script forS3 + DynamoDB backend
# RunthisONCE in a dedicated AWSaccount (or the same account with caution).
# Requirements: AWSCLI v2, proper IAM permissions.
# Step1: CreateS3 bucket with versioning and encryption
BUCKET="company-terraform-state-$(date +%Y%m%d%H%M%S)" # unique name
aws s3api create-bucket --bucket $BUCKET --region us-east-1
aws s3api put-bucket-versioning \n --bucket $BUCKET \n --versioning-configuration Status=Enabled
aws s3api put-public-access-block \n --bucket $BUCKET \n --public-access-block-configuration \n BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
aws s3api put-bucket-encryption \n --bucket $BUCKET \n --server-side-encryption-configuration \n '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"AES256"}}]}'
# Step2: CreateDynamoDB table for locking
DYNAMO_TABLE="terraform-state-locks"
aws dynamodb create-table \n --table-name $DYNAMO_TABLE \n --key-schema AttributeName=LockID,KeyType=HASH \n --attribute-definitions AttributeName=LockID,AttributeType=S \n --billing-mode PAY_PER_REQUEST
# Step3: Apply bucket policy for secure access (optional but recommended)
# Only allow Terraform runners to access the state bucket
aws s3api put-bucket-policy --bucket $BUCKET --policy '
{
"Version": "2012-10-17",
"Statement": [\n {\n \"Effect\": \"Allow\",\n \"Principal\": {\n \"AWS\": \"arn:aws:iam::123456789012:role/TerraformRunnerRole\"\n },\n \"Action\": \"s3:*\",\n \"Resource\": \"arn:aws:s3:::'$BUCKET'/*\"\n }\n ]\n}'\n\necho \"Bootstrap complete!\"\necho \"Bucket: $BUCKET\"\necho \"DynamoDBTable: $DYNAMO_TABLE\"\necho \"Add the following to backend.tf:\"\necho \"terraform {\"\necho \" backend \\\"s3\\\" {\"\necho \" bucket = \\\"$BUCKET\\\"\"\necho \" key = \\\"my-project/terraform.tfstate\\\"\"\necho \" region = \\\"us-east-1\\\"\"\necho \" dynamodb_table = \\\"$DYNAMO_TABLE\\\"\"\necho \" encrypt = true\"\necho \" }\"\necho \"}\"",
"output": "# Expected output after running bootstrap.sh\nBootstrap complete!\nBucket: company-terraform-state-20260512123456\nDynamoDB Table: terraform-state-locks\nAdd the following to backend.tf:\nterraform {\n backend \"s3\" {\n bucket = \"company-terraform-state-20260512123456\"\n key = \"my-project/terraform.tfstate\"\n region = \"us-east-1\"\n dynamodb_table = \"terraform-state-locks\"\n encrypt = true\n }\n}"
},
"callout": {
"type": "warning",
"title": "Bucket name must be globally unique",
"text": "S3 bucket names are globally unique. If the bucket name you choose already exists, the creation will fail. Use a naming convention that includes your company name and a random suffix (e.g., date-timestamp) to avoid collisions."
},
"production_insight": "Once the backend is configured, a single misconfigured IAM policy can lock out all Terraform operations. Always test the backend setup in a non-production environment first. Store the bootstrap script in a version-controlled repository with limited access — losing this script means you can't easily recreate the backend. Enable S3 versioning on the state bucket BEFORE any Terraform apply; otherwise, you lose the ability to recover from corrupt state.",
"key_takeaway": "Set up S3 + DynamoDB backend before your second engineer runs terraform. Use a bootstrap script to create bucket and table, then configure backend.tf with literal values. Always test locking by running two concurrent applies."
},
{
"heading": "Modules and Workspaces — Structuring Terraform for Real Teams at Scale",
"content": "Once you move beyond a single environment, two problems emerge fast: you're copy-pasting `.tf` files between staging and production (violating DRY), and you're terrified of running `terraform apply` in the wrong directory.\n\nModules solve the DRY problem. A module is just a folder of `.tf` files with defined inputs (variables) and outputs. You write the VPC setup once as a module, then call it from your staging config with `environment_name = staging` and from your production config with `environment_name = production`. Changes to the VPC logic happen in one place.\n\nWorkspaces solve the isolation problem — but with a caveat. Terraform workspaces let you maintain separate state files for the same configuration, switching between them with `terraform workspace select staging`. They're great for lightweight environment separation, but they use the same backend bucket and the same code, so a misconfigured variable in `terraform.tfvars` can still nuke production.\n\nFor serious multi-environment setups, most teams graduate to a **directory-based structure** instead: `environments/staging/` and `environments/production/` each have their own `main.tf` that calls shared modules. Each directory has its own state file with its own backend key. It's more files, but it makes a `terraform apply` in the wrong environment physically impossible when you're in the wrong directory.",
"code": {
"language": "hcl",
"filename": "modules/web_application/main.tf",
"code": "# ─── Project structure ───────────────────────────────────────────\n# terraform-infrastructure/\n# ├── modules/\n# │ └── web_application/\n# │ ├── main.tf ← You are here\n# │ ├── variables.tf\n# │ └── outputs.tf\n# ├── environments/\n# │ ├── staging/\n# │ │ ├── main.tf ← Calls the module with staging values\n# │ │ └── terraform.tfvars\n# │ └── production/\n# │ ├── main.tf ← Calls the module with production values\n# │ └── terraform.tfvars\n# └── backend.tf\n# ─────────────────────────────────────────────────────────────────\n\n# modules/web_application/variables.tf\nvariable \"environment_name\" {\n type = string\n description = \"Deployment environment — controls naming and sizing\"\n validation {\n condition = contains([\"staging\", \"production\"], var.environment_name)\n error_message = \"environment_name must be either 'staging' or 'production'.\" # Catches typos before they hit AWS\n }\n}\n\nvariable \"instance_type\" {\n type = string\n description = \"EC2 instance size — use t3.micro for staging, t3.large for production\"\n default = \"t3.micro\"\n}\n\nvariable \"vpc_cidr_block\" {\n type = string\n description = \"CIDR block for the VPC — must not overlap with other environments\"\n}\n\n# modules/web_application/main.tf\nresource \"aws_vpc\" \"app_network\" {\n cidr_block = var.vpc_cidr_block\n\n tags = {\n Name = \"${var.environment_name}-app-vpc\"\n Environment = var.environment_name\n }\n}\n\nresource \"aws_subnet\" \"app_subnet\" {\n vpc_id = aws_vpc.app_network.id\n cidr_block = cidrsubnet(var.vpc_cidr_block, 8, 1) # cidrsubnet carves a /24 out of the /16 automatically\n\n tags = {\n Name = \"${var.environment_name}-app-subnet\"\n Environment = var.environment_name\n }\n}\n\n# modules/web_application/outputs.tf\noutput \"vpc_id\" {\n value = aws_vpc.app_network.id\n description = \"VPCID — expose this so callers can attach other resources to the same network\"\n}\n\noutput \"subnet_id\" {\n value = aws_subnet.app_subnet.id\n description = \"SubnetIDfor the primary application subnet\"\n}\n\n# ─── environments/staging/main.tf ────────────────────────────────\n# This is how you CALL the module from an environment directory.\n# The module keyword points to the relative path of the module folder.\n\nterraform {\n required_version = \">= 1.5.0\"\n backend \"s3\" {\n bucket = \"my-company-terraform-state\"\n key = \"environments/staging/terraform.tfstate\" # Unique key per environment\n region = \"us-east-1\"\n dynamodb_table = \"terraform-state-locks\"\n encrypt = true\n }\n}\n\nprovider \"aws\" {\n region = \"us-east-1\"\n}\n\nmodule \"staging_web_app\" {\n source = \"../../modules/web_application\" # Relative path to the module\n\n environment_name = \"staging\"\n instance_type = \"t3.micro\" # Cheaper instance for non-production\n vpc_cidr_block = \"10.1.0.0/16\" # Non-overlapping CIDR — staging uses 10.1.x.x\n}\n\n# ─── environments/production/main.tf ─────────────────────────────\n\nmodule \"production_web_app\" {\n source = \"../../modules/web_application\"\n\n environment_name = \"production\"\n instance_type = \"t3.large\" # Larger instance for production load\n vpc_cidr_block = \"10.2.0.0/16\" # Production uses 10.2.x.x — no CIDR collision\n}\n\n# Outputs from a module are accessed via module.<module_name>.<output_name>\noutput \"production_vpc_id\" {\n value = module.production_web_app.vpc_id\n}",
"output": "# Running from environments/staging/\n$ terraform init && terraform apply\n\nInitializing modules...\n- staging_web_app in ../../modules/web_application\n\nApply complete! Resources: 2 added, 0 changed, 0 destroyed.\n\n# Running from environments/production/\n$ terraform init && terraform apply\n\nInitializing modules...\n- production_web_app in ../../modules/web_application\n\nApply complete! Resources: 2 added, 0 changed, 0 destroyed.\n\nOutputs:\nproduction_vpc_id = \"vpc-0b2c3d4e5f67890a1\"\n\n# If you accidentally type the wrong environment name in a tfvars file:\n$ terraform plan\n\nError: Invalid value for variable\n\n on ../../modules/web_application/variables.tf line 4, in variable \"environment_name\":\n 4: validation {\n\nValidation failed: environment_name must be either 'staging' or 'production'.\n\n# The validation block caught a typo ('Staging' vs 'staging') before any AWSAPI call was made."
},
"callout": {
"type": "info",
"title": "Interview Gold: Modules vs Workspaces",
"text": "Interviewers love this distinction. Workspaces share code and differ only by state — they're ideal for feature branch testing where the infrastructure topology is identical. Directory-based modules share logic but have fully independent configurations, backends, and state files — they're the right choice for staging vs production where sizing, redundancy, and access controls genuinely differ. Most mature teams use both: modules forDRY logic, directories for environment isolation."
},
"production_insight": "Workspaces share code but not state — easy to accidentally target prod. Directory-based environments with separate backends eliminate this risk. Rule: use workspaces for ephemeral branches, directories for long-lived envs.",
"key_takeaway": "Modules DRY out infrastructure code. Directories isolate environments. Workspaces are for short-lived copies, not production safety."
},
{
"heading": "Providers and Dependencies — How Terraform Talks to Clouds and APIs",
"content": "Every Terraform operation that touches a cloud resource goes through a **provider plugin**. The provider is the bridge between Terraform's HCL and the cloud provider's API. When you write `resource \"aws_instance\"`, Terraform calls the AWS provider, which authenticates via environment variables or IAM roles, and translates your config into a series of AWSSDKcalls (CreateInstance, DescribeInstances, etc.).\n\nProvider versioning matters more than you think. HashiCorp can introduce breaking changes in minor versions — a widely reported incident in 2023 where the AWS provider v5.0 changed the defaultfor `encrypt` on certain resources, leading to unreviewed plan changes that recreated encrypted resources. Pin your provider versions with `~> 5.0` (pessimistic operator) to allow patch-level updates while preventing major and minor surprises. Use `terraform init -upgrade` explicitly when you intend to upgrade.\n\nAnother hidden gotcha: **provider caching**. Terraform downloads provider binaries into `.terraform/providers/` on init. If your CI pipeline runs in an air-gapped environment, you must either mirror the provider registry or bundle providers in a container image. Without that, `terraform init` fails with network errors. The `terraform providers mirror` command downloads all required providers for offline use.",
"code": {
"language": "hcl",
"filename": "providers.tf",
"code": "# providers.tf — Managing provider versions and aliases\n# Pin provider major versions to avoid surprise re-creation\n# Use provider aliases to manage resources across multiple regions\n\nterraform {\n required_version = \">= 1.5.0\"\n\n required_providers {\n aws = {\n source = \"hashicorp/aws\"\n version = \"~> 5.0\" # Accepts 5.x but not 4.x or 6.x\n }\n random = {\n source = \"hashicorp/random\"\n version = \"~> 3.5\"\n }\n }\n}\n\n# Primary provider for us-east-1\nprovider \"aws\" {\n region = \"us-east-1\"\n alias = \"primary\" # Not strictly needed for a single provider, but good practice\n}\n\n# Secondary provider for us-west-2 — used for cross-region resources like Route53 health checks\nprovider \"aws\" {\n region = \"us-west-2\"\n alias = \"secondary\"\n}\n\n# Usage: resource \"aws_s3_bucket\" \"replica\" {\n# provider = aws.secondary\n# bucket = \"my-replica-us-west-2\"\n# }\n\n# For air-gapped CI: pre-download providers\n# $ terraform providers mirror ./terraform-mirror\n# Then in CI:\n# $ terraform init -plugin-dir=./terraform-mirror\n\n# If you need to use a custom provider not on the public registry:\n# terraform {\n# required_providers {\n# mycloud = {\n# source = \"example.com/myorg/mycloud\"\n# version = \">= 1.0\"\n# }\n# }\n# }"
},
"callout": {
"type": "warning",
"title": "Provider Version Pin Is Not a Suggestion",
"text": "Hashicorp released provider SDK v2.0 for the AWS provider which changed the default value of `enable_dns_hostnames` to `false`. Countless teams had their VPCs silently recreated because they were using `version = \">= 3.0\"` without an upper bound. Always pin with `~> 5.0` (or exact version) and review the changelog before bumping."
},
"production_insight": "Provider upgrades can silently change resource attributes, triggering re-creation. Air-gapped CI needs offline provider mirroring — without it, pipelines fail on init. Rule: pin provider versions explicitly and test upgrades in staging first.",
"decision_tree": {
"title": "When to Use Provider Aliases vs Separate Terraform Configurations",
"items": [
{
"condition": "Resources need to be created in a different region within the same account",
"result": "Use provider alias — keeps all resources in one state file and simplifies dependency management."
},
{
"condition": "Resources need to be created in a completely different AWS account",
"result": "Use separate Terraform configurations with separate backends — never share state across accounts for security."
},
{
"condition": "You need to manage both AWS and Azure resources in the same project",
"result": "Use multiple providers in the same configuration — Terraform handles cross-provider dependencies natively."
},
{
"condition": "You need to use a custom or community provider not on the public registry",
"result": "Specify the custom source in `required_providers` and run `terraform init` with appropriate network access."
}
]
},
"key_takeaway": "Providers are the bridge between HCL and cloud APIs. Pessimistic version pinning prevents surprise re-creation. Offline mirroring is essential for air-gapped CI/CD environments."
},
{
"heading": "CI/CD with Terraform — Automating Apply Safely in Production Pipelines",
"content": "Running `terraform apply` from a laptop is fine for a personal project. In a team setting, it's a disaster waiting to happen. The industry standard is to automate Terraform in a CI/CD pipeline where every plan is reviewed, every apply is audited, and destructive changes require explicit approval.\n\nThe gold standard pipeline looks like this:\n1. **Pull Request** triggers `terraform init` and `terraform plan`.\n2. **Plan output** is posted as a comment on the PR — no human should approve a change without reading it.\n3. A **second engineer reviews** both the code and the plan output.\n4. **Merge to main** triggers `terraform apply` automatically (or with a manual approval step for production).\n5. **Post-apply** artifacts include the final state file, outputs, and a link to the cloud console.\n\nKey tools: GitHub Actions, GitLab CI, Atlantis (a dedicated Terraform CI runner that comments on PRs), or Terraform Cloud's native run workflows. Atlantis is particularly popular because it embeds plan/apply directly into the PR workflow — no separate dashboard needed.\n\nWhere it goes wrong: teams skip the plan review step and auto-apply on merge. If someone accidentally merges a change that destroys a database, the pipeline doesn't catch it. Always enforce plan review for production. Use Sentinel or OPA policies to enforce rules like 'no destruction of stateful resources' or 'instance types must be within approved list'.",
"code": {
"language": "yaml",
"filename": ".github/workflows/terraform.yml",
"code": "# .github/workflows/terraform.yml — Production CI/CD for Terraform\n# This workflow runs terraform plan on PRs and apply on merges to main.\n# It assumes an OIDC role for AWS credentials — no hardcoded secrets.\n\nname: 'Terraform CI/CD'\n\non:\n push:\n branches: [ \"main\" ]\n pull_request:\n branches: [ \"main\" ]\n\npermissions:\n id-token: write\n contents: read\n pull-requests: write # Needed to post plan comments\n\njobs:\n terraform:\n name: 'Terraform Plan'\n runs-on: ubuntu-latest\n defaults:\n run:\n working-directory: ./environments/production # Change to your env\n\n steps:\n - name: Checkout\n uses: actions/checkout@v4\n\n - name: ConfigureAWSCredentials (OIDC)\n uses: aws-actions/configure-aws-credentials@v4\n with:\n role-to-assume: arn:aws:iam::123456789012:role/terraform-ci-role\n role-session-name: TerraformPipeline\n aws-region: us-east-1\n\n - name: SetupTerraform\n uses: hashicorp/setup-terraform@v3\n with:\n terraform_version: 1.5.0\n\n - name: TerraformInit\n id: init\n run: terraform init -input=false\n\n - name: TerraformFormatCheck\n run: terraform fmt -check -diff\n\n - name: TerraformValidate\n run: terraform validate\n\n - name: TerraformPlan\n id: plan\n run: terraform plan -input=false -no-color\n continue-on-error: true\n\n - name: PostPlanComment to PR\n if: github.event_name == 'pull_request'\n uses: actions/github-script@v7\n with:\n script: |\n const output = `#### TerraformPlan 📖\n <details><summary>ShowPlan</summary>\n \\\`\\\`\\\`\\n${process.env.PLAN_OUTPUT}\\n\\\`\\\`\\\`\n </details>\n *Pusher: @${{ github.actor }}, Action: ${{ github.event_name }}*`;\n github.rest.issues.createComment({\n issue_number: context.issue.number,\n owner: context.repo.owner,\n repo: context.repo.repo,\n body: output\n });\n env:\n PLAN_OUTPUT: ${{ steps.plan.outputs.stdout }}\n\n - name: TerraformApply (on push to main only)\n if: github.ref == 'refs/heads/main' && github.event_name == 'push'\n run: terraform apply -input=false -auto-approve"
},
"callout": {
"type": "tip",
"title": "Use Atlantis for Native PR Integration",
"content": "Atlantis (runatlantis.io) is an open-source tool that replaces the custom GitHub Actions workflow above. It runs Terraform commands in Docker containers and posts plan/apply results directly as PR comments. It supports multiple projects, workspaces, and custom workflows. Many teams prefer it over Terraform Cloud for cost reasons and self-hosted flexibility."
},
"production_insight": "Skipping plan review is the most common CI/CD mistake. Auto-apply on merge without approval leads to database deletions. Rule: approve plans via PR comments, use policy-as-code to block dangerous changes.",
"decision_tree": {
"title": "CI/CD Strategy Decision Tree",
"items": [
{
"condition": "Team size < 5, only one environment (staging)",
"result": "Simple GitHub Actions with plan on PR, apply on merge — no external tooling needed."
},
{
"condition": "Multiple environments, single team",
"result": "Atlantis with per-environment workflows — plan all, apply via approved comments."
},
{
"condition": "Enterprise with compliance requirements",
"result": "Terraform Cloud with Sentinel policies — enforced approvals, audit logs, cost estimation."
},
{
"condition": "GitOps-driven organization (ArgoCD, Flux)",
"result": "Use Terraform as a tool within GitOps — store Terraform outputs in ConfigMaps, manage infrastructure as part of the CD pipeline."
}
]
},
"key_takeaway": "Plan on every PR, approve manually, apply only on merge. Policy-as-code catches destructive changes before they run. Never auto-apply production without a second pair of eyes."
}
]
● Production incidentPOST-MORTEMseverity: high
Accidental terraform destroy on Production — State Lock vs Human Error
Symptom
All production services became unreachable. The AWS console showed VPC, subnets, and EC2 instances being deleted in real time.
Assumption
The team assumed that remote state locking would prevent destructive operations. They had not separated CI/CD pipelines per environment.
Root cause
The engineer had local Terraform CLI access with production-level AWS credentials. The directory structure was using workspaces, so a simple terraform workspace select prod followed by terraform destroy was all it took. No second pair of eyes, no plan approval.
Fix
Immediately restored from a recent state backup (S3 bucket versioning was enabled). Then restructured into separate directories: environments/staging/ and environments/production/, each with its own state file and IAM role. CI/CD pipeline does not allow local terraform destroy — only via approved PRs. Added prevent_destroy = true on all stateful resources.
Key lesson
Never give developers direct CLI write access to production environments.
Remote state locking only prevents concurrent applies, not destructive applies.
Directory-based environments with separate backends are safer than workspaces for long-lived environments.
Versioning on the state bucket is not optional — it saved the team here.
Production debug guideCommon symptoms and actions to resolve state drift, plan mismatches, and provider problems fast.5 entries
Symptom · 01
terraform plan shows changes for resources that were not modified
→
Fix
Run terraform refresh to update the state file with actual cloud resource attributes. If changes persist, check if resource tags, descriptions, or default values are being set by the cloud provider that Terraform doesn't know about. Use terraform state show <resource> to inspect the state.
Symptom · 02
terraform apply fails with 'Error acquiring the state lock'
→
Fix
Identify the locking process using terraform force-unlock -force <LOCK_ID> only if you are absolutely certain the previous apply crashed (e.g., CI runner killed). Better approach: wait for the lock to release or use terraform plan -lock=false for read-only checks.
Symptom · 03
Provider plugin installation fails or version mismatch
→
Fix
Verify required_providers versions in your terraform block. Run terraform init -upgrade to redownload provider plugins. If using a private registry, check ~/.terraformrc or TF_CLI_CONFIG_FILE. Common cause: network restrictions blocking registry.terraform.io.
Symptom · 04
terraform apply succeeds but resources don't appear in cloud console
→
Fix
Check if the resource was deleted outside Terraform (manual console deletion). Run terraform state list to confirm it's in state. If missing, use terraform import to bring it back. If state shows the resource but it's absent, it was orphaned — destroy the state entry with terraform state rm and recreate.
Symptom · 05
terraform plan takes longer than 30 seconds or memory spikes
→
Fix
Large state files (>50 MB) cause slow plans. Check state file size in S3. Reduce state size by splitting into multiple workspaces or using remote state for shared resources (like VPCs). Use terraform state list | wc -l to count resources. If >2000, consider refactoring into separate Terraform stacks.
★ Terraform Quick Debug Cheat SheetFive production-scenarios that will save your on-call shift.
State drift — cloud console changes not reflected in plan−
Immediate action
Run `terraform plan -refresh-only` to see what changed outside Terraform without proposing changes.
Commands
terraform refresh
terraform plan
Fix now
If critical resource modified manually, use terraform import to adopt it, then terraform plan to align code with reality.
Lock error on apply — another process holding the lock+
Immediate action
Check who holds the lock: `terraform plan -lock=false` to see plan but not apply. Identify the lock holder via AWS DynamoDB console or Terraform Cloud.
Commands
terraform force-unlock <LOCK_ID>
terraform plan
Fix now
Only force-unlock if you verified the holding process crashed. Never force-unlock during an active apply.
Resource not found but state says it exists+
Immediate action
Run `terraform state show <resource>` to confirm state entry. Then `aws resource describe` to check actual existence.
Commands
terraform state rm <resource>
terraform import <resource> <resource_id>
Fix now
If resource is gone, remove from state and recreate with fresh apply. If resource exists but wrong ID, correct the import.
`terraform init` fails with provider download error+
Immediate action
Check network connectivity to registry.terraform.io. If behind a proxy, set `HTTP_PROXY` and `HTTPS_PROXY` env vars.
Commands
terraform init -upgrade
TF_LOG=DEBUG terraform init
Fix now
If proxy not available, manually download provider zip and place in .terraform/providers/ matching the required version path.
Plan shows all resources as 'force replacement' — depends_on missing+
Immediate action
Check if a resource attribute that cannot be updated in-place has changed (e.g., VPC CIDR). If no meaningful change, likely a provider bug or state mismatch.