Advanced 12 min · 2026-06-21

Ansible AWS Automation: Production Patterns & Gotchas with amazon.aws Collection

Master Ansible AWS automation with the amazon.aws collection.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
June 21, 2026
last updated
1,596
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Use amazon.aws collection (>=5.0.0) for all AWS modules; community.aws is deprecated. Always set state: present and required parameters for idempotency; cloud modules are stateful. For EC2 instances, use ec2_instance with exact_count and instance_role for production. S3 bucket creation is idempotent but requires permission: private to avoid public access defaults. IAM role creation is idempotent but assume role policy document must be exact JSON string. RDS instances: set skip_final_snapshot: true only for testing; production must handle snapshots. VPC subnets: use aws_vpc and ec2_vpc_subnet with tags for idempotent lookup. Dynamic inventory with aws_ec2 plugin: use cache: yes and cache_plugin: jsonfile to avoid API rate limits. Handle eventual consistency with retries and delay on ec2_instance_info after creation. Store secrets in SSM Parameter Store with aws_ssm_parameter and no_log: true in playbooks.

✦ Definition~90s read
What is Ansible AWS Automation?

Ansible AWS automation is the practice of using Ansible playbooks to manage AWS infrastructure as code. The amazon.aws collection (version 5.0.0+) is the official, community-maintained set of modules that replace the older community.aws modules. It provides modules for EC2, S3, IAM, RDS, VPC, and many other AWS services, all designed with idempotency in mind.

Think of Ansible for AWS like a smart remote control for your cloud infrastructure.

In the Ansible ecosystem, the amazon.aws collection fits as the primary interface between Ansible and AWS. It leverages the boto3 and botocore Python libraries to make API calls. The key advantage over writing raw aws CLI commands or using CloudFormation is that Ansible modules handle state management — you declare the desired end state, and Ansible figures out what actions (create, update, delete) are needed to reach that state.

The problem it solves is the complexity of AWS API interactions: handling pagination, eventual consistency, error retries, and idempotency. Without Ansible, you'd have to write scripts with error handling, retries, and state checks. The amazon.aws collection encapsulates these patterns, allowing you to focus on infrastructure design rather than API quirks.

Plain-English First

Think of Ansible for AWS like a smart remote control for your cloud infrastructure. Instead of clicking buttons in the AWS console, you write a recipe (playbook) that says 'I want exactly 3 servers of this type, with these security settings, and this S3 bucket for logs.' Ansible talks to AWS APIs to make it happen, and if you run the recipe again, it checks what's already there and only changes what's needed — that's idempotency. But AWS is a distributed system, so sometimes when you create a server, it takes a moment for the list of servers to update. Ansible has a 'wait and retry' feature to handle that. And for secrets like database passwords, you store them in AWS SSM Parameter Store, a secure vault, and Ansible pulls them at runtime without exposing them in your code.

I still remember the 3 AM wake-up call. Our production deployment had been running smoothly for months, but that night, a seemingly innocuous change to an Ansible playbook caused a 45-minute outage. The root cause? I had used the deprecated ec2 module instead of ec2_instance, and the module didn't handle idempotency correctly — it terminated all existing instances and created new ones, thinking they were 'extra'. That incident taught me the hard way that Ansible AWS automation requires deep understanding of module behavior, API consistency, and cloud state management.

Historically, Ansible's AWS support started with basic modules like ec2 and s3, which were monolithic and often inconsistent. The community developed workarounds, but the real game-changer was the amazon.aws collection (introduced in Ansible 2.9, now the standard). This collection provides focused, idempotent modules like ec2_instance, s3_bucket, iam_role, and rds_instance, designed to work with the AWS API's eventual consistency model.

In this article, I'll share production patterns I've developed over years of managing thousands of AWS resources with Ansible. We'll cover the essential modules from the amazon.aws collection, dynamic inventory with the aws_ec2 plugin, handling idempotency and eventual consistency, and securing secrets with AWS SSM Parameter Store. Every code example is battle-tested in production environments.

By the end, you'll have a practical playbook (pun intended) for building robust, scalable AWS automation with Ansible that won't cause 3 AM phone calls.

Setting Up the amazon.aws Collection for Production

The first step is ensuring you have the correct collection version. The amazon.aws collection is the replacement for the deprecated community.aws collection. Install it with:

``bash ansible-galaxy collection install amazon.aws:==5.0.0 ``

``yaml --- collections: - name: amazon.aws version: '>=5.0.0,<6.0.0' ``

Then run ansible-galaxy collection install -r requirements.yml. The collection requires boto3 and botocore >= 1.21.0. On the control node:

``bash pip install 'boto3>=1.21.0' 'botocore>=1.24.0' ``

For authentication, use IAM instance profiles on EC2 or environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY). In production, avoid hardcoding secrets; use Ansible Vault or SSM Parameter Store (covered later).

``yaml --- ansible_aws_region: us-east-1 ansible_aws_retry_max_attempts: 10 ansible_aws_retry_delay: 5 ``

These variables control retry behavior for all modules. The default retries are often insufficient for eventual consistency.

Version Compatibility
Do NOT mix amazon.aws and community.aws modules in the same playbook. They use different module namespaces and can conflict. Stick to amazon.aws for all AWS operations.
Production Insight
In one incident, we had a playbook using community.aws.ec2_instance which was actually a redirect to amazon.aws.ec2_instance. After upgrading the collection, the redirect broke and tasks failed with 'module not found'. We fixed it by explicitly using amazon.aws.ec2_instance and removing community.aws from requirements.
Key Takeaway
Always use amazon.aws collection pinned to a major version, and ensure boto3/botocore are up to date on the control node.

Managing EC2 Instances with ec2_instance Module

The ec2_instance module is the modern way to manage EC2 instances. It supports exact_count for idempotent instance management. Here's a production playbook snippet:

``yaml - name: Launch web servers amazon.aws.ec2_instance: name: "web-{{ item }}" instance_type: t3.medium image_id: ami-0abcdef1234567890 key_name: my-key security_group: "sg-xxxx" vpc_subnet_id: "subnet-xxxx" exact_count: 3 instance_role: Name: my-instance-profile network: assign_public_ip: true tags: Environment: production Role: web wait: yes wait_timeout: 600 loop: "{{ range(1, 4) | list }}" ``

Key parameters
  • exact_count: Ensures exactly that many instances exist. If fewer, it creates; if more, it terminates extras. Without it, the module is not idempotent.
  • instance_role: Attaches an IAM instance profile. Use the profile name, not ARN.
  • wait: yes and wait_timeout: Crucial for production — waits for instance to reach running state.
  • network: Allows specifying network interfaces. For multiple ENIs, use network_interfaces.

Gotcha: The exact_count parameter works by filtering instances based on name tag and other filters you provide. If you don't set name, it may count unrelated instances. Always set name and tags to scope the count.

For updating instances (e.g., change instance type), use state: running and modify parameters. However, not all attributes are updatable in place; some require replacement. Use instance_ids to target specific instances for operations like stop/start.

Idempotency with exact_count
Always use exact_count when you want a fixed number of instances. Without it, the module will create a new instance every run, leading to drift and cost overruns.
Production Insight
We once had a playbook that created instances without exact_count. After a few runs, we had 50 instances instead of 3. Adding exact_count: 3 with proper name and tags filters immediately terminated the extras and prevented future drift.
Key Takeaway
Use ec2_instance with exact_count, name, and tags for idempotent EC2 management. Always set wait: yes.

Creating S3 Buckets and Objects with Idempotency

The s3_bucket and s3_object modules manage S3 resources. s3_bucket is idempotent by default: if the bucket exists and is owned by you, it reports ok. If it exists but is owned by another account, it fails with BucketAlreadyExists. For production:

``yaml - name: Create application bucket amazon.aws.s3_bucket: name: my-app-bucket state: present region: "{{ ansible_aws_region }}" permission: private versioning: yes tags: Environment: production ``

  • permission: Default is private. Avoid public-read or public-read-write without explicit need.
  • versioning: Enable for critical data.

``yaml - name: Upload configuration file amazon.aws.s3_object: bucket: my-app-bucket object: /config/app.conf src: /local/path/app.conf mode: put permission: bucket-owner-full-control ``

Gotcha: The s3_bucket module does not manage bucket policies. Use a separate task with aws_s3_bucket_policy or iam_policy.

Idempotency for objects: The s3_object module with mode: put will upload the file every time unless you use force: false (default). To avoid unnecessary uploads, use overwrite: different (new in amazon.aws 5.0.0) which compares MD5 checksums:

``yaml - name: Upload config only if changed amazon.aws.s3_object: bucket: my-app-bucket object: /config/app.conf src: /local/path/app.conf mode: put overwrite: different ``

S3 Bucket Naming
S3 bucket names must be globally unique and DNS-compliant. Use a naming convention like company-app-environment-region to avoid collisions.
Production Insight
We had a bucket creation fail because another team had already created a bucket with the same name in a different account. We switched to using account-specific prefixes (e.g., myapp-{{ aws_account_id }}-bucket) to guarantee uniqueness.
Key Takeaway
Use s3_bucket with state: present and permission: private. For objects, use overwrite: different to avoid unnecessary uploads.

Creating IAM Roles and Instance Profiles

IAM role management is critical for security. The iam_role module creates roles and attaches policies. Production example:

``yaml - name: Create EC2 service role amazon.aws.iam_role: name: ec2-service-role assume_role_policy_document: "{{ lookup('file', 'assume-role-policy.json') }}" managed_policies: - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy state: present create_instance_profile: yes ``

assume_role_policy_document must be a valid JSON string. Use lookup('file', ...) to load from file. Example assume-role-policy.json:

``json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "ec2.amazonaws.com" }, "Action": "sts:AssumeRole" } ] } ``

Gotcha: The iam_role module is idempotent only if the assume_role_policy_document is exactly the same. If you change the file, the module updates the role. However, managed_policies are additive — if you remove a policy from the list, the module does NOT detach it. To manage policies precisely, use iam_policy module or set managed_policies: [] and manage separately.

For instance profiles, create_instance_profile: yes creates a profile with the same name as the role. To attach the profile to an EC2 instance, use ec2_instance with instance_role parameter.

Policy Document Format
The assume_role_policy_document must be a JSON string, not a YAML dict. Use lookup('file', ...) or lookup('template', ...) to ensure proper formatting.
Production Insight
We once had a role update fail because the JSON file had a trailing comma. The module reported 'MalformedPolicyDocument'. We added a validation step: - name: Validate JSON | set_fact: policy_json={{ lookup('file', 'policy.json') | from_json }} before the IAM task.
Key Takeaway
Use iam_role with assume_role_policy_document from a file, and create_instance_profile: yes. Validate JSON before applying.

Provisioning RDS Instances with rds_instance

The rds_instance module manages RDS databases. Production example:

``yaml - name: Create PostgreSQL RDS instance amazon.aws.rds_instance: db_instance_identifier: mydb engine: postgres engine_version: "14.6" db_instance_class: db.t3.medium allocated_storage: 100 storage_type: gp3 master_username: "{{ db_master_username }}" master_user_password: "{{ db_master_password }}" vpc_security_group_ids: - sg-xxxx db_subnet_group_name: my-db-subnet-group publicly_accessible: no storage_encrypted: yes backup_retention_period: 7 skip_final_snapshot: no final_snapshot_identifier: mydb-final-{{ ansible_date_time.epoch }} wait: yes wait_timeout: 1200 state: present ``

Critical parameters
  • skip_final_snapshot: In production, set to no and provide final_snapshot_identifier to avoid data loss on deletion.
  • wait_timeout: RDS creation can take 10-20 minutes. Set to 1200 seconds (20 minutes).
  • master_user_password: Use Ansible Vault or SSM Parameter Store (see section on secrets).

Gotcha: The module does not support changing master_username after creation. If you need to change, you must delete and recreate.

Idempotency: The module checks for an existing instance with the same db_instance_identifier. If found, it compares parameters and updates if necessary. Not all parameters are updatable in place; some require replacement.

Skip Final Snapshot in Development Only
Never set skip_final_snapshot: yes in production. Always create a final snapshot before deletion. Use a unique identifier with timestamp to avoid conflicts.
Production Insight
We once had a playbook that deleted a production RDS instance because state: absent was triggered accidentally. The final_snapshot_identifier allowed us to restore within minutes. Without it, we would have lost 2 TB of data.
Key Takeaway
Always set skip_final_snapshot: no with a unique final snapshot identifier. Use wait: yes and wait_timeout: 1200.

Building VPC Networks with aws_vpc and Subnet Modules

VPC modules allow you to define network infrastructure as code. Production example:

```yaml - name: Create VPC amazon.aws.ec2_vpc_net: name: my-vpc cidr_block: 10.0.0.0/16 tags: Environment: production state: present region: "{{ ansible_aws_region }}"

  • name: Create public subnets
  • amazon.aws.ec2_vpc_subnet:
  • vpc_id: "{{ vpc_result.vpc.id }}"
  • cidr: "10.0.{{ item }}.0/24"
  • az: "{{ ansible_aws_region }}{{ item }}"
  • tags:
  • Name: "public-{{ item }}"
  • Tier: public
  • state: present
  • loop:
  • - a
  • - b
  • - c
  • register: subnet_results
  • ```

Idempotency: These modules use tags and cidr_block to identify existing resources. If you change the CIDR, it creates a new resource (the old one is not deleted). To delete, set state: absent.

Gotcha: The ec2_vpc_subnet module requires vpc_id or vpc_name. Using vpc_name is convenient but can be ambiguous if multiple VPCs have the same name. Prefer vpc_id.

```yaml - name: Create Internet Gateway amazon.aws.ec2_vpc_igw: vpc_id: "{{ vpc_result.vpc.id }}" tags: Name: my-igw state: present

  • name: Create public route table
  • amazon.aws.ec2_vpc_route_table:
  • vpc_id: "{{ vpc_result.vpc.id }}"
  • tags:
  • Name: public-rt
  • subnets:
  • - "{{ subnet_results.results[0].subnet.id }}"
  • - "{{ subnet_results.results[1].subnet.id }}"
  • - "{{ subnet_results.results[2].subnet.id }}"
  • routes:
  • - dest: 0.0.0.0/0
  • gateway_id: "{{ igw_result.gateway_id }}"
  • state: present
  • ```

Production insight: Use ec2_vpc_route_table with subnets list to associate subnets. The module is idempotent: if routes and associations match, it does nothing.

VPC Name vs ID
When referencing VPCs, use vpc_id instead of vpc_name to avoid ambiguity. You can retrieve the VPC ID using ec2_vpc_net_info with filters.
Production Insight
We had a situation where two VPCs had the same name 'production' (different regions). Using vpc_name caused the playbook to modify the wrong VPC. We switched to using vpc_id obtained from ec2_vpc_net_info with region filter.
Key Takeaway
Use vpc_id for idempotent operations. Tag all resources with Name and Environment for easy identification.

Using Dynamic Inventory with AWS EC2 Plugin

The aws_ec2 inventory plugin dynamically builds inventory from AWS EC2 instances. Configure it in inventory/aws_ec2.yml:

``yaml plugin: amazon.aws.aws_ec2 regions: - us-east-1 - us-west-2 filters: tag:Environment: production instance-state-name: running hostnames: - tag:Name - dns-name - private-dns-name compose: ansible_host: public_ip_address keyed_groups: - key: tags.Role prefix: role - key: placement.region prefix: aws_region cache: yes cache_plugin: jsonfile cache_timeout: 3600 ``

Key settings
  • cache: yes and cache_plugin: jsonfile: Avoids hitting AWS API on every playbook run. Set cache_timeout to 3600 (1 hour) for production.
  • filters: Use tags to limit scope. Avoid pulling all instances in an account.
  • hostnames: Use tag:Name for human-readable hostnames. Fallback to DNS names.
  • keyed_groups: Create groups based on tags or attributes.

``bash ansible-playbook -i inventory/aws_ec2.yml playbook.yml ``

Gotcha: The plugin caches inventory. If you create new instances, they won't appear until cache expires. Force refresh with --flush-cache.

Production pattern: Use cache: yes but set a low cache_timeout (e.g., 300) during deployments, then increase for normal operations.

Cache Flushing
Use ansible-inventory -i inventory/aws_ec2.yml --list --flush-cache to refresh the cache on demand.
Production Insight
We once had a deployment that failed because the dynamic inventory still showed old instances after a scale-down. The cache had a 1-hour timeout. We added a pre-task to flush cache: - name: Flush cache | meta: refresh_inventory.
Key Takeaway
Enable caching with cache: yes to avoid API rate limits, but flush cache during deployments with --flush-cache or meta: refresh_inventory.

Ensuring Idempotency in Cloud Modules

Idempotency means running a playbook multiple times produces the same result. AWS modules in amazon.aws are designed to be idempotent, but there are pitfalls.

Pattern 1: Use state: present with unique identifiers. For example, ec2_instance with name and tags uniquely identifies instances. Without name, the module may create duplicates.

Pattern 2: Use exact_count for EC2. This ensures exactly N instances exist. Without it, each run creates a new instance.

Pattern 3: Use force: false (default) on s3_bucket to avoid recreating.

Pattern 4: For IAM roles, the assume_role_policy_document must match exactly. If you use a template that changes every run (e.g., with timestamps), the role will be updated every time. Avoid dynamic content in policy documents.

Pattern 5: Use register and when to skip tasks if resource already exists. For example:

```yaml - name: Check if bucket exists amazon.aws.aws_s3_bucket_info: name: my-bucket register: bucket_info ignore_errors: yes

  • name: Create bucket if not exists
  • amazon.aws.s3_bucket:
  • name: my-bucket
  • state: present
  • when: bucket_info is failed
  • ```

This pattern is useful for modules that are not fully idempotent (e.g., some community modules).

Gotcha: Some modules like ec2_instance with exact_count can be slow because they query all instances matching filters. Use specific filters to limit scope.

Idempotency and Tags
Tags are often used to identify resources. If you don't set tags, the module may not find existing resources and create duplicates. Always set tags with a Name key.
Production Insight
We had a playbook that created security groups without tags. Each run created a new SG because the module couldn't find the existing one. Adding tags: { Name: my-sg } fixed it.
Key Takeaway
Use unique identifiers (name, tags) and state: present for idempotency. For EC2, use exact_count. Avoid dynamic content in policy documents.

Handling Eventual Consistency with Retries

AWS APIs are eventually consistent — after creating a resource, it may not be immediately visible in other APIs. This causes failures in subsequent tasks. The amazon.aws collection provides retries and delay parameters.

Global retry settings: Set in group_vars/all.yml:

``yaml ansible_aws_retry_max_attempts: 10 ansible_aws_retry_delay: 5 ``

Per-task retries: Override with module parameters:

``yaml - name: Wait for EC2 instance to be ready amazon.aws.ec2_instance_info: instance_ids: - i-xxxx region: "{{ ansible_aws_region }}" register: instance_info retries: 15 delay: 10 until: instance_info.instances[0].state.name == "running" ``

Common patterns
  • After creating an RDS instance, use rds_instance_info with retries until db_instance_status is available.
  • After creating a security group, use ec2_security_group_info with retries before using it in other tasks.
  • After creating an IAM role, wait for it to propagate before attaching policies.

Gotcha: The until condition must be a boolean expression. Use | default('') to handle missing keys.

Production pattern: Use a custom retry wrapper:

``yaml - name: Retry until resource is found amazon.aws.ec2_vpc_net_info: filters: "tag:Name": my-vpc register: vpc_info retries: 10 delay: 5 until: vpc_info.vpcs | length > 0 ``

Retry on 404 Errors
Many AWS modules return 404 if the resource doesn't exist yet. Use retries and until with a condition that checks for existence (e.g., vpcs | length > 0).
Production Insight
We had a playbook that created a VPC and then immediately tried to create subnets. The subnet creation failed because the VPC wasn't visible yet. Adding retries: 10, delay: 5 on the VPC info check before subnet creation fixed it.
Key Takeaway
Always add retries and delay after resource creation, especially for VPC, RDS, and IAM resources. Use until with existence checks.

Storing Secrets with AWS SSM Parameter Store

Never hardcode secrets in playbooks. Use AWS SSM Parameter Store with the aws_ssm_parameter module to manage parameters, and lookup to retrieve them securely.

Storing a secret:

``yaml - name: Store database password in SSM amazon.aws.aws_ssm_parameter: name: /myapp/dbpassword value: "{{ db_password }}" type: SecureString overwrite: yes region: "{{ ansible_aws_region }}" no_log: true ``

no_log: true prevents the value from being logged.

Retrieving a secret in a playbook:

``yaml - name: Get database password from SSM set_fact: db_password: "{{ lookup('aws_ssm_parameter', '/myapp/dbpassword', decrypt=True, region=ansible_aws_region) }}" no_log: true ``

Then use {{ db_password }} in subsequent tasks.

Gotcha: The lookup plugin returns the value as a string. If the parameter is SecureString, you must set decrypt=True and have permission to decrypt.

Production pattern: Use environment-specific paths:

``yaml - name: Get environment-specific secret set_fact: db_password: "{{ lookup('aws_ssm_parameter', '/myapp/' + env + '/dbpassword', decrypt=True, region=ansible_aws_region) }}" ``

Permissions: The IAM role executing Ansible must have ssm:GetParameter and kms:Decrypt (if using KMS) permissions.

Alternative: Use Ansible Vault, but SSM is better for centralized secret management across teams.

SSM Parameter Naming
Use hierarchical paths like /app/env/parameter for organization. Ensure the IAM role has ssm:GetParametersByPath permission to list parameters.
Production Insight
We once had a secret leak because a developer forgot no_log: true on a set_fact task. The password appeared in CI logs. We added a pre-commit hook to check for no_log: true on tasks that use aws_ssm_parameter lookup.
Key Takeaway
Use aws_ssm_parameter to store secrets and lookup to retrieve them. Always use no_log: true on tasks handling secrets.

Advanced: Combining Modules for Multi-Tier Deployments

Production applications often require multiple AWS resources. Here's a playbook that creates a VPC, subnets, security groups, RDS, and EC2 instances in a coordinated way.

```yaml --- - name: Provision multi-tier application hosts: localhost connection: local gather_facts: no vars: vpc_cidr: 10.0.0.0/16 public_subnets: - cidr: 10.0.1.0/24 az: "{{ ansible_aws_region }}a" - cidr: 10.0.2.0/24 az: "{{ ansible_aws_region }}b" private_subnets: - cidr: 10.0.10.0/24 az: "{{ ansible_aws_region }}a" - cidr: 10.0.11.0/24 az: "{{ ansible_aws_region }}b" tasks: - name: Create VPC amazon.aws.ec2_vpc_net: name: myapp-vpc cidr_block: "{{ vpc_cidr }}" tags: Environment: production state: present register: vpc

  • name: Create subnets
  • amazon.aws.ec2_vpc_subnet:
  • vpc_id: "{{ vpc.vpc.id }}"
  • cidr: "{{ item.cidr }}"
  • az: "{{ item.az }}"
  • tags:
  • Name: "{{ item.name }}"
  • state: present
  • loop:
  • - { cidr: "{{ public_subnets[0].cidr }}", az: "{{ public_subnets[0].az }}", name: "public-a" }
  • - { cidr: "{{ public_subnets[1].cidr }}", az: "{{ public_subnets[1].az }}", name: "public-b" }
  • - { cidr: "{{ private_subnets[0].cidr }}", az: "{{ private_subnets[0].az }}", name: "private-a" }
  • - { cidr: "{{ private_subnets[1].cidr }}", az: "{{ private_subnets[1].az }}", name: "private-b" }
  • register: subnets
  • name: Create security group for web
  • amazon.aws.ec2_security_group:
  • name: web-sg
  • description: Security group for web servers
  • vpc_id: "{{ vpc.vpc.id }}"
  • rules:
  • - proto: tcp
  • ports: 80
  • cidr_ip: 0.0.0.0/0
  • - proto: tcp
  • ports: 443
  • cidr_ip: 0.0.0.0/0
  • tags:
  • Name: web-sg
  • state: present
  • name: Create RDS subnet group
  • amazon.aws.rds_subnet_group:
  • name: myapp-db-subnet
  • description: Subnet group for RDS
  • subnet_ids:
  • - "{{ subnets.results[2].subnet.id }}"
  • - "{{ subnets.results[3].subnet.id }}"
  • state: present
  • name: Create RDS instance
  • amazon.aws.rds_instance:
  • db_instance_identifier: myapp-db
  • engine: postgres
  • engine_version: "14.6"
  • db_instance_class: db.t3.medium
  • allocated_storage: 100
  • master_username: "{{ db_user }}"
  • master_user_password: "{{ db_password }}"
  • db_subnet_group_name: myapp-db-subnet
  • vpc_security_group_ids:
  • - "{{ sg_result.group_id }}"
  • wait: yes
  • wait_timeout: 1200
  • state: present
  • name: Launch EC2 instances
  • amazon.aws.ec2_instance:
  • name: web-{{ item }}
  • instance_type: t3.medium
  • image_id: ami-0abcdef1234567890
  • vpc_subnet_id: "{{ subnets.results[0].subnet.id }}"
  • security_group: web-sg
  • exact_count: 2
  • tags:
  • Environment: production
  • Role: web
  • wait: yes
  • loop: "{{ range(1, 3) | list }}"
  • ```
Key points
  • Use register to capture resource IDs for later use.
  • Use loop to create multiple subnets.
  • Wait for RDS before proceeding to EC2.
  • Use no_log: true on tasks with secrets.
Orchestration Order
Create VPC first, then subnets, security groups, RDS subnet group, RDS, and finally EC2. Use wait: yes on RDS to ensure it's ready before EC2 tries to connect.
Production Insight
We once had a playbook that created RDS and EC2 in parallel using async. The EC2 instances booted before RDS was ready, causing application failures. We switched to sequential with wait: yes on RDS.
Key Takeaway
Orchestrate resource creation in dependency order. Use wait: yes and register to pass IDs between tasks.

Testing and Validating AWS Playbooks Locally

Testing AWS playbooks without affecting real infrastructure is crucial. Use --check mode and --diff to preview changes.

``bash ansible-playbook -i inventory/aws_ec2.yml playbook.yml --check --diff ``

Limitations: --check mode does not actually call AWS APIs for creation tasks; it simulates. Some modules return 'changed' even in check mode. To validate syntax:

``bash ansible-playbook playbook.yml --syntax-check ``

Unit testing with Molecule: Use the molecule tool with the ec2 driver to spin up temporary instances for testing. Example molecule.yml:

``yaml --- dependency: name: galaxy driver: name: ec2 region: us-east-1 instance_type: t2.micro image_id: ami-0abcdef1234567890 vpc_subnet_id: subnet-xxxx security_group: sg-xxxx platforms: - name: instance groups: - web provisioner: name: ansible inventory: group_vars: all: ansible_aws_region: us-east-1 verifier: name: ansible ``

Run molecule test to create, test, and destroy instances.

Gotcha: Molecule with EC2 driver incurs costs. Use --destroy=never during development to keep instances for debugging.

Production pattern: Use a separate AWS account for testing. Implement check_mode: yes in playbooks with conditional logic to skip destructive tasks.

``yaml - name: Create EC2 instance (check mode safe) amazon.aws.ec2_instance: ... when: not ansible_check_mode ``

Check Mode Limitations
Some modules (e.g., iam_role) do not support check mode fully. They may report 'changed' even when no changes would occur. Always verify with a dry run in a test environment.
Production Insight
We had a playbook that passed --check but failed in production because of a missing IAM permission. We added a pre-validation task that calls aws iam simulate-principal-policy to check permissions before running the main playbook.
Key Takeaway
Use --check and --diff for dry runs. Use Molecule with EC2 driver for integration testing. Always test in a separate AWS account.
● Production incidentPOST-MORTEMseverity: high

The Idempotency Fail: ec2 Module vs ec2_instance

Symptom
After running the playbook, all EC2 instances in the auto scaling group were terminated and new ones launched with different IPs, causing a full outage.
Assumption
The engineer assumed the ec2 module was idempotent and would only create instances if the count was insufficient.
Root cause
The ec2 module does not have an exact_count parameter that properly handles existing instances. It treated all running instances as 'extra' and terminated them before creating new ones.
Fix
Replaced the ec2 module with ec2_instance using exact_count: 3 and instance_role parameters. Also added instance_ids to target specific instances for updates.
Key lesson
  • Always use the latest module from amazon.aws collection (e.g., ec2_instance over ec2).
  • The old modules are deprecated for a reason — they lack proper idempotency and state management.
Production debug guideSymptom → Root cause → Fix4 entries
Symptom · 01
Playbook hangs at 'ec2_instance' task for 5+ minutes
Fix
Root cause: AWS API rate limiting or network timeout. Fix: Add timeout: 120 to the module and use retries: 5, delay: 10 for eventual consistency tasks.
Symptom · 02
S3 bucket creation fails with 'BucketAlreadyOwnedByYou' error
Fix
Root cause: Module not idempotent for existing buckets. Fix: Use s3_bucket module with state: present and force: false (default). The error is benign; add ignore_errors: yes or check bucket existence with aws_s3_bucket_info first.
Symptom · 03
IAM role creation fails with 'MalformedPolicyDocument'
Fix
Root cause: JSON policy document has trailing comma or invalid quotes. Fix: Use lookup('file', 'policy.json') and validate with json.loads via a set_fact before the task.
Symptom · 04
RDS instance creation succeeds but ec2_instance_info doesn't find it for 30 seconds
Fix
Root cause: AWS eventual consistency — RDS is not immediately visible in all APIs. Fix: Add wait: yes, wait_timeout: 600 to rds_instance module, then use retries: 10, delay: 10 on subsequent tasks that query RDS.
★ Ansible AWS Automation Quick Referenceprint this for your desk
EC2 instance creation times out
Immediate action
Check AWS CloudTrail for API errors
Commands
ansible-playbook -i inventory/aws_ec2.yml playbook.yml -vvv | grep -i 'ec2_instance'
aws ec2 describe-instances --instance-ids i-xxx --region us-east-1
Fix now
Add wait: yes, wait_timeout: 600 to ec2_instance task
S3 bucket already exists error+
Immediate action
Verify bucket ownership
Commands
aws s3api head-bucket --bucket my-bucket --region us-east-1
ansible-playbook playbook.yml --check
Fix now
Use s3_bucket with state: present and force: false; it's idempotent
IAM role policy document invalid+
Immediate action
Validate JSON locally
Commands
python -m json.tool policy.json
ansible-playbook playbook.yml --syntax-check
Fix now
Use lookup('file', 'policy.json') and ensure valid JSON
RDS instance not found after creation+
Immediate action
Check RDS console
Commands
aws rds describe-db-instances --db-instance-identifier mydb --region us-east-1
ansible-playbook playbook.yml -e 'ansible_aws_retry_max_attempts=10'
Fix now
Add wait: yes, wait_timeout: 600 and retries: 10, delay: 10 on subsequent tasks
SSM parameter not found in playbook+
Immediate action
Check parameter path and permissions
Commands
aws ssm get-parameter --name /myapp/dbpassword --with-decryption --region us-east-1
ansible-playbook playbook.yml -e 'ssm_region=us-east-1'
Fix now
Use lookup('aws_ssm_parameter', '/myapp/dbpassword', decrypt=True, region=ssm_region)
Comparison of EC2 Instance Modules
Featureec2 (community.aws)ec2_instance (amazon.aws)Notes
IdempotentNo (creates every run)Yes (with exact_count)Use ec2_instance for production
Count managementcount parameter (additive)exact_count (absolute)exact_count prevents drift
TagsRequires separate taskBuilt-in tags parameterSimpler playbooks
Wait for readywait=yes not reliablewait=yes with timeoutAvoids race conditions
Instance profileNot supportedinstance_role parameterSimplifies IAM integration
Network interfacesComplexnetwork parameterEasier multi-ENI setup
Stateful updatesNo (replace only)Yes (modify in place)Reduces downtime

Key takeaways

1
Use amazon.aws collection (>=5.0.0) for all AWS modules; avoid community.aws.
2
EC2
Use ec2_instance with exact_count, name, tags, and wait=yes.
3
S3
Use s3_bucket with state=present and permission=private; for objects use overwrite=different.
4
IAM
Use iam_role with assume_role_policy_document from file; validate JSON first.
5
RDS
Use rds_instance with skip_final_snapshot=no and final_snapshot_identifier.
6
VPC
Use ec2_vpc_net, ec2_vpc_subnet with tags for idempotency; prefer vpc_id over vpc_name.
7
Dynamic inventory
Use aws_ec2 plugin with cache=yes and keyed_groups.
8
Eventual consistency
Add retries, delay, and until conditions after resource creation.
9
Secrets
Store in SSM Parameter Store with SecureString; retrieve via lookup with no_log: true.
10
Test with --check, --diff, and Molecule in separate AWS account.
11
Always set tags on resources for idempotent identification.
12
Orchestrate resource creation in dependency order with wait and register.

Common mistakes to avoid

6 patterns
×

Using the deprecated `ec2` module instead of `ec2_instance`

Symptom
Instances are created every playbook run, causing duplicates or terminations.
Fix
Replace ec2 with amazon.aws.ec2_instance and use exact_count.
×

Not setting `wait: yes` on EC2 or RDS creation

Symptom
Subsequent tasks fail because resource is not ready.
Fix
Add wait: yes and wait_timeout to creation tasks.
×

Hardcoding secrets in playbooks

Symptom
Secrets exposed in logs or version control.
Fix
Use aws_ssm_parameter lookup with no_log: true.
×

Not using `exact_count` for EC2 instances

Symptom
Instance count grows on each playbook run.
Fix
Use exact_count with name and tags filters.
×

Omitting tags on resources

Symptom
Idempotency fails; resources created every run.
Fix
Always set tags with a Name key on all resources.
×

Not handling eventual consistency with retries

Symptom
Tasks fail with 'ResourceNotFound' after creation.
Fix
Add retries, delay, and until conditions to subsequent tasks.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How do you ensure idempotency when creating EC2 instances with Ansible?
Q02SENIOR
What is the difference between `ec2` and `ec2_instance` modules?
Q03SENIOR
How do you handle eventual consistency when creating an RDS instance in ...
Q04SENIOR
How do you securely manage database passwords in Ansible AWS playbooks?
Q05SENIOR
What is the best practice for dynamic inventory with AWS EC2?
Q06SENIOR
How do you create a VPC with subnets using Ansible?
Q07SENIOR
What are common pitfalls when using Ansible to manage IAM roles?
Q08SENIOR
How do you test Ansible AWS playbooks without affecting production?
Q01 of 08SENIOR

How do you ensure idempotency when creating EC2 instances with Ansible?

ANSWER
Use the amazon.aws.ec2_instance module with the exact_count parameter. Provide filters like name and tags to scope the count. Without exact_count, the module creates a new instance every run. Also set state: present and use wait: yes to ensure the instance is running before proceeding.
FAQ · 8 QUESTIONS

Frequently Asked Questions

01
What is the difference between amazon.aws and community.aws collections?
02
How do I install the amazon.aws collection?
03
Why does my EC2 instance creation fail with 'wait timeout'?
04
How do I make S3 bucket creation idempotent?
05
Can I update an IAM role's managed policies with Ansible?
06
How do I pass the VPC ID from one task to another?
07
What is the best way to handle secrets in Ansible for AWS?
08
How do I refresh the dynamic inventory cache?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
June 21, 2026
last updated
1,596
articles · all by Naren
🔥

That's Ansible. Mark it forged?

12 min read · try the examples if you haven't

Previous
Ansible Tags for Selective Execution
17 / 23 · Ansible
Next
Ansible Docker Management