Ansible Dynamic Inventory: AWS EC2 Plugin Gotchas and Production Patterns
Master Ansible dynamic inventory with AWS EC2 plugin.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
Use inventory plugins (like amazon.aws.aws_ec2) over legacy scripts; they are faster, cache-aware, and natively integrated.
Configure plugin with plugin: amazon.aws.aws_ec2 in a YAML file named aws_ec2.yml.
Tagging strategy: use keyed_groups to create groups from tags (e.g., keyed_groups: [{prefix: tag, key: tags.Name}]).
Always enable caching with cache_plugin: jsonfile and set cache_timeout: 3600 to avoid API rate limits.
Test inventory with ansible-inventory -i aws_ec2.yml --list to verify groups and variables.
For GCP, use gcp_compute plugin; for Azure, use azure_rm plugin. Both support keyed_groups and caching.
Avoid using ansible-inventory --graph for debugging; use --list with --export for clean output.
Set compose to add custom variables, e.g., compose: { ansible_host: public_ip_address }.
Imagine you have a huge box of Lego bricks, and you need to build a specific model. Instead of digging through the box every time, you create a map that tells you exactly where each brick is and which set it belongs to. Ansible dynamic inventory is that map for your cloud servers. It automatically finds all your servers (like EC2 instances) and organizes them into groups based on tags (like 'web' or 'database'). This way, you can run commands on all web servers at once without manually listing them. The inventory plugin is like a smart assistant that updates the map every time you ask, but it can also remember the map for a while to save time.
I still remember the Monday morning when our deployment pipeline failed for the second week in a row. The playbook ran fine on Friday, but on Monday it couldn't find any EC2 instances. The error was cryptic: SKIPPED: No hosts matched. After an hour of head-scratching, I discovered that the legacy dynamic inventory script we were using had a hardcoded AWS region that no longer existed. That was the day I decided to switch to inventory plugins.
Historically, Ansible dynamic inventory was done via executable scripts (Python, shell) that output JSON. They worked, but were brittle: no caching, no native error handling, and each script was a snowflake. The Ansible team introduced inventory plugins in Ansible 2.4, and they became the recommended approach by 2.9. Plugins are faster, support caching out of the box, and integrate deeply with Ansible's group and variable logic.
This article covers everything you need to use dynamic inventory in production: from the AWS EC2 plugin configuration to tagging strategies, caching, and debugging. We'll also touch on GCP and Azure plugins. I'll share real incidents and patterns from running Ansible at scale across hundreds of instances.
Inventory Plugins vs Scripts: Why Plugins Win in Production
Legacy dynamic inventory scripts are standalone executables (Python, shell, etc.) that output JSON to stdout. They work, but they have no standard way to handle caching, errors, or configuration. Ansible 2.4 introduced inventory plugins, which are Python modules that integrate with Ansible's internals. They support caching, configuration via YAML, and advanced features like keyed_groups and compose.
- Caching: Built-in support via
cache_plugin(e.g.,jsonfile,redis). Scripts require manual caching. - Error handling: Plugins raise Ansible errors with clear messages. Scripts often fail silently.
- Performance: Plugins use persistent connections and batch API calls. Scripts may make many individual API calls.
- Maintainability: Configuration is YAML, not code. No need to manage dependencies for each script.
To use a plugin, create a YAML file (e.g., aws_ec2.yml) with: ``yaml plugin: amazon.aws.aws_ec2 regions: - us-east-1 - us-west-2 filters: instance-state-name: running keyed_groups: - prefix: tag key: tags.Name compose: ansible_host: public_ip_address ` Then reference it with -i aws_ec2.yml`.
Migration: If you have legacy scripts, wrap them in a plugin using ansible.builtin.script plugin? No, that's not a plugin. Instead, rewrite the logic as a custom plugin or use the constructed plugin to add groups. But honestly, just use the cloud-specific plugin.
script inventory source. Always use plugins for new projects. Scripts may break in future versions.aws_ec2 plugin with caching reduced this to 2 seconds. The script also had a bug that missed instances in certain regions because it hardcoded the endpoint URL.Configuring amazon.aws.aws_ec2 Plugin for Production
The amazon.aws.aws_ec2 plugin is the gold standard for AWS dynamic inventory. Install the collection: ansible-galaxy collection install amazon.aws. Then create a YAML file, typically named aws_ec2.yml.
plugin: amazon.aws.aws_ec2(required)regions: list of AWS regions. Use["*"]for all regions (but be careful with API limits).filters: dict of EC2 filters (e.g.,{"instance-state-name": "running"}). Supports all EC2 API filters.hostnames: list of hostname sources (e.g.,["dns-name", "private-dns-name", "ip-address"]). The first match becomes the inventory hostname.keyed_groups: create groups from tags or attributes.compose: set variables likeansible_host.cache_plugin: enable caching (e.g.,jsonfile).cache_timeout: seconds to cache (e.g.,300).strict: boolean to fail on invalid group names (default false).
Production example: ``yaml plugin: amazon.aws.aws_ec2 regions: - us-east-1 - eu-west-1 filters: instance-state-name: running tag:Environment: production hostnames: - dns-name - private-dns-name keyed_groups: - prefix: tag key: tags.Name - prefix: env key: tags.Environment compose: ansible_host: public_ip_address cache_plugin: jsonfile cache_timeout: 300 strict: false ``
Gotcha: The hostnames list order matters. If you use dns-name first, but the instance has no public DNS, it falls back to the next. Always include private-dns-name for VPC instances.
Authentication: The plugin uses boto3. Ensure AWS credentials are available via environment variables, IAM role, or ~/.aws/credentials. Use aws sts get-caller-identity to verify.
strict: true to catch misconfigured keyed_groups or compose expressions early. In production, set to false to avoid failures from unexpected data.hostnames: ["dns-name"] but our instances were in a private subnet with no public DNS. Ansible created hosts with empty names, causing failures. We fixed it by adding private-dns-name as a fallback.hostnames with fallbacks, and use compose to set ansible_host explicitly.Tagging Strategy for Dynamic Groups with keyed_groups
Tags are the backbone of dynamic grouping. With keyed_groups, you can automatically create Ansible groups based on EC2 tags, instance attributes, or any metadata.
Syntax: ``yaml keyed_groups: - prefix: tag key: tags.Environment separator: '' ` This creates groups like tag_production, tag_staging. The prefix is prepended, then the tag value. Use separator: ''` to avoid double underscores.
Production tagging strategy: Use consistent tag keys across all instances. Common tags: - Name: instance name (often unique) - Environment: production, staging, development - Role: web, database, cache - Tier: frontend, backend - Project: project name
Then configure groups: ``yaml keyed_groups: - prefix: env key: tags.Environment separator: '' - prefix: role key: tags.Role - prefix: tag key: tags.Name parent_group: all ``
Nested groups: You can nest groups by combining tags. Use keyed_groups with parent_group to create hierarchy. Example: instances with tags.Environment=prod and tags.Role=web could be in groups env_prod and role_web. To group all prod web servers, use group: "{{ tags.Environment }}_{{ tags.Role }}" via the constructed plugin, but that's more complex.
Gotcha: Tag values with spaces or special characters become invalid Ansible group names. Set strict: false to ignore them, or sanitize with compose.
Best practice: Use lowercase tag values and avoid spaces. E.g., Environment: production not Environment: Production.
[a-zA-Z_][a-zA-Z0-9_]*. Tags with hyphens or dots will cause errors if strict: true. Use separator: '_' or sanitize via compose.Environment: Production (US) with parentheses. This created a group tag_Production (US) which was invalid. Ansible failed silently, and the instances were not grouped. We fixed it by enforcing tag values to be alphanumeric only.Advanced Grouping with keyed_groups and compose
The keyed_groups directive can create groups from any key returned by the plugin, not just tags. For example, you can group by instance type, region, or VPC ID.
``yaml keyed_groups: - prefix: instance_type key: instance_type - prefix: region key: placement.region ``
But keyed_groups only creates groups based on exact values. For more complex logic, use compose to create custom variables, then group on those.
Example: Group instances by whether they have a public IP. ``yaml compose: has_public_ip: public_ip_address is defined keyed_groups: - key: has_public_ip prefix: public ` This creates groups public_True and public_False`.
Another pattern: Create groups based on multiple tags. Use Jinja2 expressions: ``yaml compose: env_role: "{{ tags.Environment | default('unknown') }}_{{ tags.Role | default('unknown') }}" keyed_groups: - key: env_role prefix: '' ` This creates groups like production_web`.
Gotcha: compose runs after keyed_groups? Actually, compose is evaluated before keyed_groups, so you can use composed variables in keyed_groups keys. But be careful with order: compose sets host variables, then keyed_groups uses them.
Performance: Complex compose expressions are evaluated for each host. For large inventories (thousands of hosts), keep expressions simple. Avoid expensive filters like regex_replace.
{{ tags.Environment | default('unknown') }} can mask missing tags. Use | mandatory in development to catch missing tags.compose to set ansible_user based on AMI owner: ansible_user: "{{ 'ec2-user' if 'amazon' in image_id else 'ubuntu' }}". This worked but was slow for 500 instances. We moved this logic to a playbook task instead.compose for simple variable mappings; for complex logic, prefer playbook tasks to keep inventory fast.Caching Inventory to Avoid API Rate Limits
Without caching, each ansible-inventory call hits the AWS API. For large environments, this can trigger rate limits (e.g., RequestLimitExceeded). Caching stores the inventory locally and refreshes it periodically.
Configure caching in the plugin YAML: ``yaml cache_plugin: jsonfile cache_timeout: 300 cache_connection: ~/.ansible/tmp/inventory_cache ` Or set globally in ansible.cfg: `ini [inventory] cache_plugin = jsonfile cache_timeout = 300 cache_connection = ~/.ansible/tmp/inventory_cache ``
Cache plugins: jsonfile (default), redis, memcached, sqlite. For single-controller setups, jsonfile is fine. For multi-controller, use redis to share cache.
Cache invalidation: The cache is invalidated after cache_timeout seconds. To force refresh, delete the cache directory: rm -rf ~/.ansible/tmp/inventory_cache/. Or run with --flush-cache: ``bash ansible-inventory -i aws_ec2.yml --list --flush-cache ``
Production pattern: Set cache_timeout based on how often your infrastructure changes. For auto-scaling groups, set to 60 seconds. For static environments, 300 seconds is fine.
Gotcha: If using cache_plugin: jsonfile, ensure the cache directory is writable. Also, the cache file can become corrupt if multiple processes write simultaneously. Use redis for concurrent access.
Debugging cache: Check the cache file: ``bash cat ~/.ansible/tmp/inventory_cache/ansible_inventory_cache | jq '. | keys' ``
--flush-cache to ensure fresh inventory. Otherwise, stale cache can cause deployments to fail.cache_timeout: 120 (2 minutes) solved it. We also moved to redis cache to handle parallel jobs.Testing Dynamic Inventory with ansible-inventory --list
The ansible-inventory command is your best friend for debugging. Use --list to output the full inventory as JSON.
``bash ansible-inventory -i aws_ec2.yml --list | jq '."tag_Name_web"' ``
--list: output all hosts and groups.--graph: output ASCII graph of group hierarchy (less detailed).--export: output clean JSON without_meta(for external tools).--flush-cache: ignore cache and refresh.--debug: verbose logging (use with2>&1 | grep error).
Testing specific groups: ``bash ansible-inventory -i aws_ec2.yml --list | jq '.env_production' ``
Testing host variables: ``bash ansible-inventory -i aws_ec2.yml --list | jq '._meta.hostvars["i-12345"]' ``
Using with a playbook: ``bash ansible-playbook -i aws_ec2.yml site.yml --list-hosts ``
Gotcha: The --list output includes _meta with hostvars. Use --export to omit it if you want a pure inventory JSON.
Automated testing: In CI, run: ``bash ansible-inventory -i aws_ec2.yml --list > /dev/null && echo "Inventory valid" ``
- Empty output: check credentials and filters.
- Missing groups: check
keyed_groupssyntax. - Wrong hostnames: check
hostnameslist.
Pro tip: Pipe through jq to extract specific fields. For example, list all hostnames: ``bash ansible-inventory -i aws_ec2.yml --list | jq '._meta.hostvars | keys' ``
ansible-inventory for debugging, ansible-playbook --list-hosts to verify playbook targeting. The latter respects playbook host patterns.filters section: instance-state-name: runnning (three n's). ansible-inventory --list returned empty, but the error was not obvious. Adding --debug showed the invalid filter.ansible-inventory --list after changing inventory configuration to catch errors early.GCP Inventory Plugin: gcp_compute
For Google Cloud Platform, use the gcp_compute plugin from the google.cloud collection. Install: ansible-galaxy collection install google.cloud.
Configuration example (`gcp_compute.yml`): ``yaml plugin: gcp_compute projects: - my-project zones: - us-central1-a - us-east1-b filters: - status = RUNNING keyed_groups: - prefix: gcp key: labels.environment compose: ansible_host: networkInterfaces[0].accessConfigs[0].natIP hostnames: - name - networkInterfaces[0].networkIP cache_plugin: jsonfile cache_timeout: 300 ``
- Uses
labelsinstead oftags. zonesinstead ofregions.filtersuse GCE filter syntax (e.g.,status = RUNNING).hostnamesuses GCE instance properties.
Authentication: Use application default credentials or service account JSON file via GCP_SERVICE_ACCOUNT_FILE environment variable.
Gotcha: The gcp_compute plugin does not support hostnames with fallback like AWS. You must specify a list; the first match is used. If the instance has no public IP, natIP will be undefined, and the hostname will be empty.
Production pattern: Use compose to set ansible_host to the internal IP if public IP is missing: ``yaml compose: ansible_host: "{{ networkInterfaces[0].accessConfigs[0].natIP | default(networkInterfaces[0].networkIP) }}" ``
gcp_compute plugin was slow because it queried all zones. We added zones: [us-central1-a] to limit scope and enabled caching. The inventory generation time dropped from 30s to 3s.zones and projects to only what you need, and always enable caching for GCP inventory.Azure Inventory Plugin: azure_rm
For Microsoft Azure, use the azure_rm plugin from the azure.azcollection collection. Install: ansible-galaxy collection install azure.azcollection.
Configuration example (`azure_rm.yml`): ``yaml plugin: azure_rm include_vm_resource_groups: - my-resource-group - another-rg auth_source: auto keyed_groups: - prefix: azure key: tags.environment - prefix: location key: location compose: ansible_host: public_ip_address | default(private_ip_address) hostnames: - name - private_ip_address cache_plugin: jsonfile cache_timeout: 300 ``
- Uses
include_vm_resource_groupsorinclude_vmssto scope. auth_source: autouses Azure CLI or environment variables.tagsare key-value pairs (like AWS).locationis the Azure region.
Authentication: Use az login or service principal via AZURE_CLIENT_ID, AZURE_SECRET, AZURE_TENANT environment variables.
Gotcha: The azure_rm plugin can be slow for large subscriptions. Use include_vm_resource_groups to limit scope. Also, the plugin does not support hostnames fallback as elegantly; you may need to use compose.
Production pattern: Use tags to organize VMs. Example: ``yaml keyed_groups: - prefix: environment key: tags.Environment - prefix: role key: tags.Role ``
Performance: For large Azure environments, consider using azure_rm with cache_plugin: redis to share cache across controllers.
include_vm_resource_groups, the plugin queries all resource groups in the subscription, which can take minutes. Always scope to specific resource groups.Combining Multiple Inventory Sources
In production, you often need hosts from multiple clouds or sources. Ansible supports multiple inventory sources by specifying a directory with multiple files, or by listing multiple -i flags.
Using a directory: Place all inventory YAML files in a directory (e.g., inventory/). Then run: ``bash ansible-playbook -i inventory/ site.yml `` Ansible will merge all inventories. Hosts with the same name are merged (variables from later sources override earlier ones).
Using multiple -i flags: ``bash ansible-playbook -i aws_ec2.yml -i gcp_compute.yml -i azure_rm.yml site.yml ``
Merging logic: Groups from all sources are combined. If a host appears in multiple sources, its variables are merged (last source wins). To avoid conflicts, ensure hostnames are unique across clouds (e.g., use instance ID or FQDN).
Using constructed plugin: The constructed plugin can add groups and variables based on existing inventory data. Example: ``yaml plugin: constructed strict: false keyed_groups: - prefix: cloud key: cloud_type compose: cloud_type: "{{ 'aws' if 'ec2' in group_names else 'gcp' }}" ` But this requires setting cloud_type` first.
Production pattern: Use separate inventory files per cloud, and a directory to combine them. Then use a playbook-level hosts: all to target all.
aws_web and gcp_web or use instance IDs.hostnames: ["gcp-{{ name }}"].Custom Inventory Plugins: When and How
Sometimes the built-in plugins don't meet your needs (e.g., custom API, on-premise servers). You can write a custom inventory plugin. This is advanced but powerful.
Structure: A plugin is a Python module in a collection. It must inherit from ansible.plugins.inventory.BaseInventoryPlugin and implement , parse(), and optionally verify_file().get_option()
Minimal example: ```python from ansible.plugins.inventory import BaseInventoryPlugin
DOCUMENTATION = ''' name: my_custom plugin_type: inventory options: my_option: description: Example option required: true type: str '''
class InventoryModule(BaseInventoryPlugin): NAME = 'my_custom'
def verify_file(self, path): return path.endswith('.my.yml')
def parse(self, inventory, loader, path, cache=True): super().parse(inventory, loader, path, cache) self.set_options() # Fetch data from external source hosts = [{'name': 'host1', 'groups': ['web'], 'vars': {'ansible_host': '10.0.0.1'}}] for host in hosts: inventory.add_host(host['name']) for group in host['groups']: inventory.add_group(group) inventory.add_host_to_group(host['name'], group) for k, v in host['vars'].items(): inventory.set_variable(host['name'], k, v) ```
- Implement caching similar to built-in plugins.
- Use
for debug logging.self.display.vvv() - Use
to read configuration.self.get_option() - Test with
ansible-inventory.
- You have a custom CMDB.
- You need to query an API that doesn't have a plugin.
- You need complex logic not supported by
compose.
Otherwise, stick with existing plugins.
constructed plugin with a script that fed data via a static source.compose and keyed_groups.Common Pitfalls and How to Avoid Them
Here are the most common mistakes I've seen with dynamic inventory:
1. Not setting ansible_host Without compose: { ansible_host: public_ip_address }, Ansible uses the hostname as the connection address. If the hostname is not resolvable, SSH fails.
2. Over-filtering Using filters that are too restrictive (e.g., tag:Environment: production) but forgetting to tag new instances. Always include a fallback group like ungrouped.
3. Ignoring cache Not enabling caching leads to slow runs and API rate limits. Always set cache_plugin and cache_timeout.
4. Using deprecated inventory_script Legacy scripts still work but are deprecated. They lack caching and error handling. Migrate to plugins.
5. Not testing with --list Skipping ansible-inventory --list leads to surprises in production. Always test after changes.
6. Group name collisions If two inventory sources create groups with the same name, they merge. This can cause unexpected host membership.
7. Missing dependencies For AWS, install amazon.aws collection and boto3 and botocore. For GCP, install google.cloud collection and google-auth. For Azure, install azure.azcollection and azure-cli.
8. Not handling missing tags If a tag is missing, keyed_groups will fail (if strict: true) or skip. Use default filter in compose to provide defaults.
ansible-inventory --list --flush-cache on every change to catch issues early.ansible_host: private_ip_address but forgot that the controller was outside the VPC. All playbooks failed with timeout. We added a conditional: ansible_host: "{{ public_ip_address | default(private_ip_address) }}".ping module after configuration changes.Production Deployment: Putting It All Together
Here's a production-grade setup for dynamic inventory:
Directory structure: `` inventory/ aws_ec2.yml gcp_compute.yml azure_rm.yml group_vars/ all.yml env_production.yml role_web.yml ``
ansible.cfg: ```ini [defaults] inventory = inventory/ host_key_checking = False
[inventory] cache_plugin = jsonfile cache_timeout = 300 cache_connection = ~/.ansible/tmp/inventory_cache ```
CI/CD pipeline: ```bash # Install collections ansible-galaxy collection install amazon.aws google.cloud azure.azcollection
# Clear cache and test ansible-inventory -i inventory/ --list --flush-cache > /dev/null
# Run playbook ansible-playbook -i inventory/ site.yml ```
Monitoring: Use ansible-inventory --list to export inventory to a monitoring system. For example, dump to JSON and push to Prometheus.
- Store credentials in a vault (e.g., Ansible Vault, HashiCorp Vault).
- Use IAM roles for AWS, service accounts for GCP, and managed identities for Azure.
- Never commit credentials to version control.
- For 1000+ hosts, use
rediscache to share across controllers. - Use
--forksin playbooks to parallelize. - Consider using
ansible-pullfor agent-based models.
- Keep a static inventory as a fallback (e.g.,
inventory/static.yml) with critical hosts. - Test disaster recovery by running playbooks against static inventory.
--export flag outputs inventory without _meta, suitable for feeding into monitoring or CMDBs.redis cache, which solved the issue and improved performance.The Missing EC2 Instance: A Caching Disaster
ansible-inventory --list returned old data.cache_timeout: 86400 (24 hours) and the cache file was never invalidated. The cache plugin was jsonfile with a persistent directory.cache_timeout: 300 (5 minutes) and cleared the cache: rm -rf ~/.ansible/tmp/inventory_cache/.- Always set a reasonable cache timeout for dynamic inventories.
- Cloud resources change frequently; treat cache as a speed optimization, not a source of truth.
ansible-inventory --list returns empty or no hostsaws sts get-caller-identity. Verify the plugin YAML file has correct regions or filters. Use --debug flag: ansible-inventory -i aws_ec2.yml --list --debug 2>&1 | grep -i error.keyed_groups configuration. Ensure tag keys match exactly (case-sensitive). Test with ansible-inventory -i aws_ec2.yml --list | jq '._meta.hostvars'.ansible.cfg or -i flag. Run ansible-inventory --graph to see group hierarchy. Ensure ansible_host is set via compose.RequestLimitExceeded)cache_plugin: jsonfile and cache_timeout: 300. Increase max_retries in AWS config. Use boto3 retries: export AWS_MAX_ATTEMPTS=10.aws sts get-caller-identityansible-inventory -i aws_ec2.yml --list --debug 2>&1 | grep errorregions in plugin YAMLKey takeaways
ansible_host via compose to ensure connectivity. and cache_timeout: 300` to avoid API rate limits.keyed_groups to automatically create groups from tags or labels.ansible-inventory --list after every change.ansible-inventory --list.Common mistakes to avoid
6 patternsNot setting ansible_host in compose
compose: { ansible_host: public_ip_address | default(private_ip_address) }Using legacy script instead of plugin
Not enabling caching
cache_plugin: jsonfile and cache_timeout: 300Overly broad filters (e.g., no region filter)
regions: [us-east-1, eu-west-1] or use filtersUsing invalid group names from tags
strict: falseForgetting to install required collections
ansible-galaxy collection install amazon.aws (or gcp, azure)Interview Questions on This Topic
What is the difference between an inventory plugin and an inventory script?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's Ansible. Mark it forged?
12 min read · try the examples if you haven't