Ansible Variable Precedence — The 22-Level Silent Override
- Ansible is agentless — it connects over SSH requiring no software installation on managed nodes. Zero maintenance overhead on servers, instant onboarding for new infrastructure, and a smaller security footprint than agent-based tools.
- Playbooks describe desired state in human-readable YAML — not step-by-step scripts. Run them once or a hundred times and the outcome is identical. This idempotency is what makes Ansible safe to run in CI/CD pipelines and on scheduled crons.
- Variable precedence has 22 levels enforced silently. host_vars always overrides group_vars. Extra vars (-e) override everything. Run ansible-inventory --host before every production deploy where variables matter — the resolved variable state is ground truth.
- Ansible is agentless configuration management — it connects via SSH, pushes small modules, and cleans up after itself
- Three core components: Inventory (what servers), Modules (how to act), Playbooks (when to act)
- Idempotency means running the same playbook 100 times produces the same result as running it once
- Performance trade-off: agentless means zero maintenance on servers but higher control node load (forks control parallelism)
- Production trap: variable precedence has 22 levels — your dev environment works but prod breaks because host_vars silently overrides group_vars with no warning
- Biggest mistake: a host_vars file left over from a debugging session six months ago quietly overrides your group-level config in production — compiles fine, deploys fine, serves the wrong value
Ansible Production Debug Cheat Sheet
Playbook fails with Host unreachable or SSH timeout
ansible -i inventory.ini all -m ping -vvvssh -v -i ~/.ssh/your_key user@target_host echo connectedTask shows changed every run when nothing actually changes
ansible-playbook playbook.yml --check --diff > /tmp/ansible_diff.txtgrep -B 5 -A 15 'changed:' /tmp/ansible_diff.txtVariable value is correct in vars_files but resolves to something different at runtime
ansible-inventory -i inventory.ini --host $TARGET_HOST --vars | jq '.nginx_port, .environment, .db_password'ansible -m debug -a 'var=nginx_port' -i inventory.ini $TARGET_HOSTHandler runs on every playbook execution, not just when config actually changes
ansible-playbook playbook.yml --list-tasks | grep -A 5 handler_namegrep -r 'notify: handler_name' roles/ --include='*.yml'Playbook works manually from your laptop but fails consistently in the CI pipeline
env | grep -E 'ANSIBLE|PYTHON|SSH' > local_env.txtansible --version && python3 --versionProduction Incident
Production Debug GuideThese three failure modes account for 80% of Ansible incidents. Here's exactly how to diagnose each one.
Before configuration management tools, sysadmins maintained hundreds of servers by hand — logging in, running commands, hoping nothing went wrong. I lived this. In 2015, I managed a fleet of 80 web servers at a mid-size SaaS company, and every deploy night was a three-hour marathon of SSH sessions, copy-pasted commands, and prayer. One night, someone restarted the wrong database server. We lost four hours of customer data. That was the last straw.
Ansible was created by Michael DeHaan in 2012 and acquired by Red Hat in 2015 (now part of IBM). Today it runs infrastructure at NASA JPL, Capital One, and thousands of companies from Series A startups to Fortune 50 enterprises. Not because it's the most powerful automation tool, but because it's the simplest one that actually gets used.
What makes Ansible different from competitors like Chef and Puppet is that it is agentless. There is no daemon running on your managed servers, no SSL certificates to exchange, and no extra ports to open beyond standard SSH (or WinRM for Windows). Ansible runs from your control node, pushes small programs called Ansible Modules to the remote nodes, executes them, and then cleans up after itself.
One important nuance that comes up in almost every team adopting Ansible: Ansible and Terraform are not competitors — they solve different problems at different points in a server's life. Terraform creates infrastructure: it provisions the EC2 instance, creates the VPC, registers the DNS record. Ansible configures that infrastructure: it installs software, deploys application code, manages services, and corrects configuration drift on day 2, day 30, and day 300. Terraform's user_data and cloud-init can run a script at first boot, but they can't re-run idempotently when you need to update a config three months later. Ansible can. That's the real distinction — Terraform builds the house once, Ansible keeps it clean indefinitely.
In this guide, we'll break down Ansible's core architecture — inventories, playbooks, modules, and roles — cover ad-hoc commands for quick fleet operations, and build production-grade automation with real error handling, secret management, and reusable patterns. Every section includes the production detail that most tutorials skip.
Inventory, Playbooks, and Modules — The Three Core Concepts
Ansible's architecture relies on three primary building blocks. Get these right and everything else follows. Get any one of them wrong and you'll spend your time debugging instead of automating.
- The Inventory: A file (INI or YAML) that lists the servers you want to manage, organized into groups like [webservers] or [databases]. The inventory is your single source of truth about what exists. In production, you'll almost always use dynamic inventory — pulling host lists directly from AWS, GCP, or Azure APIs so your inventory stays accurate as servers are created and destroyed by autoscaling. Static inventories work for learning and small fixed fleets under 20 servers, but once you have autoscaling groups or spot instances, a static file becomes a liability. Stale IPs, terminated instances, missing new nodes — a static inventory in an elastic environment is a disaster on a timer.
- The Playbook: Your automation blueprint, written in YAML. A playbook maps groups of hosts to sequences of tasks and describes desired state rather than step-by-step instructions. This distinction matters operationally: if Nginx is already installed and running at the right version, Ansible confirms it and moves on. It doesn't reinstall. It doesn't restart unnecessarily. It checks and reports 'ok'.
- Modules: The tools in the toolbox. Instead of writing bash scripts, you use modules like apt, yum, service, copy, or template. These modules are idempotent — they check the current state of the server and only make changes when the server doesn't match your desired state. The shell and command modules are the notable exceptions. They run unconditionally every time, which is exactly why experienced Ansible engineers avoid them unless there is genuinely no dedicated module alternative.
For dynamic inventory specifically — here's what it looks like in practice. You create a plugin configuration file (aws_ec2.yml) that Ansible reads instead of a static hosts file. It queries the AWS EC2 API, groups instances by their tags, and returns a live host list. The inventory is never stale because it's rebuilt from the API on every run.
# io.thecodeforge: Static Inventory for Project Forge # Use this for fixed infrastructure under 20 servers. # For elastic/cloud environments, use dynamic inventory (aws_ec2.yml below). [webservers] web-01.thecodeforge.io ansible_host=192.168.1.10 ansible_user=ubuntu web-02.thecodeforge.io ansible_host=192.168.1.11 ansible_user=ubuntu [databases] db-01.thecodeforge.io ansible_host=192.168.1.20 ansible_user=ubuntu [production:children] webservers databases [production:vars] ansible_ssh_private_key_file=~/.ssh/forge_deploy_key # ────────────────────────────────────────────────────────────────────────────── # io.thecodeforge: Dynamic Inventory Plugin Config (aws_ec2.yml) # Save this as inventories/production/aws_ec2.yml # Run: ansible-inventory -i inventories/production/ --list # ────────────────────────────────────────────────────────────────────────────── # plugin: amazon.aws.aws_ec2 # regions: # - eu-west-1 # filters: # instance-state-name: running # tag:Environment: production # keyed_groups: # - key: tags.Role # prefix: role # separator: '_' # - key: tags.Environment # prefix: env # separator: '_' # hostnames: # - private-ip-address # compose: # ansible_user: "'ubuntu'" # ansible_ssh_private_key_file: "'~/.ssh/forge_deploy_key'" # cache: true # cache_plugin: jsonfile # cache_connection: /tmp/ansible_aws_cache # cache_timeout: 300 # # With this config: # - Instances tagged Role=webserver appear in group role_webserver # - Instances tagged Environment=production appear in group env_production # - Cache prevents hammering the EC2 API on every run (5-minute TTL) # - New instances appear automatically — no manual inventory updates
Your First Production Playbook — and the 22-Level Precedence Ladder
A playbook is a collection of plays. Each play targets a specific group from your inventory and executes a sequence of tasks in order, top to bottom. If a task fails on a specific host, Ansible stops executing for that host but continues for the others. To handle configuration changes — like restarting a web server only when a config file actually changes — Ansible uses Handlers: special tasks that only run when notified by another task that reported 'changed'.
The playbook below is a production pattern we actually use. Notice: update the package cache, install the binary, deploy a templated config, ensure the service is running. Every task is idempotent. Every task uses a dedicated module. No shell commands.
But here's what the Ansible documentation buries in a footnote that causes more production incidents than anything else: variable precedence has 22 levels, and Ansible enforces them silently. The most important levels to internalize — from highest to lowest priority:
- Extra vars (-e on the command line) — highest, overrides everything
- Task vars (set directly on a task)
- Block vars
- Role and include vars
- Set_facts and registered vars
- host_vars/hostname.yml — this is where the production incident in this article came from
- group_vars/groupname.yml
- group_vars/all.yml
- Playbook vars
- Role defaults (defaults/main.yml) — lowest, easily overridden by anything above
The rule that causes the most surprises: host_vars always overrides group_vars. Always. Without any warning. Without any log entry. If prod-web-01.yml exists in your host_vars directory, it wins over group_vars/all.yml, group_vars/webservers.yml, and everything you defined in your playbook's vars block — silently.
The diagnostic you need to run before every production deploy where variables are involved: ansible-inventory -i inventory.ini --host prod-web-01 --vars. This shows you the fully merged, fully resolved variable set that Ansible will actually use. Not what you think you set. Not what's in the playbook. The ground truth.
--- # io.thecodeforge: Standard Nginx Deployment Playbook # Variable precedence reminder (highest to lowest — the levels that matter most): # 1. Extra vars (-e) <- overrides EVERYTHING, use with extreme care in CI # 2. set_fact / registered <- runtime-computed values # 3. host_vars/hostname.yml <- PER-HOST OVERRIDE, silent, highest file-based precedence # 4. group_vars/groupname.yml <- group-specific values # 5. group_vars/all.yml <- global defaults # 6. Playbook vars block <- what you see below # 7. Role defaults/main.yml <- weakest, easily overridden # # Debug tip: ansible-inventory -i inventory.ini --host prod-web-01 --vars # shows the fully merged variable set before the playbook runs. - name: Deploy and Configure Nginx hosts: webservers become: true vars: nginx_port: 80 server_name: "thecodeforge.io" # NOTE: These vars sit at precedence level 6 (playbook vars). # A host_vars file for any target host will silently override these. # Run ansible-inventory --host <hostname> --vars to verify before deploying. tasks: - name: Verify expected variable state before making any changes ansible.builtin.debug: msg: "nginx_port resolved to {{ nginx_port }} on {{ inventory_hostname }}" # Add this debug task during onboarding or when variables behave unexpectedly. # Remove or tag it once the team trusts the variable sources. - name: Ensure apt cache is updated ansible.builtin.apt: update_cache: yes cache_valid_time: 3600 # cache_valid_time: 3600 means: skip the update if cache is less than 1 hour old. # Trade-off: saves 5-10 seconds per run but means security updates won't appear # for up to an hour. Acceptable for app servers; lower this for security-sensitive roles. - name: Install Nginx production package ansible.builtin.apt: name: nginx state: present # state: present = install if missing. state: latest = upgrade if a newer version exists. # Use present in production unless you explicitly want automatic upgrades. - name: Deploy custom Nginx configuration ansible.builtin.template: src: templates/nginx.conf.j2 dest: /etc/nginx/sites-available/default owner: root group: root mode: '0644' notify: Reload Nginx service # notify only fires when this task reports 'changed'. # If the rendered template is byte-for-byte identical to the existing file, # no notification is sent and Nginx is not reloaded. This is idempotency in action. - name: Ensure Nginx service is enabled and running ansible.builtin.service: name: nginx state: started enabled: yes handlers: - name: Reload Nginx service ansible.builtin.service: name: nginx state: reloaded # reloaded sends SIGHUP — Nginx reloads config without dropping connections. # restarted kills and restarts — drops all active connections. # Always use reloaded for config changes. Use restarted only for binary upgrades.
Ad-hoc Commands — Quick Fleet Operations Without a Playbook
Not everything needs a playbook. Sometimes you need to run a single command across your fleet right now — check disk space before a deploy, restart a hung service on 50 app servers, verify a kernel patch applied across the fleet, kill a runaway process that's consuming memory. That's what ad-hoc commands are for.
Ad-hoc commands are Ansible's underrated superpower for day-two operations. They're the reason senior SREs reach for Ansible instead of writing SSH for-loops. An SSH for-loop runs the command on every server sequentially and gives you raw unstructured output. Ansible ad-hoc runs in parallel across as many hosts as your forks setting allows, returns structured output per host, handles failures gracefully, and respects your inventory groups so you don't accidentally run something against the wrong environment.
Syntax: ansible <host-pattern> -i <inventory> -m <module> -a '<arguments>'
- -b or --become: run as root (sudo)
- -u or --user: specify the SSH username
- --limit 'web-01': restrict execution to a subset of the matched hosts — critical for safe fleet operations
- --check: dry run — show what would change without actually changing anything
- -f 50 or --forks 50: override the default parallelism for this single command
- -v, -vv, -vvv, -vvvv: increasing verbosity. -v shows task results. -vvv shows SSH connection details. -vvvv shows everything including the raw module arguments — use this when debugging SSH hangs.
In production I use ad-hoc commands daily. Checking disk space on 200 servers before a deploy: one-liner, 10 seconds, structured output. Restarting a hung worker process across 50 app servers: one-liner. Verifying that a security patch actually applied to every host in the fleet: one-liner. These replace what used to be 20-minute SSH marathons with copy-pasted commands and manually collated output.
#!/usr/bin/env bash # io.thecodeforge: Ad-hoc Command Reference # These replace SSH for-loops. Run these, not bash loops. # ── Connectivity and fact-checking ─────────────────────────────────────────── # Verify SSH connectivity to all production hosts before a major deploy ansible production -i inventory.ini -m ping # Check disk space across all web servers before a deploy # -o: one-line output mode — easier to scan for problems ansible webservers -i inventory.ini -m command -a "df -h /" -o # Gather full system facts from a single host (OS, IPs, memory, CPU) # Useful for debugging environment differences between hosts ansible db-01.thecodeforge.io -i inventory.ini -m setup # Gather only a subset of facts to speed up the call # gather_subset=min returns OS, hostname, IP — skips disk/CPU details ansible webservers -i inventory.ini -m setup -a 'gather_subset=min' -o # ── Safe fleet operations with --limit ──────────────────────────────────────── # The --limit flag restricts execution to a subset of the target group. # ALWAYS use --limit when you want to test on one host before hitting the fleet. # This is the most important safety habit for ad-hoc fleet operations. # Restart Nginx on ONE host first to verify the command is correct ansible webservers -i inventory.ini -m service \ -a "name=nginx state=restarted" --become \ --limit web-01.thecodeforge.io # Once verified, restart Nginx across all web servers ansible webservers -i inventory.ini -m service \ -a "name=nginx state=restarted" --become # ── Security and maintenance ────────────────────────────────────────────────── # Apply a security patch across the entire fleet in parallel # -f 20: process 20 hosts at a time (tune based on control node resources) ansible production -i inventory.ini \ -m apt -a "name=openssl state=latest update_cache=yes" \ --become -f 20 # Verify the patch was applied — check the installed version on every host ansible production -i inventory.ini \ -m command -a "dpkg -l openssl | grep '^ii'" -o # ── Dry run before any destructive operation ───────────────────────────────── # --check: show what WOULD happen without actually doing it # Use this before any ad-hoc command that modifies state ansible webservers -i inventory.ini \ -m apt -a "name=nginx state=absent" \ --become --check # ── Verbosity for SSH debugging ─────────────────────────────────────────────── # -v: show task result summary # -vv: show connection parameters # -vvv: show SSH connection details (use this when a host is unreachable) # -vvvv: show raw SSH protocol output (use this when SSH itself is misbehaving) ansible web-01.thecodeforge.io -i inventory.ini -m ping -vvv
Roles — Reusable Automation at Scale
Once your playbooks grow beyond 50 lines, you'll start copying tasks between files. That's when you need roles. A role is a self-contained unit of automation — tasks, handlers, templates, default variables, and static files — packaged in a standardized directory structure that Ansible knows how to load automatically. Roles are how Ansible scales from 'one playbook' to 'an entire infrastructure codebase that multiple teams can contribute to.'
The directory structure is Ansible's loading convention, not optional decoration. When you reference a role in a playbook, Ansible automatically loads tasks/main.yml, handlers/main.yml, defaults/main.yml, templates/, and files/ if they exist. The structure is the contract — deviate from it and things silently don't load.
Roles come from two sources: you write your own for application-specific automation, or you pull community roles from Ansible Galaxy (ansible-galaxy install geerlingguy.nginx). Galaxy has thousands of pre-built roles for common infrastructure software. For Nginx, Docker, PostgreSQL, certbot, Redis — a battle-tested community role saves hours and handles edge cases your first draft won't. For deploying your Java application, configuring your monitoring stack, or enforcing your company's specific security baseline — you write your own.
Critically, community roles must be version-pinned in a requirements.yml file. Not managed, not latest — a specific version tag. I've watched a Galaxy role change a default variable in a minor version update and restart PostgreSQL during a maintenance window without any warning. The role's changelog mentioned it. Nobody read the changelog because nobody expected a minor version to change default behavior. Pin the version. Test the upgrade in staging. Treat a Galaxy role update the same way you treat a library dependency upgrade — with the same caution and the same verification process.
--- # io.thecodeforge: Reusable Nginx Role # # Role directory structure (Ansible's loading convention — not optional): # roles/nginx/ # ├── defaults/ # │ └── main.yml <- weakest variable precedence, safe defaults # ├── handlers/ # │ └── main.yml <- service reload/restart handlers # ├── tasks/ # │ └── main.yml <- this file, core task logic # ├── templates/ # │ └── vhost.conf.j2 <- Jinja2 config templates # └── files/ # └── (static files if needed) # # Use this role in a playbook: # - hosts: webservers # roles: # - role: nginx # vars: # server_name: api.thecodeforge.io # nginx_port: 8080 - name: Install Nginx ansible.builtin.apt: name: nginx state: present update_cache: yes - name: Deploy virtual host configuration from template ansible.builtin.template: src: vhost.conf.j2 dest: "/etc/nginx/sites-available/{{ server_name }}.conf" owner: root group: root mode: '0644' validate: '/usr/sbin/nginx -t -c %s' # validate: runs nginx -t on the rendered config before writing it. # If the config is invalid, Ansible rejects it and the file is not updated. # This prevents deploying a broken Nginx config that would fail on reload. notify: Reload Nginx - name: Enable virtual host by creating symlink ansible.builtin.file: src: "/etc/nginx/sites-available/{{ server_name }}.conf" dest: "/etc/nginx/sites-enabled/{{ server_name }}.conf" state: link notify: Reload Nginx - name: Ensure Nginx is running and enabled on boot ansible.builtin.service: name: nginx state: started enabled: yes --- # io.thecodeforge: requirements.yml — Galaxy role version pinning # Install with: ansible-galaxy install -r requirements.yml # ALWAYS pin to a specific version. Never use 'latest'. # Treat a version bump the same as a library dependency upgrade: # test in staging, read the changelog, verify behavior before deploying to prod. # roles: # - name: geerlingguy.nginx # version: 3.2.0 # # Pinned: tested against Ubuntu 22.04 LTS on 2026-03-01 # # Upgrade checklist: test in staging, verify default variable changes # # - name: geerlingguy.docker # version: 6.1.0 # # Pinned: confirmed compatible with Docker 25.x on 2026-02-15 # # - name: geerlingguy.postgresql # version: 3.4.0 # # Pinned: restart behavior tested — does NOT restart on minor config changes # # Install all roles: # ansible-galaxy install -r requirements.yml --roles-path roles/ # # Upgrade a single role safely: # ansible-galaxy install geerlingguy.nginx,3.3.0 --force # # Then test in staging before updating the version in requirements.yml
Production Patterns — Error Handling, Vault, and Rolling Deploys
The playbook we built above works correctly for a single server in a controlled environment. Production is messier. Databases fail mid-migration. Network blips cause intermittent SSH timeouts. You need to deploy to 50 servers without taking all 50 offline simultaneously. And you absolutely cannot store database passwords in plain text YAML committed to Git — not because of policy, but because production credentials in version control is a breach waiting to happen.
Error Handling with block/rescue/always: Ansible has a try/catch equivalent. Wrap risky tasks in a block. If anything inside fails, the rescue section runs — rollback, alert, log. The always section runs regardless — cleanup, notifications. Without this pattern, a failed database migration leaves your server in a half-configured state with no automatic recovery and no notification that anything went wrong.
Rolling Deploys with serial: The serial keyword controls how many hosts Ansible processes simultaneously. serial: 3 means update 3 servers, verify they're healthy, then move to the next 3. Without serial, Ansible hits all hosts simultaneously — which is acceptable for config management but catastrophic for application deploys where you need zero downtime.
Ansible Vault for Secrets: Vault encrypts variables or entire files using AES256. Create an encrypted file with ansible-vault create group_vars/production/vault.yml, add your secrets, and commit the encrypted file to Git. Without the vault password, the file is gibberish — safe to store in version control. In CI/CD, pass the vault password via a file written from a CI secret: echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass, then ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass. Never use --ask-vault-pass in CI — it expects interactive input and hangs silently.
For different environments, use different vault password files — one for staging, one for production. The vault file contents can be identical in structure but different in values (different database passwords per environment), while the passwords to decrypt them are stored separately in your CI secrets manager.
--- # io.thecodeforge: Production Deploy with Error Handling, Rolling Deploy, and Vault # # Before running: # 1. Create vault file: ansible-vault create group_vars/production/vault.yml # Add: db_password: "your_real_password" # webhook_url: "https://hooks.slack.com/your/webhook" # 2. Commit the encrypted vault file to Git (safe — AES256 encrypted) # 3. Store vault password in CI secrets as ANSIBLE_VAULT_PASSWORD # 4. CI runs with: ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass - name: Deploy Application with Safety Rails hosts: webservers become: true serial: 3 # Rolling deploy: process 3 servers at a time # For 30 servers: 10 sequential batches of 3 # Trade-off: 10x longer than parallel, 0 simultaneous downtime max_fail_percentage: 0 # Stop the entire deploy if ANY server in a batch fails # max_fail_percentage: 30 would allow 30% failure before aborting # For database migrations, use 0 — one failure should stop everything vars_files: - group_vars/production/vault.yml # Encrypted with ansible-vault — safe in Git # vault.yml contains: # db_password: "{{ vault_db_password }}" # webhook_url: "{{ vault_webhook_url }}" # Reference in tasks as: {{ db_password }} # Ansible decrypts at runtime using the vault password file — never stores plaintext tasks: - name: Deploy application release with rollback on failure block: # ── Step 1: Pull the new code ───────────────────────────────────────── - name: Pull latest application code ansible.builtin.git: repo: "https://github.com/thecodeforge/app.git" dest: /opt/app version: "{{ release_version }}" # release_version passed via -e on the command line: # ansible-playbook deploy.yml -e release_version=v2.4.1 # ── Step 2: Run database migrations ────────────────────────────────── - name: Run database migrations ansible.builtin.command: cmd: /opt/app/bin/migrate --env production args: chdir: /opt/app environment: DATABASE_URL: "postgres://app:{{ db_password }}@db-01:5432/appdb" # db_password comes from the vault file — never hardcoded register: migration_result # register: captures the command output for use in later tasks or rescue block # ── Step 3: Verify the application is healthy ───────────────────────── - name: Verify application health endpoint responds 200 ansible.builtin.uri: url: "http://localhost:8080/health" status_code: 200 retries: 5 # Try up to 5 times delay: 3 # Wait 3 seconds between retries # If the health check fails after 5 retries, the block fails # and rescue runs automatically rescue: # Runs only if any task in the block above fails - name: Log deployment failure with context ansible.builtin.debug: msg: > Deploy FAILED on {{ inventory_hostname }}. Release: {{ release_version }}. Rolling back to: {{ previous_release }}. Migration output: {{ migration_result.stdout | default('N/A') }} - name: Rollback to previous known-good release ansible.builtin.git: repo: "https://github.com/thecodeforge/app.git" dest: /opt/app version: "{{ previous_release }}" # previous_release passed alongside release_version: # ansible-playbook deploy.yml -e release_version=v2.4.1 -e previous_release=v2.4.0 always: # Runs regardless of success or failure — use for notifications and cleanup - name: Send deployment status notification ansible.builtin.uri: url: "{{ webhook_url }}" method: POST body_format: json body: host: "{{ inventory_hostname }}" release: "{{ release_version }}" status: "{{ 'success' if ansible_failed_task is not defined else 'failed' }}" environment: production # webhook_url comes from the vault file # ansible_failed_task is set by Ansible when a task in the block fails
| Tool | Agent Required | Language | Learning Curve | Best For |
|---|---|---|---|---|
| Ansible | No (agentless — SSH only) | YAML + Jinja2 | Low — most engineers are productive within a day | Configuration management, application deployment, ad-hoc fleet operations, and orchestration across mixed environments. The fastest path from zero automation to everything automated. Best choice for teams that don't have dedicated infrastructure engineers. |
| Chef | Yes (chef-client daemon running on every managed node) | Ruby DSL | High — requires Ruby knowledge and Chef Server administration | Complex, policy-based configuration in large enterprise fleets where teams have Ruby expertise and need a pull-based model. Chef Server handles 10,000+ nodes better than Ansible's push model at extreme scale. |
| Puppet | Yes (puppet agent daemon, certificate-based auth) | Puppet DSL | High — Puppet DSL is its own language with its own idioms | Long-term compliance enforcement and drift remediation in regulated industries (finance, healthcare, government) where continuous automated enforcement matters more than on-demand execution. Puppet's pull model means servers self-correct without a human initiating a run. |
| Terraform | No | HCL | Medium — HCL is readable but state management has a learning curve | Infrastructure provisioning — creating servers, VPCs, load balancers, DNS records, IAM roles, and managed services. Complementary to Ansible, not a replacement. Terraform creates the server. Ansible configures it. Most mature DevOps teams use both in sequence: Terraform provisions, Ansible configures on first boot and on every subsequent config change. |
🎯 Key Takeaways
- Ansible is agentless — it connects over SSH requiring no software installation on managed nodes. Zero maintenance overhead on servers, instant onboarding for new infrastructure, and a smaller security footprint than agent-based tools.
- Playbooks describe desired state in human-readable YAML — not step-by-step scripts. Run them once or a hundred times and the outcome is identical. This idempotency is what makes Ansible safe to run in CI/CD pipelines and on scheduled crons.
- Variable precedence has 22 levels enforced silently. host_vars always overrides group_vars. Extra vars (-e) override everything. Run ansible-inventory --host before every production deploy where variables matter — the resolved variable state is ground truth.
- Prioritize dedicated modules (apt, systemd, git, copy, template) over shell and command. Dedicated modules check state before acting. Shell and command run unconditionally every time and report 'changed' on every run — breaking your CI dashboard's signal-to-noise ratio.
- Roles are how Ansible scales from 10 servers to 1000. The directory structure is Ansible's loading contract — deviate from it and files silently don't load. Pin Galaxy community roles to specific versions in requirements.yml and treat upgrades like dependency upgrades.
- Use block/rescue/always for any playbook that modifies persistent state. Without error handling, a failed migration on server 3 of 20 leaves your fleet in split-brain configuration with no automatic recovery and no notification.
- Ansible Vault is non-negotiable for secrets. ansible-vault create the file, commit the encrypted version to Git, store the decryption password in CI secrets, pass it with --vault-password-file. Never --ask-vault-pass in automation and never plain-text credentials in playbooks.
- Ansible and Terraform are complementary tools in the same pipeline: Terraform provisions the server, Ansible configures it. Terraform's user_data runs once at first boot. Ansible runs idempotently on day 1, day 30, and day 300 — correcting drift every time.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the agentless architecture of Ansible. How does it compare to agent-based tools like Puppet or Chef in terms of security footprint, operational overhead, and onboarding friction for new servers?Mid-levelReveal
- QWhat is idempotency in the context of Ansible modules? Can you name a module that is not idempotent by default, and explain when you'd intentionally use it?Mid-levelReveal
- QHow does Ansible handle parallel execution? What is a fork in ansible.cfg, and how does tuning it impact performance on a 500-node fleet?SeniorReveal
- QWhat is the difference between a task and a handler? In what scenario would a handler be skipped even if it is notified by a task that reported changed?Mid-levelReveal
- QHow would you use Ansible Vault to manage environment-specific secrets in a CI/CD pipeline? Walk through the workflow from encrypting the variable to injecting it during a Jenkins or GitLab CI run.SeniorReveal
- QWhat are Ansible facts? How can you disable fact gathering to speed up playbook execution, and when do you actually need them?Mid-levelReveal
- QExplain how dynamic inventory works with a cloud provider like AWS. What are the advantages over a static inventory file, and what challenges does it introduce?SeniorReveal
- QDescribe the difference between include_role and import_role. When would you choose one over the other, and how does each affect task execution order and variable scope?SeniorReveal
- QHow would you structure an Ansible project to manage 500+ servers across dev, staging, and production environments? Describe your directory layout, variable hierarchy, and how you'd prevent production changes from accidentally running against dev.SeniorReveal
Frequently Asked Questions
What is the difference between an ad-hoc command and a playbook in Ansible?
An ad-hoc command is a single one-liner executed directly from the command line — ideal for quick checks or one-off operations like restarting a service or checking disk space across your fleet. A playbook is a reusable, version-controlled YAML file that defines a sequence of tasks with variables, handlers, and error handling. Think of ad-hoc commands as shouting instructions across the room, and playbooks as writing a detailed runbook that anyone can execute repeatedly with the same result. The rule of thumb: if you've run the same ad-hoc command twice, it belongs in a playbook.
How does Ansible handle secrets and sensitive data?
Ansible provides Ansible Vault, which encrypts variables or entire files using AES256. Encrypt individual strings with ansible-vault encrypt_string and paste them into your playbooks, or encrypt entire variable files with ansible-vault encrypt. At runtime, provide the vault password via --vault-password-file pointing to a file written from a CI secret. Vault-encrypted content is safe to commit to Git — without the password it's gibberish. For larger teams, integrate Vault with HashiCorp Vault using the hashi_vault lookup plugin, which fetches secrets at runtime from a centralized secrets manager rather than storing them in encrypted files.
What is dynamic inventory in Ansible, and when should you use it?
Dynamic inventory queries an external source — typically a cloud provider API like AWS EC2, GCP, or Azure — at runtime instead of reading a static file. Ansible builds the host list from live API data based on tags, regions, and instance states. Use dynamic inventory when your infrastructure is elastic: autoscaling groups, spot instances, or any environment where servers are created and destroyed regularly. Static inventory works for fixed infrastructure under 20 servers with stable hostnames. Beyond that, a static file becomes a liability — stale IPs, missing new instances, terminated hosts that are still listed. Enable the inventory cache (cache_timeout: 300) to avoid rate limiting the cloud API on every run.
How do you handle errors and rollbacks in Ansible playbooks?
Ansible provides a block/rescue/always construct that works like try/catch/finally. Wrap risky operations in a block. If any task inside fails, the rescue section executes — rollback to a known-good state, send an alert, log the failure context. The always section runs regardless of success or failure — cleanup, status notifications. For rolling deployments, combine this with serial (how many hosts to update at once) and max_fail_percentage (abort the entire deploy if too many hosts fail). Set max_fail_percentage: 0 for database migrations — any failure should stop everything. Without block/rescue, a failed migration on server 3 of 20 leaves 17 servers on the new schema and 1 on the old, with the application broken and no automatic recovery.
What is the difference between Ansible and Terraform? Do I need both?
They solve different problems at different points in a server's life. Terraform provisions infrastructure — it creates EC2 instances, VPCs, load balancers, DNS records, and IAM roles. Ansible configures that infrastructure — it installs software, deploys application code, manages services, and corrects configuration drift. Terraform's user_data and cloud-init can run a script at first boot, but they can't re-run idempotently three months later when you need to update a config file. Ansible can. Most production teams use Terraform to build the infrastructure and Ansible to configure and maintain it. They're complementary tools in the same pipeline, not alternatives.
How do you test Ansible playbooks before running them in production?
Use --check mode for a dry run — Ansible shows what would change without applying anything. Combine it with --diff to see exact file content differences. For automated testing, use Molecule: it spins up Docker containers or VMs, runs your role, verifies the result with Testinfra assertions, and tears everything down. Run Molecule in CI to catch regressions before they reach any environment. Also run ansible-lint on all playbooks and roles to catch deprecated modules, style violations, and common structural mistakes. The combination of --check, --diff, Molecule, and ansible-lint catches the vast majority of problems before a human needs to review them.
What is Ansible Galaxy, and should I use community roles?
Ansible Galaxy is a repository of community-contributed roles for common infrastructure software — Nginx, Docker, PostgreSQL, certbot, Redis, and hundreds more. Install with ansible-galaxy install -r requirements.yml. Community roles save hours for commodity software and are often more battle-tested than what you'd write from scratch. For application-specific automation — deploying your Java app, configuring your monitoring stack — write custom roles. The mandatory practice: pin every Galaxy role to a specific version in requirements.yml. A community role is a dependency you don't control. A minor version update can change default behavior in ways that affect production. Pin it, test upgrades in staging, read the changelog before bumping the version.
How does Ansible perform on very large fleets (1000+ servers)?
Ansible's parallelism scales with the forks setting in ansible.cfg (default: 5, which is too low for large fleets). For 1000 servers, start at forks=50 and monitor control node CPU, memory, and open file descriptor counts. Enable pipelining=True to reduce SSH round-trips per module from 3 to 1 — this alone can cut playbook runtime by 30-40%. Disable fact gathering for playbooks that don't need system facts, or use gather_subset=min to collect only essential information. For operational visibility at scale — job scheduling, RBAC, audit logging, workflow orchestration, and a web UI — deploy AWX (the open-source version) or Ansible Automation Platform. Plain Ansible from the command line works at 1000+ nodes, but AWX gives you the operational control that large teams need to manage concurrent jobs safely.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.