Skip to content
Home DevOps Ansible Variable Precedence — The 22-Level Silent Override

Ansible Variable Precedence — The 22-Level Silent Override

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Ansible → Topic 1 of 3
A forgotten host_vars file overrode group_vars with zero warnings, breaking prod.
🧑‍💻 Beginner-friendly — no prior DevOps experience needed
In this tutorial, you'll learn
A forgotten host_vars file overrode group_vars with zero warnings, breaking prod.
  • Ansible is agentless — it connects over SSH requiring no software installation on managed nodes. Zero maintenance overhead on servers, instant onboarding for new infrastructure, and a smaller security footprint than agent-based tools.
  • Playbooks describe desired state in human-readable YAML — not step-by-step scripts. Run them once or a hundred times and the outcome is identical. This idempotency is what makes Ansible safe to run in CI/CD pipelines and on scheduled crons.
  • Variable precedence has 22 levels enforced silently. host_vars always overrides group_vars. Extra vars (-e) override everything. Run ansible-inventory --host before every production deploy where variables matter — the resolved variable state is ground truth.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer
  • Ansible is agentless configuration management — it connects via SSH, pushes small modules, and cleans up after itself
  • Three core components: Inventory (what servers), Modules (how to act), Playbooks (when to act)
  • Idempotency means running the same playbook 100 times produces the same result as running it once
  • Performance trade-off: agentless means zero maintenance on servers but higher control node load (forks control parallelism)
  • Production trap: variable precedence has 22 levels — your dev environment works but prod breaks because host_vars silently overrides group_vars with no warning
  • Biggest mistake: a host_vars file left over from a debugging session six months ago quietly overrides your group-level config in production — compiles fine, deploys fine, serves the wrong value
🚨 START HERE

Ansible Production Debug Cheat Sheet

The five commands that solve 90% of Ansible production issues. Run these before opening a ticket or waking someone up.
🟡

Playbook fails with Host unreachable or SSH timeout

Immediate ActionVerify SSH connectivity independently before touching Ansible configuration
Commands
ansible -i inventory.ini all -m ping -vvv
ssh -v -i ~/.ssh/your_key user@target_host echo connected
Fix NowCheck security group: port 22 inbound from control node CIDR. Ensure ansible_user in your inventory matches the actual SSH username on the target. For ephemeral environments like CI runners or short-lived EC2 instances, set ANSIBLE_HOST_KEY_CHECKING=False or pre-populate known_hosts with ssh-keyscan in your pipeline prep step.
🟡

Task shows changed every run when nothing actually changes

Immediate ActionIdentify exactly which task is reporting changed and why
Commands
ansible-playbook playbook.yml --check --diff > /tmp/ansible_diff.txt
grep -B 5 -A 15 'changed:' /tmp/ansible_diff.txt
Fix NowReplace shell or command with a dedicated idempotent module. If using copy or template, normalize line endings and trailing whitespace: ansible.builtin.copy: content="{{ config | trim }}". If you genuinely cannot avoid shell and the command's side effects are truly undetectable, use changed_when: false explicitly rather than letting it mislead your CI dashboard.
🟡

Variable value is correct in vars_files but resolves to something different at runtime

Immediate ActionDump the fully resolved variable state for the specific failing host
Commands
ansible-inventory -i inventory.ini --host $TARGET_HOST --vars | jq '.nginx_port, .environment, .db_password'
ansible -m debug -a 'var=nginx_port' -i inventory.ini $TARGET_HOST
Fix NowRemove conflicting host_vars files. Consolidate all environment-specific variables into group_vars/production.yml. Audit your CI pipeline for -e flags that inject variable overrides — these sit at the top of the precedence ladder and override everything else silently.
🟡

Handler runs on every playbook execution, not just when config actually changes

Immediate ActionFind which task is notifying the handler and why it reports changed every time
Commands
ansible-playbook playbook.yml --list-tasks | grep -A 5 handler_name
grep -r 'notify: handler_name' roles/ --include='*.yml'
Fix NowA task notifying the handler is reporting changed on every run — almost always a shell or command task running unconditionally. Convert that task to an idempotent module. If the change is genuinely undetectable (for example, an API call with no readable state), use changed_when: false on that specific task and document why.
🟡

Playbook works manually from your laptop but fails consistently in the CI pipeline

Immediate ActionCompare the execution environment between your local shell and the CI runner
Commands
env | grep -E 'ANSIBLE|PYTHON|SSH' > local_env.txt
ansible --version && python3 --version
Fix NowCI runs without an interactive terminal — set ANSIBLE_HOST_KEY_CHECKING=False and ANSIBLE_SSH_RETRIES=3 as CI environment variables. Use absolute paths to inventory files since CI working directories vary by runner. Pass the vault password via --vault-password-file pointing to a file written from a CI secret, not --ask-vault-pass which expects interactive input and hangs silently.
Production Incident

The Variable Precedence Nightmare

A playbook worked perfectly in staging but configured the wrong port in production. The database team spent three hours blaming the application. The root cause was Ansible's 22-level variable precedence ladder and a forgotten host_vars file from a debugging session six months earlier.
SymptomPlaybook using nginx_port: 8080 in group_vars/all.yml. Production servers were supposed to listen on 8080. Staging worked correctly. The same playbook, same inventory structure, different outcome on prod. No errors in Ansible output — just wrong config deployed silently. The first sign of trouble was a load balancer health check failure, not Ansible.
AssumptionThe team assumed variables defined in group_vars/all.yml applied to all hosts uniformly. They had no mental model of variable precedence. They didn't know host_vars overrides group_vars, and they had no process for auditing what variable values Ansible actually resolved at runtime versus what was declared in the playbook.
Root causeOne production host had a host_vars/prod-web-01.yml file with nginx_port: 80 left over from a troubleshooting session six months earlier. The engineer who created it had long since left the team. Ansible applied host_vars over the group_vars value silently — no warning, no log entry, no diff in the playbook output. The 22-level precedence ladder worked exactly as designed, exactly opposite of what the team expected. The fix took four minutes. Finding the cause took three hours.
FixRun ansible-inventory -i inventory.ini --host prod-web-01 --vars to see the fully merged variable set for any host before the playbook runs. Remove the orphaned host_vars file. Add a CI step that runs ansible-inventory --list and diffs the resolved variables against a known-good baseline on every merge to main. Treat host_vars as a code smell that requires a documented justification comment — if a host genuinely needs unique config, the file should say why.
Key Lesson
Variable precedence is not a suggestion — it is a hard 22-level ladder that Ansible enforces silently. Learn the top eight levels. host_vars overrides group_vars. Always. Without exception.ansible-inventory --host is your variable debug command. Run it against the specific failing host before touching the playbook. The resolved variable state is the ground truth — not what you think you set.Treat host_vars files as a code smell. Unless a host genuinely needs unique configuration that no other host in its group shares, keep variables at group level and delete host_vars files when the reason for them disappears.Your staging environment not mirroring production in inventory structure and variable sources is a disaster waiting to happen. The variable that breaks prod will always be the one that staging silently resolved differently.
Production Debug Guide

These three failure modes account for 80% of Ansible incidents. Here's exactly how to diagnose each one.

Playbook hangs indefinitely with no output or errorAdd -vvvv to your command immediately. Look for 'ESTABLISH SSH CONNECTION' in the output — if nothing appears past that line, your control node cannot reach the target host. Check security groups and firewall rules for port 22 inbound from the control node's CIDR. The default SSH timeout is 10 seconds but retry logic makes it look like an infinite hang. Also check whether the target host's SSH daemon is running at all — a recently rebooted host may not have sshd back up yet.
Task shows changed status on every run even when nothing actually changesYou are almost certainly using shell or command instead of a dedicated idempotent module. Replace with the module version — apt, service, copy, template, file. If no dedicated module exists for your use case, add a creates or removes argument to the command module so Ansible can determine whether the operation is necessary. Run ansible-playbook playbook.yml --check --diff to see exactly what is changing between runs.
Variables have different values in prod than dev with the same playbookRun ansible-inventory --host [hostname] --vars on the broken host first. Compare the output against a working host. Look specifically for host_vars files that a previous engineer may have created and forgotten, -e overrides injected by your CI pipeline environment variables, and include_vars statements inside roles that load different files based on environment name. The resolved variable state from ansible-inventory is ground truth — trust it over what you think you set.

Before configuration management tools, sysadmins maintained hundreds of servers by hand — logging in, running commands, hoping nothing went wrong. I lived this. In 2015, I managed a fleet of 80 web servers at a mid-size SaaS company, and every deploy night was a three-hour marathon of SSH sessions, copy-pasted commands, and prayer. One night, someone restarted the wrong database server. We lost four hours of customer data. That was the last straw.

Ansible was created by Michael DeHaan in 2012 and acquired by Red Hat in 2015 (now part of IBM). Today it runs infrastructure at NASA JPL, Capital One, and thousands of companies from Series A startups to Fortune 50 enterprises. Not because it's the most powerful automation tool, but because it's the simplest one that actually gets used.

What makes Ansible different from competitors like Chef and Puppet is that it is agentless. There is no daemon running on your managed servers, no SSL certificates to exchange, and no extra ports to open beyond standard SSH (or WinRM for Windows). Ansible runs from your control node, pushes small programs called Ansible Modules to the remote nodes, executes them, and then cleans up after itself.

One important nuance that comes up in almost every team adopting Ansible: Ansible and Terraform are not competitors — they solve different problems at different points in a server's life. Terraform creates infrastructure: it provisions the EC2 instance, creates the VPC, registers the DNS record. Ansible configures that infrastructure: it installs software, deploys application code, manages services, and corrects configuration drift on day 2, day 30, and day 300. Terraform's user_data and cloud-init can run a script at first boot, but they can't re-run idempotently when you need to update a config three months later. Ansible can. That's the real distinction — Terraform builds the house once, Ansible keeps it clean indefinitely.

In this guide, we'll break down Ansible's core architecture — inventories, playbooks, modules, and roles — cover ad-hoc commands for quick fleet operations, and build production-grade automation with real error handling, secret management, and reusable patterns. Every section includes the production detail that most tutorials skip.

Inventory, Playbooks, and Modules — The Three Core Concepts

Ansible's architecture relies on three primary building blocks. Get these right and everything else follows. Get any one of them wrong and you'll spend your time debugging instead of automating.

  1. The Inventory: A file (INI or YAML) that lists the servers you want to manage, organized into groups like [webservers] or [databases]. The inventory is your single source of truth about what exists. In production, you'll almost always use dynamic inventory — pulling host lists directly from AWS, GCP, or Azure APIs so your inventory stays accurate as servers are created and destroyed by autoscaling. Static inventories work for learning and small fixed fleets under 20 servers, but once you have autoscaling groups or spot instances, a static file becomes a liability. Stale IPs, terminated instances, missing new nodes — a static inventory in an elastic environment is a disaster on a timer.
  2. The Playbook: Your automation blueprint, written in YAML. A playbook maps groups of hosts to sequences of tasks and describes desired state rather than step-by-step instructions. This distinction matters operationally: if Nginx is already installed and running at the right version, Ansible confirms it and moves on. It doesn't reinstall. It doesn't restart unnecessarily. It checks and reports 'ok'.
  3. Modules: The tools in the toolbox. Instead of writing bash scripts, you use modules like apt, yum, service, copy, or template. These modules are idempotent — they check the current state of the server and only make changes when the server doesn't match your desired state. The shell and command modules are the notable exceptions. They run unconditionally every time, which is exactly why experienced Ansible engineers avoid them unless there is genuinely no dedicated module alternative.

For dynamic inventory specifically — here's what it looks like in practice. You create a plugin configuration file (aws_ec2.yml) that Ansible reads instead of a static hosts file. It queries the AWS EC2 API, groups instances by their tags, and returns a live host list. The inventory is never stale because it's rebuilt from the API on every run.

io/thecodeforge/ansible/inventory.ini · INI
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
# io.thecodeforge: Static Inventory for Project Forge
# Use this for fixed infrastructure under 20 servers.
# For elastic/cloud environments, use dynamic inventory (aws_ec2.yml below).

[webservers]
web-01.thecodeforge.io ansible_host=192.168.1.10 ansible_user=ubuntu
web-02.thecodeforge.io ansible_host=192.168.1.11 ansible_user=ubuntu

[databases]
db-01.thecodeforge.io  ansible_host=192.168.1.20 ansible_user=ubuntu

[production:children]
webservers
databases

[production:vars]
ansible_ssh_private_key_file=~/.ssh/forge_deploy_key

# ──────────────────────────────────────────────────────────────────────────────
# io.thecodeforge: Dynamic Inventory Plugin Config (aws_ec2.yml)
# Save this as inventories/production/aws_ec2.yml
# Run: ansible-inventory -i inventories/production/ --list
# ──────────────────────────────────────────────────────────────────────────────

# plugin: amazon.aws.aws_ec2
# regions:
#   - eu-west-1
# filters:
#   instance-state-name: running
#   tag:Environment: production
# keyed_groups:
#   - key: tags.Role
#     prefix: role
#     separator: '_'
#   - key: tags.Environment
#     prefix: env
#     separator: '_'
# hostnames:
#   - private-ip-address
# compose:
#   ansible_user: "'ubuntu'"
#   ansible_ssh_private_key_file: "'~/.ssh/forge_deploy_key'"
# cache: true
# cache_plugin: jsonfile
# cache_connection: /tmp/ansible_aws_cache
# cache_timeout: 300
#
# With this config:
#   - Instances tagged Role=webserver appear in group role_webserver
#   - Instances tagged Environment=production appear in group env_production
#   - Cache prevents hammering the EC2 API on every run (5-minute TTL)
#   - New instances appear automatically — no manual inventory updates
💡Test Connectivity Before Anything Else
Always run ansible all -m ping before running playbooks. If ping fails, fix SSH connectivity before debugging anything else. 90% of Ansible problems are SSH or permissions issues, not playbook logic. I've watched engineers spend two hours debugging a 'module error' that was really a missing SSH key or a security group rule blocking port 22. The ping module is your pre-flight check — make it a habit.
📊 Production Insight
The biggest inventory mistake is treating it as write-once. Hostnames change, IPs rotate, instances get replaced by autoscaling.
Dynamic inventory from cloud APIs solves stale host lists but introduces API rate limits and 2-5 seconds of startup latency per run — mitigate with the cache_timeout setting shown above.
Rule: if you cannot run ansible all -m ping successfully every time, your inventory is broken. Fix that before writing any playbook logic.
🎯 Key Takeaway
Inventory tells Ansible what servers exist. Modules tell it what to do. Playbooks tell it when and in what order.
You cannot have reliable automation without all three working correctly — and the inventory is the foundation everything else depends on.
For elastic cloud infrastructure, dynamic inventory is not optional. A stale static inventory is a silent failure waiting to happen.
Static vs Dynamic Inventory — When to Switch
IfFixed infrastructure, under 20 servers, no autoscaling, hostnames don't change
UseStatic INI or YAML inventory is fine — simple, fast, no API dependencies
IfCloud infrastructure with autoscaling groups, spot instances, or servers that get replaced regularly
UseDynamic inventory is mandatory — use the aws_ec2, gcp_compute, or azure_rm plugin. Static inventory becomes stale within days.
IfMixed environment — some fixed servers, some cloud instances
UseUse dynamic inventory for the cloud portion and a static file for fixed servers. Ansible can merge multiple inventory sources from a directory.
IfDynamic inventory is causing API rate limit errors or slow startup
UseEnable the inventory cache (cache: true, cache_timeout: 300). This rebuilds the host list from the API every 5 minutes instead of every run.

Your First Production Playbook — and the 22-Level Precedence Ladder

A playbook is a collection of plays. Each play targets a specific group from your inventory and executes a sequence of tasks in order, top to bottom. If a task fails on a specific host, Ansible stops executing for that host but continues for the others. To handle configuration changes — like restarting a web server only when a config file actually changes — Ansible uses Handlers: special tasks that only run when notified by another task that reported 'changed'.

The playbook below is a production pattern we actually use. Notice: update the package cache, install the binary, deploy a templated config, ensure the service is running. Every task is idempotent. Every task uses a dedicated module. No shell commands.

But here's what the Ansible documentation buries in a footnote that causes more production incidents than anything else: variable precedence has 22 levels, and Ansible enforces them silently. The most important levels to internalize — from highest to lowest priority:

  1. Extra vars (-e on the command line) — highest, overrides everything
  2. Task vars (set directly on a task)
  3. Block vars
  4. Role and include vars
  5. Set_facts and registered vars
  6. host_vars/hostname.yml — this is where the production incident in this article came from
  7. group_vars/groupname.yml
  8. group_vars/all.yml
  9. Playbook vars
  10. Role defaults (defaults/main.yml) — lowest, easily overridden by anything above

The rule that causes the most surprises: host_vars always overrides group_vars. Always. Without any warning. Without any log entry. If prod-web-01.yml exists in your host_vars directory, it wins over group_vars/all.yml, group_vars/webservers.yml, and everything you defined in your playbook's vars block — silently.

The diagnostic you need to run before every production deploy where variables are involved: ansible-inventory -i inventory.ini --host prod-web-01 --vars. This shows you the fully merged, fully resolved variable set that Ansible will actually use. Not what you think you set. Not what's in the playbook. The ground truth.

io/thecodeforge/ansible/site_setup.yml · YAML
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
---
# io.thecodeforge: Standard Nginx Deployment Playbook
# Variable precedence reminder (highest to lowest — the levels that matter most):
#   1. Extra vars (-e)           <- overrides EVERYTHING, use with extreme care in CI
#   2. set_fact / registered     <- runtime-computed values
#   3. host_vars/hostname.yml    <- PER-HOST OVERRIDE, silent, highest file-based precedence
#   4. group_vars/groupname.yml  <- group-specific values
#   5. group_vars/all.yml        <- global defaults
#   6. Playbook vars block       <- what you see below
#   7. Role defaults/main.yml    <- weakest, easily overridden
#
# Debug tip: ansible-inventory -i inventory.ini --host prod-web-01 --vars
# shows the fully merged variable set before the playbook runs.

- name: Deploy and Configure Nginx
  hosts: webservers
  become: true

  vars:
    nginx_port: 80
    server_name: "thecodeforge.io"
    # NOTE: These vars sit at precedence level 6 (playbook vars).
    # A host_vars file for any target host will silently override these.
    # Run ansible-inventory --host <hostname> --vars to verify before deploying.

  tasks:
    - name: Verify expected variable state before making any changes
      ansible.builtin.debug:
        msg: "nginx_port resolved to {{ nginx_port }} on {{ inventory_hostname }}"
      # Add this debug task during onboarding or when variables behave unexpectedly.
      # Remove or tag it once the team trusts the variable sources.

    - name: Ensure apt cache is updated
      ansible.builtin.apt:
        update_cache: yes
        cache_valid_time: 3600
        # cache_valid_time: 3600 means: skip the update if cache is less than 1 hour old.
        # Trade-off: saves 5-10 seconds per run but means security updates won't appear
        # for up to an hour. Acceptable for app servers; lower this for security-sensitive roles.

    - name: Install Nginx production package
      ansible.builtin.apt:
        name: nginx
        state: present
        # state: present = install if missing. state: latest = upgrade if a newer version exists.
        # Use present in production unless you explicitly want automatic upgrades.

    - name: Deploy custom Nginx configuration
      ansible.builtin.template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/sites-available/default
        owner: root
        group: root
        mode: '0644'
      notify: Reload Nginx service
      # notify only fires when this task reports 'changed'.
      # If the rendered template is byte-for-byte identical to the existing file,
      # no notification is sent and Nginx is not reloaded. This is idempotency in action.

    - name: Ensure Nginx service is enabled and running
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: yes

  handlers:
    - name: Reload Nginx service
      ansible.builtin.service:
        name: nginx
        state: reloaded
        # reloaded sends SIGHUPNginx reloads config without dropping connections.
        # restarted kills and restarts — drops all active connections.
        # Always use reloaded for config changes. Use restarted only for binary upgrades.
🔥Idempotency Is the Entire Point
Run this playbook 10 times — the result is identical to running it once. If Nginx is already installed at the right version with the right config, every task shows 'ok' and nothing changes. This is what makes Ansible safe to run on a 30-minute cron job in production. I've had this pattern running on a cron every 30 minutes for two years. It silently corrects configuration drift — when someone SSH'd in and manually changed something, the next cron run fixes it. The only time it shows 'changed' is when something genuinely changed.
📊 Production Insight
The debug task at the top showing the resolved nginx_port value costs 0ms and has saved hours of variable precedence debugging. Add it to every playbook that uses environment-specific variables.
The template task will report 'changed' every run if your Jinja2 template includes dynamic content like {{ ansible_date_time.iso8601 }} — remove timestamps from templates unless they're genuinely needed.
Rule: a handler that uses state: restarted drops active connections. Use state: reloaded for config changes. The distinction matters at 3am when you're applying a TLS certificate update to a live API.
🎯 Key Takeaway
Idempotency is not a feature — it is the entire reason Ansible is safe to run in production automation.
If a task shows 'changed' on every run, you have broken idempotency. Fix it.
The 22-level variable precedence ladder is enforced silently — learn the top 8 levels and run ansible-inventory --host before every production deploy.
Shell vs Dedicated Module — The Decision That Determines Idempotency
IfInstalling a package (apt, yum, dnf, pip)
UseUse ansible.builtin.apt / yum / pip — idempotent, checks installed state before acting
IfManaging a service (start, stop, restart, enable on boot)
UseUse ansible.builtin.service or ansible.builtin.systemd — idempotent, checks current service state
IfCopying a file or rendering a template
UseUse ansible.builtin.copy or ansible.builtin.template — compares checksums, only writes if content differs
IfRunning a command that has no dedicated Ansible module
UseUse ansible.builtin.command with creates or removes to make it conditional. Add changed_when with a specific condition. Document why no module exists.
IfRunning a shell pipeline with pipes, redirects, or shell built-ins
UseUse ansible.builtin.shell only as a last resort. Add changed_when: false if the output is not meaningful, or parse stdout to determine whether a real change occurred.

Ad-hoc Commands — Quick Fleet Operations Without a Playbook

Not everything needs a playbook. Sometimes you need to run a single command across your fleet right now — check disk space before a deploy, restart a hung service on 50 app servers, verify a kernel patch applied across the fleet, kill a runaway process that's consuming memory. That's what ad-hoc commands are for.

Ad-hoc commands are Ansible's underrated superpower for day-two operations. They're the reason senior SREs reach for Ansible instead of writing SSH for-loops. An SSH for-loop runs the command on every server sequentially and gives you raw unstructured output. Ansible ad-hoc runs in parallel across as many hosts as your forks setting allows, returns structured output per host, handles failures gracefully, and respects your inventory groups so you don't accidentally run something against the wrong environment.

Syntax: ansible <host-pattern> -i <inventory> -m <module> -a '<arguments>'

The flags you'll use daily
  • -b or --become: run as root (sudo)
  • -u or --user: specify the SSH username
  • --limit 'web-01': restrict execution to a subset of the matched hosts — critical for safe fleet operations
  • --check: dry run — show what would change without actually changing anything
  • -f 50 or --forks 50: override the default parallelism for this single command
  • -v, -vv, -vvv, -vvvv: increasing verbosity. -v shows task results. -vvv shows SSH connection details. -vvvv shows everything including the raw module arguments — use this when debugging SSH hangs.

In production I use ad-hoc commands daily. Checking disk space on 200 servers before a deploy: one-liner, 10 seconds, structured output. Restarting a hung worker process across 50 app servers: one-liner. Verifying that a security patch actually applied to every host in the fleet: one-liner. These replace what used to be 20-minute SSH marathons with copy-pasted commands and manually collated output.

io/thecodeforge/ansible/adhoc_examples.sh · BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
#!/usr/bin/env bash
# io.thecodeforge: Ad-hoc Command Reference
# These replace SSH for-loops. Run these, not bash loops.

# ── Connectivity and fact-checking ───────────────────────────────────────────

# Verify SSH connectivity to all production hosts before a major deploy
ansible production -i inventory.ini -m ping

# Check disk space across all web servers before a deploy
# -o: one-line output mode — easier to scan for problems
ansible webservers -i inventory.ini -m command -a "df -h /" -o

# Gather full system facts from a single host (OS, IPs, memory, CPU)
# Useful for debugging environment differences between hosts
ansible db-01.thecodeforge.io -i inventory.ini -m setup

# Gather only a subset of facts to speed up the call
# gather_subset=min returns OS, hostname, IP — skips disk/CPU details
ansible webservers -i inventory.ini -m setup -a 'gather_subset=min' -o

# ── Safe fleet operations with --limit ────────────────────────────────────────

# The --limit flag restricts execution to a subset of the target group.
# ALWAYS use --limit when you want to test on one host before hitting the fleet.
# This is the most important safety habit for ad-hoc fleet operations.

# Restart Nginx on ONE host first to verify the command is correct
ansible webservers -i inventory.ini -m service \
  -a "name=nginx state=restarted" --become \
  --limit web-01.thecodeforge.io

# Once verified, restart Nginx across all web servers
ansible webservers -i inventory.ini -m service \
  -a "name=nginx state=restarted" --become

# ── Security and maintenance ──────────────────────────────────────────────────

# Apply a security patch across the entire fleet in parallel
# -f 20: process 20 hosts at a time (tune based on control node resources)
ansible production -i inventory.ini \
  -m apt -a "name=openssl state=latest update_cache=yes" \
  --become -f 20

# Verify the patch was applied — check the installed version on every host
ansible production -i inventory.ini \
  -m command -a "dpkg -l openssl | grep '^ii'" -o

# ── Dry run before any destructive operation ─────────────────────────────────

# --check: show what WOULD happen without actually doing it
# Use this before any ad-hoc command that modifies state
ansible webservers -i inventory.ini \
  -m apt -a "name=nginx state=absent" \
  --become --check

# ── Verbosity for SSH debugging ───────────────────────────────────────────────

# -v:    show task result summary
# -vv:   show connection parameters
# -vvv:  show SSH connection details (use this when a host is unreachable)
# -vvvv: show raw SSH protocol output (use this when SSH itself is misbehaving)
ansible web-01.thecodeforge.io -i inventory.ini -m ping -vvv
⚠ Ad-hoc Is Not Idempotent by Default — and --limit Is Your Safety Net
The command module runs every time regardless of state. For one-off operations like checking disk space or restarting a service, this is fine. But always use --limit when testing a new ad-hoc command — run it against one host, verify the output is what you expected, then remove --limit to hit the fleet. I've seen an ad-hoc apt remove command accidentally run against the entire production fleet because someone forgot to add --limit during testing. The --limit flag is not optional for fleet operations — it's the difference between 'I tested this on one server' and 'I just removed a package from 200 servers simultaneously.'
📊 Production Insight
Parallel execution is great until it overwhelms your control node. Default forks=5 is too low for 100 servers — raise it to 50 for most fleet operations.
Each fork consumes memory, a file handle, and an SSH socket. I've seen Ansible crash with OOM errors at forks=200 on a t2.micro control node running a large fleet operation.
Rule: monitor control node CPU and memory when you increase forks. Start at 50, increase slowly, watch for SSH connection failures in the -vvv output which indicate the control node is hitting file descriptor limits.
🎯 Key Takeaway
Ad-hoc commands are for day-two fleet operations — not for automation you'll run twice.
Always use --limit to test against one host before running against the fleet. This is not optional.
If you're about to paste an ad-hoc command into a wiki page or a runbook, turn it into a playbook instead.
Ad-hoc Command vs Playbook — When to Write It Down
IfOne-off check or emergency operation you'll never run again
UseAd-hoc is appropriate — fast, no file to maintain, results are visible immediately
IfOperation you've run twice already or pasted into a wiki page
UseWrite a playbook — you've already proven this is repeatable work that deserves automation
IfFleet-wide state change during an incident (restart services, apply patch, kill process)
UseAd-hoc with --limit on one host first, then full fleet. Document the command in your incident postmortem.
IfRoutine maintenance you run weekly or monthly
UseWrite a playbook, schedule it in AWX or cron — ad-hoc commands don't have audit trails or scheduled execution

Roles — Reusable Automation at Scale

Once your playbooks grow beyond 50 lines, you'll start copying tasks between files. That's when you need roles. A role is a self-contained unit of automation — tasks, handlers, templates, default variables, and static files — packaged in a standardized directory structure that Ansible knows how to load automatically. Roles are how Ansible scales from 'one playbook' to 'an entire infrastructure codebase that multiple teams can contribute to.'

The directory structure is Ansible's loading convention, not optional decoration. When you reference a role in a playbook, Ansible automatically loads tasks/main.yml, handlers/main.yml, defaults/main.yml, templates/, and files/ if they exist. The structure is the contract — deviate from it and things silently don't load.

Roles come from two sources: you write your own for application-specific automation, or you pull community roles from Ansible Galaxy (ansible-galaxy install geerlingguy.nginx). Galaxy has thousands of pre-built roles for common infrastructure software. For Nginx, Docker, PostgreSQL, certbot, Redis — a battle-tested community role saves hours and handles edge cases your first draft won't. For deploying your Java application, configuring your monitoring stack, or enforcing your company's specific security baseline — you write your own.

Critically, community roles must be version-pinned in a requirements.yml file. Not managed, not latest — a specific version tag. I've watched a Galaxy role change a default variable in a minor version update and restart PostgreSQL during a maintenance window without any warning. The role's changelog mentioned it. Nobody read the changelog because nobody expected a minor version to change default behavior. Pin the version. Test the upgrade in staging. Treat a Galaxy role update the same way you treat a library dependency upgrade — with the same caution and the same verification process.

io/thecodeforge/ansible/roles/nginx/tasks/main.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
---
# io.thecodeforge: Reusable Nginx Role
#
# Role directory structure (Ansible's loading convention — not optional):
# roles/nginx/
#   ├── defaults/
#   │   └── main.yml       <- weakest variable precedence, safe defaults
#   ├── handlers/
#   │   └── main.yml       <- service reload/restart handlers
#   ├── tasks/
#   │   └── main.yml       <- this file, core task logic
#   ├── templates/
#   │   └── vhost.conf.j2  <- Jinja2 config templates
#   └── files/
#       └── (static files if needed)
#
# Use this role in a playbook:
#   - hosts: webservers
#     roles:
#       - role: nginx
#         vars:
#           server_name: api.thecodeforge.io
#           nginx_port: 8080

- name: Install Nginx
  ansible.builtin.apt:
    name: nginx
    state: present
    update_cache: yes

- name: Deploy virtual host configuration from template
  ansible.builtin.template:
    src: vhost.conf.j2
    dest: "/etc/nginx/sites-available/{{ server_name }}.conf"
    owner: root
    group: root
    mode: '0644'
    validate: '/usr/sbin/nginx -t -c %s'
    # validate: runs nginx -t on the rendered config before writing it.
    # If the config is invalid, Ansible rejects it and the file is not updated.
    # This prevents deploying a broken Nginx config that would fail on reload.
  notify: Reload Nginx

- name: Enable virtual host by creating symlink
  ansible.builtin.file:
    src: "/etc/nginx/sites-available/{{ server_name }}.conf"
    dest: "/etc/nginx/sites-enabled/{{ server_name }}.conf"
    state: link
  notify: Reload Nginx

- name: Ensure Nginx is running and enabled on boot
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: yes

---
# io.thecodeforge: requirements.yml — Galaxy role version pinning
# Install with: ansible-galaxy install -r requirements.yml
# ALWAYS pin to a specific version. Never use 'latest'.
# Treat a version bump the same as a library dependency upgrade:
# test in staging, read the changelog, verify behavior before deploying to prod.

# roles:
#   - name: geerlingguy.nginx
#     version: 3.2.0
#     # Pinned: tested against Ubuntu 22.04 LTS on 2026-03-01
#     # Upgrade checklist: test in staging, verify default variable changes
#
#   - name: geerlingguy.docker
#     version: 6.1.0
#     # Pinned: confirmed compatible with Docker 25.x on 2026-02-15
#
#   - name: geerlingguy.postgresql
#     version: 3.4.0
#     # Pinned: restart behavior tested — does NOT restart on minor config changes
#
# Install all roles:
#   ansible-galaxy install -r requirements.yml --roles-path roles/
#
# Upgrade a single role safely:
#   ansible-galaxy install geerlingguy.nginx,3.3.0 --force
#   # Then test in staging before updating the version in requirements.yml
💡Use Galaxy for Commodity Software — Pin the Version
Don't write your own Nginx, Docker, or PostgreSQL role from scratch. ansible-galaxy install geerlingguy.nginx gives you a battle-tested role maintained by one of the most prolific Ansible contributors in the community. Save your custom role-writing energy for application-specific automation that Galaxy can't provide. But pin the version in requirements.yml every time. A community role is a dependency you don't fully control — treat it with the same caution as any third-party library.
📊 Production Insight
Community roles save time but introduce supply chain risk. A Galaxy role that changes its default restart behavior in a minor version can restart your database during business hours with no warning in the Ansible output.
Pin Galaxy roles to specific versions in requirements.yml. Read the role's CHANGELOG before upgrading. Test in staging with the same inventory structure as production.
Rule: ansible-galaxy install geerlingguy.nginx without a version pin in requirements.yml is the same as npm install without a lockfile. Don't do it.
🎯 Key Takeaway
Roles are how Ansible scales from 10 to 1000 servers. The directory structure is the loading contract — Ansible silently skips files that don't follow it.
Community roles for infrastructure software, custom roles for application logic. Pin community role versions in requirements.yml every time.
A role you didn't write is a dependency you don't fully control. Version-pin it, test upgrades in staging, and read the changelog before deploying to production.
Custom Role vs Community Role — The Decision Criteria
IfCommon infrastructure software: Nginx, Docker, PostgreSQL, Redis, certbot, Node.js
UseUse a community Galaxy role, pinned to a specific version in requirements.yml — don't reinvent the wheel
IfApplication deployment, business-specific configuration, company security baseline
UseWrite a custom role — this is your domain-specific logic that Galaxy cannot provide
IfCommunity role exists but doesn't support a configuration option you need
UseFork the role or wrap it — add a custom task after the community role that applies your specific config. Do not modify the community role in-place.
IfPlaybook is importing more than three roles
UseCreate a higher-level wrapper role that includes the sub-roles — this makes the top-level playbook readable and keeps the role composition organized

Production Patterns — Error Handling, Vault, and Rolling Deploys

The playbook we built above works correctly for a single server in a controlled environment. Production is messier. Databases fail mid-migration. Network blips cause intermittent SSH timeouts. You need to deploy to 50 servers without taking all 50 offline simultaneously. And you absolutely cannot store database passwords in plain text YAML committed to Git — not because of policy, but because production credentials in version control is a breach waiting to happen.

Error Handling with block/rescue/always: Ansible has a try/catch equivalent. Wrap risky tasks in a block. If anything inside fails, the rescue section runs — rollback, alert, log. The always section runs regardless — cleanup, notifications. Without this pattern, a failed database migration leaves your server in a half-configured state with no automatic recovery and no notification that anything went wrong.

Rolling Deploys with serial: The serial keyword controls how many hosts Ansible processes simultaneously. serial: 3 means update 3 servers, verify they're healthy, then move to the next 3. Without serial, Ansible hits all hosts simultaneously — which is acceptable for config management but catastrophic for application deploys where you need zero downtime.

Ansible Vault for Secrets: Vault encrypts variables or entire files using AES256. Create an encrypted file with ansible-vault create group_vars/production/vault.yml, add your secrets, and commit the encrypted file to Git. Without the vault password, the file is gibberish — safe to store in version control. In CI/CD, pass the vault password via a file written from a CI secret: echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass, then ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass. Never use --ask-vault-pass in CI — it expects interactive input and hangs silently.

For different environments, use different vault password files — one for staging, one for production. The vault file contents can be identical in structure but different in values (different database passwords per environment), while the passwords to decrypt them are stored separately in your CI secrets manager.

io/thecodeforge/ansible/deploy_with_safety.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
---
# io.thecodeforge: Production Deploy with Error Handling, Rolling Deploy, and Vault
#
# Before running:
#   1. Create vault file: ansible-vault create group_vars/production/vault.yml
#      Add: db_password: "your_real_password"
#           webhook_url: "https://hooks.slack.com/your/webhook"
#   2. Commit the encrypted vault file to Git (safe — AES256 encrypted)
#   3. Store vault password in CI secrets as ANSIBLE_VAULT_PASSWORD
#   4. CI runs with: ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass

- name: Deploy Application with Safety Rails
  hosts: webservers
  become: true
  serial: 3              # Rolling deploy: process 3 servers at a time
                         # For 30 servers: 10 sequential batches of 3
                         # Trade-off: 10x longer than parallel, 0 simultaneous downtime
  max_fail_percentage: 0 # Stop the entire deploy if ANY server in a batch fails
                         # max_fail_percentage: 30 would allow 30% failure before aborting
                         # For database migrations, use 0 — one failure should stop everything

  vars_files:
    - group_vars/production/vault.yml  # Encrypted with ansible-vault — safe in Git
    # vault.yml contains:
    #   db_password: "{{ vault_db_password }}"
    #   webhook_url: "{{ vault_webhook_url }}"
    # Reference in tasks as: {{ db_password }}
    # Ansible decrypts at runtime using the vault password file — never stores plaintext

  tasks:
    - name: Deploy application release with rollback on failure
      block:
        # ── Step 1: Pull the new code ─────────────────────────────────────────
        - name: Pull latest application code
          ansible.builtin.git:
            repo: "https://github.com/thecodeforge/app.git"
            dest: /opt/app
            version: "{{ release_version }}"
            # release_version passed via -e on the command line:
            # ansible-playbook deploy.yml -e release_version=v2.4.1

        # ── Step 2: Run database migrations ──────────────────────────────────
        - name: Run database migrations
          ansible.builtin.command:
            cmd: /opt/app/bin/migrate --env production
          args:
            chdir: /opt/app
          environment:
            DATABASE_URL: "postgres://app:{{ db_password }}@db-01:5432/appdb"
            # db_password comes from the vault file — never hardcoded
          register: migration_result
          # register: captures the command output for use in later tasks or rescue block

        # ── Step 3: Verify the application is healthy ─────────────────────────
        - name: Verify application health endpoint responds 200
          ansible.builtin.uri:
            url: "http://localhost:8080/health"
            status_code: 200
          retries: 5      # Try up to 5 times
          delay: 3        # Wait 3 seconds between retries
          # If the health check fails after 5 retries, the block fails
          # and rescue runs automatically

      rescue:
        # Runs only if any task in the block above fails
        - name: Log deployment failure with context
          ansible.builtin.debug:
            msg: >
              Deploy FAILED on {{ inventory_hostname }}.
              Release: {{ release_version }}.
              Rolling back to: {{ previous_release }}.
              Migration output: {{ migration_result.stdout | default('N/A') }}

        - name: Rollback to previous known-good release
          ansible.builtin.git:
            repo: "https://github.com/thecodeforge/app.git"
            dest: /opt/app
            version: "{{ previous_release }}"
            # previous_release passed alongside release_version:
            # ansible-playbook deploy.yml -e release_version=v2.4.1 -e previous_release=v2.4.0

      always:
        # Runs regardless of success or failure — use for notifications and cleanup
        - name: Send deployment status notification
          ansible.builtin.uri:
            url: "{{ webhook_url }}"
            method: POST
            body_format: json
            body:
              host: "{{ inventory_hostname }}"
              release: "{{ release_version }}"
              status: "{{ 'success' if ansible_failed_task is not defined else 'failed' }}"
              environment: production
          # webhook_url comes from the vault file
          # ansible_failed_task is set by Ansible when a task in the block fails
⚠ Never Skip Error Handling in Production
I watched a team deploy without block/rescue during a database schema migration. A migration script failed on server 3 of 20. Ansible stopped for that host but continued for the remaining 17. Result: 17 servers running the new application code against the new schema, 1 server running old code against the old schema, and the load balancer routing 5% of traffic to the old server. The application broke in spectacular and inconsistent ways for three hours while the team figured out what happened. Always use block/rescue for any playbook that modifies persistent state. The rescue block should be your incident response automated.
📊 Production Insight
serial: 3 on a 300-server fleet means 100 sequential batches. With a 30-second health check per batch, that's 50 minutes for a full deploy. Plan your maintenance windows accordingly.
Vault decryption adds about 200ms of startup overhead per playbook run. Cache the vault password file in your CI agent's workspace — don't write it on every task.
Rule: set max_fail_percentage: 0 for database migrations and schema changes. Set max_fail_percentage: 20 for stateless config deployments where partial failure is tolerable. Never leave it at the default (which allows 100% failure before stopping).
🎯 Key Takeaway
Production deploys need serial for safety, block/rescue for recovery, and Vault for secrets. Without all three, you're gambling on every deploy.
Vault workflow: ansible-vault create the file, commit the encrypted version to Git, store the password in CI secrets, pass it with --vault-password-file. Never --ask-vault-pass in automation.
A rollback in your rescue block is worth more than any monitoring alert. By the time an alert fires, the rescue block has already run.
Choosing serial Batch Size for Rolling Deploys
IfStateless application servers, zero-downtime deploy, load balancer in front
Useserial: 25% — update one quarter of the fleet at a time. Fast enough to complete in reasonable time, safe enough to catch problems before they hit all servers.
IfDatabase migration included in the deploy
Useserial: 1 with max_fail_percentage: 0 — migrations must succeed on every server before moving to the next. One failure stops everything.
IfConfiguration change only, no code deploy, service remains running
Useserial: 50% or higher — config changes are low-risk and faster completion is better.
IfUnknown risk level or first time running this playbook in production
Useserial: 1 with --limit to start on a single non-critical host. Verify manually. Then increase serial gradually.
🗂 Ansible vs Chef, Puppet, and Terraform
Choosing the right tool depends on fleet size, team expertise, push vs pull model, and whether you need infrastructure provisioning or configuration management
ToolAgent RequiredLanguageLearning CurveBest For
AnsibleNo (agentless — SSH only)YAML + Jinja2Low — most engineers are productive within a dayConfiguration management, application deployment, ad-hoc fleet operations, and orchestration across mixed environments. The fastest path from zero automation to everything automated. Best choice for teams that don't have dedicated infrastructure engineers.
ChefYes (chef-client daemon running on every managed node)Ruby DSLHigh — requires Ruby knowledge and Chef Server administrationComplex, policy-based configuration in large enterprise fleets where teams have Ruby expertise and need a pull-based model. Chef Server handles 10,000+ nodes better than Ansible's push model at extreme scale.
PuppetYes (puppet agent daemon, certificate-based auth)Puppet DSLHigh — Puppet DSL is its own language with its own idiomsLong-term compliance enforcement and drift remediation in regulated industries (finance, healthcare, government) where continuous automated enforcement matters more than on-demand execution. Puppet's pull model means servers self-correct without a human initiating a run.
TerraformNoHCLMedium — HCL is readable but state management has a learning curveInfrastructure provisioning — creating servers, VPCs, load balancers, DNS records, IAM roles, and managed services. Complementary to Ansible, not a replacement. Terraform creates the server. Ansible configures it. Most mature DevOps teams use both in sequence: Terraform provisions, Ansible configures on first boot and on every subsequent config change.

🎯 Key Takeaways

  • Ansible is agentless — it connects over SSH requiring no software installation on managed nodes. Zero maintenance overhead on servers, instant onboarding for new infrastructure, and a smaller security footprint than agent-based tools.
  • Playbooks describe desired state in human-readable YAML — not step-by-step scripts. Run them once or a hundred times and the outcome is identical. This idempotency is what makes Ansible safe to run in CI/CD pipelines and on scheduled crons.
  • Variable precedence has 22 levels enforced silently. host_vars always overrides group_vars. Extra vars (-e) override everything. Run ansible-inventory --host before every production deploy where variables matter — the resolved variable state is ground truth.
  • Prioritize dedicated modules (apt, systemd, git, copy, template) over shell and command. Dedicated modules check state before acting. Shell and command run unconditionally every time and report 'changed' on every run — breaking your CI dashboard's signal-to-noise ratio.
  • Roles are how Ansible scales from 10 servers to 1000. The directory structure is Ansible's loading contract — deviate from it and files silently don't load. Pin Galaxy community roles to specific versions in requirements.yml and treat upgrades like dependency upgrades.
  • Use block/rescue/always for any playbook that modifies persistent state. Without error handling, a failed migration on server 3 of 20 leaves your fleet in split-brain configuration with no automatic recovery and no notification.
  • Ansible Vault is non-negotiable for secrets. ansible-vault create the file, commit the encrypted version to Git, store the decryption password in CI secrets, pass it with --vault-password-file. Never --ask-vault-pass in automation and never plain-text credentials in playbooks.
  • Ansible and Terraform are complementary tools in the same pipeline: Terraform provisions the server, Ansible configures it. Terraform's user_data runs once at first boot. Ansible runs idempotently on day 1, day 30, and day 300 — correcting drift every time.

⚠ Common Mistakes to Avoid

    Using ignore_errors: yes as a band-aid for tasks that matter
    Symptom

    A failing task is silenced with ignore_errors: yes because it was intermittently failing during development and the engineer wanted to move on. Three months later, SSL certificate renewal is silently failing on 12 servers. Customers see browser security warnings. Nobody noticed because the error was suppressed. The playbook reported 'ok' on every run.

    Fix

    Use block/rescue/always instead. If a task fails, the rescue block runs rollback and sends an alert immediately. If you genuinely expect a task to fail in a specific known way, use failed_when with a condition that checks the actual error message — not ignore_errors which swallows everything. Reserve ignore_errors for genuinely non-critical operations and document exactly why in a comment. Never use it on tasks that touch TLS, auth, or persistent state.

    Committing plain-text secrets to version control
    Symptom

    Database passwords and API keys appear in Git history — often in an early commit before the engineer realized they'd done something wrong. A former employee with repo access now has production credentials. A security audit flags the repository. The credentials must be rotated across every system that uses them.

    Fix

    Use ansible-vault encrypt_string 'your_secret' --name 'db_password' and paste the encrypted output into your playbook. Better: put all secrets in group_vars/production/vault.yml and encrypt the entire file with ansible-vault encrypt. Commit the encrypted file — it's safe in Git without the password. Store the vault password in your CI secrets manager (GitHub Actions secrets, GitLab CI variables, Jenkins credentials). Rotate secrets by re-encrypting with a new value, not by changing the vault password.

    Using shell or command modules when a dedicated module exists
    Symptom

    The CI dashboard shows 'changed' on every single run for the same task. The deploy pipeline always reports 1 changed even when nothing was deployed. The team loses trust in the changed indicator because it's always on — which means they also miss genuine changes.

    Fix

    Replace ansible.builtin.shell: apt install nginx with ansible.builtin.apt: name=nginx state=present. The apt module checks whether nginx is already installed at the correct version before acting. It only reports 'changed' when it actually installs or upgrades something. Apply the same pattern for service management, file operations, and package management — there is almost always a dedicated module.

    Not disabling host key checking in CI/CD environments
    Symptom

    The CI job hangs indefinitely with no error output. The last log line is about establishing an SSH connection. The job eventually times out after the CI runner's maximum job duration. The engineer reruns it and it hangs again.

    Fix

    Set ANSIBLE_HOST_KEY_CHECKING=False as a CI environment variable for ephemeral environments. For production stability, use ssh-keyscan in your CI pipeline prep step to pre-populate known_hosts before Ansible runs: ssh-keyscan -H target_host >> ~/.ssh/known_hosts. This maintains the security benefit of host key verification without the interactive hang.

    Forgetting become: true and spending an hour debugging the wrong thing
    Symptom

    A task that modifies /etc/nginx/conf.d/ fails with 'permission denied' or 'file not found' depending on the module. The engineer spends time checking whether the directory exists, whether the path is spelled correctly, whether the disk is full — none of which is the actual problem.

    Fix

    Add become: true at the play level for any play that touches system files, package managers, or services. Make it explicit and global: hosts: all then become: true on the next line. If only specific tasks need root, add become: true at the task level. But the most common mistake is forgetting it for an entire play — add it at the play level and override downward if needed.

    Ignoring YAML indentation and spending time on cryptic parse errors
    Symptom

    Ansible returns ERROR! Syntax Error while loading YAML or expected <block end>, but found '<block mapping start>'. The line number points to a line that looks visually correct. The error message is not helpful.

    Fix

    Run yamllint playbook.yml before ansible-playbook. Configure your editor to show invisible whitespace characters — spaces as dots, tabs as arrows. YAML requires spaces exclusively — tab characters are always invalid regardless of how they look in your editor. A missing space after a colon breaks the entire file. Install the ansible extension for VS Code which highlights YAML errors inline.

Interview Questions on This Topic

  • QExplain the agentless architecture of Ansible. How does it compare to agent-based tools like Puppet or Chef in terms of security footprint, operational overhead, and onboarding friction for new servers?Mid-levelReveal
    Agentless means no daemon runs on managed servers. Ansible connects via SSH, pushes a small Python module (or binary for Windows via WinRM), executes it, and removes it. The security footprint is smaller than agent-based tools — one fewer daemon running as root, one fewer open port, one fewer set of certificates to manage. Operational overhead is lower — no agent upgrades, no agent crashes, no certificate rotations, no 'the agent lost connection to Chef Server' incidents at 3am. Onboarding new servers is minimal — they just need SSH access and Python installed, which all Linux servers have by default. The trade-off: Ansible's push model from a control node doesn't scale as elegantly as Chef or Puppet's pull model for very large fleets (5,000+ nodes) where you need continuous automated enforcement without human initiation. Chef's pull model handles constant drift correction at extreme scale more efficiently. For most teams — under 500 servers, mixed OS environments, teams without dedicated infrastructure engineers — agentless is simpler, faster to adopt, and operationally safer.
  • QWhat is idempotency in the context of Ansible modules? Can you name a module that is not idempotent by default, and explain when you'd intentionally use it?Mid-levelReveal
    Idempotency means running an operation multiple times produces the same result as running it once. Ansible modules check current state before making changes. The apt module checks if a package is installed before installing. The template module compares checksums before writing. The service module checks whether the service is already in the desired state before acting. These modules report 'ok' when the desired state already exists and 'changed' only when they actually modify something. The shell and command modules are not idempotent by default — they execute the command unconditionally on every run and always report 'changed'. You'd intentionally use command for truly one-off operations where no dedicated module exists, but even then you add creates or removes flags to make it conditional. The only time I use shell without idempotency guards is in ad-hoc commands for emergency fleet debugging — never in a playbook that runs in CI.
  • QHow does Ansible handle parallel execution? What is a fork in ansible.cfg, and how does tuning it impact performance on a 500-node fleet?SeniorReveal
    Ansible uses forks to control parallelism. Each fork is a separate SSH connection thread on the control node. The default is forks=5, which means Ansible talks to 5 hosts simultaneously, waits for all 5 to complete, then moves to the next batch of 5. On a 500-node fleet with forks=5 and 10 seconds per host batch: 100 batches × 10 seconds = ~17 minutes. With forks=50: 10 batches × 10 seconds = ~100 seconds. The speedup is roughly linear up to the control node's resource limits. The trade-offs of higher forks: each fork holds an SSH socket (file descriptor), module output in memory, and a Python subprocess. On a t2.medium control node with forks=100, I've seen OOM kills when processing large setup module output from 100 hosts simultaneously. The safe starting point for 500 nodes is forks=50 in ansible.cfg, combined with pipelining=True which reduces the number of SSH round-trips per module from 3 to 1. Monitor control node CPU and memory. Raise forks by 10 at a time and watch for SSH connection failures in the output, which indicate file descriptor exhaustion.
  • QWhat is the difference between a task and a handler? In what scenario would a handler be skipped even if it is notified by a task that reported changed?Mid-levelReveal
    Tasks run in the order written, unconditionally (unless a when clause prevents it). Handlers run at the end of a play, only once per play regardless of how many times they're notified, and only if at least one notifying task reported 'changed'. Scenarios where a notified handler is skipped: first, the notifying task reports 'ok' instead of 'changed' — idempotency prevented the change, so the notification is never sent. Second, the play fails before reaching the handler execution phase — handlers are deferred to the end of the play, so a mid-play failure means handlers never run. Third, you use --check mode — Ansible simulates changes but doesn't apply them, so handlers aren't executed. The meta: flush_handlers trick is important for production: if you have a config change that must be applied before the next task runs (for example, Nginx must reload before a subsequent task checks the listening port), you insert meta: flush_handlers in the task list to force immediate handler execution at that point rather than waiting for the end of the play.
  • QHow would you use Ansible Vault to manage environment-specific secrets in a CI/CD pipeline? Walk through the workflow from encrypting the variable to injecting it during a Jenkins or GitLab CI run.SeniorReveal
    Step 1: Create the encrypted secrets file per environment. ansible-vault create group_vars/production/vault.yml. Add db_password: real_password and webhook_url: https://hooks.slack.com/your/webhook. The file is AES256-encrypted immediately. Step 2: Commit the encrypted file to Git. Without the vault password it's unreadable — safe in version control. Step 3: Store the vault password in CI secrets. Jenkins: create a 'Secret text' credential named ANSIBLE_VAULT_PASSWORD. GitLab CI: add a masked CI/CD variable named ANSIBLE_VAULT_PASSWORD. Step 4: In the CI pipeline, write the password to a temporary file before the Ansible run: echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass && chmod 600 /tmp/vault_pass. Then run: ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass. Step 5: For multiple environments, use separate vault password files — ANSIBLE_VAULT_PASSWORD_STAGING and ANSIBLE_VAULT_PASSWORD_PRODUCTION — and select the correct one based on the target environment in your pipeline logic. Never use --ask-vault-pass in CI — it expects interactive input and hangs silently. Never echo the password directly into the ansible-playbook command — it appears in process listings and CI logs.
  • QWhat are Ansible facts? How can you disable fact gathering to speed up playbook execution, and when do you actually need them?Mid-levelReveal
    Facts are system information collected automatically by the setup module at the start of every play — OS distribution, IP addresses, disk partitions, memory, CPU, uptime, Python interpreter path. Ansible gathers facts before the first task runs, which means one SSH call per host before any work starts. For 500 hosts, that's 500 additional SSH calls adding 15-30 seconds of pure overhead before anything useful happens. Disable with gather_facts: no at the play level. You lose nothing if your playbook doesn't use fact variables. When you need facts: conditional task execution based on OS (when: ansible_os_family == 'Debian'), using the primary IP address in templates ({{ ansible_default_ipv4.address }}), checking available memory before a memory-intensive operation, or selecting the correct package manager. When you need only a subset of facts, use gather_subset: min which collects OS family, hostname, and network interfaces but skips disk, hardware, and virtual machine details — roughly 60% faster than full fact gathering. For large fleets where you need facts, cache them: set fact_caching = jsonfile in ansible.cfg with a cache timeout. Facts are re-gathered only when the cache expires, not on every run.
  • QExplain how dynamic inventory works with a cloud provider like AWS. What are the advantages over a static inventory file, and what challenges does it introduce?SeniorReveal
    Dynamic inventory uses a plugin (amazon.aws.aws_ec2, gcp_compute, azure_rm) that queries the cloud provider API at runtime and returns host lists in Ansible's expected JSON format. You configure the plugin with a YAML file (aws_ec2.yml) that specifies regions, filters (running instances only, specific environment tags), and keyed_groups (group instances by tag values like Role or Environment). Advantages: the inventory is never stale. New instances appear automatically. Terminated instances disappear. Autoscaling group members are always correct. You can target specific subsets with tag filters without maintaining any files. Challenges: API rate limits — hitting EC2 DescribeInstances on every playbook run can throttle, especially with multiple pipelines running simultaneously. Startup latency — a static file is instant; dynamic inventory takes 2-5 seconds per API call. Credential management — the control node needs IAM permissions or access keys configured. API availability — if the cloud provider API is slow or returns an error, your inventory fails and no playbook runs. Mitigation: enable the inventory cache in the plugin config (cache: true, cache_timeout: 300). The API is queried once every 5 minutes and results are stored locally. New instances may take up to 5 minutes to appear, which is acceptable for most workflows and eliminates the rate limit and latency problems.
  • QDescribe the difference between include_role and import_role. When would you choose one over the other, and how does each affect task execution order and variable scope?SeniorReveal
    import_role is static: Ansible processes the role at playbook parsing time, before any tasks execute. All tasks, variables, and handlers from the role are loaded into the play's task list immediately. include_role is dynamic: the role is processed at runtime when the task queue reaches that line. The practical consequences: import_role cannot be used with when conditions based on runtime facts or registered variables — the role is already loaded before any tasks run, so runtime conditions can't influence whether it's included. include_role respects when conditions and can be used in loops to apply the same role with different variables multiple times. Variable scope: import_role makes the role's variables available in the global play scope — later tasks in the same play can reference the role's variables. include_role scopes variables to the role execution only — they're not visible outside the role unless you explicitly set them with set_fact. Use import_role for roles that are always needed unconditionally and whose variables should be globally available. Use include_role when the role is conditionally applied, used in a loop, or when you want to apply it with different parameters across multiple invocations. If you're not sure, import_role is the safer default — its behavior is more predictable because it's resolved at parse time.
  • QHow would you structure an Ansible project to manage 500+ servers across dev, staging, and production environments? Describe your directory layout, variable hierarchy, and how you'd prevent production changes from accidentally running against dev.SeniorReveal
    Directory structure: ansible/ ├── ansible.cfg (forks=50, pipelining=True, roles_path=roles/) ├── inventories/ │ ├── dev/ │ │ ├── aws_ec2.yml (dynamic inventory plugin config) │ │ └── group_vars/ │ │ ├── all.yml (dev-wide defaults) │ │ └── webservers.yml (dev webserver-specific vars) │ ├── staging/ (same structure as dev) │ └── production/ │ ├── aws_ec2.yml │ └── group_vars/ │ ├── all.yml │ ├── webservers.yml │ └── vault.yml (ansible-vault encrypted secrets) ├── roles/ │ ├── nginx/ │ ├── app/ │ └── requirements.yml (Galaxy roles with pinned versions) └── playbooks/ └── site.yml Variable hierarchy: group_vars/all.yml for cross-environment defaults, group_vars/webservers.yml for role-specific values, vault.yml for secrets. Treat host_vars as a code smell requiring a documented justification comment. Preventing production accidents: CI pipelines are branch-scoped. Commits to feature branches can only trigger runs against dev inventory. Merges to main can only trigger staging. Only tags with the v* pattern can trigger production, and production runs require a manual approval step. The inventory path is never hardcoded in playbooks — it's always passed as -i inventories/$ENV where ENV is set by the CI pipeline based on branch or tag. For extra safety, add a task at the top of site.yml that asserts ansible_limit is set when targeting production: fail when not ansible_limit is defined and ansible_env.CI_ENVIRONMENT == 'production'.

Frequently Asked Questions

What is the difference between an ad-hoc command and a playbook in Ansible?

An ad-hoc command is a single one-liner executed directly from the command line — ideal for quick checks or one-off operations like restarting a service or checking disk space across your fleet. A playbook is a reusable, version-controlled YAML file that defines a sequence of tasks with variables, handlers, and error handling. Think of ad-hoc commands as shouting instructions across the room, and playbooks as writing a detailed runbook that anyone can execute repeatedly with the same result. The rule of thumb: if you've run the same ad-hoc command twice, it belongs in a playbook.

How does Ansible handle secrets and sensitive data?

Ansible provides Ansible Vault, which encrypts variables or entire files using AES256. Encrypt individual strings with ansible-vault encrypt_string and paste them into your playbooks, or encrypt entire variable files with ansible-vault encrypt. At runtime, provide the vault password via --vault-password-file pointing to a file written from a CI secret. Vault-encrypted content is safe to commit to Git — without the password it's gibberish. For larger teams, integrate Vault with HashiCorp Vault using the hashi_vault lookup plugin, which fetches secrets at runtime from a centralized secrets manager rather than storing them in encrypted files.

What is dynamic inventory in Ansible, and when should you use it?

Dynamic inventory queries an external source — typically a cloud provider API like AWS EC2, GCP, or Azure — at runtime instead of reading a static file. Ansible builds the host list from live API data based on tags, regions, and instance states. Use dynamic inventory when your infrastructure is elastic: autoscaling groups, spot instances, or any environment where servers are created and destroyed regularly. Static inventory works for fixed infrastructure under 20 servers with stable hostnames. Beyond that, a static file becomes a liability — stale IPs, missing new instances, terminated hosts that are still listed. Enable the inventory cache (cache_timeout: 300) to avoid rate limiting the cloud API on every run.

How do you handle errors and rollbacks in Ansible playbooks?

Ansible provides a block/rescue/always construct that works like try/catch/finally. Wrap risky operations in a block. If any task inside fails, the rescue section executes — rollback to a known-good state, send an alert, log the failure context. The always section runs regardless of success or failure — cleanup, status notifications. For rolling deployments, combine this with serial (how many hosts to update at once) and max_fail_percentage (abort the entire deploy if too many hosts fail). Set max_fail_percentage: 0 for database migrations — any failure should stop everything. Without block/rescue, a failed migration on server 3 of 20 leaves 17 servers on the new schema and 1 on the old, with the application broken and no automatic recovery.

What is the difference between Ansible and Terraform? Do I need both?

They solve different problems at different points in a server's life. Terraform provisions infrastructure — it creates EC2 instances, VPCs, load balancers, DNS records, and IAM roles. Ansible configures that infrastructure — it installs software, deploys application code, manages services, and corrects configuration drift. Terraform's user_data and cloud-init can run a script at first boot, but they can't re-run idempotently three months later when you need to update a config file. Ansible can. Most production teams use Terraform to build the infrastructure and Ansible to configure and maintain it. They're complementary tools in the same pipeline, not alternatives.

How do you test Ansible playbooks before running them in production?

Use --check mode for a dry run — Ansible shows what would change without applying anything. Combine it with --diff to see exact file content differences. For automated testing, use Molecule: it spins up Docker containers or VMs, runs your role, verifies the result with Testinfra assertions, and tears everything down. Run Molecule in CI to catch regressions before they reach any environment. Also run ansible-lint on all playbooks and roles to catch deprecated modules, style violations, and common structural mistakes. The combination of --check, --diff, Molecule, and ansible-lint catches the vast majority of problems before a human needs to review them.

What is Ansible Galaxy, and should I use community roles?

Ansible Galaxy is a repository of community-contributed roles for common infrastructure software — Nginx, Docker, PostgreSQL, certbot, Redis, and hundreds more. Install with ansible-galaxy install -r requirements.yml. Community roles save hours for commodity software and are often more battle-tested than what you'd write from scratch. For application-specific automation — deploying your Java app, configuring your monitoring stack — write custom roles. The mandatory practice: pin every Galaxy role to a specific version in requirements.yml. A community role is a dependency you don't control. A minor version update can change default behavior in ways that affect production. Pin it, test upgrades in staging, read the changelog before bumping the version.

How does Ansible perform on very large fleets (1000+ servers)?

Ansible's parallelism scales with the forks setting in ansible.cfg (default: 5, which is too low for large fleets). For 1000 servers, start at forks=50 and monitor control node CPU, memory, and open file descriptor counts. Enable pipelining=True to reduce SSH round-trips per module from 3 to 1 — this alone can cut playbook runtime by 30-40%. Disable fact gathering for playbooks that don't need system facts, or use gather_subset=min to collect only essential information. For operational visibility at scale — job scheduling, RBAC, audit logging, workflow orchestration, and a web UI — deploy AWX (the open-source version) or Ansible Automation Platform. Plain Ansible from the command line works at 1000+ nodes, but AWX gives you the operational control that large teams need to manage concurrent jobs safely.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Ansible Playbooks Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged