Ansible is agentless configuration management — it connects via SSH, pushes small modules, and cleans up after itself
Three core components: Inventory (what servers), Modules (how to act), Playbooks (when to act)
Idempotency means running the same playbook 100 times produces the same result as running it once
Performance trade-off: agentless means zero maintenance on servers but higher control node load (forks control parallelism)
Production trap: variable precedence has 22 levels — your dev environment works but prod breaks because host_vars silently overrides group_vars with no warning
Biggest mistake: a host_vars file left over from a debugging session six months ago quietly overrides your group-level config in production — compiles fine, deploys fine, serves the wrong value
✦ Definition~90s read
What is Introduction to Ansible?
Ansible is an open-source IT automation engine that eliminates manual toil by letting you define infrastructure as code — no agents required, just SSH and Python on the target. It solves the problem of configuring thousands of servers consistently without writing shell scripts that rot.
★
Managing 100 servers by logging into each one and typing commands is like calling 100 employees individually to give the same instruction.
You describe the desired state in YAML (playbooks), and Ansible figures out the diff and applies only what's needed. Its agentless architecture means you don't install anything on managed nodes, which is why it dominates in heterogeneous environments where you can't control the OS.
The trade-off: it's not real-time (no daemon watching for drift) and can be slow at scale compared to pull-based tools like Puppet or Salt — Netflix runs 100,000+ nodes with Ansible, but they batch aggressively.
At its core, Ansible has three concepts: inventory (what you manage), playbooks (how you manage it), and modules (the actual work). Inventory can be static files or dynamic sources like AWS EC2 or vSphere. Playbooks are ordered lists of tasks, each calling a module — think of modules as idempotent functions that ensure a package is installed or a service is running.
The killer feature is variable precedence: a 22-level ladder that silently overrides values from defaults through command-line extras. Most teams get burned when a group_var in inventory overrides a role default without warning — you'll learn to pin variables at the right rung or use assert to catch surprises.
For production, you layer roles (reusable task bundles), Ansible Vault for secrets, and rolling update patterns with serial and max_fail_percentage. Error handling uses ignore_errors, failed_when, and block/rescue — but the real pattern is pre-flight validation with assert before touching state.
Ad-hoc commands (ansible -m ping) let you run one-off operations across fleets without writing a playbook, useful for quick health checks or reboots. When not to use Ansible: for real-time configuration drift detection (use Chef or a monitoring stack), or for complex orchestration with cross-host dependencies (Terraform or a workflow engine handles that better).
Plain-English First
Managing 100 servers by logging into each one and typing commands is like calling 100 employees individually to give the same instruction. Ansible is like sending one company-wide email that everyone acts on simultaneously. You describe the desired state of your servers in plain English-like YAML, and Ansible connects over SSH and makes it happen — on all servers at once, with no software installed on them.
Think of it this way: if your server is a hotel room, Ansible is the housekeeping checklist pinned to the door. It doesn't live in the room. It walks in, checks what needs fixing, fixes only what's broken, and walks out. The room doesn't even know Ansible was there — it just ends up clean.
And unlike calling each employee individually, if you send the same company-wide email again tomorrow, nothing bad happens. Everyone already followed the instructions. They'll read the email, confirm nothing needs doing, and get back to work. That's idempotency — the property that makes Ansible safe to run on a schedule, in a CI pipeline, or in a panic at 2am.
Before configuration management tools, sysadmins maintained hundreds of servers by hand — logging in, running commands, hoping nothing went wrong. I lived this. In 2015, I managed a fleet of 80 web servers at a mid-size SaaS company, and every deploy night was a three-hour marathon of SSH sessions, copy-pasted commands, and prayer. One night, someone restarted the wrong database server. We lost four hours of customer data. That was the last straw.
Ansible was created by Michael DeHaan in 2012 and acquired by Red Hat in 2015 (now part of IBM). Today it runs infrastructure at NASA JPL, Capital One, and thousands of companies from Series A startups to Fortune 50 enterprises. Not because it's the most powerful automation tool, but because it's the simplest one that actually gets used.
What makes Ansible different from competitors like Chef and Puppet is that it is agentless. There is no daemon running on your managed servers, no SSL certificates to exchange, and no extra ports to open beyond standard SSH (or WinRM for Windows). Ansible runs from your control node, pushes small programs called Ansible Modules to the remote nodes, executes them, and then cleans up after itself.
One important nuance that comes up in almost every team adopting Ansible: Ansible and Terraform are not competitors — they solve different problems at different points in a server's life. Terraform creates infrastructure: it provisions the EC2 instance, creates the VPC, registers the DNS record. Ansible configures that infrastructure: it installs software, deploys application code, manages services, and corrects configuration drift on day 2, day 30, and day 300. Terraform's user_data and cloud-init can run a script at first boot, but they can't re-run idempotently when you need to update a config three months later. Ansible can. That's the real distinction — Terraform builds the house once, Ansible keeps it clean indefinitely.
In this guide, we'll break down Ansible's core architecture — inventories, playbooks, modules, and roles — cover ad-hoc commands for quick fleet operations, and build production-grade automation with real error handling, secret management, and reusable patterns. Every section includes the production detail that most tutorials skip.
How Ansible Variable Precedence Really Works
Ansible variable precedence is a 22-level hierarchy that determines which value wins when the same variable is defined in multiple places. At its core, it's a deterministic override chain: from lowest priority (command-line -e vars) to highest (role defaults). The mechanic is simple — the last definition in the chain wins — but the chain itself is long and easy to misread.
In practice, this means a variable set in group_vars/all (level 14) will be silently overridden by a host_vars entry (level 19), which in turn can be overridden by a --extra-vars flag (level 22). The hierarchy is fixed and cannot be modified. Most teams only use 5–7 levels, but the remaining 15 create invisible traps when variables collide across inventories, roles, playbooks, and includes.
You need this hierarchy to separate concerns: default values in roles, environment-specific overrides in inventory, and emergency overrides via CLI. Without understanding the full chain, you'll debug 'why is my variable wrong?' for hours — only to find a forgotten vars/main.yml in a nested role silently winning over your carefully set inventory variable.
Silent Override Trap
A variable set in group_vars/all is not the final value — it's just level 14 of 22. Any role, include, or CLI flag can override it without warning.
Production Insight
A team deployed a config change via group_vars/production but a nested role's vars/main.yml (level 20) silently overrode the database hostname, causing all production writes to hit a staging database.
The symptom was intermittent 500 errors and corrupted data — no Ansible error, no warning.
Rule: always use ansible-inventory --list to dump resolved variables before a run; never assume a variable's source is the one you set.
Key Takeaway
Variable precedence is a fixed 22-level chain — memorize the top 5 levels that actually bite you.
The last definition wins, but 'last' is defined by hierarchy, not order of execution.
Always validate resolved variables with ansible-inventory --list before trusting a playbook's behavior.
thecodeforge.io
Ansible Variable Precedence — 22-Level Override
Ansible Introduction
Inventory, Playbooks, and Modules — The Three Core Concepts
Ansible's architecture relies on three primary building blocks. Get these right and everything else follows. Get any one of them wrong and you'll spend your time debugging instead of automating.
The Inventory: A file (INI or YAML) that lists the servers you want to manage, organized into groups like [webservers] or [databases]. The inventory is your single source of truth about what exists. In production, you'll almost always use dynamic inventory — pulling host lists directly from AWS, GCP, or Azure APIs so your inventory stays accurate as servers are created and destroyed by autoscaling. Static inventories work for learning and small fixed fleets under 20 servers, but once you have autoscaling groups or spot instances, a static file becomes a liability. Stale IPs, terminated instances, missing new nodes — a static inventory in an elastic environment is a disaster on a timer.
The Playbook: Your automation blueprint, written in YAML. A playbook maps groups of hosts to sequences of tasks and describes desired state rather than step-by-step instructions. This distinction matters operationally: if Nginx is already installed and running at the right version, Ansible confirms it and moves on. It doesn't reinstall. It doesn't restart unnecessarily. It checks and reports 'ok'.
Modules: The tools in the toolbox. Instead of writing bash scripts, you use modules like apt, yum, service, copy, or template. These modules are idempotent — they check the current state of the server and only make changes when the server doesn't match your desired state. The shell and command modules are the notable exceptions. They run unconditionally every time, which is exactly why experienced Ansible engineers avoid them unless there is genuinely no dedicated module alternative.
For dynamic inventory specifically — here's what it looks like in practice. You create a plugin configuration file (aws_ec2.yml) that Ansible reads instead of a static hosts file. It queries the AWS EC2 API, groups instances by their tags, and returns a live host list. The inventory is never stale because it's rebuilt from the API on every run.
Always run ansible all -m ping before running playbooks. If ping fails, fix SSH connectivity before debugging anything else. 90% of Ansible problems are SSH or permissions issues, not playbook logic. I've watched engineers spend two hours debugging a 'module error' that was really a missing SSH key or a security group rule blocking port 22. The ping module is your pre-flight check — make it a habit.
Production Insight
The biggest inventory mistake is treating it as write-once. Hostnames change, IPs rotate, instances get replaced by autoscaling.
Dynamic inventory from cloud APIs solves stale host lists but introduces API rate limits and 2-5 seconds of startup latency per run — mitigate with the cache_timeout setting shown above.
Rule: if you cannot run ansible all -m ping successfully every time, your inventory is broken. Fix that before writing any playbook logic.
Key Takeaway
Inventory tells Ansible what servers exist. Modules tell it what to do. Playbooks tell it when and in what order.
You cannot have reliable automation without all three working correctly — and the inventory is the foundation everything else depends on.
For elastic cloud infrastructure, dynamic inventory is not optional. A stale static inventory is a silent failure waiting to happen.
Static vs Dynamic Inventory — When to Switch
IfFixed infrastructure, under 20 servers, no autoscaling, hostnames don't change
→
UseStatic INI or YAML inventory is fine — simple, fast, no API dependencies
IfCloud infrastructure with autoscaling groups, spot instances, or servers that get replaced regularly
→
UseDynamic inventory is mandatory — use the aws_ec2, gcp_compute, or azure_rm plugin. Static inventory becomes stale within days.
IfMixed environment — some fixed servers, some cloud instances
→
UseUse dynamic inventory for the cloud portion and a static file for fixed servers. Ansible can merge multiple inventory sources from a directory.
IfDynamic inventory is causing API rate limit errors or slow startup
→
UseEnable the inventory cache (cache: true, cache_timeout: 300). This rebuilds the host list from the API every 5 minutes instead of every run.
Your First Production Playbook — and the 22-Level Precedence Ladder
A playbook is a collection of plays. Each play targets a specific group from your inventory and executes a sequence of tasks in order, top to bottom. If a task fails on a specific host, Ansible stops executing for that host but continues for the others. To handle configuration changes — like restarting a web server only when a config file actually changes — Ansible uses Handlers: special tasks that only run when notified by another task that reported 'changed'.
The playbook below is a production pattern we actually use. Notice: update the package cache, install the binary, deploy a templated config, ensure the service is running. Every task is idempotent. Every task uses a dedicated module. No shell commands.
But here's what the Ansible documentation buries in a footnote that causes more production incidents than anything else: variable precedence has 22 levels, and Ansible enforces them silently. The most important levels to internalize — from highest to lowest priority:
Extra vars (-e on the command line) — highest, overrides everything
Task vars (set directly on a task)
Block vars
Role and include vars
Set_facts and registered vars
host_vars/hostname.yml — this is where the production incident in this article came from
group_vars/groupname.yml
group_vars/all.yml
Playbook vars
Role defaults (defaults/main.yml) — lowest, easily overridden by anything above
The rule that causes the most surprises: host_vars always overrides group_vars. Always. Without any warning. Without any log entry. If prod-web-01.yml exists in your host_vars directory, it wins over group_vars/all.yml, group_vars/webservers.yml, and everything you defined in your playbook's vars block — silently.
The diagnostic you need to run before every production deploy where variables are involved: ansible-inventory -i inventory.ini --host prod-web-01 --vars. This shows you the fully merged, fully resolved variable set that Ansible will actually use. Not what you think you set. Not what's in the playbook. The ground truth.
io/thecodeforge/ansible/site_setup.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
# io.thecodeforge: StandardNginxDeploymentPlaybook
# Variable precedence reminder (highest to lowest — the levels that matter most):
# 1. Extravars (-e) <- overrides EVERYTHING, use with extreme care in CI
# 2. set_fact / registered <- runtime-computed values
# 3. host_vars/hostname.yml <- PER-HOSTOVERRIDE, silent, highest file-based precedence
# 4. group_vars/groupname.yml <- group-specific values
# 5. group_vars/all.yml <- global defaults
# 6. Playbook vars block <- what you see below
# 7. Role defaults/main.yml <- weakest, easily overridden
#
# Debug tip: ansible-inventory -i inventory.ini --host prod-web-01 --vars
# shows the fully merged variable set before the playbook runs.
- name: Deploy and ConfigureNginx
hosts: webservers
become: true
vars:
nginx_port: 80
server_name: "thecodeforge.io"
# NOTE: These vars sit at precedence level 6 (playbook vars).
# A host_vars file for any target host will silently override these.
# Run ansible-inventory --host <hostname> --vars to verify before deploying.
tasks:
- name: Verify expected variable state before making any changes
ansible.builtin.debug:
msg: "nginx_port resolved to {{ nginx_port }} on {{ inventory_hostname }}"
# Addthis debug task during onboarding or when variables behave unexpectedly.
# Remove or tag it once the team trusts the variable sources.
- name: Ensure apt cache is updated
ansible.builtin.apt:
update_cache: yes
cache_valid_time: 3600
# cache_valid_time: 3600 means: skip the update if cache is less than 1 hour old.
# Trade-off: saves 5-10 seconds per run but means security updates won't appear
# for up to an hour. Acceptablefor app servers; lower thisfor security-sensitive roles.
- name: InstallNginx production package
ansible.builtin.apt:
name: nginx
state: present
# state: present = install if missing. state: latest = upgrade if a newer version exists.
# Use present in production unless you explicitly want automatic upgrades.
- name: Deploy custom Nginx configuration
ansible.builtin.template:
src: templates/nginx.conf.j2
dest: /etc/nginx/sites-available/default
owner: root
group: root
mode: '0644'
notify: ReloadNginx service
# notify only fires when this task reports 'changed'.
# If the rendered template is byte-for-byte identical to the existing file,
# no notification is sent and Nginx is not reloaded. This is idempotency in action.
- name: EnsureNginx service is enabled and running
ansible.builtin.service:
name: nginx
state: started
enabled: yes
handlers:
- name: ReloadNginx service
ansible.builtin.service:
name: nginx
state: reloaded
# reloaded sends SIGHUP — Nginx reloads config without dropping connections.
# restarted kills and restarts — drops all active connections.
# Always use reloaded for config changes. Use restarted only for binary upgrades.
Idempotency Is the Entire Point
Run this playbook 10 times — the result is identical to running it once. If Nginx is already installed at the right version with the right config, every task shows 'ok' and nothing changes. This is what makes Ansible safe to run on a 30-minute cron job in production. I've had this pattern running on a cron every 30 minutes for two years. It silently corrects configuration drift — when someone SSH'd in and manually changed something, the next cron run fixes it. The only time it shows 'changed' is when something genuinely changed.
Production Insight
The debug task at the top showing the resolved nginx_port value costs 0ms and has saved hours of variable precedence debugging. Add it to every playbook that uses environment-specific variables.
The template task will report 'changed' every run if your Jinja2 template includes dynamic content like {{ ansible_date_time.iso8601 }} — remove timestamps from templates unless they're genuinely needed.
Rule: a handler that uses state: restarted drops active connections. Use state: reloaded for config changes. The distinction matters at 3am when you're applying a TLS certificate update to a live API.
Key Takeaway
Idempotency is not a feature — it is the entire reason Ansible is safe to run in production automation.
If a task shows 'changed' on every run, you have broken idempotency. Fix it.
The 22-level variable precedence ladder is enforced silently — learn the top 8 levels and run ansible-inventory --host before every production deploy.
Shell vs Dedicated Module — The Decision That Determines Idempotency
IfInstalling a package (apt, yum, dnf, pip)
→
UseUse ansible.builtin.apt / yum / pip — idempotent, checks installed state before acting
IfManaging a service (start, stop, restart, enable on boot)
→
UseUse ansible.builtin.service or ansible.builtin.systemd — idempotent, checks current service state
IfCopying a file or rendering a template
→
UseUse ansible.builtin.copy or ansible.builtin.template — compares checksums, only writes if content differs
IfRunning a command that has no dedicated Ansible module
→
UseUse ansible.builtin.command with creates or removes to make it conditional. Add changed_when with a specific condition. Document why no module exists.
IfRunning a shell pipeline with pipes, redirects, or shell built-ins
→
UseUse ansible.builtin.shell only as a last resort. Add changed_when: false if the output is not meaningful, or parse stdout to determine whether a real change occurred.
Ad-hoc Commands — Quick Fleet Operations Without a Playbook
Not everything needs a playbook. Sometimes you need to run a single command across your fleet right now — check disk space before a deploy, restart a hung service on 50 app servers, verify a kernel patch applied across the fleet, kill a runaway process that's consuming memory. That's what ad-hoc commands are for.
Ad-hoc commands are Ansible's underrated superpower for day-two operations. They're the reason senior SREs reach for Ansible instead of writing SSH for-loops. An SSH for-loop runs the command on every server sequentially and gives you raw unstructured output. Ansible ad-hoc runs in parallel across as many hosts as your forks setting allows, returns structured output per host, handles failures gracefully, and respects your inventory groups so you don't accidentally run something against the wrong environment.
Syntax: ansible <host-pattern> -i <inventory> -m <module> -a '<arguments>'
The flags you'll use daily
-b or --become: run as root (sudo)
-u or --user: specify the SSH username
--limit 'web-01': restrict execution to a subset of the matched hosts — critical for safe fleet operations
--check: dry run — show what would change without actually changing anything
-f 50 or --forks 50: override the default parallelism for this single command
-v, -vv, -vvv, -vvvv: increasing verbosity. -v shows task results. -vvv shows SSH connection details. -vvvv shows everything including the raw module arguments — use this when debugging SSH hangs.
In production I use ad-hoc commands daily. Checking disk space on 200 servers before a deploy: one-liner, 10 seconds, structured output. Restarting a hung worker process across 50 app servers: one-liner. Verifying that a security patch actually applied to every host in the fleet: one-liner. These replace what used to be 20-minute SSH marathons with copy-pasted commands and manually collated output.
io/thecodeforge/ansible/adhoc_examples.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/usr/bin/env bash
# io.thecodeforge: Ad-hoc CommandReference
# These replace SSHfor-loops. Run these, not bash loops.
# ── Connectivity and fact-checking ───────────────────────────────────────────
# VerifySSH connectivity to all production hosts before a major deploy
ansible production -i inventory.ini -m ping
# Check disk space across all web servers before a deploy
# -o: one-line output mode — easier to scan for problems
ansible webservers -i inventory.ini -m command -a "df -h /" -o
# Gather full system facts from a single host (OS, IPs, memory, CPU)
# Usefulfor debugging environment differences between hosts
ansible db-01.thecodeforge.io -i inventory.ini -m setup
# Gather only a subset of facts to speed up the call
# gather_subset=min returns OS, hostname, IP — skips disk/CPU details
ansible webservers -i inventory.ini -m setup -a 'gather_subset=min' -o
# ── Safe fleet operations with --limit ────────────────────────────────────────
# The --limit flag restricts execution to a subset of the target group.
# ALWAYS use --limit when you want to test on one host before hitting the fleet.
# This is the most important safety habit for ad-hoc fleet operations.
# RestartNginx on ONE host first to verify the command is correct
ansible webservers -i inventory.ini -m service \
-a "name=nginx state=restarted" --become \
--limit web-01.thecodeforge.io
# Once verified, restart Nginx across all web servers
ansible webservers -i inventory.ini -m service \
-a "name=nginx state=restarted" --become
# ── Security and maintenance ──────────────────────────────────────────────────
# Apply a security patch across the entire fleet in parallel
# -f 20: process 20 hosts at a time (tune based on control node resources)
ansible production -i inventory.ini \
-m apt -a "name=openssl state=latest update_cache=yes" \
--become -f 20
# Verify the patch was applied — check the installed version on every host
ansible production -i inventory.ini \
-m command -a "dpkg -l openssl | grep '^ii'" -o
# ── Dry run before any destructive operation ─────────────────────────────────
# --check: show what WOULD happen without actually doing it
# Usethis before any ad-hoc command that modifies state
ansible webservers -i inventory.ini \
-m apt -a "name=nginx state=absent" \
--become --check
# ── VerbosityforSSH debugging ───────────────────────────────────────────────
# -v: show task result summary
# -vv: show connection parameters
# -vvv: show SSH connection details (use this when a host is unreachable)
# -vvvv: show raw SSH protocol output (use this when SSH itself is misbehaving)
ansible web-01.thecodeforge.io -i inventory.ini -m ping -vvv
Ad-hoc Is Not Idempotent by Default — and --limit Is Your Safety Net
The command module runs every time regardless of state. For one-off operations like checking disk space or restarting a service, this is fine. But always use --limit when testing a new ad-hoc command — run it against one host, verify the output is what you expected, then remove --limit to hit the fleet. I've seen an ad-hoc apt remove command accidentally run against the entire production fleet because someone forgot to add --limit during testing. The --limit flag is not optional for fleet operations — it's the difference between 'I tested this on one server' and 'I just removed a package from 200 servers simultaneously.'
Production Insight
Parallel execution is great until it overwhelms your control node. Default forks=5 is too low for 100 servers — raise it to 50 for most fleet operations.
Each fork consumes memory, a file handle, and an SSH socket. I've seen Ansible crash with OOM errors at forks=200 on a t2.micro control node running a large fleet operation.
Rule: monitor control node CPU and memory when you increase forks. Start at 50, increase slowly, watch for SSH connection failures in the -vvv output which indicate the control node is hitting file descriptor limits.
Key Takeaway
Ad-hoc commands are for day-two fleet operations — not for automation you'll run twice.
Always use --limit to test against one host before running against the fleet. This is not optional.
If you're about to paste an ad-hoc command into a wiki page or a runbook, turn it into a playbook instead.
Ad-hoc Command vs Playbook — When to Write It Down
IfOne-off check or emergency operation you'll never run again
→
UseAd-hoc is appropriate — fast, no file to maintain, results are visible immediately
IfOperation you've run twice already or pasted into a wiki page
→
UseWrite a playbook — you've already proven this is repeatable work that deserves automation
IfFleet-wide state change during an incident (restart services, apply patch, kill process)
→
UseAd-hoc with --limit on one host first, then full fleet. Document the command in your incident postmortem.
IfRoutine maintenance you run weekly or monthly
→
UseWrite a playbook, schedule it in AWX or cron — ad-hoc commands don't have audit trails or scheduled execution
Roles — Reusable Automation at Scale
Once your playbooks grow beyond 50 lines, you'll start copying tasks between files. That's when you need roles. A role is a self-contained unit of automation — tasks, handlers, templates, default variables, and static files — packaged in a standardized directory structure that Ansible knows how to load automatically. Roles are how Ansible scales from 'one playbook' to 'an entire infrastructure codebase that multiple teams can contribute to.'
The directory structure is Ansible's loading convention, not optional decoration. When you reference a role in a playbook, Ansible automatically loads tasks/main.yml, handlers/main.yml, defaults/main.yml, templates/, and files/ if they exist. The structure is the contract — deviate from it and things silently don't load.
Roles come from two sources: you write your own for application-specific automation, or you pull community roles from Ansible Galaxy (ansible-galaxy install geerlingguy.nginx). Galaxy has thousands of pre-built roles for common infrastructure software. For Nginx, Docker, PostgreSQL, certbot, Redis — a battle-tested community role saves hours and handles edge cases your first draft won't. For deploying your Java application, configuring your monitoring stack, or enforcing your company's specific security baseline — you write your own.
Critically, community roles must be version-pinned in a requirements.yml file. Not managed, not latest — a specific version tag. I've watched a Galaxy role change a default variable in a minor version update and restart PostgreSQL during a maintenance window without any warning. The role's changelog mentioned it. Nobody read the changelog because nobody expected a minor version to change default behavior. Pin the version. Test the upgrade in staging. Treat a Galaxy role update the same way you treat a library dependency upgrade — with the same caution and the same verification process.
---
# io.thecodeforge: ReusableNginxRole
#
# Role directory structure (Ansible's loading convention — not optional):
# roles/nginx/
# ├── defaults/
# │ └── main.yml <- weakest variable precedence, safe defaults
# ├── handlers/
# │ └── main.yml <- service reload/restart handlers
# ├── tasks/
# │ └── main.yml <- this file, core task logic
# ├── templates/
# │ └── vhost.conf.j2 <- Jinja2 config templates
# └── files/
# └── (static files if needed)
#
# Usethis role in a playbook:
# - hosts: webservers
# roles:
# - role: nginx
# vars:
# server_name: api.thecodeforge.io
# nginx_port: 8080
- name: InstallNginx
ansible.builtin.apt:
name: nginx
state: present
update_cache: yes
- name: Deploy virtual host configuration from template
ansible.builtin.template:
src: vhost.conf.j2
dest: "/etc/nginx/sites-available/{{ server_name }}.conf"
owner: root
group: root
mode: '0644'
validate: '/usr/sbin/nginx -t -c %s'
# validate: runs nginx -t on the rendered config before writing it.
# If the config is invalid, Ansible rejects it and the file is not updated.
# This prevents deploying a broken Nginx config that would fail on reload.
notify: ReloadNginx
- name: Enable virtual host by creating symlink
ansible.builtin.file:
src: "/etc/nginx/sites-available/{{ server_name }}.conf"
dest: "/etc/nginx/sites-enabled/{{ server_name }}.conf"
state: link
notify: ReloadNginx
- name: EnsureNginx is running and enabled on boot
ansible.builtin.service:
name: nginx
state: started
enabled: yes
---
# io.thecodeforge: requirements.yml — Galaxy role version pinning
# Install with: ansible-galaxy install -r requirements.yml
# ALWAYS pin to a specific version. Never use 'latest'.
# Treat a version bump the same as a library dependency upgrade:
# test in staging, read the changelog, verify behavior before deploying to prod.
# roles:
# - name: geerlingguy.nginx
# version: 3.2.0
# # Pinned: tested against Ubuntu22.04LTS on 2026-03-01
# # Upgrade checklist: test in staging, verify default variable changes
#
# - name: geerlingguy.docker
# version: 6.1.0
# # Pinned: confirmed compatible with Docker25.x on 2026-02-15
#
# - name: geerlingguy.postgresql
# version: 3.4.0
# # Pinned: restart behavior tested — does NOT restart on minor config changes
#
# Install all roles:
# ansible-galaxy install -r requirements.yml --roles-path roles/
#
# Upgrade a single role safely:
# ansible-galaxy install geerlingguy.nginx,3.3.0 --force
# # Then test in staging before updating the version in requirements.yml
Use Galaxy for Commodity Software — Pin the Version
Don't write your own Nginx, Docker, or PostgreSQL role from scratch. ansible-galaxy install geerlingguy.nginx gives you a battle-tested role maintained by one of the most prolific Ansible contributors in the community. Save your custom role-writing energy for application-specific automation that Galaxy can't provide. But pin the version in requirements.yml every time. A community role is a dependency you don't fully control — treat it with the same caution as any third-party library.
Production Insight
Community roles save time but introduce supply chain risk. A Galaxy role that changes its default restart behavior in a minor version can restart your database during business hours with no warning in the Ansible output.
Pin Galaxy roles to specific versions in requirements.yml. Read the role's CHANGELOG before upgrading. Test in staging with the same inventory structure as production.
Rule: ansible-galaxy install geerlingguy.nginx without a version pin in requirements.yml is the same as npm install without a lockfile. Don't do it.
Key Takeaway
Roles are how Ansible scales from 10 to 1000 servers. The directory structure is the loading contract — Ansible silently skips files that don't follow it.
Community roles for infrastructure software, custom roles for application logic. Pin community role versions in requirements.yml every time.
A role you didn't write is a dependency you don't fully control. Version-pin it, test upgrades in staging, and read the changelog before deploying to production.
Custom Role vs Community Role — The Decision Criteria
UseUse a community Galaxy role, pinned to a specific version in requirements.yml — don't reinvent the wheel
IfApplication deployment, business-specific configuration, company security baseline
→
UseWrite a custom role — this is your domain-specific logic that Galaxy cannot provide
IfCommunity role exists but doesn't support a configuration option you need
→
UseFork the role or wrap it — add a custom task after the community role that applies your specific config. Do not modify the community role in-place.
IfPlaybook is importing more than three roles
→
UseCreate a higher-level wrapper role that includes the sub-roles — this makes the top-level playbook readable and keeps the role composition organized
Production Patterns — Error Handling, Vault, and Rolling Deploys
The playbook we built above works correctly for a single server in a controlled environment. Production is messier. Databases fail mid-migration. Network blips cause intermittent SSH timeouts. You need to deploy to 50 servers without taking all 50 offline simultaneously. And you absolutely cannot store database passwords in plain text YAML committed to Git — not because of policy, but because production credentials in version control is a breach waiting to happen.
Error Handling with block/rescue/always: Ansible has a try/catch equivalent. Wrap risky tasks in a block. If anything inside fails, the rescue section runs — rollback, alert, log. The always section runs regardless — cleanup, notifications. Without this pattern, a failed database migration leaves your server in a half-configured state with no automatic recovery and no notification that anything went wrong.
Rolling Deploys with serial: The serial keyword controls how many hosts Ansible processes simultaneously. serial: 3 means update 3 servers, verify they're healthy, then move to the next 3. Without serial, Ansible hits all hosts simultaneously — which is acceptable for config management but catastrophic for application deploys where you need zero downtime.
Ansible Vault for Secrets: Vault encrypts variables or entire files using AES256. Create an encrypted file with ansible-vault create group_vars/production/vault.yml, add your secrets, and commit the encrypted file to Git. Without the vault password, the file is gibberish — safe to store in version control. In CI/CD, pass the vault password via a file written from a CI secret: echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass, then ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass. Never use --ask-vault-pass in CI — it expects interactive input and hangs silently.
For different environments, use different vault password files — one for staging, one for production. The vault file contents can be identical in structure but different in values (different database passwords per environment), while the passwords to decrypt them are stored separately in your CI secrets manager.
---
# io.thecodeforge: ProductionDeploy with ErrorHandling, RollingDeploy, and Vault
#
# Before running:
# 1. Create vault file: ansible-vault create group_vars/production/vault.yml
# Add: db_password: "your_real_password"
# webhook_url: "https://hooks.slack.com/your/webhook"
# 2. Commit the encrypted vault file to Git (safe — AES256 encrypted)
# 3. Store vault password in CI secrets as ANSIBLE_VAULT_PASSWORD
# 4. CI runs with: ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass
- name: DeployApplication with SafetyRails
hosts: webservers
become: true
serial: 3 # Rolling deploy: process 3 servers at a time
# For30 servers: 10 sequential batches of 3
# Trade-off: 10x longer than parallel, 0 simultaneous downtime
max_fail_percentage: 0 # Stop the entire deploy ifANY server in a batch fails
# max_fail_percentage: 30 would allow 30% failure before aborting
# For database migrations, use 0 — one failure should stop everything
vars_files:
- group_vars/production/vault.yml # Encrypted with ansible-vault — safe in Git
# vault.yml contains:
# db_password: "{{ vault_db_password }}"
# webhook_url: "{{ vault_webhook_url }}"
# Reference in tasks as: {{ db_password }}
# Ansible decrypts at runtime using the vault password file — never stores plaintext
tasks:
- name: Deploy application release with rollback on failure
block:
# ── Step1: Pull the new code ─────────────────────────────────────────
- name: Pull latest application code
ansible.builtin.git:
repo: "https://github.com/thecodeforge/app.git"
dest: /opt/app
version: "{{ release_version }}"
# release_version passed via -e on the command line:
# ansible-playbook deploy.yml -e release_version=v2.4.1
# ── Step2: Run database migrations ──────────────────────────────────
- name: Run database migrations
ansible.builtin.command:
cmd: /opt/app/bin/migrate --env production
args:
chdir: /opt/app
environment:
DATABASE_URL: "postgres://app:{{ db_password }}@db-01:5432/appdb"
# db_password comes from the vault file — never hardcoded
register: migration_result
# register: captures the command output for use in later tasks or rescue block
# ── Step3: Verify the application is healthy ─────────────────────────
- name: Verify application health endpoint responds 200
ansible.builtin.uri:
url: "http://localhost:8080/health"
status_code: 200
retries: 5 # Try up to 5 times
delay: 3 # Wait3 seconds between retries
# If the health check fails after 5 retries, the block fails
# and rescue runs automatically
rescue:
# Runs only if any task in the block above fails
- name: Log deployment failure with context
ansible.builtin.debug:
msg: >
DeployFAILED on {{ inventory_hostname }}.
Release: {{ release_version }}.
Rolling back to: {{ previous_release }}.
Migration output: {{ migration_result.stdout | default('N/A') }}
- name: Rollback to previous known-good release
ansible.builtin.git:
repo: "https://github.com/thecodeforge/app.git"
dest: /opt/app
version: "{{ previous_release }}"
# previous_release passed alongside release_version:
# ansible-playbook deploy.yml -e release_version=v2.4.1 -e previous_release=v2.4.0
always:
# Runs regardless of success or failure — use for notifications and cleanup
- name: Send deployment status notification
ansible.builtin.uri:
url: "{{ webhook_url }}"
method: POST
body_format: json
body:
host: "{{ inventory_hostname }}"
release: "{{ release_version }}"
status: "{{ 'success'if ansible_failed_task is not defined else'failed' }}"
environment: production
# webhook_url comes from the vault file
# ansible_failed_task is set by Ansible when a task in the block fails
Never Skip Error Handling in Production
I watched a team deploy without block/rescue during a database schema migration. A migration script failed on server 3 of 20. Ansible stopped for that host but continued for the remaining 17. Result: 17 servers running the new application code against the new schema, 1 server running old code against the old schema, and the load balancer routing 5% of traffic to the old server. The application broke in spectacular and inconsistent ways for three hours while the team figured out what happened. Always use block/rescue for any playbook that modifies persistent state. The rescue block should be your incident response automated.
Production Insight
serial: 3 on a 300-server fleet means 100 sequential batches. With a 30-second health check per batch, that's 50 minutes for a full deploy. Plan your maintenance windows accordingly.
Vault decryption adds about 200ms of startup overhead per playbook run. Cache the vault password file in your CI agent's workspace — don't write it on every task.
Rule: set max_fail_percentage: 0 for database migrations and schema changes. Set max_fail_percentage: 20 for stateless config deployments where partial failure is tolerable. Never leave it at the default (which allows 100% failure before stopping).
Key Takeaway
Production deploys need serial for safety, block/rescue for recovery, and Vault for secrets. Without all three, you're gambling on every deploy.
Vault workflow: ansible-vault create the file, commit the encrypted version to Git, store the password in CI secrets, pass it with --vault-password-file. Never --ask-vault-pass in automation.
A rollback in your rescue block is worth more than any monitoring alert. By the time an alert fires, the rescue block has already run.
Choosing serial Batch Size for Rolling Deploys
IfStateless application servers, zero-downtime deploy, load balancer in front
→
Useserial: 25% — update one quarter of the fleet at a time. Fast enough to complete in reasonable time, safe enough to catch problems before they hit all servers.
IfDatabase migration included in the deploy
→
Useserial: 1 with max_fail_percentage: 0 — migrations must succeed on every server before moving to the next. One failure stops everything.
IfConfiguration change only, no code deploy, service remains running
→
Useserial: 50% or higher — config changes are low-risk and faster completion is better.
IfUnknown risk level or first time running this playbook in production
→
Useserial: 1 with --limit to start on a single non-critical host. Verify manually. Then increase serial gradually.
Key Features of Ansible — What Actually Matters in Production
Forget the marketing fluff. Here's what makes Ansible worth your time when you're firefighting at 3 AM.
Agentless. No daemons to install, no certificates to rotate, no agents to patch. Your managed nodes just need SSH or WinRM and Python. That's it. When a node goes belly-up, you don't debug a dead agent — you fix the node.
Idempotency isn't a feature, it's a contract. Ansible modules are built to declare state, not run commands. Run a playbook twice — the second run changes nothing if the system already matches your declaration. This isn't a nice-to-have; it's what stops you from crashing production with a forgotten restart.
Declarative YAML, not imperative scripts. You write what the end state looks — "Nginx should be installed and running on port 8080." Ansible figures out the how. This shifts your brain from "I need to write an if-else tower" to "I need to describe the target state." That's the difference between a script that rots and a playbook that survives.
Extensible via Python modules. Need to manage a proprietary API? Write a custom module. The framework is trivial — return a JSON dict with changed and msg. No special SDK to learn.
idempotency_demo.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — devops tutorial
// Proving idempotency — run this twice
- name: EnsureNginx is at the right state
hosts: webservers
gather_facts: false
tasks:
- name: InstallNginxpackage
ansible.builtin.apt:
name: nginx
state: present # Declarative, not "apt-get install"
register: install_result
- name: ReportifNginx was freshly installed
ansible.builtin.debug:
msg: "Nginx installed this run"
when: install_result.changed
- name: ReportifNginx was already present
ansible.builtin.debug:
msg: "Nginx was already installed — no change"
when: not install_result.changed
Output
First run:
ok: [web-01] => changed=true
msg: Nginx installed this run
Second run:
ok: [web-01] => changed=false
msg: Nginx was already installed — no change
Production Trap:
Idempotency breaks when you use shell or command modules without creates/removes guards. Those are imperative escape hatches — treat them like surgery. Use them only when no module exists.
Key Takeaway
Ansible's agentless design and idempotent modules eliminate agent management overhead and prevent state drift. Use declarative modules over imperative commands every time.
Ansible Architecture — The Minimal Moving Parts You Must Understand
Ansible's architecture is brutally simple compared to Puppet or Chef. That's the point. Fewer moving parts means fewer failure modes.
Control Node. This is where you install Ansible. Your laptop. A bastion host. A CI runner. Ansible sends commands from here to managed nodes. Note: Windows cannot be a control node natively — use WSL or a Linux jump box.
Managed Nodes. The servers, containers, or network devices you control. They need SSH (Linux), WinRM (Windows), or a network API target. That's it. No agent, no daemon. You push commands to them, or they pull via ansible-pull if you're doing scale-out without a central server.
Inventory. A file listing your managed nodes, grouped logically. Static or dynamic — you can pull from AWS EC2, GCP, or a CMDB. An inventory can be a flat INI file or a YAML file with variables. Critical mistake: hardcoding IPs instead of using group variables.
Modules. The actual workhorses. Each module is a Python script that runs on the managed node, returns JSON, and exits. copy, file, service, template, uri, package — learn these cold. Everything else is syntactic sugar around these core primitives.
Playbooks. YAML files that orchestrate modules in order. They define which hosts, which tasks, what variables, and how to handle failures. A playbook without error handling is a fire drill waiting to happen.
Plugins. Extend Ansible's core — connection plugins, callback plugins, filter plugins. You'll rarely write one, but you'll use them daily: ansible.builtin.debug is a plugin. So is community.general.docker_container.
minimal_architecture_inventory.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — devops tutorial
// A production inventory with group separation
[webservers]
web-01 ansible_host=10.0.1.10
web-02 ansible_host=10.0.1.11
[databases]
db-primary ansible_host=10.0.2.20
db-replica ansible_host=10.0.2.21
[loadbalancers]
lb-01 ansible_host=10.0.3.30
# Group variables — apply to all webservers
[webservers:vars]
http_port=8080
nginx_config_path=/etc/nginx/nginx.conf
Output
No direct output — inventory is a configuration file.
Common check command:
$ ansible-inventory --list --yaml
all:
children:
webservers:
hosts:
web-01:
ansible_host: 10.0.1.10
web-02:
ansible_host: 10.0.1.11
databases:
hosts:
db-primary:
ansible_host: 10.0.2.20
db-replica:
ansible_host: 10.0.2.21
loadbalancers:
hosts:
lb-01:
ansible_host: 10.0.3.30
Senior Shortcut:
Forget dynamic inventory scripts unless your fleet is volatile. Static YAML inventory with group vars is faster to debug, easier to version control, and avoids the 'inventory plugin broke at 2 AM' problem. Only use dynamic inventory when nodes spin up/down automatically.
Key Takeaway
Ansible's architecture has five components: control node, managed nodes, inventory, modules, playbooks. Master the inventory structure first — it's the foundation your playbooks run on.
How Ansible Works — The SSH Handshake and Module Execution Path
Here's the cold, hard execution path when you run ansible-playbook deploy.yml:
Parse the playbook. Ansible reads YAML, resolves variable precedence (remember the ladder?), compiles tasks into a list.
Build the inventory. It resolves host patterns, applies group vars, and expands host ranges. This is where your -l limit flag filters the host list.
SSH connection (default). Ansible opens an SSH connection to each managed node. It uses controlpersist to reuse connections — that's why first-run is slow, subsequent runs are fast. For Windows, it uses WinRM via pywinrm.
Module transfer. Ansible serializes the module (a Python script) and its arguments into JSON. It scps or sftps that module to the managed node, usually into /tmp/.ansible/.... Yes, it lands on disk temporarily.
Execute and collect. The control node runs the module script via SSH. The module executes, makes changes (e.g., writes a config file), and returns a JSON result dict: { "changed": true, "msg": "file created" }.
Cleanup. The module script is deleted from the managed node. Ansible stores the result in memory for use in later tasks (via register: result).
Report. Ansible formats the results (with colors, if enabled), prints them to stdout, and writes them to log files if configured.
This happens per task, per host. That's why a 50-host fleet with 20 tasks takes 1000 SSH round trips. Mitigation? Use pipelining=True to reduce SSH overhead — cuts execution time by up to 40%.
pipelining_config.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — devops tutorial
// EnableSSH pipelining in ansible.cfg for faster execution
[ssh_connection]
pipelining = True
# Without pipelining: one SSH session per module
# With pipelining: one SSH session per task batch
#
# Requirement: Managed nodes need:
# /etc/ssh/sshd_config:
# AllowTcpForwarding yes
# PermitTTY yes
#
# Without these, pipelining silently falls back to sftp
real 0m1.810s (saves 2.4s per host = minutes on a fleet)
Production Trap:
If you see SSH timeout failures on large fleets, check MaxSessions on your control node. Default is 10. Bump it to 100 with ansible_ssh_common_args: '-o ControlMaster=auto -o ControlPersist=600s'. Otherwise Ansible will serialize connections and execute slower than a junior dev on Monday morning.
Key Takeaway
Ansible works by SSH-pushing Python modules to managed nodes, executing them, and collecting JSON results. SSH pipelining and ControlPersist are the two levers that turn a 30-minute playbook into a 5-minute one.
Security and Compliance Enforcement — Automate Your Audits, Don't Just Check Boxes
Security isn't something you bolt on after deployment. It's either baked into your playbooks from the start or you're firefighting breaches. Compliance enforcement in Ansible means writing idempotent policies that fail closed, not open. The WHY: you need to prove to auditors that SELinux is enforcing, fail2ban is running, and SSH root login is disabled — without SSH'ing into every box manually.
The HOW: Use the assert module to gate your deployments. Check kernel parameters with sysctl, verify file permissions with stat, and enforce package versions with dpkg_selections. Combine this with failed_when conditions that halt execution if a security control is misconfigured. For compliance frameworks like CIS or PCI-DSS, write dedicated roles that map to control IDs. Then run these roles in check mode as part of your CI pipeline — your build should fail before a non-compliant node ever sees production.
Senior shortcut: Don't just check for the presence of a file. Verify its contents, owner, and permissions. Auditors love sha256sum comparisons. Give them receipts.
fatal: [prod-web-01]: FAILED! => {"assertion": "file_stat.stat.mode == '0600'", "evaluated_to": false, "msg": "Ensure permissions on /etc/ssh/sshd_config are 600 — mode is 0644, expected 0600"}
Production Trap:
Never use ignore_errors: true on security checks. If you silence compliance failures, you're hiding breaches. Let the playbook burn — you'll thank yourself during the post-mortem.
Key Takeaway
Security enforcement means failing the deployment, not just logging a warning. Idempotent assertions are your audit trail.
Dynamic Inventories — Stop Hardcoding Server Lists in 2025
Hardcoding IP addresses in a static inventory file is a rookie move that scales to exactly zero production environments. The WHY: cloud instances auto-scale, containers get recycled, and on-prem servers get migrated. Your inventory must reflect reality, not a stale text file someone committed six months ago. Dynamic inventories query your infrastructure provider (AWS, GCP, vSphere) and return live groups and variables.
The HOW: Ansible ships with inventory scripts for AWS EC2, Azure, GCP, OpenStack, and VMware. You point the -i flag at a script or use the aws_ec2 plugin with a YAML config. The plugin tags become your group names. Want to target all production web servers with the tag Environment:prod and Role:web? Ansible builds that group automatically. No manual maintenance. If a new instance spins up with the right tags, it's in the next playbook run. Dead instances? Dropped automatically.
Senior shortcut: Use the keyed_groups plugin option to create nested groups from tags or custom variables. This lets you write targeted playbooks like rolling_update:frontend without touching inventory files.
Test your dynamic inventory with ansible-inventory -i aws_ec2.yml --list before running any playbook. Catch missing tags or wrong filters when it costs nothing.
Key Takeaway
Your inventory must be alive. Dynamic inventory plugins eliminate stale host lists and enable auto-scaling automation without script changes.
Provisioning — Why Infrastructure Must Exist Before Automation Runs
Ansible is often used to configure running systems, but those systems must first exist. Provisioning is the act of creating infrastructure — VMs, containers, network interfaces, storage volumes — before any playbook touches them. Without provisioning, your automation is solving a problem on a machine that doesn't exist. Ansible provisions through cloud modules: amazon.aws.ec2_instance, azure.azcollection.azure_rm_virtualmachine, or community.general.digital_ocean. These modules send API calls to your cloud provider, wait for resource creation, and return facts like IP addresses. Do not hardcode IPs. Use add_host to dynamically insert new nodes into the in-memory inventory for downstream playbooks. Production pattern: separate provisioning into its own playbook or role, run it first, then target the fresh hosts with configuration. This keeps creation logic separate from configuration logic, making both auditable and reusable. Idempotency matters here: your provisioning playbook should detect existing resources and skip creation, not fail or duplicate.
TASK [Add new host to in-memory inventory] ***********************************
changed: [localhost] => (item=54.123.45.67)
Production Trap:
Never put your cloud provider credentials in a playbook. Use environment variables or Ansible Vault + AWS IAM instance roles. Hardcoded keys in version control are a breach waiting to happen.
Key Takeaway
Provision infrastructure first, configure it second — always separate concerns into distinct playbooks.
Orchestration — Coordinating Multi-Node Workflows That Fail Gracefully
Orchestration is about sequencing and dependencies across multiple hosts, not just running the same command everywhere. When one service must start only after another database is ready, or when you need a rolling update across 50 web servers without dropping traffic, you need orchestration. Ansible orchestration uses serial, order, throttle, and wait_for. For example, a three-tier app: provision load balancer, then app servers, then databases — each stage waits for the previous to pass health checks. Use delegate_to to run tasks from one host that check another. Use run_once for idempotent setup tasks (e.g., creating database schemas) that must execute only once across a group. For rolling updates, set serial: 1 or serial: 20% and include wait_for after restarts to verify service health before proceeding to the next batch. This pattern prevents cascading failures. Orchestration fails safely when you design for retries: set retries: 5 with delay: 10 on critical health checks.
rolling-update.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// io.thecodeforge — devops tutorial
---
- name: Rolling update of web servers
hosts: webservers
serial: 1
tasks:
- name: Take server out of load balancer
community.general.nginx_upstream:
name: backend
state: down
server: "{{ inventory_hostname }}"
delegate_to: lb01
- name: Update application
ansible.builtin.git:
repo: https://github.com/example/app.git
dest: /var/www/app
version: "{{ git_tag }}"
- name: Restart web service
ansible.builtin.systemd:
name: nginx
state: restarted
- name: Waitfor health check
ansible.builtin.wait_for:
port: 80
host: "{{ inventory_hostname }}"
timeout: 30
- name: Re-add to load balancer
community.general.nginx_upstream:
name: backend
state: up
server: "{{ inventory_hostname }}"
delegate_to: lb01
Output
PLAY [Rolling update of web servers] ****************************************
TASK [Take server out of load balancer] **************************************
ok: [web01 -> lb01]
TASK [Wait for health check] *************************************************
ok: [web01]
TASK [Re-add to load balancer] ***********************************************
changed: [web01 -> lb01]
... continues for web02, web03 ...
Production Trap:
Orchestration without health checks is gambling. A service that starts but doesn't respond correctly will take down your entire update. Always verify with wait_for or uri module before proceeding to the next batch.
Key Takeaway
Orchestration enforces order and health verification across hosts — your playbook should stop, not continue, when a node fails its check.
Introduction
Ansible is a radically simple IT automation engine that eliminates manual toil and human error from infrastructure operations. Unlike configuration management tools that require agents installed on every node, Ansible operates over standard SSH—meaning your servers remain untouched until execution. This architecture makes Ansible uniquely suited for heterogeneous environments where installing a permanent daemon is impractical or prohibited by security policy. The core philosophy is 'mechanism, not magic': every operation is a straightforward YAML description of system state, not a cryptic DSL. For teams drowning in repetitive firewall updates, user account provisioning, or application deployments, Ansible offers a path to repeatability without complexity. Before evaluating playbooks or roles, understand that Ansible's primary value is reducing the cognitive load of fleet management. It transforms tribal knowledge into executable, version-controlled specifications. This article assumes you manage more than three servers—beyond that number, manual processes break. Ansible restores sanity by making automation a side effect of documentation.
Never use root SSH keys. Create a service account with sudo escalation limited to specific commands. Unrestricted root access in Ansible is a compliance violation waiting to happen.
Key Takeaway
Ansible's agentless architecture over SSH reduces attack surface and makes automation possible in locked-down environments.
When Not to Use Ansible
Ansible excels at configuration management, application deployment, and task automation—but it is not a universal hammer. Avoid using Ansible for real-time event-driven automation where sub-second latency matters; tools like SaltStack or event-driven frameworks are better suited. Similarly, Ansible is not a container orchestrator—Kubernetes handles pod lifecycle and scaling natively. For stateful services requiring continuous convergence (e.g., ensuring a process stays running indefinitely), Ansible's push model falls short compared to a daemon-based tool like Puppet or Chef. Lastly, Ansible's Python dependency on control nodes can be a constraint in minimal environments like embedded systems or restricted CI runners. The golden rule: if your task fits in a cron job or a single shell script, Ansible is overkill. If you are managing 100+ servers with versioned, auditable state, Ansible is the right tool. Choose purpose-built tools for purpose-built problems; Ansible fills the midrange sweet spot between shell scripts and full-blown Kubernetes.
Architectural Guidance:
Teams often bolt Ansible onto Kubernetes clusters for config management. Instead, use ConfigMaps and Operators. Ansible's strength is outside the cluster—server OS configuration, network appliances, and ephemeral cloud provisioning.
Key Takeaway
Ansible is a configuration and automation tool, not a runtime system. Use it where you need periodic, idempotent changes—not continuous convergence or real-time event handling.
● Production incidentPOST-MORTEMseverity: high
The Variable Precedence Nightmare
Symptom
Playbook using nginx_port: 8080 in group_vars/all.yml. Production servers were supposed to listen on 8080. Staging worked correctly. The same playbook, same inventory structure, different outcome on prod. No errors in Ansible output — just wrong config deployed silently. The first sign of trouble was a load balancer health check failure, not Ansible.
Assumption
The team assumed variables defined in group_vars/all.yml applied to all hosts uniformly. They had no mental model of variable precedence. They didn't know host_vars overrides group_vars, and they had no process for auditing what variable values Ansible actually resolved at runtime versus what was declared in the playbook.
Root cause
One production host had a host_vars/prod-web-01.yml file with nginx_port: 80 left over from a troubleshooting session six months earlier. The engineer who created it had long since left the team. Ansible applied host_vars over the group_vars value silently — no warning, no log entry, no diff in the playbook output. The 22-level precedence ladder worked exactly as designed, exactly opposite of what the team expected. The fix took four minutes. Finding the cause took three hours.
Fix
Run ansible-inventory -i inventory.ini --host prod-web-01 --vars to see the fully merged variable set for any host before the playbook runs. Remove the orphaned host_vars file. Add a CI step that runs ansible-inventory --list and diffs the resolved variables against a known-good baseline on every merge to main. Treat host_vars as a code smell that requires a documented justification comment — if a host genuinely needs unique config, the file should say why.
Key lesson
Variable precedence is not a suggestion — it is a hard 22-level ladder that Ansible enforces silently. Learn the top eight levels. host_vars overrides group_vars. Always. Without exception.
ansible-inventory --host is your variable debug command. Run it against the specific failing host before touching the playbook. The resolved variable state is the ground truth — not what you think you set.
Treat host_vars files as a code smell. Unless a host genuinely needs unique configuration that no other host in its group shares, keep variables at group level and delete host_vars files when the reason for them disappears.
Your staging environment not mirroring production in inventory structure and variable sources is a disaster waiting to happen. The variable that breaks prod will always be the one that staging silently resolved differently.
Production debug guideThese three failure modes account for 80% of Ansible incidents. Here's exactly how to diagnose each one.3 entries
Symptom · 01
Playbook hangs indefinitely with no output or error
→
Fix
Add -vvvv to your command immediately. Look for 'ESTABLISH SSH CONNECTION' in the output — if nothing appears past that line, your control node cannot reach the target host. Check security groups and firewall rules for port 22 inbound from the control node's CIDR. The default SSH timeout is 10 seconds but retry logic makes it look like an infinite hang. Also check whether the target host's SSH daemon is running at all — a recently rebooted host may not have sshd back up yet.
Symptom · 02
Task shows changed status on every run even when nothing actually changes
→
Fix
You are almost certainly using shell or command instead of a dedicated idempotent module. Replace with the module version — apt, service, copy, template, file. If no dedicated module exists for your use case, add a creates or removes argument to the command module so Ansible can determine whether the operation is necessary. Run ansible-playbook playbook.yml --check --diff to see exactly what is changing between runs.
Symptom · 03
Variables have different values in prod than dev with the same playbook
→
Fix
Run ansible-inventory --host [hostname] --vars on the broken host first. Compare the output against a working host. Look specifically for host_vars files that a previous engineer may have created and forgotten, -e overrides injected by your CI pipeline environment variables, and include_vars statements inside roles that load different files based on environment name. The resolved variable state from ansible-inventory is ground truth — trust it over what you think you set.
★ Ansible Production Debug Cheat SheetThe five commands that solve 90% of Ansible production issues. Run these before opening a ticket or waking someone up.
Playbook fails with Host unreachable or SSH timeout−
Immediate action
Verify SSH connectivity independently before touching Ansible configuration
Check security group: port 22 inbound from control node CIDR. Ensure ansible_user in your inventory matches the actual SSH username on the target. For ephemeral environments like CI runners or short-lived EC2 instances, set ANSIBLE_HOST_KEY_CHECKING=False or pre-populate known_hosts with ssh-keyscan in your pipeline prep step.
Task shows changed every run when nothing actually changes+
Immediate action
Identify exactly which task is reporting changed and why
Replace shell or command with a dedicated idempotent module. If using copy or template, normalize line endings and trailing whitespace: ansible.builtin.copy: content="{{ config | trim }}". If you genuinely cannot avoid shell and the command's side effects are truly undetectable, use changed_when: false explicitly rather than letting it mislead your CI dashboard.
Variable value is correct in vars_files but resolves to something different at runtime+
Immediate action
Dump the fully resolved variable state for the specific failing host
ansible -m debug -a 'var=nginx_port' -i inventory.ini $TARGET_HOST
Fix now
Remove conflicting host_vars files. Consolidate all environment-specific variables into group_vars/production.yml. Audit your CI pipeline for -e flags that inject variable overrides — these sit at the top of the precedence ladder and override everything else silently.
Handler runs on every playbook execution, not just when config actually changes+
Immediate action
Find which task is notifying the handler and why it reports changed every time
Commands
ansible-playbook playbook.yml --list-tasks | grep -A 5 handler_name
A task notifying the handler is reporting changed on every run — almost always a shell or command task running unconditionally. Convert that task to an idempotent module. If the change is genuinely undetectable (for example, an API call with no readable state), use changed_when: false on that specific task and document why.
Playbook works manually from your laptop but fails consistently in the CI pipeline+
Immediate action
Compare the execution environment between your local shell and the CI runner
CI runs without an interactive terminal — set ANSIBLE_HOST_KEY_CHECKING=False and ANSIBLE_SSH_RETRIES=3 as CI environment variables. Use absolute paths to inventory files since CI working directories vary by runner. Pass the vault password via --vault-password-file pointing to a file written from a CI secret, not --ask-vault-pass which expects interactive input and hangs silently.
Ansible vs Chef, Puppet, and Terraform
Tool
Agent Required
Language
Learning Curve
Best For
Ansible
No (agentless — SSH only)
YAML + Jinja2
Low — most engineers are productive within a day
Configuration management, application deployment, ad-hoc fleet operations, and orchestration across mixed environments. The fastest path from zero automation to everything automated. Best choice for teams that don't have dedicated infrastructure engineers.
Chef
Yes (chef-client daemon running on every managed node)
Ruby DSL
High — requires Ruby knowledge and Chef Server administration
Complex, policy-based configuration in large enterprise fleets where teams have Ruby expertise and need a pull-based model. Chef Server handles 10,000+ nodes better than Ansible's push model at extreme scale.
Puppet
Yes (puppet agent daemon, certificate-based auth)
Puppet DSL
High — Puppet DSL is its own language with its own idioms
Long-term compliance enforcement and drift remediation in regulated industries (finance, healthcare, government) where continuous automated enforcement matters more than on-demand execution. Puppet's pull model means servers self-correct without a human initiating a run.
Terraform
No
HCL
Medium — HCL is readable but state management has a learning curve
Infrastructure provisioning — creating servers, VPCs, load balancers, DNS records, IAM roles, and managed services. Complementary to Ansible, not a replacement. Terraform creates the server. Ansible configures it. Most mature DevOps teams use both in sequence: Terraform provisions, Ansible configures on first boot and on every subsequent config change.
Key takeaways
1
Ansible is agentless
it connects over SSH requiring no software installation on managed nodes. Zero maintenance overhead on servers, instant onboarding for new infrastructure, and a smaller security footprint than agent-based tools.
2
Playbooks describe desired state in human-readable YAML
not step-by-step scripts. Run them once or a hundred times and the outcome is identical. This idempotency is what makes Ansible safe to run in CI/CD pipelines and on scheduled crons.
3
Variable precedence has 22 levels enforced silently. host_vars always overrides group_vars. Extra vars (-e) override everything. Run ansible-inventory --host before every production deploy where variables matter
the resolved variable state is ground truth.
4
Prioritize dedicated modules (apt, systemd, git, copy, template) over shell and command. Dedicated modules check state before acting. Shell and command run unconditionally every time and report 'changed' on every run
breaking your CI dashboard's signal-to-noise ratio.
5
Roles are how Ansible scales from 10 servers to 1000. The directory structure is Ansible's loading contract
deviate from it and files silently don't load. Pin Galaxy community roles to specific versions in requirements.yml and treat upgrades like dependency upgrades.
6
Use block/rescue/always for any playbook that modifies persistent state. Without error handling, a failed migration on server 3 of 20 leaves your fleet in split-brain configuration with no automatic recovery and no notification.
7
Ansible Vault is non-negotiable for secrets. ansible-vault create the file, commit the encrypted version to Git, store the decryption password in CI secrets, pass it with --vault-password-file. Never --ask-vault-pass in automation and never plain-text credentials in playbooks.
8
Ansible and Terraform are complementary tools in the same pipeline
Terraform provisions the server, Ansible configures it. Terraform's user_data runs once at first boot. Ansible runs idempotently on day 1, day 30, and day 300 — correcting drift every time.
Common mistakes to avoid
6 patterns
×
Using ignore_errors: yes as a band-aid for tasks that matter
Symptom
A failing task is silenced with ignore_errors: yes because it was intermittently failing during development and the engineer wanted to move on. Three months later, SSL certificate renewal is silently failing on 12 servers. Customers see browser security warnings. Nobody noticed because the error was suppressed. The playbook reported 'ok' on every run.
Fix
Use block/rescue/always instead. If a task fails, the rescue block runs rollback and sends an alert immediately. If you genuinely expect a task to fail in a specific known way, use failed_when with a condition that checks the actual error message — not ignore_errors which swallows everything. Reserve ignore_errors for genuinely non-critical operations and document exactly why in a comment. Never use it on tasks that touch TLS, auth, or persistent state.
×
Committing plain-text secrets to version control
Symptom
Database passwords and API keys appear in Git history — often in an early commit before the engineer realized they'd done something wrong. A former employee with repo access now has production credentials. A security audit flags the repository. The credentials must be rotated across every system that uses them.
Fix
Use ansible-vault encrypt_string 'your_secret' --name 'db_password' and paste the encrypted output into your playbook. Better: put all secrets in group_vars/production/vault.yml and encrypt the entire file with ansible-vault encrypt. Commit the encrypted file — it's safe in Git without the password. Store the vault password in your CI secrets manager (GitHub Actions secrets, GitLab CI variables, Jenkins credentials). Rotate secrets by re-encrypting with a new value, not by changing the vault password.
×
Using shell or command modules when a dedicated module exists
Symptom
The CI dashboard shows 'changed' on every single run for the same task. The deploy pipeline always reports 1 changed even when nothing was deployed. The team loses trust in the changed indicator because it's always on — which means they also miss genuine changes.
Fix
Replace ansible.builtin.shell: apt install nginx with ansible.builtin.apt: name=nginx state=present. The apt module checks whether nginx is already installed at the correct version before acting. It only reports 'changed' when it actually installs or upgrades something. Apply the same pattern for service management, file operations, and package management — there is almost always a dedicated module.
×
Not disabling host key checking in CI/CD environments
Symptom
The CI job hangs indefinitely with no error output. The last log line is about establishing an SSH connection. The job eventually times out after the CI runner's maximum job duration. The engineer reruns it and it hangs again.
Fix
Set ANSIBLE_HOST_KEY_CHECKING=False as a CI environment variable for ephemeral environments. For production stability, use ssh-keyscan in your CI pipeline prep step to pre-populate known_hosts before Ansible runs: ssh-keyscan -H target_host >> ~/.ssh/known_hosts. This maintains the security benefit of host key verification without the interactive hang.
×
Forgetting become: true and spending an hour debugging the wrong thing
Symptom
A task that modifies /etc/nginx/conf.d/ fails with 'permission denied' or 'file not found' depending on the module. The engineer spends time checking whether the directory exists, whether the path is spelled correctly, whether the disk is full — none of which is the actual problem.
Fix
Add become: true at the play level for any play that touches system files, package managers, or services. Make it explicit and global: hosts: all then become: true on the next line. If only specific tasks need root, add become: true at the task level. But the most common mistake is forgetting it for an entire play — add it at the play level and override downward if needed.
×
Ignoring YAML indentation and spending time on cryptic parse errors
Symptom
Ansible returns ERROR! Syntax Error while loading YAML or expected <block end>, but found '<block mapping start>'. The line number points to a line that looks visually correct. The error message is not helpful.
Fix
Run yamllint playbook.yml before ansible-playbook. Configure your editor to show invisible whitespace characters — spaces as dots, tabs as arrows. YAML requires spaces exclusively — tab characters are always invalid regardless of how they look in your editor. A missing space after a colon breaks the entire file. Install the ansible extension for VS Code which highlights YAML errors inline.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Explain the agentless architecture of Ansible. How does it compare to ag...
Q02SENIOR
What is idempotency in the context of Ansible modules? Can you name a mo...
Q03SENIOR
How does Ansible handle parallel execution? What is a fork in ansible.cf...
Q04SENIOR
What is the difference between a task and a handler? In what scenario wo...
Q05SENIOR
How would you use Ansible Vault to manage environment-specific secrets i...
Q06SENIOR
What are Ansible facts? How can you disable fact gathering to speed up p...
Q07SENIOR
Explain how dynamic inventory works with a cloud provider like AWS. What...
Q08SENIOR
Describe the difference between include_role and import_role. When would...
Q09SENIOR
How would you structure an Ansible project to manage 500+ servers across...
Q01 of 09SENIOR
Explain the agentless architecture of Ansible. How does it compare to agent-based tools like Puppet or Chef in terms of security footprint, operational overhead, and onboarding friction for new servers?
ANSWER
Agentless means no daemon runs on managed servers. Ansible connects via SSH, pushes a small Python module (or binary for Windows via WinRM), executes it, and removes it. The security footprint is smaller than agent-based tools — one fewer daemon running as root, one fewer open port, one fewer set of certificates to manage. Operational overhead is lower — no agent upgrades, no agent crashes, no certificate rotations, no 'the agent lost connection to Chef Server' incidents at 3am. Onboarding new servers is minimal — they just need SSH access and Python installed, which all Linux servers have by default.
The trade-off: Ansible's push model from a control node doesn't scale as elegantly as Chef or Puppet's pull model for very large fleets (5,000+ nodes) where you need continuous automated enforcement without human initiation. Chef's pull model handles constant drift correction at extreme scale more efficiently. For most teams — under 500 servers, mixed OS environments, teams without dedicated infrastructure engineers — agentless is simpler, faster to adopt, and operationally safer.
Q02 of 09SENIOR
What is idempotency in the context of Ansible modules? Can you name a module that is not idempotent by default, and explain when you'd intentionally use it?
ANSWER
Idempotency means running an operation multiple times produces the same result as running it once. Ansible modules check current state before making changes. The apt module checks if a package is installed before installing. The template module compares checksums before writing. The service module checks whether the service is already in the desired state before acting. These modules report 'ok' when the desired state already exists and 'changed' only when they actually modify something.
The shell and command modules are not idempotent by default — they execute the command unconditionally on every run and always report 'changed'. You'd intentionally use command for truly one-off operations where no dedicated module exists, but even then you add creates or removes flags to make it conditional. The only time I use shell without idempotency guards is in ad-hoc commands for emergency fleet debugging — never in a playbook that runs in CI.
Q03 of 09SENIOR
How does Ansible handle parallel execution? What is a fork in ansible.cfg, and how does tuning it impact performance on a 500-node fleet?
ANSWER
Ansible uses forks to control parallelism. Each fork is a separate SSH connection thread on the control node. The default is forks=5, which means Ansible talks to 5 hosts simultaneously, waits for all 5 to complete, then moves to the next batch of 5.
On a 500-node fleet with forks=5 and 10 seconds per host batch: 100 batches × 10 seconds = ~17 minutes. With forks=50: 10 batches × 10 seconds = ~100 seconds. The speedup is roughly linear up to the control node's resource limits.
The trade-offs of higher forks: each fork holds an SSH socket (file descriptor), module output in memory, and a Python subprocess. On a t2.medium control node with forks=100, I've seen OOM kills when processing large setup module output from 100 hosts simultaneously. The safe starting point for 500 nodes is forks=50 in ansible.cfg, combined with pipelining=True which reduces the number of SSH round-trips per module from 3 to 1. Monitor control node CPU and memory. Raise forks by 10 at a time and watch for SSH connection failures in the output, which indicate file descriptor exhaustion.
Q04 of 09SENIOR
What is the difference between a task and a handler? In what scenario would a handler be skipped even if it is notified by a task that reported changed?
ANSWER
Tasks run in the order written, unconditionally (unless a when clause prevents it). Handlers run at the end of a play, only once per play regardless of how many times they're notified, and only if at least one notifying task reported 'changed'.
Scenarios where a notified handler is skipped: first, the notifying task reports 'ok' instead of 'changed' — idempotency prevented the change, so the notification is never sent. Second, the play fails before reaching the handler execution phase — handlers are deferred to the end of the play, so a mid-play failure means handlers never run. Third, you use --check mode — Ansible simulates changes but doesn't apply them, so handlers aren't executed.
The meta: flush_handlers trick is important for production: if you have a config change that must be applied before the next task runs (for example, Nginx must reload before a subsequent task checks the listening port), you insert meta: flush_handlers in the task list to force immediate handler execution at that point rather than waiting for the end of the play.
Q05 of 09SENIOR
How would you use Ansible Vault to manage environment-specific secrets in a CI/CD pipeline? Walk through the workflow from encrypting the variable to injecting it during a Jenkins or GitLab CI run.
ANSWER
Step 1: Create the encrypted secrets file per environment. ansible-vault create group_vars/production/vault.yml. Add db_password: real_password and webhook_url: https://hooks.slack.com/your/webhook. The file is AES256-encrypted immediately.
Step 2: Commit the encrypted file to Git. Without the vault password it's unreadable — safe in version control.
Step 3: Store the vault password in CI secrets. Jenkins: create a 'Secret text' credential named ANSIBLE_VAULT_PASSWORD. GitLab CI: add a masked CI/CD variable named ANSIBLE_VAULT_PASSWORD.
Step 4: In the CI pipeline, write the password to a temporary file before the Ansible run: echo "$ANSIBLE_VAULT_PASSWORD" > /tmp/vault_pass && chmod 600 /tmp/vault_pass. Then run: ansible-playbook deploy.yml --vault-password-file /tmp/vault_pass.
Step 5: For multiple environments, use separate vault password files — ANSIBLE_VAULT_PASSWORD_STAGING and ANSIBLE_VAULT_PASSWORD_PRODUCTION — and select the correct one based on the target environment in your pipeline logic.
Never use --ask-vault-pass in CI — it expects interactive input and hangs silently. Never echo the password directly into the ansible-playbook command — it appears in process listings and CI logs.
Q06 of 09SENIOR
What are Ansible facts? How can you disable fact gathering to speed up playbook execution, and when do you actually need them?
ANSWER
Facts are system information collected automatically by the setup module at the start of every play — OS distribution, IP addresses, disk partitions, memory, CPU, uptime, Python interpreter path. Ansible gathers facts before the first task runs, which means one SSH call per host before any work starts. For 500 hosts, that's 500 additional SSH calls adding 15-30 seconds of pure overhead before anything useful happens.
Disable with gather_facts: no at the play level. You lose nothing if your playbook doesn't use fact variables.
When you need facts: conditional task execution based on OS (when: ansible_os_family == 'Debian'), using the primary IP address in templates ({{ ansible_default_ipv4.address }}), checking available memory before a memory-intensive operation, or selecting the correct package manager. When you need only a subset of facts, use gather_subset: min which collects OS family, hostname, and network interfaces but skips disk, hardware, and virtual machine details — roughly 60% faster than full fact gathering.
For large fleets where you need facts, cache them: set fact_caching = jsonfile in ansible.cfg with a cache timeout. Facts are re-gathered only when the cache expires, not on every run.
Q07 of 09SENIOR
Explain how dynamic inventory works with a cloud provider like AWS. What are the advantages over a static inventory file, and what challenges does it introduce?
ANSWER
Dynamic inventory uses a plugin (amazon.aws.aws_ec2, gcp_compute, azure_rm) that queries the cloud provider API at runtime and returns host lists in Ansible's expected JSON format. You configure the plugin with a YAML file (aws_ec2.yml) that specifies regions, filters (running instances only, specific environment tags), and keyed_groups (group instances by tag values like Role or Environment).
Advantages: the inventory is never stale. New instances appear automatically. Terminated instances disappear. Autoscaling group members are always correct. You can target specific subsets with tag filters without maintaining any files.
Challenges: API rate limits — hitting EC2 DescribeInstances on every playbook run can throttle, especially with multiple pipelines running simultaneously. Startup latency — a static file is instant; dynamic inventory takes 2-5 seconds per API call. Credential management — the control node needs IAM permissions or access keys configured. API availability — if the cloud provider API is slow or returns an error, your inventory fails and no playbook runs.
Mitigation: enable the inventory cache in the plugin config (cache: true, cache_timeout: 300). The API is queried once every 5 minutes and results are stored locally. New instances may take up to 5 minutes to appear, which is acceptable for most workflows and eliminates the rate limit and latency problems.
Q08 of 09SENIOR
Describe the difference between include_role and import_role. When would you choose one over the other, and how does each affect task execution order and variable scope?
ANSWER
import_role is static: Ansible processes the role at playbook parsing time, before any tasks execute. All tasks, variables, and handlers from the role are loaded into the play's task list immediately. include_role is dynamic: the role is processed at runtime when the task queue reaches that line.
The practical consequences: import_role cannot be used with when conditions based on runtime facts or registered variables — the role is already loaded before any tasks run, so runtime conditions can't influence whether it's included. include_role respects when conditions and can be used in loops to apply the same role with different variables multiple times.
Variable scope: import_role makes the role's variables available in the global play scope — later tasks in the same play can reference the role's variables. include_role scopes variables to the role execution only — they're not visible outside the role unless you explicitly set them with set_fact.
Use import_role for roles that are always needed unconditionally and whose variables should be globally available. Use include_role when the role is conditionally applied, used in a loop, or when you want to apply it with different parameters across multiple invocations. If you're not sure, import_role is the safer default — its behavior is more predictable because it's resolved at parse time.
Q09 of 09SENIOR
How would you structure an Ansible project to manage 500+ servers across dev, staging, and production environments? Describe your directory layout, variable hierarchy, and how you'd prevent production changes from accidentally running against dev.
ANSWER
Directory structure:
ansible/
├── ansible.cfg (forks=50, pipelining=True, roles_path=roles/)
├── inventories/
│ ├── dev/
│ │ ├── aws_ec2.yml (dynamic inventory plugin config)
│ │ └── group_vars/
│ │ ├── all.yml (dev-wide defaults)
│ │ └── webservers.yml (dev webserver-specific vars)
│ ├── staging/ (same structure as dev)
│ └── production/
│ ├── aws_ec2.yml
│ └── group_vars/
│ ├── all.yml
│ ├── webservers.yml
│ └── vault.yml (ansible-vault encrypted secrets)
├── roles/
│ ├── nginx/
│ ├── app/
│ └── requirements.yml (Galaxy roles with pinned versions)
└── playbooks/
└── site.yml
Variable hierarchy: group_vars/all.yml for cross-environment defaults, group_vars/webservers.yml for role-specific values, vault.yml for secrets. Treat host_vars as a code smell requiring a documented justification comment.
Preventing production accidents: CI pipelines are branch-scoped. Commits to feature branches can only trigger runs against dev inventory. Merges to main can only trigger staging. Only tags with the v* pattern can trigger production, and production runs require a manual approval step. The inventory path is never hardcoded in playbooks — it's always passed as -i inventories/$ENV where ENV is set by the CI pipeline based on branch or tag. For extra safety, add a task at the top of site.yml that asserts ansible_limit is set when targeting production: fail when not ansible_limit is defined and ansible_env.CI_ENVIRONMENT == 'production'.
01
Explain the agentless architecture of Ansible. How does it compare to agent-based tools like Puppet or Chef in terms of security footprint, operational overhead, and onboarding friction for new servers?
SENIOR
02
What is idempotency in the context of Ansible modules? Can you name a module that is not idempotent by default, and explain when you'd intentionally use it?
SENIOR
03
How does Ansible handle parallel execution? What is a fork in ansible.cfg, and how does tuning it impact performance on a 500-node fleet?
SENIOR
04
What is the difference between a task and a handler? In what scenario would a handler be skipped even if it is notified by a task that reported changed?
SENIOR
05
How would you use Ansible Vault to manage environment-specific secrets in a CI/CD pipeline? Walk through the workflow from encrypting the variable to injecting it during a Jenkins or GitLab CI run.
SENIOR
06
What are Ansible facts? How can you disable fact gathering to speed up playbook execution, and when do you actually need them?
SENIOR
07
Explain how dynamic inventory works with a cloud provider like AWS. What are the advantages over a static inventory file, and what challenges does it introduce?
SENIOR
08
Describe the difference between include_role and import_role. When would you choose one over the other, and how does each affect task execution order and variable scope?
SENIOR
09
How would you structure an Ansible project to manage 500+ servers across dev, staging, and production environments? Describe your directory layout, variable hierarchy, and how you'd prevent production changes from accidentally running against dev.
SENIOR
FAQ · 8 QUESTIONS
Frequently Asked Questions
01
What is the difference between an ad-hoc command and a playbook in Ansible?
An ad-hoc command is a single one-liner executed directly from the command line — ideal for quick checks or one-off operations like restarting a service or checking disk space across your fleet. A playbook is a reusable, version-controlled YAML file that defines a sequence of tasks with variables, handlers, and error handling. Think of ad-hoc commands as shouting instructions across the room, and playbooks as writing a detailed runbook that anyone can execute repeatedly with the same result. The rule of thumb: if you've run the same ad-hoc command twice, it belongs in a playbook.
Was this helpful?
02
How does Ansible handle secrets and sensitive data?
Ansible provides Ansible Vault, which encrypts variables or entire files using AES256. Encrypt individual strings with ansible-vault encrypt_string and paste them into your playbooks, or encrypt entire variable files with ansible-vault encrypt. At runtime, provide the vault password via --vault-password-file pointing to a file written from a CI secret. Vault-encrypted content is safe to commit to Git — without the password it's gibberish. For larger teams, integrate Vault with HashiCorp Vault using the hashi_vault lookup plugin, which fetches secrets at runtime from a centralized secrets manager rather than storing them in encrypted files.
Was this helpful?
03
What is dynamic inventory in Ansible, and when should you use it?
Dynamic inventory queries an external source — typically a cloud provider API like AWS EC2, GCP, or Azure — at runtime instead of reading a static file. Ansible builds the host list from live API data based on tags, regions, and instance states. Use dynamic inventory when your infrastructure is elastic: autoscaling groups, spot instances, or any environment where servers are created and destroyed regularly. Static inventory works for fixed infrastructure under 20 servers with stable hostnames. Beyond that, a static file becomes a liability — stale IPs, missing new instances, terminated hosts that are still listed. Enable the inventory cache (cache_timeout: 300) to avoid rate limiting the cloud API on every run.
Was this helpful?
04
How do you handle errors and rollbacks in Ansible playbooks?
Ansible provides a block/rescue/always construct that works like try/catch/finally. Wrap risky operations in a block. If any task inside fails, the rescue section executes — rollback to a known-good state, send an alert, log the failure context. The always section runs regardless of success or failure — cleanup, status notifications. For rolling deployments, combine this with serial (how many hosts to update at once) and max_fail_percentage (abort the entire deploy if too many hosts fail). Set max_fail_percentage: 0 for database migrations — any failure should stop everything. Without block/rescue, a failed migration on server 3 of 20 leaves 17 servers on the new schema and 1 on the old, with the application broken and no automatic recovery.
Was this helpful?
05
What is the difference between Ansible and Terraform? Do I need both?
They solve different problems at different points in a server's life. Terraform provisions infrastructure — it creates EC2 instances, VPCs, load balancers, DNS records, and IAM roles. Ansible configures that infrastructure — it installs software, deploys application code, manages services, and corrects configuration drift. Terraform's user_data and cloud-init can run a script at first boot, but they can't re-run idempotently three months later when you need to update a config file. Ansible can. Most production teams use Terraform to build the infrastructure and Ansible to configure and maintain it. They're complementary tools in the same pipeline, not alternatives.
Was this helpful?
06
How do you test Ansible playbooks before running them in production?
Use --check mode for a dry run — Ansible shows what would change without applying anything. Combine it with --diff to see exact file content differences. For automated testing, use Molecule: it spins up Docker containers or VMs, runs your role, verifies the result with Testinfra assertions, and tears everything down. Run Molecule in CI to catch regressions before they reach any environment. Also run ansible-lint on all playbooks and roles to catch deprecated modules, style violations, and common structural mistakes. The combination of --check, --diff, Molecule, and ansible-lint catches the vast majority of problems before a human needs to review them.
Was this helpful?
07
What is Ansible Galaxy, and should I use community roles?
Ansible Galaxy is a repository of community-contributed roles for common infrastructure software — Nginx, Docker, PostgreSQL, certbot, Redis, and hundreds more. Install with ansible-galaxy install -r requirements.yml. Community roles save hours for commodity software and are often more battle-tested than what you'd write from scratch. For application-specific automation — deploying your Java app, configuring your monitoring stack — write custom roles. The mandatory practice: pin every Galaxy role to a specific version in requirements.yml. A community role is a dependency you don't control. A minor version update can change default behavior in ways that affect production. Pin it, test upgrades in staging, read the changelog before bumping the version.
Was this helpful?
08
How does Ansible perform on very large fleets (1000+ servers)?
Ansible's parallelism scales with the forks setting in ansible.cfg (default: 5, which is too low for large fleets). For 1000 servers, start at forks=50 and monitor control node CPU, memory, and open file descriptor counts. Enable pipelining=True to reduce SSH round-trips per module from 3 to 1 — this alone can cut playbook runtime by 30-40%. Disable fact gathering for playbooks that don't need system facts, or use gather_subset=min to collect only essential information. For operational visibility at scale — job scheduling, RBAC, audit logging, workflow orchestration, and a web UI — deploy AWX (the open-source version) or Ansible Automation Platform. Plain Ansible from the command line works at 1000+ nodes, but AWX gives you the operational control that large teams need to manage concurrent jobs safely.