Senior 17 min · March 06, 2026

Ansible state:latest — One Task Broke Payments for 47 Min

Nginx 1.24→1.26 broke TLS handshakes with payment processors across 50 servers in 90 seconds.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Ansible is agentless configuration management — SSH in from a control node, no software installed on targets
  • Three core concepts: Inventory (which servers), Playbooks (what state they should be in), Modules (how to get there)
  • Idempotency means running the same playbook 100 times produces the same result as running it once — only true if you use proper modules like apt, template, and service instead of shell
  • Performance trade-off: agentless means zero agent maintenance on targets but higher SSH overhead on the control node; default forks=5 is too low for any real fleet
  • Production trap: 'state: latest' installs whatever the package mirror serves that day — a Monday morning playbook run can silently upgrade Nginx across 50 servers and break TLS configuration you never touched
  • Biggest mistake: skipping handlers and using a plain 'service: state=restarted' task — that restarts Nginx every single run, even when the config file didn't change, which means unnecessary downtime on every playbook execution
✦ Definition~90s read
What is Ansible Basics?

Ansible is an open-source IT automation engine that eliminates manual work for configuration management, application deployment, and task orchestration. Unlike Chef or Puppet which use a client-server model with agents on every node, Ansible is agentless — it connects via SSH (or WinRM for Windows) and pushes declarative instructions written in YAML.

Imagine you're a school principal and you need to deliver the same set of instructions to 500 students spread across 10 classrooms.

This means you don't install or maintain background services on managed servers, reducing attack surface and operational overhead. Red Hat acquired Ansible in 2015, and it's now the most widely adopted configuration management tool, used by 70%+ of DevOps teams according to the 2023 State of DevOps Report.

Ansible's core architecture is deceptively simple: a control node (any Linux machine with Python) runs ansible-playbook or ansible commands against an inventory file listing managed hosts. Modules — standalone scripts that enforce desired state — handle everything from package installation (apt, yum) to service management (systemd) to cloud provisioning (ec2, gcp_compute).

The pull-based model of Puppet or SaltStack is replaced by push-based execution, which gives you immediate feedback and makes ad-hoc operations trivial. You can run ansible all -m ping to test connectivity across 500 servers in under 10 seconds.

Where Ansible falls short is real-time event-driven automation and Windows-heavy environments. SaltStack's event bus and Chef's compliance reporting are more mature for continuous drift detection. For Windows, Ansible requires WinRM configuration and lacks the native module depth of PowerShell DSC.

But for Linux-centric shops running standard workflows — patching, config files, service restarts — Ansible's YAML readability and zero-agent overhead make it the pragmatic choice. The tradeoff is performance: agent-based tools cache facts locally, while Ansible gathers them fresh on each run, which can slow down playbooks on 1000+ node fleets.

Plain-English First

Imagine you're a school principal and you need to deliver the same set of instructions to 500 students spread across 10 classrooms. You wouldn't walk into each room yourself, repeat yourself 500 times, and hope you said it the same way each time. You'd write a single instruction sheet and hand it to every teacher simultaneously — they deliver the message to their rooms in parallel, in exactly the same words, every time. Ansible works exactly like that. You write your instructions once in a file called a playbook, tell Ansible which servers to target in an inventory file, and it SSHes into all of them concurrently and executes everything in the order you specified. No agent software on any server. No daemons to babysit. Just SSH and a clean YAML file that describes the world you want to exist.

Every cloud infrastructure beyond a certain size hits the same wall. Someone on the team is spending their Friday afternoon manually SSH-ing into 30 servers, running the same five commands in sequence, and quietly hoping they didn't typo on server 24. It's slow. It's completely unauditable. And it doesn't scale — not to 100 servers, not to three environments, certainly not to a team of ten engineers who all have slightly different opinions about how to run that one sed command.

Scale that manual process to hundreds of EC2 instances or GCP VMs and the problem stops being annoying and starts being a business risk. An outage caused by configuration drift — two servers out of forty that silently diverged from the others — is nearly impossible to diagnose if you have no record of what was changed, when, and by whom.

That's the exact world Ansible was built to fix. It enforces a declared, version-controlled state across every machine in your fleet simultaneously, starting from nothing more than an SSH key and a YAML file checked into Git.

But Ansible has traps that aren't obvious from the documentation. state: latest looks safe until it upgrades Nginx on a Monday morning and changes a default TLS cipher suite. Handlers look optional until you realize your playbook has been restarting your app server on every run for three months. Roles look like bureaucratic overhead until your playbook hits 300 lines and two engineers are editing conflicting sections.

By the end of this article you'll understand not just how to write playbooks, but why they're structured the way they are. You'll know how inventory files map to real cloud environments, how roles package automation that other teams can actually reuse, and how handlers restart services only when config genuinely changed — not on every run. You'll leave with the mental model and the production lessons that take most engineers two or three incidents to learn the hard way.

Why Ansible Basics Are Not Optional

Ansible is an agentless automation tool that uses SSH (or WinRM) to push declarative state to remote hosts. You write YAML playbooks describing the desired state—package installed, service running, file present—and Ansible idempotently converges the system to match. No agents, no daemons, no persistent connections: it spins up, executes, and tears down. The core mechanic is a push-based, stateless model where each playbook run is a fresh transaction against the target inventory. This means zero overhead on managed nodes, but also zero tolerance for drift between runs.

In practice, Ansible evaluates tasks sequentially, gathering facts first, then applying changes in order. Idempotency is not magic—it depends on modules checking current state before acting (e.g., yum module checks if package is already installed). If a module doesn't support check mode or fails to detect state correctly, you get unintended side effects. The control machine does all the heavy lifting; targets only need Python and SSH. This makes Ansible fast to adopt but slow at scale—running 1000 hosts serially takes O(n) time without tuning forks or async.

Use Ansible when you need to bootstrap, configure, or enforce state across a fleet without installing permanent infrastructure. It excels at one-shot provisioning, compliance checks, and ad-hoc fixes. But it is not a real-time configuration daemon—if you need continuous enforcement, pair it with a pull-based tool like Chef or a GitOps operator. The moment you treat Ansible as a live monitoring system, you will miss drift until the next playbook run.

Idempotency Is a Contract, Not a Guarantee
A task that runs 'state: latest' on a package manager will upgrade the package every run—breaking idempotency and potentially causing unexpected restarts.
Production Insight
A team used 'state: latest' for a payment service dependency, causing an automatic upgrade mid-day that changed the API contract.
Symptom: 47 minutes of 500 errors and failed transactions before the rollback playbook completed.
Rule: Pin all production dependencies to a specific version—never use 'latest' in a playbook that touches live systems.
Key Takeaway
Ansible is push-based and stateless—every run starts from scratch, so you must manage drift externally.
Idempotency depends on the module, not the tool—test each module's behavior with '--check' and '--diff'.
Never use 'state: latest' in production—always pin versions to prevent surprise upgrades.
Ansible State:latest — One Task Broke Payments for 47 Min THECODEFORGE.IO Ansible State:latest — One Task Broke Payments for 47 Min Flow from idempotent playbook design to production failure Idempotent Playbook Design Ensure tasks are repeatable without side effects state:latest in Package Task Forces upgrade even if package is already present Unplanned Package Upgrade Breaks dependency chain for payment service Payment Service Outage 47 minutes of failed transactions Rollback to state:present Restores idempotent behavior and stability ⚠ Avoid state:latest in production playbooks Use state:present or pin exact version for idempotency THECODEFORGE.IO
thecodeforge.io
Ansible State:latest — One Task Broke Payments for 47 Min
Ansible Basics

Step-by-Step Ansible Installation on Ubuntu 22.04 — Control Node Setup That Lasts

Before you write a single playbook, you need a control node. This is the machine from which you'll run all Ansible commands. It can be your laptop, a dedicated jump box, or a CI runner. The installation method you choose has operational consequences for upgrade cycles and environment consistency.

Three installation methods compete for your attention. The Python package manager (pip) is the most flexible and lets you pin exact versions. The distribution's apt repository gives you system integration and automatic updates. The newer pipx method isolates Ansible in its own virtual environment and is the official Python Packaging Authority (PyPA) recommendation for installing CLI tools.

For production control nodes — dedicated VMs or CI runners — pip installation inside a Python virtual environment is the standard. It gives you version pinning (critical for consistency), isolation from the system Python, and easy upgrades via requirements files. The following sequence sets up Ansible in a virtual environment under /opt/ansible, with a symlink in /usr/local/bin for global access.

After installation, you configure the control node's SSH access. Ansible needs to reach every managed server via SSH with key-based authentication. The common failure point is SSH host key checking. When you connect to a server for the first time, Ansible's default behavior is to verify the host key against ~/.ssh/known_hosts. In dynamic cloud environments where IPs are recycled, this causes prompt blocks. The production fix is to manage known_hosts via a pre-seeded file or use host_key_checking=False in ansible.cfg with an understanding of the security trade-off.

The inventory file is your first configuration file. It lists your managed nodes and groups them logically. For testing, a one-line inventory with a single server is enough. In production, you'll use dynamic inventory plugins that query cloud APIs.

To verify installation, run ansible all -i 'localhost,' -m ping -c local. This pings the control node itself without SSH, confirming the Ansible engine works.

BASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# ── Option A: pip in a virtual environment (recommended for dedicated control nodes) ──
sudo apt update && sudo apt install python3-venv python3-pip -y
sudo python3 -m venv /opt/ansible
/opt/ansible/bin/pip install --upgrade pip
/opt/ansible/bin/pip install ansible-core==2.17.4
sudo ln -s /opt/ansible/bin/ansible* /usr/local/bin/

# Verify
ansible --version

# ── Option B: apt (simple but version lags) ──
sudo apt update
sudo apt install software-properties-common -y
sudo add-apt-repository --yes --update ppa:ansible/ansible
sudo apt install ansible -y

# ── Option C: pipx (isolated CLI tool, good for laptops) ──
sudo apt install pipx -y
pipx ensurepath
pipx install ansible-core==2.17.4

# ── After any method: configure ansible.cfg ──
mkdir -p ~/ansible
cat > ~/ansible/ansible.cfg << 'EOF'
[defaults]
inventory = inventory
host_key_checking = False
pipelining = True
forks = 25
EOF

# ── Create a test inventory ──
echo 'localhost ansible_connection=local' > ~/ansible/inventory

# ── Test the ping module ──
cd ~/ansible
ansible all -i inventory -m ping
Always run Ansible from a Python virtual environment, not the system Python
System Python is used by the OS package manager. If you install Ansible with pip system-wide, you risk breaking the system Python when you upgrade Ansible or its dependencies. The virtual environment under /opt/ansible is self-contained. If you need to roll back an Ansible version, you delete the venv and recreate it — no system pollution. In CI, use a fresh venv per pipeline run, pinned to the exact same Ansible version your team uses locally.
Production Insight
The most common installation failure in CI is an Ansible version mismatch. Your laptop runs ansible-core 2.17.4, but the CI container or runner has 2.14.0 from the base OS. Module behavior changes across major versions. The fix: always pin the Ansible version in your project's requirements.txt or Dockerfile. For cloud VMs used as persistent control nodes (e.g., a Jenkins agent), recreate the venv from a locked requirements file after any base OS update that might have pulled in a newer Python version.
Key Takeaway
Install Ansible in an isolated Python virtual environment on your control node, pin a specific ansible-core version in a requirements file, and configure ansible.cfg with host_key_checking=False, pipelining=True, and forks=25 before writing any playbooks.

Ansible vs Chef vs Puppet vs SaltStack — Choosing the Right Configuration Management Tool

If you're new to configuration management, the first question is not 'how do I use Ansible' but 'should I use Ansible at all?' Chef, Puppet, and SaltStack are the three other major players, and each has strengths that match different operational philosophies. The right choice depends on your team's existing skills, infrastructure scale, and whether you prefer a push or pull model.

Ansible is the only major tool that is agentless — it connects to managed nodes over SSH (or WinRM) and executes tasks without installing any software. This makes initial setup trivial: if you have SSH keys, you have Ansible. The trade-off is that every playbook run opens new SSH connections, which creates overhead at scale. Ansible uses YAML for its playbooks, which is the easiest language for non-programmers to read and write. Its push model means you initiate changes from a central control node, which is natural for ad-hoc operations and CI/CD pipelines.

Chef uses a pull model: a client agent on each node periodically fetches the desired state from a Chef server. This is more resilient in environments where nodes are behind firewalls or have intermittent connectivity. Chef uses Ruby DSL for its cookbooks, which is more powerful but has a steeper learning curve. Its test-kitchen testing framework is the most mature in the CM space. Chef is strong for organizations that need a full audit trail and have dedicated platform teams.

Puppet also uses a pull model with an agent and is the oldest of the four. It has its own declarative language (Puppet DSL) that is designed for idempotency from the ground up. Puppet's module ecosystem (Puppet Forge) is vast, and its reporting capabilities are excellent. The downside is the complexity of running a Puppet server and the agent overhead on each node.

SaltStack (Salt) offers both push and pull modes via its ZeroMQ message bus, making it extremely fast at scale — it can manage thousands of nodes in seconds. It uses YAML or Python for its states (SLS files) and includes a powerful event-driven reactor system. Salt's master-minion architecture requires an agent and a master, but the agent is lightweight. It's popular in high-performance computing and environments that need real-time command execution.

The table below summarizes the key differences so you can make an informed decision based on your team's context.

HTML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<table>
  <thead>
    <tr>
      <th>Feature</th>
      <th>Ansible</th>
      <th>Chef</th>
      <th>Puppet</th>
      <th>SaltStack</th>
    </tr>
  </thead>
  <tbody>
    <tr><td>Architecture</td><td>Agentless (push)</td><td>Agent (pull)</td><td>Agent (pull)</td><td>Agent (push/pull)</td></tr>
    <tr><td>Language</td><td>YAML</td><td>Ruby DSL</td><td>Puppet DSL</td><td>YAML / Python</td></tr>
    <tr><td>Setup complexity</td><td>Low (SSH only)</td><td>High (server + agent)</td><td>High (server + agent)</td><td>Medium (master-minion)</td></tr>
    <tr><td>Learning curve</td><td>Lowest</td><td>Medium-High</td><td>Medium</td><td>Medium</td></tr>
    <tr><td>Scale performance</td><td>Moderate (SSH bottleneck)</td><td>Good (background agent)</td><td>Good</td><td>Excellent (ZeroMQ)</td></tr>
    <tr><td>Idempotency enforcement</td><td>Module-dependent</td><td>Built-in resource model</td><td>Built-in</td><td>Module-dependent</td></tr>
    <tr><td>Testing ecosystem</td><td>Molecule (evolving)</td><td>Test Kitchen (mature)</td><td>Beaker (mature)</td><td>Molecule (via Salt-Nostalgic)</td></tr>
    <tr><td>Best for</td><td>Small-medium fleets, CI/CD, ad-hoc</td><td>Large enterprises, audit-heavy</td><td>Large fleets with strict compliance</td><td>Massive fleets, HPC, real-time</td></tr>
  </tbody>
]</table>
Don't overthink the choice — start with Ansible, graduate if needed
Ansible's low barrier to entry means you can be productive in hours, not days. If you later hit scalability limits or need advanced compliance reporting, you can migrate to Chef or Puppet incrementally. Most teams never outgrow Ansible. The tool selection matters far less than having any configuration management at all.
Production Insight
In practice, the decision often comes down to whether your organization has a dedicated platform team. Teams without one overwhelmingly choose Ansible because a single DevOps engineer can maintain it. Teams with a platform team often choose Chef or Puppet for the robust testing frameworks and audit trails. SaltStack is the dark horse — it's fast, but its event-driven model requires a mindset shift that many teams don't take. The most important production consideration: whichever tool you pick, use it to manage everything from day one. A hybrid toolset creates more drift than no tool at all.
Key Takeaway
Ansible's agentless design and YAML-based playbooks make it the easiest configuration management tool to get started with; switch to Chef, Puppet, or SaltStack only if you outgrow its scalability or require deeper compliance features.

Ansible Architecture — How the Control Node, Inventory, and Managed Nodes Interact

Understanding Ansible's architecture is the foundation for debugging connection issues, scaling automation, and choosing the right deployment model. The architecture is deceptively simple: a control node runs Ansible, reads an inventory, and connects to managed nodes via SSH. But the simplicity hides a few sharp edges that only show up in production at scale.

The control node is any machine with Ansible installed — your laptop, a build server, a dedicated jump host. It's the single point of failure in the architecture. If your control node goes down, you cannot run any automation until it's restored. This is why production setups use multiple control nodes in a load-balanced fashion or rely on a CI/CD platform that can re-run jobs from any agent.

The inventory is the source of truth for which nodes exist and how they're grouped. Static inventory files map hostnames to IP addresses. Dynamic inventory plugins query cloud provider APIs and build the host list at runtime. The inventory also holds variables that travel with hosts into playbooks.

Managed nodes are the target servers. They need SSH access from the control node and Python 3 installed. That's it. No agent, no daemon, no open ports beyond SSH. This is the biggest architectural advantage over agent-based tools: you can manage any server that's reachable over SSH, including on-premise machines, cloud VMs, containers (via Docker exec), and even Windows via WinRM.

Ansible's execution model is push-based. You run a command on the control node, Ansible opens SSH connections to each managed node in parallel (controlled by the forks setting), copies the Python module code over, executes it, collects JSON results, and closes the connection. There is no persistent connection. This simplicity means Ansible is stateless from the managed node's perspective, but it also means every playbook run pays the SSH connection overhead.

The following diagram visualizes the flow during a typical playbook run. The control node reads the playbook and inventory, resolves variables, then fans out tasks to each managed node group sequentially (play by play) but within a play, tasks run on all hosts in parallel up to forks concurrent connections.

TEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
┌─────────────────────────────────────────────────────────────────────┐
│                     CONTROL NODE (Ansible CLI)                      │
│  ┌─────────────┐   ┌──────────────┐   ┌────────────────────────┐   │
│  │  Playbook    │   │  Inventory    │   │  ansible.cfg          │   │
│  │  (YAML)      │   │  (static/    │   │  forks=25             │   │
│  │  - hosts     │──▶│   dynamic)    │   │  pipelining=True      │   │
│  │  - tasks     │   │  - groups    │   │  host_key_checking=no │   │
│  │  - handlers  │   │  - variables │   └────────────────────────┘   │
│  └──────┬───────┘   └──────────────┘                                │
│         │                                                            │
│         │  Ansible compiles task list per host                       │
│         ▼                                                            │
│  ┌──────────────────────────────────┐                                │
│  │  SSH Connection Pool (forks=25) │                                │
│  │  (Manages up to 25 parallel SSH)│                                │
│  └────────────┬─────────────────────┘                                │
└────────────────┼─────────────────────────────────────────────────────┘
                 │
                 │  SSH (port 22) + Python execution
                 ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      MANAGED NODES                                  │
│                                                                     │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐     ┌──────────┐        │
│  │ Web-01   │  │ Web-02   │  │ DB-01    │     │ DB-02    │        │
│  │ 10.0.1.1 │  │ 10.0.1.2 │  │ 10.0.2.1 │     │ 10.0.2.2 │        │
│  │ Python 3 │  │ Python 3 │  │ Python 3 │     │ Python 3 │        │
│  │ SSH key  │  │ SSH key  │  │ SSH key  │     │ SSH key  │        │
│  └──────────┘  └──────────┘  └──────────┘     └──────────┘        │
│                                                                     │
│  On each node:                                                      │
│  1. Ansible copies Python script (module)                           │
│  2. Executes with given arguments                                   │
│  3. Returns JSON result (changed, ok, failed)                       │
│  4. Temporary files are cleaned up                                  │
└─────────────────────────────────────────────────────────────────────┘
The control node is a single point of failure — plan accordingly
If your only control node is your laptop and you're on vacation, no one can run playbooks. In production, run Ansible from a CI/CD platform (GitHub Actions, GitLab CI, Jenkins) that can retry from multiple agents. If you need a persistent control node for troubleshooting, set up a managed jump box in each environment with Ansible installed and the vault password available via a secure secret store. Never use a single developer's machine as the only control node.
Production Insight
The SSH connection pool is the bottleneck. Each task requires a new SSH connection (unless pipelining is enabled, which reuses one connection per host per play). With forks=25 and pipelining=True, a 100-server fleet completes a play in 4 batches instead of 20. If you see 'Maximum number of SSH sessions reached' errors, reduce forks or increase the SSH MaxSessions on the managed nodes. A common oversight: cloud network security groups limit inbound connections; at high forks, the control node may hit the flow table limit on NAT instances.
Key Takeaway
Ansible's architecture is agentless and push-based: one control node fans out tasks over SSH to all managed nodes in parallel. The inventory tells Ansible what nodes to target and what groups they belong to. Managed nodes need only SSH and Python. The control node is the operational keystone — harden it, back it up, and don't let it be a single point of failure.

Ad-Hoc Ansible Commands — Quick Operations Without Writing a Playbook

Not every task deserves a playbook. Sometimes you need to check the uptime on 50 servers, copy a configuration file to a specific server, or restart a service immediately during an incident. Ad-hoc commands are single-module operations you run directly from the command line without a playbook file. They're ideal for read-only queries, one-off changes, and emergency responses.

The pattern is always: ansible <host-pattern> -m <module> -a '<module arguments>'. The host pattern matches inventory groups, wildcards, or specific hostnames. The module name is the Ansible module to use. The -a argument string depends on the module.

Three modules dominate ad-hoc usage. The ping module tests SSH connectivity and Python availability — it's the first command you run after setting up a new inventory. The command module runs any shell command with arguments directly, but it always reports 'changed' and is not idempotent. In ad-hoc mode, that's usually fine because you're making a one-off change. The shell module is similar but runs through /bin/sh and supports shell operators like pipes and redirects.

For copying files, the copy module is idempotent even in ad-hoc mode: it only transfers the file if the source and destination differ. This makes it safe to use for emergency configuration pushes without worrying about overwriting an identical file unnecessarily.

Ad-hoc commands are powerful but leave no audit trail unless you log them. Every ad-hoc change should be logged with script or tee and followed up with a permanent playbook change. If you find yourself running the same ad-hoc command twice, it's a sign that operation should be a playbook.

A practical production use case: a security vulnerability requires updating a package version across the fleet immediately. You cannot wait for the CI pipeline. ansible all -m ansible.builtin.apt -a 'name=openssl state=latest update_cache=yes' --become patches all servers in one command. After the emergency, you pin the intended version in your main playbook and remove the state=latest usage.

BASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# ── Ping all servers in the 'production' group ─────────────────────
ansible production -i inventory -m ping

# ── Run uptime on web servers ───────────────────────────────────────
ansible web_servers -i inventory -m command -a 'uptime'

# ── Check disk usage on all servers ─────────────────────────────────
ansible all -i inventory -m shell -a 'df -h / | tail -1'

# ── Copy a configuration file to one server ─────────────────────────
ansible app-01 -i inventory -m copy -a 'src=./nginx.conf dest=/etc/nginx/nginx.conf owner=root group=root mode=0644'

# ── Restart Nginx service on all web servers ────────────────────────
ansible web_servers -i inventory -m service -a 'name=nginx state=restarted' --become

# ── Gather facts for a specific host ─────────────────────────────────
ansible db-primary -i inventory -m setup -a 'gather_subset=network'

# ── Install a package everywhere (emergency) ─────────────────────────
ansible all -i inventory -m apt -a 'name=openssl state=latest update_cache=yes' --become

# ── Check if a service is running with piped shell command ───────────
ansible all -i inventory -m shell -a 'systemctl is-active nginx && echo "active" || echo "inactive"'
Ad-hoc commands bypass playbook reviews — log them or lose them
Every ad-hoc command you run in production is an unreviewed, un-audited change. Use script to log the session, or pipe output to a file. Better yet, append the command to a runbook that later becomes a playbook. If an ad-hoc command caused a production incident, you'll need the exact command and output for the post-mortem. Without logs, you've lost the evidence.
Production Insight
Ad-hoc commands respect the forks setting, so ansible all -m ping on 200 servers with forks=25 runs in 8 sequential batches. Use the -f or --forks flag to temporarily increase parallelism for a large ad-hoc command: ansible all -m ping -f 50. Be cautious with shell commands that produce large output on many hosts — the control node's memory can spike. For read-only queries, pipe through grep or summarize with ansible all -m command -a 'your_command' --one-line.
Key Takeaway
Ad-hoc commands are fast, one-shot operations using Ansible modules directly from the CLI. Use them for read-only checks, emergency changes, and exploratory troubleshooting. After an ad-hoc fix, translate it into a playbook with proper idempotency and commit it to version control.

Ansible Modules Quick-Reference — The 15 Most Common Modules and When to Use Them

Modules are the building blocks of every Ansible task. Each module is a small Python script that performs a specific operation — installing a package, copying a file, managing a service. The art of writing good playbooks is knowing which module to use for which job. Using the wrong module (especially shell or command when a dedicated module exists) breaks idempotency, disables check mode, and makes your playbooks unreliable.

The table below lists the 15 most commonly used modules in production environments, along with their primary use case and a quick example. Master these and you can automate roughly 90% of infrastructure tasks.

ModuleDescriptionUse CaseExample
aptManage apt packagesInstall/update packages on Debian/Ubuntuname=nginx state=present
yumManage yum packagesInstall/update packages on RHEL/CentOSname=httpd state=latest (avoid latest)
copyCopy file to remote nodeDeploy config files, scriptssrc=nginx.conf dest=/etc/nginx/nginx.conf
templateRender Jinja2 template and copyDeploy config with dynamic variablessrc=app.conf.j2 dest=/etc/app/app.conf
serviceManage system services (upstart/sysv)Start/stop/enable servicesname=nginx state=started enabled=yes
systemdManage systemd servicesStart/stop/enable systemd servicesname=webapp state=reloaded daemon_reload=yes
fileManage files and directoriesCreate directories, set permissionspath=/opt/app state=directory mode=0755
userManage OS usersCreate/delete user accountsname=deploy state=present groups=sudo
groupManage OS groupsCreate/delete groupsname=webadmin state=present
commandExecute a commandRun arbitrary commands (no shell)cmd=/usr/bin/uptime
shellExecute via shellRun commands with pipes, redirects`cmd: df -h / \tail -1`
debugPrint variables during executionTroubleshooting variable valuesmsg="Current user is {{ ansible_user }}"
assertValidate conditionsHalt playbook if precondition failsthat: "ansible_os_family == 'Debian'"
wait_forWait for port/conditionPause until service is readyport=8080 host=10.0.1.10 state=drained
uriInteract with HTTP APIsHealth checks, REST API callsurl=https://api.example.com/health

The most important production rule: always prefer the dedicated module over command or shell. If you find yourself writing shell: apt-get install, replace it with the apt module. The dedicated module provides idempotency, check mode support, and proper change detection. The only legitimate use for command/shell is when no module exists for the operation, or in ad-hoc emergency commands.

io/thecodeforge/ansible/module_examples.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
# Example: Using the top 5 modules in a real playbook snippet
- name: Install and configure Nginx
  hosts: web_servers
  become: true
  tasks:
    - name: Ensure Nginx is installed (apt module)
      ansible.builtin.apt:
        name: nginx=1.24.*
        state: present
        update_cache: true

    - name: Deploy Nginx configuration (template module)
      ansible.builtin.template:
        src: nginx.conf.j2
        dest: /etc/nginx/nginx.conf
      notify: Reload Nginx

    - name: Create document root (file module)
      ansible.builtin.file:
        path: /var/www/html
        state: directory
        owner: www-data
        group: www-data

    - name: Copy index page (copy module)
      ansible.builtin.copy:
        src: index.html
        dest: /var/www/html/index.html

    - name: Ensure Nginx is running (service module)
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: yes

  handlers:
    - name: Reload Nginx
      ansible.builtin.service:
        name: nginx
        state: reloaded
shell and command always report 'changed' — use them sparingly
The shell and command modules have no way to check current state before executing. They always show 'changed' in the playbook output, which means they trigger handlers unconditionally and make your playbook output noisy. Add changed_when: command_result.rc != 0 or use creates:/removes: to restore idempotency when you must use these modules. But the best practice is to find a dedicated module whenever possible.
Production Insight
The module that surprises most production engineers is wait_for. It's invaluable in deployment pipelines where an application server takes 30 seconds to start listening on its port. Without it, downstream checks fail and the deploy appears broken. An example: wait_for: port=3000 host={{ inventory_hostname }} timeout=60 after starting a Node.js service. This single module eliminates the most common false-positive deployment failure.
Key Takeaway
Master these 15 modules to handle 90% of infrastructure tasks; always use the dedicated module over shell/command to preserve idempotency, check mode, and reliable change detection.

Your First Ansible Playbook — Install Apache on a Web Server Cluster

A playbook is a YAML file that describes the desired state of a set of hosts. It's the core unit of automation in Ansible. Writing your first playbook reinforces the mental model of declaring 'what should be true' rather than scripting 'what commands to run'.

The canonical first playbook installs and configures Apache on a group of web servers. It exercises the three most common modules: apt for package management, template for configuration files, and service for daemon management. It also introduces handlers by restarting Apache only when the configuration file actually changes.

Create an inventory file with one or two test servers, or use localhost with ansible_connection=local for a safe first run. The playbook below assumes an inventory group called web_servers that you define. It becomes root via become: true because installing packages and starting services requires superuser privileges.

The playbook has four tasks plus a handler. The first task installs Apache at the version provided by the distribution's default repositories. Using state: present ensures it's installed but won't upgrade it unexpectedly. The second task creates a custom index.html using the copy module with content directly — avoids a template file for this simple example. The third task copies an Apache virtualhost configuration from a file on the control node. The fourth task enables the site and ensures Apache runs on boot. The handler restarts Apache only when the virtualhost config changes.

Running the playbook the first time changes state (installs, writes, restarts). Running it again reports 'ok' for all tasks because the desired state is already in place — this is idempotency in action.

Use --check --diff before the first real run to see what would change without actually changing anything. This is covered in the next section.

io/thecodeforge/ansible/apache_first_playbook.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# ── First playbook: install and configure Apache ──────────────────
# Run with: ansible-playbook -i inventory apache.yml
# Run dry-run: ansible-playbook -i inventory apache.yml --check --diff

- name: Install and configure Apache on web servers
  hosts: web_servers
  become: true
  gather_facts: false

  vars:
    # Default index.html content
    welcome_message: "Welcome to TheCodeForge demo!"
    # Apache virtualhost port (can be overridden per host via inventory)
    apache_port: 80

  handlers:
    - name: Restart Apache
      ansible.builtin.service:
        name: apache2
        state: restarted
      listen: "apache config changed"

  tasks:
    - name: Install Apache2 package
      ansible.builtin.apt:
        name: apache2
        state: present
        update_cache: true
        cache_valid_time: 3600

    - name: Create custom index.html
      ansible.builtin.copy:
        content: |
          <html>
          <head><title>TheCodeForge Demo</title></head>
          <body><h1>{{ welcome_message }}</h1></body>
          </html>
        dest: /var/www/html/index.html
        owner: www-data
        group: www-data
        mode: '0644'

    - name: Deploy virtualhost configuration
      ansible.builtin.copy:
        src: files/apache-site.conf
        dest: /etc/apache2/sites-available/001-webapp.conf
        owner: root
        group: root
        mode: '0644'
      notify: "apache config changed"
      # The handler only fires if this task reports 'changed'

    - name: Ensure Apache is enabled and running
      ansible.builtin.service:
        name: apache2
        state: started
        enabled: true

---
# files/apache-site.conf
# Place this file alongside the playbook:
# <VirtualHost *:{{ apache_port }}>
#     DocumentRoot /var/www/html
#     ServerName localhost
# </VirtualHost>

# ── Sample output on first run ───────────────────────────────────────
# PLAY [Install and configure Apache on web servers] ********************
# TASK [Install Apache2 package] ***************************************
# changed: [web-01]
# TASK [Create custom index.html] ***************************************
# changed: [web-01]
# TASK [Deploy virtualhost configuration] *******************************
# changed: [web-01]
# TASK [Ensure Apache is enabled and running] ***************************
# changed: [web-01]
# RUNNING HANDLER [Restart Apache] **************************************
# changed: [web-01]
# PLAY RECAP ************************************************************
# web-01   : ok=5   changed=4   unreachable=0   failed=0   skipped=0
Start with localhost before targeting remote servers
Create an inventory with localhost ansible_connection=local and run this playbook against it. You'll see the full playbook cycle without needing SSH keys or remote access. Once you understand the flow, replace localhost with a real server. This is the fastest way to learn the syntax without network debugging distractions.
Production Insight
The cache_valid_time: 3600 on the apt task is critical the first time. Without it, every playbook run pays a 10-15 second apt update per server. After the first run, the cache is fresh, and subsequent runs skip the update. In CI pipelines that create fresh control nodes each run, consider pre-seeding apt cache or removing update_cache: true and relying on base AMI images with up-to-date packages.
Key Takeaway
A playbook declares desired state as a list of tasks using idempotent modules. Running it once converges the system to that state. Running it again does nothing if the state already matches — that's idempotency. The Apache playbook demonstrates apt, copy, service, and handlers, the four building blocks of 80% of all automation.

Playbook Check Mode and Diff — Validate Before You Change

The --check flag runs a playbook in 'dry-run' mode: Ansible evaluates every task's condition and reports what would change without actually making any changes. Combined with --diff, it shows the exact content differences for template, copy, and other modules that manage file content. This combination is the closest thing Ansible has to a pre-deployment validation step.

Check mode is not a simulation. Modules that support it (most built-in modules) check their current state and report 'changed' or 'ok' based on whether the task would alter the system. Modules that don't support check mode run partially or report that they would change, reducing confidence. The shell and command modules, for example, always report 'changed' in check mode because they cannot predict their outcome. This is another reason to prefer dedicated modules.

The --diff flag shows the before-and-after content for files managed by copy, template, file (with content), and others. It also shows which lines in configuration files would be added, removed, or modified. You review this output to catch mistakes like a typo in a template variable or an incorrect file permission before they reach production.

In production CI pipelines, every playbook run that targets staging or production should first execute a check-diff run. If the playbook would change more than expected (e.g., 200 files changed when you only expected 2), the pipeline should halt and alert a human. This is a classic 'change validation' pattern.

One caveat: check mode does not execute handlers, even if they would be notified. It also does not run command or shell tasks, so if your playbook relies on those for idempotency, check mode gives less reliable output. The rule of thumb: the more dedicated modules you use, the more accurate your dry-run results will be.

BASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# ── Dry-run a playbook against staging ─────────────────────────────
ansible-playbook -i staging site.yml --check --diff

# ── Dry-run against a single host ────────────────────────────────────
ansible-playbook -i production site.yml --limit app-01 --check --diff

# ── Capture diff output for review ───────────────────────────────────
ansible-playbook -i staging site.yml --check --diff 2>&1 | tee /tmp/dry-run-$(date +%Y%m%d-%H%M).log

# ── Grep for any tasks that would change in dry-run ───────────────────
grep -E '(changed|TASK)' /tmp/dry-run-*.log | grep -v 'ok:'

# ── Run in check mode with increased verbosity for detailed output ───
ansible-playbook -i production site.yml --check -vv

# ── In a CI pipeline, fail if check mode shows unexpected changes ─────
# After dry-run, if the output contains 'changed:', and the expected
# change count is > X, the pipeline should fail.
ansible-playbook -i staging site.yml --check --diff 2>&1 | \n  awk '/^TASK/ { task=$0 } /changed/ { print task, $0 }' | \n  wc -l
# (examine count, set threshold in pipeline logic)
Check mode does not execute shells — beware of false negatives
Production Insight
The highest-value use of check mode is in CI pipelines. Put a --check --diff step before the real deployment. If the dry-run shows more than a trivial number of changes (e.g., more than 3 tasks changed), fail the pipeline. This catches accidental edits to group_vars, stale templates, or a misconfigured inventory that would cause a mass config update. Many teams skip this because they trust their playbooks — the production incident at the start of this article could have been prevented by a dry-run that showed Nginx version would change across 50 servers.
Key Takeaway
Always precede a production playbook run with --check --diff. It shows what would change without touching any server. Use it as a CI validation gate to catch unexpected changes before they cause an incident. The accuracy of check mode improves with every shell task you replace with a dedicated module.

Ansible Variables and Precedence — The 22 Levels That Bite You

Variable precedence is the most common source of 'why is my playbook using the wrong value?' in production. Ansible has 22 different places where a variable can be defined, with a specific order of precedence. When the same variable name appears in multiple places, the highest-precedence definition wins. Misunderstanding this order leads to subtle bugs that only appear in certain environments.

The precedence order from lowest to highest (later overrides earlier): - role defaults (defaults/main.yml) - inventory group_vars/all - inventory group_vars/groupname - inventory host_vars/hostname - playbook group_vars/all - playbook group_vars/groupname - playbook host_vars/hostname - vars in playbook (vars block) - vars files included via include_vars - role vars (vars/main.yml) - block vars (within a block) - task vars (vars on a task) - set_fact (at runtime) - register (variable from task output) - extra vars (-e, highest priority)

In practice, the conflict you'll encounter most often is between group_vars/all (low) and --extra-vars (high). A stray -e in a CI pipeline can override everything else in your inventory, causing the wrong environment to be configured. Another common trap: using set_fact inside a loop — it overwrites the variable each iteration instead of accumulating.

To debug variable precedence, use the debug module to print the variable value at different points in the playbook. The -v flag also shows variable interpolation. When you need to merge lists or dictionaries instead of overriding, use the combine filter with the recursive=true option.

io/thecodeforge/ansible/variable_precedence.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Example showing variable precedence and debugging
- name: Demonstrate variable precedence
  hosts: localhost
  gather_facts: no
  vars:
    app_port: 8080          # playbook vars
  tasks:
    - name: Set a fact (runtime, higher precedence)
      set_fact:
        app_port: 9090

    - name: Print variable – will show 9090, not 8080
      debug:
        msg: "app_port is {{ app_port }}"

    - name: Demonstrate extra-vars override (run with -e app_port=9999)
      debug:
        msg: "If passed -e, this will show 9999"

    - name: Merge dictionaries instead of overriding
      set_fact:
        config: "{{ config | default({}) | combine({'timeout': 30}, recursive=True) }}"
set_fact inside a loop overwrites each iteration — use a different approach
If you need to collect values in a loop, use set_fact with the + operator on a list, or use combine for dictionaries. Example: set_fact: mylist={{ mylist | default([]) + [item] }} with loop. Without this, you'll only keep the last iteration's value.
Production Insight
The most damaging variable bug is when --extra-vars in a CI pipeline overrides environment: production to development because of a stale Jenkins parameter. Always validate that the highest-priority source (CLI or extra-vars) contains the expected values before running a production playbook. A simple debug task at the start of the playbook that prints all critical variables would have prevented many incidents.
Key Takeaway
Ansible's variable precedence has 22 levels; the highest priority is --extra-vars. Debug variable values at runtime with the debug module. When overriding dictionaries or lists, use the combine filter with recursive=true to merge instead of replace.

Ansible Roles — Stop Copy-Pasting Playbooks Like an Amateur

Roles are how you stop treating Ansible like a scripting language and start treating it like infrastructure code you’d ship to production. Without roles, your playbooks become a tangled mess of tasks, handlers, and variables that only you understand — until you don't. Roles enforce a filesystem contract: tasks go in tasks/, handlers in handlers/, defaults in defaults/, templates in templates/. This isn't bureaucracy; it's survival. When a junior onboards and your playbook has 500 lines of unorganized YAML, they will break something. Roles give you modular, reusable, testable units. Use ansible-galaxy init to scaffold one. Wire it into a playbook with a single include_role call. That's it. Your production deploys should be a composition of roles, not a novel.

AssignWebRoleToProduction.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// io.thecodeforge — devops tutorial

// Scafold the role structure once
// ansible-galaxy init web_server

- name: Deploy web application to production cluster
  hosts: webservers
  become: yes
  roles:
    - role: common_packages         # Installs curl, htop, etc. from common_packages role
    - role: web_server               # Installs and configures nginx
      vars:
        nginx_port: 8443             # Override default from roles/web_server/defaults/main.yml
        ssl_cert_path: /etc/ssl/certs/server.crt
    - role: monitoring_agent         # Installs and configures datadog agent
      when: env == "production"      # Only run this role in production, not staging
Output
PLAY [Deploy web application to production cluster] ************************
TASK [common_packages : Install essential packages] *************************
changed: [web-01] => (item=curl)
changed: [web-01] => (item=htop)
TASK [web_server : Install nginx] ********************************************
changed: [web-01]
TASK [web_server : Copy nginx configuration] *********************************
changed: [web-01]
TASK [monitoring_agent : Install datadog agent] ******************************
skipping: [web-01]
PLAY RECAP *******************************************************************
web-01 : ok=3 changed=3 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
Production Trap: One Role, One Responsibility
Never stuff database config, web server config, and app deployment into a single role. If your role has more than one job, split it. A monolithic role is just a playbook with extra steps — and a nightmare to debug.
Key Takeaway
Roles enforce modularity through filesystem structure. If your playbook doesn't fit in 20 lines of orchestration, you need a role.

Ansible vs Terraform — Two Different Hammers, One Toolbox

Here’s what nobody tells you: Ansible and Terraform are not competitors. They solve different layers of the same problem. Terraform is a provisioning tool. It talks to cloud APIs to create infrastructure — VMs, networks, load balancers. It cares about state, drift detection, and idempotent resource creation. Ansible is a configuration tool. It logs into that provisioned machine and installs software, tweaks config files, starts services. Terraform asks 'Does this VM exist?' Ansible asks 'Is Apache running?' Use Terraform to build the house. Use Ansible to furnish it. The moment you try to use Ansible to create an AWS EC2 instance, you’re fighting the tool. The moment you use Terraform to configure nginx inside that instance, you’re fighting the tool. Know the boundary. Your CI/CD pipeline should call Terraform first, then Ansible. Every senior engineer I know does this.

TerraformThenAnsiblePipeline.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// io.thecodeforge — devops tutorial

// Example CI/CD pipeline stage (GitLab CI / GitHub Actions equivalent)

# Step 1: Terraform provisions the infrastructure
# terraform apply -auto-approve

# Step 2: Ansible inventory is generated from Terraform output
# terraform output -json ansible_inventory > inventory.json

- name: Configure provisioned EC2 instances
  hosts: all
  become: yes
  vars:
    # Terraform outputs these IPs; inject via --extra-vars or dynamic inventory script
    app_version: "1.4.2"
    db_connection_string: "{{ lookup('env', 'DB_CONN') }}"  # Fetch from CI/CD secrets
  tasks:
    - name: Install application binary
      ansible.builtin.copy:
        src: "/artifacts/myapp-{{ app_version }}.bin"
        dest: "/usr/local/bin/myapp"
        mode: '0755'
      notify: restart app

  handlers:
    - name: restart app
      ansible.builtin.systemd:
        name: myapp
        state: restarted
        daemon_reload: yes
Output
No direct output — this is a pipeline concept. But your deploy log should show:
Terraform apply complete! Resources: 3 added, 0 changed, 0 destroyed.
...
PLAY RECAP *******************************************************************
prod-web-01 : ok=4 changed=2 unreachable=0 failed=0
prod-api-01 : ok=4 changed=2 unreachable=0 failed=0
Senior Shortcut: Dynamic Inventory from Terraform
Write a small Python script that parses terraform output --json and spits out Ansible inventory YAML. Then your playbook always targets exactly what Terraform just created. No manual inventory updates. No IP copy-paste errors.
Key Takeaway
Provision with Terraform, configure with Ansible. Mixing them is a recipe for state explosion and failed deploys at 2 AM.

Ansible Tower (AWX) — Centralized Execution Without the Spreadsheet Mayhem

If you're still ssh-ing into your control node and running ansible-playbook by hand, you're one fat-finger away from taking down production. Ansible Tower (or its open-source upstream, AWX) gives you a web UI, RBAC, job scheduling, and — most importantly — an audit trail. When the VP asks 'Who ran that playbook at 3 AM?', you don't shrug. You pull up the job log. Tower also solves the 'works on my machine' problem. Playbooks run on Tower's execution environment, not your laptop with the experimental Python 3.12. Use it to segment environments: developers get 'Run' access to staging, read-only to production. Operations gets full control. No more shared passwords on a sticky note. The real power? The REST API. Your CI/CD can POST to Tower instead of SSH-ing into a jump box. That way, every deploy is recorded, every failure is logged, and every success is celebrated.

LaunchTowerJobViaAPI.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// io.thecodeforge — devops tutorial

// Trigger a Tower job template from CI/CD pipeline
// curl -X POST https://tower.example.com/api/v2/job_templates/7/launch/ \
//   -H "Authorization: Bearer $(get_tower_token)" \
//   -H "Content-Type: application/json" \
//   -d '{"extra_vars": {"env": "production", "version": "1.4.2"}}'

# Inside the job template, the playbook looks like:
- name: Deploy app from Tower job
  hosts: "{{ env }}"                # Environment passed as extra_var
  become: yes
  roles:
    - role: deploy_app
      vars:
        app_version: "{{ version }}" # Version passed from CI/CD
  
  # Tower automatically captures:
  # - who triggered the job
  # - when it ran
  # - output of every task
  # - exit code and failures
Output
HTTP/2 201 Created
{
"id": 8237,
"status": "pending",
"created": "2024-07-15T14:30:00Z",
"summary_fields": {
"created_by": {
"username": "ci-bot"
}
}
}
# After job completes:
{
"id": 8237,
"status": "successful",
"elapsed": 34.5,
"failed": false
}
Never Do This: Sleeping on RBAC
I’ve seen teams give everyone 'admin' in Tower because it's easier. Don't. Grant 'Execute' on specific job templates. Use teams and organizations. A junior with admin can delete credential secrets. It takes 10 minutes to set up proper roles. It takes 10 hours to recover from a credential leak.
Key Takeaway
Tower or AWX is your audit log, RBAC layer, and execution gateway. If your playbook runs without a record, it didn't happen.
● Production incidentPOST-MORTEMseverity: high

The Monday Morning Nginx Upgrade That Broke Payment Processing for 47 Minutes

Symptom
Users started seeing 'SSL handshake failed' errors immediately after the playbook completed. Payment gateway API calls timed out at the application layer. Transaction success rate dropped to roughly 10% of normal. Every monitoring dashboard looked clean — CPU normal, memory normal, error rate on the application was zero because the failure was happening at the TLS handshake layer before requests even reached the app. The first alert that fired was a business-level one: revenue dropped off a cliff in the payment processing dashboard.
Assumption
The team had been running this playbook weekly as a 'configuration health check'. They'd added state: latest six months earlier as a way to keep the fleet current without managing explicit versions. The assumption was that staging had been on the new Nginx version for two weeks without issues, so production was safe. What they didn't know was that staging used a different package mirror that received the new version on a different schedule. Staging had never actually run the version that hit production that Monday.
Root cause
The specific task was ansible.builtin.apt: name=nginx state=latest update_cache=yes. Nginx went from 1.24 to 1.26 across the entire production fleet in a single playbook run — 50 servers in about 90 seconds. Nginx 1.26 changed the default TLS configuration to deprecate certain cipher suites that the payment gateway's SSL terminator still required. The application code was unchanged. The Nginx configuration file was unchanged. The only thing that changed was the Nginx binary itself — and the team had no test that validated TLS handshake compatibility with external payment processors after an Nginx version change. The state: latest task also didn't log which version was installed, only that the package was 'changed'. The post-incident investigation had to reconstruct the version change from package manager logs on individual servers.
Fix
Three changes were made, and all three were required. First, every state: latest in every production playbook was changed to state: present with an explicit version pin: name: nginx=1.24.*. The wildcard on the patch version allows security patches within the pinned minor version but prevents major behavior changes. Second, a separate security_updates.yml playbook was created that runs on an explicit schedule — Thursday afternoon after a staging validation run — and includes rollback instructions as inline comments. Third, an integration test was added to the deployment pipeline that validates TLS handshake success against the payment gateway endpoint using a real certificate, run after any Nginx configuration or version change. If the handshake test fails, the pipeline rolls back the Nginx version automatically.
Key lesson
  • state: latest is a footgun in any production playbook that runs on a schedule. It delegates the upgrade decision to whatever your package mirror happens to serve that day. Pin versions explicitly with name: package=version.* and make upgrades a deliberate, tested decision — not a side effect of a routine playbook run.
  • Staging and production must use the same package repository mirror and must be on the same versions at all times. If your staging environment can silently diverge from production's package versions, it provides no safety guarantee. Mirror the production repo exactly, or use a private artifact repository that you control.
  • Configuration drift detection is not the same as behavior validation. You can have perfectly idempotent configuration management and still have external integrations break when an underlying package changes its defaults. Write integration tests that validate the behavior your external dependencies rely on — TLS cipher suites, header handling, timeout behavior.
  • Monday morning is the statistically worst time to run untested automation against production. You have maximum blast radius (full week of traffic ahead), minimum time since the last human review of the change (over the weekend), and maximum cognitive load on engineers who are just starting the day. Schedule risky playbooks for Thursday, after a staging run earlier in the week.
Production debug guideThree failure patterns that together account for the majority of Ansible production incidents — with exact diagnostics and the specific fix for each.3 entries
Symptom · 01
Playbook runs and shows 'changed' on every execution for the same task, but nothing appears different on the server — the change output is noise you've stopped reading
Fix
You're using the shell or command module for something a dedicated module could handle, and those modules always report 'changed' because they have no way to inspect current state. Run ansible-playbook playbook.yml --check --diff and look at the diff output for the offending task — if diff shows nothing changed but the task still reports 'changed', that confirms the diagnosis. Fix: replace shell: apt-get install nginx with ansible.builtin.apt: name=nginx state=present. If no dedicated module exists for your task, add changed_when: false to suppress false positives, or add creates: /path/to/file to skip the task when its output already exists. The goal is a playbook where 'changed' means something actually changed — otherwise you stop trusting the output entirely and miss real changes.
Symptom · 02
A handler runs on every playbook execution even when no configuration actually changed — you're getting unnecessary service restarts on every CI pipeline run
Fix
Some task that notifies the handler is always reporting 'changed', which triggers the handler every time. Find the culprit by running ansible-playbook playbook.yml --check --diff 2>&1 | grep -B 5 'changed' and look for the task just above each 'changed' marker. Common causes: a template task that reports changed because of whitespace differences or line ending inconsistencies between the template and the deployed file; a shell task that always reports changed. For template issues, add trim_blocks: true and lstrip_blocks: true to your Jinja2 template, and check that the deployed file's line endings match the template's. For shell tasks causing spurious handler triggers, add changed_when with an explicit condition based on the command's output.
Symptom · 03
Playbook fails on some hosts in an inventory group but succeeds on others — all hosts were supposedly provisioned identically
Fix
At least one host has drifted from the expected state. Run ansible -i inventory all -m setup --limit drifted-host > /tmp/drifted-facts.json and ansible -i inventory all -m setup --limit good-host > /tmp/good-facts.json, then diff /tmp/good-facts.json /tmp/drifted-facts.json to find the divergence. Common drift sources: manual SSH changes made during a previous incident, a failed partial playbook run that left a host mid-state, or autoscaling replacing an instance from an outdated AMI. Fix: add ansible.builtin.assert tasks at the top of your playbook that validate preconditions — OS version, required directories existing, expected kernel parameters — so playbook failures are explicit and informative rather than cryptic mid-play errors.
★ Ansible Production Debug Cheat SheetThe five commands you actually run to diagnose 80% of Ansible production failures. Run these in order before escalating or restarting services manually.
Playbook hangs at the start, SSH timeout, or 'UNREACHABLE' errors
Immediate action
Test SSH connectivity completely independently of Ansible before assuming the playbook is broken
Commands
ansible -i inventory all -m ping -vvv
ssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no user@target-ip 'echo connected'
Fix now
Check AWS/GCP security group rules for port 22 inbound from your control node's IP. Verify ansible_user matches the AMI's default user (ubuntu for Ubuntu, ec2-user for Amazon Linux). For ephemeral environments like CI runners with dynamic IPs, set ANSIBLE_HOST_KEY_CHECKING=False or add host_key_checking = False to ansible.cfg. For VPC private subnets, ensure your control node is in the same VPC or a connected one — Ansible can't route through a NAT gateway by default.
A variable has the wrong value at runtime — task uses a stale or unexpected value+
Immediate action
Dump the fully resolved variable set for the specific host before running the playbook — don't guess at precedence
Commands
ansible-inventory -i inventory.ini --host $HOST
ansible -i inventory.ini -m debug -a 'var=your_variable_name' $HOST
Fix now
Ansible has 22 variable precedence levels. The most common conflict is between group_vars/all/ (lower priority) and host_vars/hostname/ (higher priority) or --extra-vars (highest priority, overrides everything). If a CI pipeline is passing -e extra vars, those win over everything in your inventory. Move all environment-specific configuration to group_vars/$ENVIRONMENT.yml and audit any -e flags in CI pipeline definitions — they frequently contain stale overrides from a previous debugging session that never got cleaned up.
Task shows 'changed' on every run — idempotency is broken and CI output is unreadable+
Immediate action
Run in check-plus-diff mode and capture the output — then read it before touching the playbook
Commands
ansible-playbook playbook.yml --check --diff 2>&1 | tee /tmp/ansible-diff.txt
grep -B 3 -A 15 'changed:' /tmp/ansible-diff.txt
Fix now
If the diff shows actual content changes, a file is genuinely different every run — check for timestamps, PIDs, or randomly generated values being written into a template. If diff shows nothing but the task still reports 'changed', you have a shell or command task with no idempotency guard — replace with the appropriate module, or add changed_when: command_result.rc != 0 to tie 'changed' to a meaningful condition. A playbook where every run shows 50 'changed' items is a playbook nobody trusts and everybody ignores.
Playbook runs cleanly on your laptop but fails in CI with connection or authentication errors+
Immediate action
Compare the full Ansible execution environment between local and CI — the problem is almost always environment, not the playbook
Commands
env | grep -E 'ANSIBLE|PYTHON|SSH|AWS'
ansible --version && python3 --version && pip3 show ansible-core
Fix now
The most common CI-specific failures are: SSH host key checking blocking on a new instance (ANSIBLE_HOST_KEY_CHECKING=False in CI environment); vault password prompt blocking the CI runner (--vault-password-file /path/to/vault.key instead of --ask-vault-pass); relative inventory paths resolving differently from the CI working directory (use absolute paths or $(pwd)/inventory); and Ansible version differences between your laptop and the CI image causing module behavior changes. Pin your Ansible version in CI: pip install ansible-core==2.17.4.
Handler is notified but never runs — service doesn't restart after a config change+
Immediate action
Verify the handler is defined in the right place and has exactly the right name — handler names are case-sensitive and whitespace-sensitive
Commands
grep -n 'handlers\|notify\|listen' playbook.yml
ansible-playbook playbook.yml --list-tasks 2>&1 | grep -i handler
Fix now
Handlers must be defined in the handlers: block at the play level — not inside a tasks: block, not inside a role's tasks/main.yml (they go in the role's handlers/main.yml). The string in notify: must match the handler's name: field exactly, including case and trailing spaces. If using listen: topics, verify the topic string matches identically. Also confirm the notifying task actually reported 'changed' — if it reported 'ok', the handler is intentionally skipped. Add -v to the playbook run to see handler trigger events in the output.
Ansible vs Chef vs Puppet vs SaltStack — Key Differences at a Glance
FeatureAnsibleChefPuppetSaltStack
ArchitectureAgentless (push)Agent (pull)Agent (pull)Agent (push/pull)
LanguageYAMLRuby DSLPuppet DSLYAML / Python
Setup complexityLow (SSH only)High (server + agent)High (server + agent)Medium (master-minion)
Learning curveLowestMedium-HighMediumMedium
Scale performanceModerate (SSH bottleneck)Good (background agent)GoodExcellent (ZeroMQ)
IdempotencyModule-dependentBuilt-inBuilt-inModule-dependent
Testing ecosystemMolecule (evolving)Test Kitchen (mature)Beaker (mature)Molecule (via Salt-Nostalgic)
Best forSmall-medium fleets, CI/CD, ad-hocLarge enterprises, audit-heavyLarge fleets with strict complianceMassive fleets, HPC, real-time

Key takeaways

1
Ansible is agentless
control node pushes tasks over SSH. Idempotency requires dedicated modules, not shell.
2
`state
latest is dangerous in production — pin versions explicitly with name: package=1.2.*`.
3
Handlers restart services only when config changes. Define them in the handlers: block, not inside tasks.
4
Variable precedence has 22 levels; --extra-vars overrides everything. Use debug to inspect values.
5
Always run --check --diff before production playbooks. Fail CI pipelines if unexpected changes appear.
6
Facts (ansible_*) are dynamic but can be cached. Never set `gather_facts
false` unless you know facts are already fresh.
7
The wait_for module prevents deployment false positives by waiting for services to actually listen.

Common mistakes to avoid

4 patterns
×

Using `state: latest` in production playbooks

Symptom
A routine playbook run unexpectedly upgrades critical packages (e.g., Nginx, OpenSSL) across the entire fleet, changing default behavior and breaking integrations. The change is logged as 'changed' with no version information, making post-incident forensics slow.
Fix
Pin package versions explicitly: name: nginx=1.24.* (wildcard on patch version). Run version upgrades through a separate, scheduled playbook with staging validation. Never use state: latest on production runs that are not explicitly upgrade workflows.
×

Placing handlers inside tasks or roles incorrectly

Symptom
A handler defined in a role's tasks/main.yml is never triggered, or a handler with a listen topic never runs despite tasks notifying that topic. The service does not restart after a config change, but the playbook shows 'changed' on the config task.
Fix
Handlers must be defined in the handlers: block at the play level, or in a role's handlers/main.yml. For listen topics, ensure the string matches exactly (case-sensitive). Always verify handler triggers with -v flag to see which handlers were notified.
×

Using `shell` or `command` modules where dedicated modules exist

Symptom
Every playbook run shows 'changed' for the same tasks, even when nothing has changed. The output is noisy, handlers restart services unnecessarily, and check mode reports false positives. Idempotency is completely broken.
Fix
Replace shell: apt-get install nginx with apt module. Replace shell: echo 'config' > /etc/file.conf with copy or template. If no dedicated module exists, add creates or changed_when to make the task idempotent. Audit all shell and command tasks regularly.
×

Assuming facts are always up to date or disabling them unnecessarily

Symptom
A playbook uses ansible_distribution or ansible_memory_mb but those variables are stale because gather_facts: false was set. The playbook applies incorrect configuration (e.g., installing yum packages on Ubuntu) or misallocates resources.
Fix
Set gather_facts: true at the play level or use setup module explicitly. If you need performance, use fact caching with cacheable: yes in set_fact and run a separate fact-gathering playbook periodically. Never disable facts globally.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is idempotency in Ansible and why does it matter for production aut...
Q02SENIOR
Explain the difference between `state: present`, `state: latest`, and `s...
Q03SENIOR
What are Ansible handlers and how do they differ from regular tasks? Giv...
Q04SENIOR
How does variable precedence work in Ansible? What is the highest preced...
Q05SENIOR
What is the purpose of `--check` mode and what are its limitations? How ...
Q01 of 05SENIOR

What is idempotency in Ansible and why does it matter for production automation?

ANSWER
Idempotency means running the same playbook multiple times produces the same final state as running it once. If a task is idempotent, the second run should report 'ok' (no change) because the system is already in the desired state. This matters in production because you can safely re-run playbooks without worrying about unintended side effects like duplicate configuration lines or unnecessary service restarts. Idempotency comes from using modules like apt, copy, template, and service instead of shell/command. The shell module is not idempotent by default.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
Can Ansible manage Windows servers?
02
How do I securely store secrets like API keys or passwords in Ansible?
03
What is the difference between `include_tasks` and `import_tasks`?
04
How do I speed up Ansible on large fleets (1000+ hosts)?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's Cloud. Mark it forged?

17 min read · try the examples if you haven't

Previous
Terraform Basics
13 / 23 · Cloud
Next
Cloud Cost Optimisation