Ansible state:latest — One Task Broke Payments for 47 Min
Nginx 1.24→1.26 broke TLS handshakes with payment processors across 50 servers in 90 seconds.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
- Ansible is agentless configuration management — SSH in from a control node, no software installed on targets
- Three core concepts: Inventory (which servers), Playbooks (what state they should be in), Modules (how to get there)
- Idempotency means running the same playbook 100 times produces the same result as running it once — only true if you use proper modules like apt, template, and service instead of shell
- Performance trade-off: agentless means zero agent maintenance on targets but higher SSH overhead on the control node; default forks=5 is too low for any real fleet
- Production trap: 'state: latest' installs whatever the package mirror serves that day — a Monday morning playbook run can silently upgrade Nginx across 50 servers and break TLS configuration you never touched
- Biggest mistake: skipping handlers and using a plain 'service: state=restarted' task — that restarts Nginx every single run, even when the config file didn't change, which means unnecessary downtime on every playbook execution
Imagine you're a school principal and you need to deliver the same set of instructions to 500 students spread across 10 classrooms. You wouldn't walk into each room yourself, repeat yourself 500 times, and hope you said it the same way each time. You'd write a single instruction sheet and hand it to every teacher simultaneously — they deliver the message to their rooms in parallel, in exactly the same words, every time. Ansible works exactly like that. You write your instructions once in a file called a playbook, tell Ansible which servers to target in an inventory file, and it SSHes into all of them concurrently and executes everything in the order you specified. No agent software on any server. No daemons to babysit. Just SSH and a clean YAML file that describes the world you want to exist.
Every cloud infrastructure beyond a certain size hits the same wall. Someone on the team is spending their Friday afternoon manually SSH-ing into 30 servers, running the same five commands in sequence, and quietly hoping they didn't typo on server 24. It's slow. It's completely unauditable. And it doesn't scale — not to 100 servers, not to three environments, certainly not to a team of ten engineers who all have slightly different opinions about how to run that one sed command.
Scale that manual process to hundreds of EC2 instances or GCP VMs and the problem stops being annoying and starts being a business risk. An outage caused by configuration drift — two servers out of forty that silently diverged from the others — is nearly impossible to diagnose if you have no record of what was changed, when, and by whom.
That's the exact world Ansible was built to fix. It enforces a declared, version-controlled state across every machine in your fleet simultaneously, starting from nothing more than an SSH key and a YAML file checked into Git.
But Ansible has traps that aren't obvious from the documentation. state: latest looks safe until it upgrades Nginx on a Monday morning and changes a default TLS cipher suite. Handlers look optional until you realize your playbook has been restarting your app server on every run for three months. Roles look like bureaucratic overhead until your playbook hits 300 lines and two engineers are editing conflicting sections.
By the end of this article you'll understand not just how to write playbooks, but why they're structured the way they are. You'll know how inventory files map to real cloud environments, how roles package automation that other teams can actually reuse, and how handlers restart services only when config genuinely changed — not on every run. You'll leave with the mental model and the production lessons that take most engineers two or three incidents to learn the hard way.
Why Ansible Basics Are Not Optional
Ansible is an agentless automation tool that uses SSH (or WinRM) to push declarative state to remote hosts. You write YAML playbooks describing the desired state—package installed, service running, file present—and Ansible idempotently converges the system to match. No agents, no daemons, no persistent connections: it spins up, executes, and tears down. The core mechanic is a push-based, stateless model where each playbook run is a fresh transaction against the target inventory. This means zero overhead on managed nodes, but also zero tolerance for drift between runs.
In practice, Ansible evaluates tasks sequentially, gathering facts first, then applying changes in order. Idempotency is not magic—it depends on modules checking current state before acting (e.g., yum module checks if package is already installed). If a module doesn't support check mode or fails to detect state correctly, you get unintended side effects. The control machine does all the heavy lifting; targets only need Python and SSH. This makes Ansible fast to adopt but slow at scale—running 1000 hosts serially takes O(n) time without tuning forks or async.
Use Ansible when you need to bootstrap, configure, or enforce state across a fleet without installing permanent infrastructure. It excels at one-shot provisioning, compliance checks, and ad-hoc fixes. But it is not a real-time configuration daemon—if you need continuous enforcement, pair it with a pull-based tool like Chef or a GitOps operator. The moment you treat Ansible as a live monitoring system, you will miss drift until the next playbook run.
Step-by-Step Ansible Installation on Ubuntu 22.04 — Control Node Setup That Lasts
Before you write a single playbook, you need a control node. This is the machine from which you'll run all Ansible commands. It can be your laptop, a dedicated jump box, or a CI runner. The installation method you choose has operational consequences for upgrade cycles and environment consistency.
Three installation methods compete for your attention. The Python package manager (pip) is the most flexible and lets you pin exact versions. The distribution's apt repository gives you system integration and automatic updates. The newer pipx method isolates Ansible in its own virtual environment and is the official Python Packaging Authority (PyPA) recommendation for installing CLI tools.
For production control nodes — dedicated VMs or CI runners — pip installation inside a Python virtual environment is the standard. It gives you version pinning (critical for consistency), isolation from the system Python, and easy upgrades via requirements files. The following sequence sets up Ansible in a virtual environment under /opt/ansible, with a symlink in /usr/local/bin for global access.
After installation, you configure the control node's SSH access. Ansible needs to reach every managed server via SSH with key-based authentication. The common failure point is SSH host key checking. When you connect to a server for the first time, Ansible's default behavior is to verify the host key against ~/.ssh/known_hosts. In dynamic cloud environments where IPs are recycled, this causes prompt blocks. The production fix is to manage known_hosts via a pre-seeded file or use host_key_checking=False in ansible.cfg with an understanding of the security trade-off.
The inventory file is your first configuration file. It lists your managed nodes and groups them logically. For testing, a one-line inventory with a single server is enough. In production, you'll use dynamic inventory plugins that query cloud APIs.
To verify installation, run ansible all -i 'localhost,' -m ping -c local. This pings the control node itself without SSH, confirming the Ansible engine works.
/opt/ansible is self-contained. If you need to roll back an Ansible version, you delete the venv and recreate it — no system pollution. In CI, use a fresh venv per pipeline run, pinned to the exact same Ansible version your team uses locally.ansible-core 2.17.4, but the CI container or runner has 2.14.0 from the base OS. Module behavior changes across major versions. The fix: always pin the Ansible version in your project's requirements.txt or Dockerfile. For cloud VMs used as persistent control nodes (e.g., a Jenkins agent), recreate the venv from a locked requirements file after any base OS update that might have pulled in a newer Python version.ansible-core version in a requirements file, and configure ansible.cfg with host_key_checking=False, pipelining=True, and forks=25 before writing any playbooks.Ansible vs Chef vs Puppet vs SaltStack — Choosing the Right Configuration Management Tool
If you're new to configuration management, the first question is not 'how do I use Ansible' but 'should I use Ansible at all?' Chef, Puppet, and SaltStack are the three other major players, and each has strengths that match different operational philosophies. The right choice depends on your team's existing skills, infrastructure scale, and whether you prefer a push or pull model.
Ansible is the only major tool that is agentless — it connects to managed nodes over SSH (or WinRM) and executes tasks without installing any software. This makes initial setup trivial: if you have SSH keys, you have Ansible. The trade-off is that every playbook run opens new SSH connections, which creates overhead at scale. Ansible uses YAML for its playbooks, which is the easiest language for non-programmers to read and write. Its push model means you initiate changes from a central control node, which is natural for ad-hoc operations and CI/CD pipelines.
Chef uses a pull model: a client agent on each node periodically fetches the desired state from a Chef server. This is more resilient in environments where nodes are behind firewalls or have intermittent connectivity. Chef uses Ruby DSL for its cookbooks, which is more powerful but has a steeper learning curve. Its test-kitchen testing framework is the most mature in the CM space. Chef is strong for organizations that need a full audit trail and have dedicated platform teams.
Puppet also uses a pull model with an agent and is the oldest of the four. It has its own declarative language (Puppet DSL) that is designed for idempotency from the ground up. Puppet's module ecosystem (Puppet Forge) is vast, and its reporting capabilities are excellent. The downside is the complexity of running a Puppet server and the agent overhead on each node.
SaltStack (Salt) offers both push and pull modes via its ZeroMQ message bus, making it extremely fast at scale — it can manage thousands of nodes in seconds. It uses YAML or Python for its states (SLS files) and includes a powerful event-driven reactor system. Salt's master-minion architecture requires an agent and a master, but the agent is lightweight. It's popular in high-performance computing and environments that need real-time command execution.
The table below summarizes the key differences so you can make an informed decision based on your team's context.
Ansible Architecture — How the Control Node, Inventory, and Managed Nodes Interact
Understanding Ansible's architecture is the foundation for debugging connection issues, scaling automation, and choosing the right deployment model. The architecture is deceptively simple: a control node runs Ansible, reads an inventory, and connects to managed nodes via SSH. But the simplicity hides a few sharp edges that only show up in production at scale.
The control node is any machine with Ansible installed — your laptop, a build server, a dedicated jump host. It's the single point of failure in the architecture. If your control node goes down, you cannot run any automation until it's restored. This is why production setups use multiple control nodes in a load-balanced fashion or rely on a CI/CD platform that can re-run jobs from any agent.
The inventory is the source of truth for which nodes exist and how they're grouped. Static inventory files map hostnames to IP addresses. Dynamic inventory plugins query cloud provider APIs and build the host list at runtime. The inventory also holds variables that travel with hosts into playbooks.
Managed nodes are the target servers. They need SSH access from the control node and Python 3 installed. That's it. No agent, no daemon, no open ports beyond SSH. This is the biggest architectural advantage over agent-based tools: you can manage any server that's reachable over SSH, including on-premise machines, cloud VMs, containers (via Docker exec), and even Windows via WinRM.
Ansible's execution model is push-based. You run a command on the control node, Ansible opens SSH connections to each managed node in parallel (controlled by the forks setting), copies the Python module code over, executes it, collects JSON results, and closes the connection. There is no persistent connection. This simplicity means Ansible is stateless from the managed node's perspective, but it also means every playbook run pays the SSH connection overhead.
The following diagram visualizes the flow during a typical playbook run. The control node reads the playbook and inventory, resolves variables, then fans out tasks to each managed node group sequentially (play by play) but within a play, tasks run on all hosts in parallel up to forks concurrent connections.
forks=25 and pipelining=True, a 100-server fleet completes a play in 4 batches instead of 20. If you see 'Maximum number of SSH sessions reached' errors, reduce forks or increase the SSH MaxSessions on the managed nodes. A common oversight: cloud network security groups limit inbound connections; at high forks, the control node may hit the flow table limit on NAT instances.Ad-Hoc Ansible Commands — Quick Operations Without Writing a Playbook
Not every task deserves a playbook. Sometimes you need to check the uptime on 50 servers, copy a configuration file to a specific server, or restart a service immediately during an incident. Ad-hoc commands are single-module operations you run directly from the command line without a playbook file. They're ideal for read-only queries, one-off changes, and emergency responses.
The pattern is always: ansible <host-pattern> -m <module> -a '<module arguments>'. The host pattern matches inventory groups, wildcards, or specific hostnames. The module name is the Ansible module to use. The -a argument string depends on the module.
Three modules dominate ad-hoc usage. The ping module tests SSH connectivity and Python availability — it's the first command you run after setting up a new inventory. The command module runs any shell command with arguments directly, but it always reports 'changed' and is not idempotent. In ad-hoc mode, that's usually fine because you're making a one-off change. The shell module is similar but runs through /bin/sh and supports shell operators like pipes and redirects.
For copying files, the copy module is idempotent even in ad-hoc mode: it only transfers the file if the source and destination differ. This makes it safe to use for emergency configuration pushes without worrying about overwriting an identical file unnecessarily.
Ad-hoc commands are powerful but leave no audit trail unless you log them. Every ad-hoc change should be logged with script or tee and followed up with a permanent playbook change. If you find yourself running the same ad-hoc command twice, it's a sign that operation should be a playbook.
A practical production use case: a security vulnerability requires updating a package version across the fleet immediately. You cannot wait for the CI pipeline. ansible all -m ansible.builtin.apt -a 'name=openssl state=latest update_cache=yes' --become patches all servers in one command. After the emergency, you pin the intended version in your main playbook and remove the state=latest usage.
script to log the session, or pipe output to a file. Better yet, append the command to a runbook that later becomes a playbook. If an ad-hoc command caused a production incident, you'll need the exact command and output for the post-mortem. Without logs, you've lost the evidence.forks setting, so ansible all -m ping on 200 servers with forks=25 runs in 8 sequential batches. Use the -f or --forks flag to temporarily increase parallelism for a large ad-hoc command: ansible all -m ping -f 50. Be cautious with shell commands that produce large output on many hosts — the control node's memory can spike. For read-only queries, pipe through grep or summarize with ansible all -m command -a 'your_command' --one-line.Ansible Modules Quick-Reference — The 15 Most Common Modules and When to Use Them
Modules are the building blocks of every Ansible task. Each module is a small Python script that performs a specific operation — installing a package, copying a file, managing a service. The art of writing good playbooks is knowing which module to use for which job. Using the wrong module (especially shell or command when a dedicated module exists) breaks idempotency, disables check mode, and makes your playbooks unreliable.
The table below lists the 15 most commonly used modules in production environments, along with their primary use case and a quick example. Master these and you can automate roughly 90% of infrastructure tasks.
| Module | Description | Use Case | Example | |
|---|---|---|---|---|
apt | Manage apt packages | Install/update packages on Debian/Ubuntu | name=nginx state=present | |
yum | Manage yum packages | Install/update packages on RHEL/CentOS | name=httpd state=latest (avoid latest) | |
copy | Copy file to remote node | Deploy config files, scripts | src=nginx.conf dest=/etc/nginx/nginx.conf | |
template | Render Jinja2 template and copy | Deploy config with dynamic variables | src=app.conf.j2 dest=/etc/app/app.conf | |
service | Manage system services (upstart/sysv) | Start/stop/enable services | name=nginx state=started enabled=yes | |
systemd | Manage systemd services | Start/stop/enable systemd services | name=webapp state=reloaded daemon_reload=yes | |
file | Manage files and directories | Create directories, set permissions | path=/opt/app state=directory mode=0755 | |
user | Manage OS users | Create/delete user accounts | name=deploy state=present groups=sudo | |
group | Manage OS groups | Create/delete groups | name=webadmin state=present | |
command | Execute a command | Run arbitrary commands (no shell) | cmd=/usr/bin/uptime | |
shell | Execute via shell | Run commands with pipes, redirects | `cmd: df -h / \ | tail -1` |
debug | Print variables during execution | Troubleshooting variable values | msg="Current user is {{ ansible_user }}" | |
assert | Validate conditions | Halt playbook if precondition fails | that: "ansible_os_family == 'Debian'" | |
wait_for | Wait for port/condition | Pause until service is ready | port=8080 host=10.0.1.10 state=drained | |
uri | Interact with HTTP APIs | Health checks, REST API calls | url=https://api.example.com/health |
The most important production rule: always prefer the dedicated module over command or shell. If you find yourself writing shell: apt-get install, replace it with the apt module. The dedicated module provides idempotency, check mode support, and proper change detection. The only legitimate use for command/shell is when no module exists for the operation, or in ad-hoc emergency commands.
shell and command modules have no way to check current state before executing. They always show 'changed' in the playbook output, which means they trigger handlers unconditionally and make your playbook output noisy. Add changed_when: command_result.rc != 0 or use creates:/removes: to restore idempotency when you must use these modules. But the best practice is to find a dedicated module whenever possible.wait_for. It's invaluable in deployment pipelines where an application server takes 30 seconds to start listening on its port. Without it, downstream checks fail and the deploy appears broken. An example: wait_for: port=3000 host={{ inventory_hostname }} timeout=60 after starting a Node.js service. This single module eliminates the most common false-positive deployment failure.shell/command to preserve idempotency, check mode, and reliable change detection.Your First Ansible Playbook — Install Apache on a Web Server Cluster
A playbook is a YAML file that describes the desired state of a set of hosts. It's the core unit of automation in Ansible. Writing your first playbook reinforces the mental model of declaring 'what should be true' rather than scripting 'what commands to run'.
The canonical first playbook installs and configures Apache on a group of web servers. It exercises the three most common modules: apt for package management, template for configuration files, and service for daemon management. It also introduces handlers by restarting Apache only when the configuration file actually changes.
Create an inventory file with one or two test servers, or use localhost with ansible_connection=local for a safe first run. The playbook below assumes an inventory group called web_servers that you define. It becomes root via become: true because installing packages and starting services requires superuser privileges.
The playbook has four tasks plus a handler. The first task installs Apache at the version provided by the distribution's default repositories. Using state: present ensures it's installed but won't upgrade it unexpectedly. The second task creates a custom index.html using the copy module with content directly — avoids a template file for this simple example. The third task copies an Apache virtualhost configuration from a file on the control node. The fourth task enables the site and ensures Apache runs on boot. The handler restarts Apache only when the virtualhost config changes.
Running the playbook the first time changes state (installs, writes, restarts). Running it again reports 'ok' for all tasks because the desired state is already in place — this is idempotency in action.
Use --check --diff before the first real run to see what would change without actually changing anything. This is covered in the next section.
localhost ansible_connection=local and run this playbook against it. You'll see the full playbook cycle without needing SSH keys or remote access. Once you understand the flow, replace localhost with a real server. This is the fastest way to learn the syntax without network debugging distractions.cache_valid_time: 3600 on the apt task is critical the first time. Without it, every playbook run pays a 10-15 second apt update per server. After the first run, the cache is fresh, and subsequent runs skip the update. In CI pipelines that create fresh control nodes each run, consider pre-seeding apt cache or removing update_cache: true and relying on base AMI images with up-to-date packages.apt, copy, service, and handlers, the four building blocks of 80% of all automation.Playbook Check Mode and Diff — Validate Before You Change
The --check flag runs a playbook in 'dry-run' mode: Ansible evaluates every task's condition and reports what would change without actually making any changes. Combined with --diff, it shows the exact content differences for template, copy, and other modules that manage file content. This combination is the closest thing Ansible has to a pre-deployment validation step.
Check mode is not a simulation. Modules that support it (most built-in modules) check their current state and report 'changed' or 'ok' based on whether the task would alter the system. Modules that don't support check mode run partially or report that they would change, reducing confidence. The shell and command modules, for example, always report 'changed' in check mode because they cannot predict their outcome. This is another reason to prefer dedicated modules.
The --diff flag shows the before-and-after content for files managed by copy, template, file (with content), and others. It also shows which lines in configuration files would be added, removed, or modified. You review this output to catch mistakes like a typo in a template variable or an incorrect file permission before they reach production.
In production CI pipelines, every playbook run that targets staging or production should first execute a check-diff run. If the playbook would change more than expected (e.g., 200 files changed when you only expected 2), the pipeline should halt and alert a human. This is a classic 'change validation' pattern.
One caveat: check mode does not execute handlers, even if they would be notified. It also does not run command or shell tasks, so if your playbook relies on those for idempotency, check mode gives less reliable output. The rule of thumb: the more dedicated modules you use, the more accurate your dry-run results will be.
--check --diff step before the real deployment. If the dry-run shows more than a trivial number of changes (e.g., more than 3 tasks changed), fail the pipeline. This catches accidental edits to group_vars, stale templates, or a misconfigured inventory that would cause a mass config update. Many teams skip this because they trust their playbooks — the production incident at the start of this article could have been prevented by a dry-run that showed Nginx version would change across 50 servers.--check --diff. It shows what would change without touching any server. Use it as a CI validation gate to catch unexpected changes before they cause an incident. The accuracy of check mode improves with every shell task you replace with a dedicated module.Ansible Variables and Precedence — The 22 Levels That Bite You
Variable precedence is the most common source of 'why is my playbook using the wrong value?' in production. Ansible has 22 different places where a variable can be defined, with a specific order of precedence. When the same variable name appears in multiple places, the highest-precedence definition wins. Misunderstanding this order leads to subtle bugs that only appear in certain environments.
The precedence order from lowest to highest (later overrides earlier): - role defaults (defaults/main.yml) - inventory group_vars/all - inventory group_vars/groupname - inventory host_vars/hostname - playbook group_vars/all - playbook group_vars/groupname - playbook host_vars/hostname - vars in playbook (vars block) - vars files included via include_vars - role vars (vars/main.yml) - block vars (within a block) - task vars (vars on a task) - set_fact (at runtime) - register (variable from task output) - extra vars (-e, highest priority)
In practice, the conflict you'll encounter most often is between group_vars/all (low) and --extra-vars (high). A stray -e in a CI pipeline can override everything else in your inventory, causing the wrong environment to be configured. Another common trap: using set_fact inside a loop — it overwrites the variable each iteration instead of accumulating.
To debug variable precedence, use the debug module to print the variable value at different points in the playbook. The -v flag also shows variable interpolation. When you need to merge lists or dictionaries instead of overriding, use the combine filter with the recursive=true option.
set_fact with the + operator on a list, or use combine for dictionaries. Example: set_fact: mylist={{ mylist | default([]) + [item] }} with loop. Without this, you'll only keep the last iteration's value.--extra-vars in a CI pipeline overrides environment: production to development because of a stale Jenkins parameter. Always validate that the highest-priority source (CLI or extra-vars) contains the expected values before running a production playbook. A simple debug task at the start of the playbook that prints all critical variables would have prevented many incidents.--extra-vars. Debug variable values at runtime with the debug module. When overriding dictionaries or lists, use the combine filter with recursive=true to merge instead of replace.Ansible Roles — Stop Copy-Pasting Playbooks Like an Amateur
Roles are how you stop treating Ansible like a scripting language and start treating it like infrastructure code you’d ship to production. Without roles, your playbooks become a tangled mess of tasks, handlers, and variables that only you understand — until you don't. Roles enforce a filesystem contract: tasks go in tasks/, handlers in handlers/, defaults in defaults/, templates in templates/. This isn't bureaucracy; it's survival. When a junior onboards and your playbook has 500 lines of unorganized YAML, they will break something. Roles give you modular, reusable, testable units. Use ansible-galaxy init to scaffold one. Wire it into a playbook with a single include_role call. That's it. Your production deploys should be a composition of roles, not a novel.
Ansible vs Terraform — Two Different Hammers, One Toolbox
Here’s what nobody tells you: Ansible and Terraform are not competitors. They solve different layers of the same problem. Terraform is a provisioning tool. It talks to cloud APIs to create infrastructure — VMs, networks, load balancers. It cares about state, drift detection, and idempotent resource creation. Ansible is a configuration tool. It logs into that provisioned machine and installs software, tweaks config files, starts services. Terraform asks 'Does this VM exist?' Ansible asks 'Is Apache running?' Use Terraform to build the house. Use Ansible to furnish it. The moment you try to use Ansible to create an AWS EC2 instance, you’re fighting the tool. The moment you use Terraform to configure nginx inside that instance, you’re fighting the tool. Know the boundary. Your CI/CD pipeline should call Terraform first, then Ansible. Every senior engineer I know does this.
terraform output --json and spits out Ansible inventory YAML. Then your playbook always targets exactly what Terraform just created. No manual inventory updates. No IP copy-paste errors.Ansible Tower (AWX) — Centralized Execution Without the Spreadsheet Mayhem
If you're still ssh-ing into your control node and running ansible-playbook by hand, you're one fat-finger away from taking down production. Ansible Tower (or its open-source upstream, AWX) gives you a web UI, RBAC, job scheduling, and — most importantly — an audit trail. When the VP asks 'Who ran that playbook at 3 AM?', you don't shrug. You pull up the job log. Tower also solves the 'works on my machine' problem. Playbooks run on Tower's execution environment, not your laptop with the experimental Python 3.12. Use it to segment environments: developers get 'Run' access to staging, read-only to production. Operations gets full control. No more shared passwords on a sticky note. The real power? The REST API. Your CI/CD can POST to Tower instead of SSH-ing into a jump box. That way, every deploy is recorded, every failure is logged, and every success is celebrated.
The Monday Morning Nginx Upgrade That Broke Payment Processing for 47 Minutes
state: latest six months earlier as a way to keep the fleet current without managing explicit versions. The assumption was that staging had been on the new Nginx version for two weeks without issues, so production was safe. What they didn't know was that staging used a different package mirror that received the new version on a different schedule. Staging had never actually run the version that hit production that Monday.ansible.builtin.apt: name=nginx state=latest update_cache=yes. Nginx went from 1.24 to 1.26 across the entire production fleet in a single playbook run — 50 servers in about 90 seconds. Nginx 1.26 changed the default TLS configuration to deprecate certain cipher suites that the payment gateway's SSL terminator still required. The application code was unchanged. The Nginx configuration file was unchanged. The only thing that changed was the Nginx binary itself — and the team had no test that validated TLS handshake compatibility with external payment processors after an Nginx version change. The state: latest task also didn't log which version was installed, only that the package was 'changed'. The post-incident investigation had to reconstruct the version change from package manager logs on individual servers.state: latest in every production playbook was changed to state: present with an explicit version pin: name: nginx=1.24.*. The wildcard on the patch version allows security patches within the pinned minor version but prevents major behavior changes. Second, a separate security_updates.yml playbook was created that runs on an explicit schedule — Thursday afternoon after a staging validation run — and includes rollback instructions as inline comments. Third, an integration test was added to the deployment pipeline that validates TLS handshake success against the payment gateway endpoint using a real certificate, run after any Nginx configuration or version change. If the handshake test fails, the pipeline rolls back the Nginx version automatically.state: latestis a footgun in any production playbook that runs on a schedule. It delegates the upgrade decision to whatever your package mirror happens to serve that day. Pin versions explicitly withname: package=version.*and make upgrades a deliberate, tested decision — not a side effect of a routine playbook run.- Staging and production must use the same package repository mirror and must be on the same versions at all times. If your staging environment can silently diverge from production's package versions, it provides no safety guarantee. Mirror the production repo exactly, or use a private artifact repository that you control.
- Configuration drift detection is not the same as behavior validation. You can have perfectly idempotent configuration management and still have external integrations break when an underlying package changes its defaults. Write integration tests that validate the behavior your external dependencies rely on — TLS cipher suites, header handling, timeout behavior.
- Monday morning is the statistically worst time to run untested automation against production. You have maximum blast radius (full week of traffic ahead), minimum time since the last human review of the change (over the weekend), and maximum cognitive load on engineers who are just starting the day. Schedule risky playbooks for Thursday, after a staging run earlier in the week.
shell or command module for something a dedicated module could handle, and those modules always report 'changed' because they have no way to inspect current state. Run ansible-playbook playbook.yml --check --diff and look at the diff output for the offending task — if diff shows nothing changed but the task still reports 'changed', that confirms the diagnosis. Fix: replace shell: apt-get install nginx with ansible.builtin.apt: name=nginx state=present. If no dedicated module exists for your task, add changed_when: false to suppress false positives, or add creates: /path/to/file to skip the task when its output already exists. The goal is a playbook where 'changed' means something actually changed — otherwise you stop trusting the output entirely and miss real changes.ansible-playbook playbook.yml --check --diff 2>&1 | grep -B 5 'changed' and look for the task just above each 'changed' marker. Common causes: a template task that reports changed because of whitespace differences or line ending inconsistencies between the template and the deployed file; a shell task that always reports changed. For template issues, add trim_blocks: true and lstrip_blocks: true to your Jinja2 template, and check that the deployed file's line endings match the template's. For shell tasks causing spurious handler triggers, add changed_when with an explicit condition based on the command's output.ansible -i inventory all -m setup --limit drifted-host > /tmp/drifted-facts.json and ansible -i inventory all -m setup --limit good-host > /tmp/good-facts.json, then diff /tmp/good-facts.json /tmp/drifted-facts.json to find the divergence. Common drift sources: manual SSH changes made during a previous incident, a failed partial playbook run that left a host mid-state, or autoscaling replacing an instance from an outdated AMI. Fix: add ansible.builtin.assert tasks at the top of your playbook that validate preconditions — OS version, required directories existing, expected kernel parameters — so playbook failures are explicit and informative rather than cryptic mid-play errors.ansible -i inventory all -m ping -vvvssh -o ConnectTimeout=5 -o StrictHostKeyChecking=no user@target-ip 'echo connected'ansible_user matches the AMI's default user (ubuntu for Ubuntu, ec2-user for Amazon Linux). For ephemeral environments like CI runners with dynamic IPs, set ANSIBLE_HOST_KEY_CHECKING=False or add host_key_checking = False to ansible.cfg. For VPC private subnets, ensure your control node is in the same VPC or a connected one — Ansible can't route through a NAT gateway by default.Key takeaways
shell. is dangerous in production — pin versions explicitly with name: package=1.2.*`.handlers: block, not inside tasks.--extra-vars overrides everything. Use debug to inspect values.--check --diff before production playbooks. Fail CI pipelines if unexpected changes appear.ansible_*) are dynamic but can be cached. Never set `gather_factswait_for module prevents deployment false positives by waiting for services to actually listen.Common mistakes to avoid
4 patternsUsing `state: latest` in production playbooks
name: nginx=1.24.* (wildcard on patch version). Run version upgrades through a separate, scheduled playbook with staging validation. Never use state: latest on production runs that are not explicitly upgrade workflows.Placing handlers inside tasks or roles incorrectly
tasks/main.yml is never triggered, or a handler with a listen topic never runs despite tasks notifying that topic. The service does not restart after a config change, but the playbook shows 'changed' on the config task.handlers: block at the play level, or in a role's handlers/main.yml. For listen topics, ensure the string matches exactly (case-sensitive). Always verify handler triggers with -v flag to see which handlers were notified.Using `shell` or `command` modules where dedicated modules exist
shell: apt-get install nginx with apt module. Replace shell: echo 'config' > /etc/file.conf with copy or template. If no dedicated module exists, add creates or changed_when to make the task idempotent. Audit all shell and command tasks regularly.Assuming facts are always up to date or disabling them unnecessarily
ansible_distribution or ansible_memory_mb but those variables are stale because gather_facts: false was set. The playbook applies incorrect configuration (e.g., installing yum packages on Ubuntu) or misallocates resources.gather_facts: true at the play level or use setup module explicitly. If you need performance, use fact caching with cacheable: yes in set_fact and run a separate fact-gathering playbook periodically. Never disable facts globally.Interview Questions on This Topic
What is idempotency in Ansible and why does it matter for production automation?
apt, copy, template, and service instead of shell/command. The shell module is not idempotent by default.Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's Cloud. Mark it forged?
17 min read · try the examples if you haven't