Introduction to Ansible — Automate Infrastructure Without an Agent
- Ansible is agentless — it connects over SSH, requiring no software installation on the target nodes. This means zero maintenance overhead on managed servers and instant onboarding for new infrastructure.
- Playbooks are human-readable YAML files that describe the 'Desired State' rather than just a list of scripts. Run them once or a hundred times — the outcome is the same.
- Prioritize dedicated modules (apt, systemd, git, copy) over generic shell commands to maintain idempotency. The shell module is a last resort, not a shortcut.
Managing 100 servers by logging into each one and typing commands is like calling 100 employees individually to give the same instruction. Ansible is like sending one company-wide email that everyone acts on simultaneously. You describe the desired state of your servers in plain English-like YAML, and Ansible connects over SSH and makes it happen — on all servers at once, with no software installed on them. Think of it this way: if your server is a hotel room, Ansible is the housekeeping checklist pinned to the door. It doesn't live in the room. It walks in, checks what needs fixing, fixes only what's broken, and walks out. The room doesn't even know Ansible was there — it just ends up clean.
Before configuration management tools, sysadmins maintained hundreds of servers by hand — logging in, running commands, hoping nothing went wrong. I lived this. In 2015, I managed a fleet of 80 web servers at a mid-size SaaS company, and every deploy night was a three-hour marathon of SSH sessions, copy-pasted commands, and prayer. One night, someone restarted the wrong database server. We lost four hours of customer data. That was the last straw.
Ansible was created by Michael DeHaan in 2012 and acquired by Red Hat in 2015 (now part of IBM). Today it runs infrastructure at NASA JPL, Capital One, and thousands of companies from Series A startups to Fortune 50 enterprises. Not because it's the most powerful automation tool, but because it's the simplest one that actually gets used.
What makes Ansible different from competitors like Chef and Puppet is that it is agentless. There is no daemon running on your managed servers, no SSL certificates to exchange, and no extra ports to open beyond standard SSH (or WinRM for Windows). Ansible runs from your control node, pushes small programs called 'Ansible Modules' to the remote nodes, executes them, and then cleans up after itself.
One important nuance: Ansible and Terraform are not competitors — they're complementary. Terraform creates infrastructure (provisions the server). Ansible configures that server (installs software, deploys code, manages services). In practice, most teams use Terraform to build the house and Ansible to furnish it.
In this guide, we'll break down Ansible's core architecture — inventories, playbooks, modules, and roles — cover ad-hoc commands for quick fleet operations, and build production-grade automation with real error handling, secret management, and reusable patterns.
Inventory, Playbooks, and Modules — The Three Core Concepts
Ansible's architecture relies on three primary building blocks.
- The Inventory: A simple file (INI or YAML) that lists the servers you want to manage. You can organize these into groups like
[webservers]or[databases]to target specific logic to specific hardware. In production, you'll often use dynamic inventory — pulling host lists directly from AWS, GCP, or Azure APIs so your inventory stays accurate as servers are created and destroyed. Static inventories are fine for learning and small setups, but once you're past 20 servers with autoscaling, dynamic inventory isn't optional — it's survival. - The Playbook: This is your automation blueprint. Written in YAML, it maps a group of hosts to a series of tasks. Playbooks describe desired state, not step-by-step instructions. This distinction matters: if Nginx is already installed and running, Ansible doesn't reinstall it. It checks, confirms, and moves on.
- Modules: These are the 'tools' in the toolbox. Instead of writing complex bash scripts, you use modules like
apt,yum,service, orcopy. These modules are idempotent, meaning they check the current state of the server and only make changes if the server doesn't already match your desired state. Theshellandcommandmodules are the notable exceptions — they run blindly every time, which is why experienced Ansible users avoid them unless there's no dedicated module alternative.
# io.thecodeforge: Inventory for Project Forge [webservers] web-01.thecodeforge.io ansible_host=192.168.1.10 ansible_user=ubuntu web-02.thecodeforge.io ansible_host=192.168.1.11 ansible_user=ubuntu [databases] db-01.thecodeforge.io ansible_host=192.168.1.20 ansible_user=ubuntu [production:children] webservers databases [production:vars] ansible_ssh_private_key_file=~/.ssh/forge_deploy_key
# ansible production -i inventory.ini -m ping
web-01.thecodeforge.io | SUCCESS => {"ping": "pong"}
db-01.thecodeforge.io | SUCCESS => {"ping": "pong"}
Your First Production Playbook
A playbook is a collection of 'plays'. Each play targets a specific group from your inventory and executes a sequence of tasks.
Ansible processes tasks in order, from top to bottom. If a task fails on a specific host, Ansible will stop executing the rest of the playbook for that host but continue for others. To handle configuration changes — like restarting a web server only when a config file is updated — Ansible uses Handlers. Handlers are special tasks that only run when 'notified' by another task that actually changed something.
Here's a production playbook we actually use. Notice the structure: update the package cache, install the binary, deploy a templated config, ensure the service is running. Every task is idempotent. Every task uses a dedicated module. No shell commands anywhere.
One thing the Ansible docs don't emphasize enough: variable precedence will bite you if you don't understand it. Ansible has a 22-level precedence ladder. Variables defined with -e on the command line override everything. Role defaults are the weakest. Variables in group_vars override role defaults but get overridden by host_vars. When a playbook behaves 'randomly' — applying different values on different runs — it's almost always a precedence conflict. Learn the hierarchy early. It will save you hours of debugging.
--- # io.thecodeforge: Standard Nginx Deployment Playbook - name: Deploy and Configure Nginx hosts: webservers become: true vars: nginx_port: 80 server_name: "thecodeforge.io" tasks: - name: Ensure apt cache is updated ansible.builtin.apt: update_cache: yes cache_valid_time: 3600 - name: Install Nginx production package ansible.builtin.apt: name: nginx state: present - name: Deploy custom Nginx configuration ansible.builtin.template: src: templates/nginx.conf.j2 dest: /etc/nginx/sites-available/default owner: root group: root mode: '0644' notify: Reload Nginx service - name: Ensure Nginx service is enabled and running ansible.builtin.service: name: nginx state: started enabled: yes handlers: - name: Reload Nginx service ansible.builtin.service: name: nginx state: reloaded
PLAY [Deploy and Configure Nginx] ********************************************
TASK [Gathering Facts] *******************************************************
ok: [web-01.thecodeforge.io]
TASK [Install Nginx production package] **************************************
changed: [web-01.thecodeforge.io]
RUNNING HANDLER [Reload Nginx service] ***************************************
changed: [web-01.thecodeforge.io]
RECAP: ok=4 changed=2 unreachable=0 failed=0
Ad-hoc Commands — Quick Fleet Operations Without a Playbook
Not everything needs a playbook. Sometimes you need to run a single command across your fleet right now — check disk space, restart a hung service, verify a patch applied. That's what ad-hoc commands are for.
Ad-hoc commands are Ansible's secret weapon for day-two operations. They're the reason senior SREs reach for Ansible over SSH loops. An SSH for-loop runs the command on every server sequentially and gives you raw output. Ansible ad-hoc runs in parallel, returns structured JSON, and handles failures gracefully.
Syntax: ansible <host-pattern> -i <inventory> -m <module> -a '<arguments>'
In production, I use ad-hoc commands daily. Checking if all nodes in a 200-server fleet have enough disk space before a deploy? That's a one-liner. Killing a run-away process across 50 app servers? One-liner. Gathering system facts to verify a kernel patch applied? One-liner. These replace what used to be 20-minute SSH marathons.
# io.thecodeforge: Ad-hoc Command Reference # Ping all production hosts to verify SSH connectivity ansible production -i inventory.ini -m ping # Check disk space across all web servers ansible webservers -i inventory.ini -m command -a "df -h" # Restart Nginx on all web servers immediately ansible webservers -i inventory.ini -m service -a "name=nginx state=restarted" --become # Gather system facts (CPU, memory, OS) from one host ansible db-01.thecodeforge.io -i inventory.ini -m setup # Install a security patch across the fleet ansible production -i inventory.ini -m apt -a "name=openssl state=latest update_cache=yes" --become
web-01.thecodeforge.io | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 12G 36G 25% /
web-02.thecodeforge.io | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 38G 10G 79% /
# Notice web-02 is at 79% — that's your alert to investigate before it fills up.
Roles — Reusable Automation at Scale
Once your playbooks grow beyond 50 lines, you'll start copying tasks between files. That's when you need roles. A role is a self-contained unit of automation — tasks, handlers, templates, default variables, and files — packaged in a standardized directory structure. Roles are how Ansible scales from 'one playbook' to 'an entire infrastructure codebase.'
The directory structure isn't optional decoration. It's Ansible's loading convention. When you reference a role, Ansible automatically loads tasks/main.yml, handlers/main.yml, defaults/main.yml, and so on. Skip the convention and things silently don't load.
Roles come from two sources: you write your own, or you pull community roles from Ansible Galaxy (ansible-galaxy install geerlingguy.nginx). Galaxy has thousands of pre-built roles. For common software — Nginx, Docker, PostgreSQL, certbot — a community role saves hours. For your application-specific logic — deploying your Java app, configuring your monitoring — you write your own.
In our production codebase at io.thecodeforge, every server role follows the same pattern: a defaults/main.yml with sane defaults, a tasks/main.yml with the core logic, and a templates/ directory with Jinja2 config files. When a new engineer joins, they can read any role and understand it immediately because the structure is predictable.
--- # io.thecodeforge: Reusable Nginx Role # Directory structure: # roles/nginx/ # ├── defaults/main.yml (default variables) # ├── handlers/main.yml (service reload/restart) # ├── tasks/main.yml (this file) # └── templates/ # └── vhost.conf.j2 - name: Install Nginx ansible.builtin.apt: name: nginx state: present update_cache: yes - name: Deploy virtual host configuration ansible.builtin.template: src: vhost.conf.j2 dest: "/etc/nginx/sites-available/{{ server_name }}.conf" owner: root group: root mode: '0644' notify: Reload Nginx - name: Enable virtual host ansible.builtin.file: src: "/etc/nginx/sites-available/{{ server_name }}.conf" dest: "/etc/nginx/sites-enabled/{{ server_name }}.conf" state: link notify: Reload Nginx - name: Ensure Nginx is running ansible.builtin.service: name: nginx state: started enabled: yes
# ---
# - hosts: webservers
# roles:
# - role: nginx
# vars:
# server_name: "thecodeforge.io"
#
# ansible-playbook -i inventory.ini site.yml
#
# PLAY [webservers] ************************************************************
# TASK [nginx : Install Nginx] *************************************************
# ok: [web-01.thecodeforge.io]
# TASK [nginx : Deploy virtual host configuration] *****************************
# changed: [web-01.thecodeforge.io]
# RUNNING HANDLER [nginx : Reload Nginx] ***************************************
# changed: [web-01.thecodeforge.io]
#
# RECAP: ok=5 changed=2 unreachable=0 failed=0
ansible-galaxy install geerlingguy.nginx gives you a battle-tested role maintained by one of the most prolific Ansible contributors. Save your custom role-writing energy for application-specific automation that Galaxy can't provide.Production Patterns — Error Handling, Vault, and Rolling Deploys
The playbook we built above works great for a single server. But production is messier. Databases fail mid-migration. Network blips cause intermittent SSH timeouts. You need to deploy to 50 servers without taking all 50 offline simultaneously. And you absolutely cannot store database passwords in plain text YAML committed to Git.
Error Handling with block/rescue/always: Ansible has a try/catch equivalent. Wrap risky tasks in a block. If anything inside fails, the rescue section runs — rollback, alert, log. The always section runs regardless — cleanup, notifications. Without this, a failed database migration leaves your server in a half-configured state with no automatic recovery.
Rolling Deploys with serial: The serial keyword controls how many hosts Ansible processes at once. serial: 3 means 'update 3 servers, verify they're healthy, then move to the next 3.' Without serial, Ansible hits all hosts simultaneously — which is fine for config management but catastrophic for application deploys where you need zero-downtime.
Ansible Vault for Secrets: Never commit plain-text passwords to version control. Vault encrypts individual variables or entire files. In CI/CD, you pass the vault password via environment variable or a password file. This keeps secrets out of your playbooks, out of your logs, and out of your Git history.
--- # io.thecodeforge: Production Deploy with Error Handling - name: Deploy Application with Safety Rails hosts: webservers become: true serial: 3 # Rolling deploy: 3 servers at a time max_fail_percentage: 0 # Stop everything if any server fails vars_files: - vault/secrets.yml # Encrypted with ansible-vault tasks: - name: Deploy application release block: - name: Pull latest code ansible.builtin.git: repo: "https://github.com/thecodeforge/app.git" dest: /opt/app version: "{{ release_version }}" - name: Run database migrations ansible.builtin.command: cmd: /opt/app/bin/migrate --env production args: chdir: /opt/app - name: Verify application health endpoint ansible.builtin.uri: url: "http://localhost:8080/health" status_code: 200 retries: 5 delay: 3 rescue: - name: Log deployment failure ansible.builtin.debug: msg: "Deploy FAILED on {{ inventory_hostname }} — rolling back to {{ previous_release }}" - name: Rollback to previous release ansible.builtin.git: repo: "https://github.com/thecodeforge/app.git" dest: /opt/app version: "{{ previous_release }}" always: - name: Notify deployment status ansible.builtin.uri: url: "{{ webhook_url }}" method: POST body_format: json body: host: "{{ inventory_hostname }}" status: "{{ 'success' if ansible_failed_task is not defined else 'failed' }}"
PLAY [Deploy Application with Safety Rails] ***********************************
TASK [Pull latest code] *******************************************************
changed: [web-01.thecodeforge.io]
changed: [web-02.thecodeforge.io]
changed: [web-03.thecodeforge.io]
TASK [Run database migrations] ************************************************
changed: [web-01.thecodeforge.io]
fatal: [web-02.thecodeforge.io]: FAILED! => {"msg": "migration timeout"}
TASK [Log deployment failure] *************************************************
ok: [web-02.thecodeforge.io] => {"msg": "Deploy FAILED on web-02 — rolling back"}
TASK [Rollback to previous release] ******************************************
changed: [web-02.thecodeforge.io]
RECAP: ok=8 changed=5 unreachable=0 failed=1
| Tool | Agent Required | Language | Learning Curve | Best For |
|---|---|---|---|---|
| Ansible | No (agentless) | YAML | Low | Configuration management, app deployment, ad-hoc operations, and orchestration across mixed environments. The fastest path from 'zero automation' to 'everything automated.' |
| Chef | Yes (chef-client daemon) | Ruby DSL | High | Complex, policy-based configuration in large enterprise fleets where teams have Ruby expertise and need a Pull-based model with a central Chef Server. |
| Puppet | Yes (puppet agent) | Puppet DSL | High | Long-term compliance and drift remediation in regulated industries (finance, healthcare) where continuous enforcement matters more than on-demand execution. |
| Terraform | No | HCL | Medium | Infrastructure provisioning — creating servers, networks, load balancers, and DNS records. Complementary to Ansible, not a replacement. Most teams use both. |
🎯 Key Takeaways
- Ansible is agentless — it connects over SSH, requiring no software installation on the target nodes. This means zero maintenance overhead on managed servers and instant onboarding for new infrastructure.
- Playbooks are human-readable YAML files that describe the 'Desired State' rather than just a list of scripts. Run them once or a hundred times — the outcome is the same.
- Prioritize dedicated modules (apt, systemd, git, copy) over generic shell commands to maintain idempotency. The shell module is a last resort, not a shortcut.
- Handlers provide a clean way to trigger service reloads or restarts only when a configuration change actually occurs — preventing unnecessary downtime from blind restarts.
- Roles are how you scale Ansible beyond toy projects. A role encapsulates tasks, handlers, templates, and variables into a reusable, predictable directory structure that any team member can read.
- Ad-hoc commands are Ansible's underrated superpower for day-two operations — fleet-wide disk checks, service restarts, and health verifications in one line instead of an SSH marathon.
- Use
block/rescue/alwaysfor error handling in any playbook that modifies production state. Without it, a single failure leaves your fleet in a split-brain configuration that's painful to debug. - Ansible Vault is non-negotiable for secrets management. Never commit plain-text credentials to version control — encrypt variables or entire files and pass the vault password through your CI/CD pipeline.
- Ansible is complementary to Terraform, not a competitor. Terraform provisions infrastructure; Ansible configures it. Most mature DevOps teams use both in sequence.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QExplain the 'Agentless' architecture of Ansible. How does it compare to agent-based tools like Puppet or Chef in terms of security footprint, operational overhead, and onboarding friction for new servers?
- QWhat is Idempotency in the context of Ansible modules? Can you name a module that is NOT idempotent by default, and explain when you'd intentionally use it?
- QHow does Ansible handle parallel execution? What is a 'fork' in Ansible configuration (
ansible.cfg), and how does tuning it impact performance on a 500-node fleet? - QWhat is the difference between a Task and a Handler? In what scenario would a Handler be skipped even if it is notified by a task that reported 'changed'?
- QHow would you use Ansible Vault to manage environment-specific secrets in a CI/CD pipeline? Walk through the workflow from encrypting the variable to injecting it during a Jenkins or GitLab CI run.
- QWhat are Ansible Facts? How can you disable fact gathering to speed up playbook execution, and in what situations would you actually need facts?
- QExplain how dynamic inventory works with a cloud provider like AWS. What are the advantages over a static inventory file, and what challenges does it introduce?
- QDescribe the difference between
include_roleandimport_role. When would you choose one over the other, and how does each affect task execution order and variable scope? - QHow would you structure an Ansible project to manage 500+ servers across dev, staging, and production environments? Describe your directory layout, variable hierarchy, and how you'd prevent production changes from accidentally running against dev.
Frequently Asked Questions
What is the difference between an ad-hoc command and a playbook in Ansible?
An ad-hoc command is a single one-liner executed directly from the command line — ideal for quick checks or one-off operations like restarting a service or checking disk space across your fleet. A playbook is a reusable, version-controlled YAML file that defines a sequence of tasks with variables, handlers, and error handling. Think of ad-hoc commands as shouting instructions across the room, and playbooks as writing a detailed SOP that anyone can execute repeatedly with the same result.
How does Ansible handle secrets and sensitive data?
Ansible provides Ansible Vault, which encrypts variables or entire files using AES256. You can encrypt individual strings with ansible-vault encrypt_string and paste them directly into your playbooks, or encrypt entire variable files with ansible-vault encrypt. At runtime, you provide the vault password via --ask-vault-pass, a password file, or an environment variable in CI/CD. Vault-encrypted content is safe to commit to Git — without the password, it's gibberish. For larger teams, integrate Vault with a secrets manager like HashiCorp Vault using community lookup plugins.
What is dynamic inventory in Ansible, and when should you use it?
Dynamic inventory pulls your host list from an external source — typically a cloud provider API like AWS EC2, GCP, or Azure — instead of maintaining a static file. Ansible queries the API at runtime and builds the inventory based on tags, regions, or instance states. Use dynamic inventory when your infrastructure is elastic: auto-scaling groups, spot instances, or any environment where servers are created and destroyed regularly. Static inventory works fine for fixed infrastructure under 20 servers, but beyond that, dynamic inventory prevents stale host lists and 'host not found' errors.
How do you handle errors and rollbacks in Ansible playbooks?
Ansible provides a block/rescue/always construct that works like try/catch in programming languages. Wrap risky operations (deployments, migrations) in a block. If any task in the block fails, the rescue section executes — typically rolling back to a known-good state and sending an alert. The always section runs regardless of success or failure for cleanup. For rolling deployments, combine this with serial (how many hosts to update at once) and max_fail_percentage (abort if too many hosts fail). Without this pattern, a failed migration on server 3 of 20 leaves your fleet in a split-brain state.
What is the difference between Ansible and Terraform? Do I need both?
They solve different problems and are designed to be used together. Terraform provisions infrastructure — it creates servers, networks, load balancers, and DNS records. Ansible configures that infrastructure — it installs software, deploys application code, manages services, and enforces desired state. A common pattern: Terraform provisions an AWS EC2 instance and outputs its IP address; Ansible takes that IP, connects via SSH, and configures the server. Terraform is declarative infrastructure-as-code. Ansible is configuration management and orchestration. Most production teams use both.
How do you test Ansible playbooks before running them in production?
Use --check mode (dry run) to see what changes Ansible would make without actually applying them. Combine it with --diff to see the exact file content differences. For automated testing, use Molecule — it spins up disposable containers or VMs, runs your role, verifies the outcome with test frameworks like Testinfra, and tears everything down. Run Molecule in CI to catch regressions before they hit production. Also lint your playbooks with ansible-lint to catch style issues, deprecated modules, and common mistakes before execution.
What is Ansible Galaxy, and should I use community roles?
Ansible Galaxy is a repository of community-contributed roles — pre-built automation for common software like Nginx, Docker, PostgreSQL, certbot, and hundreds more. Install a role with ansible-galaxy install geerlingguy.nginx and use it in your playbook immediately. For commodity software, community roles save hours and are often more battle-tested than what you'd write yourself. For application-specific automation — deploying your Java app, configuring your monitoring stack — write custom roles. The best practice is a hybrid: community roles for infrastructure software, custom roles for business logic.
How does Ansible perform on very large fleets (1000+ servers)?
Ansible's performance scales with the forks setting in ansible.cfg (default: 5). This controls how many hosts Ansible connects to in parallel. For 1000+ servers, increase forks to 50-100 depending on your control node's resources. Use pipelining = True in ansible.cfg to reduce SSH overhead by combining module transfer and execution into a single SSH call. For very large fleets, consider Ansible Automation Platform (AAP) or the open-source AWX project, which adds a web UI, job scheduling, RBAC, logging, and workflow orchestration on top of Ansible. Plain Ansible works fine at scale, but AWX gives you the operational visibility that large teams need.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.