Skip to content
Home DevOps Introduction to Ansible — Automate Infrastructure Without an Agent

Introduction to Ansible — Automate Infrastructure Without an Agent

Where developers are forged. · Structured learning · Free forever.
📍 Part of: Ansible → Topic 1 of 3
Ansible explained from scratch — what it is, how agentless automation works, and how to write your first playbook to configure a server with real YAML examples.
🧑‍💻 Beginner-friendly — no prior DevOps experience needed
In this tutorial, you'll learn
Ansible explained from scratch — what it is, how agentless automation works, and how to write your first playbook to configure a server with real YAML examples.
  • Ansible is agentless — it connects over SSH, requiring no software installation on the target nodes. This means zero maintenance overhead on managed servers and instant onboarding for new infrastructure.
  • Playbooks are human-readable YAML files that describe the 'Desired State' rather than just a list of scripts. Run them once or a hundred times — the outcome is the same.
  • Prioritize dedicated modules (apt, systemd, git, copy) over generic shell commands to maintain idempotency. The shell module is a last resort, not a shortcut.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
Quick Answer

Managing 100 servers by logging into each one and typing commands is like calling 100 employees individually to give the same instruction. Ansible is like sending one company-wide email that everyone acts on simultaneously. You describe the desired state of your servers in plain English-like YAML, and Ansible connects over SSH and makes it happen — on all servers at once, with no software installed on them. Think of it this way: if your server is a hotel room, Ansible is the housekeeping checklist pinned to the door. It doesn't live in the room. It walks in, checks what needs fixing, fixes only what's broken, and walks out. The room doesn't even know Ansible was there — it just ends up clean.

Before configuration management tools, sysadmins maintained hundreds of servers by hand — logging in, running commands, hoping nothing went wrong. I lived this. In 2015, I managed a fleet of 80 web servers at a mid-size SaaS company, and every deploy night was a three-hour marathon of SSH sessions, copy-pasted commands, and prayer. One night, someone restarted the wrong database server. We lost four hours of customer data. That was the last straw.

Ansible was created by Michael DeHaan in 2012 and acquired by Red Hat in 2015 (now part of IBM). Today it runs infrastructure at NASA JPL, Capital One, and thousands of companies from Series A startups to Fortune 50 enterprises. Not because it's the most powerful automation tool, but because it's the simplest one that actually gets used.

What makes Ansible different from competitors like Chef and Puppet is that it is agentless. There is no daemon running on your managed servers, no SSL certificates to exchange, and no extra ports to open beyond standard SSH (or WinRM for Windows). Ansible runs from your control node, pushes small programs called 'Ansible Modules' to the remote nodes, executes them, and then cleans up after itself.

One important nuance: Ansible and Terraform are not competitors — they're complementary. Terraform creates infrastructure (provisions the server). Ansible configures that server (installs software, deploys code, manages services). In practice, most teams use Terraform to build the house and Ansible to furnish it.

In this guide, we'll break down Ansible's core architecture — inventories, playbooks, modules, and roles — cover ad-hoc commands for quick fleet operations, and build production-grade automation with real error handling, secret management, and reusable patterns.

Inventory, Playbooks, and Modules — The Three Core Concepts

Ansible's architecture relies on three primary building blocks.

  1. The Inventory: A simple file (INI or YAML) that lists the servers you want to manage. You can organize these into groups like [webservers] or [databases] to target specific logic to specific hardware. In production, you'll often use dynamic inventory — pulling host lists directly from AWS, GCP, or Azure APIs so your inventory stays accurate as servers are created and destroyed. Static inventories are fine for learning and small setups, but once you're past 20 servers with autoscaling, dynamic inventory isn't optional — it's survival.
  2. The Playbook: This is your automation blueprint. Written in YAML, it maps a group of hosts to a series of tasks. Playbooks describe desired state, not step-by-step instructions. This distinction matters: if Nginx is already installed and running, Ansible doesn't reinstall it. It checks, confirms, and moves on.
  3. Modules: These are the 'tools' in the toolbox. Instead of writing complex bash scripts, you use modules like apt, yum, service, or copy. These modules are idempotent, meaning they check the current state of the server and only make changes if the server doesn't already match your desired state. The shell and command modules are the notable exceptions — they run blindly every time, which is why experienced Ansible users avoid them unless there's no dedicated module alternative.
io/thecodeforge/ansible/inventory.ini · INI
123456789101112131415
# io.thecodeforge: Inventory for Project Forge

[webservers]
web-01.thecodeforge.io ansible_host=192.168.1.10 ansible_user=ubuntu
web-02.thecodeforge.io ansible_host=192.168.1.11 ansible_user=ubuntu

[databases]
db-01.thecodeforge.io  ansible_host=192.168.1.20 ansible_user=ubuntu

[production:children]
webservers
databases

[production:vars]
ansible_ssh_private_key_file=~/.ssh/forge_deploy_key
▶ Output
# Test connectivity across the fleet:
# ansible production -i inventory.ini -m ping

web-01.thecodeforge.io | SUCCESS => {"ping": "pong"}
db-01.thecodeforge.io | SUCCESS => {"ping": "pong"}
💡Test Connectivity First:
Always run ansible all -m ping before running playbooks. If ping fails, fix SSH connectivity before debugging anything else. 90% of Ansible problems are SSH or permissions issues. I've watched junior engineers spend two hours debugging a 'module error' that was really a missing SSH key.

Your First Production Playbook

A playbook is a collection of 'plays'. Each play targets a specific group from your inventory and executes a sequence of tasks.

Ansible processes tasks in order, from top to bottom. If a task fails on a specific host, Ansible will stop executing the rest of the playbook for that host but continue for others. To handle configuration changes — like restarting a web server only when a config file is updated — Ansible uses Handlers. Handlers are special tasks that only run when 'notified' by another task that actually changed something.

Here's a production playbook we actually use. Notice the structure: update the package cache, install the binary, deploy a templated config, ensure the service is running. Every task is idempotent. Every task uses a dedicated module. No shell commands anywhere.

One thing the Ansible docs don't emphasize enough: variable precedence will bite you if you don't understand it. Ansible has a 22-level precedence ladder. Variables defined with -e on the command line override everything. Role defaults are the weakest. Variables in group_vars override role defaults but get overridden by host_vars. When a playbook behaves 'randomly' — applying different values on different runs — it's almost always a precedence conflict. Learn the hierarchy early. It will save you hours of debugging.

io/thecodeforge/ansible/site_setup.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041
---
# io.thecodeforge: Standard Nginx Deployment Playbook
- name: Deploy and Configure Nginx
  hosts: webservers
  become: true

  vars:
    nginx_port: 80
    server_name: "thecodeforge.io"

  tasks:
    - name: Ensure apt cache is updated
      ansible.builtin.apt:
        update_cache: yes
        cache_valid_time: 3600

    - name: Install Nginx production package
      ansible.builtin.apt:
        name: nginx
        state: present

    - name: Deploy custom Nginx configuration
      ansible.builtin.template:
        src: templates/nginx.conf.j2
        dest: /etc/nginx/sites-available/default
        owner: root
        group: root
        mode: '0644'
      notify: Reload Nginx service

    - name: Ensure Nginx service is enabled and running
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: yes

  handlers:
    - name: Reload Nginx service
      ansible.builtin.service:
        name: nginx
        state: reloaded
▶ Output
ansible-playbook -i inventory.ini site_setup.yml

PLAY [Deploy and Configure Nginx] ********************************************
TASK [Gathering Facts] *******************************************************
ok: [web-01.thecodeforge.io]
TASK [Install Nginx production package] **************************************
changed: [web-01.thecodeforge.io]
RUNNING HANDLER [Reload Nginx service] ***************************************
changed: [web-01.thecodeforge.io]

RECAP: ok=4 changed=2 unreachable=0 failed=0
🔥Idempotency:
Run this playbook 10 times — the result is the same as running it once. If Nginx is already installed, the install task shows 'ok' not 'changed'. This idempotency is what makes Ansible safe to run repeatedly in CI/CD pipelines. I've had Ansible running on a cron every 30 minutes in production for two years. It corrects config drift automatically. The only time it 'changes' anything is when someone SSH'd in and broke something manually.

Ad-hoc Commands — Quick Fleet Operations Without a Playbook

Not everything needs a playbook. Sometimes you need to run a single command across your fleet right now — check disk space, restart a hung service, verify a patch applied. That's what ad-hoc commands are for.

Ad-hoc commands are Ansible's secret weapon for day-two operations. They're the reason senior SREs reach for Ansible over SSH loops. An SSH for-loop runs the command on every server sequentially and gives you raw output. Ansible ad-hoc runs in parallel, returns structured JSON, and handles failures gracefully.

Syntax: ansible <host-pattern> -i <inventory> -m <module> -a '<arguments>'

In production, I use ad-hoc commands daily. Checking if all nodes in a 200-server fleet have enough disk space before a deploy? That's a one-liner. Killing a run-away process across 50 app servers? One-liner. Gathering system facts to verify a kernel patch applied? One-liner. These replace what used to be 20-minute SSH marathons.

io/thecodeforge/ansible/adhoc_examples.sh · BASH
12345678910111213141516
# io.thecodeforge: Ad-hoc Command Reference

# Ping all production hosts to verify SSH connectivity
ansible production -i inventory.ini -m ping

# Check disk space across all web servers
ansible webservers -i inventory.ini -m command -a "df -h"

# Restart Nginx on all web servers immediately
ansible webservers -i inventory.ini -m service -a "name=nginx state=restarted" --become

# Gather system facts (CPU, memory, OS) from one host
ansible db-01.thecodeforge.io -i inventory.ini -m setup

# Install a security patch across the fleet
ansible production -i inventory.ini -m apt -a "name=openssl state=latest update_cache=yes" --become
▶ Output
ansible webservers -i inventory.ini -m command -a "df -h /"

web-01.thecodeforge.io | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 12G 36G 25% /
web-02.thecodeforge.io | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 38G 10G 79% /

# Notice web-02 is at 79% — that's your alert to investigate before it fills up.
⚠ Ad-hoc Is Not Idempotent by Default:
The command module runs every time — it doesn't check state. For one-off operations, this is fine. But if you find yourself running the same ad-hoc command repeatedly, write a playbook instead. Ad-hoc is for exploration and emergencies, not for repeatable automation.

Roles — Reusable Automation at Scale

Once your playbooks grow beyond 50 lines, you'll start copying tasks between files. That's when you need roles. A role is a self-contained unit of automation — tasks, handlers, templates, default variables, and files — packaged in a standardized directory structure. Roles are how Ansible scales from 'one playbook' to 'an entire infrastructure codebase.'

The directory structure isn't optional decoration. It's Ansible's loading convention. When you reference a role, Ansible automatically loads tasks/main.yml, handlers/main.yml, defaults/main.yml, and so on. Skip the convention and things silently don't load.

Roles come from two sources: you write your own, or you pull community roles from Ansible Galaxy (ansible-galaxy install geerlingguy.nginx). Galaxy has thousands of pre-built roles. For common software — Nginx, Docker, PostgreSQL, certbot — a community role saves hours. For your application-specific logic — deploying your Java app, configuring your monitoring — you write your own.

In our production codebase at io.thecodeforge, every server role follows the same pattern: a defaults/main.yml with sane defaults, a tasks/main.yml with the core logic, and a templates/ directory with Jinja2 config files. When a new engineer joins, they can read any role and understand it immediately because the structure is predictable.

io/thecodeforge/ansible/roles/nginx/tasks/main.yml · YAML
12345678910111213141516171819202122232425262728293031323334353637
---
# io.thecodeforge: Reusable Nginx Role
# Directory structure:
# roles/nginx/
#   ├── defaults/main.yml    (default variables)
#   ├── handlers/main.yml    (service reload/restart)
#   ├── tasks/main.yml       (this file)
#   └── templates/
#       └── vhost.conf.j2

- name: Install Nginx
  ansible.builtin.apt:
    name: nginx
    state: present
    update_cache: yes

- name: Deploy virtual host configuration
  ansible.builtin.template:
    src: vhost.conf.j2
    dest: "/etc/nginx/sites-available/{{ server_name }}.conf"
    owner: root
    group: root
    mode: '0644'
  notify: Reload Nginx

- name: Enable virtual host
  ansible.builtin.file:
    src: "/etc/nginx/sites-available/{{ server_name }}.conf"
    dest: "/etc/nginx/sites-enabled/{{ server_name }}.conf"
    state: link
  notify: Reload Nginx

- name: Ensure Nginx is running
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: yes
▶ Output
# Using the role in a playbook:
# ---
# - hosts: webservers
# roles:
# - role: nginx
# vars:
# server_name: "thecodeforge.io"
#
# ansible-playbook -i inventory.ini site.yml
#
# PLAY [webservers] ************************************************************
# TASK [nginx : Install Nginx] *************************************************
# ok: [web-01.thecodeforge.io]
# TASK [nginx : Deploy virtual host configuration] *****************************
# changed: [web-01.thecodeforge.io]
# RUNNING HANDLER [nginx : Reload Nginx] ***************************************
# changed: [web-01.thecodeforge.io]
#
# RECAP: ok=5 changed=2 unreachable=0 failed=0
💡Use Galaxy for Commodity Software:
Don't write your own Nginx or Docker role from scratch unless you have a specific reason. ansible-galaxy install geerlingguy.nginx gives you a battle-tested role maintained by one of the most prolific Ansible contributors. Save your custom role-writing energy for application-specific automation that Galaxy can't provide.

Production Patterns — Error Handling, Vault, and Rolling Deploys

The playbook we built above works great for a single server. But production is messier. Databases fail mid-migration. Network blips cause intermittent SSH timeouts. You need to deploy to 50 servers without taking all 50 offline simultaneously. And you absolutely cannot store database passwords in plain text YAML committed to Git.

Error Handling with block/rescue/always: Ansible has a try/catch equivalent. Wrap risky tasks in a block. If anything inside fails, the rescue section runs — rollback, alert, log. The always section runs regardless — cleanup, notifications. Without this, a failed database migration leaves your server in a half-configured state with no automatic recovery.

Rolling Deploys with serial: The serial keyword controls how many hosts Ansible processes at once. serial: 3 means 'update 3 servers, verify they're healthy, then move to the next 3.' Without serial, Ansible hits all hosts simultaneously — which is fine for config management but catastrophic for application deploys where you need zero-downtime.

Ansible Vault for Secrets: Never commit plain-text passwords to version control. Vault encrypts individual variables or entire files. In CI/CD, you pass the vault password via environment variable or a password file. This keeps secrets out of your playbooks, out of your logs, and out of your Git history.

io/thecodeforge/ansible/deploy_with_safety.yml · YAML
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
---
# io.thecodeforge: Production Deploy with Error Handling
- name: Deploy Application with Safety Rails
  hosts: webservers
  become: true
  serial: 3              # Rolling deploy: 3 servers at a time
  max_fail_percentage: 0 # Stop everything if any server fails

  vars_files:
    - vault/secrets.yml  # Encrypted with ansible-vault

  tasks:
    - name: Deploy application release
      block:
        - name: Pull latest code
          ansible.builtin.git:
            repo: "https://github.com/thecodeforge/app.git"
            dest: /opt/app
            version: "{{ release_version }}"

        - name: Run database migrations
          ansible.builtin.command:
            cmd: /opt/app/bin/migrate --env production
          args:
            chdir: /opt/app

        - name: Verify application health endpoint
          ansible.builtin.uri:
            url: "http://localhost:8080/health"
            status_code: 200
          retries: 5
          delay: 3

      rescue:
        - name: Log deployment failure
          ansible.builtin.debug:
            msg: "Deploy FAILED on {{ inventory_hostname }} — rolling back to {{ previous_release }}"

        - name: Rollback to previous release
          ansible.builtin.git:
            repo: "https://github.com/thecodeforge/app.git"
            dest: /opt/app
            version: "{{ previous_release }}"

      always:
        - name: Notify deployment status
          ansible.builtin.uri:
            url: "{{ webhook_url }}"
            method: POST
            body_format: json
            body:
              host: "{{ inventory_hostname }}"
              status: "{{ 'success' if ansible_failed_task is not defined else 'failed' }}"
▶ Output
ansible-playbook -i inventory.ini deploy_with_safety.yml --ask-vault-pass

PLAY [Deploy Application with Safety Rails] ***********************************
TASK [Pull latest code] *******************************************************
changed: [web-01.thecodeforge.io]
changed: [web-02.thecodeforge.io]
changed: [web-03.thecodeforge.io]
TASK [Run database migrations] ************************************************
changed: [web-01.thecodeforge.io]
fatal: [web-02.thecodeforge.io]: FAILED! => {"msg": "migration timeout"}
TASK [Log deployment failure] *************************************************
ok: [web-02.thecodeforge.io] => {"msg": "Deploy FAILED on web-02 — rolling back"}
TASK [Rollback to previous release] ******************************************
changed: [web-02.thecodeforge.io]

RECAP: ok=8 changed=5 unreachable=0 failed=1
⚠ Never Skip Error Handling in Production:
I once watched a team deploy without block/rescue. A migration failed on server 3 of 20. The playbook stopped for that host but continued for the rest. Result: 19 servers on the new schema, 1 on the old. The application broke in spectacular ways for three hours while they figured out what happened. Always use block/rescue for anything that modifies state. Always.
ToolAgent RequiredLanguageLearning CurveBest For
AnsibleNo (agentless)YAMLLowConfiguration management, app deployment, ad-hoc operations, and orchestration across mixed environments. The fastest path from 'zero automation' to 'everything automated.'
ChefYes (chef-client daemon)Ruby DSLHighComplex, policy-based configuration in large enterprise fleets where teams have Ruby expertise and need a Pull-based model with a central Chef Server.
PuppetYes (puppet agent)Puppet DSLHighLong-term compliance and drift remediation in regulated industries (finance, healthcare) where continuous enforcement matters more than on-demand execution.
TerraformNoHCLMediumInfrastructure provisioning — creating servers, networks, load balancers, and DNS records. Complementary to Ansible, not a replacement. Most teams use both.

🎯 Key Takeaways

  • Ansible is agentless — it connects over SSH, requiring no software installation on the target nodes. This means zero maintenance overhead on managed servers and instant onboarding for new infrastructure.
  • Playbooks are human-readable YAML files that describe the 'Desired State' rather than just a list of scripts. Run them once or a hundred times — the outcome is the same.
  • Prioritize dedicated modules (apt, systemd, git, copy) over generic shell commands to maintain idempotency. The shell module is a last resort, not a shortcut.
  • Handlers provide a clean way to trigger service reloads or restarts only when a configuration change actually occurs — preventing unnecessary downtime from blind restarts.
  • Roles are how you scale Ansible beyond toy projects. A role encapsulates tasks, handlers, templates, and variables into a reusable, predictable directory structure that any team member can read.
  • Ad-hoc commands are Ansible's underrated superpower for day-two operations — fleet-wide disk checks, service restarts, and health verifications in one line instead of an SSH marathon.
  • Use block/rescue/always for error handling in any playbook that modifies production state. Without it, a single failure leaves your fleet in a split-brain configuration that's painful to debug.
  • Ansible Vault is non-negotiable for secrets management. Never commit plain-text credentials to version control — encrypt variables or entire files and pass the vault password through your CI/CD pipeline.
  • Ansible is complementary to Terraform, not a competitor. Terraform provisions infrastructure; Ansible configures it. Most mature DevOps teams use both in sequence.

⚠ Common Mistakes to Avoid

    Using shell/command modules when a dedicated module exists — using `ansible.builtin.shell: apt install nginx` works, but it is not idempotent and will return 'changed' every single time, even if Nginx is already installed. Use the `apt` module so Ansible can check the state properly. I've seen CI/CD dashboards showing 'changed' on every run for months because someone used `shell: systemctl restart nginx` instead of the `service` module. The ops team thought deployments were happening. Nothing was actually changing.

    y changing.

    Ignoring YAML syntax and spacing — YAML is extremely sensitive to indentation. A single missing space in a list can cause the entire playbook to fail with a cryptic parser error. Always use a linter (`yamllint`) and configure your editor to show whitespace. Mixing tabs and spaces is the most common cause — YAML requires spaces, period.

    es, period.

    Hardcoding sensitive secrets — Storing database passwords in plain text YAML is a massive security risk. Use **Ansible Vault** to encrypt sensitive variables: `ansible-vault encrypt_string 'secret_pass' --name 'db_password'`. Even better, encrypt entire variable files with `ansible-vault encrypt group_vars/production/vault.yml` so secrets never touch your Git history unencrypted.

    nencrypted.

    Not disabling host key checking in CI/CD — In automated environments, SSH will prompt 'Are you sure you want to continue connecting?' and hang forever. Set `ANSIBLE_HOST_KEY_CHECKING=False` in your CI environment, or add `host_key_checking = False` in `ansible.cfg`. For production, use `ssh-keyscan` to pre-populate `known_hosts` instead of disabling checks entirely.

    s entirely.

    Forgetting `become: true` and debugging the wrong thing — Without privilege escalation, modules fail with permission errors that look like deeper problems. I've watched engineers spend an hour debugging 'file not found' errors on `/etc/nginx/conf.d/` when the real issue was simply missing `become: true`. If a task touches system files, services, or package managers, it needs root. Make `become` explicit at the play level so you never forget it.

    forget it.

    Using `ignore_errors: yes` as a band-aid — Slapping `ignore_errors` on a failing task doesn't fix the problem. It just hides it. Three months later, you discover that your SSL certificate renewal has been silently failing and your site is running on an expired cert. Use `block/rescue/always` for proper error handling instead. If a task fails, you need to know about it and handle it deliberately.
    Fix

    the problem. It just hides it. Three months later, you discover that your SSL certificate renewal has been silently failing and your site is running on an expired cert. Use block/rescue/always for proper error handling instead. If a task fails, you need to know about it and handle it deliberately.

Interview Questions on This Topic

  • QExplain the 'Agentless' architecture of Ansible. How does it compare to agent-based tools like Puppet or Chef in terms of security footprint, operational overhead, and onboarding friction for new servers?
  • QWhat is Idempotency in the context of Ansible modules? Can you name a module that is NOT idempotent by default, and explain when you'd intentionally use it?
  • QHow does Ansible handle parallel execution? What is a 'fork' in Ansible configuration (ansible.cfg), and how does tuning it impact performance on a 500-node fleet?
  • QWhat is the difference between a Task and a Handler? In what scenario would a Handler be skipped even if it is notified by a task that reported 'changed'?
  • QHow would you use Ansible Vault to manage environment-specific secrets in a CI/CD pipeline? Walk through the workflow from encrypting the variable to injecting it during a Jenkins or GitLab CI run.
  • QWhat are Ansible Facts? How can you disable fact gathering to speed up playbook execution, and in what situations would you actually need facts?
  • QExplain how dynamic inventory works with a cloud provider like AWS. What are the advantages over a static inventory file, and what challenges does it introduce?
  • QDescribe the difference between include_role and import_role. When would you choose one over the other, and how does each affect task execution order and variable scope?
  • QHow would you structure an Ansible project to manage 500+ servers across dev, staging, and production environments? Describe your directory layout, variable hierarchy, and how you'd prevent production changes from accidentally running against dev.

Frequently Asked Questions

What is the difference between an ad-hoc command and a playbook in Ansible?

An ad-hoc command is a single one-liner executed directly from the command line — ideal for quick checks or one-off operations like restarting a service or checking disk space across your fleet. A playbook is a reusable, version-controlled YAML file that defines a sequence of tasks with variables, handlers, and error handling. Think of ad-hoc commands as shouting instructions across the room, and playbooks as writing a detailed SOP that anyone can execute repeatedly with the same result.

How does Ansible handle secrets and sensitive data?

Ansible provides Ansible Vault, which encrypts variables or entire files using AES256. You can encrypt individual strings with ansible-vault encrypt_string and paste them directly into your playbooks, or encrypt entire variable files with ansible-vault encrypt. At runtime, you provide the vault password via --ask-vault-pass, a password file, or an environment variable in CI/CD. Vault-encrypted content is safe to commit to Git — without the password, it's gibberish. For larger teams, integrate Vault with a secrets manager like HashiCorp Vault using community lookup plugins.

What is dynamic inventory in Ansible, and when should you use it?

Dynamic inventory pulls your host list from an external source — typically a cloud provider API like AWS EC2, GCP, or Azure — instead of maintaining a static file. Ansible queries the API at runtime and builds the inventory based on tags, regions, or instance states. Use dynamic inventory when your infrastructure is elastic: auto-scaling groups, spot instances, or any environment where servers are created and destroyed regularly. Static inventory works fine for fixed infrastructure under 20 servers, but beyond that, dynamic inventory prevents stale host lists and 'host not found' errors.

How do you handle errors and rollbacks in Ansible playbooks?

Ansible provides a block/rescue/always construct that works like try/catch in programming languages. Wrap risky operations (deployments, migrations) in a block. If any task in the block fails, the rescue section executes — typically rolling back to a known-good state and sending an alert. The always section runs regardless of success or failure for cleanup. For rolling deployments, combine this with serial (how many hosts to update at once) and max_fail_percentage (abort if too many hosts fail). Without this pattern, a failed migration on server 3 of 20 leaves your fleet in a split-brain state.

What is the difference between Ansible and Terraform? Do I need both?

They solve different problems and are designed to be used together. Terraform provisions infrastructure — it creates servers, networks, load balancers, and DNS records. Ansible configures that infrastructure — it installs software, deploys application code, manages services, and enforces desired state. A common pattern: Terraform provisions an AWS EC2 instance and outputs its IP address; Ansible takes that IP, connects via SSH, and configures the server. Terraform is declarative infrastructure-as-code. Ansible is configuration management and orchestration. Most production teams use both.

How do you test Ansible playbooks before running them in production?

Use --check mode (dry run) to see what changes Ansible would make without actually applying them. Combine it with --diff to see the exact file content differences. For automated testing, use Molecule — it spins up disposable containers or VMs, runs your role, verifies the outcome with test frameworks like Testinfra, and tears everything down. Run Molecule in CI to catch regressions before they hit production. Also lint your playbooks with ansible-lint to catch style issues, deprecated modules, and common mistakes before execution.

What is Ansible Galaxy, and should I use community roles?

Ansible Galaxy is a repository of community-contributed roles — pre-built automation for common software like Nginx, Docker, PostgreSQL, certbot, and hundreds more. Install a role with ansible-galaxy install geerlingguy.nginx and use it in your playbook immediately. For commodity software, community roles save hours and are often more battle-tested than what you'd write yourself. For application-specific automation — deploying your Java app, configuring your monitoring stack — write custom roles. The best practice is a hybrid: community roles for infrastructure software, custom roles for business logic.

How does Ansible perform on very large fleets (1000+ servers)?

Ansible's performance scales with the forks setting in ansible.cfg (default: 5). This controls how many hosts Ansible connects to in parallel. For 1000+ servers, increase forks to 50-100 depending on your control node's resources. Use pipelining = True in ansible.cfg to reduce SSH overhead by combining module transfer and execution into a single SSH call. For very large fleets, consider Ansible Automation Platform (AAP) or the open-source AWX project, which adds a web UI, job scheduling, RBAC, logging, and workflow orchestration on top of Ansible. Plain Ansible works fine at scale, but AWX gives you the operational visibility that large teams need.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

Next →Ansible Playbooks Explained
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged