Ansible Role = packaged unit of automation with standard directory structure (tasks, handlers, defaults, vars, templates, files, meta)
Convention over configuration: Ansible auto-loads main.yml from each directory when the role is called — missing that file means the directory silently does nothing
defaults/ for overridable variables (lowest precedence in all of Ansible), vars/ for internal constants (higher precedence — overrides from inventory won't reach them)
Performance: role lookups add ~100ms per role call — flatten deeply nested role dependency chains for large inventories running under time pressure
Production trap: hardcoding paths in tasks/ instead of using defaults/ — the role works for one team and is useless for everyone else without forking it
Biggest mistake: creating 'God roles' that configure databases AND web servers AND monitoring — break into separate focused roles, compose them in the playbook
Plain-English First
Think of Ansible Roles as a professional toolbox with dedicated, labeled drawers. Instead of throwing every tool — hammers, screwdrivers, drills — into one big pile (a single massive playbook), you organize them. One drawer holds Web Server tools. Another holds Database tools. A third holds Monitoring tools. When you need to build a new system, you grab exactly the drawers you need and leave the rest on the shelf.
The labels on the drawers matter too. Some tools have adjustable settings — the drill's speed, the torque on the wrench. Those settings go on a sticky note on the outside of the drawer so whoever borrows it can change them without opening the drawer and modifying the tool itself. That's what defaults/ is in an Ansible role: the sticky note that says 'this is what we assume, but you can change it.' vars/ is the weld that holds the drawer together — it should not be touched.
Ansible Roles are how you turn automation from scripting into software engineering. A single playbook that works for one team and one environment is a script. A role that any team can pull from Galaxy, override with their own variable values, and deploy to any environment without touching a line of task code — that's reusable infrastructure.
Most tutorials show you how to initialize a role and move on. What they skip is the operational detail that determines whether a role becomes an asset or a liability six months after it's written. The defaults-versus-vars distinction trips up engineers who understand the concept but haven't felt the pain of getting it wrong in production. God roles are written by people who know roles exist but haven't internalized why the single-responsibility principle applies to infrastructure automation as much as it does to application code. And role dependency chains in meta/main.yml can fail in ways that produce no errors and leave your fleet silently misconfigured.
By the end of this article you'll know how to structure roles that teams outside your own can actually use, how to test them with Molecule so regressions surface in CI rather than production, why the variable precedence hierarchy determines whether your role is overridable or effectively hardcoded, and how to recognize the God role pattern early enough to fix it before it becomes entrenched technical debt.
The Architecture of a Role: Convention Over Configuration
Ansible Roles exist to move infrastructure automation from scripting to software engineering. The distinction matters operationally: a script works for its author in their environment. A role works for any team, in any environment, without modification to its internals — only variable overrides at the boundary.
The mechanism that makes this possible is convention over configuration. Every role follows the same directory structure. When Ansible calls a role, it knows exactly where to look for each type of content without being told: tasks/main.yml for the primary logic, handlers/main.yml for service restart definitions, defaults/main.yml for overridable variables, vars/main.yml for internal constants, templates/ for Jinja2 config files, files/ for static assets, and meta/main.yml for dependencies and Galaxy metadata. None of these require explicit loading in your tasks — Ansible finds and loads them automatically based on their location.
This predictability is the entire point. When a new engineer opens a role they've never seen before, they know immediately where the task logic lives, where the variables are defined, and where the templates are. That shared mental model is what allows roles to be shared across teams and organizations via Ansible Galaxy.
The structure isn't optional and it isn't decoration. Missing tasks/main.yml means the role does nothing and produces no error. A template referenced in tasks that doesn't exist in templates/ fails at runtime with a file not found error that points to a path that looks correct. The ansible-galaxy role init command generates the full structure in one command — use it every time rather than creating directories manually and risking missing one.
One aspect of the structure that teams often underuse: the tests/ directory. Ansible generates it but leaves it empty. This is where your Molecule configuration lives — the test scenarios that verify the role works with default variables, with non-default variables, and that it's idempotent on a second run. A role without tests in tests/ is a role that breaks silently and gets discovered in production.
io/thecodeforge/ansible/init_role.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#!/usr/bin/env bash
# io.thecodeforge: Initialize a new role with the full standard structure
# Always use ansible-galaxy role init — never create directories manually.
# The tool generates every required directory and file, including stubs
# for meta/main.yml with Galaxy metadata fields that must be present
# for publishing to Galaxy or an internal AutomationHub.
ansible-galaxy role init io.thecodeforge.webserver
# Generated structure:
# io.thecodeforge.webserver/
# ├── defaults/
# │ └── main.yml <- LOWEST precedence. Everything here is overridable.
# │ Usefor: ports, paths, package versions, feature flags.
# │ If a value might differ between teams or environments, it goes here.
# ├── files/
# │ <- Static assets. No variable substitution.
# │ Usefor: scripts, static certs, binary configs, SSH keys.
# │ Loaded by: ansible.builtin.copy with src: filename.ext
# ├── handlers/
# │ └── main.yml <- Service lifecycle tasks. Only run when notified.
# │ Usefor: reload, restart, enable. Never unconditional tasks.
# ├── meta/
# │ └── main.yml <- Role dependencies, Galaxy metadata, supported platforms.
# │ Dependencies here run before this role, automatically.
# ├── tasks/
# │ └── main.yml <- Primary execution logic. Loaded first automatically.
# │ Should contain NO hardcoded values — only variable references.
# ├── templates/
# │ <- Jinja2 templates. Rendered at runtime with variable substitution.
# │ Usefor: nginx.conf, postgresql.conf, systemd unit files.
# │ Loaded by: ansible.builtin.template with src: filename.j2
# ├── vars/
# │ └── main.yml <- HIGH precedence. Inventory and group_vars cannot override these.
# │ Usefor: internal package names, service names, OS-specific constants.
# │ NOTfor values users should change — use defaults/ for those.
# └── tests/
# └── molecule/ <- Molecule test scenarios. Never leave this empty in production roles.
# ├── default/ (tests with default variable values)
# └── custom/ (tests with non-default values — catches hardcoding)
# After init, immediately set up Molecule:
cd io.thecodeforge.webserver
molecule init scenario default --driver-name docker
molecule init scenario custom --driver-name docker
# Write converge.yml and verify.yml for both scenarios before writing a single task.
The Empty defaults/main.yml Is a Red Flag
When reviewing a role, the first file to open is defaults/main.yml. If it's empty — or doesn't exist — the role almost certainly has hardcoded values in tasks/main.yml that make it single-use. Every path, port, package version, username, and config option that could reasonably differ between environments should appear in defaults/main.yml with a sensible value. If you can read through tasks/main.yml and find a literal string where a variable should be, that's a bug in the role's design, not just a style issue.
Production Insight
The standard directory structure is how Ansible finds your files — it is not optional formatting.
Missing tasks/main.yml means the role silently does nothing. Missing templates/ causes runtime failures that point to a path that looks correct and takes time to diagnose.
Rule: run ansible-galaxy role init every time. The directory structure costs nothing to generate and is expensive to debug when wrong. Check the generated meta/main.yml and fill in the galaxy_info block — a role without author and description metadata becomes unmaintainable in a shared Galaxy namespace.
Key Takeaway
Roles are packaged automation. The directory structure is the loading contract — Ansible finds files by convention, not configuration.
tasks/ does the work. defaults/ holds what users can change. vars/ holds what they should not. templates/ holds configs that need variables. files/ holds everything static.
Convention over configuration means every role looks the same from the outside. That shared structure is what makes roles shareable across teams.
Where Does This Value Belong?
IfThe value might differ between environments, teams, or use cases (port numbers, paths, package versions, usernames)
→
Usedefaults/main.yml — lowest precedence, fully overridable by inventory, playbook, or command line
IfThe value is an internal constant the role needs to function correctly and users should never change (internal service name, OS package name, fixed file permission)
→
Usevars/main.yml — higher precedence, protected from inventory overrides
IfThe value is a config file that needs variable substitution (nginx.conf, postgresql.conf, systemd unit)
→
Usetemplates/ as a Jinja2 .j2 file — rendered at runtime, references variables from defaults/ or the calling playbook
IfThe value is a static file that never changes (a shell script, a static binary, a fixed certificate)
→
Usefiles/ — served verbatim with no variable substitution
IfThe value is a secret (password, API token, private key)
→
UseNeither defaults/ nor vars/ — Ansible Vault encrypted variable in group_vars/production/vault.yml, passed to the role via the calling playbook
Production Patterns: Reusability, Composition, and the God Role Problem
The single most important design principle for Ansible roles is the same one that applies to microservices, library functions, and Unix commands: do one thing well. A role that installs and configures Nginx is useful to every team that runs Nginx. A role that installs Nginx, PostgreSQL, Redis, and a monitoring agent is useful to exactly one team — the team that chose that exact combination — and becomes a maintenance burden the moment any team's requirements diverge.
This is the God role problem. It emerges gradually. Someone writes a server_setup role that installs the web server and the database because both are needed on the first server they're automating. A few weeks later they add log rotation. A few weeks after that, monitoring. By the time the role has 600 lines across tasks/main.yml, it's impossible to use partially. A team that only needs the web server configuration must accept the database configuration too, or fork the role.
The fix is decomposition: one role per service, composed in the playbook. A playbook that calls roles: [common, nginx, postgresql, prometheus_node_exporter, log_rotation] is instantly readable. You know exactly what the role list configures. You can remove prometheus_node_exporter from the list for a server where you don't want monitoring. You can test each role independently with Molecule. You can update the nginx role without touching the postgresql role.
The second pattern that determines whether roles scale is the variable boundary. The calling playbook is where environment-specific values should live — not inside the role. A role's defaults/main.yml provides the fallback values that work for the most common case. The playbook's vars block or the inventory's group_vars override those defaults for specific environments. This separation means the role itself is environment-agnostic — it works for dev, staging, and production, with the differences expressed entirely in the calling context.
When a value is passed in the vars: block at role call time in a playbook, it has higher precedence than defaults/ but lower than host_vars. This is the right level for environment-specific overrides when you want the role to receive a value without the calling team having to set it in inventory. It's also where you declare which environment-specific values are expected — a well-documented vars block at the role call site is self-documenting infrastructure.
io/thecodeforge/ansible/site.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
# io.thecodeforge: Production orchestration playbook
# This file composes roles. It contains no task logic of its own.
# Each role does one thing. This playbook decides which things to do and in what order.
#
# Variable precedence at role call time (from the vars: block):
# Higher than: defaults/main.yml, group_vars, inventory vars
# Lower than: host_vars, extra vars (-e)
# Use the vars: block for environment-specific overrides you want visible in the playbook.
# Use group_vars for overrides that should apply to all plays targeting that group.
# ── Play1: LoadBalancer configuration ───────────────────────────────────────
- name: ConfigureLoadBalancerTier
hosts: load_balancers
become: true
roles:
# Common role runs first on every host — sets up SSH hardening, NTP, logging standards
- role: io.thecodeforge.common
vars:
common_ntp_servers:
- 169.254.169.123 # AWS time sync service — lower latency than pool.ntp.org
common_ssh_allow_groups: ['deploy', 'sre']
# HAProxy role — focused exclusively on load balancer configuration
- role: io.thecodeforge.haproxy
vars:
haproxy_max_connections: 10000 # Overrides defaults/main.yml value of 2000
haproxy_timeout_connect: '5s'
haproxy_timeout_client: '30s'
haproxy_timeout_server: '30s'
haproxy_backend_servers: # Dynamically populated from inventory
- { name: 'web01', addr: '{{ hostvars["web-01"]["ansible_host"] }}', port: 8080 }
- { name: 'web02', addr: '{{ hostvars["web-02"]["ansible_host"] }}', port: 8080 }
# TLS termination role — manages certificates and nginx-based TLS offloading
- role: io.thecodeforge.tls_termination
vars:
tls_domain: 'api.thecodeforge.io'
tls_cert_source: 'acme' # 'acme', 'vault', or 'file'
tls_acme_email: 'ops@thecodeforge.io'
# ── Play2: ApplicationServer configuration ──────────────────────────────────
- name: ConfigureApplicationServerTier
hosts: web_servers
become: true
roles:
- role: io.thecodeforge.common
# Same role, same defaults — common runs identically on all tiers
- role: io.thecodeforge.nginx
vars:
nginx_worker_processes: auto
nginx_worker_connections: 4096
nginx_vhosts:
- server_name: 'api.thecodeforge.io'
listen_port: 8080
root: '/var/www/api'
access_log: '/var/log/nginx/api_access.log'
- role: io.thecodeforge.app_deploy
vars:
app_repo: 'https://github.com/thecodeforge/api.git'
app_version: '{{ release_version | default("main") }}'
app_user: 'www-data'
app_env: 'production'
# ── Play3: Database configuration ────────────────────────────────────────────
- name: ConfigureDatabaseTier
hosts: database_servers
become: true
serial: 1 # One database server at a time — never parallel forPostgres
max_fail_percentage: 0 # Any database failure stops the entire play
roles:
- role: io.thecodeforge.common
- role: io.thecodeforge.postgresql
vars:
postgres_version: 16
postgres_data_dir: '/data/pg_production' # Overridesdefault /var/lib/postgresql
postgres_max_connections: 200
postgres_shared_buffers: '4GB'
postgres_effective_cache_size: '12GB'
# Passwords come from Vault — never hardcoded here
postgres_app_password: '{{ vault_postgres_app_password }}'
- role: io.thecodeforge.prometheus_node_exporter
# No vars override — defaults work for all servers
# Port9100, /metrics endpoint, standard collectors
The God Role Tells You It's a God Role
If your role's tasks/main.yml exceeds 150 lines, or if the role name contains 'and' (postgres_and_redis, webserver_and_monitoring), or if you find yourself writing when: conditions inside the role to skip tasks for certain use cases — these are the three warning signs. A role that needs when: conditions to skip parts of itself for different callers is really two roles sharing one directory. The fix is always decomposition: split, compose in the playbook, test each piece independently.
Production Insight
Role variables passed in the vars: block at call time have higher precedence than defaults/ but lower than host_vars — this is the right level for environment-specific overrides you want visible in the playbook itself.
Never put environment-specific values in vars/main.yml inside the role. vars/ is for internal role constants, not for configuration that differs by environment. An engineer reading the role has no way to know that vars/ is being overridden externally — it looks hardcoded.
Rule: the calling playbook should be readable as documentation. A vars: block at each role call site that lists the non-default values is self-documenting infrastructure. Someone reading site.yml should understand the entire system's configuration without opening any role file.
Key Takeaway
One role per service. Compose multiple roles in the playbook. This is the pattern that scales from 5 servers to 5000.
A reusable role has no environment-specific assumptions. Everything that could differ goes in defaults/. The calling playbook provides the environment-specific values.
If two teams are maintaining separate forks of the same role, the role has a hardcoded value where a defaults/ variable should be. Find it and fix it.
One Role or Multiple Roles?
IfThe automation installs and configures a single service (Nginx, PostgreSQL, Redis, Prometheus)
→
UseOne role — named after the service. This is the right granularity for Galaxy and for reuse.
IfThe automation configures two related services that are always deployed together on the same host
→
UseStill two roles — one per service. Compose them in the playbook. The coupling is at the playbook level where it belongs, not inside a role.
IfThe automation has a 'base' configuration that applies to every server regardless of role (SSH hardening, NTP, logging standards, security patches)
→
UseOne 'common' role that runs first in every play. This is explicitly the right use case for a shared foundational role.
IfThe role's tasks/main.yml has grown past 150 lines or contains when: conditions to skip sections for different callers
→
UseSplit it. The when: conditions are telling you where the split lines are. Each conditional branch is a candidate for its own role.
IfTwo teams need the same service configured differently and one is forking the role to make their changes
→
UseThe role needs better defaults/ coverage. A fork means a variable that should be in defaults/ is hardcoded in tasks/. Find it, move it, eliminate the fork.
Testing Roles with Molecule — The Practice That Separates Good Roles from Great Ones
A role without tests is a role that breaks silently in production. You find out when a deployment fails, when a new engineer makes a change that looked harmless, or when a Galaxy role dependency updates and changes behavior. Molecule gives you a way to find out in CI instead.
Molecule is the standard testing framework for Ansible roles. It spins up disposable infrastructure — Docker containers for most roles, cloud instances for roles that need real hardware — runs your role against that infrastructure, verifies the resulting state with Testinfra assertions, runs the role a second time to verify idempotency, and then tears everything down. The entire cycle takes 2-5 minutes for a Docker-based test.
The most valuable test you can write is the idempotency check: run the role twice and assert that the second run shows zero 'changed' tasks. This is Molecule's default behavior — it runs the role, checks idempotency automatically, and fails if the second run shows any changes. If your role isn't idempotent, Molecule tells you which task is the problem.
The second most valuable test is the non-default variable scenario: create a Molecule scenario that sets every variable in defaults/main.yml to a non-default value and runs the role. If any task contains a hardcoded value instead of a variable reference, this test surfaces it. The production incident in this article would have been caught by this test on the first day the role was written.
For roles that will be shared via Galaxy or an internal Automation Hub, add platform-specific scenarios: test on Ubuntu 22.04 LTS, on Ubuntu 24.04, and on RHEL 9 if your organization runs Red Hat. Platform divergence in package names, service names, and file paths is a major source of 'works on my machine' failures in shared roles.
---
# io.thecodeforge: Molecule converge playbook — default scenario
# This runs the role with default variable values.
# Molecule automatically runs this twice and fails if second run shows 'changed'.
- name: Converge — test nginx role with default variables
hosts: all
become: true
roles:
- role: io.thecodeforge.nginx
# No vars: block here — testing that defaults/main.yml values work correctly
---
# io.thecodeforge: Molecule verify playbook — Testinfra assertions
# File: molecule/default/verify.yml
# These assertions run after converge and confirm the role achieved its intended state.
- name: Verify — confirm nginx role achieved correct state
hosts: all
gather_facts: false
tasks:
- name: ConfirmNginxpackage is installed
ansible.builtin.package_facts:
manager: apt
- name: AssertNginx is installed at the pinned version
ansible.builtin.assert:
that:
- "'nginx' in ansible_facts.packages"
fail_msg: "Nginx is not installed — role task failed silently"
- name: ConfirmNginx service is running and enabled
ansible.builtin.service_facts:
- name: AssertNginx service state
ansible.builtin.assert:
that:
- "ansible_facts.services['nginx.service'].state == 'running'"
- "ansible_facts.services['nginx.service'].status == 'enabled'"
fail_msg: "Nginx is not running or not enabled — handler or service task failed"
- name: ConfirmNginx is listening on the default port
ansible.builtin.wait_for:
port: "{{ nginx_port | default(80) }}"
timeout: 5
msg: "Nginx is not listening on port {{ nginx_port | default(80) }}"
- name: VerifyNginx config is valid
ansible.builtin.command: nginx -t
register: nginx_test
changed_when: false
failed_when: nginx_test.rc != 0
---
# io.thecodeforge: Molecule converge playbook — custom_paths scenario
# File: molecule/custom_paths/converge.yml
# This scenario runs the role with NON-DEFAULT variable values.
# It catches hardcoded paths and values in tasks/main.yml.
# Ifthis scenario fails where default/ passes, a path is hardcoded.
- name: Converge — test nginx role with non-default variable values
hosts: all
become: true
roles:
- role: io.thecodeforge.nginx
vars:
nginx_port: 8080 # Non-default: catches port hardcoding
nginx_worker_processes: 2 # Non-default: catches proc count hardcoding
nginx_log_dir: /var/log/nginx_custom # Non-default: catches path hardcoding
nginx_config_dir: /etc/nginx_custom # Non-default: catches config path hardcoding
# Every variable in defaults/main.yml should appear here with a non-default value.
# If the role fails this scenario, find the hardcoded value and move it to defaults/.
---
# io.thecodeforge: CI pipeline configuration forMolecule testing
# File: .gitlab-ci.yml excerpt
# This runs both Molecule scenarios on every merge request.
# molecule_test:
# stage: test
# image: quay.io/ansible/community-ansible-dev-tools:latest
# before_script:
# - pip install molecule molecule-plugins[docker] ansible-lint
# script:
# - cd roles/io.thecodeforge.nginx
# - ansible-lint . # Lint first — fast failure
# - molecule test --scenario-name default # Default values + idempotency check
# - molecule test --scenario-name custom_paths # Non-default values + hardcoding check
# rules:
# - if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
# artifacts:
# when: on_failure
# paths:
# - roles/io.thecodeforge.nginx/.molecule/
# expire_in: 3 days
The Two Molecule Scenarios Every Role Needs
Every role needs at minimum two Molecule scenarios. The default scenario tests with the values in defaults/main.yml and verifies idempotency on a second run — Molecule does this automatically. The custom scenario tests with non-default values for every variable in defaults/main.yml and verifies the role still works. If default passes but custom fails, there's a hardcoded value somewhere in tasks/main.yml. The production incident in this article would have been caught by the custom scenario on day one. Neither scenario is optional for a role that will be shared.
Production Insight
Molecule's idempotency check — running the role twice and failing on any 'changed' in the second run — is the single most valuable automated test for an Ansible role. It catches the shell module, timestamps in templates, and non-deterministic file generation before they reach production.
The custom_paths scenario has saved the io.thecodeforge team from three separate hardcoding incidents in shared roles. Each one was caught in CI on a merge request, not in a production deployment.
Rule: no role change merges without Molecule passing both scenarios. This is the hard gate. A role that breaks its test scenarios is not ready for production regardless of how confident the author is.
Key Takeaway
A role without Molecule tests is a role you discover is broken in production.
The default scenario catches idempotency bugs. The custom_paths scenario catches hardcoded values. Both are required for any role that leaves its author's laptop.
Molecule in CI is the hard gate between 'I think this works' and 'I have verified this works.'
Molecule Test Scenario Design
IfRole is used by exactly one team in one environment
→
UseMinimum: default scenario with idempotency check. The custom_paths scenario is still strongly recommended — it catches issues before the role grows.
IfRole is shared across multiple teams or published to Galaxy
→
UseRequired: default scenario, custom_paths scenario with every defaults/ variable at a non-default value, and platform scenarios for each supported OS
IfRole manages a stateful service (database, message queue, persistent storage)
→
UseAdd a separate scenario that tests idempotency with data present — the second run should make zero changes even with real data in place
IfRole uses conditional logic based on OS family or distribution
→
UseAdd separate scenarios per platform (ubuntu_22, rhel_9) to verify the OS-specific code paths independently
● Production incidentPOST-MORTEMseverity: high
The Role That Couldn't Be Reused
Symptom
The second team spent three days adapting the role to their environment. Every environment-specific change — dev versus prod data directories, different log paths for compliance, a different PostgreSQL major version — required editing tasks/main.yml directly. When the original team patched a security configuration bug six weeks later, the fork never received the fix. Eight months in, the two versions had diverged enough that merging them was estimated at two weeks of work. The team chose to continue maintaining both.
Assumption
The original team assumed every PostgreSQL instance in the organization would use /var/lib/postgresql/14/main. It seemed like a reasonable default at the time — it was the standard Debian package layout, and they'd never needed to deviate. They put the path literal directly in tasks/main.yml because it was faster and they weren't thinking about reuse. There was no defaults/main.yml at all when the second team first opened the role.
Root cause
The role's tasks/main.yml contained literal path strings throughout: dest: /var/lib/postgresql/14/main/postgresql.conf, src: /var/lib/postgresql/14/main/pg_hba.conf. The second team needed /data/pg_production for performance reasons — their storage was mounted separately for I/O isolation. Ansible's variable precedence system would have allowed complete override, but there was no variable to override. The path was a literal string, not a reference to a variable. The defaults/ directory existed but was empty. Every 'configuration' was actually a hardcoded implementation detail.
Fix
Refactored every hardcoded path and version-specific value into defaults/main.yml: postgres_data_dir: /var/lib/postgresql/14/main, postgres_major_version: 14, postgres_config_dir: /etc/postgresql/14/main, postgres_log_dir: /var/log/postgresql. Updated every task reference to use the variable: dest: {{ postgres_data_dir }}/postgresql.conf. The second team set postgres_data_dir: /data/pg_production in their playbook's vars block and the role worked immediately without modification. Added a Molecule test scenario named custom_paths that runs the role with non-default values for every variable in defaults/main.yml — if a future change hardcodes a path, the custom_paths scenario fails.
Key lesson
If a value could possibly differ between environments, teams, or PostgreSQL versions, it belongs in defaults/main.yml — not as a literal string in tasks/main.yml. When in doubt, make it a variable.
A role with zero entries in defaults/main.yml is almost certainly hiding hardcoded site-specific assumptions somewhere in its tasks. Treat an empty defaults/main.yml as a code smell during role review.
A reusable role has no hardcoded site-specific values anywhere in its tasks. Everything that varies belongs in defaults/ with a sensible value that works for the most common case.
Test roles with non-default variable values using a dedicated Molecule scenario. If making the test pass requires editing tasks/ rather than just setting different variables, the role isn't reusable yet.
Production debug guideThree failure patterns specific to Ansible roles — with exact diagnostics and fixes for each one.3 entries
Symptom · 01
Variables set in role defaults/main.yml aren't taking effect — inventory or playbook values appear to be ignored
→
Fix
This is almost always a precedence inversion — the value you're trying to override is in vars/main.yml, not defaults/main.yml. vars/ has much higher precedence than group_vars or host_vars. Run ansible-inventory --host $HOST --vars | grep variable_name to see the resolved value and confirm it's what defaults/ declares. If the value from vars/ is winning over your group_vars, move it to defaults/ or remove it from vars/ entirely. Only constants the role cannot function without belong in vars/.
Symptom · 02
Role works correctly in isolation but fails when combined with other roles in the same playbook
→
Fix
Two roles are using the same variable name and one is overwriting the other's value. Roles share a global variable namespace — there is no automatic role-level scoping. Run ansible-playbook --list-tasks to see the execution order, then run ansible -m debug -a 'var=port' against a failing host to see what value is actually resolved. The fix is prefixing all role variables with the role name: nginx_port not port, postgres_port not port. Audit every variable in defaults/main.yml and vars/main.yml for both roles and rename any collisions.
Symptom · 03
Role dependency declared in meta/main.yml isn't running before the dependent role, or appears to be skipped entirely
→
Fix
Dependencies in meta/main.yml run once per playbook and are deduplicated. If the dependency role ran earlier in the same playbook — even in a different play — Ansible considers it already satisfied and skips it. This deduplication is the expected behavior but produces surprising results when the dependency needs to run with different variables for different contexts. Run ansible-galaxy role list --roles-path ./roles to verify the dependency is installed locally. If it's missing, run ansible-galaxy install -r requirements.yml. For circular dependencies, Ansible detects them and breaks the cycle silently — use ansible-playbook -vvv to see the dependency resolution order.
★ Ansible Role Debug Cheat SheetFive commands that diagnose 80% of role-related failures. Run these before refactoring anything.
Role variable not being overridden as expected from inventory or playbook−
Immediate action
Dump the fully resolved variable set for the specific failing host
ansible -m debug -a 'var=postgres_data_dir' -i inventory.ini $HOST
Fix now
defaults/ has the lowest precedence in Ansible. group_vars overrides it. host_vars overrides group_vars. vars/ in the role overrides group_vars — meaning inventory cannot override vars/. If your variable is in vars/ and you need it to be overridable, move it to defaults/. If it's already in defaults/ and still not being overridden, check for a conflicting host_vars file.
Role dependency declared in meta/main.yml is not running+
Immediate action
Verify the dependency is installed locally and that the declaration syntax is correct
Commands
grep -A 10 'dependencies:' roles/role_name/meta/main.yml
ansible-galaxy role list --roles-path ./roles
Fix now
Dependencies must be installed in the roles/ directory before the playbook runs. Run ansible-galaxy install -r requirements.yml to fetch missing dependencies. If the dependency is installed but still not running, check whether it already ran earlier in the same playbook — Ansible deduplicates dependency runs. Circular dependencies are silently broken — run ansible-playbook -vvv to see the resolution order and identify cycles.
Role tasks failing with 'undefined variable' for a variable that exists in defaults/main.yml+
Immediate action
Confirm the variable is actually in scope for the task that's failing
defaults/main.yml variables are scoped to the role. A task in a different role cannot access them directly. To share a value across roles, either set it in inventory group_vars (which all roles can read), or use set_fact in a playbook task to promote it to the global play scope. Cross-role variable sharing via group_vars is the cleanest pattern.
Role runs successfully but idempotency is broken — always shows 'changed' on every run+
Immediate action
Identify exactly which task reports changed and what content is changing
The most common cause is a Jinja2 template that includes dynamic content — timestamps, randomly generated values, or a fact that changes between runs. Remove dynamic content from templates used for config files. If the template is correct and the file still shows changed, verify that line endings and file encoding are consistent. For files that should only be written once (certificates, initialization tokens), use ansible.builtin.copy with force: no.
Role fails with file not found for a template or static file+
Immediate action
Verify the role directory structure and the src path in the failing task
Commands
ls -la roles/role_name/{files,templates}/
grep -n 'src:' roles/role_name/tasks/main.yml
Fix now
The template module resolves src: relative to the role's templates/ directory. The copy module resolves src: relative to the role's files/ directory. These paths are case-sensitive. If the file exists in the right directory but the task still fails, check that the filename in the src: field matches exactly — including case and extension. Run ls -la on both directories and compare output against the src: value character by character.
Single Playbook vs Ansible Roles — The Operational Difference
Aspect
Single Playbook
Ansible Roles
Appropriate scale
1-5 tasks on a single host group, one-off operations, scripts you run once. A single playbook is the right tool for small, focused, non-repeating automation.
Multi-tier infrastructure, automation shared across teams, anything that runs on a schedule or in CI/CD. Roles pay for their structure the moment a second team needs the same automation.
Reusability
None — reusing a playbook requires copy-pasting blocks of YAML and maintaining multiple copies. Any bug fix must be applied to every copy manually.
First-class — roles are versioned units with defined interfaces (defaults/). Published to Galaxy or internal Automation Hub. Bug fixes propagate to all consumers via requirements.yml version bumps.
Variable management
Global namespace — all variables are visible to all tasks. Name collisions between sections are invisible until they cause wrong behavior at runtime.
Structured and separated — defaults/ for overridable config, vars/ for internal constants, clear precedence hierarchy. Prefix variables with role name to prevent global namespace collisions.
Testing
Manual — run the playbook in staging and verify by hand. Regressions are caught by the next human who notices something is wrong.
Automated with Molecule — idempotency check, non-default variable scenario, platform scenarios. Regressions are caught in CI on the merge request.
Team collaboration
Difficult at scale — multiple people editing one file creates merge conflicts and unclear ownership. Who is responsible for which section?
Parallel ownership — each role has a clear owner and a clear boundary. The nginx team owns the nginx role. The postgres team owns the postgres role. Changes don't conflict.
Maintenance over time
Degrades — a 600-line playbook becomes impossible to read or modify without risk. Engineers avoid changing it, leading to workarounds layered on top of workarounds.
Stable — each role stays focused on one service. A postgres role doesn't grow because someone added monitoring. Roles evolve independently at their own pace.
Key takeaways
1
Roles are packaged automation units
the standard directory structure is how Ansible finds tasks, handlers, templates, and variables automatically. Deviating from convention means silent failures, not helpful errors.
2
defaults/main.yml has the lowest precedence of any variable source
inventory, group_vars, and the calling playbook's vars block all override it. vars/main.yml has much higher precedence — inventory cannot override it. Put overridable values in defaults/, internal constants in vars/. Confusing these two is the most common reason roles are inflexible.
3
One role per service. A role named after two services, or a tasks/main.yml with when
conditions to skip sections for different callers, is really two roles in one directory. Decompose and compose in the playbook.
4
A reusable role has zero hardcoded site-specific values. Every path, port, version, username, and configurable option must be a variable with a sensible default. An empty defaults/main.yml is a red flag
it almost always means the role has hidden hardcoded assumptions.
5
Molecule is the hard gate between 'I think this works' and 'I have verified this works.' Two required scenarios
default (idempotency check with default values) and custom_paths (verifies the role works with non-default values for every variable). No role change merges without both passing.
6
Handler dependencies declared in meta/main.yml run automatically before the dependent role regardless of playbook order. Circular dependencies are silently broken
Ansible detects them and continues without error, which means required configuration may be silently omitted.
7
Prefix every role variable with the role name
nginx_port, postgres_port, haproxy_timeout. Roles share a global variable namespace — unprefixed variable names collide silently and produce wrong behavior with no error message.
8
import_role is static (parsed at playbook load time
handlers work, tags propagate into the role). include_role is dynamic (evaluated at runtime — use for conditional or looped role application, but handlers from included roles may not be in the handler registry).
Common mistakes to avoid
6 patterns
×
Creating God Roles that configure multiple unrelated services
Symptom
Role named server_setup has 600 lines covering Docker installation, PostgreSQL configuration, Nginx setup, Prometheus node exporter, SSH hardening, and log rotation. Teams can't use just the PostgreSQL portion. Every change risks breaking the Nginx section. Testing requires a full-stack container. New engineers avoid modifying it entirely.
Fix
Decompose by service: docker, postgresql, nginx, prometheus_node_exporter, ssh_hardening, log_rotation — each as a separate role with its own defaults/, handlers/, templates/, and Molecule tests. Compose them in site.yml. Each role is independently testable, independently versioned, and independently useful to any team that needs that one service.
×
Hardcoding environment-specific values in tasks/main.yml instead of defaults/main.yml
Symptom
Role contains literal strings: dest: /var/lib/postgresql/14/main/postgresql.conf. A second team needs /data/pg_production. They fork the role. Two versions now diverge. Upstream bug fixes never reach the fork. Six months later merging them is estimated at two weeks.
Fix
Move every configurable value to defaults/main.yml: postgres_data_dir: /var/lib/postgresql/14/main. Reference in tasks: dest: {{ postgres_data_dir }}/postgresql.conf. Add a Molecule custom_paths scenario that sets every defaults/ variable to a non-default value. If that scenario fails, a path is hardcoded somewhere.
×
Confusing defaults/ with vars/ and wondering why inventory overrides have no effect
Symptom
Team sets nginx_port: 8080 in inventory group_vars/webservers.yml. Role still uses port 80. No error message. The port value in vars/main.yml is silently winning over the inventory value because vars/ has higher precedence than group_vars.
Fix
defaults/ has the lowest precedence of any variable source — inventory, group_vars, host_vars, and -e all override it. vars/ has much higher precedence — only host_vars and -e override it, inventory group_vars cannot. The rule: if a value is meant to be overridden by callers, it belongs in defaults/. If it's an internal constant the role needs to function and users should never touch, it belongs in vars/. If you're ever tempted to put an overridable value in vars/, put it in defaults/ instead.
×
Missing role dependencies in meta/main.yml and relying on playbook ordering
Symptom
Role assumes common_security role has already run and that firewall rules allow its service's port. On a fresh host where the playbook order changed, the role fails because the port is blocked. On hosts where the common_security role ran in an earlier play, it works. Failures are intermittent and hard to trace.
Fix
Declare the dependency explicitly in meta/main.yml: dependencies: [{role: common_security}]. Dependencies run automatically before the dependent role regardless of playbook order. Run ansible-galaxy role list --roles-path ./roles to verify the dependency is installed. For conditional dependencies (only on RedHat), use the when key in the dependency declaration.
×
Not namespacing role variables — using generic names that collide across roles
Symptom
Both the nginx role and the haproxy role define a variable named port. When both roles run in the same play, the last-loaded role's value overwrites the first's. Nginx listens on HAProxy's port or vice versa. No error is produced — Ansible uses whichever value happens to be resolved last.
Fix
Prefix every role variable with the role name: nginx_port: 80, haproxy_port: 443, postgres_port: 5432. This is not a naming convention — it's a namespace collision prevention mechanism. Audit every variable in every role's defaults/ and vars/ for generic names: port, user, version, path, log_dir. Rename any that aren't prefixed. Run multiple roles in the same play and use ansible -m debug -a 'var=nginx_port' to verify the expected value resolves correctly.
×
Skipping Molecule tests because 'the role is simple' or 'it works in staging'
Symptom
A new engineer makes what looks like a minor change — adds a line to the template, adjusts a variable default, adds a new task. The role was never tested with Molecule. Three weeks later, a production deploy fails because the template now renders different content on the second run, breaking idempotency on a cron job that runs every 30 minutes.
Fix
Add Molecule testing before the role is used in production, not after. The investment is two hours to write the default and custom scenarios. The payoff is every future change gets an idempotency check and a non-default variable check automatically. Enforce Molecule as a CI gate — no merge without Molecule passing. 'Simple' roles grow, and Molecule catches the moment they become not simple.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Describe the Ansible variable precedence hierarchy. If a variable is def...
Q02SENIOR
What is the specific use case for meta/main.yml in an Ansible Role? Prov...
Q03SENIOR
How does import_role differ from include_role? Describe a production bug...
Q04SENIOR
Explain the DRY principle in the context of Ansible. How do roles facili...
Q05SENIOR
How would you design a CI/CD pipeline to test an Ansible role independen...
Q06SENIOR
When would you use vars_prompt in a playbook instead of defining variabl...
Q01 of 06SENIOR
Describe the Ansible variable precedence hierarchy. If a variable is defined in both defaults/main.yml and vars/main.yml within a role, which one wins? What about group_vars?
ANSWER
vars/main.yml wins over defaults/main.yml. The full hierarchy from lowest to highest precedence, focusing on the levels that matter most in practice:
1. Role defaults (defaults/main.yml) — lowest of all
2. Inventory file variables
3. group_vars/all
4. group_vars/groupname
5. host_vars/hostname
6. Playbook vars block
7. Role vars (vars/main.yml) — much higher than group_vars
8. set_fact / registered variables
9. Extra vars (-e) — highest, overrides everything
The production implication: group_vars cannot override vars/main.yml. If an operator sets nginx_port: 8080 in group_vars/webservers.yml and the role has nginx_port: 80 in vars/main.yml, the role wins. No error. Wrong port. This is the most common precedence bug in shared roles.
The rule: defaults/ for values users should be able to override. vars/ only for internal constants the role cannot function without — package names, service names, fixed permissions. Every value that might legitimately differ between environments or teams must be in defaults/.
Q02 of 06SENIOR
What is the specific use case for meta/main.yml in an Ansible Role? Provide a real example including a conditional dependency.
ANSWER
meta/main.yml serves two purposes: Galaxy metadata (author, description, license, supported platforms, tags) and role dependencies. Dependencies are the operationally important part.
When a playbook calls a role, Ansible reads meta/main.yml, resolves all dependencies, and runs them before the dependent role — automatically, regardless of playbook order. Dependencies are deduplicated: if Role B is a dependency of both Role A and Role C, it runs once.
Example with conditional dependency:
``yaml
dependencies:
- role: io.thecodeforge.common
vars:
common_ntp_servers:
- 169.254.169.123
- role: io.thecodeforge.firewall
vars:
firewall_allow_ports: [5432]
when: ansible_os_family == 'RedHat'
# firewall managed differently on Debian — ufw is handled by common role
``
The operational risk to know: circular dependencies are detected and silently broken. Ansible stops the cycle without error and continues. This means a circular dependency can silently omit required configuration. Test dependency graphs with ansible-playbook -vvv to see the resolution order. Avoid circular dependencies entirely — they're always a sign the role boundaries are wrong.
Q03 of 06SENIOR
How does import_role differ from include_role? Describe a production bug caused by choosing the wrong one.
ANSWER
import_role is static — processed at playbook parse time before any tasks execute. All tasks, handlers, and variables from the role are loaded into the play immediately. Tags and when conditions on the import_role task apply to every individual task inside the role.
include_role is dynamic — processed at runtime when the task queue reaches that line. Tags and when conditions on the include_role task apply only to the inclusion itself, not to the tasks inside the role. This means you can use include_role in loops and with runtime-computed conditions.
Production bug from using include_role when import_role was needed: a handler inside an include_role'd role fires during execution and sends a notify. But handlers are resolved at parse time — the handler from a dynamically included role isn't in the handler registry at parse time, so the notify goes nowhere. The task reports 'changed', the handler is notified, and nothing restarts. Same symptoms as the handler name typo — service not reloaded, no error.
The rule: use import_role for roles that are always needed and whose handlers must work. Use include_role when the role is conditional, used in a loop, or applied with different variables per iteration. When in doubt, import_role is safer because its behavior is more predictable.
Q04 of 06SENIOR
Explain the DRY principle in the context of Ansible. How do roles facilitate it better than include_tasks?
ANSWER
DRY means any piece of logic exists in exactly one place. When it changes, it changes in one place and the change propagates to all consumers automatically.
include_tasks reuses a YAML file of tasks, but only the tasks. It doesn't package the handlers that those tasks depend on, the templates those tasks render, the default variables those tasks reference, or the dependencies that must run first. A team using include_tasks for PostgreSQL setup still has to copy handlers, templates, defaults, and dependency declarations into every playbook that uses it. Three playbooks means three copies. That's three places to apply every bug fix.
A role bundles the complete automation unit: tasks, handlers, templates, defaults, vars, files, and dependencies. A team using the postgresql role references it in one line. When a security configuration bug is fixed in the role, every consumer gets the fix by bumping the version in requirements.yml.
The practical difference: include_tasks is for sharing a file. Roles are for sharing a capability. If you find yourself copying the same handlers and templates alongside an include_tasks call, you've built a role manually — put it in a role directory and make it official.
Q05 of 06SENIOR
How would you design a CI/CD pipeline to test an Ansible role independently? Walk through the Molecule configuration and what each stage verifies.
ANSWER
The pipeline has four stages:
1. Lint: ansible-lint and yamllint catch style violations, deprecated modules, and YAML syntax errors. Fast — fails in seconds. Run on every commit.
2. Syntax check: ansible-playbook --syntax-check on the Molecule converge playbook. Catches undefined variables and structural errors without running anything.
3. Molecule default scenario: spins up a Docker container, runs the role with default variable values, runs verify.yml assertions (service running, port listening, config file present), then runs the role a second time and fails if any task reports 'changed'. This is the idempotency gate.
4. Molecule custom_paths scenario: same flow but with non-default values for every variable in defaults/main.yml. A failure here that doesn't appear in the default scenario means a hardcoded value in tasks/main.yml.
For shared roles, add platform scenarios: molecule test --scenario-name ubuntu_22 and molecule test --scenario-name rhel_9.
GitLab CI configuration:
``yaml
ansible_role_test:
stage: test
image: quay.io/ansible/community-ansible-dev-tools:latest
script:
- ansible-lint roles/io.thecodeforge.postgresql/
- molecule test --scenario-name default
- molecule test --scenario-name custom_paths
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
artifacts:
when: on_failure
paths: ['.molecule/']
expire_in: 3 days
``
No role change merges without all four stages green. This is the hard gate.
Q06 of 06SENIOR
When would you use vars_prompt in a playbook instead of defining variables in a role's defaults directory?
ANSWER
vars_prompt prompts the user interactively at runtime. The use cases are narrow and specific.
Appropriate: disaster recovery playbooks where a human must explicitly confirm a destructive action before it runs — 'Type DESTROY to drop the production database and restore from backup.' The prompt forces a human decision point that automation cannot bypass. One-time credentials that should never be stored anywhere, not even in Vault — temporary tokens, emergency access passwords.
Inappropriate: anything that runs in CI/CD or on a cron. vars_prompt blocks indefinitely when there's no interactive terminal. The pipeline hangs, times out after the runner's maximum job duration, and produces no useful error message. This is a common mistake when a playbook that worked interactively is added to automation.
For 99% of production use cases: encrypted variables in Vault, passed via --vault-password-file from a CI secret. This is auditable, automatable, and secure. vars_prompt is for the 1% — emergency runbooks where a human is physically present and the interaction is intentional.
If you're tempted to use vars_prompt for a regularly run playbook, the real question is why the value isn't in Vault. Put it there.
01
Describe the Ansible variable precedence hierarchy. If a variable is defined in both defaults/main.yml and vars/main.yml within a role, which one wins? What about group_vars?
SENIOR
02
What is the specific use case for meta/main.yml in an Ansible Role? Provide a real example including a conditional dependency.
SENIOR
03
How does import_role differ from include_role? Describe a production bug caused by choosing the wrong one.
SENIOR
04
Explain the DRY principle in the context of Ansible. How do roles facilitate it better than include_tasks?
SENIOR
05
How would you design a CI/CD pipeline to test an Ansible role independently? Walk through the Molecule configuration and what each stage verifies.
SENIOR
06
When would you use vars_prompt in a playbook instead of defining variables in a role's defaults directory?
SENIOR
FAQ · 5 QUESTIONS
Frequently Asked Questions
01
What's the difference between defaults/main.yml and vars/main.yml in an Ansible role?
defaults/main.yml has the lowest variable precedence in all of Ansible. Every other variable source — inventory file variables, group_vars, host_vars, playbook vars block, extra vars (-e) — overrides it. This makes it the right place for values you expect callers to customize: ports, file paths, package versions, feature flags, usernames. vars/main.yml has much higher precedence — only host_vars and -e can override it. Inventory group_vars cannot. This makes vars/ appropriate only for internal role constants that must not change: OS-specific package names, internal service identifiers, fixed file permissions. Putting overridable values in vars/ is the most common reason roles are inflexible — operators set the value in group_vars and nothing happens because vars/ is silently winning.
Was this helpful?
02
How do Ansible role dependencies work, and what happens with circular dependencies?
Dependencies are declared in meta/main.yml under the dependencies key. When a playbook calls a role, Ansible reads its meta/main.yml, resolves all dependencies, and runs them before the dependent role — automatically, without any explicit task in the playbook. Dependencies are deduplicated across the entire playbook: if multiple roles depend on the same role, it runs once. For circular dependencies (Role A depends on Role B, Role B depends on Role A), Ansible detects the cycle and breaks it silently by skipping one dependency. No error is produced, no warning appears — required configuration may simply not run. Test dependency graphs with ansible-playbook -vvv to see the resolution order. Avoid circular dependencies entirely by redesigning the role boundaries.
Was this helpful?
03
How do you share roles across multiple projects and teams?
Three patterns, in order of organizational maturity: First, requirements.yml with Git source — list roles by Git URL and tag in requirements.yml, run ansible-galaxy install -r requirements.yml in CI before running playbooks. Each project pins the version it needs. Second, private Automation Hub or Pulp — an internal Galaxy server where teams publish versioned roles. Callers install via ansible-galaxy with the internal server URL. Provides access control, download metrics, and deprecation management. Third, public Ansible Galaxy — appropriate for generic infrastructure roles (common OS hardening, standard monitoring agents) that have no proprietary configuration. The non-negotiable practice across all three: version every role with Git tags and pin specific versions in requirements.yml. Never reference a Git branch or 'latest' — a breaking change in the role will silently break every consumer on the next CI run.
Was this helpful?
04
What is a God role and why is it a problem?
A God role is a single Ansible role that configures multiple unrelated services — a server_setup role that installs Docker, PostgreSQL, Nginx, and a monitoring agent in one role. The problems compound over time: teams that need only PostgreSQL must accept Docker and Nginx anyway. Testing requires a full-stack environment. A change to the Nginx section risks breaking the PostgreSQL section. New engineers are afraid to modify it. Upstream consumers can't use partial functionality. The role accumulates when: conditions to skip sections that don't apply to certain callers — which is the signal that it's really multiple roles pretending to be one. The fix is always decomposition: one role per service, composed in the playbook. The playbook becomes readable documentation of which services a host runs. Each role becomes independently testable, independently versioned, and independently useful.
Was this helpful?
05
How does Molecule verify that a role is idempotent?
Molecule runs the role twice by default. The first run (converge) applies the role and verifies state with assertions in verify.yml. The second run (idempotency check) runs the exact same role again and fails the test if any task reports 'changed'. A fully idempotent role shows zero 'changed' tasks on the second run. If any task reports 'changed' on the second run, Molecule prints the task name and fails the CI job before the role reaches any environment. The most common causes of idempotency failures that Molecule catches: Jinja2 templates containing timestamps or dynamic values that differ between renders, shell module tasks running unconditionally, and file tasks that don't properly check existing content. This automated check catches idempotency bugs at code review time rather than weeks later when a production cron job starts reporting unexpected changes.