Ansible Linting & Molecule Testing: 5 Production Incidents That Forced Us to Get It Right
Master ansible-lint rules, yamllint, Molecule with Docker driver, Testinfra tests, and CI/CD integration.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
Run ansible-lint --profile production to enforce production-level rules; profiles define severity thresholds.
Use yamllint -c .yamllint to enforce YAML style consistency; common rules: line-length, indentation, truthy.
Initialize a Molecule role with molecule init role --driver-name docker; the default scenario includes create, converge, verify, destroy.
Write Testinfra tests in molecule/default/verify.yml or tests/test_default.py; use host.run() for command checks, host.file() for file assertions.
Integrate lint and test in CI with ansible-lint . && molecule test; use --parallel for faster multi-OS testing.
Common lint violations: name[missing] (tasks need names), no-changed-when (commands need changed_when), fqcn-builtins (use fully qualified collection names).
Always pin molecule and ansible-lint versions in requirements.txt to avoid breaking changes.
Use molecule converge to debug without full destroy cycle; molecule verify runs tests against running instances.
Imagine you're a chef writing a complex recipe for a team of cooks. You've got ingredients (variables), steps (tasks), and special tools (modules). But if your recipe has typos, missing steps, or inconsistent formatting, the dish will fail — maybe even start a kitchen fire. Linting is like having a sous chef who checks your recipe for common mistakes before anyone touches it. They flag missing 'cook for 10 minutes' instructions (changed_when) or secret ingredients you forgot to list (variables). Testing with Molecule is like doing a trial run in a practice kitchen. You set up a small, disposable kitchen (Docker container), follow your recipe step by step, and taste the result (verify tests). If something's off, you fix the recipe, not the real kitchen. This saves your restaurant from serving burnt food to customers — or in our case, deploying broken configs to production servers.
A few years ago, I was on-call for a major e-commerce platform. At 2 AM, our monitoring screamed: all checkout servers were returning 503s. I SSH'd in and found that a recent Ansible rollout had left the nginx config in a broken state — a missing semicolon in a template. The playbook ran without errors because the template module succeeded; it just rendered invalid config. We had no linting, no testing. That night, I vowed to never let that happen again. This article is the result of that incident and many more: the time ansible-lint saved us from a privilege escalation bug, the time Molecule caught a Docker image mismatch before prod deploy, the time a yamllint rule forced us to standardize YAML style across 50+ roles.
Historically, Ansible grew fast without a strong testing culture. Roles were shared as tarballs, and 'it works on my machine' was the norm. The community responded with ansible-lint (a static analyzer) and Molecule (a testing framework). These tools matured alongside Ansible itself, with ansible-lint now supporting profiles (production, safety, etc.) and Molecule integrating with Docker, Podman, Vagrant, and cloud drivers.
This article covers the complete pipeline: linting your Ansible code with ansible-lint and yamllint, testing roles with Molecule and the Docker driver, writing effective Testinfra and Ansible verify tests, and integrating everything into CI/CD. I'll share real production incidents, exact commands, and the gotchas that will save you hours of debugging.
Why ansible-lint Profiles Matter: From 'min' to 'production'
Ansible-lint profiles group rules by severity and context. The min profile (default) checks only syntax errors. basic adds style issues. moderate includes best practices. safety flags security concerns like hardcoded passwords. shared is for roles shared across teams. production is the strictest — it includes all safety rules plus requirements like name[missing] and no-changed-when.
To use a profile, create .ansible-lint: ``yaml --- profile: production ` Or pass via CLI: ansible-lint --profile production .`
In production, we use production profile in CI and moderate locally to avoid overwhelming devs. A common gotcha: production profile requires fqcn-builtins — you must use fully qualified collection names (e.g., ansible.builtin.copy instead of copy). Migrating legacy roles is painful but prevents namespace conflicts.
Example violation: name[missing] — every task must have a name. This is not just cosmetic; named tasks appear in output and enable --step mode. Use ansible-lint --fix to auto-add names (though it generates generic ones like 'Task 1').
Another critical rule: no-changed-when — commands must have changed_when to avoid reporting 'changed' every run. Without it, a command module always shows 'changed', breaking idempotency.
``yaml - name: Restart nginx ansible.builtin.command: systemctl restart nginx changed_when: false # or use a condition ``
ansible-lint --list-rules to see all rules and their profile. Use --warn-only to see violations without failing CI.shell instead of command because a dev didn't know the difference. The production profile flagged no-changed-when and command-instead-of-shell. We fixed it, but the real win was catching fqcn-builtins — our CI was using a community collection that shadowed builtins. That would have been a nightmare to debug.production profile in CI; it enforces the strictest rules that prevent real production issues like missing changed_when and non-FQCN modules.yamllint: The Unsung Hero of Ansible Maintainability
YAML is notoriously error-prone. A missing space or wrong indentation can break a playbook silently. yamllint catches these. Install: pip install yamllint. Create .yamllint at repo root: ``yaml --- extends: default rules: line-length: max: 120 indentation: spaces: 2 indent-sequences: consistent truthy: allowed-values: ['true', 'false', 'yes', 'no'] ` Run: yamllint .`
Common violations: line-length (default 80 is too short for Ansible tasks), truthy (YAML interprets 'on' as true), trailing-spaces (invisible but breaks diffs).
In production, we enforce yamllint in pre-commit hooks. A real incident: a junior engineer committed a playbook with mixed tabs and spaces. The playbook worked on their machine but failed in CI because the CI runner's YAML parser was stricter. yamllint caught it instantly.
Another gotcha: truthy rule flags true vs True. Stick to lowercase true/false. Use allowed-values: ['true', 'false'] to enforce.
For Ansible specifically, add ignore: | for files like .travis.yml or inventory that may not follow Ansible YAML conventions.
.pre-commit-config.yaml:
``yaml
- repo: https://github.com/adrienverge/yamllint.git
rev: v1.32.0
hooks:
- id: yamllint
``<<: *base) that worked locally but caused a parsing error in our CI's older PyYAML. yamllint didn't catch it because anchors are valid YAML. We had to pin PyYAML version. Lesson: yamllint checks syntax, not semantics.Molecule with Docker Driver: Setting Up for Role Testing
Molecule's Docker driver is the go-to for CI testing. Initialize a new role: molecule init role --driver-name docker myrole. This creates molecule/default/molecule.yml: ``yaml --- provisioner: name: ansible platforms: - name: instance image: geerlingguy/docker-ubuntu2204-ansible:latest pre_build_image: true ` Key configs: - pre_build_image: true avoids rebuilding the image each time, speeding up tests. - image` should include systemd or init system if testing services.
molecule create— pulls image and starts container.molecule converge— runs the role against the container.molecule verify— runs tests (Testinfra or Ansible).molecule destroy— tears down container.molecule test— runs all four in sequence.
For testing multiple OSes, define multiple platforms: ``yaml platforms: - name: ubuntu image: geerlingguy/docker-ubuntu2204-ansible:latest - name: centos image: geerlingguy/docker-centos9-ansible:latest ` Molecule runs tests in parallel by default. Use --parallel` flag for speed.
Common gotcha: Docker driver requires Docker installed and the user in docker group. Also, pre_build_image: true images must have Ansible installed. Use geerlingguy/docker-*-ansible images — they are maintained and include Python and systemd.
geerlingguy/docker-ubuntu2204-ansible). Add privileged: true and volumes: /sys/fs/cgroup:/sys/fs/cgroup:ro to molecule.yml platform config.ubuntu:22.04 without Ansible pre-installed and wasted hours debugging 'python not found'. Switching to geerlingguy/docker-ubuntu2204-ansible:latest fixed it. Also, we forgot pre_build_image: true and molecule rebuilt the image every run — 5 minutes per test.pre_build_image: true to keep test runs under 30 seconds.Writing Testinfra Verify Tests: Beyond Simple Assertions
Testinfra is a Python framework for testing server state. Molecule runs tests in molecule/default/tests/test_default.py. Example: ```python import pytest
def test_nginx_running_and_enabled(host): nginx = host.service("nginx") assert nginx.is_running assert nginx.is_enabled
def test_nginx_config(host): config = host.file("/etc/nginx/nginx.conf") assert config.exists assert config.contains("worker_processes auto")
def test_nginx_listening(host): socket = host.socket("tcp://0.0.0.0:80") assert socket.is_listening `` Run with: molecule verify`.
- Use
host.ansible("debug", "var=nginx_config")to run ad-hoc Ansible modules. - Parametrize tests for multiple platforms: ``
python @pytest.mark.parametrize("pkg", ["nginx", "python3"]) def test_packages(host, pkg): assert host.package(pkg).is_installed`` - Use
host.run("curl -s http://localhost")to test HTTP endpoints.
Common mistake: Forgetting to import pytest. Also, test functions must start with test_ and take a host fixture.
For Ansible-based verify (instead of Testinfra), set verifier: ansible in molecule.yml and create verify.yml playbook. This is simpler for Ansible devs but less flexible.
molecule login to get a shell in the container, then run tests manually with pytest -v /root/tests/test_default.py to see detailed output.host.file("/etc/nginx/nginx.conf").contains("ssl_certificate"). It passed locally but failed in CI because the CI container had a different nginx config template. We had to parameterize the test to check for OS-specific paths.Ansible Verify Tests: When Testinfra Is Overkill
Molecule supports Ansible as a verifier. In molecule.yml, set verifier: name: ansible. Then create molecule/default/verify.yml: ``yaml --- - name: Verify hosts: all gather_facts: false tasks: - name: Check nginx is running ansible.builtin.command: systemctl is-active nginx changed_when: false register: nginx_status failed_when: nginx_status.stdout != "active" ` Run molecule verify`.
Pros: Familiar syntax for Ansible users; reuses modules. Cons: Slower than Testinfra; less expressive.
Use Ansible verify when you need to run complex module checks (e.g., uri module to test HTTP response). Example: ``yaml - name: Test web server response ansible.builtin.uri: url: http://localhost return_content: yes register: result failed_when: '"Welcome" not in result.content' ``
A gotcha: failed_when with register requires careful syntax. Use assert module for clarity: ``yaml - name: Assert nginx is active ansible.builtin.assert: that: - nginx_status.stdout == "active" ``
gather_facts: false in verify.yml to speed up. Testinfra is generally faster for simple checks.uri module was straightforward. But we hit a timeout because the app took 30 seconds to start. We added sleep: 30 before the test — not ideal but worked.Integrating Lint and Molecule in CI/CD: A Production Pipeline
A robust CI pipeline runs lint first, then test. Example GitHub Actions workflow: ``yaml name: Ansible CI on: [push, pull_request] jobs: lint: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.11' - run: pip install ansible-lint yamllint - run: ansible-lint --profile production . - run: yamllint . test: needs: lint runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - uses: actions/setup-python@v4 with: python-version: '3.11' - run: pip install molecule molecule-docker docker - run: molecule test ` Key points: - Separate lint and test jobs for parallel execution and faster feedback. - Pin tool versions in requirements.txt: ` ansible-lint==6.22.1 molecule==6.0.3 molecule-docker==2.1.0 yamllint==1.32.0 ` - Use molecule test --parallel for multi-platform tests. - Cache Docker images to speed up: docker pull geerlingguy/docker-ubuntu2204-ansible:latest` before molecule.
- Docker socket not mounted: Add
-v /var/run/docker.sock:/var/run/docker.sockif using Docker-in-Docker. - Python version: Molecule requires Python 3.8+. Use
actions/setup-python@v4. - Resource limits: Molecule with multiple platforms can consume >2GB RAM. Set
runs-on: ubuntu-latest-8-coresor use--parallelcarefully.
.pre-commit-config.yaml with hooks for both tools.molecule test without lint first. A PR with a syntax error (missing colon) passed lint locally because the dev used a different yamllint config. The molecule create step failed with a cryptic YAML error. We added lint as a required job before test and enforced the same config via .ansible-lint and .yamllint in the repo.Common Lint Violations and Why They Matter in Production
Here are the most common ansible-lint violations and their production impact:
name[missing]: Tasks without names make playbooks unreadable and break--stepmode. In a production outage, you need to know exactly which task is failing. Named tasks appear in logs and alerting.no-changed-when: Commands withoutchanged_whenalways report 'changed'. This breaks idempotency checks — you can't trust the 'changed' count in Ansible Tower or AWX.fqcn-builtins: Usingcopyinstead ofansible.builtin.copycan lead to module shadowing if a collection defines acopymodule. In production, this caused a role to use a communitycopymodule that behaved differently, corrupting files.command-instead-of-shell: Usingshellwhencommandsuffices introduces shell injection risks. In a production incident, ashelltask with user input allowed command injection, leading to a security breach.risky-file-permissions: Settingmode: 0777on files is a security risk. Use0644or0750.no-loop: Usingwith_itemsinstead ofloopis deprecated. Not a security issue, but prevents usingloop_controlfeatures.
To fix these, run ansible-lint --fix . (only fixes some). Review each violation manually.
very high, high, medium, low. The production profile includes all very high and high rules. Use --show-severity to see them.fqcn-builtins violation once blocked a deployment because a role used service instead of ansible.builtin.service. The community.general collection was installed and its service module required extra parameters. We spent an hour debugging before ansible-lint flagged it.Molecule Scenarios: Testing Multiple Configurations
Molecule scenarios allow testing different configurations (e.g., different OS, different variables). Create molecule/ubuntu/molecule.yml and molecule/centos/molecule.yml. Run molecule test --scenario-name ubuntu.
Scenario structure: `` molecule/ default/ molecule.yml tests/ test_default.py ubuntu/ molecule.yml tests/ test_ubuntu.py ` Each scenario can have its own group_vars or host_vars`.
For variable testing, override variables in the scenario's molecule.yml: ``yaml provisioner: name: ansible inventory: group_vars: all: nginx_port: 8080 `` Then write tests that check port 8080.
Common use case: test the same role with different variables (e.g., ssl_enabled: true vs false). Create scenarios/ssl-enabled and scenarios/ssl-disabled.
Gotcha: Scenarios share the same tests/ directory unless you specify a custom path. Use verifier: name: testinfra directory: molecule/ubuntu/tests/ to point to scenario-specific tests.
molecule/default as base and override only what changes in other scenarios. This reduces duplication.httpd instead of apache2. Without it, the role would have broken on CentOS in production.Advanced Molecule: Custom Drivers, Dependencies, and Pre/Post Tasks
Molecule supports drivers beyond Docker: Podman, Vagrant, EC2, GCE, Azure, and more. For production, we use the delegated driver to test against existing VMs. Example molecule.yml: ``yaml driver: name: delegated platforms: - name: my-prod-like-vm host: 10.0.0.1 `` But Docker is preferred for CI.
Dependencies are roles or collections required by your role. Define in molecule.yml: ``yaml dependency: name: galaxy options: requirements-file: requirements.yml ` Create requirements.yml: `yaml --- roles: - src: geerlingguy.nginx collections: - name: community.general ` Run molecule dependency` to install them.
Pre- and post-tasks allow running tasks before/after converge. Useful for preparing the environment (e.g., installing packages). Example: ``yaml provisioner: name: ansible playbooks: prepare: prepare.yml cleanup: cleanup.yml ` prepare.yml runs before converge; cleanup.yml` runs before destroy.
A production pattern: use prepare.yml to set up test fixtures (e.g., create a dummy database) and cleanup.yml to remove them.
ANSIBLE_ROLES_PATH to a shared cache. Use ANSIBLE_ROLES_PATH=~/.ansible/roles to reuse.requirements.yml.Debugging Molecule Failures: A Systematic Approach
When molecule test fails, follow this process:
- Check the log:
molecule test --debugprints full Ansible output. Look forfatal:lines. - Isolate the step: Run
molecule create,molecule converge,molecule verifyseparately. If converge fails, the role has a bug. If verify fails, the test is wrong. - Login to container:
molecule logingives a shell. Run commands manually to verify state. - Check Docker logs:
docker logs <container_name>if the container exits immediately. - Test with a simple playbook: Create a minimal playbook to rule out role issues.
molecule createfails: Docker image not found or Docker daemon not running.molecule convergefails: Ansible syntax error or missing variable.molecule verifyfails: Test assertion fails or test file syntax error.
Example: fatal: [instance]: FAILED! => {"msg": "The task includes an option with an undefined variable."}. The variable is missing from the role's defaults or the playbook. Check molecule/default/group_vars/all/.
Another: ERROR! 'community.general' is not installed. Run molecule dependency to install collections.
For Testinfra errors, import traceback: pytest --tb=long.
~/.cache/molecule/<role>/<scenario>/logs/. Check converge.log and verify.log for detailed output.molecule converge failure that said 'no action detected in task'. The task used import_tasks with a relative path that didn't exist in the container. We fixed by using {{ role_path }}/tasks/.--debug, isolate phases, and log into the container to inspect state.Idempotency Testing with Molecule: The 'converge twice' Pattern
Idempotency is critical for Ansible roles. Molecule can test it by running molecule converge twice. The first run applies the role; the second should report 'ok=0' changes. To automate, create a test: ``yaml # molecule/default/verify.yml - name: Idempotency check hosts: all gather_facts: false tasks: - name: Run role again ansible.builtin.include_role: name: myrole register: result - name: Assert no changes ansible.builtin.assert: that: - result is not changed ` Or in Testinfra: `python def test_idempotent(host): # This is tricky because Testinfra can't run roles easily. # Use host.ansible to run the role again. result = host.ansible("include_role", "name=myrole", check=False) assert result["changed"] == 0 ` Note: host.ansible` is experimental.
command/shellwithoutchanged_when.lineinfilewithoutregexpthat changes the file every run.templatewith different source every run (e.g., usingdatein template).
Production tip: Run idempotency test in CI for every PR. We once had a role that added a cron job every run because cron module didn't check for duplicates. Idempotency test caught it.
pre_build_image: true and a fixed image to ensure consistent results.copy with a source file that was regenerated by a previous task. The second converge always changed the file because the source had a new timestamp. We fixed by using template with force: no.changed_when or idempotent modules.Advanced ansible-lint: Custom Rules and Ignoring False Positives
Sometimes ansible-lint flags code that is intentionally written that way. You can ignore rules per file or per task. Use # noqa inline: ``yaml - name: Restart service ansible.builtin.command: systemctl restart nginx changed_when: false # noqa no-changed-when ` Or per file in .ansible-lint: `yaml skip_list: - fqcn-builtins # only if you have a good reason ` Better: use warn_list to see warnings without failing: `yaml warn_list: - experimental ``
Custom rules are Python scripts that implement a matchtask function. Example: a rule that forbids using apt module without update_cache: yes. Create rules/no_apt_without_update.py: ```python from ansiblelint.rules import AnsibleLintRule
class NoAptWithoutUpdate(AnsibleLintRule): id = "custom001" shortdesc = "apt must have update_cache: yes" description = "..." tags = ["custom"]
def matchtask(self, task, file=None): if task["action"]["__ansible_module__"] == "apt" and not task["action"].get("update_cache", False): return True `` Add to .ansible-lint: `yaml rules: - ./rules ``
Production insight: We wrote a custom rule to enforce that all file tasks have owner and group set. This prevented a security incident where a file was owned by root:root instead of app:app.
matchtask for task-level rules, matchplay for play-level.risky-file-permissions on a file that needed to be world-readable. We used # noqa risky-file-permissions with a comment explaining why. This documented the exception.# noqa sparingly for intentional violations; write custom rules to enforce team-specific policies.Silent Variable Overwrite Breaks Production Config
vars/main.yml with a default value for db_password (e.g., 'changeme'). The playbook used include_role with vars but forgot to pass db_password, so the default was used, overwriting the vaulted value.private: true to the role's argument spec for db_password so that if not provided, ansible-lint (with safety profile) would flag it. Also added a assert task to fail if db_password equals 'changeme'.- Always define argument specs for role parameters and use ansible-lint's safety profile to catch missing required variables.
- Never rely on defaults for secrets.
ansible-lint exits with code 2 but no clear error message.ansible-lint --verbose to see which rule is failing. Common: name[missing] if tasks lack names. Add name: to all tasks.molecule converge succeeds but molecule verify fails on a file existence check.molecule login to inspect the container. Check the role's tasks for correct dest path. Add debug output.molecule create fails with 'Docker connection error'.systemctl status docker. Ensure user is in docker group: sudo usermod -aG docker $USER. Re-login.host.file('/etc/nginx/nginx.conf').exists returns False but file exists.host.check_output('hostname').ansible-lint --list-rules | grep nameansible-lint --fix .name: "Describe task" to each task.Key takeaways
Common mistakes to avoid
6 patternsRunning molecule test without linting first
Not pinning tool versions
Using default ansible-lint profile (min) in CI
Not testing idempotency
Forgetting to set pre_build_image: true
Writing verify tests that only check file existence, not content
Interview Questions on This Topic
What is the difference between ansible-lint profiles 'min' and 'production'?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
That's Ansible. Mark it forged?
10 min read · try the examples if you haven't