Ansible Error Handling: Production Patterns from a 3AM PagerDuty
Master Ansible error handling with ignore_errors, failed_when, block/rescue/always, any_errors_fatal, and max_fail_percentage.
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
Use ignore_errors: yes only when you truly don't care about a task's exit status; it still marks the task as 'failed' in output.
Override task failure with failed_when to define custom failure conditions (e.g., failed_when: result.rc != 0 or 'ERROR' in result.stderr).
Use changed_when to prevent false 'changed' status (e.g., changed_when: result.rc == 0 for a script that always returns 0).
Wrap critical sequences in block/rescue/always for try-catch-finally behavior; rescue runs on failure, always runs regardless.
Set any_errors_fatal: true on a play to stop execution immediately when any task fails in that play.
Use max_fail_percentage in rolling updates to abort if more than N% of hosts fail; e.g., serial: 5, max_fail_percentage: 20.
In rescue blocks, use ansible.builtin.include_tasks to run cleanup or notify handlers.
Always test error handling paths in CI; a misconfigured failed_when can silently swallow real failures.
Imagine you're a chef cooking a complex multi-course meal. Your recipe (the Ansible playbook) has steps like 'chop onions' and 'sear steak.' If you burn the onions (a task fails), you have a few options: you can ignore it and move on (ignore_errors), or you can decide that burnt onions are actually a failure only if they're black (failed_when). You might also want to know if the steak is 'changed' only when it's actually cooked differently (changed_when). For risky sequences, like reducing a sauce, you might use a 'try-catch' approach: try the reduction, if it fails, rescue by adding a thickener, and always clean up the pan (block/rescue/always). In a busy kitchen, if one station fails, you might want to stop the whole service (any_errors_fatal) or only if too many stations fail (max_fail_percentage). This article teaches you these patterns so your automated kitchen runs smoothly.
It was 3 AM, and my phone was buzzing with PagerDuty alerts. Our Ansible-driven deployment had taken down half the production fleet. The playbook had a task that checked for a lock file; if it existed, the task failed, and Ansible stopped the entire play. The problem? A stale lock file from a previous deployment that should have been ignored. We had ignore_errors set, but a junior engineer had commented it out during a code review, thinking it was dead code. The result: every host that had that lock file failed, and our rolling update aborted after the first batch. We lost 30 minutes of uptime. That night, I learned that error handling in Ansible isn't just about preventing failures—it's about defining what failure means for your system.
ignore_errors: When to Use and When to Avoid
The ignore_errors directive tells Ansible to continue executing tasks on a host even if the current task fails. It's a blunt instrument. Use it for non-critical checks, like verifying a service is running where you have a fallback. Never use it to hide real failures—it masks the error and still marks the task as 'failed' in output (with ...ignoring). A better pattern is to use failed_when with a condition that never matches, but that's an anti-pattern. Real example: checking for a lock file before deployment:
``yaml - name: Check for deployment lock file ansible.builtin.stat: path: /var/lock/deploy.lock register: lock_check ignore_errors: yes ``
If the lock file exists, this task fails (if the stat module fails on permission? Actually stat doesn't fail on missing file, it returns exists: false. So ignore_errors is redundant here. Better: use failed_when: false but that's weird. The point: only use ignore_errors when the task's failure is acceptable and you have subsequent logic to handle it. In production, we once had a task that stopped a service that might already be stopped; we used ignore_errors. But then a real failure (e.g., service not found) was ignored, causing a cascading issue. We switched to failed_when: result.rc != 0 and 'not running' not in result.stderr.
mount command. It failed when the mount wasn't present. We used ignore_errors, but then the mount actually failed due to a bad filesystem, and we didn't notice. We changed to register the output and use failed_when.failed_when: Defining Custom Failure Conditions
failed_when overrides Ansible's default failure detection. You provide a Jinja2 expression that evaluates to true when the task should be considered failed. This is essential for commands that return non-zero on success (e.g., grep returning 1 for no match) or for complex checks based on stdout/stderr. Syntax:
``yaml - name: Run custom script ansible.builtin.shell: /usr/local/bin/check_health.sh register: health_result failed_when: health_result.rc != 0 or 'CRITICAL' in health_result.stdout ``
Common gotcha: failed_when is evaluated after the task runs. If the task fails before running (e.g., invalid parameters), failed_when is not evaluated. Also, failed_when and ignore_errors interact: if both are set, ignore_errors takes precedence, but the task is still marked failed if failed_when is true, then ignored. To truly override, set failed_when: false (though that's odd). Production tip: always test your failed_when condition with a known failure case. We once had failed_when: result.rc == 1 but the command returned 2 for a different error; we missed a failure.
and/or to combine conditions. Example: failed_when: (result.rc != 0) or ('ERROR' in result.stderr).'FAIL' in result.stdout. Saved us from a corrupted database.changed_when: Preventing False Changes
changed_when controls whether a task reports 'changed' or 'ok'. By default, Ansible considers a task 'changed' if it modifies state (e.g., file module, command module if not idempotent). For commands that always return 'changed' (like shell with a script that always reports success), you can override:
``yaml - name: Run idempotent script ansible.builtin.shell: /usr/local/bin/update_cache.sh register: cache_update changed_when: cache_update.rc == 0 and 'updated' in cache_update.stdout ``
If you want a task to never report changed, use changed_when: false. This is common for read-only checks. However, be careful: if a task that should change things never reports changed, you lose audit trail. In production, we had a task that restarted a service only if a config file changed; we used changed_when: config_changed where config_changed was a registered variable. This gave accurate change tracking.
changed_when: false and then used a separate task to detect actual changes. This reduced noise in our deployment logs.block/rescue/always: The Try-Catch-Finally of Ansible
The block/rescue/always pattern provides structured error handling for a group of tasks. block contains the main tasks. If any task in the block fails, the rescue block executes. The always block runs regardless of success or failure. This is perfect for cleanup operations:
``yaml - name: Deploy application block: - name: Pull latest image ansible.builtin.docker_image: name: myapp:latest source: pull - name: Start container ansible.builtin.docker_container: name: myapp image: myapp:latest state: started rescue: - name: Rollback to previous image ansible.builtin.docker_image: name: myapp:previous source: pull - name: Notify team ansible.builtin.uri: url: https://hooks.slack.com/services/... method: POST body: '{"text":"Deployment failed, rolled back"}' always: - name: Clean up temp files ansible.builtin.file: path: /tmp/deploy_temp state: absent ``
Important: variables set in block are available in rescue and always. However, if a task in rescue fails, the entire play fails (unless you handle it). Use ignore_errors in rescue if needed. Also, rescue does not run if a task in block fails due to syntax error or unreachable host—only task execution failures.
any_errors_fatal: Stop the Play on Any Failure
By default, if a task fails on a host, Ansible stops executing further tasks on that host but continues on other hosts. Setting any_errors_fatal: true changes this: if any task fails on any host, the entire play stops immediately for all hosts. This is useful when a failure on one host indicates a systemic issue that should halt the entire deployment. Use it sparingly, as it can cause unnecessary downtime.
``yaml - name: Deploy critical update hosts: all any_errors_fatal: true tasks: - name: Validate config ansible.builtin.shell: /usr/local/bin/validate_config.sh ``
In production, we used this for a security patch that had to be applied consistently across all hosts. If one host failed validation, we wanted to stop and investigate. However, we combined it with serial: 1 to limit blast radius. A common mistake is setting any_errors_fatal: true without serial, causing all hosts to fail if one has a transient issue.
serial: 1 or a small batch size with any_errors_fatal: true to avoid taking down the entire fleet on a single failure.max_fail_percentage: Graceful Degradation in Rolling Updates
max_fail_percentage is a play-level directive that sets the maximum percentage of hosts that can fail before Ansible aborts the entire play. It's typically used with serial for rolling updates. For example:
``yaml - name: Rolling update hosts: webservers serial: 5 max_fail_percentage: 20 tasks: - name: Update app ansible.builtin.yum: name: myapp state: latest ``
If more than 20% of the hosts in a batch fail, the play stops. This prevents a bad deployment from taking down too many hosts. The percentage is calculated per batch, not globally. If you have 5 hosts per batch and 2 fail (40%), that exceeds 20%, so the play stops. Important: max_fail_percentage defaults to 0 (abort on any failure) if not set? Actually, default is max_fail_percentage: 0 meaning abort on any failure? No, default is no limit. Check docs: if not set, there's no limit. So setting it to 0 means abort on any failure? Actually, from Ansible docs: 'The maximum percentage of hosts that can fail before the play is aborted. If not set, the play will not abort regardless of failures.' So 0 means abort on any failure. To allow some failures, set a positive integer. In production, we use 20% for rolling updates to tolerate transient issues.
Using Rescue to Notify and Clean Up After Failures
The rescue block is not just for rollback; it's also for notification and cleanup. You can use ansible.builtin.uri to call webhooks, ansible.builtin.mail to send emails, or ansible.builtin.slack (community.general) to notify teams. For cleanup, use ansible.builtin.file to remove temporary files, or ansible.builtin.service to stop services. Example:
``yaml - name: Deploy with notification block: - name: Deploy app ansible.builtin.copy: src: /tmp/app.war dest: /opt/tomcat/webapps/ - name: Restart tomcat ansible.builtin.service: name: tomcat state: restarted rescue: - name: Notify failure ansible.builtin.uri: url: "https://hooks.slack.com/services/T00/B00/xxx" method: POST body_format: json body: text: "Deployment failed on {{ inventory_hostname }}" ignore_errors: yes - name: Clean up deployed file ansible.builtin.file: path: /opt/tomcat/webapps/app.war state: absent ignore_errors: yes always: - name: Remove temp files ansible.builtin.file: path: /tmp/deploy_temp state: absent ``
Note the `ignore_errors: yes` on rescue tasks: if the notification fails, you don't want that to compound the failure. Also, the always block runs even if rescue fails. This pattern is essential for maintaining observability and cleanliness in production.
Combining Error Handling Directives: A Production Pattern
In real playbooks, you'll combine multiple directives. Here's a pattern for a rolling update with error handling:
``yaml - name: Rolling update with error handling hosts: webservers serial: 10 max_fail_percentage: 20 any_errors_fatal: false tasks: - name: Pre-check block: - name: Check disk space ansible.builtin.shell: df / | awk 'NR==2 {print $5}' | sed 's/%//' register: disk_usage failed_when: disk_usage.stdout | int > 90 - name: Check service health ansible.builtin.uri: url: http://localhost:80/health status_code: 200 register: health ignore_errors: yes rescue: - name: Skip host and notify ansible.builtin.debug: msg: "Host {{ inventory_hostname }} failed pre-check, skipping" changed_when: false - name: Notify ansible.builtin.uri: url: https://hooks.slack.com/... method: POST body: '{"text":"Pre-check failed on {{ inventory_hostname }}"}' ignore_errors: yes always: - name: Log check result ansible.builtin.copy: content: "{{ disk_usage.stdout }}" dest: /var/log/precheck.log ignore_errors: yes ``
This pattern checks prerequisites, skips the host if they fail, and logs the result. The play continues with other hosts, but if too many fail, max_fail_percentage aborts. This is a robust pattern for large fleets.
Testing Error Handling: CI/CD Patterns
Error handling code is only as good as its test coverage. In CI, create test playbooks that intentionally fail to verify your error paths. Use ansible-playbook --syntax-check to catch syntax errors. For logic testing, use ansible-playbook --check --diff to see what would change. But for error handling, you need to actually trigger failures. We use molecule with scenarios that simulate failures:
``yaml # molecule/default/verify.yml - name: Verify error handling hosts: all tasks: - name: Trigger failure ansible.builtin.command: /bin/false register: result failed_when: result.rc != 0 ``
Then assert that the rescue block ran. Another pattern: use ansible.builtin.fail module in test plays. For example, to test max_fail_percentage, run a playbook with multiple hosts and force failures on some. Use ansible-playbook --limit to target specific hosts. Also, use -v flags to see error handling output: -vvv shows failed_when evaluation. In production, we have a CI pipeline that runs a dedicated 'chaos' playbook that injects failures to validate our error handling.
failed_when because the test never triggered the failure condition. We added a step that explicitly forces the failure condition to validate the error path.Common Pitfalls with ignore_errors and failed_when Interactions
The interaction between ignore_errors and failed_when can be confusing. Key rule: ignore_errors is evaluated after failed_when. So if both are set, the task is first evaluated for failure using failed_when. If failed_when returns true, the task is marked failed, but then ignore_errors causes the play to continue. The task output still shows 'failed' with 'ignoring'. This can mislead operators. A common pitfall is setting ignore_errors: yes on a task with failed_when thinking it will suppress the failure display. It doesn't. To truly suppress, use failed_when: false and no ignore_errors. But that's an anti-pattern. Better: use register and conditionals on subsequent tasks. Example:
```yaml - name: Attempt to stop service ansible.builtin.service: name: myapp state: stopped register: stop_result ignore_errors: yes
- name: Handle failure
- ansible.builtin.debug:
- msg: "Service stop failed, continuing"
- when: stop_result is failed
- ```
This pattern is clearer than relying on ignore_errors alone. In production, we avoid ignore_errors on critical tasks; we use register and when to handle failures explicitly.
Error Handling in Loops: With_items and Failed Items
When using loops (e.g., with_items, loop), a failure in one iteration stops the entire task by default. To handle per-item failures, use ignore_errors: yes on the task and then check results for failures. Example:
```yaml - name: Install packages ansible.builtin.yum: name: "{{ item }}" state: present loop: - nginx - bad-package - mysql ignore_errors: yes register: install_results
- name: Report failed packages
- ansible.builtin.debug:
- msg: "Package {{ item.item }} failed to install"
- loop: "{{ install_results.results | selectattr('failed', 'equalto', true) | list }}"
- ```
This pattern allows the play to continue and then process failures. In production, we use this for package installations where some packages might be unavailable. We then send a report of failed packages to a monitoring system.
loop_control with pause to throttle, but for error handling, register the results and filter.Error Handling Best Practices for Production Playbooks
- Always register results for tasks that can fail, even if you use ignore_errors. This allows debugging later.
- Use failed_when instead of ignore_errors when you have specific failure criteria.
- Limit any_errors_fatal to critical deployments; use max_fail_percentage for rolling updates.
- Test error paths in CI by forcing failures.
- Document error handling decisions in comments, especially why a task is ignored.
- Use block/rescue/always for any multi-step operation that needs cleanup.
- Avoid nested blocks; they complicate error handling.
- Set changed_when: false on read-only tasks to avoid false change notifications.
- Use ansible.builtin.fail in rescue blocks to re-raise failures after cleanup if needed.
- Monitor for ignored failures; they can hide real issues. Use a post-play hook to check for ignored tasks.
Example of a post-play hook:
``yaml - name: Check for ignored failures ansible.builtin.fail: msg: "There were {{ ignored_count }} ignored failures" when: ignored_count | default(0) > 0 vars: ignored_count: "{{ ansible_failed_result | selectattr('ignored', 'equalto', true) | list | length }}" ``
This is a simplified example; in practice, you'd need to aggregate across hosts.
ansible-lint on your playbooks; it can detect missing error handling or misused directives.The Stale Lock File Incident
ignore_errors: yes removed during a refactor, and any_errors_fatal was set to true on the play. The lock file was stale but harmless.ignore_errors: yes to the lock file check task, and changed any_errors_fatal to false. Also added a rescue block to delete the lock file if the deployment failed.- Never assume a failure is safe; explicitly declare error handling intent.
- Use ignore_errors for non-critical checks, and always test with a stale state.
ignore_errors: yes is set. If not intended, remove it. If intended, verify the task's failure condition is correct.failed_when condition. It might be evaluating to true on success. Example: failed_when: result.rc != 0 but command returns 1 on success. Fix by adjusting condition.changed_when: false or set a condition like changed_when: result.rc == 0 if the command always returns 0.serial being setany_errors_fatal: true is set on the play. Set it to false or remove it. Also check for max_fail_percentage: 0 which acts similarly.ansible-playbook playbook.yml --checkKey takeaways
Common mistakes to avoid
6 patternsUsing ignore_errors to suppress all failures without understanding the cause
Setting any_errors_fatal: true without serial, causing all hosts to fail on one failure
Not testing error handling paths in CI
Using changed_when: false on tasks that actually change state
Nesting block/rescue blocks without understanding failure propagation
Forgetting to add ignore_errors to rescue tasks
Interview Questions on This Topic
What is the difference between ignore_errors and failed_when?
Frequently Asked Questions
20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.
That's Ansible. Mark it forged?
10 min read · try the examples if you haven't